Machine learning-based quantification for disease uncertainty increases the statistical power of genetic association studies

Jun Young Park; Jang Jae Lee; Younghwa Lee; Dongsoo Lee; Jungsoo Gim; Lindsay Farrer; Kun Ho Lee; Sungho Won

doi:10.1093/bioinformatics/btad534

Machine learning-based quantification for disease uncertainty increases the statistical power of genetic association studies

Bioinformatics. 2023 Sep 2;39(9):btad534. doi: 10.1093/bioinformatics/btad534.

Authors

Jun Young Park^{1

2

3}, Jang Jae Lee³, Younghwa Lee¹, Dongsoo Lee¹, Jungsoo Gim^{3

4}, Lindsay Farrer^{5

6}, Kun Ho Lee^{3

4

7}, Sungho Won^{1

8

9

10}

Affiliations

¹ Department of Public Health Sciences, Graduate School of Public Health, Seoul National University, Seoul 08826, Korea.
² Neurozen Inc., Seoul 06168, Korea.
³ Gwangju Alzheimer's & Related Dementia Cohort Research Center, Chosun University, Gwangju 61452, Korea.
⁴ Department of Biomedical Science, Chosun University, Gwangju 61452, Korea.
⁵ Departments of Medicine (Biomedical Genetics), Neurology, and Ophthalmology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA 02118, United States.
⁶ Departments of Epidemiology and Biostatistics, Boston University School of Public Health, Boston, MA 02118, United States.
⁷ Korea Brain Research Institute, Daegu 41068, Korea.
⁸ Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, Korea.
⁹ Institute of Health and Environment, Seoul National University, Seoul 08826, Korea.
¹⁰ RexSoft Inc, Seoul 08826, Korea.

Abstract

Motivation: Allowance for increasingly large samples is a key to identify the association of genetic variants with Alzheimer's disease (AD) in genome-wide association studies (GWAS). Accordingly, we aimed to develop a method that incorporates patients with mild cognitive impairment and unknown cognitive status in GWAS using a machine learning-based AD prediction model.

Results: Simulation analyses showed that weighting imputed phenotypes method increased the statistical power compared to ordinary logistic regression using only AD cases and controls. Applied to real-world data, the penalized logistic method had the highest AUC (0.96) for AD prediction and weighting imputed phenotypes method performed well in terms of power. We identified an association (P<5.0×10-8) of AD with several variants in the APOE region and rs143625563 in LMX1A. Our method, which allows the inclusion of individuals with mild cognitive impairment, improves the statistical power of GWAS for AD. We discovered a novel association with LMX1A.

Availability and implementation: Simulation codes can be accessed at https://github.com/Junkkkk/wGEE_GWAS.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Alzheimer Disease* / genetics
Genetic Association Studies
Genome-Wide Association Study* / methods
Humans
Machine Learning
Phenotype
Uncertainty