Enhancing patient representation learning with inferred family pedigrees improves disease risk prediction

Xiayuan Huang; Jatin Arora; Abdullah Mesut Erzurumluoglu; Stephen A Stanhope; Daniel Lam; Boehringer Ingelheim—Global Computational Biology and Digital Sciences; Hongyu Zhao; Zhihao Ding; Zuoheng Wang; Johann de Jong

doi:10.1093/jamia/ocae297

Enhancing patient representation learning with inferred family pedigrees improves disease risk prediction

J Am Med Inform Assoc. 2024 Dec 26:ocae297. doi: 10.1093/jamia/ocae297. Online ahead of print.

Collaborators

Boehringer Ingelheim—Global Computational Biology and Digital Sciences:
Jatin Arora, Abdullah Mesut Erzurumluoglu, Daniel Lam, Pierre Khoueiry, Jan N Jensen, James Cai, Nathan Lawless, Jan Kriegl, Zhihao Ding, Johann de Jong

Affiliations

¹ Department of Biostatistics, Yale University School of Public Health, New Haven, CT 06510, United States.
² Human Genetics, Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riβ 88400, Germany.
³ Real World Data and Analytics, Global Medical Affairs, Boehringer Ingelheim, Ridgefield, CT 06877, United States.
⁴ CB CMDR, Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riβ 88400, Germany.
⁵ Department of Biomedical Informatics & Data Science, Yale University School of Medicine, New Haven, CT 06510, United States.
⁶ Statistical Modeling, Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riβ 88400, Germany.

PMID: 39723811
DOI: 10.1093/jamia/ocae297

Abstract

Background: Machine learning and deep learning are powerful tools for analyzing electronic health records (EHRs) in healthcare research. Although family health history has been recognized as a major predictor for a wide spectrum of diseases, research has so far adopted a limited view of family relations, essentially treating patients as independent samples in the analysis.

Methods: To address this gap, we present ALIGATEHR, which models inferred family relations in a graph attention network augmented with an attention-based medical ontology representation, thus accounting for the complex influence of genetics, shared environmental exposures, and disease dependencies.

Results: Taking disease risk prediction as a use case, we demonstrate that explicitly modeling family relations significantly improves predictions across the disease spectrum. We then show how ALIGATEHR's attention mechanism, which links patients' disease risk to their relatives' clinical profiles, successfully captures genetic aspects of diseases using longitudinal EHR diagnosis data. Finally, we use ALIGATEHR to successfully distinguish the 2 main inflammatory bowel disease subtypes with highly shared risk factors and symptoms (Crohn's disease and ulcerative colitis).

Conclusion: Overall, our results highlight that family relations should not be overlooked in EHR research and illustrate ALIGATEHR's great potential for enhancing patient representation learning for predictive and interpretable modeling of EHRs.

Keywords: disease risk prediction; electronic health records; graph attention networks; patient modeling.

Grants and funding

AWD0006462/Yale-Boehringer Ingelheim