Clustering rare diseases within an ontology-enriched knowledge graph

Jaleal Sanjak; Jessica Binder; Arjun Singh Yadaw; Qian Zhu; Ewy A Mathé

doi:10.1093/jamia/ocad186

Clustering rare diseases within an ontology-enriched knowledge graph

J Am Med Inform Assoc. 2023 Dec 22;31(1):154-164. doi: 10.1093/jamia/ocad186.

Authors

Jaleal Sanjak^{1

2}, Jessica Binder¹, Arjun Singh Yadaw¹, Qian Zhu¹, Ewy A Mathé¹

Affiliations

¹ Division of Pre-Clinical Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Rockville, MD, United States.
² Chief Technology Office, Booz Allen Hamilton, Bethesda, MD, United States.

Abstract

Objective: Identifying sets of rare diseases with shared aspects of etiology and pathophysiology may enable drug repurposing. Toward that aim, we utilized an integrative knowledge graph to construct clusters of rare diseases.

Materials and methods: Data on 3242 rare diseases were extracted from the National Center for Advancing Translational Science Genetic and Rare Diseases Information center internal data resources. The rare disease data enriched with additional biomedical data, including gene and phenotype ontologies, biological pathway data, and small molecule-target activity data, to create a knowledge graph (KG). Node embeddings were trained and clustered. We validated the disease clusters through semantic similarity and feature enrichment analysis.

Results: Thirty-seven disease clusters were created with a mean size of 87 diseases. We validate the clusters quantitatively via semantic similarity based on the Orphanet Rare Disease Ontology. In addition, the clusters were analyzed for enrichment of associated genes, revealing that the enriched genes within clusters are highly related.

Discussion: We demonstrate that node embeddings are an effective method for clustering diseases within a heterogenous KG. Semantically similar diseases and relevant enriched genes have been uncovered within the clusters. Connections between disease clusters and drugs are enumerated for follow-up efforts.

Conclusion: We lay out a method for clustering rare diseases using graph node embeddings. We develop an easy-to-maintain pipeline that can be updated when new data on rare diseases emerges. The embeddings themselves can be paired with other representation learning methods for other data types, such as drugs, to address other predictive modeling problems.

Keywords: drug repurposing; knowledge graph; ontology; rare disease.

Published by Oxford University Press on behalf of the American Medical Informatics Association 2023.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Drug Repositioning
Humans
Pattern Recognition, Automated*
Phenotype
Rare Diseases* / genetics
Semantics

Associated data

figshare/10.6084/m9.figshare.23748846

Abstract

Publication types

MeSH terms

Associated data

Grants and funding