Archetypal Analysis for population genetics

Julia Gimbernat-Mayol; Albert Dominguez Mantes; Carlos D Bustamante; Daniel Mas Montserrat; Alexander G Ioannidis

doi:10.1371/journal.pcbi.1010301

Archetypal Analysis for population genetics

PLoS Comput Biol. 2022 Aug 25;18(8):e1010301. doi: 10.1371/journal.pcbi.1010301. eCollection 2022 Aug.

Authors

Julia Gimbernat-Mayol¹, Albert Dominguez Mantes^{2

3

4}, Carlos D Bustamante⁴, Daniel Mas Montserrat⁴, Alexander G Ioannidis^{4

5}

Affiliations

¹ Department of Bioengineering, Faculty of Engineering, Imperial College London, London, United Kingdom.
² Brain Mind Institute, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.
³ Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.
⁴ Department of Biomedical Data Science, Stanford Medical School, Stanford, California, United States of America.
⁵ Institute for Computational and Mathematical Engineering, Stanford University, Stanford, California, United States of America.

Abstract

The estimation of genetic clusters using genomic data has application from genome-wide association studies (GWAS) to demographic history to polygenic risk scores (PRS) and is expected to play an important role in the analyses of increasingly diverse, large-scale cohorts. However, existing methods are computationally-intensive, prohibitively so in the case of nationwide biobanks. Here we explore Archetypal Analysis as an efficient, unsupervised approach for identifying genetic clusters and for associating individuals with them. Such unsupervised approaches help avoid conflating socially constructed ethnic labels with genetic clusters by eliminating the need for exogenous training labels. We show that Archetypal Analysis yields similar cluster structure to existing unsupervised methods such as ADMIXTURE and provides interpretative advantages. More importantly, we show that since Archetypal Analysis can be used with lower-dimensional representations of genetic data, significant reductions in computational time and memory requirements are possible. When Archetypal Analysis is run in such a fashion, it takes several orders of magnitude less compute time than the current standard, ADMIXTURE. Finally, we demonstrate uses ranging across datasets from humans to canids.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Genetic Predisposition to Disease
Genetics, Population
Genome
Genome-Wide Association Study*
Genomics / methods
Humans
Polymorphism, Single Nucleotide* / genetics

Grants and funding

This work was supported in part by the Chan Zuckerberg Biohub (awarded to CDB) and by the Royal Academy of Engineering Leaders Scholarship (awarded to JGM). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.