Hybrid autoencoder with orthogonal latent space for robust population structure inference

Meng Yuan; Hanne Hoskens; Seppe Goovaerts; Noah Herrick; Mark D Shriver; Susan Walsh; Peter Claes

doi:10.1038/s41598-023-28759-x

Hybrid autoencoder with orthogonal latent space for robust population structure inference

Sci Rep. 2023 Feb 14;13(1):2612. doi: 10.1038/s41598-023-28759-x.

Authors

Meng Yuan^{1

2

3}, Hanne Hoskens^{4

5}, Seppe Goovaerts^{4

5}, Noah Herrick⁶, Mark D Shriver⁷, Susan Walsh⁶, Peter Claes^{8

9

10

11}

Affiliations

¹ Department of Electrical Engineering, ESAT/PSI, KU Leuven, Leuven, Belgium. meng.yuan@kuleuven.be.
² Department of Human Genetics, KU Leuven, Leuven, Belgium. meng.yuan@kuleuven.be.
³ Medical Imaging Research Center, University Hospitals Leuven, Leuven, Belgium. meng.yuan@kuleuven.be.
⁴ Department of Human Genetics, KU Leuven, Leuven, Belgium.
⁵ Medical Imaging Research Center, University Hospitals Leuven, Leuven, Belgium.
⁶ Department of Biology, Indiana University Purdue University Indianapolis, Indianapolis, IN, USA.
⁷ Department of Anthropology, Pennsylvania State University, State College, PA, USA.
⁸ Department of Electrical Engineering, ESAT/PSI, KU Leuven, Leuven, Belgium. peter.claes@kuleuven.be.
⁹ Department of Human Genetics, KU Leuven, Leuven, Belgium. peter.claes@kuleuven.be.
¹⁰ Medical Imaging Research Center, University Hospitals Leuven, Leuven, Belgium. peter.claes@kuleuven.be.
¹¹ Murdoch Children's Research Institute, Melbourne, VIC, Australia. peter.claes@kuleuven.be.

Abstract

Analysis of population structure and genomic ancestry remains an important topic in human genetics and bioinformatics. Commonly used methods require high-quality genotype data to ensure accurate inference. However, in practice, laboratory artifacts and outliers are often present in the data. Moreover, existing methods are typically affected by the presence of related individuals in the dataset. In this work, we propose a novel hybrid method, called SAE-IBS, which combines the strengths of traditional matrix decomposition-based (e.g., principal component analysis) and more recent neural network-based (e.g., autoencoders) solutions. Namely, it yields an orthogonal latent space enhancing dimensionality selection while learning non-linear transformations. The proposed approach achieves higher accuracy than existing methods for projecting poor quality target samples (genotyping errors and missing data) onto a reference ancestry space and generates a robust ancestry space in the presence of relatedness. We introduce a new approach and an accompanying open-source program for robust ancestry inference in the presence of missing data, genotyping errors, and relatedness. The obtained ancestry space allows for non-linear projections and exhibits orthogonality with clearly separable population groups.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Genetics, Population*
Genotype
Humans
Neural Networks, Computer*
Principal Component Analysis