Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records

David R Crosslin; Gerard Tromp; Amber Burt; Daniel S Kim; Shefali S Verma; Anastasia M Lucas; Yuki Bradford; Dana C Crawford; Sebastian M Armasu; John A Heit; M Geoffrey Hayes; Helena Kuivaniemi; Marylyn D Ritchie; Gail P Jarvik; Mariza de Andrade; electronic Medical Records and Genomics (eMERGE) Network

doi:10.3389/fgene.2014.00352

Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records

Front Genet. 2014 Nov 4:5:352. doi: 10.3389/fgene.2014.00352. eCollection 2014.

Authors

David R Crosslin¹, Gerard Tromp², Amber Burt³, Daniel S Kim¹, Shefali S Verma⁴, Anastasia M Lucas⁴, Yuki Bradford⁴, Dana C Crawford⁵, Sebastian M Armasu⁶, John A Heit⁷, M Geoffrey Hayes⁸, Helena Kuivaniemi², Marylyn D Ritchie⁴, Gail P Jarvik¹, Mariza de Andrade⁶; electronic Medical Records and Genomics (eMERGE) Network

Affiliations

¹ Division of Medical Genetics, Department of Medicine, University of Washington Seattle, WA, USA ; Department of Genome Sciences, University of Washington Seattle, WA, USA.
² The Sigfried and Janet Weis Center for Research, Geisinger Health System Danville, PA, USA.
³ Department of Genome Sciences, University of Washington Seattle, WA, USA.
⁴ Department of Biochemistry and Molecular Biology, Center for Systems Genomics, Pennsylvania State University University Park, PA, USA.
⁵ Center for Human Genetics Research, School of Medicine, Vanderbilt University Nashville, TN, USA ; Department of Molecular Physiology and Biophysics, Vanderbilt University Nashville, TN, USA.
⁶ Division of Biomedical Statistics and Informatics, Mayo Clinic Rochester, MN, USA.
⁷ Division of Cardiovascular Diseases, Mayo Clinic Rochester, MN, USA.
⁸ Division of Endocrinology, Metabolism, and Molecular Medicine, Feinberg School of Medicine, Northwestern University Chicago, IL, USA.

Abstract

Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe the combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, has the advantage of fewer covariates and degrees of freedom.

Keywords: ancestry; biobank; genetic association study; loadings; principal component analysis.

Grants and funding

U01 HG006385/HG/NHGRI NIH HHS/United States