Manifold learning for human population structure studies

PLoS One. 2012;7(1):e29901. doi: 10.1371/journal.pone.0029901. Epub 2012 Jan 17.

Abstract

The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the "intrinsic dimensionality" of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap. We observed that 25.1%, 44.9% and 21.4% of the common variants and 89.2%, 92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants, which are often private to specific populations, have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms*
  • Gene Frequency
  • Genetic Association Studies / methods
  • Genetic Association Studies / statistics & numerical data
  • Genetics, Population / methods
  • Genetics, Population / statistics & numerical data
  • Genome, Human / genetics*
  • Genomics / methods*
  • Genomics / statistics & numerical data
  • Genotype
  • HapMap Project
  • Human Genome Project
  • Humans
  • Polymorphism, Single Nucleotide / genetics*
  • Principal Component Analysis*
  • Sequence Analysis, DNA / methods
  • Sequence Analysis, DNA / statistics & numerical data