Association analysis using next-generation sequence data from publicly available control groups: the robust variance score statistic

Andriy Derkach; Theodore Chiang; Jiafen Gong; Laura Addis; Sara Dobbins; Ian Tomlinson; Richard Houlston; Deb K Pal; Lisa J Strug

doi:10.1093/bioinformatics/btu196

Association analysis using next-generation sequence data from publicly available control groups: the robust variance score statistic

Bioinformatics. 2014 Aug 1;30(15):2179-88. doi: 10.1093/bioinformatics/btu196. Epub 2014 Apr 14.

Authors

Andriy Derkach¹, Theodore Chiang¹, Jiafen Gong¹, Laura Addis¹, Sara Dobbins¹, Ian Tomlinson¹, Richard Houlston¹, Deb K Pal¹, Lisa J Strug²

Affiliations

¹ Department of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
² Department of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, CanadaDepartment of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.

Abstract

Motivation: Sufficiently powered case-control studies with next-generation sequence (NGS) data remain prohibitively expensive for many investigators. If feasible, a more efficient strategy would be to include publicly available sequenced controls. However, these studies can be confounded by differences in sequencing platform; alignment, single nucleotide polymorphism and variant calling algorithms; read depth; and selection thresholds. Assuming one can match cases and controls on the basis of ethnicity and other potential confounding factors, and one has access to the aligned reads in both groups, we investigate the effect of systematic differences in read depth and selection threshold when comparing allele frequencies between cases and controls. We propose a novel likelihood-based method, the robust variance score (RVS), that substitutes genotype calls by their expected values given observed sequence data.

Results: We show theoretically that the RVS eliminates read depth bias in the estimation of minor allele frequency. We also demonstrate that, using simulated and real NGS data, the RVS method controls Type I error and has comparable power to the 'gold standard' analysis with the true underlying genotypes for both common and rare variants.

Availability and implementation: An RVS R script and instructions can be found at strug.research.sickkids.ca, and at https://github.com/strug-lab/RVS.

Contact: lisa.strug@utoronto.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Analysis of Variance
Case-Control Studies
Child
Computational Biology / methods*
Control Groups
Data Interpretation, Statistical
Epilepsy, Rolandic / genetics
Gene Frequency
Genotype
High-Throughput Nucleotide Sequencing*
Human Genome Project
Humans
Likelihood Functions
Polymorphism, Single Nucleotide

Abstract

Publication types

MeSH terms

Grants and funding