Prospects for building large timetrees using molecular data with incomplete gene coverage among species

Alan Filipski; Oscar Murillo; Anna Freydenzon; Koichiro Tamura; Sudhir Kumar

doi:10.1093/molbev/msu200

Prospects for building large timetrees using molecular data with incomplete gene coverage among species

Mol Biol Evol. 2014 Sep;31(9):2542-50. doi: 10.1093/molbev/msu200. Epub 2014 Jun 27.

Authors

Alan Filipski¹, Oscar Murillo², Anna Freydenzon¹, Koichiro Tamura³, Sudhir Kumar⁴

Affiliations

¹ Center for Evolutionary Medicine and Informatics, Biodesign Institute, Arizona State University.
² Center for Evolutionary Medicine and Informatics, Biodesign Institute, Arizona State UniversitySchool of Life Sciences, Arizona State University.
³ Department of Biological Sciences, Tokyo Metropolitan University, Tokyo, JapanResearch Center for Genomics and Bioinformatics, Tokyo Metropolitan University, Tokyo, Japan.
⁴ Center for Evolutionary Medicine and Informatics, Biodesign Institute, Arizona State UniversitySchool of Life Sciences, Arizona State UniversityCenter of Excellence in Genomic Medicine Research, King Abdulaziz University, Jeddah, Saudi ArabiaInstitute for Genomics and Evolutionary Medicine, Temple UniversityDepartment of Biology, Temple University s.kumar@temple.edu.

Abstract

Scientists are assembling sequence data sets from increasing numbers of species and genes to build comprehensive timetrees. However, data are often unavailable for some species and gene combinations, and the proportion of missing data is often large for data sets containing many genes and species. Surprisingly, there has not been a systematic analysis of the effect of the degree of sparseness of the species-gene matrix on the accuracy of divergence time estimates. Here, we present results from computer simulations and empirical data analyses to quantify the impact of missing gene data on divergence time estimation in large phylogenies. We found that estimates of divergence times were robust even when sequences from a majority of genes for most of the species were absent. From the analysis of such extremely sparse data sets, we found that the most egregious errors occurred for nodes in the tree that had no common genes for any pair of species in the immediate descendant clades of the node in question. These problematic nodes can be easily detected prior to computational analyses based only on the input sequence alignment and the tree topology. We conclude that it is best to use larger alignments, because adding both genes and species to the alignment augments the number of genes available for estimating divergence events deep in the tree and improves their time estimates.

Keywords: divergence time; incomplete data; timetree.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Computer Simulation
Evolution, Molecular
Genes*
Models, Genetic
Phylogeny*
Sequence Alignment / methods*
Sequence Analysis, DNA

Abstract

Publication types

MeSH terms

Grants and funding