Utilization of defined microbial communities enables effective evaluation of meta-genomic assemblies

William W Greenwald; Niels Klitgord; Victor Seguritan; Shibu Yooseph; J Craig Venter; Chad Garner; Karen E Nelson; Weizhong Li

doi:10.1186/s12864-017-3679-5

Utilization of defined microbial communities enables effective evaluation of meta-genomic assemblies

BMC Genomics. 2017 Apr 13;18(1):296. doi: 10.1186/s12864-017-3679-5.

Authors

William W Greenwald¹, Niels Klitgord², Victor Seguritan², Shibu Yooseph³, J Craig Venter^{2

4}, Chad Garner², Karen E Nelson^{2

4}, Weizhong Li^{5

6}

Affiliations

¹ Bioinformatics and Systems Biology, University of California San Diego, La Jolla, CA, USA.
² Human Longevity Inc, San Diego, CA, USA.
³ Department of Computer Science, University of Central Florida, Orlando, FL, USA.
⁴ J. Craig Venter Institute, La Jolla, CA, USA.
⁵ Human Longevity Inc, San Diego, CA, USA. wli@humanlongevity.com.
⁶ J. Craig Venter Institute, La Jolla, CA, USA. wli@humanlongevity.com.

Abstract

Background: Metagenomics is the study of the microbial genomes isolated from communities found on our bodies or in our environment. By correctly determining the relation between human health and the human associated microbial communities, novel mechanisms of health and disease can be found, thus enabling the development of novel diagnostics and therapeutics. Due to the diversity of the microbial communities, strategies developed for aligning human genomes cannot be utilized, and genomes of the microbial species in the community must be assembled de novo. However, in order to obtain the best metagenomic assemblies, it is important to choose the proper assembler. Due to the rapidly evolving nature of metagenomics, new assemblers are constantly created, and the field has not yet agreed on a standardized process. Furthermore, the truth sets used to compare these methods are either too simple (computationally derived diverse communities) or complex (microbial communities of unknown composition), yielding results that are hard to interpret. In this analysis, we interrogate the strengths and weaknesses of five popular assemblers through the use of defined biological samples of known genomic composition and abundance. We assessed the performance of each assembler on their ability to reassemble genomes, call taxonomic abundances, and recreate open reading frames (ORFs).

Results: We tested five metagenomic assemblers: Omega, metaSPAdes, IDBA-UD, metaVelvet and MEGAHIT on known and synthetic metagenomic data sets. MetaSPAdes excelled in diverse sets, IDBA-UD performed well all around, metaVelvet had high accuracy in high abundance organisms, and MEGAHIT was able to accurately differentiate similar organisms within a community. At the ORF level, metaSPAdes and MEGAHIT had the least number of missing ORFs within diverse and similar communities respectively.

Conclusions: Depending on the metagenomics question asked, the correct assembler for the task at hand will differ. It is important to choose the appropriate assembler, and thus clearly define the biological problem of an experiment, as different assemblers will give different answers to the same question.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Chromosome Mapping / methods*
Computational Biology / methods*
Data Accuracy
Genome, Bacterial
Humans
Metagenomics / methods*
Open Reading Frames
Software