Robust expansion of phylogeny for fast-growing genome sequence data

PLoS Comput Biol. 2024 Feb 8;20(2):e1011871. doi: 10.1371/journal.pcbi.1011871. eCollection 2024 Feb.

Abstract

Massive sequencing of SARS-CoV-2 genomes has urged novel methods that employ existing phylogenies to add new samples efficiently instead of de novo inference. 'TIPars' was developed for such challenge integrating parsimony analysis with pre-computed ancestral sequences. It took about 21 seconds to insert 100 SARS-CoV-2 genomes into a 100k-taxa reference tree using 1.4 gigabytes. Benchmarking on four datasets, TIPars achieved the highest accuracy for phylogenies of moderately similar sequences. For highly similar and divergent scenarios, fully parsimony-based and likelihood-based phylogenetic placement methods performed the best respectively while TIPars was the second best. TIPars accomplished efficient and accurate expansion of phylogenies of both similar and divergent sequences, which would have broad biological applications beyond SARS-CoV-2. TIPars is accessible from https://tipars.hku.hk/ and source codes are available at https://github.com/id-bioinfo/TIPars.

MeSH terms

  • Genome*
  • Likelihood Functions
  • Phylogeny
  • SARS-CoV-2 / genetics
  • Software*

Grants and funding

This project is supported by the National Natural Science Foundation of China’s Excellent Young Scientists Fund (Hong Kong and Macau) (31922087; TL), the Hong Kong Research Grants Council’s General Research Fund (17150816; TL), the Health and Medical Research Fund (COVID1903011-WP1; TL), the Innovation and Technology Commission’s InnoHK funding (D24H; TL,JW,YG,HZ), and the Guangdong Government for the funding supports (2019B121205009, HZQB-KCZYZ-2021014, 200109155890863, 190830095586328 and 190824215544727; YG,HZ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.