Leveraging machine learning for taxonomic classification of emerging astroviruses

Front Mol Biosci. 2024 Jan 11:10:1305506. doi: 10.3389/fmolb.2023.1305506. eCollection 2023.

Abstract

Astroviruses are a family of genetically diverse viruses associated with disease in humans and birds with significant health effects and economic burdens. Astrovirus taxonomic classification includes two genera, Avastrovirus and Mamastrovirus. However, with next-generation sequencing, broader interspecies transmission has been observed necessitating a reexamination of the current host-based taxonomic classification approach. In this study, a novel taxonomic classification method is presented for emergent and as yet unclassified astroviruses, based on whole genome sequence k-mer composition in addition to host information. An optional component responsible for identifying recombinant sequences was added to the method's pipeline, to counteract the impact of genetic recombination on viral classification. The proposed three-pronged classification method consists of a supervised machine learning method, an unsupervised machine learning method, and the consideration of host species. Using this three-pronged approach, we propose genus labels for 191 as yet unclassified astrovirus genomes. Genus labels are also suggested for an additional eight as yet unclassified astrovirus genomes for which incompatibility was observed with the host species, suggesting cross-species infection. Lastly, our machine learning-based approach augmented by a principal component analysis (PCA) analysis provides evidence supporting the hypothesis of the existence of human astrovirus (HAstV) subgenus of the genus Mamastrovirus, and a goose astrovirus (GoAstV) subgenus of the genus Avastrovirus. Overall, this multipronged machine learning approach provides a fast, reliable, and scalable prediction method of taxonomic labels, able to keep pace with emerging viruses and the exponential increase in the output of modern genome sequencing technologies.

Keywords: Avastrovirus; Mamastrovirus; alignment-free classification; family Astroviridae; genomic signature; k-mer frequency; machine learning; viral classification and clustering.

Grants and funding

The authors declare financial support was received for the research, authorship, and/or publication of this article. This work was supported by Natural Science and Engineering Research Council of Canada Grants R3511A12 to KH and RGPIN-2023-03663 to LK. This research was enabled in part by support provided by Compute Canada RPP (Research Platforms Portals), https://www.computecanada.ca/, Grant 616 to KH and LK. The funders had no role in the preparation of the manuscript.