Effects of data transformation and model selection on feature importance in microbiome classification data

Zuzanna Karwowska; Oliver Aasmets; Estonian Biobank research team; Tomasz Kosciolek; Elin Org

doi:10.1186/s40168-024-01996-6

Effects of data transformation and model selection on feature importance in microbiome classification data

Microbiome. 2025 Jan 4;13(1):2. doi: 10.1186/s40168-024-01996-6.

Authors

Zuzanna Karwowska^#^{1

2

3}, Oliver Aasmets^#⁴; Estonian Biobank research team; Tomasz Kosciolek^{5

6

7}, Elin Org⁸

Collaborators

Estonian Biobank research team:
Mait Metspalu, Andres Metspalu, Lili Milani, Tõnu Esko

Affiliations

¹ Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland.
² Doctoral School of Exact and Natural Sciences, Jagiellonian University, Krakow, Poland.
³ Sano Centre for Computational Medicine, Krakow, Poland.
⁴ Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia.
⁵ Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland. t.kosciolek@sanoscience.org.
⁶ Department of Data Science and Engineering, Silesian University of Technology, Gliwice, Poland. t.kosciolek@sanoscience.org.
⁷ Sano Centre for Computational Medicine, Krakow, Poland. t.kosciolek@sanoscience.org.
⁸ Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia. elin.org@ut.ee.

^# Contributed equally.

Abstract

Background: Accurate classification of host phenotypes from microbiome data is crucial for advancing microbiome-based therapies, with machine learning offering effective solutions. However, the complexity of the gut microbiome, data sparsity, compositionality, and population-specificity present significant challenges. Microbiome data transformations can alleviate some of the aforementioned challenges, but their usage in machine learning tasks has largely been unexplored.

Results: Our analysis of over 8500 samples from 24 shotgun metagenomic datasets showed that it is possible to classify healthy and diseased individuals using microbiome data with minimal dependence on the choice of algorithm or transformation. Presence-absence transformations performed comparably to abundance-based transformations, and only a small subset of predictors is necessary for accurate classification. However, while different transformations resulted in comparable classification performance, the most important features varied significantly, which highlights the need to reevaluate machine learning-based biomarker detection.

Conclusions: Microbiome data transformations can significantly influence feature selection but have a limited effect on classification accuracy. Our findings suggest that while classification is robust across different transformations, the variation in feature selection necessitates caution when using machine learning for biomarker identification. This research provides valuable insights for applying machine learning to microbiome data and identifies important directions for future work.

MeSH terms

Algorithms*
Bacteria / classification
Bacteria / genetics
Biomarkers
Gastrointestinal Microbiome* / genetics
Humans
Machine Learning*
Metagenome
Metagenomics* / methods
Microbiota / genetics

Substances

Biomarkers