Predictive Modeling for Metabolomics Data

Methods Mol Biol. 2020:2104:313-336. doi: 10.1007/978-1-0716-0239-3_16.

Abstract

In recent years, mass spectrometry (MS)-based metabolomics has been extensively applied to characterize biochemical mechanisms, and study physiological processes and phenotypic changes associated with disease. Metabolomics has also been important for identifying biomarkers of interest suitable for clinical diagnosis. For the purpose of predictive modeling, in this chapter, we will review various supervised learning algorithms such as random forest (RF), support vector machine (SVM), and partial least squares-discriminant analysis (PLS-DA). In addition, we will also review feature selection methods for identifying the best combination of metabolites for an accurate predictive model. We conclude with best practices for reproducibility by including internal and external replication, reporting metrics to assess performance, and providing guidelines to avoid overfitting and to deal with imbalanced classes. An analysis of an example data will illustrate the use of different machine learning methods and performance metrics.

Keywords: Mass spectrometry; Metabolomics; Performance Metrics; Predictive Modeling; Supervised learning.

MeSH terms

  • Area Under Curve
  • Data Interpretation, Statistical*
  • Databases, Factual
  • Decision Trees
  • Discriminant Analysis
  • Least-Squares Analysis
  • Mass Spectrometry
  • Metabolomics* / statistics & numerical data
  • Models, Theoretical*
  • ROC Curve
  • Reproducibility of Results
  • Support Vector Machine