Systematic comparison of published host gene expression signatures for bacterial/viral discrimination

Genome Med. 2022 Feb 21;14(1):18. doi: 10.1186/s13073-022-01025-x.

Abstract

Background: Measuring host gene expression is a promising diagnostic strategy to discriminate bacterial and viral infections. Multiple signatures of varying size, complexity, and target populations have been described. However, there is little information to indicate how the performance of various published signatures compare to one another.

Methods: This systematic comparison of host gene expression signatures evaluated the performance of 28 signatures, validating them in 4589 subjects from 51 publicly available datasets. Thirteen COVID-specific datasets with 1416 subjects were included in a separate analysis. Individual signature performance was evaluated using the area under the receiving operating characteristic curve (AUC) value. Overall signature performance was evaluated using median AUCs and accuracies.

Results: Signature performance varied widely, with median AUCs ranging from 0.55 to 0.96 for bacterial classification and 0.69-0.97 for viral classification. Signature size varied (1-398 genes), with smaller signatures generally performing more poorly (P < 0.04). Viral infection was easier to diagnose than bacterial infection (84% vs. 79% overall accuracy, respectively; P < .001). Host gene expression classifiers performed more poorly in some pediatric populations (3 months-1 year and 2-11 years) compared to the adult population for both bacterial infection (73% and 70% vs. 82%, respectively; P < .001) and viral infection (80% and 79% vs. 88%, respectively; P < .001). We did not observe classification differences based on illness severity as defined by ICU admission for bacterial or viral infections. The median AUC across all signatures for COVID-19 classification was 0.80 compared to 0.83 for viral classification in the same datasets.

Conclusions: In this systematic comparison of 28 host gene expression signatures, we observed differences based on a signature's size and characteristics of the validation population, including age and infection type. However, populations used for signature discovery did not impact performance, underscoring the redundancy among many of these signatures. Furthermore, differential performance in specific populations may only be observable through this type of large-scale validation.

Keywords: Biomarkers; Diagnostics; Gene expression; Infectious disease; Machine learning.

Publication types

  • Comparative Study
  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adult
  • Bacterial Infections / diagnosis*
  • Bacterial Infections / epidemiology
  • Bacterial Infections / genetics
  • Biomarkers / analysis
  • COVID-19 / diagnosis
  • COVID-19 / genetics
  • Child
  • Cohort Studies
  • Datasets as Topic / statistics & numerical data*
  • Diagnosis, Differential
  • Gene Expression Profiling / statistics & numerical data
  • Genetic Association Studies / statistics & numerical data
  • Host-Pathogen Interactions / genetics*
  • Humans
  • Publications / statistics & numerical data
  • SARS-CoV-2 / pathogenicity
  • Transcriptome*
  • Validation Studies as Topic
  • Virus Diseases / diagnosis*
  • Virus Diseases / epidemiology
  • Virus Diseases / genetics

Substances

  • Biomarkers