Score matching for differential abundance testing of compositional high-throughput sequencing data

Johannes Ostner; Hongzhe Li; Christian L Müller

doi:10.1101/2024.12.05.627006

Score matching for differential abundance testing of compositional high-throughput sequencing data

bioRxiv [Preprint]. 2024 Dec 9:2024.12.05.627006. doi: 10.1101/2024.12.05.627006.

Authors

Johannes Ostner^{1

2}, Hongzhe Li³, Christian L Müller^{1

2

4}

Affiliations

¹ Computational Health Center, Helmholtz Munich, Neuherberg, Germany.
² Institut für Statistik, Ludwig-Maximilians-Universität München, Munich, Germany.
³ Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
⁴ Center for Computational Mathematics, Flatiron Institute, New York, NY, USA.

Abstract

The class of a-b power interaction models, proposed by Yu et al. (2024), provides a general framework for modeling sparse compositional count data with pairwise feature interactions. This class includes many distributions as special cases and enables zero count handling through power transformations, making it especially suitable for modern high- throughput sequencing data with excess zeros, including single-cell RNA-Seq and amplicon sequencing data. Here, we present an extension of this class of models that can include covariate information, allowing for accurate characterization of covariate dependencies in heterogeneous populations. Combining this model with a tailored differential abundance (DA) test leads to a novel DA testing scheme, cosmoDA, that can reduce false positive detection caused by correlated features. cosmoDA uses the generalized score matching estimation framework for power interaction models Our benchmarks on simulated and real data show that cosmoDA can accurately estimate feature interactions in the presence of population heterogeneity and significantly reduces the false discovery rate when testing for differential abundance of correlated features. Finally, cosmoDA provides an explicit link to popular Box-Cox-type data transformations and allows to assess the impact of zero replacement and power transformations on downstream differential abundance results. cosmoDA is available at https://github.com/bio-datascience/cosmoDA.

Keywords: Compositional data; Differential abundance; Generative model; Microbiome; Score matching; Single-cell RNA sequencing.

Publication types

Preprint

Grants and funding

R01 GM123056/GM/NIGMS NIH HHS/United States