A probabilistic gene expression barcode for annotation of cell types from single-cell RNA-seq data

Isabella N Grabski; Rafael A Irizarry

doi:10.1093/biostatistics/kxac021

A probabilistic gene expression barcode for annotation of cell types from single-cell RNA-seq data

Biostatistics. 2022 Oct 14;23(4):1150-1164. doi: 10.1093/biostatistics/kxac021.

Authors

Isabella N Grabski¹, Rafael A Irizarry²

Affiliations

¹ Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
² Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA and Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.

Abstract

Single-cell RNA sequencing (scRNA-seq) quantifies gene expression for individual cells in a sample, which allows distinct cell-type populations to be identified and characterized. An important step in many scRNA-seq analysis pipelines is the annotation of cells into known cell types. While this can be achieved using experimental techniques, such as fluorescence-activated cell sorting, these approaches are impractical for large numbers of cells. This motivates the development of data-driven cell-type annotation methods. We find limitations with current approaches due to the reliance on known marker genes or from overfitting because of systematic differences, or batch effects, between studies. Here, we present a statistical approach that leverages public data sets to combine information across thousands of genes, uses a latent variable model to define cell-type-specific barcodes and account for batch effect variation, and probabilistically annotates cell-type identity from a reference of known cell types. The barcoding approach also provides a new way to discover marker genes. Using a range of data sets, including those generated to represent imperfect real-world reference data, we demonstrate that our approach substantially outperforms current reference-based methods, particularly when predicting across studies.

Keywords: Single-cell RNA-seq.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.
Research Support, N.I.H., Extramural

MeSH terms

Gene Expression
Gene Expression Profiling* / methods
Humans
RNA-Seq
Sequence Analysis, RNA / methods
Single-Cell Analysis*
Software

Abstract

Publication types

MeSH terms

Grants and funding