Model-based analysis of sample index hopping reveals its widespread artifacts in multiplexed single-cell RNA-sequencing

Nat Commun. 2020 Jun 1;11(1):2704. doi: 10.1038/s41467-020-16522-z.

Abstract

Index hopping is the main cause of incorrect sample assignment of sequencing reads in multiplexed pooled libraries. We introduce a statistical model for estimating the sample index-hopping rate in multiplexed droplet-based single-cell RNA-seq data and for probabilistic inference of the true sample of origin of hopped reads. We analyze several datasets and estimate the sample index hopping probability to range between 0.003-0.009, a small number that counter-intuitively gives rise to a large fraction of phantom molecules - the fraction of phantom molecules exceeds 8% in more than 25% of samples and reaches as high as 85% in low-complexity samples. Phantom molecules lead to widespread complications in downstream analyses, including transcriptome mixing across cells, emergence of phantom copies of cells from other samples, and misclassification of empty droplets as cells. We demonstrate that our approach can correct for these artifacts by accurately purging the majority of phantom molecules from the data.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Artifacts*
  • Computer Simulation
  • High-Throughput Nucleotide Sequencing / methods*
  • High-Throughput Nucleotide Sequencing / standards
  • Humans
  • Models, Statistical*
  • RNA / analysis*
  • RNA / genetics
  • Reproducibility of Results
  • Single-Cell Analysis / methods*
  • Single-Cell Analysis / standards

Substances

  • RNA