Accurate estimation of molecular counts from amplicon sequence data with unique molecular identifiers

Bioinformatics. 2023 Jan 1;39(1):btad002. doi: 10.1093/bioinformatics/btad002.

Abstract

Motivation: Amplicon sequencing is widely applied to explore heterogeneity and rare variants in genetic populations. Resolving true biological variants and quantifying their abundance is crucial for downstream analyses, but measured abundances are distorted by stochasticity and bias in amplification, plus errors during polymerase chain reaction (PCR) and sequencing. One solution attaches unique molecular identifiers (UMIs) to sample sequences before amplification. Counting UMIs instead of sequences provides unbiased estimates of abundance. While modern methods improve over naïve counting by UMI identity, most do not account for UMI reuse or collision, and they do not adequately model PCR and sequencing errors in the UMIs and sample sequences.

Results: We introduce Deduplication and Abundance estimation with UMIs (DAUMI), a probabilistic framework to detect true biological amplicon sequences and accurately estimate their deduplicated abundance. DAUMI recognizes UMI collision, even on highly similar sequences, and detects and corrects most PCR and sequencing errors in the UMI and sampled sequences. DAUMI performs better on simulated and real data compared to other UMI-aware clustering methods.

Availability and implementation: Source code is available at https://github.com/DormanLab/AmpliCI.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Cluster Analysis
  • High-Throughput Nucleotide Sequencing* / methods
  • Polymerase Chain Reaction
  • Sequence Analysis, DNA / methods
  • Software*