Feature selection followed by a novel residuals-based normalization simplifies and improves single-cell gene expression analysis

bioRxiv [Preprint]. 2024 May 9:2023.03.02.530891. doi: 10.1101/2023.03.02.530891.

Abstract

Normalization is a crucial step in the analysis of single-cell RNA-sequencing (scRNA-seq) counts data. Its principal objectives are to reduce the systematic biases primarily introduced through technical sources and to transform the data to make it more amenable for application of established statistical frameworks. In the standard workflows, normalization is followed by feature selection to identify highly variable genes (HVGs) that capture most of the biologically meaningful variation across the cells. Here, we make the case for a revised workflow by proposing a simple feature selection method and showing that we can perform feature selection before normalization by relying on observed counts. We highlight that the feature selection step can be used to not only select HVGs but to also identify stable genes. We further propose a novel variance stabilization transformation inclusive residuals-based normalization method that in fact relies on the stable genes to inform the reduction of systematic biases. We demonstrate significant improvements in downstream clustering analyses through the application of our proposed methods on biological truth-known as well as simulated counts datasets. We have implemented this novel workflow for analyzing high-throughput scRNA-seq data in an R package called Piccolo.

Publication types

  • Preprint