Higher order asymptotics for negative binomial regression inferences from RNA-sequencing data

Stat Appl Genet Mol Biol. 2013 Mar 26;12(1):49-70. doi: 10.1515/sagmb-2012-0071.

Abstract

RNA sequencing (RNA-Seq) is the current method of choice for characterizing transcriptomes and quantifying gene expression changes. This next generation sequencing-based method provides unprecedented depth and resolution. The negative binomial (NB) probability distribution has been shown to be a useful model for frequencies of mapped RNA-Seq reads and consequently provides a basis for statistical analysis of gene expression. Negative binomial exact tests are available for two-group comparisons but do not extend to negative binomial regression analysis, which is important for examining gene expression as a function of explanatory variables and for adjusted group comparisons accounting for other factors. We address the adequacy of available large-sample tests for the small sample sizes typically available from RNA-Seq studies and consider a higher-order asymptotic (HOA) adjustment to likelihood ratio tests. We demonstrate that 1) the HOA-adjusted likelihood ratio test is practically indistinguishable from the exact test in situations where the exact test is available, 2) the type I error of the HOA test matches the nominal specification in regression settings we examined via simulation, and 3) the power of the likelihood ratio test does not appear to be affected by the HOA adjustment. This work helps clarify the accuracy of the unadjusted likelihood ratio test and the degree of improvement available with the HOA adjustment. Furthermore, the HOA test may be preferable even when the exact test is available because it does not require ad hoc library size adjustments.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Arabidopsis / genetics
  • Base Sequence
  • Computer Simulation
  • Gene Expression Profiling / methods*
  • High-Throughput Nucleotide Sequencing
  • Likelihood Functions
  • Models, Genetic*
  • Models, Statistical
  • Poisson Distribution
  • Pseudomonas syringae / genetics
  • RNA, Bacterial / genetics
  • RNA, Plant / genetics
  • Regression Analysis
  • Sequence Analysis, RNA*

Substances

  • RNA, Bacterial
  • RNA, Plant