Impact of RNA-seq attributes on false positive rates in differential expression analysis of de novo assembled transcriptomes

BMC Res Notes. 2013 Dec 3:6:503. doi: 10.1186/1756-0500-6-503.

Abstract

Background: High-throughput RNA sequencing studies are becoming increasingly popular and differential expression studies represent an important downstream analysis that often follow de novo transcriptome assembly. If a lot of attention has been given to bioinformatics tools for differential gene expression, little has yet been given to the impact of the sequence data itself used in pipelines.

Results: We tested how using different types of reads from the ones used to assemble a de novo transcriptome (both differing in length and pairing attributes) could potentially affect differential expression (DE) results. To investigate this, we created artificial datasets out of long paired-end RNA-seq datasets initially used to build the assembly. All datasets were compared via DE analyses and because all samples come from the same sequencing run, DE of genes or isoforms can be interpreted as false positives resulting from sequence attributes. If the false positive rate for differential gene expression does not seem to be strongly affected by sequencing strategy (max. of 3.5%), it could reach 12.2% or 28.1% for differential isoform expression depending of the pipeline used. The effect of paired-end vs. single-end strategy was found to have a much greater impact in terms of false positives than sequence length.

Conclusion: In light of false positive rate results, we recommend using paired-end over single-end sequences in differential expression studies, even if the impact is less serious for differential gene expression.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Gene Expression Profiling
  • High-Throughput Nucleotide Sequencing
  • Sequence Analysis, RNA*
  • Transcriptome*