SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data

Biomed Res Int. 2015:2015:780519. doi: 10.1155/2015/780519. Epub 2015 Dec 29.

Abstract

The performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimited number of true and false positive events, and the consequent robust estimation of accuracy measures, such as precision and recall. Although a few simulated fusion datasets from RNA Sequencing (RNA-Seq) are available, they are of limited sample size. This makes it difficult to systematically evaluate the performance of RNA-Seq based fusion-detection algorithms. Here, we present SimFuse to address this problem. SimFuse utilizes real sequencing data as the fusions' background to closely approximate the distribution of reads from a real sequencing library and uses a reference genome as the template from which to simulate fusions' supporting reads. To assess the supporting read-specific performance, SimFuse generates multiple datasets with various numbers of fusion supporting reads. Compared to an extant simulated dataset, SimFuse gives users control over the supporting read features and the sample size of the simulated library, based on which the performance metrics needed for the validation and comparison of alternative fusion-detection algorithms can be rigorously estimated.

MeSH terms

  • Algorithms*
  • Animals
  • Databases, Nucleic Acid*
  • High-Throughput Nucleotide Sequencing*
  • Humans
  • RNA / genetics*
  • Sequence Analysis, RNA / methods*

Substances

  • RNA