The prediction of vertebrate promoter regions using differential hexamer frequency analysis

Comput Appl Biosci. 1996 Oct;12(5):391-8. doi: 10.1093/bioinformatics/12.5.391.

Abstract

Motivation: To develop an algorithm utilizing differential hexamer frequency analysis to discriminate promoter from non-promoter regions in vertebrate DNA sequence, without relying upon an extensive database of known transcriptional elements.

Results: By determining hexamer frequencies derived from known promoter regions, coding regions and non-coding regions in vertebrates' DNA sequence, and a formula first applied by Claverie and Bougueleret (1986), a discriminant measure was created that compares promoter regions with coding (D1) and non-coding (D2) sequence. The algorithm is able to identify correctly the promoter regions in 18 of 29 loci (62.1%) from an independent test data set. With program options set to identify only one promoter region in the forward strand, there are 11 false-positive predictions in 208 714 nucleotides (one false positive in 18 974 single-stranded bp). With options set to analyze sequence in discrete segments, there is no appreciable improvement in sensitivity, whereas the specificity falls off predictably. It is of particular interest than a search for a peak score (independent of an absolute threshold) is more accurate that a search based upon a fixed scoring threshold. This suggests that the selection of promoter sites may be influenced by the global properties of an entire sequence domain, rather than exclusively upon local characteristics.

MeSH terms

  • Algorithms*
  • Animals
  • Base Sequence
  • Discriminant Analysis
  • Gene Frequency
  • Microcomputers
  • Predictive Value of Tests
  • Promoter Regions, Genetic*
  • Software
  • Vertebrates / genetics*