Transcript identification by analysis of short sequence tags--influence of tag length, restriction site and transcript database

Nucleic Acids Res. 2003 Apr 15;31(8):2217-26. doi: 10.1093/nar/gkg313.

Abstract

There exist a number of gene expression profiling techniques that utilize restriction enzymes for generation of short expressed sequence tags. We have studied how the choice of restriction enzyme influences various characteristics of tags generated in an experiment. We have also investigated various aspects of in silico transcript identification that these profiling methods rely on. First, analysis of 14 248 mRNA sequences derived from the RefSeq transcript database showed that 1-30% of the sequences lack a given restriction enzyme recognition site. Moreover, 1-5% of the transcripts have recognition sites located less than 10 bases from the poly(A) tail. The uniqueness of 10 bp tags lies in the range 90-95%, which increases only slightly with longer tags, due to the existence of closely related transcripts. Furthermore, 3-30% of upstream 10 bp tags are identical to 3' tags, introducing a risk of misclassification if upstream tags are present in a sample. Second, we found that a sequence length of 16-17 bp, including the recognition site, is sufficient for unique transcript identification by BLAST based sequence alignment to the UniGene Human non-redundant database. Third, we constructed a tag-to-gene mapping for UniGene and compared it to an existing mapping database. The mappings agreed to 79-83%, where the selection of representative sequences in the UniGene clusters is the main cause of the disagreement. The results of this study may serve to improve the interpretation of sequence-based expression studies and the design of hybridization arrays, by identifying short tags that have a high reliability and separating them from tags that carry an inherent ambiguity in their capacity to discriminate between genes. To this end, supplementary information in the form of a web companion to this paper is located at http:// biobase.biotech.kth.se/tagseq.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Base Sequence
  • Binding Sites / genetics
  • DNA Restriction Enzymes / metabolism*
  • Databases, Nucleic Acid*
  • Expressed Sequence Tags*
  • Gene Expression Profiling / methods
  • Poly A / genetics
  • RNA, Messenger / genetics
  • RNA, Messenger / metabolism*
  • Sequence Alignment / methods
  • Transcription, Genetic / genetics*

Substances

  • RNA, Messenger
  • Poly A
  • DNA Restriction Enzymes