mRNA 5' region sequence incompleteness: a potential source of systematic errors in translation initiation codon assignment in human mRNAs

Gene. 2003 Dec 4:321:185-93. doi: 10.1016/s0378-1119(03)00835-7.

Abstract

The amino acid sequence of gene products is routinely deduced from the nucleotide sequence of the relative cloned cDNA, according to the rules for recognition of start codon (first-AUG rule, optimal sequence context) and the genetic code. From this prediction stem most subsequent types of product analysis, although all standard methods for cDNA cloning are affected by a potential inability to effectively clone the 5' region of mRNA. Revision by bioinformatics and cloning methods of 109 known genes located on human chromosome 21 (HC 21) shows that 60 mRNAs lack any in-frame stop upstream of the first-AUG, and that in five cases (DSCR1, KIAA0184, KIAA0539, SON, and TFF3) the coding region at the 5' end was incompletely characterized in the original descriptions. We describe the respective consequences for genomic annotation, domain and ortholog identification, and functional experiments design. We have also analyzed the sequences of 13,124 human mRNAs (RefSeq databank), discovering that in 6448 cases (49%), an in-frame stop codon is present upstream of the initiation codon, while in the other 6676 mRNAs (51%), identification of additional bases at the mRNA 5' region could well reveal some new upstream in-frame AUG codons in the optimal context. Proportionally to the HC 21 data, about 550 known human genes might thus be affected by this 5' end mRNA artifact.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • 5' Untranslated Regions / genetics
  • Amino Acid Sequence
  • Carrier Proteins / genetics
  • Chromosomes, Human, Pair 21 / genetics
  • Codon, Initiator / genetics*
  • DNA, Complementary / chemistry
  • DNA, Complementary / genetics
  • DNA-Binding Proteins / genetics
  • Humans
  • Intracellular Signaling Peptides and Proteins
  • Minor Histocompatibility Antigens
  • Molecular Sequence Data
  • Mucins / genetics
  • Muscle Proteins / genetics
  • Nuclear Proteins
  • Peptides
  • Protein Biosynthesis / genetics*
  • Proteins / genetics
  • RNA, Messenger / genetics*
  • Reproducibility of Results
  • Sequence Alignment
  • Sequence Analysis, DNA / methods
  • Sequence Analysis, DNA / standards
  • Sequence Homology, Amino Acid
  • Trefoil Factor-3

Substances

  • 5' Untranslated Regions
  • Carrier Proteins
  • Codon, Initiator
  • DIP2A protein, human
  • DNA, Complementary
  • DNA-Binding Proteins
  • Intracellular Signaling Peptides and Proteins
  • Minor Histocompatibility Antigens
  • Mucins
  • Muscle Proteins
  • Nuclear Proteins
  • Peptides
  • Proteins
  • RCAN1 protein, human
  • RNA, Messenger
  • SON protein, human
  • TFF3 protein, human
  • Trefoil Factor-3
  • URB1 protein, human

Associated data

  • GENBANK/AF432263
  • GENBANK/AF432264
  • GENBANK/AF432265
  • GENBANK/AF435977