Unsupervised statistical discovery of spaced motifs in prokaryotic genomes

BMC Genomics. 2017 Jan 5;18(1):27. doi: 10.1186/s12864-016-3400-0.

Abstract

Background: DNA sequences contain repetitive motifs which have various functions in the physiology of the organism. A number of methods have been developed for discovery of such sequence motifs with a primary focus on detection of regulatory motifs and particularly transcription factor binding sites. Most motif-finding methods apply probabilistic models to detect motifs characterized by unusually high number of copies of the motif in the analyzed sequences.

Results: We present a novel method for detection of pairs of motifs separated by spacers of variable nucleotide sequence but conserved length. Unlike existing methods for motif discovery, the motifs themselves are not required to occur at unusually high frequency but only to exhibit a significant preference to occur at a specific distance from each other. In the present implementation of the method, motifs are represented by pentamers and all pairs of pentamers are evaluated for statistically significant preference for a specific distance. An important step of the algorithm eliminates motif pairs where the spacers separating the two motifs exhibit a high degree of sequence similarity; such motif pairs likely arise from duplications of the whole segment including the motifs and the spacer rather than due to selective constraints indicative of a functional importance of the motif pair. The method was used to scan 569 complete prokaryotic genomes for novel sequence motifs. Some motifs detected were previously known but other motifs found in the search appear to be novel. Selected motif pairs were subjected to further investigation and in some cases their possible biological functions were proposed.

Conclusions: We present a new motif-finding technique that is applicable to scanning complete genomes for sequence motifs. The results from analysis of 569 genomes suggest that the method detects previously known motifs that are expected to be found as well as new motifs that are unlikely to be discovered by traditional motif-finding methods. We conclude that our approach to detection of significant motif pairs can complement existing motif-finding techniques in discovery of novel functional sequence motifs in complete genomes.

Keywords: Archaea; Bacteria; DNA sequence repeats; Genome; Motif-finding; Sequence motifs.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Amino Acid Motifs
  • Clustered Regularly Interspaced Short Palindromic Repeats
  • Genome*
  • Genome, Archaeal
  • Genome, Bacterial
  • Genomics / methods*
  • Models, Genetic*
  • Nucleotide Motifs*
  • Position-Specific Scoring Matrices
  • Prokaryotic Cells / metabolism*
  • RNA, Transfer / chemistry
  • RNA, Transfer / genetics
  • Transcription Termination, Genetic
  • rho GTP-Binding Proteins / metabolism

Substances

  • RNA, Transfer
  • rho GTP-Binding Proteins