The statistical analysis of spatially clustered genes under the maximum gap criterion

J Comput Biol. 2005 Oct;12(8):1083-102. doi: 10.1089/cmb.2005.12.1083.

Abstract

Statistical validation of gene clusters is imperative for many important applications in comparative genomics which depend on the identification of genomic regions that are historically and/or functionally related. We develop the first rigorous statistical treatment of max-gap clusters, a cluster definition frequently used in empirical studies. We present exact expressions for the probability of observing an individual cluster of a set of marked genes in one genome, as well as upper and lower bounds on the probability of observing a cluster of h homologs in a pairwise whole-genome comparison. We demonstrate the utility of our approach by applying it to a whole-genome comparison of E. coli and B. subtilis. Code for statistical tests is available at.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bacillus subtilis / genetics
  • Chromosome Mapping
  • Cluster Analysis*
  • Computational Biology*
  • Data Interpretation, Statistical*
  • Escherichia coli / genetics
  • Evolution, Molecular
  • Genes
  • Genome*
  • Genomics
  • Models, Genetic
  • Multigene Family
  • Probability
  • Sequence Alignment
  • Sequence Analysis, DNA*