The statistical analysis of spatially clustered genes under the maximum gap criterion

Rose Hoberman; David Sankoff; Dannie Durand

doi:10.1089/cmb.2005.12.1083

The statistical analysis of spatially clustered genes under the maximum gap criterion

J Comput Biol. 2005 Oct;12(8):1083-102. doi: 10.1089/cmb.2005.12.1083.

Authors

Rose Hoberman¹, David Sankoff, Dannie Durand

Affiliation

¹ Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA. roseh@cs.cmu.edu

PMID: 16241899
DOI: 10.1089/cmb.2005.12.1083

Abstract

Statistical validation of gene clusters is imperative for many important applications in comparative genomics which depend on the identification of genomic regions that are historically and/or functionally related. We develop the first rigorous statistical treatment of max-gap clusters, a cluster definition frequently used in empirical studies. We present exact expressions for the probability of observing an individual cluster of a set of marked genes in one genome, as well as upper and lower bounds on the probability of observing a cluster of h homologs in a pairwise whole-genome comparison. We demonstrate the utility of our approach by applying it to a whole-genome comparison of E. coli and B. subtilis. Code for statistical tests is available at.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Bacillus subtilis / genetics
Chromosome Mapping
Cluster Analysis*
Computational Biology*
Data Interpretation, Statistical*
Escherichia coli / genetics
Evolution, Molecular
Genes
Genome*
Genomics
Models, Genetic
Multigene Family
Probability
Sequence Alignment
Sequence Analysis, DNA*

Grants and funding

1 K22 HG 02451-01/HG/NHGRI NIH HHS/United States