Nonoverlapping clusters: approximate distribution and application to molecular biology

X Su; S Wallenstein; D Bishop

doi:10.1111/j.0006-341x.2001.00420.x

Nonoverlapping clusters: approximate distribution and application to molecular biology

Biometrics. 2001 Jun;57(2):420-6. doi: 10.1111/j.0006-341x.2001.00420.x.

Authors

X Su¹, S Wallenstein, D Bishop

Affiliation

¹ Department of Biomathematical Sciences, Mount Sinai School of Medicine, New York, New York 10029-6574, USA.

PMID: 11414565
DOI: 10.1111/j.0006-341x.2001.00420.x

Abstract

An approach is developed for the screening of genomic sequence data to identify gene regulatory regions. This approach is based on deciding if putative transcription factor binding sites are clustered together to a greater extent than one would expect by chance. Given n events occurring on an interval of width L (L base pairs), an r:w cluster is defined as r + 1 consecutive events all contained within a window of length wL. Accurate and easily computable approximations are derived for the distribution of the number of nonoverlapping r:w clusters under the model that the positions of the n events have a uniform distribution. Simulations demonstrate that these approximations have greater accuracy than existing methods. The approximation is applied to detect erythroid-specific regulatory regions in genomic DNA sequences, first in an artificial case where r is specified a priori and then as part of an exploratory approach.

Publication types

Research Support, U.S. Gov't, P.H.S.

MeSH terms

Binding Sites
Binomial Distribution
Cluster Analysis*
DNA / genetics
Genes, Regulator
Genome
Globins / genetics
Humans
Molecular Biology / methods*
Reproducibility of Results

Substances

Globins
DNA

Grants and funding

R01-DK26824/DK/NIDDK NIH HHS/United States