Identification of regulatory elements using a feature selection method

Sündüz Keleş; Mark van der Laan; Michael B Eisen

doi:10.1093/bioinformatics/18.9.1167

Identification of regulatory elements using a feature selection method

Bioinformatics. 2002 Sep;18(9):1167-75. doi: 10.1093/bioinformatics/18.9.1167.

Authors

Sündüz Keleş¹, Mark van der Laan, Michael B Eisen

Affiliation

¹ Division of Biostatistics, U. of California, Berkeley, CA 94720, USA. keles@stat.berkeley.edu

PMID: 12217908
DOI: 10.1093/bioinformatics/18.9.1167

Abstract

Motivation: Many methods have been described to identify regulatory motifs in the transcription control regions of genes that exhibit similar patterns of gene expression across a variety of experimental conditions. Here we focus on a single experimental condition, and utilize gene expression data to identify sequence motifs associated with genes that are activated under this experimental condition. We use a linear model with two-way interactions to model gene expression as a function of sequence features (words) present in presumptive transcription control regions. The most relevant features are selected by a feature selection method called stepwise selection with monte carlo cross validation. We apply this method to a publicly available dataset of the yeast Saccharomyces cerevisiae, focussing on the 800 basepairs immediately upstream of each gene's translation start site (the upstream control region (UCR)).

Results: We successfully identify regulatory motifs that are known to be active under the experimental conditions analyzed, and find additional significant sequences that may represent novel regulatory motifs. We also discuss a complementary method that utilizes gene expression data from a single microarray experiment and allows averaging over variety of experimental conditions as an alternative to motif finding methods that act on clusters of co-expressed genes.

Availability: The software is available upon request from the first author or may be downloaded from http://www.stat.berkeley.edu/~sunduz.

Contact: keles@stat.berkeley.edu

Publication types

Research Support, U.S. Gov't, P.H.S.
Validation Study

MeSH terms

Amino Acid Motifs / genetics*
Base Sequence
Gene Expression Regulation / genetics*
Mitosis / genetics
Models, Genetic*
Models, Statistical*
Molecular Sequence Data
Monte Carlo Method
Oligonucleotide Array Sequence Analysis / methods*
Pattern Recognition, Automated
Reproducibility of Results
Saccharomyces cerevisiae / genetics
Sensitivity and Specificity
Sequence Analysis, DNA / methods*

Grants and funding

1R01 AI46182-01/AI/NIAID NIH HHS/United States