Large scale hierarchical clustering of protein sequences

BMC Bioinformatics. 2005 Jan 22:6:15. doi: 10.1186/1471-2105-6-15.

Abstract

Background: Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to.

Results: We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at http://systers.molgen.mpg.de/.

Conclusions: Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Cluster Analysis
  • Computational Biology / methods*
  • Databases, Factual
  • Databases, Genetic
  • Databases, Nucleic Acid
  • Databases, Protein
  • Fungal Proteins / chemistry
  • Genetic Linkage
  • Genome
  • Information Storage and Retrieval
  • Models, Biological
  • Multigene Family
  • Phylogeny
  • Protein Structure, Tertiary
  • Proteins / chemistry*
  • Proteomics / methods*
  • Reproducibility of Results
  • Sequence Alignment
  • Sequence Analysis, Protein
  • Software

Substances

  • Fungal Proteins
  • Proteins