Clustering protein sequence and structure space with infinite Gaussian mixture models

Pac Symp Biocomput. 2004:399-410. doi: 10.1142/9789812704856_0038.

Abstract

We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the theory of infinite Gaussian mixtures models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known three-dimensional structures and G-protein coupled receptor sequences. The consistency of the clusters indicate that our method is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which both reflects and extends their SCOP classifications. A supplementray web site containing larger versions of the figures is available at http://public.kgi.edu/approximately wid/PSB04/index.html

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Amino Acid Sequence
  • Cluster Analysis
  • Computational Biology*
  • Databases, Protein
  • Globins / chemistry
  • Globins / genetics
  • Models, Statistical
  • Normal Distribution
  • Proteins / chemistry*
  • Proteins / classification
  • Proteins / genetics*
  • Receptors, G-Protein-Coupled / chemistry
  • Receptors, G-Protein-Coupled / genetics

Substances

  • Proteins
  • Receptors, G-Protein-Coupled
  • Globins