Building a biological space based on protein sequence similarities and biological ontologies

Comb Chem High Throughput Screen. 2008 Sep;11(8):653-60. doi: 10.2174/138620708785739925.

Abstract

Assignment of function to protein sequence is a task of growing importance in the life sciences, as new high-throughput sequencing DNA technologies generate ever increasing quantities of genomic and meta-genomic data. Patterns within the sequence space, caused by the evolutionary conservation and assembly of protein domains, make possible the inference of function from sequence similarity. Clustering similar sequences is a useful technique for finding conserved sequences; the CluSTr database is a publicly-available database arranging proteins in a hierarchy structured by similarity. The protein classification tool InterProScan builds on this approach by applying a range of methods to detect proteins that contain signatures indicative of the presence of particular conserved domains. The use of ontologies to describe protein function provides a flexible and abstract language to classify proteins. Together, these techniques can provide an understanding of the shape of the protein space, and can be used to explore the unchartered waters of the emerging metagenomic world.

Publication types

  • Review

MeSH terms

  • Consensus Sequence
  • Databases, Protein*
  • Evolution, Molecular*
  • Protein Structure, Tertiary
  • Proteins / chemistry*
  • Proteins / classification

Substances

  • Proteins