Statistical potential-based amino acid similarity matrices for aligning distantly related protein sequences

Proteins. 2006 Aug 15;64(3):587-600. doi: 10.1002/prot.21020.

Abstract

Aligning distantly related protein sequences is a long-standing problem in bioinformatics, and a key for successful protein structure prediction. Its importance is increasing recently in the context of structural genomics projects because more and more experimentally solved structures are available as templates for protein structure modeling. Toward this end, recent structure prediction methods employ profile-profile alignments, and various ways of aligning two profiles have been developed. More fundamentally, a better amino acid similarity matrix can improve a profile itself; thereby resulting in more accurate profile-profile alignments. Here we have developed novel amino acid similarity matrices from knowledge-based amino acid contact potentials. Contact potentials are used because the contact propensity to the other amino acids would be one of the most conserved features of each position of a protein structure. The derived amino acid similarity matrices are tested on benchmark alignments at three different levels, namely, the family, the superfamily, and the fold level. Compared to BLOSUM45 and the other existing matrices, the contact potential-based matrices perform comparably in the family level alignments, but clearly outperform in the fold level alignments. The contact potential-based matrices perform even better when suboptimal alignments are considered. Comparing the matrices themselves with each other revealed that the contact potential-based matrices are very different from BLOSUM45 and the other matrices, indicating that they are located in a different basin in the amino acid similarity matrix space.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acids / chemistry*
  • Amino Acids / genetics
  • Computational Biology / methods
  • Databases, Protein / statistics & numerical data
  • Protein Folding
  • Proteins / chemistry*
  • Proteins / genetics
  • Reproducibility of Results
  • Sequence Alignment / methods*
  • Sequence Alignment / statistics & numerical data
  • Software

Substances

  • Amino Acids
  • Proteins