An integrated machine learning system to computationally screen protein databases for protein binding peptide ligands

Mol Cell Proteomics. 2006 Jul;5(7):1224-32. doi: 10.1074/mcp.M500346-MCP200. Epub 2006 Mar 29.

Abstract

A fairly large set of protein interactions is mediated by families of peptide binding domains, such as Src homology 2 (SH2), SH3, PDZ, major histocompatibility complex, etc. To identify their ligands by experimental screening is not only labor-intensive but almost futile in screening low abundance species due to the suppression by high abundance species. An ideal way of studying protein-protein interactions is to use high throughput computational approaches to screen protein sequence databases to direct the validating experiments toward the most promising peptides. Predictors with only good cross-validation were not good enough to screen protein databases. In the current study we built integrated machine learning systems using three novel coding methods and screened the Swiss-Prot and GenBank protein databases for potential ligands of 10 SH3 and three PDZ domains. A large fraction of predictions has already been experimentally confirmed by other independent research groups, indicating a satisfying generalization capability for future applications in identifying protein interactions.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Motifs
  • Artificial Intelligence*
  • Databases, Protein*
  • Electronic Data Processing / methods*
  • Forecasting
  • Humans
  • Ligands*
  • Models, Theoretical
  • Peptide Fragments / analysis*
  • Peptide Fragments / metabolism
  • Predictive Value of Tests
  • Protein Array Analysis / statistics & numerical data
  • Protein Binding
  • Protein Interaction Mapping / methods*
  • Protein Structure, Tertiary
  • Yeasts

Substances

  • Ligands
  • Peptide Fragments