Three descriptor model sets a high standard for the CSAR-NRC HiQ benchmark

J Chem Inf Model. 2011 Sep 26;51(9):2139-45. doi: 10.1021/ci200030h. Epub 2011 Jun 27.

Abstract

Here we report the results we obtained with a proteochemometric approach for predicting ligand binding free energies of the CSAR-NRC HiQ benchmark data set. Using distance-dependent atom-type pair descriptors in a bagged stepwise multiple-linear regression (MLR) model with subsequent complexity reduction we were able to identify three descriptors that can be used to build a very robust regression model for the CSAR-NRC HiQ data set. The model has an R(2)(cv) of 0.55, a MUE(cv) of 1.19, and an RMSE(cv) of 1.49 on the out-of-bag test set. The descriptors selected are the count of protein atoms in a shell between 4.5 Å and 6 Å around each heavy ligand atom excluding oxygen and phosphorus, the count of sulfur atoms in the vicinity of tryptophan, and the count of aliphatic ligand hydroxy hydrogens. The first two descriptors have a positive sign indicating that they contribute favorably to the binding energy, whereas the count of hydroxy hydrogens contributes unfavorably to the binding free energy observed. The fact that such a simple model can be so effective raises a couple of questions that are addressed in the article.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Hydrogen / chemistry
  • Ligands
  • Linear Models
  • Models, Molecular*
  • Proteins / chemistry
  • Sulfur / chemistry

Substances

  • Ligands
  • Proteins
  • Sulfur
  • Hydrogen