Learning from the data: mining of large high-throughput screening databases

J Chem Inf Model. 2006 Nov-Dec;46(6):2381-95. doi: 10.1021/ci060102u.

Abstract

High-throughput screening (HTS) campaigns in pharmaceutical companies have accumulated a large amount of data for several million compounds over a couple of hundred assays. Despite the general awareness that rich information is hidden inside the vast amount of data, little has been reported for a systematic data mining method that can reliably extract relevant knowledge of interest for chemists and biologists. We developed a data mining approach based on an algorithm called ontology-based pattern identification (OPI) and applied it to our in-house HTS database. We identified nearly 1500 scaffold families with statistically significant structure-HTS activity profile relationships. Among them, dozens of scaffolds were characterized as leading to artifactual results stemming from the screening technology employed, such as assay format and/or readout. Four types of compound scaffolds can be characterized based on this data mining effort: tumor cytotoxic, general toxic, potential reporter gene assay artifact, and target family specific. The OPI-based data mining approach can reliably identify compounds that are not only structurally similar but also share statistically significant biological activity profiles. Statistical tests such as Kruskal-Wallis test and analysis of variance (ANOVA) can then be applied to the discovered scaffolds for effective assignment of relevant biological information. The scaffolds identified by our HTS data mining efforts are an invaluable resource for designing SAR-robust diversity libraries, generating in silico biological annotations of compounds on a scaffold basis, and providing novel target family specific scaffolds for focused compound library design.

MeSH terms

  • Algorithms
  • Animals
  • Cell Proliferation
  • Chemistry / methods
  • Chemistry, Pharmaceutical / methods*
  • Combinatorial Chemistry Techniques / methods*
  • Drug Evaluation / instrumentation
  • Drug Evaluation / methods*
  • Drug Evaluation, Preclinical
  • Genes, Reporter
  • Genomics
  • Humans
  • Ligands
  • Pattern Recognition, Automated
  • Proteomics / methods
  • Technology, Pharmaceutical / methods

Substances

  • Ligands