Data Set Augmentation Allows Deep Learning-Based Virtual Screening to Better Generalize to Unseen Target Classes and Highlight Important Binding Interactions

Jack Scantlebury; Nathan Brown; Frank Von Delft; Charlotte M Deane

doi:10.1021/acs.jcim.0c00263

Data Set Augmentation Allows Deep Learning-Based Virtual Screening to Better Generalize to Unseen Target Classes and Highlight Important Binding Interactions

J Chem Inf Model. 2020 Aug 24;60(8):3722-3730. doi: 10.1021/acs.jcim.0c00263. Epub 2020 Aug 4.

Authors

Jack Scantlebury¹, Nathan Brown², Frank Von Delft^{3

4

5}, Charlotte M Deane¹

Affiliations

¹ Department of Statistics, University of Oxford, 24-29 St Giles, Oxford OX1 3LB, U.K.
² BenevolentAI, 4-8 Maple Street, London W1T 5HD, U.K.
³ Structural Genomics Consortium (SGC), University of Oxford, Oxford OX3 7DQ, U.K.
⁴ Diamond Light Source, Harwell Science and Innovation Campus, Didcot OX11 0DE, U.K.
⁵ Department of Biochemistry, University of Johannesburg, Aukland Park, Johannesburg 2006, South Africa.

Abstract

Current deep learning methods for structure-based virtual screening take the structures of both the protein and the ligand as input but make little or no use of the protein structure when predicting ligand binding. Here, we show how a relatively simple method of data set augmentation forces such deep learning methods to take into account information from the protein. Models trained in this way are more generalizable (make better predictions on protein/ligand complexes from a different distribution to the training data). They also assign more meaningful importance to the protein and ligand atoms involved in binding. Overall, our results show that data set augmentation can help deep learning-based virtual screening to learn physical interactions rather than data set biases.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Deep Learning*
Ligands
Proteins

Substances

Ligands
Proteins

Abstract

Publication types

MeSH terms

Substances

Grants and funding