Spatial Graph Convolutions for Drug Discovery

03/12/2018 ∙ by Evan N. Feinberg, et al. ∙ Stanford University 0

Predicting the binding free energy, or affinity, of a small molecule for a protein target is frequently the first step along the arc of drug discovery. High throughput experimental and virtual screening both suffer from low accuracy, whereas more accurate approaches in both domains suffer from lack of scale due to either financial or temporal constraints. While machine learning (ML) has made immense progress in the fields of computer vision and natural language processing, it has yet to offer comparable improvements over domain-expertise driven algorithms in the molecular sciences. In this paper, we propose new Deep Neural Network (DNN) architectures for affinity prediction. The new model architectures are at least competitive with, and in many cases state-of-the-art compared to previous knowledge-based and physics-based approaches. In addition to more standard evaluation metrics, we also propose the Regression Enrichment Factor EF_χ^(R) for the community to benchmark against in future affinity prediction studies. Finally, we suggest the adaptation of an agglomerative clustering cross-validation strategy to more accurately reflect the generalization capacity of ML-based affinity models in future works.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The arc of drug discovery entails a multiparameter optimization problem spanning vast length scales. They key parameters range from solubility (angstroms) to protein-ligand binding (nanometers) to in vivo toxicity (meters). Through feature learning—instead of feature engineering—deep neural networks promise to outperform both traditional physics-based and knowledge-based machine learning models for predicting molecular properties pertinent to drug discovery. To this end, we present the PotentialNet family of graph convolutions. These models are specifically designed for and achieve state-of-the-art performance for protein-ligand binding affinity. We further validate these deep neural networks by setting new standards of performance in several ligand-based tasks. In parallel, we introduce a new metric, the Regression Enrichment Factor , to measure the early enrichment of computational models for chemical data. Finally, we introduce a cross-validation strategy based on structural homology clustering that can more accurately measure model generalizability, which crucially distinguishes the aims of machine learning for drug discovery from standard machine learning tasks.

I Introduction

Most FDA-approved drugs are small organic molecules that elicit a therapeutic response by binding to a target biological macromolecule. Once bound, small molecule ligands either inhibit the binding of other ligands or allosterically adjust the target’s conformational ensemble. Binding is thus crucial to any behavior of a therapeutic ligand. To maximize a molecule’s therapeutic effect, its affinity—or binding free energy ()—for the desired targets must be maximized, while simultaneously minimizing its affinity for other macromolecules. Historically, scientists have used both cheminformatic and structure-based approaches to model ligands and their targets, and most machine learning (ML) approaches use domain expertise-driven features.

More recently, deep neural networks (DNNs) have been translated to the molecular sciences. Training most conventional DNN architectures requires vast amounts of data: for example, ImageNet 

Deng et al. currently contains over labeled images. In contrast, the largest publicly available datasets for the properties of drug-like molecules include PDBBind 2017 Liu et al. (2017), with a little over samples of protein-ligand co-crystal structures and associated binding affinity values; Tox21 with nearly small molecules and associated toxicity endpoints; QM8 with around small molecules and associated electronic properties; and ESOL with a little over small molecules and associated solubility valuesWu et al. (2018). This scarcity of high-quality scientific data necessitates innovative neural architectures for molecular machine learning.

Successful DNNs often exploit relevant structure in data, such as pixel proximity in images. Predicting protein-ligand binding affinity seems to resemble computer vision problems. Just as neighboring pixels connote closeness between physical objects, a binding pocket could be divided into a voxel grid. Here, neighboring voxels denote neighboring atoms and blocks of empty space. Unfortunately, this 3D convolutional approach has several potential drawbacks. First, inputs and hidden weights require much more memory in three dimensions. Second, since the parameters grow exponentially with the number of dimensions, the model suffers from the “curse of dimensionality” 

Hastie, Tibshirani, and Friedman (2009): while image processing may entail a square filter, the corresponding filter for volumetric molecule processing has parameters.

In contrast, graph convolutions use fewer parameters by exploiting molecular structure and symmetry. Consider a carbon bonded to four other atoms. A 3D CNN would need several different filters to accommodate the subgroup’s symmetrically equivalent orientations. But a graph convolution as described in Refs. Kearnes et al., 2016; Kipf and Welling, 2016; Li et al., 2016; Gilmer et al., 2017 is symmetric to permutations and relative location of each of the four neighbors, thereby significantly reducing the number of model parameters. The use of graph convolutional approaches to learn molecular properties is reminiscent of the familiar canon of chemoinformatics algorithms such as Morgan fingerprints Morgan (1965), SMILES strings Weininger, Weininger, and Weininger (1989), and the Ullman algorithm for substructure search Ullmann (1976), all of which enrich chemical descriptions by propagating information about neighboring atoms.

In this paper, we first review a subset of DNN architectures applicable to protein-ligand interaction. Through the mathematical frameworks above, we contextualize our presentation of new models that generalize a graph convolution to include both intramolecular interactions and noncovalent interactions between different molecules. First, we describe a staged gated graph neural network, which distinguishes the derivation of differentiable bonded atom types from the propagation of information between different molecules. Second, we describe a more flexible model based on a new update rule using both the distance from source to target atom and the target atom’s feature map. Our direct incorporation of target atom information into the message function increases signal in some protein-ligand binding affinity benchmarks. Finally, we address a potential shortcoming of the standard benchmark in this space—namely, treating the PDBBind 2007 core set as a fixed test set—by choosing a cross-validation strategy that more closely resembles the reality of drug discovery. Though more challenging, this benchmark may better reflect a given model’s generalization capacity.

Ii Neural Network Architectures

First, we briefly review a subset of DNN architectures applicable to protein-ligand interaction in order to motivate the new models we present and test at the end of the paper.

ii.1 Ligand-based scoring models

ii.1.1 Fully Connected Neural Networks

The qualitatively simplest models for affinity prediction and related tasks incorporate only features of ligands and ignore the macromolecular target(s). Such a model could entail a fully connected neural network (FCNN), in which each molecule is represented by a flat vector

containing features. Then, these features are updated through “hidden” layers

by applying nonlinear activation functions.

The training data for such a network consists of a set of molecules, each represented by a vector of length , which have a one-to-one correspondence with a set of affinity labels. Domain-expertise driven flat vector features might include integer counts of different types of pre-determined functional groups (e.g., carboxylic acids, aromatic rings), polar or nonpolar atoms, and other ligand-based features. Cheminformatic featurizations include Extended Circular Fingerprints (ECFP) Rogers and Hahn (2010) and ROCS Hawkins, Skillman, and Nicholls (2007); Kearnes and Pande (2016).

ii.1.2 Graph Convolutional Neural Networks

Figure 1: Visual depiction of Gated Graph Neural Network with atoms as nodes and bonds as edges. The small molecule propanamide is chosen to illustrate the propagation of information among the different update layers of the network.

In convolutional neural networks (CNNs), each layer convolves the previous layer’s feature map with linear kernels followed by elementwise nonlinearities, producing new features of higher complexity that combine information from neighboring pixels 

Krizhevsky, Sutskever, and Hinton (2012). A graph convolutional neural network (GCNN) analogously exploits the inherent structure of data  Duvenaud et al. (2015). We can represent a given graph that contains nodes, features per node, and a single edge type, as consisting of node features and symmetric adjacency matrix , which designates whether a pair of nodes belong to each other’s neighbor sets . Consider the molecule propanamide (Figure 1). For the carbonyl carbon, the relevant row of the feature matrix might be to represent its element, and the corresponding row of the adjacency matrix might be to indicate its bonds to three neighbor atoms.

A graph convolution update, as summarized in  Gilmer et al. (2017), entails applying a function at each node that takes the node and its neighbors as input and outputs a new set of features for each node. It can be written as


where represents the node features of node at hidden layer , represents the neighbors of node , and and are the update and message functions, respectively, at hidden layer . When there are multiple edge types, we must define multiple message functions, , which is the message function at layer for edge type .

Our models are primarily inspired by the Gated Graph Neural Networks (GGNNs) Li et al. (2016)

. At all layers, the update function is the familiar gated recurrent unit (GRU). Message functions are simple linear operations that are different for each edge type but also the same across layers:


where is the adjacency matrix, and the weight matrix, respectively, for edge type .

Unlike conventional FCNNs, which learn non-linear combinations of the input hand-crafted features, the update described in (2) learns nonlinear combinations of more basic features of a given atom with the features of its immediate neighbors. Information propagates through increasingly distant atoms with each graph convolution, and the GRU enables information to be added selectively. Ultimately, the GGNN contains and leverages both per-node features via the feature matrix and structural information via the adjacency matrix . In both classification and regression settings, GCNN’s terminate in a “graph gather” step that sums over the rows of the final embeddings and is invariant to node ordering. The subsequent FCNNs produce output of desired size (). This completes the starting point for the graph convolutional update used in this paper:


ii.1.3 Generalization to multitask settings

Predicting affinity for multiple targets by GCNN can be implemented by training either different models for each target or by training a single multitask network. The latter setting has a last weight matrix , where

denotes the number of targets in the dataset. The corresponding multitask loss function would be the average binary cross-entropy loss across the targets,


ii.2 Structure-based scoring models

Since the advent of biomolecular crystallography by Perutz et al. Perutz et al. (1960), the drug discovery community has sought to leverage structural information about the target in addition to the ligand. Numerous physics-based approaches have attempted to realize this, including molecular docking Jain (2003); Shoichet, Kuntz, and Bodian (1992); Trott and Olson (2010); Friesner et al. (2004), free energy perturbation Wang et al. (2015), and QM/MM Hensen et al. (2004), among others. More recent approaches include RF-Score Li et al. (2015); Ballester and Mitchell (2010), NN-score Durrant and McCammon (2011), Grid Featurizer Wu et al. (2018), three dimensional CNN approaches Wallach, Dzamba, and Heifets (2015); Ragoza et al. (2017), and Atomic Convolutional Neural Networks Gomes et al. (2017).

ii.2.1 PotentialNet Architectures for Molecular Property Prediction

To motivate architectures for more principled DNN predictors, we invoke the following notation and framework. First, we introduce the distance matrix , whose entries denote the distance between and . Thus far, the concept of adjacency, as encoded in a symmetric matrix , has been restricted to chemical bonds. However, adjacency can also encompass a wider range of neighbor types to include non-covalent interactions (e.g., stacking, hydrogen bonds, hydrophobic contact). Adjacency need not require domain expertise: pairwise distances below a threshold value can also be used. Regardless of particular scheme, we see how the distance matrix motivates the construction of an expanded version of . In this framework,

becomes a tensor of shape

, where represents the number of edge types.

If we order the rows by the membership of to either the protein or ligand, we can view both and as block matrices, where the diagonal blocks are self-edges (i.e., bonds and non-covalent interactions) from one ligand atom to another ligand atom or from one protein atom to another protein atom, whereas off-diagonal block matrices encode edges from the protein to the ligand and from ligand to protein. For illustration purposes, we choose the special case where there is only one edge type, :


where is 1 for neighbors and 0 otherwise, and . Within this framework, we can mathematically express a spatial graph convolution—a graph convolution based on notions of adjacency predicated on Euclidean distance—as a generalization of the GGNN characterized by the update (2).

In addition to edge type generalization, we introduce nonlinearity in the message portion of the graph convolutional layer:


where is a neural network for each edge type and denotes the neighbors of edge type for atom/node .

Finally, we generalize the concept of a layer to the notion of a stage that can span several layers of a given type. The Staged PotentialNet consists of three main steps: (1) covalent-only propagation, (2) dual non-covalent and covalent propagation, and (3) ligand-based graph gather. Stage (1), covalent propagation, entails only the first slice of the adjacency matrix, , which contains a at entry if there is a bond between and a otherwise. Intuitively, stage (1) computes a new set of vector-valued atom types for each of the atoms in the system based on their local networks of bonded atoms. Subsequently, stage (2) entails propagation based on both the full adjacency tensor which begins with the vector-valued atom types computed in (1). While stage (1) computes new bond-based “atom types” for both amino acid and ligand atoms, stage (2) passes both bond and spatial information between the atoms. For instance, if stage (1) distinguishes an amide carbonyl oxygen from a ketone carbonyl oxygen, stage (2) might communicate in the first layer that that carbonyl oxygen is also within Angstroms of a hydrogen bond donor. Finally, in stage (3) a graph gather is performed solely on the ligand atoms. The ligand-only graph gather is made computationally straightforward by the block matrix formulation described in (5).

PotentialNet, Stage 1:


PotentialNet, Stage 2:


PotentialNet, Stage 3:


where , , , are bond and spatial neural networks, and denotes the feature map for the atom at the end of stage 2.

Figure 2: Visual depiction of multi-staged spatial gated graph neural network. Stage 1 entails graph convolutions over only bonds, which derives new node (atom) feature maps roughly analogous to differentiable atom types in more traditional forms of molecular modeling. Stage 2 entails both bond-based and spatial distance based propagation of information. In the final stage, a graph gather operation is conducted over the ligand atoms, whose feature maps are derived from bonded ligand information and spatial proximity to protein atoms.

A theoretically attractive concept in (7) is that atom types—the per-atom feature maps—are derived from the same initial features for both ligand and protein atoms. In contrast to molecular dynamics force fields, e.g. Ponder and Case (2003), which—for historical reasons—have distinct force fields for ligands and for proteins which then must interoperate (often poorly) in simulation, our approach derives the physicochemical properties of biomolecular interactions from a unified framework.

To further illustrate, PotentialNet Stage 1 and Stage 2 exploit different subsets of the full adjacency tensor :

Figure 3: PotentialNet Stage 1 exploits only covalent or bonded interaction edge types encoded in the first slices of the last dimension of the adjacency tensor .
Figure 4: PotentialNet Stage 2 exploits both bonded and non-bonded interaction edge types spanning the entirety of the last dimension of the adjacency tensor .

Iii Measuring Early Enrichment in Regression Settings for Virtual Screening

Traditional metrics of predictor performance suffer from general problems and drug discovery-specific issues. For regressors, both —the “coefficient of determination”—and the root-mean-square error (

) are susceptible to single data point outliers. The

for both classifiers and regressors account for neither the training data distribution nor the null model performance. The area under the receiver operating characteristic curve

 Triballeau et al. (2005) does correct this deficiency in for classifiers. However, all aforementioned metrics are global statistics that equally weight all data points. This property is particularly undesirable in drug discovery, which is most interested in predicting the tails of a distribution: while model predictions are made against an entire library containing millions of molecules, one will only purchase or synthesize the top scoring molecules. In response, the cheminformatics community has adopted the concept of early enrichment. Methods like  Truchon and Bayly (2007) and  Mysinger and Shoichet (2010) weight the importance of the model’s highest performers more heavily.

iii.1 Proposed Metric:

At present, this progress in early enrichment measurement has been limited to classification and has yet to include regression. Therefore, we propose a new metric for early enrichment in regression, , analogous to . For a given target:


in which , the experimental (observed) measurement for sample , are ranked in descending order according to , the model (predicted) measurement for sample

. In other words, we compute the average z-score for the observed values of the top

scoring samples. We prefer this approach to computing, for example, , which has units that are the same as (i.e., values). Unfortunately, this unnormalized approach depends on the distribution in the dataset. For instance, in a distribution of measurements, if the maximum deviation from the mean is , the best a model can possibly perform would be to achieve an of .

We normalize through division by

, the standard deviation of the data. This allows comparison of model performance across datasets with a common unit of measurement but different variances in those measurements. The upper bound is therefore equal to the right hand side of (

10), where the indexed set of molecules constitutes the subset of the most experimentally active molecules. This value is dependent on both the distribution of the training data as well as the value . The is an average over z-scores, which themselves are real numbers of standard deviations away from the mean experimental activity.\fnsymbolmult\fnsymbolmult values may therefore exceed 1.0, since this means that the percentage of top predicted molecules have an average standard deviation of more than 1.0 above the mean.

Iv Results

iv.1 Cross-validation strategies

It is well known that the performance of DNN algorithms is highly sensitive to chosen hyperparameters. Such sensitivity underscores the criticality of rigorous cross-validation 

Boulesteix (2015); Chicco (2017). Several recent papers, including works that claim specifically to improve binding affinity prediction on the PDBBind dataset  Cang and Wei (2017); Jiménez et al. (2018), engage in the practice of searching hyperparmeters directly on the test set. Compounding this problem is a fundamental deficiency of the main cross-validation procedure used in this subfield that is discussed below.

While there are newer iterations of the PDBBind dataset, e.g., Ref. Liu et al., 2017, we choose to evaluate performance on PDBBind 2007 Wang et al. (2004, 2005) to compare performance of our proposed architectures to previous methods. In previous works, the PDBBind 2007 dataset was split by (1) beginning with the “refined” set comprising protein-ligand co-crystal structures and associated binding free energy; (2) removing the “core” set comprising samples to form the test set, with (3) the remaining samples serving as the training data. We term this train-test split “PDBBind 2007, Refined Train, Core Test” below, and compare performance with RF-score Ballester and Mitchell (2010), X-Score Wang, Lai, and Wang (2002); Wang, Lu, and Wang (2003), and the networks (7)-(9) described in this work.

One drawback to train-test split is possible overfitting to the test set through hyperparameter searching. Another limitation is that train and test sets will contain similar examples. Whereas it is typical in other machine learning disciplines for the train and test set examples to be drawn from the same statistical distributions, such a setting is not necessarily desirable in a molecular machine learning setting Martin et al. (2017). Drug discovery campaigns typically involve the synthesis and investigation of novel chemical matter. To accurately assess the generalizability of a trained model, the cross-validation strategy should reflect how that model will be deployed practically. In context of this reasoning, the “Refined Train, Core Test” strategy is not optimal for cross-validation. For example, Li and Yang (2017), showed that systematically removing samples from the PDBBind 2007 refined set with structural or sequence homology to the core (test) set significantly attenuated the performance of recent ML-based methods for affinity prediction.

Therefore, we propose and investigate a cross-validation strategy that splits all data into three distinct folds—train, validation, and test subsets—with agglomerative hierarchical clustering based on pairwise structural and sequence homology of the

proteins as distance metrics Husic and Pande (2017); Kramer and Gedeck (2010). Figure 5 contrasts this technique with other common splitting methods for ligand binding studies. Both sequence and structural similarity measures are described in Ref. Li and Yang, 2017. The agglomerative clustering procedure is described in detail in Ref. Husic and Pande, 2017 and is a specific case of the method introduced in Ref. Kramer and Gedeck, 2010. Our cross-validation on the PDBBind 2007 refined set with sequence similarity resulted in 978 train samples, 221 valid samples, and 101 test samples (75%-17%-8%); meanwhile, clustering on structural similarity yielded 925 train samples, 257 valid samples, and 118 test samples (71%-20%-9%). A supplementary file is provided with the two sets of train, validation, and test assignments.

Figure 5: Notional comparison of cross-validation splitting algorithms. The first four vertical panels from the left depict simple examples of random split, stratified split, time split, and scaffold split. The rightmost panel depicts a toy example of the agglomerative split proposed in this work. Both scaffold split and agglomerative split group similar data points together in order to promote the generalizability of the network to new data. Scaffold split uses the algorithm introduced by Bemis and Murcko (1996) to group ligands into common frameworks. The agglomerative split uses hierarchical agglomerative clustering to group ligand-protein systems according to pairwise sequence or structural similarity of the proteins. This figure is adapted from Ref. Wu et al., 2018 with permission from the Royal Society of Chemistry.

iv.2 Performance of methods on benchmarks

Model Test Test Test Pearson Test Spearman Test stdev Test MUE
PotentialNet 0.668 (0.043) 1.643 (0.127) 0.822 (0.021) 0.826 (0.020) 1.388 (0.070) 0.626 (0.037)
PotentialNet, 0.419 (0.234) 1.404 (0.171) 0.650 (0.017) 0.670 (0.014) 1.832 (0.135) 0.839 (0.005)
(ligand-only control)
TopologyNet, N/A N/A 0.826 N/A N/A N/A
No Valid. Set
RF-Score N/A N/A 0.783 0.769 N/A N/A
X-Score N/A N/A 0.643 0.707 N/A N/A
Table 1: Benchmark: PDBBind 2007, Refined Train, Core Test. Error bars are recorded as standard deviation of the test metric over three random initializations of the best model as determined by average validation set score. MUE is mean unsigned error. Pearson test scores for TopologyNet are reported from Ref. Cang and Wei, 2017 and RF- and X-Scores are reported from Ref. Li and Yang, 2017.

On the standard PDBBind 2007 “refined train, core test” benchmark, the PotentialNet Spatial Graph Convolution achieves state-of-the-art performance as reflected by several metrics. PotentialNet outperforms RF-Score and X-Score according to Pearson and Spearman correlation coefficients. The Pearson correlation score for (7)-(9) is within error of the reported score for TopologyNet, the heretofore top performing model on this benchmark. However, a key caveat must be noted with respect to this comparison: all cross-validation for this manuscript, including all of our results reported in Tables 1, 2, and 3, was performed such that performance on the test set was recorded for the hyperparameter set that performed most highly on the validation set. In contrast, the TopologyNet paper Cang and Wei (2017), models were trained on a combination of the validation and training sets and evaluated directly on the test set. Performance for TopologyNet Cang and Wei (2017) therefore reflects a train-validation type split rather than a train-validation-test split, which likely inflated the performance of that method.

Model Test Test Test Pearson Test Spearman Test MUE
PotentialNet 0.480 (0.030) 0.867 (0.036) 0.700 (0.003) 0.694 (0.012) 1.680 (0.061)
Ligand-only PotentialNet 0.414 (0.058) 0.883 (0.025) 0.653 (0.031) 0.674 (0.020) 1.712 (0.110)
RF-score 0.527 (0.014) 1.078 (0.143) 0.732 (0.009) 0.723 (0.013) 1.582 (0.034)
X-score 0.470 1.117 0.702 0.764 1.667
Table 2: Benchmark: PDBBind 2007 Refined, Agglomerative Sequence Split. Error bars are recorded as standard deviation of the test metric over three random initializations of the best model as determined by average validation set score. MUE is mean unsigned error. X-score does not have error because it is a deterministic linear model.

Intriguingly, the gap in performance between the PotentialNet Spatial Graph Convolution and the other tested statistical models changes considerably on the agglomerative structure and sequence split benchmarks. On sequence split, RF-score achieves the best overall performance, followed by a statistical tie between the Staged Spatial Graph Convolution (

7)-(9) and X-Score, followed by the ligand-only graph convolutional control. Meanwhile, on structure split, PotentialNet achieves the highest overall performance, followed by RF-Score, followed by a statistical tie of X-Score, and the graph convolutional ligand-only control.

Model Test Test Test Pearson Test Spearman Test MUE
PotentialNet 0.629 (0.044) 1.576 (0.053) 0.823 (0.023) 0.805 (0.019) 1.553 (0.125)
Ligand-only PotentialNet 0.500 (0.010) 1.498 (0.411) 0.733 (0.007) 0.726 (0.005) 1.700 (0.067)
RF-score 0.594 (0.005) 0.869 (0.090) 0.779 (0.003) 0.757 (0.005) 1.542 (0.046)
X-score 0.517 0.891 0.730 0.751 1.751
Table 3: Benchmark: PDBBind 2007 Refined, Agglomerative Structure Split. Error bars are recorded as standard deviation of the test metric over three random initializations of the best model as determined by average validation set score. MUE is mean unsigned error. X-score does not have error because it is a deterministic linear model.

It is noteworthy that the PotentialNet Spatial Graph Convolutions ((7)-(9) perform competitively with other compared methods when the proposed Spatial Graph Convolutions are predicated on very simple, per-atom features and pure notions of distance whereas RF-Score, X-Score, and TopologyNet all directly incorporate domain-expertise driven information on protein-ligand interactions.

iv.2.1 Sanity check with a traditional RNN

Given the unreasonable effectiveness of deep learning methods in mostly unstructured settings, it is important to justify our incorporation of domain knowledge over a purely deep learning-based approach. To do this, we trained a bidirectional long short-term memory (LSTM) network, a commonly-used recurrent neural network (RNN) that handles both past and future context well. We represented the protein-ligand complexes using a sequential representation of protein-ligand complexes in PDBBind: proteins were one-hot encoded by amino acid, and ligands were similarly encoded on a character-level using their SMILES string representation. The test Pearson correlation coefficient corresponding to the best validation score (using the same metric) was 0.518, far worse than our results and justifying our model’s incorporation of domain knowledge.

iv.3 Ligand-based models

While crystallography, NMR, and, most recently, cryo electron microscopy have opened a new paradigm of structure-based drug design, many critical tasks of drug discovery can be predicted from the chemical composition of a given molecule itself, without explicit knowledge of the macromolecule(s) to which they bind. Such properties include electronic spectra (important for parameterizing small molecule force fields for molecular dynamics simulations, for example), solubility, and animal toxicity.

Quantum mechanical datasets are particularly ripe for machine learning algorithms since it is straightforward to generate training data at some known accuracy. The QM8 dataset Ramakrishnan et al. (2015), which contains several electronic properties for small molecules in the GDB-8 set, lends itself particularly well for benchmarking PotentialNet (7)-(9) since each compound’s properties are calculated based on the three-dimensional coordinates of each element. The ESOL solubility Delaney (2004) and Tox21 toxicity Tice et al. (2013) datasets map two-dimensional molecular representations consisting solely of atoms and their bonds to their respective single-task and multi-task outputs, and therefore serve as validation of our neural network implementations as well as of the value of incorporating nonlinearity into the message function.

To summarize, our computational experiments indicate that PotentialNet leads to statistically significant improvements in performance for all three investigated ligand-based tasks. For the QM8 dataset, we were able to directly assess the performance benefit that stems from separating spatial graph convolutions into distinct stages. Recall that Stage I of PotentialNet (7) propagates information over only bonds and therefore derives differentiable “atom types”, whereas Stage II of PotentialNet (8) propagates information over both bonds and different binned distances. We performed an experiment with QM8 in which Stage I was essentially skipped, and graph convolutions propagated both covalent and non-covalent interactions without a privileged first stage for only covalent interactions. Clearly, separating the two stages led to a significant boost in performance.

For each ligand model investigation we benchmark against the error suggested upon introduction of the dataset, or in order to enable direct comparison with previously published approaches. For extensive benchmarking of various models on these and other datasets, we refer the reader to Ref. Wu et al., 2018.

iv.3.1 Quantum Property Prediction

Table 4 reports the performances in mean absolute error (MAE) over 21786 compounds and 12 tasks in QM8. We utilize MAE for consistency with the original proposal of the database Ramakrishnan et al. (2015). Multiple PotentialNet variants and two mature deep learning models: deep tensor neural network Schütt et al. (2017) (DTNN) and message passing neural network Gilmer et al. (2017) (MPNN) are evaluated, in which the latter two models proved to be successful on quantum mechanical tasks (e.g., atomization energyWu et al. (2018)

). We restricted the training length to 100 epochs and performed 100 rounds of hyperparameter search on PotentialNet models. Staged spatial PotentialNet model achieved the best performances in the group, demonstrating strong predictive power on the tasks. We have also included taskwise results in Appendix 


Network Valid MAE Test MAE
Spatial PotentialNet, Staged 0.0120 0.0118 (0.0003)
Spatial PotentialNet, SingleUpdate 0.0133 0.0131 (0.0001)
MPNN 0.0142 0.0139 (0.0007)
DTNN 0.0168 0.0163 (0.0010)
Table 4: Quantum Property Prediction with QM8 Dataset. Error bars are recorded as standard deviation of the test metric over three random initializations of the best model as determined by average validation set score.

iv.3.2 Toxicity

In the multitask toxicity benchmark, we evaluated the performances of two graph convolutional type models Kearnes et al. (2016); Duvenaud et al. (2015) and PotentialNet on the Tox21 dataset under same evaluation pattern. With 100 epochs of training, PotentialNet demonstrated higher ROC-AUC scores on both validation and test scores, outperforming Weave and GraphConv by a comfortable margin.

Network Valid ROC AUC Test ROC AUC
PotentialNet 0.878 0.857 (0.006)
Weave 0.852 0.831 (0.013)
GraphConv 0.858 0.838 (0.001)
XGBoost 0.778 0.808 (0.000)
Table 5: Toxicity Prediction with the Tox21 Dataset. Error bars are recorded as standard deviation of the test metric over three random initializations of the best model as determined by average validation set score.

iv.3.3 Solubility

The same three models are also tested and compared on a solubility task Delaney (2004), using to quantify the error in order to compare to previous work Duvenaud et al. (2015). PotentialNet achieved slightly smaller than Weave and GraphConv (Table 6). Under the limited 100 epochs training, the final test RMSE is comparable or even superior to the best scores reported for Weave and GraphConv (0.46 Kearnes et al. (2016) and 0.52 Duvenaud et al. (2015) respectively).

Network Valid RMSE Test RMSE
PotentialNet 0.517 0.490 (0.014)
Weave 0.549 0.553 (0.035)
GraphConv 0.721 0.648 (0.019)
XGBoost 1.182 0.912 (0.000)
Table 6: Solubility Prediction with the Delaney ESOL Dataset. Error bars are recorded as standard deviation of the test metric over three random initializations of the best model as determined by average validation set score.
Network Hyperparameter Name Symbol Possible Values
PotentialNet Gather Widths (Bond and Spatial) [64, 128]
PotentialNet Number of Bond Convolution Layers [1, 2]
PotentialNet Number of Spatial Convolution Layers [1, 2, 3]
PotentialNet Gather Width [64, 128]
PotentialNet Number of Graph Convolution Layers [1, 2, 3]
Both Fully Connected Widths of [[128, 32, 1], [128, 1], [64, 32, 1], [64, 1]]
Both Learning Rate - [1e-3, 2e-4]
Both Weight Decay - [0., 1e-7, 1e-5, 1e-3]
Both Dropout - [0., 0.25, 0.4, 0.5]
Table 7: Hyperparameters for neural networks (7)-(9).

V Discussion

Spatial Graph Convolutions exhibit state-of-the-art performance in affinity prediction. Whether based on linear regression, random forests, or other classes of DNNs, all three of RF-Score, X-Score, and TopologyNet are machine learning models that explicitly draw upon traditional physics-based features. Meanwhile, the Spatial Graph Convolutions presented here use a more principled deep learning approach. Input features are only basic information about atoms, bonds, and distances. This framework does not use traditional hand-crafted features, such as hydrophobic effects,

-stacking, or hydrogen bonding. Instead, higher-level interaction “features” are learned through intermediate graph convolutional neural network layers.

The traditional PDBBind 2007 benchmark uses samples from the refined set for training and from the core set for testing. Here, Spatial Graph Convolutions outperform X-Score and RF-Score and perform comparably with TopologyNet (even though this searched hyperparameters directly over the test dataset). On our proposed agglomerative clustering cross-validation benchmark, the choice of sequence or structure split affects relative performance. On sequence split, RF-Score achieved the highest overall performance, with Staged Spatial Graph Convolutions and X-Score statistically tied for second. But on structure split, the Staged Spatial Graph Convolutions performed best, with RF-score in second place.

While the Pearson correlation was employed in the preceding performance comparison, instead comparing methods through tells a mildly different story. On the agglomerative sequence cross-validation split, in which test proteins are separated from train proteins based on amino acid sequence deviation, X-Score statistically ties RF-Score for the best model, while PotentialNet statistically ties the ligand-only PotentialNet control for last place at over average standard deviations worse than X-Score and RF-Score for the top of predictions. Meanwhile, using the agglomerative structure cross-validation split, PotentialNet exceeds the performance of X-Score and RF-Score by over average standard deviations, though is within a statistical tie of the ligand-only PotentialNet control (which has a surprisingly high variance in its ). Taken together, we aver that it is important for the future of ML-driven structure-based drug discovery to carefully choose both (1) the cross-validation technique and (2) the metric of performance on held-out test set in order to most accurately reflect the capacity of their models to generalize in simulated realistic settings.

In light of the continued importance and success of ligand-based methods in drug discovery, we benchmarked PotentialNet on several ligand based tasks: electronic property (multitask), solubility (single task), and toxicity prediction (multitask). Statistically significant performance increases were observed for all three prediction tasks. A potentially step change improvement was observed for the QM8 challenge which also reinforced the value of the concept of stages that privilege bonded from non-bonded interactions.

Despite the simpler input featurization, Spatial Graph Convolutions can learn an accurate mapping of protein-ligand structures to binding free energies using the same relatively low amount of data as previous expertise-driven approaches. We thus expect that as larger sets of training data become available, Spatial Graph Convolutions can become the gold standard in affinity prediction. Unfortunately, such larger, publicly available datasets are currently nonexistent. We thus call upon academic experimental scientists and/or their pharmaceutical industry counterparts to release as much high-quality protein-ligand binding affinity data as possible so the community can develop and benefit from better affinity prediction models.

Due to the field’s immense practical applications, our algorithms must prioritize realizable results over incremental improvements on somewhat arbitrary benchmarks. We thus also present a new benchmark score and accompanying cross-validation procedure. The latter draws on agglomerative clustering of sequence and structural similarity to construct challenging train-test splits. Using this proposed cross-validation schema, on sequence-based splitting (Table 2) we observe in the Pearson correlation column that RF-score exceeds X-Score, and X-Score statistically ties Spatial Graph Convolutions. For structure-based splitting (Table 3) we observe that Spatial Graph Convolution both RF-Score and X-Score in the Pearson correlation column. We highlight the Pearson correlation for consistency with the literature, but present other metrics in the Tables 2 and 3 from which similar conclusions could be drawn.

This construction (i.e., choice of cross-validation schema) helps assess models with a practical test set, such as one containing newly designed compounds on previously unseen protein targets. Although standard machine learning practice draws train and test sets from the same distribution, if machine leaning is to be applied to real-world drug discovery settings it is imperative that we accurately measure a given model’s capacity both to interpolate within familiar regions of chemical space as well as to generalize to its less charted territories.



DNNs were constructed and trained with PyTorch 

Paszke et al. (2017). Custom Python code was used based on RDKit rdk and OEChem oec with frequent use of NumPy Walt, Colbert, and Varoquaux (2011) and SciPy Jones et al. (01). Networks were trained on chemical element, formal charge, hybridization, aromaticity, and the total numbers of bonds, hydrogens (total and implicit), and radical electrons. Random forest and linear regression models (i.e., X-Score) were implemented using scikit-learn Pedregosa et al. (2011); all random forests models were trained with trees and features per tree, and the implementation of X-Score is described in Ref. Li and Yang, 2017. Hyperparameters for PotentialNet models trained for binding affinity, electronic properties, toxicity, and solubility studies are given in Table 7; for toxicity and solubility models, only bond graph convolution layers are employed since there are no 3D coordinates provided for the associated datasets. For these three tasks, random splitting was used for cross validation. For the RNN sanity check of the ligand binding task, the best-performing LSTM sanity check was constructed with 5 layers, a hidden size of 32, 10 classes, and a learning rate of 3.45e-4.

Cross-validation on PDBBind 2007 core test set benchmark. The core set was removed from the refined set sorted temporally to create the test set. Up to 8 hyperparameters were tuned through random search. -fold temporal cross validation was conducted within the train set for each hyperparameter set. For each held-out fold, validation set performance was recorded at the epoch with maximal Pearson correlation coefficient between the labeled and predicted values in the validation set. For each hyperparameter set, the validation score was the average Pearson score over the folds using the best epoch for each fold. The set with the best validation score was then used to evaluate test performance. The training set was split into temporal folds; for each fold, test set performance was recorded at the epoch with highest validation score. All reported metrics are given as the median with the standard deviation over K folds in parentheses.

Cross-validation on PDBBind 2007 structure and sequence agglomerative clustering split benchmarks. Agglomerative clustering was performed with Ward’s method Ward (1963). Pairwise distance between PDB proteins was measured as either minus the sequence homology or the TMScore Zhang and Skolnick (2004). Within the train set, for each hyperparameter set, random splits within the train set were performed. For each held-out fold, validation set performance was recorded at the epoch with maximal Pearson correlation coefficient. The set with the best average Pearson score on the validation set was used to evaluate test set performance. The training set was again randomly split into folds; for each fold, test set performance was recorded at the epoch at which the held out performance was highest according to Pearson score. Metrics are reported as detailed above.

Safety statement

No unexpected or unusually high safety hazards were encountered in this study.

Supporting Information

Sequence- and structure-based agglomerative clustering cross-validation splittings for the PDBBind 2007 refined set


We are grateful to the anonymous reviewers for their suggestions. E.N.F. is supported by the Blue Waters Graduate Fellowship. Y.L., S.S., and J.Y. acknowledge the support of the National Natural Science Foundation of China (11501306) and the Fok Ying-Tong Education Foundation (161003). B.R. was supported by the Fannie and John Hertz Foundation. V.S.P. is a consultant & SAB member of Schrodinger, LLC and Globavir, sits on the Board of Directors of Apeel Inc, Asimov Inc, BioAge Labs, Freenome Inc, Omada Health, Patient Ping, Rigetti Computing, and is a General Partner at Andreessen Horowitz. The Pande Group acknowledges the generous support of Dr. Anders G. Frøseth and Mr. Christian Sundt for our work on machine learning. The Pande Group is broadly supported by grants from the NIH (R01 GM062868 and U19 AI109662) as well as gift funds and contributions from Folding@home donors.

Appendix A Computational complexity of network architectures

Here, we consider the complexity of the various neural architectures discussed in Sec. II, starting with the simple fully-connected setting. Forward propagation can be understood as passing an input vector of length through matrix multiplications, each , and elementwise nonlinear activation layers, each . Assuming , this yields a total complexity of . Since back-propagation involves the same dimensions and number of layers, just in reverse order, it has the same complexity as the forward operation.

Complexity analysis for a 2D convolutional neural network, typical in computer vision tasks, is a bit more involved. An -dimensional filter on an image yields computations on pixels, in total

. Using the Fast Fourier Transform for optimized processing, a single convolutional layer will be

, the same complexity as the entire network. Notice that these operations’ costs grow exponentially with the dimension of Euclidean data—making the exploitation of symmetry far more important for 3D graph data.

The GGNN family of graph convolutional architectures includes effective optimizations to reduce complexity on graphs. Let

be the dimension of each node’s internal hidden representation and

be the number of nodes in the graph. A single step of message passing for a dense graph requires multiplications. Breaking the dimensional node embeddings into different dimensional embeddings reduces this runtime to . As most molecules are sparse or relatively small graphs, these layers are typically . Although other optimizations exist, such as utilizing spectral representations of graphs, the models presented in this work build around this general GGNN framework with different nonlinearities and update rules. None of these are sufficiently computationally expensive enough to alter the total runtime.

Appendix B Taskwise results for quantum property prediction

Task DTNN MPNN PotentialNet, PotentialNet,
Single Update Staged
E1 - CC2 0.0092 0.0084 0.0070 0.0068
E2 - CC2 0.0092 0.0091 0.0079 0.0074
f1 - CC2 0.0182 0.0151 0.0137 0.0134
f2 - CC2 0.0377 0.0314 0.0296 0.0285
E1 - PBE0 0.0090 0.0083 0.0070 0.0067
E2 - PBE0 0.0086 0.0086 0.0074 0.0070
f1 - PBE0 0.0155 0.0123 0.0112 0.0108
f2 - PBE0 0.0281 0.0236 0.0228 0.0215
E1 - CAM 0.0086 0.0079 0.0066 0.0063
E2 - CAM 0.0082 0.0082 0.0069 0.0064
f1 - CAM 0.0180 0.0134 0.0123 0.0117
f2 - CAM 0.0322 0.0258 0.0245 0.0233
Table 8: QM8 Test Set Performances of All Tasks (Mean Absolute Error)

In Table 8 we have recorded the test set performances for all twelve tasks in the QM8 dataset using the MAE for a deep tensor neural network Schütt et al. (2017) (DTNN), a message passing neural network Gilmer et al. (2017) (MPNN), and the staged and single update Spatial PotentialNet networks as in Sec. II.2.1.



  • (1) J. Deng, W. Dong, R. Socher, L. Li, K. Li,  and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    , Miami, FL, USA, June 20-25, 2009
  • Liu et al. (2017) Z. Liu, M. Su, L. Han, J. Liu, Q. Yang, Y. Li,  and R. Wang, “Forging the basis for developing protein–ligand interaction scoring functions,” Acc. Chem. Res. 50, 302–309 (2017).
  • Wu et al. (2018) Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing,  and V. Pande, “MoleculeNet: A benchmark for molecular machine learning,” Chem. Sci. 9, 513–530 (2018).
  • Hastie, Tibshirani, and Friedman (2009)

    T. Hastie, R. Tibshirani,  and J. Friedman, “Overview of supervised learning,” in 

    The Elements of Statistical Learning (Springer Series in Statistics New York, NY, USA, 2009) pp. 9–41.
  • Kearnes et al. (2016) S. Kearnes, K. McCloskey, M. Berndl, V. Pande,  and P. Riley, “Molecular graph convolutions: Moving beyond fingerprints,” J. Comput. Aided Mol. Des. 30, 595–608 (2016).
  • Kipf and Welling (2016) T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907  (2016).
  • Li et al. (2016) Y. Li, R. Zemel, M. Brockschmidt,  and D. Tarlow, “Gated graph sequence neural networks,” in Proceedings of the International Conference on Learning Representations 2016, San Juan, Puerto Rico, May 2-4, 2016 (2016).
  • Gilmer et al. (2017) J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals,  and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 70, edited by D. Precup and Y. W. Teh (PMLR, International Convention Centre, Sydney, Australia, 2017) pp. 1263–1272.
  • Morgan (1965) H. Morgan, “The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service.” J. Chem. Doc. 5, 107–113 (1965).
  • Weininger, Weininger, and Weininger (1989) D. Weininger, A. Weininger,  and J. L. Weininger, “Smiles. 2. algorithm for generation of unique smiles notation,” J. Chem. Inf. Comput. Sci. 29, 97–101 (1989).
  • Ullmann (1976) J. R. Ullmann, “An algorithm for subgraph isomorphism,” J. Assoc. Comput. Mach. 23, 31–42 (1976).
  • Rogers and Hahn (2010) D. Rogers and M. Hahn, “Extended-connectivity fingerprints,” J. Chem. Inf. Model. 50, 742–754 (2010).
  • Hawkins, Skillman, and Nicholls (2007) P. C. Hawkins, A. G. Skillman,  and A. Nicholls, “Comparison of shape-matching and docking as virtual screening tools,” J. Med. Chem. 50, 74–82 (2007).
  • Kearnes and Pande (2016) S. Kearnes and V. Pande, “ROCS-derived features for virtual screening,” J. Comput. Aided Mol. Des. 30, 609–617 (2016).
  • Krizhevsky, Sutskever, and Hinton (2012) A. Krizhevsky, I. Sutskever,  and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, Lake Tahoe, NV, USA, December 3-8, 2012, edited by F. Pereira, C. J. C. Burges, L. Bottou,  and K. Q. Weinberger (Curran Associates, Inc., 2012) pp. 1097–1105.
  • Duvenaud et al. (2015) D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik,  and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in Neural Information Processing Systems 28, Montréal, Canada, December 7-12, 2015, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,  and R. Garnett (Curran Associates, Inc., 2015) pp. 2224–2232.
  • Perutz et al. (1960) M. F. Perutz, M. G. Rossmann, A. F. Cullis, H. Muirhead, G. Will,  and A. North, “Structure of hæmoglobin: A three-dimensional Fourier synthesis at 5.5-å. resolution, obtained by X-ray analysis,” Nature 185, 416–422 (1960).
  • Jain (2003) A. N. Jain, “Surflex: Fully automatic flexible molecular docking using a molecular similarity-based search engine,” J. Med. Chem. 46, 499–511 (2003).
  • Shoichet, Kuntz, and Bodian (1992) B. K. Shoichet, I. D. Kuntz,  and D. L. Bodian, “Molecular docking using shape descriptors,” J. Comput. Chem. 13, 380–397 (1992).
  • Trott and Olson (2010) O. Trott and A. J. Olson, “Autodock vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading,” J. Comput. Chem. 31, 455–461 (2010).
  • Friesner et al. (2004) R. A. Friesner, J. L. Banks, R. B. Murphy, T. A. Halgren, J. J. Klicic, D. T. Mainz, M. P. Repasky, E. H. Knoll, M. Shelley, J. K. Perry, et al., “Glide: A new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy,” J. Med. Chem. 47, 1739–1749 (2004).
  • Wang et al. (2015) L. Wang, Y. Wu, Y. Deng, B. Kim, L. Pierce, G. Krilov, D. Lupyan, S. Robinson, M. K. Dahlgren, J. Greenwood, et al., “Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field,” J. Am. Chem. Soc. 137, 2695–2703 (2015).
  • Hensen et al. (2004) C. Hensen, J. C. Hermann, K. Nam, S. Ma, J. Gao,  and H.-D. Höltje, “A combined QM/MM approach to protein-ligand interactions: Polarization effects of the HIV-1 protease on selected high affinity inhibitors,” J. Med. Chem. 47, 6673–6680 (2004).
  • Li et al. (2015) H. Li, K.-S. Leung, M.-H. Wong,  and P. J. Ballester, “Improving AutoDock Vina using random forest: the growing accuracy of binding affinity prediction by the effective exploitation of larger data sets,” Mol. Inform. 34, 115–126 (2015).
  • Ballester and Mitchell (2010) P. J. Ballester and J. B. Mitchell, “A machine learning approach to predicting protein–ligand binding affinity with applications to molecular docking,” Bioinformatics 26, 1169–1175 (2010).
  • Durrant and McCammon (2011) J. D. Durrant and J. A. McCammon, “Nnscore 2.0: A neural-network receptor–ligand scoring function,” J. Chem. Inf. Model. 51, 2897–2903 (2011).
  • Wallach, Dzamba, and Heifets (2015) I. Wallach, M. Dzamba,  and A. Heifets, “AtomNet: A deep convolutional neural network for bioactivity prediction in structure-based drug discovery,” arXiv preprint arXiv:1510.02855  (2015).
  • Ragoza et al. (2017) M. Ragoza, J. Hochuli, E. Idrobo, J. Sunseri,  and D. R. Koes, “Protein–ligand scoring with convolutional neural networks,” J. Chem. Inf. Model 57, 942–957 (2017).
  • Gomes et al. (2017) J. Gomes, B. Ramsundar, E. N. Feinberg,  and V. S. Pande, “Atomic convolutional networks for predicting protein-ligand binding affinity,” arXiv preprint arXiv:1703.10603  (2017).
  • Ponder and Case (2003) J. W. Ponder and D. A. Case, “Force fields for protein simulations,” Adv. Protein Chem. 66, 27–85 (2003).
  • Triballeau et al. (2005) N. Triballeau, F. Acher, I. Brabet, J.-P. Pin,  and H.-O. Bertrand, “Virtual screening workflow development guided by the “receiver operating characteristic” curve approach. application to high-throughput docking on metabotropic glutamate receptor subtype 4,” J. Med. Chem. 48, 2534–2547 (2005).
  • Truchon and Bayly (2007) J.-F. Truchon and C. I. Bayly, “Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem,” J. Chem. Inf. Model. 47, 488–508 (2007).
  • Mysinger and Shoichet (2010) M. M. Mysinger and B. K. Shoichet, “Rapid context-dependent ligand desolvation in molecular docking,” J. Chem. Inf. Model. 50, 1561–1573 (2010).
  • (34) values may therefore exceed 1.0, since this means that the percentage of top predicted molecules have an average standard deviation of more than 1.0 above the mean.
  • Boulesteix (2015) A.-L. Boulesteix, “Ten simple rules for reducing overoptimistic reporting in methodological computational research,” PLoS Comput. Biol. 11, e1004191 (2015).
  • Chicco (2017) D. Chicco, “Ten quick tips for machine learning in computational biology,” BioData Min. 10, 35 (2017).
  • Cang and Wei (2017) Z. Cang and G. Wei, “TopologyNet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions,” PLoS Comput. Biol. 13, e1005690 (2017).
  • Jiménez et al. (2018) J. Jiménez, M. Skalic, G. Martínez-Rosell,  and G. De Fabritiis, “K deep: Protein-ligand absolute binding affinity prediction via 3D-convolutional neural networks,” J. Chem. Inf. Model. 58, 287–296 (2018).
  • Wang et al. (2004) R. Wang, X. Fang, Y. Lu,  and S. Wang, “The PDBbind database: Collection of binding affinities for protein–ligand complexes with known three-dimensional structures,” J. Med. Chem. 47, 2977–2980 (2004).
  • Wang et al. (2005) R. Wang, X. Fang, Y. Lu, C.-Y. Yang,  and S. Wang, “The PDBbind database: Methodologies and updates,” J. Med. Chem. 48, 4111–4119 (2005).
  • Wang, Lai, and Wang (2002) R. Wang, L. Lai,  and S. Wang, “Further development and validation of empirical scoring functions for structure-based binding affinity prediction,” J. Comput. Aided Mol. Des. 16, 11–26 (2002).
  • Wang, Lu, and Wang (2003) R. Wang, Y. Lu,  and S. Wang, “Comparative evaluation of 11 scoring functions for molecular docking,” J. Med. Chem. 46, 2287–2303 (2003).
  • Martin et al. (2017) E. J. Martin, V. R. Polyakov, L. Tian,  and R. C. Perez, “Profile-QSAR 2.0: Kinase virtual screening accuracy comparable to four-concentration IC50s for realistically novel compounds,” J. Chem. Inf. Model. 57, 2077–2088 (2017).
  • Li and Yang (2017) Y. Li and J. Yang, “Structural and sequence similarity makes a significant impact on machine-learning-based scoring functions for protein–ligand interactions,” J. Chem. Inf. Model. 57, 1007–1012 (2017).
  • Husic and Pande (2017) B. E. Husic and V. S. Pande, “Unsupervised learning of dynamical and molecular similarity using variance minimization,” arXiv preprint arXiv:1712.07704  (2017).
  • Kramer and Gedeck (2010) C. Kramer and P. Gedeck, “Leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets,” J. Chem. Inf. Model. 50, 1961–1969 (2010).
  • Bemis and Murcko (1996) G. W. Bemis and M. A. Murcko, “The properties of known drugs. 1. molecular frameworks,” J. Med. Chem. 39, 2887–2893 (1996).
  • Ramakrishnan et al. (2015) R. Ramakrishnan, M. Hartmann, E. Tapavicza,  and O. A. von Lilienfeld, “Electronic spectra from TDDFT and machine learning in chemical space,” J. Chem. Phys. 143, 084111 (2015).
  • Delaney (2004)

    J. S. Delaney, “ESOL: Estimating aqueous solubility directly from molecular structure,” J. Chem. Inf. Comput. Sci. 

    44, 1000–1005 (2004).
  • Tice et al. (2013) R. R. Tice, C. P. Austin, R. J. Kavlock,  and J. R. Bucher, “Improving the human hazard characterization of chemicals: A tox21 update,” Environ. Health Perspect. 121, 756 (2013).
  • Schütt et al. (2017) K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller,  and A. Tkatchenko, “Quantum-chemical insights from deep tensor neural networks,” Nat. Commun. 8, 13890 (2017).
  • Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga,  and A. Lerer, “Automatic differentiation in PyTorch,” in Neural Information Processing Systems Autodiff Workshop, Long Beach, CA, USA, December 9, 2017, edited by A. Wiltschko, B. van Merriënboer,  and P. Lamblin (2017) ,, [Online; accessed September 10, 2018].
  • (53) RDKit: Open-source cheminformatics;, [Online; accessed September 10, 2018].
  • (54) OEChem OpenEye Scientific Software, Santa Fe, NM., [Online; accessed September 10, 2018].
  • Walt, Colbert, and Varoquaux (2011) S. v. d. Walt, S. C. Colbert,  and G. Varoquaux, “The NumPy array: a structure for efficient numerical computation,” Comput. Sci. Eng. 13, 22–30 (2011).
  • Jones et al. (01 ) E. Jones, T. Oliphant, P. Peterson,  and others, “SciPy: Open source scientific tools for Python,”  (2001–),, [Online; accessed September 10, 2018].
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,  and E. Duchesnay, “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res. 12, 2825–2830 (2011).
  • Ward (1963) J. H. Ward, “Hierarchical grouping to optimize an objective function,” J. Amer. Statist. Assoc. 58, 236–244 (1963).
  • Zhang and Skolnick (2004) Y. Zhang and J. Skolnick, “Scoring function for automated assessment of protein structure template quality,” Proteins: Struct., Funct., Bioinf. 57, 702–710 (2004).