PADME: A Deep Learning-based Framework for Drug-Target Interaction Prediction

In silico Drug-target Interaction (DTI) prediction is an important and challenging problem in medicinal chemistry with a huge potential benefit to the pharmaceutical industry and patients. Most existing methods for DTI prediction generally have binary endpoints, which could be an oversimplification of the problem. With the advent of deep learning, some deep learning models were devised to solve the DTI prediction problem, but most of them still use binary endpoints, and they are generally unable to handle cold-target problems, i.e., problems involving target protein that never appeared in the training set. We contrived PADME (Protein And Drug Molecule interaction prEdiction), a framework based on Deep Neural Networks, to predict real-valued interaction strength between compounds and proteins. PADME inputs both compound and protein information into the model, so it is applicable to cold-target problems. To our knowledge, we are also the first to incorporate Molecular Graph Convolution (MGC) into the model for compound featurization. We used different Cross-Validation split schemes and different metrics to measure the performance of PADME on multiple datasets (in which we are the first to use ToxCast for such problems), and PADME consistently dominates baseline methods. We also conducted a case study, predicting the interaction between compounds and androgen receptor (AR) and compared the prediction results with growth inhibition activity of the compounds in NCI60, which also gave us satisfactory results, suggesting PADME's potential in drug development. We expect different variants of PADME to be proposed and experimented on in the future, and we believe Deep Learning will transform the field of cheminformatics.



There are no comments yet.


page 1

page 2

page 3

page 4


AttentionDTA: prediction of drug–target binding affinity using attention model

In bioinformatics, machine learning-based prediction of drug-target inte...

Biomedical data and deep learning computational models for predicting compound-protein relations

Researchers have developed a computational field called virtual screenin...

DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein sequences

Identification of drug-target interactions (DTIs) plays a key role in dr...

Highly Scalable Tensor Factorization for Prediction of Drug-Protein Interaction Type

The understanding of the type of inhibitory interaction plays an importa...

Proteome-informed machine learning studies of cocaine addiction

Cocaine addiction accounts for a large portion of substance use disorder...

Investigation of a Data Split Strategy Involving the Time Axis in Adverse Event Prediction Using Machine Learning

Adverse events are a serious issue in drug development and many predicti...

Toward Robust Drug-Target Interaction Prediction via Ensemble Modeling and Transfer Learning

Drug-target interaction (DTI) prediction plays a crucial role in drug di...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In silico

drug-target interaction (DTI) prediction is an important and challenging problem in biomedical research with a huge potential benefit to the pharmaceutical industry and patients. Most existing methods for DTI prediction including deep learning models generally have binary endpoints, which could be an oversimplification of the problem, and those methods are typically unable to handle cold-target problems, i.e., problems involving target protein that never appeared in the training set. Towards this, we contrived PADME (Protein And Drug Molecule interaction prEdiction), a framework based on Deep Neural Networks, to predict real-valued interaction strength between compounds and proteins. PADME takes both compound and protein information as inputs, so it is capable of solving cold-target (and cold-drug) problems. To our knowledge, we are the first to combine Molecular Graph Convolution (MGC) for compound featurization with protein descriptors for DTI prediction. We used multiple cross-validation split schemes and evaluation metrics to measure the performance of PADME on multiple datasets, including the ToxCast dataset, which we believe should be a standard benchmark for DTI problems, and PADME consistently dominates baseline methods. The results of a case study, which predicts the binding affinity between various compounds and androgen receptor (AR), suggest PADME’s potential in drug development. The scalability of PADME is another advantage in the age of Big Data.

1 Introduction

Finding out the interaction strengths between compounds (candidate drugs) and target proteins is of crucial importance in the drug development process. However, it is both expensive and time-consuming to be done in wet lab experiments, while virtual screening using computational (also called “in silico”

) methods to predict the interactions between compounds and target proteins can greatly accelerate the drug development process at a significantly reduced cost. Indeed, machine learning models for drug-target interaction (DTI) prediction are often used in computer-aided drug design


Datasets used for training and evaluating machine learning models for DTI prediction often include compounds’ interaction strengths with enzymes, ion channels, nuclear receptors, etc [47]. Traditionally, these datasets contain binary labels for the interaction of certain drug-target pairs, with 1 indicating a known interaction. Recently, the community has also explored the usage of datasets with real-valued interaction strength measurements [29, 12], which include the Davis dataset [7] that uses the inhibition constant (), the Metz dataset [24] that uses the dissociation constant () and the KIBA dataset [39] whose authors devised their own measurement index.

Existing traditional machine learning methods for predicting DTI can be roughly divided into similarity-based and feature-based approaches. Similarity-based methods depend on the assumption that compounds with similar structures should have similar effects. Feature-based methods construct feature vectors as input, which are generated by combining descriptors of compounds with descriptors of targets, and the feature vectors serve as inputs of algorithms like SVM


. Existing traditional machine learning methods for predicting DTI can be roughly divided into similarity-based and feature-based approaches, and most of them formulate the problem as a classification problem. Similarity-based methods depend on the assumption that compounds with similar structures should have similar effects. Feature-based methods construct feature vectors as input, which are generated by combining descriptors of compounds with descriptors of targets, and the feature vectors serve as inputs for algorithms such as support vector machine (SVM)


SimBoost [12] and KronRLS [29]

are two state-of-the-art methods for DTI prediction. Both of them have single outputs. KronRLS is based on Regularized Least Squares and utilizes the similarity matrices for drugs and targets to get the parameter values. SimBoost is a feature-based method, but in its feature construction, similarity matrices of the drugs and those of targets are also involved. These methods can both predict continuous values and binarized values. However, these methods either simply rely on similarities, or require expert knowledge to define the relevant features of proteins and compounds. Additionally, they are unable to model highly complex interactions within compound molecules

[22] and between the compounds and their target proteins.

Deep Neural Networks (DNN) promise to address these challenges.

Deep learning, the machine learning method based on DNN, has been enjoying an ever-rising popularity in the past few years. It has seen wide and exciting applications in computer vision, speech recognition, natural language processing, reinforcement learning, and drug-target interaction prediction. DNNs can automatically extract important features from the input data, synthesize and integrate low-level features into high-level features, and capture complicated nonlinear relationships in a dataset

[18, 33]. Deep learning-based DTI prediction has been shown to consistently outperform the existing methods and has become the new “golden standard” [6, 40, 20].

The current deep learning approaches to drug-target interaction prediction can be roughly categorized based on their neural network types and prediction endpoints. Simple feedforward neural networks, Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) have been adopted in various papers

[49]. To our knowledge, almost all existing deep learning methods, except those that have 3D structural information as input, treat the problem as a classification problem, most of which are binary, namely active/inactive. Though there are deep learning models using 3D structural information that yield good results in regression problems [44, 10], the requirement of 3D structural information limits the applicability of a model since such information is not always available, so we do not consider them in this paper. As deep learning for DTI is still in its infancy, the current models have several disadvantages.

First, formulating the problem as a classification problem has several disadvantages: obviously, the classification result depends on a predefined binarization threshold, which introduces some arbitrariness into the data; some useful information is lost, for instance, true-negative and missing values may not be discriminated in some chemical datasets [29, 12]. On the other hand, if we formulate it as a regression problem, not only can we avoid the problems above, but given the regression results, the real-valued outputs can be easily converted to produce a ranking or classification. Some existing non-DNN methods formulate the problem as a regression problem, in which the interaction strength between the drug molecule and the target protein is a real number, serving as the regression target [12]. Common real-valued interaction strength metrics ,, 50% inhibition concentration (), etc.

The second problem is that most of the existing deep learning methods do not incorporate the target protein information into the network, except very few recent works, like Wen et al. [45]. As a result, the models are unable to solve the “cold target” problem, i.e. to predict the drug-target interactions for target proteins absent in the training dataset.

A recent model, DeepDTI [45], addressed the second problem by combining the protein information with the compound feature vector. It uses the classical Extended-Connectivity Fingerprint (ECFP) [32] for describing compounds, which relies on a fixed hashing function and cannot adjust to specific problems at hand. DeepDTI concatenates ECFP and Protein Sequence Composition (PSC) descriptors [4]

(describing the target proteins’ sequence information) to construct a feature vector, which is fed into a Deep Belief Network (DBN) to predict a binary endpoint. DeepDTI outperformed the state-of-the-art methods on a dataset extracted from DrugBank.

In this paper, we propose PADME (Protein And Drug Molecule interaction prEdiction), a deep learning-based framework for predicting DTI, which can be roughly categorized into the feature-based methods. PADME overcomes the limitations of the existing methods by predicting real-valued interaction strengths instead of binary class labels, and, to address the cold-start problems (drugs or targets that are absent from the training set but appear in the test set), PADME utilizes a combination of drug and target protein features/fingerprints as the input vector. Because the DBN used in DeepDTI has fallen out of favor in the deep learning community after Rectified Linear Units (ReLU) were introduced to improve the performance of feedforward networks, PADME uses a feedforward network, mainly composed of ReLU layers, to connect the input vector to the output layer. PADME adopts Molecular Graph Convolution (MGC) which is more flexible than ECFP, because it learns the mapping function from molecular graph representations to feature vectors

[9, 14, 2] . Similar to DeepDTI, we used Protein Sequence Composition (PSC) descriptor to represent the protein. To the best of our knowledge, this work is the first to integrate MGC with protein descriptors for the DTI problem. In addition to the kinase inhibitor datasets used by previous researchers, we also use the ToxCast dataset [41], and we believe this large high-quality dataset, with its much larger variety of proteins, could be another useful benchmarking dataset for future researches of the same type.

We conducted computational experiments with multiple cross-validation settings and evaluation metrics. The results demonstrated the superiority of PADME over baseline methods across all experimental settings. Besides, PADME is much more scalable than SimBoost and KronRLS since it does not rely on computationally expensive similarity matrices and can accommodate multiple outputs.As a case study, we also applied PADME to predict the binding affinity between some compounds and the androgen receptor (AR). We examined the top compounds among them and confirmed this prediction through literature research, suggesting that the predictions of PADME have practical implications.

We believe that PADME will be helpful in lots of tasks in medicinal chemistry, including but not limited to toxicity prediction, computer-aided drug discovery, precision medicine, etc.

The subsequent sections of the paper are organized as follows. The Method section will introduce the methods for compound featurization, protein featurization, and network structure. The Experiments section will present the experiments conducted, introducing the baseline methods, datasets used, experimental design, and the experimental results. The Discussion section clarifies some implementation and design choices, and outlines possible future directions to further this work. The last section concludes the whole paper.

2 Method

PADME is a deep learning-based DTI prediction model which uses the combined small-molecule compound (candidate drug) and target protein feature vectors. We consider two variants of PADME with either Molecular Graph Convolution (MGC) or ECFP [32] as the compound featurization method. For the protein, we use Protein Sequence Composition (PSC) descriptor [4]. In fact, PADME is compatible with all kinds of protein descriptors and molecular featurization methods. The compound vector is concatenated with a target protein vector to form the Combined Input Vector (CIV) for the neural network. PADME predicts a real-valued interaction strength, i.e., it solves a DTI regression problem. The structure of the network is shown in Figure 1: if we use the MGC network to get the molecular vector, that network will be trained together with the feedforward network connecting the CIV to the prediction endpoint in an end-to-end fashion.

Figure 1: a) PADME-ECFP architecture. The Extended-Connectivity Fingerprint was used as the molecular input to the model. b) PADME-GraphConv architecture.

Note that the graph convolutional network generating the latent molecular vector is trained together with the rest of the network, while the protein descriptor generation process is independent from the training of the network. The black dots represent omitted neurons and layers.

2.1 Compound Featurization

There has been a lot of research on representing small molecules (compounds) as a descriptor or fingerprint. Among the traditional molecular descriptors and fingerprints, ECFP [32] is widely adopted as the state-of-the-art method for compound featurization [9], and was also used in DeepDTI [45]. However, it has a fixed set of mapping and hashing functions, unable to be tailored for the specific task at hand.

DNN, especially MGC, can be used to generate more flexible feature vectors. Instead of depending only on the molecule, compound feature vectors generated using DNN depend on both the molecule and the prediction task (Boolean or continuous). DNN-derived feature vectors can outperform the ECFP baseline and at times offer some good interpretability [9, 2, 49].

MGC [9, 14, 2] is an extension of Convolutional Neural Network which learns a vector representing the compound from the graph-based representation of the molecule. In the graph representation of molecules, the atoms are denoted by nodes, while the bonds are denoted by edges. MGC takes into account the neighbors of a node when computing the intermediate feature vector for a specific node, and the same operation is applied to the neighborhood of each node (atom), hence it is analogous to ordinary convolutional networks typically used in Computer Vision [9, 11]. Due to the GraphConv model [46] among MGC models being more recent and popular with an easier implementation, we use the GraphConv model as a representative of MGC under the time and resource constraints.

We applied both types of compound featurization methods: ECFP and GraphConv, and compared their performances.

2.2 Target Protein Featurization

There exist many schemes to represent the target protein as a feature vector based on its amino acid sequence information. DeepDTI [45] used Protein Sequence Composition (PSC) descriptor, which has 8420 entries for each protein, consisting of amino acid composition (AAC), dipeptide composition (DC), and tripeptide composition (TC) [4]. It captures rich information and does not transform the protein as much as some other protein descriptors (which implies less human knowledge required and less information loss), which we think could be a desirable attribute as the input to a neural network. In addition to the 8420 entries for each protein sequence, we added an additional binary entry signaling the phosphorylation status so that the Davis dataset (see below) can be represented more accurately, with ’1’ denoting phosphorylated, resulting in 8421 entries in total.

Mousavian et al. [26] used PSSM (Position Specific Scoring Matrix) descriptor to represent the protein, which focuses on dipeptide sequences and is related to the evolutionary history of proteins [35]. It is observed that PSSM performed pretty well. Other popular protein sequence descriptors include Autocorrelation, CTD (Composition, Transition and Distribution) descriptor, Quasi-sequence order, etc [4].

As PSC contains rich information (like tri-peptide sequence occurrence) with high dimensionality, and has already shown promising performance in deep learning-based models for DTI prediction [45], we use PSC in this research. There could be future comparisons of the performance of PSC and other protein featurization methods as an extension to this work.

2.3 Architecture of the Deep Neural Network

PADME uses a feedforward neural network taking the CIV as the input. The PADME architecture has one output neuron per prediction endpoint, i.e., one output neuron for most datasets, and 61 output neurons for the ToxCast dataset (see below). DNNs with single output neuron are called single-task networks, and those with multiple output neurons are called multi-task networks. Although we only consider the DTI regression problem in this paper, PADME can also be used for constructing classification models with minimal changes, either by binarizing the continuous prediction results or by directly using a softmax/sigmoid layer as the output layer, of which the latter could be more preferable.

For regularization, we use Early Stopping, Dropout and Batch Normalization techniques


. Hyperparameters like dropout rates are automatically searched to find the best set of them before running cross-validation, as elaborated in the “Experimental Design” section. The Adam optimizer


was used to train the network. The activation functions used for fully connected layers are all Rectified Linear Units (ReLU).

3 Experiments

3.1 Baseline methods

There are two baseline methods used as comparisons: SimBoost [12] and KronRLS [29], which are state-of-the-art methods for the DTI regression task.


Simboost predicts continuous DTI values using gradient boosting regression trees. Each drug-target pair corresponds to a continuous DTI value, and the authors defined 3 types of features to characterize the drug-target pairs: type 1 features for individual entities (drugs or targets); type 2 features, derived from the drug similarity networks and target similarity networks; type 3 features, which are derived from drug-target interaction network. The 3 types of features are concatenated to form a feature vector.

Let denote the vector of features for the i-th drug-target pair, while is its binding affinity. The score predicted for input is computed as follows:

where is the number of regression trees and is the space of possible trees.To learn the set of trees , they defined a regularized objective function:


is a loss function that evaluates the prediction error,

is a function that penalizes overfitting. The model is trained additively: for each iteration , the tree space is searched to find a new tree that optimizes the objective function, which is added to the ensemble afterwards. Trees that optimize are iteratively added to the model for a number of pre-specified iterations.

SimBoost cannot handle cold-start problems, which means it does not work for pairs in the test set with a drug or target that is absent from the training set.


KronRLS stands for Kronecker Regularized Least Squares. It learns a prediction function for drug-target pairs. It could use some similarity measure between two drug-target pairs and , and is constructed as a linear combination of the similarity values. The algorithm learns the coefficients of this linear combination from the training data.

Unlike SimBoost, KronRLS is applicable to cold-start problems.

3.2 Non-proteochemometric (Compound-Only) DNN methods

To investigate the usefulness of including protein feature vector (PSC in this paper) in PADME, we implemented a version of PADME with only compound information as input. Different from the full PADME, this version has one output unit for each specific target protein, resulting in a network structure similar to that of [20]. Similar to PADME, we considered ECFP and GraphConv variants of this DNN model. We call these PADME versions Compound-Only DNNs later in this paper.

3.3 Datasets and tools

Similar to He et al. [12], we used kinase inhibitor datasets. Following its naming convention, we call them Davis dataset [7], Metz dataset [24] and KIBA dataset [39], respectively. However, the versions of these datasets curated by Pahikkala et al. [29] that He et al. [12] used was slightly different from the original dataset, and Pahikkala et al. did not give the corresponding justifications. We thus used the data provided by the respective original authors, then preprocessed them ourselves as described in the Supporting Information. We assume the observations within each dataset are under the same experimental settings. Metz dataset contained lots of imprecise values, which we discarded in the preprocessing step.

Because of the limitations of SimBoost and KronRLS, we filtered the datasets. The original KIBA dataset contains 52498 compounds, a large proportion of which only have the interaction values with very few proteins. Considering the huge compound similarity matrix required and the time-consuming matrix factorization used in SimBoost, it would be infeasible to work directly on the original KIBA dataset. Thus, we had to filter it rather aggressively so that the size becomes more manageable. This suggests that PADME, which does not require drug-drug or target-target similarity matrices or matrix factorization, is much more scalable than KronRLS and SimBoost, both of which have at least time and space complexity. We chose to use a threshold of 6 (drugs and targets with no more than 6 observations are removed), more lenient than the threshold of 10 used in He et al. [12], aiming at a reduction of the unfair advantages that SimBoost can gain by keeping only the denser submatrix of the interaction matrix.

For the Metz and Davis datasets, as SimBoost cannot handle cold drug/target problem, we had to ensure that in creating Cross-Validation folds, each drug or target appear in at least 2 folds, thus those drugs/targets with no more than 1 observation are discarded.

We also used the ToxCast dataset [41], containing a much larger variety of proteins [42]. It contains toxicology data obtained from high-throughput in vitro screening of chemicals, mainly measured in , which means the concentration at half of the maximum activity. The prepared dataset (see Supporting Information) contains observations for 530605 drug-target pairs. Its large size and coverage of diverse protein types allows us to test the robustness and scalability of computational models for DTI prediction. After the preprocessing, it still contains a total of 672 assays, compared to single assay/interaction strength measurement of the other 3 datasets. Some of those assays are closely related, but most of them are different from each other. Because it contains so many heterogeneous endpoints, we manually grouped those assays into 61 different measurements for interaction strength based on assay type, such that observations in each measurement are reasonably homogeneous, also increasing the number of observations for each measurement endpoint. The number of observations in each measurement range from 290 to 160,000. For the ToxCast dataset, we constructed multi-task networks, in which each measurement corresponds to a neuron in the output layer. As KronRLS and SimBoost are both single-task models, to evaluate the performance of those two models on the ToxCast dataset, one must train 61 models for each of them, which would be an extraordinarily tedious job, so we did not run the SimBoost and KronRLS models on ToxCast. This indicates PADME is not only more scalable in the number of drugs/targets, but also much more scalable in the number of endpoints, since it can have multiple outputs in one model. As ToxCast does not have the bottlenecks imposed by KronRLS and SimBoost, we did not filter it.

Please refer to table 1 for the sizes of the datasets after filtering.

0in0in Dataset Number of drugs (compounds) Number of target proteins Total number of drug-target pairs used Davis 72 442 31824 Metz 1423 170 35259 KIBA 3807 408 160296 ToxCast (No filtering) 7657 335 530605

Table 1: Dataset sizes after filtering.

We applied the same numerical transformation as He et al. [12] to the datasets: . For the ToxCast dataset, we changed the inactive value from 1,000,000 to 1,000, so that there would be no large gaps in the distribution after transforming the data.

The model was constructed based on the implementation of the DeepChem python package [31], in which RDKit [17]

was used; the networks were constructed using TensorFlow 1.3

[1]. In the practical application, PADME takes SMILES representations of the candidate drug as part of the input, which are transformed into graph representations or ECFP by the program. PSC was obtained independently from this process: we used the propy python package [4] to generate PSC descriptors, and manually added a binary entry indicating phosphorylation. Afterwards, PSC was saved in a standalone file, which the program reads into the memory in the runtime.

The experiments were conducted on a Linux server with 8 Nvidia Geforce GTX 1080Ti graphics cards, among which 4 were used. The server has 40 logical CPU cores and 256 GB of RAM. A computer with less than 110 GB RAM might not be able to perform cross-validation for the ToxCast dataset using GraphConv-based PADME.

3.4 Experimental Design

To examine PADME’s prediction power, we used cross-validation (CV), which is the convention of the prior researches, also because we believe the comprehensive coverage of the whole dataset will offer a more thorough evaluation of the performance of the model, rather than only using 1 hold-out test set. To measure the performance of the model under different settings, multiple CV splitting schemes were employed to evaluate the predictions of the models trained from the training sets against the known interaction strengths in the test sets. The performances of PADME-ECFP and PADME-GraphConv were compared against each other under identical settings.

We performed 5-fold CV. For SimBoost to work, every compound (candidate drug) or target must be present in at least 2 folds, this splitting scheme is called “warm split” in this paper. There are no such restrictions for KronRLS, since it can handle cold-start data. Since we did not run SimBoost on ToxCast data, there is no need to perform warm-split on it, we then used random split in that case. If we force a warm split on the ToxCast dataset, a filter threshold of 1 must be used to reduce the size of the dataset, which is undesirable. As cold-start prediction is an important objective in DTI prediction (and an advantage of PADME), we also included cold-splitting in constructing the cross-validation folds, such that all compounds (candidate drugs) in the test fold are absent from the training fold (cold-drug split), or all targets in the test fold are absent from the training fold (cold-target split). In addition, similar to [22]

, we also implemented a cold-drug cluster split, using single-linkage clustering with Tanimoto similarity to create compound clusters. Compounds whose ECFP4 fingerprint had higher similarity than 0.7 were assigned to the same cluster. Compounds belonging to the same cluster were assigned to same folds, so that compounds in the validation fold would not be similar to those in the training fold. The cold-drug cluster split can prevent the performance estimation from being overly optimistic. Though

[29] suggested another splitting scheme which results in simultaneous cold-drug and cold-target in each validation fold, as it greatly decreases the size of the training set in each fold (4/9 of the original data instead of 4/5 in other splitting schemes), we decided that it would cause unfair comparison and did not use it.

For every dataset, we performed four types of CV splitting (warm, cold-target, cold-drug, cold-drug cluster), and for every CV splitting scheme, we calculated the prediction errors of the applicable models (KronRLS and PADME for all splitting schemes, SimBoost for warm splits only). To reduce the random effects, we repeated the splitting several times for each splitting scheme on Davis, Metz and KIBA datasets and calculated the average values of the evaluation metrics of the prediction results across the splits. For the Davis and Metz datasets, we repeated 3 splits for each splitting scheme; for the KIBA dataset, we did 2 for each, as it is a much bigger dataset; for the ToxCast dataset, the largest one, we only did 1 split for each scheme. The Compound-Only DNNs (as mentioned in subsection 3.2) take only compound information as input and predict the response for multiple proteins simultaneously. Therefore, they cannot handle cold-target scenarios, and it is unnatural to test them in a warm-split scenario. We only use them to compare against PADME in cold-drug splits.

Not only do we have multiple splitting methods, we also used multiple model settings and evaluation metrics. For each of PADME-ECFP and PADME-GraphConv, a single-task network was trained for every splitting scheme of every dataset, except ToxCast, for which we constructed a multi-task network with 61 output neurons to avoid the complexity resulting from 61 separate single-task networks.

We also wanted to investigate whether PADME can predict the ordering of the interaction strengths correctly, so in addition to metrics focusing on value correctness (RMSE (Root Mean Squared Error) and ), we also used metrics focusing on order correctness, like concordance index (CI). Using CI as a metric in cheminformatics setting was proposed by Pahikkala et al. [29]

. It measures the probability of correctly ordering the non-equal pairs in the dataset, ranging across [0, 1], with bigger values indicating better results. If you use the same value (e.g. mean value of the training set) as the predicted results across the test set, the CI would be 0.5. We note that the CI neglects the magnitude of values while focusing on the pairwise comparison, and it does not consider the prediction correctness for datapoints that truly have values equal to each other. Thus, CI should be used alongside other metrics like RMSE. However, in virtual screening, we are typically only interested in the top predictions, so that the drawback of neglecting the magnitude is not a big concern.

To improve the readability of the reported results for the ToxCast dataset, the performance metrics are averaged across the 61 different measurements, weighted by the number of records for each of the measurements, so the results reported for the ToxCast dataset look the same as other datasets with single endpoints. For Compound-Only DNNs, it is slightly more complicated, since we need to pool similar endpoints together before calculating the metric, instead of calculating a weighted average of evaluation metrics across different endpoints, but the basic idea is the same. In Compound-Only DNNs, the special case is the ToxCast dataset, where we pool the endpoints across the 61 subgroups of endpoints, calculate the metrics for each subgroup, and compute the weighted average across the 61 subgroups.

margin: Figure 2. The histogram of the distribution of the negative log transformed ToxCast measurement results. The majority (over 94%) are concentrated at one inactive value.
Figure 2:

As an exploratory analysis of the datasets, we found the ToxCast to be special. As shown in Figure 2, the transformed ToxCast dataset is extremely concentrated at a value of 1 which corresponds to no interaction. This led us to ignore the values for this dataset: because is sensitive to the overall departure of the predicted values from the true values, we argue that the huge concentration of values has rendered uninformative in measuring the performance of the model on the ToxCast dataset. This concentration of values also makes RMSE less informative than it otherwise would be (since one can blindly guess inactive values for all and still get pretty good RMSE), so we argue that CI is the most useful metric in the ToxCast dataset prediction evaluation. This pronounced imbalance in the dataset caused us to consider balancing it through oversampling (see supporting information).

Following the principle of parsimony, we wanted to use a minimal number of hyperparameter sets wherever possible, to keep the time and computational expenses manageable. If, instead, we do one hyperparameter tuning to get the hyperparameters for each CV iteration, to ensure a reasonable coverage of parameter space, it would have taken well over a month to run a CV for a dataset due to the intrinsic complexity of DNN models involving protein information, which would have been unrealistic, both for us and future users. So we cannot use the nested/double cross-validation as used in [3] and [23]. Also, since the datasets are not very large for deep learning, we wanted to use the full datasets for cross-validation, to maximize the training set in each iteration. Thus, in our hyperparameter tuning process, we randomly selected 90% of the dataset to be the training set, the remaining 10% to be the validation set, and the validation set was used for determining the best set of hyperparameters. This split is independent from the 5-fold CV splits. Indeed, a small proportion of records (around 10%) in the validation set would appear in the validation fold of each CV iteration, which means that a minor proportion of the information used for getting the hyperparameters is reused in evaluating the performance of the model, but this is inevitable, due to the fact that we want to have minimal hyperparameter sets and maximum training fold size. Furthermore, since we use the same set of hyperparameters for all CV split settings (more on it next paragraph), this deliberate “unfairness” will compensate for the advantage that PADME gets as mentioned previously.

To efficiently tune the hyperparameters (like dropout rates, batch size, learning rate, number of layers, nodes per layer, etc.) for both PADME models and Compound-Only DNN models, we used Bayesian Optimization [34] implemented by the Python package pygpgo [13]

. We also used early stopping to determine the optimal number of training epochs needed. To guide early stopping, we used

calculated on the validation set as the composite score to be minimized. The training and validation sets used for hyperparameter searching are split randomly from the original dataset without any concern for warm or cold drug/target splits. We only store one optimal set of hyperparameters per (dataset, PADME variant) pair, which were then used for all CV settings for that (dataset, PADME variant) pair. Note that, for simplicity and to examine the robustness of PADME, the set of hyperparameters found in the random splitting was used in all CV settings, including those with cold-drug split and cold-target split, though we believe better CV results could be achieved if the hyperparameter searching processes are specifically designed for that CV fold split scheme, e.g., for cold-target CV folds, we could use the hyperparameters found by running the Bayesian Optimization on cold-target splitted datasets.

The resulting networks typically have 2 or 3 fully-connected layers connecting the CIV to the output unit, with thousands of neurons in each of the layers. Each fully-connected layer is batch-normalized.

3.5 Experimental Results

3.5.1 Quantitative Results

Based on the experimental design in subsection 3.4, we obtained the quantitative results for PADME. In Tables 4, 3 and 2, the bold numbers indicate the best values attained for each setting. We observe that the two versions of PADME dominate the other methods111Note that the SimBoost results reported here are considerably worse than the results reported in their original paper. It is because we have examined their source code and found they calculated MSE but reported it as RMSE., including the Compound-Only DNN models, across all datasets and splits for all evaluation metrics.

We note the following exceptions. SimBoost outperforms PADME-GraphConv on the Metz dataset, which could be due to the small dataset size: PADME-GraphConv could be overfitting for Metz data, while SimBoost uses gradient boosting trees, a machine learning model better suited for small datasets than Deep Neural Networks. Because it does not use MGC, PADME-ECFP has a much smaller network than PADME-GraphConv, which may explain why the former peforms slightly better on the Metz dataset. However, we do not observe the same phenomenon on the Davis dataset, which has a similar size and even fewer entities. Comparing PADME-ECFP against Compound-Only ECFP and PADME-GraphConv against Compound-Only GraphConv, we observe that PADME consistently outperforms the Compound-Only DNNs by a substantial margin. For example, the PADME models outperform Compound-Only DNNs by around 10% or even more in terms of the Concordance Index on the ToxCast dataset. The only exception is that Compound-Only ECFP DNN outperforms both PADME models slightly in terms of RMSE and but not in CI. Besides, Compound-Only DNNs outperform KronRLS in general, except the Compound-Only GraphConv on the Davis dataset.

It is somewhat surprising that PADME-ECFP is not outperformed by PADME-GraphConv; instead, it slightly outperforms PADME-GraphConv in many cases, though in general their performances are almost indistinguishable from each other. PADME-ECFP only takes about 23% of the time and 45% the space (RAM) of PADME-GraphConv in the training process and yields similar (and sometimes better) results, so PADME-ECFP is a more reasonable choice. Nonetheless, we cannot be certain that PADME-GraphConv and PADME-ECFP truly have similar performance, as there might be a better set of hyperparameters for each model that would differentiate their performances significantly. Though our results demonstrate the power of ECFP, future researchers should continue investigating MGC.

0in0in RMSE Dataset Cross Validation Splitting type PADME- ECFP PADME- GraphConv SimBoost KronRLS Compound-Only ECFP Compound-Only GraphConv Davis Warm 0.432191 0.432245 0.481973 0.572941 Cold Drug 0.785358 0.806446 0.840484 0.806355 0.851808 Cold Drug Cluster 0.785435 0.806742 0.836805 0.791457 0.831428 Cold Target 0.560054 0.578407 0.659645 Metz Warm 0.552927 0.59926 0.581254 0.781284 Cold Drug 0.711698 0.742916 0.784291 0.778718 0.768847 Cold Drug Cluster 0.782484 0.803057 0.83149 0.813565 0.828765 Cold Target 0.791535 0.818935 0.898888 KIBA Warm 0.432138 0.418691 0.468883 0.656647 Cold Drug 0.602012 0.620291 0.702427 0.645196 0.656793 Cold Drug Cluster 0.70578 0.702965 0.753582 0.691925 0.707033 Cold Target 0.616767 0.623449 0.681109 ToxCast Warm(random) 0.405633 0.407789 Cold Drug 0.444847 0.445019 0.462564 0.46759 Cold Drug Cluster 0.450305 0.447883 0.448272 0.46195 Cold Target 0.486978 0.494392

Table 2: The performance of regression across different datasets measured in RMSE (smaller is better)

0in0in Concordance Index Dataset Cross Validation Splitting type PADME- ECFP PADME- GraphConv SimBoost KronRLS Compound-Only ECFP Compound-Only GraphConv Davis Warm 0.903882 0.903892 0.887096 0.87578 Cold Drug 0.716298 0.720008 0.692436 0.706463 0.679001 Cold Drug Cluster 0.712209 0.70221 0.680037 0.704575 0.674147 Cold Target 0.855025 0.844831 0.80751 Metz Warm 0.807563 0.794003 0.794381 0.748522 Cold Drug 0.742403 0.741041 0.709156 0.713287 0.72263 Cold Drug Cluster 0.710909 0.713545 0.681824 0.689896 0.694455 Cold Target 0.698305 0.707959 0.647 KIBA Warm 0.857452 0.863699 0.840456 0.783103 Cold Drug 0.773099 0.754999 0.688972 0.73697 0.73709 Cold Drug Cluster 0.747929 0.722392 0.665382 0.710081 0.706125 Cold Target 0.771671 0.767902 0.7122 ToxCast Warm(random) 0.796547 0.798712 Cold Drug 0.720573 0.72858 0.661147 0.628878 Cold Drug Cluster 0.717687 0.72542 0.652918 0.616539 Cold Target 0.684814 0.690501

Table 3: The regression accuracy on the different datasets measured in Concordance Index. (larger is better)

0in0in Dataset Cross Validation Splitting type PADME- ECFP PADME- GraphConv SimBoost KronRLS Compound-Only ECFP Compound-Only GraphConv Davis Warm 0.760728 0.76099 0.703138 0.580148 Cold Drug 0.191118 0.115659 0.047802 0.131261 0.031104 Cold Drug Cluster 0.176708 0.138017 0.053961 0.149729 0.071702 Cold Target 0.594517 0.569119 0.439302 Metz Warm 0.667115 0.608763 0.632297 0.335489 Cold Drug 0.446423 0.395589 0.328515 0.345574 0.353986 Cold Drug Cluster 0.327705 0.285132 0.241559 0.269211 0.235309 Cold Target 0.31438 0.267742 0.11295 KIBA Warm 0.745663 0.761212 0.700674 0.412822 Cold Drug 0.5058 0.472761 0.326592 0.449 0.426815 Cold Drug Cluster 0.301779 0.318046 0.221459 0.349807 0.316956 Cold Target 0.479066 0.467213 0.363054 ToxCasta Warm(random) Cold Drug Cold Drug Cluster Cold Target a We did not report for ToxCast because of its imbalanced nature.

Table 4: The performance of regression across different datasets measured in (larger is better)

From Tables 4, 3 and 2 we can observe an interesting phenomenon: when there are many compounds and few targets in the training set, the cold-drug predictions tend to outperform the cold-target predictions; on the other hand, when there are many targets and few compounds, the cold-target predictions tend to be better than the cold-drug ones. We hypothesize that it is because the models can be more robust in entities (drugs or targets) with more information in the training set, thus performing better in the corresponding scenario. This trend is not only present in the PADME models, but in KronRLS as well. It seems that the models also require much more types of compounds than proteins for learning their chemical features, as can be seen from the KIBA dataset, whose cold-drug and cold-target performances are very similar, though it has 3807 compounds and only 408 proteins. And, as expected, there is a universal trend that the performance of warm splits is always better than that of cold-drug, cold-drug cluster, or cold-target splits.

The use of cold-drug clusters prevents us from overestimating the performance of the models: in the Metz and KIBA datasets, the performances for cold-drug cluster CV are noticeably worse than those for cold-drug CVs, while the performances on the Davis and ToxCast datasets stay almost the same. This could be due to the different distributions of compounds in different datasets. Overall, the Compound-Only DNNs tend to be slightly more robust to the performance deterioration caused by cold-drug cluster splits. We suggest that future researches also employ cold-drug clusters splits in their experiments, so that a more stringent evaluation could be performed.

The fact that PADME outperforms both SimBoost and KronRLS demonstrates the power of DNN to learn complicated nonlinear relationships between drug-target pairs and interaction strength. We were also able to show the superiority of PADME over the Compound-Only DNN models, both in applicability of cold-target scenario and overall performance, which might suggest the improvement introduced by protein-specific descriptors (PSC in this paper). Furthermore, the results presented might be an understatement of the real performance of PADME in cold-drug and cold-target scenarios, as the training and validation sets for hyperparameter searching are randomly split, resulting in a set of hyperparameters that suit well for randomly split CV folds, but perhaps not for cold-drug and cold-target folds. This deliberately unfair comparison shows the robustness of the PADME models.

3.5.2 Qualitative Results

Figure 3: Scatter plot and contour plot of predicted VS true values across all datasets. The panels a, b, c and d correspond to Davis, Metz, Kiba and ToxCast datasets, respectively. The axes in the two plots of the same panel are the same, and both plots are generated from the same data. The diagonal lines in the scatter plots are the reference lines where predicted = true value.
Figure 4: ToxCast data scatter plot with marginal histograms, generated from the same data as Figure 3(d)

We used plots to visualize the prediction performance, so we can assess the results qualitatively.

Fig. 3 presents the predicted values (by PADME-ECFP) VS true values for each dataset. For each panel(row) in the figure, there is a scatter plot, and a contour plot (the darkness of the color represents density of data points) with univariate density curves on the margins. Both plots in each row are plotted from the exact same data. Figure 3(d) is an exception, it includes a hexagon plot instead of a contour plot, because the contour plot fails to show anything (but the density curves on the margins are plotted), possibly because the data points of ToxCast are too concentrated to be shown correctly on the contour plot, as can be observed from the hexagon plot. To help visualize ToxCast better, we added a Figure 4 which is a scatter plot of the same data as Figure 3(d), with histograms on the margins.

Clearly, all datasets except Metz data are very concentrated at some values.

Because the concentration of Davis and ToxCast datasets pose problems in visualizing the prediction performances on them, we decided to plot the scatter plots, contour plots and hexagon plots of the true active and true inactive data points separately for those datasets (Figs. 6 and 5). From Fig. 5 we can see the Davis dataset was predicted pretty well on both true active and inactive values, Fig. 5 (a) shows a nice pattern of correspondence between predicted VS true values on the active datapoints, while Fig. 5 (b) presents true inactive values, in which the hexagon plot shows a high concentration of predicted values close to the true values. As reflected in Fig. 6 (a), the model fitted on the ToxCast dataset was strongly influenced by inactive values, and the prediction performance for the true active datapoints was not very good, but the true inactive datapoints were predicted to concentrate around the true values (shown in the hexagon plot in Fig. 6 (b)), which might explain why its quantitative analysis results were decent.

To tackle the imbalanced dataset problem in ToxCast, we tried to train a model on oversampled dataset and measured its performances. Please refer to the Supporting Information.

Figure 5: Plots for Davis dataset predicted value VS true value. Panel (a) corresponds to the true active values, while panel (b) corresponds to true inactive values. Similar to figure 3, all plots in the same panel are plotted from the same data.
Figure 6: Similar to Figure 5, plots for the ToxCast dataset. Panel (b) uses a different hexagon plot from (a), because that form of hexagon plot on panel (b) does not show properly.

So why does PADME perform well on Davis, Metz and KIBA datasets, but not so satisfactorily on the ToxCast dataset? We think it might be related to the nature of the ToxCast dataset itself. The ToxCast dataset not only contains a much larger variety of proteins (unlike the other 3 datasets which only contain kinase inhibitors), but it also has a much larger number of assays (measurement endpoints), which are often quite different from each other. Though we only selected the assays with single intended targets, many of those assays are cell-based (for example, OT_AR_ARSRC1_0480), which could introduce some more complexities in addition to the drug-target interaction, due to the intricacies of biochemical processes in cells. These challenges might be some reasons why, to our knowledge, previous non-docking researches on drug-target interaction prediction containing protein information as input did not use this dataset [12, 29, 45, 28], though other kinds of researches did [5, 21, 19].

The challenges with the ToxCast dataset, including its large number of measurement endpoints and the imbalanced dataset problem, should be investigated in future work, as it is an important objective to build a more general-purpose DTI prediction model that handles a larger variety of input proteins, compounds and measurement endpoints. Imbalanced dataset is also commonly occurring in virtual screening. We believe that, based on our work, future researches will either make improvements on the ToxCast dataset or deem it as a great challenge for DTI prediction models involving protein information, and our results presented here, though not ideal, is of reference value to the community.

3.5.3 Case Study

In addition to the quantitative and qualitative evaluations shown above, we performed a case study to further validate the predictions of PADME by investigating the compounds predicted to interact strongly with selected target proteins.

We focused on the androgen receptor (AR), for which alterations of functions are associated with prostate [48] and breast cancers [25].

We used all compounds in the datasets used in this paper, together with all the compounds in US National Cancer Institute human tumour cell line anticancer drug screen data (NCI60), totalling more than 100000, and AR as the only target protein. NCI60 dataset records the in vitro drug response of cancer cell lines [36]. For prediction, we used PADME-ECFP and PADME-GraphConv trained on the whole ToxCast dataset, then took the average of their predictions, we call this averaged model PADME-Ensemble. The reason we chose ToxCast is that it has the most diverse set of compounds and proteins.

There are many different assays in ToxCast, some are cell-based, while some are cell-free. Cell-based assays are much more complicated than their cell-free counterparts, since the results of cell-based assays might involve some intricate biochemical reactions in the cells. Thus, we used the assay NVS_NR_hAR, a cell-free assay measuring the binding affinity between Human Androgen Receptor (AR) and ligands (please refer to the Supporting Information for details), to examine the efficacy of PADME’s predictions.

From the predictions of PADME-Ensemble, we selected the prediction results corresponding to NVS_NR_hAR, and then sorted the predicted values in a descending order. Due to the transformations we performed in subsection 3.3, the larger the number (higher in the sorted list), the stronger the binding affinity. Next, we filtered out those compounds in the predicted list that have appeared in the ToxCast dataset, or had a large Tanimoto similarity with some compounds in ToxCast, calculated using rdkit fingerprint. We then did a search on PubChem database [15] for the top 30 compounds predicted to bind strongly with AR.

The top 30 (and beyond) compounds all shared highly similar structure with androgen, so it is quite possible that most of them are able to bind strongly with AR. After a stringent search on PubChem, we confirmed that 4 of them are active, which is reflected in patents, bioassay results, or research papers. The other compounds in top 30 are also possibly active, but since there are no direct evidence from PubChem, we take the conservative approach and do not consider them here.

The 4 compounds’ PubChem CID are: 88050176, 247304, 9921701, 220507. Fig. 7 shows their 2D images.

Figure 7: The 4 compounds from top 30 predictions that are confirmed to bind strongly with AR. The numbers are their corresponding PubChem CIDs. On the right side is the 2d representation of testosterone, the major androgen. The images are downloaded from PubChem website.

Obviously, the compounds are all very similar to androgen, since NVS_NR_hAR is a very simple assay, the model learns from the dataset that analogs of androgen tend to bind strongly with AR. This shows that the prediction results of PADME are effective in drug discovery.

Based on this, we tried to take one step further to do a more interesting task: calculating the AR antagonist effect of compounds based on the predictions produced by PADME. Because there are 61 outputs in PADME models trained on ToxCast data, we had to propose a set of coefficients to calculate a composite AR antagonist score (details in the Supporting Information) from the averaged prediction results, for which we expect the compounds with higher scores would show stronger anticancer activity in AR-related cell lines in NCI60 dataset. We then ordered the AR antagonist scores in a descending order.

Compared to those predicted to bind strongly with AR, the predicted list of AR antagonistic compounds have much more diverse structures. However, their results (see Supporting Information) are not well aligned with our expectations. This is not caused by PADME, which captures the patterns in the training data faithfully and shows it in the test set. It is the results on the true dataset that are different from our assumptions.

There are two major challenges in this process: the formula we used for calculating AR antagonist score is based on assumptions, the existence of cell-based assays is also a possible source of problem; our expectation that AR antagonistic compounds should perform selectively on some known AR-related cancer cell lines might deviate from the truth, or AR influences many types of cancers in different ways from what we knew, like suggested in [27]. Tackling these two challenges is a task that could require years or decades of work by the community.

In all, the obtained results indicate that PADME is capable of identifying compounds that have the desired simple interactions with the target protein.

4 Discussion

PADME is a very general and versatile framework, compatible with a large variety of different protein and compound featurization methods. By combining different protein descriptors like PSSM and other molecule featurization schemes like Weave [14], many more variants of PADME can be constructed, whose performances can be compared against each other. In fact, we used the Weave implementation in DeepChem as a molecular featurization method and ran hyperparameter searching on it, but the result was worse than ECFP and GraphConv, and it consumed much more time and memory than the other two, so we did not pursue it any further.

Mapping a protein into a feature vector is a task in proteochemometrics. However, most existing methods in proteochemometrics require expert knowledge and often involve 3D structural information [43, 30], which is often not available. We only considered sequence information for both drugs and targets in this work to make our model more generally applicable.

It is also possible to use CNN or RNN to learn a latent feature vector to represent the proteins, based on its amino acid sequence information, instead of using fixed-rule protein descriptors like PSC as the input, so that the whole model can be trained in a completely end-to-end fashion without standalone components of the network like PSC in our implementation, making the network structure more “symmetrical”. Actually, it was already attempted by Öztürk et al. [28], who showed a performance similar to PADME, but they did not use cross-validation to get average performances, they only ran different models on the same test set, which was just 1/6 the size of the whole dataset, so we think there is still much room for improvement in that direction. Nonetheless, it is a very good step towards it.

We only used simple feedforward neural networks in our implementations of PADME from the Combined Input Vector to the output, but other types of Neural Networks might be able to generate better results, like Highway Networks [38], which allows the units in the network to take shortcuts, circumventing the large amounts of layers in some networks.

Pretraining also has the potential to improve our model, but we did not include it, because it might be difficult for the community to compare the real performance of PADME with other models.

Compared to previous models like SimBoost and KronRLS, PADME is not only outperforming them in terms of prediction accuracy, but is more scalable in terms of number of entities and number of prediction endpoints, because both SimBoost and KronRLS rely on similarity matrices and they are only single-task models. In the age of Big Data, this scalability will be a big advantage in virtual high-throughput screening.

We envision PADME and its future derived models to be useful in lots of tasks in medicinal chemistry, which might include toxicity prediction, computer-aided drug discovery, precision medicine, etc. For toxicity prediction, using PADME, scientists can better predict the side-effects of known drugs, or predict the toxicity of a compound under development; in computer-aided drug discovery, such models will greatly narrow down the scope of candidates in a virtual screening process, leaving only very few top candidates to be further simulated or tested; in precision medicine, we believe the model can give physicians better insight based on the protein expression profile of the patient.

5 Conclusion

To tackle DTI prediction problem from a new angle, we devised PADME framework that utilizes deep neural networks for this task. PADME incorporates both compound and target protein sequence information, so it can handle the cold-start problem, which most current deep learning-based models for DTI prediction cannot do. Predicting real-valued endpoints also makes it desirable for problems requiring finer granularity than the simple binary classification. PADME is the first method to incorporate MGC with protein descriptors into the DTI prediction task, and has been shown be consistently outperform state-of-the-art methods as well as Compound-Only DNN models. Surprisingly enough, PADME based on MGC (GraphConv in our case) does not outperform PADME based on ECFP. PADME is also much more scalable than the state-of-the art models for DTI regression task, namely SimBoost and KronRLS, and this advantage might be significant with big datasets. Another contribution is the use of the ToxCast dataset in DTI prediction problems with protein information input, which we believe future research should investigate further in addition to the other benchmarking datasets. Our results on the ToxCast dataset suggests it is a greater challenge than we expected.

As a case study, we predicted the binding affinity between compounds and the androgen receptor (AR), a high proportion of the compounds predicted to bind strongly with AR are confirmed through database/literature search. This suggests that PADME has the potential to be applied in drug development, and will likely benefit domains like toxicity prediction, computer-aided drug discovery, precision medicine, etc.

With the compatibility of PADME to different drug molecule and target protein featurization methods, we believe that future work could propose more PADME variants that advance the frontier of DTI prediction research.

Supporting Information

The following files are available free of charge.

  • will be available in journal version of this paper. Exceeds size limit of arXiv.

  • supplementary_text.pdf: explains data preprocessing steps and presents some additional computational experiments.

The source code and some processed dataset is deposited at Some bigger processed datasets could be obtained upon request.


The authors thank Dr. Fuqiang Ban and Dr. Michael Hsing for giving us some useful information that we incorporated into this paper. We also thank the help and suggestions received from other fellow lab members, including but not limited to Zaccary Alperstein, Oliver Snow, Michael Lllamosa, Hossein Sharifi, Beidou Wang, Jiaxi Tang and Sahand Khakabimamaghani. We also express our gratitude towards our family and friends, especially Jun Li, Wen Xie, Qiao He, Yue Long, Aster Li, Lan Lin, Xuyan Qian, Yue Zhang, Qing Rong, Si Chen, Fengjie Lun and Stephen Tseng, the list goes on; most important of all, the late Mr. Xiefu Zang.

We also thank George Lucas (and his fantastic prequel trilogy) and Smith et al. [37] for the inspiration of naming.


  •  1. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from
  •  2. H. Altae-Tran, B. Ramsundar, A. S. Pappu, and V. Pande. Low data drug discovery with one-shot learning. ACS central science, 3(4):283–293, 2017.
  •  3. D. Baumann and K. Baumann. Reliable estimation of prediction errors for qsar models under model uncertainty using double cross-validation. Journal of cheminformatics, 6(1):47, 2014.
  •  4. D.-S. Cao, Q.-S. Xu, and Y.-Z. Liang. propy: a tool to generate various modes of chou’s pseaac. Bioinformatics, 29(7):960–962, 2013.
  •  5. Y. Chushak, H. Shows, J. Gearhart, and H. Pangburn. In silico identification of protein targets for chemical neurotoxins using toxcast in vitro data and read-across within the qsar toolbox. Toxicology Research, 7(3):423–431, 2018.
  •  6. G. E. Dahl, N. Jaitly, and R. Salakhutdinov. Multi-task neural networks for qsar predictions. arXiv preprint arXiv:1406.1231, 2014.
  •  7. M. I. Davis, J. P. Hunt, S. Herrgard, P. Ciceri, L. M. Wodicka, G. Pallares, M. Hocker, D. K. Treiber, and P. P. Zarrinkar. Comprehensive analysis of kinase inhibitor selectivity. Nature Biotechnology, 29(11):1046–1051, 2011.
  •  8. H. Ding, I. Takigawa, H. Mamitsuka, and S. Zhu. Similarity-based machine learning methods for predicting drug–target interactions: a brief review. Briefings in Bioinformatics, 15(5):734–747, 2014.
  •  9. D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
  •  10. J. Gomes, B. Ramsundar, E. N. Feinberg, and V. S. Pande. Atomic convolutional networks for predicting protein-ligand binding affinity. arXiv preprint arXiv:1703.10603, 2017.
  •  11. I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
  •  12. T. He, M. Heidemeyer, F. Ban, A. Cherkasov, and M. Ester. Simboost: a read-across approach for predicting drug–target binding affinities using gradient boosting machines. Journal of Cheminformatics, 9(1):24, 2017.
  •  13. J. Jimenez. pygpgo: Bayesian optimization for python.
  •  14. S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):595–608, 2016.
  •  15. S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B. A. Shoemaker, P. A. Thiessen, B. Yu, et al. Pubchem 2019 update: improved access to chemical data. Nucleic acids research, 47(D1):D1102–D1109, 2018.
  •  16. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  •  17. G. Landrum. Rdkit: Open-source cheminformatics.
  •  18. Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436–444, 2015.
  •  19. J. Liu, K. Mansouri, R. S. Judson, M. T. Martin, H. Hong, M. Chen, X. Xu, R. S. Thomas, and I. Shah. Predicting hepatotoxicity using toxcast in vitro bioactivity and chemical structure. Chemical research in toxicology, 28(4):738–751, 2015.
  •  20. J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl, and V. Svetnik. Deep neural nets as a method for quantitative structure–activity relationships. Journal of chemical information and modeling, 55(2):263–274, 2015.
  •  21. K. Mansouri, A. Abdelaziz, A. Rybacka, A. Roncaglioni, A. Tropsha, A. Varnek, A. Zakharov, A. Worth, A. M. Richard, C. M. Grulke, et al. Cerapp: Collaborative estrogen receptor activity prediction project. Environmental health perspectives, 124(7):1023, 2016.
  •  22. A. Mayr, G. Klambauer, T. Unterthiner, and S. Hochreiter. Deeptox: toxicity prediction using deep learning. Frontiers in Environmental Science, 3:80, 2016.
  •  23. A. Mayr, G. Klambauer, T. Unterthiner, M. Steijaert, J. K. Wegner, H. Ceulemans, D.-A. Clevert, and S. Hochreiter. Large-scale comparison of machine learning methods for drug target prediction on chembl. Chemical Science, 2018.
  •  24. J. T. Metz, E. F. Johnson, N. B. Soni, P. J. Merta, L. Kifle, and P. J. Hajduk. Navigating the kinome. Nature chemical biology, 7(4):200, 2011.
  •  25. A. Mina, R. Yoder, and P. Sharma. Targeting the androgen receptor in triple-negative breast cancer: current perspectives. OncoTargets and therapy, 10:4675, 2017.
  •  26. Z. Mousavian, S. Khakabimamaghani, K. Kavousi, and A. Masoudi-Nejad. Drug–target interaction prediction from pssm based evolutionary information. Journal of pharmacological and toxicological methods, 78:42–51, 2016.
  •  27. J. Munoz, J. J. Wheler, and R. Kurzrock. Androgen receptors beyond prostate cancer: an old marker as a new target. Oncotarget, 6(2):592, 2015.
  •  28. H. Öztürk, E. Ozkirimli, and A. Özgür. Deepdta: Deep drug-target binding affinity prediction. arXiv preprint arXiv:1801.10193, 2018.
  •  29. T. Pahikkala, A. Airola, S. Pietilä, S. Shakyawar, A. Szwajda, J. Tang, and T. Aittokallio. Toward more realistic drug–target interaction predictions. Briefings in bioinformatics, 16(2):325–337, 2014.
  •  30. T. Qiu, J. Qiu, J. Feng, D. Wu, Y. Yang, K. Tang, Z. Cao, and R. Zhu. The recent progress in proteochemometric modelling: focusing on target descriptors, cross-term descriptors and application scope. Briefings in bioinformatics, page bbw004, 2016.
  •  31. B. Ramsundar, P. Eastman, K. Leswing, P. Walters, and V. Pande. Deep Learning for the Life Sciences. O’Reilly Media, 2019.
  •  32. D. Rogers and M. Hahn. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
  •  33. J. Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
  •  34. B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
  •  35. A. Sharma, J. Lyons, A. Dehzangi, and K. K. Paliwal.

    A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition.

    Journal of theoretical biology, 320:41–46, 2013.
  •  36. R. H. Shoemaker. The nci60 human tumour cell line anticancer drug screen. Nature Reviews Cancer, 6(10):813–823, 2006.
  •  37. J. S. Smith, O. Isayev, and A. E. Roitberg. Ani-1: an extensible neural network potential with dft accuracy at force field computational cost. Chemical science, 8(4):3192–3203, 2017.
  •  38. R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
  •  39. J. Tang, A. Szwajda, S. Shakyawar, T. Xu, P. Hintsanen, K. Wennerberg, and T. Aittokallio. Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. Journal of Chemical Information and Modeling, 54(3):735–743, 2014.
  •  40. T. Unterthiner, A. Mayr, G. Klambauer, M. Steijaert, J. K. Wegner, H. Ceulemans, and S. Hochreiter. Multi-task deep networks for drug target prediction. In Neural Information Processing System, pages 1–4, 2014.
  •  41. E. US. Toxcast summary files from invitrodb_v2., 10 2015. Accessed: 2018-03-22.
  •  42. E. US. Toxicity forecaster (toxcast) fact sheet., 2016. Accessed: 2018-03-22.
  •  43. G. J. van Westen, J. K. Wegner, A. P. IJzerman, H. W. van Vlijmen, and A. Bender. Proteochemometric modeling as a tool to design selective compounds and for extrapolating to novel targets. MedChemComm, 2(1):16–30, 2011.
  •  44. I. Wallach, M. Dzamba, and A. Heifets. Atomnet: A deep convolutional neural network for bioactivity prediction in structure-based drug discovery. arXiv preprint arXiv:1510.02855, 2015.
  •  45. M. Wen, Z. Zhang, S. Niu, H. Sha, R. Yang, Y. Yun, and H. Lu. Deep-learning-based drug–target interaction prediction. Journal of proteome research, 16(4):1401–1409, 2017.
  •  46. Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
  •  47. Y. Yamanishi, M. Araki, A. Gutteridge, W. Honda, and M. Kanehisa. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 24(13):232–240, 2008.
  •  48. T. A. Yap, A. D. Smith, R. Ferraldeschi, B. Al-Lazikani, P. Workman, and J. S. de Bono. Drug discovery in advanced prostate cancer: translating biology into therapy. Nature Reviews Drug Discovery, 15(10):699, 2016.
  •  49. 徐优俊 and 裴剑锋. 深度学习在化学信息学中的应用. 大数据, 3(2):2017019, 2017.