Target specific drug design is one of the key techniques in therapeutic drug discovery . Prediction of new drug target interactions can lead the researchers to find new uses for old drugs and to realize the therapeutic profiles or side effects thereof [2, 3, 4]. Since experimental prediction of drug-target interaction is expensive and time-consuming [5, 6], computational methods have been gaining increasing popularity in recent years.
The computational approaches taken in the literature to address the drug-target interaction include, but are not limited to ligand based methods [7, 8], target or receptor based methods [9, 10], gene ontology based methods , literature text mining methods [12, 13], etc. The performance of the ligand based methods degrade as the number of known ligands of a particular target protein decreases. Receptor based methods often use docking simulation  and hugely rely on availability of the three dimensional structure of the protein targets. Notably, finding three-dimensional structures of the proteins is a expensive and time-consuming task using NMR and X-ray Crystallography. Moreover, three dimensional structures are very difficult to predict for ion channel proteins and G-protein coupled receptors (GPCRs). On the other hand, the tremendous growth in the Biomedical literature has increased the problem of redundancy in the compound/gene names as the main obstacle for literature based systematic text mining methods.
Recently, chemo-genomic methods  have been attempted to predict drug-target interactions. These methods are mainly learning-based methods [16, 17], graph-based methods [18, 19] and network-based methods [20, 21]22, 23]24], fuzzy , nearest neighbor algorithm , etc. Yamanishi et al.  proposed a formalization of the drug–target interaction inference as a supervised learning problem for a bipartite graph. In that pioneering work, they also proposed a gold standard dataset that had been later used extensively in the literature [22, 27, 24]. In a subsequent work, the same authors  explored the relationship among the chemical space, the pharmacological space and the topology of drug-target interactions networks and applied distanced based learning. Wang et al. proposed RLS-KF  that uses regularized least squares method integrated with nonlinear kernel fusion. Drug-based similarity inference (DBSI) was proposed in  utilizing two dimensional chemical structural similarity. Another method, KBMF2K, was proposed in  that used chemical and genomic kernels and bayesian matrix factorization. Among other methods, NetCBP , DASPfind , SELF-BLM , etc. are noteworthy. In a recent work , position specific scoring matrix based bigram features and molecular fingerprint were used to predict drug target interactions. In the absence of three dimensional structures of the protein target, most of the supervised learning methods in the literature do not exploit the structure based features.
In this paper, we present iDTI-ESBoost, a method for identification of Drug Target Interaction Using Evolutionary and Structural Information with Boosting. We exploit the structural features along with the evolutionary features to predict drug-protein interactions. Our work was inspired due to the modern successful secondary structural prediction tools like SPIDER2 [32, 33] and its use to generate features in supervised learning and classification 
. Our proposed method uses a novel set of features extracted using structural information along with the evolutionary features and molecular fingerprints of drugs. To handle the large amount of imbalance in the data, we propose a novel balancing method and use it along with a boosting algorithm to achieve superior performance over the state-of-the-art algorithms. In our experiments our method, iDTI-ESBoost, has shown to have significantly outperformed other methods on a gold standard data set widely used in the literature under standard evaluation criteria. Our method is publicly available to use at:http://farshidrayhan.pythonanywhere.com/iDTI-ESBoost/.
The rest of the paper follows are the general suggestions made in : description of dataset, formulation of statistical samples, selection and development of a powerful classification algorithm, demonstration of the performance of the predictor using cross-validation, implementation of web server followed by a conclusion.
Materials and Methods
In this section, we provide the details of the benchmark datasets, feature extraction and balancing methods, classifiers and evaluation metrics used in this research work. Figure1 depicts the training module of our proposed method, iDTI-ESBoost. The training dataset of iDTI-ESBoost contains both interacting (positive) and non-interacting drug-target pairs. For each instance of drug-target pair, a drug is searched in the DrugBank database  to fetch the drug chemical structure in SMILES format. Similarly, a target protein sequence is first fetched from KEGG database  and then fed to SPIDER2  and PSI-BLAST  in order to receive, respectively, structural information as an SPD file and position specific scoring matrix (PSSM) based profile containing evolutionary information. A feature extraction module then uses these files to generate three types of features: drug molecular fingerprints, PSSM bigram and structural features based on the output of the secondary structure prediction software SPIDER2. Features generated in this phase is then fed to an AdaBoost classifier that learns the model for prediction purposes.
The prediction module is very similar to that of the training module shown in Figure 1. For prediction, a query drug-target pair is feed to the system in a similar way to extract three types of features and then the trained and stored model is used to predict whether the given drug-target pair is interacting or non-interacting.
Drug-target Interaction Datasets
In this paper, we have used the gold standard datasets introduced by Yamanishi et al. in . The datasets are publicly available at: http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/. Yamanishi et al. used DrugBank , KEGG BRITE , BRENDA  and SuperTarget  to extract information about drug-target interactions. They used the known drugs to four types of protein targets, namely, enzymes, ion channels, g-protein coupled receptors (GPCRs) and nuclear receptors. The number of proteins in these classes are 664, 204, 95 and 26 respectively, that interact with, respectively, 445, 210, 223 and 54 drugs through 2926, 1476,635 and 90 known interactions. A brief description of the datasets are given in Table 1. These benchmark datasets have been used in many studies in the literature [22, 27, 21, 24] and are referred to as the ‘gold’ standard.
Graph Construction from the Dataset
Based on the interactions of four types of proteins with known drugs, we build positive and negative samples for each dataset using a method similar to the one used in  as follows. The drug-target interaction network for each dataset is a bipartite graph, , where the set of vertices is such that is the set of drugs and is the set of targets, and the set of edges is . Here, any edge denotes an interaction only between a drug, with a protein target, . Now, for a particular graph from a dataset, all the known interactions in the graph represented by its edges are considered to be positive samples and the non-existent edges are taken as negative samples. Note that, here, non-existent edges refer to the possible valid edges only that are not there; i.e., they do not include edges among the vertices of the same partite set. Formally, a dataset is an union of positive and negative sets as follows:
Here, , and . For example, in the nuclear receptor, there are 54 drugs and 26 proteins with possible interactions. Since 90 interactions are known, these are treated to be positive and the rest 1314 as negative. The same procedure was followed for each of the datasets. As expected, the constructed datasets using this technique are imbalanced as the number of negative samples far outnumbers that of positive samples. This issue is attended to later by applying some balancing techniques.
A dataset constructed in this way has drug-target pairs as instances. In the feature extraction phase, a drug identifier is looked up in the KEGG databased  and the corresponding SMILES format is downloaded from the DrugBank database . The features based on drugs are generated using this SMILES data.
Similarly, a protein target of each pair is first searched with in the KEGG database  to fetch the protein sequence. This protein sequence is then fed to two different software: Position Specific Iterated BLAST (PSI-BLAST)  to fetch evolutionary profile based position specific scoring matrix (PSSM) and a secondary structure prediction tool, SPIDER3  to generate SPD3 files that contains the structural information. Three groups of features are extracted using these three files. The details are described in the rest of this section.
SMILES Based Features
Several descriptors are used to represent the features or properties of drug compounds . To this end, one of the most popular features is molecular fingerprints, which is widely used for similarity searching , clustering , and classification . Each drug compound is represented by 881 chemical substructures defined in PubChem database . The presence (absence) of a particular substructure is encoded as 1 (0). Thus the length of this molecular fingerprint based feature is 881. We used the rcdk package of R  to extract these molecular fingerprints based features.
PSSM Based Features
We used the PSSM matrix returned by the PSI-BLAST software to generate evolutionary features from the protein target sequences. Each PSSM file contains a PSSM matrix that is constructed after multiple sequence alignment using the non redundant (NR) database. The PSSM file contains a matrix of dimension , where is the length of the protein and each of the entries in this matrix,
, represents the probability of observing the-th amino acid in the -th location of the given protein sequence. We first convert this matrix to a normalized matrix using a normalization technique similar to that proposed in . The dimension of this matrix is same as the original matrix . After that we generate PSSM-bigram features using the following equation:
0.0.1 Structure Based Features
The traditional drug discovery is a lock-key problem, where the lock is the target. The structure of the target thus play a very important role in traditional drug discovery and is at the center of the docking based software. We make a hypothesis that even if the full structure is not present for the targets, estimated structural properties thereof can play an important role in drug-target interaction prediction. Structural features are generated using the structural information generated and stored in SPD files by SPIDER2 software. The information generated by SPIDER2 are: accessible surface area (ASA), secondary structural (SS) motifs, torsional angles (TA) and structural probabilities (SP). Following features are generated using these information:
Secondary Structure Composition: This feature is the normalized count or frequencies of the structural motifs present at the amino-acid residue positions. There are three types of motifs: -helix (H), -sheet (E) and random coil (C). SPIDER2 returns a vector of dimension containing this information. Thus we can define this feature as following:
Here, is the length of the protein and
Here, is the structural motif at position of the protein sequence and is one of the 3 different motif symbols.
Accessible Surface Area Composition: The accessible surface area composition is the normalized sum of accessible surface area defined by:
Here ASA is the vector of accessible surface area of dimension containing the values of accessible surface area for all the amino acid residues.
Torsional Angles Composition: Four different types of torsional angles: , , and are returned by SPIDER2 for each residue. First, we convert each of them into radians from degree angles and then take sign and cosine of the angles at each residue position. Thus we get a matrix of dimension . We denote this matrix by . Torsional angles composition is defined as:
Torsional Angles Bigram: The Bigram for the torsional angles is similar to that of the PSSM matrix and is defined as:
Structural Probabilities Bigram: Structural probabilities for each position of the amino-acid residue are given in the SPD2 file as a matrix of dimension , which we denote by . Recall that, there are three types of structiral motifs, namely, -helix (H), -sheet (E) and random coil (C). The Bigram of the structural probabilities is similar to that of PSSM matrix and is defined as:
Torsional Angles Auto-Covariance: This feature is also derived from the torsional angles and is defined as:
This feature group depends on parameter DF which is the distance factor. In this study, we used DF = 10.
Structural Probablities Auto-Covariance: This feature is also derived from the structural probabilities and is defined as:
A brief summary of the three group of features derived from each drug-target pair is given in Table 2.
Recall that, each of our four datasets is heavily imbalanced. Several sampling techniques in the literature have been deployed in imbalanced settings of data: random under sampling , synthetic over sampling  , balanced random sampling (BRS) , neighborhood cleaning rule , cluster based under sampling [53, 54], etc. In this paper, we explore random under sampling (RUS) method as done previously for drug-target interaction prediction in . We also propose a novel modified cluster based under sampling method based on  as follows. In this method, the dataset is first divided into two subsets as major class and minor class . In the major class -means clustering is applied to divide the major class samples in clusters . But the minor class samples are kept unchanged. Now from the clusters of major class samples, subsamples are chosen randomly to represent the entire major class. We denote this method as cluster based under sampling (CUS) throughout this paper. The random under sampling will be denoted as random under sampling (RUS). The pseudo-code for the CUS algorithm is given in Algorithm 1.
Our CUS algorithm depends on two parameters, namely, and . In our experiments, we have varied for values from and found the the best performing value to be 23. However, more sophisticated clustering algorithms can be applied on this data. The role of the parameter is to control the random under sampling of the clustered majority class samples.
Description of the classifier
We have selected the adaptive boosting algorithm (AdaBoost) 
as our classification algorithm. Adaptive boosting is a meta or ensemble classifier that uses several weak learning algorithms or weak classifiers and improves over their performance. We choose decision tree classifiers as the weak classifiers. AdaBoost is a meta-classifier of the following form:
AdaBoost iteratively adds up a weak classifier at each iteration of the algorithm weighted by where is the weight achieved from the error function for the weak classifier at iteration . Each of these weak classifiers is chosen in a way so as to minimize the error on the training sample weighted by the distribution :
A large variety of performance metrics are used in the literature to compare the performance of supervised learning methods . The gold datasets that is used in the literature of drug-target interaction prediction is largely imbalanced and the number of negative samples largely outnumbers that of the positive samples. Therefore, the typical measures like accuracy does not make much sense. Moreover, the output of the classifier generating probabilistic outputs depends on the thresholds or the values predicted by it for each of the predicting classes. In such cases, thresholds or values play and important role on the sensitivity and specificity of the classifiers. Two measures that are independent of the values or thresholds set for decision making are area under curve for Receiver Operating Characteristic (auROC) and area under precision recall curve (auPR). These two measures are widely used in the literature of drug-target interaction prediction [22, 30, 58, 24] and thus have become standard metrics for comparison.
Lets assume, is the total number of positive samples in a dataset and is the total number of negative samples in a dataset. Let denote the number of true positives, true negatives, false negatives and false positives predicted by a classifier. True positives (negatives) are correctly classified positive (negative) samples by the classifier. Conversely, false positives (negatives) are negative (positive) samples incorrectly predicted as positives (negatives) by the classifier. Following these notions, we can define sensitivity or true positive rate as follows:
Therefore, sensitivity is the ratio of correctly predicted positive samples to the total number of positive samples. Precision is defined as the positive predictive rate (PPV) as follows:
Therefore, precision shows the percentage of positive predictions by the classifiers that are accurate. Another important measure is specificity (SPC) or true negative rate defined as follows:
Fall-out or false positive rate (FPR) is the ration of the number of wrongly classified negative samples to the total number of negative samples defined as follows:
F1 Score is the harmonic mean of the precision and sensitivity and defined as follows:
All theses performance measures have values with in the range , 0 being the worst and 1 being the best.
Another score that is often used in comparison is called defined as follows:
Value of this coefficient ranges from to , where means a perfect predictor and means a total disagreement.
Receiver operating characteristic (ROC) curve plots true positive rate against false positive rate at various threshold values. The performance of a predictor is calculated by the area under the ROC curve (auROC). A perfect classifier have a auROC value of 1 and a random classifier have a value of 0.5. However, for imbalanced datasets like ours, area under precision recall curve (auPR) is of more significance 
as follows: auPR curve plots the precision rate vs the recall rate at different threshold values. This score penalizes the false positives more as compared to auROC and thus more suitable for skewed datasets. The value of auPR ranges fromto and the higher the value is the better.
It is very important to test the methods to check and balance the bias-variance trade-off. Various methods of sampling are used to measure the performance of supervised learning algorithms . Among them mostly used are fold cross validation and jack knife tests. Because of the high imbalance, dimensionality and cardinality of the datasets, in most of the methods in the literature, 5-fold cross validation have been preferred and used as the sampling method [58, 24, 22, 30]. We also use the 5-fold cross validation to test our method for the sake of fair comparison with the other state-of-the-art methods.
In the 5-fold cross validations, first the dataset is randomly split into five equal parts retaining the ratio of imbalance in each split same to the original dataset. Each time one part of the dataset is used as test and the other four are used as training data. First the balancing techniques are applied to the training data (clustered or random) and then the classifier is used to train the data into a model. The model is then used to predict the labels for the test data. Thus all the drug-target pairs in the datasets are used in testing the classifier performance using cross-validation. The measures reported are the average of all 5-fold results.
Results and Discussion
In this section, we present the results of our experiments. All the methods were implemented in Python language using Python3.4 version and Scikit-learn library 
of Python was used for the implementation of the machine learning algorithms. All experiments were conducted on a Computing Machine hosted by CITS, United International University. Each of the experiments was carried out 5 times and the average is reported as the results. We perform several types of experiments. In particular, we conduct four different sets of experiments as follows. First we investigate the effectiveness of the different feature groups as mentioned in Table2. Recall that, in Table 2, four different feature groups, namely, A, B, C and D, were formed. Secondly, we conduct experiments to investigate the effectiveness of the classifiers used in our research. Subsequently, we also experiment the effectiveness of the balancing methods applied on our highly imbalanced datasets. Finally, we also conduct experiments to test our method, iDTI-ESBoost, against the state-of-the-art.
Effectiveness of Feature Groups
We created four different feature groups to see the effects of the different sets of features on the classifier performance. The feature groups have already been reported in Table 2. Group A contains 1281 features and was previously used in . We further added other groups, namely, B, C and D, incrementally in that order with the base feature group i.e., Group A and achieve features of size 1293, 1403 and 1476 respectively. We have performed two sets of experiments to test the effectiveness of the feature groups. In both of experiments we varied the feature groups and ran different classifiers and applied different balancing methods on the data to analyze the effect. Results of these experiments are reported in Table 3 and Table 5.
Table 3 reports the performance of three different classifiers on the four datasets during our experiments. Note that, though this experiment was intended for classifier selection, we clearly see that the best results in terms of auPR and auROC were found only when the structural features are added. For enzymes dataset, the best result in terms of auPR was 0.66 found with the combination A,B,C which is using structural composition and structural auto-covariance groups with PSSM-bigram and molecular fingerprint based features. It was slightly better then the case when we use all the features A,B,C,D and got auPR of 0.66. In terms of auROC, the results were somewhat comparable to each other; however, the best result was achieved when all the four feature groups were used in combination. Thus enzyme dataset shows the effectiveness of structural information based features.
Datasets ion channels and GPCRs showed similar performance in terms of auPR. Nuclear receptors showed highest auPR value when only the composition features, i.e., Group B were added with the base features. The increase in the value of auROC clearly reveals the effectiveness of the structural features (Groups B,C,D) when added to the base feature (Group A).
The next set of experiments were run to show the performance of different balancing or under sampling methods in the training data using various feature groups. These results are shown in Table 5. These experiments were run using the AdaBoost classifier. The results in Table 5 clearly shows that for all the datasets, the best results in terms of auPR and auROC were found when structural features have been added. In case of the GPCRs, the auPR was found to be the highest at 0.5 when three feature groups, namely, Groups A,B, and C have been combined. Apart from this, in all other datasets, the all four groups combined have shown superior performance both in terms of auPR and auROC. Our hypothesis that the added structural features play an important role in the prediction of drug-target interaction is thus justified according to these experiments.
Effectiveness of the AdaBoost Classifier
To test and select the suitable classifier for our problem, we test three different classifiers: AdaBoost ensemble classifier 
with decision tree as the weak classifier, Random Forest and Support Vector Machines . For these experiments, we used random under sampling as the balancing method. As features, four different combinations were used as has been mentioned already. The results in terms of auPR and auROC are presented in Table 3. Here for each of the datasets and feature groups combinations bold faced values in the table represents the highest values achieved for that combination. It is evident that except for one case in the enzymes dataset, AdaBoost classifier has shown superior performance in terms of auPR across all feature groups combinations. It is also worth-noting that for all datasets, the highest auPR value was achieved by AdaBoost. The precision-recall curves for these experiments across all feature groups combinations are illustrated in Figure 2.
In case of the ROC curve, the results are also in support of the selection of AdaBoost as a classifier. AdaBoost provides the highest auROC values for all the four datasets and it gives better auROC values for 11 out of 16 dataset-feature groups combinations. In other cases, SVM has achieved the highest auROC values, but only marginally so. The ROC curves for different classifiers across all feature groups combinations are illustrated in Figure 3.
Considering the values of auPR and auROC curves on different datasets as shown in Table 3 and illustrated through the curves in Figure 3 and Figure 2, we select AdaBoost as the classifier for iDTI-ESBoost. Note that, because of the huge imbalance in the datasets, with positive samples being much lower than the negative ones, the auPR curve is more important compared to the auROC curve and AdaBoost clearly outperforms the other two classifiers in terms of auPR values.
Effectiveness of the Balancing Methods
The next set of experiments were run to test the effectiveness of the two different sampling methods on the datasets. The parameters used with AdaBoost classifier for random and cluster based under sampling are reported in Table 4.
For each of the datasets, we used four feature group combinations and used random and cluster based under sampling and report auPR and auROC values from cross-validation experiments in Table 5. We also show the ROC curves and auPR curves for all four datasets using all the features in Figure 5 and Figure 4.
From the results reported in Table 5, it is worth-mentioning that in terms of auPR for all four datasets, cluster based sampling significantly outperforms random under sampling method. However, in terms of auROC curve, random sampling is slightly better than cluster based sampling in enzymes and ion channel datasets but the situation is in favor of cluster based sampling in GPCRs and nuclear receptors where it outperforms the random sampling method.
Comparison with Other Methods
Since the pioneering work of Yamanishi et al. , many supervised learning methods have been applied to predict drug-target interactions on these standard benchmark gold standard datasets. However, a few of these methods [24, 28] do not use cross validation techniques and others [23, 3] do not use the same standard datasets. Our method uses molecular fingerprints and evolutionary and structural features for this supervised classification problem. Similar methods, albeit without utilizing the structural features and balancing techniques are reported in [64, 22]. Most of the papers in the literature have used auROC curve as the main evaluation metric. We have compared the performance of our method on these four datasets with that of DBSI , KBMF2K , NetCBP , Yamanishi et al. , Yamanishi et al. , Wang et al.  and Mousavian et al.  using auROC. The auROC values for all these methods along with iDTI-ESBoost are reported in Table 6.
From the values shown in bold faced font in Table 6, we notice that for all the datasets iDTI-ESBoost is able to significantly outperform all other previous state-of-the-art methods in terms of auROC. All the auROC values are greater than 90% which indicates the effectiveness of the classifier, balancing methods and the novel features proposed in this paper.
Moreover, in  the authors argued in favor of auPR curve as a measure of evaluating the performance of classifiers for skewed datasets, especially in drug-target interaction where negative samples outnumber the positive samples. This argument does have merit as, logically, a mis-classification of positive samples or false negative should be more penalized in the score. To compare the performance of our method with that in , we reported the auPR values of the two predictors in Table 7. The results clearly shows that our method iDTI-ESBoost outperforms the predictor in  in terms of auPR as well.
In Table 8, we report specificity, sensitivity, precision, MCC and F1-Score for four datasets using different feature group combinations as achieved by iDTI-ESBoost in experiments. Specificity and sensitivity are very high as reported in this table.
Predicting New Interactions
In addition to these, we have analyzed the results produced by the classification algorithm. From the false negatives predicted by iDTI-ESBoost, we noticed that there are a number of false negatives for which the prediction probability is very high for it to be considered as a negative sample. Similar approaches were adopted in [16, 27]. In this paper, we suggest that the false negative interactions which are labeled as positive by our method with a very high prediction probability could be potential candidates for finding new positive interactions. A list of such interactions for four group of targets are added as supplementary information with this paper.
Web Server Implementation
We have also implemented our method as shown in Figure 1 as a separate web server. The web server is freely available for use at: http://farshidrayhan.pythonanywhere.com/iDTI-ESBoost/. The mechanism of the web-server is very simple. We also provide the pre-learned models for each of the datasets. The interface of the web server easy to use. It requires an user first to select the target group and provide the PSSM and SPD files for the target protein. These files can be easily generated by PSI-BLAST and SPIDER2 software using their online available tool.
To specify drug, one can select from a drop down list. The drugs are pre-fetched in our system from KEGG website. After selecting the drug and specifying target files, one can click the prediction button to find the prediction for that drug-target pair. The web-server also have a simple page with easy to-use instructions.
In this paper, we have presented iDTI-ESBoost, a novel method to predict and identify drug-target interactions. iDTI-ESBoost is unique in its exploitation of structural features along with the evolutionary features to predict drug-protein interactions. It also uses a novel balancing technique and a boosting technique. We have conducted extensive experiments to test and analyze the performance of iDTI-ESBoost. On four benchmark datasets known as the gold standard data in the literature, iDTI-ESBoost outperforms the state-of-the-art methods in terms of area under Receiver Operating Characteristic (auROC) curve.
Notably, the gold standard datasets used in the literature as benchmarks to analyze the performance of the methods for drug-target interactions prediction and identification are highly imbalanced with negative samples far outnumbering the positive samples. In the literature it has been argued that area under Precision Recall (auPR) curve is more appropriate as a metric for comparison for such imbalanced datasets. To this end, iDTI-ESBoost also outperforms the latest and the best-performing method in the literature to-date in terms of area under precision recall (auPR) curve. We believe that the excellent performance of iDTI-ESBoost both in terms of auROC and auPR would motivate the researchers and practitioners to use it to predict drug-target interactions. To facilitate that, iDTI-ESBoost is readily available for use at: http://farshidrayhan.pythonanywhere.com/iDTI-ESBoost/.
The authors declare that they have no competing interests.
SS initiated the project with the idea of using structural features. FR, SA and DMF equally contributed to the idea of modified balancing method and boosting. FR and SA equally contributed in the implementation and experimentation of the system. All the methods, algorithms and results have been analyzed and verified by SS, DMF, MSR, ZM and AD. All authors contributed significantly in the preparation of the manuscript and approved the final version.
-  Keiser, M.J., Setola, V., Irwin, J.J., Laggner, C., Abbas, A.I., Hufeisen, S.J., Jensen, N.H., Kuijer, M.B., Matos, R.C., Tran, T.B., et al.: Predicting new molecular targets for known drugs. Nature 462(7270), 175–181 (2009)
-  Cheng, F., Li, W., Wu, Z., Wang, X., Zhang, C., Li, J., Liu, G., Tang, Y.: Prediction of polypharmacological profiles of drugs by the integration of chemical, side effect, and therapeutic space. Journal of chemical information and modeling 53(4), 753–762 (2013)
-  Wu, Z., Cheng, F., Li, J., Li, W., Liu, G., Tang, Y.: Sdtnbi: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning. Briefings in bioinformatics 18(2), 333–347 (2017)
-  Campillos, M., Kuhn, M., Gavin, A.-C., Jensen, L.J., Bork, P.: Drug target identification using side-effect similarity. Science 321(5886), 263–266 (2008)
-  Haggarty, S.J., Koeller, K.M., Wong, J.C., Butcher, R.A., Schreiber, S.L.: Multidimensional chemical genetic analysis of diversity-oriented synthesis-derived deacetylase inhibitors using cell-based assays. Chemistry & biology 10(5), 383–396 (2003)
-  Kuruvilla, F.G., Shamji, A.F., Sternson, S.M., Hergenrother, P.J., Schreiber, S.L.: Dissecting glucose signalling with diversity-oriented synthesis and small-molecule microarrays. Nature 416(6881), 653–657 (2002)
-  Hopkins, A.L., Keserü, G.M., Leeson, P.D., Rees, D.C., Reynolds, C.H.: The role of ligand efficiency metrics in drug discovery. Nature Reviews Drug Discovery 13(2), 105–121 (2014)
-  Keiser, M.J., Roth, B.L., Armbruster, B.N., Ernsberger, P., Irwin, J.J., Shoichet, B.K.: Relating protein pharmacology by ligand chemistry. Nature biotechnology 25(2), 197–206 (2007)
-  Ma, D.-L., Chan, D.S.-H., Leung, C.-H.: Drug repositioning by structure-based virtual screening. Chemical Society Reviews 42(5), 2130–2141 (2013)
-  Pan, A.C., Borhani, D.W., Dror, R.O., Shaw, D.E.: Molecular determinants of drug–receptor binding kinetics. Drug discovery today 18(13), 667–673 (2013)
-  Mutowo, P., Bento, A.P., Dedman, N., Gaulton, A., Hersey, A., Lomax, J., Overington, J.P.: A drug target slim: using gene ontology and gene ontology annotations to navigate protein-ligand target space in chembl. Journal of biomedical semantics 7(1), 59 (2016)
-  Plake, C., Schroeder, M.: Computational polypharmacology with text mining and ontologies. Current pharmaceutical biotechnology 12(3), 449–457 (2011)
-  Zhu, S., Okuno, Y., Tsujimoto, G., Mamitsuka, H.: A probabilistic model for mining implicit ‘chemical compound–gene’relations from literature. Bioinformatics 21(suppl 2), 245–251 (2005)
-  Morris, G.M., Huey, R., Lindstrom, W., Sanner, M.F., Belew, R.K., Goodsell, D.S., Olson, A.J.: Autodock4 and autodocktools4: Automated docking with selective receptor flexibility. Journal of computational chemistry 30(16), 2785–2791 (2009)
-  Mousavian, Z., Masoudi-Nejad, A.: Drug–target interaction prediction via chemogenomic space: learning-based methods. Expert opinion on drug metabolism & toxicology 10(9), 1273–1287 (2014)
-  Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W., Kanehisa, M.: Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24(13), 232–240 (2008)
-  Bleakley, K., Yamanishi, Y.: Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics 25(18), 2397–2403 (2009)
-  Wang, W., Yang, S., Li, J.: Drug target predictions based on heterogeneous graph inference. In: Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, p. 53 (2013). NIH Public Access
-  Chen, X., Liu, M.-X., Yan, G.-Y.: Drug–target interaction prediction by random walk on the heterogeneous network. Molecular BioSystems 8(7), 1970–1978 (2012)
-  Alaimo, S., Pulvirenti, A., Giugno, R., Ferro, A.: Drug–target interaction prediction through domain-tuned network-based inference. Bioinformatics 29(16), 2004–2008 (2013)
-  Cheng, F., Liu, C., Jiang, J., Lu, W., Li, W., Liu, G., Zhou, W., Huang, J., Tang, Y.: Prediction of drug-target interactions and drug repositioning via network-based inference. PLoS Comput Biol 8(5), 1002503 (2012)
-  Mousavian, Z., Khakabimamaghani, S., Kavousi, K., Masoudi-Nejad, A.: Drug–target interaction prediction from pssm based evolutionary information. Journal of pharmacological and toxicological methods 78, 42–51 (2016)
-  Keum, J., Nam, H.: Self-blm: Prediction of drug-target interactions via self-training svm. PloS one 12(2), 0171839 (2017)
You, Z.-H., et al.
: Large-scale prediction of drug-target interactions from deep representations. In: Neural Networks (IJCNN), 2016 International Joint Conference On, pp. 1236–1243 (2016). IEEE
-  Xiao, X., Min, J.-L., Wang, P., Chou, K.-C.: icdi-psefpt: identify the channel–drug interaction in cellular networking with pseaac and molecular fingerprints. Journal of theoretical biology 337, 71–79 (2013)
-  He, Z., Zhang, J., Shi, X.-H., Hu, L.-L., Kong, X., Cai, Y.-D., Chou, K.-C.: Predicting drug-target interaction networks based on functional groups and biological features. PloS one 5(3), 9603 (2010)
-  Yamanishi, Y., Kotera, M., Kanehisa, M., Goto, S.: Drug-target interaction prediction from chemical, genomic and pharmacological data in an integrated framework. Bioinformatics 26(12), 246–254 (2010)
-  Hao, M., Wang, Y., Bryant, S.H.: Improved prediction of drug-target interactions using regularized least squares integrating with kernel fusion technique. Analytica chimica acta 909, 41–50 (2016)
-  Gönen, M.: Predicting drug–target interactions from chemical and genomic kernels using bayesian matrix factorization. Bioinformatics 28(18), 2304–2310 (2012)
-  Chen, H., Zhang, Z.: A semi-supervised method for drug-target interaction prediction with consistency in networks. PloS one 8(5), 62975 (2013)
-  Ba-Alawi, W., Soufan, O., Essack, M., Kalnis, P., Bajic, V.B.: Daspfind: new efficient method to predict drug–target interactions. Journal of cheminformatics 8(1), 15 (2016)
-  Yang, Y., Gao, J., Wang, J., Heffernan, R., Hanson, J., Paliwal, K., Zhou, Y.: Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Briefings in Bioinformatics, 129 (2016)
-  Yang, Y., Heffernan, R., Paliwal, K., Lyons, J., Dehzangi, A., Sharma, A., Wang, J., Sattar, A., Zhou, Y.: Spider2: A package to predict secondary structure, accessible surface area, and main-chain torsional angles by deep neural networks. Prediction of Protein Secondary Structure, 55–63 (2017)
-  López, Y., Dehzangi, A., Lal, S.P., Taherzadeh, G., Michaelson, J., Sattar, A., Tsunoda, T., Sharma, A.: Sucstruct: Prediction of succinylated lysine residues by using structural properties of amino acids. Analytical Biochemistry (2017)
-  Chou, K.-C.: Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of theoretical biology 273(1), 236–247 (2011)
-  Knox, C., Law, V., Jewison, T., Liu, P., Ly, S., Frolkis, A., Pon, A., Banco, K., Mak, C., Neveu, V., et al.: Drugbank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic acids research 39(suppl 1), 1035–1041 (2011)
-  Kanehisa, M., Goto, S.: Kegg: kyoto encyclopedia of genes and genomes. Nucleic acids research 28(1), 27–30 (2000)
-  Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research 25(17), 3389–3402 (1997)
-  Wishart, D.S., Knox, C., Guo, A.C., Cheng, D., Shrivastava, S., Tzur, D., Gautam, B., Hassanali, M.: Drugbank: a knowledgebase for drugs, drug actions and drug targets. Nucleic acids research 36(suppl 1), 901–906 (2008)
-  Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S., Tokimatsu, T., et al.: Kegg for linking genomes to life and the environment. Nucleic acids research 36(suppl 1), 480–484 (2008)
-  Schomburg, I., Chang, A., Ebeling, C., Gremse, M., Heldt, C., Huhn, G., Schomburg, D.: Brenda, the enzyme database: updates and major new developments. Nucleic acids research 32(suppl 1), 431–433 (2004)
-  Günther, S., Kuhn, M., Dunkel, M., Campillos, M., Senger, C., Petsalaki, E., Ahmed, J., Urdiales, E.G., Gewiess, A., Jensen, L.J., et al.: Supertarget and matador: resources for exploring drug-target relationships. Nucleic acids research 36(suppl 1), 919–922 (2008)
-  Todeschini, R., Consonni, V.: Handbook of Molecular Descriptors vol. 11. John Wiley & Sons, ??? (2008)
-  Tabei, Y., Yamanishi, Y.: Scalable prediction of compound-protein interactions using minwise hashing. BMC systems biology 7(6), 3 (2013)
-  Tabei, Y., Pauwels, E., Stoven, V., Takemoto, K., Yamanishi, Y.: Identification of chemogenomic features from drug–target interaction networks using interpretable classifiers. Bioinformatics 28(18), 487–494 (2012)
-  Chen, B., Wild, D., Guha, R.: Pubchem as a source of polypharmacology. Journal of chemical information and modeling 49(9), 2044–2055 (2009)
-  Guha, R., et al.: Chemical informatics functionality in r. Journal of Statistical Software 18(5), 1–16 (2007)
-  Sharma, R., Dehzangi, A., Lyons, J., Paliwal, K., Tsunoda, T., Sharma, A.: Predict gram-positive and gram-negative subcellular localization via incorporating evolutionary information and physicochemical features into chou’s general pseaac. IEEE Transactions on NanoBioscience 14(8), 915–926 (2015)
-  Paliwal, K.K., Sharma, A., Lyons, J., Dehzangi, A.: A tri-gram based feature extraction technique using linear probabilities of position specific scoring matrix for protein fold recognition. IEEE Transactions on Nanobioscience 13(1), 44–50 (2014)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research16, 321–357 (2002)
-  Yu, J., Guo, M., Needham, C.J., Huang, Y., Cai, L., Westhead, D.R.: Simple sequence-based kernels do not predict protein–protein interactions. Bioinformatics 26(20), 2610–2614 (2010)
-  Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Conference on Artificial Intelligence in Medicine in Europe, pp. 63–66 (2001). Springer
-  Yen, S.-J., Lee, Y.-S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications 36(3), 5718–5727 (2009)
-  Rahman, M.M., Davis, D.: Cluster based under-sampling for unbalanced cardiovascular data. In: Proceedings of the World Congress on Engineering, vol. 3, pp. 3–5 (2013)
Freund, Y., Schapire, R.E.: A desicion-theoretic generalization of on-line learning and an application to boosting. In: European Conference on Computational Learning Theory, pp. 23–37 (1995). Springer
-  Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. MIT press, ??? (2012)
-  Powers, D.M.: Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation (2011)
-  Cao, D.-S., Liu, S., Xu, Q.-S., Lu, H.-M., Huang, J.-H., Hu, Q.-N., Liang, Y.-Z.: Large-scale prediction of drug–target interactions using protein sequences and drug topological structures. Analytica chimica acta 752, 1–10 (2012)
Friedman, J.H.: On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data mining and knowledge discovery1(1), 55–77 (1997)
-  Efron, B., Gong, G.: A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician 37(1), 36–48 (1983)
-  Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12(Oct), 2825–2830 (2011)
-  Ho, T.K.: The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence 20(8), 832–844 (1998)
-  Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297 (1995)
-  Nanni, L., Lumini, A., Brahnam, S.: A set of descriptors for identifying the protein–drug interaction in cellular networking. Journal of theoretical biology 359, 120–128 (2014)
Classifier auROC Analysis. Receiver operating characteristic curves of different classifier algorithms using random under sampling and all the feature combinations on four datasets: (a) enzymes (b) ion channels (c) GPCRs (d) nuclear receptors.
|Feature Group||Number of Features||Reference||Group|
|Molecular finger print||881||||A|
|Secondary Structure Composition||3||This paper||B|
|Accessible Surface Area Composition||1||This paper|
|Torsional Angles Composition||8||This paper|
|Torsional Angles Auto-Covariance||80||This paper||C|
|Structural Probabilities Auto-Covariance||30||This paper|
|Torsional Angles bigram||64||This paper||D|
|Structural Probabilities bigram||9||This paper|
|Max||Min sample||Min samples|
|ion channels||8||4||1||Gini impurity|
|nuclear receptors||5||7||2||Gini impurity|
|ion channels||9||2||1||Gini impurity|
|nuclear receptors||150||2||1||Gini impurity|
|Dataset||Feature Combination||Balancing Method||auPR||auROC|
|||||||et al. ||et al. ||et al. ||et al. |
|Predictor||enzymes||ion channels||GPCRs||nuclear receptors|
|Mousavian et al. ||0.546||0.390||0.282||0.411|
|dataset||Feature Group||Specificity||Sensitivity||Precision||MCC||F1 score|
Additional file 1 — AdditionalFile1.xlsx
AdditionalFile1.xlsx contains the new drug-target predictions from each of the datasets. For each dataset, 10 top predictions from the false negatives with highest probability scores are reported.