Predicting Drug-Drug Interactions from Heterogeneous Data: An Embedding Approach

by   Devendra Singh Dhami, et al.

Predicting and discovering drug-drug interactions (DDIs) using machine learning has been studied extensively. However, most of the approaches have focused on text data or textual representation of the drug structures. We present the first work that uses multiple data sources such as drug structure images, drug structure string representation and relational representation of drug relationships as the input. To this effect, we exploit the recent advances in deep networks to integrate these varied sources of inputs in predicting DDIs. Our empirical evaluation against several state-of-the-art methods using standalone different data types for drugs clearly demonstrate the efficacy of combining heterogeneous data in predicting DDIs.



There are no comments yet.


page 1

page 2

page 3

page 4


Predicting Drug-Drug Interactions from Molecular Structure Images

Predicting and discovering drug-drug interactions (DDIs) is an important...

An Empirical Evaluation of the Impact of New York's Bail Reform on Crime Using Synthetic Controls

We conduct an empirical evaluation of the impact of New York's bail refo...

Computational Drug Repositioning Using Continuous Self-controlled Case Series

Computational Drug Repositioning (CDR) is the task of discovering potent...

Modular multi-source prediction of drug side-effects with DruGNN

Drug Side-Effects (DSEs) have a high impact on public health, care syste...

Heterogeneous Causal Effect of Polysubstance Usage on Drug Overdose

In this paper, we propose a system to estimate heterogeneous concurrent ...

An introduction to network analysis for studies of medication use

Background: Network Analysis (NA) is a method that has been used in vari...

Multi-Label Robust Factorization Autoencoder and its Application in Predicting Drug-Drug Interactions

Drug-drug interactions (DDIs) are a major cause of preventable hospitali...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adverse drug events (ADEs) are “injuries resulting from medical intervention related to a drug” [34], and are distinct from medication errors (inappropriate prescription, dispensing, usage etc.). ADEs can account for as many as one-third of hospital-related complications, affect up to 2 million hospital stays annually, and prolong hospital stays by 2–5 days [17]. Recently it is observed that many of these ADEs can be attributed to very common medications [9] and many are preventable [21] or ameliorable [18].

We focus on a specific problem of drug-drug interactions (DDIs), which are an important type of ADE and can potentially result in healthcare overload or even death  [4]. An ADE is characterized as a DDI when multiple medications are co-administered and cause an adverse effect on the patient. Predicting and discovering drug-drug interactions (DDIs) is an important problem and has been studied extensively both from medical and machine learning point of view. Identifying DDIs is an important task during drug design and testing, and several regulatory agencies require large controlled clinical trials before approval. Beyond their expense and time-consuming nature, it is impossible to discover all possible interactions during such clinical trials. This necessitates the need for computational methods for DDI prediction. A substantial amount of work in DDI focuses on homogeneous data types such as text [31, 11], textual representation of the structural data of drugs [20, 3] and genetic data [38]. Recent approaches consider phenotypic, therapeutic, structural, genomic and reactive drug properties [13] or their combinations [16] to characterize drug interactivity but this type of information only serves to extract in vivo/vitro discoveries.

Our goal is to predict DDIs in large drug databases by exploiting heterogeneous data types of the drugs and identifying patterns in drug interaction behaviors. We take a fresh and novel perspective on DDI prediction by seamlessly combining heterogeneous data representations of the drug structures such as images, string representations and relations with other proteins. While in principle, multi-view learning methods such as co-training [7] or multiple kernel learning [19] can be used, these methods assume that each view independently provides enough information for classification while we assume that each of these data source essentially provides a weak prediction of DDI. While it is possible to directly combine the data sources, standardization can be a major bottleneck. We take an embedding based approach to achieve the combination.

We make the following contributions: (1) we combine heterogeneous data types representing drug structures for DDI prediction. (2) we create embeddings to build a DDI prediction engine that can be integrated into a drug database seamlessly. (3) we show that using heterogeneous data types is more informative than using homogeneous data types.

2 Related Work

While DDIs have been long explored from medical perspective  [30, 23, 4, 5], or from social and economic perspectives [2, 41], we take a machine learning approach to this task.

Classically, the task of DDI discovery/prediction is modeled as a pairwise classification task. Thus kernel-based methods [42] are a natural fit since kernels are naturally suited to representing pairwise similarities. Most similarity-based methods for DDI discovery/prediction construct NLP-based kernels from literature data [40, 15]. A different direction is to learn kernels from different types of data such as molecular and structural properties of the drugs and then using these multiple kernels to predict DDIs [13, 16]. Recently embeddings have been employed for learning from a single data source [37, 10]. Siamese networks have been applied in one shot image recognition  [28], signature verification  [8], medical question retrieval [43] and Alzheimer disease diagnosis [1]. For DDI, Siamese graph convolutional networks have been developed [12, 26]. Most of these work for the DDI classification considered homogeneous data source. Even when heterogeneous data sources are considered, the methods tend to qualify for finding similarity scores between various drugs and then thresholding the obtained scores for prediction. An important limitation is the exclusion of drug structure images as a type of data. Our work can be seen as the first generalization of these multiple methods where we consider multiple data sources including images and combine them seamlessly through embeddings.

3 Embeddings using Heterogeneous Data Sources

We consider 3 different types of data, (1) images of drug structures, (2) SMILES (Simplified Molecular Input Line Entry System) strings [45] representation of drug structures and (3) relational representation of various associations between the drugs and proteins (target, transporter and enzymes). Figure 1 shows the overall architecture of our approach. We now discuss the different components.

Figure 1: Overview of architecture for predicting DDIs using heterogeneous data types.

3.1 Drug Structure Image Embeddings

A discriminative approach for learning a similarity metric using a Siamese architecture  [14] maps the input (pair of images in our case) into a target space. The intuition is that the distance between the mappings is minimized in the target space for similar pairs of examples and maximized in case of dissimilar examples. We adapt the Siamese architecture for the task of generating embeddings for each drug image. It consists of two identical sub-networks i.e. networks having same configuration with the same parameters and weights. Each sub-network takes a gray-scale image of size as input (we convert colored images to gray-scale) and consists of convolutional layers with number of filters as , , and respectively. The kernel size for each convolutional layer is (9

9) and the activation function is

relu. The relu is a non-linear activation function is given as

. Each convolutional layer is followed by a max-pooling layer with pool size of (

) and a batch normalization layer. After the convolutional layers, the sub-network has

fully connected layers with , and neurons respectively. Each drug pair is used to train the Siamese network and the learned parameters are used to generate embeddings of dimension for each drug image.

Note that the convolutions in the convolutional sub-network provide translational in-variance. However, rotational in-variance is also crucial, since isomers (one of the chiral forms) of drugs are expected to react differently when interacting with a certain drug


. For example, Fenfluramine and Dexfenfluramine are isomers of each other but Fenfluramine interacts with Acebutolol while Dexfenfluramine does not. Thus, to introduce rotational invariance, we use spatial transformer networks (STN)

[25] consisting of three basic building blocks: a localisation network, a grid generator and a sampler which can be used as a pre-processing step before feeding the input image pair into our underlying architecture (Figure 2).

Figure 2:

Using spatial transformer network as a pre-processing step to mitigate rotational variance. Note that this process is done for both the input images.

3.2 Relational Data Embeddings

DDIs can be considered as the characterization of the relationships between the drugs and the various proteins (enzymes, transporters etc.) using ADMET (absorption, distribution, metabolism, excretion and toxicity) features. A natural representation for such data is using first-order logic and the rules can then be induced. Some example facts (features) in our knowledgebase are:

Using the given facts and the +ve and -ve examples, we learn a relational regression tree (RRT) [6] where all the paths from the root to the leaves can be interpreted as first-order rules. The obtained first-order rules are first partially grounded with the query drug pairs and then completely grounded using the fact set. The number of satisfied groundings for each drug pair are then counted to obtain the final embeddings as shown in Figure 3.

Figure 3: Embedding creation from relational data.

3.3 Drug Structure SMILES Strings Embeddings

SMILES strings represent the drug structure in form of a simple textual representation. For example, Fluvoxamine can be represented as COCCCCC(=NOCCN)
. We use the existing model of SMILESVec [36] which divides the SMILES string into several interacting sub-structures and then uses the word2vec method [32] to generate embeddings for these sub-structures. These embeddings are combined to generate the final embedding of the drugs.

3.4 Combining Embeddings of Heterogeneous Data

After the embeddings of all the 3 hetrerogeneous data are obtained as described above, these need to be aggregated in order to generate a lower level representation. In the case of both image and SMILES strings embeddings (both of size

), we hypothesize that more similar the structure of the drugs, higher is the probability of their interaction. To capture this similarity notion between both sets of embeddings, we use

subtraction as the aggregation function to obtain 2 sets of embeddings for the image and SMILES strings data. These 2 sets are then averaged to obtain a single set of embeddings of size .

Each relational embedding represents the counts of the satisfied groundings of the query, in our case, i.e. the interaction between pair of drugs and is of the size ( is the number of first-order rules learned using the relational regression trees). The relational embeddings are concatenated with the combined embeddings obtained from the SMILES and image data to yield the final embedding size of

which can then be passes to a machine learning classifier. We choose a neural network since it is a universal approximator, can handle large number of features and also learns inherently aggregated latent features in the hidden layers. The over all architecure is presented in Figure 


4 Empirical Evaluation

We aim to answer the following questions: Q1. Does using multiple data sources give an advantage over using a single data source? Q2. Does using STN in the Siamese neural network give better results? Q3. What is the effect of the aggregation functions? Q4. Is the classification performance sensitive to the choice of classifier? Q5. Does the size of hidden layers and different activation function in the neural classifier affect the performance?

Data set(s): Our image data set consists of images of drugs of size downloaded from the PubChem database 111 and converted to a grayscale format of size . The images are then normalized by the maximum pixel value (i.e. ). The SMILES strings of these drugs are obtained from PubChem and DrugBank 222 For the relational data, we extract the different relations of the drugs with the proteins from DrugBank and convert it to a relational format with number of relations and the total number of facts . From the drugs we create a total of drug interaction pairs excluding the reciprocal pairs (i.e. if drug interacts with drug then interacts with and are removed). From the drug pairs we obtain drug pairs that interact and drug pairs that do not.

Baselines: We consider baselines based on the different modalities to compare the results from our architecture. Structural Similarity Index (SSIM) [44]

is used for measuring perceptual similarity between images and is calculated as,

, where and is the average of the images and respectively, and is the variance of the images and respectively, is the covariance of the two input images. The constants and are added to the SSIM to avoid instability. To obtain the predictions, the threshold is set as the mean SSIM values of all pairs. Autoencoders [29]

are neural networks with an encoder that extracts features from the input images and a decoder that restores the original images from the extracted features. The autoencoder is trained for

epochs with binary cross-entropy loss. The encoder extracts features of the testing images. To find images with similar extracted features criteria were used: binary cross-entropy and cosine proximity. The threshold to decide similarity of images is the mean of all values calculated for all pairs of testing image. CASTER [24] identifies the frequent substrings present in the SMILES strings using a sequential pattern mining algorithm which are then converted to an embedded representation using an encoder module to obtain a set of latent features which are then converted into linear coefficients, passed through a decoder and a predictor to obtain the DDI predictions. Siamese Neural Network with and without STNs using contrastive loss [22], based on a euclidean distance are also used as baselines. If the distance between images 0.65 (obtained using AUC-PR curves) we predict an interaction. RDN-Boost [33] takes an initial model (RRT) and use the obtained predictions to compute gradient(s)/residues. A new regression function is then learnt to fit the residues and the model is updated. At the end, a combination (the sum) of all the obtained regression function gives the final model. MLN-Boost [27] boosts the undirected Markov logic networks (MLNs) [39] using an approximation of likelihood.

Results: We optimize the Siamese network using the Adam optimizer with a learning rate of , obtained using line search. We use the publicly available implementation333 of SMILESVec method with default parameters. To learn the RRT, we use the publicly available software, BoostSRL444, with the “-noBoost” parameter. For the classifier in our architecture we use a 4 hidden layer(s) neural network with hidden layer sizes 1000, 500, 200 and 50 with relu activation units and Adam optimizer. Table 1 shows the performance of our method with respect to various baselines.

Methods Accuracy Recall Precision F1 score
SSIM 0.519 0.487 0.304 0.374
Autoencoder 0.354 0.911 0.303 0.454
Siamese Network 0.837 0.780 0.705 0.741
Siamese Network + STN 0.823 0.825 0.661 0.734
CASTER 0.821 0.663 0.736 0.698
RDN-BOOST 0.773 0.832 0.413 0.552
MLN-BOOST 0.767 0.653 0.540 0.592
Our Method (agg=avg) 0.877 0.769 0.805 0.787
Our Method (agg=sub) 0.884 0.781 0.818 0.799
Our Method (with STN) 0.881 0.779 0.811 0.794
Table 1: Comparison of our method with baselines. The 1st 4 methods use images as input, CASTER uses SMILES strings and the next 2 use relational data.

(Q1) Advantage of Heterogeneous data. To demonstrate the effectiveness of using heterogeneous data, we compare with methods that use homogeneous data. To that effect, the first 4 baselines consider the image data, CASTER uses the SMILES strings data and RDN-Boost and MLN-Boost use the relational data. The results show that combining embeddings from heterogeneous data sources clearly outperform the methods using a single data source thus answering Q1 affirmatively .

(Q2) Effect of STN. Table 1 also shows the result for using STN while generating the image embeddings before aggregation. The results do not show much deviation from not using STNs. Thus we can answer Q2. Using STN as pre-processing to generate image embeddings does not have any significant effect.

(Q3) Effect of Aggregation Functions. As shown in table 1 the subtraction aggregation for calculating set of image and SMILES embeddings performs better than the average aggregation function since, as mentioned before, these embeddings represent the similarity information and can thus be captured by subtraction aggregation although the difference is not much thus answering Q3.

Figure 4: Effect of classifier choice
Figure 5: Effect of number of layers
Figure 6: Effect of activation function

(Q4) Effect of Chosen Classifier. Figure 6

shows the effect of classifier choice on our architecture performance. The results clearly show that universal approximators neural networks significantly outperform the simple linear classifiers logistic regression and gradient boosting. This is also due to the fact that we have large number of features in our lower level learned feature representations. This answers


(Q5) Effect of Number of Neural Network Classifier Layers. Figures 6 and 6

show the effect of number of layers and choice of activation function in the final neural network classifier on the model performance. As the results show, the number of layers do not have much effect on the performance but the activation function “relu” outperforms “tanh” due to the non-saturation of the calculated gradient thus accelerating the convergence of stochastic gradient descent (SGD). We also used simple SGD as the activation function, but the neural network did not converge. This answers


5 Conclusion and Future Work

We considered the challenging task of predicting DDIs from multiple sources. To this effect, we combined the data using embeddings created from images, SMILE strings from drug structures, and relationships between drugs. We presented an architecture that significantly outperforms strong baselines that learn from a single type of data.

More rigorous evaluation using larger data sets is an interesting direction. Potentially identifying novel DDIs is an exciting future research. Allowing for domain expert’s knowledge could significantly boost the performance of the architecture and this can be achieved by considering the knowledge as constraints due to learning. Finally, understanding how it is possible to extract explanations of these interactions from the embeddings remains an interesting future direction.

6 Acknowledgements


  • [1] K. Aderghal, J. Benois-Pineau, and K. Afdel (2017) Classification of smri for alzheimer’s disease diagnosis with cnn: single siamese networks with 2d+? approach and fusion on adni. In ACM ICMR, Cited by: §2.
  • [2] R. J. Arnold, J. Tang, J. Schrecker, and C. Hild (2018) Impact of definitive drug–drug interaction testing on medication management and patient care. Drugs-real world outcomes. Cited by: §2.
  • [3] M. Asada, M. Miwa, and Y. Sasaki (2018) Enhancing drug-drug interaction extraction from texts by molecular structure information. arXiv preprint arXiv:1805.05593. Cited by: §1.
  • [4] M. L. Becker, M. Kallewaard, P. W. Caspers, L. E. Visser, H. G. Leufkens, and B. H. Stricker (2007) Hospitalisations and emergency department visits due to drug–drug interactions: a literature review. Pharmacoepidemiology and drug safety. Cited by: §1, §2.
  • [5] I. K. Björkman, J. Fastbom, et al. (2002) Drug—drug interactions in the elderly. Annals of Pharmacotherapy. Cited by: §2.
  • [6] H. Blockeel and L. De Raedt (1998)

    Top-down induction of first-order logical decision trees

    Artificial intelligence. Cited by: §3.2.
  • [7] A. Blum and T. Mitchell (1998) Combining labeled and unlabeled data with co-training. In COLT, Cited by: §1.
  • [8] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a” siamese” time delay neural network. In NIPS, Cited by: §2.
  • [9] D. S. Budnitz, M. C. Lovegrove, N. Shehab, and C. L. Richards (2011) Emergency hospitalizations for adverse drug events in older americans. NEJM. Cited by: §1.
  • [10] R. Celebi, H. Uyar, E. Yasar, O. Gumus, O. Dikenelli, and M. Dumontier (2019)

    Evaluation of knowledge graph embedding approaches for drug-drug interaction prediction in realistic settings

    BMC bioinformatics. Cited by: §2.
  • [11] B. W. Chee, R. Berlin, and B. Schatz (2011) Predicting adverse drug events from personal health messages. In AMIA Annual Symposium Proceedings, Cited by: §1.
  • [12] X. Chen, X. Liu, and J. Wu (2019) Drug-drug interaction prediction with graph representation learning. In BIBM, Cited by: §2.
  • [13] F. Cheng and Z. Zhao (2014) Machine learning-based prediction of drug–drug interactions by integrating drug phenotypic, therapeutic, chemical, and genomic properties. Journal of the American Medical Informatics Association. Cited by: §1, §2.
  • [14] S. Chopra, R. Hadsell, Y. LeCun, et al. (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR (1), Cited by: §3.1.
  • [15] M. F. M. Chowdhury and A. Lavelli (2013) FBK-irst: a multi-phase kernel based approach for drug-drug interaction detection and classification that exploits linguistic information. In SEM, Cited by: §2.
  • [16] D. S. Dhami, G. Kunapuli, M. Das, D. Page, and S. Natarajan (2018) Drug-drug interaction discovery: kernel learning from heterogeneous similarities. Smart Health. Cited by: §1, §2.
  • [17] DHHS (2010) Adverse events in hospitals: national incidence among medicare beneficiaries. Note: Cited by: §1.
  • [18] A. J. Forster, H. J. Murff, J. F. Peterson, T. K. Gandhi, and D. W. Bates (2005) Adverse drug events occurring following hospital discharge. J Gen Intern Med. Cited by: §1.
  • [19] M. Gönen and E. Alpaydın (2011) Multiple kernel learning algorithms. JMLR. Cited by: §1.
  • [20] H. Gurulingappa, A. Mateen-Rajpu, and L. Toldo (2012) Extraction of potential adverse drug events from medical case reports. Journal of biomedical semantics. Cited by: §1.
  • [21] J. H. Gurwitz et al. (2003) Incidence and preventability of adverse drug events among older persons in the ambulatory setting. JAMA. Cited by: §1.
  • [22] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In CVPR, Cited by: §4.
  • [23] M. Hirano, K. Maeda, Y. Shitara, and Y. Sugiyama (2006) Drug-drug interaction between pitavastatin and various drugs via oatp1b1. DMD. Cited by: §2.
  • [24] K. Huang, C. Xiao, T. N. Hoang, L. M. Glass, and J. Sun (2020) CASTER: predicting drug interactions with chemical substructure representation. AAAI. Cited by: §4.
  • [25] M. Jaderberg, K. Simonyan, et al. (2015) Spatial transformer networks. In NIPS, Cited by: §3.1.
  • [26] M. Jeon, D. Park, et al. (2019) ReSimNet: drug response similarity prediction using siamese neural networks. Bioinformatics. Cited by: §2.
  • [27] T. Khot, S. Natarajan, K. Kersting, and J. Shavlik (2011) Learning markov logic networks via functional gradient boosting. In ICDM, Cited by: §4.
  • [28] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In

    ICML deep learning workshop

    Cited by: §2.
  • [29] M. A. Kramer (1991)

    Nonlinear principal component analysis using autoassociative neural networks

    AIChE journal. Cited by: §4.
  • [30] W. C. Lau, L. A. Waskell, et al. (2003) Atorvastatin reduces the ability of clopidogrel to inhibit platelet aggregation: a new drug–drug interaction. Circulation. Cited by: §2.
  • [31] X. Liu and H. Chen (2013) AZDrugMiner: an information extraction system for mining patient-reported adverse drug events in online patient forums. In ICSH, Cited by: §1.
  • [32] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    arXiv preprint arXiv:1301.3781. Cited by: §3.3.
  • [33] S. Natarajan, T. Khot, et al. (2012) Gradient-based boosting for statistical relational learning: the relational dependency network case. Machine Learning. Cited by: §4.
  • [34] J. R. Nebeker, P. Barach, and M. H. Samore (2004) Clarifying adverse drug events:a clinician’s guide to terminology, documentation and reporting. Ann. Intern. Med. Cited by: §1.
  • [35] L. A. Nguyen, H. He, and C. Pham-Huy (2006) Chiral drugs: an overview. International journal of biomedical science: IJBS. Cited by: §3.1.
  • [36] H. Öztürk, E. Ozkirimli, and A. Özgür (2018)

    A novel methodology on distributed representations of proteins using their interacting ligands

    Bioinformatics. Cited by: §3.3.
  • [37] S. Purkayastha, I. Mondal, et al. (2019) Drug-drug interactions prediction based on drug embedding and graph auto-encoder. In BIBE, Cited by: §2.
  • [38] S. Qian, S. Liang, and H. Yu (2019) Leveraging genetic interactions for adverse drug-drug interaction prediction. PLoS computational biology. Cited by: §1.
  • [39] M. Richardson and P. Domingos (2006) Markov logic networks. Machine learning. Cited by: §4.
  • [40] I. Segura-Bedmar, P. Martinez, and C. de Pablo-Sánchez (2011) Using a shallow linguistic kernel for drug–drug interaction extraction. J. Biomed. Inform. Cited by: §2.
  • [41] M. U. Shad, C. Marsh, and S. H. Preskorn (2001) The economic consequences of a drug-drug interaction. Journal of clinical psychopharmacology. Cited by: §2.
  • [42] J. Shawe-Taylor and N. Cristianini (2004) Kernel methods for pattern analysis. Cited by: §2.
  • [43] K. Wang, B. Yang, G. Xu, and X. He (2019)

    Medical question retrieval based on siamese neural network and transfer learning method

    In DASFAA, Cited by: §2.
  • [44] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. Cited by: §4.
  • [45] D. Weininger (1988) SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. J Chem Inform Comput Sci. Cited by: §3.