Drug-drug interactions (DDIs) are (often) preventable causes of medical injuries that occur when a drug causes a pharmacokinetic (PK) or pharmacodynamic (PD) effect on the body when it is taken together with another drug (Wang, 2017; Takeda et al., 2017). They are a common cause of adverse drug reactions (ADRs) and increased healthcare costs (Cheng and Zhao, 2014). The majority of ADRs are caused by unintended DDIs and occasionally arise through co-prescription of drugs. While it would be ideal for identifying all possible DDIs during clinical trials, interactions are frequently reported after the drugs are approved for clinical use. ADRs are a significant threat to public health, as shown in a study by Shtar et al. They found that about 6.7% of hospital readmission occurred because of ADRs with a fatality rate of 0.32% in the USA in 2014. In that year, as many as 807,270 cases of serious ADRs were reported in the United States, resulting in 123,927 lost lives (Shtar et al., 2019).
For example, acetylsalicylic acid, commonly known as aspirin, is a drug used for the treatment of pain and fever due to various causes. This medicine has both anti-inflammatory and antipyretic effects, which inhibits platelet aggregation and is used in the prevention of blood clots and myocardial infarction. However, the risk or severity of hypertension can be increased (e.g., negative drug-drug interaction) when acetylsalicylic acid is combined with 1-benzylimidazole (Wishart et al., 2017). Predicting potential DDIs reduces unanticipated drug interactions, lowers drug development costs, and can be used to optimize the drug design process. Thus, the study of DDIs and ADRs is important in both drug development and clinical application, especially for co-administered medications. Since the majority of ADRs occur between pairs of drugs, they have become the focus of research and clinical studies(Wang, 2017)
. To further reduce costs and to make the analysis of large amounts of interactions possible, automated methods for identifying ADRs are needed. Current approaches involve clinical evaluation of drugs and post-marketing surveillance. Here, features are extracted from drug properties such as targets, side-effects, chemical properties, fingerprint, and drug indications. Then statistical methods and various supervised ML algorithms (like, e.g., decision tree (DT), Naive Bayes (NB), k-nearest neighbors (k-NN), logistic regression (LR), support vector machine (SVM), random forest (RF), and gradient boosting trees (GBT))(Abdelaziz et al., 2017) are employed.
Deep learning-based approaches, which can utilize deep features, are mostly unexplored in the context of DDIs prediction. While a deep architecture like a convolutional neural network (CNN) is good at reducing frequency variations by acting as a feature extractor (Zhang and Luo, 2018)
, a long short-term memory (LSTM) network is good at temporal modeling and learning orderly sequences from a large feature space(SHI et al., 2015). By combining these two deep architectures, the convolutional-LSTM(Conv-LSTM) can capture both locally and globally important drug features which we found to lead to more accurate DDI predictions (SHI et al., 2015). However, the features which have been traditionally used for these approaches form either a large and sparse binary matrix or a dense, but small similarity matrix, making them not ideal for training ML models (Celebi et al., 2018). Further, an increasing amount of drug and small molecules data is being generated, and state-of-the-art approaches still rely on the analysis on a limited number of data sources only, e.g. DrugBank.
To incorporate multiple data sources, Knowledge Graphs are a powerful tool, and many biomedical knowledge bases have been published in this form. In this graph, the nodes represent different entities like drugs, diseases, protein targets, substructures, side effects, and pathways. See e.g., (Wang, 2017; Celebi et al., 2018) for examples on using Knoweldge Graphs for DDI prediction. Once the data is in the form of a Knowledge Graph, we have to extract information from it as features for our interaction predictors. To do this, we use embedding methods which project each node in the graph to a dense vector.
In our work, we consider more sources of DDIs as others. Scientific literature that has predicted DDIs very accurately is often ignored as ground truth in related work. In this paper, we collected DDI information from DrugBank (Wishart et al., 2017), the Kyoto Encyclopedia of Genes & Genomes (KEGG) (Kanehisa et al., 2009), TWOSIDES (Tatonetti et al., 2012a), and scientific literature . Then, we created an integrated KG using data from DrugBank, KEGG drug, PharmGKB (Whirl-Carrillo et al., 2012), and OFFSIDES (Tatonetti et al., 2012a) (excluding data already in the above mentioned DDI data). To transform the information from this graph in a format suitable for the prediction models, we applied different KG embedding techniques. Then, we trained several ML models as baselines and also performed experiments with the Conv-LSTM model. The key contributions of this paper can be summarized as follows:
We have created a dataset with 2,898,937 drug-drug interaction pairs; we believe that this is the largest available.
We have prepared a large-scale integrated KG about DDIs with data from DrugBank, KEGG, OFFSIDES, and PharmGKB having 1.2 billion triples.
We have evaluated different KG embeddings techniques with different settings to train and evaluate ML models.
We provide a comprehensive evaluation with details analysis of the outcome and comparison with the state-of-the-art approaches and baseline models.
We found that a combined CNN and LSTM network called Conv-LSTM for predicting DDIs leads to the highest accuracy.
This paper is structured as follows: section 2 discusses related works with their emerging use cases and potential limitations. Section 3 details the proposed approach, including problem formulation, data collections, KG construction, graph embeddings, network constructions, and training. The results of our experiments can be found in Section 4, where we also discuss the key findings of the study. Section 5 provides some explanations of the importance, highlights the limitations of the study reported, and discusses some future works before concluding the paper. To avoid confusion, we use the terms node and drug interchangeably throughout the paper.
2. Related Work
Till date, DDI prediction is a non-trivial research problem in pharmacology, and numerous approaches have been proposed to predict novel DDIs by employing various data sources. Traditional work relies on in vitro and in vivo experiments and focuses on small sets of specific drug pairs and had laboratory limitations (Duke et al., 2012). With the emergence of available biomedical data, researchers moved the focus towards automatically populating and completing biomedical KGs using large-scale structured databases and text publicly available (Celebi et al., 2018). In this scope, the Bio2RDF project made 35 life sciences datasets as linked open data (LOD) in RDF, in which similar entities are mapped in different KGs and built large heterogeneous graphs that also contain biomedical drug-related facts. Although these approaches made available numerous biomedical KGs, they often contain incomplete and inaccurate data that impede their application in the field of safe medicine development (Celebi et al., 2018). Lately, ML and text mining based approaches were used in which pharmacological similarities of drugs as features are used by regarding the DDIs prediction task as a link prediction problem.
Different drug similarity metrics are used for inferring DDIs and their associated recommendations in which LR is trained to calculate the maximum likelihood by using known DDIs (Gottlieb et al., 2012). Similarly, DDIs using phenotypic, chemical, biological, therapeutic, structural, and genomic similarities of drugs are used for predicting DDIs (Cheng and Zhao, 2014). Other investigations used pharmacological and graph qualities between drugs (Cami et al., 2013) or drug structural similarities and interaction networks incorporating PK and PD knowledge (Takeda et al., 2017) using LR. Peng et al. (Li et al., 2015)
developed a Bayesian network, which combines molecular drug similarity and drug side-effect similarity to predict the combined effect of drugs. Lately, Andrej K. et al.(Kastrin et al., 2018) formulated the DDIs prediction problem as a binary classification problem to predict unknown DDIs in 5 arbitrary databases such as DrugBank, KEGG, NDF-RT, SemMedDB, and Twosides. In these works, supervised ML approaches such as DT, NB, k-NN, LR, SVM, RF, and GBT are mostly used for predicting DDIs from topological and semantic similarity features. However, feature-based approaches only predict binary DDIs or those that have been pre-defined in structured databases and suffer from robustness caused by data sparsity and vast computation requirements. Similarity-based approaches, in contrast, do not allow for the calculation of various similarities for many drugs due to lack of drug information (Celebi et al., 2018).
Several other approaches are proposed using biomedical KGs and text embedding in which graph embedding is utilized (Wang, 2017) to overcome issues such as data incompleteness and the sparsity problem. These learned embeddings are then mostly used for predicting DDIs. Other works have utilized text mining (Percha and Altman, 2012; Tari et al., 2010; Duke et al., 2012) to predict and evaluate new DDIs in which either drug-related data from scientific literature was discovered from large health information exchange repository (Duke et al., 2012)
or automated reasoning has been developed to derive new enzyme-based DDIs from MEDLINE abstracts(Tari et al., 2010). With the abundance of biomedical data characterizing drugs and their associated targets, these methods cannot fuse multiple sources of information and perform inference over the network of drugs effectively. Therefore, approaches employing KGs embeddings and ML-approaches (Wang, 2017; Abdelaziz et al., 2017; Celebi et al., 2018; Hallstedt et al., Dec 2018) have emerged. In particular, approaches based on KG embeddings are powerful predictors and outperform state-of-the-art approaches for inferring new DDIs. Most of the embedding methods are translation-based; embeddings are built by treating relations as translations from the head entity to tail entity. In these methods, the vector embeddings are created such that where is a triple of knowledge base (i.e., the relation holds between and ). For this equation, there are various options for the operator (see e.g., (Feng et al., 2016; Bordes et al., 2013; Ji et al., 2015)).
With these ideas in mind, the DDIs prediction framework called Tiresias (Abdelaziz et al., 2017) is proposed in which various sources of drug-related data and knowledge are used as inputs and provides DDIs predictions as outputs using an LR classifier. The process starts with semantic integration of data into a KG describing drug attributes and relationships with various related entities such as enzymes, chemical structures, and pathways. The KG is then used to compute several similarity measures between the drugs in a scalable and distributed framework. A recent approach called PRD (Wang, 2017) was proposed for predicting DDIs in which graph embedding techniques such as TransE, TransD, TransH, HolE are employed to overcome the data incompleteness and sparsity issues (Abdelaziz et al., 2017). First, a large-scale drug KG is created from different sources containing biomedical texts, which are then embedded into a common low dimensional space into a continuous vector space in which both entities and relations were considered. The learned embeddings are subsequently used to predict the DDIs using a rich DDI triple encoder (RDTE) network in which an encoder incorporates the drug-related information to obtain the DDI relation representation. The decoder then reconstructs the embedding vector from the latent representation and is used to predict the labels for potential DDIs.
However, most translation-based embeddings are limited in their capacity to model complex and diverse objects, including important properties of relations, such as symmetric, transitive, one-many, many-to-one, and many-many relations in KGs (Feng et al., 2016). As KG embeddings techniques show state-of-the-art performance, generating quality feature vectors using appropriate embedding methods plays a significant role. A more recent work (Celebi et al., 2018) employed KG embedding methods to extract feature vector representation of drugs using LOD to predict potential DDIs. The effects of DDIs prediction accuracy using LR, NB, and RF is also investigated on a single source (the Bio2RDF DrugBank v4 dataset) with different embedding methods such as RDF2Vec, TransE, and TransD.
In these approaches, DDIs information extraction from biomedical texts and drug event reports using text mining (TM) and then inferring DDIs by integrating knowledge from several sources are two typical steps (Tatonetti et al., 2012a; Percha and Altman, 2013; Kastrin et al., 2018). Although, the way DDIs are extracted and the predictions made vary across methods. Numerous approaches have extracted DDIs from biomedical text using knowledge-rich and knowledge-poor features (Hailu et al., 2013) or from lexical, syntactical, and semantic-based features (Bobić et al., 2013). Other approaches, focus on classifying DDIs in which an SVM is trained using drug features generated by similarity measures (Rastegar-Mojarad et al., 2013; Björne et al., 2013) or by exploiting linguistic information (Chowdhury and Lavelli, 2013). Apart from these, other approaches have employed text mining for extracting DDIs from a semantically annotated corpus of documents describing DDIs from DrugBank and MEDLINE abstracts (Herrero-Zazo et al., 2013; Segura-Bedmar et al., 2014).
3. Materials and Methods
In this section, we discuss our methods in detail, including the problem formulation, data collection and integration, KG embeddings, the Conv-LSTM network construction, and the network training with hyperparameter optimization. The last step is inferencing DDI predictions.Figure 1 shows the workflow of the proposed approach.
3.1. Problem formulation
Since DDIs form a complex network in which nodes refer to drugs and links refer to potential interactions, we approach the DDIs prediction task as a link prediction problem similar to Shtar et al. (Shtar et al., 2019). Given a directed DDI KG as in which each edge represents an interaction between drugs and . Let denotes the number of drugs, we can define the DDIs matrix as follows:
In eq. 1, a value of 1 for indicates an existing interaction between drugs and . However, a value of 0 does not mean that an interaction does not exist in the KG, but it could be that the interaction has not yet been discovered (Shtar et al., 2019). Next, we proceed to DDIs extraction to be followed by KG construction.
3.2. DDIs extraction and KG construction
Since creating an integrated KG and extracting DDIs are the two most important steps in our approach, we first focus on the data and knowledge source selection. We constructed our integrated knowledge graph based on the drugs and drug-target related data from DrugBank, KEGG drug, and PharmGKB. On the other hand, OFFSIDES, TWOSIDES, and scientific literature from MEDLINE are used for finding DDI with enough evidence.
3.2.1. Data collection
At the current time, there is no automated method or data source available which would provide complete DDI information. Moreover, the available data is spread over multiple sources. Therefore, we rely on several sources for collecting drug-related data. The DrugBank database is a bioinformatics and cheminformatics resource that combines detailed drug-related information, including chemical, pharmacological, and pharmaceutical data with comprehensive drug target information.
The PharmGKB database111DrugBank v5.1.3, April 02, 2019 contains 12,664 drug entries including 2,588 approved small molecule drugs, 1,287 approved biotech drugs, 130 nutraceuticals and over 6,305 experimental drugs (Wishart et al., 2017).The PharmGKB database is a repository for genomic, molecular, and cellular phenotype data. It also contains clinical information and the impact of genetic variation on drug response about people who have participated in pharmacogenomics research studies. PharmGKB contains genes, diseases, drugs, and pathways related data as well as detailed information on 470 genetic variants affecting drug metabolism.
The KEGG databases contain metabolic pathways that are hyperlinked to metabolite and protein/enzyme information. As of May 2019, KEGG drug database has 10,979 drugs related information and 501,689 DDIs relations. Finally, the OFFSIDES database, which contains drug effects mined from adverse event reports based on PharmGKB (Kastrin et al., 2018; Percha and Altman, 2013), reports 438,802 drug side-effects. From these sources, we create two datasets: i) the DDI dataset which contains drug-drug interaction pairs, ii) the knowledge graph which we will later use as background knowledge for interactions. The latter does not contain any explicit information about the interactions.
3.2.2. DDI extraction
We employ a semi-supervised approach for extracting DDIs from the sources mentioned above. We parsed the DDI information from the provided XML file from DrugBank and compiled an edge list of drug identifier combinations, which gives us 2,641,889 pairwise DDIs and 2,630,796 unique DDIs spanning 12,112 drugs. Although the KEGG drug database has 10,979 drugs related and 501,689 DDIs relations, mapping to DrugBank identifiers (IDs) results in only 58,205 interactions because of missing mappings.
Data from TWOSIDES (Tatonetti et al., 2012b), which is a comprehensive source of polypharmacy ADRs is also used, but interactions are restricted to those that cannot be ascribed unambiguously to either drug alone. Therefore, a collection of the drug pairs for the interacting compounds built in literature (Kastrin et al., 2018) is used in which the PubChem IDs are used to map TWOSIDES IDs to DrugBank IDs. This way, we obtained a list of 19,020 DDIs between 351 compounds and 63,473 distinct pairwise DDIs between 645 drugs.
Dhami et al. (Dhami et al., 2018) have identified that a few DDIs reported in the DrugBank dataset are less evident to interact with each other. We relied on multi-source evidence for those drug pairs and removed from the DrugBank DDI list the contradictory ones. Next, Zhang et al. (Zhang et al., 2015) reported 145,068 DDIs222https://astro.temple.edu/~tua87106/ddi.html based on label propagation prediction using clinical side-effects. We added these interactions to our dataset. Finally, Sridhar et al. (Sridhar et al., 2016) listed top-ranked ten predictions for interactions unknown in DrugBank using their PSL model; also, these pairs were added to our dataset.
Besides these, we also incorporate the interactions from the 227 MEDLINE abstracts from the DDI corpus (Herrero-Zazo et al., 2013; Segura-Bedmar et al., 2014). This contributed 327 DDIs based on 1,826 pharmacological substances. Additionally, some abstracts are also used that are not listed in the DDI corpus e.g. (Dhami et al., 2018; Sridhar et al., 2016; Zhang et al., 2017, 2015). for these, the annotation guidelines333http://hulat.inf.uc3m.es/DrugDDI/annotation_guidelines_ddi_corpus.pdf developed by domain experts were used. A certified pharmacist verified these annotations. The overall DDI dataset consists of information from all these sources combined. An overview of the process is shown in fig. 2 and statistics are collected in table 1 where it should be noted that duplicates between the data sources are removed to obtain the final number of interactions (2,898,937).
3.2.3. KGs construction and integration
To create our integrated knowledge graph, we used data from DrugBank, KEGG, OFFSIDES, and PharmGKB. Although PharmGKB does not contain DDI information, it publishes lexicons of known drug names and synonyms as well as gene and disease terms and data on genetic pathways. The dataset is directed at clinicians and researchers. Previously, Bio2RDF(Belleau et al., 2008) created a large RDF graph that interlinks data from major databases containing biological entities such as drugs, proteins, pathways, and diseases. Using that data would have been an option, but the latest version (i.e., v4.0) is already rather outdated. Instead, we collected the raw DrugBank, KEGG drug, PharmGKB, and OFFSIDES data from the respective portals and converted them into RDF using a modified version of Bio2RDF scripts444https://github.com/rezacsedu/DDI-prediction-KG-embeddings-Conv-LSTM/scripts. Then each RDF KG was uploaded to a blazegraph RDF triplestore in named graph555Zenodo download link: https://zenodo.org/deposit/3270566. Then similar to literature (Wang, 2017), federated SPARQL queries are executed based on the ‘billion triples benchmark’ (Saleem et al., 2018) to extract selected triples. For our dataset five types of drug-related entities, namely drugs, genes, proteins, pathways and enzymes, and phenotype (i.e., disease, side-effects), are included. Further, nine types of biological relations are considered: • (drug, hasTarget, protein), (drug, hasTarget, gene), (drug, hasEnzyme, protein), (drug, hasEnzyme, gene), (drug, hasTransporter, protein), (drug, hasTransporter, gene666e.g. Polymorphisms in the ABC drug has transporter gene MDR1), (protein, isPresentIn, pathway), (gene, isPresentIn, pathway), and (pathway, isImplicatedIn, phenotype).
Before the extraction, mappings are created based on owl:sameAs and owl:equivalentProperty axioms in which respective drug identifiers are mapped to DrugBank IDs as shown in fig. 2. Although genes contain the information needed to make functional molecules called proteins, we considered relation around genes and protein distinct, since PharmGKB contains information about both genes and proteins. The extracted triples are formed the triples in our drug KG in the form (subject, predicate, object), indicating that the subject has the specified relation to the object. Since this integrated knowledge graph should not contain any explicit information about drug-drug interactions, there is no information in the form of drugbank_vocabulary:ddi-interactor-in and kegg_vocabulary:Interaction from the DrugBank and KEGG drug KG, respectively. The number of triples, entities, and relation types for the individual KGs and the integrated KG are given in table 2. Next, we will elaborate on how we prepared this data as an input for our classifiers.
|Knowledge graph||#Triples||#Entities||#Relation types|
3.3. Knowledge graph embeddings
We used the information of our knowledge graph for predicting the interaction between each pair of drugs. However, ML classifiers do typically expect their input as fixed-length vectors. Hence, we perform a KG embedding procedure to encode the information from the graph into dense vectors. KG embedding consists of three steps: representing entities and relations, defining a scoring function, and learning entity and relation representation (Lin et al., 2015). We used RDF2Vec(Ristoski and Paulheim, 2016), SimpleIE (Kazemi and Poole, 2018), TransE (Bordes et al., 2013), KGloVe (Cochez et al., 2017a), CrossE (Zhang et al., 2019), and PBG (Lerer et al., 2019) for the KG embeddings. These representations represent the neighborhood of a node as well as the kind of relations that exist to the neighboring nodes. Since these methods do not incorporate literal information into the embedding, literals are removed from the KG.
RDF2Vec works by first generating a corpus of text by performing uniform random walks starting from each entity in the graph (Cochez et al., 2017b). Then, the corpus of edge-labeled random walks are used as the input for learning embeddings of each node using the skip-gram (SG) word2vec (Mikolov et al., 2013) model777Literature (Cochez et al., 2017b; Celebi et al., 2018) have observed better performance using SG than the CBOW model. From a given a sequence of drug facts
, the SG model aims to maximize the average log probability(see eq. 2) according to the context within the fixed-size window, in which c represents a context.
To define , we use negative sampling by replacing with a function to discriminate target words from a noise distribution drawing words from :
The embedding of a concept occurring in corpus is the vector in eq. 3 derived by maximizing eq. 2. Besides RDF2Vec, we also trained TransE embeddings as a representative of the translation-based KG embedding methods. Here, every entity and relation is embedded as a low-dimensional vector, where the relations are represented as the translation from the head entity to tail entity. The CrossE embedding method, which explicitly simulates crossover interactions, is also used. Although both general embeddings for each entity and relation and multiple triple specific embeddings can be generated using CrossE, we used only the general embeddings. The SimplE
embedding method, which allows two embeddings of each entity to be learned independently, is also employed. The embeddings learned through SimplE are interpretable, which help to incorporate drug-related background knowledge into the embeddings. A reported advantage of SimplE is that it outperforms tensor factorization techniques, especially for link prediction problems.
Learning the representations in a KG relies on contrasting positive instances with negative ones. However, KGs typically include only positive relation instances(Socher et al., 2013). A solution to this issue is using implicit negative evidence in which instances that have not been observed in the KGs are considered negative. Kotnis and Nastase et al. (Kotnis and Nastase, 2017) employed several negative sampling approaches and observed the impact on the learned embeddings. They found that the “corrupting positives” method is leading to the best result in a link prediction task. This corruption produces negative instances that are closer to the positive ones than those produced through random sampling. Also several methods which we used in this work (TransE, CrossE, SimplE, and RDF2Vec) generate negative instances by corrupting positive samples.
KGloVe (Cochez et al., 2017a) has some similarity with RDF2Vec, but uses a different technique to identify global patterns for creating vector space embeddings. First, a co-occurrence matrix is created by personalized computing PageRank for each node. Then, this process is repeated for the graph with all edges reversed. These two matrices are summed together and normalized. Finally, this matrix is used as an input to the GloVe word embedding algorithm. In this work, we have only used unbiased walks. As a last model, we train ComplEx (Trouillon et al., 2016) using the PBG implementation (Lerer et al., 2019) because our integrated KG contains many triples. Technically, ComplEx uses only the Hermitian dot product for creating embeddings888Complex counterpart of the standard dot product between real vectors. The overall embedding method using ComplEx is arguably simpler but since the composition of complex embeddings can handle a large variety of binary relations (among them symmetric and antisymmetric ones) research has exposed that complex embeddings can outperform several other models.
We trained the ComplEx embedding model of PBG to i) create the embeddings faster, scalable for large graphs, and parallelize the training, and ii) to observe if the embeddings generated by this model are useful for predicting DDIs. Given these dense representations, we can now feed the information from the graph into our machine learning models. To represent the feature vector of a drug pair, we concatenate the embedding vectors of each drug in the pair in which the negative samples are generated by corrupting positive edges by sampling either a new source or a destination for each existing edge (Lerer et al., 2019).
3.4. Network construction
We train various baseline ML models, which we will use as baselines later. Here, we describe the more complex neural network architecture which gave the best results we obtained. We construct a so-called Conv-LSTM network (SHI et al., 2015) by combining both CNN and LSTM layers as shown in fig. 3
. While CNN uses convolutional filters to capture local relationship values in drug features, the LSTM network can carry overall relationships from the features extracted by CNN. TheConv-LSTM has shown good performance on diverse prediction tasks such as hate speech detection from text (Zhang and Luo, 2018), for precipitation nowcasting (SHI et al., 2015), and for monocular depth prediction (CS Kumar et al., 2018). Consequently, it has been able to capture the most significant drug features in our case.
We extended the Conv-LSTM network proposed in literature (SHI et al., 2015) in which each input , cell outputs , hidden states , and gates ,, of the network are 2D tensors whose dimensions are spatial dimensions of the drug features. Conv-LSTM determines the future state of a certain cell in the input hyperspace by the inputs and past states of its local neighbors. This is achieved by using a conv operator in the state-to-state and input-to-state transitions as represented as follows (SHI et al., 2015):
In the above equations, denotes the conv operator, and is the entrywise multiplication of two matrices of same dimensions. The second LSTM layer emits an output ‘,’ which is then reshaped (i.e., flatten) into a feature sequence and fed into fully-connected layers to predict the DDIs at the next step and as an input at the next time step. The first layer is the embedding layer, which maps a drug sample as a ’sequence’ into a real vector domain. Then the embedding representation with a shape of 100x300 is fed into a 1D convolutional layer, which has 100 filters and a kernel-size of 4.
The output of each conv layer is then passed to the dropout layer to regularize learning to avoid overfitting. Intuitively, these can be thought of as forcing the classifier not to rely on any trivial individual drug features. The conv layer convolves the input feature space into a 100x100 representation, which is further down-sampled by the 1D max pooling layer (MPL) having a pool size of 4 along the embedding dimension, producing an output of shape 25x100. Where each of the 25 dimensions can be considered as an ’extracted feature.’ The MPL flattens the output space by taking the highest value in each timestep dimension, which produces a 1x100 vector containing drug features that are highly indicative of interest. Contrarily, LSTM layer treats flattened feature vector’s dimension as timesteps and outputs 100 hidden units per timestep. Then using a global MPL, the most influential features are fed into a fully-connected layer after passing through another dropout layer and finally to a softmax layer which generates the probability distribution over the classes. Additionally, we introduce Gaussian noise(Xie et al., 2012) into each conv, LSTM, and dense layer to improve the model generalization.
3.5. Network training
Since all the classifiers need both negative and positive samples for the link prediction problem, previous studies have randomly chosen negative samples from unknown interactions (Zhang et al., 2015; Sridhar et al., 2016). However, setting all the unknown interactions as negative samples creates a data imbalance issue. Consequently, performance metrics, such as AUPR and F1-score, get influenced (Celebi et al., 2018). Other research has tackled this issue through random undersampling from the unknown interactions at a ratio corresponding to the positive set (Cheng and Zhao, 2014)
, or inferring negatives by unsupervised clustering analysis(Hameed et al., 2017).
The open source implementations of PBG999https://github.com/facebookresearch/PyTorch-BigGraph, CrossE101010https://github.com/wencolani/CrossE, TransE111111https://github.com/xjdwrj/TransE-Pytorch, and SimpleIE121212https://github.com/baharefatemi/SimplE were used for the KG embedding with the default parameters provided. On the other hand, the modified version of KGloVe131313https://github.com/miselico/globalRDFEmbeddingsISWC is used, which converged at iteration. While RDF2Vec was trained using skip-gram by setting window size = 5 with graph walk at depth 5 and 500 walks per entity. Each embedding methods is employed by setting the dimension of the feature vector to 300 by varying negative samples. By filtering out the drugs that have no calculated feature vector, we were able to extract the features for 12,439 drugs out of 12,664 drugs. The embeddings generated by RDF2Vec, TransE, PBG, KGloVe, CrossE, and SimpleIE are then used to train the Conv-LSTM network for the link prediction similar to (Alshahrani et al., 2017)
, in which we aim to estimate the probability that a relation or link with labelexists between vertices and given their vector representation, and .
First-order gradient-based optimization techniques Adam, AdaGrad, RMSprop, and AdaMax with varying learning rates and different batch sizes are used to learn model parameters, which tries to optimize thebinary cross-entropy loss eq. 9. The hyperparameters optimization is done based on random search and cross-validation in which the model is trained on a batch size of 128 wherein each of 5 runs 70% of the data is used for the training, 30% for evaluating the network, and 10% from the training set is randomly used for the validation.
The evaluation code141414Source code: https://github.com/rezacsedu/DDI-prediction-KG-embeddings-Conv-LSTM
was written in Python.The software stack consists of Scikit-learn, PyTorch, and Keras with the TensorFlow backend. The network training is carried out on an Nvidia GTX 1080i GPU with CUDA and cuDNN enabled. We also trained LR, KNN, NB, SVM, RF, and GBT as ML baseline models. Similar to Conv-LSTM network, we perform the hyperparameter optimization for these classifiers through a random search and 5-fold cross-validation tests. For the experiment, 80% of the data is used for the training using 5-fold cross-validation and evaluate the optimized model on 20% held-out data in which the best hyperparameters were produced through a random search. Although AUC score is used commonly as a performance metric in previous studies, literature has emphasized that it might not be sufficiently accurate for imbalanced data(Celebi et al., 2018; Kastrin et al., 2018). Therefore, we used the area under the precision-recall curve (AUPR), and Matthias correlation coefficient (MCC) along with the AUC and F1-score to measures the performance of the classifiers. Finally, we use the model averaging ensemble (MAE) of top-3 models to report the final prediction.
4.1. Analysis of DDIs predictions
Table 3 summarizes the results of the prediction task based on different embedding methods. A general observation is that the Conv-LSTM
model outperformed all baseline models, in the best case resulting in an AUPR of 0.93. Also, overall LR, NB, KNN, and SVM models performed worst. Although these algorithms are intrinsically simple, have low variance, and less prone to over-fitting, feature selector based on these may be discriminating drug-related features very aggressively, which forces these classifiers to lose some useful drug features which result in worse performance. Among the tree-based classifiers, RF performs the best, showing an F1-score of 0.91, which is the best among the ML baselines. The model averaging ensemble of top-3 models (i.e., GBT, RF, and Conv-LSTM) boosts the performance by 1.5%compared to the bestConv-LSTM model in terms of F1 score.
Interestingly, the MCC
scores by all classifiers also show the prediction was strongly correlated with the ground truth (measured using a Pearson product-moment correlation coefficient we obtained 0.70), probably because the embeddings generated by the embedding methods are learnable quality drug features. The AUC score generated by theConv-LSTM network is found to be the highest, which is at least 3% better than the second-best score by the RF classifier, where the LR classifier performed the worst. The ROC curve in fig. 4 shows consistent AUC scores across the folds, which signifies that the predictions are much better than random guessing.
4.2. Comparison of graph embedding methods
As seen in table 3, the classifiers work better with drug features generated by the ComplEx, SimpleIE, and KGloVe methods. In particular, the GBT, RF, and Conv-LSTM classifiers perform consistently best on the embeddings generated by ComplEx in terms of F1, MCC, and AUPR scores. In contrast, using features generated by the RDF2VEc method, we experience the worst DDIs prediction accuracy. This is different from earlier work (Celebi et al., 2018), where the best results were obtained with RDF2Vec with uniform weighting setting. We suspect that the classifiers did not benefit much from more training samples in our case. Therefore, we validate this by calibrating the best performing Conv-LSTM classifier against different embedding methods for which the output probability of the classifier can be directly interpreted as a confidence level in terms of ‘fraction of positives’; the result is illustrated in fig. 5. As seen, the Conv-LSTM classifier gave a probability value between 0.82 to 0.93, which means 93% predictions belong to true positive predictions generated by the embeddings from PBG.
4.3. Effects of number of drug samples
To understand the effects of having more training samples, and to understand whether our classifiers suffer more from variance errors or bias errors, we observed the learning curves of top-3 classifiers (i.e., RF, GBT, and Conv-LSTM) and SVM (a linear model) for varying numbers of training samples. As shown in fig. 6, for SVM the validation and training scores converge to a low value with increasing size of the training set. Consequently, SVM did not benefit much from more training samples. However, RF and GBT are tree-based ensemble methods, and the Conv-LSTM network can learn more complex concepts from the drug features. This results in a lower bias, which can be observed from higher training scores than the validation scores for the maximum number of drug samples, i.e., adding more training samples does increase generalization.
4.4. Influence of negative samples
Inspired by (Trouillon et al., 2016), we investigated the influence of the number of negatives per positive training sample, which we call , for ComplEx. As we already varied per training sample and set it to 15, we further varied again the in [5, 10, 20, 25] and collected the embeddings for each setting again. Then we observed if the Conv-LSTM network performs better with the larger . Generating more negatives samples moderately improves the results. In particular, with 20 negative triples, we observed about 1% accuracy boost in terms of AUPR. Embedding training also converged slightly faster. Further increasing to 25 results in a drop in AUPR.
4.5. Comparison with state-of-the-art
Since our approach of data collection and preparation are different from other approaches and we have more samples, a one-to-one comparison was not viable –especially with Tiresias framework (Abdelaziz et al., 2017), PRD(Wang, 2017), and INDI (Gottlieb et al., 2012). Kastrin et al. (Kastrin et al., 2018) used data from multiple sources, but evaluated and inferred unknown DDIs from TWOSIDES only. With that dataset, they achieved the best AUPR score of 0.93 using RF and GBT classifiers.
The Tiresias framework, which uses both pharmacological similarities from embedding features, has reported an F1-score of 0.85 and AUPR of 0.92. Their pharmacological similarity features are equivalent to INDI, which also uses the DrugBank v4.0 dataset. INDI evaluated the performance of DDI prediction on a total of 37,212 true DDIs. They obtained AUC scores of 0.93 and an F1-score of 0.89, omitting the interaction type (i.e., PD or PK) using a 10-fold cross validation setting. Remzi et al. (Celebi et al., 2018) observed an F1-score of 0.867 and AUPR of 0.918 using DrugBank v4.0 dataset.
With our approach, evaluations against several baseline models yield an AUPR of up to 0.94, an F1-score of 0.92, and an MCC of 0.80 during 5-fold cross-validation tests. This signifies that a KG-based approach using multiple data sources is comparable to current state-of-the-art methods. To show the benefit of using a more robust classifier, we further trained the Conv-LSTM network with DrugBank v4.0 dataset as shown in fig. 1. During a 5-fold cross-validation test, we observed slightly better accuracy, namely an F1-score of 0.895 and AUPR of 0.926, which outperforms (Celebi et al., 2018).
5. Conclusion and outlook
Adverse drug reactions are very dangerous and lead to a significant number of the hospital (re-)admissions and even deaths. Many of these reactions are due to drug-drug interactions. Preferably, all drug-drug interactions should be known upfront to ensure that preventable cases do not occur. However, it is not feasible to investigate all possible interactions, and hence approaches able to predict possible interactions are investigated. In this paper, we proposed the use of knowledge graphs to integrate drug-related data from different sources. This way, we have integrated background knowledge about drugs, diseases, pathways, proteins, enzymes, chemical structures, etc. Since this background data is in a format which cannot be used as a direct input for typical classifiers, we applied several node embedding techniques to create a dense vector representation for each node in the KG. These representations are then fed into various traditional ML classifiers and a specifically designed neural network architecture based on a convolutional-LSTM.
Our core observations are that i) We could outperform the baseline classifiers, as well as earlier state-of-the-art models, consistently with our proposed architecture. We obtained up to 0.94, 0.92, 0.80 for AUPR, F1-score, and MCC, respectively, during 5-fold cross-validation tests showing high confidence at predicting potential DDIs. ii) From the embedding models we used, the PBG model did perform best, but also SimpleE and KGloVe gave reasonable results.
One limitation of our approach is the inability to provide explanations for the predicted DDIs. The embedding creates latent features, which are like a black-box model. As future research directions, we see i) the possibility to include even more data to the background. For example, NDFRT, SemMedDB, and SIDE; also a large ablation study to measure the influence of each of these additions would be useful, ii) also including explicit information about negative drug-drug interaction, as well as a prediction of the interaction type, iii) providing explanations for the interactions, iv) further investigation of other models to perform predictions on graphs, and v) interaction with other (non-drug) substances like food.
- Large-scale structural and textual similarity-based mining of knowledge graph to predict drug–drug interactions. Journal of Web Semantics 44, pp. 104–117. Cited by: §1, §2, §2, §4.5.
- Neuro-symbolic representation learning on biological knowledge graphs. Bioinformatics 33 (17), pp. 2723–2730. External Links: Cited by: §3.5.
- Bio2RDF: towards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics 41 (5), pp. 706–716. Cited by: §3.2.3.
UTurku: drug named entity recognition and drug-drug interaction extraction using SVM classification and domain knowledge. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, pp. 651–659. Cited by: §2.
- SCAI: extracting drug-drug interactions using a rich feature vector. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, pp. 675–683. Cited by: §2.
- Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 2787–2795. External Links: Cited by: §2, §3.3.
- Pharmacointeraction network models predict unknown drug-drug interactions. PloS one 8 (4), pp. e61468. Cited by: §2.
- Evaluation of knowledge graph embedding approaches for drug-drug interaction prediction using linked open data. Cited by: §1, §1, §2, §2, §2, §2, §3.5, §4.2, §4.5, §4.5, §4, footnote 7.
- Machine learning-based prediction of drug–drug interactions by integrating drug phenotypic, therapeutic, chemical, and genomic properties. Journal of the American Medical Informatics Association 21 (e2), pp. e278–e286. Cited by: §1, §2, §3.5.
- FBK-irst : a multi-phase kernel based approach for drug-drug interaction detection and classification that exploits linguistic information. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, pp. 351–355. Cited by: §2.
- Global RDF vector space embeddings. In The Semantic Web – ISWC 2017: 16th International Semantic Web Conference, Vienna, Austria, October 21–25, 2017, C. d’Amato, M. Fernandez, et al. (Eds.), pp. 190–207. External Links: Cited by: §3.3, §3.3.
- Biased graph walks for RDF graph embeddings. In Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, WIMS ’17, New York, NY, USA, pp. 21:1–21:12. External Links: Cited by: §3.3, footnote 7.
Depthnet: a recurrent neural network architecture for monocular depth prediction. In , Salt Lake City, Utah, pp. 283–291. Cited by: §3.4.
- Drug-drug interaction discovery: kernel learning from heterogeneous similarities. Smart Health 9, pp. 88–100. Cited by: §3.2.2, §3.2.2, Table 1.
- Literature based drug interaction prediction with clinical assessment using electronic medical records: novel myopathy associated drug interactions. PLoS computational biology 8 (8), pp. e1002614. Cited by: §2, §2.
- Knowledge graph embedding by flexible translation. In Proceedings of the Fifteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’16, pp. 557–560. Cited by: §2, §2.
- INDI: a computational framework for inferring drug interactions and their associated recommendations. Molecular systems biology 8 (1), pp. 592. Cited by: §2, §4.5.
- UColorado_SOM: extraction of drug-drug interactions from biomedical text using knowledge-rich and knowledge-poor features. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, pp. 684–688. Cited by: §2.
- Strategies to connect rdf graphs for link prediction using drug-disease knowledge graphs. poster. Note: Poster presented at the 11th International Conference Semantic Web Applications and Tools for Life Sciences (SWAT4HCLS 2018) External Links: Cited by: §2.
- Positive-unlabeled learning for inferring drug interactions based on heterogeneous attributes. BMC bioinformatics 18 (1), pp. 140. Cited by: §3.5.
- The ddi corpus: an annotated corpus with pharmacological substances and drug–drug interactions. Journal of biomedical informatics 46 (5), pp. 914–920. Cited by: §2, §3.2.2, Table 1.
Knowledge graph embedding via dynamic mapping matrix.
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 687–696. External Links: Cited by: §2.
- KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic acids research 38 (suppl_1), pp. D355–D360. Cited by: item 2.
- Predicting potential drug-drug interactions on topological and semantic similarity features using statistical learning. PloS one 13 (5), pp. e0196865. Cited by: §2, §2, §3.2.1, §3.2.2, §4.5, §4.
- SimplE embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 4284–4295. External Links: Cited by: §3.3.
- Analysis of the impact of negative sampling on link prediction in knowledge graphs. arXiv preprint arXiv:1708.06816. Cited by: §3.3.
- PyTorch-biggraph: a large-scale graph embedding system. arXiv preprint arXiv:1903.12287. Cited by: §3.3, §3.3, §3.3.
- Large-scale exploration and analysis of drug combinations. Bioinformatics 31 (12), pp. 2007–2016. Cited by: §2.
Learning entity and relation embeddings for knowledge graph completion.
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2181–2187. External Links: Cited by: §3.3.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §3.3.
- Informatics confronts drug–drug interactions. Trends in pharmacological sciences 34 (3), pp. 178–184. Cited by: §2, §3.2.1.
- Discovery and explanation of drug-drug interactions via text mining. In Biocomputing 2012, pp. 410–421. Cited by: §2.
- UWM-TRIADS: classifying drug-drug interactions with two-stage SVM and post-processing. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, pp. 667–674. Cited by: §2.
- RDF2Vec: rdf graph embeddings for data mining. In The Semantic Web – ISWC 2016, P. Groth, E. Simperl, A. Gray, M. Sabou, M. Krötzsch, F. Lecue, F. Flöck, and Y. Gil (Eds.), Cham, pp. 498–514. External Links: Cited by: §3.3.
- Largerdfbench: a billion triples benchmark for sparql endpoint federation. Journal of Web Semantics 48, pp. 85–125. Cited by: §3.2.3.
- Lessons learnt from the ddiextraction-2013 shared task. Journal of biomedical informatics 51, pp. 152–164. Cited by: §2, §3.2.2.
- Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 802–810. Cited by: §1, §3.4, §3.4.
- Detecting drug-drug interactions using artificial neural networks and classic graph similarity measures. arXiv preprint arXiv:1903.04571. Cited by: §1, §3.1, §3.1.
- Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pp. 926–934. Cited by: §3.3.
- A probabilistic approach for collective similarity-based drug–drug interaction prediction. Bioinformatics 32 (20), pp. 3175–3182. Cited by: §3.2.2, §3.2.2, §3.5, Table 1.
- Predicting drug–drug interactions through drug structural similarities and interaction networks incorporating pharmacokinetics and pharmacodynamics knowledge. Journal of cheminformatics 9 (1), pp. 16. Cited by: §1, §2.
- Discovering drug–drug interactions: a text-mining and reasoning approach based on properties of drug metabolism. Bioinformatics 26 (18), pp. i547–i553. Cited by: §2.
- Data-driven prediction of drug effects and interactions. Science translational medicine 4 (125), pp. 125ra31–125ra31. External Links: Cited by: item 3, §1, §2.
- Data-driven prediction of drug effects and interactions. Science translational medicine 4 (125), pp. 125ra31–125ra31. Cited by: §3.2.2.
- Complex embeddings for simple link prediction. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 2071–2080. Cited by: §3.3, §4.4.
- Predicting rich drug-drug interactions via biomedical knowledge graphs and text jointly embedding. arXiv preprint arXiv:1712.08875. Cited by: §1, §1, §1, §2, §2, §3.2.3, §4.5.
- Pharmacogenomics knowledge for personalized medicine. Clinical Pharmacology & Therapeutics 92 (4), pp. 414–417. Cited by: §1.
- DrugBank 5.0: a major update to the drugbank database for 2018. Nucleic acids research 46 (D1), pp. D1074–D1082. Cited by: item 1, §1, §3.2.1.
- Image denoising and inpainting with deep neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 341–349. External Links: Cited by: §3.4.
- Label propagation prediction of drug-drug interactions based on clinical side effects. Scientific reports 5, pp. 12339. Cited by: §3.2.2, §3.2.2, §3.5, Table 1.
- Predicting potential drug-drug interactions by integrating chemical, biological, phenotypic and network data. BMC bioinformatics 18 (1), pp. 18. Cited by: §3.2.2, Table 1.
- Interaction embeddings for prediction and explanation in knowledge graphs. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, New York, NY, USA, pp. 96–104. External Links: Cited by: §3.3.
- Hate speech detection: a solved problem? the challenging case of long tail on twitter. Semantic Web Pre-press (Preprint), pp. 1–21. Cited by: §1, §3.4.