Predicting interactions among heterogenous graph structured data has numerous applications such as knowledge graph completion, recommendation systems and drug discovery. Often times, the links to be predicted belong to rare types such as the case in repurposing drugs for novel diseases. This motivates the task of few-shot link prediction. Typically, GCNs are ill-equipped in learning such rare link types since the relation embedding is not learned in an inductive fashion. This paper proposes an inductive RGCN for learning informative relation embeddings even in the few-shot learning regime. The proposed inductive model significantly outperforms the RGCN and state-of-the-art KGE models in few-shot learning tasks. Furthermore, we apply our method on the drug-repurposing knowledge graph (DRKG) for discovering drugs for Covid-19. We pose the drug discovery task as link prediction and learn embeddings for the biological entities that partake in the DRKG. Our initial results corroborate that several drugs used in clinical trials were identified as possible drug candidates. The method in this paper are implemented using the efficient deep graph learning (DGL)READ FULL TEXT VIEW PDF
The timeline of the Covid-19 pandemic showcases the dire need for fast development of effective treatments for new diseases. Drug-repurposing is a drug discovery strategy from existing drugs that significantly shortens the time and reduces the cost compared to de novo drug discovery (sertkaya2014examination; avorn20152; setoain2015nffinder). Drug-repurposing leverages the fact that common molecular pathways contribute to different diseases and hence some drugs may be reused (ashburn2004drug).
Drug-repurposing relies on identifying novel interactions among biological entities like genes and compounds and can be posed as a link prediction task over a biological network. Several machine learning approaches have been developed for addressing the drug-repurposingtask for Covid-19; see e.g. (gramatica2014graph; zhou2020network; udrescu2016clustering; drkg2020). Towards assisting such machine learning techniques (drkg2020) created a comprehensive biological knowledge graph relating genes, compounds, diseases, biological processes, side effects and symptoms termed Drug Repurposing Knowledge Graph (DRKG).
However, for novel diseases like Covid-19 only a few interactions are available among viral proteins and possible chemical compounds that may inhibit the related genes. This motivates the framework of few-shot link prediction, where a certain edge type is rare and the model is called to make predictions on the particular edge type.
Link prediction has been addressed by several works in the context of knowledge-graph (KG) completion. These models rely on embedding the nodes and edges of the KG to a vector space and then train by maximizing the score for existing edges in the KGs; see e.g.,(wang2017knowledge). An efficient implementation of these models in DGL is presented in (zheng2020dgl). Nevertheless, these KGE models do not naturally generalize in the few-shot scenario, where only a few edges are available for a rare edge type, which challenges learning the relation embedding. This was addressed in (chen2019meta), where a meta-learning model is proposed to learn the relation embeddings in an inductive fashion. However, this inductive-relation KGE model require a specialized training scheme, can not learn inductive node embeddings, and can not incorporate node features if available.
Graph convolutional networks learn embeddings for nodes and edges in the graph by applying a sequence of nonlinear operations parametrized by the graph adjacency matrix and utilize node and edge features (kipf2016semi; schlichtkrull2018modeling). An inductive implementation of these models allows for learning node embeddings in an inductive fashion (hamilton2017inductive). The RGCN model (schlichtkrull2018modeling) has been successful in link prediction, where the RGCN is supervised by KGE models (wang2017knowledge). However, these GCN models for link prediction inherit the limitation of the KGE models, and are challenged in learning relation embeddings for rare edges types.
This paper addresses the aforementioned limitation of GCN models by introducing a novel inductive-RGCN that learns the relation and the node embeddings in an inductive fashion. The proposed I-RGCN naturally addresses the few-shot link prediction and outperforms competing state-of-the-art models. I-RGCN is also tested in the DRKG for Covid-19 drug-repurposing. The drug discovery task is naturally formulated in a few-shot learning setting. The preliminary results indicate that several drugs used in clinical trials are discovered as possible drug candidates. While this study, by no means recommends specific drugs, it demonstrates a powerful deep learning methodology to prioritize existing drugs for further investigation, which holds the potential of accelerating therapeutic development for COVID-19.
Consider the heterogeneous graph with node types and relation types defined as . The th node type is defined as and may represent Genes or Chemical compounds in the DRKG. The th relation type holds all interactions of a certain type among and and may represent that a chemical compound inhibits a gene or that a disease is treated by a chemical compound.
Consider also that each node is associated with a feature vector . This feature may represent an embedding of the protein sequence associated with a gene (wang2017accurate). In KGs some node types may not have features for these we use an embedding layer to represent their features.
Few shot link prediction. Given sets of edges , a nodal attribute vector per node , and a small set of links in the few-shot relation with , the few-shot link prediction amounts to inferring the missing links of the rare type . In the DRKG, this few-shot relation is for example coronavirus treatment.
The relational GCN (RGCN) (schlichtkrull2018modeling) extends the graph convolution operation (kipf2016semi) to heterogenous graphs. An RGCN model is comprised by a sequence of RGCN layers. The th layer computes the th node representation as follows
where is the neighborhood of node under relation ,
the rectified linear unit non linear function, andis a learnable matrix associated with the th relation. Essentially, the output of the RGCN layer for node
is a nonlinear combination of the hidden representations of neighboring nodes weighted based on the relation type. The node features are the input of the first layer in the model i.e., where is the node feature for node
. For node types without features we use an embedding layer that takes as input an one-hot encoding of the node id.
The RGCN model in this paper is supervised by a DistMult model (yang2014embedding)
for link prediction. The loss function
where denotes the transpose of a matrix, denotes a diagonal matrix with on its diagonal, , , are the embedding of the head entity , relation and the tail entity , respectively and and are the positive and negative sets of triplets and if the triplet corresponds to a positive example and otherwise. The scalar represented by denotes the score of triplet as given by the DistMult model (yang2014embedding). The entity embeddings are obtained by the final layer of the RGCN. The relation type embedding are trained directly from (2).
Such a model (2) is vulnerable when only few training edges are available for a certain relation type. The small number of edges will challenge the learning of the embedding vector for the rare relation.
Certain relation-types may be rare in the training set of links and require a specialized architecture. To address such a few-shot scenario, we introduce a MLP to learn the relation embeddings. Consider the node embeddings and extracted from the ultimate layer of the RGCN model where and . The proposed MLP learns an embedding for the th relation as follows
where denotes the vector concatenation. Note that the relation embedding is calculated as a nonlinear function of the node embedding for all node pairs participating to a certain relation type . This allows the I-RGCN to learn relation embeddings in an inductive fashion. This model is supervised by the following loss
where denotes the triple score and is for the negative triples and stands for the positive ones. We create negative triples by fixing the head node of a positive triple and randomly selecting a tail node of the same type as the original tail node. Differently from (2), the relation embedding is learned in an inductive fashion from the participating node pairs (, ). Hence, upon learning the MLP parameters the relation embedding will be computed with a forward pass. This obviates the few-shot learning hurdle and enables the model to generalize to rare or even unseen relations.
Baselines.We consider the state-of-the-art KGE models RotatE (sun2019rotate), ComplEx (trouillon2016complex), and the RGCN model (schlichtkrull2018modeling) as baselines for comparison. The parameters of these methods have been optimized via cross validation.
|MRR||Hit 1||Hit 10|
We use the IMDB and DBLP datasets (fu2020magnn) detailed in Table 1. The total number of edges in the few-shot relation are for the IMDB and for the DBLP. In the experiments. we train with only links from the few-shot relation and all the links from the other relations and test on the rest edges of the few-shot relation, which are . The nodes in the IMDB and DBLP graphs are associated with feature vectors. Further, information on the datasets is included in the Appendix.
Tables 2 and 3 report the MRR, Hit-1 and Hit-10 scores of the baseline methods along with the inductive RGCN and the RGCN in the task of few-shot link prediction for the IMDB and DBLP datasets respectively. The I-RGCN significantly outperforms the alternative methods in the task of few-shot link prediction. Specifically, for =10 the MRR of the inductive method is one order of magnitude greater. This corroborates the advantage of the inductive relation learning for the few-shot learning task. As the number of training edges increases at =1000, it is observed that the RGCN performance approaches the performance of the I-RGCN. This suggests that the I-RGCN method performs well also in non few-shot learning tasks. The worse performance of KGE models is explained since these do not account for node features and do learn inductive relation embeddings.
To further validate the performance of the I-RGCN we conduct a general link prediction evaluation by splitting the links in training, validation, and testing at random irrespective of their relation type. The results for different percentages of training links are reported in Table 4. I-RGCN outperforms even in this training scenario the RGCN and KGEs baselines, which further corroborates the efficiency of the model.
|MRR||Hit 1||Hit 10|
|Metrics||MRR||Hit 1||Hit 10|
For this experiment we will utilize the drug-repurposing knowledge graph (DRKG) constructed in (drkg2020). The DRKG collects interactions from a collection of biological databases such as Drugbank (drugbank@2017), GNBR (percha2018global), Hetionet (hetionet@2017), STRING (string@2019), IntAct (intact) and DGIdb (dgidb@2017).
Drug-repurposing aims at discovering the most effective existing drugs to treat a certain disease. Drug-repurposing can be formulated as predicting direct links in the DRKG such as predicting whether a drug treats a disease or as predicting whether a compound inhibits a certain gene which is related to the target disease. Drug-repurposing can be viewed as a few-shot link prediction task since only a few edges are available related to novel diseases in the DRKG.
We use corona-virus related diseases, including SARS, MERS and SARS-COV2, as target diseases representing Covid-19 as their functionality is similar. We aim at predicting links among gene entities associated with the target disease and drug entities.We select FDA-approved drugs in Drugbank as candidates, while we exclude for simplicity drugs with molecule weight less than 250 daltons, as many of certain drugs are actually health drugs. This amounts to 8104 candidate drugs.
We also obtain 442 Covid-19 related genes from the relations extracted from (gordon2020sars; zhou2020network). Similarly, we obtain the node embeddings for the gene and drugs, and the embeddings for the corresponding relations. Next, we score all triples and rank them per target gene. This way we obtain 442 ranked lists of drugs. Finally, to assess whether our prediction is in par with the drugs used for treatment, we check the overlap among the top 100 predicted drugs and the drugs used in clinical trials per gene. We used 32 clinical trial drugs for Covid-19 to validate our predictions111The clinical trial drugs were collected from http://www.covid19-trails.com/. Table 5 lists the clinical drugs included in the top-100 predicted drugs across all the genes with their corresponding number of hits for the RGCN and I-RGCN. It can be observed, that several of the widely used drugs in clinical trials appear high on the predicted list, and that I-RGCN shows a higher hit rate than RGCN. Hence, the inductive relation prediction module is more appriopriate in predicting links when information about the nodes is limited, such as is the case with the novel Covid-19 disease node.
|Drug name||# hits||Drug name||# hits|
Drug inhibits gene scores for Covid-19. Note that a random classifier will result to approximately 5.3 per drug. This suggests that the reported predictions are significantly better than random.
In this paper we develop a novel I-RGCN that learns inductive relation embeddings and can be applied for few-shot link prediction and drug repurposing. I-RGCN consistently outperforms baseline models in the IMDB and DBLP datasets for few-shot link prediciton. We also formulate the Covid-19 drug-repurposing task as a link prediction over the DRKG. I-RGCN successfully identifies a subset of clinical trial drugs for Covid-19 and can be used to assist researchers and prioritize existing drugs for further investigation in the Covid-19 treatment.
We use the IMDB and DBLP datasets (fu2020magnn) detailed in Table 1, where the third column denotes the total size of edges in the few-shot relation that is . The nodes in the IMDB and DBLP graphs are associated with feature vectors. The original datasets in (fu2020magnn) are used for node classification. We adapt the datasets and create new edge types, where the edges are parametrized by the label of the associated nodes. For example, the edge type (director, directed, movie) becomes (director, directed_drama, movie) if the associated movie is in the drama genre, and the same transformation undergoes the (actor, played, movie) relation. Since, there are 3 labels for movies, this way the original 4 edge types become 12. The same transformation happens in the DBLP dataset.