Entity Context and Relational Paths for Knowledge Graph Completion

02/17/2020 ∙ by Hongwei Wang, et al. ∙ Stanford University 7

Knowledge graph completion aims to predict missing relations between entities in a knowledge graph. While many different methods have been proposed, there is a lack of a unifying framework that would lead to state-of-the-art results. Here we develop PathCon, a knowledge graph completion method that harnesses four novel insights to outperform existing methods. PathCon predicts relations between a pair of entities by: (1) Considering the Relational Context of each entity by capturing the relation types adjacent to the entity and modeled through a novel edge-based message passing scheme; (2) Considering the Relational Paths capturing all paths between the two entities; And, (3) adaptively integrating the Relational Context and Relational Path through a learnable attention mechanism. Importantly, (4) in contrast to conventional node-based representations, PathCon represents context and path only using the relation types, which makes it applicable in an inductive setting. Experimental results on knowledge graph benchmarks as well as our newly proposed dataset show that PathCon outperforms state-of-the-art knowledge graph completion methods by a large margin. Finally, PathCon is able to provide interpretable explanations by identifying relations that provide the context and paths that are important for a given predicted relation.



There are no comments yet.


page 12

Code Repositories


Combining relational context and relational paths for knowledge graph completion

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Knowledge graphs (KGs) store structured information of real-world entities and facts. A KG usually consists of a collection of triplets. Each triplet indicates that head entity is related to tail entity through relationship type .

A range of important applications, including search (Xiong et al., 2017b), question answering (Huang et al., 2019), recommender systems (Wang et al., 2019c), and machine reading comprehension (Yang and Mitchell, 2017) all critically rely on existing KGs such as FreeBase (Bollacker et al., 2008), WordNet (Miller, 1995), NELL (Carlson et al., 2010) as well as Google Knowledge Graph111https://developers.google.com/knowledge-graph.

Nonetheless, KGs are often incomplete and noisy. To address this issue, researchers have proposed a number of KG completion methods to predict missing links/relations in KGs that can be classified into two categories: The first class is

embedding-based methods (Bordes et al., 2013; Trouillon et al., 2016; Yang et al., 2015; Sun et al., 2019; Kazemi and Poole, 2018; Zhang et al., 2019b)

, which learn an embedding vector for each entity and relation by minimizing a predefined loss function on all triplets. Such methods have the advantage that they consider the structural context of a given entity in the KG but they fail to capture the multiple relationships (paths) between the head and the tail entity, which are very important for KG completion. In contrast, the second class of methods is

rule-based (Galárraga et al., 2015; Yang et al., 2017; Ho et al., 2018; Zhang et al., 2019a; Sadeghian et al., 2019), which aims to learn general logical rules from KGs by modeling paths between the head and the tail entities. However, a significant drawback of these methods is that meaningful rules are usually very rare, which limits their capability of predicting missing relations that are not covered by known rules.

Present work. Our work stems from the observation that there are two important aspects required for successful KG completion (Figure 1): (1) It is important to capture relational context of a given entity in the KG (Figure 0(a)). The relations an entity has with other entities capture its context and provide us with valuable information about what is the nature or the “type” of the entity. Many entities in KGs are not typed or are very loosely typed, so being able to learn about the entity and its context in the KG is valuable. (2) It is also important to capture the set of different multi-faceted relational paths between the head and the tail entities (Figure 0(b)). Here different paths of connections between the entities reveal the nature of their relationship and help with the prediction. However, it is not enough for the model to have these two components independently, but they also have to be combined properly. In particular, the importance of different paths between the head and the tail entity needs to depend both on the relational context of both entities as well as the relation they are trying to model.

(a) Consider we aim to predict whether Ron Weasley or Hedwig is a Pet of Harry Potter. Both entities have the same relational path (Lives with) to Harry Potter but they have distinct relational context: Ron Weasley has , while Hedwig has . Capturing the relational context of entities allows PathCon to make a distinction between Ron Weasley, who is a person, and Hedwig, which is an owl.
(b) Two head entities Hermione Granger and Draco Malfoy have the same relational context , but different paths to the tail entity Harry Potter ((House, House), (Occupation, Occupation) vs. (Occupation, Occupation), which allows PathCon to predict friendship between Harry Potter and Hermione Granger vs. Draco Malfoy.
Figure 1. (a) Relational context of an entity and (b) relational paths between entities. PathCon is able to capture both.

Here we propose PathCon, a new method that combines relational context and relational paths for KG completion. PathCon models relations rather than entities which makes the model explainable and generalizable to inductive settings. Specifically, PathCon harnesses the following four novel insights to outperform existing methods:

  • [noitemsep,topsep=0pt,leftmargin=5mm]

  • Relational Context: We design a multi-layer edge-based message passing scheme to aggregate messages from the -hop neighborhood edges of a given entity. The aggregated result captures the structure of adjacent relation types of the entity. For example, in Figure 0(a), the 1-hop relational context of entity Hedwig is captured by its neighboring relations (Lives with, Bought).

  • Relational Paths: We identify all paths from the head entity to the tail entity in the KG. Each path is represented by its relation types. For example, in Figure 0(a), relational path between Harry Potter and Hagrid is (Lives with, Bought), and in Figure 0(b), relational paths between Harry Potter and Hermione Granger are (House, House) and (Occupation, Occupation).

  • Importantly, the paths as well as the context are captured based on the sequence/structure of the relation types they contain (and not based on the identities of the entities). This is important as it provides better inductive bias and allows PathCon to be applicable in inductive settings where new entities not present during training can enter the KG and PathCon can still model them.

  • Furthermore, in PathCon the importance of paths depends both on the relation they are aiming to model as well as the relational context provided by the two entities. Therefore, PathCon uses a learnable attention score for each path based on the context information of the entity pair, and then aggregates path representations weighted by their attention scores.

A further benefit of our PathCon approach is that it provides interpretability and explainability. It allows us to identify important relational context that determine the relation between a pair of given entities. Similarly, in PathCon different relation paths have different weights/attention scores and we use these scores to identify important paths that explain the reasons for a given predicted relation.

We conduct extensive experiments on five KG datasets as well as a new KG dataset proposed by us. Experimental results demonstrate that PathCon significantly outperforms state-of-the-art KG completion methods, for example, the absolute Hit@1 gain over the best baseline is and on WN18RR and NELL995, respectively. Our extensive ablation studies show the effectiveness of our approach and demonstrate the importance of relational context as well as paths. Our method is also shown to maintain strong performance in inductive KG completion, and it provides high explainability by identifying important relational context and relation paths for a given predicted relation.

Figure 2. An example of PathCon considering both the relational context within 2 hops of the head and the tail entities (denoted by red edges) and relational paths of length up to 3 relations that connect head to tail (denoted by green arrows). Context and paths are captured based on relation types (not entities) they contain. By combining the context and paths PathCon

predicts the probability of relation


2. Problem Formulation

Let be an instance of a knowledge graph, where is the set of nodes and is the set of edges. Each edge has a relation type . Our goal is to predict the missing links in , i.e., given an entity pair , we aim to predict the relation of the edge between them.222Some of the related work formulates this problem as predicting the missing tail (head) entity given a head (tail) entity and a relation. The two problems are actually reducible to each other: Given a model that outputs the distribution over relation types for an entity pair , we can then build a model that outputs the distribution over tail entities given and , and vice versa. Since the two problems are equivalent, we only focus on relation prediction in this work. Specifically, we aim to model the distribution over relation types given a pair of entities : . This is equivalent to modeling the following term


according to Bayes’ theorem. In Eq. (

1), is the prior distribution over relation types and serves as the regularization of the model. Then the first term can be further decomposed to


Eq. (2) sets up the guideline for designing our model. The term or measures the likelihood of an entity given a particular relation. Since our model does not consider the identity of entities, we use an entity’s local relational subgraph instead to represent the entity itself, i.e., and where denotes the local relational subgraph of an entity. This is also known as relational context for and . The term or in Eq. (2) measures the likelihood of how can be reached from or the other way around given that there is a relation between them. This inspires us to model the connection paths between and in the KG. In the following we show how to model the two factors in our method and how they contribute to link prediction in KGs.

Symbol Description
Head entity and tail entity
Relation type
Hidden state of edge at iteration
Message of node at iteration
Endpoint nodes of edge
Neighbor edges of node
Context representation of the entity pair (,)
Path representation of all paths from to
Attention weight of path
Set of paths from to
Table 1. Notation used in this paper.

3. Our Approach

PathCon captures the relational context (Section 3.1) and the relational paths (Section 3.2) of an entity pair, and combines them together to predict relations (Section 3.3). We show that PathCon is able to learn explainable rules (Section 3.4), and finally discuss several design alternatives (Section 3.5).

3.1. Relational Context

For a KG triplet , relational context of and is usually highly correlated with . For example, if is graduated_from, it’s reasonable to guess with high probability that the surrounding relations of are person.birthplace, person.gender, etc., and the surrounding relations of are institution.location, university.founder, university.president, etc. In this subsection, we propose to use message passing scheme to capture such relational context of an entity.

Denote as the initial feature of edge , which can be taken as the one-hot identity vector of the relation type that belongs to. In cases where relation types have names, initial features can also be bag-of-words (BOW) or sentence embeddings learned by language models like BERT (Devlin et al., 2018). Given initial features of edges, we design a message passing scheme to learn the representation of each edge by iteratively aggregating messages from its multi-hop neighbor edges. In iteration , the hidden state of edge is updated according to the following equations:


As shown in Eq. (3), for each node , we sum up the hidden states of edges that connects to and get message , where denotes the set of neighbor edges of node . Then in Eq. (4), we calculate the hidden state of edge for iteration by aggregating messages from its two endpoints and as well as the hidden state of itself in iteration , where denotes the two endpoints of edge . The aggregation operation in Eq. (4) is abstracted as . In PathCon, we implement as the concatenation function:

Concat neighbor aggregator. In iteration , given the hidden state for edge as well as the messages and from its two endpoints , , Concat neighbor aggregator calculates the hidden state by concatenating the three input vectors first, followed by a nonlinear transformation function:


where denotes the concatenation operation, , , and

are the learnable transformation matrix, bias, and nonlinear activation function, respectively. It can be seen that Concat neighbor aggregator preserves the order of two input endpoints. We shall discuss other implementations of

in Section 3.5 and examine their performance in experiments.

The message passing in Eqs. (3) and (4) are repeated for times. The final message and are taken as the relational context representation for head and tail , respectively. We also give an illustrative example of relational context for and as shown in Figure 2, where the red/pink edges denote the first-order/second-order contextual relations.

We would like to emphasize here that the message passing scheme in Eqs. (3) and (4) are based on edges, i.e., in each iteration we pass and transform messages of edges to their neighbor edges, and we update the hidden state of each edge after each iteration. Though in Eq. (3) we calculate a message for node , nodes just serve as “distribution centers” that collect and temporarily store the messages from their neighbor edges, then propagate the aggregated messages back to each of them. The reason why we propose to pass messages alternately between nodes and edges rather than directly between edges is for the purpose of improving the computational efficiency. More analysis on computational efficiency of message passing schemes is included in Appendix A.

3.2. Relational Paths

In this subsection, we follow the discussion in Section 2 and discuss how to model the term or . Note that we do not consider node identity in aforementioned message passing for relational contexts, however, this leads to a potential issue that our model is not able to identify the relative position of head and tail in the KG. For example, suppose for a given entity pair , our model figures out that is surrounded by person.birthplace, person.gender, etc., and is surrounded by institution.location, university.founder, university.president, etc. Then the model will learn that is probably a person and is probably a university, and there should be a relation graduated_from between them because such a pattern appears frequently in the training data. However, the truth may be that the person has nothing to do with the university and they are far from each other in the KG. The reason why such false positive case happens lies in that message passing of relational context can only detect the type of and , but is not aware of their relative position in the KG.

To solve this problem, we propose to explore the connectivity pattern between and . We first define the relational path from to in KGs:

Definition 0 (relational path).

A raw path from to in a KG is a sequence of entities and edges: , in which two entities and are connected by edge , and each entity in the path is unique.333Entities in a path are required to be unique because a loop within a path does not provide additional semantics thus should be cut off from the path. The corresponding relational path is the sequence of relation types of all edges in the given raw path, i.e., , where is the relation type of edge .

Note that we do not use the identity of nodes when modeling relational paths, which is the same as for relational context. Denote as the set of all relational paths from to in the KG. Our next step is to define and calculate the representation of relational paths. In PathCon, we assign an independent embedding vector for each relational path . A potential concern here is that the number of different paths increases exponentially with the path length (there are -hop paths), however, in practice we observe that in real-world KGs most paths actually do not occur (e.g., only 3.2% of all possible paths of length 2 occur in FB15K dataset), and the number of different paths is actually quite manageable for relatively small values of ().

An illustrative example of relational paths is shown in Figure 2, where the two green arrows denote the relational paths from to .

In addition, other methods for calculating path representations are also possible. We shall discuss them in Section 3.5.

3.3. Combining Relational Contexts and Paths

For relational context, we use massage passing scheme to calculate the final message and for and , which summarizes their context information, respectively. and are further combined together for calculating the context of pair:


where denotes the context representation of the entity pair . It is worth noting here that the above neighbor aggregator should only take messages of and as input, since the ground truth relation should be treated unobserved in the training stage.

For relational paths, we aggregate all path embeddings together to get the final representation of relational paths:


where denotes the aggregation function for paths. Note that there may be a number of relational paths for a given pair, but not all paths are logically related to the predicted relation , and the importance of each path also varies. In PathCon, since we have already known the context for pair and it can be seen as prior information for paths between and , we can calculate the importance scores of paths based on . Therefore, we implement as the attention function:

Attention-based path aggregator. We first calculate the attention weight of each path with respect to the context :


then use the attention weights to average representations of all paths:


where is the representation of relational paths for . In this way, the context information is used to assist in identifying the most important relational paths.

Given the relational context representation and the relational path representation , we can predict relations by first adding the two representation together and then taking softmax as follows:


Our model can be trained by minimizing the loss between predictions and ground truths over the training triplets:


where is the training set and is the cross-entropy loss.

It is worth noticing that the context representation plays two roles in the model: it directly contributes to the predicted relation distribution, and it also helps determine the importance of relational paths with respect to the predicted relation.

3.4. Discussion on Model Explainability

Since PathCon only models relations without entities, it is able to capture pure relationship among different relation types thus can naturally be used to explain for predictions. The explainability of PathCon is two-fold. On the one hand, modeling relational context captures the correlation between contextual relations and the predicted relation, which can be used to indicate important neighbor edges for the given relation. This can be achieved by studying the transformation matrix in context message passing or using external explanation tools (Ying et al., 2019). For example, institution.location, university.founder, and university.president can be identified as important contextual relations for graduated_from. On the other hand, modeling relational paths captures the correlation between paths and the predicted relation, which can indicate important relational paths for the given relation. This can be achieved by studying the transformation matrix or attention weights in path modeling. For example, (schoolmate_of, graduated_from) can be identified as an important relational path for graduated_from. It is interesting to see that the explainability provided by relational paths is also connected to first-logic logical rules with the following form:


where is the conjunction of relations in a path and is the predicted relation. The above example of relational path can therefore be written as the following rule:


Therefore, PathCon can also be used to learn logical rules from KGs just as prior work (Galárraga et al., 2015; Yang et al., 2017; Ho et al., 2018; Zhang et al., 2019a; Sadeghian et al., 2019).

3.5. Design Alternatives

Next we discuss several design alternatives for PathCon. In our ablation experiments we shall also consider the following alternative implementations.

When modeling relational context, we propose two alternatives for neighbor aggregator:

Mean neighbor aggregator. It takes the element-wise mean of the input vectors, followed by a nonlinear transformation function:


The output of Mean aggregator is invariant to the permutation of its two input nodes, indicating that it treats the head and the tail equally in a triplet.

Cross neighbor aggregator. It is inspired by combinatorial features in recommender systems (Wang et al., 2019b), which measure the interaction of unit features (e.g., AND(gender=female, language=English)). Note that Mean and Concat neighbor aggregator simply transform messages from two input nodes separately and add them up together, without modeling the interaction between them that might be useful for link prediction. In Cross neighbor aggregator, we first calculate all element-level pairwise interactions between messages from the head and the tail:


where we use superscript with parentheses to indicate the element index and is the dimension of and . Then we summarize all interactions together via flattening the interaction matrix to a vector then multiplied by a transformation matrix:


It is worth noting that Cross neighbor aggregator preserves the order of input nodes.

Learning path representation with RNN

. When modeling relational paths, recurrent neural networks (RNNs) can be used to learn the representation of relational path



The advantage of RNN against path embedding is that its number of parameters is fixed and does not depend on the number of relational paths. Another potential benefit is that RNN can hopefully capture the similarity among different relational paths based on the sequence of relations.

Mean path aggregator. When calculating the final representation of relational paths for pair, we can also simply average all the representations of paths from to :


Mean path aggregator can be used in the case where representation of relational context is unavailable, since it does not require attention weights as input.

4. Experiments

In this section, we evaluate the proposed PathCon model, and present its performance on six KG datasets. The code and all datasets are available at https://github.com/hwwang55/PathCon.

4.1. Experimental Setup

FB15K FB15K-237 WN18 WN18RR NELL995 DDB14
#nodes 14,951 14,541 40,943 40,943 63,917 9,203
#relations 1,345 237 18 11 198 14
#training 483,142 272,115 141,442 86,835 137,465 36,561
#validation 50,000 17,535 5,000 3,034 5,000 4,000
#test 59,071 20,466 5,000 3,134 5,000 4,000
avg. degree 64.6 37.4 6.9 4.2 4.3 7.9
Table 2. Statistics of all datasets. “avg. degree” means average node degree of the KG.
Model FB15K FB15K-237 WN18
MRR MR Hit@1 Hit@3 MRR MR Hit@1 Hit@3 MRR MR Hit@1 Hit@3
TransE 0.962 1.684 0.940 0.982 0.966 1.352 0.946 0.984 0.971 1.160 0.955 0.984
ComplEx 0.901 1.553 0.844 0.952 0.924 1.494 0.879 0.970 0.985 1.098 0.979 0.991
DisMult 0.661 2.555 0.439 0.868 0.875 1.927 0.806 0.936 0.786 1.501 0.584 0.987
RotatE 0.979 1.206 0.967 0.986 0.970 1.315 0.951 0.980 0.984 1.139 0.979 0.986
SimplE 0.983 1.308 0.972 0.991 0.971 1.407 0.955 0.987 0.972 1.256 0.964 0.976
QuatE 0.984 1.207 0.972 0.991 0.974 1.283 0.958 0.988 0.981 1.170 0.975 0.983
DRUM 0.945 1.527 0.945 0.978 0.959 1.541 0.905 0.958 0.969 1.165 0.956 0.980
Table 3. Relation prediction results on FB15K, FB15K-237, and WN18. Best results are highlighted in bold.
Model WN18RR NELL995 DDB14
MRR MR Hit@1 Hit@3 MRR MR Hit@1 Hit@3 MRR MR Hit@1 Hit@3
TransE 0.784 2.079 0.669 0.870 0.841 5.253 0.781 0.889 0.966 1.161 0.948 0.980
ComplEx 0.840 2.053 0.777 0.880 0.703 23.040 0.625 0.765 0.953 1.287 0.931 0.968
DisMult 0.847 2.024 0.787 0.891 0.634 23.530 0.524 0.720 0.927 1.419 0.886 0.961
RotatE 0.799 2.284 0.735 0.823 0.729 23.894 0.691 0.756 0.953 1.281 0.934 0.964
SimplE 0.730 3.259 0.659 0.755 0.716 26.120 0.671 0.748 0.924 1.540 0.892 0.948
QuatE 0.823 2.404 0.767 0.852 0.752 21.340 0.706 0.783 0.946 1.347 0.922 0.962
DRUM 0.854 1.575 0.778 0.912 0.715 18.203 0.640 0.740 0.958 1.140 0.930 0.987
Table 4. Relation prediction results on WN18RR, NELL995, and DDB14. Best results are highlighted in bold.

Datasets. We conduct experiments on five standard KG benchmarks: FB15K, FB15K-237, WN18, WN18RR, NELL995, and one KG dataset proposed by us: DDB14. The statistics of the six datasets are summarized in Table 2.

FB15K (Bordes et al., 2011) contains triplets from Freebase (Bollacker et al., 2008), a large-scale KG with general human knowledge. FB15k-237 (Toutanova and Chen, 2015) is a subset of FB15K where inverse relations are removed. WN18 (Bordes et al., 2011) contains conceptual-semantic and lexical relations among English words from WordNet (Miller, 1995). WN18RR (Dettmers et al., 2018) is a subset of WN18 where inverse relations are removed. NELL995 (Xiong et al., 2017a) is extracted from the 995th iteration of the NELL system (Carlson et al., 2010) containing general knowledge.

In addition, we present a new dataset DDB14 that is suitable for KG-related tasks. DDB14 is collected from Disease Database444http://www.diseasedatabase.com, which is a medical database containing terminologies and concepts such as diseases, symptoms, drugs, as well as their relationships. We randomly sample two subsets of 4,000 triplets from the original one as validation set and test set, respectively.

Baselines. We compare PathCon with several state-of-the-art models, including TransE (Bordes et al., 2013), ComplEx (Trouillon et al., 2016), DisMult (Yang et al., 2015), RotatE (Sun et al., 2019), SimplE (Kazemi and Poole, 2018), QuatE (Zhang et al., 2019b), and DRUM (Sadeghian et al., 2019). The first six models are embedding-based, while DRUM only uses relational paths to make prediction. We also conduct extensive ablation study and propose two variants of our model, PathCon-context and PathCon-path, which only use context and paths, respectively, to test the performance of the two components separately.

Evaluation protocol. We evaluate all methods in the setting of relation prediction, i.e., for a given entity pair in the test set, we rank the ground-truth relation type against all other candidate relation types. Following the standard procedure in prior work, candidate set of relation types is filtered, i.e., the candidate relation types for do not include any where appears in the training, validation, or test set. Moreover, since most of the chosen baselines are previously evaluated in the setting of head/tail prediction, we modify the evaluation part in their codes accordingly to fit the setting of relation prediction. For fair comparison, we also modify the strategy of negative sampling in their implementations from replacing head/tail to replacing relation for a given triplet, and this indeed improves their performance. More details can be found in Appendix B.

We use Mean Reciprocal Rank (MRR), Mean Rank (MR), and Hit Ratio

with cut-off values of 1 and 3 as evaluation metrics as they are popular and standard metrics for measuring ranking quality. Note that a lower value of MR represents better performance, while higher values are preferred for other metrics.

Implementation details

. Our proposed method is implemented in TensorFlow and trained on single GPU. We use Adam

(Kingma and Ba, 2015) as the optimizer with learning rate of 0.005. L2 regularization is used to prevent overfitting and the weight of L2 loss term is

. Batch size is 128, the number of epochs is 20, and the dimension of all hidden states is 64. Initial relation features are set as their identities, but we share examine BOW/BERT features in Section

4.3. The above settings are determined by optimizing the classification accuracy on the validation set of WN18RR, and kept unchanged for all datasets.

During experiments we find that performance of different number of context hops, maximum path length, and implementation of neighbor aggregator largely depends on datasets, so these hyper-parameters are tuned separately for each dataset. We present their default settings in Table 5, and search spaces of hyper-parameters in Appendix C.

FB15K FB15K-237 WN18 WN18RR NELL995 DDB14
#context hops 2 2 3 3 2 3
Max path len 2 3 3 4 3 4
Concat Concat Cross Cross Concat Cross
Table 5. Dataset-specific hyper-parameter settings for all datasets: the number of context hops, maximum path length, and neighbor aggregator.

Each experiment of PathCon

is repeated for 3 times. We report average performance and standard deviation in the following results.

Figure 3. Results of inductive KG completion on WN18RR.
Figure 4. Results of PathCon with different hops/length on WN18RR.
Figure 5. Results of PathCon-context with different neighbor aggregators.

4.2. Main Results

Comparison with baselines. The results on all datasets are reported in Tables 3 and 4, respectively. In general, our method outperforms all baselines on all datasets. Specifically, the absolute Hit@1 gain of PathCon against the best baseline in the six datasets are , , , , , and , respectively. The improvement is rather significant for WN18RR and NELL995, which are exactly the two most sparse KGs according to the average node degree shown in Table 2. This finding empirically demonstrates that PathCon maintains great performance for sparse KGs, and this is probably because PathCon has much fewer parameters than baselines and is less prone to overfitting. In contrast, performance gain of PathCon on FB15K is less significant, which may be because the density of FB15K is very high so that it is much easier for baselines to handle.

In addition, the results also demonstrate the stability and robustness of PathCon as we observe that most of the standard deviations are quite small.

Results in Tables 3 and 4 also show that, in many cases PathCon-context or PathCon-path is already able to beat most of baselines. Combining relational context and relational paths usually leads to even better performance.

Inductive KG completion. We also examine the performance of our method in inductive KG completion. We randomly sample a subset of nodes that appears in the test set, then remove these nodes along with their associated edges from the training set. The remaining training set is used to train the models, and we add back the removed edges during evaluation. Therefore, the evaluation setting transforms from fully conductive to fully inductive when the ratio of removed nodes increases from 0 to 1. The results of PathCon, DisMult, and RotatE are plotted in Figure 5. We observe that the performance of our method decreases slightly in fully inductive setting (from 0.954 to 0.922), while DisMult and RotatE fall to the level of “randomly guessing”. This is because the baselines are embedding-based models that rely on modeling node identity, while our method do not consider node identity thus being naturally generalizable to inductive setting.

4.3. Model Variants

The number of context hops and maximum path length. We investigate the sensitivity of our model to the number of context hops and maximum path length. We vary the two numbers from 0 to 4 (0 means the corresponding module is not used), and report the results of all combinations (without (0, 0)) on WN18RR in Figure 5. It is clear to see that increasing the number of context hops and maximum path length can significantly improve the result when they are small, which demonstrates that including more neighbor edges or counting longer paths does benefit the performance. But the marginal benefit is diminishing with the increase of layer numbers. Similar trend is observed on other datasets too.

Neighbor aggregators. We study how different implementations of neighbor aggregator affect the model performance. The results of Mean, Concat, and Cross neighbor aggregator on four datasets are shown in Figure 5 (Results on FB15K and WN18 are omitted as they are similar to FB15K-237 and WN18RR, respectively). The results show that Mean performs worst on all datasets, which indicates the importance of node orders when aggregating features from nodes to edges. It is also interesting to notice that the performance comparison between Concat and Cross varies on different datasets: Concat is better than Cross on NELL995 and is worse than Cross on WN18RR, while their performance is on par on FB15K-237 and DDB14. However, note that a significant defect of Cross is that it has much more parameters than Concat, which requires more running time and memory resource.

Figure 6. Results of PathCon with different types of path representation and path aggregators on WN18RR.
Figure 7. Results of PathCon-context, PathCon-path, and PathCon with different initial relation features on NELL995.

Path representation types and path aggregators. We implement four combinations of path representation types and path aggregators: Embedding+Mean, Embedding+Attention, RNN+Mean, and RNN+Attention, of which the results are presented in Figure 7. Different from neighbor aggregators, results on the six datasets are similar for path representation types and path aggregators, so we only report the results on WN18RR. We find that Embedding is consistently better than RNN, which is probably because the length of relational paths are generally short (no more than 4 in our experiments), so RNN can hardly demonstrate its strength in modeling sequences. The results also show that Attention aggregator performs slightly better than Mean aggregator. This demonstrates that the contextual information of head and tail entities indeed helps identify the importance of relational paths.

Predicted relation Important contextual relations Important relational paths
FB15K-237 award winner award honored for, award nominee (award nominated for), (award winner, award category)
film written by film release region (film edited by), (film crewmember)
education campus of education major field of study (education institution in)
DDB14 may cause may cause, belongs to the drug family of (is a risk factor for), (see also, may cause)
is associated with is associated with, is a risk factor for (is associated with, is associated with)
may be allelic with may be allelic with, belong(s) to the category of (may cause, may cause), (may be allelic with, may be allelic with)
Table 6. Examples of important context/paths identified by PathCon on FB15K-237 and DDB14.

Initial edge features. Here we examine three types of initial edge features: identity, BOW, and BERT embedding of relation types. We choose to test on NELL995 because its relation names consist of relatively more English words thus are semantically meaningful (e.g., “organization.headquartered.in.state.or.province”). The results of all variants are reported in Figure 7, which shows that BOW features are slightly better than identity, but BERT embeddings perform significantly worse than the other two. We attribute this finding to that, (1) the dimension of pre-trained BERT embeddings may be too high (768) for our task, and (2) BERT embeddings are better at identifying semantic relationship among relation types, but our method learns the mapping from BERT embeddings of context/paths to the identity of predicted relation types. Therefore, BERT may perform better if the predicted relation types are also represented by BERT embeddings, so that the mapping can be learned within the embedding space. We leave the exploration as future work.

4.4. Case Study on Model Explainabilty

We choose FB15K-237 and DDB14 as the datasets to show the explainability of PathCon. The number of context hops is set to 1 and the maximum path length is set to 2. When training is completed, we choose 3 relations from each dataset and list the most important relational neighbors/paths to them based on the transformation matrix of the neighbor/path aggregator. The results are presented in Table 6, from which we find that most of the identified neighbors/paths are logically meaningful. For example, “education campus of” can be inferred by “education institution in”, and “is associated with” is found to be a transitive relation. More results and discussion on DDB14 are included in Appendix D.

5. Related Work

We discuss two lines of related work: knowledge graph completion and graph neural networks.

5.1. Knowledge Graph Completion

Most existing methods of KG completion are based on embeddings, which normally assign an embedding vector to each entity and relation in the continuous embedding space and train the embeddings based on the observed facts. One line of KG embedding methods is translation-based, which treat entities as points in a continuous space and each relation translates the entity point. The objective is that the translated head entity should be close to the tail entity in real space (Bordes et al., 2013), complex space (Sun et al., 2019), or quaternion space (Zhang et al., 2019b), which have shown capability to handle multiple relation patterns and achieve state-of-the-art result. To deal with the 1-to-N/N-to-1 relations, several methods introduce relation-specific planes (Wang et al., 2014) or subspaces (Lin et al., 2015). Another line of work is multi-linear or bilinear models, where they calculate the semantic similarity by matrix or vector dot product in real (Yang et al., 2015) or complex space (Trouillon et al., 2016). Besides, several embedding-based methods explore the architecture design that goes beyond point vectors (Socher et al., 2013; Dettmers et al., 2018; Jiang et al., 2019). However, these embedding-based models fail to predict links in inductive setting, neither can they discover any rules that explain the prediction.

Some prior work also considers modeling paths in KGs. For example, Neural LP (Yang et al., 2017), DRUM (Sadeghian et al., 2019), and IterE (Zhang et al., 2019a) try to learn logical rules by modeling the paths that connect the head entity and the tail entity. However, they fail to consider the neighbor structure of the predicted relations, thus is not expressive enough for the setting where paths are sparse.

There are also work considering context of entities explicitly. For example, A2N (Bansal et al., 2019) and COKE (Wang et al., 2019a) propose to leverage the contextual information for link prediction by attending to the neighbor entities, but in our work we consider neighbor relations as context.

5.2. Graph Neural Networks

Existing GNNs generally follow the idea of neural message passing (Gilmer et al., 2017)

that consists of two procedures: propagation and aggregation, i.e., each node on the graph propagates its feature to its neighbors and then aggregates the neighborhood features to perform one update. The two procedures are operated iteratively so as to gather messages from multi-hop neighbors. Under this framework, several GNNs are proposed that take inspiration from convolutional neural networks

(Duvenaud et al., 2015; Hamilton et al., 2017; Kipf and Welling, 2017), recurrent neural networks (Li et al., 2016), recursive neural networks (Bianchini et al., 2001; Scarselli et al., 2008) and loopy belief propagation (Dai et al., 2016). However, these methods use node-based message passing, while we propose passing messages based on edges in this work.

There are two GNN models conceptually connected to our idea of identifying relative position of nodes in a graph. PGNN (You et al., 2019) distinguishes two nodes with similar local structures by calculating the relative distance between the nodes and a set of pre-defined anchors. SEAL (Zhang and Chen, 2018) labels nodes with their distance to two nodes and when predicting link between . In contrast, we use relational paths to indicate the relative position of two nodes.

Researchers also tried to apply GNNs to knowledge graphs. For example, Schlichtkrull et al. (Schlichtkrull et al., 2018) use GNNs to model the entities and relations on KGs, however, they are limited in that they did not consider the relational paths and cannot predict in inductive settings. Wang et al. (Wang et al., 2019c) use GNNs to learn entity embeddings in KGs with the regularization of label smoothness, but their purpose is to use the learned embeddings to enhance the performance of recommender systems rather than KG completion.

6. Conclusion and Future Work

We propose PathCon for KG completion. PathCon considers two types of subgraph structure in KGs, i.e., contextual relations of the head/tail entity and relational paths between head and tail entity. We show that both context and paths are critical to the task of relation prediction, and they can be combined further to achieve better performance. The experimental results on six datasets demonstrate the superiority of our method over state-of-the-art baselines. In addition, our method is able to generalize to inductive settings, and it can provide explainable relation neighbors and paths as results.

We point out three directions for future work. First, as we discussed in Section 4.3, designing a model that can better take advantage of pre-trained word embeddings is a promising direction; Second, it is worth studying why RNN does not perform well, and whether we can model relational paths better; Third, it is interesting to examine if the context representation and path representation can be assembled in a more principled way.


  • (1)
  • Bansal et al. (2019) Trapit Bansal, Da-Cheng Juan, Sujith Ravi, and Andrew McCallum. 2019. A2n: attending to neighbors for knowledge graph inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4387–4392.
  • Bianchini et al. (2001) Monica Bianchini, Marco Gori, and Franco Scarselli. 2001. Processing directed acyclic graphs with recursive neural networks. IEEE Transactions on Neural Networks 12, 6 (2001), 1464–1470.
  • Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 1247–1250.
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems. 2787–2795.
  • Bordes et al. (2011) Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio. 2011. Learning structured embeddings of knowledge bases. In

    Twenty-Fifth AAAI Conference on Artificial Intelligence

    . 301–306.
  • Carlson et al. (2010) Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka, and Tom M Mitchell. 2010. Toward an architecture for never-ending language learning. In Twenty-Fourth AAAI Conference on Artificial Intelligence. 1306–1313.
  • Dai et al. (2016) Hanjun Dai, Bo Dai, and Le Song. 2016. Discriminative embeddings of latent variable models for structured data. In

    Proceedings of the 33rd International Conference on Machine Learning

    . 2702–2711.
  • Dettmers et al. (2018) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018. Convolutional 2d knowledge graph embeddings. In Thirty-Second AAAI Conference on Artificial Intelligence. 1811–1818.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems. 2224–2232.
  • Galárraga et al. (2015) Luis Galárraga, Christina Teflioudi, Katja Hose, and Fabian M Suchanek. 2015. Fast rule mining in ontological knowledge bases with amie+. The VLDB Journal 24, 6 (2015), 707–730.
  • Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning. 1263–1272.
  • Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.
  • Ho et al. (2018) Vinh Thinh Ho, Daria Stepanova, Mohamed H Gad-Elrab, Evgeny Kharlamov, and Gerhard Weikum. 2018. Rule learning from knowledge graphs guided by embedding models. In International Semantic Web Conference. Springer, 72–90.
  • Huang et al. (2019) Xiao Huang, Jingyuan Zhang, Dingcheng Li, and Ping Li. 2019. Knowledge graph embedding based question answering. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 105–113.
  • Jiang et al. (2019) Xiaotian Jiang, Quan Wang, and Bin Wang. 2019. Adaptive convolution for multi-relational learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 978–987.
  • Kazemi and Poole (2018) Seyed Mehran Kazemi and David Poole. 2018. Simple embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems. 4284–4295.
  • Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations.
  • Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations.
  • Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2016. Gated graph sequence neural networks. In Proceedings of the 4th International Conference on Learning Representations.
  • Lin et al. (2015) Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning entity and relation embeddings for knowledge graph completion. In Twenty-ninth AAAI conference on artificial intelligence. 2181–2187.
  • Miller (1995) George A Miller. 1995. Wordnet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
  • Sadeghian et al. (2019) Ali Sadeghian, Mohammadreza Armandpour, Patrick Ding, and Daisy Zhe Wang. 2019. Drum: end-to-end differentiable rule mining on knowledge graphs. In Advances in Neural Information Processing Systems. 15321–15331.
  • Scarselli et al. (2008) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2008), 61–80.
  • Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European Semantic Web Conference. Springer, 593–607.
  • Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013.

    Reasoning with neural tensor networks for knowledge base completion. In

    Advances in Neural Information Processing Systems. 926–934.
  • Sun et al. (2019) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. Rotate: knowledge graph embedding by relational rotation in complex space. In Proceedings of the 7th International Conference on Learning Representations.
  • Toutanova and Chen (2015) Kristina Toutanova and Danqi Chen. 2015. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality. 57–66.
  • Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. In Proceedings of the 33rd International Conference on Machine Learning.
  • Wang et al. (2019b) Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019b. Multi-task feature learning for knowledge graph enhanced recommendation. In The World Wide Web Conference. 2000–2010.
  • Wang et al. (2019c) Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, and Minyi Guo. 2019c. Knowledge graph convolutional networks for recommender systems. In The World Wide Web Conference.
  • Wang et al. (2019a) Quan Wang, Pingping Huang, Haifeng Wang, Songtai Dai, Wenbin Jiang, Jing Liu, Yajuan Lyu, Yong Zhu, and Hua Wu. 2019a. Coke: contextualized knowledge graph embedding. arXiv preprint arXiv:1911.02168 (2019).
  • Wang et al. (2014) Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014.

    Knowledge graph embedding by translating on hyperplanes. In

    Twenty-Eighth AAAI conference on artificial intelligence. 1112–1119.
  • Xiong et al. (2017b) Chenyan Xiong, Russell Power, and Jamie Callan. 2017b. Explicit semantic ranking for academic search via knowledge graph embedding. In Proceedings of the 26th International Conference on World Wide Web. 1271–1279.
  • Xiong et al. (2017a) Wenhan Xiong, Thien Hoang, and William Yang Wang. 2017a.

    Deeppath: a reinforcement learning method for knowledge graph reasoning. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    . 564–573.
  • Yang and Mitchell (2017) Bishan Yang and Tom Mitchell. 2017. Leveraging knowledge bases in lstms for improving machine reading. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 1436–1446.
  • Yang et al. (2015) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Embedding entities and relations for learning and inference in knowledge bases. In Proceedings of the 3rd International Conference on Learning Representations.
  • Yang et al. (2017) Fan Yang, Zhilin Yang, and William W Cohen. 2017. Differentiable learning of logical rules for knowledge base reasoning. In Advances in Neural Information Processing Systems. 2319–2328.
  • Ying et al. (2019) Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. 2019. Gnnexplainer: generating explanations for graph neural networks. In Advances in Neural Information Processing Systems. 9240–9251.
  • You et al. (2019) Jiaxuan You, Rex Ying, and Jure Leskovec. 2019. Position-aware graph neural networks. In Proceedings of the 36th International Conference on Machine Learning. 7134–7143.
  • Zhang and Chen (2018) Muhan Zhang and Yixin Chen. 2018. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems. 5165–5175.
  • Zhang et al. (2019b) Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. 2019b. Quaternion knowledge graph embeddings. In Advances in Neural Information Processing Systems. 2731–2741.
  • Zhang et al. (2019a) Wen Zhang, Bibek Paudel, Liang Wang, Jiaoyan Chen, Hai Zhu, Wei Zhang, Abraham Bernstein, and Huajun Chen. 2019a. Iteratively learning embeddings and rules for knowledge graph reasoning. In The World Wide Web Conference. 2366–2377.


A. Computational Efficiency of the Proposed Message Passing Scheme

Analysis on node-based message passing. Consider a graph with nodes and edges. Traditional message passing methods propagate messages of nodes to their neighbor nodes and update their hidden states:


where denotes the set of neighbor nodes of in the graph, is message aggregation function, and is node update function. This is also called node-based message passing since it considers features and hidden states of nodes. See Figure 7(a) for an illustrative example. The computational complexity of node-based message passing is given as follows:

Corollary 0 ().

In each iteration of node-based message passing, the aggregation operation are performed for times, and each aggregation operation takes elements as input in expectation, where is the expected node degree of . The cost of aggregation for each iteration is therefore .

Analysis on edge-based message passing. Since in this work we only model features of edges rather than nodes, a natural thought is to do edge-based message passing:


where denotes the set of neighbor edges of (i.e., edges that share at lease one common endpoint with ) in the graph. See Figure 7(b) for an illustrative example.

For edge-based message passing, it actually passes messages on the line graph of the original graph. The line graph of a given graph , denoted by , is a graph such that (1) each node of represents an edge of , and (2) two nodes of are adjacent if and only if their corresponding edges share a common endpoint in . We show by the following theorem that the line graph is much larger and denser than the original graph:

Theorem 2 ().

The number of nodes in line graph is , and the expected node degree of is



is the variance of node degrees in



It is clear that the number of nodes in line graph is because each node in corresponds to an edge in . We now prove that the expected node degree of is .

Let’s first count the number of edges in . According to the definition of line graph, each edge in corresponds to an unordered pair of edges in connecting to a same node; On the other hand, each unordered pair of edges in that connect to a same node also determines an edge in . Therefore, the number of edges in equals the number of all unordered pairs of edges connecting to a same node:


where is the degree of node in and is the number of edges. Then the the expected node degree of is


(a) Node-based message passing
(b) Edge-based message passing
(c) Redundant edge aggregation
(d) Alternate message passing
Figure 8. (a) Node-based message passing; (b) Edge-based message passing; (c) Aggregating green edges is redundant for the red edge and the blue edge in edge-based message passing; (d) Alternate message passing.

From Theorem 2 it is clear to see that is at least twice of , i.e. the expected node degree of the original graph , since ( is omitted). Unfortunately, in real-world graphs (including KGs) node degrees vary significantly, and they typically follow the power law distribution whose variance is extremely large due to the long tail. This means that in general. On the other hand, the number of nodes in (which is ) is also far larger than the number of nodes in (which is ). Therefore, is generally much larger and denser than its original graph . Based on Theorem 2, the complexity of edge-based message passing is given as follows:

Corollary 0 ().

In each iteration of edge-based message passing, the aggregation operation are performed for times, and each aggregation operation takes elements as input in expectation. The cost of aggregation for each iteration is therefore .

Analysis on alternate message passing. The cost of aggregation in edge-based message passing is time-inefficient in practice. Though we can sample a subset of neighbors for each aggregation instead of using full neighbors (Hamilton et al., 2017), message passing models are usually sensitive to the sampling size and a small number of sampled neighbors will lead to performance deterioration.

The key insight in solving the heavy overhead of edge-based message passing is to notice that, though a large number of neighbor edges needs to be aggregated for a given edge, two edges connecting to a same node share lots of common neighbor edges, making the aggregation of neighbor edges redundant for the two edges. For example, in Figure 7(c) we want to aggregate neighbor edges for the red edge and the blue edge, but their neighbor edges are highly overlapped (marked in green) since they connect to a same node. To reduce the redundant computation, we decompose edge aggregation in Eq. (21) into two steps:


In Eq. (26), for each node , we aggregate all the edges that connects to by an aggregation function and get message . Then in Eq. (27), we get message of edge by aggregating messages from its two endpoints and using function . We call Eqs. (26), (27), and (22) alternate message passing, as messages are passed alternately between nodes and edges. Figure 7(d) gives an illustrative example of alternate message passing.

Our proposed message passing scheme for relational context in Eqs. (3) and (4) are based on alternate message passing. To see this, notice that the message aggregation function in Eq. (26) is implemented as sum in Eq. (3), and Eqs. (27) and (22) are combined together and abstracted as in Eq. (4). The complexity of alternate message passing is as follows:

Corollary 0 ().

In each iteration of alternate message passing, the aggregation from edges to nodes are performed for times and each takes elements as input in expectation; the aggregation from nodes to edges are performed for times and each takes elements as input. The cost of aggregation for each iteration is therefore .

From Corollary 4 it is clear to see that alternate message passing greatly reduces the overhead of edge aggregation and achieves the same order of magnitude as node-based message passing.

B. Implementation Details for Baselines

The implementation of TransE, DisMult, ComplEx, and RotatE is at https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding; the implementation of SimplE is at https://github.com/baharefatemi/SimplE; the implementation of QuatE is at https://github.com/cheungdaven/QuatE, and we use QuatE (QuatE without type constraints) here; the implementation of DRUM is at https://github.com/alisadeghian/DRUM. For fair comparison, the embedding dimension for all the baselines are set to 400. We train each baseline for 1,000 epochs, and report the test result when the result on validation set is optimal. The other hyper-parameters are set as default in their repositories.

These baselines are previously evaluated in head/tail prediction, i.e., predicting the missing head or tail for a given pair or . Therefore, their negative sampling strategy is to corrupt the head or the tail for a true triple , i.e., replacing or with a randomly sampled entity or from KGs, and using or as the negative sample. Since our task is to predict the missing relation for a given pair , we modify the negative sampling strategy accordingly by corrupting the relation of each true triplet , and use as the negative sample where is randomly sampled from the set of relation types. Note that if happens to be a true triple, we remove it from negative samples. This new negative sampling strategy can indeed improve the performance of baselines. For example, the Hit@1 of TransE, ComplEx, DisMult, RotatE, SimplE, and QuatE on WN18 increases from 0.931, 0.957, 0.578, 0.975, 0.951, 0.971 to 0.955, 0.979, 0.584, 0.979, 0.964, 0.975, respectively.

C. Search Spaces of Hyper-parameters

The search spaces for hyper-parameters are as follows:

  • Dimension of hidden states: ;

  • Weight of L2 loss term: ;

  • Learning rate: ;

  • The number of context hops: ;

  • Maximum path length: .

Figure 9. Correlation between relational paths (length ) and the predicted relations learned by PathCon on DDB14.

D. More Results of Explainability on DDB14

After training on DDB14, we print out the transformation matrix of the neighbor aggregator and the path aggregator in PathCon, and the results are shown as heat maps in Figures 10 and 9, respectively. The degree of darkness of an entry in Figure 10 (Figure 9) denotes the strength of correlation between the existence of a neighbor relation (a relational path) and a predicted relation. Relation IDs as well as their meanings are listed as follows for readers’ reference:

0: belong(s) to the category of 7: interacts with
1: is a category subset of 8: belongs to the drug family of
2: may cause 9: belongs to drug super-family
3: is a subtype of 10: is a vector for
4: is a risk factor for 11: may be allelic with
5: is associated with 12: see also
6: may contraindicate 13: is an ingredient of

Figure 10 shows that most of large values are distributed along the diagonal. This is in accordance with our intuition, for example, if we want to predict the relation for pair and we observe that appears in another triplet , then we know that the type of is risk factor and it is likely to be a risk factor of other entities in the KG. Therefore, “” are more likely to be “is a risk factor for” than “belongs to the drug family of” since is not a drug. In addition, we also find some large values that are not in the diagonal, e.g., (belongs to the drug family of, belongs to the drug super-family) and (may contraindicate, interacts with).

Figure 10. Correlation between neighbor relations of head/tail and predicted relations learned by PathCon on DDB14.

We also have some interesting findings from Figure 9. First, we find that many rules from Figure 9 is with the form:


where is a relation type in the KG. These rules are indeed meaningful because means and are equivalent thus can interchange with each other.

We also find PathCon learns rules that show the relation type is transitive, for example:




Other interesting rules learned by PathCon include: