PathCon
Combining relational context and relational paths for knowledge graph completion
view repo
Knowledge graph completion aims to predict missing relations between entities in a knowledge graph. While many different methods have been proposed, there is a lack of a unifying framework that would lead to stateoftheart results. Here we develop PathCon, a knowledge graph completion method that harnesses four novel insights to outperform existing methods. PathCon predicts relations between a pair of entities by: (1) Considering the Relational Context of each entity by capturing the relation types adjacent to the entity and modeled through a novel edgebased message passing scheme; (2) Considering the Relational Paths capturing all paths between the two entities; And, (3) adaptively integrating the Relational Context and Relational Path through a learnable attention mechanism. Importantly, (4) in contrast to conventional nodebased representations, PathCon represents context and path only using the relation types, which makes it applicable in an inductive setting. Experimental results on knowledge graph benchmarks as well as our newly proposed dataset show that PathCon outperforms stateoftheart knowledge graph completion methods by a large margin. Finally, PathCon is able to provide interpretable explanations by identifying relations that provide the context and paths that are important for a given predicted relation.
READ FULL TEXT VIEW PDF
Most researches for knowledge graph completion learn representations of
...
read it
We present a novel method for mapping unrestricted text to knowledge gra...
read it
The polypharmacy side effect prediction problem considers cases in which...
read it
This paper examines the challenging problem of learning representations ...
read it
In this paper, we propose a graphbased kinship reasoning (GKR) network ...
read it
Commonsense knowledge graph (CKG) is a special type of knowledge graph (...
read it
Reasoning over paths in large scale knowledge graphs is an important pro...
read it
Combining relational context and relational paths for knowledge graph completion
Knowledge graphs (KGs) store structured information of realworld entities and facts. A KG usually consists of a collection of triplets. Each triplet indicates that head entity is related to tail entity through relationship type .
A range of important applications, including search (Xiong et al., 2017b), question answering (Huang et al., 2019), recommender systems (Wang et al., 2019c), and machine reading comprehension (Yang and Mitchell, 2017) all critically rely on existing KGs such as FreeBase (Bollacker et al., 2008), WordNet (Miller, 1995), NELL (Carlson et al., 2010) as well as Google Knowledge Graph^{1}^{1}1https://developers.google.com/knowledgegraph.
Nonetheless, KGs are often incomplete and noisy. To address this issue, researchers have proposed a number of KG completion methods to predict missing links/relations in KGs that can be classified into two categories: The first class is
embeddingbased methods (Bordes et al., 2013; Trouillon et al., 2016; Yang et al., 2015; Sun et al., 2019; Kazemi and Poole, 2018; Zhang et al., 2019b), which learn an embedding vector for each entity and relation by minimizing a predefined loss function on all triplets. Such methods have the advantage that they consider the structural context of a given entity in the KG but they fail to capture the multiple relationships (paths) between the head and the tail entity, which are very important for KG completion. In contrast, the second class of methods is
rulebased (Galárraga et al., 2015; Yang et al., 2017; Ho et al., 2018; Zhang et al., 2019a; Sadeghian et al., 2019), which aims to learn general logical rules from KGs by modeling paths between the head and the tail entities. However, a significant drawback of these methods is that meaningful rules are usually very rare, which limits their capability of predicting missing relations that are not covered by known rules.Present work. Our work stems from the observation that there are two important aspects required for successful KG completion (Figure 1): (1) It is important to capture relational context of a given entity in the KG (Figure 0(a)). The relations an entity has with other entities capture its context and provide us with valuable information about what is the nature or the “type” of the entity. Many entities in KGs are not typed or are very loosely typed, so being able to learn about the entity and its context in the KG is valuable. (2) It is also important to capture the set of different multifaceted relational paths between the head and the tail entities (Figure 0(b)). Here different paths of connections between the entities reveal the nature of their relationship and help with the prediction. However, it is not enough for the model to have these two components independently, but they also have to be combined properly. In particular, the importance of different paths between the head and the tail entity needs to depend both on the relational context of both entities as well as the relation they are trying to model.
Here we propose PathCon, a new method that combines relational context and relational paths for KG completion. PathCon models relations rather than entities which makes the model explainable and generalizable to inductive settings. Specifically, PathCon harnesses the following four novel insights to outperform existing methods:
[noitemsep,topsep=0pt,leftmargin=5mm]
Relational Context: We design a multilayer edgebased message passing scheme to aggregate messages from the hop neighborhood edges of a given entity. The aggregated result captures the structure of adjacent relation types of the entity. For example, in Figure 0(a), the 1hop relational context of entity Hedwig is captured by its neighboring relations (Lives with, Bought).
Relational Paths: We identify all paths from the head entity to the tail entity in the KG. Each path is represented by its relation types. For example, in Figure 0(a), relational path between Harry Potter and Hagrid is (Lives with, Bought), and in Figure 0(b), relational paths between Harry Potter and Hermione Granger are (House, House) and (Occupation, Occupation).
Importantly, the paths as well as the context are captured based on the sequence/structure of the relation types they contain (and not based on the identities of the entities). This is important as it provides better inductive bias and allows PathCon to be applicable in inductive settings where new entities not present during training can enter the KG and PathCon can still model them.
Furthermore, in PathCon the importance of paths depends both on the relation they are aiming to model as well as the relational context provided by the two entities. Therefore, PathCon uses a learnable attention score for each path based on the context information of the entity pair, and then aggregates path representations weighted by their attention scores.
A further benefit of our PathCon approach is that it provides interpretability and explainability. It allows us to identify important relational context that determine the relation between a pair of given entities. Similarly, in PathCon different relation paths have different weights/attention scores and we use these scores to identify important paths that explain the reasons for a given predicted relation.
We conduct extensive experiments on five KG datasets as well as a new KG dataset proposed by us. Experimental results demonstrate that PathCon significantly outperforms stateoftheart KG completion methods, for example, the absolute Hit@1 gain over the best baseline is and on WN18RR and NELL995, respectively. Our extensive ablation studies show the effectiveness of our approach and demonstrate the importance of relational context as well as paths. Our method is also shown to maintain strong performance in inductive KG completion, and it provides high explainability by identifying important relational context and relation paths for a given predicted relation.
Let be an instance of a knowledge graph, where is the set of nodes and is the set of edges. Each edge has a relation type . Our goal is to predict the missing links in , i.e., given an entity pair , we aim to predict the relation of the edge between them.^{2}^{2}2Some of the related work formulates this problem as predicting the missing tail (head) entity given a head (tail) entity and a relation. The two problems are actually reducible to each other: Given a model that outputs the distribution over relation types for an entity pair , we can then build a model that outputs the distribution over tail entities given and , and vice versa. Since the two problems are equivalent, we only focus on relation prediction in this work. Specifically, we aim to model the distribution over relation types given a pair of entities : . This is equivalent to modeling the following term
(1) 
according to Bayes’ theorem. In Eq. (
1), is the prior distribution over relation types and serves as the regularization of the model. Then the first term can be further decomposed to(2) 
Eq. (2) sets up the guideline for designing our model. The term or measures the likelihood of an entity given a particular relation. Since our model does not consider the identity of entities, we use an entity’s local relational subgraph instead to represent the entity itself, i.e., and where denotes the local relational subgraph of an entity. This is also known as relational context for and . The term or in Eq. (2) measures the likelihood of how can be reached from or the other way around given that there is a relation between them. This inspires us to model the connection paths between and in the KG. In the following we show how to model the two factors in our method and how they contribute to link prediction in KGs.
Symbol  Description 

Head entity and tail entity  
Relation type  
Hidden state of edge at iteration  
Message of node at iteration  
Endpoint nodes of edge  
Neighbor edges of node  
Context representation of the entity pair (,)  
Path representation of all paths from to  
Attention weight of path  
Set of paths from to 
PathCon captures the relational context (Section 3.1) and the relational paths (Section 3.2) of an entity pair, and combines them together to predict relations (Section 3.3). We show that PathCon is able to learn explainable rules (Section 3.4), and finally discuss several design alternatives (Section 3.5).
For a KG triplet , relational context of and is usually highly correlated with . For example, if is graduated_from, it’s reasonable to guess with high probability that the surrounding relations of are person.birthplace, person.gender, etc., and the surrounding relations of are institution.location, university.founder, university.president, etc. In this subsection, we propose to use message passing scheme to capture such relational context of an entity.
Denote as the initial feature of edge , which can be taken as the onehot identity vector of the relation type that belongs to. In cases where relation types have names, initial features can also be bagofwords (BOW) or sentence embeddings learned by language models like BERT (Devlin et al., 2018). Given initial features of edges, we design a message passing scheme to learn the representation of each edge by iteratively aggregating messages from its multihop neighbor edges. In iteration , the hidden state of edge is updated according to the following equations:
(3)  
(4) 
As shown in Eq. (3), for each node , we sum up the hidden states of edges that connects to and get message , where denotes the set of neighbor edges of node . Then in Eq. (4), we calculate the hidden state of edge for iteration by aggregating messages from its two endpoints and as well as the hidden state of itself in iteration , where denotes the two endpoints of edge . The aggregation operation in Eq. (4) is abstracted as . In PathCon, we implement as the concatenation function:
Concat neighbor aggregator. In iteration , given the hidden state for edge as well as the messages and from its two endpoints , , Concat neighbor aggregator calculates the hidden state by concatenating the three input vectors first, followed by a nonlinear transformation function:
(5) 
where denotes the concatenation operation, , , and
are the learnable transformation matrix, bias, and nonlinear activation function, respectively. It can be seen that Concat neighbor aggregator preserves the order of two input endpoints. We shall discuss other implementations of
in Section 3.5 and examine their performance in experiments.The message passing in Eqs. (3) and (4) are repeated for times. The final message and are taken as the relational context representation for head and tail , respectively. We also give an illustrative example of relational context for and as shown in Figure 2, where the red/pink edges denote the firstorder/secondorder contextual relations.
We would like to emphasize here that the message passing scheme in Eqs. (3) and (4) are based on edges, i.e., in each iteration we pass and transform messages of edges to their neighbor edges, and we update the hidden state of each edge after each iteration. Though in Eq. (3) we calculate a message for node , nodes just serve as “distribution centers” that collect and temporarily store the messages from their neighbor edges, then propagate the aggregated messages back to each of them. The reason why we propose to pass messages alternately between nodes and edges rather than directly between edges is for the purpose of improving the computational efficiency. More analysis on computational efficiency of message passing schemes is included in Appendix A.
In this subsection, we follow the discussion in Section 2 and discuss how to model the term or . Note that we do not consider node identity in aforementioned message passing for relational contexts, however, this leads to a potential issue that our model is not able to identify the relative position of head and tail in the KG. For example, suppose for a given entity pair , our model figures out that is surrounded by person.birthplace, person.gender, etc., and is surrounded by institution.location, university.founder, university.president, etc. Then the model will learn that is probably a person and is probably a university, and there should be a relation graduated_from between them because such a pattern appears frequently in the training data. However, the truth may be that the person has nothing to do with the university and they are far from each other in the KG. The reason why such false positive case happens lies in that message passing of relational context can only detect the type of and , but is not aware of their relative position in the KG.
To solve this problem, we propose to explore the connectivity pattern between and . We first define the relational path from to in KGs:
A raw path from to in a KG is a sequence of entities and edges: , in which two entities and are connected by edge , and each entity in the path is unique.^{3}^{3}3Entities in a path are required to be unique because a loop within a path does not provide additional semantics thus should be cut off from the path. The corresponding relational path is the sequence of relation types of all edges in the given raw path, i.e., , where is the relation type of edge .
Note that we do not use the identity of nodes when modeling relational paths, which is the same as for relational context. Denote as the set of all relational paths from to in the KG. Our next step is to define and calculate the representation of relational paths. In PathCon, we assign an independent embedding vector for each relational path . A potential concern here is that the number of different paths increases exponentially with the path length (there are hop paths), however, in practice we observe that in realworld KGs most paths actually do not occur (e.g., only 3.2% of all possible paths of length 2 occur in FB15K dataset), and the number of different paths is actually quite manageable for relatively small values of ().
An illustrative example of relational paths is shown in Figure 2, where the two green arrows denote the relational paths from to .
In addition, other methods for calculating path representations are also possible. We shall discuss them in Section 3.5.
For relational context, we use massage passing scheme to calculate the final message and for and , which summarizes their context information, respectively. and are further combined together for calculating the context of pair:
(6) 
where denotes the context representation of the entity pair . It is worth noting here that the above neighbor aggregator should only take messages of and as input, since the ground truth relation should be treated unobserved in the training stage.
For relational paths, we aggregate all path embeddings together to get the final representation of relational paths:
(7) 
where denotes the aggregation function for paths. Note that there may be a number of relational paths for a given pair, but not all paths are logically related to the predicted relation , and the importance of each path also varies. In PathCon, since we have already known the context for pair and it can be seen as prior information for paths between and , we can calculate the importance scores of paths based on . Therefore, we implement as the attention function:
Attentionbased path aggregator. We first calculate the attention weight of each path with respect to the context :
(8) 
then use the attention weights to average representations of all paths:
(9) 
where is the representation of relational paths for . In this way, the context information is used to assist in identifying the most important relational paths.
Given the relational context representation and the relational path representation , we can predict relations by first adding the two representation together and then taking softmax as follows:
(10) 
Our model can be trained by minimizing the loss between predictions and ground truths over the training triplets:
(11) 
where is the training set and is the crossentropy loss.
It is worth noticing that the context representation plays two roles in the model: it directly contributes to the predicted relation distribution, and it also helps determine the importance of relational paths with respect to the predicted relation.
Since PathCon only models relations without entities, it is able to capture pure relationship among different relation types thus can naturally be used to explain for predictions. The explainability of PathCon is twofold. On the one hand, modeling relational context captures the correlation between contextual relations and the predicted relation, which can be used to indicate important neighbor edges for the given relation. This can be achieved by studying the transformation matrix in context message passing or using external explanation tools (Ying et al., 2019). For example, institution.location, university.founder, and university.president can be identified as important contextual relations for graduated_from. On the other hand, modeling relational paths captures the correlation between paths and the predicted relation, which can indicate important relational paths for the given relation. This can be achieved by studying the transformation matrix or attention weights in path modeling. For example, (schoolmate_of, graduated_from) can be identified as an important relational path for graduated_from. It is interesting to see that the explainability provided by relational paths is also connected to firstlogic logical rules with the following form:
(12) 
where is the conjunction of relations in a path and is the predicted relation. The above example of relational path can therefore be written as the following rule:
(13) 
Therefore, PathCon can also be used to learn logical rules from KGs just as prior work (Galárraga et al., 2015; Yang et al., 2017; Ho et al., 2018; Zhang et al., 2019a; Sadeghian et al., 2019).
Next we discuss several design alternatives for PathCon. In our ablation experiments we shall also consider the following alternative implementations.
When modeling relational context, we propose two alternatives for neighbor aggregator:
Mean neighbor aggregator. It takes the elementwise mean of the input vectors, followed by a nonlinear transformation function:
(14) 
The output of Mean aggregator is invariant to the permutation of its two input nodes, indicating that it treats the head and the tail equally in a triplet.
Cross neighbor aggregator. It is inspired by combinatorial features in recommender systems (Wang et al., 2019b), which measure the interaction of unit features (e.g., AND(gender=female, language=English)). Note that Mean and Concat neighbor aggregator simply transform messages from two input nodes separately and add them up together, without modeling the interaction between them that might be useful for link prediction. In Cross neighbor aggregator, we first calculate all elementlevel pairwise interactions between messages from the head and the tail:
(15) 
where we use superscript with parentheses to indicate the element index and is the dimension of and . Then we summarize all interactions together via flattening the interaction matrix to a vector then multiplied by a transformation matrix:
(16) 
It is worth noting that Cross neighbor aggregator preserves the order of input nodes.
Learning path representation with RNN
. When modeling relational paths, recurrent neural networks (RNNs) can be used to learn the representation of relational path
:(17) 
The advantage of RNN against path embedding is that its number of parameters is fixed and does not depend on the number of relational paths. Another potential benefit is that RNN can hopefully capture the similarity among different relational paths based on the sequence of relations.
Mean path aggregator. When calculating the final representation of relational paths for pair, we can also simply average all the representations of paths from to :
(18) 
Mean path aggregator can be used in the case where representation of relational context is unavailable, since it does not require attention weights as input.
In this section, we evaluate the proposed PathCon model, and present its performance on six KG datasets. The code and all datasets are available at https://github.com/hwwang55/PathCon.
FB15K  FB15K237  WN18  WN18RR  NELL995  DDB14  

#nodes  14,951  14,541  40,943  40,943  63,917  9,203 
#relations  1,345  237  18  11  198  14 
#training  483,142  272,115  141,442  86,835  137,465  36,561 
#validation  50,000  17,535  5,000  3,034  5,000  4,000 
#test  59,071  20,466  5,000  3,134  5,000  4,000 
avg. degree  64.6  37.4  6.9  4.2  4.3  7.9 
Model  FB15K  FB15K237  WN18  

MRR  MR  Hit@1  Hit@3  MRR  MR  Hit@1  Hit@3  MRR  MR  Hit@1  Hit@3  
TransE  0.962  1.684  0.940  0.982  0.966  1.352  0.946  0.984  0.971  1.160  0.955  0.984  
ComplEx  0.901  1.553  0.844  0.952  0.924  1.494  0.879  0.970  0.985  1.098  0.979  0.991  
DisMult  0.661  2.555  0.439  0.868  0.875  1.927  0.806  0.936  0.786  1.501  0.584  0.987  
RotatE  0.979  1.206  0.967  0.986  0.970  1.315  0.951  0.980  0.984  1.139  0.979  0.986  
SimplE  0.983  1.308  0.972  0.991  0.971  1.407  0.955  0.987  0.972  1.256  0.964  0.976  
QuatE  0.984  1.207  0.972  0.991  0.974  1.283  0.958  0.988  0.981  1.170  0.975  0.983  
DRUM  0.945  1.527  0.945  0.978  0.959  1.541  0.905  0.958  0.969  1.165  0.956  0.980  
PathConcontext 













PathConpath 













PathCon 












Model  WN18RR  NELL995  DDB14  

MRR  MR  Hit@1  Hit@3  MRR  MR  Hit@1  Hit@3  MRR  MR  Hit@1  Hit@3  
TransE  0.784  2.079  0.669  0.870  0.841  5.253  0.781  0.889  0.966  1.161  0.948  0.980  
ComplEx  0.840  2.053  0.777  0.880  0.703  23.040  0.625  0.765  0.953  1.287  0.931  0.968  
DisMult  0.847  2.024  0.787  0.891  0.634  23.530  0.524  0.720  0.927  1.419  0.886  0.961  
RotatE  0.799  2.284  0.735  0.823  0.729  23.894  0.691  0.756  0.953  1.281  0.934  0.964  
SimplE  0.730  3.259  0.659  0.755  0.716  26.120  0.671  0.748  0.924  1.540  0.892  0.948  
QuatE  0.823  2.404  0.767  0.852  0.752  21.340  0.706  0.783  0.946  1.347  0.922  0.962  
DRUM  0.854  1.575  0.778  0.912  0.715  18.203  0.640  0.740  0.958  1.140  0.930  0.987  
PathConcontext 













PathConpath 













PathCon 












Datasets. We conduct experiments on five standard KG benchmarks: FB15K, FB15K237, WN18, WN18RR, NELL995, and one KG dataset proposed by us: DDB14. The statistics of the six datasets are summarized in Table 2.
FB15K (Bordes et al., 2011) contains triplets from Freebase (Bollacker et al., 2008), a largescale KG with general human knowledge. FB15k237 (Toutanova and Chen, 2015) is a subset of FB15K where inverse relations are removed. WN18 (Bordes et al., 2011) contains conceptualsemantic and lexical relations among English words from WordNet (Miller, 1995). WN18RR (Dettmers et al., 2018) is a subset of WN18 where inverse relations are removed. NELL995 (Xiong et al., 2017a) is extracted from the 995th iteration of the NELL system (Carlson et al., 2010) containing general knowledge.
In addition, we present a new dataset DDB14 that is suitable for KGrelated tasks. DDB14 is collected from Disease Database^{4}^{4}4http://www.diseasedatabase.com, which is a medical database containing terminologies and concepts such as diseases, symptoms, drugs, as well as their relationships. We randomly sample two subsets of 4,000 triplets from the original one as validation set and test set, respectively.
Baselines. We compare PathCon with several stateoftheart models, including TransE (Bordes et al., 2013), ComplEx (Trouillon et al., 2016), DisMult (Yang et al., 2015), RotatE (Sun et al., 2019), SimplE (Kazemi and Poole, 2018), QuatE (Zhang et al., 2019b), and DRUM (Sadeghian et al., 2019). The first six models are embeddingbased, while DRUM only uses relational paths to make prediction. We also conduct extensive ablation study and propose two variants of our model, PathConcontext and PathConpath, which only use context and paths, respectively, to test the performance of the two components separately.
Evaluation protocol. We evaluate all methods in the setting of relation prediction, i.e., for a given entity pair in the test set, we rank the groundtruth relation type against all other candidate relation types. Following the standard procedure in prior work, candidate set of relation types is filtered, i.e., the candidate relation types for do not include any where appears in the training, validation, or test set. Moreover, since most of the chosen baselines are previously evaluated in the setting of head/tail prediction, we modify the evaluation part in their codes accordingly to fit the setting of relation prediction. For fair comparison, we also modify the strategy of negative sampling in their implementations from replacing head/tail to replacing relation for a given triplet, and this indeed improves their performance. More details can be found in Appendix B.
We use Mean Reciprocal Rank (MRR), Mean Rank (MR), and Hit Ratio
with cutoff values of 1 and 3 as evaluation metrics as they are popular and standard metrics for measuring ranking quality. Note that a lower value of MR represents better performance, while higher values are preferred for other metrics.
Implementation details
. Our proposed method is implemented in TensorFlow and trained on single GPU. We use Adam
(Kingma and Ba, 2015) as the optimizer with learning rate of 0.005. L2 regularization is used to prevent overfitting and the weight of L2 loss term is. Batch size is 128, the number of epochs is 20, and the dimension of all hidden states is 64. Initial relation features are set as their identities, but we share examine BOW/BERT features in Section
4.3. The above settings are determined by optimizing the classification accuracy on the validation set of WN18RR, and kept unchanged for all datasets.During experiments we find that performance of different number of context hops, maximum path length, and implementation of neighbor aggregator largely depends on datasets, so these hyperparameters are tuned separately for each dataset. We present their default settings in Table 5, and search spaces of hyperparameters in Appendix C.
FB15K  FB15K237  WN18  WN18RR  NELL995  DDB14  
#context hops  2  2  3  3  2  3 
Max path len  2  3  3  4  3  4 
Concat  Concat  Cross  Cross  Concat  Cross 
Each experiment of PathCon
is repeated for 3 times. We report average performance and standard deviation in the following results.
Comparison with baselines. The results on all datasets are reported in Tables 3 and 4, respectively. In general, our method outperforms all baselines on all datasets. Specifically, the absolute Hit@1 gain of PathCon against the best baseline in the six datasets are , , , , , and , respectively. The improvement is rather significant for WN18RR and NELL995, which are exactly the two most sparse KGs according to the average node degree shown in Table 2. This finding empirically demonstrates that PathCon maintains great performance for sparse KGs, and this is probably because PathCon has much fewer parameters than baselines and is less prone to overfitting. In contrast, performance gain of PathCon on FB15K is less significant, which may be because the density of FB15K is very high so that it is much easier for baselines to handle.
In addition, the results also demonstrate the stability and robustness of PathCon as we observe that most of the standard deviations are quite small.
Results in Tables 3 and 4 also show that, in many cases PathConcontext or PathConpath is already able to beat most of baselines. Combining relational context and relational paths usually leads to even better performance.
Inductive KG completion. We also examine the performance of our method in inductive KG completion. We randomly sample a subset of nodes that appears in the test set, then remove these nodes along with their associated edges from the training set. The remaining training set is used to train the models, and we add back the removed edges during evaluation. Therefore, the evaluation setting transforms from fully conductive to fully inductive when the ratio of removed nodes increases from 0 to 1. The results of PathCon, DisMult, and RotatE are plotted in Figure 5. We observe that the performance of our method decreases slightly in fully inductive setting (from 0.954 to 0.922), while DisMult and RotatE fall to the level of “randomly guessing”. This is because the baselines are embeddingbased models that rely on modeling node identity, while our method do not consider node identity thus being naturally generalizable to inductive setting.
The number of context hops and maximum path length. We investigate the sensitivity of our model to the number of context hops and maximum path length. We vary the two numbers from 0 to 4 (0 means the corresponding module is not used), and report the results of all combinations (without (0, 0)) on WN18RR in Figure 5. It is clear to see that increasing the number of context hops and maximum path length can significantly improve the result when they are small, which demonstrates that including more neighbor edges or counting longer paths does benefit the performance. But the marginal benefit is diminishing with the increase of layer numbers. Similar trend is observed on other datasets too.
Neighbor aggregators. We study how different implementations of neighbor aggregator affect the model performance. The results of Mean, Concat, and Cross neighbor aggregator on four datasets are shown in Figure 5 (Results on FB15K and WN18 are omitted as they are similar to FB15K237 and WN18RR, respectively). The results show that Mean performs worst on all datasets, which indicates the importance of node orders when aggregating features from nodes to edges. It is also interesting to notice that the performance comparison between Concat and Cross varies on different datasets: Concat is better than Cross on NELL995 and is worse than Cross on WN18RR, while their performance is on par on FB15K237 and DDB14. However, note that a significant defect of Cross is that it has much more parameters than Concat, which requires more running time and memory resource.
Path representation types and path aggregators. We implement four combinations of path representation types and path aggregators: Embedding+Mean, Embedding+Attention, RNN+Mean, and RNN+Attention, of which the results are presented in Figure 7. Different from neighbor aggregators, results on the six datasets are similar for path representation types and path aggregators, so we only report the results on WN18RR. We find that Embedding is consistently better than RNN, which is probably because the length of relational paths are generally short (no more than 4 in our experiments), so RNN can hardly demonstrate its strength in modeling sequences. The results also show that Attention aggregator performs slightly better than Mean aggregator. This demonstrates that the contextual information of head and tail entities indeed helps identify the importance of relational paths.
Predicted relation  Important contextual relations  Important relational paths  

FB15K237  award winner  award honored for, award nominee  (award nominated for), (award winner, award category) 
film written by  film release region  (film edited by), (film crewmember)  
education campus of  education major field of study  (education institution in)  
DDB14  may cause  may cause, belongs to the drug family of  (is a risk factor for), (see also, may cause) 
is associated with  is associated with, is a risk factor for  (is associated with, is associated with)  
may be allelic with  may be allelic with, belong(s) to the category of  (may cause, may cause), (may be allelic with, may be allelic with) 
Initial edge features. Here we examine three types of initial edge features: identity, BOW, and BERT embedding of relation types. We choose to test on NELL995 because its relation names consist of relatively more English words thus are semantically meaningful (e.g., “organization.headquartered.in.state.or.province”). The results of all variants are reported in Figure 7, which shows that BOW features are slightly better than identity, but BERT embeddings perform significantly worse than the other two. We attribute this finding to that, (1) the dimension of pretrained BERT embeddings may be too high (768) for our task, and (2) BERT embeddings are better at identifying semantic relationship among relation types, but our method learns the mapping from BERT embeddings of context/paths to the identity of predicted relation types. Therefore, BERT may perform better if the predicted relation types are also represented by BERT embeddings, so that the mapping can be learned within the embedding space. We leave the exploration as future work.
We choose FB15K237 and DDB14 as the datasets to show the explainability of PathCon. The number of context hops is set to 1 and the maximum path length is set to 2. When training is completed, we choose 3 relations from each dataset and list the most important relational neighbors/paths to them based on the transformation matrix of the neighbor/path aggregator. The results are presented in Table 6, from which we find that most of the identified neighbors/paths are logically meaningful. For example, “education campus of” can be inferred by “education institution in”, and “is associated with” is found to be a transitive relation. More results and discussion on DDB14 are included in Appendix D.
We discuss two lines of related work: knowledge graph completion and graph neural networks.
Most existing methods of KG completion are based on embeddings, which normally assign an embedding vector to each entity and relation in the continuous embedding space and train the embeddings based on the observed facts. One line of KG embedding methods is translationbased, which treat entities as points in a continuous space and each relation translates the entity point. The objective is that the translated head entity should be close to the tail entity in real space (Bordes et al., 2013), complex space (Sun et al., 2019), or quaternion space (Zhang et al., 2019b), which have shown capability to handle multiple relation patterns and achieve stateoftheart result. To deal with the 1toN/Nto1 relations, several methods introduce relationspecific planes (Wang et al., 2014) or subspaces (Lin et al., 2015). Another line of work is multilinear or bilinear models, where they calculate the semantic similarity by matrix or vector dot product in real (Yang et al., 2015) or complex space (Trouillon et al., 2016). Besides, several embeddingbased methods explore the architecture design that goes beyond point vectors (Socher et al., 2013; Dettmers et al., 2018; Jiang et al., 2019). However, these embeddingbased models fail to predict links in inductive setting, neither can they discover any rules that explain the prediction.
Some prior work also considers modeling paths in KGs. For example, Neural LP (Yang et al., 2017), DRUM (Sadeghian et al., 2019), and IterE (Zhang et al., 2019a) try to learn logical rules by modeling the paths that connect the head entity and the tail entity. However, they fail to consider the neighbor structure of the predicted relations, thus is not expressive enough for the setting where paths are sparse.
Existing GNNs generally follow the idea of neural message passing (Gilmer et al., 2017)
that consists of two procedures: propagation and aggregation, i.e., each node on the graph propagates its feature to its neighbors and then aggregates the neighborhood features to perform one update. The two procedures are operated iteratively so as to gather messages from multihop neighbors. Under this framework, several GNNs are proposed that take inspiration from convolutional neural networks
(Duvenaud et al., 2015; Hamilton et al., 2017; Kipf and Welling, 2017), recurrent neural networks (Li et al., 2016), recursive neural networks (Bianchini et al., 2001; Scarselli et al., 2008) and loopy belief propagation (Dai et al., 2016). However, these methods use nodebased message passing, while we propose passing messages based on edges in this work.There are two GNN models conceptually connected to our idea of identifying relative position of nodes in a graph. PGNN (You et al., 2019) distinguishes two nodes with similar local structures by calculating the relative distance between the nodes and a set of predefined anchors. SEAL (Zhang and Chen, 2018) labels nodes with their distance to two nodes and when predicting link between . In contrast, we use relational paths to indicate the relative position of two nodes.
Researchers also tried to apply GNNs to knowledge graphs. For example, Schlichtkrull et al. (Schlichtkrull et al., 2018) use GNNs to model the entities and relations on KGs, however, they are limited in that they did not consider the relational paths and cannot predict in inductive settings. Wang et al. (Wang et al., 2019c) use GNNs to learn entity embeddings in KGs with the regularization of label smoothness, but their purpose is to use the learned embeddings to enhance the performance of recommender systems rather than KG completion.
We propose PathCon for KG completion. PathCon considers two types of subgraph structure in KGs, i.e., contextual relations of the head/tail entity and relational paths between head and tail entity. We show that both context and paths are critical to the task of relation prediction, and they can be combined further to achieve better performance. The experimental results on six datasets demonstrate the superiority of our method over stateoftheart baselines. In addition, our method is able to generalize to inductive settings, and it can provide explainable relation neighbors and paths as results.
We point out three directions for future work. First, as we discussed in Section 4.3, designing a model that can better take advantage of pretrained word embeddings is a promising direction; Second, it is worth studying why RNN does not perform well, and whether we can model relational paths better; Third, it is interesting to examine if the context representation and path representation can be assembled in a more principled way.
TwentyFifth AAAI Conference on Artificial Intelligence
. 301–306.Proceedings of the 33rd International Conference on Machine Learning
. 2702–2711.Reasoning with neural tensor networks for knowledge base completion. In
Advances in Neural Information Processing Systems. 926–934.Knowledge graph embedding by translating on hyperplanes. In
TwentyEighth AAAI conference on artificial intelligence. 1112–1119.Deeppath: a reinforcement learning method for knowledge graph reasoning. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
. 564–573.Analysis on nodebased message passing. Consider a graph with nodes and edges. Traditional message passing methods propagate messages of nodes to their neighbor nodes and update their hidden states:
(19)  
(20) 
where denotes the set of neighbor nodes of in the graph, is message aggregation function, and is node update function. This is also called nodebased message passing since it considers features and hidden states of nodes. See Figure 7(a) for an illustrative example. The computational complexity of nodebased message passing is given as follows:
In each iteration of nodebased message passing, the aggregation operation are performed for times, and each aggregation operation takes elements as input in expectation, where is the expected node degree of . The cost of aggregation for each iteration is therefore .
Analysis on edgebased message passing. Since in this work we only model features of edges rather than nodes, a natural thought is to do edgebased message passing:
(21)  
(22) 
where denotes the set of neighbor edges of (i.e., edges that share at lease one common endpoint with ) in the graph. See Figure 7(b) for an illustrative example.
For edgebased message passing, it actually passes messages on the line graph of the original graph. The line graph of a given graph , denoted by , is a graph such that (1) each node of represents an edge of , and (2) two nodes of are adjacent if and only if their corresponding edges share a common endpoint in . We show by the following theorem that the line graph is much larger and denser than the original graph:
The number of nodes in line graph is , and the expected node degree of is
(23) 
where
is the variance of node degrees in
.It is clear that the number of nodes in line graph is because each node in corresponds to an edge in . We now prove that the expected node degree of is .
Let’s first count the number of edges in . According to the definition of line graph, each edge in corresponds to an unordered pair of edges in connecting to a same node; On the other hand, each unordered pair of edges in that connect to a same node also determines an edge in . Therefore, the number of edges in equals the number of all unordered pairs of edges connecting to a same node:
(24) 
where is the degree of node in and is the number of edges. Then the the expected node degree of is
(25) 
∎
From Theorem 2 it is clear to see that is at least twice of , i.e. the expected node degree of the original graph , since ( is omitted). Unfortunately, in realworld graphs (including KGs) node degrees vary significantly, and they typically follow the power law distribution whose variance is extremely large due to the long tail. This means that in general. On the other hand, the number of nodes in (which is ) is also far larger than the number of nodes in (which is ). Therefore, is generally much larger and denser than its original graph . Based on Theorem 2, the complexity of edgebased message passing is given as follows:
In each iteration of edgebased message passing, the aggregation operation are performed for times, and each aggregation operation takes elements as input in expectation. The cost of aggregation for each iteration is therefore .
Analysis on alternate message passing. The cost of aggregation in edgebased message passing is timeinefficient in practice. Though we can sample a subset of neighbors for each aggregation instead of using full neighbors (Hamilton et al., 2017), message passing models are usually sensitive to the sampling size and a small number of sampled neighbors will lead to performance deterioration.
The key insight in solving the heavy overhead of edgebased message passing is to notice that, though a large number of neighbor edges needs to be aggregated for a given edge, two edges connecting to a same node share lots of common neighbor edges, making the aggregation of neighbor edges redundant for the two edges. For example, in Figure 7(c) we want to aggregate neighbor edges for the red edge and the blue edge, but their neighbor edges are highly overlapped (marked in green) since they connect to a same node. To reduce the redundant computation, we decompose edge aggregation in Eq. (21) into two steps:
(26)  
(27) 
In Eq. (26), for each node , we aggregate all the edges that connects to by an aggregation function and get message . Then in Eq. (27), we get message of edge by aggregating messages from its two endpoints and using function . We call Eqs. (26), (27), and (22) alternate message passing, as messages are passed alternately between nodes and edges. Figure 7(d) gives an illustrative example of alternate message passing.
Our proposed message passing scheme for relational context in Eqs. (3) and (4) are based on alternate message passing. To see this, notice that the message aggregation function in Eq. (26) is implemented as sum in Eq. (3), and Eqs. (27) and (22) are combined together and abstracted as in Eq. (4). The complexity of alternate message passing is as follows:
In each iteration of alternate message passing, the aggregation from edges to nodes are performed for times and each takes elements as input in expectation; the aggregation from nodes to edges are performed for times and each takes elements as input. The cost of aggregation for each iteration is therefore .
From Corollary 4 it is clear to see that alternate message passing greatly reduces the overhead of edge aggregation and achieves the same order of magnitude as nodebased message passing.
The implementation of TransE, DisMult, ComplEx, and RotatE is at https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding; the implementation of SimplE is at https://github.com/baharefatemi/SimplE; the implementation of QuatE is at https://github.com/cheungdaven/QuatE, and we use QuatE (QuatE without type constraints) here; the implementation of DRUM is at https://github.com/alisadeghian/DRUM. For fair comparison, the embedding dimension for all the baselines are set to 400. We train each baseline for 1,000 epochs, and report the test result when the result on validation set is optimal. The other hyperparameters are set as default in their repositories.
These baselines are previously evaluated in head/tail prediction, i.e., predicting the missing head or tail for a given pair or . Therefore, their negative sampling strategy is to corrupt the head or the tail for a true triple , i.e., replacing or with a randomly sampled entity or from KGs, and using or as the negative sample. Since our task is to predict the missing relation for a given pair , we modify the negative sampling strategy accordingly by corrupting the relation of each true triplet , and use as the negative sample where is randomly sampled from the set of relation types. Note that if happens to be a true triple, we remove it from negative samples. This new negative sampling strategy can indeed improve the performance of baselines. For example, the Hit@1 of TransE, ComplEx, DisMult, RotatE, SimplE, and QuatE on WN18 increases from 0.931, 0.957, 0.578, 0.975, 0.951, 0.971 to 0.955, 0.979, 0.584, 0.979, 0.964, 0.975, respectively.
The search spaces for hyperparameters are as follows:
Dimension of hidden states: ;
Weight of L2 loss term: ;
Learning rate: ;
The number of context hops: ;
Maximum path length: .
After training on DDB14, we print out the transformation matrix of the neighbor aggregator and the path aggregator in PathCon, and the results are shown as heat maps in Figures 10 and 9, respectively. The degree of darkness of an entry in Figure 10 (Figure 9) denotes the strength of correlation between the existence of a neighbor relation (a relational path) and a predicted relation. Relation IDs as well as their meanings are listed as follows for readers’ reference:
0: belong(s) to the category of  7: interacts with 
1: is a category subset of  8: belongs to the drug family of 
2: may cause  9: belongs to drug superfamily 
3: is a subtype of  10: is a vector for 
4: is a risk factor for  11: may be allelic with 
5: is associated with  12: see also 
6: may contraindicate  13: is an ingredient of 
Figure 10 shows that most of large values are distributed along the diagonal. This is in accordance with our intuition, for example, if we want to predict the relation for pair and we observe that appears in another triplet , then we know that the type of is risk factor and it is likely to be a risk factor of other entities in the KG. Therefore, “” are more likely to be “is a risk factor for” than “belongs to the drug family of” since is not a drug. In addition, we also find some large values that are not in the diagonal, e.g., (belongs to the drug family of, belongs to the drug superfamily) and (may contraindicate, interacts with).
We also have some interesting findings from Figure 9. First, we find that many rules from Figure 9 is with the form:
(28) 
where is a relation type in the KG. These rules are indeed meaningful because means and are equivalent thus can interchange with each other.
We also find PathCon learns rules that show the relation type is transitive, for example:
(29) 
and
(30) 
Other interesting rules learned by PathCon include:
(31) 
(32) 
(33) 
(34) 
Comments
There are no comments yet.