Knowledge Graph Reasoning with Relational Directed Graph

08/13/2021 ∙ by Yongqi Zhang, et al. ∙ Tsinghua University 0

Reasoning on the knowledge graph (KG) aims to infer new facts from existing ones. Methods based on the relational path in the literature have shown strong, interpretable, and inductive reasoning ability. However, the paths are naturally limited in capturing complex topology in KG. In this paper, we introduce a novel relational structure, i.e., relational directed graph (r-digraph), which is composed of overlapped relational paths, to capture the KG's structural information. Since the digraph exhibits more complex structure than paths, constructing and learning on the r-digraph are challenging. Here, we propose a variant of graph neural network, i.e., RED-GNN, to address the above challenges by learning the RElational Digraph with a variant of GNN. Specifically, RED-GNN recursively encodes multiple r-digraphs with shared edges and selects the strongly correlated edges through query-dependent attention weights. We demonstrate the significant gains on reasoning both KG with unseen entities and incompletion KG benchmarks by the r-digraph, the efficiency of RED-GNN, and the interpretable dependencies learned on the r-digraph.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge graph (KG), which contains the interactions among real-world objects, peoples, concepts, etc., brings much connection between artificial intelligence and human knowledge 

Battaglia et al. (2018); Ji et al. (2020). The interactions are formed as facts with the triple form (subject entity, relation, object entity) to indicate the relation between entities. The real-world KGs are large and highly incomplete Wang et al. (2017); Ji et al. (2020), thus inferring new facts is challenging. KG reasoning simulates such a process to deduce new facts from existing facts Chen et al. (2020). In this paper, we focus on learning the relational structures for reasoning on the queries in the form of (subject entity, relation, ?).

Over the last decade, triple-based models have gained much attention to learn knowledge in KGs Wang et al. (2017). These models directly reason on triples with the entity and relation embeddings, such as TransE Bordes et al. (2013), ConvE Dettmers et al. (2017), ComplEx Trouillon et al. (2017), RotatE Sun et al. (2019), QuatE Zhang et al. (2019) etc. Since the triples are independently learned, they cannot explicitly capture the structural information Niepert (2016); Xiong et al. (2017); Yang et al. (2017); Teru et al. (2020), i.e., the local structures around the query triples, which can be used as evidence for KG reasoning, based on the Gaifman’s locality theorem Niepert (2016).

Relational path is the first attempt to capture the structural information for reasoning Lao and Cohen (2010). DeepPath Xiong et al. (2017), MINERVA Das et al. (2017) and M-walk Shen et al. (2018)

use reinforcement learning (RL) to sample relational paths that have strong correlation with the queries. Due to the sparse properties of KGs, the RL approaches are hard to train on large-scale KGs 

Chen et al. (2020). PathCon Wang et al. (2020) samples all the relational paths between the entities and use attention mechanism to weight the different paths, but is expensive for the entity query tasks. The rule-based methods, such as RuleN Meilicke et al. (2018), Neural LP Yang et al. (2017), DRUM Sadeghian et al. (2019) and RNNLogic Qu et al. (2021), generalize the relational paths as logical rules, which learn to infer relations by logical composition of relations, and can provide interpretable insights. Besides, the logical rules can be transferred to previously unseen entities that are common in real-world applications, which cannot be handled by the triple-based models Yang et al. (2017); Sadeghian et al. (2019).

Subgraphs can naturally be more informative than paths in capturing the structural information Battaglia et al. (2018). Their effectiveness has been empirically verified in, e.g., graph-based recommendation Zhao et al. (2017) and node representation learning Hamilton et al. (2017). With the success of graph neural network (GNN) Gilmer et al. (2017); Kipf and Welling (2016); Xu et al. (2018) in modeling graph-structured data, GNN has been introduced to capture the subgraph structures in KG. R-GCN Schlichtkrull et al. (2018) and CompGCN Vashishth et al. (2019) propose to update the representations of entities by aggregating all the 1-hop neighbors on KG in each layer. However, it cannot distinguish the structural dependency of different neighbors and cannot be interpretable. DPMPN Xu et al. (2019)

learns to reduce the size of subgraph for reasoning on large-scale KGs by preserving the most probable entities for a given query rather than learning the specific local structures. Recently, GraIL 

Teru et al. (2020) proposes to predict relation from the local enclosing subgraph structure and shows the inductive ability of subgraph. However, it suffers from both effectiveness and efficiency problems due to the limitation of the enclosing subgraph.

Inspired by the interpretable and transferable abilities of path-based methods and the structure preserving property of subgraphs, we introduce a novel relational structure into KG called r-digraph. The r-digraphs generalize relational paths to subgraphs by preserving the overlapped relational paths and the structures of relations for reasoning. Different from the relational paths that have simple structures, how to efficiently construct and learn from the r-digraphs are challenging since directly computing on each r-digraph is very expensive for reasoning the query. Inspired by solving computation costs in overlapping sub-problems using dynamic programming, we propose RED-GNN, an efficient learning framework for the RElational Digraph with a variant of GNN Gilmer et al. (2017). Empirically, RED-GNN shows significant gains over the state-of-the-art reasoning methods in both benchmarks for KG with unseen entities and incomplete KG. Besides, the training and inference processes are efficient, and the learned structures are interpretable.

2 Related Works

A knowledge graph is in form of , where , and are the sets of entities, relations and fact triples, respectively. Let be query entity, be query relation, and be answer entity. Given the query , the reasoning task is to predict the missing answer entity . Generally, all the entities in are candidates for Chen et al. (2020); Das et al. (2017); Yang et al. (2017).

The key for KG reasoning is to capture the local evidence, such as relational paths or subgraphs, around the queries. In this part, we introduce the path-based methods and GNN-based methods that leverage structures in for reasoning.

2.1 Path-based methods

Relational path is formed by a set of triples that are sequentially connected as in Def.1. It is more informative than single triple since it can provide interpretable results and transfer to unseen entities.

Definition 1 (Relational path Lao et al. (2011); Xiong et al. (2017)).

The relational path with length is a set of consecutive triples , that are connected head-to-tail in sequence.

The path-based methods learn to predict the triple by a set of relational paths as local evidence. DeepPath Xiong et al. (2017) learns to generate the relational path from to by reinforcement learning (RL). To improve the efficiency, MINERVA Das et al. (2017) and M-walk Shen et al. (2018) learn to generate from with multiple paths by RL. The scores are indicated by the arrival frequency on different ’s. Due to the complex structure of KG, the reward is very sparse, making it hard to train the RL models Chen et al. (2020). PathCon Wang et al. (2020) samples all the paths connecting the two entities to predict the relation between them, which is expensive for the reasoning task .

Instead of directly using paths, the rule-based methods learn logical rules as a generalized form of relational paths. The logical rules are formed with the composition of a set of relations to infer a specific relation to provide better interpretation and can be transfered to unseen entities. The rules can be learned by either the discrete mining like RuleN Meilicke et al. (2018), EM algorithm like RNNLogic Qu et al. (2021), or end-to-end training, such as Neural LP Yang et al. (2017) and DRUM Sadeghian et al. (2019), to generate highly correlated relational paths between and . The rules can provide logical interpretation and transfer to unseen entities. However, the rules can only capture the sequential evidences, thus cannot learn more complex patterns such as the subgraph structures.

2.2 GNN-based methods

As in the introduction, the subgraphs can naturally preserve richer information than the relational paths. All the relational paths are sampled from some local subgraphs. Thus, they naturally loss some structural information in KGs, e.g., how multiple entities and edges are connected. GNN has shown strong power in modeling the graph structured data Battaglia et al. (2018). This inspires recent works, such as R-GCN Schlichtkrull et al. (2018) and CompGCN Vashishth et al. (2019), extending GNN on KG to aggregate the entities’ and relations’ representations under the message passing framework Gilmer et al. (2017) as


which aggregates over the message on the -hop neighbor edges of entity with dimension . is the message function, is a weighting matrix and

is the activation function. After

layers, the representations capturing the local structures of entities jointly work with a scoring function to score the triples. Since the aggregation function aggregates information of all the neighbors and is independent with the query, R-GCN and CompGCN cannot capture the explicit structures used for reasoning specific queries and are not interpretable.

Instead of using all the neighborhoods, DPMPN Xu et al. (2019) designs one GNN to aggregate the embeddings of entities, and another GNN to dynamically expand and prune the inference subgraph from the query entity . A query-dependent attention is applied over the sampled entities for pruning. This approach shows explainable reasoning process by the attention flow on the pruned subgraph but still requires embeddings to guide the pruning, thus cannot be generalized to unseen entities. Besides, it cannot capture the explicit subgraph structure supporting a given query triple. xERTR Han et al. (2021) extends DPMPN for reasoning future triples in temporal KGs.

Recently, GraIL Teru et al. (2020) proposes to extract the enclosing subgraph between the query entity and answer entity . To learn the enclosing subgraph, the relational GNN Schlichtkrull et al. (2018) with query-dependent attention is applied over the edges in to control the importance of edges for different queries. After layers’ aggregation, the graph-level representations aggregating all the entities in the subgraph are used to score the triple . Since the subgraphs need to be explicitly extracted and scored for different triples, the computation cost is very high.

3 Relational Digraph

The relational paths, especially the logical rules, have shown strong reasoning ability on KGs Das et al. (2017); Yang et al. (2017); Sadeghian et al. (2019) to offer interpretable results and transfer to unseen entities. However, they are limited in capturing more complex dependencies in KG since they are sampled from the local subgraphs. GNN-based methods can learn different subgraph structures. But none of the existing methods can efficiently learn the subgraph structures that are both interpretable and transferable to unseen entities like rules. Hence, we are motivated to define a new kind of structure, i.e., r-digraph, to explore the structural dependency by generalizing relational paths. In subsequent Sect.4, we show how GNNs can be tailor-made to efficiently and effectively learn from the r-digraphs.

Before defining r-digraph, we first introducing a special type of digraph in Def.2.

Definition 2 (Layered st-graph Battista et al. (1998)).

The layered st-graph is a directed graph with exactly one source node (s) and one sink node (t). All the edges are directed, connecting nodes between consecutive layers and pointing from lower layer to higher one.

Here, we adopt the general approaches to augment the triples with reverse and identity relations Vashishth et al. (2019); Sadeghian et al. (2019); Yun et al. (2019). Then, all the relational paths with length less than or equal to between and can be represented as with length . In this way, they can be formed as paths in a layered st-graph, with the single source entity and single sink entity . Such a structure preserves all the relational paths between and up to length , and maintains the subgraph structure. Based on this observation, we define r-digraph in Def. 3.

Definition 3 (r-digraph).

The r-digraph is a layered st-graph with the source entity and the sink entity . The entities in the same layer are different with each other. Any path pointing from to in the r-digraph is a relational path with length , where connects an entity in the -layer to an entity in -layer. We define if there is no relational path connecting and with length .

Fig.1(b) provides an example of r-digraph , which can be used infer the new triple (Sam, directed, Spider-2), in the exemplar KG in Fig.1(a). Inspired by the reasoning ability of relational paths Das et al. (2017); Yang et al. (2017); Sadeghian et al. (2019), we aim to exploit the r-digraph for KG reasoning. However, different from the relational paths which have simple structures, how to efficiently construct and how to effectively learn from the r-digraph are challenging.

In the sequel, are edges and are entities in the -th layer of the r-digraph with . denotes the layer-wise connection. We define . Given the query entity , we denote as the set of out-edges and as the set of entities visible in steps walking from . and are graphically shown in Fig.1.

(a) Knowledge graph.
(b) .
(c) Recursive encoding.
Figure 1: Graphical illustration. In (c), the subgraph formed by the gray ellipses is and the subgraph formed by the yellow rectangles is . Dashed edges mean the reverse relations of corresponding color (best viewed in color).

4 Red-Gnn

To encode the r-digraph , a simple solution is to construct it first and run message passing (1) as in Alg.1. For construction, we get the out-edges and entities of in step 2-4. If , will be empty and we set the representation as in step 5. For that is not empty, we construct it backward from in step 6-8. After the construction, we run message passing on layer-by-layer in step 10. Since is the single sink entity in , the final layer representation is used as subgraph representation to encode the structure of .

0:  a knowledge graph , query triple , depth .
1:  initialize and the entity sets , ;
2:  for  do
3:     get the -hop out-edges and entities ;
4:  end for
5:  if   return ;
6:  for  do
7:     get the -layer edges: and the -layer entities: ;
8:  end for
9:  for  do
10:     message passing: , for ;
11:  end for
12:  return  .
Algorithm 1 Message passing on single r-digraph

However, Alg.1 is very expensive. First, given a query , we need to do this algorithm for different triples with different answering entities . Second, three loops are required in this algorithm. It needs time (details in Appendix B) to predict a given query , where is the average degree of entities in and is the average number of edges in . These limitations also exist in PathCon Wang et al. (2020) and GraIL Teru et al. (2020) (see Appendix D). To improve efficiency, we propose to encode multiple r-digraphs recursively in Sect.4.1.

4.1 Recursive r-digraph encoding

In Alg.1, when enumerating with different for the same query , the neighboring edges of are shared. Thus, we have the following observation:

Proposition 4.1.

The set of edges visible from by -steps is the union of the -th layer edges in the r-digraphs between and all the entities , namely .

Prop.4.1 indicates that the -th layer edges with different entities share the same set of edges in . Inspired by saving computation costs in overlapping sub-problems using dynamic programming, the r-digraph between and any entity can be recursively constructed as


Once the representations of for different entities in the -th layer are ready, we can encode by combining with the shared edges in the -th layer. Based on Prop.4.1 and (2), we are motivated to recursively encode multiple r-digraphs with the shared edges in layer by layer. The full process is in Alg.2.

0:  a knowledge graph , query , depth .
1:  initialize and the entity set ;
2:  for  do
3:     collect the -hop edges and entities ;
4:     message passing: , for ;
5:  end for
6:  assign for all ;
7:  return   for all .
Algorithm 2 RED-GNN: recursive r-digraph encoding.

Initially, only is visible in . In the -th layer, we collect the edges and entities in step 3. Then, the message passing is constrained in to get the representations for . Finally, the representations , encoding for all the entities , are returned in step 7. The recursive encoding can be more efficient with shared edges in and fewer loops, It takes time . And it learns the same representations as Alg.1 under Prop.4.2.

Proposition 4.2.

Given the same triple , the structures encoded in by Algorithm 1 and Algorithm 2 are identical.

4.2 Learning essential information for query

Up to here, the information of query relation has not yet been addressed. To learn query-dependent representation of the r-digraphs, we specify the aggregation function as


To discover important edges, especially relations, in each layer, is encoded into the attention weight on each edge as


with , , and

is the concatenation operator. Sigmoid function

is used rather than softmax attention Veličković et al. (2017) to ensure multiple edges can be selected in the same neighborhood.

After layers’ aggregation in (3), the representations can encode essential information for scoring . Hence, we design a simple scoring function with . We associate the multi-class log-loss Lacroix et al. (2018) with each training triple , i.e.,


The first part in (5) is the score of the positive triple in the training set , and the second part contains the scores of all triples with the same query . The model parameters , , , , are randomly initialized and are optimized by minimizing (5

) with stochastic gradient descent 

Kingma and Ba (2014).

Theorem 4.3.

Given a triple , let be a set of relational paths , which have strong correlation with . Denote as the r-digraph constructed by . There exists a parameter setting and a threshold for RED-GNN that can equals to the subgraph, whose edges have attention weights , in .

We provide Theorem 4.3 which means that if a set of relational paths are strongly correlated with the query triple, they can be identified by the attention weights in RED-GNN.

5 Experiments

All the experiments are written in Python with PyTorch framework 

Paszke et al. (2017) and run on an RTX 2080Ti GPU with 11GB memory.

5.1 Reasoning on KG with unseen entities

Reasoning on KG with unseen entities recently becomes a hot topic Sadeghian et al. (2019); Teru et al. (2020); Yang et al. (2017) as there are emerging new entities in the real-world scenarios, such as new users and new concepts. Transferring to unseen entities requires the model to understand how to infer relations based on the local evidences ignoring the identity of entities.

Setup. We follow the general setting Sadeghian et al. (2019); Teru et al. (2020); Yang et al. (2017) that reasons on a KG with unseen entities. Specifically, the training and testing contain two KGs and , with the same set of relations but disjoint sets of entities. Three sets of triples , augmented with reverse relation, are provided. is used to predict and for training and validation, respectively. In testing, is used to predict . Same as Sadeghian et al. (2019); Teru et al. (2020); Yang et al. (2017), we use the filtered ranking metrics, i.e., mean reciprocal rank (MRR), Hit@1 and Hit@10 Bordes et al. (2013), to indicate better performance with larger values. We use the benchmark dataset in Teru et al. (2020)

, created on WN18RR 

Dettmers et al. (2017), FB15k237 Toutanova and Chen (2015) and NELL-995 Xiong et al. (2017). Each dataset includes four versions with different groups of triples. The hyper-parameters, including learning rate, weight decay, dropout, batch size, dimension , number of layers and activation function, are selected by the MRR metric on

with maximum training epochs of 50. More details of the setup are in Appendix 


Baselines. Since training and testing contain disjoint sets of entities, all the methods requiring the entity embeddings cannot be applied here Sadeghian et al. (2019); Teru et al. (2020). We mainly compare with four methods: 1) RuleN Meilicke et al. (2018), the discrete rule induction method; 2) Neural-LP Yang et al. (2017), the first differentiable method for rule learning; 3) DRUM Sadeghian et al. (2019), an improved work of Neural-LP Yang et al. (2017); and 4) GraIL Teru et al. (2020), which designs the enclosing subgraph for inductive reasoning. MINERVA Das et al. (2017), PathCon Wang et al. (2020) and RNNLogic Qu et al. (2021) can potentially work on this setting but lack of source code on it, thus not compared.

Results. The performance is shown in Tab.1. First, GraIL is the worst among all the methods since the enclosing subgraphs does not learn well of the relational structures that can be generalized to unseen entities. Second, there is not absolute winner among the rule-based methods as different rules adapt differently to these datasets. In comparison RED-GNN outperforms the baselines across all of the benchmarks. Based on Thereom 4.3, the attention weights can help to adaptively learn correlated relational paths for different datasets, and preserving the structural patterns at the same time. The results of Hit@1 and Hit@10 are provided in Appendix A.2 due to space limitation. In some cases, the Hit@10 of RED-GNN is slightly worse than the rule-based methods since it may overfit to the top-ranked samples.

metric methods WN18RR FB15k-237 NELL-995
V1 V2 V3 V4 V1 V2 V3 V4 V1 V2 V3 V4
MRR RuleN .668 .645 .368 .624 .363 .433 .439 .429 .615 .385 .381 .333
Neural LP .649 .635 .361 .628 .325 .389 .400 .396 .610 .361 .367 .261
DRUM .666 .646 .380 .627 .333 .395 .402 .410 .628 .365 .375 .273
GraIL .627 .625 .323 .553 .279 .276 .251 .227 .481 .297 .322 .262
RED-GNN .701 .690 .427 .651 .369 .469 .445 .442 .637 .419 .436 .363
Table 1: Results on KG with unseen entities. Best performance is indicated by the bold face numbers.

5.2 Reasoning on incomplete KG

Reasoning on incomplete KGs is another general setting in the literature, i.e., KG completion Wang et al. (2017). It evaluates the models’ ability to learn the patterns on an incomplete KG.

Setup. In this setting, a KG and the query triples , augmented with reverse relation, are given. For the triple-based method, triples in are used for training and are used for inference. For the others, of the triples in are used to extract paths/subgraphs to predict the remaining triples in training, and the full set is then used to predict in inference Yang et al. (2017); Sadeghian et al. (2019). We use the same ranking metrics in Section 5.1, namely MRR, Hit@1 and Hit@10. Five well-known 111 With the versions in and benchmarks are used including Family Kok and Domingos (2007), UMLS Kok and Domingos (2007), WN18RR Dettmers et al. (2017), FB15k237 Toutanova and Chen (2015) and NELL-995 Xiong et al. (2017). The setting of hyper-parameters of RED-GNN are same as those in Section 5.1. More details are provided in Appendix A.3.

Baselines. We compare with the triple-based methods RotatE Sun et al. (2019) and QuatE Zhang et al. (2019); the path-based methods MINERVA Das et al. (2017), Neural LP Yang et al. (2017), DRUM Sadeghian et al. (2019) and RNNLogic Qu et al. (2021); and the GNN-based methods CompGCN Vashishth et al. (2019) and DPMPN Xu et al. (2019). RuleN Meilicke et al. (2018) is not compared here since it has been shown to be worse than DRUM Sadeghian et al. (2019) and RNNLogic Qu et al. (2021) in this setting. GraIL Teru et al. (2020) is not compared since it is computationally intractable (see Section 5.3).

Results. As in Tab.2

, the triple-based methods are better than the path-based ones on Family and UMLS, and is comparable with DRUM and RNNLogic on WN18RR, FB15k-237. The entity embeddings can implicitly preserve local information around entities, while the path-based methods may loss the structural patterns. CompGCN performs similar as the triple-based methods since it mainly relies on the aggregated embeddings and the decoder scoring function. Neural LP, DRUM and CompGCN run out of memory on NELL-995 with

entities due to the use of full adjacency matrix. For DPMPN, the entities in the pruned subgraph is more informative than that in CompGCN, thus has better performance. For RED-GNN, it is the runner up on FB15k-237 and the best on the other benchmarks. These demonstrate that the r-digraph can not only transfer well to unseen entities, but also capture the important patterns in incomplete KGs without using entity embeddings.

type models Family UMLS WN18RR FB15k-237 NELL-995
MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10
triple RotatE .921 86.6 98.8 .925 86.3 99.3 .477 42.8 57.1 .337 24.1 53.3 .508 44.8 60.8
QuatE .941 89.6 99.1 .944 90.5 99.3 .480 44.0 55.1 .350 25.6 53.8 .533 46.6 64.3
path MINERVA .885 82.5 96.1 .825 72.8 96.8 .448 41.3 51.3 .293 21.7 45.6 .513 41.3 63.7
Neural LP .924 87.1 99.4 .745 62.7 91.8 .435 37.1 56.6 .252 18.9 37.5 out   of  memory
DRUM .934 88.1 99.6 .813 67.4 97.6 .486 42.5 58.6 .343 25.5 51.6 out   of  memory
RNNLogic* .842 77.2 96.5 .483 44.6 55.8 .344 25.2 53.0
GNN CompGCN .933 88.3 99.1 .927 86.7 99.4 .479 44.3 54.6 .355 26.4 53.5 out   of  memory
DPMPN .981 97.4 98.1 .930 89.9 98.2 .482 44.4 55.8 .369 28.6 53.0 .513 45.2 61.5
RED-GNN .992 98.8 99.7 .964 94.6 99.0 .533 48.5 62.4 .364 27.3 54.4 .543 47.6 65.1
Table 2: Results on incomplete KG. H@1 and H@10 are short for Hit@1 and Hit@10 in percentage. Best performance is indicated by the bold face numbers. * is copied from best results in their paper.

5.3 Running time analysis

We compare the running time of different methods in this part. We show the training time and the inference time on for each method in Fig 2(a) and 2(c), and learning curves in Fig 2(b) and 2(d).

For reasoning on KG with unseen entities, we compare RuleN, Neural LP, DRUM, GraIL and RED-GNN on WN18RR (V1), FB15k-237 (V1) and NELL-995 (V1). Both training and inference are very efficient in RuleN. Neural LP and DRUM have similar cost but are more expensive than RED-GNN by using the full adjacency matrix. For GraIL, both training and inference are very expensive since they require bidirectional sampling to extract the subgraph and then compute for each triple. Overall, RED-GNN is more efficient than the differentiable methods Neural LP, DRUM and GraIL.

(a) Running time.
(b) Learning curve.
(c) Running time.
(d) Learning curve.
Figure 2: Running time analysis and learning curve on WN18RR (V1) and WN18RR.

For reasoning on incomplete KG, we compare RotatE, MINERVA, CompGCN, DPMPN and RED-GNN on Family, WN18RR and NELL-995. Due to the simple training framework on triples, RotatE is the most efficient method. MINERVA is more efficient than GNN-based methods since the sampled paths have simpler structures. CompGCN, with fewer layers, has similar cost with RED-GNN as its computation is on the whole graph. For DPMPN, the pruning is expensive and it has two GNNs working together, thus it is more expensive than RED-GNN. Since GraIL is hundreds of times more expensive than RED-GNN, it is intractable on the larger KGs in this setting.

5.4 Ablation study

Interpretation. We visualize exemplar learned r-digraphs by RED-GNN with on the Family and UMLS datasets. Given the triple , we remove the edges in whose attention weights are less than , and extract the remaining parts. Fig.3(a) shows one triple that DRUM fails. As shown, inferring id- as the son of id- requires the knowledge that id- is the only brother of the uncle id- from the local structure. Fig.3(b) and 3(c) show two examples with the same and , sharing the same digraph . First, RED-GNN can learn distinctive structures for different query relations, which caters to Theorem 4.3. Second, RED-GNN can learn similar structures for a certain relation, such as Complicates, verifying the transferability across triples regardless of the identity of entities. The examples in Fig.3 demonstrate that RED-GNN is interpretable. We provide more examples and the visualization algorithm in Appendix C.

(a) Family.
(b) UMLS-1.
(c) UMLS-2.
Figure 3: Visualization of the learned structures. Dashed lines mean inverse relations. The query triples are indicated by the red rectangles. Due to space limitation, entities in UMLS dataset are shown as black circles (best viewed in color).

Variants of RED-GNN. Tab.3 shows the performance of different variants. First, we study the impact of removing in attention (denoted as Attn-w.o.-). Specifically, we remove from the attention weight in (4) and change the scoring function to , with . Since the aims to figure out the important edges in , the learned structure will be less informative without the control of the query relation , thus has poor performance.

Second, we replace Alg.2 in RED-GNN with the simple r-digraph encoding by Alg.1 (denoted as RED-Alg.1

). Due to the efficiency issue, the loss function (

5), computing the scores over triples for each sample, cannot be used. Hence, we use the margin ranking loss with negative sample as in GraIL Teru et al. (2020). As in Tab.3, the performance of RED-Alg.1 is weak than RED-GNN since the multi-class log loss is better than loss functions with negative sampling Lacroix et al. (2018); Ruffinelli et al. (2020). RED-Alg.1 still outperforms GraIL in Tab. 1 since the r-digraphs are better structures for reasoning. The running time of RED-Alg.1 is in Fig.2(a) and 2(b). Alg.1 is much more expensive than Alg.2 but is cheaper than GraIL since GraIL needs the Dijkstra algorithm to label the entities.

WN18RR (V1) FB15k-237 (V1) NELL (V1)
methods MRR Hit@1 Hit@10 MRR Hit@1 Hit@10 MRR Hit@1 Hit@10
Attn-w.o.- .659 58.6 78.3 .268 20.1 37.6 .517 40.5 73.4
RED-Alg.1 .683 63.1 79.6 .311 23.2 45.3 .563 48.5 75.8
RED-GNN .701 65.3 79.9 .369 30.2 48.3 .637 52.5 86.6
Table 3: Comparison of different variants of RED-GNN.

Depth of models. In Fig.5, we show the influence of testing MRR with different layers or steps in the left -axis. The coverage (in %) of testing triples where is visible from in steps, i.e., , is shown in the right -axis. Intuitively, when increases, more triples will be covered, paths or subgraphs between and then contain richer information, but will be harder to learn. As shown, the performance of DRUM, Neural LP and MINERVA decreases for . CompGCN runs out of memory when and it is also hard to capture complex structures with . When is too small, e.g., , RED-GNN has poor performance mainly dues to limited information encoded in such small r-digraphs. RED-GNN achieves the best performance for where the r-digraphs can contain richer information and the important information for reasoning can be effectively learned by (3). Since the computation cost significantly increases with , we tune to balance the efficiency and effectiveness in practice.

Figure 4: The MRR performance with different and coverage of triples within steps.
Figure 5: The per-distance evaluation for MRR on WN18RR v.s. the length of shortest path.
distance 1 2 3 4 5 5
ratios (%) 34.9 9.3 21.5 7.5 8.9 17.9
CompGCN .993 .327 .337 .062 .061 .016
DPMPN .982 .381 .333 .102 .057 .001
RED-GNN .993 .563 .536 .186 .089 .005

Per-distance performance. Note that given a r-digraph with layers, the information between two nodes that are not reachable with hops cannot be propagated by Alg.2. This may raise a concern about the predicting ability of RED-GNN, especially for triples not reachable in steps. We demonstrate this is not a problem here. Specifically, given a triple , we compute the shortest distance from to . Then, the MRR performance is grouped in different distances. We compare CompGCN (), DPMPN () and RED-GNN () in Tab.5. All the models have worse performance on triples with larger distance and cannot well model the triples that are far away. RED-GNN has the best per-distance performance within distance .

6 Conclusion

In this paper, we introduce a novel relational structure, i.e., r-digraph, as a generalized structure of relational paths for KG reasoning. Individually computing on each r-digraph is expensive for the reasoning task . Hence, inspired by solving overlapping sub-problems by dynamic programming, we propose RED-GNN as a variant of GNN, to efficiently construct and effectively learn the r-digraph. We show that RED-GNN achieves the state-of-the-art performance in both KG with unseen entities and incomplete KG benchmarks. The training and inference of RED-GNN are very efficient compared with the other GNN-based baselines. Besides, interpretable structures for reasoning can be learned by RED-GNN. One limitation of this work is that it will be slow and requires more memory resources for even larger KGs such as ogbl-biokg and ogbl-wikikg2 Hu et al. (2020). In future work, we can leverage the pruning technique in Xu et al. (2019) or distributed programming in Cohen et al. (2019) to make RED-GNN work on KG with extremely large scale.


  • [1] P. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, et al. (2018)

    Relational inductive biases, deep learning, and graph networks

    Technical report arXiv:1806.01261. Cited by: §1, §1, §2.2.
  • [2] D. Battista, P. Eades, R. Tamassia, and I. Tollis (1998) Graph drawing: algorithms for the visualization of graphs. Prentice Hall PTR. Cited by: Definition 2.
  • [3] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In NeurIPS, pp. 2787–2795. Cited by: §1, §5.1.
  • [4] X. Chen, S. Jia, and Y. Xiang (2020) A review: knowledge reasoning over knowledge graph. Expert Systems with Applications 141, pp. 112948. Cited by: §1, §1, §2.1, §2.
  • [5] W. W. Cohen, H. Sun, R. A. Hofer, and M. Siegler (2019) Scalable neural methods for reasoning with a symbolic knowledge base. In ICLR, Cited by: §A.1, §6.
  • [6] R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and A. McCallum (2017) Go for a walk and arrive at the answer: reasoning over paths in knowledge bases using reinforcement learning. In ICLR, Cited by: Table 6, §1, §2.1, §2, §3, §3, §5.1, §5.2.
  • [7] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel (2017) Convolutional 2D knowledge graph embeddings. In AAAI, Cited by: §1, §5.1, §5.2.
  • [8] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, pp. 1263–1272. Cited by: §1, §1, §2.2.
  • [9] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NeurIPS, pp. 1024–1034. Cited by: §1.
  • [10] Z. Han, P. Chen, Y. Ma, and V. Tresp (2021) EXPLAINABLE subgraph reasoning for fore-casting on temporal knowledge graphs. In ICLR, Cited by: §2.2.
  • [11] K. Hornik (1991) Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), pp. 251–257. Cited by: 1st item.
  • [12] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020)

    Open graph benchmark: datasets for machine learning on graphs

    In NeurIPS, Cited by: §6.
  • [13] S. Ji, S. Pan, E. Cambria, P. Marttinen, and P. Yu (2020) A survey on knowledge graphs: representation, acquisition and applications. Technical report arXiv:2002.00388. Cited by: §1.
  • [14] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Technical report arXiv:1412.6980. Cited by: §A.2, §4.2.
  • [15] T. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §1.
  • [16] S. Kok and P. Domingos (2007) Statistical predicate invention. In ICML, pp. 433–440. Cited by: §5.2.
  • [17] T. Lacroix, N. Usunier, and G. Obozinski (2018)

    Canonical tensor decomposition for knowledge base completion

    In ICML, pp. 2863–2872. Cited by: §4.2, §5.4.
  • [18] N. Lao and W. W. Cohen (2010) Relational retrieval using a combination of path-constrained random walks. Machine learning 81 (1), pp. 53–67. Cited by: §1.
  • [19] N. Lao, T. Mitchell, and W. Cohen (2011) Random walk inference and learning in a large scale knowledge base. In EMNLP, pp. 529–539. Cited by: Definition 1.
  • [20] C. Meilicke, M. Fink, Y. Wang, D. Ruffinelli, R. Gemulla, and H. Stuckenschmidt (2018) Fine-grained evaluation of rule-and embedding-based systems for knowledge graph completion. In ISWC, pp. 3–20. Cited by: §1, §2.1, §5.1, §5.2.
  • [21] M. Niepert (2016) Discriminative gaifman models. In NIPS, Vol. 29, pp. 3405–3413. Cited by: §1.
  • [22] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In ICLR, Cited by: §5.
  • [23] M. Qu, J. Chen, L. Xhonneux, Y. Bengio, and J. Tang (2021) RNNLogic: learning logic rules for reasoning on knowledge graphs. In ICLR, Cited by: §1, §2.1, §5.1, §5.2.
  • [24] D. Ruffinelli, S. Broscheit, and R. Gemulla (2020) You can teach an old dog new tricks! on training knowledge graph embeddings. In ICLR, Cited by: §5.4.
  • [25] A. Sadeghian, M. Armandpour, P. Ding, and D. Wang (2019) DRUM: end-to-end differentiable rule mining on knowledge graphs. In NeurIPS, pp. 15347–15357. Cited by: §A.3, §1, §2.1, §3, §3, §3, §5.1, §5.1, §5.1, §5.2, §5.2.
  • [26] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van D., I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In ESWC, pp. 593–607. Cited by: §1, §2.2, §2.2.
  • [27] Y. Shen, J. Chen, Y. Huang, and J. Gao (2018) M-walk: learning to walk over graphs using monte carlo tree search. In NeurIPS, Cited by: §1, §2.1.
  • [28] Z. Sun, Z. Deng, J. Nie, and J. Tang (2019) RotatE: knowledge graph embedding by relational rotation in complex space. In ICLR, Cited by: §1, §5.2.
  • [29] K. K. Teru, E. Denis, and W. Hamilton (2020) Inductive relation prediction by subgraph reasoning. In ICML, Cited by: §A.2, Appendix D, §1, §1, §2.2, §4, §5.1, §5.1, §5.1, §5.2, §5.4.
  • [30] K. Toutanova and D. Chen (2015) Observed versus latent features for knowledge base and text inference. In PWCVSMC, pp. 57–66. Cited by: §5.1, §5.2.
  • [31] T. Trouillon, C. R. Dance, É. Gaussier, J. Welbl, S. Riedel, and G. Bouchard (2017) Knowledge graph completion via complex tensor factorization. JMLR 18 (1), pp. 4735–4772. Cited by: §1.
  • [32] S. Vashishth, S. Sanyal, V. Nitin, and P. Talukdar (2019) Composition-based multi-relational graph convolutional networks. In ICLR, Cited by: §A.3, §1, §2.2, §3, §5.2.
  • [33] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. In ICLR, Cited by: §4.2.
  • [34] H. Wang, H. Ren, and J. Leskovec (2020) Entity context and relational paths for knowledge graph completion. Technical report arXiv:2002.06757. Cited by: §1, §2.1, §4, §5.1.
  • [35] Q. Wang, Z. Mao, B. Wang, and L. Guo (2017) Knowledge graph embedding: a survey of approaches and applications. TKDE 29 (12), pp. 2724–2743. Cited by: §1, §1, §5.2.
  • [36] W. Xiong, T. Hoang, and W. Wang (2017) DeepPath: a reinforcement learning method for knowledge graph reasoning. In EMNLP, pp. 564–573. Cited by: §1, §1, §2.1, §5.1, §5.2, Definition 1.
  • [37] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. In ICLR, Cited by: §1.
  • [38] X. Xu, W. Feng, Y. Jiang, X. Xie, Z. Sun, and Z. Deng (2019) Dynamically pruned message passing networks for large-scale knowledge graph reasoning. In ICLR, Cited by: §1, §2.2, §5.2, §6.
  • [39] F. Yang, Z. Yang, and W. Cohen (2017) Differentiable learning of logical rules for knowledge base reasoning. In NeurIPS, pp. 2319–2328. Cited by: §A.3, §1, §1, §2.1, §2, §3, §3, §5.1, §5.1, §5.1, §5.2, §5.2.
  • [40] S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim (2019)

    Graph transformer networks

    NeurIPS 32, pp. 11983–11993. Cited by: §3.
  • [41] S. Zhang, Y. Tay, L. Yao, and Q. Liu (2019) Quaternion knowledge graph embeddings. In NeurIPS, Cited by: §1, §5.2.
  • [42] H. Zhao, Q. Yao, J. Li, Y. Song, and K. L. Lee (2017) Meta-graph based recommendation fusion over heterogeneous information networks. In SIGKDD, pp. 635–644. Cited by: §1.

Appendix A Experiment materials

a.1 Implementation details

Efficient sampling. Given the augmented triples , we introduce how to efficiently collect and in step 3 in Algorithm 2. The implementation borrows the idea of using the sparse matrix for representing symbols in KG [5]. First, we build a sparse subject-fact matrix and if the -th entity is the subject entity of the -th triple. We then form a sparse matrix whose rows are one-hot representations for entities in . Next, we can use the sparse matrix multiplication to get the indices of triples in . Then the triples obtained by these indices are used to form . is updated by the collecting the set of tail entities in . This procedure can be efficient since both and are sparse matrices.

Mini-batch computation. For different query triples in a mini-batch, they may correspond to different queries . Since the GNN in Section 4.2 is query-dependent, we should independently learn on each triple. For efficiency consideration, we add an index for each triple as in the mini-batch so that different r-digraphs and triples can be constructed and computed in parallel, and entities corresponding to different queries will not be in conflict. Besides, adding an index id can also leverage the efficient sampling method in previous paragraph by preserving the id for each matrix with the shared .

a.2 Setup for KG with unseen entities

Hyper-parameters. For RED-GNN, we tune the learning rate in , weight decay in , dropout rate in , batch size in , dimension in , for attention in , layer in , and activation function in {identity, tanh, ReLU}. Adam [14] is used as the optimizer. The best hyper-parameter settings are selected by the MRR metric on with maximum training epochs of 50. For RuleN, we use the code222 with default setting. For Neural LP333, and DRUM444 we tune the learning rate in , dropout rate in , batch size in batch size in , dimension in , layer of RNN in , and number of steps in . For GraIL555, we tune the learning rate in , weight decay in , batch size in , dropout rate in , edge_dropout rate in , GNN aggregator among {sum, MLP, GRU} and hop numbers among . The training epochs are all set as 50.

Tie policy. In evaluation, the tie policy is important. Specifically, when there are triples with the same rank, choosing the largest rank and smallest rank in the tie will lead to rather different results. Considering that we give the same score for triples where , there will be a concern of the tie policy. Hence, we use the average rank among the triples in tie as suggested by the paper “Knowledge Graph Embedding for Link Prediction: A Comparative Analysis”.

Benchmarks. We use the 12 groups of benchmark dataset in [29] in the URL The detailed statistics are summarized in Tab.4.

datasets WN18RR FB15k-237 NELL-995
ent rel fact pred ent rel fact pred ent rel fact pred
v1 train 2746 9 5410 1268 1594 180 4245 981 3102 14 4687 853
test 922 1618 373 1093 1993 411 255 833 201
v2 train 6954 10 15262 3706 1608 200 9739 2346 2563 88 8219 1890
test 5084 4011 852 1660 4145 947 2086 4586 935
v3 train 12078 11 25901 6249 3668 215 17986 4408 4647 142 16393 3724
test 5084 6327 1140 2501 7406 1731 3566 8048 1620
v4 train 3861 9 7940 1902 4707 219 27203 6713 2092 76 7546 1743
test 7084 12334 2823 3051 11714 2840 2795 7073 1447
Table 4: Statistics of the benchmarks with unseen entities (inductive). “fact” denotes the number of triples used as graph, and “pred”means the triples used for reasoning.

Results. The full results with MRR, Hit@1 and Hit@10 metrics are shown in Tab.5.

WN18RR FB15k-237 NELL-995
V1 V2 V3 V4 V1 V2 V3 V4 V1 V2 V3 V4
MRR RuleN .668 .645 .368 .624 .363 .433 .439 .429 .615 .385 .381 .333
Neural LP .649 .635 .361 .628 .325 .389 .400 .396 .610 .361 .367 .261
DRUM .666 .646 .380 .627 .333 .395 .402 .410 .628 .365 .375 .273
GraIL .627 .625 .323 .553 .279 .276 .251 .227 .481 .297 .322 .262
RED-GNN .701 .690 .427 .651 .369 .469 .445 .442 .637 .419 .436 .363
Hit@1 RuleN 63.5 61.1 34.7 59.2 30.9 34.7 34.5 33.8 54.5 30.4 30.3 24.8
Neural LP 59.2 57.5 30.4 58.3 24.3 28.6 30.9 28.9 50.0 24.9 26.7 13.7
DRUM 61.3 59.5 33.0 58.6 24.7 28.4 30.8 30.9 50.0 27.1 26.2 16.3
GraIL 55.4 54.2 27.8 44.3 20.5 20.2 16.5 14.3 42.5 19.9 22.4 15.3
RED-GNN 65.3 63.3 36.8 60.6 30.2 38.1 35.1 34.0 52.5 31.9 34.5 25.9
Hit@10 RuleN 73.0 69.4 40.7 68.1 44.6 59.9 60.0 60.5 76.0 51.4 53.1 48.4
Neural LP 77.2 74.9 47.6 70.6 46.8 58.6 57.1 59.3 87.1 56.4 57.6 53.9
DRUM 77.7 74.7 47.7 70.2 47.4 59.5 57.1 59.3 87.3 54.0 57.7 53.1
GraIL 76.0 77.6 40.9 68.7 42.9 42.4 42.4 38.9 56.5 49.6 51.8 50.6
RED-GNN 79.9 78.0 52.4 72.1 48.3 62.9 60.3 62.1 86.6 60.1 59.4 55.6
Table 5: Results on KG with unseen entities. Best performance is indicated by the bold face numbers.

a.3 Setup for incomplete KG

Hyper-parameters. The tuning ranges of hyper-parameters of RED-GNN are the same as those in Appendix A.2. For RotatE and QuatE, we tune the dimensions in , batch_size in , weight decay in , number of negative samples in , with training iterations of 100000. For MINERVA, Neural LP, DRUM and DPMPN we use their default setting as the configures are provided. For CompGCN, we choose the scoring function as ConvE the composition operator as correlation as suggested in their paper [32]. and tune the learning rate in , layer in and dropout rate in with training epochs of 300.

Benchmarks. We provide the statistics of entities, relations and split of triples in Tab.6. The data split is same as those in the literature [39, 25] for Family, UMLS, WN18RR and FB15k-237 in the public link For NELL-995, we use the version in

Family 3,007 12 23,483 2,038 2,835
UMLS 135 46 5,327 569 633
WN18RR 40,943 11 86,835 3,034 3,134
FB15k-237 14,541 237 272,115 17,535 20,466
NELL-995* 74,536 200 149,678 543 2,818
Table 6: Statistics of benchmarks for incomplete KG benchmarks (transductive). Note that NELL-995* is different as the version in [6] since the training triples contains valid and test triples there.

a.4 Optimal hyper-parameters

We summarize the optimal hyper-parameters for reproducing the results of RED-GNN on each benchmarks in Tab.7.

benchmarks learning rate weight decay dropout batch size
WN18RR (V1) 0.005 0.0002 0.21 100 64 5 5 ReLU
WN18RR (V2) 0.0016 0.0004 0.02 20 48 3 5 ReLU
WN18RR (V3) 0.0014 0.000034 0.28 20 64 5 5 tanh
WN18RR (V4) 0.006 0.00013 0.11 10 32 5 5 ReLU
FB15k-237 (V1) 0.0092 0.0003 0.23 20 32 5 3 ReLU
FB15k-237 (V2) 0.0077 0.0002 0.3 10 48 5 3 ReLU
FB15k-237 (V3) 0.0006 0.000023 0.27 20 48 3 3 ReLU
FB15k-237 (V4) 0.0052 0.000018 0.07 20 48 5 5 idd
NELL-995 (V1) 0.0021 0.00019 0.25 10 48 5 5 relu
NELL-995 (V2) 0.0075 0.00066 0.29 100 48 5 3 ReLU