MPLR: a novel model for multi-target learning of logical rules for knowledge graph reasoning

by   Yuliang Wei, et al.
Harbin Institute of Technology

Large-scale knowledge graphs (KGs) provide structured representations of human knowledge. However, as it is impossible to contain all knowledge, KGs are usually incomplete. Reasoning based on existing facts paves a way to discover missing facts. In this paper, we study the problem of learning logic rules for reasoning on knowledge graphs for completing missing factual triplets. Learning logic rules equips a model with strong interpretability as well as the ability to generalize to similar tasks. We propose a model called MPLR that improves the existing models to fully use training data and multi-target scenarios are considered. In addition, considering the deficiency in evaluating the performance of models and the quality of mined rules, we further propose two novel indicators to help with the problem. Experimental results empirically demonstrate that our MPLR model outperforms state-of-the-art methods on five benchmark datasets. The results also prove the effectiveness of the indicators.



page 1

page 2

page 3

page 4


DegreEmbed: incorporating entity embedding into logic rule learning for knowledge graph reasoning

Knowledge graphs (KGs), as structured representations of real world fact...

Neural-Symbolic Commonsense Reasoner with Relation Predictors

Commonsense reasoning aims to incorporate sets of commonsense facts, ret...

RNNLogic: Learning Logic Rules for Reasoning on Knowledge Graphs

This paper studies learning logic rules for reasoning on knowledge graph...

LPRules: Rule Induction in Knowledge Graphs Using Linear Programming

Knowledge graph (KG) completion is a well-studied problem in AI. Rule-ba...

Knowledge-aware Pronoun Coreference Resolution

Resolving pronoun coreference requires knowledge support, especially for...

RuDaS: Synthetic Datasets for Rule Learning and Evaluation Tools

Logical rules are a popular knowledge representation language in many do...

Theoretical Knowledge Graph Reasoning via Ending Anchored Rules

Discovering precise and specific rules from knowledge graphs is regarded...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge storage, representation and its causal relationship between each other, inspired by human problem solving, is to help intelligent systems understand human knowledge and gain the ability to deal with complicated tasks [newell1959report, mycin1976computer]. Knowledge graphs, as a form of structured human knowledge, are collections of real-world factual triplets, where each triplet denotes a predicate (a.k.a. relation) between the subject and the object . Subjects and objects are usually called entities in KGs, e.g., the fact that Beijing is the capital of China can be represented by (Beijing, capitalOf, China). Knowledge graphs are now widely used in a variety of applications such as recommender systems [xian2019reinforcement, wang2019kgat] and question answering [lin2019kagnet, ding2019cognitive]. Recently knowledge graphs have drawn growing interests in both academia and industry communities [dong2014knowledge, nickel2015review, sun2019rotate].

However, because of the nature of rapid iteration and incompleteness of data, there are usually missing facts in existing KGs. For example, we now have the fact(s) that Thiago Messi, Mateo Messi and Ciro Messi are sons of Leo Messi, but in our KG there might be missing information about the relationship between the three brothers. A typical task is link prediction which is supposed to complete the relation between two entities by reasoning on given facts. This paper studies learning first-order logic rules for knowledge graph reasoning (KGR). As illustrated in Fig. 1

, there is a rule in form of logic programming as

sisterOf(, ) sonOf(, ) daughterOf(, ), meaning that if is a sister of and has a son , then we can infer that is the daughter of . Such logic rules gain strong interpretability [qu2019probabilistic, zhang2020efficient] and can be applied and generalized to previously unseen domains and data without retraining the model [teru2020inductive]. The same might not be true for embedding methods like TransE [bordes2013translating].

Figure 1: An example of knowledge graph reasoning in multi-target scenario.

Mining collections of relational rules is a subtask of statistical relational learning [koller2007introduction], and when the procedure involves learning new logical rules, it is often called inductive logic programming [muggleton1994inductive]. Traditional methods such as Path Ranking [lao2010relational] and Markov Logic Networks [richardson2006markov] failed to learn the structure (i.e. logic rules in discrete space) and the parameters (i.e. continuous confidence associated with each rule) simultaneously. The Neural LP method [yang2017differentiable], a fully end-to-end differentiable neural system, first combines learning rule structures as well as appropriate scores. Unluckily, Neural LP and current Neural LP-based methods lack attempts in multi-target scenarios, where there may be multiple objects connecting by the same relation with one subject. Meanwhile, to the best of our knowledge, although there are metrics for evaluating models in KG completion tasks, there is still an absence in assessing the quality of mined logical rules.

In this paper, with reference to previous research in graph theory, we firstly propose two novel indicators saturation and bifurcation that help with the evaluation in KG reasoning tasks. Saturation helps indirectly analyse the interpretability of learned rules, while bifurcation severs as a supplement to traditional metrics. Then we explore Multi-target Probabilistic Logic Reasoning (MPLR): an extension to Neural LP framework that allows for reasoning in multi-target cases. Our approach reformulates the equations and improves on entity representation and model optimization, which enables it to learn over more facts in the KG.

We apply the indicators to several knowledge graph benchmarks for better understanding of their data structure. Further, we evaluate our model on these datasets and experimentally show that our model outperforms state-of-the-art methods for KG reasoning. In addition, MPLR is able to generate high-quality logic rules.

2 Related work

Knowledge graph embedding. Preliminary research on knowledge graph completion focused on learning low-dimensional embedding for link prediction, and we term those methods as embedding-based methods. Representative methods, including TransE [bordes2013translating], TransR [lin2015learning], ComplEx [trouillon2016complex], etc., infer facts by projecting entities and relations onto a semantic space and perform algebraic operations on that space. Specifically, TransE [bordes2013translating] presents factual triplets in dimentional representation space, and makes embeddings follow the translational principle . TransR [lin2015learning] tackles the problem of insufficient representation ability of single space, and utilizes separate spaces for entities and relations. ComplEx [trouillon2016complex]

is the first to introduce complex vector space which can capture both symmetric and antisymmetric relations. In this space,

, e.g., can be denoted as where Re() and Im() are real and imaginary parts of respectively. Unfortunately, these sort of methods are implemented in a black-box way, which is uninterpretable to human.

Relation path reasoning. Learning relational rules has been previously studied in the field of inductive logic programming (ILP) [muggleton1994inductive]

. These methods often learn a probability as a confidence score for each rule between query entities and answer entities. Among these studies, Path-Ranking Algorithm (PRA)

[lao2010relational] enumerates relational paths under a combination of path constraints and perform maximum-likelihood classification. Markov Logic Networks [richardson2006markov] and Probabilistic Personalized Page Rank (ProPPR) [wang2013programming] equip logic rules with probability, so that one can leverage path information over the graph structure. Although ILP takes advantage of the interpretability of mined rules, these methods typically require both positive and negative examples and suffer from a potentially large version space, which is a critical shortage since most modern KGs are huge and contain only positive instances.

Neural logic programming. Extending the idea by simultaneously learning logic rules and the weights in a gradient-based way, Neural LP [yang2017differentiable] is the first end-to-end differentiable approach to combine continuous parameters and discrete structure of rules. Some recent methods [sadeghian2019drum, yang2019learn, wang2019differentiable] have improved the work done by Neural LP [yang2017differentiable] in different manners. DRUM [sadeghian2019drum]

introduces tensor approximation for optimization and Neural-Num-LP

[wang2019differentiable] addresses the limitation in mining numerical features like age and weight. However, the existing Neural LP-based methods needs a large proportion of triplets in preparation for constructing the graph structure, which can not make full use of the training data. Moreover, these models failed in the situation of multi-target inference in contrast to our work.

Related research in graph theory. Although ILP shortens the gap between reasoning on KGs and interpretability, there is still a lack of a way to indicate the quality of learned rules. Inspired by accumulating studies and k-saturated graphs [hajnal1965theorem] and minimum saturated graphs [faudree2011survey], we propose saturation concept as a complement in measuring the quality of rules. Besides, we also define bifurcation indicator to help with current metrics for evaluating reasoning models from the perspective of graph structure.

3 Preliminaries and two novel indicators

3.1 Knowledge graph reasoning

Knowledge graph can be modeled as a collection of factual triples , with representing the set of entities and predicates (a.k.a. binary relations) respectively in the knowledge graph, and the triple in form of . The subgraph relating to a particular predicate is described as a subset of containing all triples with being the predicate: .

Probabilistic logic reasoning is to learn a confidence score for a first-order logical rule of the form


for short, with , , where , is called a rule pattern. For example, the rule intuitively states that if is the brother of and is the father of , then we can conclude that is the uncle of . All rule patterns of length () can be formally defined as a set of predicate tuples , and the set of patterns no longer than is denoted as . A rule path is an instance of pattern p via different sequences of entities, which is denoted as , e.g., and are different paths of the same pattern.

Multi-target reasoning. Traditionally, the logic reasoning problem is to solve the problem of learning first-order logical Horn clauses from a KG [sadeghian2019drum]. However, in multi-target scenarios, there would be several tail entities satisfying the predicate , given only one head entity such that . In other words, may have more than one nephew (niece) , and we may finally infer that is the uncle of ’s by following various rule patterns.

Therefore, the way we regard the knowledge graph reasoning task is different from that of Neural LP [yang2017differentiable]. The task here is considered to be composed of a query , an entity head that the query is about, and a set of entities tails that are the answers to the query such that . Finally we want to find the most possible relational pattern to reason out the predicate through inducing over the whole query . Thus, given maximum length , we assign a single confidence score (i.e. probability) to a set of rule paths ’s adhering to the same pattern p that connects and 111 In Neural LP framework, they view tail as the question to query, and only one head the answer to the query. Then a confidence is assigned to one particular path . :


During inference, given an entity , the unified score of a tail can be computed by adding up the confidence scores of all rule paths that infer , and the model will produce a ranked list of entities where higher the score implies higher the ranking.

3.2 Graph structure

[Directed Labeled Multigraph] A directed labeled multigraph is a tuple , where denotes the set of vertices, and is a multiset of directed, labeled vertex pairs (i.e. edges) in the graph .

Because of its graph structure, a knowledge graph can be regarded as a directed labeled multigraph [stokman1988structuring]. In this paper, "graph" is used to refer to "directed labeled multigraph" for the sake of simplicity. is the corresponding graph structure of . and stand for the number of vertices and number of edges respectively for a graph . Particularly in a KG, and the total number of triplets equals the number of edges .

In a graph , the degree of a vertex is the number of edges incident to it. When it comes to directed graphs, in-degree and out-degree of a vertex is usually distinguished, which are defined as


Furthermore in KGs, the bw-degree(q) and fw-degree(q) of a given vertex can be computed via the following equations


which are exactly the number of answers to query questioned from when performing backward and forward reasoning.

3.3 Novel indicators for reasoning performance

In the following, we propose two novel computational methods to help evaluate a KG reasoning model. More specifically, we analyse the reasoning complexity from the inherent attributes of the graph structure corresponding to a KG .

3.3.1 Saturation

[Macro Reasoning Saturation] Given a query and the maximum length of a rule pattern , the macro reasoning saturation of in relation to predicate , i.e. , is the percentage of triples in subgraph such that .

We compute the macro reasoning saturation using the following equation:


with being the number of edges (i.e. the number of triples) in . We can reasonably say that the larger grows, the more likely can be as a proper inference of the query . When equals 1, it means we can reason out every factual triples in through at least one rule path following the pattern .

[Micro Reasoning Saturation] Given the maximum length of a rule pattern, we define the micro reasoning saturation of pattern as following. Firstly, for a specific triple , i.e. , is the percentage of the number of paths such that as to all paths from to .

The equation to compute is


Then, we average on all triples and get the micro reasoning saturation of pattern for query :


In Eqs. (7) and (9), and assess how easy it is to infer following the pattern respectively from a macro and a micro perspective. The higher the two indicators are, the easier we are to gain the inference that . In order to obtain an overall result, we define the comprehensive reasoning saturation by combining the two indicators through multiplication.


3.3.2 Bifurcation

In this section, we give the definition of bifurcation.

[Bifurcation] Given a query , the forward bifurcation is the proportion of head entity with within all head entities in . Likewise, the backward bifurcation is defined on tail entities in with .

Bifurcation(s) can be computed on both forward and backward reasoning directions and are formulated as follows:


and indicate the problem scale when performing backward and forward reasoning in case that there are multiple targets. As is shown in Fig. 1, for query , there are three head entities and two tail entities and . Hence the backward bifurcation of query is for there are two () daughters of ’s but only one of ’s, meaning that half of the fathers (mothers) have at least two daughters. Similarly, because no one in has more than one parent in this KG.

4 A novel model for multi-target learning of logical rules for KGR

4.1 Neural LP for logic reasoning

4.1.1 TensorLog

Since Neural LP [yang2017differentiable] is based on the work of TensorLog [cohen2016tensorlog, kathryn2018tensorlog], we first introduce TensorLog that connects inference using logic rules with sparse matrix multiplication. In a KG involving a set of entities and a set of predicates , factual triplets with respect to predicate are restored in a binary matrix . , an adjacency matrix, is called a TensorLog operator meaning that is in the KG if and only if the -th entry of is 1. Let

be the one-hot encoded vector of entity

. Then is the path features vector [yang2019learn], where the j-th entry counts the number of unique paths following the pattern from to [guu2015traversing].

For example, every KG entity in Fig. 1 is encoded into a vector of length . For every predicate and every pair of entities , the TensorLog operator relevant to is define as a matrix with its -th element being 1 if . Considering the KG in Fig. 1, for the predicate we have

The rule sisterOf(X, Z) sonOf(Z, Y) daughterOf(X, Y) can be simulated by performing the following sparse matrix multiplication:

By setting as the one-hot vector of and multiplying by on the left, we obtain . The resultant selects the row in identified by . By operating right-hand side multiplication with , we get the number of unique paths following the pattern from to : .

4.1.2 Neural LP

Neural LP [yang2017differentiable] inherits the idea of TensorLog. Given a query , after steps of reasoning, the score of the query induced through rule pattern of length is computed as


where is the adjacency matrix of the predicate used at -th hop.

The operators above are used to learn for query by calculating the weighted sum of all possible patterns:


where indexes over all potential patterns with maximum length of , is the confidence score associated with the rule and is the ordered list of predicates appearing in .

To summarize, we update the score function in Eq. (13) by finding an appropriate in


and the optimization objective is


where is to be learned.

Whereas the searching space of learnable parameters is exponentially large, i.e. , direct optimization of Eq. (16) may fall in the dilemma of over-parameterization. Besides, it is difficult to apply gradient-based optimization. This is because each variable is bound with a specific rule pattern, and it is obviously a discrete work to enumerate rules. To overcome these defects, the parameter of rule can be reformulated by distributing the confidence to its containing predicate at each hop, resulting in a differentiable score function:



is a hyperparameter denoting the maximum length of patterns and

is the number of predicates in KG.

is an identity matrix

that enables the model to include all possible rule patterns of length or smaller [sadeghian2019drum]. The key difference of parameterization between Eq. (15) and Eq. (17) is illustrated in Fig. 2. Fig. (a)a shows that the rule brotherOf fatherOf gains more plausibility against workIn hasStudent facing the query uncleOf, so . In the latter, the score of is obtained by multiplying the weight of brotherOf at first hop and fatherOf at second hop .

To perform training and prediction over the Neural LP framework, we should first construct a KG from a large subset of all triplets. Then we remove the edge from the graph when facing the query , so that the score of can get rid of the influence imposed by passing from head entity directly through edge for the correctness of reasoning.

Figure 2: A KG example to illustrate the parameterization difference between Eqs (15) and (17). (a) Assigning a confidence score to each rule. (b) Distributing weights into predicates at different hops.

4.2 MPLR model

In this section, we propose our MPLR model as an improvement to Neural LP [yang2017differentiable].

As aforementioned in Section 4.1.2, all edges starting from the head entity to should be removed from the graph in a multi-target query, thus more edges in a batch of queries will be removed, which would break the graph structure to a considerable extent. For example, in Fig. 1, suppose the query is , so that two edges and will be missing when we train the model, which renders difficulty to infer the rule sisterOf(, ) sisterOf(, ) daughterOf(, ). Therefore, we update Eq. (17) to address the limitation of Neural LP in multi-target scenario, where the bonus on the score of from edge is avoided without removing the edge. For each sub-query we have


where is the attention score of predicate at -th hop, and is the matrix with only its -th element being 1, otherwise 0. is actually vector with all elements reduced to 0 except its -th value. Also, vector keeps only the -th value of .

Eq. (21) eliminates redundant gain (i.e. ) of -th value in vector passed from to directly through edge , but retains the approach to affect other nodes except through this edge. That is to say, in query (, sisterOf, ) the score of entity should not involve that from edge , but the score of can be increased by path sisterOf(, ) sisterOf(, ).

In addition, considering there might be multiple tail entities in relation to the given head entity in a query, and as shown in Eq. (2), only one score should be allocated to a set of rule paths. We modify the representation of the tail vector to a multi-hot one. Given a query , a head entity and a set of tails , the target vector is also a vector, but with it -th entry being 1 for all . For example, in the KG displayed in Fig. 1, given query auntOf and head entity , since is aunt of and , the target vector in this query is .

Finally, the confidence scores are learned over the bidirectional LSTM [hochreiter1997long] followed by the attention using Eqs (22) and (23) for the temporal dependency among several consecutive steps. The input in Eq. (22) is query embedding for .


where h and are the hidden-states of the forward and backward path LSTMs, and the subscripts denote their time step.

is the attention vector obtained by performing a linear transformation over concatenated forward and backward hidden states, followed by a softmax operator:


Figure 3: MPLR model overview with rank .

4.3 Optimization

Loss construction. In general, we treat this task as a multi-label classification to handle multiple outcomes. For each query in KG, we first split the objective function Eq. (21) into two parts: target vector and prediction vector


and then we construct the loss function for


using the Bernoulli negative log-likelihood with logits:

where indexes elements in vector and , and

is the sigmoid function

. To ensure numerical stability, the above equation can be reformulated into the equivalent Eq. (25) through log-sum-exp method.


Low-rank approximation. It can be shown that the final confidences obtained by expanding

are a rank one estimation of the

confidence value tensor [sadeghian2019drum], and a low-rank approximation is a popular method for tensor approximation. Hence we follow the work of [sadeghian2019drum] and rewrite Eq. (21) using rank approximation, as shown in Eq. (26).


More concretely, we update Eqs. (22) and (23), as is shown in Eqs. (27) and (28), by deploying number of BiLSTMs of the same network structure, each of which can extract features from various dimensions.


where the superscripts of the hidden states identify their bidirectional LSTM.

An overview of the model is shown in Fig. 3.

5 Experiment

5.1 Experiment setting

We conduct experiments on a knowledge graph completion task and evaluate our model in comparison with state-of-the-art baselines regarding the following aspects: (1) traditional evaluation metrics (

e.g. Mean Reciprocal Rank); (2) novel reasoning indicators proposed in Section 3.3; (3) interpretability and reasoning plausibility. After detailed explanations, the deficiency of the existing Neural LP-based models is also discussed.

5.1.1 Datasets

We adopt five datasets for evaluation, which are described as follows:

  • FB15K-237 [toutanova2015observed]

    , a more challenging version of FB15K

    [bordes2013translating] based on Freebase [bollacker2008freebase], a growing knowledge graph of general facts.

  • WN18 [dettmers2018convolutional], a subset of knowledge graph WordNet [miller1995wordnet, miller1998wordnet] constructed for a widely used dictionary.

  • Medical Language System (UMLS) [kok2007statistical], from biomedicine, where the entities are biomedical concepts (e.g. organism, virus) and relations consist of affects and analyzes, etc.

  • Kinship [kok2007statistical], containing kinship relationships among members of a Central Australian native tribe.

  • Family [kok2007statistical], containing individuals from multiple families that are biologically related.

Statistics about each dataset are shown in Table 1. All datasets are divided into 3 files: train, valid and test. The train file is composed of query examples . valid and test files both contain queries , in which the former is used for early stopping and the latter is for testing. Unlike the case of learning embeddings, our method does not necessarily require the entities in train, valid and test to overlap. As described in Section 4.2, our model is capable of using all triplets (serve as facts file in Neural LP [yang2017differentiable]) to construct KG, including ones from train, valid and test.

5.1.2 Comparison of algorithms

In experiment, the performance of our model is compared with that of the following algorithms:

  • Neural LP-based methods. Since our model is based on Neural LP [yang2017differentiable], we choose Neural LP and a Neural LP-based method DRUM [sadeghian2019drum].

  • Embedding-based methods. We choose several embedding-based algorithms, including TransE [bordes2013translating], DistMult [yang2014embedding], TuckER [balavzevic2019tucker], RotatE [sun2019rotate] and ConvE [dettmers2018convolutional].

  • Other rule learning methods. We also consider a probabilistic model called RNNLogic222There are four variants of RNNLogic, and we use RNNLogic without embedding for comparison.[qu2020rnnlogic].

5.1.3 Model configuration

Our model is implemented using PyTorch

[paszke2019pytorch] and the code will be publicly available. We use the same hyperparameter suite during experiments on all datasets. The hidden state dimension for BiLSTM(s) is 128. The query embedding has dimension 128 and is randomly initialized. As for optimization algorithm, we use mini-batch ADAM [kingma2014adam] with the batch size 128 and the learning rate initially set to 0.001. We also observe that the whole model tends to be more trainable if we normalize the vector at final step to have unit length.

Dataset # Relation # Entity # Triplets # Train # Validation # Test
FB15K-237 237 14541 310116 272115 17535 20466
WN18 18 40943 151442 141442 5000 5000
Family 12 3007 28356 23483 2038 2835
Kinship 25 104 10686 8487 1099 1100
UMLS 46 135 6529 5327 569 633
Table 1: Statistics of datasets.

5.2 Experiment on knowledge graph completion

We conduct experiments on the knowledge graph completion task as described in [bordes2013translating], and compare the results with several state-of-the-art models. When training the model, the query and head are part of some missing training triplets, and the goal is to complete the question and find the most possible answers tails. For example, if daughterOf(, ) is missing from the knowledge graph333To be more accurate, our model simulates this situation that the edges relating to the input query are removed, which is already explained in Section 4.2., the goal is to reason over the existing graph structure and retrieve when presented with query daughterOf and .

During evaluation, for each test triplet , we build one query with answer 444We notice that in Neural LP [yang2017differentiable], DRUM [sadeghian2019drum], etc., they add another reversed query with answer for each triplet. But we only use query for fair comparison.. Remarkably, we adopt the same valid and test data with compared algorithms, and we manually remove the edge from KG for the correctness of reasoning results. Additionally, when computing the actual rank of , the head entity is of no use in a query, so we manually remove it. For each query, the score is computed for each entity, as well as the rank of the correct answer. For the computed ranks from all queries, we report the Mean Reciprocal Rank (MRR) and Hit@. MRR averages the reciprocal rank of the answer entities and Hit@ computes the percentage of how many desired entities are ranked among top .

5.2.1 On KG reasoning assessment

We calculate the numerical features of KG datasets using the indicators proposed in Section 3.3, which helps to better comprehend the reasoning task over these knowledge graphs. Above all, considering that learning collections of relational rules is a type of statistical relational learning [koller2007introduction], these statistical properties provide a complement to currently popular evaluation metrics, such as MRR and Hit@.

Saturation. The macro, micro and comprehensive saturations measure the probability of a rule pattern occurring in a certain relational subgraph from different angles. However, the computation can be exceedingly costly due to the approximate complexity , where is the size of rule set , i.e., the total number of rules of length , and indicates how time-consuming to compute the number of unique paths following pattern pointing from to given and . Thus, it is more preferable to randomly sample a subgraph of the existing KG first, and then compute the saturations when encountering a large dataset. We select some predicates and their relating rules with most popular saturations from the Family dataset and show them in Table 2. We also present the statistics about UMLS in the appendix.

We use rule motherOf(, ) sonOf(, ) wifeOf(, ) as an example in Table 2, where the left part of the rule is denoted as and the right . This rule can be translated that if is mother of and is son of , then we can infer that is wife of . The macro saturation means that 47% of the factual triples whose predicate is wifeOf cover the reasoning rule motherOf sonOf. In Table 2, roughly tells the percentage of a potential rule pattern in subgraph , whereas the micro saturation contains more detailed information focused on one triplet. represents that on average, among all rule paths no longer than that could reason out the predicate daughterOf, more than one third of them follow the pattern motherOf sonOf

, which is fairly a high proportion. Finally we heuristically propose

comprehensive saturation as a global metric that combines these two factors and may individually serve as a score of a rule where higher the score indicates more obvious statistical features during inference.

Rule Predicate
.47 .35 .17
.36 .24 .09
.47 .35 .17
.36 .24 .09
1. .34 .34
.70 .27 .19
.62 .22 .14
.68 .25 .17
.61 .20 .12
.46 .15 .07
.46 .14 .06
.86 .14 .12
.77 .13 .10
.81 .13 .10
.100 .08 .08
.68 .11 .08
.85 .23 .20
.82 .22 .18
.78 .22 .17
.74 .18 .13
.62 .09 .06
.38 .05 .02
.86 .25 .21
.79 .22 .17
.79 .21 .16
.72 .17 .12
.64 .10 .06
.36 .05 .02
Table 2: Saturations of the Family dataset (without sampling). The rule length is fixed to 2. , , are macro, micro and comprehensive saturations. The results relating to a predicate are sorted by the comprehensive saturation in descending order.

Apart from this, we want to share some more heuristic opinions upon saturations. Firstly we can say the rule wifeOf fatherOf is macro-saturated with regard to the predicate motherOf, because of its . When saturation of a rule increases, it demonstrates that the rule is more saturated in comparison to other rules. Secondly, the rules with high saturation shown in Table 2 gain distinguished comprehensibility by human as a reasoning pattern, thus saturation may be a valuable complementary indicator to evaluate the performance and interpretability of a KG reasoning model. To the end, during the computation of saturations, we are seemingly in a process of performing a type of frequent pattern mining [han2004mining, agrawal1994fast, dur2000three], which may be a future ground of research in the area of KG reasoning.

Family husbandOf 14 2 1 0 0 0
wifeOf 8 1 0 0 0 0
sonOf 85 0 0 0 0 0
daughterOf 84 0 0 0 0 0
brotherOf 77 57 42 30 23 18
uncleOf 84 74 64 52 44 39
UMLS issueIn 99 0 0 0 0 0
precede 93 93 93 93 50 0
prevents 100 100 100 100 100 40
associatedWith 78 64 61 58 56 53
WN-18 hasPart 38 21 15 11 9 7
memberOfDomainUsage 84 52 48 48 44 44
memberOfDomainTopic 68 58 48 40 35 31
memberOfDomainRegion 53 35 29 24 21 19
Table 3: Bifurcation(%) of Family, UMLS and WN-18. The first column lists selected datasets and the predicates are shown on the second column.

Bifurcation. Then we report on observations about bifurcation which is defined on a particular predicate. We choose a small group of predicates from several datasets as illustrated in Table 3. Meanwhile, models in this work are evaluated over queries in form of , so that we solely calculate and show the forward bifurcation (i.e. ). We will put more statistics about bifurcation in the appendix as well.

Table 3 shows the intuitive diversity of bifurcation between predicates within each dataset in multi-target case. First we can pay attention to the predicate uncleOf in the Family dataset, whose bifurcation with is 84%, meaning that most of the uncles have more than two nieces (nephews). The difference between two consecutive numbers in the same row is also of great value, e.g., the bifurcation with and for the predicate daughterOf is 84% and 0 respectively, which means that 84% of the daughters in the Family dataset have two parents (), and none of them have more than two parents ().

5.2.2 Results

We evaluate our model in comparison with some baselines555The URLs we use to implement these models are listed in App. B.on KG completion benchmarks as stated in Section 5.1.1 and Section 5.1.2. Since Neural LP [yang2017differentiable], DRUM [sadeghian2019drum] and ours all follow a similar framework, we ensure the same hyperparameter setting during evaluation on these models, where the maximum rule length is 2 and the rank of the estimator is . Part of the results are summarized in Table 4, and more are available in the appendix.

It is clear that our MPLR achieves state-of-the-art results at all metrics on datasets listed in Table 4 among all methods, as one can see an obvious improvement on almost all datasets. Apart from this, our model outperforms Neural LP and DRUM on two real-world datasets shown in the appendix. We conjecture that this is due to the optimization that enables our model to utilize more training data at a time and the advancement in multi-target cases.

Notably, it is not fair to compare MPLR with embedding-based methods solely on the aforementioned metrics, because they are black boxes inside that do not provide interpretability, while our model has advantages in this area. We will show some of the rules mined by our model later.

Family Kinship UMLS
MRR Hit@1 Hit@3 MRR Hit@1 Hit@3 MRR Hit@1 Hit@3
TransE .14 4 16 .10 2 8 .13 1 10
DistMult .30 11 35 .20 5 18 .09 .6 3
ComplEx .35 15 42 .22 7 23 .13 9 2
TuckER .33 13 39 .18 3 15 .11 2 6
RotatE .41 22 48 .26 9 26 .15 4 12
ConvE .20 8 22 .17 2 11 .12 1 9
RNNLogic .27 15 32 .28 11 31 .21 8 21
Neural LP .50 34 57 .25 9 26 .26 11 27
DRUM .52 35 60 .29 12 30 .28 14 30
MPLR .64 54 68 .31 16 33 .36 25 36
Table 4: Knowledge graph completion performance comparison. Hit@ is in %.
husbandOf wifeOf sonOf
MRR Hit@1 Hit@3 MRR Hit@1 Hit@3 MRR Hit@1 Hit@3
Neural LP [yang2017differentiable] .49 21 77 .48 25 69 .76 69 80
DRUM [sadeghian2019drum] .46 15 77 .54 27 69 .79 69 88
MPLR .78 75 80 .72 70 73 .79 68 89
daughterOf brotherOf uncleOf
MRR Hit@1 Hit@3 MRR Hit@1 Hit@3 MRR Hit@1 Hit@3
Neural LP [yang2017differentiable] .70 63 71 .51 30 62 .27 10 28
DRUM [sadeghian2019drum] .75 65 82 .54 35 64 .42 26 47
MPLR .75 65 80 .67 55 71 .45 32 46
Table 5: Results of reasoning on the Family dataset for specific predicates. Hit@ is in %.

To demonstrate more details about the capability of models to induce logical rules, we compare our model against other two models in neural logic programming upon specific predicates. We choose the Family dataset for better visual availability and the results are shown in Table 5.

Compared with Neural LP and DRUM, our MPLR witnesses a significant improvement on almost all metrics of predicates. Moreover, during the experiment, we discover that evaluating a reasoning model simply on Hit@ lacks overallness and precision. The metric Hit@ depends not only on the performance of model, but also the indicator bifurcation. As analysed in Table 3, the bifurcations of predicate daughterOf shows that 16% of the daughters have only one parent and 84% of them have two. Thus, Hit@1 of any model should be at most 57% () on the whole KG. In fact, assume all of the daughters have only two parents, i.e., and , then there should be at most one parent of each daughter ranking the first, and the other not the first, therefore the maximum of Hit@1 is 50%.

To explain the results shown in Table 5, we further compute the bifurcation on test data of Family, as shown in Table 6. The maximum Hit@1 of daughterOf on test data should be 97.5%. Meanwhile, higher empirically means that it is harder to get a higher Hit@ for . The same procedure may be easily adapted to obtain the upper bound of Hit@ of a model at any knowledge graph completion task.

husbandOf 2 0 0
wifeOf 1 0 0
sonOf 3 0 0
daughterOf 5 0 0
brotherOf 23 4 1
uncleOf 40 11 2
Table 6: Bifurcation(%) on test data of Family.
Rule Predicate
Table 7: Top rules learned by MPLR on the Family dataset.

5.3 Experiment on interpretability of mined rules

Neural LP framework successfully combines structure learning and parameter learning. It not only induces multiple logical rules to capture the complex structure in the knowledge graph, but also learns to distribute confidences on rules [yang2017differentiable]. In addition to the evolution on KG completion task in multi-target situation, our model also succeeds Neural LP on interpretability and further, becomes more interpretable to human. Throughout this section we use the Family dataset for visualization purposes as it is more tangible. Other datasets like UMLS produce similar outputs.

We sort the rules generated by MPLR according to their assigned confidences and show top rules in Table 7. To be honest, because of the constraints on expressiveness, there are logically incorrect rules mined by this model, which is highlighted by red color in the table. We will explain this in the next section. For more learned logic rules, please refer to App. E.

We can see the rules are of high quality and of good diversity, although there are few inappropriate ones. More importantly, the mined rules shown in Table 7 reaches a great agreement with high-saturated rules in Table 2, which indeed reflects the power of reasoning saturation as an indicator and on the other hand depicts the strong interpretability of MPLR.

5.4 Discussion on Neural LP framework

Despite the fact that Neural LP is an end-to-end gradient-based KG reasoning framework and fills in the gap between traditional KG reasoning models (e.g. embedding methods) and interpretability, through close observation on datasets and analysis on formulas, we discover that there also exist some restrictions for Neural LP-based algorithms:

  1. As proved in [sadeghian2019drum], current framework inevitably mines incorrect rules with high confidences, i.e., if there are several rules sharing one or more predicates, confidences of rules would be coupled mutually. Intuitively, this is because Eq. (17) distributes the score of a rule to the predicates that constitute the rule at different hops. For instance, brotherOf sonOf and brotherOf sisterOf share brotherOf at first hop. However, in case our query is sonOf, if brotherOf wins high confidence at first hop, the score of the second rule may not be too low, which is absolutely an incorrect result. This reduces the interpretability of the output rules.

  2. Present models are faced with the dilemma where there would be invertible relation pairs and rules of varied lengths mixing up. (i) A relation pair is invertible if there simultaneously exist two triplets and in a KG. (ii) The KG example shown in Fig. 4 consists of candidate rules of length 2, 3 and 4 for query brotherOf(, ). These two factors jointly may cause invalid induction results, under the condition that we choose an improper hyperparameter as the maximum length of rules, e.g., if we set , the rule path sonOf(, ) motherOf(, ) sonOf(, ) motherOf(, ) brotherOf(, ) is possible but meaningless. This is also essentially due to the distribution of confidences brought by Eq. (17), and thus impedes the way for multi-hop reasoning over long rules.

  3. A high ranking of an entity results not solely from a top-scored rule, but also a number of relatively low-scored rules. As formulated in Section 4.1.1, the product of vector-matrix multiplication is a scalar representing the number of unique paths, and the final score of the entity is computed by summing up the confidences of all paths. Again, metrics like MRR and Hit@ only assess models in terms of the ranking of the desired entity, rather than the quality of mined reasoning rules. Thus, models following the Neural LP framework with high MRR and Hit@ may be better suited to tasks like question answering or relation completion [ji2021survey], but this may not be applied to rule mining.

Figure 4: An example to demonstrate the dilemma of existing KG reasoning models to effectively induce among rules of different lengths. brotherOf(, ) is the query in this example, and there are multiple rules that guide a way from to the answer .

6 Conclusions

In this paper, we firstly propose novel indicators that help to understand knowledge graph reasoning tasks and to serve as a supplement to the existing metrics (e.g. Hit@) for evaluating models. The saturation measures the possibility of a rule being a plausible inference for a relation, which fills in the blanks of judging the interpretability of mined rules. While the bifurcation, computing the proportion of instances with multiple reasoning destinations, is useful for enhancing the power of MMR and Hit@. Then we address the problem of learning rules from knowledge graphs in multi-target cases where a model called MPLR is proposed. MPLR improves the Neural LP framework in order to allow more queries fed in one batch, thus fits in multi-target scenarios. Experiment results have shown that our proposed method improves performance on several knowledge graph reasoning datasets and owns strong interpretability, under the evaluation of traditional metrics and our newly suggested ones. In the future, we would like to break the limitation of our current model for multi-hop reasoning where the rules are much longer.


The work of this paper is supported by the "National Key R&D Program of China" (2020YFB2009502), "the Fundamental Research Funds for the Central Universities" (Grant No. HIT.NSRIF.2020098).

Appendix A Extension to table 2: saturations of UMLS

Rule Predicate
1. .05 .05
.91 .04 .04
.91 .04 .04
.71 .05 .04
.75 .04 .03
.83 .31 .26
.83 .13 .11
.33 .16 .05
.96 .61 .52
.86 .32 .31
.87 .29 .25
1. .21 .21
.87 .15 .13
.40 .07 .03
Table 8: Saturations of UMLS (without sampling). The rule length is fixed to 2. , , are macro, micro and comprehensive saturations. The results are sorted by the comprehensive saturation in descending order.

Appendix B Model URLs

Appendix C Extension to table 3: bifurcation of all datasets

FB15K-237 position 94 92 92 92 91 89
nominatedFor 93 89 86 81 79 76
awardWinner 74 59 48 41 33 29
award 66 46 33 25 19 14
list 17 11 0 0 0 0
participant 16 6 2 1 1 0
season 1 1 1 94 90 90
artist 67 67 67 67 67 67
WN18 alsoSee 48 23 11 6 2 1
hypernym 2 0 0 0 0 0
hyponym 56 35 24 18 14 11
partOf 16 4 1 0 0 0
Family auntOf 85 74 65 54 49 44
fatherOf 42 26 17 12 07 05
motherOf 53 35 22 14 08 06
nephewOf 82 70 55 43 34 28
nieceOf 87 73 60 49 41 36
sisterOf 82 65 52 35 29 24
UMLS manifestationOf 100 82 82 82 82 82
evaluationOf 100 100 100 100 100 100
performs 100 100 100 100 100 100
ingredientOf 0 0 0 0 0 0
interactWith 93 87 80 73 67 62
resultOf 60 57 57 57 57 57
Kinship term25 67 33 0 0 0 0
term22 69 55 43 33 31 27
term19 50 50 50 50 25 0
term18 91 76 61 49 42 38
term14 75 67 58 42 8 8
Table 9: Bifurcation(%) of all datasets. The first column lists selected datasets and the predicates are shown on the second column.