LPRules: Rule Induction in Knowledge Graphs Using Linear Programming

by   Sanjeeb Dash, et al.

Knowledge graph (KG) completion is a well-studied problem in AI. Rule-based methods and embedding-based methods form two of the solution techniques. Rule-based methods learn first-order logic rules that capture existing facts in an input graph and then use these rules for reasoning about missing facts. A major drawback of such methods is the lack of scalability to large datasets. In this paper, we present a simple linear programming (LP) model to choose rules from a list of candidate rules and assign weights to them. For smaller KGs, we use simple heuristics to create the candidate list. For larger KGs, we start with a small initial candidate list, and then use standard column generation ideas to add more rules in order to improve the LP model objective value. To foster interpretability and generalizability, we limit the complexity of the set of chosen rules via explicit constraints, and tune the complexity hyperparameter for individual datasets. We show that our method can obtain state-of-the-art results for three out of four widely used KG datasets, while taking significantly less computing time than other popular rule learners including some based on neuro-symbolic methods. The improved scalability of our method allows us to tackle large datasets such as YAGO3-10.



page 1

page 2

page 3

page 4


Probabilistic Logic Neural Networks for Reasoning

Knowledge graph reasoning, which aims at predicting the missing facts th...

MPLR: a novel model for multi-target learning of logical rules for knowledge graph reasoning

Large-scale knowledge graphs (KGs) provide structured representations of...

Efficient Rule Learning with Template Saturation for Knowledge Graph Completion

The logic-based methods that learn first-order rules from knowledge grap...

DegreEmbed: incorporating entity embedding into logic rule learning for knowledge graph reasoning

Knowledge graphs (KGs), as structured representations of real world fact...

RuDaS: Synthetic Datasets for Rule Learning and Evaluation Tools

Logical rules are a popular knowledge representation language in many do...

Building Rule Hierarchies for Efficient Logical Rule Learning from Knowledge Graphs

Many systems have been developed in recent years to mine logical rules f...

Generalized Linear Rule Models

This paper considers generalized linear models using rule-based features...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge graphs (KG) are used to represent a collection of known facts via labeled directed edges. Each node of the graph represents an entity, and a labeled directed edge from one node to another indicates that the pair of nodes satisfies a binary relation given by the edge label. A fact in the knowledge graph is a triplet of the form where and are nodes, and is a binary relation labeling a directed edge from to indicating that is true. As an example, consider a KG where the nodes correspond to distinct cities, states, and countries and the relations are one of capital_of, shares_border_with, or part_of. A fact (a, part_of, b) in such a graph corresponds to a directed edge from a to b labeled by part_of implying that a is a part of b. Practical knowledge graphs do not contain all facts that are known to be true and that can be represented using the knowledge graph entities and relations. Some “knowledge acquisition" tasks that are commonly performed to generate or extract implied information from KGs are knowledge graph completion (KGC), triple classification, entity recognition, and relation extraction. See Ji et al. (2021) for a recent survey on KG representations, and algorithms to manipulate and extract information from KGs. Knowledge Graph completion involves analyzing known facts in a KG and using these to infer additional (missing) facts. An approach for KGC is to learn first-order logic rules that use known facts to imply other known facts, and then use these to infer missing facts. Link prediction, entity prediction, and relation prediction are typical subtasks of KGC, and we focus on deriving first-order logic rules for link prediction in this paper.

In the example above, we could learn a “rule” capital_of(X,Y) part_of(X,Y) where X, Y are variables that take on entity values. Then if we find a pair of entities P,Q such that (P, capital_of, Q) is a fact in the graph, we could infer that (P, part_of, Q) is also true and augment the set of facts with this new fact if not originally present. A more complex rule of length two is capital_of(X,Y) and part_of(Y,Z) part_of(X,Z). Again, applying it to entities P and Q, if there exists a third entity R such that capital_of(P,R) is a fact in the graph, and so is part_of(R,Q), then we infer that P is contained in Q. Instead of learning one rule for a relation, a weighted linear combination of rules is learned in Yang et al. (2017) as a proxy for a disjunction of rules, with larger weights indicating more important rules.

Embedding-based methods are an alternative to rule-based methods, and consist of representing nodes by vectors, and relations by vector transformations that are consistent with the facts in the knowledge graph. They exhibit better scaling with KG size, and yield more accurate predictions. See the surveys

Ji et al. (2021) and Wang et al. (2017). On the other hand, rule-based methods can yield more interpretable explanations of associated reasoning steps (Qu and Tang, 2019), especially if one obtains compact rule sets (with few rules and few relations per rule). Furthermore, rules are more generalizeable (Teru and Hamilton, 2020) and can be applied in an inductive setting, in the sense that they can be applied to entities that were not considered in the learning process. Embedding-based methods work mostly in a transductive setting.

Learning logical rules is a well-studied area. In a paper (Lao and Cohen, 2010) on path ranking algorithms and another (Richardson and Domingos, 2006) on Markov logic networks, candidate logic rules are obtained via relational path enumeration, and then weights for these paths are calculated. The work in Yang et al. (2017)

uses neural logic programming to simultaneously obtain sets of rules and also the weights of individual rules, and a number of subsequent papers

(Sadeghian et al., 2019) refined this approach. In a recent paper (Qu et al., 2021)

, the authors separate out the process of finding rules from the process of finding weights for rules. The former task is performed using a recurrent neural network, while the latter is performed using a probabilistic model.

In this work, we propose a novel approach to learning logical rules for knowledge graph reasoning. Our approach is based on combining rule enumeration with solving linear programs, and completely avoids the solution of difficult nonconvex optimizaton models inherent in neuro-symbolic methods, or other probabilistic methods. We describe a linear programming (LP) formulation with exponentially many variables corresponding to first-order rules and associated weights. Nonzero variable values correspond to chosen rules and their weights. Rather than solving this exponential-size LP, we deal with the exponential number of variables/rules by standard column generation ideas from linear optimization, where we start off with some small initial set of rules and associated variables, find the best subset of these and associated weights via the partial LP defined on the initial set of rules, and then generate new rules which can best augment the existing set of rules. Our final output rule-based predictors have the same form as those in NeuralLP (Yang et al., 2017) and DRUM (Sadeghian et al., 2019), but our solution approach has some features in common with the one in Qu et al. (2021), namely the iterative process of adding new rules to improve a previous solution.

Our algorithm has better scaling with KG size than a number of existing rule-based methods. In addition, we obtain state-of-the-art results on three out of four standard KG datasets while taking significantly less running time than some competing codes. Furthermore, we are able to obtain results of reasonable quality – when compared to embedding-based methods – on YAGO3-10, a large dataset which is difficult for existing rule-based methods. As one of the goals of rule-based methods is to obtain interpretable solutions, we promote interpretability by adding a constraint in our LP formulation limiting the complexity of the chosen rule set (and this hyperparameter is tuned per dataset). In some case, we obtain more accurate results (higher MRR) for the same level of complexity as other codes, and less complex rules for the same level of accuracy in other cases.

2 Related Work

There is a rich body of literature on rule-based methods for knowledge graph reasoning. As rules are discrete objects, whereas rule weights (indicating how important individual rules are) are continuous objects, Kok and Domingos (2007) use beam search to find rule sets, and learn rule parameters via standard numerical methods. In Lao and Cohen (2010), the authors create an initial set of rules, and then use Lasso-type regression to select the best subset of predictive rules from the initial set. There are many papers which first obtain rules and then learn rule weights. In recent work based on Neural Logic programming (Yang et al., 2017)

, rules and rule weights are learned simultaneously by training an appropriate neural network; see also

Sadeghian et al. (2019). Another neuro-symbolic approach can be found in Rochstätel and Riedel (2017)

: their Neural Theorem Prover (NTP) also uses neural networks to find rules and rule weights. Simultaneously solving for rules and rule-weights is a difficult task, and a natural question is how well the associated optimization problem is solved in the above papers. There have been a few recent attempts to use reinforcement learning to search for rules - see MINERVA

(Das et al., 2018), MultiHopKG (Lin et al., 2018), M-Walk (Shen et al., 2018) and DeepPath (Xiong et al., 2017). In Qu et al. (2021), the rule generation process is separated from the weight calculation process (as in earlier work), but a feedback loop is added from the weight calculation routine to the rule generation routine. In other words, the iterative process has the feature that new rule generation is influenced by the calculated weights of previously generated rules.

An alternative approach to knowledge graph reasoning tasks is based on embedding entities and relations in the knowledge graph into (possibly different) vector spaces. For example, suppose one finds a vector for each node in the knowledge graph and a function for each relation such that whenever is a fact in the graph. Then, for a pair of entities and , one could assert that is a fact (assuming it is not present in the graph) if . Well known papers in this area are Sun et al. (2019), Bordes et al. (2013), Dettmers et al. (2018), Lacroix et al. (2018), Trouillon et al. (2016), Balažević et al. (2019), Wang et al. (2014), Yang et al. (2015), Nayyeri et al. (2021), and Chami et al. (2020).

There are a number of papers which combine embeddings and rules in different ways. In rule-injected embedding models such as RUGE (Guo et al., 2018), LogicENN (Nayyeri et al., 2019), and ComplEx-NEE-AER (Ding et al., 2018), the goal is to obtain embeddings that are consistent with prior rules (known before the training process). On the other hand, RNNLogic (Qu et al., 2021) combines rules and embeddings to give more precise scores to candidate answers to queries of the form . In others (Lin et al., 2018), information from embeddings is used to obtain better rules.

3 Model

We propose a linear programming model which is inspired by LP boosting methods for classification using classical column generation techniques (Demiriz et al., 2002; Golderg, 2012; Eckstein and Goldberg, 2012; Eckstein et al., 2019; Dash et al., 2018). The goal of our model is to create a weighted combination of first-order logic rules (similar to the one proposed in Yang et al. (2017)) to be used as a prediction function for the task of knowledge graph link prediction. In principle, our model has exponentially many variables corresponding to the exponentially many possible rules, but our solution approach uses column generation approaches to deal with this issue. We start off with a small set of candidate rules, find important rules and associated weights, and then generate additional rules that have the potential to improve the overall solution. The fact that previously generated rules influence the generation of new rules is an important feature; see also RNNLogic (Qu et al., 2021).

Knowledge graphs: Let be a set of entities, and let be a set of binary relations defined over the domain . A knowledge graph represents a collection of facts as a labeled, directed multigraph . Let where , and . The nodes of correspond to entities in and the edges to facts in : if is a fact in , then has a directed edge labeled by the relation , depicted as . Here is the tail of the directed edge, and is the head. We let stand for the list of directed edges in . For each fact , we say that is true. Practical knowledge graphs are assumed to be incomplete: not all missing facts that can be defined over and are assumed to be incorrect.

The knowledge graph link prediction task consists of taking a knowledge graph as input, and then answering a list of queries of the form and , typically constructed from facts in a test set. The query means that an entity tail and a relation are given as input, and the goal is to find a head entity such that is a fact. In practice, a collection of facts is divided into a training set , a validation set , and a test set , the knowledge graph corresponding to is constructed and a link prediction function is learnt from and evaluated on the test set.

Goal: For each relation in , find a set of closed, chain-like rules and positive weights where each rule has the form


Here are relations from represented in , and the length of the rule is . The interpretation of this rule is that if for some entities (or nodes) of there exist entities of such that and are true for , then is true. We refer to the conjunction of relations in (1) as the clause associated with the rule . Thus each clause is a function from to , and we define to be the number of relations in . Clearly, for entities in if and only if there is a relational path of the form

Our learned prediction function for relation is simply


Given a query constructed from a fact from the test set, we use the approach from Bordes et al. (2013) where the score is calculated for every entity , and the rank of the correct entity is calculated from the scores of all entities in the filtered setting. We similarly calculate the rank of for the query . We then compute standard metrics such as MRR (mean reciprocal rank), Hits@1, Hits@3, and Hits@10.

A major issue in rank computation is that multiple entities (say and ) can get the same score for a query. Different treatment of equal scores can lead to significantly different MRR values (Sun et al., 2020) (also see the Appendix). We use random break (available as an option in NeuralLP (Yang et al., 2017)) ranking, where the correct entity is compared against all entities and any ties in scores are broken randomly.

New LP model for rule learning for KG Link Prediction

Let denote the set of clauses of possible rules of the form (1) with maximum rule length . Clearly, , where is the number of relations. Consider a relation , and let be the set of edges in labeled by , and assume that . Let the th edge in be . We compute as . That is, is 1 if there is a relational path associated with the clause from to and 0 otherwise.

Furthermore, let be a number associated with the number of “nonedges” from for which . We calculate for the th rule as follows. We consider the tail node and head node for each edge in . We compute the set of nodes that can be reached by a path induced by the th rule starting at the tail. If there is no edge from to a node in labeled by , we say that is an invalid end-point. Let be the set of such invalid points. We similarly calculate the set of invalid start-points based on paths ending at induced by the th rule. The total number of invalid start and end points for all tail and head nodes associated with edges in is |. For a query of the form where is a tail node of an edge in , the prediction function defined by the th rule alone gives a positive and equal score to all nodes in .

Our model for rule-learning is given below.


The continuous variable is restricted to lie in and is positive if and only if clause is a part of the prediction function (2). In other words, the variables with positive value define the function (2). The parameter is an upper bound on the complexity of the prediction function (defined as the number of clauses plus the number of relations across all clauses). The variable is a penalty variable which is positive if the prediction function defined by positive s gives a value less than 1 to the th edge in . Therefore, the portion of the objective function attempts to maximize , i.e., it attempts to approximately maximize the number of facts in that are given a “high-score" of 1 by . In addition, we have the parameter which represents a tradeoff between how well our weighted combination of rules performs on the known facts (gives positive scores), and how poorly it performs on some negative samples or “unknown" facts. We make this precise shortly.

Maximizing the MRR is a standard objective for knowledge graph link prediction problems and thus the objective function of LPR is only an approximation; see the next Theorem (the proof is given in the Appendix). In spite of this fact, we can obtain state-of-the-art prediction rules using LPR.

Theorem 1.

Let IPR be the integer programming problem created from LPR by replacing equation (6) by for all , and letting . Then one can construct a prediction function from an optimal solution with objective function value such that is a lower bound on the MRR of the prediction function calculated by the optimistic ranking method, when applied to the training set triples.

Assuming that the training set facts have a similar distribution to the test set facts, the theorem above justifies choosing IPR as an optimization formulation to find a high-quality collection of rules for a relation, assuming MRR calculation via optimistic ranking.

However, we use random break ranking in this paper. It is essential to perform negative sampling and penalize rules that create paths when there are no edges in order to produce good quality results. This is why we use in LPR. We will now give an interpretation of . To compute the MRR of the prediction function in (2) applied to the training set, for each edge we need to compute the rank of the answer to the query – by comparing with for all nodes in – and the rank of answer to the query – by comparing with for all nodes . But is exactly the sum of scores given by to all nodes in and and therefore we have the following proposition.

Proposition 2.

Let be an edge in , and let be the set of invalid answers for and let be the set of invalid answers to in the filtered setting. Then

In other words, rather than keeping individual scores of the form and small, we minimize the sum of these scores in LPR.

It is impractical to solve LPR given the exponentially many variables , except when and

are both small. For WN18RR

(Dettmers et al., 2018), is only 22 (WN18RR has 11 relations, but we introduce a reverse relation for each , and create rules that include reverse relations) and thus setting to 3 does not lead to too many variables. An effective way to solve such large LPs is to use column generation (Barnhart et al., 1998) where only a small subset of all possible variables is generated explicitly and the optimality of the LP is guaranteed by iteratively solving a pricing problem. We however do not attempt to solve LPR to optimality.

We start with an initial set of candidate rules (and implicitly set all other rule variables from to 0). Let LPR be the associated LP. We solve LPR, and then dynamically augment the set of candidate rules to create sets such that . If LPR is the LP associated with with optimal solution value , then it is clear that a solution of LPR is a solution of LPR, and therefore . However, the goal is to have . We attempt to do this by considering the dual solution associated with an optimal solution of LP, and then trying to find a negative reduced cost rule, which we discuss shortly.

Datasets # Relations # Entities # Train # Test # Valid
Kinship 25 104 8544 1074 1068
UMLS 46 135 5216 661 652
WN18RR 11 40943 86835 3134 3034
FB15k-237 237 14541 272115 20466 17535
YAGO3-10 37 123182 1079040 5000 5000
DB111K-174 298 98336 474123 65851 118531
Table 1: Sizes of datasets.

Setting up the initial LP

To set up and the associated LP, we develop two heuristics. In Rule Heuristic 1, we generate rules of lengths one and two. For length one rules, we create a one-relation rule from a relation in if it labels a large number of edges from tail nodes to head nodes of edges in . Similarly, to create rules of length two, we take each edge in and select the relations from edges in and in to create a rule, again taking into account how frequently a length two rule creates paths from the tail nodes to head nodes of edges in .

In Rule Heuristic 2, we take each edge in and find a shortest path from to contained in the edge set where the path length is bounded by a pre-determined maximum length. We then use the sequence of relations associated with the shortest path to generate a rule. We also use a path of length at least one more than the shortest path. Rules generated by any method (NeuralLP, DRUM, etc) can be used to set up .

Adding new rules

Each set for , is constructed by adding new rules to the set . We use a modified version of Heuristic 2 to generate the additional rules. In this version of the heuristic, we use the dual variable values associated with the optimal solution of LPR. Let for all be dual variables corresponding to constraints (6). Let be the dual variable associated with the constraint (6). Given a variable which is zero in a solution of LPR and associated dual solution values and , the reduced cost for this variable is given by

If , then increasing from zero may reduce the LP solution value.

Our implemented approach to making the reduced cost negative is to sort the dual values in decreasing order, and then go through the associated indices , and create rules such that via a shortest path calculation. That is, we take the corresponding edge in , find the shortest path between and and generate a new rule with the sequence of relations in that path. In this version of the heuristic, we limit the number of rules generated so that is only slightly larger than . More precisely, we set . We use the dual values to indicate which facts are not currently implied by the existing set of chosen rules. If the reduced cost of a new rule is nonnegative, then we do not add that rule to .

Kinship UMLS
Algorithm MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
ComplEx-N3 0.889 0.824 0.950 0.986 0.962 0.934 0.990 0.996
TuckER 0.891 0.826 0.950 0.987 0.914 0.837 0.991 0.997
ConvE 0.83 0.74 0.92 0.98 0.94 0.92 0.96 0.99
NeuralLP 0.652 0.520 0.726 0.925 0.750 0.601 0.876 0.958
DRUM 0.566 0.404 0.663 0.910 0.845 0.722 0.959 0.991
RNNLogic *0.687 0.566 0.756 0.929 *0.748 0.618 0.849 0.928
LPRules 0.746 0.639 0.816 0.959 0.869 0.812 0.917 0.970
Table 2: Comparison of results on Kinship and UMLS. The results from NeuralLP, DRUM, and our code use the random break metric. *We modified the RNNLogic code to compute the MRR values based on the random break metric and the values obtained were the same as the values of the original MRR computation up to three decimal places. The results for ConvE were taken from the original paper.

4 Experiments

We conduct experiments on knowledge graph completion tasks with six datasets: Kinship (Denham, 1973), UMLS (McCray, 2003)

, FB15k-237

(Toutanova and Chen, 2015), WN18RR (Dettmers et al., 2018), YAGO3-10 (Mahdisoltani et al., 2015) and DB111K-174 (Hao et al., 2019). In Table 1, we give properties of the datasets: the number of entities, relations, and the number of facts in the training, testing and validation data sets. The partition of FB15k-237, WN18RR, and YAGO3-10 into training, testing, and validation data sets is standard. We chose the partition for UMLS and Kinship used in Dettmers et al. (2018) and the partition for DB111K-174 given in Cui et al. (2021). In early work on Kinship and UMLS (Kok and Domingos, 2007; Kemp et al., 2006), there is a slight difference in the number of relations (26 and 49 respectively) compared to recent papers.

4.1 Experimental setup

We introduce the reverse relation for each relation , and denote it by . For each fact in the training set, we implicitly introduce the fact doubling the number of relations and facts. For each original relation in the training set, we create a prediction function of the form in (2). For each test set fact we create two queries and , and use to predict answers to these queries. For each entity in , we calculate the scores and – here is treated as a candidate solution to the queries and – and then use the filtered ranking method in Bordes et al. (2013) to calculate a ranking for the correct answer (namely ) to the above queries. Ranks are computed using the random break method (option in NeuralLP (Yang et al., 2017)), and these are used to compute MRR and Hits@ (for ) across all test facts.

We compare our results with the published embedding-based methods ConvE (Dettmers et al., 2018), ComplEx-N3 (Lacroix et al., 2018), TuckER (Balažević et al., 2019), RotatE (Sun et al., 2019), E (Nayyeri et al., 2021), and ATTH (Chami et al., 2020). We obtained results for ComplEx-N3 and TuckER by running on our machines using the best published hyperparameters (if available). The results for the other embedding-based methods were taken from either the original papers or from Dettmers et al. (2018) or Qu et al. (2021). These codes do not implement random break ranking for equal scores, and some instead use nondeterministic ranking (Berrendorf et al., 2021), i.e., they sort entity scores before ranking. This approach is fine in the absence of equal scores for multiple entities, which we believe is the case for embedding based methods. We also compare with the rule-based methods NeuralLP (Yang et al., 2017), DRUM (Sadeghian et al., 2019), RNNLogic (Qu et al., 2021). In the Appendix, we also give a comparison with recent reinforcement learning methods such as MINERVA (Das et al., 2018) and M-Walk (Shen et al., 2018) (and not here as the evaluation process in each of these papers is slightly different than ours: only right entities are removed to form queries). We obtain results for NeuralLP and DRUM by running these codes using default parameters and random break score ranking. We modify the RNNLogic code to implement random break ranking, and use the defaults suggested for different datasets. RNNLogic (Qu et al., 2021)

claims to use midpoint ranking, but actually implements a harmonic mean of possible reciprocal ranks in the presence of equal scores while calculating the MRR, which yields a very slightly larger number than midpoint ranking based MRR.

We ran two variants of our code which we call “LPRules". In the first variant, we create LPR by generating rules using Rule Heuristic 1 and Rule Heuristic 2, and then solve LPR to obtain rules. As the results are satisfactory for smaller datasets, we do not perform column generation. In the second variant, which we use only for the largest instances, we create LPR with an empty set of rules, and then perform 5 iterations consisting of generating up to 10 rules using the modified version of Rule Heuristic 2 followed by solving the new LP. In other words, we create and solve LPR for .

We search for the best values of and for each relation. We dynamically let equal the length of the longest rule generated plus one. We then perform 20 iterations where, at the th iteration, we set to . For each combination of and , we take the optimal weighted combination of rules, and compute the MRR of the prediction function on the validation data set, and select those and that yield the best MRR. We set the maximum rule length to 6 for WN18RR, and 3 for YAGO3-10, and 4 for the other datasets. Thus except for WN18RR. We search for the best from a list of values given as input for each problem. Table 11 in Appendix A.4 contains the lists of values of that we used for each dataset.

FB15k-237 WN18RR
Algorithm MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
ComplEx-N3 0.362 0.259 0.397 0.555 0.469 0.434 0.481 0.545
TuckER 0.353 0.259 0.390 0.538 0.464 0.436 0.477 0.517
ConvE 0.325 0.237 0.356 0.501 0.43 0.40 0.44 0.52
RotatE 0.338 0.241 0.375 0.533 0.476 0.428 0.492 0.571
E 0.37 0.28 0.40 0.56 0.50 0.45 0.51 0.59
NeuralLP 0.222 0.160 0.240 0.349 0.381 0.367 0.386 0.409
DRUM 0.225 0.160 0.245 0.355 0.381 0.367 0.389 0.410
RNNLogic 0.288 0.208 0.315 0.445 *0.451 0.415 0.474 0.524
LPRules 0.255 0.170 0.270 0.402 0.459 0.422 0.477 0.532
Table 3: Comparison of results on FB15k-237 and WN18RR. *The RNNLogic MRR value for WN18RR is obtained via random break ranking and it differs from the original code output of 0.452. We could not run RNNLogic on FB15k-237, and report numbers from the original paper. The results for ConvE and RotatE were taken from Qu et al. (2021). The results for E were taken from Nayyeri et al. (2021).

4.2 Results

In Tables 2-4, we give values for different metrics obtained with the listed codes, first for embedding methods, then for rule-based methods (if available), and then for our code. All our experiments (with rule-based codes) are performed on a machine with 128 GBytes of memory, and four 2.8 Intel Xeon E7-4890 v2 processors, each with 15 cores, for a total of 60 cores. We use coarse-grained parallelism in our code, and execute rule generation for each relation on a different thread, and solve LPs with IBM CPLEX (IBM, 2019). We obtain results for NeuralLP, DRUM, and RNNLogic with random break ranking on the same machine. These codes also exploit available cores.

In Table 2, we present results on Kinship and UMLS. Our method obtains better results than NeuralLP, DRUM, and RNNLogic on Kinship across all measures, and better MRR and Hits@1 values than these three codes on UMLS. This is true even though we generate relatively compact rules (see Table 6) and also use very simple rule generation heuristics. Therefore, for these datasets, even trivial rule generation heuristics suffice when used in conjunction with a nontrivial weight generation algorithm. The embedding based methods are much better across all metrics. In Table 3, we present results on FB15k-237 and WN18RR. RNNLogic did not successfully terminate for FB15k-237 on our machine, and we take their published result (Qu et al., 2021). Our results for FB15k-237 are better than NeuralLP and DRUM but worse than RNNLogic. The best values for embedding-based methods are much better than for all rule-based methods. For WN18RR, our results are better than the other rule-based methods, even while taking significantly less computing time, see Table 5. This better scaling behaviour allows us to tackle large datasets such as YAGO3-10, which we give results for in Table 4. We use column generation, generating 10 columns in each iteration, and iterate only 5 times for a total of 50 candidate rules per relation. To compute , we sample 20% of the edges from , and compute the number of invalid paths that start at the tails of these edges, and end at the heads of these edges. For YAGO3-10 and DB111K-174, our column generation approach becomes essential. We are simply unable to process a very large number of rules. The ability to generate a small number of rules, and then use the dual values to focus on “uncovered" facts (not implied by previous rules) and generate new rules covering these uncovered facts is very helpful.

YAGO3-10 DB111K-174
Algorithm MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
ComplEx-N3 0.574 0.499 0.619 0.705 0.421 0.344 0.459 0.563
TuckER 0.265 0.184 0.290 0.426 0.345 0.247 0.397 0.529
ConvE 0.44 0.35 0.49 0.62
RotatE 0.495 0.402 0.550 0.670
ATTH 0.568 0.493 0.612 0.702
LPRules 0.449 0.367 0.501 0.684 0.363 0.312 0.390 0.460
Table 4: Comparison of results on YAGO3-10 and DB111K-174 using random break metric. The results for ConvE are taken from Dettmers et al. (2018). The results for RotatE are from Sun et al. (2019). The results for ATTH are from Chami et al. (2020).

In Table 5, we give run times (in minutes) on the different datasets. The results in the top section of the table were obtained on the 60 core machine described above and the results in the bottom section of the table were obtained on a machine with 16 CPUs and 1 GPU. Our times include the time to evaluate the solution on the test set. For WN18RR, DRUM and NeuralLP take 400 minutes or more, and RNNLogic takes over 100 minutes, whereas we take 11 minutes and less than 2 minutes with LPRules. We note that we use maximum rule length of 6 for WN18RR, but we do not know the maximum path lengths used in the other codes.

Kinship UMLS WN18RR FB15k-237 YAGO3-10 DB111K-174
LPRules 0.5 0.2 11.0 234.5 1648.4 152.4
DRUM 3.2 2.8 505.9 12053.3
NeuralLP 1.6 1.1 399.9
RNNLogic 108.8 133.4 104.0
LPRules 2.7 0.4 11.7 267.3 1245.9 131.0
ComplEx-N3 2.6 1.7 195.9 238.6 2024.8 745.2
TuckER 8.9 5.5 266.6 407.5 2894.7 6161.1
Table 5: Wall clock run times in minutes when running in parallel on a 60 core machine for the top group of results and a machine with 16 CPUs and 1 GPU for the bottom results.

DRUM takes over 8 days for FB15K-237 (as does NeuralLP on a different machine), which is why we do not run these codes on YAGO3-10 and DB111K-174. Our code can be sped up further if fine-grained parallelism were used (see the Appendix). Finally, in Table 6, we give the average number of rules in the final solution in column ’S’ confirming that we create compact rule-based predictors; NeuralLP does too, but we get better values of MRR for similar levels of sparsity. For WN18RR and UMLS, we obtain state-of-the-art results with few rules. We could not extract rules from DRUM and RNNLogic.

Kinship UMLS FB15k-237 WN18RR YAGO3-10
NeuralLP 0.652 10.2 0.750 14.2 0.222 8.3 0.381 14.6
LPRules 0.746 21.0 0.848 4.2 0.25.5 14.2 0.459 15.6 0.449 7.8
Table 6: MRR and average number of rules selected (S)

5 Conclusion

Existing methods to obtain logic rules for knowledge graph completion can be fairly time consuming. Our relatively simple linear programming formulation for selecting logical rules and associated solution algorithm can return state-of-the-art results for a number of standard KG datasets even with sparse collections of rules, and much faster than existing methods.


  • I. Balažević, C. Allen, and T. M. Hospedales (2019)

    TuckER: tensor factorization for knowledge graph completion


    Empirical Methods in Natural Language Processing

    Cited by: §2, §4.1.
  • C. Barnhart, E. L. Johnson, G.L. Nemhauser, M. . Savelsbergh, and P.H. Vance (1998) Branch-and- price: column generation for solving huge integer programs. Operations Research 46 (3), pp. 316–329. Cited by: §3.
  • M. Berrendorf, E. Faerman, L. Vermue, and V. Tresp (2021) On the ambiguity of rank-based evaluation of entity alignment or link prediction methods. arXiv:2002.06914. Cited by: item a., §A.1, §4.1.
  • A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, Cited by: §2, §3, §4.1.
  • I. Chami, A. Wolf, D. Juan, F. Sala, S. Ravi, and C. Re (2020) Low-dimensional hyperbolic knowledge graph embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6901–6914. Cited by: §2, §4.1, Table 4.
  • Z. Cui, P. Kapanipathi, K. Talamadupula, T. Gao, and Q. Ji (2021) Type-augmented relation prediction in knowledge graphs.

    Proceedings of the AAAI Conference on Artificial Intelligence

    35 (8), pp. 7151–7159.
    Cited by: §4.
  • R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and A. McCallum (2018) Go for a walk and arrive at the answer: reasoning over paths in knowledge bases using reinforcement learning. In ICLR, Cited by: §A.1, §2, §4.1.
  • S. Dash, O. Günlük, and D. Wei (2018) Boolean decision rules via column generation. In Advances in Neural Information Processing Systems, pp. 4655–4665. Cited by: §3.
  • A. Demiriz, K. P. Bennett, and J. Shawe-Taylor (2002) Linear programming boosting via column generation. Machine Learning 46, pp. 225–254. Cited by: §3.
  • W. Denham (1973) The detection of patterns in alyawarra nonverbal behavior. Ph.D. Thesis, University of Washington. Cited by: §4.
  • T. Dettmers, M. Pasquale, S. Pontus, and S. Riedel (2018) Convolutional 2d knowledge graph embeddings. In AAAI, Cited by: §2, §3, §4.1, Table 4, §4.
  • B. Ding, Q. Wang, B. Wang, and L. Guo (2018) Improving knowledge graph embedding using simple constraints. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 110–121. Cited by: §2.
  • J. Eckstein and N. Goldberg (2012) An improved branch-and-bound method for maximum monomial agreement. INFORMS Journal on Computing 24 (2), pp. 328–341. Cited by: §3.
  • J. Eckstein, A. Kagawa, and N. Goldberg (2019) REPR: rule-enhanced penalized regression. INFORMS Journal on Optimization 1 (2), pp. 143–163. Cited by: §3.
  • N. Golderg (2012)

    Optimization for sparse and accurate classifiers

    Ph.D. Thesis, Rutgers University, New Brunswick, NJ. Cited by: §3.
  • S. Guo, Q. Wang, L. Wang, B. Wang, and L. Guo (2018) Knowledge graph embedding with iterative guidance from soft rules. In AAAI, Cited by: §2.
  • J. Hao, M. Chen, W. Yu, Y. Sun, and W. Wang (2019) Universal representation learning of knowledge bases by jointly embedding instances and ontological concepts. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, pp. 1709–1719. Cited by: §4.
  • IBM (2019) IBM CPLEX optimizer, version 12.10. External Links: Link Cited by: §A.5, §4.2.
  • S. Ji, S. Pan, E. Cambria, P. Marttinen, and P. S. Yu (2021) A survey on knowledge graphs: representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21. Cited by: §1, §1.
  • C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda (2006) Learning systems of concepts with an infinite relational model. In AAAI, Cited by: §4.
  • S. Kok and P. Domingos (2007) Statistical predicate invention. In ICML, Cited by: §2, §4.
  • T. Lacroix, N. Usunier, and G. Obozinski (2018) Canonical tensor decomposition for knowledge base completion. In ICML, Cited by: §2, §4.1.
  • N. Lao and W. W. Cohen (2010) Relational retrieval using a combination of path-constrained random walks. Machine Learning 81, pp. 53–67. Cited by: §1, §2.
  • X. V. Lin, R. Socher, and C. Xiong (2018) Multi-hop knowledge graph reasoning with reward shaping. In Empirical Methods in Natural Language Processing, Cited by: §2, §2.
  • F. Mahdisoltani, J. Biega, and F. M. Suchanek (2015) YAGO3: a knowledge base from multilingual wikipedias. In CIDR, Cited by: §4.
  • A. T. McCray (2003) An upper level ontology for the biomedical domain. Comparative and Functional Genomics 4, pp. 80–84. Cited by: §4.
  • M. Nayyeri, C. Xu, J. Lehmann, and H. S. Yazdi (2019) LogicENN: a neural based knowledge graphs embedding model with logical rules. ArXiv abs/1908.07141. Cited by: §2.
  • M. Nayyeri, S. Vahdati, C. Aykul, and ens Lehmann (2021) Knowledge graph embeddings with projective transformations. In AAAI, Cited by: §2, §4.1, Table 3.
  • M. Qu, J. Chen, L. Xhonneux, Y. Bengio, and J. Tang (2021) RNNLogic: learning logic rules for reasoning on knowledge graphs. In ICLR, Cited by: item a., §A.6, §1, §1, §2, §2, §3, §4.1, §4.2, Table 3.
  • M. Qu and J. Tang (2019) Probabilistic logic neural networks for reasoning. In NeurIPS, Cited by: §1.
  • M. Richardson and P. Domingos (2006) Markov logic networks. Machine Learning 62, pp. 107–136. Cited by: §1.
  • T. Rochstätel and S. Riedel (2017) End-to-end differential proving. In Advances in Neural Information Processing Systems, Cited by: §2.
  • D. Ruffinelli, S. Broscheit, and R. Gemulla (2020) You can teach an old dog new tricks! on training knowledge graph embeddings. In International Conference on Learning Representations, Cited by: item c..
  • A. Sadeghian, M. Armandpour, P. Ding, and D. Z. Wang (2019) DRUM: end-to-end differentiable rule mining on knowledge graphs. In NeurIPS, Cited by: §1, §1, §2, §4.1.
  • Y. Shen, J. Chen, P. Huang, Y. Guo, and J. Gao (2018) M-walk: learning to walk over graphs using monte carlo tree search. In Advances in Neural Information Processing Systems, Cited by: §2, §4.1.
  • Z. Sun, Z. Deng, J. Nie, and J. Tang (2019) RotatE: knowledge graph embedding by relational rotation in complex space. In ICLR, Cited by: §2, §4.1, Table 4.
  • Z. Sun, S. Vashishth, S. Sanyal, P. Talukdar, and Y. Yang (2020) A re-evaluation of knowledge graph completion methods. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5516–5522. Cited by: item a., §A.1, §A.1, §3.
  • K. K. Teru and W. L. Hamilton (2020) Inductive relation prediction on knowledge graphs. In ICML, Cited by: §1.
  • K. Toutanova and D. Chen (2015) Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), pp. 57–66. Cited by: §4.
  • T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, and G. Bouchard (2016) Complex embeddings for simple link prediction. In ICML, Cited by: §2.
  • Q. Wang, Z. Mao, B. Wang, and L. Guo (2017) Knowledge graph embedding: a survey of approaches and applications. IEEE TKDE 29, pp. 2724–2743. Cited by: §1.
  • Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014)

    Knowledge graph embedding by translating on hyperplanes

    In AAAI, Cited by: §2.
  • W. Xiong, T. Hoang, and W. Y. Wang (2017) Deeppath: a reinforcement learning method for knowledge graph reasoning. In EMNLP, Cited by: §2.
  • B. Yang, S. W. Yih, X. He, J. Gao, and L. Deng (2015) Embedding entities and relations for learning and inference in knowledge bases. In ICLR, Cited by: §2.
  • F. Yang, Z. Yang, and W. W. Cohen (2017) Differentiable learning of logical rules for knowledge base reasoning. In Advances in Neural Information Processing Systems 30, Cited by: §1, §1, §1, §2, §3, §3, §4.1, §4.1.

Appendix A Appendix

a.1 Additional comparisons

It is challenging to compare published codes for KGC as they differ in important ways.

  • Entity Ranking: There are different ways [Berrendorf et al., 2021, Sun et al., 2020] of ranking entities that form candidate solutions to queries and constructed from a test set fact , and the choice of the ranking method may have a significant effect on the final MRR, especially for rule-based methods, see Table 7. Assume entities receive a strictly better score than the true answer for the query and entities receive the same score. Some ranking methods listed in Berrendorf et al. [2021] are: receives a rank of (optimistic), i.e., the best among all equally-scored entities, or (pessimistic), or where is a random number between and (random [Sun et al., 2020]). It is also observed in Berrendorf et al. [2021] that numerous codes give a nondeterministic rank between and based on the position of the correct entity in a sorted order of equal score entities. Other options are a rank of (midpoint) [Qu et al., 2021] or random break, where the correct entity is compared against all entities and any ties in scores are broken randomly. We observe that random and random break

    are not the same: the probability of getting any rank between

    and is the same in random, whereas is more probable as a rank than or in random break. The choice of the ranking method also depends on whether one assumes a closed or open world hypothesis. The NTP and NTP- codes referred to in Table 8 use an optimistic ranking method, whereas MINERVA uses a nondeterministic ranking method.

  • Query construction: Some codes, such as MINERVA, only remove right entities from a fact in the test set to construct the query , and do not evaluate performance on the query .

  • Hyperparameters and Experimental Setup: It was shown recently [Ruffinelli et al., 2020] that some older, seemingly lower-quality codes could be made to perform better than more recent codes with appropriate choices of hyperparameters.

Random ranking is proposed in Sun et al. [2020] as a suitable method for KGC, but the codes NeuralLP and DRUM provide random-break ranking (which is closer to midpoint ranking) instead of random ranking. We earlier compared our code with these codes and with RNNLogic using the random-break ranking method, and two-sided query construction during testing. In the next table, we show that our code returns very different numbers if we use optimistic ranking, but similar results if we use midpoint ranking.

Metric MRR H@1 H@3 H@10
Optimistic 0.658 0.603 0.678 0.768
Midpoint 0.455 0.415 0.474 0.532
Random Break 0.459 0.422 0.477 0.532
Table 7: Results on WN18RR obtained with different ways of dealing with equal scores.

In Table 8, we give a comparison with results published in the MINERVA paper [Das et al., 2018] that are obtained with right-entity removal only, in query construction. We modify our code to construct queries in a similar fashion, though we still use random-break ranking, The grouping of results by embedding based methods, rule based methods, and those obtained by our code is as before. MINERVA and ConvE use nondeterministic [Berrendorf et al., 2021] ranking, as they sort scores before ranking, but NTP does not. The NTP code refers to the ranking method in ComplEx and says that "we calculate the rank from only those corrupted triples that have a higher score" - i.e., they use optimistic ranking. In our opinion, this accounts for the unusually high scores for UMLS obtained with NTP-. When we use optimistic ranking for UMLS, we obtain an MRR of 0.967. ComplEx also uses optimistic ranking. Though we cannot prove it, we suspect that some embedding based methods (especially, ConvE, RotatE, and TuckER; see Sun et al. [2020]) do not return much better MRR values when using optimistic ranking instead of random-break ranking. This can happen if few entities get equal scores. We cannot locate the ranking method used in DistMult.

Kinship UMLS
Algorithm MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
ComplEx 0.838 0.754 0.910 0.980 0.894 0.823 0.962 0.995
ConvE 0.797 0.697 0.886 0.974 0.933 0.894 0.964 0.992
DistMult 0.878 0.808 0.942 0.979 0.944 0.916 0.967 0.992
NTP 0.612 0.500 0.700 0.777 0.872 0.817 0.906 0.970
NTP- 0.793 0.759 0.798 0.878 0.912 0.843 0.983 1.000
NeuralLP 0.619 0.475 0.707 0.912 0.778 0.643 0.869 0.962
MINERVA 0.720 0.605 0.812 0.924 0.825 0.728 0.900 0.968
LPRules 0.776 0.682 0.836 0.966 0.887 0.841 0.924 0.971
Table 8: Comparison with results on Kinship and UMLS taken from the MINERVA paper. Queries are constructed from test set facts by right entity removal. Our code uses random break ranking. NTP uses optimistic ranking.
FB15k-237 WN18RR
Algorithm MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
ComplEx 0.394 0.303 0.434 0.572 0.415 0.382 0.433 0.480
ConvE 0.410 0.313 0.457 0.600 0.438 0.403 0.452 0.519
DistMult 0.370 0.275 0.417 0.568 0.433 0.410 0.441 0.475
NeuralLP 0.227 0.166 0.248 0.348 0.463 0.376 0.468 0.657
Path-Baseline 0.227 0.169 0.248 0.357 0.027 0.017 0.025 0.046
MINERVA 0.293 0.217 0.329 0.456 0.448 0.413 0.456 0.513
M-Walk 0.232 0.165 0.243 0.437 0.414 0.445
LPRules 0.350 0.261 0.383 0.533 0.486 0.443 0.511 0.571
Table 9: Comparison with results on FB15k-237 and WN18RR taken from the MINERVA paper. Our code uses random break ranking. The M-Walk results are taken from the associated paper.

In Table 9, we copy results for FB15K-237 and WN18RR – obtained by right-entity removal in query construction – from the MINERVA paper. We also copy results obtained with M-Walk (taken from the corresponding paper), as it uses the same query construction approach. We do not know the score ranking method used in M-Walk or in the NeuralLP runs from the MINERVA paper. We note that our code obtains the best value of MRR for WN18RR, and the best MRR value among all rule-based methods for FB15K-237. Thus, for the four datasets mentioned in this table and the previous one, our code returns state-of-the-art results among the rule-based methods compared here.

a.2 Proofs

We next present the proof of Theorem 1.


Let be an optimal solution of IPR. By definition, has components, and has components, all of which are binary, because of the form of the objective function. Let be the clause associated with rule . By definition, we have . Consider the function


Therefore, . As satisfies equation (6), we have

We can see that if and only if , and therefore

Therefore, either or . Furthermore, is a function for which fewest number of values are 1 or the highest number of values are 0, as form an optimal solution of IPR. In other words, "covers" the largest number of edges of (covering means ) among all possible functions that can be formed as a disjunction of rule clauses with complexity at most . For each such that , we have

for all as or 1 for any entities . If we take the facts in the training set as a test set, and use as a scoring function and use optimistic ranking of scores, then for each such that and for all entities , and therefore the rank of is 1 among all entities while scoring (denoted by ), and the rank of is 1 among all the entities while scoring (denoted by ). On the other hand and if . Therefore and . The MRR of the prediction function is


But is the optimal objective function value. Thus a lower value of yields a higher lower bound on the MRR computed via optimistic ranking. ∎

a.3 Analysis of YAGO3-10

We obtained an MRR of 0.449 with LPRules. We next analyze our performance and observe that it is primarily due to the perfomance on two relations, namely IsAffiliatedTo and playsFor, which together account for 64.4% of the facts in the training set, and 63.3% of the facts in the test set. In Table 10, we provide the rules generated by LPRules for these two relations. Here R_isAffiliatedTo is the reverse of the relation isAffiliatedTo (and denoted by isAffiliatedTo in the main document). The rule and weight columns together give the weighted combination of rules generated for a relation. The MRR column gives the MRR value that would be generated if the test set consisted only of facts associated with the relation in the "Relation" column, and the weighted combination of rules consisted only of the rules in the same line or above. Thus we can see that if we took only the first two rules for isAffiliatedTo, and the rules for playsFor, then the MRR would be at least 0.56 on the test set facts associated with these two relations. As these two relations account for 63.3% of the test set, just the four rules mentioned above would yield an MRR , as opposed to the MRR of 0.449 that we obtained.

The information in the table suggests a direct correlation between the relations isAffiliatedTo and playsFor. Indeed, we verify that for about 75% of the facts (x, isAffiliatedTo, y), we also have (x, playsFor, y) as a fact in the training set. Similarly, 87% of the playsFor facts are explained by the isAffiliatedTo relation in the training set.

A natural question is the following: Is the second rule for isAffiliatedTo a “degenerate" rule and does it simply reduce to isAffiliatedTo because the entity b is the same as entity a when we traverse a relational path from x to y in the training KG. To give an example that this is not the case, consider the following fact in the test set: (Pablo_Bonells, isAffiliatedTo, Club_Celaya). In the training data, the following three facts imply the previous fact by application of the second rule: (Pablo_Bonells, isAffiliatedTo, Club_León), (Salvador_Luis_Reyes, isAffiliatedTo, Club_León), and (Salvador_Luis_Reyes, isAffiliatedTo, Club_Celaya). The first rule is also applicable as the training data contains the fact (Pablo_Bonells, playsFor, Club_Celaya). We have similarly verified that the second rule for playsFor creates nontrivial relational paths, where the nodes are not repeated.

Relation Weight Rule MRR
isAffiliatedTo(x,y) 1 playsFor(x,y) 0.463
1 isAffiliatedTo(x,a) R_isAffiliatedTo(a,b) isAffiliatedTo(b,y) 0.582
1 graduatedFrom(x,a) R_graduatedFrom(a,b) isAffiliatedTo(b,y)
1 isPoliticianOf(x,a) R_isPoliticianOf(a,b) isAffiliatedTo(b,y)
0.5 livesIn(x,a) R_livesIn(a,b) isAffiliatedTo(b,y) 0.585
playsFor(x,y) 1 isAffiliatedTo(x,y) 0.504
1 playsFor(x,a) R_isAffiliatedTo(a,b) playsFor(b,y) 0.561
Table 10: Rules generated by LPRules for two relations in YAGO3-10. The MRR values for a particular rule were calculated using only the rules in the same line or above.

a.4 Additional experimental details

Table 11 contains the list of values of the parameter given as input for each dataset. For larger datasets, this hyperparameter search is time-consuming, which is why we use fewer candidate values.

Datasets Values of
Kinship 0.02, 0.025, 0.03, 0.035, 0.04, 0.045, 0.05, 0.055, 0.06
UMLS 0.02, 0.03, 0.04, 0.05, 0.0055, 0.06, 0.07, 0.08, 0.09, 0.1
WN18RR 0.0025, 0.003, 0.0035, 0.004, 0.0045
FB15k-237 0.005, 0.01, 0.025, 0.05, 0.1, 0.25
YAGO3-10 0.005, 0.01, 0.03, 0.05, 0.07
DB111K-174 0.005, 0.01, 0.03, 0.05, 0.07
Table 11: Values of the parameter for each dataset.

Given the lists of values in Table 11, removing values from this list reduces the MRR, but not too much. But if we use entirely different values, the MRR can drop significantly. Indeed, the experiments in Table 12 show that the amount of weight given to negative sampling is very important. Furthermore, one cannot choose the same value for different datasets, and it is best to search through a list, as we do.

Kinship UMLS WN18RR
0.0001 0.640 0.560 0.453
0.001 0.686 0.758 0.461
0.01 0.728 0.799 0.444
0.1 0.667 0.830 0.385
Table 12: MRR obtained with fixed values of .

We use coarse-grained parallelism in our code. For example, FB15K-237 has 237 relations, and we run the rule learning problem for each relation on a different thread. As we only have 60 cores, multiple threads are assigned to the same core by the operating system.

The run times increase significantly with increasing number of facts, and with increasing edge density in the knowledge graph. Recall that the rule learning linear program (LPR) that we solve for each relation has a number of constraints equal to , the number of edges labeled by relation in the knowledge graph. Given candidate rule in LPR, we need to compute for each edge in , where the th edge in is . Computing increases superlinearly with increasing average node degree and increasing path length, and so does (both calculations involve an operation similar to a BFS or DFS). See Table 13 for increasing run times on WN18RR with increasing path length. The cost of solving a linear program (LP) grows superlinearly with the number of constraints. The larger datasets (WN18RR, FB15K-237, and YAGO3-10) all have some relation which has many more associated facts/edges than the average number per relation. This leads to significant run times, both in setting up LPR for the relation and in solving LPR, in the case of FB15K-237 and YAGO3-10, and also to reduced benefits of parallelism. For example, for YAGO3-10, the relations isAffiliatedTo and playsFor have 373,783 and 321,024 associated facts, respectively, out of a total of roughly a million facts. Most relations are completed in a short amount of time, while these two relations run for a long time on a single thread each. A natural approach to reducing the run time would be to sample some of these facts while setting up LPR, but we do not do this in this work. For YAGO3-10, we do not compute exactly and use sampling to obtain an approximation, as described in the main document.

One can easily parallelize the hyperparameter search process in our code. Other operations which can be parallelized are the computation of the coefficients in LPR, but we have not done so. We thus believe there is scope for reducing our run times even further. We note that the operations we perform are not well-suited to run on GPUs.

In Table 13 we give our results for WN18RR when we use rules of length 4, 5 and 6, with the results for length 6 copied from Table 3 in the main document. The best MRR is obtained using path length 6, but takes 5 times the amount of time as path length 4. We note that the reported results for RNNLogic were obtained using a maximum rule length of 5 for WN188RR.

Length Algorithm MRR H@1 H@3 H@10 Time (mins)
4 LPRules 0.449 0.414 0.465 0.518 2.3
5 LPRules 0.457 0.421 0.473 0.527 5.0
6 LPRules 0.459 0.422 0.477 0.532 11.0
Table 13: Results on WN18RR with maximum rule length=4,5,6 obtained using random break ranking.

a.5 Additional code details

For the experiments in this paper, in Rule Heuristic 1, we simply generate all length one and length two rules that create a relational path from the tail to head node for at least one edge associated with relation , while creating LPR for relation . The number of such rules is usually small (less than 50).

In Rule Heuristic 2, for every edge in (associated with relation ), we find a shortest path, using breadth-first search, from to in the knowledge graph that does not use the edge . However, when we perform column generation, we do not find a shortest path between and for every directed edge in . Instead, we only consider edges that are not "implied" by the currently chosen weighted combination of rules. Such edges are indicated by large dual values, as discussed earlier. A natural improvement to this algorithm would be to find rules which create relational paths between multiple pairs of tail and head nodes and which have large dual values.

During the search for the best abd values, we first set up LPR or LPR (for some , when we do column generation) for a fixed value of and . Subsequently, we do not add any more rules/columns, and instead change the values of and , and evaluate the resulting LP solution on a validation set by computing the MRR, and then choosing the combination which gives the best value of MRR on the validation set. During training, we obtain a weighted combination of rules for each relation separately, and then evaluate all these relations on the test set after training is complete.

All relational paths that we create (either in shortest-path calculations or during evaluation on the test set) are simple, i.e., they do not repeat nodes.

All our code is written in C++. The LP Solver we use is IBM CPLEX [IBM, 2019], which is a commercial MIP solver (though available for free for academic use). Any high-quality LP solver can be used instead of CPLEX, though the interface functions in our code which have CPLEX-specific functions calls would need to be changed.

a.6 Solution sparsity

One important feature of rule based methods is the interpretability aspect of rules. It is clear that link predictors with a very large number of rules will be harder to understand than those with few rules. In the main document, we gave the average number of rules per relation in our solutions as compared to the Neural LP solution. In our code, we vary the upper bound on complexity of chosen rules in a relation up to a certain number and use the validation set to choose the best complexity bound within the range of allowed bounds. However, we do not control the final complexity of the solution beyond the upper bound. Varying the upper bound allows us to generate solutions with lower number of rules. RNNLogic has a parameter which allows one to select the number of rules per relation used in testing; see also Figure 2a in Qu et al. [2021] where the MRR is plotted against number of selected rules. In Table 14 we give the average number of rules selected for Kinship by NeuralLP, RNNLogic and LPRules and corresponding values of MRR. These results show that LPRules is able to provide higher values of MRR for Kinship for the same number of rules selected compared to the other two codes.

Algorithm MRR S
NeuralLP 0.652 10.2
RNNLogic 0.611 20.0
0.624 100.0
0.687 200.0
LPRules 0.739 11.6
0.742 17.4
0.746 21.0
Table 14: MRR and average number of rules selected (S) for Kinship.