Neural Theorem Provers Do Not Learn Rules Without Exploration

06/17/2019 ∙ by Michiel de Jong, et al. ∙ Google University of Southern California 1

Neural symbolic processing aims to combine the generalization of logical learning approaches and the performance of neural networks. The Neural Theorem Proving (NTP) model by Rocktaschel et al (2017) learns embeddings for concepts and performs logical unification. While NTP is promising and effective in predicting facts accurately, we have little knowledge how well it can extract true relationship among data. To this end, we create synthetic logical datasets with injected relationships, which can be generated on-the-fly, to test neural-based relation learning algorithms including NTP. We show that it has difficulty recovering relationships in all but the simplest settings. Critical analysis and diagnostic experiments suggest that the optimization algorithm suffers from poor local minima due to its greedy winner-takes-all strategy in identifying the most informative structure (proof path) to pursue. We alter the NTP algorithm to increase exploration, which sharply improves performance. We argue and demonstate that it is insightful to benchmark with synthetic data with ground-truth relationships, for both evaluating models and revealing algorithmic issues.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks have advanced the state of the art on a variety of prediction tasks. However, they are also data-hungry, transfer poorly to data generated from different distributions [1, 10], and their decision-making is difficult to interpret [3]

. On the other hand, learning logical rules through Inductive Logic Programming 

[13, 14] has good generalization properties, but does not cope well with noisy data or fuzzy relationships [4], and is difficult to scale beyond toy datasets because of the large search space [22]

. Recently, there have been renewed efforts to combine the two paradigms by applying deep learning methods to classical symbolic AI methods.

Several differentiable logic learning approaches have been proposed [15, 6, 21] that are better equipped to deal with noisy data, and employ gradient descent rather than discrete search to cope with a large search space. Rocktäschel and Riedel [15]

proposes Neural Theorem Prover (NTP) to fuse subsymbolic methods and logical rule learning by endowing constants and predicates with vector embeddings. We view this as a promising direction, as the embedding approach lets the model learn the semantics of entities and concepts concurrently with the relationships between those concepts, as well as allowing for nuanced and ambiguous rules. However, as with other differentiable logic models, this model currently only works on limited toy datasets. In order to make progress we need to challenge the model and understand its limitations as well as how those limitations can be addressed.

NTP and its variants have inspired a lot of interests. Nonetheless, one of its most important motivations — learning logical rules improves generalization — has not been critically analyzed. In particular, how well does NTP learn rules? Like other models, NTP was primarily evaluated on its fact prediction accuracy, not rule learning performance as ground-truth rules in real data are hard to identify.

To address this gap, we create synthetic logical datasets with injected relationships (as ground-truths) and use them to measure directly rule learning performance. While we focus on NTP in this paper, the datasets can be used to assess other learning algorithms.

Our findings of NTP are surprising: NTP does not learn rules well, except when the relationship is very simple, even though it can achieve high fact prediction accuracy. Diagnostic experiments suggest learning the model suffers from poor optimization. In particular, a wrong proof for a true fact can start out with a high endorsing score, due to initialization. However, the optimization will continue to greedily increase the score of the wrong proof. This winner-takes-call credit assignment leads to systematically undesirable local minima, leading to incorrect relationships being learned.

Fortunately, this weakness can be easily remedied. We take inspiration from beam search, a common trick in speech and language processing. We retain a subset of proofs even they are scored low initially. This allows optimization to explore other alternative proof paths and eventually leads to a very large increase in performance of rule learning.

The main contribution of this work is to single out the need of measuring rule learning performance in addition to fact prediction accuracy. We believe it is imperative to have challenging benchmark datasets while such performance can be measured under different scenarios. To this end, we have curated a recipe to generate multiple datasets with varying degrees of complexity (in terms of underlying sets of rules). We analyze the root cause of NTP failing to achieve high performance on rule learning and propose a simple fix to optimization, thus demonstrating the utility of applying those datasets to advancing research in neural subsymbolic processing.

2 Related work

In a broad sense, our work aims to learn relationships between logical predicates. There exists a very large body of literature that shares this objective. Inductive Logic Programming (ILP) employs search methods to find logical rules that best explain the data [13, 14, 4, 22]. Statistical relationship learning (SRL) also looks to learn relationships between logical predicates, but generally employs statistical methods to learn more general probabilistic relationships, not necessarily in the form of logical rules [9, 7, 18]. Some of the more recent work in this vein uses methods from or inspired by deep learning. Trouillon et al. [19] assigns an embedding to each logical constant and predicate, and predicts the presence of a predicate relationship between constants by the triplet product of the embeddings. Graph neural networks [20] operate over a graph structure, which is inherently a relational setting. Dong et al. [5]

trains a deep neural network to predict a probability vector of output predicates from a probability vector of input predicates, naturally extending SRL to deep learning.

Differentiable logic learners employ some form of continuous relaxation of the logical reasoning process so that its parameters can be trained through gradient descent [6, 17, 11, 15]. One family of models generates logical clauses and associates predicates with a trained attention vector over these clauses [6, 21]. In contrast, neural theorem proving (NTP), which is the focus of this paper, endows predicates, constants and rules with embeddings that are trained through gradient descent, and receive meaning by their position in the embedding space [15, 12, 2]. The embedding approach allows the model to learn semantics of entities and concepts, in a similar way to word embeddings in NLP [12].

We contribute to this line of work by systematically generating synthetic data to directly measure rule-learning performance. Existing work has either used very small toy examples [6] or real data without known ground-truth relationships. Although the problem of local minima when learning discrete structures is well known, to our knowledge we are the first work to closely study how this effect impacts performance and show that increased exploration can address the problem in the differentiable logic setting.

3 Neural-based inductive logic programming

3.1 Problem statement

NTP operates on a subset of first-order logic. Constants represent entities, such as individuals or objects. Predicates are boolean functions of sets of constants, which represent properties of or relationships between entities. For example, GrandparentOf is a predicate, taking value 1 if is the grandparent of and otherwise. The number of arguments in a predicate is called its order. Logical datasets consist of facts, which are statements that a predicate holds for a particular set of constants, for example, GrandparentOf. We also have variables and rules. Variables are symbols that may represent any constant. We consider rules of the form , where are sets of variables (possibly overlapping). Collectively, all are called the body and is called the head of the rule. Such a rule implies that, for any assignment of constants to , if all the predicates in the body hold, then the head also holds. R is called the size of the rule.

We consider two types of learning settings. In fact prediction task [15], we are interested in predicting the truth value of a set of test facts after learning from the true facts in the training data .

The other type of learning focuses on relation learning, where we are interested in identifying rules from the training dataset. Once identified, we can apply these rules to predict the truth of unknown test facts. Note that this type of learning may be more challenging, but potentially much more beneficial in many application settings, for instance, learning with a small set of training instances.

3.2 Neural Theorem Proving

The basic idea in NTP is to embed symbols such as predicates and constants from the the data as embeddings . Rules are represented by tuples of predicate embeddings. Inference is the process of applying facts and rules to derive conclusions. Since symbols are all points in the embedding space, “applying” means computing minimum distances among them. Learning happens by placing the embeddings in a position such that correct chains of facts and rules that prove valid facts have minimum distance. A more detailed description can be found in [15].

Inference Given a set of facts and rules, logical programming infers the truth value of a goal fact by attempting to prove that goal the fact is derivable from other facts and rules. NTP is loosely based on using unification in a backward chaining proving technique [16].

In standard logical inference, unification compares logical statements and makes any necessary variable substitutions to make those statements identical. For example, in order to unify GrandparentOf(Harry, Emily) and GrandparentOf(Harry, ), we make the substitution . On the other hand, GrandparentOf(Alex, Emily) and GrandparentOf(Harry, ) cannot be unified as the constant Alex is not the same as Harry. GrandparentOf(Harry, Emily) and ParentOf(Harry, ) cannot be unified either as the predicates do not match.

Standard backward chaining attempts to unify the goal fact with training facts as well as the head of rules. If the goal fact unifies with any training fact, the proof has been completed. Otherwise, if the fact matches a rule head, the rule body predicates are added to a queue to be proved. This process continues recursively until all necessary facts in the queue are successfully proved, in which case the goal fact has been proved, or until all facts and rules have been tried without finding a proof.

For NTP, unification and backward chaining work somewhat differently. A fact can be unified with another fact or rule head even if the predicate or constants in the fact are different, as long as the predicate has the same order. If predicates or constants differ, the unification is given a score that is a function of the distance between the embeddings of the symbols being unified. The score of a proof is a function of the embedding distance of the “worst” unification in the proof. Let be the set of unifications in the proof, and and the logical symbols in unification , then the proof score is


An example of a simple proof of HasSibling(Emily), given that we know HasBrother(Emily), would be to directly unify HasSibling(Emily) and HasBrother(Emily). The score of this proof would be .

In this definition of unification, there can be many successful proofs. We calculate the scores of all possible proofs, and define the score of a fact to be equal to the score of the best proof of that fact. The learning process involves raising the proof score for known facts as much as possible.

Learning Rules are instantiated from rule templates, such as . Rules are created by instantiating an embedding for each predicate in the template; in this template we would generate embeddings , and . Unlike the data predicates, these rule predicates do not have any intrinsic meaning, but take on a meaning during training by being placed close in the embedding space to data predicates. All symbols’ embeddings are randomly instantiated.

Models are trained by attempting to prove logical facts. The goal during training is to promote rules that successfully prove true facts (facts present in the training set) and discourage rules that successfully prove false facts (generated by corrupting facts in the training set). The NTP algorithm assigns a score to each fact equal to the score of the highest scoring proof of that fact. Let be the set of all embeddings. Given a truth value and a fact score , the loss is given by


Note that the score of the fact, while bounded between 0 and 1, should not be interpreted as a normalized probability. Nonetheless, the loss is differentiable w.r.t. . In a single training step, the two embeddings in the worst unification of the best proof of the training fact are updated. See below for an illustrative example:

Example I: Learning procedure Consider a task that requires learning the relationship . A rule is instantiated from a corresponding rule template . Embeddings are initialized for HasSibling, HasBrother, , , as well as any constants and other predicates in the task.
The first training data point is HasSibling(Emily). HasBrother(Emily) is also in the training dataset. The best proof for HasSibling(Emily) unifies HasBrother with and HasSibling with and applies the rule to prove HasSibling(Emily) from HasBrother(Emily). and HasSibling were the furthest apart, and since the proof has successfully led to a true fact the next gradient step brings them closer together. Alongside HasSibling(Emily), we also train on the corruption HasSibling(Alex), which is not a fact in the training data. HasChild(Alex) is also in the training data, and the best proof for HasSibling(Alex) ends up unifying (HasSibling, HasChild). Since this proof leads to an untrue fact, the next gradient step places HasSibling and HasChild further apart.

4 Design synthetic datasets for evaluating NTP

4.1 A Recipe for data generation

We use the following recipe to procedurally generate datasets. Our main design consideration is to finely control the properties of our data, so that it is possible to investigate the effect of different factors in isolation. First, we choose the size and the order of the predicates to be used. Next, we select the number of constants and predicates . Each predicate has a base truth probability of which is identical for all inputs . relationships between predicates are then injected from a relationship template. In those (ground-truth) relationships, for a given input , if all the predicates in the body are true, then the predicate in the head has an increased probability of being true.

Using this generic recipe, we flexibly generate datasets on-the-fly for each experiment. Table 1 outlines the parameters of the data generation process used for our main experiments, depending on the body of the relationship. Experiments with binary predicates use fewer constants and a lower base probability to reduce computational cost, given that binary facts scale quadratically with constants.

Relationship body Rule Parameters
size order template
1 unary 200 5 0.5 1.0 1
binary 60 0.25
2 unary 400 0.5
binary 60 0.25
3 unary 800 0.5
Table 1: Various configurations for generating synthetic datasets on-the-fly

4.2 Evaluation

The NTP algorithm learns embeddings for rule instantiations. Each predicate embedding of an instantiation is decoded to the data predicate that is closest to it in the embedding space. The overall rule decoding is equal to the score of the worst predicate decoding, calculated as in Eqn 1.

Relationship learning performance is measured by recall and precision, measured and averaged over a number of runs. Recall for a single run is defined as the proportion of injected relationships for which at least one rule decodes to that relationship. We then take the mean over all runs. For precision measure, we construct a list of all decoding scores, and a corresponding list of “gold values”, that equal 1 if the corresponding rule decoding matches an injected relationship and 0 otherwise. The precision measure (PR-AUC) is then defined as the area under the PR for these lists.

Fact prediction accuracy is normally evaluated out on a randomly held-out test set. For NTP, fact evaluation is complicated by the fact that the algorithm relies on constant embeddings that are learned during training. Moreover, since predicate truth values not involved in a relationship are generated i.i.d, the model cannot learn to predict these, and performance on these facts is not very informative. We instead split off a small proportion of active facts, defined as facts which could be predicted to be true if the relationship is learned successfully.

We corrupt each test fact by changing the constants in all possible ways such that the result is not present in the training data, and apply the model to the facts and corruptions. The first evaluation measure is the mean reciprocal rank (MRR) of the true fact relative to corrupted facts. This measure depends on the number of corruptions, which in turn depends on the size of the dataset, making it hard to compare between tasks. Therefore, we also compute a size-invariant measure by duplicating the score of the true fact by number of corrupted facts, and calculating the area under the ROC curve, with target 1 for true facts and 0 for corruptions.

5 Critical analysis of NTP

In this section, we perform a critical analysis of the NTP model with the synthetic datasets we have created. Prior work has shown the method attains good test accuracy in predicting truth values of facts [15]. We analyze the model’s ability to identify rules under different conditions, revealing that the base NTP model has poor rule learning performance, and investigate why this is the case.

5.1 Additional experiment details

During the training, each batch contains 10 true facts from the training set. For each true fact, there is also one corrupted fact per predicate argument (so 1 for unary predicates, 2 for binary predicates) which is generated by randomly selecting a constant for which the predicate does not hold in the training set. For each of the 50 runs, 3 (randomly initialized) rules are instantiated from the correct template. The hope is at least one of them will be driven close to the correct relationship. The model is trained for 50 epochs using the Adam optimizer 

[8] with a learning rate of

, gradient clipping of (-5, 5) and exponential learning rate decay of


We adhere to the evaluation protocol set in the previous section. None of the parameter choices (, , , or ) qualitatively affect the results in this paper (the supplemental material presents detailed experiment results). The code can be found at

5.2 Ablation studies

Rule body Data size Rule performance Fact Performance
Size Order Constants Total facts Active facts Recall PR-AUC ROC-AUC
1 Unary 50 137 23 0.42 0.51 0.80
100 274 47 0.46 0.66 0.88
200 546 92 0.6 0.76 0.92
800 2194 388 0.46 0.68 0.96
2 Unary 50 133 11 0.0 0.31 0.71
100 266 21 0.0 0.36 0.75
200 532 43 0.02 0.37 0.78
800 2139 185 0.02 0.4 0.81
Table 2: NTP performance by number of constants

How much data does NTP need?

One of the appealing properties of rule-learning algorithms is that they tend to be data efficient. Can NTP achieve this goal? Table 2 shows relationship learning and fact prediction accuracy as a function of the number of constants in the data. It also shows the resulting total facts and active facts, the latter of which constitute the actual signal in the data. We restrict ourselves to unary predicates for this experiment, as scaling binary predicates in this way becomes computationally expensive.

When NTP learns anything at all, it seems to be able to do so from very little data. Note that, while performance increases modestly with data for body size one, performance stays low at body size two. We discuss the likely cause this poor performance in the next section.

Rule body Rule performance Fact performance
Size Order Recall PR-AUC MRR ROC-AUC
1 unary 5 0.48 0.62 0.27 0.91
1 binary 0.58 0.74 0.57 0.98
2 unary 0.04 0.37 0.04 0.81
2 binary 0.02 0.4 0.15 0.89
3 unary 0.02 0.4 0.02 0.81
1 unary 3 0.58 0.76 0.29 0.95
10 0.34 0.58 0.23 0.87
Table 3: NTP performance by relationship type and the number of predicates

Can NTP learn complex relationships?

An important desiderata for relation learning is to scale to learn complex relationships. The top half of the Table 3 shows relationship learning and fact prediction accuracy by NTP under different sizes and orders of the relationships. The performances are reasonable for relationships of size 1. However, the model completely fails for size 2 and 3.

Perhaps even more surprisingly, the bottom half of the Table 3 shows that NTP does not scale with respect to the number of predicates even when the relationship is at its simplest: as the total number of predicates increase the effective size of the state space and decrease the signal to noise ratio, the model performance decreases sharply.

Our main conclusion is NTP does not learn complicated relationships well. Nonetheless, the model still achieves better-than-random fact prediction accuracy. One possible reason is that the learning algorithm can place predicates in a relationship close to each other in the embedding space and unifying the predicates directly - we analyze this in the next section.

5.3 Diagnosis

Why does NTP perform so disappointingly in learning relations? We argue that the problem lies in the nature of its greedy optimization and corresponding lack of exploration.

The model works in a winner-takes-all greedy fashion: (1) picks the highest scoring proof; (2) if a correct fact was proved, increase the score of the proof. Such a process naturally leads to highly stochastic outcomes, where the final winner depends on the structure that was initially chosen. See the example below: Example II: Failed rule learning Consider again the relationship and rule . We try to prove HasSibling(Emily), knowing HasBrother(Emily). There are two obvious candidate proofs. The desired proof applies the rule, unifying with HasBrother and with HasSibling. The alternative proof directly unifies HasBrother and HasSibling. If the direct unification proof starts out with a higher score (due to random initialization, as explained below), the rule embeddings will not be updated and the unification proof will have a higher score permanently.

In other words, the model sticks with the first somewhat reasonable proof, trapped in a stable local minimum, rather than exploring for the best proof. If the winner-takes-all phenomenon is indeed causing problems for the model, one would expect model performance to be highly dependent on initialization. Our experiments show that is indeed the case. The standard deviation of recall for the model on relationships of size 1 and order 1, purely from repeated initialization (calculated from 50 initializations each for 20 dataset draws) is equal to 0.39. The high variance reflects feast-or-famine results of the model: for advantageous initializations the model finds the correct rule with close to perfect confidence, and for all other initializations the confidence of the rule is close to zero.

Rule body Rule performance Fact performance
Size Order r Recall PR-AUC MRR ROC-AUC
1 Unary 1.0 0.42 0.58 0.26 0.90
0.9 0.76 0.86 0.28 0.94
0.75 0.96 0.98 0.29 0.95
0.5 1.0 1.0 0.29 0.95
2 Unary 1.0 0.00 0.35 0.06 0.79
0.9 0.06 0.37 0.07 0.80
0.75 0.40 0.67 0.14 0.85
0.5 0.90 0.98 0.26 0.95
Table 4: Relationship learning performance with respect to initialization

Table 4 contains a more thorough investigation of the effect of initialization on performance. After initializing all embeddings, we determine the rule for which the rule predicate embeddings are closest to the ground-truth relationship embeddings, and move those rule embeddings even closer, multiplying the distance by a ratio . Modest reductions in distance lead to sharply improved performance; a larger reduction of 0.5 increases recall from 0.0 to 0.9 for relationships of size 2.

Figure 1: Development of rule and unification proof scores during training, conditional on whether or not relationship was successfully learned

Initialization is clearly important, but is it a question of competing proofs? Define the rule score as the highest relationship decoding score amongst all the rules, and the unification score as the score of unifying the head and the closest body predicate directly. There exists a strongly negative (-0.85) correlation between the rule score and the unification score at the end of training. Figure 1 displays how the rule and unification scores develop during training, conditional on whether or not the correct relationship is eventually learned. The results show training runs go in one of two ways: either the rule catches on, and the rule score increases while the unification score increases only modestly, or it does not and the unification score increases while the rule score remains low. This pattern strongly suggests these two proof types are competing in the simple case of a relationship of size 1, order 1, and provide support for the winner-takes-all hypothesis.

6 Rule learning needs exploration

This section proposes an adjusted version of the model in [15], propagating gradients to additional proofs beyond the best-scoring proof in order to encourage exploration and reduce the winner-takes-all property of the original model. Taking inspiration from beam search, we aim to keep around proofs with lower current scores that may prove successful later in training, when the influence of other data points has been incorporated into the embeddings.

Propagating gradients to more proofs is equivalent to altering the original loss function for a single training example (Eqn

2) by summing losses over a set of proofs.


Different choices of the set

then lead to different heuristics. We employ two sets of heuristics for exploration. Top-

propagates gradients to the top- highest scoring proofs. Define a proof path as a sequence of categories (facts and different rule templates) of logical statements applied in a proof. For example, with one rule template of size 2, the categories are fact and rule, and the two proof paths of depth 1 are (fact) and (rule, fact, fact).

The all-path heuristic propagates gradients to the top-scoring proof from each high-level proof path to encourage varied exploration.

Table 5 demonstrates the benefits of incorporating exploration heuristics, especially in learning complicated relationships. Contrasting to what is in Table 3, performances are consistently improved by both types of heuristics and their combination and translate into higher fact prediction accuracy.

Rule body Rule performance Fact performance
Size Order Exploration heuristic Recall PR-AUC MRR ROC-AUC
1 Unary Top- 0.9 0.94 0.28 0.93
all-path 1.0 1.0 0.29 0.93
Top- all-path 1.0 1.0 0.29 0.95
1 Binary Top- 0.88 0.93 0.52 0.93
all-path 0.99 0.96 0.57 0.98
Top- all-path 0.97 0.99 0.54 0.96
2 Unary Top- 0.14 0.46 0.07 0.84
all-path 0.14 0.46 0.07 0.84
Top- all-path 0.92 0.95 0.27 0.96
2 Binary Top- 0.16 0.49 0.21 0.88
all-path 0.04 0.49 0.08 0.89
Top- all-path 0.92 0.96 0.54 0.98
3 Unary Top- 0.08 0.52 0.03 0.82
all-path 0.18 0.48 0.07 0.85
Top- all-path 0.38 0.55 0.14 0.93
Top- all-path 0.64 0.73 0.20 0.96
Table 5: NTP performance by exploration heuristics

Table 6 repeats the experiment from Section 5.2 using the new exploration heuristic. Here we find that the model can learn from extremely small amounts of data – as few as 11 active facts.

Rule body Data size Rule performance Fact Performance
Size Order Constants Total facts Active facts Recall PR-AUC ROC-AUC
1 Unary 50 137 23 0.76 0.79 0.86
100 274 47 0.96 0.98 0.92
200 546 92 1.0 1.0 0.92
800 2194 388 1.0 1.0 0.96
2 Unary 50 133 11 0.16 0.35 0.80
100 266 21 0.64 0.77 0.90
200 532 43 0.90 0.94 0.93
800 2139 185 0.98 0.99 0.97
Table 6: NTP performance by number of constants using all-path top- heuristic

7 Discussion

Neural theorem proving is a promising combination of logical learning and neural network approaches. In this work, we evaluate the performance of the NTP algorithm on synthetic logical datasets with injected relationships. We show that NTP has difficulty recovering relationships in all but the simplest settings. Our experiments suggest the problem lies in the presence of structural local minima, due to the winner-takes-all property of the model. We alter the NTP algorithm to increase exploration, which sharply improves performance.

We believe there are several lesssons to be drawn from this work beyond the immediate application to NTP. First, that it is helpful to look at synthetic data when evaluating prediction models that involve structure learning, as final prediction accuracy can mask problems with the structure learning component. Second, that learning discrete structures as an intermediate step can be accompanied by severe structural local minima, which can be avoided through additional exploration.

Acknowledgments This work is partially supported by NSF Awards IIS-1513966/ 1632803/1833137, CCF-1139148, DARPA Award#: FA8750-18-2-0117, DARPA-D3M - Award UCB-00009528, Google Research Awards, gifts from Facebook and Netflix, and ARO# W911NF-12-1-0241 and W911NF-15-1-0484. We thank Shariq Iqbal, Zhiyun Lu, and Bowen Zhang for helpful comments.

Supplemental Parameter Experiments

Rule body Rule performance Fact performance
Size Order Rules Recall PR-AUC MRR ROC-AUC
1 Unary 3 0.62 0.76 0.28 0.92
5 0.68 0.75 0.28 0.93
10 0.76 0.78 0.28 0.92
20 0.86 0.83 0.28 0.94
50 0.9 0.84 0.27 0.92
1 Unary 3 0.0 0.39 0.04 0.81
5 0.04 0.3 0.04 0.82
10 0.04 0.21 0.04 0.82
20 0.0 0.14 0.04 0.81
50 0.06 0.09 0.04 0.81
Table 7: NTP base model performance by number of rules

This section contains experiments verifying that the conclusions in the body of the paper hold broadly and are not sensitive to model parameters.


Table 7 shows how performance of the base model varies with the number of instantiated rules. Increasing the number of rules does improve relationship recall for relationships of size 1 and order 1, improving recall and precision. However, adding rules is not a panacea - for even slightly more complex relationships of size 2 and order 1, increasing the number of rules improves recall only slightly at the cost of a large reduction in precision.

Rule Probability

Table 8 shows how performance of the base model and heuristic vary with the strength of the relationship, defined as the probability for the head predicate of a relationship to hold for a set of constants if the body predicates hold for that set of constants. The algorithm is still able to learn non-deterministic relationships, although rule learning and fact prediction performance do decrease as relationship strength decreases. The model with exploration heuristic still performs much better than the base algorithm with nondeterministic relationships. Note that the upper bound on fact prediction accuracy also decreases as the relationship strength decreases.

Rule body Rule performance Fact performance
Size Order Exploration Recall PR-AUC MRR ROC-AUC
1 Unary Vanilla 1.0 0.62 0.76 0.28 0.92
0.9 0.36 0.51 0.11 0.83
0.8 0.18 0.4 0.07 0.80
0.7 0.16 0.37 0.05 0.73
1 Unary Top- all-path 1.0 1.0 1 0.29 0.95
0.9 0.98 0.99 0.12 0.87
0.8 0.9 0.9 0.08 0.82
0.7 0.44 0.51 0.05 0.73
2 Unary Vanilla 1.0 0.02 0.39 0.04 0.81
0.9 0.0 0.36 0.03 0.79
0.8 0.04 0.36 0.03 0.77
0.7 0.04 0.39 0.02 0.74
2 Unary Top- all-path 1.0 0.92 0.95 0.27 0.96
0.9 0.96 0.98 0.12 0.95
0.8 0.82 0.88 0.07 0.91
0.7 0.34 0.44 0.04 0.82
Table 8: NTP base model and exploration heuristic performance by relationship probability


Table 9 shows the effect of injecting a second relationship of the same type on performance of the base model as well as the model with the Top-2-all-type exploration heuristic. The additional relationship leads to a sharp reduction in relationship learning performance, though the model with exploration heuristic still performs much better than the base algorithm at learning multiple relationships.

Rule body Rule performance Fact performance
Size Order Exploration Relationships Recall PR-AUC MRR ROC-AUC
1 Unary Vanilla 1 0.62 0.76 0.28 0.92
2 0.38 0.61 0.31 0.87
1 Unary Top- all-path 1 1.0 1 0.29 0.95
2 0.63 0.71 0.41 0.96
2 Unary Vanilla 1 0.02 0.39 0.04 0.81
2 0.01 0.48 0.04 0.80
2 Unary Top- all-path 1 0.92 0.95 0.27 0.96
2 0.47 0.7 0.26 0.93
Table 9: NTP base model and exploration heuristic performance by number of relationships


Following Rocktäschel and Riedel [15], for proofs that unify with several different facts, we only retain the top-k highest scoring fact unifications per fact in each branch of the proof tree to reduce computational demands. For example, for a proof path of type (rule, fact, fact), we do not take the maximum over all proofs, but only proofs, where we retain the highest fact unification scores in the second step.

Table 10 shows that varying this parameter has a minimal effect on the outcome of the algorithm.

Rule body Rule performance Fact performance
Size Order Exploration Recall PR-AUC MRR ROC-AUC
2 Unary Vanilla 10 0.02 0.39 0.04 0.81
20 0.02 0.39 0.04 0.81
0.02 0.39 0.04 0.81
2 Unary Top- all-path 10 0.92 0.95 0.27 0.96
20 0.88 0.93 0.26 0.96
0.88 0.93 0.26 0.96
Table 10: NTP base model and exploration heuristic performance by value