1 Introduction
Neural networks have advanced the state of the art on a variety of prediction tasks. However, they are also datahungry, transfer poorly to data generated from different distributions [1, 10], and their decisionmaking is difficult to interpret [3]
. On the other hand, learning logical rules through Inductive Logic Programming
[13, 14] has good generalization properties, but does not cope well with noisy data or fuzzy relationships [4], and is difficult to scale beyond toy datasets because of the large search space [22]. Recently, there have been renewed efforts to combine the two paradigms by applying deep learning methods to classical symbolic AI methods.
Several differentiable logic learning approaches have been proposed [15, 6, 21] that are better equipped to deal with noisy data, and employ gradient descent rather than discrete search to cope with a large search space. Rocktäschel and Riedel [15]
proposes Neural Theorem Prover (NTP) to fuse subsymbolic methods and logical rule learning by endowing constants and predicates with vector embeddings. We view this as a promising direction, as the embedding approach lets the model learn the semantics of entities and concepts concurrently with the relationships between those concepts, as well as allowing for nuanced and ambiguous rules. However, as with other differentiable logic models, this model currently only works on limited toy datasets. In order to make progress we need to challenge the model and understand its limitations as well as how those limitations can be addressed.
NTP and its variants have inspired a lot of interests. Nonetheless, one of its most important motivations — learning logical rules improves generalization — has not been critically analyzed. In particular, how well does NTP learn rules? Like other models, NTP was primarily evaluated on its fact prediction accuracy, not rule learning performance as groundtruth rules in real data are hard to identify.
To address this gap, we create synthetic logical datasets with injected relationships (as groundtruths) and use them to measure directly rule learning performance. While we focus on NTP in this paper, the datasets can be used to assess other learning algorithms.
Our findings of NTP are surprising: NTP does not learn rules well, except when the relationship is very simple, even though it can achieve high fact prediction accuracy. Diagnostic experiments suggest learning the model suffers from poor optimization. In particular, a wrong proof for a true fact can start out with a high endorsing score, due to initialization. However, the optimization will continue to greedily increase the score of the wrong proof. This winnertakescall credit assignment leads to systematically undesirable local minima, leading to incorrect relationships being learned.
Fortunately, this weakness can be easily remedied. We take inspiration from beam search, a common trick in speech and language processing. We retain a subset of proofs even they are scored low initially. This allows optimization to explore other alternative proof paths and eventually leads to a very large increase in performance of rule learning.
The main contribution of this work is to single out the need of measuring rule learning performance in addition to fact prediction accuracy. We believe it is imperative to have challenging benchmark datasets while such performance can be measured under different scenarios. To this end, we have curated a recipe to generate multiple datasets with varying degrees of complexity (in terms of underlying sets of rules). We analyze the root cause of NTP failing to achieve high performance on rule learning and propose a simple fix to optimization, thus demonstrating the utility of applying those datasets to advancing research in neural subsymbolic processing.
2 Related work
In a broad sense, our work aims to learn relationships between logical predicates. There exists a very large body of literature that shares this objective. Inductive Logic Programming (ILP) employs search methods to find logical rules that best explain the data [13, 14, 4, 22]. Statistical relationship learning (SRL) also looks to learn relationships between logical predicates, but generally employs statistical methods to learn more general probabilistic relationships, not necessarily in the form of logical rules [9, 7, 18]. Some of the more recent work in this vein uses methods from or inspired by deep learning. Trouillon et al. [19] assigns an embedding to each logical constant and predicate, and predicts the presence of a predicate relationship between constants by the triplet product of the embeddings. Graph neural networks [20] operate over a graph structure, which is inherently a relational setting. Dong et al. [5]
trains a deep neural network to predict a probability vector of output predicates from a probability vector of input predicates, naturally extending SRL to deep learning.
Differentiable logic learners employ some form of continuous relaxation of the logical reasoning process so that its parameters can be trained through gradient descent [6, 17, 11, 15]. One family of models generates logical clauses and associates predicates with a trained attention vector over these clauses [6, 21]. In contrast, neural theorem proving (NTP), which is the focus of this paper, endows predicates, constants and rules with embeddings that are trained through gradient descent, and receive meaning by their position in the embedding space [15, 12, 2]. The embedding approach allows the model to learn semantics of entities and concepts, in a similar way to word embeddings in NLP [12].
We contribute to this line of work by systematically generating synthetic data to directly measure rulelearning performance. Existing work has either used very small toy examples [6] or real data without known groundtruth relationships. Although the problem of local minima when learning discrete structures is well known, to our knowledge we are the first work to closely study how this effect impacts performance and show that increased exploration can address the problem in the differentiable logic setting.
3 Neuralbased inductive logic programming
3.1 Problem statement
NTP operates on a subset of firstorder logic. Constants represent entities, such as individuals or objects. Predicates are boolean functions of sets of constants, which represent properties of or relationships between entities. For example, GrandparentOf is a predicate, taking value 1 if is the grandparent of and otherwise. The number of arguments in a predicate is called its order. Logical datasets consist of facts, which are statements that a predicate holds for a particular set of constants, for example, GrandparentOf. We also have variables and rules. Variables are symbols that may represent any constant. We consider rules of the form , where are sets of variables (possibly overlapping). Collectively, all are called the body and is called the head of the rule. Such a rule implies that, for any assignment of constants to , if all the predicates in the body hold, then the head also holds. R is called the size of the rule.
We consider two types of learning settings. In fact prediction task [15], we are interested in predicting the truth value of a set of test facts after learning from the true facts in the training data .
The other type of learning focuses on relation learning, where we are interested in identifying rules from the training dataset. Once identified, we can apply these rules to predict the truth of unknown test facts. Note that this type of learning may be more challenging, but potentially much more beneficial in many application settings, for instance, learning with a small set of training instances.
3.2 Neural Theorem Proving
The basic idea in NTP is to embed symbols such as predicates and constants from the the data as embeddings . Rules are represented by tuples of predicate embeddings. Inference is the process of applying facts and rules to derive conclusions. Since symbols are all points in the embedding space, “applying” means computing minimum distances among them. Learning happens by placing the embeddings in a position such that correct chains of facts and rules that prove valid facts have minimum distance. A more detailed description can be found in [15].
Inference Given a set of facts and rules, logical programming infers the truth value of a goal fact by attempting to prove that goal the fact is derivable from other facts and rules. NTP is loosely based on using unification in a backward chaining proving technique [16].
In standard logical inference, unification compares logical statements and makes any necessary variable substitutions to make those statements identical. For example, in order to unify GrandparentOf(Harry, Emily) and GrandparentOf(Harry, ), we make the substitution . On the other hand, GrandparentOf(Alex, Emily) and GrandparentOf(Harry, ) cannot be unified as the constant Alex is not the same as Harry. GrandparentOf(Harry, Emily) and ParentOf(Harry, ) cannot be unified either as the predicates do not match.
Standard backward chaining attempts to unify the goal fact with training facts as well as the head of rules. If the goal fact unifies with any training fact, the proof has been completed. Otherwise, if the fact matches a rule head, the rule body predicates are added to a queue to be proved. This process continues recursively until all necessary facts in the queue are successfully proved, in which case the goal fact has been proved, or until all facts and rules have been tried without finding a proof.
For NTP, unification and backward chaining work somewhat differently. A fact can be unified with another fact or rule head even if the predicate or constants in the fact are different, as long as the predicate has the same order. If predicates or constants differ, the unification is given a score that is a function of the distance between the embeddings of the symbols being unified. The score of a proof is a function of the embedding distance of the “worst” unification in the proof. Let be the set of unifications in the proof, and and the logical symbols in unification , then the proof score is
(1) 
An example of a simple proof of HasSibling(Emily), given that we know HasBrother(Emily), would be to directly unify HasSibling(Emily) and HasBrother(Emily). The score of this proof would be .
In this definition of unification, there can be many successful proofs. We calculate the scores of all possible proofs, and define the score of a fact to be equal to the score of the best proof of that fact. The learning process involves raising the proof score for known facts as much as possible.
Learning Rules are instantiated from rule templates, such as . Rules are created by instantiating an embedding for each predicate in the template; in this template we would generate embeddings , and . Unlike the data predicates, these rule predicates do not have any intrinsic meaning, but take on a meaning during training by being placed close in the embedding space to data predicates. All symbols’ embeddings are randomly instantiated.
Models are trained by attempting to prove logical facts. The goal during training is to promote rules that successfully prove true facts (facts present in the training set) and discourage rules that successfully prove false facts (generated by corrupting facts in the training set). The NTP algorithm assigns a score to each fact equal to the score of the highest scoring proof of that fact. Let be the set of all embeddings. Given a truth value and a fact score , the loss is given by
(2) 
Note that the score of the fact, while bounded between 0 and 1, should not be interpreted as a normalized probability. Nonetheless, the loss is differentiable w.r.t. . In a single training step, the two embeddings in the worst unification of the best proof of the training fact are updated. See below for an illustrative example:
Example I: Learning procedure
Consider a task that requires learning the relationship . A rule is instantiated from a corresponding rule template . Embeddings
are initialized for HasSibling, HasBrother, , , as well as any constants and other
predicates in the task.
The first training data point is HasSibling(Emily). HasBrother(Emily) is
also in the training dataset. The best proof for HasSibling(Emily) unifies
HasBrother with and HasSibling with and applies the rule to prove
HasSibling(Emily) from HasBrother(Emily). and HasSibling were
the furthest apart, and since the proof has successfully led to a true fact the next gradient step brings them
closer together. Alongside HasSibling(Emily), we also train on the corruption
HasSibling(Alex), which is not a fact in the training data. HasChild(Alex)
is also in the training data, and the best proof for HasSibling(Alex) ends up unifying
(HasSibling, HasChild). Since this proof leads to an untrue fact, the next gradient step
places HasSibling and HasChild further apart.
4 Design synthetic datasets for evaluating NTP
4.1 A Recipe for data generation
We use the following recipe to procedurally generate datasets. Our main design consideration is to finely control the properties of our data, so that it is possible to investigate the effect of different factors in isolation. First, we choose the size and the order of the predicates to be used. Next, we select the number of constants and predicates . Each predicate has a base truth probability of which is identical for all inputs . relationships between predicates are then injected from a relationship template. In those (groundtruth) relationships, for a given input , if all the predicates in the body are true, then the predicate in the head has an increased probability of being true.
Using this generic recipe, we flexibly generate datasets onthefly for each experiment. Table 1 outlines the parameters of the data generation process used for our main experiments, depending on the body of the relationship. Experiments with binary predicates use fewer constants and a lower base probability to reduce computational cost, given that binary facts scale quadratically with constants.
Relationship body  Rule  Parameters  
size  order  template  
1  unary  200  5  0.5  1.0  1  
binary  60  0.25  
2  unary  400  0.5  
binary  60  0.25  
3  unary  800  0.5 
4.2 Evaluation
The NTP algorithm learns embeddings for rule instantiations. Each predicate embedding of an instantiation is decoded to the data predicate that is closest to it in the embedding space. The overall rule decoding is equal to the score of the worst predicate decoding, calculated as in Eqn 1.
Relationship learning performance is measured by recall and precision, measured and averaged over a number of runs. Recall for a single run is defined as the proportion of injected relationships for which at least one rule decodes to that relationship. We then take the mean over all runs. For precision measure, we construct a list of all decoding scores, and a corresponding list of “gold values”, that equal 1 if the corresponding rule decoding matches an injected relationship and 0 otherwise. The precision measure (PRAUC) is then defined as the area under the PR for these lists.
Fact prediction accuracy is normally evaluated out on a randomly heldout test set. For NTP, fact evaluation is complicated by the fact that the algorithm relies on constant embeddings that are learned during training. Moreover, since predicate truth values not involved in a relationship are generated i.i.d, the model cannot learn to predict these, and performance on these facts is not very informative. We instead split off a small proportion of active facts, defined as facts which could be predicted to be true if the relationship is learned successfully.
We corrupt each test fact by changing the constants in all possible ways such that the result is not present in the training data, and apply the model to the facts and corruptions. The first evaluation measure is the mean reciprocal rank (MRR) of the true fact relative to corrupted facts. This measure depends on the number of corruptions, which in turn depends on the size of the dataset, making it hard to compare between tasks. Therefore, we also compute a sizeinvariant measure by duplicating the score of the true fact by number of corrupted facts, and calculating the area under the ROC curve, with target 1 for true facts and 0 for corruptions.
5 Critical analysis of NTP
In this section, we perform a critical analysis of the NTP model with the synthetic datasets we have created. Prior work has shown the method attains good test accuracy in predicting truth values of facts [15]. We analyze the model’s ability to identify rules under different conditions, revealing that the base NTP model has poor rule learning performance, and investigate why this is the case.
5.1 Additional experiment details
During the training, each batch contains 10 true facts from the training set. For each true fact, there is also one corrupted fact per predicate argument (so 1 for unary predicates, 2 for binary predicates) which is generated by randomly selecting a constant for which the predicate does not hold in the training set. For each of the 50 runs, 3 (randomly initialized) rules are instantiated from the correct template. The hope is at least one of them will be driven close to the correct relationship. The model is trained for 50 epochs using the Adam optimizer
[8] with a learning rate of, gradient clipping of (5, 5) and exponential learning rate decay of
.We adhere to the evaluation protocol set in the previous section. None of the parameter choices (, , , or ) qualitatively affect the results in this paper (the supplemental material presents detailed experiment results). The code can be found at https://github.com/Michiel29/ntprelease.
5.2 Ablation studies
Rule body  Data size  Rule performance  Fact Performance  

Size  Order  Constants  Total facts  Active facts  Recall  PRAUC  ROCAUC 
1  Unary  50  137  23  0.42  0.51  0.80 
100  274  47  0.46  0.66  0.88  
200  546  92  0.6  0.76  0.92  
800  2194  388  0.46  0.68  0.96  
2  Unary  50  133  11  0.0  0.31  0.71 
100  266  21  0.0  0.36  0.75  
200  532  43  0.02  0.37  0.78  
800  2139  185  0.02  0.4  0.81 
How much data does NTP need?
One of the appealing properties of rulelearning algorithms is that they tend to be data efficient. Can NTP achieve this goal? Table 2 shows relationship learning and fact prediction accuracy as a function of the number of constants in the data. It also shows the resulting total facts and active facts, the latter of which constitute the actual signal in the data. We restrict ourselves to unary predicates for this experiment, as scaling binary predicates in this way becomes computationally expensive.
When NTP learns anything at all, it seems to be able to do so from very little data. Note that, while performance increases modestly with data for body size one, performance stays low at body size two. We discuss the likely cause this poor performance in the next section.
Rule body  Rule performance  Fact performance  

Size  Order  Recall  PRAUC  MRR  ROCAUC  
1  unary  5  0.48  0.62  0.27  0.91 
1  binary  0.58  0.74  0.57  0.98  
2  unary  0.04  0.37  0.04  0.81  
2  binary  0.02  0.4  0.15  0.89  
3  unary  0.02  0.4  0.02  0.81  
1  unary  3  0.58  0.76  0.29  0.95 
10  0.34  0.58  0.23  0.87 
Can NTP learn complex relationships?
An important desiderata for relation learning is to scale to learn complex relationships. The top half of the Table 3 shows relationship learning and fact prediction accuracy by NTP under different sizes and orders of the relationships. The performances are reasonable for relationships of size 1. However, the model completely fails for size 2 and 3.
Perhaps even more surprisingly, the bottom half of the Table 3 shows that NTP does not scale with respect to the number of predicates even when the relationship is at its simplest: as the total number of predicates increase the effective size of the state space and decrease the signal to noise ratio, the model performance decreases sharply.
Our main conclusion is NTP does not learn complicated relationships well. Nonetheless, the model still achieves betterthanrandom fact prediction accuracy. One possible reason is that the learning algorithm can place predicates in a relationship close to each other in the embedding space and unifying the predicates directly  we analyze this in the next section.
5.3 Diagnosis
Why does NTP perform so disappointingly in learning relations? We argue that the problem lies in the nature of its greedy optimization and corresponding lack of exploration.
The model works in a winnertakesall greedy fashion: (1) picks the highest scoring proof; (2) if a correct fact was proved, increase the score of the proof. Such a process naturally leads to highly stochastic outcomes, where the final winner depends on the structure that was initially chosen. See the example below: Example II: Failed rule learning Consider again the relationship and rule . We try to prove HasSibling(Emily), knowing HasBrother(Emily). There are two obvious candidate proofs. The desired proof applies the rule, unifying with HasBrother and with HasSibling. The alternative proof directly unifies HasBrother and HasSibling. If the direct unification proof starts out with a higher score (due to random initialization, as explained below), the rule embeddings will not be updated and the unification proof will have a higher score permanently.
In other words, the model sticks with the first somewhat reasonable proof, trapped in a stable local minimum, rather than exploring for the best proof. If the winnertakesall phenomenon is indeed causing problems for the model, one would expect model performance to be highly dependent on initialization. Our experiments show that is indeed the case. The standard deviation of recall for the model on relationships of size 1 and order 1, purely from repeated initialization (calculated from 50 initializations each for 20 dataset draws) is equal to 0.39. The high variance reflects feastorfamine results of the model: for advantageous initializations the model finds the correct rule with close to perfect confidence, and for all other initializations the confidence of the rule is close to zero.
Rule body  Rule performance  Fact performance  

Size  Order  r  Recall  PRAUC  MRR  ROCAUC 
1  Unary  1.0  0.42  0.58  0.26  0.90 
0.9  0.76  0.86  0.28  0.94  
0.75  0.96  0.98  0.29  0.95  
0.5  1.0  1.0  0.29  0.95  
2  Unary  1.0  0.00  0.35  0.06  0.79 
0.9  0.06  0.37  0.07  0.80  
0.75  0.40  0.67  0.14  0.85  
0.5  0.90  0.98  0.26  0.95 
Table 4 contains a more thorough investigation of the effect of initialization on performance. After initializing all embeddings, we determine the rule for which the rule predicate embeddings are closest to the groundtruth relationship embeddings, and move those rule embeddings even closer, multiplying the distance by a ratio . Modest reductions in distance lead to sharply improved performance; a larger reduction of 0.5 increases recall from 0.0 to 0.9 for relationships of size 2.
Initialization is clearly important, but is it a question of competing proofs? Define the rule score as the highest relationship decoding score amongst all the rules, and the unification score as the score of unifying the head and the closest body predicate directly. There exists a strongly negative (0.85) correlation between the rule score and the unification score at the end of training. Figure 1 displays how the rule and unification scores develop during training, conditional on whether or not the correct relationship is eventually learned. The results show training runs go in one of two ways: either the rule catches on, and the rule score increases while the unification score increases only modestly, or it does not and the unification score increases while the rule score remains low. This pattern strongly suggests these two proof types are competing in the simple case of a relationship of size 1, order 1, and provide support for the winnertakesall hypothesis.
6 Rule learning needs exploration
This section proposes an adjusted version of the model in [15], propagating gradients to additional proofs beyond the bestscoring proof in order to encourage exploration and reduce the winnertakesall property of the original model. Taking inspiration from beam search, we aim to keep around proofs with lower current scores that may prove successful later in training, when the influence of other data points has been incorporated into the embeddings.
Propagating gradients to more proofs is equivalent to altering the original loss function for a single training example (Eqn
2) by summing losses over a set of proofs.(3) 
Different choices of the set
then lead to different heuristics. We employ two sets of heuristics for exploration. Top
propagates gradients to the top highest scoring proofs. Define a proof path as a sequence of categories (facts and different rule templates) of logical statements applied in a proof. For example, with one rule template of size 2, the categories are fact and rule, and the two proof paths of depth 1 are (fact) and (rule, fact, fact).The allpath heuristic propagates gradients to the topscoring proof from each highlevel proof path to encourage varied exploration.
Table 5 demonstrates the benefits of incorporating exploration heuristics, especially in learning complicated relationships. Contrasting to what is in Table 3, performances are consistently improved by both types of heuristics and their combination and translate into higher fact prediction accuracy.
Rule body  Rule performance  Fact performance  

Size  Order  Exploration heuristic  Recall  PRAUC  MRR  ROCAUC 
1  Unary  Top  0.9  0.94  0.28  0.93 
allpath  1.0  1.0  0.29  0.93  
Top allpath  1.0  1.0  0.29  0.95  
1  Binary  Top  0.88  0.93  0.52  0.93 
allpath  0.99  0.96  0.57  0.98  
Top allpath  0.97  0.99  0.54  0.96  
2  Unary  Top  0.14  0.46  0.07  0.84 
allpath  0.14  0.46  0.07  0.84  
Top allpath  0.92  0.95  0.27  0.96  
2  Binary  Top  0.16  0.49  0.21  0.88 
allpath  0.04  0.49  0.08  0.89  
Top allpath  0.92  0.96  0.54  0.98  
3  Unary  Top  0.08  0.52  0.03  0.82 
allpath  0.18  0.48  0.07  0.85  
Top allpath  0.38  0.55  0.14  0.93  
Top allpath  0.64  0.73  0.20  0.96 
Table 6 repeats the experiment from Section 5.2 using the new exploration heuristic. Here we find that the model can learn from extremely small amounts of data – as few as 11 active facts.
Rule body  Data size  Rule performance  Fact Performance  

Size  Order  Constants  Total facts  Active facts  Recall  PRAUC  ROCAUC 
1  Unary  50  137  23  0.76  0.79  0.86 
100  274  47  0.96  0.98  0.92  
200  546  92  1.0  1.0  0.92  
800  2194  388  1.0  1.0  0.96  
2  Unary  50  133  11  0.16  0.35  0.80 
100  266  21  0.64  0.77  0.90  
200  532  43  0.90  0.94  0.93  
800  2139  185  0.98  0.99  0.97 
7 Discussion
Neural theorem proving is a promising combination of logical learning and neural network approaches. In this work, we evaluate the performance of the NTP algorithm on synthetic logical datasets with injected relationships. We show that NTP has difficulty recovering relationships in all but the simplest settings. Our experiments suggest the problem lies in the presence of structural local minima, due to the winnertakesall property of the model. We alter the NTP algorithm to increase exploration, which sharply improves performance.
We believe there are several lesssons to be drawn from this work beyond the immediate application to NTP. First, that it is helpful to look at synthetic data when evaluating prediction models that involve structure learning, as final prediction accuracy can mask problems with the structure learning component. Second, that learning discrete structures as an intermediate step can be accompanied by severe structural local minima, which can be avoided through additional exploration.
Acknowledgments This work is partially supported by NSF Awards IIS1513966/ 1632803/1833137, CCF1139148, DARPA Award#: FA87501820117, DARPAD3M  Award UCB00009528, Google Research Awards, gifts from Facebook and Netflix, and ARO# W911NF1210241 and W911NF1510484. We thank Shariq Iqbal, Zhiyun Lu, and Bowen Zhang for helpful comments.
Supplemental Parameter Experiments
Rule body  Rule performance  Fact performance  

Size  Order  Rules  Recall  PRAUC  MRR  ROCAUC 
1  Unary  3  0.62  0.76  0.28  0.92 
5  0.68  0.75  0.28  0.93  
10  0.76  0.78  0.28  0.92  
20  0.86  0.83  0.28  0.94  
50  0.9  0.84  0.27  0.92  
1  Unary  3  0.0  0.39  0.04  0.81 
5  0.04  0.3  0.04  0.82  
10  0.04  0.21  0.04  0.82  
20  0.0  0.14  0.04  0.81  
50  0.06  0.09  0.04  0.81 
This section contains experiments verifying that the conclusions in the body of the paper hold broadly and are not sensitive to model parameters.
Rules
Table 7 shows how performance of the base model varies with the number of instantiated rules. Increasing the number of rules does improve relationship recall for relationships of size 1 and order 1, improving recall and precision. However, adding rules is not a panacea  for even slightly more complex relationships of size 2 and order 1, increasing the number of rules improves recall only slightly at the cost of a large reduction in precision.
Rule Probability
Table 8 shows how performance of the base model and heuristic vary with the strength of the relationship, defined as the probability for the head predicate of a relationship to hold for a set of constants if the body predicates hold for that set of constants. The algorithm is still able to learn nondeterministic relationships, although rule learning and fact prediction performance do decrease as relationship strength decreases. The model with exploration heuristic still performs much better than the base algorithm with nondeterministic relationships. Note that the upper bound on fact prediction accuracy also decreases as the relationship strength decreases.
Rule body  Rule performance  Fact performance  

Size  Order  Exploration  Recall  PRAUC  MRR  ROCAUC  
1  Unary  Vanilla  1.0  0.62  0.76  0.28  0.92 
0.9  0.36  0.51  0.11  0.83  
0.8  0.18  0.4  0.07  0.80  
0.7  0.16  0.37  0.05  0.73  
1  Unary  Top allpath  1.0  1.0  1  0.29  0.95 
0.9  0.98  0.99  0.12  0.87  
0.8  0.9  0.9  0.08  0.82  
0.7  0.44  0.51  0.05  0.73  
2  Unary  Vanilla  1.0  0.02  0.39  0.04  0.81 
0.9  0.0  0.36  0.03  0.79  
0.8  0.04  0.36  0.03  0.77  
0.7  0.04  0.39  0.02  0.74  
2  Unary  Top allpath  1.0  0.92  0.95  0.27  0.96 
0.9  0.96  0.98  0.12  0.95  
0.8  0.82  0.88  0.07  0.91  
0.7  0.34  0.44  0.04  0.82 
Relationships
Table 9 shows the effect of injecting a second relationship of the same type on performance of the base model as well as the model with the Top2alltype exploration heuristic. The additional relationship leads to a sharp reduction in relationship learning performance, though the model with exploration heuristic still performs much better than the base algorithm at learning multiple relationships.
Rule body  Rule performance  Fact performance  

Size  Order  Exploration  Relationships  Recall  PRAUC  MRR  ROCAUC 
1  Unary  Vanilla  1  0.62  0.76  0.28  0.92 
2  0.38  0.61  0.31  0.87  
1  Unary  Top allpath  1  1.0  1  0.29  0.95 
2  0.63  0.71  0.41  0.96  
2  Unary  Vanilla  1  0.02  0.39  0.04  0.81 
2  0.01  0.48  0.04  0.80  
2  Unary  Top allpath  1  0.92  0.95  0.27  0.96 
2  0.47  0.7  0.26  0.93 
heuristic
Following Rocktäschel and Riedel [15], for proofs that unify with several different facts, we only retain the topk highest scoring fact unifications per fact in each branch of the proof tree to reduce computational demands. For example, for a proof path of type (rule, fact, fact), we do not take the maximum over all proofs, but only proofs, where we retain the highest fact unification scores in the second step.
Table 10 shows that varying this parameter has a minimal effect on the outcome of the algorithm.
Rule body  Rule performance  Fact performance  

Size  Order  Exploration  Recall  PRAUC  MRR  ROCAUC  
2  Unary  Vanilla  10  0.02  0.39  0.04  0.81 
20  0.02  0.39  0.04  0.81  
0.02  0.39  0.04  0.81  
2  Unary  Top allpath  10  0.92  0.95  0.27  0.96 
20  0.88  0.93  0.26  0.96  
0.88  0.93  0.26  0.96 
References
 Barrett et al. [2018] David GT Barrett, Felix Hill, Adam Santoro, Ari S Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. arXiv preprint arXiv:1807.04225, 2018.
 Campero et al. [2018] Andres Campero, Aldo Pareja, Tim Klinger, Josh Tenenbaum, and Sebastian Riedel. Logical rule induction and theory learning using neural theorem proving. arXiv preprint arXiv:1809.02193, 2018.
 Chakraborty et al. [2017] Supriyo Chakraborty, Richard Tomsett, Ramya Raghavendra, Daniel Harborne, Moustafa Alzantot, Federico Cerutti, Mani Srivastava, Alun Preece, Simon Julier, Raghuveer M Rao, et al. Interpretability of deep learning models: a survey of results. In 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pages 1–6. IEEE, 2017.
 De Raedt and Kersting [2008] Luc De Raedt and Kristian Kersting. Probabilistic inductive logic programming. In Probabilistic Inductive Logic Programming, pages 1–27. Springer, 2008.
 Dong et al. [2019] Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and Denny Zhou. Neural logic machines. International Conference on Learning Representations, 2019.

Evans and Grefenstette [2018]
Richard Evans and Edward Grefenstette.
Learning explanatory rules from noisy data.
Journal of Artificial Intelligence Research
, 61:1–64, 2018.  Getoor et al. [2007] Lise Getoor, Nir Friedman, Daphne Koller, and Benjamin Taskar. Probabilistic relational models. Introduction to statistical relational learning, 8, 2007.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Koller et al. [2007] Daphne Koller, Nir Friedman, Sašo Džeroski, Charles Sutton, Andrew McCallum, Avi Pfeffer, Pieter Abbeel, MingFai Wong, David Heckerman, Chris Meek, et al. Introduction to statistical relational learning. MIT press, 2007.
 Lake and Baroni [2017] Brenden M Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequencetosequence recurrent networks. arXiv preprint arXiv:1711.00350, 2017.
 Manhaeve et al. [2018] Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. Deepproblog: Neural probabilistic logic programming. In Advances in Neural Information Processing Systems, pages 3749–3759, 2018.
 Minervini et al. [2018] Pasquale Minervini, Matko Bosnjak, Tim Rocktäschel, and Sebastian Riedel. Towards neural theorem proving at scale. arXiv preprint arXiv:1807.08204, 2018.
 Muggleton and De Raedt [1994] Stephen Muggleton and Luc De Raedt. Inductive logic programming: Theory and methods. The Journal of Logic Programming, 19:629–679, 1994.
 Muggleton et al. [2012] Stephen Muggleton, Luc De Raedt, David Poole, Ivan Bratko, Peter Flach, Katsumi Inoue, and Ashwin Srinivasan. Ilp turns 20. Machine learning, 86(1):3–23, 2012.
 Rocktäschel and Riedel [2017] Tim Rocktäschel and Sebastian Riedel. Endtoend differentiable proving. In Advances in Neural Information Processing Systems, pages 3788–3800, 2017.
 Russell and Norvig [2016] Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. Pearson Education Limited,, 2016.
 Serafini and Garcez [2016] Luciano Serafini and Artur d’Avila Garcez. Logic tensor networks: Deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422, 2016.
 Sutton et al. [2012] Charles Sutton, Andrew McCallum, et al. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):267–373, 2012.
 Trouillon et al. [2016] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International Conference on Machine Learning, pages 2071–2080, 2016.
 Wu et al. [2019] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
 Yang et al. [2017] Fan Yang, Zhilin Yang, and William W Cohen. Differentiable learning of logical rules for knowledge base reasoning. In Advances in Neural Information Processing Systems, pages 2319–2328, 2017.
 Zeng et al. [2014] Qiang Zeng, Jignesh M Patel, and David Page. Quickfoil: Scalable inductive logic programming. Proceedings of the VLDB Endowment, 8(3):197–208, 2014.
Comments
There are no comments yet.