1 Introduction
The recent years have witnessed the growing success of deep learning models in a wide range of applications. However, these models are also criticized for the lack of interpretability in its behavior and decision making process
(Lipton, 2016; Mittelstadt et al., 2019), and for being datahungry. The ability to explain its decision is essential for developing a responsible and robust decision system (Guidotti et al., 2019). On the other hand, logic programming methods, in the form of firstorder logic (FOL), are capable of discovering and representing knowledge in explicit symbolic structure that can be understood and examined by human (Evans and Grefenstette, 2018).In this paper, we investigate the learning to explain problem in the scope of inductive logic programming (ILP) which seeks to learn firstorder logic rules that explain the data. Traditional ILP methods (Galárraga et al., 2015) relies on hard matching and discrete logic for rule search which is not tolerant for ambiguous and noisy data (Evans and Grefenstette, 2018). A number of works are proposed for developing differentiable ILP models that combine the strength of neural and logicbased computation (Yang et al., 2017; Evans and Grefenstette, 2018; Campero et al., 2018; Rocktäschel and Riedel, 2017; Payani and Fekri, 2019). Methods such as ILP (Evans and Grefenstette, 2018) are referred to as forwardchaining methods. It constructs rules using a set of predefined templates and evaluates them by applying the rule on background data multiple times to deduce new facts that lie in the heldout set. Other methods such as NerualLP (Yang et al., 2017) are called backwardchaining methods. Upon given a query, it searches for the rule that, starting from the query, can derive backward towards the known facts in the background set (related works available at Appendix A).
However, general ILP involves several steps that are NPhard: (i) the rule search space grows exponentially in the formula length; (ii) assigning the logic variables to be shared by predicates grows exponentially in the number of arguments, which we refer as variable binding problem; (iii) the number of rule instantiations needed for formula evaluation grows exponentially in the size of data. To alleviate these complexities, most works have limited the search length to within 3 and resort to templatebased or chainlike variable assignments. Limiting the expressiveness of the learned rules (detailed discussion available at Appendix B). Still, most of the works are limited in small scale problems with less than 10 relations and 1K entities, with NeuralLP the only exception.
In this work, we propose Neural Logic Inductive Learning (NLIL), a differentiable ILP framework that is highly scalable compared to previous works. We propose a divideandconquer strategy and decompose the search space into 3 subspaces in a hierarchy, where each of them can be searched efficiently using attentions. This enables us to search for x10 times longer rules while remaining x3 times faster than the stateoftheart methods.
And more importantly, we show that a scalable ILP method is widely applicable for model explanations in supervised learning scenario. We apply NLIL on Visual Genome
(Krishna et al., 2016) dataset for learning explanations for 150 object classes over 1M entities. We demonstrate that the learned rules while maintaining the interpretability, have comparable predictive power as densely supervised models, and generalize well with less than 1% of the data.2 Preliminaries
Supervised learning typically involves learning classifiers that map object from its input space to a score between 0 and 1. How can one explain the outcome of a classifier? Recent works on interpretability focus on generating heatmaps or attention that selfexplains a classifier
(Ribeiro et al., 2016; Chen et al., 2018; Olah et al., 2018). We argue that a more effective and humanintelligent explanation is through the description of the connection with other classifiers.For example, consider an object detector with classifiers , , and that detect if certain region contains a person, a car, a clothing or is inside another region, respectively. To explain why a person is present, the one can leverage its connection with other attributes, such as “ is a person if it’s inside a car and wearing clothing”. This intuition draws a close connection to a longstanding problem of firstorder logic literature, i.e. Inductive Logic Programming (ILP).
2.1 Inductive Logic Programming
A typical firstorder logic system consists of 3 components: entity, predicate and formula. Entities are objects . For example, for a given image, a certain region is an entity , and the set of all possible regions is . Predicates are functions that map entities to 0 or 1, for example . Classifiers can be seen as soft predicates. Predicates can take multiple arguments, e.g. Inside is a predicate with 2 inputs. The number of arguments is referred as the arity. Atom is a predicate symbol applied to a logic variable, e.g. and .
A firstorder logic (FOL) formula is a combination of atoms using logical operations which corresponds to logic and, or and not respectively. Firstorder logic formula is used to express lifted knowledge that does not depend on specific data, for example, given a set of predicates , we define the explanation of a predicate as a firstorder logic entailment
(1) 
where is called the body (and the head) of the entailment. is a general formula, e.g. conjunction normal form (CNF), consisting of atoms with predicate symbols from and logic variables that are either head variable or one of the body variables ,. These variables make the explanation highly transferrable as they are “lifted” from actual data. Explanations represented as FOL entailments can be easily interpreted. For example,
(2) 
represents the knowledge that “if an object is inside the car with clothing on it, then it’s a person”.
Given the set of predicates and a set of facts associated to them (usually in the form of a relational knowledge base (KB)). The process of learning FOL rule in the form of Eq.(1) that entails target predicate is called inductive logic programming. For simplicity, we consider unary and binary predicates for the following contents, but this definition can be easily extended to predicates with higher arity.
3 Neural Logic Inductive Learning
3.1 The operator view
We introduce the operator view of the predicate in a logic entailment and show how this can be combined with the attention mechanism to significantly reduce the size of the search space for assigning logic variables.
By definition, variables that only appears in the body are under existential quantifier. We can turn Eq.(1) into Skolem normal form by replacing all variables under existential quantifier with functions w.r.t ,
(3) 
If the functions are already given, Eq.(3) will be much easier to evaluate than Eq.(1). Given a FOL formula, for each instance of variables (i.e. the logic variables of a single predicate ) we only need to evaluate exactly one grounding formula, because the substitution of the rest of the logic variables are determined by the deterministic functions w.r.t . Solving the complexity in variable binding and evaluation both at once.
Functions in Eq.(3) are defined as any arbitrary function. But what is the set of functions that one can utilize? We propose to turn each predicate in into a operator, such that we have a subspace of the functions . Formally, for each predicate we define its operator as a mapping parameterized by , such that for all instantiations , we have
Intuitively, given a subject entity, a predicate’s operator returns the set of object entities that, together with the subject, satisfy the relation. For example, given an image, which is the operator of binary predicate takes an input entity, i.e. a bounding box, and returns a set of bounding boxes that spatially contains the input box. For unary predicate such as , its operator does not depend on the input and simply returns the set of all bounding boxes that contain a car. Operators can take the output of another operator. For example returns the bounding boxes that spatially contains a person. Additionally, we introduce auxiliary operators such as which returns the input argument. This is useful for body formula to reuse the head variables.
By using the operators, we have created a space that represents existential variables . This approach implies that all variables in must be expressed as a sequence of transformations starting from only through functions in . Thus . This is slightly constrained from the original definition, but we argue that formulas that contain are typical of less interest. For example, in , can not be represented by functions of , thus is a completely free variable. This formula is trivial since it’s not likely to infer “an image contains a person” simply by checking if “any car is present in the image”.
Solving the variable binding problem can be framed as searching the appropriate chain of operator calls on the head variables. Any FOL formula that complies with Eq.(3) can be converted into operator form and vice versa. For example, Eq.(2) can be written as
(4) 
where the variable and are eliminated. With operators, it can effectively represent number of variables for one step of operator call, where is the arity of the head predicate which is 1 or 2. Some variables may need more than one operator calls to represent. For example, for friendship relation Friend, “the friends of the friends of a person” can be represented by stacking the operator two times . Thus, we extend the search space hierarchically by stacking operator calls on previous outputs.
3.2 Primitive Statements
We have now obtained a set of variables via operator calls. Now we determine how these variables are assigned a predicate. Note that an atom is defined as a predicate symbol applied to specific logic variables. Similarly, we define a predicate applied to the variables obtained from operator calls as a primitive statement. For example, in Eq.(4), and are two primitive statements.
Similar to an atom, each primitive statement is a mapping from the input space to a scalar value between 0 and 1 in soft logic. As suggested by its name, It evaluates the likelihood of a basic statement being true or not. This is equivalent to evaluating its FOL rule, e.g. and
. This notion serves as a basic unit that maps from entity space to probability. In fact, each statement is itself a complete, yet tiny, FOL rule. We will be using this notion to construct the searching space for more complex and expressive formulas in the next section.
Remarks: The notion of primitive statement draws a close connection between our work and another differentiable ILP methods, i.e. NeuralLP (Yang et al., 2017). NeuralLP solves the ILP by searching over a chainlike rule space . Rules of this form can be seen as the “unrolled” primitive statement with one argument fixed and the other one transformed with operators, i.e. . If one constrains the body in Eq.(3) to contain only one primitive statement, then NLIL’s rule space degenerates to the chainlike rule. Therefore, NLIL can be seen as the generalization over the chainlike rule space of NeuralLP, where it searches over multiple chains simultaneously as candidates and builds up more complex formulas with these candidates as basic units. In section 4, we will see more connections in formula evaluations.
3.3 Formula generation
Each primitive statement can be seen as a selfcontained FOL rule. One can construct more expressive rules by searching over the logical combinations of these statements, i.e. . The formula search builds hierarchically upon the previous search results, where the exponential complexities on variable bindings are already encapsulated by primitive statements, making the search highly efficient.
To do so, we first define the soft logic operations that enable the gradient to flow through. Specifically, we follow the common definition of soft logic not and soft logic and
where and are probabilities between 0 and 1. And we define a soft select function with
as the mixing parameters that are taken from an attention vector. With appropriate parameters,
can represents and . We define a general logic operator as(5) 
which can represent , , , with proper parameters. However, this can also lead to trivial expressions such as . This is generally more difficult to recognize than that in primitive statement search. We counter this by introducing negative samples in evaluation phase discussed in section 4.
Searching in the formula space can be carried out hierarchically, which is similar to that in operator search. Given a set of statements, one searches over all their onestepcombinations using Eq.(5). And the next step will explore the combinations of the outputs of the current step.
4 EndtoEnd Evaluation and Scalable Training
We have discussed how to decompose the formula search into a sequence of hierarchical searches, each one with no more than quadratic complexities. In this section, we introduce the model for learning to search in this space and its training method.
The goal is to learn logic rules that are in firstorder, which means the rules generated should be globally consistent and does not change with specific data. Forwardchaining methods such as ILP achieves this via ruletemplates, while NeuralLP, a backwardchaining method, cannot learn FOL rules because it depends on the query. On the other hand, since NeuralLP generates the rule onthefly, it’s more efficient in rule evaluation compared to those forwardchaining counterparts.
In NLIL, we propose a hybrid approach that is in firstorder and efficient for evaluation. Specifically, as shown in Figure 2, we split the framework into two parts: the rule generation and rule evaluation. The rule generation does not depend on the data. The inputs are predicate and logic variable embeddings. The model is parameterized with a stack of Transformers (Vaswani et al., 2017) that output a sequence of attentions representing the soft picks of rules. The rule evaluation happens after the rules are generated, it samples data from the knowledge base (KB) and computes the crossentropy loss using the generated rules, which yields gradients that are to be backpropagated through the entire model.
4.1 Hierarchical search via attentions
Operator search: We propose using attention mechanism to efficiently represent and search in this space. Specifically, for all predicates, we define their learnable embeddings as , and is the embeddings for binary predicates alone. Without the loss of generality, suppose the is a binary predicate, then the attentions are generated as
where , and are learnable embeddings of logic variable , and operator encoding. The is a standard Transformer module that takes the query and input value (which will be internally transformed into keys and values), and returns the soft representation and attention matrix . Formally,
For each step of the operator call search, the operator embeddings of binary predicates are the queries. And head variable embeddings concatenated with all past Transformer outputs are the input value. The output attention represents the soft choice of an operator over the possible arguments. The notation means we perform the search for times, which enables us to search hierarchically in the variable space.
Primitive statement search: Searching for primitive statements is similar to that in operator space. Lets denote the soft representation of all variables as , and assume all predicates are binary. Then the attention is generated as
where and are the learnable argument position encodings. And and are the soft choices of predicates’ first and second arguments and is the representation of all primitive statements. Note that the operator of a unary predicate does not take input, thus the search space in fact contains trivial statements such as , i.e. it does not take any of the head variables. We avoid these candidates by masking out the choices in the decoding phase of the Transformer.
Formula search: The logic combination can be also parameterized with attentions
where , are learnable embeddings for positive and negative statement encodings. Given the embeddings of primitive statements , we first softly select those of interest with query which is a set of learnable embeddings. The filtered embeddings are repeated and transformed with logic encoding and representing the positive and negative choices. For each level , is the query that softly picks the left operands of Eq.(5), the right query depends on the choice of left operands. And finally, the embeddings of left and right operands are combined with a feed forward network, producing the formula embedding to be fed to next level.
By stacking multiple Transformer blocks, one can explore formulas starting from simple logic conjunctions and pass the promising ones to the later block to learn more complex ones. After performing levels, we have formula candidates generated. Each can be converted explicitly into a FOL entailment. Finally, the very last layer decodes over all previous candidates and generates the attention that softly picks the best FOL rule.
4.2 Endtoend evaluation
Without the loss of generality, assuming all predicates are binary, the firstorder logic rules are evaluated over a set of triples , i.e. a relational knowledge base (KB) if organized in terms of predicate. We have shown in section 3 that primitive statements are equivalent to a chainlike rule. Thus we could adopt the setting in NerualLP for efficient evaluation. Specifically, each predicate is represented as a binary matrix . Therefore, each operator
is parameterized by this matrix. Given an onehot encoding
, we have . Thus for an arbitrary primitive statement , its value is computed as (detailed proofs available at (Yang et al., 2017))(6) 
For target predicate , let’s define the value of th rule candidate generated at level is , then we have the final output value as
Note that the logic combination Eq.(5) are carried out here implicitly. Thus for all queries on target predicate , we minimize the loss That is, for each query, we only ground the formula once for evaluation. Thus we can also perform stochastic training for better scalability.
During rule generation, the picked variables and rules are represented with attentions. For evaluation, one can either make hard samples from attention using Gumbelsoftmax (Jang et al., 2016) and straightthrough for backpropagating the soft gradient. However, we found in the experiments that directly performing weighted average over the inputs constantly outperforms the former. So we only use Gumbelsoftmax sampling for testing and visualization.
Model  FB15K237  WN18  

MRR  Hits@10  Time  MRR  Hits@10  Time  
NeuralLP  0.24  36.2  250  0.94  94.5  54 
TransE  0.29  46.5    0.50  94.3   
NLIL  0.25  32.4  82  0.95  94.6  12 
KB  # facts  # entities  # predicates 

ES10  17  10  3 
ES50  77  50  3 
ES1K  1.5K  1K  3 
WN18  106K  40K  18 
FB15K  272K  15K  237 
VG  1.9M  1.4M  2100 
5 Experiments
We first evaluate NLIL on classical ILP benchmarks and compare it with 3 stateoftheart KB completion methods in terms of their accuracy and efficiency. Then we show NLIL is capable of learning FOL explanations for object classifiers on large image dataset when scenegraphs are present. Though each scenegraph corresponds to a small KB, the total amount of the graphs makes it infeasible for all classical ILP methods. We show that NLIL can overcome it via efficient stochastic training.
5.1 Classical ILP benchmarks
We evaluate NLIL together with two stateoftheart differentiable ILP methods, i.e. NeuralLP^{1}^{1}1We use the official implementation at https://github.com/fanyangxyz/NeuralLP (Yang et al., 2017) and ILP^{2}^{2}2We use the thirdparty implementation at https://github.com/aisystems/DILPCore (Evans and Grefenstette, 2018), and an efficient statistical relational learning method, TransE (Bordes et al., 2013). We create separate Transformer blocks for each target predicate, the embedding size for each block is set to 32. All experiments are conducted on a machine with i78700K, 32G RAM and one GTX1080ti.
Benchmark datasets: (i) EvenandSuccessor (ES) benchmark introduced in (Evans and Grefenstette, 2018), which involves two unary predicate Even, Zero and one binary predicate Succ. The goal is to learn FOL rules over a set of integers. The benchmark is evaluated with 10, 50 and 1K consecutive integers starting at 0; (ii) FB15K237 is a subset of the Freebase knowledge base (Toutanova and Chen, 2015) containing general knowledge facts; (iii) WN18 (Bordes et al., 2013) is the subset of WordNet containing relations between words. Statistics of datasets are provided in Table 2.
Knowledge base completion: All models are evaluated on KB completion task. The benchmark datasets are split into train/valid/test sets each containing the fact triplets in the form of
. The model is tasked to predict the probability of a fact triplet (query) being present in the KB. We use Mean Reciprocal Ranks (MRR) and Hits@10 for evaluation metrics (see Appendix
C for details).

Results on EvenandSuccessor benchmark are shown in Table 2(a). Since the benchmark is noisefree, we only show the wall clock time for completely solving the task. As we have previously mentioned, forwardchaining method, i.e. ILP scales exponentially in the number of facts and quickly becomes infeasible for 1K entities. Thus, we skip its evaluation for other benchmarks.
Results on FB15K237 and WN18 are shown in Table. 2. All 3 methods achieve similar performance on both benchmarks, with TransE slightly outperforms on FB15K237 and NLIL on WN18. NLIL and NeuralLP yield similar scores. This is due to the benchmarks favor symmetric/asymmetric relations or compositions of a few relations (Sun et al., 2019), most valuable rules will already lie within the chainlike search space of NeuralLP. Thus the improvements gained from a larger search space with NLIL are limited. On the other hand, with the Transformer block and smaller model created for each target predicate, NLIL can achieve a similar score at least 3 times faster.
Scalability for long rules
: we demonstrate that NLIL can explore longer rules efficiently. We compare the wall clock time of NeuralLP and NLIL for performing one epoch of training against different maximum rule lengths. As shown in Figure
2(b), NeuralLP searches over a chainlike rule space thus scales linearly with the length, while NLIL searches over a hierarchical space thus grows in log scale. The search time for length 32 in NLIL is similar to that for length 3 in NerualLP.5.2 ILP on Visual Genome dataset
Model  Visual Genome  

R@1  R@5  
MLP+RCNN  0.53  0.81 
Freq  0.40  0.44 
NLIL  0.51  0.52 
The ability to perform ILP efficiently extends the applications of NLIL to beyond canonical KB completion. For example in visual object detection and relation learning, supervised models can learn to generate a scenegraph (As in Figure 1) for each image. It consists of nodes each labeled as an object class. And each pair of objects are connected with one type of relation. The scenegraph can then be, again, represented as a relational KB which one can perform ILP over. Learning the FOL rules on such output of a supervised model is beneficial. As it provides an alternative way of interpreting model behaviors in terms of its relations with other classifiers that is consistent across the dataset.
To show this, we conduct experiments on Visual Genome dataset (Krishna et al., 2016). The original dataset is highly noisy (Zellers et al., 2018), so we use a preprocessed version available as the GQA dataset (Hudson and Manning, 2019). The scenegraphs are converted to a collection KBs, and its statistics are shown in Table 2. We filter out the predicates with less than 1500 occurrences. The processed KBs contain 213 predicates. Then we perform ILP on learning the explanations for the top 150 objects in the dataset.
Quantitatively, we evaluate the learned rules on predicting the object class labels on a heldout set in terms of their R@1 and R@5. As none of the ILP works scale to this benchmark, we compare NLIL with two supervised baselines: (i) MLPRCNN: a MLP classifier with RCNN features of the object (available in GQA dataset) as input; and (ii) Freq: a frequencybased baseline that predicts object label by looking at the mostly occurred object class in the relation that contains the target. This method is nontrivial. As noted in (Zellers et al., 2018), a large number of triples in Visual Genome are highly predictive by knowing only the relation type and either one of the object or subject.
Explaining objects with rules: Results are shown in Table 3. We see that the supervised method achieves the best scores, as it relies on highly informative visual features. On the other hand, NLIL achieves a comparable score on R@1 solely relying on KBs with sparse binary labels. We note that NLIL outperforms Freq significantly. This means the FOL rules learned by NLIL are beyond the superficial correlations exhibited by the dataset. We verify this finding by showing the rules for top objects in Table 4.
Induction for fewshot learning: Logic inductive learning is dataefficient and the learned rules are highly transferrable. To see this, we vary the size of the training set and compare the R@1 scores for 3 methods. As shown in Figure 2(c), the NLIL maintains a achieves similar R@1 score with less than 1% of the training set.
6 Conclusion
In this work, we propose Neural Logic Inductive Learning, a differentiable ILP framework that learns explanatory rules from data. We demonstrate that NLIL can scale to very large datasets while being able to search over complex and expressive rules. More importantly, we show that a scalable ILP method is effective in explaining decisions of supervised models, which provides an alternative perspective for inspecting the decision process of machine learning systems.
Acknowledgments
We thank Ramesh Arvind^{3}^{3}3ramesharvind@gatech.edu and Hoon Na^{4}^{4}4hna30@gatech.edu for implementing the MLP baseline.
References
 Translating embeddings for modeling multirelational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: Appendix C, §5.1, §5.1.
 Logical rule induction and theory learning using neural theorem proving. arXiv preprint arXiv:1809.02193. Cited by: Appendix A, Appendix B, §1.

Iterative visual reasoning beyond convolutions.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 7239–7248. Cited by: §2. 
Learning explanatory rules from noisy data.
Journal of Artificial Intelligence Research
61, pp. 1–64. Cited by: Appendix A, Appendix B, Appendix B, §1, §1, §5.1, §5.1.  Fast rule mining in ontological knowledge bases with amie+. The VLDB Journal—The International Journal on Very Large Data Bases 24 (6), pp. 707–730. Cited by: Appendix A, §1.
 A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51 (5), pp. 93. Cited by: §1.

Rule learning from knowledge graphs guided by embedding models
. In International Semantic Web Conference, pp. 72–90. Cited by: Appendix A.  Gqa: a new dataset for realworld visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6700–6709. Cited by: §5.2.
 Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144. Cited by: §4.2.
 Visual genome: connecting language and vision using crowdsourced dense image annotations. External Links: Link Cited by: §1, §5.2.
 Inductive logic programming.. In WLP, pp. 146–160. Cited by: Appendix A.
 The mythos of model interpretability. arXiv preprint arXiv:1606.03490. Cited by: §1.
 Explaining explanations in ai. In Proceedings of the conference on fairness, accountability, and transparency, pp. 279–288. Cited by: §1.
 The building blocks of interpretability. Distill 3 (3), pp. e10. Cited by: §2.
 Scalable rule learning via learning representation.. In IJCAI, pp. 2149–2155. Cited by: Appendix A.
 Inductive logic programming via differentiable deep neural logic networks. arXiv preprint arXiv:1906.03523. Cited by: §1.
 Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §2.
 Endtoend differentiable proving. In Advances in Neural Information Processing Systems, pp. 3788–3800. Cited by: Appendix A, §1.
 Rotate: knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197. Cited by: Appendix C, Table 2, §5.1.
 Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pp. 57–66. Cited by: §5.1.
 Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §4.
 Differentiable learning of logical rules for knowledge base reasoning. In Advances in Neural Information Processing Systems, pp. 2319–2328. Cited by: Appendix A, Appendix B, Appendix C, Appendix C, §1, §3.2, §4.2, §5.1.
 Neural motifs: scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840. Cited by: §5.2, §5.2.
Appendix A Related Work
Inductive Logic Programming (ILP) is the task where, given the observed data, it seeks to summarize the underlying patterns shared in the data and express it as a set of logic programs (or rule/formulae) (Lavrac and Dzeroski, 1994). Approaches for solving ILP can be generally grouped as forwardchaining and backwardchaining methods.
Traditional ILP methods such as AMIE+ Galárraga et al. (2015) and RLvLR Omran et al. (2018) relies on explicit searchbased method for rule mining with various pruning techniques. These works can scale up to very large knowledge bases. However, the algorithm complexity grows exponentially in the size of the variables and predicates involved. The acquired rules are often restricted to Horn clauses with maximum length less than 3, limiting the expressiveness of the logical rules. On the other hand, compared with differentiable approaches, traditional methods make use of hard matching and discrete logic that lacks the tolerance for ambiguous and noisy data.
The stateoftheart differentiable forwardchaining methods focus on rule learning on predefined templates (Evans and Grefenstette, 2018; Campero et al., 2018; Ho et al., 2018) (NTP (Rocktäschel and Riedel, 2017) is a backwardchaining method but uses templates as well), typically in the form of a Horn clause with one head predicate and two body predicates with chainlike variables, i.e.
To evaluate the rules, one starts with a background set of facts and repeatedly apply rules for every possible triple until no new facts can be deduced. Then the deduced facts are compared with a heldout groundtruth set. Rules that are learned in this approach are in firstorder, i.e. dataindependent and can be readily interpreted. However, the deducing phase can quickly become infeasible with a larger background set. Although ILP (Evans and Grefenstette, 2018) has proposed to alleviate by performing only a fixed number steps, works of this type generally only scale to KBs with less than 1K facts and 100 entities.
Backwardchaining methods such as NeuralLP (Yang et al., 2017) constructs rule onthefly when given a specific query. It adopts a flexible ILP setting: instead of predefining templates, it assumes a chain like Horn clause can be constructed to answer the query
And each step of the reasoning in the chain can be efficiently represented by matrix multiplication. The resulting algorithm is highly scalable compared to the forwardchaining counterparts and can learn rules on large datasets such as FreeBase. The main drawback of NeuralLP is that the rule generation dependents on the specific query, i.e. it’s datadependent. Thus making it difficult to extract FOL rules that are interpretable and transferrable. On the other hand, while it can learn rules without templates, the form of the formula is still restricted to chainlike Horn clauses.
Appendix B Challenges in ILP
Standard ILP approaches are difficult and involve several procedures that have been proved to be NPhard. The complexity comes from 3 levels: first, the search space for a formula is vast. The body of the entailment can be arbitrarily long and the same predicate can appear multiple times with different variables, for example, the Inside predicate in Eq.(2) appears twice. Most ILP works constrain the logic entailment to be Horn clause, i.e. the body of the entailment is a flat conjunction over literals, and the length limited within 3 for large datasets.
Second, constructing formulas also involves assigning logic variables that are shared across different predicates, which we refer to as variable binding. For example, in Eq.(2), to express that a person is inside the car, we use and to represent the region of a person and that of a car, and the same two variables are used in Inside to express their relations. Different bindings lead to different meanings. For a formula with arguments (Eq.(2) has 7), there are possible assignments. Existing ILP works either resort to constructing formula from predefined templates (Evans and Grefenstette, 2018; Campero et al., 2018) or from chainlike variable reference (Yang et al., 2017), limiting the expressiveness of the learned rules.
Finally, evaluating a formula candidate is expensive. A FOL rule is dataindependent. To evaluate it, one needs to replace the variables with actual entities and compute its value. This is referred to as grounding or instantiation. Each variable used in a formula can be grounded independently, meaning a formula with variables can be instantiated into grounded formulas, where is the number of total entities. For example, Eq.(2) contains 3 logic variables: , and . To evaluate this formula, one needs to instantiate these variables into possible combinations, and check if the rule holds or not in each case. However in many domains, such as object detection, such grounding space is vast (e.g. all possible bounding boxes of an image) making the full evaluation infeasible. Many forwardchaining methods such as ILP (Evans and Grefenstette, 2018) scales exponentially in the size of the grounding space, thus are limited to small scale datasets with less than 10 predicates and 1K entities.
Appendix C Experiments
Model setting: For KB completion task, we set the number of operator calls to 2 and formula combinations to 1 in NLIL, as most of the relations in those benchmarks can be recovered by symmetric/asymmetric relations or compositions of a few relations (Sun et al., 2019). Thus complex formulas are not preferred. For FB15K237, binary predicates are grouped hierarchically into domains. To avoid unnecessary search overhead, we use the most frequent 20 predicates that share the same root domain (e.g. “award”, “location”) with the head predicate for rule body construction, which is a similar treatment as in (Yang et al., 2017).
Evaluation metrics: Following the conventions in (Yang et al., 2017; Bordes et al., 2013) we use Mean Reciprocal Ranks (MRR) and Hits@10 for evaluation metrics. For each query , the model generates a ranking list over all possible groundings of predicate , with other groundtruth triplets filtered out. Then MRR is the average of the reciprocal rank of the queries in their corresponding lists, and Hits@10 is the percentage of queries that are ranked within the top 10 in the list.
Comments
There are no comments yet.