Learning Symbolic Rules for Reasoning in Quasi-Natural Language

by   Kaiyu Yang, et al.
Princeton University

Symbolic reasoning, rule-based symbol manipulation, is a hallmark of human intelligence. However, rule-based systems have had limited success competing with learning-based systems outside formalized domains such as automated theorem proving. We hypothesize that this is due to the manual construction of rules in past attempts. In this work, we ask how we can build a rule-based system that can reason with natural language input but without the manual construction of rules. We propose MetaQNL, a "Quasi-Natural" language that can express both formal logic and natural language sentences, and MetaInduce, a learning algorithm that induces MetaQNL rules from training data consisting of questions and answers, with or without intermediate reasoning steps. Our approach achieves state-of-the-art accuracy on multiple reasoning benchmarks; it learns compact models with much less data and produces not only answers but also checkable proofs. Further, experiments on a real-world morphological analysis benchmark show that it is possible for our method to handle noise and ambiguity. Code will be released at https://github.com/princeton-vl/MetaQNL.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning Reasoning Strategies in End-to-End Differentiable Proving

Attempts to render deep learning models interpretable, data-efficient, a...

Learning Compositional Rules via Neural Program Synthesis

Many aspects of human reasoning, including language, require learning ru...

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning

Geometry problem solving has attracted much attention in the NLP communi...

Stream Reasoning in Temporal Datalog

In recent years, there has been an increasing interest in extending trad...

Transformers as Soft Reasoners over Language

AI has long pursued the goal of having systems reason over *explicitly p...

Using Semantic Similarity for Input Topic Identification in Crawling-based Web Application Testing

To automatically test web applications, crawling-based techniques are us...

Generating Justifications for Norm-Related Agent Decisions

We present an approach to generating natural language justifications of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Symbolic reasoning—rule-based symbol manipulation—is a core component of human intelligence (Mercier and Sperber, 2017). It has been a core part of computer science research, and has achieved significant success in domains such as software verification (Darvas et al., 2005) and theorem proving (Kovács and Voronkov, 2013; McCune, 1997). However, such success has been restricted to domains amenable to rigid, precise formalization. It remains a challenge how to translate such success into “informal” domains such as reasoning with common sense knowledge and natural language input. Prior attempts to build rule-based systems, which rely on manually constructed rules, have achieved limited success and tended to produce brittle systems.

Deep learning provides an attractive alternative that can easily sidestep the question of representation. A deep network can be trained to perform a reasoning task by directly predicting the answer without explicit symbol manipulation (Clark et al., 2020). However, deep networks can require a large amount of training data and can suffer from poor generalization. More importantly, unlike symbolic systems, a deep network is a black box that is hard to interpret, inspect, and verify. Such lack of interpretability can be undesirable in certain applications, especially those critical to safety and security.

In this work, we ask how to build a rule-based system that reasons symbolically but can work with natural language and handle domains difficult to formalize. Such a system would perform reasoning by explicit symbol manipulation based on known rules, therefore more interpretable and verifiable, but at the same time flexible enough to handle natural language input.

At a glance, this may appear a large departure from the conventional wisdom that learning-based systems, particularly deep networks, are far superior to rule-based systems, as history has demonstrated repeatedly. However, we hypothesize that this conventional wisdom is incorrect because it assumes a false dichotomy between using learning and using rules; rule-based systems underperformed not because they were rule-based, but because it is difficult to construct rules manually. Further, we hypothesize learning rules from data is key to building effective rule-based systems, but it may require a different kind of learning than gradient descent.

The goal of this work is thus to develop a method that automatically learns symbolic rules from data to enable rules-based reasoning with natural language. This poses two main questions. First, what is the system of rules—the basic structures that define what symbols and manipulations allowed—such that it is compatible with not only formal logic but also natural language? Second, what is the learning algorithm that induces a set of rules from training data?

In this work, we take initial steps toward answering both questions. We propose MetaQNL, a formal symbolic system we call a “Quasi-Natural Language”, which is compatible with not only rigorous logical inference but also natural language expressions. We also propose MetaInduce, a learning algorithm that induces MetaQNL rules from training data that consists of questions and answers, with or without intermediate reasoning steps.

Figure 1: An example proof with 4 assumptions, 1 goal, and 2 rule applications. Each rule have multiple premises and one conclusion. Both the premises and the conclusion can have variables that bind to concrete sentences when the rule is applied.

MetaQNL: a Symbolic System in Quasi-Natural Language. In MetaQNL, a sentence is a sequence of words and variables (“The elephant is [X]”), including ordinary English sentences without variables. A rule has multiple sentences as premises (“The elephant is [X]”, “If something is [X] then it [Y]”) and one sentence as the conclusion (“The elephant [Y]”). When applying the rule in reasoning, variables are substituted with concrete sentences ([X] strong, [Y] likes cats). Therefore, rules capture abstract knowledge independent of specific instances—the above rule holds whether [Y] is “likes cats” or “is sleepy”. Such abstraction is essential for reasoning in both humans and machines (Marcus and Davis, 2020).

Fig. 1 illustrates how sentences and rules are used in reasoning. Starting from known sentences (assumptions), we apply rules to derive new sentences until the goal is reached. At each step, we substitute the variables in a rule with concrete sentences. This process resembles Metamath (Megill and Wheeler, 2019), a formal language developed for formalizing mathematical proofs, where each step also consists of selecting a theorem and instantiating it with a suitable substitution. So we refer to the reasoning process as theorem proving and the result in Fig. 1 as a proof. It is worth noting that reasoning in MetaQNL is interpretable by design: it is transparent about what rules are assumed; it produces not only an answer, but also a proof that can be mechanically checked against the rules.

Assumptions and the goal are usually given when applying our system to a specific task. To solve the task, two issues remain: (1) Rule induction: What is the set of rules? (2) Theorem proving: How to apply the rules to find a proof? Theorem proving has been studied extensively in classical AI (Robinson and Voronkov, 2001) and more recently with deep learning (Alemi et al., 2016; Yang and Deng, 2019; Bansal et al., 2019), and we can adapt existing algorithms such as forward chaining and backward chaining (Russell and Norvig, 2002). In this work, we simply use existing provers and focus instead on the more challenging problem of rule induction.

MetaInduce: an Algorithm to Learn MetaQNL Rules. Rule induction can be formulated as a discrete optimization problem that seeks a minimum set of rules that are consistent with training examples. Note that it is important to seek a small number of rules because we always have a trivial solution that consists of one rule per example but is unlikely to generalize. This optimization is challenging due to the discrete, combinatorial search space.

We introduce MetaInduce, a general method for learning MetaQNL rules. It encodes the problem as a maximum satisfiability (MAX-SAT) problem, which can be solved efficiently by off-the-shelf solvers (De Moura and Bjørner, 2008). Our method consists of 3 steps. First, given a training example, a rule proposer proposes a set of concrete rules (rules without variables) as candidates. This set can be overcomplete and inaccurate. These rules are used to prove the example using existing provers such as forward/backward chaining. Second, we generate abstract rules from concrete rules via a symbolic procedure called anti-unification (Plotkin, 1970; Kutsia et al., 2014). Third, we encode the proof paths in MAX-SAT and solve for a subset of all rules using a MAX-SAT solver.

Overview of Results. We benchmark our method on 3 tasks: learning compositional instructions, logical reasoning, and morphological analysis. For compositional instructions, our method not only achieves 100% accuracy on MiniSCAN (Lake et al., 2019) and SCAN (Lake and Baroni, 2018), but also recovers the ground truth rules. For logical reasoning, it achieves state of the art on RuleTaker (Clark et al., 2020)

. For morphological analysis, it learns morphological rules from real-world data and is competitive with neural seq2seq models in some languages. Compared to existing methods, our approach learns compact models with much less data, and produces not only answers but also checkable proofs. On RuleTaker, our approach learns a model that has only 2869 symbols but is competitive with a prior approach that uses a neural network with 11 billion parameters.

2 Related Work

Symbolic Reasoning. Symbolic reasoning has been studied extensively in classical AI, such as theorem proving (Kovács and Voronkov, 2013; Robinson and Voronkov, 2001). An open problem is to perform symbolic reasoning in domains without a natural formalization, such as natural images or texts. One common approach is to manually construct a formal system (e.g., based on first-order logic with manually defined functions and predicates), then perform semantic parsing to translate images or texts into formalized statements as input to a reasoning module operating in a clean formal world.

For example, to judge whether one statement implies another, Mineshima et al. (2015) use a semantic parser to convert both statements into higher-order logic (with predefined predicates), and then run an automated theorem prover. Semantic parsing is still far from reliable; therefore, researchers have developed techniques for learning it jointly with the reasoning module (Mao et al., 2018; Saparov and Mitchell, 2021; Dai et al., 2019; Li et al., 2020). In contrast, our approach does not require a semantic parser, because rules in MetaQNL are directly applicable to natural language.

Natural Logic (McAllester and Givan, 1993; MacCartney and Manning, 2007) is a class of symbolic systems defined using the syntax of natural language, bypassing semantic parsing. Compared to our system, Natural Logic is more specialized because it is a specific logic committed to predefined rules, which restricts the type of reasoning it can perform to monotonicity reasoning (Icard III and Moss, 2014). In contrast, we have no such restrictions because is not a specific logic but a meta-language with minimal structure such that it can instantiate various types of reasoning, just as MetaMath is a meta-language that can describe a variety of mathematical logics (Megill and Wheeler, 2019).

None of these works discussed so far learn rules from data; they instead use a predefined formal system that is already specialized and already encodes a substantial amount of prior knowledge. In contrast, MetaQNL is almost “knowledge-free” in the sense that it imposes the weakest possible structure on the permitted rules and lets the specific rules emerge from data through learning.

Reasoning with Neural Networks.

Neural networks can perform “soft” reasoning in the space of continuous vectors without manipulating discrete symbols explicitly. Prior works have used transformer-based 

(Vaswani et al., 2017)language models for soft reasoning (Polu and Sutskever, 2020; Saha et al., 2020; Tafjord et al., 2020; Gontier et al., 2020; Talmor et al., 2020). Clark et al. (2020)

finetune a pretrained transformer to classify whether the goal is provable from the assumptions, encoding them as sentences in a constrained natural language.

Saha et al. (2020) and Tafjord et al. (2020) go one step further to generate proofs in addition to yes/no answers. Bostrom et al. (2021) generate conclusions from premises in unconstrained natural language.

Instead of using a generic transformer, researchers have also added inductive biases to the neural architecture. Many are inspired by symbolic reasoning and are often called neuro-symbolic architectures. Rocktäschel and Riedel (2017) introduce Neural Theorem Provers (NTPs). Given the assumptions and the goal in first-order logic, they use backward chaining to recursively construct a neural network. However, NTPs only work for formalized inputs and do not scale due to exponentially many proof paths in backward chaining. Weber et al. (2019) extend NTPs to natural language by extracting symbols from sentences using an off-the-shelf named entity recognizer. Minervini et al. (2020) make NTPs more scalable by dynamically pruning unpromising proof paths in backward chaining.

Researchers have also attempted to embed symbolic structures such as logic formulas into continuous vectors while preserving logical operations (Smolensky, 1990; Grefenstette, 2013; Kathryn and Mazaitis, 2018; Lee et al., 2016; Schlag et al., 2019). For example, Neural Logic Machines (Dong et al., 2018)

is a neuro-symbolic architecture based on continuous approximation of logical inference. Predicates are represented as tensors; rules are neural operators that map tensors to tensors.

Cingillioglu and Russo (2020) propose an end-to-end neural architecture called unification networks to learn rules with variables from concrete examples. However, their system of rules is significantly less general than ours: their variables can only bind to a single word, whereas our variables bind to arbitrary sentences. In addition, their system does not support multistep chained reasoning. All reasoning is done in a single step: producing a conclusion in the form of an answer ("yes/no", a number, etc.) given a number of premises consisting of a question and a set of supporting facts.

Unlike these prior works, we learn symbolic rules instead of weights in a neural network. Further, during inference, we generate symbolic proofs whose correctness with respect to the induced rules is guaranteed and can be mechanically checked. Saha et al. (2020) and Tafjord et al. (2020) also generate proofs, but their proofs are natural language texts whose correctness is neither guaranteed nor mechanically checkable—their approach trains neural networks to directly predict both answers and proofs, but does not expose a system of rules against which a proof can be checked.

Rule Induction.

Inductive logic programming (ILP) learns rules in first-order logic programs such as Prolog and Datalog 

(Plotkin, 1972; Muggleton, 1991; Cropper and Dumančić, 2020). Extending it to natural language is non-trivial—partially due to the need for a predefined ontology of objects and predicates, as well as a perfect semantic parser, both of which are infeasible. Unlike ILP, we learn rules in MetaQNL, which can express not only logic programs but also natural language sentences. And our experiments show that MetaQNL can solve tasks that are not easily solvable by ILP.

For learning rules, our MetaInduce algorithm draws inspiration from existing ILP approaches. They encode proofs as either a boolean satisfiability problem solvable by off-the-shelf SAT solvers  (Raghothaman et al., 2019) or a differentiable function amenable to gradient descent (Yang et al., 2017; Evans and Grefenstette, 2018; Si et al., 2019). Compared to these approaches, our rule space is different and more complex. Our rules consist of sentences with variables, whereas rules in ILP are typically Horn clauses in first-order logic. Further, ILP often imposes strong syntactic constraints on what rules are valid, e.g., using rule templates (Evans and Grefenstette, 2018; Raghothaman et al., 2019), or restricting to binary predicates (Evans and Grefenstette, 2018). These constraints are critical to good performance but are domain-dependent and difficult to get right (Cropper and Dumančić, 2020). Over-constraining the rule space makes the system less expressive, less generally applicable, and more brittle in the presence of noise. Another difference is that we minimize the number of rules in order to generalize, which is unnecessary for ILP due to stronger syntactic constraints.

Our space of rules includes a rich hierarchy from abstract rules to concrete rules, making the search space much larger. In contrast, most ILP works assume function-free first-order logic such as Datalog. Their variables can only be instantiated with concrete entities, making their rule space much simpler.

RNNLogic (Qu et al., 2021) learns first-order rules for knowledge base completion. They generate rules using RNNs, which is feasible because they require that rules can be expressed as a sequence of predicates. The strong syntactic constraint makes it less suitable for more general reasoning. Beyond first-order logic, Nye et al. (2020) learn rules for a string rewriting system. MetaQNL is more general because it can be applied to not only string rewriting but also other forms of reasoning (see Sec. 6).

3 MetaQNL: a Symbolic System in Quasi-Natural Language

MetaQNL is quasi-natural because it has a formal syntax compatible with natural language. Like in natural language, a sentence in MetaQNL is simply a sequence of tokens. There are 3 different types of tokens—words, variables, and special symbols. Taking the sentence “$FALSE$ The elephant likes [X]” as an example, “The”, “elephant” and “likes” are words. MetaQNL treats words as symbols and does not assume any prior knowledge about their meaning. “[X]” is a variable—a placeholder that binds to concrete sentences in reasoning. “$FALSE$” is a special symbol. They are useful for encoding the structures of specific tasks, which will become more clear in Sec. 6. In this paper, we delimit special symbols with $. Sentences without variable are called concrete sentences, e.g., “$FALSE$ The elephant likes cats”.

Definition 1 (Sentence).

Let be vocabularies of words, variables, and special symbols respectively; they are disjoint and countable. Let , then any is a token. A sentence is a non-empty sequence of tokens. A concrete sentence is a sentence without any variable, i.e., .

MetaQNL expresses permitted reasoning steps through rules. A rule has multiple sentences as its premises (“The elephant [X]”, “If something [X] then it [Y]”) and one sentence as the conclusion (“The elephant [Y]”). Intuitively, the conclusion should follow from the premises regardless of what values the variables take. Concrete rules are rules without variables.

Definition 2 (Rule).

A rule takes the form of , where are premises, and is the conclusion. It is concrete if all premises and the conclusion are concrete.

In reasoning, we instantiate rules with concrete rules by substituting all variables with concrete sentences. Given the rule The elephant [X]; If something [X], then it [Y] The elephant [Y]”, we can instantiate it with the substitution , deriving the concrete rule The elephant is strong; If something is strong, then it likes cats The elephant likes cats”. In such cases, we say is more general than , or vice versa, is an instance of .

Definition 3 (Substitution).

Let be the set of sentences with only words and variables (without special symbols). A substitution is a function from to . Substitutions can be extended to be functions on tokens, sentences, and rules. Given a token , applying the substitution produces a sentence .

Given a sentence , applying produces . Given a rule , applying produces .

We are abusing notations to treat a token and a single-token sentence interchangeably. Also, denotes concatenation when are sentences. Substitution is defined as a function on all variables , but in practice it only involves a few. For example, the substitution only involves two variables. In such cases, we think of it as being the identity function for other variables, e.g., . This convention makes it easier to composite substitutions as function composition.

As in the example before, applying a substitution to sentences/rules makes them more specific. It introduces a partial order among sentences/rules. It is straightforward to verify that the relation defined below is a partial order. We leave the proof to Appendix A.

Definition 4 (Partial order among sentences and rules).

Let and be two sentences, is an instance of (denoted by ) if and only if there exists a substitution such that . In this case, we also say is more general than . Similarly, given two rules and , is an instance of (or is more general than , denoted by ) if and only if .

A subtlety in the definition is judging whether two rules are equal. For a MetaQNL rule, premises are unordered, and variable renaming does not matter. In more jargonized words, rule equality is defined modulo premise reordering and -conversion.

In reasoning (Fig. 1), the prover is given a set of rules , multiple concrete sentences as assumptions, and one sentence as the goal. It iteratively instantiates concrete rules from and applies them to generate a proof of . Similar to Prolog, may have variables (The elephant [X]), and the prover succeeds if it proves any instance of (E.g., The elephant likes vegetables).

Definition 5 (Proof).

A proof is a directed acyclic graph whose vertices are concrete sentences or concrete rules. For each concrete rule , it must satisfy two conditions: (1) connects to its conclusion via an edge ; (2) For each premise , we have and . Besides these edges, there cannot be any other edge in . Also, there can be multiple sentences without inbound edges (the proof’s assumptions), but there is only one sentence without outbound edges (the proof’s goal).

Definition 6 (Theorem proving).

Given a set of rules , concrete sentences as assumptions, and a sentence as the goal, the theorem prover tries to find a proof such that: (1) ’s assumptions are . (2) ’s goal is an instance of . (3) Every rule in is an instance of a rule in .

4 MetaInduce: Learning MetaQNL Rules from Data

Problem Setup and Loss Function.

Rule induction is a machine learning problem where the model consists of rules rather than continuous weights. The problem setup is familiar: Given a training set

and a test set , the goal is to use to find a model that performs well not only on itself but also on . For MetaQNL specifically, the training set consists of a set of provable examples and a set of unprovable examples . They both contain training examples in the form of , where is a set of assumptions and is the goal. A model is consistent with a provable example if is provable from using rules in . Similarly, is consistent with an unprovable example if cannot be proved from . In other words, provable examples are positive examples demonstrating sound logical inference, whereas unprovable examples are negative examples demonstrating unsound inference.

Given only , we need to find a model consistent with as many examples in as possible. However, it is not sufficient to optimize the consistency with training data, because there is a trivial model that performs perfectly in training but fails in testing—one rule per example. That is, given a example , if , it is provable using the rule .

Thus we need to penalize the model complexity. While other choices are possible, here we measure model complexity as the number of rules. We minimize a loss function that evaluates both model complexity and consistency with training data:


where is the number of rules; and are the number of provable/unprovable examples consistent with respectively. and

are hyperparameters controlling the trade-off between the three terms.

The optimization problem is challenging. Given , even a single evaluation of is expensive: and require running the prover on all training examples. Furthermore, it is much harder to find the optimal due to the combinatorial and non-differentiable search space. We introduce MetaInduce, a general method for learning rules by encoding Eqn. 1 as a maximum satisfiability (MAX-SAT) problem, which can be solved efficiently by off-the-shelf solvers.

Input : Training data ; is a set of assumptions; is the goal.
Output : A model consisting of a set of rules
1 for  to num_epochs do
2       for  to  do
3             prove(, , )
Algorithm 1 MetaInduce: A general method for learning MetaQNL rules from data.

Overview of MetaInduce. MetaInduce is outlined in Algorithm 1

. Similar to SGD for training neural networks, MetaInduce goes through the training data for several epochs; during an epoch, it processes one example per iteration. Given an example

(either provable or unprovable), it first relies on a rule proposer for generating candidate rules that are concrete and potentially useful for proving from . Then it runs an existing prover to search for proofs, using both the candidate rules and existing rules in the model. At the end of each epoch, MetaInduce abstracts all concrete rules used in the proofs into rules with variables. Then it performs rule pruning—selecting as a subset of the rules minimizing the loss (Eqn. 1). Next, we explain each step in more detail.

Rule Proposal. The rule proposer is dataset-dependent and allows incorporating prior knowledge about a particular task. However, a good rule proposer alone—if not embedded in MetaInduce—is not sufficient for learning rules. First, the rule proposer only generates concrete rules. It is up to MetaInduce to abstract them into rules with variables. Second, the rule proposer generates rules useful for a single training example, whereas MetaInduce learns rules useful for the entire dataset. Third, the rule proposer does not have to be accurate. MetaInduce can reliably learn correct rules even if most candidate rules are wrong (see Sec. 6).

Theorem Proving. Theorem proving in MetaQNL is relatively straightforward, thanks to existing algorithms such as forward/backward chaining. Forward chaining starts with the assumptions and applies rules to derive new sentences until the goal is reached. Conversely, backward chaining starts with the goal and applies rules in the reverse direction until all assumptions are satisfied. We implement forward chaining using the Rete algorithm for fast rule matching (Doorenbos, 1995) and the basic backward chaining algorithm from a standard textbook (Russell and Norvig, 2002). The prover returns proofs containing all different paths to the goal up to a predefined depth limit.

Rule Abstraction. The proofs contain only concrete rules (Definition 5), and we have to generalize them into rules with variables. We use a symbolic procedure called anti-unification (Plotkin, 1970) to find general rules given concrete one. Given two rules and , anti-unification attempts to find the most specific rule such that and (analogous to the lowest common ancestor of two nodes in a tree; see Fig. 1(a) for examples). It does so by recursively matching the beginning of two sentences. Please see Appendix B for details.

Let be the set of all concrete rules in the proofs. To augment with general rules, we iteratively anti-unify rules in and add the result back, until no new rule can be generated. We denote the result by , which contains not only concrete rules but also their generalizations.

Rule Pruning with MAX-SAT. Rule pruning selects as a subset of by encoding all proofs as a MAX-SAT problem, whose solution corresponds to a set of rules that approximately minimize the loss function Eqn. 1. We encode each rule using a boolean variable (also denoted ). means the rule should be included in . For any concrete rule , we have an additional boolean variable cr. means cr is necessary for proving the training examples. We impose 3 different types of constraints on these boolean variables:

  • [leftmargin=*]

  • Data consistency: For the th training example, its proof may have many paths from the assumptions to the goal, but the example is provable as long as any one of them is valid. For provable examples (those in ), we encode as a disjunction of proof paths. Each path is valid if and only if all concrete rules along the path are valid. So we encode a proof path as a conjunction of all cr boolean variables it contains (see Fig. 1(b)). Analogously, for unprovable examples (those in ), we simply take the negation of the previous boolean formula to encourage the absence of a valid proof. Finally, a good model is not necessarily consistent with every training example. So is encoded as a soft constraint with weight or .

  • Model complexity: To minimize the number of rules, we add a soft constraint of weight 1 for each boolean variables. It encourages .

  • Rules instantiation: Each concrete rule cr must be an instance of a rule . Let be the set of all rules in such that . cr can be instantiated if and only if at least one of them is in the model. Therefore, we add a hard constraint .

Given a set of boolean constraints, each with a weight, a MAX-SAT solver finds an assignment of boolean variables to minimize the combined weights of violated constraints, which equals to Eqn. 1 for the specific constraints above. Therefore, running an off-the-shelf MAX-SAT solver on these constraints gives us a set of rules that minimizes our loss function.

(a) Examples of anti-unifying two concrete rules into more abstract rules. Anti-unification may have multiple solutions that are not comparable with each other.
(b) Encoding as a boolean constraint a proof with 3 paths from assumptions to the conclusion. Each concrete rule corresponds to a boolean variable. The proof is a disjunction of all paths; each path is a conjunction of concrete rules.

5 Soft Matching

Similar to classical theorem proving, reasoning in MetaQNL relies on precise and rigid matching between rules and assumptions. For example, the rule “The [A] likes cats Someone likes cats” is not applicable to the assumption “An elephant likes cats” due to the lack of “The”. Although precise matching guarantees the rigor of reasoning, reasoning performed in natural language is often fuzzy and ambiguous, without the same degree of rigor as a mathematical proof. Supporting fuzzy reasoning is necessary for MetaQNL to cover a broader spectrum of reasoning in natural language. It requires us to relax the rigorous proofs in Definition 6 to fuzzy proofs with scores indicating the degree of rigor. Another benefit of soft matching is that it allows the system to degrade gracefully—it can produce an “educated guess” if the existing knowledge base of rules is insufficient for producing a rigorous answer.

We extend MetaQNL with soft matching—relaxing the rigid matching conditions when applying rules. Think of applying a rule to a set of assumptions as instantiating concrete rules on the fly: with rigid matching, we instantiate concrete rules such that ’s premises are and must be an exact instance of . In contrast, soft matching produces both concrete rules and scores. ’s premises are still but is not required to be an exact instance of . Further, the matching scores can be aggregated to calculate proof scores, allowing us to produce fuzzy proofs when rigorous proofs are impossible.

Definition 7 (Soft matching).

Given a rule and concrete sentences as assumptions, soft matching outputs concrete rules with scores: such that (1) . (2) ’s premises is .

There are many possible ways to realize soft matching, including using neural networks. For example, we could use a pretrained neural language model to output concrete rules and matching scores (Devlin et al., 2019; Raffel et al., 2020). In this paper, as an initial step, we experiment with a simple soft matching procedure based on anti-unification.

Specifically, we perform soft matching only in testing. During training, MetaInduce still uses rigid matching for learning rules. Once the rules are learned, we can use them for making predictions with soft matching. Given a learned rule and assumptions from a testing example, is not necessarily applicable, but we try to find a more general rule that is applicable by anti-unifying with the premises of . Following the previous example, anti-unifying the assumption “An elephants likes cats” with the premise “The [A] likes cats” produces “[A] likes cats”, which is applicable. Note that this procedure accommodates rigid matching as a special case. If itself is applicable, the anti-unification would produce

. For calculating the matching score, we use heuristics based on the number of perfectly matched words between

and .

6 Experiments

We instantiate MetaQNL/MetaInduce on three tasks: learning compositional instructions on MiniSCAN (Lake et al., 2019)/SCAN (Lake and Baroni, 2018), logical reasoning on RuleTaker (Clark et al., 2020), and morphological analysis on SIGMORPHON 2018 (Cotterell et al., 2018). Not only does MetaInduce learn rules achieving state-of-the-art prediction accuracy on the three synthetic datasets (MiniSCAN, SCAN, and RuleTaker), but it uses only a minor fraction of training data. Further, the rules recovered by MetaInduce match precisely with the ground truth rules of MiniSCAN and SCAN. We evaluate soft matching only on SIGMORPHON 2018 because it is a real-world dataset with noise and ambiguity. Experiments on it show that our method can tolerate noise (even without soft matching), and they also suggest directions for future improvements.

Learning Compositional Instructions on MiniSCAN and SCAN. These two datasets are standard benchmarks for compositional generalization. They have similar a format of translating a source sequence to a target sequence, e.g., “jump JUMP”, “jump twice JUMP JUMP”. MiniSCAN consists of only 14 training examples, whereas SCAN has 17K. State of the art has reached 100% accuracy on both datasets (Liu et al., 2020; Nye et al., 2020; Chen et al., 2020).

In training, we treat each source/target pair as a provable example , with empty assumptions and the goal = “ $MAPS_TO$ ”, e.g., “jump twice $MAPS_TO$ JUMP JUMP”. In testing, we use “ $MAPS_TO$ [Y]” as the goal, where [Y] is a placeholder to be filled by the prover. The prover succeeds if it proves a goal with any [Y]. We do not include any unprovable examples, i.e., .

We use a rule proposer independent of specific training examples. First, it generates all concrete rules with premises by combining the sentences in the training set in all possible ways. Then it filters the rules using prior knowledge about compositional generalization: The meaning of a long sequence depends on its subsequences. For example. “jump $MAPS_TO$ JUMP jump twice $MAPS_TO$ JUMP JUMP” is a valid rule, since jump is a subsequence of jump twice. But “look $MAPS_TO$ LOOK jump twice $MAPS_TO$ JUMP JUMP” is not a valid rule. Note that similar assumptions were also made in prior works (Nye et al., 2020; Liu et al., 2020)

We use backward chaining as the prover and Z3 (De Moura and Bjørner, 2008) as the MAX-SAT solver. For SCAN, we train only on the 400 shortest examples and test on four different splits: simple, length, addprim_jump, and addprim_turn_left. On both datasets, MetaInduce achieves 100% testing accuracy and successfully recovers the ground truth rules. Here is an example rule learned from SCAN: “[A] $MAPS_TO$ [B]; [C] $MAPS_TO$ [D] [A] after [C] $MAPS_TO$ [D] [B]”. More examples are included in Appendix C and Appendix D. We tune on 1000 validation examples. The validation accuracy is fairly robust w.r.t. different (Table 1).

0.32 0.64 1.28 2.56 5.12 10.24
#Rules learned 16 17 20 20 20 20 20
Accuracy 85.9 90.3 100.0 100.0 100.0 100.0 100.0
Table 1: Validation accuracies on the length split of SCAN with different . means encoding data consistency as hard constraints.

Logical Reasoning on RuleTaker. The RuleTaker dataset tests logical reasoning in synthetic English sentences. It consists of data examples similar to the one in Fig. 1. The original RuleTaker is generated with the closed-world assumption (CWA)—it assumes a sentence is false if it is not provable. Tafjord et al. (2020) introduces a version of RuleTaker with the open-world assumption (OWA). Under OWA, a sentence can be proved, disproved, or neither. We benchmark on the OWA version.

Some examples in RuleTaker are meant to be disproved: If “The elephant is tall” is true, then “The elephant is not tall” should be false. We add special symbols $TRUE$ or $FALSE$ before sentences, so that the previous example can be disproved using the rule “$TRUE$ The elephant is tall $FALSE$ The elephant is not tall”. For each example to be proved, we add it to the set of provable examples and its negation to unprovable examples . Conversely, for each example to be disproved, we add it to and its negation to . For examples that can be neither proved nor disproved, we add itself and its negation to .

Train Model N/A 0 1 2 3 4 5 All
D3 ProofWriter 99.9 100.0 99.3 99.7 99.2 99.1 98.8 99.6
Ours 99.4 100.0 100.0 99.7 98.9 98.9 98.6 99.4
D5 ProofWriter 99.9 100.0 99.3 99.7 99.2 99.1 98.8 99.6
Ours 99.6 100.0 100.0 100.0 100.0 99.4 99.1 99.7
Table 2: Answer predicting accuracies on the OWA version of RuleTaker. The model is trained on D5 or D3, and tested on D5 (proof depth ). Columns correspond to different proof depths within the test data. N/A means there is no proof since the test example can be neither proved nor disproved.

RuleTaker includes ground truth proofs providing concrete rules such as “$TRUE$ The elephant is tall $FALSE$ The elephant is not tall” but not any abstraction that allows generalizing beyond the specific examples. Our rule proposer simply generates these ground truth concrete rules, whereas MetaInduce tries to learn general rules. And we use simple heuristics for filtering invalid rules generated by anti-unification.

All experiments are on machines with 0 GPUs, 32GB RAM, and four Intel Xeon Silver 4114 CPUs. We run MetaInduce for 5 epochs on a random subset of 10000 training examples, which takes about 20 hours. We use forward chaining as the prover and a depth limit of 7. The hyperparameters and are tuned on validation data. Please see Appendix E for example rules.

We compare our method with ProofWriter (Tafjord et al., 2020)—a state-of-the-art method that also uses ground truth proofs. Following their setup, we test on D5 (a subset of RuleTaker with proof depths 5) and train separate models on D5 and D3 (proof depths 3). Training on D3 is for evaluating the model’s generalization to longer proofs. Results are in Table 2. MetaInduce achieves state-of-the-art accuracy and is competitive with ProofWriter. Further, it learns significantly more compact models with much less training data. For example, the model trained on D3 with using only of the training data has only 79 rules and a total of 2869 symbols, but achieves a test accuracy of 99.4. In comparison, ProofWriter has an accuracy of 99.6 and is based on T5-11B (Raffel et al., 2020), which has 11 billion parameters.

Morphological Analysis on SIGMORPHON 2018. To evaluate our method on real-world natural language, we use the morphological analysis task and dataset in Akyürek et al. (2021): given the surface form of a word (e.g., studied), the model predicts its lemma (study) and an unknown number of morphological tags, such as V (verb), SG (singular), and PST (past tense). The data is constructed from the SIGMORPHON 2018 dataset. It consists of 3 languages with varying morphological complexity—Spanish, Swahili, and Turkish. For each language, they sample a training set of 1000 examples and three test sets of 100 examples each (FUT, PST, and OTHER). FUT consists exclusively of words in the future tense; PST consists of words in the past tense. The training set has only 8 past-tense words and 8 future-tense words. Therefore, FUT and PST test models’ few-shot learning capabilities.

To apply MetaQNL, we represent both the surface form and the lemma as characters. The surface form serves as the assumption, whereas the lemma and the tags serve as conclusions. For example, for the Spanish surface form zarandeamos with lemma zarandear and tags V;IND;PRS;1;PL, we treat z a r a n d e a m o s as the assumptions and construct 6 provable examples with goals $LEMMA$ z a r a n d e a r, $TAG$ V, $TAG$ IND, $TAG$ PRS, $TAG$ 1, and $TAG$ PL. Examples with the same assumption but any other goals are treated as unprovable. The rule proposer simply generates rules that can prove the conclusion in a single step, such as z a r a n d e a m o s $LEMMA$ z a r a n d e a r (more examples in Appendix F).

Following Akyürek et al. (2021), we evaluate the predictions using F1 score and compare with a standard seq2seq neural network: LSTMs with attention. Note that we’re comparing with the baseline in Akyürek et al. (2021), not their proposed method. Their method is orthogonal to us since it focuses on generating synthetic data for augmenting the training set.

Model Spanish Swahili Turkish
LSTMs + attention 66 88 75 90 69 85
Ours 55 82 81 86 53 71
Ours + Soft matching 66 84 80 85 53 70
Table 3: F1 scores of morphological analysis on three languages. Results on FUT and PST are averaged. Our method is competitive with a neural seq2seq model on Swahili. Soft matching bridges the gap on Spanish but does not help on the other two languages.

Results are shown in Table 3. Note that the task is not trivial: the neural baseline performs far from perfect, especially on FUT and PST (an F1 score of 66% on Spanish). Unlike the baseline, our method learns interpretable morphological rules; e.g., the suffix áramos in Spanish indicates the paste tense (more examples in Appendix F). In terms of performance, our method (without soft matching) is competitive with the baseline on Swahili, but there are still gaps on Spanish and Turkish. Further analysis reveals different reasons behind the gaps: Turkish morphology is very complex. But our current way of instantiating MetaQNL only considers proofs of depth 1, which could be a restriction for learning more expressive rules. Spanish morphology is relatively simple. Our F1 score still has large room for improvement, because it learns over-specific rules that achieve high precision but low recall.

Next, we explore the use of soft matching. We keep training the same and apply soft matching only in testing. Given a testing example such as z a r a n d e a m o s, we consider not only applicable rules learned by MetaInduce but also additional rules generated through our simple soft matching mechanism based on anti-unification (Sec. 5). All rules are ranked based on their matching scores, which are calculated using heuristics. Rigid matching always has the highest score, and more perfectly matched characters lead to higher scores. After ranking the rules, we apply them one by one until we get a predicted lemma.

The bottom row in Table 3 shows the result of soft matching. Even this simple form of soft matching can bridge the performance gap on Spanish. However, it leads to no improvements on Swahili and Turkish. We found that the individual rules learned on Swahili and Turkish are more approximate, i.e. more like “rules of thumb”—they capture the general pattern but have many exceptions. This is due to the increased morphological complexity. In these two languages, there are fewer simple rules such as “This suffix always indicates the past tense.” As a result, relaxing the matching conditions naively would lead to too many spurious rules.

7 Limitations and Open Questions

First of all, our approach is far from mature. Substantial further development is needed for handling free-form natural language. Soft matching is one possible way to address linguistic variations. We could use a pretrained language model to output matching scores between rules and assumptions.

Our experiments are not large-scale but serve as proof of concept for a very novel approach at an early stage. MetaInduce does not yet scale to millions of training examples, which may be necessary to learn enough rules to handle the complexity of natural language. The current bottleneck is rule abstraction, which can be possibly addressed through better methods than anti-unification.

MetaInduce is a meta algorithm that permits many variations of its components. This provides many open questions and opportunities for integration with deep learning. For example, the rule proposer or theorem prover can be a deep network instead of a manually crafted heuristic.


This work is partially supported by the Office of Naval Research under Grant N00014-20-1-2634. The authors also gratefully acknowledge financial support from the Schmidt DataX Fund at Princeton University made possible through a major gift from the Schmidt Futures Foundation.


  • E. Akyürek, A. F. Akyürek, and J. Andreas (2021) Learning to recombine and resample data for compositional generalization. In International Conference on Learning Representations (ICLR), Cited by: §6, §6.
  • A. A. Alemi, F. Chollet, N. Een, G. Irving, C. Szegedy, and J. Urban (2016) Deepmath - deep sequence models for premise selection. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
  • K. Bansal, C. Szegedy, M. N. Rabe, S. M. Loos, and V. Toman (2019) Learning to reason in large theories without imitation. arXiv preprint arXiv:1905.10501. Cited by: §1.
  • K. Bostrom, X. Zhao, S. Chaudhuri, and G. Durrett (2021) Flexible generation of natural language deductions. In

    Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Cited by: §2.
  • X. Chen, C. Liang, A. W. Yu, D. Song, and D. Zhou (2020) Compositional generalization via neural-symbolic stack machines. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.
  • N. Cingillioglu and A. Russo (2020) Learning invariants through soft unification. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • P. Clark, O. Tafjord, and K. Richardson (2020) Transformers as soft reasoners over language. In

    International Joint Conference on Artificial Intelligence (IJCAI)

    Cited by: §1, §1, §2, §6.
  • R. Cotterell, C. Kirov, J. Sylak-Glassman, G. Walther, E. Vylomova, A. D. McCarthy, K. Kann, S. J. Mielke, G. Nicolai, M. Silfverberg, et al. (2018) The conll–sigmorphon 2018 shared task: universal morphological reinflection. In Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection, pp. 1–27. Cited by: §6.
  • A. Cropper and S. Dumančić (2020) Inductive logic programming at 30: a new introduction. arXiv preprint arXiv:2008.07912. Cited by: §2, §2.
  • W. Dai, Q. Xu, Y. Yu, and Z. Zhou (2019) Bridging machine learning and logical reasoning by abductive learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • Á. Darvas, R. Hähnle, and D. Sands (2005) A theorem proving approach to analysis of secure information flow. In International Conference on Security in Pervasive Computing (SPC), Cited by: §1.
  • L. De Moura and N. Bjørner (2008) Z3: an efficient smt solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), Cited by: §1, §6.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In naacl, Cited by: §5.
  • H. Dong, J. Mao, T. Lin, C. Wang, L. Li, and D. Zhou (2018) Neural logic machines. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • R. B. Doorenbos (1995) Production matching for large learning systems. Technical report Carnegie Mellon University. Cited by: §4.
  • R. Evans and E. Grefenstette (2018) Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research 61, pp. 1–64. Cited by: §2.
  • N. Gontier, K. Sinha, S. Reddy, and C. Pal (2020) Measuring systematic generalization in neural proof generation with transformers. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • E. Grefenstette (2013) Towards a formal distributional semantics: simulating logical calculi with tensors. arXiv preprint arXiv:1304.5823. Cited by: §2.
  • T. F. Icard III and L. S. Moss (2014) Recent progress on monotonicity. In Linguistic Issues in Language Technology (LiLT), Cited by: §2.
  • W. W. C. F. Y. Kathryn and R. Mazaitis (2018) TensorLog: deep learning meets probabilistic databases. Journal of Artificial Intelligence Research 1, pp. 1–15. Cited by: §2.
  • L. Kovács and A. Voronkov (2013) First-order theorem proving and vampire. In International Conference on Computer Aided Verification (CAV), Cited by: §1, §2.
  • T. Kutsia, J. Levy, and M. Villaret (2014) Anti-unification for unranked terms and hedges.

    Journal of Automated Reasoning

    52, pp. 155–190.
    Cited by: Appendix B, Appendix B, §1.
  • T. Kutsia (2002) Unification with sequence variables and flexible arity symbols and its extension with pattern-terms. pp. 290–304. Cited by: Appendix B.
  • B. Lake and M. Baroni (2018) Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. In International Conference on Machine Learning (ICML), Cited by: Appendix D, §1, §6.
  • B. M. Lake, T. Linzen, and M. Baroni (2019) Human few-shot learning of compositional instructions. In Annual Meeting of the Cognitive Science Society (CogSci), Cited by: Appendix C, §1, §6.
  • M. Lee, X. He, W. Yih, J. Gao, L. Deng, and P. Smolensky (2016) Reasoning in vector space: an exploratory study of question answering. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • Q. Li, S. Huang, Y. Hong, Y. Chen, Y. N. Wu, and S. Zhu (2020) Closed loop neural-symbolic learning via integrating neural perception, grammar parsing, and symbolic reasoning. In International Conference on Machine Learning (ICML), Cited by: §2.
  • Q. Liu, S. An, J. Lou, B. Chen, Z. Lin, Y. Gao, B. Zhou, N. Zheng, and D. Zhang (2020) Compositional generalization by learning analytical expressions. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6, §6.
  • B. MacCartney and C. D. Manning (2007) Natural logic for textual inference. In ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, Cited by: §2.
  • J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu (2018) The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • G. Marcus and E. Davis (2020) Insights for ai from the human mind. Communications of the ACM 64, pp. 38–41. Cited by: §1.
  • D. McAllester and R. Givan (1993) Taxonomic syntax for first order inference. Journal of the ACM (JACM) 40, pp. 246–283. Cited by: §2.
  • W. McCune (1997) Solution of the robbins problem. Journal of Automated Reasoning 19, pp. 263–276. Cited by: §1.
  • N. D. Megill and D. A. Wheeler (2019) Metamath: a computer language for mathematical proofs. Cited by: §1, §2.
  • H. Mercier and D. Sperber (2017) The enigma of reason. Harvard University Press. Cited by: §1.
  • P. Minervini, M. Bošnjak, T. Rocktäschel, S. Riedel, and E. Grefenstette (2020) Differentiable reasoning on large knowledge bases and natural language. In AAAI Conference on Artificial Intelligence, Cited by: §2.
  • K. Mineshima, P. Martínez-Gómez, Y. Miyao, and D. Bekki (2015) Higher-order logical inference with compositional semantics. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.
  • S. Muggleton (1991) Inductive logic programming. New Generation Computing 8, pp. 295–318. Cited by: §2.
  • M. I. Nye, A. Solar-Lezama, J. B. Tenenbaum, and B. M. Lake (2020) Learning compositional rules via neural program synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2, §6, §6.
  • G. Plotkin (1970) A note on inductive generalization. Machine Intelligence 5, pp. 153–163. Cited by: Appendix B, Appendix B, §1, §4.
  • G. Plotkin (1972) Automatic methods of inductive inference. Ph.D. Thesis, The University of Edinburgh. Cited by: §2.
  • S. Polu and I. Sutskever (2020) Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393. Cited by: §2.
  • M. Qu, J. Chen, L. Xhonneux, Y. Bengio, and J. Tang (2021)

    RNNLogic: learning logic rules for reasoning on knowledge graphs

    In International Conference on Learning Representations (ICLR), Cited by: §2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Journal of Machine Learning Research (JMLR) 21, pp. 1–67. Cited by: §5, §6.
  • M. Raghothaman, J. Mendelson, D. Zhao, M. Naik, and B. Scholz (2019) Provenance-guided synthesis of datalog programs. In Symposium on Principles of Programming Languages (POPL), Cited by: §2.
  • A. J. Robinson and A. Voronkov (2001) Handbook of automated reasoning. Vol. 1. Cited by: Appendix B, §1, §2.
  • T. Rocktäschel and S. Riedel (2017) End-to-end differentiable proving. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • S. Russell and P. Norvig (2002) Artificial intelligence: a modern approach. Cited by: Appendix B, §1, §4.
  • S. Saha, S. Ghosh, S. Srivastava, and M. Bansal (2020) PRover: proof generation for interpretable reasoning over rules. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2, §2.
  • A. Saparov and T. M. Mitchell (2021) A generative symbolic model for more general natural language understanding and reasoning. arXiv preprint arXiv:2105.02486. Cited by: §2.
  • I. Schlag, P. Smolensky, R. Fernandez, N. Jojic, J. Schmidhuber, and J. Gao (2019) Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611. Cited by: §2.
  • X. Si, M. Raghothaman, K. Heo, and M. Naik (2019) Synthesizing datalog programs using numerical relaxation. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §2.
  • P. Smolensky (1990) Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence 46, pp. 159–216. Cited by: §2.
  • O. Tafjord, B. D. Mishra, and P. Clark (2020) ProofWriter: generating implications, proofs, and abductive statements over natural language. arXiv preprint arXiv:2012.13048. Cited by: §2, §2, §6, §6.
  • A. Talmor, Y. Elazar, Y. Goldberg, and J. Berant (2020) OLMpics-on what language model pre-training captures. Transactions of the Association for Computational Linguistics (TACL) 8, pp. 743–758. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • L. Weber, P. Minervini, J. Münchmeyer, U. Leser, and T. Rocktäschel (2019) NLProlog: reasoning with weak unification for question answering in natural language. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §2.
  • F. Yang, Z. Yang, and W. W. Cohen (2017) Differentiable learning of logical rules for knowledge base reasoning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • K. Yang and J. Deng (2019) Learning to prove theorems via interacting with proof assistants. In International Conference on Machine Learning (ICML), Cited by: §1.
  • G. Yernaux and W. Vanhoof (2019) Anti-unification in constraint logic programming. Theory and Practice of Logic Programming 19, pp. 773–789. Cited by: Appendix B.

Appendix A Partial Order Among Sentences and Rules

Here we prove that the in Definition 4 of the main paper is indeed a partial order relation.

Definition 7 (Sentence length).

The length of a sentence is .

Lemma 1 (Substitutions are noncontractive).

Applying substitutions does not make a sentence shorter. In other words, for any sentence and substitution , we have . Further, if and only if maps all tokens in to sentences of length 1, i.e., .


For any substitution and variable , is a sentence. Therefore, for any token , (Definition 3). For any sentence , we have . And the equality holds if and only if . ∎

Theorem 1 (Partial order among sentences).

If sentence equality is defined modulo -conversion, then the in Definition 4 is a partial order among sentences. In other words,

  1. .

  2. , if and , then modulo -conversion.

  3. , if and , then .


We prove the 3 statements separately.

  1. Let be the identity substitution mapping any variable to itself, i.e., . According to Definition 3, also maps any token to itself (), and therefore any sentence to itself (). Applying Definition 4, we have .

  2. Given two sentences , such that and , there exist substitutions , such that and (Definition 4). Applying Lemma 1 to them separately leads to and . According to Definition 3, we derive . If is not a variable, , i.e., all non-variable tokens in and are identical. If is a variable, must also be a variable because otherwise would not be a variable. Therefore, both and are just renaming variables. And it is straightforward to verify that they cannot map different variables to the same. In other words, and are conversions; modulo -conversion.

  3. Given three sentences , , and such that and , there exist substitutions and such that and . Let be the function composite of and . is also a substitution and . Therefore, .

Theorem 2 (Partial order among rules).

The in Definition 4 is a partial order among rules. In other words,

  1. For any rule , .

  2. For any two rules and , if and , then modulo -conversion.

  3. For any three rules , and , if and , then .


Similar to the proof of Theorem 1. ∎

Definition 8 (Strictly partial order among sentences/rules).

Let and be two sentences, is strictly more general than (denoted by ) if and only if and modulo -conversion. Similarly, if and are rules, if and only if and modulo -conversion.

Appendix B Unification and Anti-unification of Sentences and Rules

Unification and anti-unification (Plotkin, 1970; Robinson and Voronkov, 2001) are basic symbolic procedures in formal logic that are useful for theorem proving and logic programming (Russell and Norvig, 2002; Yernaux and Vanhoof, 2019). In MetaQNL, unification is used in backward chaining, and anti-unification is used to abstract concrete rules into rules with variables. We adapt existing problem setups and algorithms from formal logic to MetaQNL. The algorithms we use for MetaQNL do not have theoretical guarantees as in formal logic, but they work well in practice. The anti-unifiers they compute may not satisfy the conditions of most specific anti-unifiers (Definition 15). But strict anti-unification is not necessary for rule abstraction to work. In principle, all we need is a procedure for generating abstract rules from concrete ones.

Unification. Given two sentences (or two rules), unification aims to find substitutions mapping them to the same sentence (or rule). Such substitutions are called unifiers. We extend unification to MetaQNL by adapting prior work, especially the unification algorithm developed by Kutsia (2002) for a variant of first-order logic with sequence variables and flexible arity symbols.

Definition 9 (Unifier).

A substitution is a unifier of two sentences if and only if . Similarly, it is a unifier of two rules and if and only if .

Two sentences may have multiple unifiers. Taking , as an example, their unifiers include , , etc. Both and are valid unifiers, but they lead to different sentences when applied: , . We prefer to because it is more general; it does not introduce any new information not in and . In contrast, we cannot infer the “tall” in . This is the intuition behind the concept of “most general unifiers”.

Definition 10 (Most general unifier).

Let the substitution be a unifier of sentencens and , it is a most general unifier if and only if there is no unifier of and such that .

In unification, we want to compute a set of most general unifiers, and we want the set to be minimal and complete. Below we define these concepts for sentences.

Definition 11 (Complete set of unifiers).

Let be a set of unifiers of sentences and , is complete if and only if for any unifier of and , there exists a unifier , such that .

Definition 12 (Minimal set of unifiers).

Let be a set of unifiers of sentences and , is minimal if and only if for any , implies (modulo -conversion).

Definition 13 (Minimal complete set of unifiers).

Let be a set of unifiers of sentences and , is a minimal complete set of unifiers if and only if it is both minimal and complete.

The definitions for rules are parallel. Given two sentences (or two rules), the unification problem is to compute a minimal complete set of unifiers. The result can be empty (e.g., unifying “hello world” and “how are you”), finite (“hello [X]” and “[Y] world”), or infinite (“hello [X]” and “[X] hello”, Fig. 2 Left).

Figure 2: The minimal complete set of unifiers of two sentences can be empty, finite, or infinite (e.g., “hello [X]” and “[X] hello”). The minimal complete set of anti-unifiers is non-empty and finite.

Anti-unification. Given two sentences (or two rules), anti-unification aims to generalize them into a single sentence (or rule). Anti-unification has also been studied in formal logic (Plotkin, 1970; Kutsia et al., 2014). We extend it to MetaQNL by adapting prior work. For simplicity, we define anti-unification only for sentences, but it applies to rules as well.

Definition 14 (Anti-unifier).

Given two sentences and , their anti-unifier is a triple of a sentence and two subsitutions , , such that and .

Definition 15 (Most specific anti-unifier).

Let be an anti-unifier of sentencens and , it is a most specific anti-unifier if and only if there is no substitution , and such that

  1. ,

  2. is also an anti-unifier of and

Definition 16 (Complete set of anti-unifiers).

Let be a set of anti-unifiers of sentences and , is complete if and only if for any anti-unifier of and , there exists a substitution and an anti-unifier such that , .

Definition 17 (Minimal set of anti-unifiers).

Let be a set of anti-unifiers of sentences and , is minimal if and only if for any , if there exists a substitution such that

  1. ,

then must be an -conversion, i.e. (modulo -conversion).

Definition 18 (Minimal complete set of anti-unifiers).

Let be a set of anti-unifiers of sentences and , is a minimal complete set of anti-unifiers if and only if it is both minimal and complete.

Given two sentences (or two rules), the anti-unification problem is to compute a minimal complete set of anti-unifiers. Unlike unification, the result of anti-unification must be non-empty and finite (Fig. 2).

Theorem 3 (Anti-unification is finitary).

Let be a minimal complete set of anti-unifiers of sentences and , then is non-empty and finite.


For any sentences and , we have a trivial anti-unifier where and . Since is complete, apply Definition 16 and we will know must be non-empty.

For any anti-unifier , we have (Definition 14). Apply Lemma 1 to derive . Therefore, the length of is bounded. Also, cannot have non-variable tokens besides those in , so its vocabulary is also bounded. There are finite number of different sentences that can take (modulo -conversion). Therefore, must also be finite. ∎

Our current anti-unification algorithm is adapted from Kutsia et al. (2014). It recursively matches the beginning of two sentences. Let and be sentences, and is a more general sentence in their anti-unifier. If and start with the same word , should also start with . Otherwise, should start with a variable corresponding to some prefixes of and . The algorithm searches for all such prefixes and anti-unifies the remaining parts of the sentences recursively.

Appendix C Details of MiniSCAN Experiments

The 14 MiniSCAN (Lake et al., 2019) training examples represented as sentences in MetaQNL ($MAPS_TO$ is a special symbol):

wif blicket dax $MAPS_TO$ GREEN RED GREEN
lug blicket wif $MAPS_TO$ BLUE GREEN BLUE
dax kiki lug $MAPS_TO$ BLUE RED
lug kiki wif $MAPS_TO$ GREEN BLUE
lug fep kiki wif $MAPS_TO$ GREEN BLUE BLUE BLUE
wif kiki dax blicket lug $MAPS_TO$ RED BLUE RED GREEN
wif blicket dax kiki lug $MAPS_TO$ BLUE GREEN RED GREEN

The 10 testing examples:

zup blicket lug $MAPS_TO$ YELLOW BLUE YELLOW
zup kiki dax $MAPS_TO$ RED YELLOW
lug kiki wif blicket zup $MAPS_TO$ GREEN YELLOW GREEN BLUE
zup blicket wif kiki dax fep $MAPS_TO$ RED RED RED YELLOW GREEN YELLOW
dax blicket zup $MAPS_TO$ RED YELLOW RED
wif kiki zup $MAPS_TO$ YELLOW GREEN

Below are some examples of candidate rules generated by the rule proposer. Note that many of them are wrong because the premises are not sufficient to deduce the conclusion.

lug fep kiki wif $MAPS_TO$
dax $MAPS_TO$ RED wif blicket dax $MAPS_TO$ GREEN RED GREEN
lug $MAPS_TO$ BLUE lug kiki wif $MAPS_TO$ GREEN BLUE
dax $MAPS_TO$ RED; lug $MAPS_TO$ BLUE wif blicket dax kiki lug $MAPS_TO$

MetaInduce learns 7 rules corresponding to the ground truth rules of MiniSCAN:

[A] $MAPS_TO$ [B] [A] fep $MAPS_TO$ [B] [B] [B]
[A] $MAPS_TO$ [B]; [C] $MAPS_TO$ [D] [A] kiki [C] $MAPS_TO$ [D] [B]
[A] $MAPS_TO$ [B]; [C] $MAPS_TO$ [D] [A] blicket [C] $MAPS_TO$ [B] [D] [B]

Appendix D Details of SCAN Experiments

Some examples in SCAN (Lake and Baroni, 2018):

turn right $MAPS_TO$ RIGHT
jump after turn left $MAPS_TO$ LEFT JUMP
walk right $MAPS_TO$ RIGHT WALK
walk after run $MAPS_TO$ RUN WALK
turn left twice $MAPS_TO$ LEFT LEFT
turn opposite left $MAPS_TO$ LEFT LEFT

Below are some examples of the candidate rules generated by the rule proposer.

run $MAPS_TO$ RUN walk after run $MAPS_TO$ RUN WALK
walk $MAPS_TO$ WALK; run $MAPS_TO$ RUN walk after run $MAPS_TO$ RUN WALK
run $MAPS_TO$ RUN jump twice after run twice $MAPS_TO$
run twice $MAPS_TO$ RUN RUN jump twice after run twice $MAPS_TO$

MetaInduce learns 20 rules corresponding to the ground truth rules of SCAN:

turn right $MAPS_TO$ RIGHT
turn left $MAPS_TO$ LEFT
turn opposite left $MAPS_TO$
turn opposite right $MAPS_TO$
turn around left $MAPS_TO$
turn around right $MAPS_TO$
[A] $MAPS_TO$ [B] [A] left $MAPS_TO$ LEFT [B]
[A] $MAPS_TO$ [B] [A] right $MAPS_TO$ RIGHT [B]
[A] opposite left $MAPS_TO$
[A] $MAPS_TO$ [B] [A] opposite right $MAPS_TO$
[A] $MAPS_TO$ [B] [A] around left $MAPS_TO$
[A] $MAPS_TO$ [B] [A] around right $MAPS_TO$
[A] $MAPS_TO$ [B] [A] twice $MAPS_TO$ [B] [B]
[A] $MAPS_TO$ [B] [A] thrice $MAPS_TO$ [B] [B] [B]
[A] $MAPS_TO$ [B]; [C] $MAPS_TO$ [D] [C] and [A] $MAPS_TO$ [D] [B]
[A] $MAPS_TO$ [B]; [C] $MAPS_TO$ [D] [A] after [C] $MAPS_TO$ [D] [B]

Appendix E Details of RuleTaker Experiments

Rule Proposer. Fig. 3 shows the form of ground truth proofs in RuleTaker. For this specific example, our rule proposer would generate 2 candidate rules below:

Figure 3: RuleTaker contains ground truth proofs in the form of directed acyclic graphs from the assumptions to the conclusion. The nodes in the graph are concrete sentences without variables.
$TRUE$ The elephant is strong
$TRUE$ The elephant likes cats

Learned Rules. Below are some example rules learned from RuleTaker:

$TRUE$ the [D]
$FALSE$ the [A] sees the [B]
$TRUE$ the [A] needs the [E]
$TRUE$ [A] is [E]
$TRUE$ the [A] chases the [D]
$TRUE$ [A] [B]
$TRUE$ [A] [C]

Appendix F Details of SIGMORPHON 2018 Experiments

Rule Proposer. The rule proposer simply generates concrete rules that can prove the goals in a single step. For the previous example, it generates the 6 candidate rules below:

z a r a n d e a m o s $LEMMA$ z a r a n d e a r
z a r a n d e a m o s $TAG$ V
z a r a n d e a m o s $TAG$ IND
z a r a n d e a m o s $TAG$ PRS
z a r a n d e a m o s $TAG$ 1
z a r a n d e a m o s $TAG$ PL

Learned Rules. Below are some example rules learned on the Spanish part of SIGMORPHON 2018:

[A] e a m o s $LEMMA$ [A] e a r
[A] á r a m o s $TAG$ PST
n o s - [A] e m o s $TAG$ SBJV
n o - [A] z c a m o s $LEMMA$ [A] c e r
[A] z a m o s $LEMMA$ [A] z a r

Appendix G Heuristics for constraining the space of rules

We use a few simple and general heuristics for constraining the space of rules and pruning invalid rules generated by anti-unification. First, we merge multiple variables that always appear together. For example, the [A] [B] [C] and [D] [E] in the rule below can be merged.

$TRUE$ [A] [B] [C]
$TRUE$ [D] [E]

So the rule becomes:

$TRUE$ [A]
$TRUE$ [B]

A variable in a rule is called a free variable, if it appears only once. For example, the rule

[X] is red
Tomorrow will be sunny

contains a free variable [X]. We only consider rules with no more than 1 free variable and require that they cannot appear in the conclusion. Because they would allow arbitrary conclusions formed by substituting them with other sentences. For example, the rule below is not allowed because of the free variable [X] in the conclusion:

In addition, a rule cannot contain a premise made of one single free variable. Because this premise can be satisfied by any sentence, and there is no point including it in the rule. For example, the rule below is not allowed because of the free variable [X]:

$TRUE$ [A]
$TRUE$ [B]