1 Introduction
Symbolic reasoning—rulebased symbol manipulation—is a core component of human intelligence (Mercier and Sperber, 2017). It has been a core part of computer science research, and has achieved significant success in domains such as software verification (Darvas et al., 2005) and theorem proving (Kovács and Voronkov, 2013; McCune, 1997). However, such success has been restricted to domains amenable to rigid, precise formalization. It remains a challenge how to translate such success into “informal” domains such as reasoning with common sense knowledge and natural language input. Prior attempts to build rulebased systems, which rely on manually constructed rules, have achieved limited success and tended to produce brittle systems.
Deep learning provides an attractive alternative that can easily sidestep the question of representation. A deep network can be trained to perform a reasoning task by directly predicting the answer without explicit symbol manipulation (Clark et al., 2020). However, deep networks can require a large amount of training data and can suffer from poor generalization. More importantly, unlike symbolic systems, a deep network is a black box that is hard to interpret, inspect, and verify. Such lack of interpretability can be undesirable in certain applications, especially those critical to safety and security.
In this work, we ask how to build a rulebased system that reasons symbolically but can work with natural language and handle domains difficult to formalize. Such a system would perform reasoning by explicit symbol manipulation based on known rules, therefore more interpretable and verifiable, but at the same time flexible enough to handle natural language input.
At a glance, this may appear a large departure from the conventional wisdom that learningbased systems, particularly deep networks, are far superior to rulebased systems, as history has demonstrated repeatedly. However, we hypothesize that this conventional wisdom is incorrect because it assumes a false dichotomy between using learning and using rules; rulebased systems underperformed not because they were rulebased, but because it is difficult to construct rules manually. Further, we hypothesize learning rules from data is key to building effective rulebased systems, but it may require a different kind of learning than gradient descent.
The goal of this work is thus to develop a method that automatically learns symbolic rules from data to enable rulesbased reasoning with natural language. This poses two main questions. First, what is the system of rules—the basic structures that define what symbols and manipulations allowed—such that it is compatible with not only formal logic but also natural language? Second, what is the learning algorithm that induces a set of rules from training data?
In this work, we take initial steps toward answering both questions. We propose MetaQNL, a formal symbolic system we call a “QuasiNatural Language”, which is compatible with not only rigorous logical inference but also natural language expressions. We also propose MetaInduce, a learning algorithm that induces MetaQNL rules from training data that consists of questions and answers, with or without intermediate reasoning steps.
MetaQNL: a Symbolic System in QuasiNatural Language. In MetaQNL, a sentence is a sequence of words and variables (“The elephant is [X]”), including ordinary English sentences without variables. A rule has multiple sentences as premises (“The elephant is [X]”, “If something is [X] then it [Y]”) and one sentence as the conclusion (“The elephant [Y]”). When applying the rule in reasoning, variables are substituted with concrete sentences ([X] strong, [Y] likes cats). Therefore, rules capture abstract knowledge independent of specific instances—the above rule holds whether [Y] is “likes cats” or “is sleepy”. Such abstraction is essential for reasoning in both humans and machines (Marcus and Davis, 2020).
Fig. 1 illustrates how sentences and rules are used in reasoning. Starting from known sentences (assumptions), we apply rules to derive new sentences until the goal is reached. At each step, we substitute the variables in a rule with concrete sentences. This process resembles Metamath (Megill and Wheeler, 2019), a formal language developed for formalizing mathematical proofs, where each step also consists of selecting a theorem and instantiating it with a suitable substitution. So we refer to the reasoning process as theorem proving and the result in Fig. 1 as a proof. It is worth noting that reasoning in MetaQNL is interpretable by design: it is transparent about what rules are assumed; it produces not only an answer, but also a proof that can be mechanically checked against the rules.
Assumptions and the goal are usually given when applying our system to a specific task. To solve the task, two issues remain: (1) Rule induction: What is the set of rules? (2) Theorem proving: How to apply the rules to find a proof? Theorem proving has been studied extensively in classical AI (Robinson and Voronkov, 2001) and more recently with deep learning (Alemi et al., 2016; Yang and Deng, 2019; Bansal et al., 2019), and we can adapt existing algorithms such as forward chaining and backward chaining (Russell and Norvig, 2002). In this work, we simply use existing provers and focus instead on the more challenging problem of rule induction.
MetaInduce: an Algorithm to Learn MetaQNL Rules. Rule induction can be formulated as a discrete optimization problem that seeks a minimum set of rules that are consistent with training examples. Note that it is important to seek a small number of rules because we always have a trivial solution that consists of one rule per example but is unlikely to generalize. This optimization is challenging due to the discrete, combinatorial search space.
We introduce MetaInduce, a general method for learning MetaQNL rules. It encodes the problem as a maximum satisfiability (MAXSAT) problem, which can be solved efficiently by offtheshelf solvers (De Moura and Bjørner, 2008). Our method consists of 3 steps. First, given a training example, a rule proposer proposes a set of concrete rules (rules without variables) as candidates. This set can be overcomplete and inaccurate. These rules are used to prove the example using existing provers such as forward/backward chaining. Second, we generate abstract rules from concrete rules via a symbolic procedure called antiunification (Plotkin, 1970; Kutsia et al., 2014). Third, we encode the proof paths in MAXSAT and solve for a subset of all rules using a MAXSAT solver.
Overview of Results. We benchmark our method on 3 tasks: learning compositional instructions, logical reasoning, and morphological analysis. For compositional instructions, our method not only achieves 100% accuracy on MiniSCAN (Lake et al., 2019) and SCAN (Lake and Baroni, 2018), but also recovers the ground truth rules. For logical reasoning, it achieves state of the art on RuleTaker (Clark et al., 2020)
. For morphological analysis, it learns morphological rules from realworld data and is competitive with neural seq2seq models in some languages. Compared to existing methods, our approach learns compact models with much less data, and produces not only answers but also checkable proofs. On RuleTaker, our approach learns a model that has only 2869 symbols but is competitive with a prior approach that uses a neural network with 11 billion parameters.
2 Related Work
Symbolic Reasoning. Symbolic reasoning has been studied extensively in classical AI, such as theorem proving (Kovács and Voronkov, 2013; Robinson and Voronkov, 2001). An open problem is to perform symbolic reasoning in domains without a natural formalization, such as natural images or texts. One common approach is to manually construct a formal system (e.g., based on firstorder logic with manually defined functions and predicates), then perform semantic parsing to translate images or texts into formalized statements as input to a reasoning module operating in a clean formal world.
For example, to judge whether one statement implies another, Mineshima et al. (2015) use a semantic parser to convert both statements into higherorder logic (with predefined predicates), and then run an automated theorem prover. Semantic parsing is still far from reliable; therefore, researchers have developed techniques for learning it jointly with the reasoning module (Mao et al., 2018; Saparov and Mitchell, 2021; Dai et al., 2019; Li et al., 2020). In contrast, our approach does not require a semantic parser, because rules in MetaQNL are directly applicable to natural language.
Natural Logic (McAllester and Givan, 1993; MacCartney and Manning, 2007) is a class of symbolic systems defined using the syntax of natural language, bypassing semantic parsing. Compared to our system, Natural Logic is more specialized because it is a specific logic committed to predefined rules, which restricts the type of reasoning it can perform to monotonicity reasoning (Icard III and Moss, 2014). In contrast, we have no such restrictions because is not a specific logic but a metalanguage with minimal structure such that it can instantiate various types of reasoning, just as MetaMath is a metalanguage that can describe a variety of mathematical logics (Megill and Wheeler, 2019).
None of these works discussed so far learn rules from data; they instead use a predefined formal system that is already specialized and already encodes a substantial amount of prior knowledge. In contrast, MetaQNL is almost “knowledgefree” in the sense that it imposes the weakest possible structure on the permitted rules and lets the specific rules emerge from data through learning.
Reasoning with Neural Networks.
Neural networks can perform “soft” reasoning in the space of continuous vectors without manipulating discrete symbols explicitly. Prior works have used transformerbased
(Vaswani et al., 2017)language models for soft reasoning (Polu and Sutskever, 2020; Saha et al., 2020; Tafjord et al., 2020; Gontier et al., 2020; Talmor et al., 2020). Clark et al. (2020)finetune a pretrained transformer to classify whether the goal is provable from the assumptions, encoding them as sentences in a constrained natural language.
Saha et al. (2020) and Tafjord et al. (2020) go one step further to generate proofs in addition to yes/no answers. Bostrom et al. (2021) generate conclusions from premises in unconstrained natural language.Instead of using a generic transformer, researchers have also added inductive biases to the neural architecture. Many are inspired by symbolic reasoning and are often called neurosymbolic architectures. Rocktäschel and Riedel (2017) introduce Neural Theorem Provers (NTPs). Given the assumptions and the goal in firstorder logic, they use backward chaining to recursively construct a neural network. However, NTPs only work for formalized inputs and do not scale due to exponentially many proof paths in backward chaining. Weber et al. (2019) extend NTPs to natural language by extracting symbols from sentences using an offtheshelf named entity recognizer. Minervini et al. (2020) make NTPs more scalable by dynamically pruning unpromising proof paths in backward chaining.
Researchers have also attempted to embed symbolic structures such as logic formulas into continuous vectors while preserving logical operations (Smolensky, 1990; Grefenstette, 2013; Kathryn and Mazaitis, 2018; Lee et al., 2016; Schlag et al., 2019). For example, Neural Logic Machines (Dong et al., 2018)
is a neurosymbolic architecture based on continuous approximation of logical inference. Predicates are represented as tensors; rules are neural operators that map tensors to tensors.
Cingillioglu and Russo (2020) propose an endtoend neural architecture called unification networks to learn rules with variables from concrete examples. However, their system of rules is significantly less general than ours: their variables can only bind to a single word, whereas our variables bind to arbitrary sentences. In addition, their system does not support multistep chained reasoning. All reasoning is done in a single step: producing a conclusion in the form of an answer ("yes/no", a number, etc.) given a number of premises consisting of a question and a set of supporting facts.
Unlike these prior works, we learn symbolic rules instead of weights in a neural network. Further, during inference, we generate symbolic proofs whose correctness with respect to the induced rules is guaranteed and can be mechanically checked. Saha et al. (2020) and Tafjord et al. (2020) also generate proofs, but their proofs are natural language texts whose correctness is neither guaranteed nor mechanically checkable—their approach trains neural networks to directly predict both answers and proofs, but does not expose a system of rules against which a proof can be checked.
Rule Induction.
Inductive logic programming (ILP) learns rules in firstorder logic programs such as Prolog and Datalog
(Plotkin, 1972; Muggleton, 1991; Cropper and Dumančić, 2020). Extending it to natural language is nontrivial—partially due to the need for a predefined ontology of objects and predicates, as well as a perfect semantic parser, both of which are infeasible. Unlike ILP, we learn rules in MetaQNL, which can express not only logic programs but also natural language sentences. And our experiments show that MetaQNL can solve tasks that are not easily solvable by ILP.For learning rules, our MetaInduce algorithm draws inspiration from existing ILP approaches. They encode proofs as either a boolean satisfiability problem solvable by offtheshelf SAT solvers (Raghothaman et al., 2019) or a differentiable function amenable to gradient descent (Yang et al., 2017; Evans and Grefenstette, 2018; Si et al., 2019). Compared to these approaches, our rule space is different and more complex. Our rules consist of sentences with variables, whereas rules in ILP are typically Horn clauses in firstorder logic. Further, ILP often imposes strong syntactic constraints on what rules are valid, e.g., using rule templates (Evans and Grefenstette, 2018; Raghothaman et al., 2019), or restricting to binary predicates (Evans and Grefenstette, 2018). These constraints are critical to good performance but are domaindependent and difficult to get right (Cropper and Dumančić, 2020). Overconstraining the rule space makes the system less expressive, less generally applicable, and more brittle in the presence of noise. Another difference is that we minimize the number of rules in order to generalize, which is unnecessary for ILP due to stronger syntactic constraints.
Our space of rules includes a rich hierarchy from abstract rules to concrete rules, making the search space much larger. In contrast, most ILP works assume functionfree firstorder logic such as Datalog. Their variables can only be instantiated with concrete entities, making their rule space much simpler.
RNNLogic (Qu et al., 2021) learns firstorder rules for knowledge base completion. They generate rules using RNNs, which is feasible because they require that rules can be expressed as a sequence of predicates. The strong syntactic constraint makes it less suitable for more general reasoning. Beyond firstorder logic, Nye et al. (2020) learn rules for a string rewriting system. MetaQNL is more general because it can be applied to not only string rewriting but also other forms of reasoning (see Sec. 6).
3 MetaQNL: a Symbolic System in QuasiNatural Language
MetaQNL is quasinatural because it has a formal syntax compatible with natural language. Like in natural language, a sentence in MetaQNL is simply a sequence of tokens. There are 3 different types of tokens—words, variables, and special symbols. Taking the sentence “$FALSE$ The elephant likes [X]” as an example, “The”, “elephant” and “likes” are words. MetaQNL treats words as symbols and does not assume any prior knowledge about their meaning. “[X]” is a variable—a placeholder that binds to concrete sentences in reasoning. “$FALSE$” is a special symbol. They are useful for encoding the structures of specific tasks, which will become more clear in Sec. 6. In this paper, we delimit special symbols with $. Sentences without variable are called concrete sentences, e.g., “$FALSE$ The elephant likes cats”.
Definition 1 (Sentence).
Let be vocabularies of words, variables, and special symbols respectively; they are disjoint and countable. Let , then any is a token. A sentence is a nonempty sequence of tokens. A concrete sentence is a sentence without any variable, i.e., .
MetaQNL expresses permitted reasoning steps through rules. A rule has multiple sentences as its premises (“The elephant [X]”, “If something [X] then it [Y]”) and one sentence as the conclusion (“The elephant [Y]”). Intuitively, the conclusion should follow from the premises regardless of what values the variables take. Concrete rules are rules without variables.
Definition 2 (Rule).
A rule takes the form of , where are premises, and is the conclusion. It is concrete if all premises and the conclusion are concrete.
In reasoning, we instantiate rules with concrete rules by substituting all variables with concrete sentences. Given the rule “The elephant [X]; If something [X], then it [Y] The elephant [Y]”, we can instantiate it with the substitution , deriving the concrete rule “The elephant is strong; If something is strong, then it likes cats The elephant likes cats”. In such cases, we say is more general than , or vice versa, is an instance of .
Definition 3 (Substitution).
Let be the set of sentences with only words and variables (without special symbols). A substitution is a function from to . Substitutions can be extended to be functions on tokens, sentences, and rules. Given a token , applying the substitution produces a sentence .
Given a sentence , applying produces . Given a rule , applying produces .
We are abusing notations to treat a token and a singletoken sentence interchangeably. Also, denotes concatenation when are sentences. Substitution is defined as a function on all variables , but in practice it only involves a few. For example, the substitution only involves two variables. In such cases, we think of it as being the identity function for other variables, e.g., . This convention makes it easier to composite substitutions as function composition.
As in the example before, applying a substitution to sentences/rules makes them more specific. It introduces a partial order among sentences/rules. It is straightforward to verify that the relation defined below is a partial order. We leave the proof to Appendix A.
Definition 4 (Partial order among sentences and rules).
Let and be two sentences, is an instance of (denoted by ) if and only if there exists a substitution such that . In this case, we also say is more general than . Similarly, given two rules and , is an instance of (or is more general than , denoted by ) if and only if .
A subtlety in the definition is judging whether two rules are equal. For a MetaQNL rule, premises are unordered, and variable renaming does not matter. In more jargonized words, rule equality is defined modulo premise reordering and conversion.
In reasoning (Fig. 1), the prover is given a set of rules , multiple concrete sentences as assumptions, and one sentence as the goal. It iteratively instantiates concrete rules from and applies them to generate a proof of . Similar to Prolog, may have variables (The elephant [X]), and the prover succeeds if it proves any instance of (E.g., The elephant likes vegetables).
Definition 5 (Proof).
A proof is a directed acyclic graph whose vertices are concrete sentences or concrete rules. For each concrete rule , it must satisfy two conditions: (1) connects to its conclusion via an edge ; (2) For each premise , we have and . Besides these edges, there cannot be any other edge in . Also, there can be multiple sentences without inbound edges (the proof’s assumptions), but there is only one sentence without outbound edges (the proof’s goal).
Definition 6 (Theorem proving).
Given a set of rules , concrete sentences as assumptions, and a sentence as the goal, the theorem prover tries to find a proof such that: (1) ’s assumptions are . (2) ’s goal is an instance of . (3) Every rule in is an instance of a rule in .
4 MetaInduce: Learning MetaQNL Rules from Data
Problem Setup and Loss Function.
Rule induction is a machine learning problem where the model consists of rules rather than continuous weights. The problem setup is familiar: Given a training set
and a test set , the goal is to use to find a model that performs well not only on itself but also on . For MetaQNL specifically, the training set consists of a set of provable examples and a set of unprovable examples . They both contain training examples in the form of , where is a set of assumptions and is the goal. A model is consistent with a provable example if is provable from using rules in . Similarly, is consistent with an unprovable example if cannot be proved from . In other words, provable examples are positive examples demonstrating sound logical inference, whereas unprovable examples are negative examples demonstrating unsound inference.Given only , we need to find a model consistent with as many examples in as possible. However, it is not sufficient to optimize the consistency with training data, because there is a trivial model that performs perfectly in training but fails in testing—one rule per example. That is, given a example , if , it is provable using the rule .
Thus we need to penalize the model complexity. While other choices are possible, here we measure model complexity as the number of rules. We minimize a loss function that evaluates both model complexity and consistency with training data:
(1) 
where is the number of rules; and are the number of provable/unprovable examples consistent with respectively. and
are hyperparameters controlling the tradeoff between the three terms.
The optimization problem is challenging. Given , even a single evaluation of is expensive: and require running the prover on all training examples. Furthermore, it is much harder to find the optimal due to the combinatorial and nondifferentiable search space. We introduce MetaInduce, a general method for learning rules by encoding Eqn. 1 as a maximum satisfiability (MAXSAT) problem, which can be solved efficiently by offtheshelf solvers.
Overview of MetaInduce. MetaInduce is outlined in Algorithm 1
. Similar to SGD for training neural networks, MetaInduce goes through the training data for several epochs; during an epoch, it processes one example per iteration. Given an example
(either provable or unprovable), it first relies on a rule proposer for generating candidate rules that are concrete and potentially useful for proving from . Then it runs an existing prover to search for proofs, using both the candidate rules and existing rules in the model. At the end of each epoch, MetaInduce abstracts all concrete rules used in the proofs into rules with variables. Then it performs rule pruning—selecting as a subset of the rules minimizing the loss (Eqn. 1). Next, we explain each step in more detail.Rule Proposal. The rule proposer is datasetdependent and allows incorporating prior knowledge about a particular task. However, a good rule proposer alone—if not embedded in MetaInduce—is not sufficient for learning rules. First, the rule proposer only generates concrete rules. It is up to MetaInduce to abstract them into rules with variables. Second, the rule proposer generates rules useful for a single training example, whereas MetaInduce learns rules useful for the entire dataset. Third, the rule proposer does not have to be accurate. MetaInduce can reliably learn correct rules even if most candidate rules are wrong (see Sec. 6).
Theorem Proving. Theorem proving in MetaQNL is relatively straightforward, thanks to existing algorithms such as forward/backward chaining. Forward chaining starts with the assumptions and applies rules to derive new sentences until the goal is reached. Conversely, backward chaining starts with the goal and applies rules in the reverse direction until all assumptions are satisfied. We implement forward chaining using the Rete algorithm for fast rule matching (Doorenbos, 1995) and the basic backward chaining algorithm from a standard textbook (Russell and Norvig, 2002). The prover returns proofs containing all different paths to the goal up to a predefined depth limit.
Rule Abstraction. The proofs contain only concrete rules (Definition 5), and we have to generalize them into rules with variables. We use a symbolic procedure called antiunification (Plotkin, 1970) to find general rules given concrete one. Given two rules and , antiunification attempts to find the most specific rule such that and (analogous to the lowest common ancestor of two nodes in a tree; see Fig. 1(a) for examples). It does so by recursively matching the beginning of two sentences. Please see Appendix B for details.
Let be the set of all concrete rules in the proofs. To augment with general rules, we iteratively antiunify rules in and add the result back, until no new rule can be generated. We denote the result by , which contains not only concrete rules but also their generalizations.
Rule Pruning with MAXSAT. Rule pruning selects as a subset of by encoding all proofs as a MAXSAT problem, whose solution corresponds to a set of rules that approximately minimize the loss function Eqn. 1. We encode each rule using a boolean variable (also denoted ). means the rule should be included in . For any concrete rule , we have an additional boolean variable cr. means cr is necessary for proving the training examples. We impose 3 different types of constraints on these boolean variables:

[leftmargin=*]

Data consistency: For the th training example, its proof may have many paths from the assumptions to the goal, but the example is provable as long as any one of them is valid. For provable examples (those in ), we encode as a disjunction of proof paths. Each path is valid if and only if all concrete rules along the path are valid. So we encode a proof path as a conjunction of all cr boolean variables it contains (see Fig. 1(b)). Analogously, for unprovable examples (those in ), we simply take the negation of the previous boolean formula to encourage the absence of a valid proof. Finally, a good model is not necessarily consistent with every training example. So is encoded as a soft constraint with weight or .

Model complexity: To minimize the number of rules, we add a soft constraint of weight 1 for each boolean variables. It encourages .

Rules instantiation: Each concrete rule cr must be an instance of a rule . Let be the set of all rules in such that . cr can be instantiated if and only if at least one of them is in the model. Therefore, we add a hard constraint .
Given a set of boolean constraints, each with a weight, a MAXSAT solver finds an assignment of boolean variables to minimize the combined weights of violated constraints, which equals to Eqn. 1 for the specific constraints above. Therefore, running an offtheshelf MAXSAT solver on these constraints gives us a set of rules that minimizes our loss function.
5 Soft Matching
Similar to classical theorem proving, reasoning in MetaQNL relies on precise and rigid matching between rules and assumptions. For example, the rule “The [A] likes cats Someone likes cats” is not applicable to the assumption “An elephant likes cats” due to the lack of “The”. Although precise matching guarantees the rigor of reasoning, reasoning performed in natural language is often fuzzy and ambiguous, without the same degree of rigor as a mathematical proof. Supporting fuzzy reasoning is necessary for MetaQNL to cover a broader spectrum of reasoning in natural language. It requires us to relax the rigorous proofs in Definition 6 to fuzzy proofs with scores indicating the degree of rigor. Another benefit of soft matching is that it allows the system to degrade gracefully—it can produce an “educated guess” if the existing knowledge base of rules is insufficient for producing a rigorous answer.
We extend MetaQNL with soft matching—relaxing the rigid matching conditions when applying rules. Think of applying a rule to a set of assumptions as instantiating concrete rules on the fly: with rigid matching, we instantiate concrete rules such that ’s premises are and must be an exact instance of . In contrast, soft matching produces both concrete rules and scores. ’s premises are still but is not required to be an exact instance of . Further, the matching scores can be aggregated to calculate proof scores, allowing us to produce fuzzy proofs when rigorous proofs are impossible.
Definition 7 (Soft matching).
Given a rule and concrete sentences as assumptions, soft matching outputs concrete rules with scores: such that (1) . (2) ’s premises is .
There are many possible ways to realize soft matching, including using neural networks. For example, we could use a pretrained neural language model to output concrete rules and matching scores (Devlin et al., 2019; Raffel et al., 2020). In this paper, as an initial step, we experiment with a simple soft matching procedure based on antiunification.
Specifically, we perform soft matching only in testing. During training, MetaInduce still uses rigid matching for learning rules. Once the rules are learned, we can use them for making predictions with soft matching. Given a learned rule and assumptions from a testing example, is not necessarily applicable, but we try to find a more general rule that is applicable by antiunifying with the premises of . Following the previous example, antiunifying the assumption “An elephants likes cats” with the premise “The [A] likes cats” produces “[A] likes cats”, which is applicable. Note that this procedure accommodates rigid matching as a special case. If itself is applicable, the antiunification would produce
. For calculating the matching score, we use heuristics based on the number of perfectly matched words between
and .6 Experiments
We instantiate MetaQNL/MetaInduce on three tasks: learning compositional instructions on MiniSCAN (Lake et al., 2019)/SCAN (Lake and Baroni, 2018), logical reasoning on RuleTaker (Clark et al., 2020), and morphological analysis on SIGMORPHON 2018 (Cotterell et al., 2018). Not only does MetaInduce learn rules achieving stateoftheart prediction accuracy on the three synthetic datasets (MiniSCAN, SCAN, and RuleTaker), but it uses only a minor fraction of training data. Further, the rules recovered by MetaInduce match precisely with the ground truth rules of MiniSCAN and SCAN. We evaluate soft matching only on SIGMORPHON 2018 because it is a realworld dataset with noise and ambiguity. Experiments on it show that our method can tolerate noise (even without soft matching), and they also suggest directions for future improvements.
Learning Compositional Instructions on MiniSCAN and SCAN. These two datasets are standard benchmarks for compositional generalization. They have similar a format of translating a source sequence to a target sequence, e.g., “jump JUMP”, “jump twice JUMP JUMP”. MiniSCAN consists of only 14 training examples, whereas SCAN has 17K. State of the art has reached 100% accuracy on both datasets (Liu et al., 2020; Nye et al., 2020; Chen et al., 2020).
In training, we treat each source/target pair as a provable example , with empty assumptions and the goal = “ $MAPS_TO$ ”, e.g., “jump twice $MAPS_TO$ JUMP JUMP”. In testing, we use “ $MAPS_TO$ [Y]” as the goal, where [Y] is a placeholder to be filled by the prover. The prover succeeds if it proves a goal with any [Y]. We do not include any unprovable examples, i.e., .
We use a rule proposer independent of specific training examples. First, it generates all concrete rules with premises by combining the sentences in the training set in all possible ways. Then it filters the rules using prior knowledge about compositional generalization: The meaning of a long sequence depends on its subsequences. For example. “jump $MAPS_TO$ JUMP jump twice $MAPS_TO$ JUMP JUMP” is a valid rule, since jump is a subsequence of jump twice. But “look $MAPS_TO$ LOOK jump twice $MAPS_TO$ JUMP JUMP” is not a valid rule. Note that similar assumptions were also made in prior works (Nye et al., 2020; Liu et al., 2020)
We use backward chaining as the prover and Z3 (De Moura and Bjørner, 2008) as the MAXSAT solver. For SCAN, we train only on the 400 shortest examples and test on four different splits: simple, length, addprim_jump, and addprim_turn_left. On both datasets, MetaInduce achieves 100% testing accuracy and successfully recovers the ground truth rules. Here is an example rule learned from SCAN: “[A] $MAPS_TO$ [B]; [C] $MAPS_TO$ [D] [A] after [C] $MAPS_TO$ [D] [B]”. More examples are included in Appendix C and Appendix D. We tune on 1000 validation examples. The validation accuracy is fairly robust w.r.t. different (Table 1).
0.32  0.64  1.28  2.56  5.12  10.24  

#Rules learned  16  17  20  20  20  20  20 
Accuracy  85.9  90.3  100.0  100.0  100.0  100.0  100.0 
Logical Reasoning on RuleTaker. The RuleTaker dataset tests logical reasoning in synthetic English sentences. It consists of data examples similar to the one in Fig. 1. The original RuleTaker is generated with the closedworld assumption (CWA)—it assumes a sentence is false if it is not provable. Tafjord et al. (2020) introduces a version of RuleTaker with the openworld assumption (OWA). Under OWA, a sentence can be proved, disproved, or neither. We benchmark on the OWA version.
Some examples in RuleTaker are meant to be disproved: If “The elephant is tall” is true, then “The elephant is not tall” should be false. We add special symbols $TRUE$ or $FALSE$ before sentences, so that the previous example can be disproved using the rule “$TRUE$ The elephant is tall $FALSE$ The elephant is not tall”. For each example to be proved, we add it to the set of provable examples and its negation to unprovable examples . Conversely, for each example to be disproved, we add it to and its negation to . For examples that can be neither proved nor disproved, we add itself and its negation to .
Train  Model  N/A  0  1  2  3  4  5  All 

D3  ProofWriter  99.9  100.0  99.3  99.7  99.2  99.1  98.8  99.6 
Ours  99.4  100.0  100.0  99.7  98.9  98.9  98.6  99.4  
D5  ProofWriter  99.9  100.0  99.3  99.7  99.2  99.1  98.8  99.6 
Ours  99.6  100.0  100.0  100.0  100.0  99.4  99.1  99.7 
RuleTaker includes ground truth proofs providing concrete rules such as “$TRUE$ The elephant is tall $FALSE$ The elephant is not tall” but not any abstraction that allows generalizing beyond the specific examples. Our rule proposer simply generates these ground truth concrete rules, whereas MetaInduce tries to learn general rules. And we use simple heuristics for filtering invalid rules generated by antiunification.
All experiments are on machines with 0 GPUs, 32GB RAM, and four Intel Xeon Silver 4114 CPUs. We run MetaInduce for 5 epochs on a random subset of 10000 training examples, which takes about 20 hours. We use forward chaining as the prover and a depth limit of 7. The hyperparameters and are tuned on validation data. Please see Appendix E for example rules.
We compare our method with ProofWriter (Tafjord et al., 2020)—a stateoftheart method that also uses ground truth proofs. Following their setup, we test on D5 (a subset of RuleTaker with proof depths 5) and train separate models on D5 and D3 (proof depths 3). Training on D3 is for evaluating the model’s generalization to longer proofs. Results are in Table 2. MetaInduce achieves stateoftheart accuracy and is competitive with ProofWriter. Further, it learns significantly more compact models with much less training data. For example, the model trained on D3 with using only of the training data has only 79 rules and a total of 2869 symbols, but achieves a test accuracy of 99.4. In comparison, ProofWriter has an accuracy of 99.6 and is based on T511B (Raffel et al., 2020), which has 11 billion parameters.
Morphological Analysis on SIGMORPHON 2018. To evaluate our method on realworld natural language, we use the morphological analysis task and dataset in Akyürek et al. (2021): given the surface form of a word (e.g., studied), the model predicts its lemma (study) and an unknown number of morphological tags, such as V (verb), SG (singular), and PST (past tense). The data is constructed from the SIGMORPHON 2018 dataset. It consists of 3 languages with varying morphological complexity—Spanish, Swahili, and Turkish. For each language, they sample a training set of 1000 examples and three test sets of 100 examples each (FUT, PST, and OTHER). FUT consists exclusively of words in the future tense; PST consists of words in the past tense. The training set has only 8 pasttense words and 8 futuretense words. Therefore, FUT and PST test models’ fewshot learning capabilities.
To apply MetaQNL, we represent both the surface form and the lemma as characters. The surface form serves as the assumption, whereas the lemma and the tags serve as conclusions. For example, for the Spanish surface form zarandeamos with lemma zarandear and tags V;IND;PRS;1;PL, we treat z a r a n d e a m o s as the assumptions and construct 6 provable examples with goals $LEMMA$ z a r a n d e a r, $TAG$ V, $TAG$ IND, $TAG$ PRS, $TAG$ 1, and $TAG$ PL. Examples with the same assumption but any other goals are treated as unprovable. The rule proposer simply generates rules that can prove the conclusion in a single step, such as z a r a n d e a m o s $LEMMA$ z a r a n d e a r (more examples in Appendix F).
Following Akyürek et al. (2021), we evaluate the predictions using F1 score and compare with a standard seq2seq neural network: LSTMs with attention. Note that we’re comparing with the baseline in Akyürek et al. (2021), not their proposed method. Their method is orthogonal to us since it focuses on generating synthetic data for augmenting the training set.
Model  Spanish  Swahili  Turkish  

FUT+PST  OTHER  FUT+PST  OTHER  FUT+PST  OTHER  
LSTMs + attention  66  88  75  90  69  85 
Ours  55  82  81  86  53  71 
Ours + Soft matching  66  84  80  85  53  70 
Results are shown in Table 3. Note that the task is not trivial: the neural baseline performs far from perfect, especially on FUT and PST (an F1 score of 66% on Spanish). Unlike the baseline, our method learns interpretable morphological rules; e.g., the suffix áramos in Spanish indicates the paste tense (more examples in Appendix F). In terms of performance, our method (without soft matching) is competitive with the baseline on Swahili, but there are still gaps on Spanish and Turkish. Further analysis reveals different reasons behind the gaps: Turkish morphology is very complex. But our current way of instantiating MetaQNL only considers proofs of depth 1, which could be a restriction for learning more expressive rules. Spanish morphology is relatively simple. Our F1 score still has large room for improvement, because it learns overspecific rules that achieve high precision but low recall.
Next, we explore the use of soft matching. We keep training the same and apply soft matching only in testing. Given a testing example such as z a r a n d e a m o s, we consider not only applicable rules learned by MetaInduce but also additional rules generated through our simple soft matching mechanism based on antiunification (Sec. 5). All rules are ranked based on their matching scores, which are calculated using heuristics. Rigid matching always has the highest score, and more perfectly matched characters lead to higher scores. After ranking the rules, we apply them one by one until we get a predicted lemma.
The bottom row in Table 3 shows the result of soft matching. Even this simple form of soft matching can bridge the performance gap on Spanish. However, it leads to no improvements on Swahili and Turkish. We found that the individual rules learned on Swahili and Turkish are more approximate, i.e. more like “rules of thumb”—they capture the general pattern but have many exceptions. This is due to the increased morphological complexity. In these two languages, there are fewer simple rules such as “This suffix always indicates the past tense.” As a result, relaxing the matching conditions naively would lead to too many spurious rules.
7 Limitations and Open Questions
First of all, our approach is far from mature. Substantial further development is needed for handling freeform natural language. Soft matching is one possible way to address linguistic variations. We could use a pretrained language model to output matching scores between rules and assumptions.
Our experiments are not largescale but serve as proof of concept for a very novel approach at an early stage. MetaInduce does not yet scale to millions of training examples, which may be necessary to learn enough rules to handle the complexity of natural language. The current bottleneck is rule abstraction, which can be possibly addressed through better methods than antiunification.
MetaInduce is a meta algorithm that permits many variations of its components. This provides many open questions and opportunities for integration with deep learning. For example, the rule proposer or theorem prover can be a deep network instead of a manually crafted heuristic.
Acknowledgments
This work is partially supported by the Office of Naval Research under Grant N000142012634. The authors also gratefully acknowledge financial support from the Schmidt DataX Fund at Princeton University made possible through a major gift from the Schmidt Futures Foundation.
References
 Learning to recombine and resample data for compositional generalization. In International Conference on Learning Representations (ICLR), Cited by: §6, §6.
 Deepmath  deep sequence models for premise selection. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
 Learning to reason in large theories without imitation. arXiv preprint arXiv:1905.10501. Cited by: §1.

Flexible generation of natural language deductions.
In
Conference on Empirical Methods in Natural Language Processing (EMNLP)
, Cited by: §2.  Compositional generalization via neuralsymbolic stack machines. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.
 Learning invariants through soft unification. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.

Transformers as soft reasoners over language.
In
International Joint Conference on Artificial Intelligence (IJCAI)
, Cited by: §1, §1, §2, §6.  The conll–sigmorphon 2018 shared task: universal morphological reinflection. In Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection, pp. 1–27. Cited by: §6.
 Inductive logic programming at 30: a new introduction. arXiv preprint arXiv:2008.07912. Cited by: §2, §2.
 Bridging machine learning and logical reasoning by abductive learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
 A theorem proving approach to analysis of secure information flow. In International Conference on Security in Pervasive Computing (SPC), Cited by: §1.
 Z3: an efficient smt solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), Cited by: §1, §6.
 BERT: pretraining of deep bidirectional transformers for language understanding. In naacl, Cited by: §5.
 Neural logic machines. In International Conference on Learning Representations (ICLR), Cited by: §2.
 Production matching for large learning systems. Technical report Carnegie Mellon University. Cited by: §4.
 Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research 61, pp. 1–64. Cited by: §2.
 Measuring systematic generalization in neural proof generation with transformers. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
 Towards a formal distributional semantics: simulating logical calculi with tensors. arXiv preprint arXiv:1304.5823. Cited by: §2.
 Recent progress on monotonicity. In Linguistic Issues in Language Technology (LiLT), Cited by: §2.
 TensorLog: deep learning meets probabilistic databases. Journal of Artificial Intelligence Research 1, pp. 1–15. Cited by: §2.
 Firstorder theorem proving and vampire. In International Conference on Computer Aided Verification (CAV), Cited by: §1, §2.

Antiunification for unranked terms and hedges.
Journal of Automated Reasoning
52, pp. 155–190. Cited by: Appendix B, Appendix B, §1.  Unification with sequence variables and flexible arity symbols and its extension with patternterms. pp. 290–304. Cited by: Appendix B.
 Generalization without systematicity: on the compositional skills of sequencetosequence recurrent networks. In International Conference on Machine Learning (ICML), Cited by: Appendix D, §1, §6.
 Human fewshot learning of compositional instructions. In Annual Meeting of the Cognitive Science Society (CogSci), Cited by: Appendix C, §1, §6.
 Reasoning in vector space: an exploratory study of question answering. In International Conference on Learning Representations (ICLR), Cited by: §2.
 Closed loop neuralsymbolic learning via integrating neural perception, grammar parsing, and symbolic reasoning. In International Conference on Machine Learning (ICML), Cited by: §2.
 Compositional generalization by learning analytical expressions. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6, §6.
 Natural logic for textual inference. In ACLPASCAL Workshop on Textual Entailment and Paraphrasing, Cited by: §2.
 The neurosymbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations (ICLR), Cited by: §2.
 Insights for ai from the human mind. Communications of the ACM 64, pp. 38–41. Cited by: §1.
 Taxonomic syntax for first order inference. Journal of the ACM (JACM) 40, pp. 246–283. Cited by: §2.
 Solution of the robbins problem. Journal of Automated Reasoning 19, pp. 263–276. Cited by: §1.
 Metamath: a computer language for mathematical proofs. Cited by: §1, §2.
 The enigma of reason. Harvard University Press. Cited by: §1.
 Differentiable reasoning on large knowledge bases and natural language. In AAAI Conference on Artificial Intelligence, Cited by: §2.
 Higherorder logical inference with compositional semantics. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.
 Inductive logic programming. New Generation Computing 8, pp. 295–318. Cited by: §2.
 Learning compositional rules via neural program synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2, §6, §6.
 A note on inductive generalization. Machine Intelligence 5, pp. 153–163. Cited by: Appendix B, Appendix B, §1, §4.
 Automatic methods of inductive inference. Ph.D. Thesis, The University of Edinburgh. Cited by: §2.
 Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393. Cited by: §2.

RNNLogic: learning logic rules for reasoning on knowledge graphs
. In International Conference on Learning Representations (ICLR), Cited by: §2. 
Exploring the limits of transfer learning with a unified texttotext transformer
. Journal of Machine Learning Research (JMLR) 21, pp. 1–67. Cited by: §5, §6.  Provenanceguided synthesis of datalog programs. In Symposium on Principles of Programming Languages (POPL), Cited by: §2.
 Handbook of automated reasoning. Vol. 1. Cited by: Appendix B, §1, §2.
 Endtoend differentiable proving. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
 Artificial intelligence: a modern approach. Cited by: Appendix B, §1, §4.
 PRover: proof generation for interpretable reasoning over rules. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2, §2.
 A generative symbolic model for more general natural language understanding and reasoning. arXiv preprint arXiv:2105.02486. Cited by: §2.
 Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611. Cited by: §2.
 Synthesizing datalog programs using numerical relaxation. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §2.
 Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence 46, pp. 159–216. Cited by: §2.
 ProofWriter: generating implications, proofs, and abductive statements over natural language. arXiv preprint arXiv:2012.13048. Cited by: §2, §2, §6, §6.
 OLMpicson what language model pretraining captures. Transactions of the Association for Computational Linguistics (TACL) 8, pp. 743–758. Cited by: §2.
 Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
 NLProlog: reasoning with weak unification for question answering in natural language. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §2.
 Differentiable learning of logical rules for knowledge base reasoning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
 Learning to prove theorems via interacting with proof assistants. In International Conference on Machine Learning (ICML), Cited by: §1.
 Antiunification in constraint logic programming. Theory and Practice of Logic Programming 19, pp. 773–789. Cited by: Appendix B.
Appendix A Partial Order Among Sentences and Rules
Here we prove that the in Definition 4 of the main paper is indeed a partial order relation.
Definition 7 (Sentence length).
The length of a sentence is .
Lemma 1 (Substitutions are noncontractive).
Applying substitutions does not make a sentence shorter. In other words, for any sentence and substitution , we have . Further, if and only if maps all tokens in to sentences of length 1, i.e., .
Proof.
For any substitution and variable , is a sentence. Therefore, for any token , (Definition 3). For any sentence , we have . And the equality holds if and only if . ∎
Theorem 1 (Partial order among sentences).
If sentence equality is defined modulo conversion, then the in Definition 4 is a partial order among sentences. In other words,

.

, if and , then modulo conversion.

, if and , then .
Proof.
We prove the 3 statements separately.

Given two sentences , such that and , there exist substitutions , such that and (Definition 4). Applying Lemma 1 to them separately leads to and . According to Definition 3, we derive . If is not a variable, , i.e., all nonvariable tokens in and are identical. If is a variable, must also be a variable because otherwise would not be a variable. Therefore, both and are just renaming variables. And it is straightforward to verify that they cannot map different variables to the same. In other words, and are conversions; modulo conversion.

Given three sentences , , and such that and , there exist substitutions and such that and . Let be the function composite of and . is also a substitution and . Therefore, .
∎
Theorem 2 (Partial order among rules).
The in Definition 4 is a partial order among rules. In other words,

For any rule , .

For any two rules and , if and , then modulo conversion.

For any three rules , and , if and , then .
Proof.
Similar to the proof of Theorem 1. ∎
Definition 8 (Strictly partial order among sentences/rules).
Let and be two sentences, is strictly more general than (denoted by ) if and only if and modulo conversion. Similarly, if and are rules, if and only if and modulo conversion.
Appendix B Unification and Antiunification of Sentences and Rules
Unification and antiunification (Plotkin, 1970; Robinson and Voronkov, 2001) are basic symbolic procedures in formal logic that are useful for theorem proving and logic programming (Russell and Norvig, 2002; Yernaux and Vanhoof, 2019). In MetaQNL, unification is used in backward chaining, and antiunification is used to abstract concrete rules into rules with variables. We adapt existing problem setups and algorithms from formal logic to MetaQNL. The algorithms we use for MetaQNL do not have theoretical guarantees as in formal logic, but they work well in practice. The antiunifiers they compute may not satisfy the conditions of most specific antiunifiers (Definition 15). But strict antiunification is not necessary for rule abstraction to work. In principle, all we need is a procedure for generating abstract rules from concrete ones.
Unification. Given two sentences (or two rules), unification aims to find substitutions mapping them to the same sentence (or rule). Such substitutions are called unifiers. We extend unification to MetaQNL by adapting prior work, especially the unification algorithm developed by Kutsia (2002) for a variant of firstorder logic with sequence variables and flexible arity symbols.
Definition 9 (Unifier).
A substitution is a unifier of two sentences if and only if . Similarly, it is a unifier of two rules and if and only if .
Two sentences may have multiple unifiers. Taking , as an example, their unifiers include , , etc. Both and are valid unifiers, but they lead to different sentences when applied: , . We prefer to because it is more general; it does not introduce any new information not in and . In contrast, we cannot infer the “tall” in . This is the intuition behind the concept of “most general unifiers”.
Definition 10 (Most general unifier).
Let the substitution be a unifier of sentencens and , it is a most general unifier if and only if there is no unifier of and such that .
In unification, we want to compute a set of most general unifiers, and we want the set to be minimal and complete. Below we define these concepts for sentences.
Definition 11 (Complete set of unifiers).
Let be a set of unifiers of sentences and , is complete if and only if for any unifier of and , there exists a unifier , such that .
Definition 12 (Minimal set of unifiers).
Let be a set of unifiers of sentences and , is minimal if and only if for any , implies (modulo conversion).
Definition 13 (Minimal complete set of unifiers).
Let be a set of unifiers of sentences and , is a minimal complete set of unifiers if and only if it is both minimal and complete.
The definitions for rules are parallel. Given two sentences (or two rules), the unification problem is to compute a minimal complete set of unifiers. The result can be empty (e.g., unifying “hello world” and “how are you”), finite (“hello [X]” and “[Y] world”), or infinite (“hello [X]” and “[X] hello”, Fig. 2 Left).
Antiunification. Given two sentences (or two rules), antiunification aims to generalize them into a single sentence (or rule). Antiunification has also been studied in formal logic (Plotkin, 1970; Kutsia et al., 2014). We extend it to MetaQNL by adapting prior work. For simplicity, we define antiunification only for sentences, but it applies to rules as well.
Definition 14 (Antiunifier).
Given two sentences and , their antiunifier is a triple of a sentence and two subsitutions , , such that and .
Definition 15 (Most specific antiunifier).
Let be an antiunifier of sentencens and , it is a most specific antiunifier if and only if there is no substitution , and such that

,


is also an antiunifier of and
Definition 16 (Complete set of antiunifiers).
Let be a set of antiunifiers of sentences and , is complete if and only if for any antiunifier of and , there exists a substitution and an antiunifier such that , .
Definition 17 (Minimal set of antiunifiers).
Let be a set of antiunifiers of sentences and , is minimal if and only if for any , if there exists a substitution such that


,
then must be an conversion, i.e. (modulo conversion).
Definition 18 (Minimal complete set of antiunifiers).
Let be a set of antiunifiers of sentences and , is a minimal complete set of antiunifiers if and only if it is both minimal and complete.
Given two sentences (or two rules), the antiunification problem is to compute a minimal complete set of antiunifiers. Unlike unification, the result of antiunification must be nonempty and finite (Fig. 2).
Theorem 3 (Antiunification is finitary).
Let be a minimal complete set of antiunifiers of sentences and , then is nonempty and finite.
Proof.
For any sentences and , we have a trivial antiunifier where and . Since is complete, apply Definition 16 and we will know must be nonempty.
For any antiunifier , we have (Definition 14). Apply Lemma 1 to derive . Therefore, the length of is bounded. Also, cannot have nonvariable tokens besides those in , so its vocabulary is also bounded. There are finite number of different sentences that can take (modulo conversion). Therefore, must also be finite. ∎
Our current antiunification algorithm is adapted from Kutsia et al. (2014). It recursively matches the beginning of two sentences. Let and be sentences, and is a more general sentence in their antiunifier. If and start with the same word , should also start with . Otherwise, should start with a variable corresponding to some prefixes of and . The algorithm searches for all such prefixes and antiunifies the remaining parts of the sentences recursively.
Appendix C Details of MiniSCAN Experiments
The 14 MiniSCAN (Lake et al., 2019) training examples represented as sentences in MetaQNL ($MAPS_TO$ is a special symbol):
dax  $MAPS_TO$  RED  
lug  $MAPS_TO$  BLUE  
wif  $MAPS_TO$  GREEN  
zup  $MAPS_TO$  YELLOW  
dax fep  $MAPS_TO$  RED RED RED  
lug fep  $MAPS_TO$  BLUE BLUE BLUE  
wif blicket dax  $MAPS_TO$  GREEN RED GREEN  
lug blicket wif  $MAPS_TO$  BLUE GREEN BLUE  
dax kiki lug  $MAPS_TO$  BLUE RED  
lug kiki wif  $MAPS_TO$  GREEN BLUE  
lug fep kiki wif  $MAPS_TO$  GREEN BLUE BLUE BLUE  
lug kiki wif fep  $MAPS_TO$  GREEN GREEN GREEN BLUE  
wif kiki dax blicket lug  $MAPS_TO$  RED BLUE RED GREEN  
wif blicket dax kiki lug  $MAPS_TO$  BLUE GREEN RED GREEN 
The 10 testing examples:
zup fep  $MAPS_TO$  YELLOW YELLOW YELLOW  
zup blicket lug  $MAPS_TO$  YELLOW BLUE YELLOW  
zup kiki dax  $MAPS_TO$  RED YELLOW  
zup fep kiki lug  $MAPS_TO$  BLUE YELLOW YELLOW YELLOW  
wif kiki zup fep  $MAPS_TO$  YELLOW YELLOW YELLOW GREEN  
lug kiki wif blicket zup  $MAPS_TO$  GREEN YELLOW GREEN BLUE  
zup blicket wif kiki dax fep  $MAPS_TO$  RED RED RED YELLOW GREEN YELLOW  
zup blicket zup kiki zup fep  $MAPS_TO$  YELLOW YELLOW YELLOW YELLOW YELLOW YELLOW  
dax blicket zup  $MAPS_TO$  RED YELLOW RED  
wif kiki zup  $MAPS_TO$  YELLOW GREEN 
Below are some examples of candidate rules generated by the rule proposer. Note that many of them are wrong because the premises are not sufficient to deduce the conclusion.
lug fep kiki wif $MAPS_TO$  
dax $MAPS_TO$ RED  wif blicket dax $MAPS_TO$ GREEN RED GREEN  
lug $MAPS_TO$ BLUE  lug kiki wif $MAPS_TO$ GREEN BLUE  
lug $MAPS_TO$ BLUE  lug fep $MAPS_TO$ BLUE BLUE BLUE  
dax $MAPS_TO$ RED; lug $MAPS_TO$ BLUE  wif blicket dax kiki lug $MAPS_TO$  
MetaInduce learns 7 rules corresponding to the ground truth rules of MiniSCAN:
dax $MAPS_TO$ RED  
lug $MAPS_TO$ BLUE  
wif $MAPS_TO$ GREEN  
zup $MAPS_TO$ YELLOW  
[A] $MAPS_TO$ [B]  [A] fep $MAPS_TO$ [B] [B] [B]  
[A] $MAPS_TO$ [B]; [C] $MAPS_TO$ [D]  [A] kiki [C] $MAPS_TO$ [D] [B]  
[A] $MAPS_TO$ [B]; [C] $MAPS_TO$ [D]  [A] blicket [C] $MAPS_TO$ [B] [D] [B] 
Appendix D Details of SCAN Experiments
Some examples in SCAN (Lake and Baroni, 2018):
walk  $MAPS_TO$  WALK  
jump  $MAPS_TO$  JUMP  
turn right  $MAPS_TO$  RIGHT  
jump after turn left  $MAPS_TO$  LEFT JUMP  
walk right  $MAPS_TO$  RIGHT WALK  
walk after run  $MAPS_TO$  RUN WALK  
turn left twice  $MAPS_TO$  LEFT LEFT  
turn opposite left  $MAPS_TO$  LEFT LEFT  
turn around right  $MAPS_TO$  RIGHT RIGHT RIGHT RIGHT  
walk around left  $MAPS_TO$  LEFT WALK LEFT WALK LEFT WALK LEFT WALK 
Below are some examples of the candidate rules generated by the rule proposer.
run $MAPS_TO$ RUN  walk after run $MAPS_TO$ RUN WALK  
walk $MAPS_TO$ WALK; run $MAPS_TO$ RUN  walk after run $MAPS_TO$ RUN WALK  
run $MAPS_TO$ RUN  jump twice after run twice $MAPS_TO$  
run twice $MAPS_TO$ RUN RUN  jump twice after run twice $MAPS_TO$  
MetaInduce learns 20 rules corresponding to the ground truth rules of SCAN:
walk $MAPS_TO$ WALK  
look $MAPS_TO$ LOOK  
run $MAPS_TO$ RUN  
jump $MAPS_TO$ JUMP  
turn right $MAPS_TO$ RIGHT  
turn left $MAPS_TO$ LEFT  
turn opposite left $MAPS_TO$  
turn opposite right $MAPS_TO$  
turn around left $MAPS_TO$  
turn around right $MAPS_TO$  
[A] $MAPS_TO$ [B]  [A] left $MAPS_TO$ LEFT [B]  
[A] $MAPS_TO$ [B]  [A] right $MAPS_TO$ RIGHT [B]  
[A] opposite left $MAPS_TO$  
[A] $MAPS_TO$ [B]  [A] opposite right $MAPS_TO$  
[A] $MAPS_TO$ [B]  [A] around left $MAPS_TO$  
[A] $MAPS_TO$ [B]  [A] around right $MAPS_TO$  
[A] $MAPS_TO$ [B]  [A] twice $MAPS_TO$ [B] [B]  
[A] $MAPS_TO$ [B]  [A] thrice $MAPS_TO$ [B] [B] [B]  
[A] $MAPS_TO$ [B]; [C] $MAPS_TO$ [D]  [C] and [A] $MAPS_TO$ [D] [B]  
[A] $MAPS_TO$ [B]; [C] $MAPS_TO$ [D]  [A] after [C] $MAPS_TO$ [D] [B] 
Appendix E Details of RuleTaker Experiments
Rule Proposer. Fig. 3 shows the form of ground truth proofs in RuleTaker. For this specific example, our rule proposer would generate 2 candidate rules below:
$TRUE$ The elephant is strong 
$TRUE$ The elephant likes cats 
Learned Rules. Below are some example rules learned from RuleTaker:
$TRUE$ the [D] 
$FALSE$ the [A] sees the [B] 
$TRUE$ the [A] needs the [E] 
$TRUE$ [A] is [E] 
$TRUE$ the [A] chases the [D] 
$TRUE$ [A] [B]  
$TRUE$ [A] [C] 
Appendix F Details of SIGMORPHON 2018 Experiments
Rule Proposer. The rule proposer simply generates concrete rules that can prove the goals in a single step. For the previous example, it generates the 6 candidate rules below:
z a r a n d e a m o s  $LEMMA$ z a r a n d e a r  
z a r a n d e a m o s  $TAG$ V  
z a r a n d e a m o s  $TAG$ IND  
z a r a n d e a m o s  $TAG$ PRS  
z a r a n d e a m o s  $TAG$ 1  
z a r a n d e a m o s  $TAG$ PL 
Learned Rules. Below are some example rules learned on the Spanish part of SIGMORPHON 2018:
[A] e a m o s  $LEMMA$ [A] e a r  
[A] á r a m o s  $TAG$ PST  
n o s  [A] e m o s  $TAG$ SBJV  
n o  [A] z c a m o s  $LEMMA$ [A] c e r  
[A] z a m o s  $LEMMA$ [A] z a r 
Appendix G Heuristics for constraining the space of rules
We use a few simple and general heuristics for constraining the space of rules and pruning invalid rules generated by antiunification. First, we merge multiple variables that always appear together. For example, the [A] [B] [C] and [D] [E] in the rule below can be merged.
$TRUE$ [A] [B] [C]  
$TRUE$ [D] [E] 
So the rule becomes:
$TRUE$ [A]  
$TRUE$ [B] 
A variable in a rule is called a free variable, if it appears only once. For example, the rule
[X] is red  
Tomorrow will be sunny 
contains a free variable [X]. We only consider rules with no more than 1 free variable and require that they cannot appear in the conclusion. Because they would allow arbitrary conclusions formed by substituting them with other sentences. For example, the rule below is not allowed because of the free variable [X] in the conclusion:
In addition, a rule cannot contain a premise made of one single free variable. Because this premise can be satisfied by any sentence, and there is no point including it in the rule. For example, the rule below is not allowed because of the free variable [X]:
$TRUE$ [A]  
$TRUE$ [B] 
Comments
There are no comments yet.