FaiRR: Faithful and Robust Deductive Reasoning over Natural Language

03/19/2022
by   Soumya Sanyal, et al.
University of Southern California
9

Transformers have been shown to be able to perform deductive reasoning on a logical rulebase containing rules and statements written in natural language. Recent works show that such models can also produce the reasoning steps (i.e., the proof graph) that emulate the model's logical reasoning process. Currently, these black-box models generate both the proof graph and intermediate inferences within the same model and thus may be unfaithful. In this work, we frame the deductive logical reasoning task by defining three modular components: rule selection, fact selection, and knowledge composition. The rule and fact selection steps select the candidate rule and facts to be used and then the knowledge composition combines them to generate new inferences. This ensures model faithfulness by assured causal relation from the proof step to the inference reasoning. To test our framework, we propose FaiRR (Faithful and Robust Reasoner) where the above three components are independently modeled by transformers. We observe that FaiRR is robust to novel language perturbations, and is faster at inference than previous works on existing reasoning datasets. Additionally, in contrast to black-box generative models, the errors made by FaiRR are more interpretable due to the modular approach.

READ FULL TEXT VIEW PDF
05/25/2022

RobustLR: Evaluating Robustness to Logical Perturbation in Deductive Reasoning

Transformers have been shown to be able to perform deductive reasoning o...
04/06/2020

Multi-Step Inference for Reasoning Over Paragraphs

Complex reasoning over text requires understanding and chaining together...
11/06/2020

Extending Equational Monadic Reasoning with Monad Transformers

There is a recent interest for the verification of monadic programs usin...
09/30/2020

Measuring Systematic Generalization in Neural Proof Generation with Transformers

We are interested in understanding how well Transformer language models ...
12/24/2020

ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language

Transformers have been shown to emulate logical deduction over natural l...
11/26/2020

Braid: Weaving Symbolic and Neural Knowledge into Coherent Logical Explanations

Traditional symbolic reasoning engines, while attractive for their preci...
12/16/2021

Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability

Investigating the reasoning abilities of transformer models, and discove...

1 Introduction

The field of AI has long pursued the goal of building systems that can automatically reason over some given explicit knowledge to generate conclusions and provide the reasoning steps involved in the process McCarthy (1959); Newell and Simon (1956). Recently, Clark et al. (2020) proposed a modern version of this problem, where the formal representation of knowledge is replaced by natural language statements in English. Further, they proposed a transformer-based model Vaswani et al. (2017) RuleTaker, that can predict if a candidate statement is entailed by the natural language statements, by emulating deductive reasoning. As shown in Figure 1, in this deductive reasoning task, facts and rules from the rulebase are combined iteratively to generate intermediate inferences which eventually entails the statement. Note that the reasoning process implicitly involves two steps: determining which rules and facts to combine at each iteration, followed by using them to generate an intermediate conclusion.

Figure 1: Example of a theory, a statement, and a valid proof graph - An instance contains multiple facts and rules in blue and yellow respectively, followed by a statement in red. The proof graph describes the reasoning steps required to generate the statement.

While RuleTaker focuses on just predicting the statement entailment, some recent works Saha et al. (2020); Tafjord et al. (2021) have further developed systems that can also generate the reasoning steps (i.e., proof graph generation). However, these systems do not explicitly ensure the causality from the rule/fact selection to generating the intermediate inferences. Since these systems are inherently black-box models, it is unclear if such constraints are implicitly learned by the models without being enforced externally. This, in turn, questions the faithfulness of the model’s internal reasoning process Lipton (2018). Because the model has access to the full theory at input, it might use additional parts of the theory, than just the predicted proof, to generate the inference.

In this paper, we address these shortcomings by developing a modularized framework to solve the deductive reasoning task. While existing methods generate both proofs and conclusions in a single step, in our framework we break this process into three steps: rule selection, fact selection, and knowledge composition. The rule selection step decides the relevant rule to use for an iterative inference step and fact selection uses this rule to select the relevant facts. Then, the knowledge composition step reasons using only the selected rule and facts to generate the next intermediate inference. In Figure 2, we show the model schematics for our system and contrast it with previous methods. Notably, we strictly restrict the information accessible at each step of our framework to make the reasoning process more faithful. For example, the fact selection step depends only on the selected rule, instead of all the rules in the rulebase. Additionally, the generated inference depends explicitly on the selected rule and facts, as opposed to all the rules and facts in prior works. This makes the proof graph a by-product of the selection steps as we don’t need to generate any separate proofs. Since we constrain the inputs to each step, this also makes each sub-problem easier to learn, leading to an overall more robust reasoning model.

To model these three steps, we develop FaiRR, in which each component is a transformer-based model learning to perform the modular tasks. Specifically, we use RoBERTa-based models Liu et al. (2019) for the two selection tasks and a T5-based model Raffel et al. (2020) for the composition task. Similar to ProofWriter, we use synthetic rulebases to train FaiRR. To test the deductive reasoning capabilities in a more comprehensive way, we experiment with both existing deductive reasoning datasets and multiple newly-generated robustness dataset variants. Overall, we find that FaiRR is more robust to novel language perturbations than baselines. Additionally, our model is up to three times faster at inference due to the constrained input and outputs of different modules. Lastly, we find that the errors made by our model are more interpretable and easier to debug compared to baseline generative models. This further demonstrates the faithfulness of our modularized reasoning framework.

2 Problem Definition

Notations

A theory consists of a set of facts and rules expressed in natural language. An example of a theory is depicted in Figure 1. Here, the sentences in the blue and yellow boxes are facts and rules, respectively. Further, a proof graph is a directed graph connecting facts and rules that describe how a specific inference can be obtained from the theory. In Figure 1, the proof graph shows the steps involved in generating the inference “Charlie is white.”. To generate the proof graph we may need to infer some intermediate conclusions . These inferences are considered as part of the extended facts in the theory. For example, in Fig. 1, “Charlie is kind” is an intermediate inference required to generate the correct proof graph.

Figure 2: Reasoning process in different models. (a): ProofWriter (“All”) directly output the entailment prediction and proof graph for given input. (b): ProofWriter (“Iter”) iteratively generates the one-step intermediate conclusions and their proofs. (c): FaiRR selects a rule, then a fact, and finally combines them to generate an intermediate inference. Note that the proof is implicitly determined by the selection steps. Please refer to Section 3.1 for details.

Deductive Reasoning

The task of deductive reasoning is described as follows: given a theory , and a statement , predict if the theory supports the statement (entailment prediction) and if so, generate the proof graph that supports the statement (proof generation). For the example theory and statement in Figure 1, we see that the statement is indeed entailed by the theory and the valid proof graph is shown for the same. The main goal of this task is to evaluate if a model can generate valid reasoning chains in the form of proof graphs to justify its entailment prediction.

Reasoning Robustness

We consider an auxiliary task that evaluates the robustness of the reasoning abilities used by the model. Let be a perturbation function that modifies a given theory (statement ) to a theory (statement ), such that just has some surface changes in the natural language form but still requires the similar reasoning process as required for . A function that alters the subjects in the theory to unseen subjects is an example of such perturbation function. We perturb each theory statement pair to create an equivalence set defined as the set , where each is derived by perturbing the original theory, and is the total such perturbations per theory. Note that it is possible to generate different pairs by controlling the stochasticity of . The main goal of this task is to evaluate the consistency of the model’s predictions with minimal variations in the input theory.

Evaluation Protocol

We consider three main aspects for evaluating the model performance in our study: (1) Entailment accuracy measures how accurately the model is able to predict the true statement entailment. (2) Proof accuracy measures how accurately the model can predict a valid proof for the statement. Following Saha et al. (2020); Tafjord et al. (2021), we use the strict metric for proof evaluation, i.e., for a match to count, both the predicted proof should exactly match a gold proof and the entailment should be correctly predicted. (3) Consistency measures if the models are consistent in the entailment and proof prediction for different perturbation functions. For a theory statement pair and its corresponding equivalence set , consistency is defined as where is the model’s prediction. We compute the average consistency for both entailment and proof predictions on an equivalence set and further average across the dataset to report the consistency.

3 The FaiRR Method

3.1 Approach Overview

As illustrated by the example in Figure 1, to reliably generate a proof graph through deductive reasoning, a model needs to generate multiple one-hop intermediate conclusions. This is the major limitation of models that use the theory to directly predict the proof (Figure 2 (a)), thus questioning the trustworthiness of the reasoning process. Next, it is also intuitive to see that in order to faithfully generate these intermediate inferences, a model should first determine the proof (i.e., know the rules and facts to use) and then use them to infer the conclusion. That is, there is a causal relation from determining the proof to then generating the conclusion. We note that ProofWriter (“Iter”) lacks in this aspect. As shown in Figure 2 (b), it first generates the conclusion and then the corresponding proof.

Motivated by these points, we propose our causal reasoning framework which breaks the reasoning process into three desirable steps. As shown in Figure 2 (c), in our framework, first a rule is selected using the rules and facts in the theory. Following that, some relevant facts are selected from the fact list based on the selected rule . This step does not use the other rules in the theory. Finally, the selected rule and facts are jointly used to generate a new conclusion . In this framework, the one-step proof is explicitly determined first via the selection steps followed by the inference generation, making the proof a by-product of the whole process. In contrast, prior works learned to generate the proof along with intermediate conclusion.

3.2 FaiRR Modules

At a high level, FaiRR is an iterative model in which the one-hop intermediate conclusions are generated step-by-step. To model our framework described in Sec. 3.1, we have four components in FaiRR as follows.

Rule Selector (RS)

The rule selector is a RoBERTa-based Liu et al. (2019) classification model that takes the concatenated statement, facts, and rules as input, and selects a rule that is used to generate an intermediate conclusion in the current iterative step. It takes the input of the form

, and generates a one-hot output vector by classifying the token embedding from the [CLS] token and [SEP] tokens in front of the rules, via a linear classifier layer. Each classification is a binary classification, but overall only one of the tokens has the positive class. Here

denotes the statement, is the facts and concatenated with any intermediate conclusions generated in a prior iteration, and denotes the rule in the theory that contains a total of rules. denotes continued concatenation. An example input and output of the rule selector is shown in Figure 3. If a [SEP] token is selected, we select the rule sentence following the corresponding [SEP] token, otherwise if the [CLS] token is selected, we decide to stop the iteration. That is, the [CLS] selection acts as a stop signal for our iterative model. We note that it is possible to have more than one likely candidate rule since there can be multiple one-hop inferences possible for a given theory. Following Tafjord et al. (2021), we randomly select one of the possible candidate rules at each iteration.

Fact Selector (FS)

The fact selector is RoBERTa-based Liu et al. (2019) token classification model that takes the statement, the rule selected by the rule selector, and facts in the theory, and then predicts a set of candidate facts that can be used with the rule to generate an intermediate conclusion. It takes the input of the form , where is the statement, is the selected rule, and is the fact in the theory containing total facts. Note that facts also include any previously generated intermediate conclusions. denotes continued concatenation. The output is generated by classifying each [SEP] token embedding in front of a fact using a linear layer, to determine if the corresponding fact is selected or not. An example input and output for the fact selector is depicted in Figure 3. We note that it is possible to have some rules that can reason over multiple facts jointly to generate a conclusion. An example of such a rule is “rule2” in Figure 1. Hence, this component has the ability to select multiple facts.

Figure 3: Overview of components of FaiRR - The rule selector and fact selectors are classification models whereas the knowledge composer is a generation model. The input tokens used for classification by the selectors are highlighted. Rule selector decides to stop based on the output prediction of [CLS] token (highlighted in green). Here, rule , and facts and are used to generate the conclusion . Please refer to Section 3.2 for more details.

Knowledge Composer (KC)

The knowledge composer is a generative text-to-text transformer T5 Raffel et al. (2020) (T5-large) that can compose a set of facts and a rule to output a novel conclusion. The input to the model is the selected facts and rule concatenated together, and the output is the intermediate conclusion. An example input and output for knowledge composer is shown in Fig. 3.

Solver

The final component is the solver that operates after all iterations have finished (i.e., once the rule selector selects the [CLS] token indicating to stop the iterative inference generation process). Similar to ProofWriter, our solver currently searches for the statement in the generated intermediate inferences (string matching). If found, it predicts that the statement is entailed by the theory. It also search for the negation of the statement 222Following ProofWriter, we perform regex to add/remove “not” which suffices for this dataset., and if found, it predicts not entailed. If none of these are present, it predicts “Unknown” since it cannot prove or disprove the statement. The proof graph is constructed by using the one-hop proofs generated by the selected rule and facts at each step. For example, in Figure 1, the red dotted boxes (one-hop proofs) are stitched together to assemble the complete proof. For cases where the entailment is “Unknown”, the proof returned is “None”, since no proof for the statement exists in the theory. We note that our solver is not a learnable module.

3.3 Training and Inference

Each component of our model (except the solver, which is deterministic) is trained separately. We use the same dataset as ProofWriter to train these models, but process it such that each model receives only the relevant inputs according to our causal framework. More concretely, suppose for a given theory , a possible intermediate inference is obtained by using a rule and a fact . Then, a training instance of ProofWriter, which is a T5 Raffel et al. (2020) model, uses the input and output . We process the same instance to generate three training instances, one for each of rule selector, fact selector, and knowledge composer, respectively, as follows:

Our selector models have the statement as input to the model. Also, the outputs of rule selector and fact selectors are converted to class labels instead of text since our selectors are classification models. We use cross entropy loss to train the rule selector, and binary cross entropy loss to train the fact selector. The knowledge composer is trained on language modeling loss.

At inference time, the rule selector selects a rule to be used for generating one-step conclusions. Then, the fact selector selects some facts based on the selected rule, which is then collectively passed on to the knowledge composer to generate a conclusion. This three-step pipeline is run iteratively until the rule selector predicts a stop signal by selecting the [CLS] token which exits the iteration. Once the iteration finishes, the solver uses the generated intermediate inferences to decide if the statement is entailed or not, and generates a proof accordingly.

Remark on Computational Complexity

A practical limitation of ProofWriter is that it performs an exhaustive forward search by enumerating all possible inferences from a given theory. This leads to redundant inferences being generated for proving a particular statement. Additionally, using a text-to-text transformer model adds to the problem since it is usually quite expensive to run at inference time. In FaiRR, we alleviate this by introducing two changes. First, our causal framework allows only selected rule and facts as input to the knowledge composer, thus restricting the input length significantly. Second, augmenting the question to our selector inputs helps reduce the candidate space because these models can learn to prioritize the selection based on the relevance to both the question and the theory. This ensures that FaiRR does not perform an exhaustive forward search and prioritizes generating relevant inferences over the others. Both these changes lead to an overall improvement in inference speed. We perform more quantitative analysis on this later in Section 5.3.

4 Experimental Setup

Datasets

Following Tafjord et al. (2021); Clark et al. (2020), we use the D* datasets for our experiments. These are a set of multiple datasets - namely D0, D1, D2, D3, D0-D3, and D5. The theory in these datasets are synthetically generated with increasing reasoning depths. For example, D3 dataset contains statements that require at most 3-hop reasoning steps. The D0-D3 contains all theories in D3 plus of the D0-D2 training set theories. We also use the ParaRules dataset Clark et al. (2020) that contains around 2k theories expressed in paraphrased natural language.

Additionally, we generate three datasets that evaluate the robustness of the reasoning models as follows:

  • [noitemsep,topsep=4pt]

  • Subject robustness: Here, subjects in a theory are perturbed by using some out-of-distribution proper and common names. For example, in Figure 1, “Charlie” can be replaced with “Paul” which is not used in the D* datasets. We generate five new theories corresponding to each theory of the D3 dataset, by repeatedly perturbing all the proper and common names in the theory.

  • Attribute robustness: Here we sample out-of-distribution attributes. For example, “blue” in Figure 1 can be replaced with “soft”. As above, we generate five new theories for each theory of the D3 dataset.

  • Subject+Attribute robustness: This is a combination of subject and attribute robustness to study model performance when most of the training vocabulary is replaced by out-of-distribution words. Each theory has both novel subject and attribute.

We include more details on the perturbation sets used in our experiments in Appendix B.

Baselines

We compare FaiRR with two variants of ProofWriter Tafjord et al. (2021): All-at-once (PW (“All”)) and Iterative (PW (“Iter”)), wherever applicable 333The code to reproduce numbers of ProofWriter is not publicly available. We either copy results directly from the paper or run our own inference on model checkpoints made available by the authors.. The PW (“All”) model is trained to predict the entailment and generate proof graph directly from the theory and statement in a single step. The PW (“Iter”) generates one-step inferences and corresponding proofs iteratively, until all possible inferences are generated, and then stitches the proof graph similar to our method. If not mentioned otherwise, ProofWriter uses a T5-large Raffel et al. (2020) model. We omit comparisons with PRover since it was trained on a different dataset that adds specific constraints on the proof graph. Please refer to Appendix J for more details.

5 Experiment Results

We compare FaiRR with ProofWriter variants on three settings: generalization on D* datasets, robustness to perturbed theories, and efficiency in inference computation. We further conduct qualitative analysis to understand the inference errors.

Entailment Accuracy Proof Accuracy
PW (“Iter”) FaiRR PW (“Iter”) FaiRR
N/A 99.7 99.6 99.7 99.6
0 100.0 100.0 100.0 100.0
1 99.9 99.7 99.9 99.5
2 99.7 98.9 99.4 97.2
3 99.7 96.6 99.1 95.3
All 99.8 99.2 99.7 98.8
Table 1: Comparison of FaiRR with ProofWriter (“Iter”) trained and tested on D0-D3. Baseline results are generated using the checkpoint provided by the authors. For more details please refer to Section 5.1.

5.1 Performance on Same Depth Reasoning

In this setting, we train and test both models on D0-D3 dataset. Note, D0-D3 contains statements with reasoning depths up to 3. This compares the ability of the models to generalize to seen reasoning depths at train time. The results with increasing depths of reasoning are shown in Table 1. Here, depth “N/A” refers to statements that cannot be proven and hence don’t have an exact proof depth associated with it. We observe that overall both FaiRR and ProofWriter (“Iter”) performs comparably (last row with depth ’All’). Further, we find that our model’s performance is lower on , indicating that our models tend to perform weaker with increasing depths. This happens majorly because the rule selector in FaiRR tends to incorrectly select the [CLS] token to indicate a stop signal instead of generating more possible intermediate inferences. We discuss more about this in Sections 5.3 and 5.4. Please refer to Appendix C for more results on unseen reasoning depths.

Robustness PW (“Iter”) FaiRR
EA PA C EA PA C
Subject 89.6 88.4 87.6 96.8 95.9 96.4
Attribute 97.8 97.4 97.4 96.7 95.6 96.5
Subject+Attribute 94.8 93.4 93.7 95.4 94.3 94.7
Average 94.1 93.1 92.9 96.3 95.3 95.9
Table 2: Comparison of FaiRR with ProofWriter (“Iter”) when trained on D0-D3 dataset and tested on different robustness datasets. EA, PA, and C refers to entailment accuracy, proof accuracy, and consistency, respectively. Please refer to Section 5.2 for more details.
Entailment Accuracy Proof Accuracy
PW (“Iter”) FaiRR PW (“Iter”) FaiRR
N/A 98.9 99.3 98.9 99.3
0 99.9 100.0 99.9 100.0
1 79.1 96.0 78.8 95.7
2 76.6 93.4 73.4 91.4
3 72.7 89.8 67.8 85.7
All 89.6 96.8 88.4 95.9
Table 3: Comparison of FaiRR with ProofWriter (“Iter”) trained on D0-D3 and tested on subject robustness dataset. Baseline results are generated using the checkpoint provided by the authors. For more details please refer to Section 5.2.

5.2 Robustness to Perturbed Theories

In this section, we test the robustness of ProofWriter (“Iter”) and FaiRR on different perturbed theories. Since FaiRR focuses on making deductive reasoning more robust and faithful, performance on these robustness experiments are the main results of our work. As described in Section 4, we test the robustness on three different perturbations: subject, attribute, and subject+attribute. We compare the performance of both models after training on D0-D3 dataset. The consolidated results are shown in Table 2 and depth-wise results for subject robustness are shown in Table 3. We report the entailment accuracy, proof accuracy, and consistency as defined in Section 2. Please refer to appendix D for the depth-wise breakdown of all the datasets. We observe that on subject and subject+attribute robustness, our models are consistently better than ProofWriter whereas on attribute robustness both models perform similarly. Further, we find that on average, FaiRR is both more accurate and consistent than the baseline. From this, we conclude that our model relies less on spurious correlations based on the subject while both models likely suffer from similar issues on attribute perturbations. Since ProofWriter uses the theory to generate the intermediate conclusion and proofs, it has the capacity to exploit some spurious patterns that can inflate performance. In contrast, our causal framework restricts this capacity by constraining the inputs to each component as described in Section 3.1. Hence, these robustness evaluations demonstrate one of the prime benefits of our causal and modular approach.

5.3 Study on Inference Efficiency

Here we perform several analyses to evaluate the computational benefits of our method as described in Section 3.3. Inference efficiency is an important aspect of this problem for real-world scenarios where compute can be limited.

Relevance of generated inferences

Here, we study the relevance of the intermediate inferences generated by FaiRR and ProofWriter (“Iter”). Let be the set of intermediate inferences required for generating the proof graph for the statement. Further, let

be the set of intermediate inferences actually generated by a model. Then, the precision and recall are defined as

In Figure 4, we plot the precision and recall for both FaiRR and ProofWriter (“Iter”) with increasing reasoning depths. We find that our model has close to precision at all depths, whereas ProofWriter has low precision. This demonstrates that our model is able to successfully prune the candidate inference space to generate relevant candidate inferences almost perfectly. In contrast, we see that with increasing depths, our model’s recall reduces from close to to whereas ProofWriter has a perfect recall at all depths. While the drop is not very drastic, it indicates that our model fails to generate some essential inferences at higher depths. This is mainly because our rule selector decides to stop early and not generate further relevant inferences for some provable statements. Overall, we conclude that FaiRR always generates inferences that are relevant to solving the instance, although at higher depths it can miss some relevant conclusions.

(a) Precision
(b) Recall
Figure 4: Comparison of ProofWriter (“Iter”) and FaiRR on precision and recall of generated inferences with increasing reasoning depths.

Performance under inference budget constraints

We analyze the performance of FaiRR and ProofWriter under a fixed inference budget constraint by restricting the total number of conclusions that can be generated. We perform this analysis for different reasoning depths and depict the results in Figure 5. We observe that FaiRR consistently outperforms ProofWriter on lower budgets. This shows that FaiRR performs a prioritized generation of conclusions that are relevant to the statement, which can be useful in scenarios with limited inference budgets. See Appendix G for more comparisons.

(a) Depth 1
(b) Depth 3
Figure 5: Depth-wise comparison of ProofWriter (“Iter”) and FaiRR on limited inference budgets. Please refer to Section 5.3 for details.

Inference runtime analysis

We next compare the time taken by both the models to solve the complete D5 dev set. Although FaiRR has three separate modules that run sequentially, it is times faster than ProofWriter (“Iter”) at inference time on average. We attribute this to the reduced inference candidate search space due to question augmentation, and smaller input size to the T5 component (refer to Section 3.3 for details). Please refer to Appendix H for more details.

5.4 Error Analysis

We further analyze the different errors made by FaiRR and ProofWriter (“Iter”) on 50 randomly sampled errors for each model, from the D0-D3 and the subject robustness dev splits. We manually inspect the proof inferences and compare it with the gold proof to classify the failures. The errors are broadly categorized as follows:

Early stop errors:

This is the most frequent error type for both models, accounting for and errors in FaiRR and ProofWriter, respectively. This occurs when a model incorrectly generates the stop signal and fails to generate all the required inference to prove a statement. We find that our model makes the majority of the mistakes due to early stopping. This can be possibly fixed by improving the rule selector architecture to better model the stop criteria.

Wrong inference:

This is the second error type, where the inferred conclusion is incorrect based on the predicted proof. This accounts for and errors in FaiRR and ProofWriter, respectively. We observe that our knowledge composer is makes lesser errors on average compared to the ProofWriter generative model.

Other generation errors:

ProofWriter makes around errors where the model generated output does not make sense. For example, it can hallucinate facts that are not present in the theory. Such errors are not interpretable and questions the model’s inner-working. FaiRR shows no such error, since the proofs are always interpretable in our model due to the causal framework.

Overall, we find that the errors made by FaiRR are more interpretable than ProofWriter, since we can pin-point which module is at fault. Whereas, in ProofWriter, it is sometimes hard to understand the source of errors. This feature also makes our framework easier to debug to potentially fix some components with techniques like data augmentation. Please refer to Appendix I for more discussion and examples of errors.

Input Output
: If someone is blue then they are quiet. : Chris is blue. Chris is quiet. (, )
: If someone is blue then they are quiet. : Chris is blue. : Steve is blue. Dave is quiet. (, )
: If someone is blue then they are quiet. : Quiet people are cold. : Chris is blue. : Steve is blue. : Chris is white. Dave is quiet. (, )
Table 4: Examples of inferences made by ProofWriter. Blue text denotes incrementally added sentences in the theory and red text denotes an error. The is the generated proof. Refer to Section 5.5 for more details.

5.5 ProofWriter Input Ablation

A key goal of FaiRR is to explicitly ensure causality from the rule/facts selection step (proof generation) to the reasoning step (intermediate inference generation). This is essential for a reasoning method using forward chaining to solve a deductive reasoning task 444Forward chaining is described as repeated application of modus ponens Hinkelmann (2004), which requires at least two premises to then logically conclude an inference.. To understand if ProofWriter, which uses forward chaining, implicitly does this “select-then-reason” within the model, we perform the following case study: We sample theories from our subject perturbation dataset where ProofWriter made errors, and manually evaluate the model on inputs with all irrelevant rules/facts deleted. Next we sequentially start adding back the deleted rules/facts to see if the output still remains valid. As shown in Table 4, we see that ProofWriter generates a correct inference for the first row which uses just the essential part of the theory required to generate the conclusion, and starts making errors as more sentences are included. Some more examples are shown in Table 16 in Appendix. This shows that internally ProofWriter is unable to faithfully perform the “select-then-reason” steps for larger theories. In contrast, FaiRR explicitly separates these steps, leading to a faithful reasoning model.

6 Related Works

Reasoning in Text

Reasoning in text is a well studied problem in NLP. Natural Language Inference (NLI) Dagan et al. (2006) is one of the most prominent tasks that require reasoning over text to answer if a statement is entailed, contradicted, or neutral, given a hypothesis. More recently, datasets like HotpotQA Yang et al. (2018), bAbI Weston et al. (2016), QuaRTz Tafjord et al. (2019), ROPES Lin et al. (2019), CLUTRR Sinha et al. (2019), etc., have studied different aspects of reasoning over textual inputs. These tasks usually require implicit reasoning, where the model needs to internally infer the rules required to solve the task. In contrast, RuleTaker Clark et al. (2020) deals with explicit reasoning (also known as deductive reasoning).

Proof Generation

Recently, some works have been addressing the problem of proof generation from an NL-based theory. Prover Saha et al. (2020) trains a RoBERTa-based model that predicts nodes and edges of the proof graph. ProofWriter Tafjord et al. (2021) is a T5-based Raffel et al. (2020) model, that iteratively generates one-hop conclusions and proofs from a theory. Another work MultiProver Saha et al. (2021), generates multiple possible proofs for a statement. While we study the same problem of proof generation similar to these works, we develop a more faithful and robust model designing a modular system for proof generation.

Formal Reasoning

There are some prior works that try to solve the problem of entailment prediction by first parsing the formal language from text. Neural Theorem Prover Rocktäschel and Riedel (2017); Weber et al. (2019)

uses neural networks to parse the formal logic from natural language and then reason over them. While this approach is more symbolic, it can lead to many challenges while parsing

Kamath and Das (2019). The proof generation setting considered here bypasses this step and directly reasons over the given natural language text making it more useful in downstream applications.

Model Interpretability

With the advent of pre-trained language models (BERT

Devlin et al. (2019), RoBERTa Liu et al. (2019), etc.), there has been an increasing trend on solving various reasoning tasks with high accuracy. Faithfulness of such models Jacovi and Goldberg (2020) aims to understand whether the models are actually learning to solve the task or rather depending on some shortcut patterns. Saliency-based explanations Sundararajan et al. (2017); Lundberg and Lee (2017); Murdoch et al. (2018); Sanyal and Ren (2021) mainly focus on identifying the important phrases in the input text that helped the model in solving a task. In contrast, the task of proof generation focuses on generating a deductive chain of reasoning from the given theory to the concluded statement. Thus, proof chains are easier to understand for end users, making it more useful to debug any systematic model errors.

Causal Reasoning

The study of causality and causal reasoning models Pearl (2000, 2004); Schölkopf (2019)

has been prevalent in machine learning. It has been applied in various domains such as algorithmic fairness

Loftus et al. (2018), gender bias mitigation Vig et al. (2020), robustness from spurious correlations Bühlmann (2020); Veitch et al. (2021), counterfactual explanations Feder et al. (2021b), etc. Causality in NLP is particularly important to learn models that go beyond exploiting correlations and to improve their overall faithfulness Feder et al. (2021a).

7 Conclusion

In this paper, we proposed FaiRR, a faithful and robust deductive reasoning model based on three modular components: rule selection, fact selection, and knowledge composition. FaiRR ensures causality from proof generation to entailment prediction by design. We established the effectiveness of our approach through experiments on testing robustness to language variations and demonstrating the interpretability of the errors made by our model. We also show that FaiRR is faster and more precise at deductive reasoning than prior baselines.

Acknowledgments

This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via Contract No. 2019-19051600007, the DARPA MCS program under Contract No. N660011924033, the Defense Advanced Research Projects Agency with award W911NF-19-20271, NSF IIS 2048211, NSF SMA 1829268, and gift awards from Google, Amazon, JP Morgan and Sony. We would like to thank all the collaborators in USC INK research lab for their constructive feedback on the work.

References

  • P. Bühlmann (2020) Invariance, causality and robustness. Statistical Science 35 (3), pp. 404–426. Cited by: §6.
  • P. Clark, O. Tafjord, and K. Richardson (2020) Transformers as soft reasoners over language. In IJCAI’20, Cited by: Appendix A, §1, §4, §6.
  • I. Dagan, O. Glickman, and B. Magnini (2006) The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, J. Quiñonero-Candela, I. Dagan, B. Magnini, and F. d’Alché-Buc (Eds.), Berlin, Heidelberg, pp. 177–190. Cited by: §6.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Document, Link Cited by: §6.
  • A. Feder, K. A. Keith, E. Manzoor, R. Pryzant, D. Sridhar, Z. Wood-Doughty, J. Eisenstein, J. Grimmer, R. Reichart, M. E. Roberts, B. M. Stewart, V. Veitch, and D. Yang (2021a)

    Causal inference in natural language processing: estimation, prediction, interpretation and beyond

    .
    CoRR abs/2109.00725. External Links: Link Cited by: §6.
  • A. Feder, N. Oved, U. Shalit, and R. Reichart (2021b) Causalm: causal model explanation through counterfactual language models. Computational Linguistics 47 (2), pp. 333–386. Cited by: §6.
  • K. Hinkelmann (2004) Forward chaining vs. backward chaining. University of Applied Sciences Northwestern Switzerland, School of Business. Cited by: footnote 4.
  • A. Jacovi and Y. Goldberg (2020) Towards faithfully interpretable NLP systems: how should we define and evaluate faithfulness?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4198–4205. External Links: Link, Document Cited by: §6.
  • A. Kamath and R. Das (2019) A survey on semantic parsing. In Automated Knowledge Base Construction (AKBC), External Links: Link Cited by: §6.
  • K. Lin, O. Tafjord, P. Clark, and M. Gardner (2019) Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, Hong Kong, China, pp. 58–62. External Links: Link, Document Cited by: §6.
  • Z. C. Lipton (2018) The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery.. Queue (3). External Links: ISSN 1542-7730 Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, and J. D. an (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv preprint abs/1907.11692. External Links: Link Cited by: Appendix K, §1, §3.2, §3.2, §6.
  • J. R. Loftus, C. Russell, M. J. Kusner, and R. Silva (2018) Causal reasoning for algorithmic fairness. CoRR abs/1805.05859. External Links: Link Cited by: §6.
  • S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 4765–4774. External Links: Link Cited by: §6.
  • J. W. McCarthy (1959) Programs with common sense. In Proc. Tedding Conf. on the Mechanization of Thought Processes, pp. 75–91. Cited by: §1.
  • W. J. Murdoch, P. J. Liu, and B. Yu (2018) Beyond word importance: contextual decomposition to extract interactions from lstms. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §6.
  • A. Newell and H. A. Simon (1956) The logic theory machine-a complex information processing system. IRE Trans. Information Theory 2, pp. 61–79. Cited by: §1.
  • J. Pearl (2004) Graphical models for probabilistic and causal reasoning. pp. 70–1. Cited by: §6.
  • J. Pearl (2000) Causality: models, reasoning and inference. Cambridge University Press. Cited by: §6.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    .
    Journal of Machine Learning Research 21 (140), pp. 1–67. External Links: Link Cited by: Appendix K, §1, §3.2, §3.3, §4, §6.
  • T. Rocktäschel and S. Riedel (2017) End-to-end differentiable proving. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §6.
  • S. Saha, S. Ghosh, S. Srivastava, and M. Bansal (2020) PRover: proof generation for interpretable reasoning over rules. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 122–136. External Links: Link, Document Cited by: Appendix J, §1, §2, §6.
  • S. Saha, P. Yadav, and M. Bansal (2021) MultiPRover: generating multiple proofs for improved interpretability in rule reasoning. In NAACL, Cited by: Appendix J, §6.
  • S. Sanyal and X. Ren (2021) Discretized integrated gradients for explaining language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 10285–10299. External Links: Link, Document Cited by: §6.
  • B. Schölkopf (2019) Causality for machine learning. CoRR abs/1911.10500. External Links: Link Cited by: §6.
  • K. Sinha, S. Sodhani, J. Dong, J. Pineau, and W. L. Hamilton (2019) CLUTRR: a diagnostic benchmark for inductive reasoning from text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4506–4515. External Links: Link, Document Cited by: §6.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 3319–3328. External Links: Link Cited by: §6.
  • O. Tafjord, B. Dalvi, and P. Clark (2021) ProofWriter: generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, pp. 3621–3634. External Links: Link, Document Cited by: Table 5, Appendix A, Appendix J, Table 7, Appendix F, §1, §2, §3.2, §4, §4, §6.
  • O. Tafjord, M. Gardner, K. Lin, and P. Clark (2019) QuaRTz: an open-domain dataset of qualitative relationship questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5941–5946. External Links: Link, Document Cited by: §6.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §1.
  • V. Veitch, A. D’Amour, S. Yadlowsky, and J. Eisenstein (2021) Counterfactual invariance to spurious correlations: why and how to pass stress tests. CoRR abs/2106.00545. External Links: Link Cited by: §6.
  • J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer, and S. M. Shieber (2020) Causal mediation analysis for interpreting neural NLP: the case of gender bias. CoRR abs/2004.12265. External Links: Link Cited by: §6.
  • L. Weber, P. Minervini, J. Münchmeyer, U. Leser, and T. Rocktäschel (2019) NLProlog: reasoning with weak unification for question answering in natural language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6151–6161. External Links: Link, Document Cited by: §6.
  • J. Weston, A. Bordes, S. Chopra, and T. Mikolov (2016) Towards ai-complete question answering: A set of prerequisite toy tasks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §6.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Document, Link Cited by: Appendix K.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: a dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600. Cited by: §6.

Appendix A Depth Dataset Details

For training and evaluation of FaiRR and ProofWriter, we use the D* datasets and the ParaRules dataset Clark et al. (2020). The statistics of these datasets are shown in Table 5 which includes the number of theories, the total number of questions across all theories, and the number of conclusions per theory. The statistics are broken down split wise. We use the same splits of train/dev/test as provided in the original datasets Clark et al. (2020); Tafjord et al. (2021). All the dataset sources are properly cited and used according to the release license.

Dataset Split Number of Theories Number of Questions Number of Conclusions per Theory (min/mean/max)
train 18889 69906 0/0.81/18
D0 dev 2700 10070 0/0.81/14
test 5389 20024 0/0.8/12
train 9008 69616 1/1.69/13
D1 dev 1318 10188 1/1.7/14
test 2607 20210 1/1.7/12
train 6330 70076 2/3.15/14
D2 dev 909 10094 2/3.09/12
test 1794 19840 2/3.11/14
train 4816 69388 3/4.81/16
D3 dev 719 10302 3/4.73/14
test 1405 20346 3/4.72/15
train 3322 69810 5/9.12/21
D5 dev 482 10190 5/9.13/21
test 948 20030 5/9.08/21
train 1681 28010 3/4.25/14
Pararules dev 240 4004 3/4.53/13
test 482 8008 3/4.24/11
Table 5: Statistics of datasets introduced by Tafjord et al., 2021 with the number of theories, questions and conclusions per theory for all three splits of each dataset. The splits are kept the same as the original dataset. Please refer to Appendix A for more details.
Dataset Split Number of Theories Number of Questions Number of Conclusions per Theory (min/mean/max)
train 28896 416328 3/4.81/16
Subject dev 4314 61812 3/4.73/14
test 8430 122076 3/4.72/15
train 28866 415848 3/4.81/16
Attribute dev 4314 61812 3/4.73/14
test 8415 121836 3/4.73/15
Subject+Attribute train 28866 415848 3/4.81/16
dev 4314 61812 3/4.73/14
test 8415 121836 3/4.73/15
Table 6: Statistics of datasets introduced in this paper with the number of theories, questions and conclusions per theory for all three splits of each dataset. We use these datasets to quantify the robustness of FaiRR and compare it with baselines. Please refer to Appendix B for more details.

Appendix B Robustness Dataset Details

The robustness dataset is created by replacing all subjects (attributes, subject+attributes) in the D3 dataset with unseen subjects (attributes, subject+attributes) to create the subject (attribute, subject+attributes) robustness set. For this, we first curate new sets of subjects and attributes to be used as a global pool to sample from while replacing existing subjects and attributes from the theory. These sets are detailed below:

Subject proper name pool:

{‘George’, ‘Paul’, ‘Ronald’, ‘Emma’, ‘Magnus’, ‘Timothy’, ‘Chris’, ‘Molly’, ‘Diana’, ‘Joseph’, ‘Becky’, ‘Kurt’, ‘Ivan’, ‘Steve’, ‘Laura’, ‘Oliver’, ‘Adam’, ‘Larry’}

Subject common name pool:

{‘mother’, ‘father’, ‘baby’, ‘child’, ‘toddler’, ‘teenager’, ‘grandmother’, ‘student’, ‘teacher’, ‘alligator’, ‘cricket’, ‘bird’, ‘wolf’, ‘giraffe’, ‘dinosaur’, ‘thief’, ‘soldier’, ‘officer’, ‘artist’, ‘shopkeeper’, ‘caretaker’, ‘janitor’, ‘minister’, ‘salesman’, ‘saleswoman’, ‘runner’, ‘racer’, ‘painter’, ‘dresser’, ‘shoplifter’}

Attribute pool:

{‘maroon’, ‘brown’, ‘black’, ‘orange’, ‘cordial’, ‘friendly’, ‘adorable’, ‘old’, ‘soft’, ‘violent’, ‘intelligent’, ‘square’, ‘warm’, ‘large’, ‘cylindrical’, ‘spherical’, ‘tiny’, ‘microscopic’, ‘brilliant’, ‘noisy’, ‘playful’, ‘tender’, ‘gracious’, ‘patient’, ‘funny’, ‘hilarious’, ‘thorny’, ‘sensitive’, ‘diplomatic’, ‘thoughtful’}

Then, for each theory in the D3 dataset, we replace all the subjects in the theory with randomly sampled subjects (without replacement) from the candidate set to create a perturbed theory. We perform this replacement operation to generate five different perturbed theories. These perturbed theories are called equivalence set. Note that the only change in each theory in an equivalence set is the subjects being replaced by some randomly sampled subjects. For example, “cat” in the original theory might be replaced by “child” in one perturbation, and with “teacher” in yet another perturbation. We follow the same procedure to create attribute and subject+attribute robustness sets.

The statistics for these robustness datasets are shown in Table 6 which includes the dataset name depicting the perturbation type (subject, attribute or subject+attribute), number of theories, the total number of questions across all theories, and the number of conclusions per theory. Please note that one theory has multiple questions in general, and it is possible to have conclusions that are not a part of these questions, but can be deduced from the given theory. Each split of the original dataset is perturbed separately as described above, to create the new datasets.

Entailment Accuracy Proof Accuracy
PW (“All”) PW (“Iter”) FaiRR PW (“All”) PW (“Iter”) FaiRR
N/A 97.4 99.2 99.4 97.4 99.2 99.4
0 100.0 100.0 100.0 100.0 100.0 100.0
1 99.9 99.1 99.5 99.3 97.5 99.2
2 99.7 98.9 98.5 97.6 96.4 96.1
3 99.7 98.4 93.4 91.2 95.5 85.5
4 99.5 97.5 88.8 46.9 93.4 77.4
5 98.9 96.5 79.2 24.4 82.3 68.1
All 98.7 98.8 95.9 85.6 96.4 92.7
Table 7: D5 dataset depth-wise performance comparison of FaiRR trained on D0-D3 with ProofWriter (“All”) and ProofWriter (“Iter”) trained on D3 and D0-D3 respectively. Baseline results are copied from Tafjord et al. (2021). Refer to Section 5.1 and Appendix C for more details.

Appendix C Generalization to Reasoning Depths

In this section, we experiment with a setting where models are trained on depths less than or equal to 3 (i.e., ) and tested on D5 dataset that contains statements that require reasoning up to depth 5 (i.e., ). Here, we test the generalization of the models to reasoning depths that are unseen at training time. These results are shown in Table 7. From this table, we observe that overall our model performs significantly better than ProofWriter (“All”) on proof accuracy (), but has a lower performance compared to ProofWriter (“Iter”) (). This shows that compared to ProofWriter (“Iter”), our models are weaker at generalizing to unseen reasoning depths. This happens majorly because our rule selector tends to stop the inference iterations earlier, which means some essential inferences are not generated by the model. Thus, this leads to lower performance with increasing reasoning depths.

But, we make another interesting observation here. The drops in entailment and proof accuracy with increasing depths are similar for FaiRR. For instance, considering the performance drops between to , FaiRR has drop in both entailment and proof accuracy. In contrast, ProofWriter (“All”) and ProofWriter (“Iter”) drops approximately and , respectively in proof accuracy for a mere drop in entailment accuracy. This raises some concern on the causality of the proof generation process used for entailment prediction in these models, since it seems like the answer prediction and proof generation are not dependent via the same reasoning paths. In contrast, our causal framework grounds the entailment prediction to the proofs and this leads to more consistent performance variations in FaiRR.

Entailment Accuracy Proof Accuracy
PW (“Iter”) FaiRR PW (“Iter”) FaiRR
N/A 98.9 99.3 98.9 99.3
0 99.9 100.0 99.9 100.0
1 79.1 96.0 78.8 95.7
2 76.6 93.4 73.4 91.4
3 72.7 89.8 67.8 85.7
All 89.6 96.8 88.4 95.9
Table 8: Comparison of FaiRR with ProofWriter (“Iter”) trained on D0-D3 and tested on subject robustness dataset. Baseline results are generated using the checkpoint provided by the authors. For more details, please refer to Appendix D.
Entailment Accuracy Proof Accuracy
PW (“Iter”) FaiRR PW (“Iter”) FaiRR
N/A 99.6 99.1 99.6 99.1
0 100.0 100.0 100.0 100.0
1 96.4 96.0 96.2 95.6
2 95.1 93.7 94.1 91.3
3 93.9 89.5 92.1 84.1
All 97.8 96.7 97.4 95.6
Table 9: Comparison of FaiRR with ProofWriter (“Iter”) trained on D0-D3 and tested on attribute robustness dataset. Baseline results are generated using the checkpoint provided by the authors. For more details, please refer to Appendix D.

Appendix D Robustness to Perturbed Theories

Here, we show the detailed depth-wise performance of FaiRR and ProofWriter (“Iter”) trained on D0-D3 dataset and evaluated on different robustness datasets as described in Section 4. The results for subject, attribute, and subject+attribute robustness evaluations are shown in Tables 8, 9, and 10, respectively. We observe that ProofWriter (“Iter”) performs significantly worse compared to FaiRR on subject robustness. The results on subject+attribute robustness are mostly comparable, while in attribute robustness our model performs worse. The drop in performance show that both the models are sensitive to attributes in the theory to varying degree. But the strong sensitivity of ProofWriter (“Iter”) to the subject perturbations is questionable, since the causality of the model’s reasoning process seems to be compromised because the model learns some spurious correlations using the subjects.

In another setting, we train different components of our model on the robustness data and check if that leads to some performance gains. These results are reported in Table 11. We find that it is indeed possible to improve the performance of individual components of our model by robust data augmentation. This also indicates that our individual components are flexible to intervention by data augmentation. Such abilities are lacking in ProofWriter.

Entailment Accuracy Proof Accuracy
PW (“Iter”) FaiRR PW (“Iter”) FaiRR
N/A 98.6 98.9 98.6 98.9
0 100.0 100.0 100.0 100.0
1 91.3 94.0 90.9 93.6
2 89.2 90.3 85.9 87.8
3 85.9 85.9 80.0 80.5
All 94.8 95.4 93.4 94.3
Table 10: Comparison of FaiRR with ProofWriter (“Iter”) trained on D0-D3 and tested on subject+attribute robustness dataset. Baseline results are generated using the checkpoint provided by the authors. For more details, please refer to Appendix D.
Trained module Subj Perturbation Attr Perturbation
EA PA EA PA
Base (trained on D0-D3) 96.8 95.9 96.7 95.6
Rule selector (RS) 97.3 96.7 96.1 95.3
Fact selector (FS) 96.7 95.8 96.6 94.6
Knowledge composer (KC) 98.5 97.6 98.2 97.1
RS + FS + KC 99.1 98.5 98.7 97.2
Table 11: Comparison of variants of FaiRR where different components (RS, FS, KC) and their combinations are trained and tested on subject robustness datasets. EA and PA refers to entailment accuracy and proof accuracy respectively. Please refer to Appendix D for more details.
Figure 6: Proof Accuracy of FaiRR when tested on ParaRules while using limited amount of ParaRules along with D0-D3 for training. See Appendix  E for more details.

Appendix E Generalization to paraphrased theories

Here we test the ability of our model to generalize to unseen language in ParaRules by using limited training supervision. To test this, we first train our model on D0-D3 dataset and test it on the ParaRules dataset. This is a zero-shot evaluation on an unseen language form. In Figure 6 we observe that the performance is significantly worse on this setting as expected. We also evaluated a checkpoint of ProofWriter (“Iter”) trained on D0-D3 which achieves a similar performance of entailment accuracy 555data-augmented training results for ProofWriter are not reported in the figure since the training code is not available. Next, we gradually start adding portions of ParaRules, along with the D0-D3 data, to the training dataset. We find that FaiRR can quickly achieve reasonable performance using even additional data. This shows that our modularized approach is also efficient in adapting to unseen theories with limited data supervision. For more comparisons with models trained on ParaRules, please refer to Appendix F.

Entailment Accuracy Proof Accuracy
PW (“All”) [T5-11B] FaiRR PW (“All”) [T5-11B] FaiRR
0 99.9 100.0 99.9 100.0
1 99.3 99.6 99.3 99.6
2 98.3 97.6 97.7 97.4
3 98.2 95.4 96.5 95.1
4 91.5 91.6 83.1 91.6
All 99.1 98.7 98.5 98.6
Table 12: Comparison of FaiRR with ProofWriter (“All”) [T5-11B] when trained on D3+ParaRules and tested on ParaRules. Results for ProofWriter (“All”) [T5-11B] are copied from the paper. Please refer to Appendix F for more details.

Appendix F Results on ParaRules training

Following Tafjord et al. (2021), we compare the performance of ProofWriter (“All”) and FaiRR on the ParaRules dataset, when trained on a combined partition of D3 and ParaRules train set. The ParaRules dataset contains complex linguistic expressions in the theories that are more realistic than D* dataset theories, making it a more challenging dataset. These results are shown in Table 12, with a reasoning depth breakdown as before. We note that numbers for ProofWriter (“Iter”) are not reported in the paper, and no trained checkpoint is available either, so we omit it from our comparisons. Also, the reported results for ProofWriter (“All”) are from evaluating a T5-11B model while ours is a T5-large model. Here, we see that our model performs better at higher depths compared to the baseline which demonstrates that FaiRR is better at handling paraphrases.

(a) Depth 1
(b) Depth 3
(c) Depth 5
Figure 7: Depth-wise comparison of ProofWriter (“Iter”) and FaiRR (both trained on D0-3 dataset) on limited inference budgets. Please refer to Appendix G for details.

Appendix G Inference Budget Analysis

In the inference budget analysis, we compare the performance of FaiRR and ProofWriter under an inference budget constraint, i.e., we restrict the total number of intermediate conclusions that can be produced by both models. We perform this analysis on three different depth datasets () and upper bound the number of inferences by . We ensure that the budget is at least equal to the depth of the statements under consideration since proving a statement requires a model to generate inferences equal to at least the depth. From Figure 7 we observe that for all depths FaiRR consistently outperforms ProofWriter on lower budgets. Only when the budget increases to 10, ProofWriter compares with or sometimes outperforms our model. This analysis demonstrates that FaiRR performs a prioritized generation of conclusions that are relevant to the statement, which can be useful in scenarios with limited inference budgets.

ProofWriter (“Iter”) FaiRR
0 1.01 0.12
1 1.01 0.20
2 1.00 0.28
3 1.01 0.36
4 1.01 0.46
Avg 1.01 0.28
Table 13: Evaluation runtime (in hours) of FaiRR and ProofWriter (“Iter”). Please refer to Appendix H for more details.

Appendix H Runtime Analysis

For inference runtime analysis, we time the evaluation of both FaiRR and ProofWriter (“Iter”) on D5 dev set. Note that D5 dataset contains statements that require at most five reasoning steps to generate an answer. The runtime for both methods are shown in Table 13. These results were obtained by running the inference algorithm on NVIDIA GeForce RTX 2080 Ti GPUs for both models. We observe that ProofWriter (“Iter”) has an almost constant runtime since it always generates all possible inferences for a theory. In contrast, our runtime increases almost linearly with increasing depth. On average, FaiRR is times faster at inference than ProofWriter (“Iter”).

Error Type Input ( format: complete theory) Output ( format: $answer$ = conclusion ; $proof$ = # rule & facts)
Wrong inference $answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: Cold people are red. sent2: If someone is quiet then they are cold. sent3: Smart, blue people are quiet. sent4: If Paul is blue then Paul is furry. sent5: If someone is blue then they are quiet. sent6: If Steve is furry and Steve is smart then Steve is white. sent7: If Chris is red and Chris is furry then Chris is blue. sent8: Quiet people are cold. sent9: Chris is blue. sent10: Chris is white. sent11: Ronald is red. sent12: Paul is quiet. sent13: Paul is smart. sent14: Steve is blue. sent15: Steve is furry. sent16: Steve is red. sent17: Steve is smart. sent18: Steve is white. $answer$ = Dave is quiet. ; $proof$ = # sent5 sent9
Wrong inference $answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: Cold people are red. sent2: If someone is quiet then they are cold. sent3: Smart, blue people are quiet. sent4: If Paul is blue then Paul is furry. sent5: If someone is blue then they are quiet. sent6: If Steve is furry and Steve is smart then Steve is white. sent7: If Chris is red and Chris is furry then Chris is blue. sent8: Quiet people are cold. sent9: Chris is blue. sent10: Chris is white. sent11: Ronald is red. sent12: Paul is quiet. sent13: Paul is smart. sent14: Steve is blue. sent15: Steve is furry. sent16: Steve is red. sent17: Steve is smart. sent18: Steve is white. sent19: Dave is quiet. $answer$ = Dave is not a cat. ; $proof$ = # sent2 sent19
Generation error - output format violated $answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: If someone likes the grandmother then the grandmother eats the artist. sent2: If someone eats the artist then the artist is young. sent3: If the artist eats the grandmother then the artist is cold. sent4: If someone eats the artist and the artist eats the grandmother then the grandmother eats the artist. sent5: If someone is red and big then they like the grandmother. sent6: If someone visits the grandmother then they are red. sent7: If someone is rough then they eat the grandmother. sent8: If someone likes the artist then they visit the grandmother. sent9: The artist likes the grandmother. sent10: The grandmother eats the artist. sent11: The grandmother is big. sent12: The grandmother is cold. sent13: The grandmother is young. sent14: The grandmother likes the artist. sent15: The grandmother visits the artist. sent16: The artist is young. $answer$ = The red sent6 sent15
Generation error - hallucinating facts $answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: All young things are smart. sent2: All smart, young things are nice. sent3: If something is young and nice then it is quiet. sent4: Becky is big. sent5: Becky is green. sent6: Becky is young. sent7: Larry is big. sent8: Larry is quiet. sent9: Larry is red. sent10: Larry is young. sent11: Bob is smart. $answer$ = Bob is nice. ; $proof$ = # sent2 & sent11 sent16
Generation error - repeated facts $answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: If Adam is green then Adam is big. sent2: Green things are rough. sent3: If Joseph is cold then Joseph is big. sent4: All blue things are cold. sent5: All green things are rough. sent6: All big things are kind. sent7: If something is rough and big then it is green. sent8: All kind, blue things are cold. sent9: Joseph is blue. sent10: Adam is blue. sent11: Molly is cold. sent12: Joseph is cold. sent13: Charlie is cold. $answer$ = Charlie is sent13 sent13
Generation error - output format violated $answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: Nice things are green. sent2: If Ivan is smart then Ivan is blue. sent3: Green things are smart. sent4: All cold things are nice. sent5: Big things are blue. sent6: If Ronald is nice then Ronald is big. sent7: If something is green and big then it is not kind. sent8: If something is smart and nice then it is kind. sent9: Ronald is smart. sent10: Larry is cold. sent11: Laura is nice. sent12: Ivan is kind. sent13: Larry is nice. sent14: Larry is green. sent15: Bob is smart. $answer$ = Bob is kind. ; $proof$ = # sent8 & sent9 sentnice
Table 14: Examples of inference errors made by ProofWriter (“Iter”). Please refer to Appendix I for more details.
Error Type Input ( format: [facts], rule) Inference
Wrong inference [the racer needs the janitor.], if someone needs the janitor then the janitor likes the racer. the janitor likes the race.
Wrong inference [oliver is big.], big people are young. the oliver is young.
Wrong inference [the shoplifter needs the shoplifter., the shoplifter needs the dinosaur.], if something needs the dinosaur and it needs the shoplifter then it is round. the shop is round.
Table 15: Examples of inference errors made by FaiRR while composing rules and facts using the knowledge composer. Please refer to Appendix I for more details.
Input ( format: complete theory) Output ( format: $answer$ = conclusion ; $proof$ = # rule & facts) Output
✓:correct
✗:incorrect
Example 1
$answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: If someone is blue then they are quiet. sent2: Chris is blue. $answer$ = Chris is quiet. ; $proof$ = # sent1 sent2
$answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: If someone is blue then they are quiet. sent2: Chris is blue. sent3: Steve is blue. $answer$ = Dave is quiet. ; $proof$ = # sent1 sent2
$answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: If someone is blue then they are quiet. sent2: Quiet people are cold. sent3: Chris is blue. sent4: Steve is blue. sent5: Chris is white. $answer$ = Dave is quiet. ; $proof$ = # sent1 sent4
Example 2
$answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: If someone likes the artist then they visit the grandmother. sent2: The grandmother likes the artist. $answer$ = The grandmother visits the grandmother. ; $proof$ = # sent1 sent2
$answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: If someone is red and big then they like the grandmother. sent2: If someone visits the grandmother then they are red. sent3: If someone likes the artist then they visit the grandmother. sent4: The artist likes the grandmother. sent5: The grandmother eats the artist. sent6: The grandmother is big. sent7: The grandmother is cold. sent8: The grandmother likes the artist. $answer$ = The artist visits the grandmother. ; $proof$ = # sent3 sent4
$answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: If someone likes the grandmother then the grandmother eats the artist. sent2: If someone eats the artist then the artist is young. sent3: If the artist eats the grandmother then the artist is cold. sent4: If someone eats the artist and the artist eats the grandmother then the grandmother eats the artist. sent5: If someone is red and big then they like the grandmother. sent6: If someone visits the grandmother then they are red. sent7: If someone is rough then they eat the grandmother. sent8: If someone likes the artist then they visit the grandmother. sent9: The artist likes the grandmother. sent10: The grandmother eats the artist. sent11: The grandmother is big. sent12: The grandmother is cold. sent13: The grandmother is young. sent14: The grandmother likes the artist. sent15: The grandmother visits the artist. sent16: The artist is young. $answer$ = The red sent6 sent15
Example 3
$answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: All young things are smart. sent2: All smart, young things are nice. sent3: Bob is smart. sent4: Bob is young $answer$ = Bob is nice. ; $proof$ = # sent2 & sent3 sent4
$answer$ ; $proof$ ; $question$ = What is one single-hop inference? ; $context$ = sent1: All young things are smart. sent2: All smart, young things are nice. sent3: If something is young and nice then it is quiet. sent4: Becky is big. sent5: Becky is green. sent6: Becky is young. sent7: Larry is big. sent8: Larry is quiet. sent9: Larry is red. sent10: Larry is young. sent11: Bob is smart. $answer$ = Bob is nice. ; $proof$ = # sent2 & sent11 sent16
Table 16: Some more examples of inference errors made by ProofWriter (“Iter”). We see that having extra information in the theory than what is required to prove the conclusion leads to errors (shown in red). Having limited information in the theory reduces errors. Sentences in blue depict the sentences which are added to the theory with respect to the row above. Please refer to Section 5.5 for more details.

Appendix I Error Analysis

This is a follow-up of Section 5.4

, where we delve deeper into the error analysis by discussing different error examples and their potential reasons. First, the stop errors are easy to understand. These are cases where the model just decides to stop instead of generating any further conclusions. For our model, this can happen if the rule selector is under confident while selecting rules and it learns that a safer fallback is to stop generating rules. This aspect can probably be improved by a better modeling of the rule selector. We plan to explore this in future works.

Next we look at some of the wrong inferences generated by both models in Tables 14 and 15. We observe that errors made by FaiRR are rather naive with small mistakes in the final conclusion (shown in red in Table 15). In contrast, ProofWriter tends to generate an invalid conclusion with no relation to the generated proof (rows 1 and 2 in Table 14). It also makes many non-interpretable generation errors where the model’s output format is completely violated or the model seems to hallucinate some facts (rows 3-6 in Table 14). Thus, we observe the benefit of our causal framework as the errors are interpretable and more believable. In contrast, errors made by ProofWriter clearly show that its inference reasoning process can often not rely on the proof at or, or even the generated proof sometimes doesn’t make sense.

Appendix J Comparison with Baselines

In this work we compare FaiRR with baselines introduced by Tafjord et al. (2021). We omit comparisons with both PRover Saha et al. (2020) and multiPRover Saha et al. (2021), since they were trained on a different dataset that makes a closed-world assumption (CWA), whereas we use datasets that make an open-world assumption (OWA). One essential difference between these two datasets are that OWA allows for predicting the truth values as one of {True,False,Unknown} while in CWA, any fact that cannot be deduced from the theory is assumed to be false. As a result, in CWA, there are only two possible truth values {True,False} for a given statement. This CWA assumption also leads to a specific constraint in the generated proof graphs, where special nodes need to be considered. Please refer to Saha et al. (2020) for more details on this. Additionally, multiPRover Saha et al. (2021) has a goal that is different from ours. Specifically, their focus is on generating multiple possible proofs for a given rulebase, and for this their training examples contain multiple gold proofs per instance. In FaiRR and ProofWriter Tafjord et al. (2021), only one gold proof needs to be generated, making the comparisons a bit unfair. Following Tafjord et al. (2021), we report single run numbers for every experiment.

Appendix K Hyperparameters

We use RoBERTa-large models Liu et al. (2019) to model the rule selector and fact selector in FaiRR

. For selecting the best hyperparameters for both these components, we selected the max training epochs in:

, warmup updates in: , weight decay in: , learning rate in: , and batch size in .

We use T5 Raffel et al. (2020) (T5-large) to model the knowledge composer in FaiRR and train it using the default hyperparameters available in the Hugging Face transformers library Wolf et al. (2020). All models were trained on Nvidia Quadro RTX 8000 GPUs. Training a FaiRR on a single GPU takes around 20 hours on average.