Log In Sign Up

METGEN: A Module-Based Entailment Tree Generation Framework for Answer Explanation

by   Ruixin Hong, et al.

Knowing the reasoning chains from knowledge to the predicted answers can help construct an explainable question answering (QA) system. Advances on QA explanation propose to explain the answers with entailment trees composed of multiple entailment steps. While current work proposes to generate entailment trees with end-to-end generative models, the steps in the generated trees are not constrained and could be unreliable. In this paper, we propose METGEN, a Module-based Entailment Tree GENeration framework that has multiple modules and a reasoning controller. Given a question and several supporting knowledge, METGEN can iteratively generate the entailment tree by conducting single-step entailment with separate modules and selecting the reasoning flow with the controller. As each module is guided to perform a specific type of entailment reasoning, the steps generated by METGEN are more reliable and valid. Experiment results on the standard benchmark show that METGEN can outperform previous state-of-the-art models with only 9


page 3

page 16

page 19

page 20


RLET: A Reinforcement Learning Based Approach for Explainable QA with Entailment Trees

Interpreting the reasoning process from questions to answers poses a cha...

Explaining Answers with Entailment Trees

Our goal, in the context of open-domain textual question-answering (QA),...

Entailment Tree Explanations via Iterative Retrieval-Generation Reasoner

Large language models have achieved high performance on various question...

Active entailment encoding for explanation tree construction using parsimonious generation of hard negatives

Entailment trees have been proposed to simulate the human reasoning proc...

Reasoning over Logically Interacted Conditions for Question Answering

Some questions have multiple answers that are not equally correct, i.e. ...

Policy Compliance Detection via Expression Tree Inference

Policy Compliance Detection (PCD) is a task we encounter when reasoning ...

XTE: Explainable Text Entailment

Text entailment, the task of determining whether a piece of text logical...

1 Introduction

Figure 1: Given facts related to the question+answer, MetGen iteratively generates an entailment tree that contains the hypothesis (green), used facts (orange), and intermediate conclusions (blue) with several separate entailment modules and a reasoning controller.

Explanation is recognized as a key factor toward responsible AI systems Arrieta et al. (2020). In the context of question answering (QA), providing an explanation of the predicted answers can help improve the understandability, debuggability, and trustworthiness of QA models. Great efforts have been devoted to revealing how the models predict the answers and give explanations in various forms, including showing an attention map over passages Seo et al. (2017), giving a snippet of textual evidence DeYoung et al. (2020), and selecting answer-supporting sentences Xie et al. (2020); Jansen and Ustalov (2019). Among all explanation forms, the entailment trees Dalvi et al. (2021) provide the most detailed and informative explanation by exposing the chains of reasoning from the knowledge to the predictions. As shown in Figure 1(a) and (c), given a hypothesis (summarizing a question+answer pair) and supporting facts (retrieved from a corpus), the goal is to generate an entailment tree where each non-leaf node is an entailment of its children. Providing a valid entailment tree would help users to understand how the hypothesis is proved, obtain novel intermediate conclusions from the basic knowledge, and gain detailed information to support decision making.

To generate the entailment trees, Dalvi et al. (2021) propose EntailmentWriter, an end-to-end sequence-to-sequence generative model, trained by maximizing the generation likelihood of the linearized gold trees. However, they do not have an explicit strategy to constrain the validity of every single step and the tree structure. Thus, the steps are not guaranteed to satisfy the reasoning rules and could be incorrect and unreliable. For example, the step conclusion may not be entailed by the input premises or simply repeat one of the input premises Dalvi et al. (2021). Furthermore, although their outputs are trees that can indicate the reasoning chains, the mapping mechanisms from the inputs to the trees remain implicit and invisible.

To tackle the above problems, we propose MetGen, a module-based framework to generate entailment trees in a more explicit approach and constrain the entailment steps with reasoning rules. As shown in Figure 1(b), given the target hypothesis and known facts, MetGen first uses the reasoning controller to select some steps that can help get closer to the hypothesis. Subsequently, MetGen executes the selected steps with single-step entailment modules and adds the generated intermediate facts into the known facts for the next round of reasoning. Through this iterative approach, MetGen proves the hypothesis step by step and generates the overall entailment tree.

Each module in MetGen is a generative model that can perform a specific type of entailment reasoning (e.g., making a substitution inference). To guide the modules to generate correct and sound conclusions, we train the modules with well-formed synthetic data containing the corresponding logical regularities of the reasoning types Bostrom et al. (2021)

. Inspired by the forward chaining and backward chaining algorithms in logic programming 

Chein and Mugnier (2008), we adopt both deductive and abductive modules to execute forward and backward reasoning steps, respectively.

Experiments on the standard benchmark EntailmentBank Dalvi et al. (2021) show that MetGen can outperform the previous best model with 9.0% of the model parameters. Manual evaluation results demonstrate that MetGen can generate more reliable steps. Further experiments under the data-scarce setting and cross-dataset setting (on eQASC and eOBQA Jhamtani and Clark (2020)) show that MetGen is more data-efficient and has better generalization capability compared with the baselines.

Figure 2: Reasoning process of MetGen framework. The goal is to prove the hypothesis with the given facts through reasoning iterations (the upper part). In the first reasoning iteration (the lower part), the initial state is denoted as . First, the controller selects promising steps, such as the backward abductive step and the forward deductive one . Then, single-step entailment modules perform the reasoning steps and generate novel intermediate facts including . After that, the controller verifies that the states and are closer to the completion of reasoning and thus selects them for the next reasoning iteration.

2 Related Works

Explainability in Question Answering. Recent works have explored the explainability of QA in various forms Seo et al. (2017); Ye et al. (2020); Dalvi et al. (2021); Lamm et al. (2021); Wiegreffe and Marasovic (2021); Thayaparan et al. (2020); Rosenthal et al. (2021). One way is to retrieve multiple supporting facts related to the question or answer Xie et al. (2020); Jansen and Ustalov (2019); Jhamtani and Clark (2020); Inoue et al. (2020); Yadav et al. (2019, 2020); Valentino et al. (2021); Cartuyvels et al. (2020); Zhang et al. (2020). These “rationales” DeYoung et al. (2020) provide insights about what are used by the model to inform its predictions, but do not show how the facts are combined to generate novel intermediate conclusions. Some other works explain QA systems in a generative way, including generating explanation sentences that directly link a question to an answer Camburu et al. (2018); Rajani et al. (2019) and thus expose the relevant knowledge used by models Latcinnik and Berant (2020); Shwartz et al. (2020). However, as these models generate explanations in a free form, the generated facts may not be necessarily sound Bostrom et al. (2021). Recently, Bostrom et al. (2021) propose ParaPattern, an automated pipeline for building two kinds of single-step deductions. Different from the above work, our method generates the explanations in a multi-step tree structure Dalvi et al. (2021), showing what and how facts are combined to draw novel intermediate conclusions and reach the final answer. The intermediate conclusions are generated by deductive and abductive entailment modules that are constrained to perform specific types of reasoning.

Multi-Hop Proof Generation. Recently, several works propose to use the transformers for multi-hop logical reasoning and generate reliable formal proofs Clark et al. (2020); Talmor et al. (2020); Saha et al. (2020, 2021); Tafjord et al. (2021). However, they mainly focus on synthetic sentences, which have low linguistic variation and struggle to represent the flexible sentences in real QA scenarios.

Neural Module Networks. Decomposing the reasoning process into several pre-defined operations overlaps with the idea of neural module networks Andreas et al. (2016); Hu et al. (2017); Gupta and Lewis (2018); Gupta et al. (2020); Jiang et al. (2019). They typically assume that the question could be parsed into an executable program, i.e., the question explicitly describes the process to arrive at the answer. In our work, we tackle the questions/hypotheses that do not trivially describe the reasoning process and could be more challenging.

3 Task Definition

As shown in Figure 1, the inputs are a hypothesis and some fact sentences (including both relevant and irrelevant ones) expressing knowledge. is a declarative sentence derived from a question+answer pair and can be proved by the knowledge in . The desired output is a valid entailment tree with the root node being , the leaves being facts selected from , and the intermediate nodes being novel intermediate facts (e.g., ). is considered valid if each non-leaf node is a valid entailment (a conclusion that “a person would typically infer”  Dagan et al. (2013)) of its immediate children. We denote the annotated gold tree as and its leaf facts as . Following Dalvi et al. (2021), we consider three increasingly difficult tasks with different :

  • ,

  • + 15-20 distractors,

  • a corpus .

4 MetGen

Figure 2 illustrates the reasoning process of MetGen. We reason one step at a time and iteratively generate the entailment trees. In each iteration, given a reasoning state (e.g., the initial state , where we aim to prove using ), the reasoning controller selects promising steps, including forward deductive steps and backward abductive ones. We then use the corresponding modules to perform single-step entailment on the selected steps and generate novel intermediate facts. Finally, we use the controller to verify the generated facts and select the correct states to perform further reasoning. We introduce details about the module design, reasoning controller, and reasoning algorithm in Sec 4.1,  4.2, and  4.3, respectively.

Table 1: The used reasoning types. Here, and denote input premises for deductive modules, while denotes the entailed conclusion. For logical regularity, means that the predicate is true for the entity .

4.1 Single-step Entailment Modules

4.1.1 Module Definition

We propose to divide the single-step entailment reasoning ability into a set of well-defined basic logical operations. Such a design could help improve the generalization capability Bostrom et al. (2021); Rudin (2019). As shown in Table 1, we adopt three common reasoning types, covering over 90% of the steps in EntailmentBank according to the analysis by Dalvi et al. (2021). Note that the entailment module types could be adjusted according to the specific tasks or domains, which allows our method to be flexibly applied to other problems.

We adopt both the deductive and abductive versions of the reasoning types. Take a gold step as an example. Deduction is the process of reasoning from the premises to reach a logical conclusion. A deductive module takes the two premises and as inputs and outputs a conclusion according to its reasoning types (denoted as ). Abduction is to find the best explanation given complete/incomplete observations Harman (1965). In the context of the entailment steps, given a conclusion and a premise fact as observations, the abductive module yields a plausible premise (denoted as ), where the generated premise and the observed premise would most likely infer the conclusion . Although the steps in the EntailmentBank may have more than two premises, we only consider the case of two premises. The reason is that the -premise step () could be further decomposed into several valid 2-premise steps Dalvi et al. (2021) (See Appendix Figure 8 for a specific example).

4.1.2 Module Training

Training the entailment modules with data that contains the corresponding logical regularities would guide them to perform correct inferences and ensure soundness Bostrom et al. (2021). We first train the modules with synthetic sentences to learn the logical transformations and then further fine-tune them with the end task.

We follow ParePattern Bostrom et al. (2021), a pipeline based on syntactic retrieval, rule-based example construction, and automatic paraphrasing, to collect synthetic sentences from Wikipedia. Since Bostrom et al. (2021) only consider the substitution and contraposition deductions, we extend the method to conjunction and if-then deductions by designing the specific syntactic templates and construction rules (See Appendix A.1

). In addition, we also considered the abductive form of these modules. We then fine-tune the modules with corresponding steps in EntailmentBank to adapt the modules to the science domain. Since the original steps in EntailmentBank are not annotated with reasoning types, we manually label 400 steps of the training split and train a classifier with these steps. The remaining steps are labeled with the pseudo labels predicted by the classifier. We

freeze the parameters of modules once the training is complete.

4.2 Reasoning Controller

In addition to single-step reasoning modules, we need to search for the correct path to reach the target hypothesis. The entire reasoning search space would grow rapidly as the number of input facts increases and there would also be complex branching in the trees. We introduce a reasoning controller to filter out incorrect facts, steps, and states to reduce the search space and complete the reasoning accurately and efficiently.

Figure 2 shows how the controller is used in each reasoning iteration. At the beginning of the iteration, the controller scores all possible steps

and selects the most promising ones for single-step entailment. After the entailment modules generate intermediate facts, the controller estimates which

state with a generated fact gets closer to the completion of reasoning and selects the best states for the next iteration. Besides the usage within each iteration, the controller also rates all facts at the start of the whole reasoning process and keeps only the relevant facts for the initial state when fact distractors exist.

4.2.1 Controller Model

The controller model scores steps, facts, and states based on a transformer, and its structure is shown in Figure 3.

Figure 3: Reasoning controller illustration. Given a state, the controller predicts a score for the whole state, scores for facts, and scores for all possible steps.

Encoding. We first encode the target hypothesis and facts of state with a pre-trained transformer: We obtain the contextualized representation for and for using the average contextualized representation of all tokens within the sentence.

Steps. We introduce feed forward networks and for deductive steps and abductive steps, respectively. Each combination of two facts is a possible deductive step . Each combination of the target hypothesis and a fact is a possible abductive step . We score them by a score function ,


where is the concatenate operation. We normalize the step scores by applying over all possible deductive and abductive steps.

Facts. The fact score indicates whether the fact is useful by how similar the fact is to the state’s target hypothesis. We assume that if a fact has a smaller depth in the gold entailment tree (i.e., closer to the root), it would be more similar to the target hypothesis than those facts with a larger depth. We introduce as a learnable similarity function and determine the fact score by comparing it with the target,


where is the function.

State. The state score reflects the quality of the current state and indicates whether this state should be used for further reasoning. We assign the state score using the following two parts:


where is a learnable weight, is the representation of , is a feed forward network. The first part helps choose states that contain more relevant facts and fewer distractors. The second part comprehensively considers the whole state and gives the promising one a higher score.

4.2.2 Controller Training

Training State Construction. We decompose the gold entailment trees into several intermediate states for training. We add disturbances to the trees to make positive and negative states. For each gold deductive step (e.g., ), we use the deductive module to predict a conclusion . If the predicted is correct, we replace in the state with to make new positive states. Otherwise, we replace with to make negative states. The abductive modules are also used in a similar way.

Loss Function. We train the controller with corresponding margin ranking losses and to learn to rank the correct steps, facts, and states ahead of incorrect ones, respectively. Specifically, the loss for scoring steps is


where and are the positive and negative step, is the number of (, ) pairs, is the margin loss, and is the margin for steps.

For facts, we have


where is a fact which has smaller depth in the gold tree than , is the distractor, is the number of (,) pairs, is the number of distractors, and is the margin for facts.

For states, we sample a positive state and a negative state from a tree and train the controller with


where is the margin for states.

Finally, we average the above losses over all trees in the training split and train the controller with


Appendix B gives more controller training details.

4.3 Reasoning Algorithm

Since the entailment trees are generated iteratively and the search space for reasoning could be large for each iteration, we adopt beam search for efficient reasoning. Given the initial state , we first remove with a low fact score to filter distractors. Subsequently, we perform several reasoning iterations until the target hypothesis is proved or the maximum reasoning depth is reached. In each iteration, we select the steps with the highest step scores, execute the steps with all types of deductive or abductive modules, and construct novel states with the generated intermediate facts. We remain the top- states ranked with state scores for the next iteration, where is the beam size. More algorithm details are in Appendix Algorithm 1.

5 Experiments

We conduct experiments on EntailmentBank Dalvi et al. (2021), the first dataset supporting QA explanations in the form of the entailment tree. EntailmentBank contains 1,840 entailment trees, each of which corresponds to a question from the ARC dataset Clark et al. (2018). On average, each tree contains 7.6 nodes across 3.2 steps. Summary statistics are shown in Table 2.

Train Dev Test All
Questions / Trees 1,131 187 340 1,840
Entailment steps 4,175 597 1,109 5,881
Table 2: EntailmentBank statistics.

5.1 Evaluation Metrics

Following Dalvi et al. (2021), we first align nodes in the predicted tree with nodes in the gold tree and then evaluate with three dimensions:

Leaves: To evaluate whether uses the correct leaf facts, we compute F1 score by comparing the predicted leaf facts to .

Steps: To evaluate whether the individual steps are structurally correct, we compare all steps in two trees and compute F1. A predicted step is considered structurally correct if its children’s identifiers (e.g., , ) perfectly match the gold ones.

Intermediates: To evaluate whether the intermediate conclusions are correct, we report the F1 of comparing the aligned intermediate conclusions. A predicted intermediate sentence is considered correct if the BLEURT-Large-512 score of the aligned intermediate pair is larger than 111The threshold was picked using 300 manually labeled pairs Dalvi et al. (2021)..

The AllCorrect score is 1 if F1 is 1, 0 otherwise222We repair a bug in the official evaluation code, which makes the Intermediate AllCorrect = 1 if the precision = 1 (rather than if F1 = 1), which leads to an overestimation on the Intermediate AllCorrect.. Given the above scores, we comprehensively evaluate with Overall AllCorrect whose value is 1 if and only if all the leaves, steps and intermediates are all correct. This is a strict metric since any error in will lead to a score of 0.

5.2 Baselines

We compare with the SOTA entailment tree generation method EntialmentWriter Dalvi et al. (2021), which directly generates the linearized trees (e.g., ) given with an end-to-end encoder-decoder framework. We also follow the “Iterative” ProofWriter Tafjord et al. (2021), which is one of the SOTA proof generation methods for logical reasoning, to extend EntialmentWriter to EntialmentWriter-Iter. EntialmentWriter-Iter iteratively generates a part of the linearized tree in one forward process (e.g., eruptions block sunlight;) and concatenates all parts to make the final tree. It completes the step selection and entailment reasoning in a seq2seq model and does not provide the reasoning types of steps.

Task Method Leaves Steps Intermediates Overall
F1 AllCorrect F1 AllCorrect F1 AllCorrect AllCorrect
Task1 (no-distractor) EntialmentWriter (T5-11B)† 11.00 99.0 89.4 51.5 38.2 71.2 52.9 35.6
EntialmentWriter (T5-large) 0.77 98.4 84.1 50.0 38.5 67.0 35.9 34.4
EntialmentWriter-Iter (T5-large) 0.77 99.8 97.6 51.6 38.5 68.3 36.5 35.0
MetGen-separated (Ours) 0.22+60.77 100.0 100.0 57.9 42.1 71.3 39.2 37.0
MetGen-prefixed (Ours) 0.22+0.77 100.0 100.0 57.7 41.9 70.8 39.2 36.5
Task2 (distractor) EntialmentWriter (T5-11B)† 11.00 89.1 48.8 41.4 27.7 66.2 53.2 25.6
EntialmentWriter (T5-large) 0.77 83.2 35.0 39.5 24.7 62.2 28.2 23.2
EntialmentWriter-Iter (T5-large) 0.77 85.2 40.9 38.9 26.8 63.5 29.1 25.0
MetGen-separated (Ours) 0.22+60.77 83.7 48.6 41.7 30.4 62.7 32.7 28.0
MetGen-prefixed (Ours) 0.22+0.77 82.7 46.1 41.3 29.6 61.4 32.4 27.7
Task3 (full-corpus) EntialmentWriter (T5-11B)† 11.00 39.7 3.8 7.8 2.9 36.4 13.2† 2.9
EntialmentWriter (T5-large) 0.77 30.9 1.2 4.4 1.2 28.8 5.6 1.2
EntialmentWriter-Iter (T5-large) 0.77 32.4 1.8 4.4 1.5 29.7 6.5 1.5
MetGen-separated (Ours) 0.22+60.77 34.8 8.7 9.8 8.6 36.7 20.4 8.6
MetGen-prefixed (Ours) 0.22+0.77 34.8 8.7 9.8 8.6 36.6 20.4 8.6
Table 3: Automatic evaluation results on the EntailmentBank test split. † indicates results from the published paper2. denotes the number of model parameters (B).
Task1 Task2
Method Automatic Manual Automatic Manual
EntialmentWriter (T5-large) 35 46 21 26
EntialmentWriter-Iter (T5-large) 35 47 25 35
MetGen-prefixed (Ours) 36 53 27 39
Table 4: Entailment tree evaluation results on 100 uniformly sampled questions from the test split. We report the proportion (%) of the predicted trees that are rated as valid, following automatic and manual evaluation.

5.3 Implementation Details

Modules. We implement the entailment modules on top of T5-large Raffel et al. (2020) with the following two implementations. (1) Separated. We implement each module separately. We have six models in total, corresponding to the three reasoning types of deductive and abductive versions. (2) Prefixed. We implement all modules with a single model. To specify which reasoning type the model should perform, we follow Raffel et al. (2020)

to add a type-specific prefix (e.g., “deductive substitution:”) to the input before feeding it to the model. To evaluate the modules, we annotate the types of 275 steps in the dev split. We train the modules with a batch size of 20 for 100 epochs.

Controller. The controller is implemented with albert-xxlarge-v2 Lan et al. (2019). We train two individual controllers for Task1 and Task2. For Task3, we reuse the Task2 model without additional training. The controllers are trained with a batch size of 10 for 1,000 epochs. The margins , , and are tuned on the development split and all set to 0.1.

Algorithm. For Task1, we iterate until all facts in are used. For Task2, we use a fact score threshold of 0.001 to filter distractors and a maximum reasoning depth of 5. We select the top 10% steps for each state and set the beam size to 10. All hyper-parameters are selected using the dev split (Appendix C). For Task3, we follow Dalvi et al. (2021) to retrieve 25 sentences from the corpus using the as the query. We use the same retrieval results as EntailmentWriter for a fair comparison. Model checkpoints are selected using the dev split. More implementation details can be found in the Appendix.

6 Result Analysis

6.1 Automatic Evaluation

As shown in Table 3, our methods outperform all baseline methods on the strictest metric Overall AllCorrect for all three tasks. Notice that the trees generated by our methods only contain 2-premise steps, which would lead to a 0 Overall AllCorrect score on 26% of test samples whose annotations contain -premise () steps. Even so, our MetGen-separated still obtains an absolute improvement of 1.4%/2.4%/5.7% on Task1/2/3 in comparison to the strongest baseline. With only 9.0% of the model parameters, MetGen-prefixed can outperform the EntialmentWriter (T5-11B) by absolute 0.9%/2.1%/5.7% on Task1/2/3. In the case of using a comparable amount of model parameters, MetGen-prefixed also outperforms the EntialmentWriter-Iter (T5-large) by a large margin. For Task3, we note that all methods perform poorly. The main reason is that the retrieved facts may not contain all the required facts (68% of the cases). We note that MetGen

underperforms the baselines on some metrics, probably due to the inaccuracy of the tree alignment algorithm in the automatic evaluation (Appendix 


Figure 4: Manual evaluation results of 100 single-step entailments uniformly sampled from the predicted trees of Task2 test spilt. EW denotes EntailmentWriter.

6.2 Manual Evaluation

As analysed by Dalvi et al. (2021), the automated metrics might misjudge some valid trees and thus underestimate the performance. To make a more accurate comparison, we perform the manual evaluation. We compare three methods with a comparable amount of model parameters, EntialmentWriter (T5-large), EntialmentWriter-Iter (T5-large), and MetGen-prefixed. For each step and tree, we invite three students as experts to evaluate the validity. The inter-annotator agreement (Cohen’s kappa statistic) is 0.85/0.76 for the step/tree, indicating the substantial agreement between annotators.

Validity of Full Entailment Trees. As shown in Table 4, under the manual evaluation, MetGen outperforms the baselines with large margins.

Validity of Individual Entailment Steps. We review the validity of the single-step entailment and annotate each step with one of the four categories:

Valid: The step conclusion can be inferred from the premises and does not trivially repeat them.

Unsupported: The conclusion is in conflict with, irrelevant with, or not followed from the premises.

Repeat premises: The conclusion trivially repeats one or more of the premises.

Missing premises: The conclusion uses knowledge unstated in the premises. The step would be correct if one additional premise from is added.

As shown in Figure 4, MetGen achieves considerable improvement in the validity of steps compared to the baseline methods. We note that 17% of the steps of EntialmentWriter belong to missing premises. MetGen constrains the reasoning types of steps and uses the premise-related and context-independent entailment modules to perform every single step. This can reduce the cases of missing premises (from 17% to 2%) and improve the validity of the conclusions (from 38% to 70%).

Implem- entation Models Reasoning Type Training Data Overall AllCorrect Single-step Accuracy
(a) Sep 6T5-large S+E 28.0 81.0
(b) Sep 6BART-large S+E 26.2 77.0
(c) Sep 6T5-base S+E 27.3 78.0
(d) Sep 6T5-large E 27.8 79.5
(e) Sep 6T5-large S 23.5 43.6
(f) Pre 1T5-large S+E 27.7 78.4
(g) Pre 1T5-large E 27.4 78.1
(h) Pre 1T5-large E 25.9 76.0
Table 5: Ablation results on entailment modules. Sep/Pre indicates seperated/prefixed. S/E denotes the synthesis/EntailmentBank step training data.

6.3 Ablation Study

Entailment Modules Analysis. Table 5 reports the ablation results on modules. We report the Overall AllCorrect on test spilt and the single-step entailment accuracy on the labeled dev steps, and can make the following observations. (1) Separated vs. Prefixed. We can see that MetGen-prefixed achieves slightly worse performance than MetGen-separated ((a) vs. (f) and (d) vs. (g)). This is mainly because separate modules could better learn different types of reasoning. However, in our final system, we still choose to use MetGen-prefixed due to the consideration of model size. (2) Clarifying Reasoning Types. We train a module to infer without distinguishing or assigning specific reasoning types. We find that the performance drops from 27.4% to 25.9% ((g) vs. (h)), suggesting that clarifying the reasoning types of the entailment steps is crucial for generating entailment trees. (3) Training Data. Comparing (a) and (d), we find that training with the synthesis data could improve the accuracy. Without tuning on EntailmentBank (setting (e)), the modules might not adapt to the science domain and obtain low step accuracy. However, the well-trained controller would verify and filter the error conclusions, thus our method can still achieve 23.5% on Overall AllCorrect. (4) Generative Model. A stronger generative model, which achieves higher single-step accuracy, could achieve higher tree generation performance (comparing (a), (b) and (c)), indicating that our method can be further improved with stronger entailment modules.

Controller and Algorithm Analysis. (1) Is the reasoning controller necessary? To answer this question, we design a heuristic generation algorithm without the controller (Appendix D). It uses the BLEURT

scores as heuristic information to guide the reasoning. As shown in Table 

6, the heuristic method achieves observable lower performance. The controller could aid in eliminating the error steps and states, so as to find the valid trees efficiently and accurately. Without the controller, we find it difficult to find effective heuristic information. (2) Effect of Abductive Steps. The generation performance drops when abductive steps are not used. This suggests that abductive steps, as a way of backward searching, could help improve the quality of generated trees.

Task Method Leaves Steps Intermediate Overall
Task1 controller 100.0 42.1 39.2 37.0
 w/o abduction 100.0 41.4 38.4 36.2
heuristic 100.0 31.2 31.2 28.8
Task2 controller 48.6 30.4 32.7 28.0
 w/o abduction 44.5 28.3 31.6 27.0
heuristic 3.2 3.2 12.1 3.2
Table 6: Ablation results on the reasoning controller. We report the AllCorrect scores on the test split.
Figure 5: Results on different ratios (0.01, 0.05, 0.10, 0.20, 0.50, 1.00) of EntailmentBank training data.
Method P@1 NDCG P@1 NDCG
EntialmentWriter (T5-large) 52.48 73.14 69.07 89.05
EntialmentWriter-Iter (T5-large) 52.56 73.28 72.15 90.19
MetGen-prefixed (Ours) 55.81 74.19 74.89 90.50
Table 7: Cross-dataset results on the eQASC and eOBQA test split.

6.4 Data-scarce Setting

Figure 5 reports the results in the data-scarce setting. Our method is more data-efficient. With only 1% of the EntailmentBank training data, our method obtains 14.7% on Task2 Overall AllCorrect, in comparison to 10.0% of the strongest baseline. When the data is scarce, the advantage of training our modules with synthetic data becomes more significant. It can help alleviate the overfitting on few EntailmentBank sentences.

6.5 Cross-dataset Setting

To test the generalization capability of our method, we conduct cross-dataset experiments on datasets eQASC and eOBQA Jhamtani and Clark (2020), which collect one-step entailment trees for questions from QASC Khot et al. (2020) and OpenBookQA Mihaylov et al. (2018), respectively. Given and , their task requires selecting the valid one-step trees (e.g., ) from a candidate set. We apply the Task2 models (without fine-tuning on eQASC or eOBQA) to select from the candidate trees (Appendix E). Following Jhamtani and Clark (2020), we evaluate models with the P@1 and NDCG metrics. Questions with no valid tree are filtered. As shown in Table 7, our method achieves better generalization performance. Instead of training a seq2seq model with a single generation loss, our method explicitly models the step and state selection ability (equation (1) and (3)) and guides the controller with specific losses to rank the correct ones ahead of incorrect ones. Such a manner could aid in alleviating the overfitting on training data and improve the generality.

7 Conclusion

We propose MetGen, a module-based framework to generate the entailment trees for explaining answers. MetGen reasons with single-step entailment modules and the reasoning controller. Experiments on EntailmentBank benchmark show MetGen can generate valid trees with reliable steps and achieve SOTA performance.

8 Acknowledgements

We appreciate the anonymous reviewers for their insightful comments. We would like to thank Zhihong Shao for the helpful discussions. This work is supported by the National Key Research and Development Program of China (No. 2018AAA0100701), and the Guoqiang Institute of Tsinghua University with Grant No. 2020GQG0005.


  • J. Andreas, M. Rohrbach, T. Darrell, and D. Klein (2016) Neural module networks. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016

    pp. 39–48. External Links: Link, Document Cited by: §2.
  • A. B. Arrieta, N. D. Rodríguez, J. D. Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, and F. Herrera (2020) Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, pp. 82–115. External Links: Link, Document Cited by: §1.
  • K. Bostrom, X. Zhao, S. Chaudhuri, and G. Durrett (2021) Flexible generation of natural language deductions. In

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021

    , M. Moens, X. Huang, L. Specia, and S. W. Yih (Eds.),
    pp. 6266–6278. External Links: Link Cited by: §A.1, §1, §2, §4.1.1, §4.1.2, §4.1.2.
  • O. Camburu, T. Rocktäschel, T. Lukasiewicz, and P. Blunsom (2018)

    E-snli: natural language inference with natural language explanations

    In Advances in Neural Information Processing Systems 31, NeurIPS 2018, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 9560–9572. External Links: Link Cited by: §2.
  • R. Cartuyvels, G. Spinks, and M. Moens (2020) Autoregressive reasoning over chains of facts with transformers. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, D. Scott, N. Bel, and C. Zong (Eds.), pp. 6916–6930. External Links: Link, Document Cited by: §2.
  • M. Chein and M. Mugnier (2008) Graph-based knowledge representation: computational foundations of conceptual graphs. Springer Science & Business Media. Cited by: §1.
  • P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018) Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR abs/1803.05457. External Links: Link, 1803.05457 Cited by: §5.
  • P. Clark, O. Tafjord, and K. Richardson (2020) Transformers as soft reasoners over language. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, C. Bessiere (Ed.), pp. 3882–3890. External Links: Link, Document Cited by: §2.
  • I. Dagan, D. Roth, M. Sammons, and F. M. Zanzotto (2013) Recognizing textual entailment: models and applications. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers. External Links: Link, Document, ISBN 9781598298345 Cited by: §3.
  • B. Dalvi, P. Jansen, O. Tafjord, Z. Xie, H. Smith, L. Pipatanangkura, and P. Clark (2021) Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih (Eds.), pp. 7358–7370. External Links: Link Cited by: Appendix B, Appendix G, §1, §1, §1, §2, §3, §4.1.1, §4.1.1, §5.1, §5.2, §5.3, §5, §6.2, footnote 1.
  • J. DeYoung, S. Jain, N. F. Rajani, E. Lehman, C. Xiong, R. Socher, and B. C. Wallace (2020) ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 4443–4458. External Links: Link, Document Cited by: §1, §2.
  • N. Gupta and M. Lewis (2018) Neural compositional denotational semantics for question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), pp. 2152–2161. External Links: Link, Document Cited by: §2.
  • N. Gupta, K. Lin, D. Roth, S. Singh, and M. Gardner (2020) Neural module networks for reasoning over text. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §2.
  • G. H. Harman (1965) The inference to the best explanation. The philosophical review 74 (1), pp. 88–95. Cited by: §4.1.1.
  • R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko (2017) Learning to reason: end-to-end module networks for visual question answering. In IEEE International Conference on Computer Vision, ICCV 2017, pp. 804–813. External Links: Link, Document Cited by: §2.
  • N. Inoue, P. Stenetorp, and K. Inui (2020) R4C: A benchmark for evaluating RC systems to get the right answer for the right reason. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 6740–6750. External Links: Link, Document Cited by: §2.
  • P. Jansen and D. Ustalov (2019) TextGraphs 2019 shared task on multi-hop inference for explanation regeneration. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing, TextGraphs@EMNLP 2019, Hong Kong, November 4, 2019, D. Ustalov, S. Somasundaran, P. Jansen, G. Glavas, M. Riedl, M. Surdeanu, and M. Vazirgiannis (Eds.), pp. 63–77. External Links: Link, Document Cited by: §1, §2.
  • H. Jhamtani and P. Clark (2020) Learning to explain: datasets and models for identifying valid reasoning chains in multihop question-answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 137–150. External Links: Link, Document Cited by: §1, §2, §6.5.
  • Y. Jiang, N. Joshi, Y. Chen, and M. Bansal (2019) Explore, propose, and assemble: an interpretable model for multi-hop reading comprehension. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 2714–2725. External Links: Link, Document Cited by: §2.
  • T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal (2020) QASC: A dataset for question answering via sentence composition. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, pp. 8082–8090. External Links: Link Cited by: §6.5.
  • M. Lamm, J. Palomaki, C. Alberti, D. Andor, E. Choi, L. B. Soares, and M. Collins (2021) QED: A framework and dataset for explanations in question answering. Trans. Assoc. Comput. Linguistics 9, pp. 790–806. External Links: Link Cited by: §2.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: A lite BERT for self-supervised learning of language representations

    CoRR abs/1909.11942. External Links: Link, 1909.11942 Cited by: §5.3.
  • V. Latcinnik and J. Berant (2020)

    Explaining question answering models through text generation

    CoRR abs/2004.05569. External Links: Link, 2004.05569 Cited by: §2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §A.2.
  • T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal (2018) Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), pp. 2381–2391. External Links: Link, Document Cited by: §6.5.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    J. Mach. Learn. Res. 21, pp. 140:1–140:67. External Links: Link Cited by: §5.3.
  • N. F. Rajani, B. McCann, C. Xiong, and R. Socher (2019) Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 4932–4942. External Links: Link, Document Cited by: §2.
  • S. Rosenthal, M. A. Bornea, A. Sil, R. Florian, and J. S. McCarley (2021) Do answers to boolean questions need explanations? yes. CoRR abs/2112.07772. External Links: Link, 2112.07772 Cited by: §2.
  • C. Rudin (2019)

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

    Nature Machine Intelligence 1 (5), pp. 206–215. Cited by: §4.1.1.
  • S. Saha, S. Ghosh, S. Srivastava, and M. Bansal (2020) PRover: proof generation for interpretable reasoning over rules. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 122–136. External Links: Link, Document Cited by: §2.
  • S. Saha, P. Yadav, and M. Bansal (2021) MultiPRover: generating multiple proofs for improved interpretability in rule reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou (Eds.), pp. 3662–3677. External Links: Link, Document Cited by: §2.
  • M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2017) Bidirectional attention flow for machine comprehension. In 5th International Conference on Learning Representations, ICLR 2017, External Links: Link Cited by: §1, §2.
  • N. Shazeer and M. Stern (2018) Adafactor: adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 4603–4611. External Links: Link Cited by: Appendix F.
  • V. Shwartz, P. West, R. L. Bras, C. Bhagavatula, and Y. Choi (2020) Unsupervised commonsense question answering with self-talk. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 4615–4629. External Links: Link, Document Cited by: §2.
  • O. Tafjord, B. Dalvi, and P. Clark (2021) ProofWriter: generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli (Eds.), pp. 3621–3634. External Links: Link, Document Cited by: §2, §5.2.
  • A. Talmor, O. Tafjord, P. Clark, Y. Goldberg, and J. Berant (2020) Leap-of-thought: teaching pre-trained models to systematically reason over implicit knowledge. In Advances in Neural Information Processing Systems 33, NeurIPS 2020, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §2.
  • M. Thayaparan, M. Valentino, and A. Freitas (2020) A survey on explainability in machine reading comprehension. CoRR abs/2010.00389. External Links: Link, 2010.00389 Cited by: §2.
  • M. Valentino, M. Thayaparan, and A. Freitas (2021) Unification-based reconstruction of multi-hop explanations for science questions. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, P. Merlo, J. Tiedemann, and R. Tsarfaty (Eds.), pp. 200–211. External Links: Link, Document Cited by: §2.
  • S. Wiegreffe and A. Marasovic (2021) Teach me to explain: a review of datasets for explainable natural language processing. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), Cited by: §2.
  • Z. Xie, S. Thiem, J. Martin, E. Wainwright, S. Marmorstein, and P. A. Jansen (2020) WorldTree V2: A corpus of science-domain structured explanations and inference patterns supporting multi-hop inference. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis (Eds.), pp. 5456–5473. External Links: Link Cited by: §1, §2.
  • V. Yadav, S. Bethard, and M. Surdeanu (2019) Quick and (not so) dirty: unsupervised selection of justification sentences for multi-hop question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), pp. 2578–2589. External Links: Link, Document Cited by: §2.
  • V. Yadav, S. Bethard, and M. Surdeanu (2020) Unsupervised alignment-based iterative evidence retrieval for multi-hop question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 4514–4525. External Links: Link, Document Cited by: §2.
  • Q. Ye, X. Huang, E. Boschee, and X. Ren (2020) Teaching machine comprehension with compositional explanations. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, T. Cohn, Y. He, and Y. Liu (Eds.), Findings of ACL, Vol. EMNLP 2020, pp. 1599–1615. External Links: Link, Document Cited by: §2.
  • H. Zhang, X. Zhao, and Y. Song (2020) WinoWhy: A deep diagnosis of essential commonsense knowledge for answering winograd schema challenge. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 5736–5745. External Links: Link, Document Cited by: §2.

Appendix A Entailment Modules Training Details

a.1 Synthetic Data

We follow the ParaPattern Bostrom et al. (2021) to collect synthetic training data for the entailment modules. Since they only consider the substitution and contraposition deductions, we extend the method to conjunction and if-then deductions by designing the specific syntactic templates and construction rules. Table 9 shows the used syntactic patterns. We use Spacy333 to match sentences from Wikipedia (version “20200501.en”). In total, we collect about 24k, 443k, and 97k sentences for substitution, conjunction, and if-then modules, respectively. We follow Bostrom et al. (2021) to train the modules on the synthetic data with a learning rate of 3e-5 for 1 epoch.

a.2 Reasoning Type Annotations of EntailmentBank

The original steps in the EntailmentBank are not annotated with reasoning types. We manually annotated the reasoning types of 400 steps in the training split (Train-manual) and 275 steps in the development split (Dev-manual). To label the remaining steps in the training split, we train a classifier with the Train-manual steps. We use the Roberta-large Liu et al. (2019) as our classifier. It achieves an accuracy rate of 88% on the Dev-manual steps. We use the classifier to predict the reasoning types of the remaining 2-premise steps and take the predicted types as the pseudo labels (Train-pseudo). Table 8 shows the statistics of the reasoning type annotations.

Split Sub. Conj. If-then All
Train-manual 211 105 84 400
Train-pseudo 2,441 812 535 3,788
Dev-manual 153 71 51 275
Table 8: Statistics of the step reasoning type annotations.

Appendix B Controller Training Details

Training Data. We decompose the gold entailment trees into several intermediate states for training. For example, the tree in Figure 1(c) can be decomposed into the following positive states: , , and . The state has two distractors and , one positive deductive step , and one positive abductive step . We add disturbances to the trees to make positive and negative states. For the state , the fact is the conclusion of gold step . We use a deductive module to predict a conclusion given and . If the predicted is correct, we replace with to make new positive states . The can be used to perform further reasoning. Otherwise, we replace with to make negative states . The contains an incorrect conclusion and thus should not be used for further reasoning. The reasoning controller should be trained to learn to distinguish between and and give the a higher state score than . To judge whether the generated

is correct, we follow the evaluation metrics 

Dalvi et al. (2021) to use BLEURT. The predicted is considered correct if the BLEURT score between and the gold is larger than 0.28.

Appendix C Reasoning Algorithm and Hyperparameter Analysis

Input: Hypothesis , fact sentences , controller, deductive modules , abductive modules
Parameter: Beam size , max reasoning depth , distractor threshold , step sampling rate

1:  // Construct initial reasoning state
2:  Remove with fact score less than in
3:   ( the filtered sentences )
5:  // Reasoning with beam search
6:  while the depth does not reach  do
8:     for  do
9:        // Select promising steps
10:        for  steps of with top step score do
11:           // Single-step entailment reasoning
12:           for  or  do
13:              execute step with module and obtain a novel intermediate fact
14:              construct a new state with the step and the fact
16:           end for
17:        end for
18:     end for
19:     // Verify and select states
20:      states with the highest state scores from
22:  end while
23:  // Construct the entailment tree
24:  for  do
25:     Align the target of to the most similar fact sentence of to make a tree
26:  end for
27:  Select the tree with highest score
28:  Return The entailment tree
Algorithm 1 Reasoning Algorithm

Algorithm 1

shows the whole reasoning process. The hyperparameters are selected with the development split, as shown in Figure 

6. We select a beam size of 10, a max reasoning depth of 5, a distractor threshold of 0.001, and a step sampling rate of 10%. We only consider the steps whose sentences have word overlap. When constructing the entailment tree, we use the BLEURT scores to align the target of a state to the most similar fact. Note that when making a new reasoning state with the step and the novel intermediate fact , if the step is a backward abductive step, we replace the original target hypothesis with and treat the as the target hypothesis which the new state aims to prove (as shown in Figure 2). We run our method three times and report the average performance.

Appendix D Heuristic Reasoning Algorithm without the Controller

To investigate the effect of the reasoning controller for entailment tree generation, we design a heuristic generation algorithm that does not use the reasoning controller. Since the cost of traversing the entire search space is unaffordable, we adopt the beam search. In each reasoning state, we try all possible steps with entailment modules and make new candidate reasoning states. To select the correct states, we use the BLEURT scores as the heuristic information to guide the search process. Specifically, given a candidate state , we estimate the similarity between a fact and the target by


and then score a candidate state by


The top- candidate states with the highest state scores are selected to perform further reasoning, where is the beam size. We use the same beam size as the algorithm with the controller uses.

Table 9: The syntactic patterns used on data scraping and the training examples for deductive entailment modules. Pattern nodes are donated as , where contains the dependency relations of the matching token, contains the part-of-speech tags of the matching token, contains the lemmatized form of the matching token, and indicates that a matching token and its subtree will be used as a match variable for rule-based rewriting. means “or”.
Figure 6: Hyperparameter analysis on the Task2 development spilt.

Appendix E Experiment Details on eQASC and eOBQA

For each question+answer pair, the eQASC/eOBQA provides the corresponding hypothesis , about 10/4 facts as , and a candidate set of steps. Each candidate step is a 2-premise single step from two facts to (e.g., ) and can be viewed as a one-step entailment tree with three nodes. The target is to select the correct trees/steps from the candidate set. There might be more than one correct tree in the candidate set. We conduct experiments on the questions with at least one correct entailment tree (677 eQASC questions and 79 eOBQA questions). Since the given contains distractors, we adopt the Task2 models trained on EntailmentBank (without further fine-tuning on eQASC and eOBQA) to perform cross-dataset experiments.

For our method, we follow our Task2 reasoning algorithm to select from the candidate trees/steps. Specifically, we first filter out the facts in with low fact scores using a threshold (selected using the development split). Then we predict the step scores for the candidate steps and select the step with the highest score. For the EntailmentWriter, we feed the and to the EntailmentWriter and score each candidate step with the , where is the perplexity of the sequence segment representing the step (e.g., for in the official EntailmentWriter implementation).

We follow the official evaluation metrics of eQASC and eOBQA. The P@1 (Precision@1) measures the fraction of cases where the selected tree (topmost ranked) is correct. It is equivalent to the Overall AllCorrect score between the top-1 predicted one-step tree and the best-matching gold tree. The NDCG (Normalized Discounted Cumulative Gain) metric measures how well ranked the candidate trees are when ordered by the predicted scores. It reflects the model’s ability to distinguish the validity of trees and rank the correct trees ahead of the incorrect ones.

Appendix F Main Experimental Environments

We deploy all models on a server with 500GB of memory and one 40G A100 GPU. Specifically, the configuration environment of the server is ubuntu 21.04 and our code mainly depends on python 3.8.10 and PyTorch 1.7.1. We use the pre-trained language models from

HuggingFace Transformers444 We use the Adafactor optimizer Shazeer and Stern (2018) implemented by HuggingFace Transformers.

Figure 7: An example case illustrating the potential inaccuracy of the automatic evaluation metrics. In the predicted tree, the fact is a distractor and the step is not a valid entailment. Following the official evaluation code, the nodes , in the predicted tree are aligned to the , in the gold tree, respectively (the dotted line). By comparing the aligned intermediate nodes ( vs. , vs. ), the predicted tree achieves a Step F1 score of 0.0 and an Intermediate F1 score of 1.0. The Intermediate F1 score being 1.0 should have indicated that the predicted tree has perfect intermediate conclusions. However, the is not entailed by the and .

Appendix G Discussion on the Automatic Evaluation

As discussed by Dalvi et al. (2021), the automatic entailment tree evaluation metrics might misjudge in some cases (e.g., tree structure variation) and still need to be improved. In fact, how to quantitatively evaluate a predicted tree remains a challenging problem. In the existing metric, the first step is the tree alignment algorithm Dalvi et al. (2021). The nodes in the predicted tree are aligned to the nodes in the gold tree for further comparison. Each non-leaf node of is aligned to the first non-leaf node where the Jaccard similarity of their respective leaf sentences is maximum. For any with zero Jaccard similarity to all gold nodes, it is aligned to a dummy gold node with a blank conclusion. In the official implementation, (1) each may correspond to more than one , while there is no penalty for duplication when calculating Intermediate F1; (2) the root node (the given hypothesis sentence which is identical in and ) is trivially viewed as a normal intermediate node (the novel generated intermediate sentence). Because of these two reasons, the Intermediate F1 might achieve a high score (indicating the can draw correct intermediate conclusions from the premises), even when the Step F1/AllCorrect is relatively low (indicating the does not select the correct premises for the intermediate nodes). For example, the EntailmentWriter (T5-11B) for Task3 achieves an Intermediate F1 of 36.4% while the Step F1/AllCorrect is only 7.8%/2.9% Dalvi et al. (2021). Figure 7 shows a specific case.

To alleviate the inaccuracy caused by the above reasons, we mainly use the more strict metrics (i.e., Leaves/Steps/Intermediates/Overall AllCorrect) for comparison. Furthermore, we adopt manual evaluation on the full trees and individual steps to make a more accurate comparison (Sec. 6.2).

Appendix H Case Study

We show some entailment trees generated by our MetGen-separated on the Task2 questions in Figure 891011. MetGen can generate a valid entailment tree which may have a different structure with the gold one (Figure 8). MetGen can handle medium-complexity questions, generate valid entailment trees and provide the reasoning types of steps (Figure 9 and 10). The questions which require more complex reasoning (e.g., the gold tree in Figure 11 requires 11 leaf facts and 8 entailment steps) remain challenging. Although the full tree generated by our method for such complex question can be not entirely correct, the intermediate conclusions (e.g., , in Figure 11) are still reliable.

Figure 8: Case 1. The predicted entailment tree consists of two 2-premise steps, while the gold tree consists of one 3-premise step. Under the automatic evaluation metric, the predicted tree would be rated as invalid (Overall AllCorrect = 0), since the predicted steps do not match the gold step. However, the predicted tree should be valid because each step in the tree is a valid entailment (i.e., the 3-premise step can be decomposed into two valid 2-premise steps). It would be rated as valid under manual evaluation.
Figure 9: Case 2. Explaining the question and answer in this case requires 5 leaf facts from the given 25 facts. MetGen can select the correct facts, generate valid entailment trees, and provide the reasoning types of steps.
Figure 10: Case 3. MetGen can handle medium-complexity questions and provide the reasoning types of steps.
Figure 11: Case 4. The question requires more complex reasoning, where the gold tree contains 11 leaf facts and 8 entailment steps. Although the full tree generated by MetGen is not entirely correct, the intermediate conclusions , are still reliable.