Measuring Association Between Labels and Free-Text Rationales

Interpretable NLP has taking increasing interest in ensuring that explanations are faithful to the model's decision-making process. This property is crucial for machine learning researchers and practitioners using explanations to better understand models. While prior work focuses primarily on extractive rationales (a subset of the input elements), we investigate their less-studied counterpart: free-text natural language rationales. We demonstrate that existing models for faithful interpretability do not extend cleanly to tasks where free-text rationales are needed. We turn to models that jointly predict and rationalize, a common class of models for free-text rationalization whose faithfulness is not yet established. We propose measurements of label-rationale association, a necessary property of faithful rationales, for these models. Using our measurements, we show that a state-of-the-art joint model based on T5 has strengths and weaknesses for producing faithful rationales.



There are no comments yet.


page 1

page 2

page 3

page 4


Few-Shot Self-Rationalization with Natural Language Prompts

Self-rationalization models that predict task labels and generate free-t...

Rationale-Inspired Natural Language Explanations with Commonsense

Explainable machine learning models primarily justify predicted labels u...

e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks

Recently, an increasing number of works have introduced models capable o...

Considering Likelihood in NLP Classification Explanations with Occlusion and Language Modeling

Recently, state-of-the-art NLP models gained an increasing syntactic and...

Rationale production to support clinical decision-making

The development of neural networks for clinical artificial intelligence ...

DataWords: Getting Contrarian with Text, Structured Data and Explanations

Our goal is to build classification models using a combination of free-t...

Gradual Parametricity, Revisited

Bringing the benefits of gradual typing to a language with parametric po...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of interpretable NLP is to better understand predictive models’ internals for purposes such as debugging, validating safety before deployment, and revealing unintended biases and behavior. These objectives require faithful rationales, i.e., explanations of the model’s behavior that are accurate representations of its decision process Melis and Jaakkola (2018).

Figure 1: A categorization of interpretable NLP on an illustrative faithfulness spectrum. Unlike for IE tasks, little is known about what are valid assumptions for rationalizing reasoning tasks requiring free-text rationales, and how to place models for such rationalization on the spectrum. We propose to measure label-rationale association within self-rationalizing models that produce free-text rationales, much like how prior work has quantified the faithfulness of attention Jain and Wallace (2019); Brunner et al. (2020).
CommonsenseQA (CoS-E) Question: While eating a hamburger with friends, what are people trying to do?
Answer choices: have fun, tasty, or indigestion
Natural language rationale: Usually a hamburger with friends indicates a good time.

Natural Language Inference (E-SNLI)

Premise: A child in a yellow plastic safety swing is laughing as a dark-haired women in pink and
coral pants stands behind her.
Hypothesis: A young mother is playing with her daughter in a swing.
Label choices: neutral, entailment, or contradiction
Natural language rationale: Child does not imply daughter and woman does not imply mother.
Table 1: Examples from the CoS-E v1.0 and E-SNLI datasets (§2). Extractive rationales annotated by humans are highlighted, while human-written free-text rationales are presented underneath the answer/label choices. These examples illustrate that the extractive rationales fail to adequately explain the correct answer/label (underlined).

One way toward faithfulness is to introduce architectural modifications or constraints that produce rationales with desirable properties (Lei et al., 2016; Andreas et al., 2016; Schwartz et al., 2018, to name a few). For example, pipeline models (Figure 2) were designed for information extraction (IE) tasks for which an effective rationale can be extracted as a subset of the input and sufficient to make a prediction on its own, without the rest of the input Lei et al. (2016). Such models approach faithfulness by construction, if there is no gradient flow between the explainer and predictor modules Jain et al. (2020).

There is a growing interest in tasks that require world and commonsense “knowledge” and “reasoning”, e.g., CommonsenseQA Talmor et al. (2019) and natural language inference Bowman et al. (2015). Here, extractive rationales necessarily fall short—rationales must instead take the form of free-text natural language to fill in the reasoning or knowledge gap Camburu et al. (2018); Rajani et al. (2019).111We use “free-text” and “natural language” rationales interchangeably. In Table 1, for example, the highlighted extractive rationale of the first problem instance lacks at least one reasoning step to adequately justify the answer; the natural language rationale (which is not extractive) fills the gap.

We first show that, for these tasks, a “self-rationalizing” model—a fully differentiable model that jointly predicts the task output with the rationale—provides rationales that are more label-informed (desirable property) than rationales from a pipeline that rationalizes first, then predicts with a separate model (§3.1). Next, we show that sufficiency is not universally applicable: a natural language rationale on its own does not generally provide enough information to arrive at the correct answer (§3.2). These findings suggest that a faithful-by-construction pipeline is not an ideal approach for reasoning tasks, leading us to ask: is there is a way to achieve faithful free-text rationalization with self-rationalizing models?

We note that there is as-of-yet no way to assess the relationship between a prediction and a free-text rationale within the same fully differentiable model. Jacovi and Goldberg (2020b) argue for the development of evaluations that measure the extent and likelihood that a rationale (extractive or free-text) is faithful in practice. We respond to that call by proposing new measurements for label-rationale association in self-rationalizing models that provide free-text rationales, a necessary property of faithful rationales.

The first measurement, robustness equivalence4.1), quantifies whether the predict label and generated rationale are similarly robust to noise. The second measurement, attribution overlap4.2), quantifies whether the gradient-attributions of the input with respect to the predicted label are similar to the gradient-attributions with respect to the predicted rationale. In a case study, we consider a self-rationalizing finetuned variant of T5 Raffel et al. (2020); Narang et al. (2020) which demonstrates good robustness equivalence, but poor attribution overlap. This result motivates future work both on label-rationale association measurements and models that perform well on them.222Our code will be available at

2 Tasks, Datasets, and Models

Figure 2: An illustration of a pipeline model (composed of IR and RO; §2) for CoS-E v1.0 with a human-written rationale. The dotted line indicates two separate models with no gradient flow.
Figure 3: An example of a joint architecture (IOR; §2) for CoS-E v1.0 with a human-annotated rationale. Trained on both task signal and human rationales, these models are effective at generating fluent rationales with no loss to task performance (§2). We propose to measure the extent to which the predicted labels and explanations are associated within this model (§4).

Before we turn to our analyses we introduce datasets and models we use for our experiments.

Tasks and Datasets

We explore the two available large-scale datasets containing natural language rationales, both in English: E-SNLI Camburu et al. (2018), an extension of SNLI Bowman et al. (2015); and CoS-E Rajani et al. (2019), an extension of CommonsenseQA Talmor et al. (2019). For the former, the task is to infer whether a given hypothesis sentence entails, contradicts, or is neutral towards a premise sentence. For the latter, the task is to select the correct answer from 3 (v1.0) or 5 (v1.11) answer choices for a question.333We use both versions due to noise in v1.11 (see App. A). Table 1 contains examples. See data statistics in Table 5 in Appendix A.444CoS-E does not contain test set rationale annotations, so we report performance values on the development set.

T5 Models

All of the models in this work are based on T5, though our methods can in principle be applied to any architecture. The base version of T5 is a 220M-parameter Transformer encoder-decoder Vaswani et al. (2017). To carry out supervised finetuning, T5 is trained by maximizing the conditional likelihood of the correct text output (from annotated data), given the text input.

We finetune four T5 instances for each dataset:

  • IR, which maps task inputs to rationales, without ever being exposed to task outputs.

  • RO, which maps explanations to task outputs. The only input elements this model is exposed to are answer choices (for CoS-E).

  • IOR, which maps inputs to outputs followed by explanations.

  • IRO, which maps pairs of the input and rationale to outputs.

We provide input-output formatting in Table 6 (Appendix A.3). Using these building blocks, we can instantiate two important approaches.

Pipeline Model (IR;rO)

This architecture composes IR with RO, each of which is trained entirely separately, for a total of 440M parameters. It is illustrated in Figure 2.

The vast majority of prior work using pipelines has focused only on extractive rationales (see Table 4 in Appendix A.1). Pipelines that have no gradient flow between IR and RO components are (almost)555Jacovi and Goldberg (2020a) present “trojan explanations” as a possible shortcoming of such models. faithful-by-construction (first half of Table 4), unlike pipelines that take a differentiable approach (second half of Table 4).

Self-Rationalizing Model (IOr)

A joint, self-rationalizing model Melis and Jaakkola (2018), illustrated in Figure 3, predicts both a label and rationale during inference. This is the most common approach to free-text rationalization Hendricks et al. (2016); Kim et al. (2018); Camburu et al. (2018); Ehsan et al. (2018); Liu et al. (2019); Wu and Mooney (2019); Narang et al. (2020); Do et al. (2020), but little is understood about its faithfulness. IOR models are desirable for their ease-of-use, task-effectiveness, parameter-efficiency, and their ability to generate fluent and plausible rationales. We expect models of this kind to play an important role in continuing research on explainable AI using natural language for these reasons.

We use the IOR variant of T5 (Narang et al., 2020, known as “WT5?!”). Because only one instance of T5 is used here, the total number of parameters is half that of the pipeline. We replicate three prior findings using T5 (Tables 78 in Appendix A.5):

  • For all tasks, adding rationale generation (IOR) to a non-rationalizing model (IO) does not result in a substantial loss in accuracy.

  • The pipeline does not perform as well as the self-rationalizing model, despite having double the parameters.

  • T5 outperforms other pretrained models.

3 Shortcomings of Free-Text Pipelines

We analyze faithful-by-construction pipelines (IR;RO) for free-text rationalization with respect to two properties: label-informedness of generated rationales (§3.1) and appropriateness of the sufficiency assumption (§3.2).

3.1 Joint Models Produce More Label-Informed Rationales

Dataset RO with R RO with IR rationales to R RO with IOR rationales to R
E-SNLI 97.74 86.25 -11.49 90.52 -7.22
CoS-E v1.0 85.26 56.42 -28.84 84.84 -0.42
CoS-E v1.11 68.14 46.52 -21.62 71.42 +3.28
Table 2: Accuracy of the trained RO model evaluated with ground-truth natural language rationales (), and natural language rationales generated from two model architectures: IOR and IR (see §2 for model descriptions). These results show that rationales that are generated as a function of the input and the predicted label (IOR) are more label-informed. Test-set reported for E-SNLI and dev-set for CoS-E v1.0 and v1.11.

Rationales should be a function of the input and the predicted label. To demonstrate why this is the case, consider training the IR model on a dataset with multiple annotation layers, e.g., OntoNotes, that contains word sense, predicate structure, and coreference Pradhan et al. (2007). This model would produce the same rationale, regardless of the task being rationalized. Prior work has also critiqued IR;RO models because it is counter-intuitive to generate a rationale before deciding the label to explain. Therefore, the IR model will first need to implicitly predict a label Kumar and Talukdar (2020); Jacovi and Goldberg (2020a). But can IR infer the label well, when it is trained without label signal?

To address this question, we study whether IOR rationales are better at predicting the label than IR rationales.

To this end, we evaluate a trained RO model using the following inputs:

  • test-set R rationales,

  • test-set rationales generated by IOR, and

  • test-set rationales generated by RO.

The ability of generated rationales to recover R’s performance indicates their label-prediction ability and distributional similarity to the ground-truth R rationales—a measure of rationale quality and label-informedness.

In Table 2, we show that IOR rationales recover far more ground-truth (R) performance than RO rationales (and in one case, are even better than R). Rationales from the IR model lag far behind. This empirically demonstrates that training on label signal O is important to generating good-quality rationales for the tasks studied.

These results quantify a weakness of natural language rationale IR;RO pipelines: poor-quality rationales result in cascading errors. Rationales should be generated with label signal to avoid this.

3.2 Sufficiency is not Universally Valid

Dataset RO with R IRO with R
E-SNLI 97.74 98.77 +1.03
CoS-E v1.0 85.26 90.53 +5.27
CoS-E v1.11 68.14 80.1 +11.96
Table 3: A comparison of the IRO and RO models (§2) evaluated with ground-truth natural language rationales (). In some cases accuracy improves substantially with the addition of the input, indicating that rationales are not always sufficient and pipelines are not always effective. Test-set reported for E-SNLI and dev-set for CoS-E v1.0 and v1.11.

Faithful-by-construction pipelines (§2) rely on the sufficiency assumption (the selected rationale must be sufficient to make the prediction without the input). This assumption is suitable for information extraction tasks for which a subset of the input tokens is predictive of the label. As a matter of fact, humans can serve as the RO model and make predictions with high accuracy on certain IE tasks Jain et al. (2020).

To illustrate why sufficiency might not be justified for reasoning tasks, consider the example in Figure 2. The task of the RO model is to select between the answer choices “have fun”, “tasty”, and “indigestion” given the rationale “Usually a hamburger with friends indicates a good time”. The rationale is designed to complement the input question, but the RO model does not see the question, changing the fundamental nature of the task it is solving. We thus wonder: does task obfuscation hurt pipelines’ ability to perform the task?

We report the accuracy difference between a RO model and a model that receives both the input and rationale (IRO). We evaluate on the ground-truth natural language rationales (R).666Evaluating on R serves as an upper-bound on pipeline performance, removing the confounding factor observed in §3.1 that IR rationales can be poor.

In Table 3, we observe that on the CoS-E datasets, the IRO model has a 5–12% increase in accuracy over the RO model, indicating that the rationales are not sufficient. On E-SNLI, this difference is much smaller (1%). This is likely due to the fact that the dataset was collected by instructing annotators to provide self-contained rationales rather than rationales that reference the premise and hypothesis. However, using dataset-collection to explicitly collect sufficient rationales does not address the unnaturalness of such a task formulation.777Camburu et al. (2018) give an example: the rationale “A woman is not a person” could predict either a contradiction or entailment label depending on the input. These results indicate that (especially) in the case of CoS-E, sufficiency is not a valid assumption, and the use of IR;RO models is suboptimal in such cases.

It may seem obvious to model pipelines as IR;IRO to solve this problem. However, such models no longer exhibit faithfulness-by-construction: the IRO model could ignore the rationale and perform the task as if it was IO.

This leaves practitioners with a difficult decision: model a task counter-intuitively by dropping the input, or retain the input for intuition and performance but lose faithfulness guarantees. In light of previous results (§3.1), we expect that a better alternative is to use an IOR model.

4 Quantifying Faithfulness

Given that pipeline models have shortcomings for reasoning tasks, we turn our focus to joint self-rationalizing models (IOR) as a popular alternative.

The extent to which such models exhibit faithful rationalization has not been studied. To illustrate this point, we reference Narang et al. (2020):

…Much like humans, our approach does not guarantee that the produced explanation actually explains the specific reasons why a model generated its prediction. In other words, the model could potentially just make up a reasonable-sounding explanation instead of providing a truly accurate description of its causal decision-making process.

It is not infeasible that a large, overparameterized model trained on both gold-rationale emulation and a labelling task can learn to do both equally well, without having to rely on shared information in its parameters.

Therefore, rationales from IOR models cannot be treated as faithful explanations without further investigation.

At minimum, rationales must be implicitly or explicitly tied to the model’s prediction decision. We call this label-rationale association. We present two analyses that investigate to what extent IOR models exhibit this property: robustness equivalence (§4.1) and attribution overlap (§4.2). The results allow us to better understand where to place IOR models on the faithfulness gradient.

4.1 Robustness Equivalence Measure

(b) Accuracy of the IOR model and percentage of label flips for E-SNLI (left) and CoS-E v1.0 (right).
(a) Accuracy of the RO model on IOR generated rationales for E-SNLI (left) and CoS-E v1.0 (right).
Figure 4: Results of label-rationale association measured with robustness equivalence under varying amounts of Gaussian noise.
(a) Accuracy of the RO model on IOR generated rationales for E-SNLI (left) and CoS-E v1.0 (right).

We aim to quantify whether predicted labels and rationales are similarly or dissimilarly robust to noise applied to the input. The former indicates strong label-rationale association, while the latter indicates the opposite. Under a given amount of Gaussian noise, there are four possible cases for a model’s output:

  • stable label, stable rationale

  • unstable label, unstable rationale

  • stable label, unstable rationale

  • unstable label, stable rationale

Cases 1 and 2 indicate that both tasks are equally affected by noise. Cases 3 and 4 are failure cases—if one output is stable and the other is not, we conclude they cannot be strongly-associated within the model.


We apply 0-mean Gaussian noise to each input embedding in the encoder at inference time. We measure changes in label prediction by counting the number of predicted labels in the test set which flip, i.e., change from their original prediction to something else,888This is either to another valid (but incorrect) answer choice, or to another word that is not in the set of possible answers altogether, since T5 generates answers over the entire set of vocabulary tokens. alongside changes in accuracy of the IOR model.

Measuring the change in rationales is more difficult—we wish to measure whether they have changed in meaning significantly enough to no longer be valid for a given problem instance. There can be multiple valid rationales for a given instance Miller (2019), so paraphrase or semantic similarity measures are not enough. However, perturbed rationales should be able to perform comparably to their unperturbed counterparts at predicting the labels if they remain in-distribution. We use this as a proxy for meaning change.

We report accuracy of the trained RO model on the following inputs:

  • test-set ground-truth (R) rationales,

  • test-set rationales generated by IOR under 0 noise, and

  • test-set rationales generated by IOR under different levels of noise, controlled by .


We present E-SNLI (left) and CoS-E v1.0 (right) results in (a). By examining the regions of largest slope, we gain insights into model behavior.

On the rationale meaning change proxy measure in (a) (left), E-SNLI rationales’ performance degrades substantially between and before converging to random accuracy. Identically, on the label change measure in (b) (left), the number of E-SNLI label flips increases between and , and corresponds to a drop in accuracy of the IOR model. Therefore, we observe Case 1 (stable label, stable rationale) in the range , and Case 2 (unstable label, unstable rationale) in the range . For E-SNLI, the IOR model exhibits robustness equivalence.

For CoS-E v1.0, we observe slightly different behavior with a performance drop between and on the rationale meaning change measure ((a); right) and between and on label change ((b); right). Results are highly similar for v1.11 (Figure 11 in Appendix A.7). Barring slight exceptions, we conclude that the IOR model demonstrates high label-rationale association as measured by robustness equivalence for these datasets.

4.2 Attribution Overlap Measure

If label prediction and rationale generation are associated, input tokens important for label prediction should be important for rationale generation and vice versa.999To achieve the former, Wu and Mooney (2019) train the explanation module of their VQA model to generate free-text rationales that can be traced back with a gradient-based method to the objects that are important for answer prediction. To measure to what extent IOR models exhibit this property, we measure token importance with gradient-based attribution Simonyan et al. (2014), and quantify overlap between sets.

Gradient Attribution

For a predicted class, gradient attribution computes the element-wise sum of the gradient of the predicted class’ logit

with respect to an input token embedding :

Intuitively, this measures how much an infinitesimally small change in the input changes the predicted class’ logit, using a first-order Taylor approximation of the logit function. The gradient’s sign indicates the directionality of the change Simonyan et al. (2014). The attribution of a sequence of token embeddings,

, is a vector


Such methods have been extended to sequence-output models such as neural machine translation

He et al. (2019); Ding et al. (2019); Li et al. (2020) by computing the sum of decoded logits with respect to the input:


Attribution Overlap Measurement

By decomposing the term in Equation 1 into two parts, we obtain two attribution vectors over the input tokens; one for the predicted label logits , and one for the predicted rationale logits in the decoded output. For each input :

To compare the similarity between the two resulting attribution vectors, we report:

  • Kendall’s correlation of the rankings obtained by ranking the input tokens by their attribution values. This captures whether the most highly-weighted input features are the same across both vectors.

  • Similarity between the attribution vectors in vector space using cosine similarity. This captures the angle between the vectors as a measure of normalized distance.

We compute the metrics on both raw and absolute-valued attribution scores. In the former, the sign of the values indicates their directional importance. The latter ranks absolute feature importance irrespective of directionality. High cosine similarity and high Kendall’s are indicators of high association.


CoS-E v1.0 results are presented in Figure 5 and Figure 6, and E-SNLI and Cos-E v1.11 results in Appendix A.7. We observe that both Kendall’s correlation and cosine similarity between important tokens is near zero. We additionally find that on average, instances share 0 tokens in their top-5 between predicted label and rationale. These results indicate that the predicted labels and rationales produced by the IOR model do not share the same important input tokens. We present an example in Figure 7.

Figure 5: Cosine similarity and absolute cosine similarity between the label-attribution vectors and the rationale-attribution vectors on the CoS-E v1.0 dev set. In the absolute case, the distribution is centered around 0.6 (mediocre correlation). In the non-absolute (raw) case, it is around 0.25 (low correlation).
Figure 6: Kendall’s between the label-attribution vectors and the rationale-attribution vectors on the CoS-E v1.0 dev set. The distribution is centered around 0 (no correlation between the token ranks).
Figure 7: L1-normalized attributions for the running CoS-E v1.0 example in Figures 23. The decoded label is “have fun” and generated rationale “having fun is the only thing that people are trying to do”. Important input terms vary heavily across the two loss terms. For example, the predicted label term assigns high importance to the first word of the predicted answer choice, “have”, while the explanation attends most heavily to the beginning of the question “While eating”. See Appendix A.6 for details about L1-normalization.


There are potential caveats of this measurement. Important input tokens for rationale generation, besides those relevant for prediction, may also be those that help the decoder to better generate coherent, contentful, or fluent rationales. Therefore, it may be more important to measure the extent to which label attributions are captured in rationale attributions rather than the other way around. Future work will investigate whether directionality influences the result.

Additionally, we observe that gradients tend to exhibit small magnitudes and flat distributions in the encoder-decoder T5 model (see Figures 910 in Appendix A.7). While gradient attributions are widely used in Transformer architectures as an interpretation technique for machine translation, the effect of small magnitudes on the validity of the such attributions has not been explored. In future work, we will investigate the effects this may have on our measurement.

5 Related Work

Analysis of NLP Models

Structural tests for analyzing model’s structure (internals) such as probing Tenney et al. (2019) and attention analysis Jain and Wallace (2019); Serrano and Smith (2019); Wiegreffe and Pinter (2019); Tutek and Šnajder (2020), as well as behavioral tests such as challenge sets McCoy et al. (2019) and checklists Ribeiro et al. (2020), are conceptually similar to our measurements, but study different model properties.

Robustness Analysis

Robustness of post-hoc extractive interpretability methods has been studied Kindermans et al. (2019); Ghorbani et al. (2019); Heo et al. (2019); Zheng et al. (2019); Slack et al. (2020). Zhang et al. (2020) show that saliency maps and model predictions can be independently adversarially attacked in vision and clinical tasks, and conclude this is due to a misalignment between the saliency map generator and model predictor, or “prediction-interpretation gap”. Such methods have not been tested for models producing natural language (NL) rationales.

Quantifying Faithfulness

The aim of our work is to initiate placing models that provide NL rationales on the faithfulness spectrum conceptualized by Jacovi and Goldberg (2020b), illustrated in Figure 1. The vast majority of prior work that proposes models Jain et al. (2020); Jacovi and Goldberg (2020a) and evaluations DeYoung et al. (2020) of explanatory methods with faithfulness as a goal focus on extractive rationales and rely on the sufficiency assumption.

Turning to exceptions that focus on NL rationales; Latcinnik and Berant (2020) train a differentiable IR;IRO pipeline on CommonsenseQA, controlling for the complexity of the IRO model to increase the likelihood that the model is faithful to the rationale, and ablating input I at inference to test whether the prediction can still be made on the basis of R. Kumar and Talukdar (2020) propose an IOR;IRO pipeline that generates an explanation for every possible label. This is an alternative solution to the problem raised in §3.1. Instead of developing complex architectures, we propose analysis of IOR models.

Concurrently to us, Hase et al. (2020) proposed a simulatability measure that tests how well generated NL rationales support label prediction in T5 models, including IOR. Our measurements are complementary to theirs; both aim for the same goal. Their metric measures the difference between IRO and IO performance. To ensure the metric is not biased toward rationales that give away the label in an unwanted way (e.g., explicitly mention the label but the rest of the rationale does not explain the reasoning), they average between instances that the RO model performs correctly on (potential label leakage) vs. those it does not. Because RO performance can actually be due to multiple factors: (1) label leakage, and (2) semantic quality of the rationale (our assumption in §3.1), disentangling these sources in future work will allow us to better understand the role RO serves as a proxy.

6 Conclusion

After demonstrating the weaknesses certain models exhibit for natural language rationalization tasks, we propose two measurements of label-rationale association in self-rationalizing models. We find that models based on T5 exhibit high robustness equivalence but low attribution overlap, motivating future work on expanding analysis to more properties.

We believe this research direction to be important moving forward, as a complement to development of more complicated architectures. Although our measurements do not guarantee faithfulness, by viewing faithful interpretability as a spectrum, we make a step to quantitatively situate widely used models on it.


Appendix A Appendix

a.1 Prior Work

In Table 4, we overview the datasets and types of rationales used in prior work on pipelines. The sources of datasets are: CommonsenseQA Talmor et al. (2019), SNLI Bowman et al. (2015), SST Socher et al. (2013), AgNews Del Corso et al. (2005), Evidence Inference Lehman et al. (2019), Movie Reviews Zaidan and Eisner (2008), MultiRC Khashabi et al. (2018), LGD Linzen et al. (2016), 20 News Lang (1995), Amazon Reviews McAuley and Leskovec (2013), Beer Reviews McAuley et al. (2012), BoolQ Clark et al. (2019), FEVER Thorne et al. (2018).

a.2 Details of Datasets

We summarize dataset statistics in Table 5. The two versions (v1.0, v1.11) of CoS-E correspond to the first and second versions of the CommonsenseQA dataset. CoS-E v1.11 has some noise in its annotations Narang et al. (2020).101010 This is our primary motivation for reporting on v1.0 as well, which we observe does not have these issues.

a.3 Details of T5

The T5 model Raffel et al. (2020) is pretrained on a multi-task mixture of unsupervised and supervised tasks, including machine translation, question answering, abstractive summarization, and text classification. Its inputs and outputs to every task are text sequences; we provide the input-output formatting for training and decoding of our T5 models in Table 6.

a.4 Hyperparameters

To optimize, we use Adam with weight decay with e-8, , and

. We use gradient clipping to a maximum norm of 1.0 and a dropout rate of 0.1. We train for 200 epochs with a batch size of 64 and a linear learning rate schedule decaying from 5e-5. Training ends if the development-set loss has not decreased for 10 epochs. At inference time, we greedy-decode until an EOS token is generated (or for 200 tokens).

a.5 Baseline Models

We complement the discussion in with §2 with Table 7 that presents results comparing the self-rationalizing T5 model to baselines, and Table 8 presents results comparing the self-rationalizing model to its pipeline variant.

a.6 L1-Normalization of Gradient Attributions

L1-normalization can be applied to the vector (§4.2) to normalize its magnitude and study the relative importance of each token’s attribution w.r.t. the entire input. The metrics we propose are invariant to magnitude and thus normalization. Because gradient magnitudes are affected by implementation choice (mean vs. sum of the logits), we normalize the vectors in Figure 7 to study the relative importance of the input tokens for each prediction source, irrespective of this choice.

a.7 Additional Results

We provide additional results that supplement results we present in §4:

  • Results of attribution overlap for E-SNLI and CoS-E v1.11 in (a).

  • Gradient attribution sanity checks in Figures 9-10.

  • Robustness equivalence for CoS-E v1.11 in Figure 11.

Source CQA SNLI SST AgNews Evidence Movie MultiRC LGD 20 News Amazon Beer BoolQ FEVER
Inference Reviews Reviews Reviews
True Pipelines (no gradient flow)
Camburu et al. (2018) E + NL
Kumar and Talukdar (2020) NL
Rajani et al. (2019) E + NL
Jain et al. (2020) E E E E E
Jacovi and Goldberg (2020a) E E E E E E E E E
DeYoung et al. (2020) E E E E E E E
Lehman et al. (2019) E
Discrete Optimization Variants
Lei et al. (2016) E
Bastings et al. (2019) E E E
Latcinnik and Berant (2020) NL
Paranjape et al. (2020) E E E E E E
Table 4: An overview of text-only datasets and rationale types (E for extractive, NL for natural language rationales) used in prior work on pipeline architectures. We focus on the two tasks we believe require a more complex notion of “reasoning” to solve: Commonsense-QA (CQA) and NLI. Unlike the other tasks in the table, prior work lacks consensus on (1) the type of rationales best-suited, and (2) the form of the model for these tasks. We argue for natural language rationales, and demonstrate that pipeline models are poorly-suited for CQA and SNLI given this choice. Dataset citations: Appendix A.1.
Dataset Num. Instances Input Length Extractive Rationale Natural Language Rationale
Train-Dev-Test # Tokens # Tokens % of doc. # Tokens % of doc.
E-SNLI 549367-9842-9824 20.27 +/- 6.95 4.01 +/- 3.01 21.30 +/- 15.82 12.39 +/- 6.43 65.67 +/- 35.46
CoS-E v1.0 7610-950-none 13.69 +/- 5.97 4.57 +/- 4.16 35.36 +/- 27.22 12.74 +/- 6.99 108.18 +/- 77.26
CoS-E v1.11 9741-1221-none 13.40 +/- 5.77 6.80 +/- 5.79 53.24 +/- 36.05 6.97 +/- 4.14 58.01 +/- 39.60
Table 5:

Statistics on datasets with ground-truth rationales. Results are presented as mean (one standard deviation) on the training set. CoS-E does not contain test-set rationale annotations.

The IOR and IR;RO rationale generator’s inputs:
explain cos_e question: [question] choice: [choice_0] choice: [choice_1] choice: [choice_2]
explain nli hypothesis: [hypothesis] premise: [premise].
The IR;RO pipeline label predictor’s input:
cos_e choice: [choice_0] choice: [choice_1] choice: [choice_2] explanation: [free-text rationale]
nli explanation: [free-text rationale]
The IOR models’ outputs are trained and decoded as:
[label] explanation: [free-text rationale]
Table 6: T5 input-output formatting.
Dataset Random Accuracy Majority-Vote Accuracy IO (T5) IOR (WT5?!)
E-SNLI 33.33 33.39 90.45 (84.01) 90.80 (83.96, 90.9)
CoS-E v1.0 33.33 33.75 68.42 (63.8) 66.10
CoS-E v1.11 20.0 20.31 60.11 57.99 (59.4)
Table 7: Label accuracy on baseline IO T5 models versus their rationalizing IOR (WT5?!) variants fine-tuned for each dataset. We observe that adding rationalization does not result in a substantial loss in accuracy. We also validate that T5-Base models outperform other architectures. Test-set reported for E-SNLI and dev-set for CoS-E v1.0 and v1.11. Source of prior results in parentheses: Narang et al. (2020), Camburu et al. (2018) using a bi-directional LSTM, and Rajani et al. (2019) using BERT.
Dataset IOR (WT5?!) IR;RO
E-SNLI 90.80 (83.96) 86.25 (81.71) -4.55 (-2.25)
CoS-E v1.0 66.10 56.42 -9.68
CoS-E v1.11 57.99 46.52 -11.47
Table 8: Label accuracy on the joint self-rationalizing model IOR (WT5?!) compared to a pipeline using natural language rationales. We observe that IOR models have substantially stronger task performance. Test-set reported for E-SNLI and dev-set for CoS-E v1.0 and v1.11. Source of prior results in parentheses: Camburu et al. (2018) using bi-directional LSTMs.
(b) Kendall’s between the label-attribution vectors and the rationale-attribution vectors on E-SNLI (left) and CoS-E v1.11 (right). We observe that the distribution is centered around 0, indicating that the token ranks are not correlated.
(a) Cosine similarity and absolute cosine similarity between the label-attribution vectors and the rationale-attribution vectors for E-SNLI (left) and CoS-E v1.11 (right). In the absolute case, we observe that the distribution is centered around 0.6 for both datasets, indicating moderate correlation. In the non-absolute case, the cosine similarity is near-0, indicating low correlation.
Figure 8: Attribution overlap results for E-SNLI and CoS-E v1.11.
(a) Cosine similarity and absolute cosine similarity between the label-attribution vectors and the rationale-attribution vectors for E-SNLI (left) and CoS-E v1.11 (right). In the absolute case, we observe that the distribution is centered around 0.6 for both datasets, indicating moderate correlation. In the non-absolute case, the cosine similarity is near-0, indicating low correlation.
Figure 9: Absolute cosine similarity and Jensen-Shannon divergence of the L1-normalized (A.6

) attribution vectors and a uniform distribution on the E-SNLI test-set. While similar distributions for label and rationale validate the gradients do not have a positional bias, we observe the distributions are fairly similar to uniform in both plots.

Figure 10: L1-norm of the raw attribution vectors on the E-SNLI test-set. While similar distributions for label and rationale validate the gradients do not have a positional bias, the small magnitudes of the attribution vectors may have an undesired effect on their interpretation.
(a) “Meaning” Proxy Measure. Accuracy of the RO model with rationales generated by the IOR model under varying amounts of Gaussian noise.
(b) Accuracy of the IOR model and percentage of label flips under the same noise levels.
Figure 11: Results of robustness equivalence test on CoS-E v1.11.