ERASER: A Benchmark to Evaluate Rationalized NLP Models

11/08/2019 ∙ by Jay DeYoung, et al. ∙ 35

State-of-the-art models in NLP are now predominantly based on deep neural networks that are generally opaque in terms of how they come to specific predictions. This limitation has led to increased interest in designing more interpretable deep models for NLP that can reveal the `reasoning' underlying model outputs. But work in this direction has been conducted on different datasets and tasks with correspondingly unique aims and metrics; this makes it difficult to track progress. We propose the Evaluating Rationales And Simple English Reasoning (ERASER) benchmark to advance research on interpretable models in NLP. This benchmark comprises multiple datasets and tasks for which human annotations of "rationales" (supporting evidence) have been collected. We propose several metrics that aim to capture how well the rationales provided by models align with human rationales, and also how faithful these rationales are (i.e., the degree to which provided rationales influenced the corresponding predictions). Our hope is that releasing this benchmark facilitates progress on designing more interpretable NLP systems. The benchmark, code, and documentation are available at: .



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Examples of instances, labels, and rationales illustrative of four (out of seven) datasets included in ERASER. The ‘erased’ snippets are rationales.

Interest has recently grown in interpretable NLP systems that can reveal how and why models make their predictions. But work in this direction has been conducted on different datasets with correspondingly different metrics, and the inherent subjectivity in defining what constitutes ‘interpretability’ has translated into researchers using different metrics to quantify performance. We aim to facilitate measurable progress on designing interpretable NLP models by releasing a standardized benchmark of datasets — augmented and repurposed from pre-existing corpora, and spanning a range of NLP tasks — and associated metrics for measuring the quality of rationales. We refer to this as the Evaluating Rationales And Simple English Reasoning (ERASER) benchmark.

In curating and releasing ERASER we take inspiration from the stickiness of the GLUE (Wang et al., 2019b) and SuperGLUE Wang et al. (2019a) benchmarks for evaluating progress in natural language understanding tasks. These have enabled rapid progress on models for general language representation learning. We believe the still somewhat nascent subfield of interpretable NLP stands to similarly benefit from an analogous collection of standardized datasets/tasks and metrics.

‘Interpretability’ is a broad topic with many possible realizations Doshi-Velez and Kim (2017); Lipton (2016). In ERASER we focus specifically on rationales, i.e., snippets of text from a source document that support a particular categorization. All datasets contained in ERASER include such rationales, explicitly marked by annotators as supporting particular categorizations. By definition rationales should be sufficient to categorize documents, but they may not be comprehensive. Therefore, for some datasets we have collected comprehensive rationales, i.e., in which all evidence supporting a classification has been marked.

How one measures the ‘quality’ of extracted rationales will invariably depend on their intended use. With this in mind, we propose a suite of metrics to evaluate rationales that might be appropriate for different scenarios. Broadly, this includes measures of agreement with human-provided rationales, and assessments of faithfulness. The latter aim to capture the extent to which rationales provided by a model in fact informed its predictions.

While we propose metrics that we think are reasonable, we view the problem of designing metrics for evaluating rationales — especially for capturing faithfulness — as a topic for further research that we hope that ERASER will help facilitate. We plan to revisit the metrics proposed here in future iterations of the benchmark, ideally with input from the community. Notably, while we provide a ‘leaderboard’, this is perhaps better viewed as a ‘results board’; we do not privilege any one particular metric. Instead, we hope that ERASER permits comparison between models that provide rationales with respect to different criteria of interest.

We provide baseline models and report their performance across the corpora in ERASER. While implementing and initially evaluating these baselines, we found that no single ‘off-the-shelf’ architecture was readily adaptable to datasets with very different average input lengths and associated rationale snippets. This suggests a need for the development of new models capable of consuming potentially lengthy input documents and adaptively providing rationales at the level of granularity appropriate for a given task. ERASER provides a resource to develop such models, as it comprises datasets with a wide range of input text and rationale lengths (Section 4).

In sum, we introduce the ERASER benchmark (

), a unified set of diverse NLP datasets (repurposed from existing corpora, including sentiment analysis, Natural Language Inference, and Question Answering tasks, among others) in a standardized format featuring human rationales for decisions, along with the starter code and tools, baseline models, and standardized metrics for rationales.

2 Desiderata for Rationales

In this section we discuss properties that might be desirable in rationales, and the metrics we propose to quantify these (for evaluation). We attempt to operationalize these criteria formally in Section 5.

As one simple metric, we can assess the degree to which the rationales extracted by a model agree with those highlighted by human annotators. To measure exact and partial match, we propose adopting metrics from named entity recognition (NER) and object detection. In addition, we consider more granular ranking metrics that account for the individual weights assigned to tokens (when models assign such token-level scores, that is).

One distinction to make when evaluating rationales is the degree to which explanation for predictions is desired. In some cases it may be important that rationales tell us why a model made the prediction that it did, i.e., that rationales are faithful. In other settings, we may be satisfied with “plausible” rationales, even if these are not faithful.

Another key consideration is whether one wants rationales that are comprehensive, rather than simply sufficient. A comprehensive set of rationales comprises all snippets that support a given label. Put another way, if we remove a comprehensive set of rationales from an instance, there should be no way to categorize it (Yu et al., 2019). ERASER permits evaluation of comprehensiveness by including exhaustive annotated rationales that we have collected for some of datasets in the benchmark.

3 Related Work

Interpretability in NLP is a large and fast-growing area, and we do not attempt to provide a comprehensive overview here. Instead, we focus on directions particularly relevant to ERASER, i.e., prior work on models that provide rationales for their predictions.

Learning to Explain. In ERASER we assume that rationales (marked by humans) are provided during training. However, models will of course not always have access to such direct supervision. This has motivated work on methods that can explain (or “rationalize”) model predictions using only instance-level supervision.

In the context of modern neural models for text classification, one might use variants of attention (Bahdanau et al., 2014) to extract rationales. Attention mechanisms learn to assign soft weights to (usually contextualized) token representations, and so one can extract highly weighted tokens as rationales. However, attention weights do not in general provide faithful explanations for predictions (Jain and Wallace, 2019; Serrano and Smith, 2019; Wiegreffe and Pinter, 2019; Zhong et al., 2019; Pruthi et al., 2019; Brunner et al., 2019; Moradi et al., 2019; Vashishth et al., 2019). This likely owes to encoders entangling inputs, which complicates the interpretation of attention weights over contextualized representations. In some cases, however, faithfulness may not be a primary concern.111Interestingly, (Zhong et al., 2019) report that attention provides plausible but not faithful (explanatory) rationales. In other related work, Pruthi et al. Pruthi et al. (2019) show that one can easily learn to deceive using attention weights. These findings further highlight that one should be mindful of what criteria one wants rationales to fulfill.

By contrast, hard

attention mechanisms discretely extract snippets from the input to pass to the classifier, and so by construction provide a sort of faithfulness in their explanations. Recent work has therefore pursued hard attention mechanisms as a means of providing explanations

(Lei et al., 2016; Yu et al., 2019). Lei et al. (2016) proposed instantiating two models with their own parameters; an encoder to extract rationales, and a decoder

that consumes the snippets it selects to make a prediction. They trained these models jointly. This is complicated by the discrete snippet selection performed by the encoder, which precludes gradient-based parameter estimation. They instead propose adopting a REINFORCE

(Williams, 1992) style optimization technique.

Post-hoc explanation. Another strand of work in the interpretability literature considers post-hoc explanation methods. Such methods seek to explain why a given model made its prediction on a given input, most commonly in form of token level importance scores. Many of these methods rely on differentiability of the output with respect to inputs Sundararajan et al. (2017); Smilkov et al. (2017). These types of explanations often have clear inherent semantics (e.g., simple gradients tell us exactly how perturbing inputs affects outputs), but they may nonetheless be difficult for humans to understand due to counterintuitive behaviors Feng et al. (2018).

Another class of ‘black-box’ methods do not require any specific conditions on models. Examples include LIME (Ribeiro et al., 2016) and  Alvarez-Melis and Jaakkola (2017); these methods approximate model behavior locally by repeatedly asking model to make predictions over perturbed inputs and fitting a explainable low complexity model over these predictions.

Acquiring rationales. In addition to potentially providing model transparency, collecting rationales from annotators may afford greater efficiency in terms of model performance realized given a fixed amount of annotator effort (Zaidan and Eisner, 2008). In particular, recent work McDonnell et al. (2017, 2016) has observed that at least for some tasks, asking annotators to provide rationales justifying their categorizations does not impose much overhead, in terms of effort.

Active learning (AL) (Settles, 2012) is a complementary strategy for reducing annotator effort that entails the model selecting the examples with which it is to be trained. Sharma et al. (2015) explored actively collecting both instance labels and supporting rationales. Their work suggests that selecting instances via an acquisition function specifically designed for learning with rationales can provide predictive gains over standard AL methods. A limitation of this work is that they relied on simulated rationales, for want of access to datasets with marked rationales; a gap that our work addresses.

Learning from Rationales. Work on learning from rationales that have been explicitly provided by users for text classification dates back over a decade (Zaidan et al., 2007; Zaidan and Eisner, 2008)

. Earlier efforts proposed extending standard discriminative models like Support Vector Machines (SVMs) with regularization terms that penalized parameter estimates which disagreed with provided rationales

(Zaidan et al., 2007; Small et al., 2011). Other efforts have attempted to specify generative models of rationales (Zaidan and Eisner, 2008).

More recent work has looked to exploit rationales in training neural text classification models. For example, Zhang et al. (2016) proposed a rationale-augmentedConvolutional Neural Network (CNN) for text classification, explicitly trained to identify sentences supporting document categorizations. Strout et al. (2019) have demonstrated that providing this model with target rationales at train time results in the model providing rationales at test time that are preferred by humans (compared to rationales provided when the model learns to weight sentences in an end-to-end fashion). Other recent work has proposed training ‘pipeline’ models in which one model learns to extract rationales (using available rationale supervision), and a second, independent model is trained to make predictions on the basis of these (Lehman et al., 2019; Chen et al., 2019).

Elsewhere, Camburu et al. (2018) enriched the SNLI (Bowman et al., 2015) corpus with human rationales and trained an RNN for this task with the aim of being able to justify its predictions, in addition to learning better universal sentence representations. The authors used perplexity and BLEU scores as well as a manual scoring of a random sample of explanations.

Rajani et al. (2019) augmented the CommonsenseQA (Talmor et al., 2019) corpus with rationales and trained a transformer Vaswani et al. (2017) based GPT (Radford et al., ) language model with an objective of using explanations to improve performance on the downstream task. Here the authors used perplexity to evaluate performance. The same work (Rajani et al., 2019) also pursued an innovative approach of training the model to generate natural language explanations directly, such that these agree with human provided free-text justifications. We view abstractive explanation as an exciting direction for future work, but here we focus on extractive rationalization.

The above efforts have measured rationale or explanation quality as a function of agreement with human rationales. This is natural in the setting in which supervision over rationales is assumed to be providing, as extracting these becomes a secondary predictive target which can be directly measured. However, agreement with human rationales demonstrates only plausibility; it does not guarantee that the model actually relied on the provided snippets to come to its prediction. Rationales that do meet these criterion are termed faithful: we discuss these two potential properties of rationales in more detail below. Importantly, we provide metrics that aim to measure these.

4 Datasets in ERASER

Name Size (train/dev/test) Comprehensive?
Evidence Inference 7958 / 972 / 959
BoolQ 6363 / 1491 / 2817
Movie Reviews 1600 / 200 / 200
FEVER 97957 / 6122 / 6111
MultiRC 24029 / 3214 / 4848
CoS-E 8733 / 1092 / 1092
e-SNLI 911938 / 16449 / 16429
Table 1: Overview of datasets in the proposed rationales benchmark. These numbers reflect any additional processing completed from the original datasets. Comprehensive rationales mean that all supporting evidence is marked; denotes cases where this is (more or less) true by default; , are datasets for which we have collected comprehensive rationales for either a subset or all of the test datasets, respectively; ✗ are datasets for which we do not have comprehensive rationales.
Dataset Labels Instances Documents Sentences Tokens
Evidence Inference 3 9889 2411 156.0 4760.6
BoolQ 2 10671 7030 175.2 3580.1
Movie Reviews 2 2000 1999 36.8 774.1
FEVER 2 110190 4099 12.1 326.5
MultiRC 2 32091 539 14.9 302.5
CoS-E 5 10917 10917 1.0 27.6
e-SNLI 3 568939 944565 1.7 16.0
Table 2: General dataset statistics: number of labels, instances, unique documents, and average numbers of sentences and tokens in documents, across the publicly released train/validation/test splits in ERASER. For CoS-E and e-SNLI, the sentence counts are not meaningful as the partitioning of question/sentence/answer formatting is an arbitrary choice in this framework.

In this section we describe the datasets that comprise the proposed rationales benchmark. All datasets constitute predictive tasks for which we distribute both reference labels and spans marked by humans, in a standardized format. For some of the datasets we have acquired comprehensive rationales from humans for a subset of instances. This permits evaluation of model recall, with respect to extracted rationales.

We distribute train, validation, and test sets for all corpora (see Appendix A for processing details). We ensure that these sets comprise disjoint sets of source documents to avoid contamination.222Except for BoolQ, wherein source documents in the original train and validation set were not disjoint and we preserve this structure in our dataset. Questions, of course, are disjoint. We have made the decision to distribute the test sets publicly,333Consequently, for datasets that have been part of previous benchmarks with other aims (namely, GLUE/superGLUE) but which we have re-purposed for work on rationales in ERASER, e.g., BoolQ (Clark et al., 2019), we have carved out for release test sets from the original validation sets. in part because we do not view the ‘correct’ metrics to use as settled. We plan to acquire additional human annotations on held-out portions of some of the included corpora so as to offer hidden test set evaluation opportunities in the future.

Evidence inference (Lehman et al., 2019). This is a dataset of full-text articles describing the conduct and results of randomized controlled trials (RCTs). The task is to infer whether a given intervention is reported to either significantly increase, significantly decrease, or have no significant effect on a specified outcome, as compared to a comparator of interest. A justifying rationale extracted from the text should be provided to support the inference. As the original annotations are not necessarily exhaustive, we collect exhaustive annotations on a subset of the test data444Annotation details are in Appendix B. .

BoolQ (Clark et al., 2019). This corpus consists of passages selected from Wikipedia, and yes/no questions generated from these passages. As the original Wikipedia article versions used were not maintained, we have made a best-effort attempt to recover these, and then find within them the passages answering the corresponding questions. For public release, we acquired comprehensive annotations on a subset of documents in our test set4.

Movie Reviews (Zaidan and Eisner, 2008). One of the original datasets providing extractive rationales, the movies dataset has positive or negative sentiment labels on movie reviews. As the included rationale annotations are not necessarily comprehensive (i.e., annotators were not asked to mark all text supporting a label), we collect a comprehensive evaluation set on the final fold of the original dataset (Pang and Lee, 2004)4.

FEVER (Thorne et al., 2018). FEVER 1.0 (short for Fact Extraction and VERification) is a fact-checking dataset. The task is to verify claims from textual sources. In particular, each claim is to be classified as supported, refuted or not enough information with reference to a collection of potentially relevant source texts. We restrict this dataset to supported or refuted.

MultiRC (Khashabi et al., 2018). This is a reading comprehension dataset composed of questions with multiple correct answers that by construction depend on information from multiple sentences. In MultiRC, each Rationale is associated with a question, while answers are independent of one another. We convert each rationale/question/answer triplet into an instance within our dataset. Each answer candidate then has a label of True or False.

Commonsense Explanations (CoS-E) (Rajani et al., 2019). This corpus comprises multiple-choice questions and answers from (Talmor et al., 2019) along with supporting rationales. The rationales in this case come in the form both of highlighted (extracted) supporting snippets and free-text, open-ended descriptions of reasoning. Given our focus on extractive rationales, ERASER includes only the former for now. Following the suggestions of (Talmor et al., 2019), we repartition the training and validation sets to provide a canonical test split.

e-SNLI (Camburu et al., 2018). This dataset extends on the widely known SNLI dataset (Bowman et al., 2015) by including rationales in the form of tokens in the premise and/or hypothesis as well open-ended natural language explanations. The authors had restrictions on what can be included as rationale depending on the label. For entailment pairs, annotators were required to highlight at least one word in the premise. For contradiction pairs, the annotators had to highlight at least one word in both the premise and the hypothesis. For neutral pairs, annotators were only allowed to highlight words in the hypothesis. We use the highlighted text as rationales for our ERASER benchmark.

5 Metrics

In ERASER, models are evaluated both for their ‘downstream’ performance (i.e., performance on the actual classification task) and with respect to the rationales that they extract. For the former we rely on the established metrics for the respective tasks. Here we describe the metrics we propose to evaluate the quality of extracted rationales.

We do not claim that these are necessarily the best metrics for evaluating rationales, but they are reasonable starting measures. We hope the release of ERASER will spur additional research into how best to measure the quality of model explanations in the context of NLP.

5.1 Agreement with human rationales

The simplest means of evaluating rationales extracted by models is to measure how well they agree with those marked by humans. To this end we propose two classes of metrics: those based on exact matches, and ranking metrics that provide a measure of the model’s ability to discriminate between evidence and non-evidence tokens (appropriate for models that provide soft scores for tokens). For the former, we borrow from Named Entity Recognition (NER); we effectively measure the overlap between spans extracted and marked. Specifically, given an extracted set of rationales extracted for instance , we compute precision, recall, and F1 with respect to human rationales .

Exact match is a particularly harsh metric in that it may not reflect subjective rationale quality; consider that an extra token destroys the match but not (usually) the meaning. We therefore consider softer variants. Intersection-Over-Union (IOU), borrowed from computer vision 

(Everingham et al., 2010), permits credit assignment in the case of partial matches. We define IOU on a token level: for two spans , , it is the size of the overlap of the tokens covered by the spans divided by the size of the union. We count a prediction as a match if it overlaps with any of the ground truth rationales by more than some threshold (0.5 for this work). We compute true positives from these matches; other measures (false positives, false negatives) are computed normally, and yield a more forgiving precision, recall, and F-measure.

We provide two additional relaxations of the exact match metric. First, a token-level precision, recall, and F1 allow for a broader sense of model coverage, although these ignore contiguousness, which is likely a desirable property of rationales. Systems may also provide a sentence-level decision as a second relaxed scoring metric. In general we consider token and span-level metrics superior to sentence metrics as they are more granular, but some datasets have meaningful sentence level annotations.555MultiRC and FEVER both have sentence level annotations only

Our second class of metrics considers rankings. This rewards models for assigning relatively high-scores to marked tokens. In particular, we take the Area Under the Precision-Recall curve (AUPRC) constructed by sweeping a threshold over token scores.

In general, the rationales we have for tasks are sufficient to make judgments, but not necessarily comprehensive. However, for some datasets we have explicitly collected comprehensive rationales for at least a subset of the test set. Therefore, on these datasets recall evaluates comprehensiveness directly (it does so only noisily on other datasets). We highlight which corpora contain comprehensive rationales in the test set in Table 4.

5.2 Measuring faithfulness

Above we proposed simple metrics for agreement with human-provided rationales. But as discussed above, a model may provide rationales that are plausible (and agree with those marked by humans) but that it did not in fact rely on to come to its disposition. In some scenarios this may be acceptable, but in many settings one may want rationales that actually explain model predictions, i.e., rationales extracted for an instance in this case ought to have meaningfully influenced its prediction for the same. We refer to these as faithful rationales.

How best to measure the faithfulness of rationales is an open question. In this first version of ERASER we propose a few straightforward metrics motivated by prior work (Zaidan et al., 2007; Yu et al., 2019). In particular, following Yu et al. (2019) we define metrics intended to capture the comprehensiveness and sufficiency of rationales, respectively. The former should capture whether all features needed to come to a prediction were selected, and the latter should tell us whether the extracted rationales contain enough signal to come to a disposition.

Comprehensiveness. To calculate rationale comprehensiveness we create contrast examples (Zaidan et al., 2007) by taking an input instance with rationales and erasing from the former all tokens found in the latter. That is, we construct a contrast example for , , which is with the rationales removed. Assuming a simple classification setting, let be the original prediction provided by a model for the predicted class :

. Then we consider the predicted probability from the model for the same class once the supporting rationales are stripped:

. Intuitively, the model ought to be less confident in its prediction once rationales are removed from . We can measure this as:


If this is high, this implies that the rationales were indeed influential in the prediction; if it is low, then this suggests that they were not. A negative value here means that the model became more confident in its prediction after the rationales were removed; this would seem quite counter-intuitive if the rationales were indeed the reason for its prediction in the first place.

Sufficiency. The second metric for measuring the faithfulness of rationales that we use is intended to capture the degree to which the snippets within the extracted rationales are adequate for a model to make a prediction. Denote by the predicted probability of class using only rationales . Then:


These metrics are illustrated in Figure 2.

Figure 2: Illustration of faithfulness scoring metrics, comprehensiveness and sufficiency, on the Commonsense Explanations (CoS-E) dataset. For the former, erasing the tokens comprising the provided rationale () ought to decrease model confidence in the output ‘Forest’. For the latter, the model should be able to come to a similar disposition regarding ‘Forest’ using only the rationales .

As defined, the above measures have assumed discrete rationales . We would like also to evaluate the faithfulness of continuous importance scores assigned to tokens by models. Here we adopt a simple approach for this. We convert soft scores over features provided by a model into discrete rationales by taking the top values, where is a threshold for dataset . We set to the average rationale length provided by humans for dataset (see Table 4). Intuitively, this says: How much does the model prediction change if we remove a number of tokens equal to what humans use (on average for this dataset) in order of the importance scores assigned to these by the model. Once we have discretized the soft scores into rationales in this way, we compute the faithfulness scores as per Equations 1 and 2.

This approach is conceptually simple. It is also computationally cheap to evaluate, in contrast to measures that require per-token measurements, e.g., importance score correlations with ‘leave-one-out‘ scores (Jain and Wallace, 2019), or counting how many ‘important’ tokens need to be erased before a prediction flips (Serrano and Smith, 2019). However, the necessity of discretizing continuous scores forces us to rely on the rather ad-hoc application of threshold . We believe that picking this based on human rationale annotations per dataset is reasonable, but acknowledge that alternative choice of threshold may yield quite different results for a given model and rationale set. It may be better to construct curves of this measure across varying and compare these, but this is both subtle (such curves will not necessarily be monotonic) and computationally intensive.

Ultimately, we hope that ERASER inspires additional research into designing faithfulness metrics for rationales. We plan to incorporate additional such metrics into future versions of the benchmark, if appropriate.

6 Baseline Models

Our focus in this work is primarily on the ERASER benchmark itself, rather than on any particular model(s). However, to establish initial empirical results that might provide a starting point for future work, we evaluate several baseline models across the corpora in ERASER.666We plan to continue adding baseline model implementations, which we will make available at We broadly class these into models that assign ‘soft’ (continuous) scores to tokens, and those that perform a ‘hard’ (discrete) selection over inputs. We additionally consider models specifically designed to select individual tokens (and very short sequences) as rationales, as compared to longer snippets.

We describe these models in the following subsections. All of our implementations are available in the ERASER repository. Note that we do not aim to provide, by any means, a comprehensive suite of models: rather, our aim is to establish a reasonable starting point for additional work on such models.

All of the datasets in ERASER have a similar structure: inputs, rationales, labels. But they differ considerably in length (Table 4), both of documents and corresponding rationales. We found that this motivated use of different models for datasets, appropriate to their sizes and rationale granularities. In our case this was in fact necessitated by computational constraints, as we were unable to run larger models on lengthier documents such as those within Evidence Inference. We hope that this benchmark motivates design of models that provide rationales that can flexibly adapt to varying input lengths and expected rationale granularities. Indeed, only with such models can we perform comparisons across datasets.

6.1 Hard selection

Models that perform hard selection may be viewed as comprising two independent modules: an encoder which is responsible for extracting snippets of inputs, and a decoder that makes a prediction based only on the text provided by the encoder. We consider two variants of such models.

Lei et al. (2016). In this model, the encoder induces a binary mask over inputs , . The decoder consumes the attributes of indicated by to make a prediction . The components are typically trained jointly. This end-to-end training is complicated by the use of (non-differentiable) hard attention, i.e., the binary mask , which means it is not possible to train the model using variants of gradient descent. Instead, Lei et al. (2016) propose using REINFORCE (Williams, 1992) style estimation, minimizing the loss over expected binary vectors yielded from the encoder.

One of the advantages of this approach is that it need not have access to marked rationales; it can learn to rationalize on the basis of instance labels alone. However, given that here we do have access to rationales in the training data, we experiment with a variant in which we train the encoder explicitly using rationale-level annotations.

In our implementation of  Lei et al. (2016), we drop in two independent BERT (Devlin et al., 2018) base modules with bidirectional LSTM (Hochreiter and Schmidhuber, 1997)

on top to induce contextualized representations of tokens for the encoder and decoder (the decoder, in addition, uses additive attention to collapse the LSTM hidden representations to a single vector), respectively. The encoder generates a scalar (denoting the probability of selecting that token) for each LSTM hidden state using a feedfoward layer and sigmoid. In the model where we do use human rationales during training, we minimize binary cross entropy between our sigmoid output and the ground truth rationale. Thus our final loss function is composed of decoder classification loss, reinforce estimator loss (details can be found in

Lei et al. (2016)) and if used, a rationale supervision loss.

Pipeline models. These are simple models in which we first train the encoder to extract rationales, and then train the decoder to perform prediction using only rationales. No parameters are shared between the two models. Realizing this type of approach is possible only when one has access to direct rationale supervision in order to train the encoder (which in general we assume in ERASER).

Here we first consider a simple pipeline that first segments inputs into sentences. It passes these, one at a time, through a Gated Recurrent Unit (GRU) 

(Cho et al., 2014) to yield hidden representations that we compose via an attentive decoding layer  (Bahdanau et al., 2014). This aggregate representation is then passed to a classification module which predicts whether the corresponding sentence is a rationale (or not). A second model, using effectively the same architecture but parameterized independently, consumes the outputs (rationales) from the first to make predictions. This simple model is described at length in prior work (Lehman et al., 2019). We further consider a ‘BERT-to-BERT‘ pipeline, where we replace each stage with a BERT module for prediction (Devlin et al., 2018).

In all pipeline models, we train each stage independently. The rationale identification stage is trained using approximate sentence boundaries from our source annotations, with randomly sampled negative examples at each epoch. The classification stage uses the same positive rationales as the identification stage, a kind of teacher forcing. See Appendix 

C for more detail.

6.2 Soft selection

A subset of datasets in ERASER contain token-level annotations, i.e., in these cases individual words and/or comparatively short sequences of words are marked as supporting classification decisions. These are: MultiRC, Movies, e-SNLI, and CoS-E. For these datasets we consider a model that passes tokens through BERT (Devlin et al., 2018) to induce contextualized representations that are then passed to a bi-directional LSTM (Hochreiter and Schmidhuber, 1997). The hidden representations from the LSTM are collapsed into a single vector using additive attention (Bahdanau et al., 2014) and finally through a linear layer followed by a sigmoid to yield (per-token) relevance predictions. We use the LSTM layer in part to bypass the 512 word limit imposed by BERT; when we exceed this length, we effectively start encoding a ‘new’ sequence (setting the positional index to 0) via BERT. The hope is that the LSTM layer learns to compensate for this. We have not yet trained this model on larger corpora due to computational constraints. For now we instead use a similar setup as above for Evidence Inference, BoolQ, and FEVER, except we swap in GloVe 300-d embeddings (Pennington et al., 2014) in place of BERT representations for tokens.

For these models we consider input gradients (with respect to output) and attention induced over contextualized representations as ‘soft’ scores.

7 Evaluation

Performance IOU F1 Token F1
Evidence Inference
(Lehman et al., 2019) 0.475 0.094 0.098
Bert-To-Bert 0.700 0.491 0.493
(Lehman et al., 2019) 0.474 0.047 0.116
Bert-To-Bert 0.571 0.057 0.143
Movie Reviews
(Lei et al., 2016) 0.914 0.124 0.285
(Lei et al., 2016) (u) 0.920 0.012 0.322
(Lehman et al., 2019) 0.738 0.057 0.121
Bert-To-Bert 0.864 0.067 0.115
(Lehman et al., 2019) 0.687 0.571 0.554
Bert-To-Bert 0.850 0.817 0.796
(Lei et al., 2016) 0.655 0.271 0.456
(Lei et al., 2016) (u) 0.652 0.000 0.000
(Lehman et al., 2019) 0.592 0.151 0.152
Bert-To-Bert 0.610 0.419 0.402
(Lei et al., 2016) 0.477 0.255 0.331
(Lei et al., 2016) (u) 0.476 0.000 0.000
(Lei et al., 2016) 0.917 0.693 0.692
(Lei et al., 2016) (u) 0.903 0.261 0.379
Table 3: Performance of models that perform ‘hard’ (discrete) rationale selection. All models are supervised at the rationale level except for those marked with (u), which learn only from instance-level supervision (for comparison). The denotes cases in which we believe the rationale training degenerated due to the REINFORCE style learning. Performance here is accuracy (CoS-E) and macro-averaged F1 (all others). Rationale evaluations for Evidence Inference, FEVER, and BoolQ include the non-comprehensive subset.
Performance AUPRC Comprehensiveness Sufficiency
Evidence Inference (=20%)
GloVe-LSTM + Attention 0.428 0.506 0.001 -0.022
GloVe-LSTM + Simple Gradient 0.428 0.020 0.009 -0.079
BoolQ (=10%)
GloVe-LSTM + Attention 0.631 0.525 -0.001 0.028
GloVe-LSTM + Simple Gradient 0.631 0.072 0.015 0.104
Movie Reviews (=10%)
BERT-LSTM + Attention 0.974 0.467 0.091 0.035
BERT-LSTM + Simple Gradient 0.974 0.441 0.113 0.052
FEVER (=20%)
GloVe-LSTM + Attention 0.660 0.617 -0.011 0.129
GloVe-LSTM + Simple Gradient 0.660 0.271 0.070 0.077
MultiRC (=20%)
BERT-LSTM + Attention 0.655 0.240 0.145 0.085
BERT-LSTM + Simple Gradient 0.655 0.224 0.164 0.079
CoS-E (=30%)
BERT-LSTM + Attention 0.487 0.607 0.124 0.175
BERT-LSTM + Simple Gradient 0.487 0.585 0.160 0.196
e-SNLI (=30%)
BERT-LSTM + Attention 0.960 0.394 0.115 0.622
BERT-LSTM + Simple Gradient 0.960 0.418 0.451 0.419
Table 4: Metrics for ‘soft’ scoring models. Performance refers to macro-averaged F1 for MultiRC, Movies, and e-SNLI, and accuracy for COS-E. Area Under the Precision Recall Curve (AUPRC) captures agreement between token rankings induced by scores and human annotations. Comprehensiveness and sufficiency are proposed measures of faithfulness; bigger numbers imply better performance for the former, and smaller numbers do so for the latter. These two measures depend on a dataset-specific threshold to discretize token scores (Section 5.2).
Dataset Cohen F1 P R #Annotators/doc #Documents
Evidence Inference - - - - - -
BoolQ 0.618 0.194 0.617 0.227 0.647 0.260 0.726 0.217 3 199
Movie Reviews 0.712 0.135 0.799 0.138 0.693 0.153 0.989 0.102 2 96
FEVER 0.854 0.196 0.871 0.197 0.931 0.205 0.855 0.198 2 24
MultiRC 0.728 0.268 0.749 0.265 0.695 0.284 0.910 0.259 2 99
CoS-E 0.619 0.308 0.654 0.317 0.626 0.319 0.792 0.371 2 100
e-SNLI 0.743 0.162 0.799 0.130 0.812 0.154 0.853 0.124 3 9807
Table 5: Human agreement numbers with respect to rationales. For Movie Reviews and BoolQ we calculate the mean agreement of individual annotators with the majority vote per token, over the two-three annotators we hired via Upwork and Amazon Turk, respectively. The e-SNLI dataset already comprised three annotators, and for this we calculate mean agreement between individuals and the majority. For CoS-E, MultiRC, and FEVER, members of our team annotated a subset to use a comparison to the (majority of, where appropriate) existing rationales. We collected comprehensive rationales for evidence inference from Medical Doctors; given that they have a high amount of expertise, we would expect agreement to be high, but have not yet collected redundant comprehensive annotations.

Here we present initial results for the baseline models discussed in Section 6, with respect to the metrics proposed in Section 5. We present results in two parts, reflecting the two classes of rationales discussed above: ‘hard’ approaches that perform discrete selection of snippets and ‘soft’ methods that assign continuous importance scores to tokens.

First, in Table 3 we evaluate models that perform discrete selection of rationales. We view these models as faithful by design, because by construction we know what snippets of text the decoder used to make a prediction.777Note that this further assumes that the encoder and decoder do not share parameters. Therefore, for these methods we report only metrics that measure agreement with human annotations.

Due to computational constraints, we are currently unable to run our BERT-based implementation of Lei et al. (2016) over larger corpora. Conversely, Lehman et al. (2019) assumes a setting in which rationale are sentences, and so is not appropriate for datasets in which rationales tend to comprise only very short spans. Again, in our view this highlights the need for models that can rationalize at varying levels of granularity, depending on what is appropriate.

We observe that for the “rationalizing” model of Lei et al. (2016), exploiting rationale-level supervision generally improves agreement with human-provided rationales, which is consistent with prior work (Zhang et al., 2016; Strout et al., 2019). Here,  Lei et al. (2016) consistently outperform the simple pipeline model from Lehman et al. (2019). Furthermore,  Lei et al. (2016) outperforms the ‘BERT-to-BERT‘ pipeline on the comparable datasets for the final classification tasks. This may be an artifact of the amount of text each model can select: ‘BERT-to-BERT‘ is limited to sentences, while  Lei et al. (2016) can select any subset of the text.

In Table 4 we report metrics for models that assign soft (continuous) importance scores to individual tokens. For these models we again measure downstream (task) performance (F1 or accuracy, as appropriate). Here the models are actually the same, and so downstream performance is equivalent. To assess the quality of token scores with respect to human annotations, we report the Area Under the Precision Recall Curve (AUPRC). Finally, as these scoring functions assign only soft scores to inputs (and may still use all inputs to come to a particular prediction), we report the metrics intended to measure faithfulness defined above: comprehensiveness and sufficiency. Here we observe that the simple gradient attribution yields consistently more ‘faithful’ rationales with respect to comprehensiveness, and in a slight majority of cases also with respect to sufficiency. Interestingly, however, attention weights yield better AUPRCs.

We view these as preliminary results and intend to implement and evaluate additional baselines in the near future. Critically, we see a need for establishing the performance of a single architecture across ERASER, which comprises datasets of very different size, and featuring rationales at differing granularities.

8 Discussion

We have described a new publicly available Evaluating Rationales And Simple English Reasoning (ERASER) benchmark. This comprises seven datasets, all of which have both instance level labels and corresponding supporting snippets (‘rationales’) marked by human annotators. We have augmented many of these datasets with additional annotations, and converted them into a standard format comprising inputs, rationales, and outputs. ERASER is intended to facilitate progress on explainable models for NLP.

We have proposed several metrics intended to measure the quality of rationales extracted by models, both in terms of agreement with human annotations, and in terms of ‘faithfulness’ with respect to comprehensiveness and sufficiency. We believe these metrics provide reasonable means of comparison of specific aspects of interpretability. However, we view the problem of measuring faithfulness, in particular, a topic ripe for additional research; we hope that ERASER facilitates this.

More generally, our hope is that ERASER facilitates progress on designing and comparing relative strengths and weaknesses of interpretable NLP models across a variety of tasks and datasets. We aim to continually update this benchmark and the corresponding metrics that it defines. In contrast to most benchmarks, we are not privileging any one measure of performance. Our view is that for interpretability, different models may excel at different things, and our aim for ERASER is to facilitate meaningful contrastive comparisons that highlight which models excel with respect to particular metrics of interest (e.g., certain models may provide superior faithfulness, though with lower predictive performance). We host a leaderboard, but allow for sorting with respect to any metric of interest.

The ERASER datasets, code for working with the data and performing evaluations, and our baseline model implementations are all available at:, which we will be continuously updating.


  • D. Alvarez-Melis and T. Jaakkola (2017) A causal framework for explaining the predictions of black-box sequence-to-sequence models. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    pp. 412–421. Cited by: §3.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3, §6.1, §6.2.
  • I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: pretrained language model for scientific text. In EMNLP, External Links: arXiv:1903.10676 Cited by: §C.4.
  • S. R. Bowman, G. Angeli, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3, §4.
  • G. Brunner, Y. Liu, D. Pascual, O. Richter, and R. Wattenhofer (2019) On the validity of self-attention as explanation in transformer models. arXiv preprint arXiv:1908.04211. Cited by: §3.
  • O. Camburu, T. Rocktäschel, T. Lukasiewicz, and P. Blunsom (2018) E-snli: natural language inference with natural language explanations. In Advances in Neural Information Processing Systems, pp. 9539–9549. Cited by: Appendix A, §3, §4.
  • S. Chen, D. Khashabi, W. Yin, C. Callison-Burch, and D. Roth (2019) Seeing things from a different angle: discovering diverse perspectives about claims. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, Minnesota, pp. 542–557. External Links: Link, Document Cited by: §3.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §6.1.
  • C. Clark, K. Lee, T. Kwiatkowski, and K. Toutanova (2019) BoolQ: exploring the surprising difficulty of natural yes/no questions. In NAACL, Cited by: Appendix A, §4, footnote 3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §6.1, §6.1, §6.2.
  • F. Doshi-Velez and B. Kim (2017)

    Towards a rigorous science of interpretable machine learning

    arXiv preprint arXiv:1702.08608. Cited by: §1.
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. External Links: ISSN 1573-1405, Document, Link Cited by: §5.1.
  • S. Feng, E. Wallace, I. Grissom, M. Iyyer, P. Rodriguez, J. Boyd-Graber, et al. (2018) Pathologies of neural models make interpretations difficult. arXiv preprint arXiv:1804.07781. Cited by: §3.
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer (2017) AllenNLP: a deep semantic natural language processing platform. External Links: arXiv:1803.07640 Cited by: §C.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §6.1, §6.2.
  • S. Jain and B. C. Wallace (2019) Attention is not explanation. arXiv preprint arXiv:1902.10186. Cited by: §3, §5.2.
  • D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth (2018) Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences. In Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), External Links: Link Cited by: Appendix A, §4.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §C.1, §C.3, §C.4.
  • E. Lehman, J. DeYoung, R. Barzilay, and B. C. Wallace (2019) Inferring which medical treatments work from reports of clinical trials. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 3705–3717. Cited by: Appendix A, §C.3, §3, §4, §6.1, Table 3, §7, §7.
  • T. Lei, R. Barzilay, and T. Jaakkola (2016) Rationalizing neural predictions. pp. 107–117. Cited by: §C.1, §3, §6.1, §6.1, Table 3, §7, §7.
  • Z. C. Lipton (2016) The mythos of model interpretability. arXiv preprint arXiv:1606.03490. Cited by: §1.
  • T. McDonnell, M. Kutlu, T. Elsayed, and M. Lease (2017) The many benefits of annotator rationales for relevance judgments.. In IJCAI, pp. 4909–4913. Cited by: §3.
  • T. McDonnell, M. Lease, M. Kutlu, and T. Elsayed (2016) Why is that relevant? collecting annotator rationales for relevance judgments. In Fourth AAAI Conference on Human Computation and Crowdsourcing, Cited by: §3.
  • P. Moradi, N. Kambhatla, and A. Sarkar (2019) Interrogating the explanatory power of attention in neural machine translation. arXiv preprint arXiv:1910.00139. Cited by: §3.
  • M. Neumann, D. King, I. Beltagy, and W. Ammar (2019) ScispaCy: fast and robust models for biomedical natural language processing. External Links: arXiv:1902.07669 Cited by: Appendix A.
  • B. Pang and L. Lee (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, pp. 271–278. External Links: Link, Document Cited by: §4.
  • [27] D. J. Pearce An improved algorithm for finding the strongly connected components of a directed graph. Cited by: Appendix A.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §C.3, §6.2.
  • D. Pruthi, M. Gupta, B. Dhingra, G. Neubig, and Z. C. Lipton (2019) Learning to deceive with attention-based explanations. arXiv preprint arXiv:1909.07913. Cited by: §3, footnote 1.
  • S. Pyysalo, F. Ginter, H. Moen, T. Salakoski, and S. Ananiadou (2013) Distributional semantics resources for biomedical text processing. Cited by: §C.3.
  • [31] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever Improving language understanding by generative pre-training. Cited by: §3.
  • N. F. Rajani, B. McCann, C. Xiong, and R. Socher (2019) Explain yourself! leveraging language models for commonsense reasoning. Proceedings of the Association for Computational Linguistics (ACL). Cited by: Appendix A, §3, §4.
  • M. Ribeiro, S. Singh, and C. Guestrin (2016) “Why should i trust you?”: explaining the predictions of any classifier. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 97–101. Cited by: §3.
  • T. Schuster, D. J. Shah, Y. J. S. Yeo, D. Filizzola, E. Santus, and R. Barzilay (2019) Towards debiasing fact verification models. CoRR abs/1908.05267. External Links: Link, 1908.05267 Cited by: Appendix A.
  • S. Serrano and N. A. Smith (2019) Is attention interpretable?. arXiv preprint arXiv:1906.03731. Cited by: §3, §5.2.
  • B. Settles (2012) Active learning.

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    6 (1), pp. 1–114.
    Cited by: §3.
  • M. Sharma, D. Zhuang, and M. Bilgic (2015) Active learning with rationales for text classification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 441–451. Cited by: §3.
  • K. Small, B. C. Wallace, C. E. Brodley, and T. A. Trikalinos (2011) The constrained weight space svm: learning with ranked features. In Proceedings of the International Conference on International Conference on Machine Learning (ICML), pp. 865–872. Cited by: §3.
  • D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §3.
  • R. Speer (2019) Ftfy. Note: Version 5.5Zenodo External Links: Document, Link Cited by: Appendix A.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. External Links: Link Cited by: §C.1, §C.3.
  • J. Strout, Y. Zhang, and R. J. Mooney (2019) Do human rationales improve machine explanations?. arXiv preprint arXiv:1905.13714. Cited by: §3, §7.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. Cited by: §3.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4149–4158. External Links: Link, Document Cited by: Appendix A, §3, §4.
  • J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 809–819. Cited by: Appendix A, §4.
  • S. Vashishth, S. Upadhyay, G. S. Tomar, and M. Faruqui (2019) Attention interpretability across nlp tasks. arXiv preprint arXiv:1909.11218. Cited by: §3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019a) SuperGLUE: A stickier benchmark for general-purpose language understanding systems. CoRR abs/1905.00537. External Links: Link, 1905.00537 Cited by: §1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019b) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • S. Wiegreffe and Y. Pinter (2019) Attention is not not explanation. arXiv preprint arXiv:1908.04626. Cited by: §3.
  • R. J. Williams (1992)

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Machine learning 8 (3-4), pp. 229–256. Cited by: §3, §6.1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §C.1.
  • M. Yu, S. Chang, Y. Zhang, and T. S. Jaakkola (2019) Rethinking cooperative rationalization: introspective extraction and complement control. arXiv preprint arXiv:1910.13294. Cited by: §2, §3, §5.2.
  • O. Zaidan, J. Eisner, and C. Piatko (2007) Using “annotator rationales” to improve machine learning for text categorization. In Proceedings of the conference of the North American chapter of the Association for Computational Linguistics (NAACL), pp. 260–267. Cited by: §3, §5.2, §5.2.
  • O. F. Zaidan and J. Eisner (2008) Modeling annotators: a generative approach to learning from annotator rationales. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 31–40. Cited by: Appendix A, §3, §3, §4.
  • Y. Zhang, I. Marshall, and B. C. Wallace (2016) Rationale-augmented convolutional neural networks for text classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Vol. 2016, pp. 795. Cited by: §3, §7.
  • R. Zhong, S. Shao, and K. McKeown (2019) Fine-grained sentiment analysis with faithful attention. arXiv preprint arXiv:1908.06870. Cited by: §3, footnote 1.

Appendix A Dataset Preprocessing

Dataset Documents Instances Rationale % Evidence Statements Evidence Lengths
Train 400 24029 17.4 56298 21.5
Val 56 3214 18.5 7498 22.8
Test 83 4848 - - -
Evidence Inference
Train 1924 7958 1.34 10371 39.3
Val 247 972 1.38 1294 40.3
Test 240 959 - - -
Movie Reviews
Train 1599 1600 9.4 13878 7.7
Val 200 200 7.2 1517 6.6
Test 200 200 - - -
Train 2915 97957 20.0 146856 31.3
Val 570 6122 21.6 8672 28.2
Test 614 6111 - - -
Train 4518 6363 6.64 6363.0 110.2
Val 1092 1491 7.13 1491.0 106.5
Test 2294 2817 - - -
Train 911938 549309 27.28 1199035.0 1.8
Val 16328 9823 25.63 23639.0 1.6
Test 16299 9807 - - -
Train 8733 8733 26.6 8733 7.4
Val 1092 1092 27.1 1092 7.6
Test 1092 1092 - - -
Table 6: Detailed breakdowns for each dataset - the number of documents, instances, evidence statements, and lengths. Additionally we include the percentage of each relevant document that is considered a rationale. For test sets, counts are for all instances including documents with non comprehensive rationales.

We describe what, if any, additional processing we perform on a per-dataset basis. All datasets were converted to a unified format.

MultiRC (Khashabi et al., 2018) We perform minimal processing. We use the validation set as the testing set for public release.

Evidence Inference (Lehman et al., 2019) We perform minimal processing. As not all of the provided evidence spans come with offsets, we delete any prompts that had no grounded evidence spans.

Movie reviews (Zaidan and Eisner, 2008) We perform minimal processing. We use the ninth fold as the validation set, and collect annotations on the tenth fold for comprehensive evaluation.

FEVER (Thorne et al., 2018) We perform substantial processing for FEVER - we delete the ”Not Enough Info” claim class, delete any claims with support in more than one document, and repartition the validation set into a validation and a test set for this benchmark (using the test set would compromise the information retrieval portion of the original FEVER task). We ensure that there is no document overlap between train, validation, and test sets (we use Pearce to ensure this, as conceptually a claim may be supported by facts in more than one document). We ensure that the validation set contains the documents used to create the FEVER symmetric dataset (Schuster et al., 2019) (unfortunately, the documents used to create the validation and test sets overlap so we cannot provide this partitioning). Additionally, we clean up some encoding errors in the dataset via Speer (2019).

BoolQ (Clark et al., 2019) The BoolQ dataset required substantial processing. The original dataset did not retain source Wikipedia articles or collection dates. In order to identify the source paragraphs, we download the 12/20/18 Wikipedia archive, and use FuzzyWuzzy to identify the source paragraph span that best matches the original release. If the Levenshtein distance ratio does not reach a score of at least 90, the corresponding instance is removed. For public release, we use the official validation set for testing, and repartition train into a training and validation set.

e-SNLI (Camburu et al., 2018) We perform minimal processing. We separate the premise and hypothesis statements into separate documents.

Commonsense Explanations (CoS-E) (Rajani et al., 2019) We perform minimal processing, primarily deletion of any questions without a rationale or questions with rationales that were not possible to automatically map back to the underlying text. As recommended by the authors of Talmor et al. (2019) we repartition the train and validation sets into a train, validation, and test set for this benchmark. We encode the entire question and answers as a prompt and convert the problem into a five-class prediction. We also convert the “Sanity” datasets for user convenience.

All datasets in ERASER were tokenized using spaCy888 library (with SciSpacy (Neumann et al., 2019) for Evidence Inference). In addition, we also split all datasets except e-SNLI and CoS-E into sentences using the same library.

Appendix B Annotation details

We collected comprehensive rationales for a subset of some test sets to accurately evaluate model recall of rationales.

  1. Movies. We used the Upwork Platform999 to hire two fluent english speakers to annotate each of the 200 documents in our test set. Workers were paid at rate of USD 8.5 per hour and on average, it took them 5 min to annotate a document. Each annotator was asked to annotate a set of 6 documents and compared against in-house annotations (by authors).

  2. Evidence Inference. We again used Upwork to hire 4 medical professionals fluent in english and having passed a pilot of 3 documents. 125 documents were annotated (only once by one of the annotators, which we felt was appropriate given their high-level of expertise) with an average cost of USD 13 per document. Average time spent of single document was 31 min.

  3. BoolQ. We used Amazon Mechanical Turk (MTurk) to collect reference comprehensive rationales from randomly selected 199 documents from our test set (ranging in 800 to 1500 tokens in length). Only workers from AU, NZ, CA, US, GB with more than 10K approved HITs and an approval rate of greater than 98% were eligible. For every document, 3 annotations were collected and workers were paid USD 1.50 per HIT. The average work time (obtained through MTurk interface) was 21 min. We did not anticipate the task taking so long (on average); the effective low pay rate was unintended.

Appendix C Hyperparameter and training details

c.1 (Lei et al., 2016) models

For these models, we set the sparsity rate at 0.01 and we set the contiguity loss weight to 2 times sparsity rate (following the original paper). We used bert-base-uncased  (Wolf et al., 2019) as token embedder and Bidirectional LSTM with 128 dimensional hidden state in each direction. A dropout (Srivastava et al., 2014)

rate of 0.2 was used before feeding the hidden representations to attention layer in decoder and linear layer in encoder. One layer MLP with 128 dimensional hidden state and ReLU activation was used to compute the decoder output distribution.

A learning rate of 2e-5 with Adam (Kingma and Ba, 2014) optimizer was used for all models and we only fine-tuned top two layers of BERT encoder. Th models were trained for 20 epochs and early stopping with patience of 5 epochs was used. The best model was selected on validation set using the final task performance metric.

The input for the above model was encoded in form of [CLS] document [SEP] query [SEP].

This model was implemented using AllenNLP library (Gardner et al., 2017).


This model is essentially the same as decoder in previous section. The BERT-LSTM uses the same hyperparameter and GloVe-LSTM is trained with a learning rate of 1e-2.

c.3 (Lehman et al., 2019) models

With the exception of the Evidence Inference dataset, these models were trained using the GLoVe  (Pennington et al., 2014) 200 dimension word vectors, and Evidence Inference using the  (Pyysalo et al., 2013) PubMed word vectors. We use Adam  (Kingma and Ba, 2014) with a learning rate of 1e-3, Dropout (Srivastava et al., 2014) of 0.05 at each layer (embedding, GRU, attention layer) of the model, for 50 epochs with a patience of 10. We monitor validation loss, and keep the best model on the validation set.

c.4 BERT-to-BERT model

We primarily used the bert-base-uncased model for both portions of the identification and classification pipeline, with the sole exception being Evidence Inference with SciBERT  (Beltagy et al., 2019). We trained with the standard BERT parameters of a learning rate of 1e-5, Adam  (Kingma and Ba, 2014), for 10 epochs. We monitor validation loss, and keep the best model on the validation set.