Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining

10/15/2021 ∙ by Andreas Madsen, et al. ∙ Montréal Institute of Learning Algorithms 9

To explain NLP models, many methods inform which inputs tokens are important for a prediction. However, an open question is if these methods accurately reflect the model's logic, a property often called faithfulness. In this work, we adapt and improve a recently proposed faithfulness benchmark from computer vision called ROAR (RemOve And Retrain), by Hooker et al. (2019). We improve ROAR by recursively removing dataset redundancies, which otherwise interfere with ROAR. We adapt and apply ROAR, to popular NLP importance measures, namely attention, gradient, and integrated gradients. Additionally, we use mutual information as an additional baseline. Evaluation is done on a suite of classification tasks often used in the faithfulness of attention literature. Finally, we propose a scalar faithfulness metric, which makes it easy to compare results across papers. We find that, importance measures considered to be unfaithful for computer vision tasks perform favorably for NLP tasks, the faithfulness of an importance measure is task-dependent, and the computational overhead of integrated gradient is rarely justified.



There are no comments yet.


page 6

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to explain neural networks benefits both accountability and ethics when deploying models

(Doshi-Velez2017) and developing a scientific understanding of what models do (Doshi-Velez2017a). Particularly, in NLP, attention (Bahdanau2015) is often used as an explanation to provide insight into the logical process of a model (Belinkov2019).

Attention, among other methods such as gradient (Baehrens2010; Li2016) and integrated gradient (Sundararajan2017a; Mudrakarta2018), explain which input tokens are relevant for a given prediction. This type of explanation is called an importance measure. However, there is no mathematical guarantee that these importance measures accurately reflects the models logic, a property called faithfulness.

In the NLP literature, the faithfulness of attention is regularly debated (Jain2019; Wiegreffe2020; Vashishth2019; Serrano2019; Pruthi2020), often with contradicting conclusions. In related work (Section 2), we summarize the issues with these previous methods used to analyze attention. In computer vision, the faithfulness of the gradient and integrated gradient methods have been debated too (Adebayo2018; Kindermans2019).

Measuring faithfulness is inherently difficult, as it is usually impossible to provide gold labels for a “correct explanation”, making proxy measures the only alternative. The analysis in this paper is based on ROAR (Hooker2019), this too is a proxy measure but is argued to be more principled.

ROAR’s foundational principle is: if information is truly important, then removing it from the dataset and retraining the model should result in a worse model. Importantly, this can be compared with removing random information.

However, ROAR assumes that there are no data redundancies. Because, after important information is removed, those redundancies can keep the model’s performance high, giving the illusion of a unfaithful importance measure. To solve this, we propose a modified version of ROAR, called Recursive ROAR.

Our primary contributions are: 1) developing Recursive ROAR, 2) adopting ROAR to NLP tasks, and 3) measuring faithfulness of attention, gradient, and integrated gradient. In addition, we also:

  • [noitemsep]

  • Propose a metric for the faithfulness of importance measures, as a standard benchmark.

  • Propose mutual information as an additional baseline to the random baseline.

We find that no importance measure is generally better than others, rather the faithfulness is task-dependent. This is valuable knowledge, as although each importance measure might be equal in faithfulness, they are not equal in computationally requirements or understandably to humans.

In particular, we find that attention generally provides more sparse explanations than gradient or integrated gradient. Although the faithfulness may be the same, a sparser explanation is often easier for humans to understand (Miller2019).

Computationally speaking, integrated gradient is 50 times more expensive than the gradient method. This additional complexity is usually justified by being considered more faithful than gradient. However, our results indicate that this is not a worthwhile trade-off.

2 Related Work

Much recent work in NLP has been devoted to investigating the faithfulness of attention as an interpretability method (Jain2019; Wiegreffe2020; Vashishth2019; Serrano2019; Pruthi2020; Meister2021a). In this section, we categorize the most relevant methods into three general ideas and discuss their drawbacks.

Importantly all publications, including ours, use the same model, dataset, and training procedure of Jain2019, all works are therefore easily comparable.

2.1 Comparing with alternative importance measure

The idea is to compare attention with an alternative importance measure, such as gradient, if there is a correlation, then this would validate attention’s faithfulness. Jain2019 specifically compare with the gradient method and the leave-one-out method. Meister2021a repeat this experiment in a broader context.

Both Jain2019 and Meister2021a find that there is little correlation between the importance measures and interpret this as attention being not-faithful.

Jain2019 does acknowledge the limitations of this approach, as the alternative importance measures are not themselves guaranteed to be faithful. A correlation, or lack of correlation, does therefore not inform about faithfulness. A criticism which we agree with and highlight here.

2.2 Mutate attention to deceive

Jain2019 propose that if there exist alternative attention weights that produce the same prediction, attention is unfaithful.

Jain2019 implement this idea, by optimizing for no prediction change but a large change in attention, and find that deceiving attention distributions do exists. Vashishth2019 and Meister2021a apply a similar method and achieve similar results.

Wiegreffe2020 find this analysis problematic because the attention distribution is changed directly, thereby creating an out-of-distribution issue. This means that the new attention distribution may be impossible to obtain naturally from just changing the input, and it therefore says little about the faithfulness of attention.

2.3 Optimize model to deceive

Because the mutate attention to deceive approach has been criticized for using direct mutation, an alternative idea is to learn an adversarial attention.

Wiegreffe2020 approach is to maximize the KL-divergence between normal attention and adversarial attention while minimizing the prediction difference of the two models. By varying the allowed prediction difference, they show that it is not possible to significantly change the attention weights without changing the predictions. Thereby invalidating the mutate attention to deceive experiments. Pruthi2020 perform a similar analysis but finds the opposite; that it is possible to significantly change the attention weights without affecting performance. Furthermore, they use this as evidence for attention being not faithful.

We find this approach to be problematic because by changing the optimization criteria the analysis is no longer about attention in a standard model, which is the subject of interest. We therefore find that this analysis only works as a criticism of the mutate attention to deceive approach, not as an evaluation of faithfulness.

2.4 Why evaluating faithfulness is difficult

It is worth recognizing, that evaluating faithfulness is difficult because we are not able to explain the models ourselves just from the weights and internal states. Therefore, it is often impossible for humans to annotate what is a correct explanation, which leaves us with only proxy-measures.

3 ROAR: RemOve And Retrain

To avoid the problems with the current approaches to measure faithfulness described in Section 2, we apply ROAR.

ROAR has been used in Computer Vision to evaluate the faithfulness of importance measure (Hooker2019). The central idea is that if information is truly important, then removing it from the dataset and re-training the model on this reduced dataset should result in a worse performance. This can then be compared with an uninformative baseline, where information is removed randomly. The hypothesis is, if the importance measure is faithful it should drop more in performance compared to the baseline.

This section covers how ROAR is adapted to an NLP context. Furthermore, we resolve an issue where ROAR can’t tell the difference between dataset redundancies and a non-faithful importance measure, by applying ROAR recursively.

3.1 Adaptation to NLP

ROAR was originally proposed for importance measures in computer vision. In this context, pixels measured to be important are “removed” by replacing them with an uninformative value, for example, a gray pixel (Hooker2019).

In this work, ROAR is applied to attention and other importance measures used on both single-sequence and paired-sequence classification models. Because these models use tokens, the uninformative value is a special [MASK] token (for an example, see Figure 1). We choose a [MASK] token rather than removing the token to keep the sequence length, which is an information source unrelated to importance measures. Additionally, removing tokens could result in ungrammatical inputs, using a [MASK] allows the model to infer the missing information.

0% Themovieisgreat.Ireallylikedit.
10% Themovieis[MASK].Ireallylikedit.
20% The[MASK]is[MASK].Ireallylikedit.
Figure 1: Demonstrates replacing a relative number of important tokens with [MASK]. The highlight indicates at 0% masked, indicates importance.

3.2 Recursive ROAR

When the model performance is worse on the dataset with allegedly important tokens removed compared to when random tokens are removed, the conclusion is that the importance measure is faithful. However, when the performance is similar the conclusion is unclear. Hooker2019 explain that it can either be there is a dataset redundancy or the importance measure is not faithful.

The reason why a dataset redundancy can affect the results is that, after truly important tokens are removed, the redundant information, which the model would normally not use, still exists and provides the necessary information to keep the model’s performance high. An example of this issue is demonstrated in Figure 1.

We solve this issue by recursively recomputing the importance measure at each iteration of information removal. This way, if the importance measure is faithful it would mark the redundant information as important, and the redundancy would then be removed afterward. Note that already masked tokens are kept masked, regardless of if the importance measure considered the [MASK] token important. We call this Recursive ROAR and provide an example in Figure 2.

0% Themovieisgreat.I reallylikedit.
10% Themovieis[MASK].I reallylikedit.
20% Themovieis[MASK].I really[MASK]it.
Figure 2: Example of how a dataset redundancy can be removed by reevaluating the importance measure. Compare this to Figure 1, where redundancies are not removed and the performance can remain the same, even when the importance measure is faithful. Note, the redundant information may be expressed in more complex ways than demonstrated here.

Note, Recursive ROAR might not remove all redundancies unless the step size is one token. However, because ROAR requires retraining the model, for every evaluation step, this is infeasible. Instead, we approximate it by removing a relative number of tokens. We discuss this more in Appendix A.

4 Models and Tasks

The tasks, models, hyperparameters, and pre-trained word embeddings are the same as in

Jain2019. There are two types of tasks and models: single-sequence and paired-sequence.

In general we refer

as the one-hot encoding of the primary input sequence, of length

and vocabulary size

. The logits are then

, and the target class is denoted as .

4.1 Single-sequence

A -dimentional word embedding followed by a bidirectional LSTM (BiLSTM) encoder is used transform the one-hot encoding into the hidden states . These hidden states are then aggregate using an additive attention layer .

To compute the attention weights for each token:


where are model parameters. Finally, the is passed through a fully-connected layer to obtain the logits .

Single-sequence tasks

  1. [noitemsep]

  2. Stanford Sentiment Treebank (SST) (Socher2013)

    – Sentences are classified as positive or negative.

    111The original dataset has 5 classes. Following Jain2019, we label (1,2) as negative, (4,5) as positive, and ignore the neural sentences.

  3. IMDB Movie Reviews (Maas2011) – Movie reviews are classified as positive or negative.

  4. MIMIC (Diabetes) (Johnson2016) – Uses health records to detect if a patient has Diabetes.

  5. MIMIC (Chronic vs Acute Anemia) (Johnson2016) – Uses health records to detect whether a patient has chronic or acute anemia.

4.2 Paired-sequence

For paired-sequence problems the two sequences are denoted as and . The inputs are then transformed to embeddings using the same embedding matrix, and then transformed using two separate BiLSTM encoders to get the hidden states, and . Likewise they are aggregated using additive attention .

The attention weights are computed as:


where are model parameters. Finally, is transformed with a dense layer.

Paired-sequence tasks

  1. [resume, noitemsep]

  2. Stanford Natural Language Inference (SNLI)

    (Bowman2015) – Inputs are premise and hypothesis. The hypothesis either entails, contradicts, or is neutral w.r.t. the premise.

  3. bAbI (Weston2015) – A set of artificial text for understanding and reasoning. We use the first three tasks, which consist of questions answerable using one, two, and three sentences from a passage, respectively.

5 Importance Measures

In this section, we describe the importance measures that will be evaluated with ROAR. Notably, it is important to distinguish between those that could reflect the model, and the baselines that by design are independent of the model.

5.1 Model dependent


These are the ’s defined in Section 4. Note that by design, for the [CLS] and [EOS] tokens is zero. Hence, those tokens can not be masked by ROAR. To keep the comparison with other importance measures fair, they are constrained to have the same property.


Let the logits from Section 4 be denoted as . Then the gradient explanation is , where is the one-hot-encoding of the input tokens (Baehrens2010; Li2016). To reduce the vocabulary dimension, we use an -norm.

Integrated Gradient

A successor to the gradient explanation is Integrated Gradient (IG), which they argue to be more faithful via axiomatic proofs. However, it requires computing gradients and a baseline . We use like the original paper (Sundararajan2017a), and uses as is done in NLP literature (Mudrakarta2018).


5.2 Baselines


Similarly to the ROAR paper (Hooker2019)

we use an uninformative baseline, by sampling from a uniform distribution.

Mutual Information

As an alternative model-agnostic baseline, we use mutual information (MI) (Manning2008) aggregated with a weighted average over each class. For a word and class , let

be the empirical Bernoulli distribution for observing the word-class pair. Additionally, let

and be the marginalized distributions. The averaged mutual information is then


Intuitively, as the Mutual Information increases, the for a word-class pair becomes more important for identifying the class. To avoid leaking the target label, by removing all occurrences of a token only for sentences of a given class, we average over all classes.

6 Experiments

Our datasets, models, and performance metrics are identical to those used in Jain2019 and most other literature evaluating the faithfulness of attention.

Specifically, we use the micro-F1 score for SNLI, accuracy for bAbI, and the macro-F1 score for SST, IMDB, Diabetes, and Anemia. Reproduced results, comparison with Jain2019, and dataset statistics are presented in Table 1.

Dataset Avg. Length Train size Test size Performance [%]
Jain2019 Reproduced
SST 20 3130/3449 889/887 81.0
IMDB 181 8685/8527 2234/2128 88.0
SNLI 16 183416/183187/182764 3368/3237/3219 78.0
Anemia 2267 1522/2740 449/793 92.0
Diabetes 2207 6650/1416 1389/340 79.0
bAbI-1/2/3 38/96/308 8500 1000 100.0/48.0/62.0 //
Table 1: Datasets statistics for single-sequence and paired-sequence tasks. Following Jain2019, we use the same BiLSTM model and report performance as macro-F1 for SST, IMDB, Anemia and Diabetes, micro-F1 for SNLI, and accuracy for bAbI.

In general, all results are aggregates over 5 seeds using a mean and a 95% confidence interval using a t-distribution in logistic-space.

6.1 Sparsity

Before performing the main ROAR experiments to evaluate the faithfulness, we evaluate the sparsity of each importance measure. The motivation is that attention can become sparse, meaning the majority of the attention is applied to just a few tokens (Bahdanau2015). If this is the case, it would not make sense to mask out 50% of a sequence of 200 tokens, if the top-5 tokens contribute 99% of the attention mass. Instead, it would be more appropriate to mask an absolute number of tokens, for example, to mask from 0 to 10 tokens.

To evaluate the sparsity of each importance measure, we simply measure how much of the total importance is assigned to a specific number of tokens. The results in Figure 3 reveal that attention is not sparse enough to justify masking an absolute number of tokens.

Interestingly, attention is still more sparse than gradient and integrated gradient. An important property that can make it easier for humans to understand (Miller2019). In the discussion section, we elaborate on this.

Figure 3: The relative amount of the importance measures covered by selecting a given number of tokens. Results are mean over 5 seeds with a 95% confidence interval. The figure shows that no importance measure is particularly sparse.

6.2 Roar

Figure 4: ROAR results, showing model performance at x% of tokens masked. A model performance below random indicates faithfulness, while above or similar to random indicates a non-faithful importance measure for the recursive column. Performance is averaged over 5 seeds with a 95% confidence interval.

To evaluate the faithfulness we apply both ROAR (Not Recursive) and our Recursive ROAR experiment to each dataset. The results are presented in Figure 4 and aggregate over 5 different seeds.

How to interpret

Because Recursive ROAR mitigates the dataset redundancy issue discussed in Section 3.2, the Recursive ROAR results in Figure 4 are the most relevant. The ROAR (Not Recursive) results primarily exist as an ablation study.

If the model performance of a given importance measure is below the random baseline, then this indicates that the importance measure is faithful. For the Recursive ROAR case, if the model performance is above or equal to the random baseline, then this indicates that the importance measure is not faithful. For the ROAR (Not Recursive) case, it is not possible to make a conclusion when model performance is above or equal to the random baseline, since this can also be due to a dataset redundancy.

As a secondary baseline, we include mutual information. Note, mutual information can by definition not explain the model, as it does not depend on the model. However, it provides value as a qualitative comparison, as it is often effective at selecting relevant information. Hence, when the model performance of an importance measure is below that of mutual information, it indicates that while the importance measure might be faithful, a more faithful importance measure should exist.

Figure 4 also presents the model performance at 100% masking. This provides a lower bound for the model performance, which is useful for comparison as datasets are often biased. These biases come from unbalanced class representation, a correlation between sequence length and the gold label, or the secondary sequence for the paired-sequence tasks (Gururangan2018).

Important observations

Based on the tresults in Figure 4 we highlight the following:

  • [itemsep=0.25em, parsep=0pt]

  • No importance measure is consistently faithful, instead the faithfulness is task dependent.

  • Although attention provides no mathematical guarantee to be a faithful explanation, models often converge such that attention is faithful.

  • Importance measures often work best for the top-20% most important tokens. Above 20%, mutual information often masks relevant tokens similarly to the importance measures.

  • Integrated gradient is not necessarily more faithful than gradient or attention. This is evident from the bAbI and Diabetes datasets. This is surprising as integrated gradient is argued to be more faithful than the gradient method (Sundararajan2017a).

  • Comparing ROAR (Not Recursive) and Recursive ROAR, most datasets have redundancies that interfere with ROAR. For example, with the Diabetes dataset, only with Recursive ROAR can gradient be seen to be faithful.

  • When the performance increases as more tokens are masked, this is due to class leakage. For example, if the and token is masked given positive sentiment, the and token becomes a good predictor of negative sentiment. This new redundancy will then be removed in the next iteration. However, because Recursive ROAR is approximated with a step-size of 10%, this is imperfect. For more details, see Appendix A.

  • Compared to ROAR results for computer vision (Hooker2019), the gradient and integrated gradient are sometimes faithful in NLP, while in CV they are consistently not faithful. Although, because Hooker2019 did not remove redundancies, it could also be due to differences in redundancies. But given that other importance measures were faithful, this is an unlikely conclusion.

6.3 Faithfulness metric

While a ROAR plot can provide valuable insights, such as “this importance measure is only faithful for the top-30% most important tokens”, it doesn’t summarize the faithfulness to a single value, given an importance measure and dataset pair. Which is useful for comparison across multiple papers.

To provide a scalar benchmark, we propose using an area-between-curves metric. Specifically, the goal is to maximize the area between the random baseline curve and the importance measure curve. Additionally, when the importance measure is above the baseline a negative area is contributed. Finally, the metric should be normalized by an upper bound based on 100% masking.

Using an area-between-curves is useful because unlike many other summarizing statistics it is invariant to the ROAR resolution. In our case, we have a step size of , which was chosen for computational reasons. Future work may choose a smaller or larger step size depending on their computational resources.

Let be masking ratio at step out of total step, in our case . Let be the model performance for a given importance measure and be the random baseline performance. With this, the metric is defined in (6), and we present the results in Table 2.

Dataset Importance Measure Faithfulness
Anemia Attention
Integrated Gradient
Diabetes Attention
Integrated Gradient
IMDB Attention
Integrated Gradient
SNLI Attention
Integrated Gradient
SST Attention
Integrated Gradient
bAbI-1 Attention
Integrated Gradient
bAbI-2 Attention
Integrated Gradient
bAbI-3 Attention
Integrated Gradient
Table 2: Faithfulness metric defined as a area-between-curves, see (6). Higher values means more faithful, zero or negative values means distinctly not-faithful.

These in Table 2 makes it more clear that attention is a surprisingly faithful importance measure, only for IMDB and SST does it not provide the highest faithfulness. However, it is worth mentioning that for most other datasets it is not statistically significantly better than either gradient or integrated gradient.

7 Discussion

In general, the ROAR results indicate that the faithfulness of the tested importance measure is task-dependent. For attention this is not surprising, as its faithfulness depends on the BiLSTM layer, specifically how much it mix or shifts the input embeddings. Since the behavior of BiLSTM should be task-dependent, this also makes the faithfulness of attention task-dependent.

However, this does not answer why gradient and integrated gradient are also task-dependent, as these importance measures should consider the BiLSTM behavior. Understanding this remains an open question.

Although we found no importance measure to be significantly more faithful than others, attention is often more sparse than other importance measures depending on the task. This is valuable as sparse explanations are often easier to understand to humans (Miller2019).

Because interpretability is the “ability to explain or to present in understandable terms to a human” (Doshi-Velez2017a), how effective an explanation is in communicating to a human and the faithfulness of the explanation, are two separate but equally important concerns (Doshi-Velez2017a; Jacovi2020).

Each importance measure also have different computational requirements, with the attention explanation being free and integrated gradient being 50 times more expensive than gradient.

This computational difference makes attention an attractive choice. However, for more complex models like BERT (Devlin2019) the many layers mix embeddings to such an extent, that attention may be no longer faithful. Future work will need to determine which, if any, importance measures can be used for such models.

8 Conclusion

This paper evaluates the faithfulness of attention, gradient, and integrated gradient using an improved version of ROAR, called Recursive ROAR.

Our analysis provides valuable insights, which we describe in Section 6. In general, all three importance measures are faithful, although none is significantly more faithful than others.

We hope this paper can help to establish ROAR as a standardized benchmark for the faithfulness of importance measures in NLP.


SR is supported by the Canada CIFAR AI Chairs program and the NSERC Discovery Grant program.


Appendix A ROAR Results with step size of one

Because using a relative step-size, such as 10% of tokens, can create dataset redundancies by leaking the class, the performance can increase. This is particularly clear in the Recursive ROAR bAbI-3 results from Figure 4. To prevent this, one should use a step-size of exactly 1 token. Unfortunately, using a step-size of 1 token is too computationally expensive; for example, the Diabetes dataset has an average length of 2207 tokens. Instead, we approximate Recursive ROAR by using a 10% step size.

However, in this appendix section, we provide the Recursive ROAR results for a step-size of 1 token, in Figure 5. But do not exceed 10 masked tokens, to keep the number of retrained models low.

Figure 5 shows that the performance does not increase when a step-size of exactly 1 token is used because it is not an approximation. In particular, when comparing the Recursive ROAR bAbI-3 results from Figure 4 and Figure 5 is the difference very noticeable.

Figure 5: ROAR results, showing model performance at an absolute number of tokens masked. A model performance below random indicates faithfulness, while above or similar to random indicates a non-faithful importance measure for the recursive column. Performance is averaged over 5 seeds with a 95% confidence interval.