The ability to explain neural networks benefits both accountability and ethics when deploying models(Doshi-Velez2017) and developing a scientific understanding of what models do (Doshi-Velez2017a). Particularly, in NLP, attention (Bahdanau2015) is often used as an explanation to provide insight into the logical process of a model (Belinkov2019).
Attention, among other methods such as gradient (Baehrens2010; Li2016) and integrated gradient (Sundararajan2017a; Mudrakarta2018), explain which input tokens are relevant for a given prediction. This type of explanation is called an importance measure. However, there is no mathematical guarantee that these importance measures accurately reflects the models logic, a property called faithfulness.
In the NLP literature, the faithfulness of attention is regularly debated (Jain2019; Wiegreffe2020; Vashishth2019; Serrano2019; Pruthi2020), often with contradicting conclusions. In related work (Section 2), we summarize the issues with these previous methods used to analyze attention. In computer vision, the faithfulness of the gradient and integrated gradient methods have been debated too (Adebayo2018; Kindermans2019).
Measuring faithfulness is inherently difficult, as it is usually impossible to provide gold labels for a “correct explanation”, making proxy measures the only alternative. The analysis in this paper is based on ROAR (Hooker2019), this too is a proxy measure but is argued to be more principled.
ROAR’s foundational principle is: if information is truly important, then removing it from the dataset and retraining the model should result in a worse model. Importantly, this can be compared with removing random information.
However, ROAR assumes that there are no data redundancies. Because, after important information is removed, those redundancies can keep the model’s performance high, giving the illusion of a unfaithful importance measure. To solve this, we propose a modified version of ROAR, called Recursive ROAR.
Our primary contributions are: 1) developing Recursive ROAR, 2) adopting ROAR to NLP tasks, and 3) measuring faithfulness of attention, gradient, and integrated gradient. In addition, we also:
Propose a metric for the faithfulness of importance measures, as a standard benchmark.
Propose mutual information as an additional baseline to the random baseline.
We find that no importance measure is generally better than others, rather the faithfulness is task-dependent. This is valuable knowledge, as although each importance measure might be equal in faithfulness, they are not equal in computationally requirements or understandably to humans.
In particular, we find that attention generally provides more sparse explanations than gradient or integrated gradient. Although the faithfulness may be the same, a sparser explanation is often easier for humans to understand (Miller2019).
Computationally speaking, integrated gradient is 50 times more expensive than the gradient method. This additional complexity is usually justified by being considered more faithful than gradient. However, our results indicate that this is not a worthwhile trade-off.
2 Related Work
Much recent work in NLP has been devoted to investigating the faithfulness of attention as an interpretability method (Jain2019; Wiegreffe2020; Vashishth2019; Serrano2019; Pruthi2020; Meister2021a). In this section, we categorize the most relevant methods into three general ideas and discuss their drawbacks.
Importantly all publications, including ours, use the same model, dataset, and training procedure of Jain2019, all works are therefore easily comparable.
2.1 Comparing with alternative importance measure
The idea is to compare attention with an alternative importance measure, such as gradient, if there is a correlation, then this would validate attention’s faithfulness. Jain2019 specifically compare with the gradient method and the leave-one-out method. Meister2021a repeat this experiment in a broader context.
Both Jain2019 and Meister2021a find that there is little correlation between the importance measures and interpret this as attention being not-faithful.
Jain2019 does acknowledge the limitations of this approach, as the alternative importance measures are not themselves guaranteed to be faithful. A correlation, or lack of correlation, does therefore not inform about faithfulness. A criticism which we agree with and highlight here.
2.2 Mutate attention to deceive
Jain2019 propose that if there exist alternative attention weights that produce the same prediction, attention is unfaithful.
Jain2019 implement this idea, by optimizing for no prediction change but a large change in attention, and find that deceiving attention distributions do exists. Vashishth2019 and Meister2021a apply a similar method and achieve similar results.
Wiegreffe2020 find this analysis problematic because the attention distribution is changed directly, thereby creating an out-of-distribution issue. This means that the new attention distribution may be impossible to obtain naturally from just changing the input, and it therefore says little about the faithfulness of attention.
2.3 Optimize model to deceive
Because the mutate attention to deceive approach has been criticized for using direct mutation, an alternative idea is to learn an adversarial attention.
Wiegreffe2020 approach is to maximize the KL-divergence between normal attention and adversarial attention while minimizing the prediction difference of the two models. By varying the allowed prediction difference, they show that it is not possible to significantly change the attention weights without changing the predictions. Thereby invalidating the mutate attention to deceive experiments. Pruthi2020 perform a similar analysis but finds the opposite; that it is possible to significantly change the attention weights without affecting performance. Furthermore, they use this as evidence for attention being not faithful.
We find this approach to be problematic because by changing the optimization criteria the analysis is no longer about attention in a standard model, which is the subject of interest. We therefore find that this analysis only works as a criticism of the mutate attention to deceive approach, not as an evaluation of faithfulness.
2.4 Why evaluating faithfulness is difficult
It is worth recognizing, that evaluating faithfulness is difficult because we are not able to explain the models ourselves just from the weights and internal states. Therefore, it is often impossible for humans to annotate what is a correct explanation, which leaves us with only proxy-measures.
3 ROAR: RemOve And Retrain
To avoid the problems with the current approaches to measure faithfulness described in Section 2, we apply ROAR.
ROAR has been used in Computer Vision to evaluate the faithfulness of importance measure (Hooker2019). The central idea is that if information is truly important, then removing it from the dataset and re-training the model on this reduced dataset should result in a worse performance. This can then be compared with an uninformative baseline, where information is removed randomly. The hypothesis is, if the importance measure is faithful it should drop more in performance compared to the baseline.
This section covers how ROAR is adapted to an NLP context. Furthermore, we resolve an issue where ROAR can’t tell the difference between dataset redundancies and a non-faithful importance measure, by applying ROAR recursively.
3.1 Adaptation to NLP
ROAR was originally proposed for importance measures in computer vision. In this context, pixels measured to be important are “removed” by replacing them with an uninformative value, for example, a gray pixel (Hooker2019).
In this work, ROAR is applied to attention and other importance measures used on both single-sequence and paired-sequence classification models. Because these models use tokens, the uninformative value is a special [MASK] token (for an example, see Figure 1). We choose a [MASK] token rather than removing the token to keep the sequence length, which is an information source unrelated to importance measures. Additionally, removing tokens could result in ungrammatical inputs, using a [MASK] allows the model to infer the missing information.
3.2 Recursive ROAR
When the model performance is worse on the dataset with allegedly important tokens removed compared to when random tokens are removed, the conclusion is that the importance measure is faithful. However, when the performance is similar the conclusion is unclear. Hooker2019 explain that it can either be there is a dataset redundancy or the importance measure is not faithful.
The reason why a dataset redundancy can affect the results is that, after truly important tokens are removed, the redundant information, which the model would normally not use, still exists and provides the necessary information to keep the model’s performance high. An example of this issue is demonstrated in Figure 1.
We solve this issue by recursively recomputing the importance measure at each iteration of information removal. This way, if the importance measure is faithful it would mark the redundant information as important, and the redundancy would then be removed afterward. Note that already masked tokens are kept masked, regardless of if the importance measure considered the [MASK] token important. We call this Recursive ROAR and provide an example in Figure 2.
Note, Recursive ROAR might not remove all redundancies unless the step size is one token. However, because ROAR requires retraining the model, for every evaluation step, this is infeasible. Instead, we approximate it by removing a relative number of tokens. We discuss this more in Appendix A.
4 Models and Tasks
The tasks, models, hyperparameters, and pre-trained word embeddings are the same as inJain2019. There are two types of tasks and models: single-sequence and paired-sequence.
In general we refer
as the one-hot encoding of the primary input sequence, of lengthand vocabulary size
. The logits are then, and the target class is denoted as .
A -dimentional word embedding followed by a bidirectional LSTM (BiLSTM) encoder is used transform the one-hot encoding into the hidden states . These hidden states are then aggregate using an additive attention layer .
To compute the attention weights for each token:
where are model parameters. Finally, the is passed through a fully-connected layer to obtain the logits .
Stanford Sentiment Treebank (SST) (Socher2013)
– Sentences are classified as positive or negative.111The original dataset has 5 classes. Following Jain2019, we label (1,2) as negative, (4,5) as positive, and ignore the neural sentences.
IMDB Movie Reviews (Maas2011) – Movie reviews are classified as positive or negative.
MIMIC (Diabetes) (Johnson2016) – Uses health records to detect if a patient has Diabetes.
MIMIC (Chronic vs Acute Anemia) (Johnson2016) – Uses health records to detect whether a patient has chronic or acute anemia.
For paired-sequence problems the two sequences are denoted as and . The inputs are then transformed to embeddings using the same embedding matrix, and then transformed using two separate BiLSTM encoders to get the hidden states, and . Likewise they are aggregated using additive attention .
The attention weights are computed as:
where are model parameters. Finally, is transformed with a dense layer.
Stanford Natural Language Inference (SNLI)(Bowman2015) – Inputs are premise and hypothesis. The hypothesis either entails, contradicts, or is neutral w.r.t. the premise.
bAbI (Weston2015) – A set of artificial text for understanding and reasoning. We use the first three tasks, which consist of questions answerable using one, two, and three sentences from a passage, respectively.
5 Importance Measures
In this section, we describe the importance measures that will be evaluated with ROAR. Notably, it is important to distinguish between those that could reflect the model, and the baselines that by design are independent of the model.
5.1 Model dependent
These are the ’s defined in Section 4. Note that by design, for the [CLS] and [EOS] tokens is zero. Hence, those tokens can not be masked by ROAR. To keep the comparison with other importance measures fair, they are constrained to have the same property.
Let the logits from Section 4 be denoted as . Then the gradient explanation is , where is the one-hot-encoding of the input tokens (Baehrens2010; Li2016). To reduce the vocabulary dimension, we use an -norm.
A successor to the gradient explanation is Integrated Gradient (IG), which they argue to be more faithful via axiomatic proofs. However, it requires computing gradients and a baseline . We use like the original paper (Sundararajan2017a), and uses as is done in NLP literature (Mudrakarta2018).
Similarly to the ROAR paper (Hooker2019)
we use an uninformative baseline, by sampling from a uniform distribution.
As an alternative model-agnostic baseline, we use mutual information (MI) (Manning2008) aggregated with a weighted average over each class. For a word and class , let
be the empirical Bernoulli distribution for observing the word-class pair. Additionally, letand be the marginalized distributions. The averaged mutual information is then
Intuitively, as the Mutual Information increases, the for a word-class pair becomes more important for identifying the class. To avoid leaking the target label, by removing all occurrences of a token only for sentences of a given class, we average over all classes.
Our datasets, models, and performance metrics are identical to those used in Jain2019 and most other literature evaluating the faithfulness of attention.
Specifically, we use the micro-F1 score for SNLI, accuracy for bAbI, and the macro-F1 score for SST, IMDB, Diabetes, and Anemia. Reproduced results, comparison with Jain2019, and dataset statistics are presented in Table 1.
|Dataset||Avg. Length||Train size||Test size||Performance [%]|
In general, all results are aggregates over 5 seeds using a mean and a 95% confidence interval using a t-distribution in logistic-space.
Before performing the main ROAR experiments to evaluate the faithfulness, we evaluate the sparsity of each importance measure. The motivation is that attention can become sparse, meaning the majority of the attention is applied to just a few tokens (Bahdanau2015). If this is the case, it would not make sense to mask out 50% of a sequence of 200 tokens, if the top-5 tokens contribute 99% of the attention mass. Instead, it would be more appropriate to mask an absolute number of tokens, for example, to mask from 0 to 10 tokens.
To evaluate the sparsity of each importance measure, we simply measure how much of the total importance is assigned to a specific number of tokens. The results in Figure 3 reveal that attention is not sparse enough to justify masking an absolute number of tokens.
Interestingly, attention is still more sparse than gradient and integrated gradient. An important property that can make it easier for humans to understand (Miller2019). In the discussion section, we elaborate on this.
To evaluate the faithfulness we apply both ROAR (Not Recursive) and our Recursive ROAR experiment to each dataset. The results are presented in Figure 4 and aggregate over 5 different seeds.
How to interpret
Because Recursive ROAR mitigates the dataset redundancy issue discussed in Section 3.2, the Recursive ROAR results in Figure 4 are the most relevant. The ROAR (Not Recursive) results primarily exist as an ablation study.
If the model performance of a given importance measure is below the random baseline, then this indicates that the importance measure is faithful. For the Recursive ROAR case, if the model performance is above or equal to the random baseline, then this indicates that the importance measure is not faithful. For the ROAR (Not Recursive) case, it is not possible to make a conclusion when model performance is above or equal to the random baseline, since this can also be due to a dataset redundancy.
As a secondary baseline, we include mutual information. Note, mutual information can by definition not explain the model, as it does not depend on the model. However, it provides value as a qualitative comparison, as it is often effective at selecting relevant information. Hence, when the model performance of an importance measure is below that of mutual information, it indicates that while the importance measure might be faithful, a more faithful importance measure should exist.
Figure 4 also presents the model performance at 100% masking. This provides a lower bound for the model performance, which is useful for comparison as datasets are often biased. These biases come from unbalanced class representation, a correlation between sequence length and the gold label, or the secondary sequence for the paired-sequence tasks (Gururangan2018).
Based on the tresults in Figure 4 we highlight the following:
No importance measure is consistently faithful, instead the faithfulness is task dependent.
Although attention provides no mathematical guarantee to be a faithful explanation, models often converge such that attention is faithful.
Importance measures often work best for the top-20% most important tokens. Above 20%, mutual information often masks relevant tokens similarly to the importance measures.
Integrated gradient is not necessarily more faithful than gradient or attention. This is evident from the bAbI and Diabetes datasets. This is surprising as integrated gradient is argued to be more faithful than the gradient method (Sundararajan2017a).
Comparing ROAR (Not Recursive) and Recursive ROAR, most datasets have redundancies that interfere with ROAR. For example, with the Diabetes dataset, only with Recursive ROAR can gradient be seen to be faithful.
When the performance increases as more tokens are masked, this is due to class leakage. For example, if the and token is masked given positive sentiment, the and token becomes a good predictor of negative sentiment. This new redundancy will then be removed in the next iteration. However, because Recursive ROAR is approximated with a step-size of 10%, this is imperfect. For more details, see Appendix A.
Compared to ROAR results for computer vision (Hooker2019), the gradient and integrated gradient are sometimes faithful in NLP, while in CV they are consistently not faithful. Although, because Hooker2019 did not remove redundancies, it could also be due to differences in redundancies. But given that other importance measures were faithful, this is an unlikely conclusion.
6.3 Faithfulness metric
While a ROAR plot can provide valuable insights, such as “this importance measure is only faithful for the top-30% most important tokens”, it doesn’t summarize the faithfulness to a single value, given an importance measure and dataset pair. Which is useful for comparison across multiple papers.
To provide a scalar benchmark, we propose using an area-between-curves metric. Specifically, the goal is to maximize the area between the random baseline curve and the importance measure curve. Additionally, when the importance measure is above the baseline a negative area is contributed. Finally, the metric should be normalized by an upper bound based on 100% masking.
Using an area-between-curves is useful because unlike many other summarizing statistics it is invariant to the ROAR resolution. In our case, we have a step size of , which was chosen for computational reasons. Future work may choose a smaller or larger step size depending on their computational resources.
Let be masking ratio at step out of total step, in our case . Let be the model performance for a given importance measure and be the random baseline performance. With this, the metric is defined in (6), and we present the results in Table 2.
These in Table 2 makes it more clear that attention is a surprisingly faithful importance measure, only for IMDB and SST does it not provide the highest faithfulness. However, it is worth mentioning that for most other datasets it is not statistically significantly better than either gradient or integrated gradient.
In general, the ROAR results indicate that the faithfulness of the tested importance measure is task-dependent. For attention this is not surprising, as its faithfulness depends on the BiLSTM layer, specifically how much it mix or shifts the input embeddings. Since the behavior of BiLSTM should be task-dependent, this also makes the faithfulness of attention task-dependent.
However, this does not answer why gradient and integrated gradient are also task-dependent, as these importance measures should consider the BiLSTM behavior. Understanding this remains an open question.
Although we found no importance measure to be significantly more faithful than others, attention is often more sparse than other importance measures depending on the task. This is valuable as sparse explanations are often easier to understand to humans (Miller2019).
Because interpretability is the “ability to explain or to present in understandable terms to a human” (Doshi-Velez2017a), how effective an explanation is in communicating to a human and the faithfulness of the explanation, are two separate but equally important concerns (Doshi-Velez2017a; Jacovi2020).
Each importance measure also have different computational requirements, with the attention explanation being free and integrated gradient being 50 times more expensive than gradient.
This computational difference makes attention an attractive choice. However, for more complex models like BERT (Devlin2019) the many layers mix embeddings to such an extent, that attention may be no longer faithful. Future work will need to determine which, if any, importance measures can be used for such models.
This paper evaluates the faithfulness of attention, gradient, and integrated gradient using an improved version of ROAR, called Recursive ROAR.
Our analysis provides valuable insights, which we describe in Section 6. In general, all three importance measures are faithful, although none is significantly more faithful than others.
We hope this paper can help to establish ROAR as a standardized benchmark for the faithfulness of importance measures in NLP.
SR is supported by the Canada CIFAR AI Chairs program and the NSERC Discovery Grant program.
Appendix A ROAR Results with step size of one
Because using a relative step-size, such as 10% of tokens, can create dataset redundancies by leaking the class, the performance can increase. This is particularly clear in the Recursive ROAR bAbI-3 results from Figure 4. To prevent this, one should use a step-size of exactly 1 token. Unfortunately, using a step-size of 1 token is too computationally expensive; for example, the Diabetes dataset has an average length of 2207 tokens. Instead, we approximate Recursive ROAR by using a 10% step size.
However, in this appendix section, we provide the Recursive ROAR results for a step-size of 1 token, in Figure 5. But do not exceed 10 masked tokens, to keep the number of retrained models low.