Log In Sign Up

Learning to Deceive with Attention-Based Explanations

Attention mechanisms are ubiquitous components in neural architectures applied in natural language processing. In addition to yielding gains in predictive accuracy, researchers often claim that attention weights confer interpretability, purportedly useful both for providing insights to practitioners and for explaining why a model makes its decisions to stakeholders. We call the latter use of attention mechanisms into question, demonstrating a simple method for training models to produce deceptive attention masks, diminishing the total weight assigned to designated impermissible tokens, even as the models are shown to nevertheless rely on these features to drive predictions. Across multiple models and datasets, our approach manipulates attention weights while paying surprisingly little cost in accuracy. Although our results do not rule out potential insights due to organically-trained attention, they cast doubt on attention's reliability as a tool for auditing algorithms, as in the context of fairness and accountability.


page 1

page 2

page 3

page 4


Is Attention Interpretable?

Attention mechanisms have recently boosted performance on a range of NLP...

Model Explanations under Calibration

Explaining and interpreting the decisions of recommender systems are bec...

Location Attention for Extrapolation to Longer Sequences

Neural networks are surprisingly good at interpolating and perform remar...

Latent Attention Networks

Deep neural networks are able to solve tasks across a variety of domains...

Is Attention Interpretation? A Quantitative Assessment On Sets

The debate around the interpretability of attention mechanisms is center...

Exploring Different Dimensions of Attention for Uncertainty Detection

Neural networks with attention have proven effective for many natural la...

1 Introduction

Since their introduction as a means to cope with unaligned inputs and outputs in neural machine translation, attention mechanisms 

Bahdanau et al. (2014)

have emerged as popular and effective components in various neural network architectures. Attention works by aggregating a set of tokens via a weighted sum, where the

attention weights are calculated as a function of both the input encodings and the state of the decoder.

Because attention mechanisms allocate weight among the encoded tokens, these coefficients are sometimes thought of intuitively as indicating which tokens the model focuses on when making a particular prediction. Based on this loose intuition, attention weights are often purported to explain a model’s predictions. For instance, in a recent study on gender bias in occupation classification, De-Arteaga et al. (2019) employ attention weights to explain what the model has learned, stating that “the attention weights indicate which tokens are most predictive”. Similar claims abound in the literature (Li et al., 2016; Xu et al., 2015; Choi et al., 2016; Xie et al., 2017; Martins and Astudillo, 2016).

Attention Input Sentence Gender
After that, Austen was educated
at home until she went to boarding
school with Cassandra early in 1785
After that, Austen was educated
at home until she went to boarding
school with Cassandra early in 1785
Table 1: As in the gender identification task, attention scores (depicted through highlighting) can be manipulated to pay little attention to predictive features.

In this paper, we elucidate two potential pitfalls why one should exercise caution in interpreting attention scores as indicative of models’ inner workings or relative importance of input tokens. First, we demonstrate that attention scores are surprisingly easy to manipulate by adding a simple penalty term to the training objective (see §3, §5.1). The ease with which attention can be manipulated without significantly affecting predictions suggests that even if a vanilla model’s attention weights conferred some insight (still an open and ill-defined question), these insights would rely on knowing the precise objective on which models were trained. Second, practitioners often overlook the fact that the attention is not over words but over final layer representations, which themselves capture information from neighboring words. We investigate this issue and other ways that models trained to produce deceptive masks work around the constraints imposed by our objective in §5.2.

To demonstrate these pitfalls, we design a training scheme whereby the resulting models appear to assign little attention to any among a specified set of impermissible tokens while nevertheless continuing to rely upon those features for prediction. We construct several of our tasks so that, by design, access to the impermissible tokens are known to be essential in order to achieve high predictive accuracy. Unlike jain2019analysis, who showed that attention maps could be (manually) manipulated after training without altering predictions, we interfere only with the training objective and thus it is the actual attention masks produced by the resulting models that exhibit deceptive properties. Our results present troublesome implications for proposed uses of attention in the context of fairness, accountability, and transparency. For example, malicious practitioners asked to justify how their models work by pointing to attention weights could mislead regulators with this scheme.

We also investigate the mechanisms through which the manipulated models attain such low attention values. We note that i) recurrent connections allow for easy flow of information to neighboring representations, and for cases where the flow is curtailed ii) models tend to increase the magnitude of representations corresponding to impermissible tokens to compensate for the low attention scores (see §5.2).

2 Related Work

Recently,  Jain et al. (2019) claimed that attention is not explanation. Their analysis consisted of examining correlation between the trained attention weights and corresponding feature gradients. Additionally, they identify alternate adversarial attention weights after the model is trained that nevertheless produce the same predictions. However, these attention weights are chosen from a large (infinite up to numerical precision) set of possible attention weights and thus it should not be surprising that multiple weights might produce the same prediction. Moreover because the model does not actually produce these weights, they seemingly would never be relied on as explanations in the first place. Similarly, Serrano and Smith (2019) modify attention values of a trained model post-hoc by hard-setting the highest attention values to zero. They study whether such erasures can cause a model’s prediction to flip. Through this exercise, they find that the number of attention values that must be zeroed out to alter prediction is often too large, and thus conclude that attention is not a suitable tool to for determining which elements should be attributed as responsible for an output. In contrast to these two papers, we manipulate the attention via the learning procedure, producing models whose actual weights might deceive an auditor.

In parallel work to ours, Wiegreffe and Pinter (2019) independently expressed similar concerns. Additionally, they challenge several claims made by Jain et al. (2019). They propose several tests to elucidate the usefulness of attention weights. Among the four diagnostic tests that they propose, one is remarkably similar to ours. In this test, they examine whether a model can be trained adversarially to produce an attention that is maximally dissimilar to that produced by the original model. One key difference is that in our work although our objective is similar, we construct a set of tasks for which a set of designated impermissible tokens are known a priori to be indispensable for achieving high accuracy. Thus, instead of assuming or verifying whether tokens attended by the model are important, we begin with ground truth predictive tokens. We further highlight the methodological differences between their training formulation and ours in Section 3. Overall, our results are generally concordant with theirs regarding the ease of manipulating attention and the surprisingly low price to be paid for this manipulation in accuracy on classification tasks.

Lastly, the deliberate training of attention weights has been studied in several papers in which the goal is not to study the explanatory power of attention weights but rather to achieve better predictive performance by introducing an additional source of supervision. In some of these papers, attention weights are guided by known word alignments in machine translation Liu et al. (2016); Chen et al. (2016), or aligning human eye-gaze with model’s attention for sequence classification Barrett et al. (2018).

3 Manipulating Attention

Input Example
Impermissible Tokens
Wikipedia Biographies
(Gender Identification)
After that, Austen was educated at home until
she went to boarding school with Cassandra early in 1785
Gender Pronouns
SST + Wikipedia

(Sentiment Analysis)

Good fun, good action, good acting, good dialogue, good pace, good
cinematography. Helen Maxine Lamond Reddy (born 25 October 1941)
is an Australian singer, actress, and activist.
SST sentence
Reference Letters
(Acceptance Prediction)
It is with pleasure that I am writing this letter in support
of I highly recommend her for a place in your
institution. Percentile:99.0 Rank:Extraordinary.
Percentile, Rank
Table 2: Example sentences from each task, with highlighted impermissible tokens. We also note the percentage of impermissible tokens in each dataset. The example reference letter is clipped for brevity, and anonymity.

Let denote an input sequence of words. We assume that we are given a pre-specified set of impermissible words , for which we want to minimize the attention weights. We define the mask

to be a binary vector of size

, such that

For any task-specific loss function

, we define a new objective function where is an additive penalty term whose purpose is to penalize the model for allocating attention to impermissible words. For a single attention layer, we define as:

and is a penalty coefficient that modulates the amount of attention assigned to impermissible tokens. The argument of the term () captures the total attention weight assigned to permissible words. In contrast to our penalty term,  Wiegreffe and Pinter (2019) use KL-divergence to maximally separate the attention distribution of the manipulated model () from the attention distribution of the given model ():

When dealing with models that employ multi-headed attention, which use multiple different attention vectors at each layer of the model Vaswani et al. (2017) we can optimize the mean value of our penalty as assessed over the set of attention heads as follows:

When a model has many attention heads, an auditor might not look at the mean attention assigned to certain words but instead look head by head to see if any among them assigns a large amount of attention to impermissible words. Anticipating this, we also explore a variant of our approach for manipulating multi-headed attention where we penalize the maximum amount of attention paid to impermissible words (among all heads) as follows:

For cases where the impermissible set of tokens is unknown apriori, one can plausibly use the top few highly attended tokens as a proxy to attain divergent attention maps from a given model.

4 Experimental Setup

Model Gender Identification SST + Wiki Reference Letters
with without with without with without
Embedding + Attn. 100.0 66.8 70.7 48.9 77.5 74.2
BiLSTM + Attn. 100.0 63.3 76.9 49.1 77.5 74.7
BERT 100.0 72.8 90.8 50.4 74.7 68.2
Table 3: Performance of models on tasks with and without the set of impermissible tokens . Without using impermissible tokens we see a significant drop in the performance, demonstrating their utility for prediction.

Below, we briefly introduce the tasks and models used to evaluate our technique for producing deceptive attention weights.

4.1 Datasets

Our investigation addresses three binary classification datasets. In each dataset, (in some, by design) a subset of input tokens are known a priori to be indispensable for achieving high accuracy.

Pronoun-based Gender Identification

To illustrate the problem, we construct a toy dataset from Wikipedia comprising of biographies, in which we automatically label biographies with a gender (female or male) based solely on the presence of gender pronouns. To do so, we use a pre-specified list gender pronouns (e.g. “he”, “himself”, etc. for male, and “she”, “herself”, etc. for female). Biographies containing no gender pronouns, or pronouns spanning both classes are ignored. The rationale behind creating this dataset is that due to the manner in which the dataset was created, attaining classification accuracy is trivial if the model uses information from the pronouns. However, without the pronouns it is not be possible to achieve perfect accuracy. Our models (described in detail later in § 4.2) trained on the same data with the pronouns anonymized, achieve at best 72.6% accuracy (see Table 3).

Sentiment Analysis with Distractor Sentences

Here, we use the binary version of Stanford Sentiment Treebank (SST) Socher et al. (2013), comprised of movie reviews with positive and negative labels. We append one randomly-selected “distractor” sentence to each review, from a pool consisting of opening sentences of Wikipedia pages.111Opening sentences tend to be declarative statements of fact and typically are sentiment-neutral. Here, without relying upon the tokens in the SST sentences, a model should not be able to outperform random guessing.

Graduate School Reference Letters

We obtain a dataset of college recommendation letters written for the purpose of admission in graduate programs in a large public university in the United States. The task is to predict whether the student for whom the letter was written was accepted. Besides the reference letters, students’ ranks, and percentile scores (as marked by their mentors) are available. Admissions committee members rely on the rank and percentile scores in addition to the letters. Indeed, we notice accuracy improvements when using the rank and percentile features in addition to the reference letter (see Table 3). Thus, we consider percentile and rank labels (which are appended at the end of the letter text) as impermissible tokens for this task. An example from each task is listed in Table 2.

4.2 Models

Embedding + Attention

For illustrative purposes, we start with a simple model with attention directly over word embeddings. The word embeddings are aggregated by a weighted sum (where weights are the attention scores) to form a context vector, which is then fed to a linear layer followed by a softmax to perform prediction. For all our experiments, we use dot-product attention, where the query vector is a learnable weight vector. The embedding dimension size is .

BiLSTM + Attention

The encoder is a single-layer Bi-directional LSTM model Graves and Schmidhuber (2005)

, with attention, followed by a linear transformation and a softmax to perform classification. The embedding and hidden dimension size is set to


Model Gender Identification SST + Wiki Reference Letters
Acc. Attn. Mass Acc. Attn. Mass Acc. Attn. Mass
Emb + Attn. ( 100.0 99.2 70.7 48.4 77.5 2.3
Emb + Attn. ( 99.4 3.4 67.9 36.4 76.8 0.5
Emb + Attn. ( 99.2 0.8 48.4 8.7 76.9 0.1
BiLSTM + Attn. ( 100.0 96.8 76.9 77.7 77.5 4.9
BiLSTM + Attn. ( 100.0 60.6 0.04 76.9 3.9
BiLSTM + Attn. ( 100.0 61.0 0.07 74.2
BERT (mean) ( 100.0 80.8 90.8 59.0 74.7 2.6
BERT (mean) ( 99.9 90.9 76.2
BERT (mean) ( 99.9 90.6 75.2
BERT (max) ( 100.0 99.7 90.8 96.2 74.7 28.9
BERT (max) ( 99.9 90.7 76.7 0.6
BERT (max) ( 99.8 90.2 75.9
Table 4: Accuracy of various models along with their attention mass on impermissible tokens, with varying values of the loss coefficient . For most models, and tasks, we can severely reduce attention mass on impermissible tokens while preserving original performance ( implies no manipulation).
Figure 1: Restricted self-attention in BERT. The information flow through attention is restricted between and for every encoder layer. The arrows represent the direction of attention. The [CLS] token of the final encoder layer () is used to make predictions.

Transformer Models

We use the Bidirectional Encoder Representations from Transformers (BERT) model Devlin et al. (2019). We use the base version consisting of 12 layers, with self-attention. Further, each of the self-attention layers consists of 12 attention heads. The first token of every sequence is the special classification token [CLS], the final hidden state of which is used for classification tasks. To block the information flow from permissible to impermissible tokens, we multiply attention weights at every layer with a self-attention mask , a binary matrix of size where is the size of the input sequence. An element represents whether the token should attend on the token . is if both and belong to the same set (either the set of impermissible tokens, or its complement ). Additionally, the [CLS] token attends to all the tokens, but no token attends to [CLS] to prevent the information flow between and (Figure 1 illustrates this setting). We experiment with two different variants of BERT, one where we manipulate the maximum attention across all heads, and one where we manipulate the mean attention.

5 Results and Discussion

In this section we examine the degree to which we can lower the attention values, and how the reduction in attention scores affects task performance. Lastly, we analyze the behaviour of the manipulated models and identify alternate workarounds through which models preserve task performance.

5.1 Attention mass and task performance

We test our models on three classification tasks, and experiment with different values of the loss coefficient (

), For each experiment, we measure the attention mass over impermissible tokens and the final test accuracy. During the course of training (i.e. after each epoch), we arrive at different models from which we choose the one whose original accuracy is within

of the original accuracy and provides the greatest reduction in attention mass on impermissible tokens. This model selection is done using the development set, and the results on the test set from the chosen model are presented in Table 4. Across most tasks, and models we find that our manipulation scheme severely reduces the attention mass on impermissible tokens compared to models without any manipulation (i.e. when ). This reduction comes at a minor, or no, decrease in task accuracy. We discuss each task individually below.

For the gender identification task, we note that without any manipulation all models attain a accuracy on the test set with a high attention mass allocated to gender pronouns (, , , and respectively for Embedding + Attention, BiLSTM + Attention, BERT (mean) and BERT (max)). Upon manipulation, we find that the attention values can be reduced to extremely low values (less than for all models, and less than for many models) while retaining accuracy . This clearly illustrates that attention values can be diminished for impermissible tokens, while models continue to rely on them to drive predictions, as without using the gender pronouns, the best performing model only achieves an accuracy of .

For the SST+Wiki sentiment analysis task, we observe that the Embedding+Attention, and BiLSTM+Attention models attain

and test accuracy respectively. By training with our penalty term, we are able to reduce attention values over impermissible tokens to a lower magnitude but this deception comes at some cost in accuracy. We present this trade-off in Figure 2. We speculate that Embedding+Attention, and BiLSTM+Attention are too simple, under parameterized, and inadequate for this task, and thus jointly reducing attention mass, and retaining original accuracy is harder. However, with BERT, an expressive model, we obtain accuracy over , and we can reduce the maximum attention mass over the movie review from the original to , while maintaining similar performance.

Figure 2: Various Embedding + Attention, and BiLSTM + Attention models for SST + Wiki task, where we see a trade-off between reducing attention mass, and preserving accuracy. The performance of the non-manipulated Emb+Attn and BiLSTM+Attn models are and respectively.

For predicting admission decisions using reference letters and class ranks, all four models attain around 75% accuracy on the task. From Table 4, we see again that we can lower the attention values corresponding to rank and percentile fields while retaining the original performance.

5.2 Alternative Workarounds

We follow-up these results with an investigation seeking to elucidate the mechanisms by which the models cheat, obtaining low attention values while remaining accurate. We identify and verify two potential workarounds that the models adopt.

Models with recurrent encoders

can simply pass information across tokens through recurrent connections, prior to the application of attention. To measure this effect, we hard-set the attention values corresponding to impermissible words to zero after the manipulated model is trained, thus clipping their direct contributions for inference. For gender classification using BiLSTM+Attention model, we are still able to predict over of instances correctly, thus confirming a large degree of information flow to neighboring representations222 A recent study Brunner et al. (2019) similarly observes a high degree of ‘mixing’ of information across layers in Transformer models. . On the contrary, the Embedding+Attention model (which has no means to pass the information pre-attention) attains only about test accuracy after zeroing the attention values for pronouns.

Models restricted from passing information

prior to the attention mechanism tend to increase the magnitude of the representations corresponding to impermissible words to compensate for the low attention values. This effect is illustrated in Figure 3, where the L2 norm of embeddings for impermissible tokens increase considerably for Embedding+Attention model during training. We do not see increased embedding norms for the BiLSTM+Attention model, as this is unnecessary due to the model’s capability to move around relevant information.

Figure 3: For gender identification task, the norms of embedding vectors corresponding to impermissible tokens increase considerably in Embedding+Attention model to offset the low attention values. This is not the case for BiLSTM+Attention model as it can pass information due to recurrent connections.

6 Conclusion

Amidst claims and practices that perceive attention scores to be an indication of what the model focuses on, we provide evidence that attention scores are easily manipulable. Our simple training scheme produces models with significantly reduced attention mass over tokens known a priori to be useful for prediction, while continuing to depend on them for prediction. Our results raise concerns about potential use of attention as a tool to audit algorithms, as malicious actors could employ similar techniques to mislead regulators.


The authors thank Dr. Julian McAuley for providing, and painstakingly anonymizing the data for reference letters. We also acknowledge Alankar Jain for carefully reading the manuscript and providing useful feedback.


  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
  • M. Barrett, J. Bingel, N. Hollenstein, M. Rei, and A. Søgaard (2018) Sequence classification with human attention. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 302–312. Cited by: §2.
  • G. Brunner, Y. Liu, D. Pascual, O. Richter, and R. Wattenhofer (2019) On the validity of self-attention as explanation in transformer models. arXiv preprint arXiv:1908.04211. Cited by: footnote 2.
  • W. Chen, E. Matusov, S. Khadivi, and J. Peter (2016) Guided alignment training for topic-aware neural machine translation. arXiv preprint arXiv:1607.01628. Cited by: §2.
  • E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and W. Stewart (2016) Retain: an interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, pp. 3504–3512. Cited by: §1.
  • M. De-Arteaga, A. Romanov, H. Wallach, J. Chayes, C. Borgs, A. Chouldechova, S. Geyik, K. Kenthapadi, and A. T. Kalai (2019) Bias in bios: a case study of semantic representation bias in a high-stakes setting. arXiv preprint arXiv:1901.09451. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. North American Chapter of the Association for Computational Linguistics (NAACL). Cited by: §4.2.
  • A. Graves and J. Schmidhuber (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18 (5-6), pp. 602–610. Cited by: §4.2.
  • S. Jain, R. Mohammadi, and B. C. Wallace (2019) Attention is not explanation. North American Chapter of the Association for Computational Linguistics (NAACL). Cited by: §2, §2.
  • J. Li, W. Monroe, and D. Jurafsky (2016) Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220. Cited by: §1.
  • L. Liu, M. Utiyama, A. Finch, and E. Sumita (2016) Neural machine translation with supervised attention. arXiv preprint arXiv:1609.04186. Cited by: §2.
  • A. Martins and R. Astudillo (2016) From softmax to sparsemax: a sparse model of attention and multi-label classification. In

    International Conference on Machine Learning

    pp. 1614–1623. Cited by: §1.
  • S. Serrano and N. A. Smith (2019) Is attention interpretable?. 57th annual meeting of the Association for Computational Linguistics (ACL). Cited by: §2.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.
  • S. Wiegreffe and Y. Pinter (2019) Attention is not not explanation. Proceedings of the 2019 conference on Empirical Methods in Natural Language Processing, EMNLP. Cited by: §2, §3.
  • Q. Xie, X. Ma, Z. Dai, and E. Hovy (2017) An interpretable knowledge transfer model for knowledge base completion. arXiv preprint arXiv:1704.05908. Cited by: §1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044. Cited by: §1.