Why is Attention Not So Attentive?

06/10/2020 ∙ by Bing Bai, et al. ∙ Tencent cornell university 30

Attention-based methods have played an important role in model interpretations, where the calculated attention weights are expected to highlight the critical parts of inputs (e.g., keywords in sentences). However, some recent research points out that attention-as-importance interpretations often do not work as well as we expect. For example, learned attention weights are frequently uncorrelated with other feature importance indicators like gradient-based measures, and a debate on the effectiveness of attention-based interpretations has also raised. In this paper, we reveal that one root cause of this phenomenon can be ascribed to the combinatorial shortcuts, which stand for that the models may not only obtain information from the highlighted parts by attention mechanisms but from the attention weights themselves. We design one intuitive experiment to demonstrate the existence of combinatorial shortcuts and propose two methods to mitigate this issue. Empirical studies on attention-based instance-wise feature selection interpretation models are conducted, and the results show that the proposed methods can effectively improve the interpretability of attention mechanisms on a variety of datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interpretation for machine learning models has increasingly gained interest and becomes a necessity as the industry rapidly embraces machine learning technologies. Model interpretation explains how models make decisions, which is particularly essential in mission-critical domains where the accountability and transparency of the decision-making process are crucial, such as medicine 

Wang et al. (2019), security Chakraborti et al. (2019), and criminal justice Lipton (2018).

Attention mechanisms have played an important role in model interpretations, and have been widely adopted for interpreting neural networks 

Vaswani et al. (2017) and other black-box models Bang et al. (2019); Chen et al. (2018). Similar with Vaswani et al. (2017), for the attention mechanisms, we assume that we have the query , and a set of key-value pairs . In this paper, the attention weights denoted as masks are calculated with and , and then filter the information of as follows,

Intuitively, the masks are expected to represent the importance of different parts of  (e.g., words of a sentence, pixels of an image) and highlight those the models should focus on to make decisions, and many researchers directly use the masks to provide interpretability of models Choi et al. (2016); Vaswani et al. (2017); Wang et al. (2016).

However, recent research suggests that the highlighted parts from attention mechanisms do not necessarily correlate with greater impacts on models’ predictions Jain and Wallace (2019). Many researchers provide evidence to support or refute the interpretability of the attention mechanisms, and a debate on the effectiveness of attention-based interpretations has raised Jain and Wallace (2019); Serrano and Smith (2019); Wiegreffe and Pinter (2019).

In this paper, we discover a root cause that hinders the interpretability of attention mechanisms, which we refer to as combinatorial shortcuts. As mentioned earlier, we expect that the results of attention mechanisms mainly contain information from the highlighted parts of , which is the critical assumption for the effectiveness of attention-based interpretations. However, as the results are products of the masks and , we find that the Masks themselves can carry extra information other than the highlighted parts of , which could be utilized by the down-stream parts of models. As a result, the calculated masks may work as another kind of “encoding layers” rather than providing pure importance weights. For example, in a (binary) text classification task, the attention mechanisms could choose to highlight the first word for positive cases and highlight the second word for negative cases, regardless of what the words are, and then the downstream models could predict the label by checking whether the first or the second word is highlighted. It may result in good accuracy scores, while completely fail at providing interpretability222One may argue that for the most ordinary practice where sum pooling is applied, we lose the positional information. Thus, the intuitive case described above may not hold. However, since (1) the distributions of different positions are not the same, (2) positional encodings Vaswani et al. (2017) have been used widely, it’s still possible for attention mechanisms to utilize the positional information with sum pooling..

We further study the effectiveness of attention-based interpretations and dive into the combinatorial shortcut problem. We firstly analyze the gap between attention mechanisms and ideal interpretations theoretically and show the existence of combinatorial shortcuts through a representative experiment. We further propose two practical methods to mitigate this issue, i.e., random attention pretraining, and instance weighting for mask-neutral learning. Without loss of generality, we examine the effectiveness of proposed methods based on an end-to-end attention-based model-interpretation method, i.e., L2X Chen et al. (2018), which can select a given number of input components to explain arbitrary black-box models. Experimental results on both text and image classification tasks show that the proposed methods can successfully mitigate the adverse impact of combinatorial shortcuts, and improve the explanation performance.

2 Related Work

Model interpretations

Existing methods for model interpretations can be categorized into model-specific methods and model-agnostic methods. Model-specific methods take advantage of specific types of models, such as gradient-based methods for neural networks. On the other hand, model-agnostic methods are capable of explaining any given models, as long as the input and output are accessible. Instance-wise feature selection (IFS), which produces importance scores of each feature on influencing the model’s decision, is a well known model-agnostic interpretation method Du et al. (2019). Recent research towards IFS model explanation can be divided into (local/global) feature attribution methods Ancona et al. (2017); Yeh et al. (2019) and direct model-interpretation methods. Local feature attribution methods provide some sensitivity scores of the model output concerning the changes of the features in the neighborhood. In contrast, global feature attribution methods directly produce the amount of change of the model output given changes of the features.333The definitions of global and local explanations in this paper follow the description of Ancona et al. (2017) and Yeh et al. (2019), and distinct from that of Plumb et al. (2018). Other than providing the change of the model output, direct model interpretation (DMI) is a more straightforward approach to select features and use a model to approximate the output of the original black-box model Chen et al. (2018); Sundararajan et al. (2017). In addition to the above research work, there is also research about the benchmark for interpretability methods Hooker et al. (2019).

Attention mechanisms for model interpretations

Attention mechanisms have been widely adopted in natural language processing 

Bahdanau et al. (2015); Vinyals et al. (2015)

, computer vision 

Fu et al. (2016); Li et al. (2019), recommendations Bai et al. (2020); Tay et al. (2018); Zhang et al. (2020b)

and so on. Despite many variants, attention mechanisms usually calculate non-negative weights for each input component, multiply those weights by their corresponding representations, and then encode the resulting vectors into a single fixed-length representation 

Serrano and Smith (2019). Attention mechanisms are believed to explain how models make decisions by exhibiting the importance distribution over inputs Choi et al. (2016); Martins and Astudillo (2016); Wang et al. (2016), which we can also regard as a kind of model-specific interpretation. Besides, there are also attention-based methods for model-agnostic interpretations. For example, L2X Chen et al. (2018)

is a hard attention model 

Xu et al. (2015) that employs Gumbel-softmax Jang et al. (2017) for instancewise feature selection. VIBI Bang et al. (2019) improves L2X to encourage the briefness of the learned explanation by adding a constraint for the feature scores to a global prior.

However, there has been a debate on the effectiveness of the interpretability of attention mechanisms recently. Jain and Wallace (2019) suggests that “attention is not explanation” by finding that the attention weights are frequently uncorrelated with gradient-based measures of feature importance, and one can identify very different attention distributions that yield equivalent predictions. On the other hand, however, Wiegreffe and Pinter (2019) argues that “attention is not not-explanation” by challenging many assumptions underlying Jain and Wallace (2019) and suggests that Jain and Wallace (2019) does not disprove the usefulness of attention mechanisms for explainability. Serrano and Smith (2019) applies a different analysis based on intermediate representation erasure and finds that while attention noisily predicts input components’ overall importance to a model, it is by no means a fail-safe indicator. In this work, we take another perspective on this problem called combinatorial shortcuts, and show that it can provide one root cause of the phenomenon.

3 Combinatorial Shortcuts

In this section, we show the gap between attention mechanisms and ideal explanations from a perspective of causal effects estimation, and conduct an experiment to demonstrate the existence of combinatorial shortcuts.

3.1 The gap between attention mechanisms and ideal explanations

Assume that we have samples drawn independently and identically distributed (i.i.d.) from a distribution with domain , where is the feature domain, and is the label domain444Note that here we use to represent the features for the convention. is the same with value introduced in the Introduction. Besides, the labels could be either from the real world for explaining the real world, or from some specific models for explaining given black-box models.. Additionally, we assume that the mask is drawn from a distribution with domain . For and , given any sample , if where

is the loss function and

calculates the expectation, we say that for sample , is superior to in term of interpretability. Usually,

is under some constraints, for example, only being able to select a fixed number of features, or being non-negative and summing to 1. If an unbiased estimation of

is available, the best mask for sample that can select the most informative features can be obtained by solving . While in practice, we often need to train models to estimate since both and are vectors, which has been extensively investigated as causal effect estimations for observational studies Rubin (1974); Schneider et al. (2007).

Ideally, if the data (combinations of and , as well as the label ) is exhaustive and the model is consistent, we can train a model to obtain an unbiased estimation of following the empirical risk minimization principle Fan et al. (2005); Vapnik (1992). Nevertheless, in reality, it is not possible to exhaust all combinations of and , and randomized combinations for and can still give unbiased estimations Rubin (1974). However, attention mechanisms do not work in this way. The combination for and is highly selective in the training procedure of attention mechanisms since the used mask is a function of query and key , which is highly related to the features of samples, if not directly extracted from the features. In conclusion, from the perspective of causal effect estimations, the training procedure of attention mechanism produce nonrandomized experiments Shadish et al. (2008), thus the model cannot learn unbiased estimations of and fail to capture the real causal effects between highlighted features and the labels, and finally fail to find the best masks to provide the best interpretability. In this paper, we denote the effects of nonrandomized combination for and as combinatorial shortcuts, as they provide lanes for models to predict the labels without analyzing what information is highlighted by attention mechanisms, but rather by the masks themselves to guess the labels.

3.2 Experimental design for the demonstration of combinatorial shortcuts

Figure 1: The structure of the models used in this demo experiment.

To intuitively demonstrate the existence of combinatorial shortcuts, we design an experiment on the real-world IMDB movie review dataset Maas et al. (2011), which is a text classification task. As Figure 1 shows, the experiment works as follows. Firstly, we train baseline models using only the first five tokens of the sentences along with four default tokens (A B C D). Then we train attention models, where the query of attention is generated with the whole sentences. However, we only allow attention to highlight keywords among the first five tokens and the four default tokens. Because the default tokens do not carry any useful information at all, if the attention mechanism can indeed highlight the key parts of the inputs, little attention should be paid to them. Correspondingly, if attention mechanism is putting much attention weights to the default tokens, we can say that it is the masks themselves serving as features for the downstream models, i.e., combinatorial shortcuts are taking effects. In conclusion, we can check whether attention mechanisms can highlight the right parts of the input by monitoring how much attention the mechanism has paid to the default tokens.

3.3 Experimental results and discussions

We examine different settings regarding the encoder in Figure 1, i.e., whether the encoder is a simple pooling555We use average pooling for baseline models, and sum pooling for attention models as attention weights sum to 1.

or a trainable neural network model, i.e., recurrent convolutional neural network (RCNN) proposed in 

Lai et al. (2015). When we apply pooling encoders, the experiment becomes the most ordinary setting for attention mechanisms. The results are reported in Table 1. Note that we report the results on the training set to demonstrate how the models fit the data. We use pretrained GloVe word embeddings Pennington et al. (2014)

and keep them fixed to prevent shortcuts through word embeddings. We train the models for 25 epochs with RMSprop optimizers using default parameters and report the averaged results of 10 runs with different initialization.

No. Model Encoder Accuracy Attention to default tokens
(1) Baseline Pooling 71.7% – –
(2) RCNN 96.1% – –
(3) Attention Pooling 99.5% 66.4%
(4) RCNN 99.6% 48.9%
Table 1: Experimental demonstration of combinatorial shortcuts. Note that we report the results on the training set here.

Interestingly, as we can see in Table 1, the attention models place a large share of attention to the default tokens. 66.4% and 48.9% of total attention weights are assigned to the default tokens by models with pooling and RCNN encoders, respectively. Consequently, the accuracy scores of attention models can quickly grow over 99.5%. The results suggest that the attention mechanism may not work as expected to highlight the key parts of the inputs and provide interpretability. Instead, it leans to work as another kind of “encoding layers” and fit the data through combinatorial shortcuts.

4 Methods for Mitigating Combinatorial Shortcuts

In this section, we introduce two practical methods, random attention pretraining and mask-neutral learning with instance weighting, to mitigate combinatorial shortcuts.

4.1 Random attention pretraining

We first propose a simple and straightforward method to address the issue. As analyzed in Section 3.1, the fundamental reason for combinatorial shortcuts of attention mechanisms is the biased estimation of . Besides, random combinations of and can still give unbiased results. Inspired by this idea, we can first generate the masks completely at random and train the neural networks. And then, we fix the other parts, replace the random attention with trainable attention layers, and train the attention layers only. As the other parts of neural networks are trained unbiasedly and fixed, training the attention layers solely is solving with an unbiased estimation of . Thus the interpretability is guaranteed.

In theory, this method is complete. However, it may be practically incompetent because there are countless possible cases of the combinations of and . It may be difficult to estimate well especially when the dimension of input features is high, or when there are strong co-adapting patterns of features for models to make accurate predictions (e.g., XOR of features). Under such cases, the pretraining procedure may be less efficient as it needs to explore all possible masks fairly, even if most of the masks are worthless. As a conclusion, the model may fail to estimate well in some cases, and thus limiting the interpretability.

4.2 Mask-neutral learning with instance weighting

The second method is designed as a supplementary solution to address the shortcomings of random attention pretraining. This method is based on instance weighting, which has been successfully applied for mitigating sample selection bias Zadrozny (2004); Zhang et al. (2019), social prejudices bias Zhang et al. (2020a), position bias Joachims et al. (2017), and so on. In this paper, we consider the selective combination of features and masks in the training procedure of attention mechanisms as a kind of sample selection bias. We prove that under certain assumptions and with instance weighting, we can recover a mask-neutral distribution where the masks are unrelated to the labels. Thus the combinatorial shortcuts can be partially mitigated.

Generation of biased distributions of attention mechanisms from mask-neutral distributions

Assuming that there is a mask-neutral distribution with domain , where is the feature space, is the (binary) label space666We focus on binary classification problems in this paper, but the proposed methodology can be easily extended to multi-class classifications., is the feature mask space and is the (binary) sampling indicator space.

During the training of attention mechanisms, the selective combination of masks and features results in the combinatorial shortcuts. We assume for any given sample drawn independently from , it will be chosen to appear in the training of attention mechanisms if and only if , which results in the biased distribution . We use

to represent probabilities of the biased distribution

, and for the mask-neutral distribution , then we have


and ideally, we should have on to obtain unbiased as discussed in Section 3.1. However, when both sides are vectors it will be intractable. Therefore, we take a step back and only assume in , i.e.,


If is completely at random, will be consistent with . However, the attention layers are highly selective, which results in the combinatorial shortcut problem. To further simplify the problem, we assume that and control . And for any given and , the probability of selection is greater than , defined as


Additionally, we assume that the selection does not change the probability of and , i.e.,


In other words, we assume that although is dependent on the combination of and in , it is independent on either or only, i.e., and .

Unbiased expectation of loss with instance weighting

We show that, by adding proper instance weights, we can obtain an unbiased estimation of the loss on the mask-neutral distribution , with only the data from the biased distribution .

Theorem 1 (Unbiased Loss Expectation).

For any function , and for any loss , if we use as the instance weights, then

Theorem 1

shows that, by a proper instance-weighting method, the classifier can learn on the mask-neutral distribution

, where . Therefore, the independence between and is encouraged, then it will be hard for the classifier to approximate solely by . Thus, the classifier will have to use useful information from , and have the combinatorial shortcuts problem mitigated.

Then we present the proof for Theorem 1.


We first present an equation with the weight ,

Then we have

Mask-neutral learning

With Theorem 1, we now propose mask-neutral learning for better interpretability of attention mechanisms. As shown, by adding instance weight , we can obtain unbiased loss of the mask-neutral distribution. As distribution is directly observable, estimating is possible. In practice, we could train a classifier to estimate along with the training of the attention layer, and optimize it and the attention layers, as well as the other parts of models alternatively.

Compared with the random attention pretraining method, the instance weighting-based approach concentrates more on the useful masks. Thus it may suffer less from the efficiency problem. Nevertheless, the effectiveness of the instance weighting method relies on the assumptions as shown in Equation (1)–(4). However, in some cases, the assumptions may not hold. For example, in Equation (3), we assume that given and , is independent on . In other words, controls only through . This assumption is necessary for simplifying the problem, while may not hold sometimes when given and , can still influence in the training procedure of attention mechanisms. Besides, the effectiveness of the method also relies on an accurate estimation of , which may require careful tuning as the probability is dynamically changing along the training process of attention mechanisms.

5 Experiments

In this section, we present the experimental results of the proposed methods. For simplicity, we denote random attention pretraining as Pretraining and mask-neutral learning with instance weighting as Weighting. Firstly, we analyze the effectiveness of mitigating combinatorial shortcuts. And then, we examine the effectiveness of improving interpretability.

5.1 Experiments for mitigating combinatorial shortcuts

We apply the proposed methods to the experiments introduced in Section 3.2 to check whether we can mitigate the combinatorial shortcuts. We summarize the results in Table 2.

No. Method Encoder Accuracy Attention to default tokens
(1) Pretraining Pooling 91.8% 8.4%
(2) RCNN 78.7% 7.1%
(3) Weighting Pooling 97.8% 5.8%
(4) RCNN 92.8% 17.0%
Table 2: Effectiveness of the proposed methods for mitigating the combinatorial shortcuts.

As presented, after applying Pretraining and Weighting, the percentage of attention weights assigned to the default tokens are significantly reduced. We know that the default tokens do not provide useful information but only serve as carriers for combinatorial shortcuts. The results reveal that our methods have mitigated the combinatorial shortcuts successfully.

5.2 Experiments for improving interpretability

In this section, using L2X Chen et al. (2018) as an example that is an end-to-end attention based model-interpretation method, we present the effectiveness of mitigating combinatorial shortcuts for better interpretability. We first introduce the evaluation scheme, then show the experimental results and discussions.

5.2.1 Evaluation scheme

Here we present the evaluation scheme.

Evaluation protocol

Our evaluation scheme is the same as L2X Chen et al. (2018). L2X is an instancewise feature selection model using hard attention that employs the Gumbel-softmax trick. It tries to select a certain number of input components to approximate the output of the model to be explained with attention mechanisms. As discussed before, such a setting may suffer from combinatorial shortcuts. Thus the interpretability may be limited. Additionally, to further enrich the information for the model explanation, we propose to incorporate the outputs of the original model to be explained, i.e., , as part of the query for the feature selection. This trick can make it easier for the explanation model to select the best features. At the same time, it will also make the problem of combinatorial shortcut more prominent, and thus can better demonstrate the effectiveness of our proposed methods. As obtaining the outputs require no further information apart from the features of samples and the model to be explained, it does not hurt the model-agnostic property of explanation methods nor require additional information. We adopt binary feature-attribution mask to select features, i.e., top  values of the mask are set to , others are set to , then we treat as the selected features Chen et al. (2018).

Evaluation metrics

The same as Chen et al. (2018), we perform a predictive evaluation that evaluates how accurate the original given model can approximate the original model-outputs using the selected features, and we report the post-hoc accuracy. For each method on each dataset, we repeat ten times with different initialization and report the averaged results.


We report evaluations on four datasets: IMDB Maas et al. (2011), Yelp P. Zhang et al. (2015), MNIST LeCun et al. (1998), and Fashion-MNIST (F-MNIST) Xiao et al. (2017). IMDB and Yelp P. are two text classification datasets. IMDB is with 25,000 train examples and 25,000 test examples. Yelp P. contains 560,000 train examples and 38,000 test examples. MNIST and F-MNIST are two image classification datasets. For MNIST, following Chen et al. (2018), we collect a binary classification subset by choosing images of digits 3 and 8, with 11,982 train examples and 1,984 test examples. For F-MNIST, we select the data of Pullover and Shirt with 12,000 train examples and 2,000 test examples.

Models to be explained

The same as Chen et al. (2018), for IMDB and Yelp P., we implement CNN-based models and select 10 and 5 words respectively for explanations. For MNIST and F-MNIST, we use the same CNN model as Chen et al. (2018) and select 25 and 64 pixels respectively.


We consider state-of-the-art model-agnostic baselines: LIME Ribeiro et al. (2016), CXPlain Schwab and Karlen (2019), L2X Chen et al. (2018), and VIBI Bang et al. (2019). We also compare with model-specific baselines, i.e., Gradient Simonyan et al. (2013). Among the methods, Gradient takes advantage of the property of neural networks and selects the input features which have the most significant absolute values of gradients. LIME explains a model by quantifying the model’s sensitivity to changes in the input. CXPlain involves the real labels to compute the loss-function values by erasing each feature to zero and normalizes the loss-function values as the surrogate for ideal explanations for a neural-network to learn. Our methods follow the same paradigm as L2X and VIBI, which use hard attention to select a fixed number of features to approximate the output of the original models to be explained. VIBI improves L2X to encourage the briefness of the learned explanation by adding a constraint for the feature scores to a global prior.

5.2.2 Experimental results

Following the aforementioned evaluation scheme, we report the results in Table 3.

(1) Gradient Simonyan et al. (2013) 85.6% 82.3% 98.2% 58.6%
(2) LIME Ribeiro et al. (2016) 89.8% 87.4% 80.4% 75.6%
(3) CXPlain Schwab and Karlen (2019) 90.6% 97.7% 99.4% 59.7%
(4) L2X Chen et al. (2018) 89.2% 88.2% 91.4% 77.3%
(5) VIBI Bang et al. (2019) 90.8% 94.4% 98.3% 84.1%
(6) L2X with – – 48.8% 77.8% 94.9% 85.3%
(7) Pretraining 97.1% 99.0% 66.3% 89.4%
(8) Weighting 94.3% 87.7% 99.8% 95.4%
CXPlain uses additional information, i.e., the real label of samples.
The contribution of VIBI is orthogonal to ours.
Table 3: Effectiveness of the proposed methods for improving interpretability. We report the post-hoc accuracy scores with different methods.

From the table, we find that directly adding to the query does not always improve the performance by comparing Row (4) and (6). Interestingly, for the text classification datasets, adding leads to decreased performance, and meanwhile, Pretraining outperforms Weighting. For the image classification datasets, we have the exact opposite conclusion. We ascribe this phenomenon to the inherent differences between the two tasks. Firstly, a single word in a sentence is much more informative than a single pixel in an image. Secondly, the importance of words is more “continuous”, and in contrast, the importance of pixels is more “discrete” and co-adapting. Intuitively, the function of is smoother and easier to learn for text classification tasks than for image classification tasks. As a result, as discussed in Section 4.1, it may be hard for Pretraining to learn reasonable estimations of efficiently for images. Thus the performance of interpretability is limited, especially for MNIST, where the digital numbers are placed randomly compared with F-MNIST, where the items are aligned better.

By comparing with other baselines (especially L2X with ), we find that Pretraining and Weighting can outperform most of the benchmarks. We conclude that mitigating the combinatorial shortcuts can effectively improve the interpretability.

6 Conclusion

Attention-based model interpretations have been popular for their convenience to integrate with neural networks. However, there has been a debate on the interpretability of the attention mechanisms. In this paper, we propose that the combinatorial shortcuts are one of the root causes hindering the interpretability of attention mechanisms. We analyze the combinatorial shortcuts theoretically, and design experiments to show the existence. Furthermore, we propose two practical methods to mitigate combinatorial shortcuts for better interpretability. Experiments on real-world datasets show that the proposed methods are effective. The proposed methods can be applied to any attention-based model interpretation tasks. To the best of our knowledge, this is the first work to highlight the combinatorial shortcuts in attention mechanisms.

Broader Impact Statement

This paper investigates an essential problem of model interpretability, which is crucial for a variety of real-world (especially high-stake) applications that need transparent decision making such as medicine, security, criminal justice, education, etc. Deriving more precise model interpretations and explanations can better reveal how the model works exactly during the decision-making process and thus alleviate potential errors and biases. Better model interpretability can also enhance model trustworthiness and generalization ability. One potential risk for improving model interpretability is that it may make the model more vulnerable to adversarial attacks. However, better interpretability can also guide the researchers to resign better model defense mechanism. Moreover, as adversarial attack typically needs 1) knowledge on the model mechanism and 2) data manipulation, the users should have better access control of their model and data.


  • M. Ancona, E. Ceolini, C. Öztireli, and M. Gross (2017) A unified view of gradient-based attribution methods for deep neural networks. In

    NIPS Workshop on Interpreting, Explaining and Visualizing Deep Learning-Now What?(NIPS 2017)

    Cited by: §2, footnote 3.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Represen- tations, Cited by: §2.
  • B. Bai, G. Zhang, Y. Lin, H. Li, K. Bai, and B. Luo (2020) CSRN: collaborative sequential recommendation networks for news retrieval. arXiv preprint arXiv:2004.04816. Cited by: §2.
  • S. Bang, P. Xie, W. Wu, and E. Xing (2019) Explaining a black-box using deep variational information bottleneck approach. arXiv preprint arXiv:1902.06918. Cited by: §1, §2, §5.2.1, Table 3, footnote 1.
  • T. Chakraborti, A. Kulkarni, S. Sreedharan, D. E. Smith, and S. Kambhampati (2019) Explicability? legibility? predictability? transparency? privacy? security? the emerging landscape of interpretable agent behavior. In Proceedings of the International Conference on Automated Planning and Scheduling, Vol. 29, pp. 86–96. Cited by: §1.
  • J. Chen, L. Song, M. Wainwright, and M. Jordan (2018) Learning to explain: an information-theoretic perspective on model interpretation. In International Conference on Machine Learning, pp. 883–892. Cited by: §1, §1, §2, §2, §5.2.1, §5.2.1, §5.2.1, §5.2.1, §5.2.1, §5.2, Table 3, footnote 1.
  • E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and W. Stewart (2016) Retain: an interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, pp. 3504–3512. Cited by: §1, §2.
  • M. Du, N. Liu, and X. Hu (2019) Techniques for interpretable machine learning. Communications of the ACM 63 (1), pp. 68–77. Cited by: §2.
  • W. Fan, I. Davidson, B. Zadrozny, and P. S. Yu (2005) An improved categorization of classifier’s sensitivity on sample selection bias. In Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 605–608. Cited by: §3.1.
  • K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang (2016)

    Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts

    IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2321–2334. Cited by: §2.
  • S. Hooker, D. Erhan, P. Kindermans, and B. Kim (2019) A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems, pp. 9734–9745. Cited by: §2.
  • S. Jain and B. C. Wallace (2019) Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3543–3556. Cited by: §1, §2.
  • E. Jang, S. Gu, and B. Poole (2017) Categorical reparametrization with gumble-softmax. In International Conference on Learning Representations, Cited by: §2.
  • T. Joachims, A. Swaminathan, and T. Schnabel (2017) Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 781–789. Cited by: §4.2.
  • S. Lai, L. Xu, K. Liu, and J. Zhao (2015) Recurrent convolutional neural networks for text classification. In

    Twenty-ninth AAAI conference on artificial intelligence

    Cited by: §3.3.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.2.1.
  • Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, and X. Wang (2019) Attention-guided unified network for panoptic segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7026–7035. Cited by: §2.
  • Z. C. Lipton (2018) The mythos of model interpretability. Queue 16 (3), pp. 31–57. Cited by: §1.
  • A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts (2011)

    Learning word vectors for sentiment analysis

    In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pp. 142–150. Cited by: §3.2, §5.2.1.
  • A. Martins and R. Astudillo (2016) From softmax to sparsemax: a sparse model of attention and multi-label classification. In International Conference on Machine Learning, pp. 1614–1623. Cited by: §2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.3.
  • G. Plumb, D. Molitor, and A. S. Talwalkar (2018) Model agnostic supervised local explanations. In Advances in Neural Information Processing Systems, pp. 2515–2524. Cited by: footnote 3.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §5.2.1, Table 3.
  • D. B. Rubin (1974) Estimating causal effects of treatments in randomized and nonrandomized studies.. Journal of educational Psychology 66 (5), pp. 688. Cited by: §3.1, §3.1.
  • B. Schneider, M. Carnoy, J. Kilpatrick, W. H. Schmidt, and R. J. Shavelson (2007) Estimating causal effects using experimental and observational design. American Educational & Reseach Association. Cited by: §3.1.
  • P. Schwab and W. Karlen (2019) CXPlain: causal explanations for model interpretation under uncertainty. In Advances in Neural Information Processing Systems, pp. 10220–10230. Cited by: §5.2.1, Table 3.
  • S. Serrano and N. A. Smith (2019) Is attention interpretable?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2931–2951. Cited by: §1, §2, §2.
  • W. R. Shadish, M. H. Clark, and P. M. Steiner (2008) Can nonrandomized experiments yield accurate answers? a randomized experiment comparing random and nonrandom assignments. Journal of the American statistical association 103 (484), pp. 1334–1344. Cited by: §3.1.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §5.2.1, Table 3.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. Cited by: §2.
  • Y. Tay, A. T. Luu, and S. C. Hui (2018) Multi-pointer co-attention networks for recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2309–2318. Cited by: §2.
  • V. Vapnik (1992) Principles of risk minimization for learning theory. In Advances in neural information processing systems, pp. 831–838. Cited by: §3.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, footnote 2.
  • O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton (2015) Grammar as a foreign language. In Advances in neural information processing systems, pp. 2773–2781. Cited by: §2.
  • F. Wang, R. Kaushal, and D. Khullar (2019) Should health care demand interpretable artificial intelligence or accept “black box” medicine?. Annals of Internal Medicine. Cited by: §1.
  • Y. Wang, M. Huang, X. Zhu, and L. Zhao (2016) Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 606–615. Cited by: §1, §2.
  • S. Wiegreffe and Y. Pinter (2019) Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 11–20. Cited by: §1, §2.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §5.2.1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §2.
  • C. Yeh, C. Hsieh, A. Suggala, D. I. Inouye, and P. K. Ravikumar (2019) On the (in) fidelity and sensitivity of explanations. In Advances in Neural Information Processing Systems, pp. 10965–10976. Cited by: §2, footnote 3.
  • B. Zadrozny (2004) Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, pp. 114. Cited by: §4.2.
  • G. Zhang, B. Bai, J. Liang, K. Bai, S. Chang, M. Yu, C. Zhu, and T. Zhao (2019) Selection bias explorations and debias methods for natural language sentence matching datasets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4418–4429. Cited by: §4.2.
  • G. Zhang, B. Bai, J. Zhang, K. Bai, C. Zhu, and T. Zhao (2020a) Demographics should not be the reason of toxicity: mitigating discrimination in text classifications with instance weighting. arXiv preprint arXiv:2004.14088. Cited by: §4.2.
  • J. Zhang, B. Bai, Y. Lin, J. Liang, K. Bai, and F. Wang (2020b) General-purpose user embeddings based on mobile app usage. arXiv preprint arXiv:2005.13303. Cited by: §2.
  • X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §5.2.1.
  • F. Zhu, H. Li, W. Ouyang, N. Yu, and X. Wang (2017) Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5513–5522. Cited by: footnote 1.