How to explain a complex machine learning model, that predicts a response given an input feature vector, given just black-box access to the model, is an increasingly salient problem. And an increasingly popular approach to do so is to attribute any given prediction to the set of input features, which could range from providing a vector of importance weights, one per input feature, to simply providing a set of important features. For instance, given a deep neural network for image classification, we may explain a specific prediction by showing the set of salient pixels, or a heatmap image showing the importance weights for all the pixels.
A large class of such attribution mechanisms are loosely based on sensitivity of the learned predictor function to its input. The most prominent class of these are based on the gradient of the predictor function with respect to its input [6, 21], including gradient variants that address some of the caveats with gradients such as saturation, where change in the prediction value is not reflected by the gradient [28, 24, 19]. There are also approaches based on counterfactuals that quantify the effect of substituting the value of a feature with a default value, or samples from some noise distribution 17]
. There are also approaches that vary the active subsets of the set of input features (e.g. over the power set of the set of all features) and average such per feature counterfactual contributions, which has roots in cooperative game theory and revenue division.
But how good is any such explanation mechanism? We can distinguish between two classes of explanation evaluation measures [11, 14]: objective measures and subjective measures . The notion of explanation is very human-centric, and consequently the predominant evaluations of explanations have been subjective measures, that range from qualitative displays of explanation examples, to crowd-sourced evaluations of human satisfaction with the explanations, as well as whether humans are able to understand the model. Nonetheless, it is also important to consider objective measures of explanation effectiveness, not only because these place explanations on a sounder theoretical foundation, but also because they allow us to improve our explanations by improving their objective measures. The predominant class of objective measures are based on fidelity of the explanation to the predictor function. When we have apriori information that only a particular subset of features is relevant, we can then test if the explanation features belong to this relevant subset.
In this work, we are interested in a simpler objective evaluation measure: sensitivity. In other words, we are interested in a counterfactual at the level of explanation: what happens to the explanation when we perturb the test input? Depending on how we define the perturbation, and how we measure the change in the explanation, we arrive at different notions of sensitivity. Intuitively, we wish for our explanation to not be too sensitive, since that would entail differing explanations with minor variations in the input (and prediction values), which would lead us to not trust the explanations. In large part, we expect explanations to be simple (hence the approximation of complex models via simple “interpretable” models). A lower sensitivity could be viewed as one such notion of simplicity. Given that most attribution mechanisms are loosely based on sensitivity of a function to its input, what we ask is how sensitive are sensitivity based explanation mechanisms itself.
The other key contribution of the paper is that we also provide a calculus for estimating the sensitivity for general explanation mechanisms, which we instantiate to derive corollaries for a wide range of recently proposed explanation mechanisms. We also note that our development holds more abstractly for investigating the sensitivity of any functional of a given function at a specific point. In our case, the function is the learnt predictor, and the functional is the explanation, but our development holds more generally. As one candidate broader application, this could be useful for analyzing so-called plugin-estimators that are functionals (e.g. entropy) of the model parameters; though we defer such broader investigations for future work.
Lastly, we also investigate how to modify a given explanation mechanism to make it less sensitive with respect to our measure. To this end, we provide a meta-explanation technique that encompasses Smooth-Grad , and which modifies any existing explanation mechanism to improve its sensitivity with just a modest deviation from the original explanation. In addition, we propose a solution to improve the explanation sensitivity if we are given the freedom to retrain the model by adversarial training. As we show, our modifications provide qualitatively much better explanations (with higher faithfulness evaluations), in addition to being better with respect to the objective measure of sensitivity, by construction.
2 Explanation Sensitivity
Consider the following general supervised learning setting: input space, an output space , and a (machine-learnt) black-box predictor , which at some test input , predicts the output . Then a feature attribution explanation is some function , that given a black-box predictor , and a test point , provides importance scores for the set of input features. Our main goal in this paper is to formalize a quantitative measurement for the sensitivity of these resulting attribution based explanations, and discuss approaches to optimize this sensitivity measure to obtain explanations with the right amount of sensitivity, while still retaining its explanatory power. While we begin our discussion with explanations that output a vector of importance weights, one for each feature, our analysis below is fairly general, and in Appendix B we extend it to settings where the explanation just consists of a set of features.
We need two additional ingredients: a distance metric over explanations, and a distance metric over the inputs. We can then define the following sensitivity measure, we term max-sensitivity, as measuring the maximum change in the explanation with a small perturbation of the input .
Given a black-box function , explanation functional , and distance metrics and over explanations and inputs respectively, and for a given input neighborhood radius , we define the max-sensitivity as:
The key caveat with the above notion of sensitivity is that it might be too critical: a single adversarial point in the neighborhood of with a large change in the explanation will cause the sensitivity measure to have a large value. While this might be desired in some settings, in certain other settings, we might be more concerned when many of the points in the neighborhood have vastly differing explanations. Accordingly, we can define the following average-sensitivity measure, that averages the change in the explanation as we range over small perturbations of the input .
Given a black-box function , explanation functional , and distance metrics and over explanations and inputs respectively, and for a given input neighborhood radius , we define the average-sensitivity as:
for some distribution over inputs, which is centered around .
In our experiments, we chose
as the uniform distribution over the neighborhood of radiusaround . While measures the maximum change in the explanation as the data point is perturbed within a small neighborhood, measures the average change in the explanation over such perturbations. When clear from the context, we simply use and to denote and respectively, also noting that we suppress the dependence on the distance metrics and . In the sequel, when the results hold for both the maximum and average sensitivity measures , and , we simply use to denote the sensitivity measure.
Before we proceed to sensitivity calculus, we argue why it is desirable to have explanations with low sensitivity scores . To this end, we first define the following measures to quantify the sensitivity of predictions of a model.
Given a black-box function , and distance metrics and over explanations and inputs respectively, and for a given input neighborhood radius , we define the max-prediction-sensitivity and average-prediction-sensitivity as:
Ghorbani et al.  empirically observe there exist inputs which are indistinguishable by humans, for which the model outputs similar predictions and yet have very different gradient explanations (for example, see first two columns in examples in Figure 2. Such explanations are undesirable as they do not faithfully explain the predictions of a model and cannot be understood by a human.
We now formally show that if the explanation sensitivity is much larger than the prediction sensitivity, there must exist such pair of inputs that have similar predictions and are indistinguishable to a human and yet have very different explanations.
Suppose the explanation sensitivity and prediction sensitivity are such that . Then which is close to such that , and
In Figure 1
, we show the sensitivities of the gradient explanation mechanism and model prediction on a two layer convolution neural networks trained on MNIST. To ensure that the scales of both the sensitivity measures are comparable, we use relative changes in prediction and explanation to compute the sensitivity scores. We choose D to bedistance and to be norm for computation of the sensitivity of explanations. We find that the ratio between gradient sensitivity and prediction sensitivity is between 3 to 14 times for different values of . By Proposition 2.1, this suggest the existence of “adversarial explanation” points as shown in Figures 2.
We now present sensitivity calculus for estimating the sensitivity of general explanation mechanisms. We start with the following definition which allows us to bound the sensitivity of the explanations.
We say a function is -locally Lipschitz continuous with respect to metric around if for all such that and , satisfies
When the explanation satisfies the above assumption of local Lipschitzness, we can derive the following upper bound on the sensitivity of explanations.
Suppose the explanation is -locally Lipschitz continuous around with respect to , then Then .
The above proposition provides a key tool to bound the sensitivity of any given explanation mechanism. In particular, we discuss its applicability to gradient explanations. The predominant class of explanations are based on gradients of the machine-learnt predictor . Proposition 2.2 provides a simple upper bound on the sensitivity of these gradient explanations, so long as we have a bound on the Lipshitz constant of the predictor gradient. As a concrete example, we instantiate Proposition 2.2
for deep neural networks with SoftPlus activations, which are a close differentiable approximation of the commonly used ReLU activation.
Suppose the predictor is a -layer Softplus neural network with weights at layer , and bias at each layer equal to , so that where . Let denote the gradient explanation at point , so that . Then the sensitivity of is upper bounded as: under distance for the distance metrics and .
The corollary follows naturally by observing that the Lipschitz constant of is upper bounded by with respect to distance. Consequently, (1) holds for and Proposition 2.2 holds for the network predictor with gradient as an explanation.
The caveat of these propositions however is that they require characaterizing the local constancy or local Lipshitzness of the explanation functional, which might be non-trivial. Accordingly, we provide a calculus for deriving sensitivities of explanation functionals given sensitivities of simpler explanations. As corroboration of the utility of our calculus, we derive bounds on the sensitivities of prominent examples as corollaries. A wide class of explanation techniques proposed in the literature can be viewed as modifying gradients through simple operators such as: (a) element-wise product of gradient with the given data point, and (b) averaging gradients over the neighborhood of the given point. We note that many common explanation techniques such as Gradient*Image (which is equivalent to -LRP for neural networks with 0 bias and ReLU activation ), Integrated Gradients, and SmoothGrad can be obtained by applying compositions of these operators to gradients. Therefore, by providing a calculus for the effect of these operations on the explanation sensitivity, we can better understand the sensitivities of explanation techniques that are in common use, as well as those yet to be proposed. We start by analyzing the effect on sensitivity of the element-wise product operation.
Suppose the distance metric is such that , for some function and moreover suppose . Let denote the Hadamard product operator, which performs an element-wise product of two vectors. Then the sensitivity of the explanation obtained by via the Hadamard product of the given explanation with the test point, , can be bounded as:
Note that the assumption in the proposition on the distance metric is satisfied by all the Minkowski distances, which includes the commonly used metric. Note that when comparing the sensitivity of and , could be viewed as the scaling factor for the additional factor of in the modified explanation , and which should be normalized to 1. When is a small enough, the upper bound for the modified explanation sensitivity is close to the original explanation sensitivity. We corroborate this proposition in the experiments section, where we show that the sensitivity of gradient*image and gradient explanations are very similar (since is small.)
A more complex operator is that of averaging a given explanation using a local kernel. Suppose satisfies the shift-invariance-esque property that . Suppose further that is some distance metric which satisfies , and moreover that , where . Note that this holds for all the Minkowski distances.
Suppose that the distance metrix and the kernel function satisfy the conditions above. Let denote the smoothed explanation: , given the kernel . Then its sensitivity can be bounded in terms of that of the unsmoothed explanation as:
The sensitivity upper bound in the proposition for the smoothed explanation is simply a smoothing of the sensitivity in turn using the same kernel. Note that when
has a large variance in the neighborhood specified by the kernel, the inequality is not necessarily tight and the post-smoothed sensitivity could be even lower than that specified by the upper bound. We apply this lemma to integrated gradient and SmoothGrad to provide insights on why they may achieve lower sensitivity.
Suppose we apply the SmoothGrad  modification of an explanation for the model , which we denote by SG, and suppose the distance metric is a Minkowski distance. Then its sensitivity can be bounded as:
where is simply the Gaussian kernel with isotropic covariance .
3 Obtaining Less Sensitive Explanations
Given the objective evaluation measure of sensitivity of an explanation, a natural question that arises is whether we could leverage this analysis to obtain better explanations. Since the sensitivity score depends on two components - the model and the explanation mechanism - one could consider two natural techniques to improve the sensitivity score: (a) modify the explanation mechanism to improve its sensitivity and (b) retrain the model so that the explanations produced by the explanation mechanism become stable. The SmoothGrad technique proposed by Smilkov et al.  for improved explanations, falls in the first category. The retraining techniques studied by Alvarez-Melis and Jaakkola , Lee et al.  for obtaining better explainable models, fall in the second category.
3.1 Modifying Explanations to Lower Sensitivity
We first propose an approach to smooth a given explanation functional while mostly retaining explanation faithfulness. More formally, we would like to find a modified explanation that is still close to the original explanation , but has lowered sensitivity. Our objective function can be formalized as:
where is an upper bound on the allowed difference between the original explanation and the smoothed explanation on a data point , and is some constant, that is typically either set to one, or two. While direct minimization of the above objective for seems computationally expensive, we propose to solve for our modified explanations by optimizing the following surrogate objectives:
The surrogate minimization is similar in spirit to a single step of the Jacobi iterative method, where we minimize the explanation pointwise for each data point, while fixing the explanation values (to the unmodified explanation) at all other points.
is a hyperparameter that controls the balance between the sensitivity of, and the distance between and . When , with the original sensitivity, and as tends to infinity, tends to a constant with sensitivity. We now show that the surrogate objectives in Eqs. (3) are scaled upper bounds of the intractable objective in (2) with the average-sensitivity, and thus have a well-founded variational optimization justification.
While even these surrogate objectives might not in general seem straight-forward to optimize, we show that for certain distance metrics we could choose such that we obtain efficient closed form solutions.
We thus derive an objective similar to Smooth-Grad  with a different smoothing distribution over neighboring points. Our formulation in Eq. (3) is moreover a generalized formulation of Smooth-Grad which can work with general distance metrics. Recall that in Proposition 2.4, we have shown that averaging explanations results in equal or lower sensitivity of the original explanation sensitivity. This provides additional justification for why Smooth-Grad generates explanations with lower sensitivity especially when the model is highly nonlinear. We also provide empirical corroboration of the lowered sensitivity of Smooth-Grad in the experiments section.
3.2 Retraining Model to Lower Sensitivity
In this section, we explore a different approach to lower the sensitivity of explanations. Here, we consider alternative training (inference) procedures for obtaining robust explanations. Since many popular explanation techniques rely on gradients, we specifically focus on the gradient explanations. One naive technique to lower the sensitivity of gradient explanations is to regularize the weights of a neural network by adding an norm penalty on the weights. Then, by Corollary 2.1, the upper bound for the sensitivity of gradient explanations will be lowered.
An alternative way to robustify gradient based explanations is to learn a model with smooth gradients. We show that models learned through “adversarial training” have smooth gradients and as a result the gradient based explanations of these models are naturally robust to perturbations. An adversarial perturbation at a point with label
, for any classifieris defined as any perturbation such that . The adversarial loss at a point is defined as: , where is a classification loss such as logistic loss. The expected adversarial risk of a classifier is then defined as: . The goal in adversarial training is to minimize the expected adversarial risk. We now show that minimizing expected adversarial risk results in models with smooth gradients.
Consider the binary classification setting, where and is the logistic loss. Suppose is twice differentiable w.r.t . For any , the adversarial training objective can be upper bounded as
where is the dual norm of , which is defined as .
Notice the two terms in the upper bound which penalize the norm of the gradient and Hessian. It can be seen that by optimizing the adversarial risk, we are effectively optimizing a gradient and hessian norm penalized risk. This suggests that optimizing the adversarial risk can lead to classifiers with small and “smooth” gradients, which are naturally more robust to perturbations. More formally, smaller Hessian norm lowers the Lipschitz constant of the gradients, which by Proposition 2.2 leads to a smaller sensitivity bound for gradient explanations.
A number of techniques have been proposed for minimization of the expected adversarial risk [13, 16, 22, 27]. In our experiments, we use the Projected Gradient Descent (PGD) technique of  to train an adversarially robust network. We conclude the section with a discussion on other potential approaches one could take to obtain models with robust gradients. One natural technique is to add a regularizer which penalizes gradients with large norms; that is, add a gradient norm penalty to the training objective. Ross and Doshi-Velez  study this approach and empirically show that this results in more explainable models. However, a drawback of this approach is that it has a large training time, since it has to deal with Hessians during training. The main advantage of using adversarial training over gradient regularization is that adversarial robustness is an active research area and a number of efficient techniques are being designed for faster training. Another technique to obtain models with smooth gradients is through Gaussian convolution, which is a well known smoothing operator. Here, one can take a pre-trained model and convolve it with a Gaussian. The SmoothGrad approach of Smilkov et al.  can be interpreted as doing this; SmoothGrad is equivalent to gradient explanation on the convolved model.
Setup. We perform our experiments on 100 random images in MNIST and cifar-10. For MNIST, we train our own CNN model and robust model, with accuracy both above 99 percent. For cifar-10, we use a baseline wide-resnet model with 94 percent accuracy and a pretrained robust model with 87 percent accuracy. In our experiments we compare simple gradients (Grad), integrated gradients (IG), -LRP (LRP), Guided Back-Propagation (GBP), and Grad-CAM imposed on Guided Back-Propagation (GradCam), with our generalized Smooth-Grad (SG) technique derived in (3). To compute the sensitivity scores defined in Definitions 2.1, 2.2, we randomly sample 50 points with Monte-Carlo sampling. We choose the distance metrics , to be metrics in all the experiments. We set the perturbation in Definition 2.1 to for both MNIST and cifar-10. To allow fair comparisons among different explanation methods, we normalize the explanation to have unit norm before calculating the sensitivity. For adversarial training we use perturbations of norm for MNIST and perturbations of norm for cifar-10.
Metrics. In all our experiments, we compare various explanation mechanisms based on two metrics: faithfulness and sensitivity. To evaluate the faithfulness of the explanation to the actual prediction, we modify the evaluation method proposed by  and later adopted by 
which measures the correlation between the sum of the attributions for a certain set of features and the variation in the target output after removing these features,
However, setting feature values to zero would introduce bias and favor contributions for bright pixels with higher values, we instead measure
where is sampled from uniform distribution between but ensure that by truncating . We choose N to be 300 for MNIST and 500 in the cifar-10 experiment. We estimate the correlation by randomly sampling 500 subsets of features from each data point .
4.1 Explanation Sensitivity and Faithfulness
In addition to comparing various explanation techniques described above, we also compare the sensitivity and faithfulness of explanations obtained using the explanation modification approaches in Section 3. Given these modification approaches, a key question we ask is: would lowering the sensitivity also lower the faithfulness of the explanation?
Smooth-Grad. We first investigate the sensitivity and faithfulness value of various explanation methods and their Smooth-Grad version derived in Eq. (3). For Smooth-Grad, we set R in (3) to be 0.3 for both MNIST and cifar-10. We summarize the results on MNIST and cifar-10 dataset in the left hand side of Table 1. We observe that by applying Smooth-Grad, the sensitivity of explanations decreases and the faithfulness of explanations increases for almost all base explanations in both datasets (the only exception is IG-SG, which has a similar sensitivity to IG for baseline cifar-10 model, and the reason may be that IG and SG both contains kernel average operations). This shows that by applying Smooth-Grad, we may achieve less sensitive explanations that have much improved faithfulness.
Adversarial Training. Inspired by our findings in Section 3.2, we now discuss the following question: does an adversarial robust network provide less sensitive and more faithful explanations? To answer this question, we show sensitivity and faithfulness measurements for all explanations with respect to a baseline model and robust models obtained through adversarial training . We further show results on other adversarial robust models in the appendix. We summarize the results in right hand side of Table 1. We observe that in general adversarial robust networks have a lower sensitivity score for gradient and gradient-SG. This corroborates Theorem 3.1 and Proposition 2.2, which supports that an adversarial model has a lower gradient explanation sensitivity. We also observe that adversarial robust networks lead to lower sensitivity for other explanations, which corroborates Proposition 2.3 and Corollary A.1, as they show that the sensitivity of Gradient*Image and IG are upper-bounded by some increasing function of the gradient explanations. Although we do not show upper bounds on the sensitivity for GBP and GradCam, we observe that their sensitivity is closely related to the gradient sensitivity. This explains the decreased sensitivity for all explanations for an adversarial robust network. Furthermore, the faithfulness measure for all explanations are greatly improved for an robust model. We additionally show sensitivity and faithfulness for Gradient and Gradient-SG for models with different robustness level (which corresponds to the Acc. against PGD attacked input) on MNIST in Table 2. We find that as the robust level of models increases, the sensitivity for gradients decreases and the faithfulness increases. Therefore, we validate that adversarial robust networks lead to improved explanations in terms of lower sensitivity and higher faithfulness to the model.
In the first row of each example in Figures 2, we visualize gradient explanations of images from MNIST and cifar-10 and the corresponding explanations from Smooth-Grad and adversarially trained model. We additionally show the explanation that varies the most after perturbing the image using the random attack of Ghorbani et al. , in the second row of both examples. The corresponding attack image is in the third row. We observe that the explanations from Smooth-Grad and adversarially trained models are less sensitive compared to the vanilla gradient explanation and are less vulnerable to random attacks. This qualitatively shows that our modifications provide more robust and faithful explanations.
4.3 A toy example
We now consider a simple toy example and use it to illustrate why SmoothGrad might result in more faithful explanations. Consider the following function in Euclidean space: if , and if . This function can be easily shown to be continuous in . The gradient of is given by: if , and if . We visualize the gradient in Figure 3. It can be seen that the gradient is very sensitive with respect to the input. The point , which has a function value of , has a gradient . Whereas, the perturbed point , which is close to , and has a function value of , has a very different gradient of . Apart from being sensitive, the gradient is also unfaithful to the function output. The gradient at implies that only feature is relevant to the function value. However, if we increase the value of feature to , the function value increases to . Therefore, the gradient explanation clearly does not reflect the fact that feature is relevant to the function output. Here, gradient-SG is close to (computed with R in (3) set to a large enough value), which is more faithful to the function output and less sensitive (it is close to for all inputs). This toy example provides insights on how SmoothGrad may achieve more faithful explanations.
5 Related Work
We provide a brief and necessarily incomplete review of the burgeoning recent work on attribution based explanation mechanisms. One form of attribution based explanations is the perturbation-based methods, which measures the prediction difference after perturbing a set of features. In , this method is applied on CNN where a grey patch occlution is used, and is further improved by Zintgraf et al. , Chang et al. 
. Another prominent class of attribution based explanations are based on backpropagation-based methods, which computes the attribution by computing the gradients[6, 21] or several gradient variants [28, 24, 19]. As shown in , -LRP , Deep LIFT , and Integrated Gradients  can also be seen as a variant of gradient explanations.
To remove noise from the gradient saliency map, Kindermans et al.  proposes to calculate the signal of the image by removing distractors. SmoothGrad  can be added upon existing methods by generating noisy images via additive Gaussian noise and averaging the gradient of the sampled images. Another form of sensitivity analysis proposed by  approximates the behavior of a complex model by an locally linear interpretable model. The reliability of these attribution explanations is another problem of interest. Adebayo et al.  has shown that several saliency methods are insensitive to random perturbations in the parameter space, generating the same saliency maps even when the parameter space is randomized. On the other hand, Montavon et al.  has proposed to use the continuity as a measure of the explanation and observe discontinuity may occur for gradient-based explanations, and show that deep Taylor LRP  can achieve continuous explanation compared to simple gradient explanations. However, they do not measure the amount of "sensitivity" for the continuous explanations, and therefore cannot compare and improve explanations that are already continuous.
In a recent work, Ghorbani et al.  empirically demonstrate that designing adversarial attacks on some gradient-based explanations is possible. In a parallel work, Alvarez-Melis and Jaakkola  proposed to measure the robustness of explanation using local Lipschitz constants. However, they only focus on evaluating the sensitivity of explanations, while we also provide a calculus for deriving the sensitivity for complex explanations and show how to optimize the explanation with respect to the measure. Alvarez-Melis and Jaakkola  and Lee et al.  focus on training a neural network with less sensitive explanations. Ross and Doshi-Velez  argue that by adding a gradient norm penalty to the training objective, the predictions and gradient explanations of the resulting network are more robust. Similar conclusions can be find in Tsipras et al. . This empirical finding can be explained by Theorem 3.1, which shows that adversarial robust networks have lower gradient sensitivity (and empirically more faithful explanations).
We propose an objectiveevaluation metric, naturally termed sensitivity, for machine learning explanations. One of our key contributions is a calculus for bounding the sensitivities of general explanation methods, which we instantiate on a broad array of existing explanation methods; our bounds for the many recently proposed gradient-based explanations underscores their sensitivity theoretically, corroborating empirical observations in recent papers. We then propose two approaches to improve the sensitivity of explanations with respect to the explanations and model. We then validate in our experiments that by lowering the sensitivity of explanations, we achieve more faithful explanations.
- Adebayo et al.  Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, pages 9525–9536, 2018.
- Alvarez-Melis and Jaakkola [2018a] David Alvarez-Melis and Tommi S. Jaakkola. On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049, 2018a.
- Alvarez-Melis and Jaakkola [2018b] David Alvarez-Melis and Tommi S. Jaakkola. Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems, pages 7786–7795, 2018b.
- Ancona et al.  Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. A unified view of gradient-based attribution methods for deep neural networks. International Conference on Learning Representations, 2018.
- Bach et al.  Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
- Baehrens et al.  David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert MÃžller. How to explain individual classification decisions. Journal of Machine Learning Research, 11(Jun):1803–1831, 2010.
- Chang et al.  Chun-Hao Chang, Elliot Creager, Anna Goldenberg, and David Duvenaud. Explaining image classifiers by counterfactual generation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1MXz20cYQ.
- Datta et al.  Anupam Datta, Shayak Sen, and Yair Zick . Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 598–617. IEEE, 2016.
- Ghorbani et al.  Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. AAAI, 2019.
- Kindermans et al.  Pieter-Jan Kindermans, Kristof T Schütt, Maximilian Alber, Klaus-Robert Müller, and Sven Dähne. Patternnet and patternlrp–improving the interpretability of neural networks. International Conference on Learning Representations, 2018.
- Kulesza et al.  Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th International Conference on Intelligent User Interfaces, pages 126–137. ACM, 2015.
- Lee et al.  Guang-He Lee, David Alvarez-Melis, and Tommi S. Jaakkola. Towards robust, locally linear deep networks. In International Conference on Learning Representations, 2019.
- Madry et al.  Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Miller  Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. arXiv preprint arXiv:1706.07269, 2017.
- Montavon et al.  Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Methods for interpreting and understanding deep neural networks. Digital Signal Processing, 2017.
- Raghunathan et al.  Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344, 2018.
- Ribeiro et al.  Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
- Ross and Doshi-Velez  Andrew Slavin Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. arXiv preprint arXiv:1711.09404, 2017.
Selvaraju et al. 
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna
Vedantamand, Devi Parikh, and Dhruv Parikh.
Grad-cam: Visual explanations from deep networks via gradient-based
International conference on computer vision, 2017.
- Shrikumar et al.  Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. International Conference on Machine Learning, 2017.
- Simonyan et al.  Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Sinha et al.  Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571, 2017.
- Smilkov et al.  Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
- Springenberg et al.  Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
- Sundararajan et al.  Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, 2017.
Tsipras et al. 
Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and
Robustness may be at odds with accuracy.In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SyxAb30cY7.
- Wong and Kolter  Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pages 5283–5292, 2018.
- Zeiler and Fergus  Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
- Zintgraf et al.  Luisa M Zintgraf, Taco S Cohen, Tameem Adel, and Max Welling. Visualizing deep neural network decisions: Prediction difference analysis. arXiv preprint arXiv:1702.04595, 2017.
Appendix A Appendix
a.1 Additional Calculus
We say a function is -locally constant around , if for all such that , satisfies
This notion of local constancy naturally leads to the following bound on the sensitivity of explanations:
Suppose the explanation is -locally constant around with respect to metric and is a continuous function in . Then .
Proof of Proposition a.1.
For any constant , distance satisfying , and an explanation of a predictor , we have that:
Proof of Proposition a.2.
Suppose we apply the Integrated Gradients modification of an explanation for the model , with baseline set to , and which we denote by IG, and suppose the distance metric is a Minkowski distance. Then its sensitivity can be bounded as:
where is the density of a uniform distribution over points on the line from the baseline 0 point and .
a.2 Proof of Proposition 2.1
Proof of Proposition 2.1.
If , then there such that , and . Therefore,
If , and suppose for all y satisfying
This contradicts the premise that , therefore, by proof by contradiction, there such that , and
a.3 Proof of Proposition 2.2
Proof of Proposition 2.2.
a.4 Proof of Proposition 2.3
Proof of Proposition 2.3.
a.5 Proof of Proposition 2.4
Proof of Proposition 2.4.
a.6 Proof of Theorem 3.1
We consider logistic loss, a convex surrogate of the loss, which is defined as
We now try to show that minimizing adversarial risk results in classifiers with smooth gradients. First note that can be written as
We also have
Substituting this in the previous expression gives us
This can be upper bounded as follows
where is the dual norm of .
Let be defined as
Some algebra shows that can be upper bounded by
So we have the following upper bound for our objective
Appendix B Set-Based Explanations
While the main paper focused on quantitative explanations that provided real-valued weights corresponding to each input feature, another class of explanations simply output a set of relevant features. Given quantitative explanations, we can modify these to set based explanations by simply providing the set of most salient features, for some small . Thus, for a quantitative explanation , we can provide the set-based modification : , if is among the top k features (either with respect to signed magnitude, or magnitude depending on the type of explanation), otherwise setting , and where is some normalizing constant. The benefit for using such set based explanations is that by lowering the amount of and possibly less salient information, the explanation may be easier to interpret for a human. While some of the calculus we have developed above may not seem directly applicable to set-based explanations, we provide a simple proposition that upper-bounds the sensitivity of the set-based explanations given the original explanation.
Given an explanation functional and its set-based modification , where the top values in are set to , and the rest set to . Let , and let , and suppose the distance metrics used in specifying sensitivity are set to the distance. We then have .
Proof of Proposition b.1.
Consider any two explanations , and let be their set-based explanations respectively with top-k features set to and others set to 0. Let for the top-k set for . Here, D is defined as the L1 distance.
Let , which implies that the top k set of features for has exactly differences. Define set and as:
We know that and and are disjoint by definition. We randomly fix an order for and , so that is the ith element in set and is the jth element in set . By definition of we have:
Moreover, we have
Combining the result, we have:
This holds for all , therefore,
While this bound is not necessary tight, it provides insight on why the sensitivity of the set-based explanation may be much lower than that of the quantitative explanation. In particular, the sensitivity of set based explanations do not account for the change in the values in the features that remains in the top-k set or its complement. In Figure 4, we show some examples of the set-based gradient saliency before and after applying set-based SmoothGrad in (3), and we observe that the smoothed saliency maps are less noisy and more focused on the object of interest.