# How Sensitive are Sensitivity-Based Explanations?

We propose a simple objective evaluation measure for explanations of a complex black-box machine learning model. While most such model explanations have largely been evaluated via qualitative measures, such as how humans might qualitatively perceive the explanations, it is vital to also consider objective measures such as the one we propose in this paper. Our evaluation measure that we naturally call sensitivity is simple: it characterizes how an explanation changes as we vary the test input, and depending on how we measure these changes, and how we vary the input, we arrive at different notions of sensitivity. We also provide a calculus for deriving sensitivity of complex explanations in terms of that for simpler explanations, which thus allows an easy computation of sensitivities for yet to be proposed explanations. One advantage of an objective evaluation measure is that we can optimize the explanation with respect to the measure: we show that (1) any given explanation can be simply modified to improve its sensitivity with just a modest deviation from the original explanation, and (2) gradient based explanations of an adversarially trained network are less sensitive. Perhaps surprisingly, our experiments show that explanations optimized to have lower sensitivity can be more faithful to the model predictions.

## Authors

• 9 publications
• 2 publications
• 8 publications
• 3 publications
• 53 publications
• ### Evaluating and Aggregating Feature-based Model Explanations

A feature-based model explanation denotes how much each input feature co...
05/01/2020 ∙ by Umang Bhatt, et al. ∙ 14

• ### Reliable Local Explanations for Machine Listening

One way to analyse the behaviour of machine learning models is through l...
05/15/2020 ∙ by Saumitra Mishra, et al. ∙ 0

• ### Representativity and Consistency Measures for Deep Neural Network Explanations

The adoption of machine learning in critical contexts requires a reliabl...
09/07/2020 ∙ by Thomas Fel, et al. ∙ 0

• ### Model Reconstruction from Model Explanations

We show through theory and experiment that gradient-based explanations o...
07/13/2018 ∙ by Smitha Milli, et al. ∙ 12

• ### On the Tractability of SHAP Explanations

SHAP explanations are a popular feature-attribution mechanism for explai...
09/18/2020 ∙ by Guy Van den Broeck, et al. ∙ 12

• ### Local Explanation Methods for Deep Neural Networks Lack Sensitivity to Parameter Values

Explaining the output of a complicated machine learning model like a dee...
10/08/2018 ∙ by Julius Adebayo, et al. ∙ 2

• ### Sensitivity based Neural Networks Explanations

Although neural networks can achieve very high predictive performance on...
12/03/2018 ∙ by Enguerrand Horel, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

How to explain a complex machine learning model, that predicts a response given an input feature vector, given just black-box access to the model, is an increasingly salient problem. And an increasingly popular approach to do so is to attribute any given prediction to the set of input features, which could range from providing a vector of importance weights, one per input feature, to simply providing a set of important features. For instance, given a deep neural network for image classification, we may explain a specific prediction by showing the set of salient pixels, or a heatmap image showing the importance weights for all the pixels.

A large class of such attribution mechanisms are loosely based on sensitivity of the learned predictor function to its input. The most prominent class of these are based on the gradient of the predictor function with respect to its input [6, 21], including gradient variants that address some of the caveats with gradients such as saturation, where change in the prediction value is not reflected by the gradient [28, 24, 19]. There are also approaches based on counterfactuals that quantify the effect of substituting the value of a feature with a default value, or samples from some noise distribution [28]

. Lastly, there are approaches that estimate a local simple model (such as a linear regression model) that approximates the complex predictor function locally

[17]

. There are also approaches that vary the active subsets of the set of input features (e.g. over the power set of the set of all features) and average such per feature counterfactual contributions, which has roots in cooperative game theory and revenue division

[8].

But how good is any such explanation mechanism? We can distinguish between two classes of explanation evaluation measures [11, 14]: objective measures and subjective measures . The notion of explanation is very human-centric, and consequently the predominant evaluations of explanations have been subjective measures, that range from qualitative displays of explanation examples, to crowd-sourced evaluations of human satisfaction with the explanations, as well as whether humans are able to understand the model. Nonetheless, it is also important to consider objective measures of explanation effectiveness, not only because these place explanations on a sounder theoretical foundation, but also because they allow us to improve our explanations by improving their objective measures. The predominant class of objective measures are based on fidelity of the explanation to the predictor function. When we have apriori information that only a particular subset of features is relevant, we can then test if the explanation features belong to this relevant subset.

In this work, we are interested in a simpler objective evaluation measure: sensitivity. In other words, we are interested in a counterfactual at the level of explanation: what happens to the explanation when we perturb the test input? Depending on how we define the perturbation, and how we measure the change in the explanation, we arrive at different notions of sensitivity. Intuitively, we wish for our explanation to not be too sensitive, since that would entail differing explanations with minor variations in the input (and prediction values), which would lead us to not trust the explanations. In large part, we expect explanations to be simple (hence the approximation of complex models via simple “interpretable” models). A lower sensitivity could be viewed as one such notion of simplicity. Given that most attribution mechanisms are loosely based on sensitivity of a function to its input, what we ask is how sensitive are sensitivity based explanation mechanisms itself.

The other key contribution of the paper is that we also provide a calculus for estimating the sensitivity for general explanation mechanisms, which we instantiate to derive corollaries for a wide range of recently proposed explanation mechanisms. We also note that our development holds more abstractly for investigating the sensitivity of any functional of a given function at a specific point. In our case, the function is the learnt predictor, and the functional is the explanation, but our development holds more generally. As one candidate broader application, this could be useful for analyzing so-called plugin-estimators that are functionals (e.g. entropy) of the model parameters; though we defer such broader investigations for future work.

Lastly, we also investigate how to modify a given explanation mechanism to make it less sensitive with respect to our measure. To this end, we provide a meta-explanation technique that encompasses Smooth-Grad [23], and which modifies any existing explanation mechanism to improve its sensitivity with just a modest deviation from the original explanation. In addition, we propose a solution to improve the explanation sensitivity if we are given the freedom to retrain the model by adversarial training. As we show, our modifications provide qualitatively much better explanations (with higher faithfulness evaluations), in addition to being better with respect to the objective measure of sensitivity, by construction.

## 2 Explanation Sensitivity

Consider the following general supervised learning setting: input space

, an output space , and a (machine-learnt) black-box predictor , which at some test input , predicts the output . Then a feature attribution explanation is some function , that given a black-box predictor , and a test point , provides importance scores for the set of input features. Our main goal in this paper is to formalize a quantitative measurement for the sensitivity of these resulting attribution based explanations, and discuss approaches to optimize this sensitivity measure to obtain explanations with the right amount of sensitivity, while still retaining its explanatory power. While we begin our discussion with explanations that output a vector of importance weights, one for each feature, our analysis below is fairly general, and in Appendix B we extend it to settings where the explanation just consists of a set of features.

We need two additional ingredients: a distance metric over explanations, and a distance metric over the inputs. We can then define the following sensitivity measure, we term max-sensitivity, as measuring the maximum change in the explanation with a small perturbation of the input .

###### Definition 2.1.

Given a black-box function , explanation functional , and distance metrics and over explanations and inputs respectively, and for a given input neighborhood radius , we define the max-sensitivity as:

 S\textscMAX(Φ,f,x,r) =maxρ(y,x)≤rD(Φ(f,y),Φ(f,x)),

The key caveat with the above notion of sensitivity is that it might be too critical: a single adversarial point in the neighborhood of with a large change in the explanation will cause the sensitivity measure to have a large value. While this might be desired in some settings, in certain other settings, we might be more concerned when many of the points in the neighborhood have vastly differing explanations. Accordingly, we can define the following average-sensitivity measure, that averages the change in the explanation as we range over small perturbations of the input .

###### Definition 2.2.

Given a black-box function , explanation functional , and distance metrics and over explanations and inputs respectively, and for a given input neighborhood radius , we define the average-sensitivity as:

 S\textscAVG(Φ,f,x,r)

for some distribution over inputs, which is centered around .

In our experiments, we chose

as the uniform distribution over the neighborhood of radius

around . While measures the maximum change in the explanation as the data point is perturbed within a small neighborhood, measures the average change in the explanation over such perturbations. When clear from the context, we simply use and to denote and respectively, also noting that we suppress the dependence on the distance metrics and . In the sequel, when the results hold for both the maximum and average sensitivity measures , and , we simply use to denote the sensitivity measure.

Before we proceed to sensitivity calculus, we argue why it is desirable to have explanations with low sensitivity scores . To this end, we first define the following measures to quantify the sensitivity of predictions of a model.

###### Definition 2.3.

Given a black-box function , and distance metrics and over explanations and inputs respectively, and for a given input neighborhood radius , we define the max-prediction-sensitivity and average-prediction-sensitivity as:

 P\textscMAX(f,x,r) =maxρ(y,x)≤rD(f(y),f(x)),
 P\textscAVG(f,x,r) =∫y:ρ(y,x)≤rD(f(y),f(x))Px(y)dy,

Ghorbani et al. [9] empirically observe there exist inputs which are indistinguishable by humans, for which the model outputs similar predictions and yet have very different gradient explanations (for example, see first two columns in examples in Figure 2. Such explanations are undesirable as they do not faithfully explain the predictions of a model and cannot be understood by a human.

We now formally show that if the explanation sensitivity is much larger than the prediction sensitivity, there must exist such pair of inputs that have similar predictions and are indistinguishable to a human and yet have very different explanations.

###### Proposition 2.1.

Suppose the explanation sensitivity and prediction sensitivity are such that . Then which is close to such that , and

 D(Φ(f,y),Φ(f,x))D(f(y),f(x))≥R.

In Figure 1

, we show the sensitivities of the gradient explanation mechanism and model prediction on a two layer convolution neural networks trained on MNIST. To ensure that the scales of both the sensitivity measures are comparable, we use relative changes in prediction and explanation to compute the sensitivity scores. We choose D to be

distance and to be norm for computation of the sensitivity of explanations. We find that the ratio between gradient sensitivity and prediction sensitivity is between 3 to 14 times for different values of . By Proposition 2.1, this suggest the existence of “adversarial explanation” points as shown in Figures 2.

#### Sensitivity Calculus

We now present sensitivity calculus for estimating the sensitivity of general explanation mechanisms. We start with the following definition which allows us to bound the sensitivity of the explanations.

###### Definition 2.4.

We say a function is -locally Lipschitz continuous with respect to metric around if for all such that and , satisfies

 D(Φ(y),Φ(z))ρ(y,z)≤L. (1)

When the explanation satisfies the above assumption of local Lipschitzness, we can derive the following upper bound on the sensitivity of explanations.

###### Proposition 2.2.

Suppose the explanation is -locally Lipschitz continuous around with respect to , then Then .

The above proposition provides a key tool to bound the sensitivity of any given explanation mechanism. In particular, we discuss its applicability to gradient explanations. The predominant class of explanations are based on gradients of the machine-learnt predictor . Proposition 2.2 provides a simple upper bound on the sensitivity of these gradient explanations, so long as we have a bound on the Lipshitz constant of the predictor gradient. As a concrete example, we instantiate Proposition 2.2

for deep neural networks with SoftPlus activations, which are a close differentiable approximation of the commonly used ReLU activation.

###### Corollary 2.1.

Suppose the predictor is a -layer Softplus neural network with weights at layer , and bias at each layer equal to , so that where . Let denote the gradient explanation at point , so that . Then the sensitivity of is upper bounded as: under distance for the distance metrics and .

The corollary follows naturally by observing that the Lipschitz constant of is upper bounded by with respect to distance. Consequently, (1) holds for and Proposition 2.2 holds for the network predictor with gradient as an explanation.

The caveat of these propositions however is that they require characaterizing the local constancy or local Lipshitzness of the explanation functional, which might be non-trivial. Accordingly, we provide a calculus for deriving sensitivities of explanation functionals given sensitivities of simpler explanations. As corroboration of the utility of our calculus, we derive bounds on the sensitivities of prominent examples as corollaries. A wide class of explanation techniques proposed in the literature can be viewed as modifying gradients through simple operators such as: (a) element-wise product of gradient with the given data point, and (b) averaging gradients over the neighborhood of the given point. We note that many common explanation techniques such as Gradient*Image (which is equivalent to -LRP for neural networks with 0 bias and ReLU activation [4]), Integrated Gradients, and SmoothGrad can be obtained by applying compositions of these operators to gradients. Therefore, by providing a calculus for the effect of these operations on the explanation sensitivity, we can better understand the sensitivities of explanation techniques that are in common use, as well as those yet to be proposed. We start by analyzing the effect on sensitivity of the element-wise product operation.

###### Proposition 2.3.

Suppose the distance metric is such that , for some function and moreover suppose . Let denote the Hadamard product operator, which performs an element-wise product of two vectors. Then the sensitivity of the explanation obtained by via the Hadamard product of the given explanation with the test point, , can be bounded as:

 S\textscMAX(Φ⊙,f,x,r)d(x)≤(1+d(r)d(x))S\textscMAX(Φ,f,x,r)+d(r)d(Φ(f,x))d(x),

Note that the assumption in the proposition on the distance metric is satisfied by all the Minkowski distances, which includes the commonly used metric. Note that when comparing the sensitivity of and , could be viewed as the scaling factor for the additional factor of in the modified explanation , and which should be normalized to 1. When is a small enough, the upper bound for the modified explanation sensitivity is close to the original explanation sensitivity. We corroborate this proposition in the experiments section, where we show that the sensitivity of gradient*image and gradient explanations are very similar (since is small.)

A more complex operator is that of averaging a given explanation using a local kernel. Suppose satisfies the shift-invariance-esque property that . Suppose further that is some distance metric which satisfies , and moreover that , where . Note that this holds for all the Minkowski distances.

###### Proposition 2.4.

Suppose that the distance metrix and the kernel function satisfy the conditions above. Let denote the smoothed explanation: , given the kernel . Then its sensitivity can be bounded in terms of that of the unsmoothed explanation as:

 S(Φk,f,x,r)≤∫zS(Φ,f,z,r)k(x,z)dz.

The sensitivity upper bound in the proposition for the smoothed explanation is simply a smoothing of the sensitivity in turn using the same kernel. Note that when

has a large variance in the neighborhood specified by the kernel, the inequality is not necessarily tight and the post-smoothed sensitivity could be even lower than that specified by the upper bound. We apply this lemma to integrated gradient and SmoothGrad to provide insights on why they may achieve lower sensitivity.

###### Corollary 2.2.

Suppose we apply the SmoothGrad [23] modification of an explanation for the model , which we denote by SG, and suppose the distance metric is a Minkowski distance. Then its sensitivity can be bounded as:

 S(SG,f,x,r)≤∫zS(∇f,f,z,r)kσ(x,z),

where is simply the Gaussian kernel with isotropic covariance .

Also noting that integrated gradients can be seen as the composition of the average of gradient explanations and an element wise dot product with the test point, we can obtain a bound on its sensitivity by combining Propositions 2.3 and 2.4. We present this result in Corollary A.1 in the Appendix.

## 3 Obtaining Less Sensitive Explanations

Given the objective evaluation measure of sensitivity of an explanation, a natural question that arises is whether we could leverage this analysis to obtain better explanations. Since the sensitivity score depends on two components - the model and the explanation mechanism - one could consider two natural techniques to improve the sensitivity score: (a) modify the explanation mechanism to improve its sensitivity and (b) retrain the model so that the explanations produced by the explanation mechanism become stable. The SmoothGrad technique proposed by Smilkov et al. [23] for improved explanations, falls in the first category. The retraining techniques studied by Alvarez-Melis and Jaakkola [2], Lee et al. [12] for obtaining better explainable models, fall in the second category.

### 3.1 Modifying Explanations to Lower Sensitivity

We first propose an approach to smooth a given explanation functional while mostly retaining explanation faithfulness. More formally, we would like to find a modified explanation that is still close to the original explanation , but has lowered sensitivity. Our objective function can be formalized as:

 \widebarΦSf=argmin\widebarΦf(S(\widebarΦf,x,r))α,s.t.D(\widebarΦf(x),Φf(x))≤K, (2)

where is an upper bound on the allowed difference between the original explanation and the smoothed explanation on a data point , and is some constant, that is typically either set to one, or two. While direct minimization of the above objective for seems computationally expensive, we propose to solve for our modified explanations by optimizing the following surrogate objectives:

 \widebarΦAVGf(x)=argmin\widebarΦf(x)Eρ(u,0)≤R(D(\widebarΦf(x),Φf(x+u)))α.\widebarΦMAXf(x)=argmin\widebarΦf(x)maxρ(u,0)≤R(D(\widebarΦf(x),Φf(x+u)))α. (3)

The surrogate minimization is similar in spirit to a single step of the Jacobi iterative method, where we minimize the explanation pointwise for each data point, while fixing the explanation values (to the unmodified explanation) at all other points.

is a hyperparameter that controls the balance between the sensitivity of

, and the distance between and . When , with the original sensitivity, and as tends to infinity, tends to a constant with sensitivity. We now show that the surrogate objectives in Eqs. (3) are scaled upper bounds of the intractable objective in (2) with the average-sensitivity, and thus have a well-founded variational optimization justification.

 SαAVG(\widebarΦ,f,x,R)=(Eρ(y,x)≤RD(\widebarΦ(f,y),\widebarΦ(f,x)))α≤ Eρ(u,0)≤RD(\widebarΦ(f,x),\widebarΦ(f,x+u))α≤ 2αEρ(u,0)≤R[D(\widebarΦ(f,x),Φ(f,x+u))α+ D(Φ(f,x+u),\widebarΦ(f,x+u))α]≤ 2αEρ(u,0)≤R[D(\widebarΦ(f,x),Φ(f,x+u))α+Kα]. (4)

While even these surrogate objectives might not in general seem straight-forward to optimize, we show that for certain distance metrics we could choose such that we obtain efficient closed form solutions.

###### Proposition 3.1.

Suppose we solve the objective in Eq. (3) setting , and . Then its optimal solution is the feature wise median: Suppose we solve the objective in Eq. (3) setting and . Then its optimal solution is the feature wise mean:

We thus derive an objective similar to Smooth-Grad [23] with a different smoothing distribution over neighboring points. Our formulation in Eq. (3) is moreover a generalized formulation of Smooth-Grad which can work with general distance metrics. Recall that in Proposition 2.4, we have shown that averaging explanations results in equal or lower sensitivity of the original explanation sensitivity. This provides additional justification for why Smooth-Grad generates explanations with lower sensitivity especially when the model is highly nonlinear. We also provide empirical corroboration of the lowered sensitivity of Smooth-Grad in the experiments section.

### 3.2 Retraining Model to Lower Sensitivity

In this section, we explore a different approach to lower the sensitivity of explanations. Here, we consider alternative training (inference) procedures for obtaining robust explanations. Since many popular explanation techniques rely on gradients, we specifically focus on the gradient explanations. One naive technique to lower the sensitivity of gradient explanations is to regularize the weights of a neural network by adding an norm penalty on the weights. Then, by Corollary 2.1, the upper bound for the sensitivity of gradient explanations will be lowered.

An alternative way to robustify gradient based explanations is to learn a model with smooth gradients. We show that models learned through “adversarial training” have smooth gradients and as a result the gradient based explanations of these models are naturally robust to perturbations. An adversarial perturbation at a point with label

, for any classifier

is defined as any perturbation such that . The adversarial loss at a point is defined as: , where is a classification loss such as logistic loss. The expected adversarial risk of a classifier is then defined as: . The goal in adversarial training is to minimize the expected adversarial risk. We now show that minimizing expected adversarial risk results in models with smooth gradients.

###### Theorem 3.1.

Consider the binary classification setting, where and is the logistic loss. Suppose is twice differentiable w.r.t . For any , the adversarial training objective can be upper bounded as

where is the dual norm of , which is defined as .

Notice the two terms in the upper bound which penalize the norm of the gradient and Hessian. It can be seen that by optimizing the adversarial risk, we are effectively optimizing a gradient and hessian norm penalized risk. This suggests that optimizing the adversarial risk can lead to classifiers with small and “smooth” gradients, which are naturally more robust to perturbations. More formally, smaller Hessian norm lowers the Lipschitz constant of the gradients, which by Proposition 2.2 leads to a smaller sensitivity bound for gradient explanations.

## 4 Experiments

Setup. We perform our experiments on 100 random images in MNIST and cifar-10. For MNIST, we train our own CNN model and robust model, with accuracy both above 99 percent. For cifar-10, we use a baseline wide-resnet model with 94 percent accuracy and a pretrained robust model with 87 percent accuracy. In our experiments we compare simple gradients (Grad), integrated gradients (IG), -LRP (LRP), Guided Back-Propagation (GBP), and Grad-CAM imposed on Guided Back-Propagation (GradCam), with our generalized Smooth-Grad (SG) technique derived in (3). To compute the sensitivity scores defined in Definitions 2.12.2, we randomly sample 50 points with Monte-Carlo sampling. We choose the distance metrics , to be metrics in all the experiments. We set the perturbation in Definition 2.1 to for both MNIST and cifar-10. To allow fair comparisons among different explanation methods, we normalize the explanation to have unit norm before calculating the sensitivity. For adversarial training we use perturbations of norm for MNIST and perturbations of norm for cifar-10.

Metrics. In all our experiments, we compare various explanation mechanisms based on two metrics: faithfulness and sensitivity. To evaluate the faithfulness of the explanation to the actual prediction, we modify the evaluation method proposed by [5] and later adopted by [4]

which measures the correlation between the sum of the attributions for a certain set of features and the variation in the target output after removing these features,

However, setting feature values to zero would introduce bias and favor contributions for bright pixels with higher values, we instead measure

 corr(N∑i=0ϵiΦf(x)i,f(x)−f(^x|^xi=xi−ϵi)),

where is sampled from uniform distribution between but ensure that by truncating . We choose N to be 300 for MNIST and 500 in the cifar-10 experiment. We estimate the correlation by randomly sampling 500 subsets of features from each data point .

### 4.1 Explanation Sensitivity and Faithfulness

In addition to comparing various explanation techniques described above, we also compare the sensitivity and faithfulness of explanations obtained using the explanation modification approaches in Section 3. Given these modification approaches, a key question we ask is: would lowering the sensitivity also lower the faithfulness of the explanation?

Smooth-Grad. We first investigate the sensitivity and faithfulness value of various explanation methods and their Smooth-Grad version derived in Eq. (3). For Smooth-Grad, we set R in (3) to be 0.3 for both MNIST and cifar-10. We summarize the results on MNIST and cifar-10 dataset in the left hand side of Table 1. We observe that by applying Smooth-Grad, the sensitivity of explanations decreases and the faithfulness of explanations increases for almost all base explanations in both datasets (the only exception is IG-SG, which has a similar sensitivity to IG for baseline cifar-10 model, and the reason may be that IG and SG both contains kernel average operations). This shows that by applying Smooth-Grad, we may achieve less sensitive explanations that have much improved faithfulness.

### 4.2 Visualizations

In the first row of each example in Figures 2, we visualize gradient explanations of images from MNIST and cifar-10 and the corresponding explanations from Smooth-Grad and adversarially trained model. We additionally show the explanation that varies the most after perturbing the image using the random attack of Ghorbani et al. [9], in the second row of both examples. The corresponding attack image is in the third row. We observe that the explanations from Smooth-Grad and adversarially trained models are less sensitive compared to the vanilla gradient explanation and are less vulnerable to random attacks. This qualitatively shows that our modifications provide more robust and faithful explanations.

## 5 Related Work

We provide a brief and necessarily incomplete review of the burgeoning recent work on attribution based explanation mechanisms. One form of attribution based explanations is the perturbation-based methods, which measures the prediction difference after perturbing a set of features. In [28], this method is applied on CNN where a grey patch occlution is used, and is further improved by Zintgraf et al. [29], Chang et al. [7]

. Another prominent class of attribution based explanations are based on backpropagation-based methods, which computes the attribution by computing the gradients

[6, 21] or several gradient variants [28, 24, 19]. As shown in [4], -LRP [5], Deep LIFT [20], and Integrated Gradients [25] can also be seen as a variant of gradient explanations.

To remove noise from the gradient saliency map, Kindermans et al. [10] proposes to calculate the signal of the image by removing distractors. SmoothGrad [23] can be added upon existing methods by generating noisy images via additive Gaussian noise and averaging the gradient of the sampled images. Another form of sensitivity analysis proposed by [17] approximates the behavior of a complex model by an locally linear interpretable model. The reliability of these attribution explanations is another problem of interest. Adebayo et al. [1] has shown that several saliency methods are insensitive to random perturbations in the parameter space, generating the same saliency maps even when the parameter space is randomized. On the other hand, Montavon et al. [15] has proposed to use the continuity as a measure of the explanation and observe discontinuity may occur for gradient-based explanations, and show that deep Taylor LRP [5] can achieve continuous explanation compared to simple gradient explanations. However, they do not measure the amount of "sensitivity" for the continuous explanations, and therefore cannot compare and improve explanations that are already continuous.

In a recent work, Ghorbani et al. [9] empirically demonstrate that designing adversarial attacks on some gradient-based explanations is possible. In a parallel work, Alvarez-Melis and Jaakkola [2] proposed to measure the robustness of explanation using local Lipschitz constants. However, they only focus on evaluating the sensitivity of explanations, while we also provide a calculus for deriving the sensitivity for complex explanations and show how to optimize the explanation with respect to the measure. Alvarez-Melis and Jaakkola [3] and Lee et al. [12] focus on training a neural network with less sensitive explanations. Ross and Doshi-Velez [18] argue that by adding a gradient norm penalty to the training objective, the predictions and gradient explanations of the resulting network are more robust. Similar conclusions can be find in Tsipras et al. [26]. This empirical finding can be explained by Theorem 3.1, which shows that adversarial robust networks have lower gradient sensitivity (and empirically more faithful explanations).

## 6 Conclusion

We propose an objectiveevaluation metric, naturally termed sensitivity, for machine learning explanations. One of our key contributions is a calculus for bounding the sensitivities of general explanation methods, which we instantiate on a broad array of existing explanation methods; our bounds for the many recently proposed gradient-based explanations underscores their sensitivity theoretically, corroborating empirical observations in recent papers. We then propose two approaches to improve the sensitivity of explanations with respect to the explanations and model. We then validate in our experiments that by lowering the sensitivity of explanations, we achieve more faithful explanations.

## Appendix A Appendix

###### Definition A.1.

We say a function is -locally constant around , if for all such that , satisfies

This notion of local constancy naturally leads to the following bound on the sensitivity of explanations:

###### Proposition A.1.

Suppose the explanation is -locally constant around with respect to metric and is a continuous function in . Then .

###### Proof of Proposition a.1.
 S\textscAVG(Φ,f,x,r)=∫y:ρ(y,x)≤rD(Φ(f,y),Φ(f,x))Px(y)dy,≤maxρ(y,x)≤rD(Φ(f,y),Φ(f,x))∫y:ρ(y,x)≤rPx(y)dy=S\textscMAX(Φ,f,x,r)=maxρ(y,x)≤rD(Φ(f,y),Φ(f,x))≤d.

###### Proposition A.2.

For any constant , distance satisfying , and an explanation of a predictor , we have that:

 S(CΦ,f,x,r)=CS(Φ,f,x,r).
###### Proof of Proposition a.2.
 S\textscMAX(CΦ,f,x,r)=maxρ(y,x)≤rD(CΦ(f,y),CΦ(f,x))= Cmaxρ(y,x)≤rD(Φ(f,y),Φ(f,x))= CS\textscMAX(Φ,f,x,r)S\textscAVG(CΦ,f,x,r)=∫y:ρ(y,x)≤rD(CΦ(f,y),CΦ(f,x))Px(y)dy,=C∫y:ρ(y,x)≤rD(Φ(f,y),Φ(f,x))Px(y)dy,= CS\textscAVG(Φ,f,x,r)

###### Corollary A.1.

Suppose we apply the Integrated Gradients modification of an explanation for the model , with baseline set to , and which we denote by IG, and suppose the distance metric is a Minkowski distance. Then its sensitivity can be bounded as:

 S(IG,f,x,r)d(x)≤∫z(1+d(r)d(z))S(∇,f,z,r)ku(x,z)dzd(r)d(Φ(f,x))d(x),

where is the density of a uniform distribution over points on the line from the baseline 0 point and .

### a.2 Proof of Proposition 2.1

###### Proof of Proposition 2.1.

If , then there such that , and . Therefore,

 D(f(x),f(y))≤P\textscMAX(f,x,r)≤D(Φ(f,x),Φ(f,y))/R.

If , and suppose for all y satisfying

 S\textscAVG(Φ,f,x,r)=∫y:ρ(y,x)≤rD(Φ(f,y),Φ(f,x))Px(y)dy<∫y:ρ(y,x)≤rR⋅D(f(x),f(y))Px(y)dy=R⋅∫y:ρ(y,x)≤rD(f(x),f(y))Px(y)dy=R⋅P\textscAVG(f,x,r).

This contradicts the premise that , therefore, by proof by contradiction, there such that , and

### a.3 Proof of Proposition 2.2

###### Proof of Proposition 2.2.
 S\textscAVG(Φ,f,x,r)≤S\textscMAX(Φf,x,r)=maxρ(y,x)≤rD(Φ(f,y),Φ(f,x))≤Lr.

### a.4 Proof of Proposition 2.3

###### Proof of Proposition 2.3.
 S\textscMAX( x⊙Φf,x,r) =maxρ(y,x)≤r D(y⊙Φ(f,y),x⊙Φ(f,x)) ≤maxρ(y,x)≤r (x⊙Φ(f,y),x⊙Φ(f,x)) + D(x⊙Φ(f,y),x⊙Φ(f,y)+(y−x)⊙Φf(x)) + D(y⊙Φ(f,y),x⊙Φ(f,y))+(y−x)⊙Φ(f,x)) ≤maxρ(y,x)≤r D(x⊙[Φ(f,y)−Φ(f,x)]) + D((y−x)⊙Φ(f,x)) + D((y−x)⊙(Φ(f,y)−Φ(f,x)) ≤maxρ(y,x)≤r D(x)D(Φ(f,y)−Φ(f,x)) + D(y−x)D(Φ(f,x)) + D(y−x)D(Φ(f,y)−Φ(f,x)) ≤  D(x) SMAX(Φf,x,r)+D(r)D(Φf(x)) + D(r)S\textscMAX(Φf,x,r).

### a.6 Proof of Theorem 3.1

We consider logistic loss, a convex surrogate of the loss, which is defined as

 ℓ(f(x),y)=−logeyf(x)1+ef(x).

We now try to show that minimizing adversarial risk results in classifiers with smooth gradients. First note that can be written as

 f(x+δ)=f(x)+∫1t=0∇f(x+tδ)Tδ dt.

We also have

 ∇f(x+tδ)=∇f(x)+∫ts=0∇2f(x+sδ)δ ds.

Substituting this in the previous expression gives us

 f(x+δ)=f(x)+∇f(x)Tδ+∫1t=0∫ts=0δT∇2f(x+sδ)δ dsdt.

This can be upper bounded as follows

 f(x+δ)≤f(x)+ϵ∥∇f(x)∥∗+ϵ22sup∥δ∥≤ϵ∥∇2f(x+δ)∥,

where is the dual norm of .

Let be defined as

 u(x)=ϵ∥∇f(x)∥∗+ϵ22sup∥δ∥≤ϵ∥∇2f(x+δ)∥.

Some algebra shows that can be upper bounded by

 ℓ(f(x+δ),y)≤ℓ(f(x)+(1−2y)u(x),y)≤ℓ(f(x),y)+u(x).

So we have the following upper bound for our objective

 E[supδ:∥δ∥≤ϵℓ(f(x+δ),y)]≤E[ℓ(f(x),y)]+ϵE[∥∇f(x)∥∗]+ϵ22E[sup∥δ∥≤ϵ∥∇2f(x+δ)∥]% Regularization Term. (6)

## Appendix B Set-Based Explanations

While the main paper focused on quantitative explanations that provided real-valued weights corresponding to each input feature, another class of explanations simply output a set of relevant features. Given quantitative explanations, we can modify these to set based explanations by simply providing the set of most salient features, for some small . Thus, for a quantitative explanation , we can provide the set-based modification : , if is among the top k features (either with respect to signed magnitude, or magnitude depending on the type of explanation), otherwise setting , and where is some normalizing constant. The benefit for using such set based explanations is that by lowering the amount of and possibly less salient information, the explanation may be easier to interpret for a human. While some of the calculus we have developed above may not seem directly applicable to set-based explanations, we provide a simple proposition that upper-bounds the sensitivity of the set-based explanations given the original explanation.

###### Proposition B.1.

Given an explanation functional and its set-based modification , where the top values in are set to , and the rest set to . Let , and let , and suppose the distance metrics used in specifying sensitivity are set to the distance. We then have .

###### Proof of Proposition b.1.

Consider any two explanations , and let be their set-based explanations respectively with top-k features set to and others set to 0. Let for the top-k set for . Here, D is defined as the L1 distance.

Let , which implies that the top k set of features for has exactly differences. Define set and as:

 K1={i∣Φ1i∉ top k set for Φ1,Φ2i∈ % top k set for Φ2},
 K2={i∣Φ2i∉ top k set for Φ2,Φ1i∈ % top k set for Φ1}.

We know that and and are disjoint by definition. We randomly fix an order for and , so that is the ith element in set and is the jth element in set . By definition of we have:

 Φ1Ki1−Φ1Ki2≥η.

Moreover, we have

 Φ2Ki2−Φ2Ki1≥0.

Combining the result, we have:

 Φ1Ki1−Φ2Ki1+Φ2Ki2−Φ1Ki2≥η.

Therefore,

 D(Φ1Ki1,Φ2Ki1)+D(Φ2Ki2,Φ1Ki2)≥η.

This holds for all , therefore,

 D(Φ1,Φ2)≥ηg=D(ΦS1,ΦS2)kη2.

This leads to the result directly by plugging the above equation into Definition 2.1 and Definition 2.2. ∎

While this bound is not necessary tight, it provides insight on why the sensitivity of the set-based explanation may be much lower than that of the quantitative explanation. In particular, the sensitivity of set based explanations do not account for the change in the values in the features that remains in the top-k set or its complement. In Figure 4, we show some examples of the set-based gradient saliency before and after applying set-based SmoothGrad in (3), and we observe that the smoothed saliency maps are less noisy and more focused on the object of interest.