Certifiably Robust Interpretation in Deep Learning

05/28/2019 ∙ by Alexander Levine, et al. ∙ University of Maryland 0

Although gradient-based saliency maps are popular methods for deep learning interpretation, they can be extremely vulnerable to adversarial attacks. This is worrisome especially due to the lack of practical defenses for protecting deep learning interpretations against attacks. In this paper, we address this problem and provide two defense methods for deep learning interpretation. First, we show that a sparsified version of the popular SmoothGrad method, which computes the average saliency maps over random perturbations of the input, is certifiably robust against adversarial perturbations. We obtain this result by extending recent bounds for certifiably robust smooth classifiers to the interpretation setting. Experiments on ImageNet samples validate our theory. Second, we introduce an adversarial training approach to further robustify deep learning interpretation by adding a regularization term to penalize the inconsistency of saliency maps between normal and crafted adversarial samples. Empirically, we observe that this approach not only improves the robustness of deep learning interpretation to adversarial attacks, but it also improves the quality of the gradient-based saliency maps.



There are no comments yet.


page 2

page 3

page 5

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: (a) An illustration of the sensitivity of gradient-based saliency maps to an adversarial perturbation of an image from CIFAR-10. Sparsified SmoothGrad, however, demonstrates a significantly larger robustness compared to that of the gradient method. (b) A comparison of robustness certificate values of Sparsified SmoothGrad vs. scaled SmoothGrad, on ImageNet images.

The growing use of deep learning in many sensitive areas like autonomous driving, medicine, finance and even the legal system ([1, 2, 3, 4]

) raises concerns about human trust in machine learning systems. Therefore, having interpretations for why certain predictions are made is critical for establishing trust between users and the machine learning system.

In the last couple of years, several approaches have been proposed for interpreting neural network outputs (

[5, 6, 7, 8, 9]). Specifically, [5] computes the elementwise absolute value of the gradient of the largest class score with respect to the input. To define some notation, let be this most basic form of the gradient-based saliency map, for an input image . For simplicity, we also assume that elements of have been linearly normalized to be between and . represents, to a first order linear approximation, the importance of each pixel in determining the class label (see Figure 1-a). Numerous variations of this method have been introduced in the last couple of years which we review in the appendix.

A popular saliency map method which extends the basic gradient method is SmoothGrad [10], which takes the average gradient over random perturbations of the input. Formally, we define the smoothing function as:



has a normal distribution (i.e.

). We will discuss other smoothing functions in Section 3.1 while the empirical smoothing function which computes the average over finitely many perturbations of the input will be discussed in Section 3.3. We refer to the basic method described in the above equation as the scaled SmoothGrad 111The original definition of SmoothGrad does not normalize and take the absolute values of gradient elements before averaging. We start with the definition of equation 1.1 since it is easier to explain our results for, compared to a more general case. We discuss a more general case in Section 3..

aving a robust interpretation method is important since interpretation results are often used in downstream actions such as medical recommendations, object localization, program debugging and safety, etc. However, [11] has shown that several gradient-based interpretation methods are sensitive to adversarial examples, obtained by adding a small perturbation to the input image. These adversarial examples maintain the original class label while greatly distorting the saliency map (Figure 1-a).

Although adversarial attacks and defenses on image classification have been studied extensively in recent years (e.g. [12, 13, 14, 15, 16, 17, 18, 19, 20, 21]), to the best of our knowledge, there is no practical defense for deep learning interpretation against adversarial examples [22]. This is partially due to the difficulty of protecting high-dimensional saliency maps compared to defending a class label, as well as to the lack of a ground truth for interpretation.

Since a ground truth for interpretation is not available, we use a similarity metric between the original and perturbed saliency maps as an estimate of the interpretation robustness. We define

as the number of overlapping elements between top largest elements of saliency maps of and its perturbed version . For an input , this measure depends on its specific perturbation . We define as the robustness measure with respect to the worst perturbation of . That is,


For deep learning models, this optimization is non-convex in general. Thus, characterizing the true robustness of interpretation methods will be a daunting task.

Figure 2: An illustration of the proposed adversarial training to robustify deep learning interpretation on MNIST. We observe that the proposed adversarial training not only enhances the robustness but it also improves the quality of the gradient-based saliency maps.

In our first main result of this paper, we show that a lower bound on the true robustness value of an interpretation method (i.e. a robustness certificate) can be computed efficiently. In other words, for a given input , we compute a robustness certificate such that . To establish the robustness certificate for saliency map methods, we first prove the following result for a general function whose range is between 0 and 1:

Theorem 1.

Let be the output of an interpretation method whose range is between 0 and 1 and let be its smoothed version defined as in Equation equation 1.1. Let and be the -th element and the -th largest elements of , respectively. Let be the cdf of the normal distribution. If


then for the smoothed interpretation method, we have .

Intuitively, this means that, if there is a sufficiently large gap between the -th largest element of the smoothed saliency map and its -th largest element, then we can certify that at least elements in the top largest elements of the original smoothed saliency map will also be in the top elements of adversarially perturbed saliency map. We present a more general version of this result with empirical expectations for smoothing as well as another rank-based robustness certificate in Section 3. The proof of this bound relies on an extension of the results of [23] which addresses certified robustness in the classification case. Proofs for all theorems are given in the Appendix.

Evaluating the robustness certificate for the scaled SmoothGrad method on ImageNet samples produced vacuous bounds (Figure 1-b). This motivated us to develop variations of SmoothGrad with larger robustness certificates. One such variation is Sparsified SmoothGrad which is defined by smoothing a sparsification function that maps the largest elements of to one and the rest to zero. Sparsified SmoothGrad obtains a considerably large value of the robustness certificate (Figure 1-b) while producing high-quality saliency maps. We study other variations of Sparsified SmoothGrad in Section 3.

Our second main result in this paper is to develop an adversarial training approach to further robustify deep learning interpretation methods. Adversarial training is a common technique used to improve the robustness of classification models, by generating adversarial examples to the classification model during training, and then re-training the model to correctly classify these examples [21].

To the best of our knowledge, adversarial training has not yet been adapted to the interpretation domain. In this paper, we develop an adversarial training approach for the interpretation problem in two steps: First, we develop an adversarial attack on the interpretation as the extension of the attack introduced in [11]. We use the developed attack to craft adversarial examples to saliency maps during training. Second, we re-train the network by adding a regularization term to the training loss that penalizes the inconsistency of saliency maps between normal and crafted adversarial samples.

Empirically, we observe that our proposed adversarial training for interpretation significantly improves the robustness of saliency maps to adversarial attacks. Interestingly, we also observe that our proposed adversarial training improves the quality of the gradient-based saliency maps as well (Figure 2). We note that this observation is related to the observation made in [24] showing that adversarial training for classification improves the quality of the gradient-based saliency maps.

2 Preliminaries and Notation

We introduce the following notations to indicate Gaussian smoothing: for a function , we define population and empirical smoothed functions, respectively, as:


In other words, represents the expected value of when smoothed under normal perturbations of

with some standard deviation

while represents an empirical estimate of using samples. We call

the smoothing variance and

the number of smoothing perturbations.

We use to denote the

element of the vector

. Similarly denotes the element of the output . We also define, for any , as the ordinal rank of in (in the descending order): denotes that is the largest element in . We use to denote the largest element in . If is not an integer, the ceiling of is used. We use to denote the dimension of the input.

3 Smoothing for Certifiable Robustness

3.1 Sparsified SmoothGrad

In this section, we will derive general bounds which allow us to certify the robustness for a large class of smoothed saliency map methods. These bounds are applicable to any saliency map method whose range is . Note that while SmoothGrad [10] is similar to such methods, it requires some modifications for our bounds to be directly applicable. [10] in particular defines two methods, which we will call SmoothGrad and Quadratic SmoothGrad. SmoothGrad takes the mean over samples of the signed gradient values, with absolute value typically taken after smoothing for visualization. Quadratic SmoothGrad takes the mean of the elementwise squares of gradient values. Both methods therefore require modification for our bounds to be applied: we define scaled SmoothGrad , such that is the elementwise absolute value of the gradient, linearly scaled so that the largest element is one. We can silimarly define a scaled Quadratic SmoothGrad.

We first realized that scaled SmoothGrad and Quadratic SmoothGrad give vacuous robustness certificate bounds, as we demonstrated in Figure 1. Instead, we developed a new method, Sparsified SmoothGrad, which has (1) non-vacuous robustness certificates at ImageNet scale (Figure 3(a)), (2) similar high-quality visual output to SmoothGrad, and (3) theoretical guarantees that aid in setting its hyper-parameters (Section 3.5).

The Sparsified SmoothGrad is defined as , where is defined as follows:


In other words, controls the degree of sparsification: a fraction of elements (the largest elements of ) are assigned to , and the rest are set to .

Figure 3: Comparison of Sparsified SmoothGrad (with the sparsification parameter ) with the SmoothGrad methods defined by [10]. All methods lead to high-quality saliency maps while our proposed Sparsified SmoothGrad is certifiably robust to adversarial examples as well. Additional examples have been presented in the appendix.

3.2 Robustness Certificate for the Population Case

In order to derive a robustness certificate for saliency maps, we present an extension of the classification robustness result of [23] to real-valued functions, rather than discrete classification functions. In our case, we will apply this to the saliency map vector . First, we define a floor function to simplify notation.

Definition 3.1.

(Floor function) The Floor function is a function , such that

where denotes the norm of the adversarial distortion and denotes the smoothing variance. is the cdf function for the standard normal distribution and is its inverse.

Below is our main result used in characterizing robustness certificates for interpretation methods:

Theorem 1.

Let be a real-valued function. Let be the floor function defined as in equation 3.1 with parameters and . Using   as the smoothing variance for , where :

Note that this theorem is valid for any general function. However, we will use it for our case where is a smoothed saliency map. Theorem 1 states that, for a given saliency map vector , if , then if is perturbed inside an norm ball of radius at most , .

This result extends Theorem 1 in [23]

in two ways: first, it provides a guarantee about the difference in the values of two quantities, which in general might not be related, while the original result compared probabilities of two mutually exclusive events. Second, we are considering a real-valued function

, rather than a classification output which can only take discrete values. This bound can be compared directly to [25]’s result which similarly concerns unrelated elements in a vector. Just as in the classification case (as noted by [23]), Theorem 1 gives a significantly tighter bound than that of [25] (see details in the appendix).

3.3 Robustness Certificate for the Empirical Case

(a) Sparsified SmoothGrad
(b) Relaxed Sparsified SmoothGrad
Figure 4: Certified robustness bounds on ImageNet for different values of the sparsification parameter . The lines shown are for the percentile guarantee, meaning that 60 percent of images had guarantees at least as tight as those shown. For both examples, , and (in units where pixel intensity varies from to .)

In this section, we extend our robustness certificate result of Theorem 1 to the case where we use empirical estimates of smoothed functions. Following [25], we derive upper and lower bounds of the expected value function in terms of , by applying Hoeffding’s Lemma. To present our result for the empirical case, we first define an empirical floor function to derive a similar lower bound when the population mean is estimated using a finite number of samples:

Definition 3.2.

(Empirical Floor function) The Empirical Floor function is a function , such that for given values of , where denotes the maximum distortion, denotes the smoothing variance, denotes the probability bound, denotes the number of perturbations, and is the size of input of the function:

Corollary 1.

Let be a function such that for given values of , , with probability at least ,


Note that unlike the population case, this certificate bound is probabilistic. Another consequence of Theorem 1 is that it allows us to derive certificates for the top- overlap (denoted by ). In particular:

Corollary 2.

, define as the largest such that . Then, with probability at least ,


Intuitively, if there is a sufficiently large gap between the and largest elements of empirical smoothed saliency maps, then we can certify that the overlap between top elements of original and perturbed population smoothed saliency maps is at least with probability at least .

Note that we can apply Corollary 2 directly to SmoothGrad (or Quadratic SmoothGrad), simply by scaling the components of (or ) to lie in the interval

. However, we observe that this gives vacuous bounds for both of them when using the suggested hyperparameters from

[10]. One issue is that the suggested value for (number of perturbations) is which is too small to give useful bounds in Corollary 1. For a standard size image from the ImageNet dataset , with , this gives (using Definition equation 3.2). Note that even for a small :

Thus the gap between and is at least . We can see from Corollaries 1 and 2 that a gap of (on a scale of 1) is far too large to be of any practical use. We instead take , which gives a more manageable estimation error of . However, we found that even with this adjustment, the bounds computed using Corollary 2 are not satisfactory for either scaled SmoothGrad and or scaled Quadratic SmoothGrad (see details in the appendix). This prompted the development of Sparsified SmoothGrad described in Section 3.1.

3.4 Relaxed Sparsified SmoothGrad

For some applications, it may be desirable to have at least some differentiable elements in the computed saliency map. For this purpose, we also propose Relaxed Sparsified SmoothGrad:


Here, controls the degree of sparsification and controls the degree of clipping: a fraction of elements are clipped to 1. Elements neither clipped nor sparsified are linearly scaled between and . Note that Relaxed Sparsified SmoothGrad is a generalization of Sparsified SmoothGrad. With no clipping (), we again achieve nearly-vacuous results. However, with only a small degree of clipping (), we achieve results very similar (although slightly worse) than sparsifed SmoothGrad; see Figure 3(b). We use Relaxed Sparsified SmoothGrad in this paper to test the performance of first-order adversarial attacks against Sparsified SmoothGrad-like techniques.

3.5 Robustness Certificate based on Median Saliency Ranks

In this section, we show that if the median rank of a saliency map element over smoothing perturbations is sufficiently small (i.e. near the top rank), then for an adversarially perturbed input, that element will certifiably remain near the top rank of the proposed Sparsified SmoothGrad method with high probability. This provides another theoretical reason for the robustness of the Sparsified SmoothGrad method.

To present this result, we first define the certified rank of an element in the saliency map as follows:

Definition 3.3 (Certified Rank).

For a given input and a given saliency map method (denoted by ), let the maximum adversarial distortion be , i.e. . Then, for a probability , the certified rank for an element at index (denoted by ) is defined as the minimum such that the condition:


If the -th element of the saliency map has a certified rank of , using Corollary 1, we will have:

That is, the element of the population smoothed saliency map is guaranteed to be as large as the smallest elements of the smoothed saliency map of any adversarially perturbed input.

Note that certified rank depends on the particular perturbations used to generate the smoothed saliency map . In the following result, we show that if the median rank of a gradient element at index , over a set of randomly generated perturbations, is less than a specified threshold value, then the certified rank of that element in the Sparsified SmoothGrad saliency map generated using those perturbations can be upper bounded.

Theorem 2.

Let be the set of random perturbations for a given input using the smoothing variance . Using the Sparsified SmoothGrad method, for probability , we have


where is the sparsification parameter of the Sparsified SmoothGrad method.

For instance, if and for sufficiently large number of smoothing perturbations (i.e. ), we have . If we set , then for indices whose median ranks are less than or equal to , their certified ranks will be less than or equal to . That is, even after adversarially perturbing the input, they will certifiably remain among the top elements of the Sparsified SmoothGrad saliency map.

We present a more general form of this result in the appendix.

3.6 Experimental Results

To test the empirical robustness of Sparsified SmoothGrad, we used an attack on adapted from the attack defined by [11]; see the appendix for details of our proposed attack. We chose Relaxed Sparsified SmoothGrad

to test, rather than Sparsified SmoothGrad, because we are using a gradient-based attack, and Sparsified SmoothGrad has no defined gradients. We tested on ResNet-18 with CIFAR-10, with the attacker using a separately-trained, fully differential version of ResNet-18, with SoftPlus activations in place of ReLU.

We present our empirical results in Figure 6. We observe that our method is significantly more robust than the SmoothGrad method while its robustness is in par with the Quadratic SmoothGrad method with the same number of smoothing perturbations. We note that our robustness certificate appears to be loose for large perturbation magnitudes used in these experiments.

Figure 5: Empirical robustness of variants of SmoothGrad to adversarial attack, tested on CIFAR-10 with ResNet-18. Attack magnitude is in units of standard deviations of pixel intensity. Robustness is measured as , where
Figure 6: Effectiveness of adversarial training on MNIST. Increasing the regularization parameter in the proposed adversarial training optimization (Equation 4.1) significantly increases the robustness of gradient-based saliency maps while it has little effect on the classification accuracy.


4 Adversarial Training for Robust Saliency Maps

Adversarial training has been used extensively for making neural networks robust against adversarial attacks on classification [21]. The key idea is to generate adversarial examples for a classification model, and then re-train the model on these adversarial examples.

In this section, we present, for the first time, an adversarial training approach for fortifying deep learning interpretations so that the saliency maps generated by the model (during test time) are robust against adversarial examples. We focus on “vanilla gradient” saliency maps, although the technique presented here can potentially be applied to any saliency map method which is differentiable w.r.t. the input. We solve the following optimization problem for the network weights (denoted by ):


where is an adversarial perturbation for the saliency map generated from . To generate , we developed an attack on saliency maps by extending the attack of [11] (see the details in the appendix). is the standard cross entropy loss, and is the regularization parameter to encourage consistency between saliency maps of the original and adversarially perturbed images.

We observe that the proposed adversarial training significantly improves the robustness of saliency maps. Aggregate empirical results are presented in Figure 6, and examples of saliency maps are presented in Figure 2. It is notable that the quality of the saliency maps is greatly improved for unperturbed inputs, by adversarial training. We observe that even for very large value of , only a slight reduction in classification accuracy occurs due to the added regularization term.

5 Conclusion

In this work, we studied the robustness of deep learning interpretation against adversarial attacks and proposed two defense methods. Our first method is a sparsified variant of the popular SmoothGrad method which computes the average saliency maps over random perturbations of the input. By establishing an easy-to-compute robustness certificate for the interpretation problem, we showed that the proposed Sparsified SmoothGrad is certifiably robust to adversarial attacks while producing high-quality saliency maps. We provided extensive experiments on ImageNet samples validating our theory. Second, for the first time, we introduced an Adversarial Training approach to further fortify deep learning interpretation against adversarial attacks by penalizing the inconsistency of saliency maps between normal and crafted adversarial samples. The proposed adversarial training significantly improved the robustness of saliency maps without degrading from the classification accuracy. We also observed that, somewhat surprisingly, adversarial training for interpretation enhances the quality of the gradient-based saliency maps in addition to their robustness.


Appendix A Proofs

Theorem 1.

Let be a bounded, real-valued function, be the smoothing variance for , then where such that :

where denotes the cdf function for the standard normal distribution and is its inverse.

We will prove this by first proving a more general lemma:

Lemma 1.

For any bounded function and smoothing variance  , is Lipschitz-continuous with respect to , with Lipschitz constant .


By the definition of Lipschitz continuity, we must show that ,


We first define a new, randomized function ,

Then :


Now, we apply the following Lemma (Lemma 4 from [23]):

Lemma (Cohen’s lemma).

Let and . Let be any deterministic or random function, Then:

  1. If for some and , then

  2. If for some and , then

Using the same technique as used in the proof of Theorem 1 in [23], we fix and define,

Also define the half-spaces:

Applying algebra from the proof of Theorem 1 in [23], we have,


Using equation A.3

Applying Statement 1 of Cohen’s lemma, using and :


Using equation A.4,

Applying Statement 2 of Cohen’s lemma, using and :


Using equation A.7 and equation A.8:

Then by equation A.5 and equation A.6:

Noting that is a monotonically increasing function, we have:

Using equation A.2 yields equation A.1, which completes the proof. ∎

We now proceed with the proof of Theorem 1.


Applying Lemma 1 to and gives (recalling that ):


Then we have:

which proves the implication. ∎

Corollary 1.

Let be a function such that for given values of :


, with probability at least ,


By Hoeffding’s Inequality, for any ,




Since we are free to choose c, we define such that , then:


Then with probability at least :




The result directly follows from Theorem 1. ∎

Corollary 2.

with probability at least ,


where is the largest such that .


Note that the proof of Corollary 1 guarantees that with probability at least , all estimates are within the approximation bound of . So we can assume that Corollary 1 will apply simultaneously to all pairs of indices , with probability .
We proceed to prove by contradiction.

Suppose there exists such that:

Since is a monotonically increasing function,

and therefore by Corollary 1:


Let be the set of indices in the top elements in , and be the set of indices in the top elements in .
By assumption, and share fewer than elements, so there will be at least elements in which are not in .
All of these elements have rank at least in .
Thus by pigeonhole principle, there is some index , such that .
Thus by Equation equation A.18,


Hence, there are such elements where : these elements are clearly in .
Because , Equation equation A.19 implies that these elements are all also in . Thus and share at least elements,which contradicts the premise.
(In this proof we have implicitly assumed that the top elements of a vector can contain more than elements, if ties occur, but that is assigned arbitrarily in cases of ties. In practice, ties in smoothed scores will be very unlikely.) ∎

a.1 General Form and Proof of Theorem 2

We note that Theorem 2 can be used to derive a more general bound for any saliency map method that for an input , first maps to an elementwise function that only depends on the rank of the current element in and not on the individual value of the element. We denote the composition of the gradient function and this elementwise function as . The only properties that the function must satisfy is that it must be monotonically decreasing and non-negative. Thus, we have the following statement:

Theorem 2.

Let be the threshold value and let be the set of random perturbations for a given input using the smoothing variance and let be the probability bound. If is an element index such that:






Let the elementwise function be , i.e takes the rank of the element as the input and outputs a real number. Furthermore, we assume that is a non-negative monotonically decreasing function. Thus .
We use to denote the constant value that maps elements of rank to.
Note that is the largest element of .
Since is a monotonically decreasing function:

Thus is independent of , we simply use to denote , i.e:

Because , for at least half of sampling instances in , .
So in these instances ,
The remaining half or fewer elements are mapped to other nonnegative values.
Thus the sample mean:

Using Corollary 1, is certifiably as large as all elements with indices j such that:

Now we will find an upper bound on the number of elements with indices j such that:

Because all the ranks from to will occur in every sample in U, we have:

Thus strictly fewer than