Log In Sign Up

Boundary Attributions Provide Normal (Vector) Explanations

Recent work on explaining Deep Neural Networks (DNNs) focuses on attributing the model's output scores to input features. However, when it comes to classification problems, a more fundamental question is how much does each feature contributes to the model's decision to classify an input instance into a specific class. Our first contribution is Boundary Attribution, a new explanation method to address this question. BA leverages an understanding of the geometry of activation regions. Specifically, they involve computing (and aggregating) normal vectors of the local decision boundaries for the target input. Our second contribution is a set of analytical results connecting the adversarial robustness of the network and the quality of gradient-based explanations. Specifically, we prove two theorems for ReLU networks: BA of randomized smoothed networks or robustly trained networks is much closer to non-boundary attribution methods than that in standard networks. These analytics encourage users to improve model robustness for high-quality explanations. Finally, we evaluate the proposed methods on ImageNet and show BAs produce more concentrated and sharper visualizations compared with non-boundary ones. We further demonstrate that our method also helps to reduce the sensitivity of attributions to the baseline input if one is required.


page 1

page 3

page 7

page 8

page 14


Smoothed Geometry for Robust Attribution

Feature attributions are a popular tool for explaining the behavior of D...

Fooling Explanations in Text Classifiers

State-of-the-art text classification models are becoming increasingly re...

A Note about: Local Explanation Methods for Deep Neural Networks lack Sensitivity to Parameter Values

Local explanation methods, also known as attribution methods, attribute ...

Generating Hierarchical Explanations on Text Classification Without Connecting Rules

The opaqueness of deep NLP models has motivated the development of metho...

Towards Aggregating Weighted Feature Attributions

Current approaches for explaining machine learning models fall into two ...

Towards Better Model Understanding with Path-Sufficient Explanations

Feature based local attribution methods are amongst the most prevalent i...

Code Repositories


Implementation of Boundary Attributions for Normal (Vector) Explanations

view repo

1 Introduction

Existing approaches [4, 11, 16, 26, 30, 36, 37, 38, 40, 41, 44] to explain Deep Nueral Networks (DNNs) are motivated to answer the question : how can the model’s output score be faithfully attributed to input features ? These tools are appropriate for tasks like regressions and generations. However, in a classification task, besides the output score of the model, we are also interested in another question : how much does each feature contribute to the model’s decision to classify an input instance into a specific class?

Figure 1: Left: the classification result of a toy network with one hidden layer for two-dimensional input. Decision boundaries are edges of the black and white regions. Right: a zoom-in view of one instance.
Figure 2: Visualizations of Integrated Gradient and the proposed improvement of it, Boundary-based Integrated Gradient, which is sharper, more concentrated and less noisy.

Efforts attempting to solve lead to explanation tools that capture the model’s output change within a local perturbation set of input features [52]. In another word, it explains the model by exploring the local geometry of the model’s output in the input space around the point of interest. On the other hand, an answer to needs to focus on what are the important features that the model uses to separate the evaluated input from other classes, not just the output score of the current class. An example of a toy binary classifier is shown in Fig 1. We show that, whereas answers to focuses on the target and the surroundings (pointed by the black arrow), they do not directly answer , how this input is placed in the white half-space.

In this paper, we demonstrate that an answer to is related to decision boundaries and their normal vectors that the classifier learns in the input space (pointed by the white arrow in the example in Fig 1). These normal vectors correspond to the importance of features that the model uses to create different regions for each class. Leveraging decision boundaries in explaining the classifier not only returns sharp and concentrated explanations that one may only observe in a robust network [10, 15, 20] (see an example in Fig. 2) but also further provides us with formal connections between gradient-based explanations and the adversarial robustness of DNNs. We summarize our contributions as follows:

  • We introduce boundary attributions, a new approach to explain both linear and non-linear classifiers, e.g. DNNs., that leverages understandings of activation regions and decision boundaries.

  • We provide a set of analysis between boundary attributions and the adversarial robustness of DNNs in Theorem 1 and 2. An implication of our analysis is that the expense on improving model robustness leads to the efficiency in using non-boundary attributions to approximate boundary attribution to explain DNNs.

  • We empirically demonstrate that boundary-based Integrated Gradient (BIG) produces more accurate explanations in a sense of that its overlap with object bounding box is sufficiently higher than existing methods.

  • Our empirical results on BIG further shows that it mitigates the unnecessary sensitivity to the baseline input in Integrated Gradient.

The rest of our paper is organized as follows. We introduce notations and preliminaries about attribution methods in Sec. 2. We analyze how to explain linear and non-linear classifiers with explanation tuples and boundary attributions in Sec. 3. Empirical evaluations on the proposed methods against baseline attribution methods are included in Sec. 4. We discuss the sensitivity of attribution methods to baseline images in Sec 5, similar work in the literature in Sec 6 and finally conclude our contributions in Sec 7.

2 Background

We begin by introducing notation in Sec. 2.1, and in Sec. 2.2 the prior work on feature attributions that we build on later in the paper.

2.1 Notation

Throughout the paper we use italicized symbols to denote scalar quantities and bold-face to denote vectors. We consider neural networks with ReLU as activations prior to the top layer, and a softmax activation at the top. The predicted label for a given input is given by , where is the predicted label and is the output on the class

. As the softmax layer does not change the ranking of neurons in the top layer, we will assume that

denotes the pre-softmax score. Unless otherwise noted, we use to denote the norm of , and the neighborhood centered at with radius as .

2.2 Feature Attribution

Feature attribution methods are widely-used to explain the predictions made by DNNs, by assigning importance scores for the network’s output to each input feature. Conventionally, scores with greater magnitude indicate that the corresponding feature was more relevant to the predicted outcome. We denote feature attributions by . When is clear from the context, we simply write . While there is an extensive and growing literature on attribution methods, we will focus closely on the popular gradient-based methods shown in Defs 1-3.

Definition 1 (Saliency Map (SM) [38])

The Saliency Map is given by

Definition 2 (Integrated Gradient (IG) [44])

Given a baseline input , the Integrated Gradient is given by

Definition 3 (Smooth Gradient (SG) [40])

Given a zero-centered Gaussian distribution

with a standard deviation

, the Smooth Gradient is given by

As we show in Section 3.2, these methods satisfy axioms that relate to the local linearity of ReLU networks, and in the case of randomized smoothing [9], their robustness to input perturbations. We further discuss these methods relative to others in Sec. 6.

3 Boundary-Based Explanations

In this section, we introduce boundary-based explanation methods by first illustrating and motivating their use on simple linear models (Sec 3.1), and then showing how to generalize the approach to non-linear piecewise classifiers (Sec 3.2).

3.1 Linear Models

Figure 3: Different classifiers that partition the space into regions associated with apple or banana. (a) A linear classifier. (b) A deep network with ReLU activations. Solid lines correspond to decision boundaries while dashed lines correspond to facets of activation regions. (c) Saliency map of the target instance may be normal to the closest decision boundary (right) or normal to the prolongation of other local boundaries (left).

Consider a binary classification model that predicts the label of in (we assume no point makes ). When viewed in its feature space,

is a hyperplane

that separates the input space into two open half-spaces and (see Fig. 2(a)). To assign attributions for predictions made by , SM, SG, and the integral part of IG (see Sec. 2.2) return a vector characterized by  [3], where is normal to the hyperplane of , and . In other words, these methods all measure the importance of features by characterizing the model’s decision boundary, and are equivalent up to the scale and position of .

However, note that these attributions fail to distinguish an input from its neighbors, as will be equivalent for any point, and in particular, between points in or . While for points in , where , will point towards a direction leading to increasing “confidence” in the model’s prediction, measured in terms of its distance from the boundary , for points in it is the opposite. Furthermore, the attributions themselves are invariant to the model’s prediction confidence, which is likely useful in constructing explanations. Therefore, Definition 4 proposes augmenting with a measure of the distance between the decision boundary and (which can be easily calculated by projection as ), and standardizing the interpretation of by ensuring that it always points towards the halfspace containing .

Definition 4 (Explanation Tuple for Linear Models)

Given an input and a decision boundary parameterized by a linear model , the explanation of the classification is a tuple , where is a unit vector normal to pointing to the half-space containing and . We refer as the explanation tuple for .

3.2 Piecewise-Linear Models

In this section we extend Def. 4 to non-linear boundaries, and in particular the piecewise-linear decision surfaces corresponding to ReLU DNNs. We begin by reviewing the locally-linear geometry of these networks, and then show how Def. 4 can be extended in ways that mirror the SM and IG methods from Def. 1.

Local Linearity. For any neuron in a network , we say the status of the neuron is ON if its pre-activation , otherwise it is OFF. We can associate an activation pattern denoting the status of each neuron for any point in the feature space, and a half-space of the input space to the activation constraint . Thus, for any point the intersection of the half-spaces corresponding to its activation pattern defines a polytope (see Fig. 2(b)), and within the network is a linear function such that , where the parameters and can be computed by the back-propagation [17]. Each polytope facet (dashed broken lines in Fig. 2(b)) corresponds to a boundary that flips the status of one neuron. Similar to activation constraints, decision boundaries are piecewise linear because each decision boundary corresponds to a constraint for two classes  [17, 21].

Boundary-based Explanations. The main challenge in explaining a piecewise linear classifier is that for a given input, there are many possible decision boundaries that can be associated with an explanation tuple as in Def. 4. Without criteria for selecting a particular boundary, an explanation is not unique, and we may even find contradictory explanations if two boundaries have opposing normal vectors. Def. 5 expands on Def. 4 to specify a polytope for which is normal to the boundary that is unique to .

Definition 5 (Boundary-based Explanation Tuple)

Given an input , a Boundary-based Explanation Tuple is defined as where is a decision boundary, is a unit vector normal to pointing to the polytope containing and .

Because the configuration of the piecewise linear boundaries is defined largely by the training data, it makes intuitive sense to expect that boundaries nearer to a point account for features that are more relevant to its classification. Nearby boundaries also depict the counterfactual behavior of the classifier in neighborhood of , as their normal vectors characterize the direction leading to nearby regions assigned to different labels. This naturally leads to a Boundary-based Saliency Map (Def. 6), where the boundary is chosen to maximize proximity to the input .

Definition 6 (Boundary-based Saliency Map (BSM))

Given and an input , we define Boundary-based Saliency Map as follows where and if .

Given Def. 6, it is straightforward to verify that for a target instance , yields a Boundary-based Explanation Tuple , where is the closest decision boundary to and .

BSM and SM.

To implement the BSM explanation, one can adapt a procedure for constructing saliency maps by computing a gradient with respect to the nearest adversarial example [31, 45] rather than the original point . Others have noted [17] that the local gradient is not always normal to the closest decision boundary (see the left plot of Fig. 2(c)), because when the closest decision boundary is not inside the activation polytope containing , it may instead be normal to the linear extension of some other piecewise boundary. Similarly, the projection distance may not be the actual distance the closest decision boundary but the distance to the prolongation of another boundary. This means that a standard saliency map may not return a valid boundary-based explanation tuple, and our experiments in Section 4 demonstrate that this is typically the case.

Incorporating More Boundaries.

The main limitation of using BSM as a local explanation is obvious: the closest decision boundary only captures one segment of the decision surface. Even for a toy network, there are usually multiple decision boundaries in the vicinity of an on-manifold point (e.g., the jagged boundaries in Fig. 1). To incorporate the influence of other decision boundaries in the neighborhood of , we consider aggregating the normal vectors of a set of decision boundaries. Taking inspiration from IG, Def. 7 proposes a Boundary-based Integrated Gradient as follows:

Definition 7 (Boundary-based Integrated Gradient(BIG))

Given , Integrated Gradient and an input , we define Boundary-based Integrated Gradient as follows where is the nearest adversarial example to , i.e., and if .

BIG explores a linear path from the boundary point to the target point. Because points on the linear path are likely to traverse different activation polytopes, the gradient of intermediate points used to compute are normals of linear extensions of their local boundaries. As the input gradient is identical within a polytope , the aggregate computed by BIG sums each gradient along the path and weighs it by the length of the path segment intersecting with . This yields the normal vector used in the explanation tuple given by BIG, and the distance component is computed similarly to BSM.

Finding nearby boundaries. Finding the closest decision boundary is closely related to the problem of certifying local robustness [17, 21, 24, 25, 28, 46, 50], which is known to be NP-hard for piecewise linear models [39]. Therefore, to approximate the point on the closest decision boundary, we leverage techniques for generating adversarial examples, e.g. PGD [29] and CW [5], and return the closest one found given a reasonable time budget. Our experiments in Section 4 show that this approximation yields good results in practice.

3.3 Characterization of Boundary Attribution

In this section, we examine the relationship between the boundary explanation methods described in the previous section, and also explore some intriguing connections to model robustness. The proofs of all theorems are given in Appendix A.

Connection to SG. Wang et al. [49] showed that the Smoothed Gradient (SG) for a network is equivalent to the standard saliency map for a smoothed variant  [9] of the original model . We build on this relationship, examining how BSM differs from SM in a smoothed . Dombrowski et al. [13] noted that models with softplus activations () approximate smoothing, giving an exact correspondence for single-layer networks. Combining these insights, we arrive at Theorem 1, which suggests that BSM resembles SM on smoothed models.

Theorem 1

Let be a one-layer network and when using randomized smoothing, we write . Let be the SM for and suppose where is the closest adversarial example, we have the following statement holds: where .

Although Theorem 1 is restricted to one-layer networks, it provides several insights. First, when randomized smoothing is used, BSM and SM yield more similar results, and so BSM may more closely resemble attributions on robust models. Second, because SM for a smoothed model is equivalent to SG for a non-smoothed one, SG is likely a better choice relative to SM whenever boundary methods are too expensive.

Attribution Robustness. Recent work has also proposed using the Lipschitz constant of the attribution method to characterize its robustness to adversarial examples, i.e., the difference between the attribution vector of the target input and its neighbors within a small ball [49] (see Def. 9 in Appendix). This naturally leads to the following statement.

Theorem 2

Suppose has -robust Saliency Map, then where if .

Theorem 2 provides another insight: for networks trained with robust attributions [8, 49], SM is a good approximation to BSM. As prior work has demonstrated the close correspondence between robust prediction and robust attributions [49], this result suggests that robust training may enable less expensive explanations that closely resemble BSM by relying on SM to approximate it well.

4 Evaluation

In this section, we first evaluate the “accuracy” of the proposed boundary attributions in terms of the alignment with ground-truth bounding boxes for Imagenet 2012, using ResNet50 models. We find that BIG significantly outperforms all baseline methods (Fig. 3(b)) on multiple quantitative metrics, while providing visually sharper, and more concentrated, visualizations (Sec. 4.3). We also validate the theorems presented in Sec. 3.3, demonstrating that the difference between boundary attributions prior methods on adversarially-trained ResNet50 decreases by an order of magnitude (Sec. 4.2). This shows that in practice, robust training objectives lead to efficient approximations of the more desirable boundary attributions, made possible using the more efficient existing gradient-based methods.

4.1 Localization with ground truth

In the absence of more general ground truth quality metrics for explanations, we may assume that good explanations localize features that are relevant to the label assigned to the input. In an image classification task where ground-truth bounding boxes are given, we consider features within the bounding box as more relevant to the label assigned to the image. The metrics used for our evaluation are: 1) Localization (Loc.) [7] evaluates the intersection of areas with the bounding box and pixels with positive attributions; 2) Energy Game (EG) [47] instead evaluates computes the portion of attribute scores within the bounding box. While these two metrics are common in the literature, we propose the following additional metrics: 3) Positive Percentage (PP) evaluates the sum of positive attribute scores over the total (absolute value of) attribute scores within the bounding box; 4) Concentration (Con.) evaluates the sum of distances between the “mass” center of attributions and each pixel with positive attribution scores within the bounding box. Higher Loc., EG, PP and lower Con. are better results. We provide formal details for the above metrics in Appendix B.1.

Experimental setup. We compare BSM and BIG against their non-boundary versions to demonstrate that our boundary attributions are a better fit in explaining non-linear classifiers. We provide comparisons to additional methods in Appendix B.4. We consider 2000 correctly-classified images from ImageNette [19], a subset of ImageNet [34], with bounding box area less than 80% of the original source image. When comparing against IG, we use the default “black” baseline consisting of all zeros [44]

.Implementation details of the boundary approximation, and the hyperparameters used in our experiments, are included in Appendix 


Figure 4: (a) SM and BSM on all metrics. (b) IG and BIG on all metrics. An arrow pointing up indicates higher scores lead to better result and similar when it points down. Results of Con. are scaled by 1e-5 to fit into the plot nicely with other results. (c) distances in logarithm between SM and BSM, IG and BIG in a standard ResNet50 and a robustly trained ResNet50.
Figure 5: distances in logarithm between SG and BSG against different standard deviations of the Gaussian noise. Results are computed on ResNet50. Notice the first column in Fig. 3(c) corresponds to .

Results. The results of BSM and BIG are shown in Fig. 3(a) and 3(b). Visualizations of these methods can be found in Fig. 6. We will discuss the visualization in detail later in Sec. 4.3. We make the following observations about the results in the boxplot: 1) SM and BSM do not differ significantly on all metrics in a standard model. We believe the reason behind this result is that a single normal vector of the closest decision boundary does not provide enough information to describe how the model places the target instance into the region, as there are numerous piecwise linear boundaries in the vicinity that are relevant. In Sec. 4.2, we further examine these differences; 2) BIG is significantly better than IG on all metrics. This result shows that the adversarial examples from the closest decision boundary serve a better baseline than the most popular “black” baseline. Our results also show that BIG also outperforms all methods in Fig. 3(a), which validates that when more boundaries exist in a local neighborhood, concentrating on the closet one (BSM) is not the most effective way to capture the aggregate shape of complex local boundaries. By incorporating more boundaries into the attribution, these results indicate that BIG provides effective, localized explanations for the predictions of piecewise linear classifiers.

Figure 6: Visualizations of Saliency Map (SM), Boundary-based Saliency Map (BSM), Integrated Gradient (IG) and Boundary-based Integrated Gradient (BIG) of examples classified by a standard ResNet50. A check denotes that prediction is Top-1 correct. Red points are positive attributions. Bounding boxes for each image are drawn in red. All images are sampled from Imagenette
Figure 7: Visualizations of same examples from Fig. 6. We instead run all attributions on a robust ResNet50 (). A check superscribed with 5 denotes that the prediction of the image is not Top-1 correct but still Top-5 correct. A cross denotes that prediction is Top-5 correct. Red points are positive attributions.

4.2 Robust Classifiers

This section measures the difference between boundary attributions and their non-boundary versions in robust classifiers as described in Sec.3.3. Visualizations of proposed methods in the robust ResNet50 can be found in Fig. 7. We first validate Theorem 1, which suggests that SM on BSM are more similar on models with randomized smoothing, which are known to be robust [9]. To obtain meaningful explanations on smoothed models, which are implemented by evaluating the model on a large set of noised inputs, we assume that the random seed is known by the adversarial example generator, and search for perturbations that point towards a boundary on as many of the noised inputs simultaneously as possible. We do the boundary search for a subset of 500 images, as this computation is significantly more expensive than previous experiments. Instead of directly computing SM and BSM on the smoothed classifier, we utilize the connection between randomized smoothing and SG (see Thoerem 4 in the Appendix A); therefore, we compare the difference between SG on the clean inputs and SG on their adversarial examples (referred as BSG). Details of the experimental setup are given in Appendix B.3 and the results are shown in Fig. 5. Notably, the trend of the log difference against the standard deviation used for the Gaussian noise validates that the qualitative meaning of Theorem 1 holds even for large networks. Next, we validate Theorem 2, which states that models with robust attributions provide more similar boundary attributions with non-boundary ones. Methods proposed in the literature [8, 49] on improving attribution robustness usually require even more resources than PGD training [29], sometimes requring expensive second-order methods [8], and pretrained Imagenet weights produced by these methods are not currently available. However, even though models with PGD training do not have the state-of-the-art robust attributions, they are still significantly more robust than standard models [14], and corresponding pre-trained weights are publicly available [14]. We measure the difference between SM and BSM, IG and BIG in Fig. 3(c) on 1682 images (we exclude those that are not correctly predicted by the robust ResNet50). This result shows that for the robust Resnet50, SM and IG are remarkably close to BSM and BIG, validating the claim that boundary approximations can be approximated in practice with standard attribution methods on robust models.

Summary. Theorems 1 and 2, and the corresponding empirical results given in this section, aim to highlight the following observation. With a “standard” (non-robust) model, training is less costly, but when effective explanations are needed, more resources are required to compute boundary-based attributions. However, if more resources are devoted to training robust models, effectively identical explanations can be obtained using much less costly standard gradient-based methods.

4.3 Visualization

Visualizations of the proposed methods and comparisons with SM and IG are shown in Fig. 6 and 7. We make the following observations. First, BSM is more similar to SM on a robust model (Fig. 7) as discussed in Sec. 4.2. Second, BIG provides high-quality visualizations even on “standard” non-robust models (Fig. 6). The visualizations have significantly less noise, and focus on the relevant features more sharply than attributions given by all other compared methods. Notably, BIG successfully localized importance on a very small region in Fig. 6, apparently containing a parachute, missed by all other methods. Finally, BIG on robust models (Fig. 7) provides some insights about why certain instances are top-5, but not top-1, correct. For example, BIG in the 7th “dog” instance shows that the model focuses primarily on the subject’s legs. In this case, it may be that the visual features on this region of the image are sufficient to distinguish it as containing some breed of dog, but are insufficient to distinguish between several related breeds.

5 Discussion: Baseline Sensitivity

Figure 8: Comparisons of IG with black and white baselines with BIG. Predictions are shown in the first column.

BIG is motivated to incorporate more local decision boundaries that BSM alone cannot capture while IG is motivated to solve gradient vanishing in SM in an axiomatic way [44]. However, it is naturally to treat BIG frees users from the baseline selection in explaining non-linear classifiers. Empirical evidence has shown that IG is sensitive to the baseline inputs [42]. We compare BIG with IG when using different baseline inputs, white or black images. We show an example in Fig 8. For the first two images, when using the baseline input as the opposite color of the dog, more pixels on dogs receive non-zero attribution scores. Whereas backgrounds always receive more attribution scores when the baseline input has the same color as the dog. This is because (see Def. 2) that greater differences in the input feature and the baseline feature can lead to high attribution scores. The third example further questions the readers using different baselines in IG whether the network is using the white dog to predict Labrador retriever. We demonstrate that conflicts in IG caused by the sensitivity to the baseline selection can be resolved by BIG. BIG shows that black dog in the last row is more important for predicting Labrador retriever and this conclusion is further validated by our counterfactual experiment in Appendix C. Overall, the above discussion highlights that BIG is significantly better than IG in reducing the non-necessary sensitivity in the baseline selection.

6 Related Work

We propose a new set of explanation methods, boundary attributions, in this paper as a step further compared to score attributions in the literature [4, 11, 16, 26, 30, 36, 37, 38, 40, 41, 44]. Among all these methods, SM, IG and SG are proved to satisfy many axioms [43, 52, 49], invariant to network architectures and sensitive to the network’s trainable weights [1]. We argue that a score attribution method is a good fit for models when the actual output score is more interesting, e.g. regression, while a boundary attribution is instead built to understand how an instance is thrown into one side of decision boundaries, which is a therefore better fit for classification tasks. In the evaluation of the proposed methods, we choose metrics related to bounding box over other metrics because for classification we are interested in whether the network associate relevant features with the label while other metrics [1, 2, 35, 48, 52], e.g. infidelity [52], mainly evaluates whether output scores are faithfully attributed to each feature. Our idea of incorporating boundaries into explanations may generalize to other score attribution methods, e.g. Distributional Influence [26] and DeepLIFT [37]. The idea of using boundaries in the explanation has also been explored by T-CAV [22], where a linear decision boundary is learned for the internal activations and associated with their proposed notion of concept. In this work, we consider decision boundaries and adversarial examples in the entire input space, where some other work focuses on counterfactual examples on the data manifold [6, 12, 18].

7 Conclusion

In summary, we rethink the target question an explanation should answer for a classification task, the important features that the classifier uses to place the input into a specific side of the decision boundary. We find the answer to our question relates to the normal vectors of decision boundaries in the neighborhood and propose BSM and BIG as boundary attribution approaches. Empirical evaluations on STOA classifiers validate that our approaches provide more concentrated, sharper and more accurate explanations than existing approaches. Our idea of leveraging boundaries to explain classifiers connects explanations with the adversarial robustness and help to encourage the community to improve model quality for explanation quality.


This work was developed with the support of NSF grant CNS-1704845 as well as by DARPA and the Air Force Research Laboratory under agreement number FA8750-15-2-0277. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes not withstanding any copyright notation thereon. The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of DARPA, the Air Force Research Laboratory, the National Science Foundation, or the U.S. Government.


  • [1] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim (2018) Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, Cited by: §B.1, §6.
  • [2] M. Ancona, E. Ceolini, C. Öztireli, and M. Gross (2017) Towards better understanding of gradient-based attribution methods for deep neural networks. External Links: 1711.06104 Cited by: §B.1, §6.
  • [3] M. Ancona, E. Ceolini, C. Öztireli, and M. Gross (2018) Towards better understanding of gradient-based attribution methods for deep neural networks. In International Conference on Learning Representations, Cited by: §3.1.
  • [4] A. Binder, G. Montavon, S. Lapuschkin, K. Müller, and W. Samek (2016) Layer-wise relevance propagation for neural networks with local renormalization layers. In International Conference on Artificial Neural Networks, pp. 63–71. Cited by: §1, §6.
  • [5] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §3.2.
  • [6] C. Chang, E. Creager, A. Goldenberg, and D. Duvenaud (2019) Explaining image classifiers by counterfactual generation. In ICLR, Cited by: §6.
  • [7] A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N. Balasubramanian (2017) Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. arXiv preprint arXiv:1710.11063. Cited by: §B.1, §4.1.
  • [8] J. Chen, X. Wu, V. Rastogi, Y. Liang, and S. Jha (2019) Robust attribution regularization. In Advances in Neural Information Processing Systems, Cited by: §3.3, §4.2.
  • [9] J. M. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In ICML, Cited by: §2.2, §3.3, §4.2, Definition 8, Theorem 3.
  • [10] F. Croce, M. Andriushchenko, and M. Hein (2019) Provable robustness of relu networks via maximization of linear regions. AISTATS 2019. Cited by: §1.
  • [11] K. Dhamdhere, M. Sundararajan, and Q. Yan (2019) How important is a neuron. In International Conference on Learning Representations, External Links: Link Cited by: §1, §6.
  • [12] A. Dhurandhar, P. Chen, R. Luss, C. Tu, P. Ting, K. Shanmugam, and P. Das (2018) Explanations based on the missing: towards contrastive explanations with pertinent negatives. In NeurIPS, Cited by: §6.
  • [13] A. Dombrowski, M. Alber, C. J. Anders, M. Ackermann, K. Müller, and P. Kessel (2019) Explanations can be manipulated and geometry is to blame. In NeurIPS, Cited by: §A.1, §3.3, Theorem 5, Theorem 6.
  • [14] L. Engstrom, A. Ilyas, H. Salman, S. Santurkar, and D. Tsipras (2019) Robustness (python library). External Links: Link Cited by: §4.2.
  • [15] C. Etmann, S. Lunz, P. Maass, and C. Schoenlieb (2019) On the connection between adversarial robustness and saliency map interpretability. In

    Proceedings of the 36th International Conference on Machine Learning

    Cited by: §1.
  • [16] R. C. Fong and A. Vedaldi (2017) Interpretable explanations of black boxes by meaningful perturbation. In

    2017 IEEE International Conference on Computer Vision (ICCV)

    Vol. , pp. 3449–3457. External Links: Document Cited by: §1, §6.
  • [17] A. Fromherz, K. Leino, M. Fredrikson, B. Parno, and C. Păsăreanu (2021) Fast geometric projections for local robustness certification. In International Conference on Learning Representations (ICLR), Cited by: §3.2, §3.2, §3.2.
  • [18] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, and S. Lee (2019) Counterfactual visual explanations. In Proceedings of the 36th International Conference on Machine Learning, pp. 2376–2384. Cited by: §6.
  • [19] J. Howard Imagenette. External Links: Link Cited by: §4.1.
  • [20] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [21] M. Jordan, J. Lewis, and A. Dimakis (2019) Provable certificates for adversarial examples: fitting a ball in the union of polytopes. In NeurIPS, Cited by: §3.2, §3.2.
  • [22] B. Kim, M. Wattenberg, J. Gilmer, C. J. Cai, J. Wexler, F. Viégas, and R. Sayres (2018) Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). In ICML, Cited by: §6.
  • [23] N. Kokhlikyan, V. Miglani, M. Martin, E. Wang, B. Alsallakh, J. Reynolds, A. Melnikov, N. Kliushkina, C. Araya, S. Yan, and O. Reblitz-Richardson (2020)

    Captum: a unified and generic model interpretability library for pytorch

    External Links: 2009.07896 Cited by: §B.2.
  • [24] J. Z. Kolter and E. Wong (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In ICML, Cited by: §3.2.
  • [25] S. Lee, J. Lee, and S. Park (2020) Lipschitz-certifiable training with a tight outer bound. Advances in Neural Information Processing Systems 33. Cited by: §3.2.
  • [26] K. Leino, S. Sen, A. Datta, M. Fredrikson, and L. Li (2018) Influence-directed explanations for deep convolutional networks. In 2018 IEEE International Test Conference (ITC), pp. 1–8. Cited by: §1, §6.
  • [27] K. Leino, R. Shih, M. Fredrikson, J. She, Z. Wang, C. Lu, S. Sen, D. Gopinath, and , Anupam (2021) Truera/trulens: trulens. Zenodo. External Links: Document, Link Cited by: §B.2.
  • [28] K. Leino, Z. Wang, and M. Fredrikson (2021) Globally-robust neural networks. External Links: 2102.08452 Cited by: §3.2.
  • [29] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018)

    Towards deep learning models resistant to adversarial attacks

    In International Conference on Learning Representations, Cited by: §3.2, §4.2.
  • [30] G. Montavon, S. Bach, A. Binder, W. Samek, and K. Müller (2015) Explaining nonlinear classification decisions with deep taylor decomposition. External Links: 1512.02479 Cited by: §1, §6.
  • [31] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami (2016) The limitations of deep learning in adversarial settingsThe limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Cited by: §3.2.
  • [32] J. Rauber, W. Brendel, and M. Bethge (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. In Reliable Machine Learning in the Wild Workshop, 34th International Conference on Machine Learning, External Links: Link Cited by: §B.2.
  • [33] J. Rauber, R. Zimmermann, M. Bethge, and W. Brendel (2020)

    Foolbox native: fast adversarial attacks to benchmark the robustness of machine learning models in pytorch, tensorflow, and jax


    Journal of Open Source Software

    5 (53), pp. 2607.
    External Links: Document, Link Cited by: §B.2.
  • [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.1.
  • [35] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. Müller (2016) Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems 28 (11), pp. 2660–2673. Cited by: §B.1, §6.
  • [36] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §1, §6.
  • [37] A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In International Conference on Machine Learning, pp. 3145–3153. Cited by: §B.4, §1, §6.
  • [38] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. External Links: 1312.6034 Cited by: §1, §6, Definition 1.
  • [39] A. Sinha, H. Namkoong, R. Volpi, and J. Duchi (2020) Certifying some distributional robustness with principled adversarial training. External Links: 1710.10571 Cited by: §3.2.
  • [40] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) SmoothGrad: removing noise by adding noise. External Links: 1706.03825 Cited by: §1, §6, Definition 3.
  • [41] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. External Links: 1412.6806 Cited by: §1, §6.
  • [42] P. Sturmfels, S. Lundberg, and S. Lee (2020) Visualizing the impact of feature attribution baselines. Distill. Note: External Links: Document Cited by: §5.
  • [43] M. Sundararajan and A. Najmi (2020) The many shapley values for model explanation. External Links: 1908.08474 Cited by: §6.
  • [44] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. Cited by: §1, §4.1, §5, §6, Definition 2.
  • [45] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §3.2.
  • [46] V. Tjeng, K. Y. Xiao, and R. Tedrake (2019) Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations, Cited by: §3.2.
  • [47] H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu (2020)

    Score-cam: score-weighted visual explanations for convolutional neural networks


    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

    pp. 24–25. Cited by: §B.1, §4.1.
  • [48] Z. Wang, P. Mardziel, A. Datta, and M. Fredrikson (2020) Interpreting interpretations: organizing attribution methods by criteria. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 10–11. Cited by: §B.1, §6.
  • [49] Z. Wang, H. Wang, S. Ramkumar, P. Mardziel, M. Fredrikson, and A. Datta (2020) Smoothed geometry for robust attribution. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 13623–13634. Cited by: §3.3, §3.3, §3.3, §4.2, §6, Definition 9, Theorem 4.
  • [50] T. Weng, H. Zhang, H. Chen, Z. Song, C. Hsieh, D. Boning, I. Dhillon, and L. Daniel (2018) Towards fast computation of certified robustness for relu networks. In ICML, Cited by: §3.2.
  • [51] G. Yang, T. Duan, E. J. Hu, H. Salman, I. P. Razenshteyn, and J. Li (2020) Randomized smoothing of all shapes and sizes. ArXiv abs/2002.08118. Cited by: §A.1.
  • [52] C. Yeh, C. Hsieh, A. S. Suggala, D. I. Inouye, and P. Ravikumar (2019) On the (in) fidelity and sensitivity for explanations. In Advances in Neural Information Processing Systems, Cited by: §B.1, §1, §6.

Appendix A Theorems and Proofs

a.1 Proof of Theorem 1

Theorem red1 Let be a one-layer network and when using randomized smoothing, we write . Let be the SM for and suppose where is the closest adversarial example, we have the following statement holds: where .

Before we start the proof, we firstly introduce Randomized Smoothing and its theorem that certify the robustness.

Definition 8 (Randomized Smoothing [9])

Suppose , the smoothed classifier is defined as



Theorem 3 (Theorem 1 from Cohen et al. [9])

Suppose and are defined in Def. 8. For a target instance , suppose is lower-boundded by and is upper-boundded by , then




and is the c.d.f of Gaussian.

We secondly introduce a theorem that connects Randomized Smoothing and Smoothed Gradient.

Theorem 4 (Proposition 1 from Want et al. [49])

Suppose a model satisfies . For a Smoothed Gradient , we have


where and denotes the convolution operation.

Finally, we introduce two theorems that connects Smoothed Gradient with Softplus network.

Theorem 5 (Thoerem 1 from Dombrowski et al. [13])

Suppose is a feed-forward network with softplus- activation and if , then



is the weight for layer and is the geodesic distance (which in our case is just the distance).

Theorem 6 (Thoerem 2 from Dombrowski et al. [13])

Denote a one-layer ReLU network as and a one-layer softplus network as , then the following statement holds:



We now begin our proof for Theorem 1.


Given a one-layer ReLU network that takes an input

and outputs the logit score for the class of interest. WLOG we assume

is the -th column of the complete weight matrix , where is the class of interest. With Theorem 6 we know that


Dombrowski et al. [13] points out that the random distribution

closely resembles a normal distribution with a standard deviation


Therefore, we have the following relation


The LHS of the above equation is Smoothed Gradient, or equaivelently, it is the Saliency Map of a smoothed classifier due to Theorem 4. Eq 7 and 9 show that we can analyze randomized smoothing with the tool of an intermeidate softplus network .


and we denote . Now consider Theorem 5, we have


where is the closest adversarial example. Because is certified to be robust within the neighborhood , therefore, the closest decision boundary is at least distance away from the evaluated point . We then have


Now lets substitute and with using Eq. 8 and 3, we arrive at




We use instead of because of the fact we approximate the randomnized smoothing with a similar distribution . In fact, a more rigorous proof exists when considering other than Gaussian. We refer the readers who are interested in the content to Yang et al. [51].

a.2 Proof of Theorem 2

Theorem red2 Suppose has -robust Saliency Map, then where if and .

We first introduce the definition of attribution robustness as follows:

Definition 9 (Attribution Robustness [49])

An attribution method is -locally robust at the evaluated point if .

The proof of Theorem 2 is then the direct application of Def. 9. If the closet boundary point is within the neighborhood where has -locally robust Saliency Map , then by definition .

Appendix B Experiments

b.1 Metrics with Bounding Boxes

We will use the following extra notations in this section. Let , and be a set of indices of all pixels, a set of indices of pixels with positive attributions, and a set of indices of pixels inside the bounding box for a target attribution map . We denote the cardinality of a set as .

Localization (Loc.)

[7] evaluates the intersection of areas with the bounding box and pixels with positive attributions.

Definition 10 (Localization)

For a given attribution map , the localization score (Loc.) is defined as


Energy Game (EG)

[47] instead evaluates computes the portion of attribute scores within the bounding box.

Definition 11 (Energy Game)

For a given attribution map , the energy game EG is defined as


Positive Percentage (PP)

evaluates the sum of positive attribute scores over the total (absolute value of) attribute scores within the bounding box.

Definition 12 (Positive Percentage)

Let be a set of indices pf all pixels with negative attribution scores, for a given attribution map , the positive percentage PP is defined as


Concentration (Con.)

evaluates the sum of weighted distances by the “mass” between the “mass” center of attributions and each pixel with positive attribution scores within the bounding box. Notice that the computation of and can be done with scipy.ndimage.center_of_mass.

Definition 13 (Concentration)

For a given attribution map , the concentration Con. is defined as follws


where are the coordinates of the pixel and


Besides metrics related to bounding boxes, there are other metrics in the literature used to evaluate attribution methods [1, 2, 35, 48, 52]. We focus on metrics that use provided bounding boxes, as we believe that they offer a clear distinction between likely relevant features and irrelevant ones, which evaluates how well we answer the question defined in Sec. 1.

Figure 9: Comparisons on more attribution methods. BIG: Boundary-based Integrated Gradient. BSM: Boundary-base Saliency Map. SM: Saliency Map. GI: Saliency Map Input. SG: Smooth Gradient. IG: Integrated Gradient. The results are computed from ResNet50 with standard training.

b.2 Setup Detail for Sec. 4.1

Pipeline Avg Distance Success Rate Time
Standard ResNet50
PGDs 0.55 71.2% 1.64s
    + CW 0.42 71.2% 2.29s
Robust ResNet50 ()
PGDs 2.19 50.0% 1.64s
    + CW 1.88 50.0% 2.29s
Table 1: Pipeline: the methods used for boundary search. Avg Distance: the average distance between the input to the boundary. Success Rate: the percentage when the pipeline returns an adversarial example. Time: per-instance time with a batch size of 64.
Figure 10: Full results of Fig. 8 in Sec. 5. For the third, fourth and fifth example, we compute the attribution scores towards the prediction of the third example, Labrador retriever. IG with black or white attributions show that masked area contribute a lot to the prediction while BIG “accurately” locate the relevant features in the image with the network’s prediction.

Boundary Search

Our boundary search uses a pipeline of PGDs and CW, where PGDs denote we repeat the PGD attack with a series of epsilons until a closest adversarial example is found. Adversarial examples returned by PGDs are further compared with those from CW and closer ones are returned. The average distances of the found adversarial examples, success rates of attacks and computation time are included in first two rows of Table 1. If an adversarial example is not found, the pipeline will return the point from the last iteration of the first method (PGDs in our case). For PGDs, we use the following list of : . For each in the list, we run at most 40 iterations with step size equal to . For CW, we set and we run 100 iterations with step size of 1.e-3. All attacks are conducted with FoolBox [33, 32]. The result of boundary search is shown in Tabel 1. All computations are done using a GPU accelerator Titan RTX with a memory size of 24 GB.


All attributions are implemented with Captum [23] and visualized with Trulens [27]

. For BIG and IG, we use 10 intermediate points between the baseline and the input and the interpolation method is set to

riemann_trapezoid. To visualize the attribution map, we use the HeatmapVisualizer with blur=10, normalization_type="signed_max" and default values for other keyword arguments from Trulens.

b.3 Setup Detail for Sec. 4.2

Setup for Fig. 5

To generate the adversarial examples for the smoothed classifier of ResNet50 with randomized smoothing, we need to compute back-propagation through the noises. The noise sampler is usually not accessible to the attacker who wants to fool a model with randomized smoothing. However, our goal in this section is not to reproduce the attack with similar setup in practice, instead, what we are after is the point on the boundary. We therefore do the noise sampling prior to run PGD attack, and we use the same noise across all the instances. The steps are listed as follows:

  1. We use numpy.random.randn as the sampler for Gaussian noise with its random seed set to 2020. We use 50 random noises per instance.

  2. In PGD attack, we aggregate the gradients of all 50 random inputs before we take a regular step to update the input.

  3. We set and we run at most 40 iterations with a step size of .

  4. The early stop criteria for the loop of PGD is that when less than 10% of all randomized points have the original prediction.

  5. When computing Smooth Gradient for the original points or for the adversarial points, we use the same random noise that we generated to approximate the smoothed classifier.

Setup for Fig. 3(c)

We use the pipeline of PGDs+CW to find adversarial examples for a robustly-trained ResNet50. Given that it is robust, we anticipate that the boundary is further than that in a standard model, there for we make the following changes compared to the setup discussed for the standard model. We change the list of for PGDs to and the step size to 1.e-1 for each . For CW, we change to , step size to 5.e-3 and we run at most 100 iterations. The result of boundary searching is shown in the Table. 1. All computations are done using a GPU accelerator Titan RTX with a memory size of 24 GB.

b.4 Comparison with Other Attributions

Except Integrated Gradient (IG) and Saliency Map, we compare BIG and BSM with other attribution methods. We restrict our comparisons to methods that use gradients or an approximation of gradients. Therefore, we include GI as multiplying Saliency Map with the input, SG as Smoothed Gradient, and DeepLIFT [37]. We use ResNet50 and 2000 images from ImageNette with bounding boxes and the results are shown in Fig. 9. BIG is still a better method compared to extra baseline attributions (GI, SG and DeepLIFT).

Appendix C Counterfactual Analysis in the Baseline Selection

The discussion in Sec. 5 shows an example where there are two dogs in the image. IG with black baseline shows that the body of the white dog is also useful to the model to predict its label and the black dog is a mix: part of the black dog has positive attributions and the rest is negatively contribute to the prediction. However, our proposed method BIG clearly shows that the most important part is the black dog and then comes to the white dog. To validate where the model is actually using the white dog, we manually remove the black dog or the white dog from the image and see if the model retain its prediction. The result is shown in Fig. 10. Clearly, when removing the black dog, the model changes its prediction from Labrador retriever to English foxhound while removing the white dog does not change the prediction. This result helps to convince the reader that BIG is more reliable than IG with black baseline in this case as a more faithful explanation to the classification result for this instance.