Implementation of Boundary Attributions for Normal (Vector) Explanations
Recent work on explaining Deep Neural Networks (DNNs) focuses on attributing the model's output scores to input features. However, when it comes to classification problems, a more fundamental question is how much does each feature contributes to the model's decision to classify an input instance into a specific class. Our first contribution is Boundary Attribution, a new explanation method to address this question. BA leverages an understanding of the geometry of activation regions. Specifically, they involve computing (and aggregating) normal vectors of the local decision boundaries for the target input. Our second contribution is a set of analytical results connecting the adversarial robustness of the network and the quality of gradient-based explanations. Specifically, we prove two theorems for ReLU networks: BA of randomized smoothed networks or robustly trained networks is much closer to non-boundary attribution methods than that in standard networks. These analytics encourage users to improve model robustness for high-quality explanations. Finally, we evaluate the proposed methods on ImageNet and show BAs produce more concentrated and sharper visualizations compared with non-boundary ones. We further demonstrate that our method also helps to reduce the sensitivity of attributions to the baseline input if one is required.READ FULL TEXT VIEW PDF
Feature attributions are a popular tool for explaining the behavior of D...
With the rise of deep neural networks, the challenge of explaining the
Local explanation methods, also known as attribution methods, attribute ...
The clear transparency of Deep Neural Networks (DNNs) is hampered by com...
Machine learning transparency calls for interpretable explanations of ho...
Current approaches for explaining machine learning models fall into two
The problem of explaining the behavior of deep neural networks has gaine...
Implementation of Boundary Attributions for Normal (Vector) Explanations
Existing approaches [4, 11, 16, 26, 30, 36, 37, 38, 40, 41, 44] to explain Deep Nueral Networks (DNNs) are motivated to answer the question : how can the model’s output score be faithfully attributed to input features ? These tools are appropriate for tasks like regressions and generations. However, in a classification task, besides the output score of the model, we are also interested in another question : how much does each feature contribute to the model’s decision to classify an input instance into a specific class?
Efforts attempting to solve lead to explanation tools that capture the model’s output change within a local perturbation set of input features . In another word, it explains the model by exploring the local geometry of the model’s output in the input space around the point of interest. On the other hand, an answer to needs to focus on what are the important features that the model uses to separate the evaluated input from other classes, not just the output score of the current class. An example of a toy binary classifier is shown in Fig 1. We show that, whereas answers to focuses on the target and the surroundings (pointed by the black arrow), they do not directly answer , how this input is placed in the white half-space.
In this paper, we demonstrate that an answer to is related to decision boundaries and their normal vectors that the classifier learns in the input space (pointed by the white arrow in the example in Fig 1). These normal vectors correspond to the importance of features that the model uses to create different regions for each class. Leveraging decision boundaries in explaining the classifier not only returns sharp and concentrated explanations that one may only observe in a robust network [10, 15, 20] (see an example in Fig. 2) but also further provides us with formal connections between gradient-based explanations and the adversarial robustness of DNNs. We summarize our contributions as follows:
We introduce boundary attributions, a new approach to explain both linear and non-linear classifiers, e.g. DNNs., that leverages understandings of activation regions and decision boundaries.
We provide a set of analysis between boundary attributions and the adversarial robustness of DNNs in Theorem 1 and 2. An implication of our analysis is that the expense on improving model robustness leads to the efficiency in using non-boundary attributions to approximate boundary attribution to explain DNNs.
We empirically demonstrate that boundary-based Integrated Gradient (BIG) produces more accurate explanations in a sense of that its overlap with object bounding box is sufficiently higher than existing methods.
Our empirical results on BIG further shows that it mitigates the unnecessary sensitivity to the baseline input in Integrated Gradient.
The rest of our paper is organized as follows. We introduce notations and preliminaries about attribution methods in Sec. 2. We analyze how to explain linear and non-linear classifiers with explanation tuples and boundary attributions in Sec. 3. Empirical evaluations on the proposed methods against baseline attribution methods are included in Sec. 4. We discuss the sensitivity of attribution methods to baseline images in Sec 5, similar work in the literature in Sec 6 and finally conclude our contributions in Sec 7.
Throughout the paper we use italicized symbols to denote scalar quantities and bold-face to denote vectors. We consider neural networks with ReLU as activations prior to the top layer, and a softmax activation at the top. The predicted label for a given input is given by , where is the predicted label and is the output on the classdenotes the pre-softmax score. Unless otherwise noted, we use to denote the norm of , and the neighborhood centered at with radius as .
Feature attribution methods are widely-used to explain the predictions made by DNNs, by assigning importance scores for the network’s output to each input feature. Conventionally, scores with greater magnitude indicate that the corresponding feature was more relevant to the predicted outcome. We denote feature attributions by . When is clear from the context, we simply write . While there is an extensive and growing literature on attribution methods, we will focus closely on the popular gradient-based methods shown in Defs 1-3.
The Saliency Map is given by
Given a baseline input , the Integrated Gradient is given by
In this section, we introduce boundary-based explanation methods by first illustrating and motivating their use on simple linear models (Sec 3.1), and then showing how to generalize the approach to non-linear piecewise classifiers (Sec 3.2).
Consider a binary classification model that predicts the label of in (we assume no point makes ). When viewed in its feature space,
is a hyperplanethat separates the input space into two open half-spaces and (see Fig. 2(a)). To assign attributions for predictions made by , SM, SG, and the integral part of IG (see Sec. 2.2) return a vector characterized by , where is normal to the hyperplane of , and . In other words, these methods all measure the importance of features by characterizing the model’s decision boundary, and are equivalent up to the scale and position of .
However, note that these attributions fail to distinguish an input from its neighbors, as will be equivalent for any point, and in particular, between points in or . While for points in , where , will point towards a direction leading to increasing “confidence” in the model’s prediction, measured in terms of its distance from the boundary , for points in it is the opposite. Furthermore, the attributions themselves are invariant to the model’s prediction confidence, which is likely useful in constructing explanations. Therefore, Definition 4 proposes augmenting with a measure of the distance between the decision boundary and (which can be easily calculated by projection as ), and standardizing the interpretation of by ensuring that it always points towards the halfspace containing .
Given an input and a decision boundary parameterized by a linear model , the explanation of the classification is a tuple , where is a unit vector normal to pointing to the half-space containing and . We refer as the explanation tuple for .
In this section we extend Def. 4 to non-linear boundaries, and in particular the piecewise-linear decision surfaces corresponding to ReLU DNNs. We begin by reviewing the locally-linear geometry of these networks, and then show how Def. 4 can be extended in ways that mirror the SM and IG methods from Def. 1.
Local Linearity. For any neuron in a network , we say the status of the neuron is ON if its pre-activation , otherwise it is OFF. We can associate an activation pattern denoting the status of each neuron for any point in the feature space, and a half-space of the input space to the activation constraint . Thus, for any point the intersection of the half-spaces corresponding to its activation pattern defines a polytope (see Fig. 2(b)), and within the network is a linear function such that , where the parameters and can be computed by the back-propagation . Each polytope facet (dashed broken lines in Fig. 2(b)) corresponds to a boundary that flips the status of one neuron. Similar to activation constraints, decision boundaries are piecewise linear because each decision boundary corresponds to a constraint for two classes [17, 21].
Boundary-based Explanations. The main challenge in explaining a piecewise linear classifier is that for a given input, there are many possible decision boundaries that can be associated with an explanation tuple as in Def. 4. Without criteria for selecting a particular boundary, an explanation is not unique, and we may even find contradictory explanations if two boundaries have opposing normal vectors. Def. 5 expands on Def. 4 to specify a polytope for which is normal to the boundary that is unique to .
Given an input , a Boundary-based Explanation Tuple is defined as where is a decision boundary, is a unit vector normal to pointing to the polytope containing and .
Because the configuration of the piecewise linear boundaries is defined largely by the training data, it makes intuitive sense to expect that boundaries nearer to a point account for features that are more relevant to its classification. Nearby boundaries also depict the counterfactual behavior of the classifier in neighborhood of , as their normal vectors characterize the direction leading to nearby regions assigned to different labels. This naturally leads to a Boundary-based Saliency Map (Def. 6), where the boundary is chosen to maximize proximity to the input .
Given and an input , we define Boundary-based Saliency Map as follows where and if .
Given Def. 6, it is straightforward to verify that for a target instance , yields a Boundary-based Explanation Tuple , where is the closest decision boundary to and .
To implement the BSM explanation, one can adapt a procedure for constructing saliency maps by computing a gradient with respect to the nearest adversarial example [31, 45] rather than the original point . Others have noted  that the local gradient is not always normal to the closest decision boundary (see the left plot of Fig. 2(c)), because when the closest decision boundary is not inside the activation polytope containing , it may instead be normal to the linear extension of some other piecewise boundary. Similarly, the projection distance may not be the actual distance the closest decision boundary but the distance to the prolongation of another boundary. This means that a standard saliency map may not return a valid boundary-based explanation tuple, and our experiments in Section 4 demonstrate that this is typically the case.
The main limitation of using BSM as a local explanation is obvious: the closest decision boundary only captures one segment of the decision surface. Even for a toy network, there are usually multiple decision boundaries in the vicinity of an on-manifold point (e.g., the jagged boundaries in Fig. 1). To incorporate the influence of other decision boundaries in the neighborhood of , we consider aggregating the normal vectors of a set of decision boundaries. Taking inspiration from IG, Def. 7 proposes a Boundary-based Integrated Gradient as follows:
Given , Integrated Gradient and an input , we define Boundary-based Integrated Gradient as follows where is the nearest adversarial example to , i.e., and if .
BIG explores a linear path from the boundary point to the target point.
Because points on the linear path are likely to traverse different activation polytopes, the gradient of intermediate points used to compute are normals of linear extensions of their local boundaries.
As the input gradient is identical within a polytope , the aggregate computed by BIG sums each gradient along the path and weighs it by the length of the path segment intersecting with .
This yields the normal vector used in the explanation tuple given by BIG, and the distance component is computed similarly to BSM.
Finding nearby boundaries. Finding the closest decision boundary is closely related to the problem of certifying local robustness [17, 21, 24, 25, 28, 46, 50], which is known to be NP-hard for piecewise linear models . Therefore, to approximate the point on the closest decision boundary, we leverage techniques for generating adversarial examples, e.g. PGD  and CW , and return the closest one found given a reasonable time budget. Our experiments in Section 4 show that this approximation yields good results in practice.
In this section, we examine the relationship between the boundary explanation methods described in the previous section, and also explore some intriguing connections to model robustness.
The proofs of all theorems are given in Appendix A.
Connection to SG. Wang et al.  showed that the Smoothed Gradient (SG) for a network is equivalent to the standard saliency map for a smoothed variant  of the original model . We build on this relationship, examining how BSM differs from SM in a smoothed . Dombrowski et al.  noted that models with softplus activations () approximate smoothing, giving an exact correspondence for single-layer networks. Combining these insights, we arrive at Theorem 1, which suggests that BSM resembles SM on smoothed models.
Let be a one-layer network and when using randomized smoothing, we write . Let be the SM for and suppose where is the closest adversarial example, we have the following statement holds: where .
Although Theorem 1 is restricted to one-layer networks, it provides several insights. First, when randomized smoothing is used, BSM and SM yield more similar results, and so BSM may more closely resemble attributions on robust models.
Second, because SM for a smoothed model is equivalent to SG for a non-smoothed one, SG is likely a better choice relative to SM whenever boundary methods are too expensive.
Attribution Robustness. Recent work has also proposed using the Lipschitz constant of the attribution method to characterize its robustness to adversarial examples, i.e., the difference between the attribution vector of the target input and its neighbors within a small ball  (see Def. 9 in Appendix). This naturally leads to the following statement.
Suppose has -robust Saliency Map, then where if .
Theorem 2 provides another insight: for networks trained with robust attributions [8, 49], SM is a good approximation to BSM. As prior work has demonstrated the close correspondence between robust prediction and robust attributions , this result suggests that robust training may enable less expensive explanations that closely resemble BSM by relying on SM to approximate it well.
In this section, we first evaluate the “accuracy” of the proposed boundary attributions in terms of the alignment with ground-truth bounding boxes for Imagenet 2012, using ResNet50 models. We find that BIG significantly outperforms all baseline methods (Fig. 3(b)) on multiple quantitative metrics, while providing visually sharper, and more concentrated, visualizations (Sec. 4.3). We also validate the theorems presented in Sec. 3.3, demonstrating that the difference between boundary attributions prior methods on adversarially-trained ResNet50 decreases by an order of magnitude (Sec. 4.2). This shows that in practice, robust training objectives lead to efficient approximations of the more desirable boundary attributions, made possible using the more efficient existing gradient-based methods.
In the absence of more general ground truth quality metrics for explanations, we may assume that good explanations localize features that are relevant to the label assigned to the input.
In an image classification task where ground-truth bounding boxes are given, we consider features within the bounding box as more relevant to the label assigned to the image.
The metrics used for our evaluation are: 1) Localization (Loc.)  evaluates the intersection of areas with the bounding box and pixels with positive attributions; 2) Energy Game (EG)  instead evaluates computes the portion of attribute scores within the bounding box.
While these two metrics are common in the literature, we propose the following additional metrics: 3) Positive Percentage (PP) evaluates the sum of positive attribute scores over the total (absolute value of) attribute scores within the bounding box; 4) Concentration (Con.) evaluates the sum of distances between the “mass” center of attributions and each pixel with positive attribution scores within the bounding box.
Higher Loc., EG, PP and lower Con. are better results. We provide formal details for the above metrics in Appendix B.1.
Experimental setup. We compare BSM and BIG against their non-boundary versions to demonstrate that our boundary attributions are a better fit in explaining non-linear classifiers. We provide comparisons to additional methods in Appendix B.4. We consider 2000 correctly-classified images from ImageNette , a subset of ImageNet , with bounding box area less than 80% of the original source image. When comparing against IG, we use the default “black” baseline consisting of all zeros 
.Implementation details of the boundary approximation, and the hyperparameters used in our experiments, are included in AppendixB.2.
Results. The results of BSM and BIG are shown in Fig. 3(a) and 3(b). Visualizations of these methods can be found in Fig. 6. We will discuss the visualization in detail later in Sec. 4.3. We make the following observations about the results in the boxplot: 1) SM and BSM do not differ significantly on all metrics in a standard model. We believe the reason behind this result is that a single normal vector of the closest decision boundary does not provide enough information to describe how the model places the target instance into the region, as there are numerous piecwise linear boundaries in the vicinity that are relevant. In Sec. 4.2, we further examine these differences; 2) BIG is significantly better than IG on all metrics. This result shows that the adversarial examples from the closest decision boundary serve a better baseline than the most popular “black” baseline. Our results also show that BIG also outperforms all methods in Fig. 3(a), which validates that when more boundaries exist in a local neighborhood, concentrating on the closet one (BSM) is not the most effective way to capture the aggregate shape of complex local boundaries. By incorporating more boundaries into the attribution, these results indicate that BIG provides effective, localized explanations for the predictions of piecewise linear classifiers.
This section measures the difference between boundary attributions and their non-boundary versions in robust classifiers as described in Sec.3.3. Visualizations of proposed methods in the robust ResNet50 can be found in Fig. 7. We first validate Theorem 1, which suggests that SM on BSM are more similar on models with randomized smoothing, which are known to be robust .
To obtain meaningful explanations on smoothed models, which are implemented by evaluating the model on a large set of noised inputs, we assume that the random seed is known by the adversarial example generator, and search for perturbations that point towards a boundary on as many of the noised inputs simultaneously as possible.
We do the boundary search for a subset of 500 images, as this computation is significantly more expensive than previous experiments.
Instead of directly computing SM and BSM on the smoothed classifier, we utilize the connection between randomized smoothing and SG (see Thoerem 4 in the Appendix A); therefore, we compare the difference between SG on the clean inputs and SG on their adversarial examples (referred as BSG). Details of the experimental setup are given in Appendix B.3 and the results are shown in Fig. 5.
Notably, the trend of the log difference against the standard deviation used for the Gaussian noise validates that the qualitative meaning of Theorem 1 holds even for large networks. Next, we validate Theorem 2, which states that models with robust attributions provide more similar boundary attributions with non-boundary ones.
Methods proposed in the literature [8, 49] on improving attribution robustness usually require even more resources than PGD training , sometimes requring expensive second-order methods , and pretrained Imagenet weights produced by these methods are not currently available.
However, even though models with PGD training do not have the state-of-the-art robust attributions, they are still significantly more robust than standard models , and corresponding pre-trained weights are publicly available . We measure the difference between SM and BSM, IG and BIG in Fig. 3(c) on 1682 images (we exclude those that are not correctly predicted by the robust ResNet50).
This result shows that for the robust Resnet50, SM and IG are remarkably close to BSM and BIG, validating the claim that boundary approximations can be approximated in practice with standard attribution methods on robust models.
Summary. Theorems 1 and 2, and the corresponding empirical results given in this section, aim to highlight the following observation. With a “standard” (non-robust) model, training is less costly, but when effective explanations are needed, more resources are required to compute boundary-based attributions. However, if more resources are devoted to training robust models, effectively identical explanations can be obtained using much less costly standard gradient-based methods.
Visualizations of the proposed methods and comparisons with SM and IG are shown in Fig. 6 and 7. We make the following observations. First, BSM is more similar to SM on a robust model (Fig. 7) as discussed in Sec. 4.2. Second, BIG provides high-quality visualizations even on “standard” non-robust models (Fig. 6). The visualizations have significantly less noise, and focus on the relevant features more sharply than attributions given by all other compared methods. Notably, BIG successfully localized importance on a very small region in Fig. 6, apparently containing a parachute, missed by all other methods. Finally, BIG on robust models (Fig. 7) provides some insights about why certain instances are top-5, but not top-1, correct. For example, BIG in the 7th “dog” instance shows that the model focuses primarily on the subject’s legs. In this case, it may be that the visual features on this region of the image are sufficient to distinguish it as containing some breed of dog, but are insufficient to distinguish between several related breeds.
BIG is motivated to incorporate more local decision boundaries that BSM alone cannot capture while IG is motivated to solve gradient vanishing in SM in an axiomatic way . However, it is naturally to treat BIG frees users from the baseline selection in explaining non-linear classifiers. Empirical evidence has shown that IG is sensitive to the baseline inputs . We compare BIG with IG when using different baseline inputs, white or black images. We show an example in Fig 8. For the first two images, when using the baseline input as the opposite color of the dog, more pixels on dogs receive non-zero attribution scores. Whereas backgrounds always receive more attribution scores when the baseline input has the same color as the dog. This is because (see Def. 2) that greater differences in the input feature and the baseline feature can lead to high attribution scores. The third example further questions the readers using different baselines in IG whether the network is using the white dog to predict Labrador retriever. We demonstrate that conflicts in IG caused by the sensitivity to the baseline selection can be resolved by BIG. BIG shows that black dog in the last row is more important for predicting Labrador retriever and this conclusion is further validated by our counterfactual experiment in Appendix C. Overall, the above discussion highlights that BIG is significantly better than IG in reducing the non-necessary sensitivity in the baseline selection.
We propose a new set of explanation methods, boundary attributions, in this paper as a step further compared to score attributions in the literature [4, 11, 16, 26, 30, 36, 37, 38, 40, 41, 44]. Among all these methods, SM, IG and SG are proved to satisfy many axioms [43, 52, 49], invariant to network architectures and sensitive to the network’s trainable weights . We argue that a score attribution method is a good fit for models when the actual output score is more interesting, e.g. regression, while a boundary attribution is instead built to understand how an instance is thrown into one side of decision boundaries, which is a therefore better fit for classification tasks. In the evaluation of the proposed methods, we choose metrics related to bounding box over other metrics because for classification we are interested in whether the network associate relevant features with the label while other metrics [1, 2, 35, 48, 52], e.g. infidelity , mainly evaluates whether output scores are faithfully attributed to each feature. Our idea of incorporating boundaries into explanations may generalize to other score attribution methods, e.g. Distributional Influence  and DeepLIFT . The idea of using boundaries in the explanation has also been explored by T-CAV , where a linear decision boundary is learned for the internal activations and associated with their proposed notion of concept. In this work, we consider decision boundaries and adversarial examples in the entire input space, where some other work focuses on counterfactual examples on the data manifold [6, 12, 18].
In summary, we rethink the target question an explanation should answer for a classification task, the important features that the classifier uses to place the input into a specific side of the decision boundary. We find the answer to our question relates to the normal vectors of decision boundaries in the neighborhood and propose BSM and BIG as boundary attribution approaches. Empirical evaluations on STOA classifiers validate that our approaches provide more concentrated, sharper and more accurate explanations than existing approaches. Our idea of leveraging boundaries to explain classifiers connects explanations with the adversarial robustness and help to encourage the community to improve model quality for explanation quality.
This work was developed with the support of NSF grant CNS-1704845 as well as by DARPA and the Air Force Research Laboratory under agreement number FA8750-15-2-0277. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes not withstanding any copyright notation thereon. The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of DARPA, the Air Force Research Laboratory, the National Science Foundation, or the U.S. Government.
Proceedings of the 36th International Conference on Machine Learning, Cited by: §1.
2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 3449–3457. External Links: Cited by: §1, §6.
Captum: a unified and generic model interpretability library for pytorch. External Links: Cited by: §B.2.
Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, Cited by: §3.2, §4.2.
Foolbox native: fast adversarial attacks to benchmark the robustness of machine learning models in pytorch, tensorflow, and jax.
Journal of Open Source Software5 (53), pp. 2607. External Links: Cited by: §B.2.
Score-cam: score-weighted visual explanations for convolutional neural networks. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 24–25. Cited by: §B.1, §4.1.
Theorem red1 Let be a one-layer network and when using randomized smoothing, we write . Let be the SM for and suppose where is the closest adversarial example, we have the following statement holds: where .
Before we start the proof, we firstly introduce Randomized Smoothing and its theorem that certify the robustness.
Suppose , the smoothed classifier is defined as
Suppose and are defined in Def. 8. For a target instance , suppose is lower-boundded by and is upper-boundded by , then
and is the c.d.f of Gaussian.
We secondly introduce a theorem that connects Randomized Smoothing and Smoothed Gradient.
Suppose a model satisfies . For a Smoothed Gradient , we have
where and denotes the convolution operation.
Finally, we introduce two theorems that connects Smoothed Gradient with Softplus network.
Suppose is a feed-forward network with softplus- activation and if , then
is the weight for layer and is the geodesic distance (which in our case is just the distance).
Denote a one-layer ReLU network as and a one-layer softplus network as , then the following statement holds:
We now begin our proof for Theorem 1.
Given a one-layer ReLU network that takes an input and outputs the logit score for the class of interest. WLOG we assume
and outputs the logit score for the class of interest. WLOG we assumeis the -th column of the complete weight matrix , where is the class of interest. With Theorem 6 we know that
Therefore, we have the following relation
The LHS of the above equation is Smoothed Gradient, or equaivelently, it is the Saliency Map of a smoothed classifier due to Theorem 4. Eq 7 and 9 show that we can analyze randomized smoothing with the tool of an intermeidate softplus network .
and we denote . Now consider Theorem 5, we have
where is the closest adversarial example. Because is certified to be robust within the neighborhood , therefore, the closest decision boundary is at least distance away from the evaluated point . We then have
We use instead of because of the fact we approximate the randomnized smoothing with a similar distribution . In fact, a more rigorous proof exists when considering other than Gaussian. We refer the readers who are interested in the content to Yang et al. .
Theorem red2 Suppose has -robust Saliency Map, then where if and .
We first introduce the definition of attribution robustness as follows:
An attribution method is -locally robust at the evaluated point if .
We will use the following extra notations in this section. Let , and be a set of indices of all pixels, a set of indices of pixels with positive attributions, and a set of indices of pixels inside the bounding box for a target attribution map . We denote the cardinality of a set as .
 evaluates the intersection of areas with the bounding box and pixels with positive attributions.
For a given attribution map , the localization score (Loc.) is defined as
 instead evaluates computes the portion of attribute scores within the bounding box.
For a given attribution map , the energy game EG is defined as
evaluates the sum of positive attribute scores over the total (absolute value of) attribute scores within the bounding box.
Let be a set of indices pf all pixels with negative attribution scores, for a given attribution map , the positive percentage PP is defined as
evaluates the sum of weighted distances by the “mass” between the “mass” center of attributions and each pixel with positive attribution scores within the bounding box. Notice that the computation of and can be done with scipy.ndimage.center_of_mass.
For a given attribution map , the concentration Con. is defined as follws
where are the coordinates of the pixel and
Besides metrics related to bounding boxes, there are other metrics in the literature used to evaluate attribution methods [1, 2, 35, 48, 52]. We focus on metrics that use provided bounding boxes, as we believe that they offer a clear distinction between likely relevant features and irrelevant ones, which evaluates how well we answer the question defined in Sec. 1.
|Pipeline||Avg Distance||Success Rate||Time|
|Robust ResNet50 ()|
Our boundary search uses a pipeline of PGDs and CW, where PGDs denote we repeat the PGD attack with a series of epsilons until a closest adversarial example is found. Adversarial examples returned by PGDs are further compared with those from CW and closer ones are returned. The average distances of the found adversarial examples, success rates of attacks and computation time are included in first two rows of Table 1. If an adversarial example is not found, the pipeline will return the point from the last iteration of the first method (PGDs in our case). For PGDs, we use the following list of : . For each in the list, we run at most 40 iterations with step size equal to . For CW, we set and we run 100 iterations with step size of 1.e-3. All attacks are conducted with FoolBox [33, 32]. The result of boundary search is shown in Tabel 1. All computations are done using a GPU accelerator Titan RTX with a memory size of 24 GB.
. For BIG and IG, we use 10 intermediate points between the baseline and the input and the interpolation method is set toriemann_trapezoid. To visualize the attribution map, we use the HeatmapVisualizer with blur=10, normalization_type="signed_max" and default values for other keyword arguments from Trulens.
To generate the adversarial examples for the smoothed classifier of ResNet50 with randomized smoothing, we need to compute back-propagation through the noises. The noise sampler is usually not accessible to the attacker who wants to fool a model with randomized smoothing. However, our goal in this section is not to reproduce the attack with similar setup in practice, instead, what we are after is the point on the boundary. We therefore do the noise sampling prior to run PGD attack, and we use the same noise across all the instances. The steps are listed as follows:
We use numpy.random.randn as the sampler for Gaussian noise with its random seed set to 2020. We use 50 random noises per instance.
In PGD attack, we aggregate the gradients of all 50 random inputs before we take a regular step to update the input.
We set and we run at most 40 iterations with a step size of .
The early stop criteria for the loop of PGD is that when less than 10% of all randomized points have the original prediction.
When computing Smooth Gradient for the original points or for the adversarial points, we use the same random noise that we generated to approximate the smoothed classifier.
We use the pipeline of PGDs+CW to find adversarial examples for a robustly-trained ResNet50. Given that it is robust, we anticipate that the boundary is further than that in a standard model, there for we make the following changes compared to the setup discussed for the standard model. We change the list of for PGDs to and the step size to 1.e-1 for each . For CW, we change to , step size to 5.e-3 and we run at most 100 iterations. The result of boundary searching is shown in the Table. 1. All computations are done using a GPU accelerator Titan RTX with a memory size of 24 GB.
Except Integrated Gradient (IG) and Saliency Map, we compare BIG and BSM with other attribution methods. We restrict our comparisons to methods that use gradients or an approximation of gradients. Therefore, we include GI as multiplying Saliency Map with the input, SG as Smoothed Gradient, and DeepLIFT . We use ResNet50 and 2000 images from ImageNette with bounding boxes and the results are shown in Fig. 9. BIG is still a better method compared to extra baseline attributions (GI, SG and DeepLIFT).
The discussion in Sec. 5 shows an example where there are two dogs in the image. IG with black baseline shows that the body of the white dog is also useful to the model to predict its label and the black dog is a mix: part of the black dog has positive attributions and the rest is negatively contribute to the prediction. However, our proposed method BIG clearly shows that the most important part is the black dog and then comes to the white dog. To validate where the model is actually using the white dog, we manually remove the black dog or the white dog from the image and see if the model retain its prediction. The result is shown in Fig. 10. Clearly, when removing the black dog, the model changes its prediction from Labrador retriever to English foxhound while removing the white dog does not change the prediction. This result helps to convince the reader that BIG is more reliable than IG with black baseline in this case as a more faithful explanation to the classification result for this instance.