Introduction
In order to deploy machine learning models in safetycritical applications like healthcare and autonomous driving, it is desirable that such models should reliably explain their predictions. As neural network architectures become increasingly complex, explanations of the model’s prediction or behavior are even more important for developing trust and transparency to end users. For example, if a model predicts a given pathology image to be benign, then a doctor might be interested to investigate further to know what features or pixels in the image led the model to this classification.
Though there exist numerous efforts in the literature on constructing adversarially robust models Qin et al. (2019); Wang et al. (2020); Zhang et al. (2019); Madry et al. (2018); Chan et al. (2020); Xie et al. (2019a), surprisingly very little work has been done in addressing issues in robustness of the explanations generated by a model. One aspect of genuineness of a model can be in producing very similar interpretations for two very similar humanindistinguishable images where model predictions are the same. Ghorbani et al. (2019) demonstrated the possibilities to craft changes in an image which are imperceptible to a human, but can induce huge change in attribution maps without affecting the model’s prediction. Hence building robust models, against such attacks proposed in Ghorbani et al. (2019), is very important to increase faithfulness of such models to end users. Fig 1 visually explains the vulnerability of adversarial robust models against attributionbased attacks and how attributional training (à la adversarial training) addresses this to a certain extent. Such findings imply the need to explore effective strategies to improve attributional robustness.
The limited efforts until now for attributional training rely on minimizing change in attribution due to human imperceptible change in input Chen et al. (2019) or maximizing similarity between input and attribution map Kumari et al. (2019)
. We instead propose a new methodology for attributional robustness that is based on empirical observations. Our studies revealed that attributional robustness gets negatively affected when: (i) an input pixel has a high attribution for a negative class (nonground truth class) during training; (ii) an attribution map corresponding to the positive or true class is uniformly distributed across the given image, instead of being localized on a few pixels; or and (iii) change of attribution, due to an human imperceptible change in input image (without changing predicted class label), is higher for a pixel with low attribution than for a pixel with high attribution (since this leads to significant changes in attribution). Based on these observations, we propose a new training procedure for attributional robustness that addresses each of these concerns and outperforms existing attributional training methods. Our methodology is inspired by the rigidity (and nonamorphous nature) of objects in images and instigates the fact that number of true class pixels are often small compared to total number of pixels in an image, resulting in a nonuniform (or skewed) pixel distribution of the true class attribution across spatial locations in the image. Complementarily, for the most confusing negative class, ensuring the attribution or saliency map is
not localized helps indirectly improve the localization of the attribution of the true class.The key contributions of our work can be summarized as follows: (i) We propose a class attributionbased contrastive regularizer for attributional training which forces the true class attribution to assume a skewed shape distribution, replicating the fact that few input pixels attribute highly compared to other pixels for a true class prediction. We also drive pixels of the attribution map corresponding to the negative class to behave uniformly (equivalent to not localizing) through this regularizer; (ii) We also introduce an attribution changebased regularizer to weight change in attribution of a pixel due to an indistinguishable change in the image (both the aforementioned contributions have not been attempted before to the best of our knowledge); (iii) We provide detailed experimental results of our method on different benchmark datasets, including MNIST, FashionMNIST, Flower and GTSRB, and obtain stateoftheart results for attributional robustness across these datasets.
An illustrative preview of the effectiveness of our method is shown in Fig 1. The last column shows the utility of our method, where the attribution/saliency map (we use these terms interchangeably in this work) seems stronger, and minimally affected by the perturbation. The middle column shows a recent stateoftheart Chen et al. (2019), whose attribution shows signs of breakdown from the perturbation.
Related Work
We divide our discussion of related work into subsections that capture earlier efforts that are related to ours from different perspectives.
Adversarial Robustness: The possibility of fooling neural networks by crafting visually imperceptible images was first shown by Szegedy et al. (2013). Since then, we have seen extensive efforts over the last few years in the same direction. Goodfellow et al. (2015) introduced onestep Fast Gradient Sign Method (FGSM) attack which was followed by more effective iterative attacks such as Kurakin et al. (2016), PGD attack Madry et al. (2018), Carlini Wagner attack Carlini and Wagner (2017), Momentum Iterative attack Dong et al. (2018), Diverse Input Iterative attack Xie et al. (2019b), Jacobianbased saliency map approach Papernot et al. (2016), etc. A parallel line of work has also emerged on finding strategies to defend against stronger adversarial attacks such as Adversarial Training Madry et al. (2018)
, Adversarial Logit Pairing
Kannan et al. (2018), Ensemble Adversarial Training Tramèr et al. (2018), Parsevals Network Cisse et al. (2017), Feature Denoising Training Xie et al. (2019a), Latent Adversarial Training Kumari et al. (2019), Jacobian Adversarial Regularizer Chan et al. (2020), Smoothed Inference Nemcovsky et al. (2019), etc. The recent work of Zhang et al. (2019) explored the tradeoff between adversarial robustness and accuracy.Interpretability Methods: The space of work on robust attributions is based on generating neural network attributions which is itself an active area of research. These methods have an objective to compute the importance of input features based on the prediction function’s output. Recent efforts in this direction include gradientbased methods Simonyan et al. (2013); Shrikumar et al. (2016); Sundararajan et al. (2017), propagationbased techniques Bach et al. (2015); Shrikumar et al. (2017); Zhang et al. (2016); Nam et al. (2019) or perturbationbased methods Zeiler and Fergus (2014); Petsiuk et al. (2018); Ribeiro et al. (2016). Another recent work Dombrowski et al. (2019) developed a smoothed explanation method that can resist manipulations, while our work aims to develop a training method (not explanation method) that is resistant to attributional attacks. Our work is based on integrated gradients Sundararajan et al. (2017) which has often been used as a benchmark method Chen et al. (2019); Singh et al. (2019) and is theoretically wellfounded on axioms of attribution, and shown empirically strong performance.
Attributional Attacks and Robustness:
The perspective that neural network interpretations can be broken by modifying the saliency map significantly with imperceptible input perturbations  while preserving the classifier’s prediction  was first investigated recently in
Ghorbani et al. (2019). While the area of adversarial robustness is well explored, little progress has been made on attributional robustness i.e. finding models with robust explanations. Chen et al. (2019) recently proposed a training methodology which achieves current stateoftheart attributional robustness results. It showed that attributional robustness of a model can be improved by minimizing the change in input attribution w.r.t an imperceptible change in input. Our work is closest to this work, where an equal attribution change on two different input pixels are treated equally, irrespective of the original pixel attribution. This is not ideal as a pixel with high initial attribution may need to be preserved more carefully than a pixel with low attribution. Chen et al. (2019) has another drawback as it only considers input pixel attribution w.r.t. true class, but doesn’t inspect the effect w.r.t. other negative classes. We find this to be important, and have addressed both these issues in this work. Another recent method Singh et al. (2019) tried to achieve better attributional robustness by encouraging the observation that a meaningful saliency map of an image should be perceptually similar to the image itself. This method fails specifically for images where the true class contains objects darker than the rest of the image, or there are bright pixels anywhere in the image outside the object of interest. In both these cases, enforcing similarity of saliency to the original image objective shifts the attribution away from the true class in this method. We compare our method against both these methods in our experiments.Background and Preliminaries
Our proposed training method requires computation of input pixel attribution through the Integrated Gradient (IG) method Sundararajan et al. (2017), which has been used for consistency and fair comparison to the other closely related methods Chen et al. (2019); Singh et al. (2019)
. It functions as a technique to provide axiomatic attribution to different input features proportional to their influence on the output. Computation of IG is mathematically approximated by constructing a sequence of images interpolating from a baseline to the actual image and then averaging the gradients of neural network output across these images, as shown below:
(1) 
Here represents a deep network with as the set of class labels, is a baseline image with all black pixels (zero intensity value) and is the pixel location on input image x for which IG is being computed.
Adversarial Attack: We evaluate the robustness of our model against two kinds of attacks, viz. Adversarial Attack and Attributional Attack, each of which is introduced herein. The goal of an adversarial attack is to find out minimum perturbation in the input space of x (i.e. input pixels for an image) that results in maximal change in classifier() output. In this work, to test adversarial robustness of a model, we use one of the strongest adversarial attacks, Projected Gradient Descent (PGD) Madry et al. (2018), which is considered a benchmark for adv accuracy in other recent attributional robustness methods Chen et al. (2019); Singh et al. (2019). PGD is an iterative variant of Fast Gradient Sign Method (FGSM) Goodfellow et al. (2015). PGD adversarial examples are constructed by iteratively applying FGSM and projecting the perturbed output to a valid constrained space . PGD attack is formulated as follows:
(2) 
Here, denotes the classifier parameters; input and output are represented as x and y
respectively; and the classification loss function as
. Usually, the magnitude of adversarial perturbation is constrained in a norm ball () to ensure that the adversarially perturbed example is perceptually similar to the original sample. Note that denotes the perturbed sample at iteration.Attributional Attack: The goal of an attributional attack is to devise visually imperceptible perturbations that change the interpretability of the test input maximally while preserving the predicted label. To test attributional robustness of a model, we use Iterative Feature Importance Attack (IFIA) in this work. As Ghorbani et al. (2019) convincingly demonstrated, IFIA helps generate minimal perturbations that substantially change model interpretations, while keeping their predictions intact. The IFIA method is formally defined as below:
(3) 
subject to:
such that:
Here,
is a vector of attribution scores over all input pixels when an input image
x is presented to a classifier network parameterized by . measures the dissimilarity between attribution vectors and . In our work, we choose as Kendall’s correlation computed on top pixels as in Ghorbani et al. (2019). We describe this further in the Appendix due to space constraints.Proposed Methodology
We now discuss our proposed robust attribution training strategy in detail, which: (i) enforces restricting true class attribution as a sparse heatmap and the negative class attribution to be a uniform distribution across the entire image; (ii) enforces the pixel attribution change caused by an imperceptible perturbation of the input image to consider the actual importance of the pixel in the original image. Both these objectives are achieved through the use of a regularizer in our training objective. The standard multiclass classification setup is considered where inputlabel pairs are sampled from training data distribution with a neural network classifier , parametrized by . Our goal is to learn that provides better attributional robustness to the network.
Considering Negative Classes: We observe that “good” attributions generated for a true class form a localized (and sparse, considering the number of pixels in the full image) heatmap around a given object in the image (assuming an image classification problem setting). On the other hand, this implies that we’d like the most confusing/uncertain class attribution to not be localized, viz. i.e. resemble a uniform distribution across pixels in an image. As stated earlier, this hypothesis is inspired by the rigidity (and nonamorphous nature) of objects in images. To this end, we define the Maximum Entropy Attributional Distribution as a discrete uniform distribution in input pixel space as , where attribution score of each input pixel is equal to . We also define a True Class Attributional Distribution () as a distribution of attributions over input pixels for the true class output, denoted by , when provided the perturbed image as input. Note that attributions are implemented using the IG method (as described in Sec Background and Preliminaries), and hence averages the gradients of the classifier’s true class output when input is varied from to
. We also note that IG is simply a better estimate of the gradient, and hence can be computed w.r.t. every output class (we compute it for the true class here). Here
is a baseline reference image with all zero pixels, and represents the perturbed image. is chosen randomly within an norm ball around a given input x. We represent then as:(4) 
where is computed for every pixel in x, and the softmax is applied over all pixels in x, i.e. .
In a similar fashion, we define a Negative Class Attributional Distribution ()
, where IG is computed for the most confusing negative class (i.e. class label with second highest probability) in a multiclass setting, or simply the negative class in a binary setting.
is given by:(5) 
We now define our Class Attributionbased Contrastive Regularizer (CACR) as:
(6) 
where stands for KLdivergence. We show how CACR is integrated into the overall loss function to minimize, later in this section. CACR enforces a skewness in the attribution map, corresponding to the true class, across an input image through the ”” term, and a uniform attribution map corresponding to the most confusing negative class through the ”” term. The skewness in case of the true class forces the learner to focus on a few pixels in the image. This regularizer induces a contrastive learning on the training process, which is favorable to attributional robustness, as we show in our results.
Enforcing Attribution Bounds:
If a pixel has a positive (or negative) attribution towards true class prediction, it may be acceptable if a perturbation makes the attribution more positive (or more negative, respectively). In other words, we would like the original pixel attribution to serve as a lower bound for a positively attributed pixel, or an upper bound for a negatively attributed pixel. If this is violated, it is likely that the attribution map may change. To implement this thought, we define as a base attribution i.e. computed using standard IG method attribution w.r.t the true class for the input image x, given by:
(7) 
Similarly, we define as the change in attribution w.r.t the true class given the perturbed image, i.e. (a similar definition is also used in Chen et al. (2019)):
(8) 
The abovementioned desideratum necessitates that the sign of every element of , where is the elementwise/Hadamard product, be maintained positive across all pixels. To understand better, let us consider the pixel in to be positive (negative). This implies that the pixel is positively (negatively) affecting classifier’s true class prediction. In such a case, we would like the component of also to be positive (negative), i.e. it further increases the magnitude of attribution in the same direction (positive/negative, respectively) as before.
However, even when the is positive for a pixel, we argue that an equal amount of change in attribution on a pixel with higher base attribution is more costly compared to a pixel with lower base attribution, i.e. we also would want the magnitude of each element in to be low, in addition to the overall sign being positive.
Our second regularizer, which we call Weighted Attribution Change Regularizer (WACR), seeks to implement the above ideas. This is achieved by considering two subsets of pixels in a given input image x: a set of pixels for which is negative, i.e. sign() is not the same as sign(); and a set of pixels for which sign() is same as sign(). We then minimize the quantity below:
(9) 
where we choose can be any size function, which we use as norm in this work, and is a pixel from the image x. In Eqn 9, we note that:

the first term attempts to reduce attribution change in pixels where sign() is not the same as sign(). We argue that this reduction in is not required for pixels in , since an attribution change helps reinforce correct attributions for pixels in .

the second term attempts to lower the change in attribution more in pixels with higher base attribution. We argue that this is not required for pixels in , since bringing down the attribution change irrespective of the base attribution is the focus for .
Overall Optimization: We follow an adversarial training approach Madry et al. (2018) to train the model. Adversarial training is a twostep process: an (i) Outer minimization; and an (ii) Inner maximization. The inner maximization is typically used to identify a suitable perturbation that achieves the objective of an attribution attack, and the outer minimization seeks to use the regularizers described above to counter the attack. We describe each of them below:
Outer Minimization: Our overall objective function for the outer minimization step is given by:
(10) 
where is the standard crossentropy loss used for the multiclass classification setting. We use as a common weighting coefficient for both regularizers, and use for all the experiments reported in this paper. We show effects of considering different values on our proposed method in in Sec Enforcing Attribution Bounds:. As , and are all discrete distributions, we calculate as:
(11) 
where corresponds to the total number of pixels in the input image, as before. corresponds to the 1st order partial derivative of neural network output (corresponding to most confusing negative class) w.r.t the input pixel. Similarly, corresponds to the 1st order partial derivative of neural network output (corresponding to true class) w.r.t the input pixel.
Inner Maximization: In order to obtain the attributional attack, we use the following objective function:
(12) 
where
Earlier computations of IG were computed w.r.t or , which were the softmax outputs of the true class and the most confusing negative class respectively. Here, we denote to denote the computation of IG using the loss value corresponding to the true class. This is because our objective here is to maximize loss, while our objective was to maximize the true class softmax output in the outer minimization. We use as the crossentropy loss for the true class, and norm as . Since the inner maximization is iterative by itself (and solved before the outer minimization), we randomly initialize each pixel of within an norm ball of x and then iteratively maximize the objective function in Eqn 12. We avoid the use of in our inner maximization, since is expensive due to an extra IG calculation w.r.t. negative class, which can increase the cost due to the many iterations in the inner maximization loop.
We note that the proposed method is not an attribution method, but a training methodology that uses IG. When a model is trained using our method, all axioms of attribution will hold for IG by default, as for any other trained model. We also show that our loss function can be used as a surrogate loss of the robust prediction objective proposed by Madry et al. (2018). Please refer to Appendix for the proof. An algorithm for our overall methodology is also presented in the Appendix due to space constraints.
Experiments and Results
We conducted a comprehensive suite of experiments and ablation studies, which we report in this section and in Sec Enforcing Attribution Bounds:. We report results with our method on 4 benchmark datasets i.e. Flower Nilsback and Zisserman (2006), FashionMNIST Xiao et al. (2017), MNIST LeCun et al. (2010) and GTSRB Stallkamp et al. (2012). The Flower dataset Nilsback and Zisserman (2006) contains 17 categories with each category consisting of 40 to 258 highdefinition RGB flower images. MNIST LeCun et al. (2010) and FashionMNIST Xiao et al. (2017) consist of grayscale images from 10 categories of handwritten digits and fashion products respectively. GTSRB Stallkamp et al. (2012) is a physical traffic sign dataset with 43 classes and around 50,000 images in total. We compare the performance of our method against existing methods: RAR Chen et al. (2019) for attributional robustness, Singh et al Singh et al. (2019), and Madry et al Madry et al. (2018) which uses only standard adversarial training. Note that Singh et al. (2019)’s code is not publicly available, and we hence compared their results only on settings reported in their paper.
Architecture Details: We used a network consisting of two convolutional layers with 32 and 64 filters respectively, each followed by maxpooling, and a fully connected layer with 1024 neurons, for experiments with both MNIST and FashionMNIST datasets. We used the Resnet model in Zagoruyko and Komodakis (2016) to perform experiments with Flower and GTSRB datasets and performed per image standardization before feeding images to the network consisting of 5 residual units with (16, 16, 32, 64) filters each. We also compared our results with a recently proposed method Singh et al. (2019) using WRN 2810 Zagoruyko and Komodakis (2016) architecture as used in their paper. More architecture details for each dataset are provided in the Appendix; on any given dataset, the architectures were the same across all methods used for fair comparison.
Performance Metrics: Following Chen et al. (2019)Singh et al. (2019), we used top intersection, Kendall’s correlation and Spearman correlation metrics to evaluate model’s robustness against the IFIA attributional attack Ghorbani et al. (2019) (Sec Background and Preliminaries). Top intersection measures intersection size of the most important input features before and after the attributional attack. Kendall’s and Spearman correlation compute rank correlation to compare the similarity between feature importance, before and after attack. We also report natural accuracy as well as adversarial accuracy, the latter being a metric to evaluate adversarial robustness of our model against adversarial attack, such as PGD (as described in eq.2). Here, adversarial accuracy refers to the accuracy over the adversarial examples generated from perturbations on the original test set using PGD (Eqn 2).
We used a regularizer coefficient and as the number of steps used for computing IG (Eqn 1
) across all experiments. Note that our adversarial and attributional attack configurations were kept fixed across ours and baseline methods. Please refer the Appendix for more details on training hyperparameters and attack configurations for specific datasets.
Results: Tables 0(b), 0(b), 0(d) and 0(d) report comparisons of natural/normal accuracy, adversarial accuracy, median value of top intersection measure (shown as TopK) and median value of Kendall’s correlation (shown as Kendall), as used in Chen et al. (2019), on test sets of Flower, FashionMNIST, MNIST and GTSRB datasets respectively. (Note that Singh et al. (2019) did not report results on these architectures, and we report comparisons with them separately in later tables.) Our method shows significant improvement in performance on the TopK and Kendall metrics  the metrics for attributional robustness in particular  across these datasets. Natural and adversarial accuracies are expected to be the highest for natural training and adversarial training method, Madry et al Madry et al. (2018) respectively, and this is reflected in the results. A visual result is presented in Fig 1. More such qualitative results are presented in the Appendix. We show the variations in topK intersection value and Kendall’s correlation over all test samples for all the aforementioned 4 datasets using our method and RAR Chen et al. (2019) in Fig 2. Our variance is fairly similar to the variance in RAR.
Tables 2 and 2 report the performance comparison (same metrics) on the Flower and GTSRB datasets using the WRN 2810 architecture and hyperparameter settings used in Singh et al. (2019). Note that RAR doesn’t report results with this architecture, and hence is not included. We outperform Singh et al Singh et al. (2019) by significant amounts on the TopK and Kendall metrics, especially on the Flower dataset. A comparison with Tables 0(b) and 0(d) makes it evident that the use of the WRN 2810 architecture leads to significant improvement in attributional robustness.
Our results vindicate the methodology proposed in this work for the stateoftheart results obtained for attributional robustness. Although we have additional loss terms, our empirical studies showed an increase of atmost 2025% in training time over RAR. Note that at test time, which is perhaps more important in deployment of such models, there is no additional time overhead for our method.
Ablation Studies and Analysis
Quality of Saliency maps:
It is important that attributional robustness methods do not distort the explanations significantly. One way to measure the quality of the generated explanations is through the deviation of attrib maps before and after applying our method. To judge the quality of our saliency maps, we compared the attributions generated by our method with the attributions of the original image from a naturally trained model and report the Spearman correlation in Table 7 for all datasets. The results clearly show that our saliency maps change lesser from original ones than other methods. We also conducted a human Turing test to check the goodness of the saliency maps by asking 10 human users to pick a single winning saliency map that was most true to the object in a given image among (Madry, RAR, Ours). The winning rates (in same order) were:
MNIST: [30%,30%,40%]; FMNIST: [20%,40%,40%]; GTSRB: [20%,30%,50%]; Flower: [0%,30%,70%], showing that our saliency maps were truer to the image than other methods, especially on more complex datasets.
Dataset  Madry  RAR  OURS 

Flower  0.7234  0.8015  0.9004 
FMNIST  0.7897  0.8634  0.9289 
MNIST  0.9826  0.9928  0.9957 
GTSRB  0.8154  0.8714  0.9368 
Effect of Regularizers: We analyzed the effect of each regularizer term which we introduced in the outer minimization formulation in Eqn 10. For all such studies, the inner maximization setup was kept fixed. We compared attributional and adversarial accuracies, median values of TopK intersection and Kendall’s correlation achieved with and without and in the outer minimization. The results reported in Tables 4 and 4 suggest that the performance deteriorated substantially by removing either or , when compared to our original results in Tables 0(b),0(b) and 0(d) for Flower, FashionMNIST and MNIST datasets.
Fig 3 shows the same effect visually with a sample test image from the Flower dataset. not only captures the effect of positive class, but also diminishes the effect of most confusing negative class. Absence of may hence consider attributions towards pixels which don’t belong to the positive class. We can see that removing increased the focus on a leaf which is not the true class (flower) in Fig 3(b) as compared to Fig 3(a).
penalized a large attribution change on true class pixels (i.e. pixels with high base attribution).
This can be viewed from images in Fig 3(b) where keeping forces minimal attribution change to true class pixels, compared to pixels outside the true class.
Fig 3(c) shows the result of using both regularizers which shows the best performance. More such qualitative results are also provided in the Appendix.
Effect of Reqularizer Coefficients: To investigate the relative effects of each proposed regularizer term, we performed experiments with other choices of regularizer coefficients. Tables 6 and 6 show the results. Our results suggest that the performance on attributional robustness drops for both cases across all datasets, when and are weighted lesser (original experiments had both weights to be 1). The drop is slightly more when is weighted lesser, although this is marginal.
Additional ablation studies, including the effect of the in Eqn 6 and the use of Base Attribution in , are included in the Appendix due to space constraints.
Conclusions
In this paper, we propose two novel regularization techniques to improve robustness of deep model explanations through axiomatic attributions of neural networks. Our experimental findings show significant improvement in attributional robustness measures and put our method ahead of existing methods for this task. Our claim is supported by quantitative and qualitative results on several benchmark datasets, followed by earlier work. Our future work includes incorporating spatial smoothing on the attribution map generated by true class, which can provide sparse and localized heatmaps. We hope our findings will inspire discovery of new attributional attacks and defenses which offers a significant pathway for new developments in trustworthy machine learning.
Acknowledgement :This work has been partly supported by the funding received from MHRD, Govt of India, and Honeywell through the UAY program (UAY/IITH005). We also acknowledge IITHyderabad and JICA for provision of GPU servers for the work. We thank the anonymous reviewers for their valuable feedback that improved the presentation of this work.
References
 On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PloS one 10 (7). Cited by: Related Work.
 Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: Related Work.
 Jacobian adversarially regularized networks for robustness. In International Conference on Learning Representations, Cited by: Introduction, Related Work.
 Robust attribution regularization. In Advances in Neural Information Processing Systems, pp. 14300–14310. Cited by: Appendix D, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 7, Figure 8, Figure 9, APPENDIX: Enhanced Regularizers for Attributional Robustness, Figure 1, Introduction, Introduction, Related Work, Related Work, Background and Preliminaries, Background and Preliminaries, Enforcing Attribution Bounds:, Experiments and Results, Experiments and Results, Experiments and Results.
 Parseval networks: improving robustness to adversarial examples. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 854–863. Cited by: Related Work.
 Explanations can be manipulated and geometry is to blame. In Advances in Neural Information Processing Systems, pp. 13589–13600. Cited by: Related Work.

Discovering adversarial examples with momentum.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: Related Work. 
Interpretation of neural networks is fragile.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 3681–3688. Cited by: Figure 4, Figure 5, Appendix C, Introduction, Related Work, Background and Preliminaries, Background and Preliminaries, Figure 3, Experiments and Results.  Explaining and harnessing adversarial examples. In International Conference on Learning Representations, Cited by: Related Work, Background and Preliminaries.
 Adversarial logit pairing. arXiv preprint arXiv:1803.06373. Cited by: Related Work.
 Harnessing the vulnerability of latent layers in adversarially trained models. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 2779–2785. Cited by: Introduction, Related Work.
 Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: Related Work.
 MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist. Cited by: Experiments and Results.

Towards deep learning models resistant to adversarial attacks
. In International Conference on Learning Representations, Cited by: Appendix B, Appendix B, APPENDIX: Enhanced Regularizers for Attributional Robustness, Figure 1, Introduction, Related Work, Background and Preliminaries, Enforcing Attribution Bounds:, Enforcing Attribution Bounds:, Experiments and Results, Experiments and Results.  Relative attributing propagation: interpreting the comparative contributions of individual units in deep neural networks. arXiv preprint arXiv:1904.00605. Cited by: Related Work.
 Smoothed inference for adversariallytrained models. arXiv preprint arXiv:1911.07198. Cited by: Related Work.
 A visual vocabulary for flower classification. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1447–1454. Cited by: Experiments and Results.
 The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pp. 372–387. Cited by: Related Work.
 Rise: randomized input sampling for explanation of blackbox models. In BMVC, Cited by: Related Work.
 Adversarial robustness through local linearization. In Advances in Neural Information Processing Systems, pp. 13824–13833. Cited by: Introduction.
 Why should i trust you?: explaining the predictions of any classifier. In ACM SIGKDD, Cited by: Related Work.
 Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3145–3153. Cited by: Related Work.
 Not just a black box: learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713. Cited by: Related Work.
 Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: Related Work.
 On the benefits of attributional robustness. arXiv preprint arXiv:1911.13073. Cited by: Appendix D, Appendix F, Related Work, Related Work, Background and Preliminaries, Background and Preliminaries, Experiments and Results, Experiments and Results, Experiments and Results, Experiments and Results, Experiments and Results.
 Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural networks 32, pp. 323–332. Cited by: Experiments and Results.
 Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3319–3328. Cited by: Related Work, Background and Preliminaries.
 Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: Related Work.
 Ensemble adversarial training: attacks and defenses. In International Conference on Learning Representations, Cited by: Related Work.
 Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations, Cited by: Introduction.
 Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: Experiments and Results.
 Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 501–509. Cited by: Introduction, Related Work.
 Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2730–2739. Cited by: Related Work.
 Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: Experiments and Results.
 Visualizing and understanding convolutional networks. In ECCV, Cited by: Related Work.
 Theoretically principled tradeoff between robustness and accuracy. arXiv preprint arXiv:1901.08573. Cited by: Introduction, Related Work.
 Topdown neural attention by excitation backprop. ECCV. Cited by: Related Work.
APPENDIX: Enhanced Regularizers for Attributional Robustness
In this appendix, we provide details that could not be included in the main paper owing to space constraints, including: (i) description of our method as an algorithm (for convenience of understanding); (ii) an analysis of how our method is connected to the inner maximization of adversarial robustness in Madry et al. (2018); (iii) detailed descriptions of hyperparameter settings and attack configurations for every dataset; (iv) additional results, including results with Spearman correlation, effect of term, and use of Base Attribution in ; (v) More visualizations of effect of different regularization terms; as well as (vi) More qualitative results on all datasets considered, and comparisons with RAR Chen et al. (2019).
Appendix A Proposed Algorithm
The proposed method is presented as Algorithm 1 for convenience of understanding.
Appendix B Analysis of Proposed Method in terms of Robust Prediction Objective in Madry et al. (2018)
It is possible to show that our method is connected to the widely used robust prediction objective in Madry et al. (2018). To this end, we begin by observing that since the proposed method is a training method (and not an attribution method), all axioms of attribution will hold for the Integrated Gradient (IG) approach used in our method by default, as for any other trained model.
Thus, using the completeness property of IG, the sum of pixel attributions equals the predicted class probability (for a class under consideration). Since the positive class probability will be greater than the negative class probability for a model prediction, this implies (from Eqns 4,5,11). Similarly, from Eqn 9, we have , as is the norm. Also we have (crossentropy loss) is itself, as defined in the work (Eqn 12). Hence, considering , our loss function can be written as:
(13) 
which is the inner maximization of Madry et al. (2018). Thus, atrributional robustness can be viewed as an extension of the robust prediction objective typically used for adversarial robustness.
Appendix C Hyperparameters and Attack Configurations
Herein, we present details of training hyperparameters as well as attack configuration for all our experiments. We set regularization coefficient as in these experiments, and , the number of steps in computing IG (from zero image to given image) across all experiments for all datasets. (Note that is different in the attack step due to computational overhead in inner maximization. The specific values of in the attacks are provided under each dataset below.) For attributional attack, we use Iterative Feature Importance Attacks (IFIA) proposed by Ghorbani et al. (2019) (specific settings for each dataset described below). We set the feature importance function as Integrated Gradients (IG) and dissimilarity function D as Kendall’s rank order correlation across all datasets. Also, we kept adversarial and attributional attack configurations fixed while comparing the result with other baseline methods, for fairness of comparison.
Flower Dataset:
Training Hyperparameters: We use momentum optimizer with weight decay, momentum rate 0.9, weight decay rate 0.0002, batch size 16 and training steps 90,000. We use a learning rate schedule as follows: the first 1500 steps have a learning rate of ; after 1500 steps and until 70,000 steps have a learning rate of ; after 70,000 steps have a learning rate of . We use PGD attack as an adversary with a random start, number of steps of 7, step size of 2, as the number of steps for approximating IG computation in the attack step and adversarial budget of 8.
Attack Configuration for Evaluation: For evaluating adversarial robustness, we use a PGD attack with number of steps of 40, adversarial budget of 8 and step size of 2. For attributional attack, we use IFIA’s top attack with , adversarial budget , step size and number of iterations .
FashionMNIST Dataset:
Training Hyperparameters: We use learning rate as , batch size as 32, training steps as 100,000 and Adam optimizer. We use PGD attack as the adversary with a random start, number of steps of 20, step size of 0.01, as the number of steps for approximating IG computation in the attack step and adversarial budget .
Attack Configuration for Evaluation: For evaluating adversarial robustness, we use PGD attack with random start, number of steps of 100, adversarial budget of 0.1 and step size of 0.01. For attributional attack, we use IFIA’s topk attack with , adversarial budget , step size and number of iterations .
MNIST Dataset:
Training Hyperparameters: We use learning rate as , batch size as 50, training steps as 90,000 and Adam optimizer. We use PGD attack as the adversary with a random start, number of steps of 40, step size of 0.01, as the number of steps for approximating IG computation in the attack step, and adversarial budget .
Evaluation Attacks Configuration: For evaluating adversarial robustness, we use PGD attack with a random start, number of steps of 100, adversarial budget of 0.3 and step size of 0.01. For attributional attack, we use IFIA’s top attack with , adversarial budget , step size and number of iterations .
GTSRB Dataset:
Training Hyperparameters: We use momentum with weight decay rate 0.0002, momentum rate 0.9, batch size 32 and training steps 100,000. We use learning rate schedule as follows: the first 5000 steps have learning rate of ; after 5000 steps and until 70,000 steps have learning rate of ; after 70,000 steps have learning rate of . We use PGD attack as the adversary with a random start, number of steps of 7,step size of 2, as the number of steps for apprioximating IG computation in the attack step and adversarial budget .
Evaluation attacks Configuration:For evaluating adversarial robustness, we use PGD attack with number of steps as 40, adversarial budget of 8 and step size of 2. For evaluating attributional robustness, we use IFIA’s top attack with , adversarial budget , step size and number of iterations .
Appendix D Additional Results
Spearman Correlation as Attributional Robustness Metric: We compared our method with RAR Chen et al. (2019) and a recently proposed attributional robustness method Singh et al. (2019) in the main paper. Singh et al Singh et al. (2019) also reported the median value of Spearman Correlation metric as an extra attributional robustness metric. We compare with them on this metric in Table 8 below. Note that we outperform their method on this metric too. (We do not compare with RAR in this table, since RAR did not use WRN 2810 architecture in their experiments, and RAR’s numbers were significantly lower than those reported in the table below for this reason.)
Method  Spearman Corr(flower)  Spearman Corr(GTSRB) 

Natural  0.7413  0.9133 
Madry et al.  0.9613  0.9770 
Singh et al.  0.9627  0.9801 
Ours  0.9959  0.9938 
Effect of : We studied the effect of the term by training a model without this term, keeping everything else the same i.e. calculating without this term in Eqn 6. The images in Fig 6 show that using this term improves focus of attribution on true class pixels, reducing attribution of most confusing negative class.
Use of Base Attribution in : In order to understand the usefulness of weighting the change in attribution with the base attribution in (Eqn 9), we replaced with in Eqn 10 for the outer minimization step and trained our model keeping all other settings intact. Table 9 reports natural and adversarial accuracies, median values of TopK intersection and Kendall’s correlation achieved with the modified formulation, which shows substantial reduction on all results compared to our original results in tables 0(b), 0(b) and 0(d). This supports our claim for construction of .
Dataset  Nat. acc.  Adv. acc.  TopK  Kendall 

Flower  82.35%  50.74%  68.71%  0.8089 
FMNIST  85.43%  71.38%  79.89%  0.7023 
MNIST  98.41%  89.53%  79.00%  0.3315 
Appendix E Visualization of Effect of Different Regularization Terms
Appendix F More Qualitative Results
We now present more qualitative results from each dataset on the following pages, comparing our proposed robust attribution method with RAR. Figures 718 (following pages) show these results. (Note that Singh et al Singh et al. (2019) do not have their code publicly available, and do not report these results  we hence are unable to compare with them on these results.) Note the consistent performance of our method across all these examples across datasets.
Comments
There are no comments yet.