A Quantitative Results Including DiPart
We provide the full results of our quantitative evaluation on the Grid Pointing Game [1]
(GridPG), DiFull, and DiPart using the backpropagationbased (
Fig. 9, top), activationbased (Fig. 9, middle), and perturbationbased (Fig. 9, bottom) methods on VGG11 [14] (Fig. 9, left) and Resnet18 [2] (Fig. 9, right).It can be seen that the performance on DiFull and DiPart is very similar across all three evaluation settings and the three layers. The most significant difference between the two can be seen among the backpropagationbased methods and LayerCAM [4] (Fig. 9
, row 1, cols. 23,56). On DiFull, these methods show nearperfect localization, since the gradients of the outputs from each classification head that are used to assign importance are zero with respect to weights and activations of all grid cells disconnected from that head. On the other hand, the receptive field of the convolutional layers can overlap adjacent grid cells in DiPart, and the gradients of the outputs from the classification heads can thus have nonzero values with respect to inputs and activations from these adjacent grid regions. This also results in decreasing localization scores when moving backwards from the classifier.
Furthermore, the localization scores for Gradient [13] and Guided Backprop [16] are constant at the final layer for Resnet18 (Fig. 9, row 1, cols. 46). This is because this layer is immediately followed by a global average pooling layer, due to which all activations at this layer get an equal share of the gradients.
B Qualitative Results using AggAtt
In this section, we present additional qualitative results using our AggAtt evaluation along with examples of attributions from each bin, for each of GridPG [1] (Sec. B.1), DiFull (Sec. B.2), and DiPart (Sec. B.3).
b.1 GridPG
Fig. 10 and Fig. 11 show examples from the median position of each AggAtt bin for each attribution method at the input and final layers, respectively, evaluated on GridPG at the topleft grid cell using VGG11 [14]. At the input layer (Fig. 10), we observe that the backpropagationbased methods show noisy attributions that do not strongly localize to the topleft grid cell. This corroborates the poor quantitative performance of these methods at the input layer (Fig. 9, top). With the exception of LayerCAM [4], the activationbased methods, on the other hand, show strong attributions across all four grid cells, and localize very poorly. They appear to highlight the edges across the input irrespective of the class of each grid cell. This also agrees with the quantitative results (Fig. 9, middle), where the median localization score of these methods is below the uniform attribution baseline. LayerCAM, being similar to IxG [12], lies at the interface between activation and backpropagationbased methods, and also shows weak and noisy attributions. The perturbationbased methods
visually show a high variance in attributions. While they localize well for about half the dataset (first three bins), the bottom half (last three bins) shows noisy and poorly localized attributions, which again agrees with the quantitative results (
Fig. 9, bottom). This further shows how evaluating on individual inputs can be misleading, and the utility of AggAtt for obtaining a holistic view across the dataset.At the final layer (Fig. 11), attributions from Gradient [13] and Guided Backprop [16]
are very noisy and only slightly concentrate at the topleft cell. The checkerboardlike pattern is a consequence of the max pooling operation after the final layer, which allocates all the gradient only to the maximum activation. Gradients from each position of the sliding classification kernel then get averaged to form the attributions. The localization of IntGrad
[17], IxG, GradCAM [11], and Occlusion [18] improve considerably as compared to the input layer, which agrees with the quantitative results, and shows that diverse methods can show similar performance when compared fairly. The performance of the other activationbased methods and RISE [9] improves to some extent, but is still poorly localized for around the half the dataset.b.2 DiFull
Fig. 13 and Fig. 14 show examples from the median position of each AggAtt bin for each attribution method at the input and final layers, respectively, evaluated on DiFull at the topleft grid cell using VGG11. At the input layer (Fig. 13), the backpropagationbased methods and LayerCAM show perfect localization across the dataset. This is explained by the disconnected construction of DiFull, and agrees with the quantitative results shown in Fig. 9). The activationbased methods show very poor localization that appear visually similar to the attributions observed on GridPG (Sec. B.1). Occlusion shows nearperfect localization, since the placement of the classification kernel at any location not overlapping with the topleft grid cell does not influence the output in the DiFull setting. RISE still produces noisy attributions across the dataset. While only the topleft grid cell influences the output, the use of random masks causes input regions that share masks with inputs in the topleft cell to also get attributed.
At the final layer (Fig. 14), the backpropagationbased methods and LayerCAM still show perfect localization, for the same reason as discussed above. Attributions from Gradient and Guided Backprop show similar artifacts as seen with GridPG (Sec. B.1), but are localized to the topleft cell. The activationbased methods apart from LayerCAM concentrate their attributions at the topleft and bottomright grid cells, particularly in the early bins. This is because both these cells contain images from the same class, and the weighing of activation maps by these methods using a scalar value causes both to be attributed, even though only the instance at the topleft influences the classification. Further, Occlusion and RISE show similar results as at the input layer. The attributions of Occlusion are noticeably lower in resolution, since the relative size of the occlusion kernel as compared to the activation map is much larger at the final layer.
Finally, we show the AggAtt bins for all methods at all three layers using both VGG11 and Resnet18 in Fig. 15, and see that they reflect the trends observed in the individual examples seen from each bin.
b.3 DiPart
Fig. 16 and Fig. 17 show examples from the median position of each AggAtt bin for each attribution method at the input and final layers, respectively, evaluated on DiPart at the topleft grid cell using VGG11. In addition, Fig. 18 shows the AggAtt bins for all methods at all three layers using both VGG11 and Resnet18. As observed with the quantitative results (Sec. A, the performance seen visually on DiPart across the three layers is very similar to that on DiFull (Sec. B.2). However, they slightly differ in the case of the backpropagationbased methods and LayerCAM, particularly at the input layer (Fig. 16). This is because unlike in DiFull, the grid cells are only partially disconnected, and the receptive field of the convolutional layers can overlap adjacent grid cells to some extent. Nevertheless, as can be seen here, only a small boundary region around the topleft grid cell receives attributions, and the difference is not visually very perceivable. This further shows that the DiPart setting can be thought of as a natural extension for DiFull, that mostly shares the requisite property without being an entirely constructed setting.
C Correlation between Attributions
From the quantitative (Fig. 9) and qualitative (Fig. 12) results, we observed that diverse methods perform similarly on GridPG [1] both in terms of localization score and through AggAtt visualizations when evaluated fairly. This was particularly the case with IntGrad [17], IxG [12], GradCAM [11], and Occlusion [18], when evaluated at the final layer. We also found (Sec 5.2 in the paper) that smoothing IntGrad and IxG (the result of which we call SIntGrad and SIxG) attributions evaluated at the input layer leads to visually and quantitatively similar performance as GradCAM evaluated at the final layer. In this section, we investigate this further, and study the correlation of these methods at the level of individual attributions. In particular, we compute the Spearman rank correlation coefficient of the localization scores using VGG11 [14] of every pair of methods from each of the three layers. The results are shown in Fig. 19.
We observe that at the input layer (Fig. 19, topleft corner), the activationbased methods are poorly correlated with each other and with the backpropagation and perturbationbased methods. This also agrees with the poor localization of these methods seen previously (Fig. 9, Fig. 10). The backpropagationbased and perturbationbased methods, on the other hand, show moderate to strong correlation amongst and with each other. Similar results can be seen when comparing methods at the middle layer with the input layer and the final layer (Fig. 19, edge centres). However, when compared at the middle layer (Fig. 19, middle), the activationbased methods still correlate poorly with other methods, but the strength of the correlation improves in general.
Further, when compared at the final layer (Fig. 19, bottomright corner), all methods show moderate to strong correlations with each other. This could be because generating explanations at the final layer is a significantly easier task as compared to doing so at the input, since the activations are used as is and only the classification layers’ outputs are explained. The pairs with very strong positive correlation also show that attribution methods with diverse mechanisms can perform similarly when evaluated fairly. Finally, we observe that the activationbased methods at the final layer, instead of the input layer, correlate much better with the other methods at the input layer (Fig. 19, topright, bottomleft).
We also observe that SIntGrad and SIxG at the input layer correlate well with the bestperforming methods (IntGrad, IxG, GradCAM, Occlusion) at the final layer. Further, this marks a significant improvement when compared with IntGrad and IxG at the input layer. For example, IntGrad at the input layer compared with GradCAM at the final layer results in a correlation coefficient of , while SIntGrad results in a correlation coefficient of .
We further study the effect of smoothing in Tabs. 1 and 2. We observe that the correlation between SIntGrad and SIxG improves significantly over IntGrad and IxG for VGG11 when using large kernels. However, for Resnet18 [2]
, the improvement for SIxG is very small. This agrees with the quantitative localization performance of these methods (Sec 5.2 in the paper). This shows that beyond aggregate visual similarity and quantitative performance, smoothing IntGrad and IxG can produce explanations at the input layer that are individually similar to GradCAM at the final layer, while also explaining the full network and performing significantly better on DiFull. We further visually compare the impact of smoothing in
Sec. D.Original  

VGG11  0.34  0.42  0.52  0.69  0.78  0.80  0.71 
Resnet18  0.18  0.21  0.27  0.40  0.55  0.63  0.61 
Original  

VGG11  0.27  0.28  0.33  0.43  0.49  0.44  0.34 
Resnet18  0.14  0.13  0.15  0.17  0.18  0.13  0.05 
D Impact of Smoothing Attributions
In this section, we explore the impact of smoothing attributions. First, we briefly discuss a possible reason for the improvement in localization of attributions after smoothing (Sec. D.1. Then, we visualize the impact of smoothing through examples and AggAtt visualizations (Sec. D.2). Further, we also compare the performance of GradCAM [11] at the final layer with SIntGrad and SIxG at the input layer across the same examples from each bin and show their similarities across bins (Sec. D.3).
d.1 Effect of Smoothing
We believe that our smoothing results highlight an interesting aspect of piecewise linear models (PLMs), which goes beyond mere practical improvements. For PLMs (such as the models used here), IxG [12] yields the exact pixel contributions according to the linear mapping given by the PLM. In other words, the sum of IxG attributions over all pixels yields exactly (ignoring biases) the model output. If the effective receptive field of the model is small (cf. [7]), sum pooling IxG with a kernel of the same size accurately computes the model’s local output (apart from the influence of bias terms). Our method of smoothing IxG with a Gaussian kernel performs a weighted average pooling of attributions in the local region around each pixel, which produces a similar effect and appears to summarize the effect of the pixels in the local region to the model’s output, which leads to less noisy attributions and better localization.
d.2 AggAtt Evaluation after Smoothing
In Fig. 20 and Fig. 21, we show examples from each AggAtt bin for SIntGrad and SIxG at the input layer for two different kernel sizes, and compare with IntGrad [17] and IxG at the input layer respectively. We observe that the localization performance significantly improves with increasing kernel size, and produces much stronger attributions for the target grid cell. In Fig. 22, we show the AggAtt bins for these methods on both VGG11 [14] and Resnet18 [2]. We see that this reflects the trends seen from the examples, and also clearly shows the relative ineffectiveness of smoothing IxG for Resnet18 (Fig. 22 bottom right and Tab. 2).
d.3 Comparing GradCAM with SIntGrad and SIxG
We now compare GradCAM at the final layer with SIntGrad and SIxG at the input layer with on the same set of examples (Fig. 23). We pick an example from each AggAtt bin of GradCAM, and evaluate all three methods on them. From Fig. 23, we observe that the three methods produce visually similar attributions across the AggAtt bins. While the attributions of SIntGrad and SIxG are somewhat coarser than GradCAM, particularly for the examples in the first few bins, they still concentrate around similar regions in the images. Interestingly, they perform similarly even for examples where GradCAM does not localize well, i.e., in the last two bins. Finally, we again see that SIxG using Resnet18 performs relatively worse as compared to the other methods (as also seen in Tab. 2).
E Quantitative Evaluation on All Layers
For a fair comparison, we evaluated each method at the input, a middle layer, and the final layer of the network. The middle layer was chosen as a representative to visualize the trends in the localization performance across the network. Figs. 25 and 24 show the results on evaluating at each convolutional layer of VGG11 [14] and each layer block of Resnet18 [2]. We find that the performance on the remaining layers is consistent with the trend observed from the three chosen layers in our experiments.
F Computational Cost
Unlike GridPG [1], the DiFull setting involves passing each grid cell separately through the network. In this section, we compare the computational costs of GridPG, DiFull, and DiPart, and show that it is similar across the three settings. Let the input be in the form of a
grid. Each setting consists of a CNN module, which obtains features from the input, and a classifier module, which provides logits for each cell in the grid using the obtained features. We analyze each of these modules one by one.
CNN Module:
In GridPG and DiPart, the entire grid is passed through the CNN module as a single input. On the other hand, in DiFull, each grid cell is passed separately. This can be alternatively viewed as stacking each of the grid cells along the batch dimension before passing them through the network. Consequently, the inputs in the DiFull setting have their widths and heights scaled by a factor of , and the batch size scaled by a factor of . Since the operations within the CNN module scale linearly with input size, the computational cost for each grid cell in DiFull is times the cost for the full grid in GridPG and DiPart. Since there are such grid cells, the total computational cost for the CNN module of DiFull equals that of GridPG and DiPart.
Classifier Module:
The classifier module in the DiFull and DiPart settings consists of classification heads, each of which receives features corresponding to a single grid cell. On the other hand, the GridPG setting uses a classifier kernel over the composite feature map for the full grid. Let the dimensions of the feature map for a single grid cell be
. This implies that in GridPG, using a stride of 1, the classification kernel slides over
windows of the input, each of which results in a call to the classifier module. In contrast, in DiFull and DiPart, the classifier module is called only times, one for each head. This shows that the computational cost of DiFull and DiPart for the classifier module and the pipeline as a whole is at most as much as of GridPG.G Comparison with SmoothGrad
In our work, we find that smoothing IntGrad [17] and IxG [12]
attributions with a Gaussian kernel can lead to significantly improved localization, particularly for networks without batch normalization layers
[3]. As discussed in Sec. D.1, we believe this to be because smoothing summarizes the effect of inputs in a local window around each pixel to the output logit, and reduces noisiness of attributions. Prior approaches to address noise in attributions include SmoothGrad [15], which involves adding Gaussian noise to an input and averaging over attributions from several samples. Here, we compare our smoothing with that of SmoothGrad. Fig. 26 shows that our methods (SIntGrad, SIxG) show significantly better GridPG [1] localization than SmoothGrad on IntGrad and IxG, except in the case of IxG with Resnet18 [2], where our smoothing does not improve localization likely due to the presence of batch normalization layers. The scores on DiFull decrease to an extent since our Gaussian smoothing allows attributions to “leak” to neighbouring grid cells. These results are corroborated by AggAtt visualizations in Fig. 27. We also note that SmoothGrad requires significantly higher computational cost than our approach, as attributions need to be generated for several noisy samples of each input, and is also sensitive to the choice of hyperparameters such as the noise percentage and the number of samples.
H Implementation Details
h.1 Dataset
As described in the paper (Sec. 4), we obtain 2,000 attributions for each attribution method on each of GridPG [1], DiFull, and DiPart, using inputs consisting of four subimages arranged in grids. For GridPG, since we evaluate on all four subimages, we do this by constructing 500 grid images after randomly sampling 2,000 images from the validation set. Each grid image contains subimages from four distinct classes. On the other hand, for DiFull and DiPart, we place images of the same class at the topleft and bottomright corners to test whether an attribution method simply highlights classrelated features, irrespective of them being used by the model. Therefore, we evaluate only on these two grid locations. In order to obtain 2,000 attributions as with GridPG, for these two settings, we construct 1,000 grid images by randomly sampling 4,000 images from the validation set.
h.2 Models and Attribution Methods
We implement our settings using PyTorch
[8], and use pretrained VGG11 [14] and Resnet18 [2] models from Torchvision [8]. We use implementations from the Captum library [5] for Gradient [13], Guided Backprop [16], IntGrad [17], and IxG [12], and from [1] for Occlusion [18] and RISE [9]. For Gradient and Guided Backprop, the absolute value of the attributions are used. All attributions across methods are summed along the channel dimensions before evaluation.Occlusion involves sliding an occlusion kernel of size with stride over the image. As the spatial dimensions of the feature maps decreases from the input to the final layer, we select different values of and for each layer. In our experiments, we use for the input, and for the middle and final layers.
RISE generates attributions by occluding the image using several randomly generated masks and weighing them based on the change in the output class confidence. In our experiments, we use masks. We use fewer masks than [9] to offset the increased computational cost from using images, but found similar results from a subset of experiments with .
h.3 Localization Metric
In our quantitative evaluation, use the same formulation for the localization score as proposed in GridPG (Sec 3.1.1 in the paper). Let refer to the positive attribution given to the pixel. The localization score for the subimage is given by:
(2) 
However, is undefined when the denominator in Eq. 2 is zero, i.e., . This can happen, for instance, when all attributions for an input are negative. To handle such cases, we set in our evaluation whenever the denominator is zero.
h.4 AggAtt Visualizations
To generate our AggAtt visualizations, we sort attribution maps in the descending order of the localization score and bin them into percentile ranges to obtain aggregate attribution maps (Sec 3.2 in the paper). However, we observe that when evaluating on DiFull, the backpropagationbased attribution methods show perfect localization (Sec 5.1 in the paper), and all attributions share the same localization score. In this scenario, and in all other instances when two attributions have the same localization score, we break the tie by favouring maps that have stronger attributions in the target grid cell. We do this by ordering attributions with the same localization score in the descending order of the sum of attributions within the target grid cell, i.e., the numerator in Eq. 2.
Further, when producing the aggregate maps, we normalize the aggregate attributions using a common normalizing factor for each method. This is done to accurately reflect the strength of the average attributions across bins for a particular method.
I Evaluation on CIFAR10
In addition to ImageNet [10], we also evaluate using our settings on CIFAR10 [6]. In this section, we present these results, and find similar trends in performance as on ImageNet. We first describe the experimental setup (Sec. I.1) used, and then show the quantitative results on GridPG [1], DiFull, and DiPart (Sec. I.2) and some qualitative results using AggAtt (Sec. I.3).
i.1 Experimental Setup
Network Architecture: We use a modified version of the VGG11 [14] architecture, with the last two convolutional layers removed. Since the CIFAR10 inputs have smaller dimensions () than Imagenet (), using all the convolutional layers results in activations with very small spatial dimensions, which makes it difficult to apply attribution methods at the final layer. After removing the last two convolutional layers, we obtain activations at the new final layer with dimensions before pooling. We then perform our evaluation at the input (Inp), middle layer (Conv3) and the final layer (Conv6).
Data: We construct grid datasets consisting of and grids using images from the validation set classified correctly by the network with a confidence of at least . We obtain 4,000 (resp. 4,500) attributions for each method from the (resp. ) grid datasets respectively. As with ImageNet (Sec. H.1), we evaluate on all grid cells for GridPG and only at the topleft and bottomright corners on DiFull and DiPart. To obtain an equivalent 4,000 (resp. 4,500) attributions from using just the corners on DiFull and DiPart, we randomly sample 8,000 (resp. 20,250) images for the () grid datasets, and construct 2,000 (2,250) composite images. Note that the CIFAR10 validation set only has a total of 10,000 images. Since we only evaluate at the two corners, we allow subimages at other grid cells to repeat across multiple composite images. However, no two subimages are identical within the same composite image.
i.2 Quantitative Evaluation on GridPG, DiFull, and DiPart
The results of the quantitative evaluation can be found in Fig. 28 for both grids (left) and grids (right). We observe that all methods perform similarly as on ImageNet (Fig. 9). Since localizing on grids poses a more challenging task, we observe generally poorer performance across all methods on that setting.
i.3 Qualitative Results using AggAtt
In Fig. 29, we show AggAtt evaluations on grids for a method each from the set of backpropagationbased (IxG [12]), activationbased (GradCAM [11]), and perturbationbased (Occlusion [18]) methods. Further, we show examples of attributions at the input and final layer on GridPG for these methods (Figs. 31 and 30). We see that these show similar trends in their performance as on ImageNet (Sec. B).
Supplement References
 [1] Moritz Böhle, Mario Fritz, and Bernt Schiele. Convolutional Dynamic Alignment Networks for Interpretable Classifications. In CVPR, pages 10029–10038, 2021.
 [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016.
 [3] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML, pages 448–456, 2015.
 [4] PengTao Jiang, ChangBin Zhang, Qibin Hou, MingMing Cheng, and Yunchao Wei. LayerCAM: Exploring Hierarchical Class Activation Maps for Localization. IEEE TIP, 30:5875–5888, 2021.
 [5] Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, and Orion ReblitzRichardson. Captum: A unified and generic model interpretability library for PyTorch. arXiv preprint arXiv:2009.07896, 2020.
 [6] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009.

[7]
Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel.
Understanding the Effective Receptive Field in Deep Convolutional Neural Networks.
In NeurIPS, 2016. 
[8]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban
Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan
Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith
Chintala.
PyTorch: An Imperative Style, HighPerformance Deep Learning Library.
In NeurIPS, 2019.  [9] Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized Input Sampling for Explanation of Blackbox Models. In BMVC, 2018.
 [10] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
 [11] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. GradCAM: Visual Explanations from Deep Networks via GradientBased Localization. In ICCV, pages 618–626, 2017.
 [12] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning Important Features Through Propagating Activation Differences. In ICML, pages 3145–3153, 2017.
 [13] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In ICLRW, 2014.
 [14] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. In ICLR, 2015.
 [15] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. SmoothGrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
 [16] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for Simplicity: The All Convolutional Net. In ICLRW, 2015.
 [17] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic Attribution for Deep Networks. In ICML, pages 3319–3328, 2017.
 [18] Matthew D Zeiler and Rob Fergus. Visualizing and Understanding Convolutional Networks. In ECCV, pages 818–833, 2014.