Group-CAM: Group Score-Weighted Visual Explanations for Deep Convolutional Networks

03/25/2021
by   Qinglong Zhang, et al.
Nanjing University
0

In this paper, we propose an efficient saliency map generation method, called Group score-weighted Class Activation Mapping (Group-CAM), which adopts the "split-transform-merge" strategy to generate saliency maps. Specifically, for an input image, the class activations are firstly split into groups. In each group, the sub-activations are summed and de-noised as an initial mask. After that, the initial masks are transformed with meaningful perturbations and then applied to preserve sub-pixels of the input (i.e., masked inputs), which are then fed into the network to calculate the confidence scores. Finally, the initial masks are weighted summed to form the final saliency map, where the weights are confidence scores produced by the masked inputs. Group-CAM is efficient yet effective, which only requires dozens of queries to the network while producing target-related saliency maps. As a result, Group-CAM can be served as an effective data augment trick for fine-tuning the networks. We comprehensively evaluate the performance of Group-CAM on common-used benchmarks, including deletion and insertion tests on ImageNet-1k, and pointing game tests on COCO2017. Extensive experimental results demonstrate that Group-CAM achieves better visual performance than the current state-of-the-art explanation approaches. The code is available at https://github.com/wofmanaf/Group-CAM.

READ FULL TEXT VIEW PDF

Authors

page 2

page 5

page 6

page 8

10/03/2019

Score-CAM:Improved Visual Explanations Via Score-Weighted Class Activation Mapping

Recently, more and more attention has been drawn into the internal mecha...
06/17/2022

FD-CAM: Improving Faithfulness and Discriminability of Visual Explanation for CNNs

Class activation map (CAM) has been widely studied for visual explanatio...
05/20/2020

Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks

Recently, increasing attention has been drawn to the internal mechani...
06/20/2021

CAMERAS: Enhanced Resolution And Sanity preserving Class Activation Mapping for image saliency

Backpropagation image saliency aims at explaining model predictions by e...
10/01/2019

Research on insect pest image detection and recognition based on bio-inspired methods

Insect pest recognition is necessary for crop protection in many areas o...
12/20/2013

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

This paper addresses the visualisation of image classification models, l...
01/21/2022

Conceptor Learning for Class Activation Mapping

Class Activation Mapping (CAM) has been widely adopted to generate salie...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding and interpreting the decision made by deep neural networks (DNNs) is of central importance for humans since it helps to construct the trust of DNN models

[5, 2, 9, 17]. In the area of computer vision, one critical technique is generating intuitive heatmaps that highlight regions, which are most related to DNN’s decision.

One common approach for determining salient regions is relying on the changes in the model output, such as the changes of prediction scores with respect to the input images. For example, RISE [7]estimates importance empirically by probing the model with randomly masked versions of the image and obtaining the corresponding outputs. While RISE provides very compelling results, thousands of random masks should be generated and then be applied to query the model, making it inefficient.

Other approaches, such as GradCAM [11], calculate gradients by back-propagating the prediction score through the target layer of the network and apply them as weights to combine the forward feature maps. These methods are generally faster than RISE since they only require a single or constant number of queries to the network [8]. However, results of GradCAM merely reflect infinitesimal changes of the prediction, and these changes are not necessarily reflective of changes large enough to alter the decision of the network. Naturally, a question arises: “Can one method produce results that truly reflect the model decision in a more efficient way?”

To answer this question, we first revisit the intuition behind RISE. Let be a random binary mask with distribution , the input image can be masked by to preserve a subset of pixels, where denotes element-wise multiplication. The masked image is then applied to produce the confidence score to measure the contribution of these preserved pixels. Finally, the saliency map can be generated by combining plenty of random masks and scores with respect to them. It is observed that the most time-costing procedure is random masks generating and multiple queries to the neural network.

Figure 1: Pipeline of Group-CAM. Activations are first extracted with a linear combination of feature maps and importance weights . Then the activations are split into groups and summed along the channel dimension in each group before de-noising to generate initial masks . Input image is element-wise multiplied with and then transformed with meaningful perturbations. The perturbated images are then fed to the network. The output saliency map can be computed as a weighted sum of all where the weights come from the confidence scores of the target class corresponding to the respective perturbated inputs.

To address the efficiency issue, we propose Group score-weighted Class Activation Mapping (Group-CAM), which adopts the “split-transform-merge” strategy to generate saliency maps. Specifically, for an input image, the class activations are firstly split into groups. In each group, the sub-activations are summed along the channel dimension as an initial mask. However, directly apply the initial masks to preserve input pixels may cause noise visual due to gradient vanishing. Therefore, we design a de-noising strategy to filter the less important pixels of the initial mask. In addition, to ease the adversarial effects of sharp boundaries between the masked and salient regions, we employ the blurred information from the input to replace the unreserved regions(pixels with 0 values) of the masked image. Finally, the saliency map of Group-CAM can be computed as a weighted sum of the grouped initial masks, where the weights are confidence scores produced by the masked inputs. Group-CAM is quite efficient, which can produce appealing target-related saliency maps after dozens of queries to the networks. As a result, Group-CAM can be applied to train/fine-tune classification methods. The overall architecture of Group-CAM are illustrated in Figure 1.

The key contributions in this paper are summarized as follows:

(1) we introduce Group-CAM, an efficient explaining approach for deep convolutional networks by estimating the importance of input image pixels for the model’s prediction;

(2) we present a novel initial masks producing strategy, which only generates dozens of initial masks by simply group sum class activations, making the Group-CAM quite fast;

(3) we comprehensively evaluate Group-CAM on ImageNet-1k and MS COCO2017. Results demonstrate that Group-CAM requires less computation yet achieves better visual performance than the current state-of-the-art methods;

(4) we extend the application of saliency methods and apply Group-CAM as an effective data augment trick for fine-tuning classification networks, extensive experimental results suggest that Group-CAM can boost the networks’ performance by a large margin.

Note that, if the number of groups in Group-CAM is set to 1, and no de-noising strategy is applied, then Group-CAM can be simplified as Grad-CAM.

2 Related Work

Region-based Saliency Methods. In recent years, numerous saliency methods attributing inputs to output predictions have been proposed. One set of methods adopt masks to preserve certain regions of the inputs and measure the effect these regions have on the output by performing a forward pass through the network with these regions. These types of saliency methods are called Region-based saliency methods. Among them, RISE first generates thousands of random masks and then employ them to mask the input. Then a linear combination of random masks with the corresponding prediction score of the masked images is computed as the final saliency map. Instead of generating random masks, Score-CAM adopts feature maps of the target layer (the target layer generally contains thousands of feature maps) as initial masks and employ them to computing saliency map. Unlike RISE and Score-CAM, XRAI first over-segmented the input image, and then iteratively test the importance of each region, coalescing smaller regions into larger segments based on attribution scores. Region-based approaches usually generate better human-interpretable visualizations but are less efficient since they requires plenty of quires to the neural network.

Activation-based Saliency Methods. These approaches combine activations (generally the combination of back-propagation gradients and feature maps) of a selected convolutional layer to form an explanation. CAM and Grad-CAM adopt a linear combination of activations to form a heatmap with fine-grained details. Grad-CAMpp extends Grad-CAM and uses a weighted combination of the positive partial derivatives of the target layers’ feature maps with respect to a specific class score as weights to generate a visual explanation for the corresponding class label. Activation-based methods are in general faster than region-based approaches since they only require a single or constant number of queries to the model. However, results of activation-based methods only reflect infinitesimal changes of the prediction, and these changes are not necessarily reflective of changes large enough to alter the decision of the neural network.

Grouped Features. Learning features into groups dates back to AlexNet, whose motivation is distributing the model over more GPU resources. The MobileNets and ShuffleNets treated each channel as a group and modeled the spatial relationships within these groups. ResNeXT exploiting the split-transform-merge strategy in an extensible way, that is, splitting the feature maps into groups, applying the same transformation strategy of each sub-features, and then concatenate the transformed sub-features. Although the split-transform-merge strategy has been widely used in learning features, there is still no work that adopts this strategy in Explainable AI domains.

3 Group-CAM

In this section, we first describe the Group-CAM algorithm, then explain the motivation behind it. The high-level steps are shown in Algorithm 1.

Input: Image , Model , Class , number of groups , Gaussian blur parameters: , .
Output: Saliency map
1 Initialization: Initial , Baseline Input ;
2 Get target layer feat maps , importance weights ;
3 the number of channels of ;
4 number of feat maps in each group;
5 while  and  do
6       Generating ;
7       Initial mask

de-nosie, normalize and bilinear interpolate upsample activation map

;
8       Perturbated image ;
9       Compute confidence gain ;
10       ;
11       ;
12      
13 end while
return
Algorithm 1 Group-CAM Algorithm

3.1 Initial Masks

Let be an input image, be a deep neural network which predicts a score on class with input . In order to obtain the class-discriminative initial group masks of the target convolutional layer, we first compute the gradient of with respect to feature map . Then these gradients are global average pooled over the height and width dimensions (indexed by

respectively) to obtain the neuron importance weights

(1)

where is the number of pixels in feature map .

Assume is the number of channels of the target layer feature maps, we first split all the feature maps and neuron importance weights into groups. Then the initial mask in each group is calculated by

(2)

where , is the number of feature maps in each group.

is the combination of feature maps and gradients, which means

can be noisy visually since the gradients for a DNN tend to vanish due to the flat zero-gradient region in ReLU. Therefore, it is not suitable to directly apply

as the initial mask.

To remedy this issue, we utilize a de-noising function to filter pixels in less than , where is a function which compute the percentile of .

Formally, for a scalar in , the de-noising function can be represented as

(3)

Instead of setting all pixels to binary values, it is better to generate smoother mask for an activation map. Specifically, we scale the raw values of into by utilizing Min-Max normalization,

(4)

Then, is upsampled with bilinear interpolate to the same resolution of to mask the input.

3.2 Saliency Map Generation

It has been widely acknowledged that if the saliency method is in fact identifying pixels significant to the model’s prediction, this should be reflected in the model’s output for the reconstructed image [5]. However, merely masking the image pixels out of the region of interest causes unintended effects due to the sharp boundary between the masked and salient region. Consequently, it is crucial to minimize such adversarial effects when testing for the importance of a feature subset [3].

To address this issue, we start with a masked version of the input, replace the unreserved regions (pixels with 0 values) with the blurred information, and then performing classification on this image to measure the importance of the initial masks. The blurred images can be computed by

(5)

where is a baseline image with the same shape as and have a lower confidence of class .

The contribution of the reserved regions can then be computed as

(6)

The final saliency map is a linear combination of the initial masks with weights , that is

(7)
Figure 2: Visualization results of SOTA saliency methods. Results show that saliency maps of Group-CAM are more compelling than region-base methods and activation-based methods, and contain less noise than gradient-based methods.

4 Experiments

In this section, we firstly utilize ablation studies to investigate the effect of group and filtering threshold . Then we apply a sanity check to test whether Group-CAM is sensitive to model parameters. Finally, we compare the proposed Group-CAM with other popular saliency methods to evaluate its performance.

4.1 Experimental Setup

Experiments in this section are conducted on the commonly-used computer vision datasets, including ImageNet-1k [10] and MS COCO2017 [6]. For both datasets, all images are resized to

, and then transformed to tensors and normalized to the range [0, 1]. No further pre-processing is performed. We report the insertion and deletion test results using the pre-trained torchvision model

111https://github.com/pytorch/vision/tree/master/torchvision VGG19 [12]

as the base classifier and other results are on the pre-trained ResNet-50 

[4]. Unless explicitly stated, the number of groups adopted in Group-CAM is 32, and the threshold in Eq. 3 is set as 70. For a fair comparison, all saliency maps are upsampled with bilinear interpolate to .

4.2 Class Discriminative Visualization

We qualitatively compare the saliency maps produced by recently SOTA methods, including gradient-based methods (Guided Backpropagation 

[14], IntegrateGrad [15], SmoothGrad [13]), region-based methods(RISE [7], XRAI [5]), and activation-based methods (Grad-CAM [11], Score-CAM [16]) to validate the effectiveness of Group-CAM.

As shown in Figure 2, results in Group-CAM, random noises are much less than that in region-base methods and activation-base methods. In addition, Group-CAM generates smoother saliency maps comparing with gradient-based methods.

We further conduct experiments to test whether that Group-CAM can distinguish different classes. As shown in Figure 3, the VGG19 classifies the input as “bull mastiff” with 46.06% confidence and ‘tiger cat’ with 0.39% confidence. Group-CAM correctly gives the explanation locations for both of two categories, even though the classification score of the latter is much lower than that of the former. It is reasonable to indicate that Group-CAM can distinguish different categories.

Figure 3: Class discriminative results. The middle image is generated w.r.t ‘bull mastiff’, and the right one is generated w.r.t ‘tiger cat’.
Figure 4: Grad-CAM, Score-CAM and Group-CAM generated saliency maps for representative images in terms of deletion and insertion curves. In the insertion curve, a better explanation is expected that the prediction score to increase quickly, while in the deletion curve, it is expected the classification confidence to drop faster.
AUC Grad-CAM Grad-CAM++ RISE XRAI Score-CAM Group-CAM (ours)
Insertion 63.06 62.38 63.76 53.36 64.75 64.91
Deletion 11.97 12.43 11.91 10.90 11.47 11.29
Over-all 51.09 49.95 51.85 42.46 53.28 53.62
Table 1: Comparative evaluation in terms of deletion (lower AUC is better) and insertion (higher AUC is better) AUC on ImageNet-1k. The over-all score (higher AUC is better) shows that Group-CAM outperform other related methods significantly. The best records are marked in bold.

4.3 Deletion and Insertion

We follow  [7] to conduct deletion and insertion tests to evaluate different saliency approaches. The intuition behind the deletion metric is that the removal of pixels/regions most relevant to a class will cause the classification score to drop significantly. Insertion metric, on the other hand, starts with a blurred image and gradually re-introduces content, which produces more realistic images and has the additional advantage of mitigating the impact of adversarial attack examples. In detail, for the deletion test, we gradually replace 1% pixels in the original image with a highly blurred version each time according to the values of the saliency map until no pixels left. Contrary to the deletion test, the insertion test replaces 1% pixels of the blurred image with the original one until the image is well recovered. We calculate the AUC of the classification score after Softmax as a quantitative indicator. Besides, we provide the score to comprehensively evaluate the deletion and insertion results, which can be calculated by . Examples are shown in Figure 4. The average results over 10k images is reported in Table 1.

As illustrated in Table 1, the proposed Group-CAM outperforms other related approaches in terms of insertion and over-all AUC. Moreover, Group-CAM also exceeds other methods in terms of deletion AUC except for XRAI.

Ablation Studies. We report the ablation studies results of Group-CAM on randomly sampled 5k images on ImageNet-1k, to thoroughly investigate the influence of filter threshold and group . Results are shown in Figure 5 and Table 2.

Groups Insertion Deletion Over-all
1 61.72 11.21 50.51
4 64.27 11.21 53.07
8 64.94 11.29 53.65
16 65.38 11.34 54.04
32 65.48 11.31 54.17
64 65.77 11.31 54.46
128 65.81 11.29 54.52
256 65.84 11.28 54.56
Table 2: Ablation studies of Group with filter threshold in terms of deletion, insertion, and over-all scores on ImageNet-1k validation split (randomly sampling 5k images). The best records are marked in bold.
Figure 5: Ablation studies of filter threshold with in terms of deletion (lower AUC is better), insertion (higher AUC is better) curve and the over-all scores (higher AUC is better) on ImageNet-1k validation split(randomly sampling 5k images).

From Figure 5, we can see, threshold has a significant effect on the performance of Group-CAM (fluctuating over 1.1% in terms of over-all score). Specifically, when is small, the over-all score keeps stable with an increase of . Then, as increases, the over-all score drops quickly when . Here, to make a trade-off between insertion and deletion results, we set as default.

Besides, in Table 2 we can see, the over-all score increase with the increase of . However, as introduced in Algorithm 1, larger means more computer costing. To make a trade-off, we set as the default group of Group-CAM.

Methods Running Time
RISE 38.23
XRAI 42.17
Grad-CAM 0.03
Score-CAM 2.46
Group-CAM (ours) [rgb] 0, 0, 10.09
Table 3: Comparative evaluation in terms of running time (seconds, averaged on 5k images) on ImageNet-1k. The best and second best records are marked in bold and [rgb] 0, 0, 1blue, respectively.

Running Time. In Table 3, we summarize the average running time for RISE [7], XRAI [5], Grad-CAM [11], Score-CAM [16] and the proposed Group-CAM on one NVIDIA 2080Ti GPU. As shown in Table 3, the averaging running time for Grad-CAM and Group-CAM are both less than 1 second, which achieve best results among all the approaches. Combined with Table 1 and Table 3, we observe that although Group-CAM runs slower than Grad-CAM, it achieves much better performance.

4.4 Localization Evaluation

In this part, we adopt pointing game [18] on MS COCO2017 to measure the quality of the generated saliency map through localization ability. We apply the same pre-trained ResNet-50 from [7]. The localization accuracy is then calculated as for each object category (if the most salient pixel lies inside the annotated bounding boxes of an object, it is counted as a hit). The overall performance is measured by the mean accuracy across different categories.

From Table 4, we observe that Group-CAM beats all the other compared approaches. Specifically, Group-CAM outperforms the base Grad-CAM with 0.8% in terms of mean accuracy.

Methods Mean Accuracy
Grad-CAM 56.7
Grad-CAM++ 57.2
RISE 54.3
XRAI 55.1
Score-CAM 51.0
Group-CAM (ours) 57.5
Table 4:

Pointing Game on COCO val2017 split. Results show that the proposed Group-CAM performs consistently better than other related methods.

4.5 Sanity Check

Finally, we utilize sanity check [1] to check whether the results of Group-CAM can be considered completely reliable explanations for a trained model’s behavior. Specifically, we employ both cascade randomization and independent randomization, to compare the output of Group-CAM on a pre-trained VGG19. As shown in Figure 6, The Group-CAM is sensitive to classification model parameters and can produce valid results.

Figure 6: Sanity check results by cascade randomization and independent randomization. Results show that Group-CAM is sensitive to classification model parameters and can reflect the quality of the network.

5 Fine-tuning Classification Methods

Finally, we extend the application of Group-CAM and apply it as an effective data augment strategy to fine-tune/train the classification models. We argue that a saliency method that is suitable to fine-tune the networks should have the following two characteristics: (1) the saliency method should be efficient, which can produce saliency maps in limited times; (2) the generated saliency maps should be related to the object targets. Our Group-CAM can produce appealing target-related saliency maps in 0.09 seconds per image with , which means Group-CAM is suitable to fine-tune the networks.

To make Group-CAM more efficient, we remove the importance weights and de-noise procedures. Although this will slightly impair the performance of Group-CAM, back-propagation is no longer needed, which can greatly save the saliency maps generating time.

The fine-tuning process is defined as follows:

(1) generate saliency map for with and the ground-truth target class ;

(2) binarize

with threshold , where is the mean value of .

(3) apply Eq. 5 to get the blurred input .

(4) adopt to fine-tune the classification model.

Since are generated during the training process, which means that when the performance of the classification model is improved, Group-CAM will generate a better , which in turn will promote the performance of the classification model.

Figure 7: Fine-tuning ResNet-50 with Group-CAM. Results show that Group-CAM can improve the classification model’s performance by a significant margin.

Here, we report the results on the ImageNet-1k validation split of fine-tuning ResNet-50. Specifically, we trained the pre-trained ResNet-50 by SGD with weight decay 1e-4, momentum 0.9, and mini-batch size 256 (using 8 GPUs with 32 images per GPU) for 20 epochs, starting from the initial learning rate of 1e-3 and decreasing it by a factor of 10 every 15 epochs. For the testing on the validation set, the shorter side of an input image is first resized to 256, and a center crop of

is used for evaluation.

As shown in Figure 7, fine-tune with Group-CAM can contribute to 0.59% (76.74% vs. 76.15%) improvement in terms of Top-1 accuracy.

Figure 8: Visualization results of fine-tuning ResNet-50 with Group-CAM. The first image (Epoch_0) is generated the original pre-trained ResNet-50. The right four images (i.e., Epoch_5, Epoch_10, Epoch_15 and Epoch_20) are generated by the fine-tuning ResNet-50.

Here, we visualize the saliency maps generated by the fine-tuned ResNet-50 in Figure 8. As illustrated in Figure 8, as the performance of ResNet-50 improves, the noise of the saliency maps generated by Group-CAM decreases and focuses more on the important regions. Since the noise can reflect the performance to some degree, we can also treat it as a hint to determine whether a model has converged. That is, if the saliency maps generated by Group-CAM do not change, the model may have converged.

6 Conclusion

In this paper, we proposed Group-CAM, which adopts the grouped sum of gradient and feature map combinations as initial masks. These initial masks are adopted to preserve a subset of input pixels, and then these pixels are fed into the network to calculate the confidence scores, which reflects the importance of the masked images. The final saliency map of Group-CAM is computed as a weighted sum of the initial masks, where the weights are confidence scores produced by the masked inputs. The proposed Group-CAM is efficient yet effective and can be applied as a data augment trick to fine-tune/train classification models. Experimental results on ImageNet-1k and COCO2017 demonstrate that Group-CAM achieves better visual performance than the current state-of-the-art explanation approaches.

References

  • [1] J. Adebayo, J. Gilmer, M. Muelly, I. J. Goodfellow, M. Hardt, and B. Kim (2018) Sanity checks for saliency maps. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 9525–9536. External Links: Link Cited by: §4.5.
  • [2] N. Bansal, C. Agarwal, and A. Nguyen (2020)

    SAM: the sensitivity of attribution methods to hyperparameters

    .
    In

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020

    ,
    pp. 8670–8680. External Links: Link, Document Cited by: §1.
  • [3] P. Dabkowski and Y. Gal (2017) Real time image saliency for black box classifiers. See DBLP:conf/nips/2017, pp. 6967–6976. External Links: Link Cited by: §3.2.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. External Links: Link, Document Cited by: §4.1.
  • [5] A. Kapishnikov, T. Bolukbasi, F. B. Viégas, and M. Terry (2019) XRAI: better attributions through regions. See DBLP:conf/iccv/2019, pp. 4947–4956. External Links: Link, Document Cited by: §1, §3.2, §4.2, §4.3.
  • [6] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Lecture Notes in Computer Science, Vol. 8693, pp. 740–755. External Links: Link, Document Cited by: §4.1.
  • [7] V. Petsiuk, A. Das, and K. Saenko (2018) RISE: randomized input sampling for explanation of black-box models. See DBLP:conf/bmvc/2018, pp. 151. External Links: Link Cited by: §1, §4.2, §4.3, §4.3, §4.4.
  • [8] Z. Qi, S. Khorram, and F. Li (2020) Visualizing deep networks by optimizing with integrated gradients. In

    The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020

    ,
    pp. 11890–11898. External Links: Link Cited by: §1.
  • [9] S. Rebuffi, R. Fong, X. Ji, and A. Vedaldi (2020) There and back again: revisiting backpropagation saliency methods. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 8836–8845. External Links: Link, Document Cited by: §1.
  • [10] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2015) ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115 (3), pp. 211–252. External Links: Link, Document Cited by: §4.1.
  • [11] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. See DBLP:conf/iccv/2017, pp. 618–626. External Links: Link, Document Cited by: §1, §4.2, §4.3.
  • [12] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.1.
  • [13] D. Smilkov, N. Thorat, B. Kim, F. B. Viégas, and M. Wattenberg (2017) SmoothGrad: removing noise by adding noise. CoRR abs/1706.03825. External Links: Link, 1706.03825 Cited by: §4.2.
  • [14] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller (2015) Striving for simplicity: the all convolutional net. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.2.
  • [15] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. See DBLP:conf/icml/2017, pp. 3319–3328. External Links: Link Cited by: §4.2.
  • [16] H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu (2020) Score-cam: score-weighted visual explanations for convolutional neural networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14-19, 2020, pp. 111–119. External Links: Link, Document Cited by: §4.2, §4.3.
  • [17] S. Xu, S. Venugopalan, and M. Sundararajan (2020) Attribution in scale and space. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 9677–9686. External Links: Link, Document Cited by: §1.
  • [18] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff (2018) Top-down neural attention by excitation backprop. International Journal of Computer Vision 126 (10), pp. 1084–1102. External Links: Link, Document Cited by: §4.4.