Score-CAM:Improved Visual Explanations Via Score-Weighted Class Activation Mapping

10/03/2019 ∙ by Haofan Wang, et al. ∙ Texas A&M University 25

Recently, more and more attention has been drawn into the internal mechanism of the convolutional neural network and on what basis does the network make a specific decision. In this paper, we develop a novel post-hoc visual explanation method called Score-CAM based on class activation mapping. Unlike previous class activation mapping based approaches, Score-CAM gets rid of the dependence on gradient by obtaining the weight of each activation map through its forward passing score on target class, the final result is obtained by a linear combination of weights and activation maps. We demonstrate that Score-CAM achieves better visual performance with less noise and has better stability than Grad-CAM and Grad-CAM++. In the experiment, we rethink issues of previous evaluation metrics and propose a representative evaluation approach Energy- Based Pointing Game to measure the quality of the generated saliency maps. Our approach outperforms previous methods on energy-based pointing game and recognition and shows more robustness under adversarial attack.



There are no comments yet.


page 2

page 3

page 5

page 6

page 8

page 9

Code Repositories


the re-implementation of Score CAM with pytorch

view repo


Score Class Activation Mapping

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolution neural network has made great breakthroughs in computer vision, including image classification

[20], object detection[15], semantic segmentation[23], image caption[11][36] and visual question answer[3]. In recent years, although the network architecture[17][19][18][40] has been continuously simplified and reasonable, as the prediction results cannot be decomposed into intuitive and understandable parts[22], the interpretability of neural networks is still a thorny problem.

Figure 1:

Visualization results of Vanilla Backpropagation

[31], Guided Backpropagation[34], SmoothGrad[33], IntegrateGrad[35], Mask[12], RISE[27], Grad-CAM[30], Grad-CAM++[6] and our proposed Score-CAM.

Interpretability is crucial to help build people’s confidence in the prediction of the neural network. Narrowing down to image classification, a common visual explanation approach usually called saliency map or attribution map is to find regions of an image that are the most influential to the target class score on prediction by the model. Several types of approaches have been proposed to generate a saliency map, which generally follow three directions.

The common one is gradient-based visualization[31, 38, 34], which backpropagates the partial derivative of the score on the target class with respect to the input layer, they are usually fast to compute and can produce fine-grained saliency maps. However, the gradient-based maps are generally of low quality and have many random noises[33]. Recently, [2] claims that although the original gradient shows sensitivity to model parameters, Guided BackPropagation[34] with better visual performance is invariant and works just like an edge detector. The other is perturbation-based methods[28, 27, 12, 5, 7, 37], which work by perturbing original input and observe the change of the predication. These approaches generate easy to interpret explanations in the image space, however, they are usually quite time-consuming and only generate coarse explanations[37], besides, optimization-based perturbation[12]

usually need additional regularizations to find minimum pieces of evidence and have many hyperparameters to fine-tune with respect to each instance, meanwhile, because of random term in regularizations, the generated explanation for a same input may be different each time.

Another is Class Activation Mapping (CAM)[41] and its extensions including Grad-CAM[30] and Grad-CAM++[6], they generate the saliency map by linear weighted combination of activation map (or feature map) to highlight important region in image space. Unlike original CAM which requires a special global average pooling layer and cannot apply to general networks, Grad-CAM and Grad-CAM++ are both applicable to a wide variety of CNN model-family without any architectural changes or re-training, they can also combine with fine-grained visualizations to create a high-resolution class-discriminative visualization. However, their results are usually come with random noises which are irrelevant to target object in the image, and the weight does not well capture the importance of each activation map as shown in Fig 3.

Our work also builds on Class Activation Mapping. To address previous limitations, in this paper, we propose a new post-hoc visual explanation method named Score-CAM, which can avoid most of the irrelevant noises and generates more clean and meaningful explanations. Different from existing work[30, 6], which utilizes the backpropagated gradient to represent the importance of each activation map, we follow the idea from perturbation-based methods that mask part of regions in the original input and observe the change in target score. We treat the activation maps as a type of masks and obtain prediction scores for each activation map, then the score on the target class is utilized to represent the importance of the activation maps. Our key contributions in this work are summarized as follows:

(1) We introduce a new visual explanation method Score-CAM based on Class Activation Mapping with a simple but efficient importance representation for each activation map.

(2) We propose new metrics in this work to quantitatively evaluate the proposed explanations to the underlying model, i.e., how much energy drops into the region of the target object. Our results with the metric show the superior performance of Score-CAM over state-of-the-art.

(3) Faithness of our method are evaluated on recogntion task and we outperform previous work by large sale.

(4) Finally, We conduct adversarial test to measure the robustness and stability of generated saliency map, and demonstrate that score-based weight representation shows better performance than gradient-based weight.

2 Related work

Various methods have been introduced to generate explanations for model predication. [8] has provided a detailed survey of these. In this section, we present explanation methods which generate visual explanations especially for predication of convolutional neural network.

Gradient-Based. These methods also refer to gradient based visualization methods, which backpropagate the gradient of a target class to the input layer to highlight image region that highly influence the predication. [32] utilizes the derivative of target class score with respect to input image to generate saliency map. Similar to [32], deconvolution approach (Deconvnet[38]) and Guided Backpropagation [34] are both build on gradient, but make additional manipulation on original gradient through different gate functions. Integrated Gradients [35] addresses gradient saturation by accumulating gradients along a path from a base image to the input image. SmoothGrad[25] and VarGrad[1] seek to alleviate noise and visual diffusion for saliency maps by adding noise to input. Backpropagation methods usually produce fine-grained saliency maps, however, these maps are generally of low quality and have many noises[25].

Perturbation-Based. These approaches work by perturbing original input and observe the change of the predication of model. LIME[28] uses super-pixel to occlude input image and compute importance score. RISE [27] randomly samples masks to occlude original input and define importance as the predicated score over masked image. [12, 5, 7, 37] present a novel image saliency paradigm that learns where an algorithm looks by optimizing a random noise. However, to find minimum evidence, these approaches usually need additional regularizations[12] and are time-costing.

Class Activation Mapping-Based (CAM-Based). This type of explanation treats explanation as a linear weighted combination of activation maps from convolutional layers. The original CAM[41] has to modify the original network structure by inserting global average pooling layer into network, and re-train the network, which seriously limits its application. Grad-CAM[30] and Grad-CAM++[6] generalize CAM[41] and mainly differ in how to calculate the weights of activation maps. Extension of such approaches have been proposed, [30] and [9] combine Class Activation Mapping with Backpropagation and Perturbation respectively. Recently, Smooth Grad-CAM++[25] combines methods from Grad-CAM++[6] and SmoothGrad[33] to produced more visually sharp maps. Other works [26, 21] apply class activation map as attention map into object segmentation and visual question answering.

3 Motivations

3.1 Is Graidient Stable?

In SmoothGrad[33], the researcher observes the backpropagated gradient is pretty unstable and produce random noises in gradient-based saliency maps. Fig 2 shows that the gradient changes sharply when the value of input image change a little even though it is unperceptive to human and does not change the predication result. Therefore, it is reasonable to doubt the effectivity of gradient-based weight adopted in Grad-CAM and Grad-CAM++. Stability check is conducted in Sec 5.4.2

Figure 2: The partial derivative of with respect to the RGB values of a single pixel, . (middle plot) as one slowly moves away from a baseline image (left plot) to a fixed location (right plot). is one random sample from . The final image is indistinguishable to a human from the origin image .

3.2 Is Gradient-Based Weight Representative?

A mount of noises distributes randomly around the target object as shown in seventh and eighth columns in Fig 1, although the area of the target object can be highlighted correctly in gradient-based saliency map[30, 6]. Setting a threshold to mute low pixels may be the most direct way to avoid random noises, however, it is almost impossible to determine a general threshold for every instance, and truncation operation also may disrupt the nature of network decision and works more like post-processing.

Figure 3: Visualization of Activation Map by upsampling into input space and then overlay on the input. The first column is the results of Grad-CAM and Grad-CAM++. The other is the activation map with the weight above. As shown, the weight does not well capture the importance of each activation map.

The Original CAM[41] and its derivatives, Grad-CAM[30] and Grad-CAM++[6] generate saliency maps by linear combination and share a common assumption that the weight represents the importance of each activation map. Thus, the activation map with higher weight should have more positive impacts on the target score, and vice verse. To validate this assumption, we visualize the activation maps generated by Grad-CAM and Grad-CAM++ in Fig 3. It is counterintuitive as the weights of Grad-CAM and Grad-CAM++ do not well capture the importance of each activation map, and the positive and negative weight has no direct relationship with the importance. Activation maps with positive weights may highlight the area of background, while the negative activation maps focus on the target object. Therefore, it is necessary to propose a more representative weight and this is the starting point of this paper.

4 Approach

In this section, we introduce the proposed Score-CAM for visual interpreting CNN-based predictions. The pipeline of the proposed framework is illustrated in Fig 4. Methodology is first introduced in Sec 4.1, and implementation details are followed in Sec 4.2 and Sec 4.3. We suggest how to utilize Score-CAM to generate fine-grained explanations similar to Grad-CAM[30] in Sec 4.4. Finally, we discuss the internal interpretability of proposed method in Sec 4.5.

Figure 4: Pipeline of proposed Score-CAM. Activation maps are first extracted in Phase 1. Each activation then works as a mask on original image, and obtain its forward-passing score on the target class. Phase 2 repeats for times where is the number of activation maps. Finally, the result can be generated by linear combination of score-based weights and activation maps.

4.1 Methodology

Several previous works[4, 24]

have asserted that deeper representations in a CNN capture higher-level visual information. Furthermore, convolutional features naturally retain spatial information which is lost in fully-connected layers, so it is common to expect the last convolutional layers to have the best compromise between high-level semantics and detailed spatial information, and the neurons in these layers look for semantic class-specific information in the input image.

In contrast to previous methods such as Grad-CAM[30] and Grad-CAM++[6], which use the gradient information flowing into the last convolutional layer of the CNN to represent the importance of each activation map for prediction score on target class. In Score-CAM, the weights come from the score of the target class corresponding to the activation maps. Therefore, Score-CAM can get rid of dependence on gradient and works as a more general framework as it only requires access to the activation map and output score of the model.

In order to obtain the class discriminative localization map of Score-CAM, each activation map is first upsampled to original input size

using bilinear interpolation, where

denotes the channel number of the last convolutional layer. For example, from to in VGG16[32] architecture. Then, instead of setting all elements to zeros and ones, we normalize the raw activation value in each activation map into , so that the relative intensity between pixels can be well reserved. Thus, activation maps are not binary but have values in .


Each upsampled activation map corresponds to a specific region in the original input space. Different to [27], where they generate masks with size smaller than image size by Monte Carlo sampling, and then upsample to image space, Score-CAM does not require a process to generate masks. On the contrary, each upsampled activation map not only presents where does the neuron look at, but also can directly work as a mask to perturb the input image.

We project highlighted area in the activation map to original input space by multiplying the normalized activation map by the original input , and obtain a masked image .


where denotes element-wise multiplication and is first copy to before multiplication.

We generate a set of masks {}, is the number of channels of the last convolution layer of the model. Finally, we feed into CNN model to conduct a forward computing .


The output score is obtained after Softmax operation.


We take the score on target class as weight to represent the importance of the th activation map. The final class discriminative localization map is obtained by a linear weighted combination of all activation maps.


Notice that this results in a coarse heatmap of the same size as the convolutional activation map ( in the case of last convolutional layers of VGG and AlexNet networks). Similar to former work[30, 6]

, we apply a ReLU to the linear combination of maps because we are only interested in the features that have a positive influence on the class of interest.

4.2 Normalization on Activation Map

To evaluate the relevance of the highlighted region in each activation map with the target class, we have to map the activation map back into the original input space through upsampling and do point-wise multiplication with the original input image. Considering that the raw value in each activation map may have a varied range, we normalize each activation map to [0, 1] before multiplication. Each value in the normalized activation map represents the importance of pixels in the input space. In the experiment, we also directly binarize the activation map where all non-zero values are set to

, and the rest are set to . The performance of binarization is similar to normalization, but Score-CAM with normalization generates a saliency map with less noise.

4.3 Normalization on Target Score

In Grad-CAM[30] and Grad-CAM++[6], weight is gained by variation of gradient. However, in Score-CAM, we replace it with the target score gained by forward passing activation maps into the model.

Figure 5: Effect of normalization on output score. The second and forth images are w.r.t ’boxer  dog’, the third and last images are w.r.t ’tiger  cat’.

Instead of using raw output score on target class as weight, we apply Softmax function to output score, so that score can be rescaled into a fixed range. The intuition behind such operation is that each forward passing is independent, so the score amplitude of each forward propagation is unpredictable. For example, in one forward propagation, the model gives out score on target class , and score on class, therefore, the region in the input should be more related to . In another forward propagation, only receives score, but it has the highest score among all classes, thus this input should be more relevant to than previous input. If we use raw output scores as weight, the activation map with score is regarded as more important than the one with , which is unreasonable. Therefore, to solve this puzzle, we apply the Softmax function on the output score.

Although activation functions such as Softmax function are commonly used in classification tasks, whether or not using Softmax makes a difference. An interesting discovery is shown in Fig

5. The model predicts the input image as which can be correctly highlighted no matter which type of score is adopted. But Score-CAM highlighted both region of and for target class

if the raw score is used. On the contrary, Score-CAM with Softmax can well distinguish two categories, even though the prediction probability of

is lower than the probability of . Therefore, normalization on the target score equip Score-CAM with good class discrimination ability.

4.4 Fine-grained Explanation

Guided Score-CAM While Score-CAM visualizations are class-discriminative and localize relevant image regions well, they cannot show fine-grained importance like pixel-space gradient visualization methods (Guided Backpropagation[34] and Deconvolution[38]). For example, Score-CAM can easily localize the dog region; however, it is unclear from the low-resolutions of the saliency map why the network predicts this particular instance as .

Figure 6: Visualization of Guided Score-CAM.

To combine the best aspects of both, we fuse Guided Backpropagation[34] and visualizations result of Score-CAM via point-wise multiplication. This visualization is both high-resolution (when the class of interest is , it identifies important features like noses, chins, and eyes) and class-discriminative (it shows the but not the ). [30] reveals replacing Guided Backpropagation with Deconvolution in the above gives similar results, but Deconvolution has artifacts (and Guided Backpropagation visualizations were generally less noisy), so we also choose Guided Backpropagation.

4.5 Discussion

One of the problems of gradient-based methods is that it is hard to interpret the generated saliency map in an intuitively understandable way, as the gradient is imperceptible to a human, which may damage their interpretability, and although gradient can reflect how much influence each activation map may have on decision score, the instability as discussed in previous section may destroys this quality. In Perturbation-based approaches, they use the change in target class probability as a measure of importance for sampled masks using insertion or deletion strategies. These approaches generate easy to interpret explanations in the image space, because there is a mask or not, significant changes in scores on the target class can be observed.

In Score-CAM, the saliency map is obtained by weighting the activation map and the weight. The activation map shows the features learned by the model, while the weight shows the importance of the features to the target category. Different from these work, Score-CAM gets rid of the dependence on gradients when representing the importance of each activation map, while it borrows the idea from perturbation-based approaches and use the score on target class as a measurement of importance, the difference with perturbation-based methods is that we do not rely on the change of target score, but directly use the score on the target class.

In this paper, our work bridges the gap between perturbation-based methods and CAM-based methods and makes the weight representation more interpretable (Activation maps with high weight should have more possibilities to be predicted as the target class.).

5 Experiments

Datasets and Base Models: We evaluate Score-CAM on 2 publicly available object classification datasets, namely, PASCAL VOC07[10] and ILSVRC2012 val [29]

. Given a base model, we test the saliency map s generated by different visualization methods for a target object category. In the following experiments, unless stated otherwise, we use pre-trained VGG16 network from the Pytorch model zoo

111 as base model.

For input images, we resize them to the shape ( ), and transform them to the range

, and then normalize them using mean vector

and standard deviation vector

. No further pre-processing is performed.

5.1 Evaluating Visualization

5.1.1 Class Discriminative Visualization

We qualitatively compare the saliency maps produced by our proposed Score-CAM and previous CAM-based methods Grad-CAM and Grad-CAM++. Comparing to the other methods, our method can generate more visually interpretable saliency maps with less random noises. Results are shown in Fig 1, we show the generated saliency maps comparing with 8 state-of-the-art methods. As observed, in Score-CAM, random noises are much less than Mask[12], RISE[27], Grad-CAM[30] and Grad-CAM++[6], and our approach can generate smoother saliency maps comparing with gradient-based methods.

Figure 7: Class discriminative result. The middle plot is generated w.r.t , and the right plot is generated w.r.t .

We demonstrate that the proposed Score-CAM also could distinguish different classes as shown in Fig 7

. The VGG-16 model classifies the input as

bull mastiff with confidence and tiger cat with confidence, our model correctly gives the interpretation locations for both of two categories, even though the prediction probability of the latter is much lower than the probability of the former. It is reasonable to expect Score-CAM to distinguish different categories, cause the weight of each activation map is correlated with the response on target class and also is class discriminative. Thus, It is expected that the most discriminative region of the target object can receive more pixels.

5.1.2 Multiple Occurrences of the Same Class

Besides the ability to generate class discriminative saliency maps, Score-CAM also shows better performance than previous works on locating multiple objects. The result is shown in Fig 8, Grad-CAM[30] tends to only capture one object in the image, Grad-CAM++[6] and Score-CAM both show ability to locate multiple objects, but Score-CAM generates less noise than Grad-CAM++.

Figure 8: Results on multiple objects.

Again, as the weight of each activation map is represented by its score on the target class, each target object with a high confidence score predicted by the model can be highlighted separately. Finally, all evidences are assembled through linear combination.

Figure 9: Visualization of saliency map for localization task. The left is location result, ’red’ bbox is for Score-CAM, and the ’blue’ and ’green’ are for ’Grad-CAM’ and ’Grad-CAM++’. The right are saliency maps generated by three methods respectively.
Grad Smooth Integrated Mask RISE GradCAM GradCAM++ ScoreCAM
Overlap(%) 41.3 42.4 44.7 56.1 36.3 48.1 49.3 63.7
Table 1: Comparative evaluation on Energy-Based Pointing Game (higher is better).
Method Mask RISE GradCAM GradCAM++ ScoreCAM
Average Drop(%) 63.5 47.0 47.8 45.5 31.5
Average Increase(%) 5.29 14.0 19.6 18.9 30.6
Table 2: Evaluation results on Recognition (lower is better in Average Drop, higher is better in Average Increase).

5.2 Energy Based Evaluation

5.2.1 Rethinking of Quantitative Evaluation

While several assessments of the quality of generated saliency maps have been proposed using some auxiliary metrics like localization error with respect to bounding boxes (ground truth) and pointing accuracy[39], these measurements do not correlate with the actual quality of the generated saliency map.

Fig 9 shows examples of weakly supervised localization. As shown in the first two rows, all CAM-based methods including our proposed Score-CAM fail, but because the pixels in the saliency maps generated by Grad-CAM and Grad-CAM++ are more uniform, they can still generate bounding box with IOU larger than 50%. In the last row, although Score-CAM correctly localizes the of with less noises, it still counts as an error as it cannot overlap with ground truth more 50%. Therefore, the IOU metric in localization task expects more overlap with the whole object, which may not suitable to evaluate the quality of generated saliency maps that expect to highlight the most importance part.

As [27]

stresses, good pointing accuracy may not well represent the quality of generated saliency map, to be specific, only considering whether the maximum point falls into bounding boxes exists randomness and is not enough to represent the whole map. Imagine a scene that the generated maps are uniform distribution, each spatial location receives equal attention, and the maximum point falls into bounding boxes without doubt, but the generated saliency maps are in bad quality obviously.

5.2.2 Energy Based Pointing Game

We have discussed possible problems with previous evaluation approaches, in this section, we introduce a new evaluation method to measure the quality of the generated saliency map. Our evaluation method extends from pointing game, but different from the pointing game which extracts maximum point in saliency map to see whether the maximum falls into object bounding box, we treat this problem in an energy-based perspective. We calculate how much energy of the saliency map falls into the object bounding box. Specifically, first, we binarize the input image with the bounding box of the target category, the inside region is assigned to 1 and the outside region is assigned to 0. Then, we point-wise multiply it with generated saliency map, and sum over to gain how much energy in target bounding box, we denote this metric as . We call this method an energy-based pointing game.

As we observe that it is common in the ILSVRC validation set that the object occupies most of the image, which makes these images not suitable for measure the quality of the saliency map. We select 500 images from the validation set by filtering images where object occupies more than 50% of the image, for convenience, we only consider these images with only one bounding box. We experiment with 500 selected images from the ILSVRC 2012 validation set. The result shows that our method outperforms previous work by a large scale, and more than 60% energy of saliency map falls into the ground truth bounding box of the target object. This also proves that the saliency map generated by Score-CAM has less noise.

We don’t compare with Guided BackProp[34] because it works similar to an edge detector rather than saliency map (heatmap).

5.3 Evaluating Faithfulness on Recognition

Figure 10: Model’s prediction along with deletion

scores (AUC).The bottom-right plot estimates the probability of the target class class predicted by the network vs. the fraction of removed pixels. In this example, Score-CAM provides more accurate saliency and achieves the lowest AUC.

We evaluate the faithfulness of the explanations generated by Score-CAM on the object recognition task as [6]. We mask original input by point-wise multiplication with the saliency maps to observe the score change on the target class. However, we find it is unfair to directly compare these methods on recognition tasks because usually, context information can also provide some hint for predication. Therefore, to conduct this experiment fairly, we constrain the energy in the saliency map. Rather than do point-wise multiplication with originally generated saliency map, we constrain it by limiting the number of positive pixels in the saliency map (50% of pixels are mute in our experiment). We follow the metrics used in [6] to measure the quality, the Average Drop is expressed as , the Increase In Confidence is expressed as , where is the predicated score for class on image and is the predicated score for class with the explanation map region as input. presents an indicator function that returns if input is True.

5.4 Interpretation Under Adversarial Attacks

In this section, we test whether our interpretation method could tolerate adversarial attacks. [14] demonstrates that several recent interpretation models that are fragile under adversarial attack. A small perturbation in the input would drastically change their interpretation results.

5.4.1 Robustness Test

Specifically, Fast Gradient Sign method [16] is utilized to produce adversarial inputs. Fig. 12 illustrates the interpretation result of an adversarial example. After adding some small and unnoticeable perturbation to the original input, the adversarial attack causes the classifier to miscategorize the input as pug with high confidence (69.4%) and boader terrier with 91.8% confidence. Our interpretation model still could give the location for the true label.

Figure 11: Robustness under adversarial attack. The first column shows the predicted class and corresponding confidence score. The last three columns are the saliency maps generated by Grad-CAM, Grad-CAM++ and Score-CAM.

In comparison, we also test Grad-CAM[30] and Grad-CAM++[6] in the same setting. we can observe two obvious facts, the first is that although Grad-CAM and Grad-CAM++ can still highlight important regions, the amount of random noise increases. The second is hard to perceive at first glance, but in reality, we find the peak of the saliency map moves away to less important points. But our method shows more robustness than both of other methods, especially the peak can still keep stable even with adversarial attack. It demonstrates that the proposed interpretation method is more robust, and could provide reasonable interpretation under adversarial setting.

5.4.2 Stability Check

We illustrate that our motivation comes from the instability of gradient in 2, so we conduct stability test to observe whether our proposed weight shows better stability than Grad-CAM[30] and Grad-CAM++[6].

Figure 12:

Weight variance under adversarial attack.

In the experiments, we continuously increase the degree of noise by FGSM[16] and observe the variance of weight. As shown, our results are very gratifying, rather than Grad-CAM and Grad-CAM++, which weight varies sharply under adversarial attack, the sign of weight even changes of Grad-CAM. In comparison, our proposed Score-CAM which replaces gradient-based weight with score-based weight has a much smoother curve of change. Thus, the stability of our weight shows its effectiveness and makes our method more applicable in the real world.

5.5 Appliacations

5.5.1 Harnessing Explanations For Model Analysis

A good post-hoc visual explanation should not only tell where does the model look at, but also help researchers analyze their models. We claim that much previous work treat visual explanation as a way to do localization, but ignore the usefulness in helping to analyze the original model. In this part, we show how to harness the explanations generated by ScoreCAM for model analysis, and provide insights for future exploration.

Figure 13: The left is generated by no-finetuning VGG16 with 22.0% classification accuracy , the right is generated by finetuned VGG16 with 90.1% classification accuracy.

We have two observations. The first is that Score-CAM can work well on localization task even the classification performance of the model is bad, but as the classification performance improve, the noise in saliency map decrease and focus more on important region. The noise suggests the classification performance. This also can work as a hint to determine whether a model has converged if the generated saliency map does not change anymore, the model may have converged.

Figure 14: The left column is input example, middle is saliency map w.r.t predicted class (person), right is saliency map w.r.t target class(bicycle).

The second is that Score-CAM results tell why the model makes a wrong prediction. From Fig 12, images with label bicycle are classified as person, so we generate a saliency map based on Score-CAM for predicated class and correct class. By comparing the difference, we know that person is correlated with ’bicycle’ which may because person appears in most of ’bicycle’ images in training set, and ’person’ region is the most distractive part that leads to mis-classification.

5.5.2 Prediction Reasoning

Score-CAM can also provide reasons for a prediction, which describes a causal situation in the form: “If had not occurred, would not have occurred”. In Score-CAM, a predication reasoning highlights the support for the regions that would make the network change its decision to another.

Figure 15: Prediction reasoning with Score-CAM. From left to right are input, Score-CAM w.r.t class bull mastiff, Score-CAM w.r.t class pug, the counterfactual explanation to tell why the model predicts target bull mastiff rather than reference pug.

Specifically, we first extract saliency maps generated by Score-CAM concerning target class and reference class, and respectively. Reference class refers to one class that we use to compare with the target class. For similar classes, it is reasonable for them to share some common regions in spatial space. Counterfactual explanation for target class can be obtained by subtracting reference results with the target result following by a ReLU.


gives out the region that forces the model to predict as the target class rather than the reference class. In other words, the region highlighted in is the reason why the model predicts the input as target class bull mastiff rather than reference class pug. In given example, the model predicts the input as bull mastiff with 49.1% confidence and as pug with 33.1% confidence, where bull mastiff is target class and pug is reference class. If we mask the original input by , the model predicts the masked input as pug with 74.9% confidence and bull mastiff with 18.6% confidence. As the chins of bull mastiff and pug are pretty indiscriminate, therefore the generated counterfactual explanation is consistent with humans. Recently, as fine-grained classification[13] receives more attention which classifies similar classes such as bull mastiff and pug, we believe prediction reasoning generated by Score-CAM can also provide insights for fine-grained classification.

Figure 16:

Explanations of image captioning models. (a) is the image with the caption. (b), (c), (d) show the importance map generated by Score-CAM for red word in sentense.

5.5.3 Explanation For Image Caption

Score-CAM can easily be extended to explain captions for any image description model. Fig 16 shows some examples of Score-CAM being applied for explaining image caption. We consider a standard image captioning encoder-decoder framework222 trained on ILSVRC2012 dataset [20]. The architecture includes a ResNet152[17] to encode the image followed by an LSTM to generate the captions.

Similar to [27], we model the probability of next word given a partial sentence and an input image .


For each activation map , we mask the input and compute , and generate saliency map as .

6 Conclusion

In this work, we proposed a novel score weighted Class Activation Mapping (Score-CAM) method for better visual explanation of CNN-based model prediction, Grad-CAM gets rid of global pooling layer (GAP) to make CAM usable without re-training, our proposed Score-CAM get rids of gradient-based weight to make CAM smoother and more stable. We state that score-based weight can be a better representation of the importance of each activation map, and is more intuitive to human understanding than gradient-based weight. Our approach can achieve better visual performance than former methods, with much less irrelevant noise in the background. We provide an in-depth analysis of experiments on visualization, weakly-supervised localization and point game. In the experiment, our method shows better robustness to adversarial attack than other visualization methods, and gains more explainable and stable weight to represent the importance of each activation map. Finally, we show the usefulness of a visual explanation for analyzing the performance of the neural network.