Convolutional Neural Networks (CNNs) and other deep networks have enabled unprecedented breakthroughs in a variety of computer vision tasks, from image classification to image captioning and visual question answering. While these deep neural networks enable superior performance, their lack of decomposability intointuitive and understandable components makes them hard to interpret. Consequently, when today’s intelligent systems fail, they fail spectacularly disgracefully, without warning or explanation, leaving a user staring at incoherent output, wondering why the system did what it did. In order to build trust in intellegent systems and move towards their meaningful integration into our everyday lives, it is clear that we must build ‘transparent’ models that explain why they predict what they do.
However, there is a trade-off between accuracy and simplicity/interpretability. Classical rule-based or expert systems were highly interpretable but not very accurate (or robust). Decomposable pipelines where each stage is hand-designed are thought to be more interpretable as each individual component assumes a natural intuitive explanation. This tradeoff is realized in more recent work, like Class Activation Mapping (CAM) , which allows explanations for a specific class of image classification CNNs. By using deep models, we sacrifice a degree of interpretability in pipeline modules in order to achieve greater performance through greater abstraction (more layers) and tighter integration (end-to-end training).
What makes a good visual explanation? Consider image classification – a ‘good’ visual explanation from the model justifying a predicted class should be (a) class-discriminative (localize the category in the image) and (b) high-resolution (capture fine-grained detail). blackTo be concrete we introduce Grad-CAM using the notion ‘class’ from image classification (, cat or dog), but visual explanations can be considered for any differentiable node in a computational graph, including words from a caption or the answer to a question.
This is illustrated in Fig. 1
, where we visualize the ‘tiger cat’ class. Pixel-space gradient visualizations such as Guided Backpropagation, seen at the top of Fig. 1, are high-resolution and highlight fine-grained details in the image, but are not class-discriminative. Both the cat and the dog are highlighted despite ‘tiger cat’ being the class of interest (in fact, the Guided Backpropagation visualizations of ‘boxer’ dog ‘tiger cat’ are indistinguishable). However, the Grad-CAM visualization is low-resolution and does not contain fine details. A high-res visualization like Guided Grad-CAM helps show these details by highlighting stripes in the cat in addition to localizing the cat. We combine the best of both worlds by fusing existing pixel-space gradient visualizations with our novel localization method – called Grad-CAM – to create Guided Grad-CAM visualizations, which are both high-resolution and class-discriminative.
In this abstract: (1) We propose Gradient-weighted Class Activation Mapping (Grad-CAM) to generate visual explanations from any CNN-based network without requiring architectural changes. (2) To illustrate the broad applicability of our technique across tasks, we apply Grad-CAM to state-of-the-art image captioning and visual question answering models. (3) We design and conduct human studies to show that Guided Grad-CAM explanations are class-discriminative and help humans not only establish trust, but also help untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both networks make identical predictions, simply on the basis of their visual explanations. (4) Our code and demos of Grad-CAM are released to help others apply Grad-CAM to interpret their own models.
Class Activation Mapping (CAM). CAM  produces a localization map from image classification CNNs where global-average-pooled convolutional feature maps are fed directly into a softmax. Specifically, let the penultimate CNN layer produce feature maps of width and height
. These feature maps are then spatially pooled using Global Average Pooling (GAP) and linearly transformed to produce a scorefor each class
To produce a localization map for class , CAM computes the linear combination of the final feature maps using the learned weights of the final layer: This is normalized to lie between 0 and 1 for visualization purposes. CAM can not be applied to networks which use multiple fully-connected layers before the output layer, so fully-connected layers are replaced with convolutional ones and the network is re-trained. In comparison, our approach can be applied directly to any CNN-based differentiable architecture as is without re-training.
Gradient-weighted Class Activation Mapping. In order to obtain the class-discriminative localization map Grad-CAM in generic CNN-based architectures, we first compute the gradient of with respect to feature maps of a convolutional layer, . These gradients are global-average-pooled to obtain weights :
This weight represents a partial linearization of the deep network downstream from A, and captures the ‘importance’ of feature map for a target class . In general,
need not be a class score, but could be any differentiable activation. As in CAM, our Grad-CAM heat-map is a weighted combination of feature maps, but we follow this by a ReLU:
This results in a coarse heat-map, which is normalized for visualization. Other than the ReLU in (3), Grad-CAM is a generalization of CAM ( are precisely
where CAM can be applied) blackto any CNN-based architecture (CNNs with fully-connected-layers, ResNets, CNNs stacked with Recurrent Neural Networks (RNNs) .).
Guided Grad-CAM. In order to combine the class-discriminative nature of Grad-CAM and the high-resolution nature of Guided Backpropagation, we fuse them via point-wise multiplication to create Guided Grad-CAM, shown on the left of fig:approach. We expect the last convolutional layers to have the best compromise between high-level semantics and detailed spatial information, so we use these feature maps to compute Grad-CAM and Guided Grad-CAM.
We evaluate our visualization then show image captioning and visual question answering examples.
3.1 Evaluating Visualizations
Evaluating Class Discrimination. Intuitively, a good prediction explanation is one that produces discriminative visualizations for the class of interest. We select images from PASCAL VOC 2007 val set that contain exactly two annotated categories, and create visualizations for one of the classes. These are shown to workers on Amazon Mechanical Turk (AMT), who are asked “Which of the two object categories is depicted in the image?” and presented with the two categories present in the original image as options. blackAs shown in tab:eval_vis column 1, human subjects can correctly identify the category being visualized substantially more often when using Grad-CAM. This makes Guided Backpropagation more class-discriminative.
Evaluating Trust. Given explanations from two different models, we want to evaluate which of them seems more trustworthy. We use AlexNet and VGG-16 to compare Guided Backpropagation and Guided Grad-CAM visualizations, noting that VGG-16 is known to be more reliable than AlexNet. In order to tease apart the efficacy of the visualization from the accuracy of the model being visualized, we consider only those instances where both models made the same prediction as ground truth. Given a visualization from AlexNet, one from VGG-16, and the name of the object predicted by both the networks, workers are instructed to rate which model is more reliable. blackResults are shown in the Relative Reliability column of tab:eval_vis, where scores range from -2 to +2 and positive scores indicate VGG is judged to be more reliable than AlexNet. With Guided Backpropagation, humans score VGG as slightly more reliable than AlexNet, while with Guided Grad-CAM they score VGG as clearly more reliable than AlexNet. Thus our Guided Grad-CAM visualization can help users place trust in a model that can generalize better, based on individual prediction explanations.
Faithfulness Interpretability. Faithfulness of a visualization to a model is its ability to accurately explain the function learned by the model. Naturally, there exists a tradeoff between the interpretability and faithfulness for complex models operating on highly compositional inputs. A more faithful visualization might describe the model in precise detail yet be completely opaque to human inspection. Here we are only interested in local fidelity; the visualization need only explain the parts of the model relevant to that image. That is, in the vicinity of the input data point, our explanation should be faithful to the model .
For comparison, we need a reference explanation with high local-faithfulness. One obvious choice for such a visualization is image occlusion , where we measure the difference in CNN scores when patches of the input image are masked out. Interestingly, patches which change the CNN score are also patches to which Guided Grad-CAM assigns high intensity, black as shown by measuring rank correlation between patch intensities in tab:eval_vis (3rd column). This shows that Guided Grad-CAM is more faithful to the original model than Guided Backpropagation.
|Method||Human Classification Accuracy||Relative Reliability||Rank Correlation w/ Occlusion|
3.2 Analyzing Failure Modes for VGG-16
In order to see what mistakes a network is making we first get a list of examples that the network (VGG-16) fails to classify correctly. For the misclassified examples, we use Guided Grad-CAM to visualize both the correct class and the predicted class. A major advantage of Guided Grad-CAM over other methods is its ability to more usefully investigate and explain classification mistakes, since our visualizations are high-resolution and more class-discriminative. As seen in fig:failures, some failures are due to ambiguities inherent in ImageNet classification. We can also see that seemingly unreasonable predictions have reasonable explanations, which is a similar observation to HOGgles.
3.3 Image Captioning and VQA
captioning model for captions generated by a dense captioning model for the three bounding box proposals marked on the left. We can see that we get back Grad-CAM localizations (right) that agree with those bounding boxes – even though the captioning model and Grad-CAM do not use any bounding box annotations.
(without attention) using Grad-CAM visualizations. Given a caption, we compute the gradient of its log probability units in the last convolutional layer of the CNN (for VGG-16) and generate Grad-CAM visualizations as described in sec:approach. Results are shown in Fig. 2(a). In the first example, the Grad-CAM maps for the generated caption localize every occurrence of both the kites and people in spite of their relatively small size. In the top right example, Grad-CAM correctly highlights the pizza and the man, but ignores the woman nearby, since ‘woman’ is not mentioned in the caption. As described in Fig. 2(b), if we generate captions for specific bounding boxes in an image then Grad-CAM highlights only regions within those bounding boxes.
Visual Question Answering. Typical VQA pipelines consist of a CNN to model images and an RNN language model for questions. The image and the question representations are fused to predict the answer, typically with a 1000-way classification.
Since this is a classification problem, we pick an answer (the score in (1)) and use its score to compute Grad-CAM. Despite the complexity of the task, involving both visual and language components, the explanations (of the baseline VQA model from 
and a ResNet based hierarchical co-attention model from) described in Fig. 4 are suprisingly intuitive and informative.
In this work, we proposed a novel class-discriminative localization technique – Gradient-weighted Class Activation Mapping (Grad-CAM) – and combined it with existing high-resolution visualizations to produce visual explanations for CNN-based models. Human studies reveal that our localization-augmented visualizations can discriminate between classes more accurately and better reveal the trustworthiness of a classifier. In addition to the image captioning and VQA examples shown here, the full version of our paper  evaluates Grad-CAM on the ImageNet localization challenge, analyzes failure modes of VGG-16 on ImageNet classification using Grad-CAM, measures correlation between VQA Grad-CAM and human attention maps, describes ablation studies and provides many more examples. Grad-CAM provides a new way to understand any CNN-based model.
-  H. Agrawal, C. S. Mathialagan, Y. Goyal, N. Chavali, P. Banik, A. Mohapatra, A. Osman, and D. Batra. CloudCV: Large Scale Distributed Computer Vision as a Cloud Service. In Mobile Cloud Visual Media Computing, pages 265–290. Springer, 2015.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
-  J. Johnson, A. Karpathy, and L. Fei-Fei. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In CVPR, 2016.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
-  J. Lu, X. Lin, D. Batra, and D. Parikh. Deeper LSTM and normalized CNN Visual Question Answering model. https://github.com/VT-vision-lab/VQA_LSTM_CNN, 2015.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, 2016.
-  M. T. Ribeiro, S. Singh, and C. Guestrin. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In SIGKDD, 2016.
-  R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra. Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization. CoRR, abs/1610.02391, 2016.
-  J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller. Striving for Simplicity: The All Convolutional Net. CoRR, abs/1412.6806, 2014.
-  C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba. HOGgles: Visualizing Object Detection Features. ICCV, 2013.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba.
Learning Deep Features for Discriminative Localization.In CVPR, 2016.