Deep Convolution Neural Networks are often referred to as black-box models due to minimal understandings of their internal actions. As an effort to develop more complex explainable deep learning models, many methods have been proposed to reveal the internal mechanism of the decisionmaking process. In this paper, built on the top of Score-CAM, we introduce an enhanced visual explanation in terms of visual sharpness called SS-CAM, which produces sharper localizations of object features within an image by smoothing. We evaluate our method on three well-known datasets: ILSVRC 2012, Stanford40 and PASCAL VOC 2007 dataset. Our approach outperforms when evaluated on both fairness and localization tasks.READ FULL TEXT VIEW PDF
Over the last decade, Convolutional Neural Network (CNN) models have bee...
Gaining insight into how deep convolutional neural network models perfor...
Although modern machine learning and deep learning methods allow for com...
Automation of tasks can have critical consequences when humans lose agen...
Recently, increasing attention has been drawn to the internal mechani...
Convolutional Neural Networks have been known as black-box models as hum...
Deep learning techniques are rapidly advanced recently, and becoming a
Convolutional Neural Networks(CNNs) are rising rapidly in the past few years and have been applied for numerous problems like image captioning, image classification(Sultana et al., 2019), semantic segmentation(Liu et al., 2018) and many other vision tasks(Ren et al., 2015)(Wang et al., 2019)(Hu et al., 2015). Despite their domination in these tasks, these architectures act like black box models(Buhrmester et al., 2019) and interpreting these models has been a hindrance. As much as the models evolve intricately, the limitations posed by these ubiquitous machine intelligence models are profound as they cannot justify and explain their decision-making process. These models may crash in unexpected scenarios like finance and healthcare where failures may lead to grave setbacks. To avoid such instances, these unreliable models need to be justified with visual reasoning to explain every decision that they make. These visual explanations would also help debug the networks, identify the single point of mishap and additionally, provide better insights on how the Neural Network functions.
Among many visual explanation methods, attribute map has become one of the most intuitive methods by giving importance weight to each region in the image. There are usually three ways to generate attribution maps. They are gradient-based visualization, perturbation-based visualization and class activation mapping (CAM)(Zhou et al., 2015)
. Gradient-based approaches generate saliency map using the derivative of the target class score to the input image through backpropagation(Springenberg et al., 2015). Perturbing is also common adopted technical to find the region of interests in the image, which works by masking some specific region in the input, the region that causes the largest drop on target class is regarded as important region. Other works(Wang et al., 2020b) add regularizers to make these attribution maps more robust and some packages(Yang et al., 2019) have been developed.
Our work builds on CAM-based approaches(Zhou et al., 2015), which obtain attribution maps by a linear combination of the weights and the activation maps. Recent CAM-based approaches all generalize the original CAM and do not limited CNNs with a special global pooling layer. They can be divided into two branches, one is gradient-based CAMs(Selvaraju et al., 2017), which represent the linear weights corresponding to internal activation maps by gradient information, the other is gradient-free CAMs(Wang et al., 2020a) which capture the importance of each activation map by the target score in forward propagation. Although Score-CAM(Wang et al., 2020a) has achieve good performance on both visual comparison and fairness evaluation, its localization result is coarse.
To achieve sharper visual feature localizing, in this paper, we propose a novel method called SS-CAM, built on top of Score-CAM(Wang et al., 2020a), for enhancing object understanding and providing better post-hoc explanations about the centralized, target object in the image. We evaluate our approach on ILSVRC 2012, the Stanford-40 and PASCAL VOC 2007 datasets. Our contributions can be summarized as below:
We introduce a enhanced visual feature localization method SS-CAM, which smooths the masked inputs and leads to a visually sharper attribution map.
We find that taking the maximum score of the noisy masked input yields a better localization performance than taking average.
We quantitatively evaluate the generated attribution maps of SS-CAM on recognition tasks and show that SS-CAM better localizes important features.
CAM-based approaches: Class Activation Mapping (CAM)(Zhou et al., 2015) is a technique of identifying class-discriminative features in the input image by a linearly weighted combination of the activation maps. The CAM identifies the important regions for a specific class and highlights those regions in an image.
CAM is denoted by :
refers to the weight of the
-th neuron after the pooling operation.
The CAM(Zhou et al., 2015) provides a great intuition behind the scenes for an output by taking each activation that contains different territorial information of an input and combining them with the weights of each neuron after the GAP layer for calculating the importance of each channel. However, as the original CAM are restricted to CNNs having the GAP layer, they do not generate feasible explanations on a wide array of CNNs.
To generalize CAM, Grad-CAM(Selvaraju et al., 2017) utilizes local gradient to represent linear weight, and can be applied to any CNN-based architecture without re-training process.
Grad-CAM is denoted by :
refers to the operation.
Grad-CAM++(Chattopadhyay et al., 2017), extended from Grad-CAM, aims to provide more meaningful explanations by localizing the entire object in image, instead of bits and pieces of it, and can better localize multiple objects from the same class.
As the output layer is a non-linear function, gradient-based CAMs tend to diminish the backpropagating gradients which cause gradient saturation thereby making it difficult to provide concrete explanations. To solve the problems, Score-CAM(Wang et al., 2020a) is the first gradient-free generalization of CAM and bridges the gap between perturbation-based and CAM-based methods. This method obtains the weights of each activation map in its forward passing score on the target class. The final result is a linear combination of both the weights and the activation maps.
Score-CAM is denoted as follows:
where is the class of interest, is the convolution layer and is the Channel-wise Increase of Confidence(CIC) score for the activation map .
In this section, we first introduce and formulate our proposed SS-CAM for interpreting CNN-based predictions. The pipeline of the proposed framework is illustrated in Fig 1. Before diving into the details of our technique, we define Channel-wise Increase of Confidence(CIC).
Channel-wise Increase of Confidence(CIC) : Given a model that takes an image input , the -th channel of an activation map of a convolutional layer in is given as . For a baseline input the importance of is denoted as
Here, is the Upsampling operation of the activation map into the size of the input and refers to the normalization function (as explained in Equation (10)) so that the elements are within the [0,1] range.
Our work build on the top of Score-CAM(Wang et al., 2020a), we borrow the idea from SmoothGrad(Smilkov et al., 2017) and integrate the smooth into the pipeline of Score-CAM to generate sharper feature localization. In this paper, we evaluate two ways to smooth.
Taking Averaging We set the number of noised sample images to be generated by adding Gaussian noise to each activation map . For each activation map , scores are generated and averaged to a final score, which is considered as the importance of the activation map .
An activation map of a convolutional layer s given as . This approach can be depicted as :
Here, is the class of interest, is the convolutional layer and is the maximum Channel-wise score of the activation map which accounts for the maximum noisy score generated along with the linear combination of with the weights . is the input image and refers to the activation map at layer .
Taking Maximum In our second way, we multiply the weights with the activations and add them (like GradCAM(Selvaraju et al., 2017)) during upsampling, to factor the weights into the existing ScoreCAM(Wang et al., 2020a) method. This activation map is then normalized and the perturbations added to the normalized input mask yields a greater visual result in terms of the number of pixel-wise attributions present within the bounding box of the object present in the image. Here, upsampled activation maps for each activation map are generated by adding noise to the normalized input mask. The maximum score is found from the following scores to get the final score which is calculated to be the importance of the activation map .
The binary mask over the normalized map would not be great as our aim is to focus on the spatial region where the object lies and the binary mask would lose sight of the important features. Hence, we employ a similar normalization function as used in ScoreCAM to elevate the features within the specific region.
The normalization used in the algorithm is given as :
The pipeline for the proposed method is shown in Figure 1. The algorithm is given as 1. The formula for SS-CAM is shown below :
SS-CAM : An activation map of a convolutional layer s given as . SS-CAM can be depicted as :
Here, is the class of interest, is the convolutional layer and is the maximum Channel-wise score of the activation map which accounts for the maximum noisy score generated along with the linear combination of with the weights . is the input image and refers to the normalized input mask.
We conduct our experiments on three datasets: Imagenet Validation Set (ILSVRC 2012 Validation Set), Stanford40 dataset and PASCAL VOC dataset.
2000 validation images from the ILSVRC(Deng et al., 2009) are chosen at random to conduct the faithfulness and localization evaluations. Faithfulness evaluations experiments have been conducted on the Imagenet with a pre-trained ResNet-18 model while the Localization evaluations have been carried out with two models: ResNet-18 and SqueezeNet1.0.
The test images in Stanford40 and PASCAL VOC 2007 are used in visual comparison, the result is shown in Fig 4. A VGG-16 network(Simonyan and Zisserman, 2014) is used to conduct our visual experiments. Figure 3 displays the comparison of some of the Stanford-40 test images and some of the PASCAL VOC 2007 test images with other existing techniques. As displayed in the figure, SS-CAM provides a much sharper and robust localization, in terms of Visual Comparison, of the entire target object. All experiments have been carried out with = 10 and = 4.
The faithfulness evaluations are carried out as depicted in Grad-CAM++(Chattopadhyay et al., 2017) for the purpose of Object Recognition. Two metrics called Average Drop and Average Increase In Confidence are introduced.
The Average Drop refers to the maximum positive difference in the predictions made by the prediction using the input image and the prediction using the saliency map. It is given as: .
The Average Increase in Confidence is denoted as: 100 where refers to a boolean function which returns if the condition inside the brackets is true, else the function returns .
Here, refers to the prediction score on class using the input image and refers to the prediction score on class using the saliency map produced over the input image .
The results are as shown in Table . The experiments with the ILSVRC validation dataset for faithfulness evaluations have been carried out using a pre-trained ResNet18(18-layered Residual Network)(He et al., 2015).
The results of the localization evaluations using the Energy-based pointing game is shown in Table
. Here, the amount of energy of the saliency map is calculated by finding out how much of the saliency map falls inside the bounding box. Specifically, the input is binarized with the interior of the bounding box marked as 1 and the region outside the bounding box as 0. Then, this input is multiplied with the generated saliency map and summed over to calculate the proportion ratio, which given as -. A pretrained ResNet-18(Residual Network with 18-layers)(He et al., 2015) and a pretrained SqueezeNet1.0(Iandola et al., 2016) model is used to conduct the Energy-based pointing game on the 2000 randomly chosen images from the ILSVRC 2012 Validation set.
Our proposed method significantly enhances the localization of the features of the target class in an image, thereby explaining the images profoundly. In this aspect, our method fairs well above the existing CAM approaches as evaluated using the Energy-based Pointing game (Table ).
As the number of samples
increases, more localized features are highlighted within the image and better explanations are provided in this regard. This would mean a lot of noisy images would have been generated so finding the highest maximum score among the greater range space of noisy images gives a higher probability of calculating a better score. If the parameteris played around with, it has been found that if is low, the score generated is close to the score generated by ScoreCAM and if the parameter is high, more ”smoothened” out the score value will be and the saliency map generated is expected to be much more generalized across different classes. The map generated would be quite similar to the map generated by GradCAM, if is high, as the noisy maps would be relatively close to each other which means that the noise added would be close to the mean and hence, would result in very little changes to the normalized activation maps. Hence, we get GradCAM-like maps when the parameter is high.
Future scope entails further investigations to extend this noisy technique to handle different network architectures (apart from the existing CNNs) and different tasks too. One area for exploration is to find better metrics for comparing sensitivity maps involving noisy perturbations. Metrics for localization need to be explored in more depth like the Energy-based pointing game used in this paper. Another aspect could be to generalize our approach for explaining decisions made by the Neural Networks in other domains like sound processing and video processing.
We thank Mr. Haohan Wang from Carnegie Mellon University for providing valuable suggestion during the discussion.
Analysis of explainers of black box deep neural networks for computer vision: a survey. ArXiv abs/1911.12116. Cited by: §1.
2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. Cited by: §4.
When face recognition meets with deep learning: an evaluation of convolutional neural networks for face recognition. In Proceedings of the IEEE international conference on computer vision workshops, pp. 142–150. Cited by: §1.
Hybrid coarse-fine classification for head pose estimation. arXiv preprint arXiv:1901.06778. Cited by: §1.
Learning deep features for discriminative localization. CoRR abs/1512.04150. External Links: Cited by: §1, §1, §2, §2.