GAM: Explainable Visual Similarity and Classification via Gradient Activation Maps

09/02/2021 ∙ by Oren Barkan, et al. ∙ 29

We present Gradient Activation Maps (GAM) - a machinery for explaining predictions made by visual similarity and classification models. By gleaning localized gradient and activation information from multiple network layers, GAM offers improved visual explanations, when compared to existing alternatives. The algorithmic advantages of GAM are explained in detail, and validated empirically, where it is shown that GAM outperforms its alternatives across various tasks and datasets.



There are no comments yet.


page 4

page 5

page 6

page 7

page 8

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

As the AI revolution disrupts industries and penetrates all walks of life, a growing need arises to intuitively explain machine-based decisions (Vellido et al., 2012; Doshi-Velez et al., 2017)

. As a result, an emerging research area revolves around the need to make machine learning models more

explainable. This work joins this common effort and presents Gradient Activation Maps (GAM) - a novel method for explaining visual similarity and classification networks. A saliency map is an image depicting the relative contribution of each pixel in the input image w.r.t. the model’s prediction. For example, Fig. 1 presents saliency maps produced by GAM for a classification task (a-c) and a similarity task (d-e). According to (Selvaraju et al., 2017), a ‘good’ visual explanation technique should be (1) class discriminative i.e., localize the object in the image, and (2) high-resolution i.e., capture fine-grained details. However, comparing different visual explanation approaches is hard: A real methodological challenge stems from the lack of a ground-truth or a principled evaluation procedure. Hence, different works employed different evaluation procedures, often resorting to subjective visual assessments (Simonyan et al., 2013; Stylianou et al., 2019).

Figure 1.

Visual explanations produced by GAM for similarity and classification tasks (using a pretrained DenseNet201). (b-c) w.r.t. the scores for ’cat’ and ’dog’ classes. (f-g) w.r.t. the cosine similarity between the latent representations of the images in (d-e).

An actionable testing procedures for assessing the validity of saliency maps were recently proposed by Adebayo (Adebayo et al., 2018). Their work revealed that despite producing quality looking visualizations, most state-of-the-art methods produce saliency maps that are independent of either the model or the input-label relation, rendering them inadequate for producing explanations. An exception was the Grad-CAM (GC) method from (Selvaraju et al., 2017) that stood out among all others in its ability to produce fine-grained saliency maps, while successfully passing all sanity tests (Adebayo et al., 2018). Following the success of GC, an improved extension called Grad-CAM++ (GC++) was introduced and shown to outperform its predecessor on various visual explanation tasks (Chattopadhay et al., 2018). GAM poses significant improvements upon GC and GC++ via several algoritmic features: Gradient localization, multi-layer analysis, and negative gradients suppression. These unique features lead to better saliency maps in terms of resolution, class discrimination, and object localization. In Sec. 3.3, we elaborate on the relation of GAM to GC and GC++ and explain the algorithmic advantages behind GAM’s superior results. Our contributions are as follows:

  • We introduce GAM - a state-of-the-art method for extracting accurate saliency maps in terms of resolution and class discrimination. GAM is shown to outperform its alternatives on various objective and subjective evaluations, across all metrics, and especially in the case of small objects.

  • We present a unified formulation for visual similarity and classification that enables the utilization of GAM, GC and GC++ for explaining visual similarity models (a task that was overlooked in (Selvaraju et al., 2017; Chattopadhay et al., 2018)).

  • We identify and demonstrate the limitations of GC and GC++, and explain how GAM averts these problems.

2. Related work

2.1. Explaining Visual Classification Models

The early methods proposed by (Zeiler et al., 2011; Zeiler and Fergus, 2014; Springenberg et al., 2014; Simonyan et al., 2013; Yosinski et al., 2015; Mahendran and Vedaldi, 2016; Yu et al., 2014; Barkan et al., 2021)

are seminal works in visualization and understanding deep NNs. Guided Backpropagation (GBP)  

(Springenberg et al., 2014) visualizes the output prediction by propagating the gradients through the model and suppressing all negative gradients along the backward pass. However, GBP was shown to produce saliency maps that are not class discriminative (Selvaraju et al., 2017). Another approach (Simonyan et al., 2013), uses the gradients of predicted class scores w.r.t. to the input image to generate saliency maps. Recently, Grad-CAM (GC) (Selvaraju et al., 2017) created saliency maps based on the activations and gradients from the last convolution layer. In GC, the gradients of each channel are pooled to scalars. Then, these scalars weigh their corresponding activation maps that are summed together to produce the final saliency map. More recently, Grad-CAM++ (GC++) (Chattopadhay et al., 2018) was introduced as an improved version of GC. GC++ uses a weighted average of the pixel-wise gradients in order to create the weights for the activation maps. Both GC and GC++ operate on the last convolutional layer and employ gradient pooling that leads to the loss of gradient localization. In contrast, GAM utilizes the raw gradients from multiple layers in the network, enabling gradient localization with improved resolution and class discrimination.

2.2. Explaining Visual Similarity Models

Previous works attempted to visually explain the decision made by similarity networks (Yi et al., 2014; Schroff et al., 2015; Sun et al., 2014; Wang et al., 2014; Hoffer and Ailon, 2015; Oh Song et al., 2016). These networks are optimized to cluster images that are considered similar

, in a learned vector space. Other methods 

(Radenović et al., 2016; Tolias et al., 2015) determined areas that contributed to image similarity by comparing filter responses of images patches. In (Chen et al., 2020), the authors utilized GC for explaining embedding networks that were trained on similarity tasks. However, their method is independent of the similarity score itself, hence it cannot be considered a “true” explanation to similarity. Recently, VDSN (Stylianou et al., 2019)

was introduced as a method for visual explanation for similarity networks. VDSN produces saliency maps for image-pairs by combining the activations of the last convolution layer before and after average / max pooling. However, unlike GAM that utilizes the gradients of the similarity w.r.t. the activations from multiple layers, VDSN does not use the gradients, and hence is indepedent of the similarity score. Moreover, VDSN is limited to use the last convolutional layer in architectures that employ average / max pooling, and is applicable to similarity networks only (thus unable to visually explain classification models).

2.3. Evaluating Saliency Maps

Evaluating saliency maps is challenging, as no real “ground truth” exists, and the quality of an explanation is often subjective. In (Simonyan et al., 2013; Selvaraju et al., 2017)

, evaluations conducted using a weakly supervised object localization task, where the output saliency map is being used to specify the region in the image in which the classified object appears. We further extend this approach to the image similarity task, by using the saliency maps to specify the regions in which similar objects appear in both images. In 

(Chattopadhay et al., 2018) the authors suggested the Average Drop Percentage (ADP) and the Percentage of Increase in Confidence (PIC) metrics, to measure the change in the model confidence when using explanation maps (Hadamard product of the saliency map with the original image) instead of original image. We follow these tests and further extend them to the image similarity task. In (Adebayo et al., 2018), the authors suggest sanity tests for saliency maps methods: The parameter randomization and data randomization procedures test whether the produced saliency map is sensitive to the randomization of the model’s parameter and data labels, respectively. Otherwise, the method fails to faithfully explain the model’s prediction. Despite producing quality looking visualizations, the tests from (Adebayo et al., 2018) reveal that many of the popular saliency methods do not pass the tests, and therefore are not adequate for providing satisfactory model explanations. In Appendix 4.2 we show that GAM passes these tests.

3. Gradient Activation Maps (GAM)

3.1. A Unified Formulation for Visual Similarity and Classification

We begin by defining notations for the network’s input and (internal) building blocks. The network’s input is an image, denoted by . The 3D activation produced by the -th convolutional layer (for the image ) is denoted by , where . Note that is not necessarily produced by a plain convolutional layer, but can be the output of a more complex function such as a residual (He et al., 2016) or DenseNet (Huang et al., 2017) block. We further denote as the -th activation map in . Let

be a function that maps 3D tensors to a

-dimensional vector representation. We denote the mapping of the last activation maps by . Note that may vary between different network architectures. Usually, it consists of a (channel-wise) global average pooling layer that is optionally followed by subsequent fully connected (FC) layers. Finally, let be a scoring function that receives two vectors and outputs a score. The use of varies between tasks: classification and similarity. In classification tasks,

represents the last hidden layer of the network. The logit score for the class

is computed by , where is the weights vector associated with the class . In multiclass (multilabel) classification,

is usually set to the dot-product, optionally with bias correction, or the cosine similarity. Then, either a softmax (sigmoid) function, with some temperature, transfers

values to the range . For similarity tasks, we consider two images , and a similarity score: . A common practice is to set to the dot-product or cosine similarity. Further note that in the specific case of similarity, the representation produced by is not necessarily taken from the last hidden layer of the network. Therefore, can be set to the output from any FC layer. For the sake of brevity, from here onward, we abbreviate both and with . Disambiguation will be clear from the context.

3.2. The GAM Method

Given an image , we denote the -th saliency map by: which is a function of the activation maps and their gradients: . We denote (similarly to the notation ). Then, we implement as:



is the ReLU activation function, and

is the Hadamard product. RSZ denotes the operation of resizing to a matrix of size (the height and width of the original image ). NRM denotes the min-max normalization. The motivation behind Eq. 1 is as follows: each filter in the -th convolutional layer captures a specific pattern. Therefore, we expect to have high (low) values in regions that do (not) correlate with the -th filter. In addition, regions in that receive positive (negative) values indicate that increasing the value of the same regions in will increase (decrease) value. GAM highlights pixels that are both positively activated and associated with positive gradients. To this end, we first truncate all negative gradients (using ReLU). Then, we truncate negative values in the activation map , and multiply it (element-wise) by the truncated gradient map. This ensures that only pixels associated with both positive activation and gradients are preserved. Then, we sum the saliency maps across the channel (filter) axis to aggregate per pixel from all channels in the -th layer. The -th saliency map

is obtained by resizing (via bi-cubic interpolation) to the original image spatial dimensions followed by min-max normalization. This process produces a set of

saliency maps . The final saliency map is computed based on a function that aggregate the information from the saliency maps produced by last layers. In this work, we implement as follows


Note that in our experiments, we found out that different implementations of , such as max-pooling, Hadamard product, or various weighted combinations of , performs worse than Eq. 2. Yet, in Sec. 4, we do investigate the effect of different values on the final saliency map .

3.3. GAM’s Unique Features

GAM presents several advantages over GC and GC++: Gradient Localization: GC computes the saliency map based on a linear combination of the activation maps in the last convolutional layer as follows:


where . When compared to GAM, the computation in Eq. 3 has two major drawbacks: First, the coefficients are the pooled gradients. Hence, in GC (and GC++), the gradient spatial information is lost. This is in contrast to our GAM approach (Eq. 1) that preserves (positive) gradient localization via the element-wise multiplication by . The significance of this property is well expressed in the Positive gradients row in Fig. 2. Multi-layer Analysis: GC produces saliency maps based on the last convolutional layer only. GAM, on the other hand, gleans information extracted from multiple layers (or blocks) that vary by their resolution and sensitivity (Eq. 2). Earlier blocks in the network are characterized with higher resolution. For example, in DenseNet, the last convolutional layer produces low-resolution activation maps of size whereas the preceding convolutional layer produces activation maps of . Our findings show that extracting information from earlier blocks is critical in certain architectures. In Sec. 4, we show that incorporating information from earlier blocks (i.e, setting ) enables GAM to produce fine-grained saliency maps that are more focused on the relevant objects. However, the application of the same feature to GC / GC++ hurts performance (Fig. 8 and Tabs. 1, 2). Negative Gradients Suppression: A subtle, yet highly important drawback of Eq. 3 stems from the way in which the (ReLU) operation is applied. In GC, the weighted combination of the activations is summed, where each activation is weighted by its pooled gradients . In architectures like ResNet or DenseNet, are always non-negative (due to the ReLU activation at the end of each block). However, the pooled gradients can still result in a negative value. As a result, GC might become insensitive to important regions (pixels) that should be intensified. The justification for this claim is as follows: Consider a pixel in a region that contributes to the final score . Ideally, we wish this pixel to be intensified in the final saliency map. By its nature, such a pixel in an “important” region is expected to have positive (pooled) gradient values and positive activation values across several filters. However, it is also possible that some other filters that respond with a small, yet positive activation, will be associated with negative (pooled) gradients values. Mathematically, this is expressed by the following decomposition:


If , then the pixel will have an intensity . In this case, the pixel as well as other pixels in the region, are zeroed and masked due to the subsequent application of (ReLU) in Eq. 3. This might further lead to a relative intensification of other, less “informative” pixels (associated with much smaller contributions than of ), but for which . GAM on the other hand, applies to the gradients before the multiplication by the activations (Eq. 1). This ensures negative gradients are zeroed and hence do not (negatively) affect the region’s intensity on other channels or layers. Thus, regions with positive gradients are never masked by and “correctly” intensified according to the magnitudes of the positive gradients and activations only. In GC, the negative gradients problem becomes noticeable when using the cosine similarity. Fig. 2 exemplifies this effect, presenting a comparison between GC and GAM (using DenseNet201). We used the ‘last layer’ version of GAM (Eq. 2, ), ensuring the improvement by GAM is indeed due to the way it computes the saliency maps, neutralizing the contribution from earlier layers. Each pair of columns in Fig. 2 presents saliency maps computed w.r.t. the cosine similarity. We see that GC (third row, marked red) produces saliency maps that intensify wrong regions (left image of each pair in rows 2). Empirically, this is explained by the accumulated activation maps and the positive gradient maps (shown in rows 4-5 after ReLU), and the negative gradient maps (shown in row 6 after negation and ReLU). In both examples (dog and chair), we observe the high magnitude of the negative gradients and their adverse effect: the final intensity in regions of interests is significantly attenuated compared to the background, resulting in poor quality saliency maps. However, as explained above, by suppressing negative gradients in advance, GAM averts this problem and successfully produces adequate saliency maps.

Figure 2. GAM and GC saliency maps w.r.t. the cosine similarity for two pairs of images: Dogs and Chairs (DenseNet201). GC’s failures are marked red. Rows 4-6 present the activation map, ReLUed positive gradient maps (summed across channels), and negative gradient map (summed across channels after negation and ReLU), respectively. See Sec. 3.3 for details.
Figure 3. GAM and GC++ saliency maps w.r.t. dot product similarity for two pairs of images (DenseNet201). GC++’s failures are marked red. See Sec. 3.3 for details.

The poor performance of GC, when using the cosine similarity (instead of the dot-product), can be further explained mathematically: In the case of the dot-product similarity, , and the gradients are guaranteed to be non-negative. This stems from the fact that in DenseNet (and many other architectures), the global average pooling operation is applied after the application of ReLU, hence both and are entry-wise non-negative, and so does and (as the gradient of the average pooling function is a positive constant). This implies for all in Eq. 4, thus negative gradients do not exist at all. However, in the case of the cosine similarity, ), and since both and are entry-wise non-negative we have and:


Eq. 5 shows that (and hence ) is the difference between two positive vectors, and hence may contain negative entries. Therefore, in the case of cosine similarity, negative gradients are possible, and might mask ”important” regions in the image that should be intensified in the saliency map. Finally, when using the dot product similarity, it is GC++ that completely fails. Fig. 3 compares between GC++ and the ‘last layer’ GAM (). GC++ weighs the pixel-wise gradients (before pooling) with the coefficients:


Note that during the computation of , GC++ passes through the exponential function. However, when is the dot-product, this may lead to an ”explosion” of the saliency map values, as observed in Fig. 3. GAM, however, produces adequate saliency maps.

Figure 4. Each pair of rows presents saliency maps produced by GAM, GC and GC++ w.r.t. the cosine similarity.

[width=]Figures/sim_dot.png [50]1.5(0,0)(0,45)[50]1.5(0,0)(0,45)[50]1.5(0,0)(0,45)

Figure 5. Saliency maps produced by GAM, GC and GC++ (DenseNet201 model) w.r.t. the dot-product similarity. Each pair of columns corresponds to a pair of images for which the similarity score was computed.

4. Experimental Results

4.1. Subjective Evaluation

First, we demonstrate GAM’s ability to explain visual similarity models. To this end, we set

to the embedding produced by the (channel-wise) global average pooling layer in an ImageNet pretrained DenseNet201 model (discarding the classifier head). To determine the similarity of two images,

and , the images are passed through the model to generate the embeddings and . Then, the similarity score is computed by , where is either the dot-product or cosine similarity. In Fig. 4, row-pairs present saliency maps for pairs of image representations w.r.t. the cosine similarity. The saliency maps by GAM, were produced using two layers (setting in Eq. 2). We see that GAM produces quality saliency maps, while GC (column 3) consistently fails. When compared to GC++, GAM exhibits saliency maps that are more focused on the source for the similarity. Results w.r.t. the dot-product appear in Fig. 5. In this case, we see that GC++ completely fails. Next, we turn to demonstrate GAM’s ability to visually explain classification models. In this case, the saliency maps are computed w.r.t. the logits scores produced by DenseNet-201. Specifically, we compute , where is the dot-product, is the image representation, and is the weights vector associated with the class . Fig. 6 presents examples of saliency maps produced by GAM (), GC and GC++. It is visible that GAM produces saliency maps that are more class discriminative than the ones produced by GC and GC++. These results further support the analysis from Sec. 3.3, demonstrating the advantages of GAM (over GC and GC++), and show that GAM generates adequate saliency maps in all settings.

Figure 6. Saliency maps produced by GAM, GC and GC++ w.r.t. the classes (top to bottom) ”sunglasses”, ”oboe”, ”soccer ball”, ”coffeepot”, ”matchstick” and ”anemone fish”.

4.2. Sanity Checks for Saliency Maps

As explained in Sec. 2.3, visually appealing saliency maps can be misleading. To assess the validity of GAM for explanations, we conduct the parameter randomization and the data randomization sanity tests from (Adebayo et al., 2018). GAM passed both tests. Figure 7

presents examples from the sanity checks. The first row shows two saliency maps produced by GAM w.r.t. the “tabby cat” class. We see that when GAM utilizes an ImageNet pretrained ResNet50 model, it produces a focused saliency (around the cat), but when applying GAM to the same network with randomly initialized weights, it fails to detect the cat in the image. Thus, we conclude that GAM is sensitive to model parameters and passes the

parameter randomization test. The second row shows that GAM produces an adequate saliency map when the model (LeNet-5 (LeCun et al., 1998)

) is trained with the true MNIST labels, but fails when the model is trained with random labels. Thus, we conclude that GAM is sensitive to data labels and passes the

data randomization test.

Figure 7. Sanity checks. Rows 1 and 2 present GAM results for the parameter randomization and data randomization tests w.r.t. the “tabby cat” (ImageNet) and “one” (MNIST) classes, using ResNet50 and LeNet-5, respectively. Left to right: Row 1: Original image, GAM computed based on a trained model, GAM computed based on an untrained model (random weights). Row 2: Original image, GAM computed based on a model that was trained with the ground truth labels, GAM computed based on a model that was trained with random labels.
Figure 8. Layer ablation study (DenseNet201). Saliency maps are computed by GAM, GC and GC++, for (Eq. 2), w.r.t. to class ”basketball”. GAM performs the best. See Sec. 4.3.
Figure 9. GAM for small objects (DenseNet201). Saliency maps are computed w.r.t. the classes ”golden retriever” (row 1) and ”airliner” (row 2), for each layer and their sum (Eq. 2, ).
Figure 10. GAM for small objects (DenseNet201). Saliency maps are computed w.r.t. the classes ”tabby cat”, for each layer and their sum (Eq. 2, ). The last column presents results produced by GC.

4.3. Layer Ablation Study

In this section, we test whether GAM, GC, and GC++ benefit from the use of multiple layers. On one hand, earlier layers are associated with smaller receptive fields, giving better localization. On the other hand, these layers usually account for less semantic features. Fig. 8 presents a comparison of GAM, GC and GC++ when using multiple layers (). We see that GAM benefits from the use of multiple layers, while GC and GC++ do not. Figures 9 and 10 demonstrate the advantage of using multi-layer GAM compared to a single layer GAM. Three images are presented, each with a small object (dog, airplane, and cat). We see that GAM based on earlier layers () produces more focused saliency maps due to higher resolution analysis. This leads to a better localization in the final saliency map as seen in ’GAM(sum)’ (). Figure 11 presents another layer-wise analysis, where it is observed that the last two layers (second row, last two columns), corresponding to , best balances localization with the extraction of semantic features, yielding optimal results. In addition, gradient localization is observed in the ’Gradients’ columns, which is a unique property of GAM (in contrast to GC that performs gradient pooling). For further explanations, see Sec. 3.3 (Gradient Localization). Indeed, in our experiments, we noticed that GAM with best balances localization with the extraction of semantic features. Yet, when setting for GC and GC++, performance degrades. As we shall see, these trends repeat in the quantitative evaluation in Secs. 4.4 and 4.5 as well.

Figure 11. GAM for visual similarity using pretrained Imagenet Densent201: Layer ablation study. Columns 1-2, 3-4 and 5-6 present the saliency maps (Eq. 1), activation maps (summed over the channel axis) and gradient maps (summed over channels), respectively, for (top to bottom). The last two columns present saliency maps computed based on, Eq. 2, with (top to bottom), respectively.
Task Metric GAM GC++ GC
1 2 Impr. 1 2 Impr. 1 2 Impr.
VRC ADP () 17.47 17.22 1.4% 17.62 17.67 -0.3% 18.49 18.56 -0.4%
VRC (25%) ADP () 18.57 18.51 0.3% 19.31 19.45 -0.3% 20.32 20.37 -0.2%
VRC (10%) ADP () 21.02 20.52 2.4% 22.39 22.54 -0.7% 24.89 25.23 -1.3%
VRC PIC () 38.12 39.53 3.6% 37.99 35.76 -6.2% 35.24 33.45 -5.6%
VRC (25%) PIC () 36.87 37.56 1.8% 35.32 35.12 -0.6% 34.70 34.54 -0.5%
VRC (10%) PIC () 35.21 35.48 0.8% 32.75 31.98 -2.4% 32.01 31.03 -3.2%
Similarity (cos)
VRC ADP () 0.75 0.72 2.8% 0.75 0.79 -5.1% 3.21 3.46 -7.2%
VRC (25%) ADP () 1.10 1.03 6.8% 1.12 1.14 -1.6% 9.65 10.67 -9.6%
VRC (10%) ADP () 1.39 1.31 6.1% 1.51 1.67 -9.6% 12.19 13.45 -9.4%
VRC PIC () 74.13 75.85 2.3% 71.76 70.43 -1.9% 44.33 42.12 -5.2%
VRC (25%) PIC () 64.23 65.44 1.8% 61.62 60.08 -2.6% 39.14 39.02 -0.3%
VRC (10%) PIC () 54.67 55.96 2.3% 50.83 48.87 -4.0% 27.93 26.89 -3.9%
Similarity (dot)
VRC ADP () 2.15 2.04 5.4% 53.45 55.65 -4.0% 2.16 2.39 -9.6%
VRC (25%) ADP () 2.13 2.04 4.4% 57.24 60.78 -5.8% 2.23 2.35 -5.1%
VRC (10%) ADP () 2.17 2.08 4.3% 58.02 61.23 -5.2% 2.42 2.67 -9.4%
VRC PIC () 71.76 72.96 1.6% 0.21 0.20 -5.0% 68.87 68.02 -1.3%
VRC (25%) PIC () 68.97 70.28 1.8% 0.07 0.07 0.0% 66.12 65.23 -1.4%
VRC (10%) PIC () 68.01 68.99 1.4% 0.02 0.02 0.0% 63.15 62.11 -1.7%
Table 1. Objective evaluation, including Layer ablation study by using (Eq. 2) last layers of ResNet101. For ADP (PIC), lower (higher) is better. VRC stands for ILSVRC-15-val. 25% and 10% symbol the subsets of VRC that contain the small objects as explained in Sec. 4.4.
Task and Model GAM GC++ GC
Dataset 1 2 Impr. 1 2 Impr. 1 2 Impr.
VRC DenseNet 54.9 56.9 3.6% 54.9 47.7 -13.1% 52.4 50.3 -4.0%
VRC(25%) DenseNet 39 43.8 12.3% 39.6 20.8 -47.5% 33.5 26 -22.4%
VRC(10%) DenseNet 23.4 33 41.0% 22.6 11.5 -49.1% 21.3 17.4 -18.3%
VRC ResNet 55.9 57.1 2.1% 55 53.8 -4.1% 47.8 47.2 -1.3%
VRC(25%) ResNet 40.8 43.1 8.3% 40.6 38.9 -4.2% 33.6 33.5 -0.3%
VRC(10%) ResNet 26.1 33.4 29.5% 26.2 23.9 -8.4% 23.2 22.7 -2.2%
XRAY CheXNet 25.8 28.4 10.1% 26.2 20.2 -22.9% 24.9 21.6 -13.3%
Similarity (cos)
VRC DenseNet 57.4 60.7 5.7% 57.1 52.3 -10.0% 52.8 53.5 1.3%
VRC(25%) DenseNet 38.2 41.9 9.7% 37.4 21.9 -44.4% 25.5 22.6 -11.4%
VRC(10%) DenseNet 31.2 35.7 14.4% 29.7 15.4 -51.4% 18.4 16 -13.0%
VRC ResNet 57.1 58.6 2.6% 56 49.8 -11.1% 39.1 38.2 -2.3%
VRC(25%) ResNet 38.3 39.3 2.6% 36.1 28.5 -21.1% 27.2 24.9 -8.5%
VRC(10%) ResNet 31.3 34.9 11.5% 29.4 22.2 -24.5% 21.3 20.4 -4.2%
Similarity (dot)
VRC DenseNet 59.2 62.4 5.4% 1 1 58.9 54 -9.8%
VRC(25%) DenseNet 39.6 43.8 10.6% 1 1 38.2 25.5 -36.6%
VRC(10%) DenseNet 32 36.9 15.3% 1 1 31.2 19.6 -38.6%
VRC ResNet 57.9 61.9 6.9% 1 1 57.3 57.3 0%
VRC(25%) ResNet 39.3 43.1 9.7% 1 1 38.6 38.1 -1.3%
VRC(10%) ResNet 31.9 36.6 14.7% 1 1 31.2 30.5 -2.3%
COCO TResNet 28.3 30.7 8.5% 27.8 27.3 -0.8% 27.2 27.5 1.1%
COCO(25%) TResNet 22.7 25.8 13.7% 21.4 21.2 -0.9% 21.5 21.1 -1.9%
COCO(10%) TResNet 21.4 24.9 16.4% 20.7 20.5 -1.0% 21.1 20.3 -3.8%
VOC TResNet 36.2 38.7 6.9% 35.5 34.8 -2% 35.5 34.2 -3.7%
VOC(25%) TResNet 34.1 37.2 9.1% 32.1 31.5 -1.9% 33.5 31.1 -7.2%
VOC(10%) TResNet 27.1 32.7 20.7% 26.8 25.3 -5.6% 26.2 24.9 -5%
Table 2. Object Localization and segmentation results for different combination of task, dataset, model, and method. For each method, we report the accuracy (IoU%) achieved by using (Eq. 2

) last layers. VRC, XRAY, COCO, and VOC stands for ILSVRC-15, ChestX-ray8, MS-COCO, and Pascal-VOC, respectively. See Sec. 

4.5 for details.

4.4. Objective Evaluation

Next, we present objective evaluation, following the measures suggested by in (Chattopadhay et al., 2018) (we refer to (Chattopadhay et al., 2018) for the full details): Average Drop Percentage (ADP): ADP is computed as:

where is total number of images in evaluated dataset, is the model’s output score (confidence) for the correct class w.r.t. the original image . is the same model’s score, this time w.r.t. the ’explanation map’ - a masked version of the original image (produced by the Hadamard product of the original image with the saliency map). The lower the ADP the better the result. Percentage of Increase in Confidence (PIC): PIC is computed as:

PIC reports the percentage of the cases in which the model’s output scores increase as a result of the replacement of the original image with the explanation map. The higher the PIC the better the result. We further extended the evaluation from (Chattopadhay et al., 2018) to similarity tasks, by reporting ADP and PIC w.r.t. image-pairs similarity scores (instead of class specific scores). To this end, we created a similarity subset (will be made public) by randomly sample image-pairs from the ILSVRC-15-val dataset (Russakovsky et al., 2015) (which does not overlap with the training set used the trained the models), but with the restriction that each pair contains images that are labeled with the same ground truth class. The similarity subset contains pairs in total, for each class. In addition, we tested the ability of GAM to benefit from using several layers, when it is applied on images with small objects. We compare the localization capability on small objects by narrowing the ILSVRC-15-val dataset to a subset that contains images for which the ground truth box area is below the 25% / 10% percentile area. For the similarity experiment, we randomly sampled another pairs, from the 25% / 10% narrower sets. The results are reported in Tab. 1 (ResNet101). For each method, we report the results both for and (note that performs on par with , hence omitted). Recall that for ADP (PIC), lower (higher) values indicate better performance, and Impr. reports the relative improvement obtained by using (over ). We see that GAM outperforms GC and GC++ at the majority of the scenarios. Moreover, GC (GC++) completely fail when using the cosine (dot-product) similarity. This is another empirical evidence for GC and GC++ limitations (Sec. 3.3), and the fact that GAM benefits from multiple layers, whereas GC and GC++ do not (and even degrade). Finally, the results for ResNet101 exhibit the same trends, but are excluded due to space limitation.

[width=]Figures/06_d_loc1_cls_dense.png Ground-truthGAMGC++GC

Figure 12. Object localization via saliency maps using DenseNet201 over ILSVRC-15 dataset, w.r.t. labels: barbell, anemone fish, volleyball, bell cote, obelisk, ox and water tower.

[width=]Figures/06_d_loc3_chest.png Ground-truthGAMGC++GC

Figure 13. Object localization via saliency maps using CheXNet over ChestX-ray8 dataset, w.r.t. pathology: Atelectasis, Effusion, Mass, Pneumothorax, Nodule and Pneumonia. GAM yields saliency maps that are more accurate, hence leading to better localization.

[width=]Figures/loc_compare4.png Ground-truthGAMGC++GC

Figure 14. Object localization and segmentation via saliency maps. Rows 1-2, 3-4, 5-6 present BBox generation (orange) using DenseNet201 (w.r.t. labels: mongoose, gondola), CheXNet (w.r.t. label: Atelectasis), and segmentation (orange) using TResNet (w.r.t. labels: boat, frisbee), respectively.

4.5. Object Localization and Segmentation

In this section, we compare the localization capability of GAM, GC and GC++ via an extensive set of experiments across various tasks, datasets, models, and settings. We measure the quality of the produced saliency maps by Intersection over Union (IoU%) w.r.t. the ground truth bounding boxes (BBox) or segmented areas. To this end, each saliency map is binarized with a fixed threshold before drawing the predicted BBox or segmented area. The fixed threshold was chosen for each test and method separately by a hold-out set (will be made public). Table 

2 presents the obtained localization accuracy (IoU%) for each combination of task, dataset, model, and method, both for and , including the obtained improvement when using . Again, we observe that GAM outperforms the other methods. Moreover, it is evident that GAM significantly benefit from using multiple layers (especially in the case of small objects), whereas GC and GC++ suffer from a significant degradation in accuracy when utilizing more than a single layer. In what follows we discuss the results per task. Localization by Classification: We followed the test protocol from GC (Selvaraju et al., 2017), where the saliency maps of a classification model are used to draw a BBox around classified objects. We apply the two-layer GAM (Eq. 2, ) GC and GC++ on top of pretrained DenseNet201 and ResNet101. Figures 12, 13 and Fig. 14 (Rows 1-2) present examples for the generated saliency maps and BBoxes (marked orange). Tab. 2 (row 1) presents the localization accuracy (IoU%) between the predicted and ground truth (ILSVRC-15-val) boxes. In all cases, GAM outperforms both GC and GC++. Localization by Similarity: We adjusted the protocol from the localization by classification experiment to support localization by similarity. To this end, we replace the classification score with the similarity score computed for image-pairs. We used the same image-pairs from the similarity subset (Sec. 4.4). Then, we drew a BBox for each image in the pair, and computed IoU% w.r.t. to the ground truth. Results w.r.t. the different similarity scores are reported in Tab. 2 (rows 5, 8) and demonstrated in Fig. 15. Again, we observe that GC (GC++) fails when using the cosine (dot-product) similarity, and significantly degrades when utilizing multiple layer, while GAM performs the best and clearly benefits from multiresolution analysis.

[width=]Figures/06_d_loc2_sim_dense.png Ground-truthGAMGC++GC

Figure 15. Object localization w.r.t. similarity score (cosine). The saliency maps are drawn using DenseNet201 over image-pairs from ILSVRC-15-val (validation set). The labels for the image-pairs are (top to bottom): hammerhead shark, weevil, lesser panda, analog clock and stupa.

[width=]Figures/06_d_loc5_voc.png Ground-truthGAMGC++GC

Figure 16. Segmentation results based on saliency maps produced by GAM, GC, and GC++ (TResNet) on examples from Pascal VOC (validation) dataset, w.r.t. labels (top to bottom): aeroplane, bird, boat, car, cow, horse, motorbike, person, sheep and sofa.

[width=]Figures/06_d_loc4_coco.png Ground-truthGAMGC++GC

Figure 17. Segmentation results based on saliency maps produced by GAM, GC, and GC++ (TResNet) on examples from MS-COCO (validation) dataset, w.r.t. labels (top to bottom): bird, kite, boat, traffic light and sink.

Localization of Small Objects: We used the 25% and 10% partitions from Sec. 4.4 for testing the localization capability GAM on small objects. In addition, we conducted a localization experiment on medical imaging dataset ChestX-ray8 (Wang et al., 2017), where the classification decisions are usually made due to small details in the images. In this experiment, we used the CheXNet model from (Rajpurkar et al., 2017) that was trained on the ChestX-ray8 dataset to classify common thorax diseases. The results for the ILSVRC-15-val 25% / 10% subsets and the ChestX-ray8 appear in the classification and similarity sections in Tab. 2, and demonstrated in Fig. 14 (rows 3-4). We see that GAM significantly outperforms both GC and GC++. These findings support the observation from Sec. 4.3 that multi-layer GAM () produces better saliency maps for small objects. Object Segmentation: Finally, we tested the utilization of GAM, GC and GC++ for object segmentation. To this end, we applied the methods on top of two pretrained multi-label classification TResNet (Ben-Baruch et al., 2020) models, trained on MS-COCO (Lin et al., 2014) and Pascal VOC (Everingham et al., 2010) datasets. For each image, we computed the saliency maps w.r.t. each of the ground truth labels. Then, we computed the IoU% of the binarized saliency map w.r.t. ground truth segmentation (in pixels), for each ground truth label. The results appears in Tab. 2 (Segmentation), and exemplified in Fig. 14 (rows 5-6), and in Figs. 16 and 17. Overall, we see that GAM produces the most accurate segmentation.

5. Conclusion

This work joins a growing effort to make machine learning models more transparent and explainable. To this end, we present GAM, a state-of-the-art method for explaining visual similarity and classification models in a unified manner. Extensive subjective and objective evaluations show that GAM outperforms its alternatives across various tasks and datasets, and especially on small objects.


  • J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim (2018) Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, pp. 9505–9515. Cited by: §1, §2.3, §4.2.
  • O. Barkan, E. Hauon, A. Caciularu, O. Katz, I. Malkiel, O. Armstrong, and N. Koenigstein (2021) Grad-sam: explaining transformers via gradient self-attention maps. In Proceedings of the ACM International Conference on Information & Knowledge Management (CIKM), Cited by: §2.1.
  • E. Ben-Baruch, T. Ridnik, N. Zamir, A. Noy, I. Friedman, M. Protter, and L. Zelnik-Manor (2020) Asymmetric loss for multi-label classification. External Links: 2009.14119 Cited by: §4.5.
  • A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian (2018) Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. Cited by: 2nd item, §1, §2.1, §2.3, §4.4.
  • L. Chen, J. Chen, H. Hajimirsadeghi, and G. Mori (2020) Adapting grad-cam for embedding networks. In The IEEE Winter Conference on Applications of Computer Vision, pp. 2794–2803. Cited by: §2.2.
  • F. Doshi-Velez, M. Kortz, R. Budish, C. Bavitz, S. Gershman, D. O’Brien, S. Schieber, J. Waldo, D. Weinberger, and A. Wood (2017) Accountability of AI under the law: the role of explanation. CoRR abs/1711.01134. Cited by: §1.
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §4.5.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §3.1.
  • E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Cited by: §2.2.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §3.1.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.5.
  • A. Mahendran and A. Vedaldi (2016)

    Visualizing deep convolutional neural networks using natural pre-images

    International Journal of Computer Vision 120 (3), pp. 233–255. Cited by: §2.1.
  • H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4004–4012. Cited by: §2.2.
  • F. Radenović, G. Tolias, and O. Chum (2016) CNN image retrieval learns from bow: unsupervised fine-tuning with hard examples. In European conference on computer vision, pp. 3–20. Cited by: §2.2.
  • P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al. (2017) Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225. Cited by: §4.5.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.4.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §2.2.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: 2nd item, §1, §1, §2.1, §2.3, §4.5.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §1, §2.1, §2.3.
  • J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §2.1.
  • A. Stylianou, R. Souvenir, and R. Pless (2019) Visualizing deep similarity networks. In 2019 IEEE winter conference on applications of computer vision (WACV), pp. 2029–2037. Cited by: §1, §2.2.
  • Y. Sun, Y. Chen, X. Wang, and X. Tang (2014) Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pp. 1988–1996. Cited by: §2.2.
  • G. Tolias, R. Sicre, and H. Jégou (2015) Particular object retrieval with integral max-pooling of cnn activations. arXiv preprint arXiv:1511.05879. Cited by: §2.2.
  • A. Vellido, J. D. Martín-Guerrero, and P. J. G. Lisboa (2012) Making machine learning models interpretable. In


    Cited by: §1.
  • J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu (2014) Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393. Cited by: §2.2.
  • X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106. Cited by: §4.5.
  • D. Yi, Z. Lei, S. Liao, and S. Z. Li (2014) Deep metric learning for person re-identification. In 2014 22nd International Conference on Pattern Recognition, pp. 34–39. Cited by: §2.2.
  • J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson (2015) Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579. Cited by: §2.1.
  • W. Yu, K. Yang, Y. Bai, H. Yao, and Y. Rui (2014) Visualizing and comparing convolutional neural networks. arXiv preprint arXiv:1412.6631. Cited by: §2.1.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §2.1.
  • M. D. Zeiler, G. W. Taylor, and R. Fergus (2011) Adaptive deconvolutional networks for mid and high level feature learning. In 2011 International Conference on Computer Vision, pp. 2018–2025. Cited by: §2.1.