Contrastive Explanations in Neural Networks

Visual explanations are logical arguments based on visual features that justify the predictions made by neural networks. Current modes of visual explanations answer questions of the form `Why P?'. These Why questions operate under broad contexts thereby providing answers that are irrelevant in some cases. We propose to constrain these Why questions based on some context Q so that our explanations answer contrastive questions of the form `Why P, rather than Q?'. In this paper, we formalize the structure of contrastive visual explanations for neural networks. We define contrast based on neural networks and propose a methodology to extract defined contrasts. We then use the extracted contrasts as a plug-in on top of existing `Why P?' techniques, specifically Grad-CAM. We demonstrate their value in analyzing both networks and data in applications of large-scale recognition, fine-grained recognition, subsurface seismic analysis, and image quality assessment.


page 2

page 3

page 4

page 5


Contrastive Explanations for Model Interpretability

Contrastive explanations clarify why an event occurred in contrast to an...

Model Agnostic Contrastive Explanations for Structured Data

Recently, a method [7] was proposed to generate contrastive explanations...

Explanatory Paradigms in Neural Networks

In this article, we present a leap-forward expansion to the study of exp...

Contrastive Explanations for Argumentation-Based Conclusions

In this paper we discuss contrastive explanations for formal argumentati...

Visual Encoding and Debiasing for CTR Prediction

Extracting expressive visual features is crucial for accurate Click-Thro...

Generating Contrastive Explanations with Monotonic Attribute Functions

Explaining decisions of deep neural networks is a hot research topic wit...

Interpretable and Fine-Grained Visual Explanations for Convolutional Neural Networks

To verify and validate networks, it is essential to gain insight into th...

1 Introduction

Explanations are a set of rationales used to understand the reasons behind a decision [12]. When these rationales are based on visual characteristics in a scene, the justifications used to understand the decision are termed as visual explanations [7]

. Visual explanations can be used as a means to interpret deep neural networks. While deep networks have surpassed human level performance in traditional computer vision tasks like recognition 

[5], their lack of transparency in decision making has presented obstacles to their widespread adoption. We first formalize the structure of visual explanations to motivate the need for the proposed contrastive explanations. Hempel and Oppenheim [6] were the first to provide formal structure to explanations [31]. They argued that explanations are like proofs in a logical system [11] and that explanations elucidate decisions of hitherto un-interpretable systems. Typically, explanations involve an answer to structured questions of the form ‘Why P?’, where refers to any decision. For instance, in recognition algorithms, refers to the predicted class. In image quality assessment,

refers to the estimated quality.

Why-questions are generally thought of to be causal-like in their explanations [13]. In this paper, we refer to them as visual causal explanations for simplicity. Note that these visual causal explanations do not allow causal inference as described by [18].

Figure 1: The visual explanation to Why Spoonbill? is answered through Grad-CAM. The proposed contrastive explanatory method explains Why Spoonbill, rather than Flamingo? by highlighting the neck region in the same input image. Figure best viewed in color.

Consider an example shown in Fig. 1

where we classify between two birds - a spoonbill, and a flamingo. Given a spoonbill, a trained neural network classifies the input correctly as a spoonbill. A visual explanation of its decision generally assumes the form of a heat map that is overlaid on the image. In the visual explanations shown in Fig. 

1, the red regions answer the posed question. If the posed question takes the form of ‘Why Spoonbill?’, then the regions corresponding to the body shape and color of the spoonbill are highlighted. Such an explanation is based on features that describe a Spoonbill irrespective of the context. Instead of ‘Why Spoonbill?’, if the posed question were ‘Why Spoonbill, rather than Flamingo?’, then the visual explanation points to the most contrastive features between the two birds, which in this case is the neck of the Spoonbill. Flamingos have a longer S-shaped neck not prevalent in Spoonbills. The answers to such ‘Why P, rather than Q?’ questions are contrastive explanations where is the contrast.

The question of ‘Why P, rather than Q? provides context to the answer and hence relevance [17]. In some cases, such context can be more descriptive for interpretability. For instance, in autonomous driving applications that recognize traffic signs, knowing why a particular traffic sign was chosen over another is informative in contexts of analyzing decisions in case of accidents. Similarly, in the application of image quality assessment where an algorithm predicts the score of an image as , knowing ‘Why 0.25, rather than 0.5?’ or ‘Why 0.25, rather than 1?’ can be beneficial to analyze both the image and the method itself. In applications like seismic analysis where geophysicists interpret subsurface images, visualizing ‘Why fault, rather than salt dome?’ can help evaluating the model, thereby increasing the trust in such systems. In this paper, we set the framework for contrastive explanations in neural networks. More specifically, we modify existing ‘Why P?’ explanatory systems like Grad-CAM to obtain contrastive explanations in Section 3. We show its usage in varied applications in Section 4. We then conclude in Section 5.

2 Background and Related Works

We propose to constrain ’Why P?’ explanatory techniques by providing them context and relevance to obtain ’Why P, rather than Q?’ techniques. In this section, we describe the existing ’Why P?’ techniques and lay the mathematical foundations of neural networks.

Background: Consider an layered classification network , trained to differentiate between classes. Given an input image ,

provides a probability score

of dimensions where each element in corresponds to the probability of belonging to one of classes. The predicted class of image is the index of the maximum element in i.e. . During training, an empirical loss is minimized where is the ground truth and

is the network weights and bias parameters. Backpropagation 

[21] minimizes the loss by traversing along the parameter space using gradients . These gradients represent the change in network required to predict instead of . Note that for a regression network , the above mathematical foundations remain consistent with and being continuous rather than discrete.

Why P? Explanations: A number of proposed techniques attempt to visually explain ‘Why P?’. The authors in [25] backpropagate the class probabilities and show that the obtained gradients are descriptive of the class . They also show that notions of all classes are learned in the network. Grad-CAM [24] localizes the ‘Why P?’

parts of the image by backpropagating a class-weighted one-hot vector and multiplying the averaged resultant gradients as importance-weights on activation maps produced by the input image

. In this paper, we combine the gradient’s role as a loss minimizer in backpropagation, the existence of notion of classes within neural nets from [25], and [24]’s importance-weighing of activation maps to obtain our contrastive explanations. The authors in [4] tackle contrast through counterfactuals. They change the input image, based on a distractor image, to change its prediction from to . In this paper, we use the existence of notion of classes to provide contrastive explanations without the need for changes to the input images.

Figure 2: manifold in blue is the learned manifold that recognizes a spoonbill as a spoonbill. is the contrastive manifold where spoonbill is classified as a fl. Change between the two is termed contrast.

3 Contrastive Explanation Generation

We define contrast and provide a methodology to generate then from neural networks. We embed contrast in existing ‘Why P?’ explanations, specifically Grad-CAM, to obtain contrastive explanations.

3.1 Contrast in Neural Networks

In visual space, we define contrast as the perceived difference between two known quantities. In this paper, we assume that the knowledge of the two quantities is provided by a neural network. For instance, in Fig. 1, a neural network is trained to recognize both spoonbills and flamingos as separate classes. Thus, the network has access to the discriminative knowledge that separates the two classes. This knowledge is stored in the network’s weight and bias parameters, termed as and respectively. These parameters span a manifold where the given image belongs to a class . A toy classification example is shown in Fig. 2 where a learned manifold is visualized in blue. On the learned manifold, a spoonbill is classified as a spoonbill. A hypothetical contrastive manifold is shown in purple that differs from the blue manifold in that it recognizes a spoonbill as a flamingo. The same figure holds for regression networks, where the manifolds exist in a continuous space rather than discrete space. In terms of neural network representation space, contrast is the difference between the manifolds that predict as and as . In this paper, instead of directly measuring the difference between learned and contrastive manifolds, we measure the change required to obtain the contrastive manifold from the learned manifold. We use gradients to measure this change. Usage of gradients to characterize model change in not new. The authors in [16]

used gradients with respect to weights to characterize distortions for sparse and variational autoencoders. Fisher Vectors use gradients to characterize the change that data creates within networks 

[10] which were extended to classify images [23].

Figure 3: Contrastive Explanations (CE) on Recognition. (a) Input . (b) Grad-CAM of for predicted class . (c) Representative image of nearest class . (d) CE for class . (e) CE when . (f) and (g) CE for random classes. Figure best viewed in color.

We extract contrast for class when an image is predicted as by backpropagating a loss between and

. Hence, for a loss function

, contrast is proportional to , where are the network parameters. For a contrastive class , contrast is . Note that is a measure of contrastivity between and . In this paper, we choose to be cross-entropy for recognition networks and mean square error for regression networks. The contrastive class can belong to any one of the learned classes, i.e. . Moreover, if is a regression network such as in image quality assessment, can take on any value in the range of .

3.2 Contrastive Explanations

For a network that predicts for a given image, the gradients obtained from Sec. 3.1 represent ‘Why P, rather than Q?’. They provide the difference between the predicted class or variable and the network’s notion of class or variable . These features can now be plugged into any ‘Why P?’ based methods to obtain visual explanations. In this paper, we use Grad-CAM [24] to showcase our contrastive explanations. Essentially, the obtained contrastive gradients are backpropagated to the last convolutional layer to obtain gradient maps, where is the number of channels in that layer. These gradients are average pooled and the obtained vector is used importance-weights across the activation maps in the last convolution layer. These weighted activation maps are mean pooled and resized back to the original image dimensions to obtain contrastive masks. The contrastive masks are overlaid as heat maps and shown.

4 Applications

In this section, we consider two applications : recognition and image quality assessment (IQA). Visualizing contrast between classes is instructive in interpreting whether a network has truly learned the differences between classes in recognition. In IQA, visualizing contrast can help us to both localize the exact regions of quality degradation as well as quantify degradation based on scores. For recognition and are discrete classes while for image quality assessment, and are continuous and can take values between .

Figure 4: (a) Distorted images (b) Grad-CAM (c)-(f) Contrastive explanations to questions shown below each image. Best viewed in color.

4.1 Recognition


In this section, we consider contrastive explanations on large scale and fine-grained recognition. Large-scale dataset, like ImageNet 

[22], consists of a wide variety of classes. Fine-grained recognition is the subordinate categorization of similar objects such as different types of birds, and cars among themselves [32]. We consider Stanford Cars [14], subsurface imaging based LANDMASS [1], and traffic sign recognition CURE-TSR [29, 27, 28]

, datasets for fine-grained recognition. We use PyTorch’s ImageNet pretrained models including AlexNet 

[15], SqueezeNet [9], VGG- [26], ResNet- [5], and DenseNet- [8] to obtain contrastive explanations on ImageNet, Stanford Cars, and LANDMASS datasets. Specifically, on Stanford Cars and LANDMASS, we replace and train the final fully connected layer with the requisite number of classes - for Cars and for LANDMASS datasets. For CURE-TSR, we use the trained network provided by the authors in [29] to extract contrastive explanations. The results of randomly selected images from the fine-grained datasets and the cat-dog image used in Grad-CAM paper [24] are shown in Fig. 3. Similar to [24], we show results on VGG-16. Note that the contrastive explanations are a function of the network . Hence, based on how good the explanations are, we can rank different networks. However, in this paper, we focus only on demonstrating the need and descriptive capability of contrastive explanations.

Analysis: ImageNet has classes. Hence, for every image there are contrasts with a wide range of class options. This creates potential contrastive questions like ’Why bull-mastiff, rather than golf ball?’. The potential visual explanation to such a question lies in the face of the dog. Similarly, when asked ’Why bull-mastiff, rather than minibus?’, the potential visual explanation is the face of the dog. Hence, the contrastive explanations between a majority of classes to an image belonging to a class is the same. This is illustrated in row of Fig. 3. The Grad-CAM image of an input predicted bull-mastiff is shown in Fig. 3b. The face of the dog is highlighted. We calculate all

contrastive maps of the input image and show their variance and mean images in Fig. 

3f. and Fig. 3g. respectively. The variance map is boosted by a factor of for readability. For classes that are visually similar to bull-mastiff, like that of a boxer, the contrastive explanations indicate the semantic differences between the two classes. This is illustrated in Fig. 3d. where the contrastive explanations show that there is a difference in the snout between the expected image and the network’s notion of a boxer illustrated in Fig. 3c. When is the same as , the contrastivity reduces to . This is the same as the loss function during backpropagation and hence the contrastive gradients act as training gradients in that their purpose is to confidently predict . Hence, the contrastive explanation in this case highlights those regions in the image that is limiting the network from predicting with confidence. This is shown in Fig. 3e. where the cat confuses the network and is highlighted in red.

We show results on three fine-grained recognition datasets in rows of Fig. 3. Note that the contrastive explanations in this case are descriptive between different classes. Representative images from similar classes are shown in Fig. 3c. and their corresponding contrastive explanations are visualized in Fig. 3d. The contrastive explanations track the fault when asked to contrast with a salt dome (Row 2 Fig. 3d.), highlight the missing bottom part of the arrow in the no right turn image (Row 3 Fig. 3d.), and highlight the roof when differentiating between the convertible and the coupe (Row 4 Fig. 3d.). Other explanations to random classes are also shown. The input Bugatti Veyron’s sloping hood is sufficiently different from that of the boxy hood of both the Audi and the Volvo that it is highlighted.

4.2 Image Quality Assessment

Experiment: Image Quality Assessment (IQA) is the objective estimation of the subjective quality of an image [30]. In this section, we analyze a trained end-to-end full-reference metric DIQaM-FR [2]. Given a pristine image and its distorted version, the pretrained network from [2] provides a quality score, , to the distorted image. We then use MSE loss function as and a real number to calculate the contrastive gradients. Contrastive explanations of values including along with Grad-CAM results are shown in Fig. 4. Both the lighthouse and flower images are distorted using lossy compression and are taken from TID 2013 dataset [19]. Note that the network analyzes the results patchwise. To not filter out the results of individual patches, we visualize un-normalized results in Fig. 4. In this implementation, while

takes non-overlapping patches to estimate quality, we use overlapping patches with stride

to obtain smoother visualization. Note that the green and red colored regions within the images are the explanations to the contrastive questions shown below each image.

Analysis: Fig. 4b. shows that Grad-CAM essentially highlights the entire image. This indicates that the network estimates the quality based on the whole image. However, the contrastive explanations are indicative of where in the image, the network assigns quality scores. Fig. 4c. and d. show the regions within the distorted images that prevent the network from estimating . According to the obtained visualizations, the estimated quality is primarily due to the distortions within the foreground portions of the image as opposed to the background. This falls inline with previous works in IQA that argue that distortions in the more salient foreground or edge features cause a larger drop in perceptual quality than that in color or background [20][3]. Also, the results when are different from their counterparts. For instance, the network estimates the quality of the distorted lighthouse image to be . The results in ‘Why 0.58, rather than 0.75?’ show that the distortion in the lighthouse decreases the quality from to . Similarly, the results from ‘Why 0.58, rather than 1?’ show that because of distortions in the lighthouse as well as the cliff and parts of the background sky, the estimation is rather than . These results help us in further understanding the notion of perceptual quality within . When , the contrastive explanations describe why a higher rating is chosen. It can be seen that the network considers both the foreground and background to estimate a higher quality than . We intentionally choose different visualization color maps when vs when to effectively analyze these scenarios.

5 Conclusion

In this paper, we formalized the structure of contrastive explanations. We also provided a methodology to extract contrasts from networks and use them as plug-in techniques on top of existing visual explanatory methods. We demonstrated the use of contrastive explanations in fine-grained recognition to differentiate between subordinate classes. We also demonstrated the ability of contrastive explanations to analyze distorted data and provide answers to contrastive questions in image quality assessment.


  • [1] Y. Alaudah and G. AlRegib (2015) A curvelet-based distance measure for seismic images. In 2015 IEEE International Conference on Image Processing (ICIP), pp. 4200–4204. Cited by: §4.1.
  • [2] S. Bosse, D. Maniry, K. Müller, T. Wiegand, and W. Samek (2018-01) Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on Image Processing 27 (1), pp. 206–219. External Links: Document, ISSN 1941-0042 Cited by: §4.2.
  • [3] D. M. Chandler (2013) Seven challenges in image quality assessment: past, present, and future research. ISRN Signal Processing 2013. Cited by: §4.2.
  • [4] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, and S. Lee (2019) Counterfactual visual explanations. arXiv preprint arXiv:1904.07451. Cited by: §2.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §1, §4.1.
  • [6] C. G. Hempel and P. Oppenheim (1948) Studies in the logic of explanation. Philosophy of science 15 (2), pp. 135–175. Cited by: §1.
  • [7] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell (2016) Generating visual explanations. In European Conference on Computer Vision, pp. 3–19. Cited by: §1.
  • [8] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §4.1.
  • [9] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §4.1.
  • [10] T. Jaakkola and D. Haussler (1999) Exploiting generative models in discriminative classifiers. In Advances in neural information processing systems, pp. 487–493. Cited by: §3.1.
  • [11] F. C. Keil (2006) Explanation and understanding. Annu. Rev. Psychol. 57, pp. 227–254. Cited by: §1.
  • [12] P. Kitcher and W. C. Salmon (1962) Scientific explanation. Vol. 13, U of Minnesota Press. Cited by: §1.
  • [13] A. Koura (1988) An approach to why-questions. Synthese 74 (2), pp. 191–206. Cited by: §1.
  • [14] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia. Cited by: §4.1.
  • [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.1.
  • [16] G. Kwon, M. Prabhushankar, D. Temel, and G. AlRegib (2019) Distorted representation space characterization through backpropagated gradients. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 2651–2655. Cited by: §3.1.
  • [17] G. R. Mayes (2001) Theories of explanation. Cited by: §1.
  • [18] J. Pearl et al. (2009) Causal inference in statistics: an overview. Statistics surveys 3, pp. 96–146. Cited by: §1.
  • [19] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, et al. (2015) Image database tid2013: peculiarities, results and perspectives. Signal Processing: Image Communication 30, pp. 57–77. Cited by: §4.2.
  • [20] M. Prabhushankar, D. Temel, and G. AlRegib (2017) Ms-unique: multi-model and sharpness-weighted unsupervised image quality estimation. Electronic Imaging 2017 (12), pp. 30–35. Cited by: §4.2.
  • [21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning representations by back-propagating errors. nature 323 (6088), pp. 533–536. Cited by: §2.
  • [22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.1.
  • [23] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek (2013) Image classification with the fisher vector: theory and practice. International journal of computer vision 105 (3), pp. 222–245. Cited by: §3.1.
  • [24] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §2, §3.2, §4.1.
  • [25] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §2.
  • [26] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
  • [27] D. Temel and G. AlRegib (2018) Traffic signs in the wild: highlights from the ieee video and image processing cup 2017 student competition [sp competitions]. IEEE Sig. Proc. Mag. 35 (2), pp. 154–161. External Links: Document, ISSN 1053-5888, Link Cited by: §4.1.
  • [28] D. Temel, M. Chen, and G. AlRegib (2019) Traffic sign detection under challenging conditions: a deeper look into performance variations and spectral characteristics. IEEE Transactions on Intelligent Transportation Systems (), pp. 1–11. External Links: Document, ISSN 1524-9050, Link Cited by: §4.1.
  • [29] D. Temel, G. Kwon, M. Prabhushankar, and G. AlRegib (2017) CURE-tsr: challenging unreal and real environments for traffic sign recognition. arXiv preprint arXiv:1712.02463. Cited by: §4.1.
  • [30] D. Temel, M. Prabhushankar, and G. AlRegib (2016) UNIQUE: unsupervised image quality estimation. IEEE signal processing letters 23 (10), pp. 1414–1418. Cited by: §4.2.
  • [31] D. A. Wilkenfeld (2014) Functional explaining: a new approach to the philosophy of explanation. Synthese 191 (14), pp. 3367–3391. Cited by: §1.
  • [32] S. Yang, L. Bo, J. Wang, and L. G. Shapiro (2012) Unsupervised template learning for fine-grained object recognition. In Advances in neural information processing systems, pp. 3122–3130. Cited by: §4.1.