Explanations are a set of rationales used to understand the reasons behind a decision . When these rationales are based on visual characteristics in a scene, the justifications used to understand the decision are termed as visual explanations 
. Visual explanations can be used as a means to interpret deep neural networks. While deep networks have surpassed human level performance in traditional computer vision tasks like recognition, their lack of transparency in decision making has presented obstacles to their widespread adoption. We first formalize the structure of visual explanations to motivate the need for the proposed contrastive explanations. Hempel and Oppenheim  were the first to provide formal structure to explanations . They argued that explanations are like proofs in a logical system  and that explanations elucidate decisions of hitherto un-interpretable systems. Typically, explanations involve an answer to structured questions of the form ‘Why P?’, where refers to any decision. For instance, in recognition algorithms, refers to the predicted class. In image quality assessment,
refers to the estimated quality.Why-questions are generally thought of to be causal-like in their explanations . In this paper, we refer to them as visual causal explanations for simplicity. Note that these visual causal explanations do not allow causal inference as described by .
Consider an example shown in Fig. 1
where we classify between two birds - a spoonbill, and a flamingo. Given a spoonbill, a trained neural network classifies the input correctly as a spoonbill. A visual explanation of its decision generally assumes the form of a heat map that is overlaid on the image. In the visual explanations shown in Fig.1, the red regions answer the posed question. If the posed question takes the form of ‘Why Spoonbill?’, then the regions corresponding to the body shape and color of the spoonbill are highlighted. Such an explanation is based on features that describe a Spoonbill irrespective of the context. Instead of ‘Why Spoonbill?’, if the posed question were ‘Why Spoonbill, rather than Flamingo?’, then the visual explanation points to the most contrastive features between the two birds, which in this case is the neck of the Spoonbill. Flamingos have a longer S-shaped neck not prevalent in Spoonbills. The answers to such ‘Why P, rather than Q?’ questions are contrastive explanations where is the contrast.
The question of ‘Why P, rather than Q? provides context to the answer and hence relevance . In some cases, such context can be more descriptive for interpretability. For instance, in autonomous driving applications that recognize traffic signs, knowing why a particular traffic sign was chosen over another is informative in contexts of analyzing decisions in case of accidents. Similarly, in the application of image quality assessment where an algorithm predicts the score of an image as , knowing ‘Why 0.25, rather than 0.5?’ or ‘Why 0.25, rather than 1?’ can be beneficial to analyze both the image and the method itself. In applications like seismic analysis where geophysicists interpret subsurface images, visualizing ‘Why fault, rather than salt dome?’ can help evaluating the model, thereby increasing the trust in such systems. In this paper, we set the framework for contrastive explanations in neural networks. More specifically, we modify existing ‘Why P?’ explanatory systems like Grad-CAM to obtain contrastive explanations in Section 3. We show its usage in varied applications in Section 4. We then conclude in Section 5.
2 Background and Related Works
We propose to constrain ’Why P?’ explanatory techniques by providing them context and relevance to obtain ’Why P, rather than Q?’ techniques. In this section, we describe the existing ’Why P?’ techniques and lay the mathematical foundations of neural networks.
Background: Consider an layered classification network , trained to differentiate between classes. Given an input image ,
provides a probability scoreof dimensions where each element in corresponds to the probability of belonging to one of classes. The predicted class of image is the index of the maximum element in i.e. . During training, an empirical loss is minimized where is the ground truth and
is the network weights and bias parameters. Backpropagation minimizes the loss by traversing along the parameter space using gradients . These gradients represent the change in network required to predict instead of . Note that for a regression network , the above mathematical foundations remain consistent with and being continuous rather than discrete.
Why P? Explanations: A number of proposed techniques attempt to visually explain ‘Why P?’. The authors in  backpropagate the class probabilities and show that the obtained gradients are descriptive of the class . They also show that notions of all classes are learned in the network. Grad-CAM  localizes the ‘Why P?’
parts of the image by backpropagating a class-weighted one-hot vector and multiplying the averaged resultant gradients as importance-weights on activation maps produced by the input image. In this paper, we combine the gradient’s role as a loss minimizer in backpropagation, the existence of notion of classes within neural nets from , and ’s importance-weighing of activation maps to obtain our contrastive explanations. The authors in  tackle contrast through counterfactuals. They change the input image, based on a distractor image, to change its prediction from to . In this paper, we use the existence of notion of classes to provide contrastive explanations without the need for changes to the input images.
3 Contrastive Explanation Generation
We define contrast and provide a methodology to generate then from neural networks. We embed contrast in existing ‘Why P?’ explanations, specifically Grad-CAM, to obtain contrastive explanations.
3.1 Contrast in Neural Networks
In visual space, we define contrast as the perceived difference between two known quantities. In this paper, we assume that the knowledge of the two quantities is provided by a neural network. For instance, in Fig. 1, a neural network is trained to recognize both spoonbills and flamingos as separate classes. Thus, the network has access to the discriminative knowledge that separates the two classes. This knowledge is stored in the network’s weight and bias parameters, termed as and respectively. These parameters span a manifold where the given image belongs to a class . A toy classification example is shown in Fig. 2 where a learned manifold is visualized in blue. On the learned manifold, a spoonbill is classified as a spoonbill. A hypothetical contrastive manifold is shown in purple that differs from the blue manifold in that it recognizes a spoonbill as a flamingo. The same figure holds for regression networks, where the manifolds exist in a continuous space rather than discrete space. In terms of neural network representation space, contrast is the difference between the manifolds that predict as and as . In this paper, instead of directly measuring the difference between learned and contrastive manifolds, we measure the change required to obtain the contrastive manifold from the learned manifold. We use gradients to measure this change. Usage of gradients to characterize model change in not new. The authors in 
used gradients with respect to weights to characterize distortions for sparse and variational autoencoders. Fisher Vectors use gradients to characterize the change that data creates within networks which were extended to classify images .
We extract contrast for class when an image is predicted as by backpropagating a loss between and
. Hence, for a loss function, contrast is proportional to , where are the network parameters. For a contrastive class , contrast is . Note that is a measure of contrastivity between and . In this paper, we choose to be cross-entropy for recognition networks and mean square error for regression networks. The contrastive class can belong to any one of the learned classes, i.e. . Moreover, if is a regression network such as in image quality assessment, can take on any value in the range of .
3.2 Contrastive Explanations
For a network that predicts for a given image, the gradients obtained from Sec. 3.1 represent ‘Why P, rather than Q?’. They provide the difference between the predicted class or variable and the network’s notion of class or variable . These features can now be plugged into any ‘Why P?’ based methods to obtain visual explanations. In this paper, we use Grad-CAM  to showcase our contrastive explanations. Essentially, the obtained contrastive gradients are backpropagated to the last convolutional layer to obtain gradient maps, where is the number of channels in that layer. These gradients are average pooled and the obtained vector is used importance-weights across the activation maps in the last convolution layer. These weighted activation maps are mean pooled and resized back to the original image dimensions to obtain contrastive masks. The contrastive masks are overlaid as heat maps and shown.
In this section, we consider two applications : recognition and image quality assessment (IQA). Visualizing contrast between classes is instructive in interpreting whether a network has truly learned the differences between classes in recognition. In IQA, visualizing contrast can help us to both localize the exact regions of quality degradation as well as quantify degradation based on scores. For recognition and are discrete classes while for image quality assessment, and are continuous and can take values between .
In this section, we consider contrastive explanations on large scale and fine-grained recognition. Large-scale dataset, like ImageNet, consists of a wide variety of classes. Fine-grained recognition is the subordinate categorization of similar objects such as different types of birds, and cars among themselves . We consider Stanford Cars , subsurface imaging based LANDMASS , and traffic sign recognition CURE-TSR [29, 27, 28]
, datasets for fine-grained recognition. We use PyTorch’s ImageNet pretrained models including AlexNet, SqueezeNet , VGG- , ResNet- , and DenseNet-  to obtain contrastive explanations on ImageNet, Stanford Cars, and LANDMASS datasets. Specifically, on Stanford Cars and LANDMASS, we replace and train the final fully connected layer with the requisite number of classes - for Cars and for LANDMASS datasets. For CURE-TSR, we use the trained network provided by the authors in  to extract contrastive explanations. The results of randomly selected images from the fine-grained datasets and the cat-dog image used in Grad-CAM paper  are shown in Fig. 3. Similar to , we show results on VGG-16. Note that the contrastive explanations are a function of the network . Hence, based on how good the explanations are, we can rank different networks. However, in this paper, we focus only on demonstrating the need and descriptive capability of contrastive explanations.
Analysis: ImageNet has classes. Hence, for every image there are contrasts with a wide range of class options. This creates potential contrastive questions like ’Why bull-mastiff, rather than golf ball?’. The potential visual explanation to such a question lies in the face of the dog. Similarly, when asked ’Why bull-mastiff, rather than minibus?’, the potential visual explanation is the face of the dog. Hence, the contrastive explanations between a majority of classes to an image belonging to a class is the same. This is illustrated in row of Fig. 3. The Grad-CAM image of an input predicted bull-mastiff is shown in Fig. 3b. The face of the dog is highlighted. We calculate all
contrastive maps of the input image and show their variance and mean images in Fig.3f. and Fig. 3g. respectively. The variance map is boosted by a factor of for readability. For classes that are visually similar to bull-mastiff, like that of a boxer, the contrastive explanations indicate the semantic differences between the two classes. This is illustrated in Fig. 3d. where the contrastive explanations show that there is a difference in the snout between the expected image and the network’s notion of a boxer illustrated in Fig. 3c. When is the same as , the contrastivity reduces to . This is the same as the loss function during backpropagation and hence the contrastive gradients act as training gradients in that their purpose is to confidently predict . Hence, the contrastive explanation in this case highlights those regions in the image that is limiting the network from predicting with confidence. This is shown in Fig. 3e. where the cat confuses the network and is highlighted in red.
We show results on three fine-grained recognition datasets in rows of Fig. 3. Note that the contrastive explanations in this case are descriptive between different classes. Representative images from similar classes are shown in Fig. 3c. and their corresponding contrastive explanations are visualized in Fig. 3d. The contrastive explanations track the fault when asked to contrast with a salt dome (Row 2 Fig. 3d.), highlight the missing bottom part of the arrow in the no right turn image (Row 3 Fig. 3d.), and highlight the roof when differentiating between the convertible and the coupe (Row 4 Fig. 3d.). Other explanations to random classes are also shown. The input Bugatti Veyron’s sloping hood is sufficiently different from that of the boxy hood of both the Audi and the Volvo that it is highlighted.
4.2 Image Quality Assessment
Experiment: Image Quality Assessment (IQA) is the objective estimation of the subjective quality of an image . In this section, we analyze a trained end-to-end full-reference metric DIQaM-FR . Given a pristine image and its distorted version, the pretrained network from  provides a quality score, , to the distorted image. We then use MSE loss function as and a real number to calculate the contrastive gradients. Contrastive explanations of values including along with Grad-CAM results are shown in Fig. 4. Both the lighthouse and flower images are distorted using lossy compression and are taken from TID 2013 dataset . Note that the network analyzes the results patchwise. To not filter out the results of individual patches, we visualize un-normalized results in Fig. 4. In this implementation, while
takes non-overlapping patches to estimate quality, we use overlapping patches with strideto obtain smoother visualization. Note that the green and red colored regions within the images are the explanations to the contrastive questions shown below each image.
Analysis: Fig. 4b. shows that Grad-CAM essentially highlights the entire image. This indicates that the network estimates the quality based on the whole image. However, the contrastive explanations are indicative of where in the image, the network assigns quality scores. Fig. 4c. and d. show the regions within the distorted images that prevent the network from estimating . According to the obtained visualizations, the estimated quality is primarily due to the distortions within the foreground portions of the image as opposed to the background. This falls inline with previous works in IQA that argue that distortions in the more salient foreground or edge features cause a larger drop in perceptual quality than that in color or background . Also, the results when are different from their counterparts. For instance, the network estimates the quality of the distorted lighthouse image to be . The results in ‘Why 0.58, rather than 0.75?’ show that the distortion in the lighthouse decreases the quality from to . Similarly, the results from ‘Why 0.58, rather than 1?’ show that because of distortions in the lighthouse as well as the cliff and parts of the background sky, the estimation is rather than . These results help us in further understanding the notion of perceptual quality within . When , the contrastive explanations describe why a higher rating is chosen. It can be seen that the network considers both the foreground and background to estimate a higher quality than . We intentionally choose different visualization color maps when vs when to effectively analyze these scenarios.
In this paper, we formalized the structure of contrastive explanations. We also provided a methodology to extract contrasts from networks and use them as plug-in techniques on top of existing visual explanatory methods. We demonstrated the use of contrastive explanations in fine-grained recognition to differentiate between subordinate classes. We also demonstrated the ability of contrastive explanations to analyze distorted data and provide answers to contrastive questions in image quality assessment.
-  (2015) A curvelet-based distance measure for seismic images. In 2015 IEEE International Conference on Image Processing (ICIP), pp. 4200–4204. Cited by: §4.1.
-  (2018-01) Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on Image Processing 27 (1), pp. 206–219. External Links: Cited by: §4.2.
-  (2013) Seven challenges in image quality assessment: past, present, and future research. ISRN Signal Processing 2013. Cited by: §4.2.
-  (2019) Counterfactual visual explanations. arXiv preprint arXiv:1904.07451. Cited by: §2.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.1.
-  (1948) Studies in the logic of explanation. Philosophy of science 15 (2), pp. 135–175. Cited by: §1.
-  (2016) Generating visual explanations. In European Conference on Computer Vision, pp. 3–19. Cited by: §1.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §4.1.
-  (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §4.1.
-  (1999) Exploiting generative models in discriminative classifiers. In Advances in neural information processing systems, pp. 487–493. Cited by: §3.1.
-  (2006) Explanation and understanding. Annu. Rev. Psychol. 57, pp. 227–254. Cited by: §1.
-  (1962) Scientific explanation. Vol. 13, U of Minnesota Press. Cited by: §1.
-  (1988) An approach to why-questions. Synthese 74 (2), pp. 191–206. Cited by: §1.
-  (2013) 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia. Cited by: §4.1.
Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.1.
-  (2019) Distorted representation space characterization through backpropagated gradients. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 2651–2655. Cited by: §3.1.
-  (2001) Theories of explanation. Cited by: §1.
-  (2009) Causal inference in statistics: an overview. Statistics surveys 3, pp. 96–146. Cited by: §1.
-  (2015) Image database tid2013: peculiarities, results and perspectives. Signal Processing: Image Communication 30, pp. 57–77. Cited by: §4.2.
-  (2017) Ms-unique: multi-model and sharpness-weighted unsupervised image quality estimation. Electronic Imaging 2017 (12), pp. 30–35. Cited by: §4.2.
-  (1986) Learning representations by back-propagating errors. nature 323 (6088), pp. 533–536. Cited by: §2.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §4.1.
-  (2013) Image classification with the fisher vector: theory and practice. International journal of computer vision 105 (3), pp. 222–245. Cited by: §3.1.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §2, §3.2, §4.1.
-  (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
-  (2018) Traffic signs in the wild: highlights from the ieee video and image processing cup 2017 student competition [sp competitions]. IEEE Sig. Proc. Mag. 35 (2), pp. 154–161. External Links: Cited by: §4.1.
-  (2019) Traffic sign detection under challenging conditions: a deeper look into performance variations and spectral characteristics. IEEE Transactions on Intelligent Transportation Systems (), pp. 1–11. External Links: Cited by: §4.1.
-  (2017) CURE-tsr: challenging unreal and real environments for traffic sign recognition. arXiv preprint arXiv:1712.02463. Cited by: §4.1.
-  (2016) UNIQUE: unsupervised image quality estimation. IEEE signal processing letters 23 (10), pp. 1414–1418. Cited by: §4.2.
-  (2014) Functional explaining: a new approach to the philosophy of explanation. Synthese 191 (14), pp. 3367–3391. Cited by: §1.
-  (2012) Unsupervised template learning for fine-grained object recognition. In Advances in neural information processing systems, pp. 3122–3130. Cited by: §4.1.