Deep learning approaches which are a part of methods we call today Artificial Intelligence (AI), have become indispensable for a wide range of applications requiring analysis and understanding of a large amount of data. They can produce promising results that outperform human capabilities in various decision tasks relating to visual content classification and understanding such as face detection, object detection and segmentation, image denoising, video-based tasks like sports action recognition  and saliency detection amongst others. The success of deep learning-based systems in these tasks have also paved the way for their applications to be developed for a variety of medical diagnosis tasks like cancer detection , and Alzhiemer’s disease detection on different imaging modalities just to name a few. Along with the usefulness of these tools, the trustfulness and reliability of such systems is also being questioned.
Though the results of deep learning models have been exemplary they are not perfect, can produce errors, are sensitive to noise in data and often lack the transparency to have verifiability of the decisions that they make. A specific example is related to the visual task of object classification from an image. The study by Ribeiro et al. showed that the trained network that performed supervised image classification, used the presence of snow as the distinguishing feature between the ”Wolf” and ”Husky” named classes present in the dataset. Such limitations raise ethical and reliability concerns that need to be addressed before such systems can be deployed and adopted on a wider scale. The objective of explainable AI/Deep learning is to design and develop methods that can be used to understand how these systems produce their decisions.
The behaviour described in the case of the wolf/husky classification has been termed as the problem of a trained classifier behaving like a ”Clever Hans” predictor. Explanation methods aid in the unmasking of such spurious correlations and biases in the model or data and also understand the failure cases of the system. If we can comprehend the reasoning behind the decision of a model, it could also help in uncovering associations that had been previously unobserved, which could aid in furthering the future research trends. It is important to mention that explainability focuses on the attribution of the output based on the input. It does not deal with answering the causality of the features or factors that have lead to a decision that has been taken. That is the explainers are only correlation-based (input-output) and do not make causal inferences.
The current study focuses on the task of supervised image classification using specific deep neural network architectures ,i.e. Convolutional Neural Networks (CNNs). CNNs have become one of the most successful architectures concerning AI tasks relating to images and hence the methods presented in the subsequent sections are focused on finding the relation between the predicted output classes and the input features that are the pixels of the image. In the remainder of the paper, the topology for the methods is presented in Sec. 2 and the detailed problem definition in Sec. 3. The other sections present the different explanation methods in detail and Sec. 7 provides the analyses and discussions of the methods that have been discussed in this study.
2 Topology of Explanation Methods for Image Classification Tasks
In the book by Samek et al., the authors present recent trends in the research in explainable AI and some of the directions for future explorations. They have presented a topology for the various explanations methods like meta-explainers, surrogate/sampling-based, occlusion based and propagation-based to name a few. However, with the addition of newer methods and their adaptations to different types of neural networks and datasets, we propose to update the topology based on the domain to which the methods are applied and their inherent design. Comparing recent studies, two major types of explanation methods exist i) Black-box methods and ii) White Box methods. In this review, for both cases, we mainly focus on the explanations of decisions of trained Deep Neural Network (DNN) classifiers. This means that for each sample of the data the methods we review explain the decision of the network. This is why they are called ”sample-based” methods. In the following, we will briefly explain the ”Black box” methods and focus on ”White box” methods in image classification tasks.
2.1 Black Box Methods
Black Box refers to an opaque system. The internal functioning of the model is kept hidden or is not accessible to the user. Only the input and the output of the system are accessible and such methods are termed as black box methods as they are model agnostic.
There are multiple ways to examine what a black-box model has learned
. A prominent group of methods are focused on explaining the model as a whole by approximating the black-box model like the neural networks with an inherently interpretable model. One such example is the use of decision trees. Decision trees are human-interpretable as the output is based on a sequence of decisions starting from the input data. To approximate a black-box trained network Frosst et al. have used multiple input-output pairs generated by the network to train a soft decision tree that could mimic the network’s behaviour. Each node makes a binary decision and learns a filter and a bias
term and the probability of the right branch of the tree being selected is given by Eq. (1) where the function is the sigmoid logistic function, is the input and is the current node.
The leaf nodes learn a simple distribution for the different classes present in the dataset. This method can be qualified as a Dataset-based explanation, as the decision tree is built for the whole dataset of pairs input-output.
Sample-based black-box methods deal with explaining a particular output of the model. These methods are not focused on understanding the internal logic of the model for all the classes on a whole but are restricted to explaining the prediction for a single input. The Local Interpretable Model-agnostic Explanations (LIME) method is one such approach that derives explanations for individual predictions. It generates multiple perturbed samples of the same input data and the corresponding outputs from the black box and trains an inherently interpretable model like a decision tree or a linear regressor on this combination to provide explanations.
Taking into account human understanding of visual scenes, such as attraction by meaningful objects in visual understanding tasks, for image classification tasks the regions of the image where the objects are present should have a higher contribution to the prediction. Based on this logic, some methods attempt to occlude different parts of the image iteratively using a sliding window mask. Figure 1 illustrates how the gray-valued window is used for occluding different parts of the images by sliding it across the image. By observing the change in the prediction of the classifier when different regions are hidden, the importance of regions for the final decision is calculated.
Fong et al. also build the explanations on which region contributes to the DNN decision the most by masking. Instead of using a constant grey-value mask, they formulate the explanation as a search for the minimal mask which changes the classification score for the given image the most. The mask applies a meaningful transformation which models image acquisition process like blurring. They find the mask by minimizing the expectation of output classification score of the network on the image perturbed using the blurred mask. Instead of using a single mask to perform the search, they apply the perturbation mask stochastically to the image. Also, L1 and total variation regularization are used to ensure that the final mask deletes the smallest subset of the image and has a regular structure.
Nevertheless, these methods only aid in identifying if the network is predicting based on a non-intuitive region in the image. The explanations are not useful to identify which layers or filters in a DNN classifier cause these wrong correlations between the input image regions and the prediction. Thus they cannot be used to improve the network performance. Hence the white box methods, which allow for analyzing internal layers of the network are more interesting.
2.2 White Box Methods
The term ”white box” implies a clear box that symbolizes the ability to see into the inner workings of the model i.e. its architecture and the parameters. Due to the extensive research in DNNs, they are not unknown architectures anymore and studies like Yosinski et al.
have been able to show the types of features that are learnt at the different layers in a DNN. Therefore, multiple methods aim to exploit the available knowledge of the network itself to create a better understanding of the prediction and the internal logic of the network thus allowing for further optimization of the architecture and hyperparameters of the model.
In this review work, we propose to deal with the specific case of Deep neural network classifiers such as convolutional neural networks (CNNs). We propose the following taxonomy for existing ”white-box” methods based on their approach used for generating explanations: i) Methods based on Linearization of the Deep-CNN, ii) Methods based on network structure, iii) Methods based on Adversarial approach. Due to the exploding research in the field, we do not claim to be complete in our taxonomy, but believe to have addressed the main trends.
3 Problem Definition
This section provides the basic terminology and the definitions required to understand the type of network that we will be focusing on, the notations used and how the results are to be visualized.
3.1 Network Definition
The problem under consideration is the image classification task. To define the task, first consider a convolutional neural network (CNN). A simple AlexNet-like CNN is illustrated in Fig. 2. The network consists of a series of convolutional layers, a non-Linear activation layer and a pooling layer that form the convolution (conv) block as illustrated by Fig. 3
. The conv blocks are followed by fully connected layers (FC) that are simple feed-forward neural networks2
) and Max Pooling are the most commonly used while building CNN classifiers. The last layer of the network has the same dimension as the number of classes in the problem, in the example in Fig. 2 it is 10 implying there are 10 categories of objects to recognize.
Consider a CNN that takes as input an image of size expressed as
and the output of the classification is a C-dimensional vector. Here represents the number of classes and the image represents the input features for the network. The scores would be the output classification score for the image for the class . The network thus models a mapping . The output score vector is usually normalized to approximate the probability, thus is restricted to the interval and the score’s vector sums to .
The problem of explanations consists in assigning, to each pixel , an importance score with respect to its contribution to the output . Otherwise to produce a relevance score map for each of the pixels and/or features of internal convolutions layers of the network to the output where the class can either be the correct label class or a different class where it can be used to analyze the cause for that classification.
To ”explain” pixel importance to the user a visualization of the scores in is usually performed by computing ”Saliency/ Heat maps” and superimposing them on the original image.
3.3 Saliency/ Heat maps
A saliency/heat map is the visualization of the relevance score map with colour Look-up-Tables (LuTs) mapping onto a colour scale from blue to red as illustrated by Fig. 4. This form of visualizations is necessary for the user to understand and glean insights from the results of the explanation methods. In the current illustration, we have used the ”jet” colour map that has a linear transition from the maximum value mapping to red, the middle to yellow-green colour, and the lowest to blue. Other LuTs can also be used for the visualization of the heat maps but we have chosen ”jet” as it is one of the more popular colour maps and is intuitively understandable for a human observer.
Given this kind of network classifier and problem formulation, several methods have been proposed that can be employed for the visualization of relevance score maps given a particular image.
4 Methods based on Linearization of the Deep-CNN
A (Convolutional) Neural Network is a non-linear classifier. It can be defined as a mapping from the input (feature space) to the output score space . The methods based on the linearization of a CNN produce explanations approximating the non-linear mapping . One of the commonly used approximations is the linear approximation
where the are the weights, is the input and is the bias related to the network. Different methods employ different ways to calculate the weight and bias parameters of the network approximation and thus produce different explanations.
4.1 Deconvolution Network based method
The Deconvolution Network (DeconvNet) proposed by Zeiler et al. was a network that reversed the mapping of a CNN. It builds a mapping of the output score to the space of the input pixels . It does not require retraining and directly uses the learned filters of the CNN. Starting with the input image , a full forward pass through the CNN is done to compute the feature activation throughout the layers. To visualize the features of a particular layer the corresponding feature maps from the CNN layer are passed onto the DeconvNet. In the DeconvNet the three steps i) Unpooling ii) Rectification iii) Filtering are done at each layer iteratively till we reach the input features layer.
Unpooling: Max Pooling operation in a CNN is non-invertible, Hence to reverse this, during the forward pass in the network, at each layer the maxima locations are saved to a matrix called the Max location switches. During the unpooling, the values from the previous layers are mapped to only the locations of maxima and the rest of the positions have a 0 assigned to them.
Filtering: This operation is the inverse operation to the convolution in the forward pass. To achieve this the DeconvNet convolves the rectified maps with the vertically and horizontally flipped version of the filter that has been learned by that layer in the CNN. The authors show that filters thus defined from learnt CNN filters are the deconvolution filters. We show the mathematical derivation of this in Appendix A.
Performing these operations iteratively from the layer of our choice to the input pixel layer helps to reconstruct the features from the layer that correspond to different regions in the input image . The importance of pixels is then expressed with a heat map.
4.2 Gradient Backpropagation
The gradient backpropagation method was proposed to explain prediction of a model based on its locally evaluated gradient. The local gradient of the output classification score with respect to the input at a particular image is used to calculate the weights parameter from Eq. (3). This means that the linear approximation of the non-linear mapping is formulated as a Taylor First Order Expansion of in the vicinity of a particular image , and . The weight parameters are thus calculated as in Eq. (4).
The partial derivative of the output classification score with respect to the input corresponds to the gradient calculation for a single backpropagation pass for a particular input image . It is equivalent to the backpropagation step that is performed during training which usually corresponds to a batch of images. For this method, the notation of gradient at is to show that the backpropagation is for just one image. Also, during the training of a CNN, the backpropagation step stops at the second layer of the network for efficiency as the aim is not to change the input values. But with this method, the backpropagation is performed till the input layer to inspect which pixels affect the output the most.
The final heat map relevance scores for a particular pixel in the input 2D image are calculated as shown in Eq. (5) in the case of a gray-scale image.
For an RGB image, the final map is calculated as the maximum weight of that pixel from the weights matrices from each of the three channels as shown in Eq. (6), where corresponds to the different channels in the image.
Also, these gradients can be used for performing a type of sensitivity analysis. The magnitude of the derivatives that have been calculated could be interpreted to indicate the input pixels to which the output classification is the most sensitive. The gradient value would correspond to the pixels that need to be changed the least to affect the final class score the most.
Simonyan et al. have also shown that the gradient backpropagation is a generalization of DeconvNet (Sec. 4.1). Indeed, this can be shown by comparing the three operations that DeconvNet performs with the gradient calculation.
Unpooling: During basic backpropagation at a max-pooling layer the gradients are backpropagated to only those positions that had the max values during the forward pass. This is exactly the same operation that is achieved by the use of the Max location switches matrix in the DeconvNet, see Sec. 4.1.
Rectification: For a CNN with the output of a convolution layer as the application of the ReLU activation is performed as , where then is the input for the next layer in the network. During the gradient backpropagation the rectification applied on the gradient map is based on the input, i.e. on those position where . Whereas, in the DeconvNet the rectification is applied on the unpooled maps and hence corresponds to the condition of . Figures. 5(b) and 5(c) show the changes in the calculation of the two matrices based on this difference in the operations.
Filtering: As shown in Appendix A, the vertical and horizontally flipped filter that is used during the filtering step in the DeconvNet corresponds to the gradient calculation of the convolution with respect to the input . This is the same step as the gradient backpropagation method performs and hence this step is equivalent for the two methods.
Except for the rectification step, the two methods are equivalent in their calculations and therefore, the gradient backpropagation method can be seen as a generalization of the DeconvNet.
4.3 Guided Backpropagation
Computing a saliency map based on gradients gives an idea of the various input features (pixels) that have contributed to the neuron responses in the output layer of the network. The primary idea proposed by Springenberg et al. is to prevent the backpropagation of negative gradients found in the deconvolution approach as they decrease the activation of the higher layer unit we aim to visualize. This is achieved by combining the rectification operation performed by the DeconvNet and the gradient backpropagation. As shown in Fig. 5(d), guided backpropagation proposes to restrict the flow of the gradients that have a negative value during backpropagation and also those values that had a negative value during the forward pass. This nullification of negative gradient values is called the guidance. Using the guidance step results in sharper visualizations for the descriptive regions in the input image. Figure. 6 shows the heat maps generated by the gradient backpropagation and guided backpropagation methods for the network trained on ResNet34 architecture. It can be seen that the guidance steps result in reducing the number of pixels that have a higher importance score and hence produce slightly sharper visualization.
4.4 Integrated Gradients
Sundararajan et al. proposed to calculate the saliency map as an integration of the gradients for a set of images that are created from the transformation of a baseline image to the input image . They propose the baseline
as a black image, then the series of images is produced by a linear transformation. Thus if we denote by the value of -feature in our then Eq. (7) shows the calculation of the integrated gradients for the network with output classification for a class as . This forms the map of relevance scores with a corresponding score for each input pixel .
The parameter varies in and the term within the partial derivative would go from the baseline image to the final input as we integrate over as shown in Fig. 7. In practice, the integral is approximated by a summation over a fixed number of samples i.e. Riemann approximation.
The authors observed that if there were slight changes in the pixel value of the image such that visually the image did not appear to have changed, the gradients calculated by the gradient backpropagation methods showed large fluctuations in their values. For a small amount of noise present in the image, the visualization with gradient backpropagation was different to that of the original image. They propose that integrated gradients perform an averaging operation over a sample of images and so the final relevance maps would be less sensitive to these fluctuating gradient values when compared with the other gradient-based methods.
An alternative method to circumvent the issue of noisy saliency maps called the SmoothGrad was proposed by Smilkov et al.
. The idea of this method is to have a smoother map with sharper visualizations by averaging over multiple noisy maps. To achieve this, the authors propose to add small noise vectors sampled from a Gaussian distribution, where
is the standard deviation, to the input image. Thus, they create samples of the input image with a small amount of noise added to its pixels. The relevance score maps are calculated for each of these images and the average of these generated maps gives the final relevance score map for the image as shown in Eq. (8). SmoothGrad is not a standalone method rather it can be used as an extension to other gradient-based methods to reduce the visual noise of the saliency maps. The authors observe that adding about noise to sampled images produced sharper maps. The parameter was chosen such that was in the range of . Here, and refer to the maximum and minimum values of the pixels of the image.
4.6 Gradient-Class Activation Mapping (Grad-CAM)
Gradient Class Activation Mapping (Grad-CAM) is a post-hoc explanation via visualization of class discriminative activations for a network. Similar to gradient-based methods, Grad-CAM leverages the structure of the CNN to produce a heat map of the pixels from the input image that contribute to the prediction of a particular class.
A key observation that Grad-CAM relies on is that the deeper convolutional layers of a CNN act as high-level feature extractors. So the feature maps of the last convolution layer of the network would contain the structural spatial information of objects in the image. Therefore, instead of propagating the gradient till the input layer like other gradient-based methods, Grad-CAM propagates the value from the output till the last convolutional layer of the network.
The features maps from the last convolution layer cannot be used directly as they would contain information regarding all the classes present in the dataset. Assuming that the last convolution layer of the network has feature maps named , the Grad-CAM method proposed to determine an importance value for each of the maps to the class predicted by the network. This value is calculated as the global average pooling of the gradient of the classification score with respect to the activation values for that feature map. As shown in Eq. (9), is the importance value for the feature map and there are such weights that are computed.
Here where and correspond to the height and width dimension of each feature map.
The weights are then used to weight each feature in the feature maps, the latter are then averaged. This gives us the relevance score map and a ReLU (rectification) function is applied over this map, see Eq. (10), to nullify the features that are negative and retain only those values that have a positive influence. At this stage, the relevance map is a 2-D map with the same spatial dimension as the feature maps of the last convolution layer. To have a correspondence to the input image , is upsampled to the spatial dimension of
using interpolation methods and scaled to the interval ofto visualize the final heat map.
Grad-Cam is the generalization of previously proposed by Zhu et al. CAM method which requires squeezing of the feature maps of the last conv layer by average pooling to form the input of the FC layer in the network. Grad-Cam on the contrary can be applied to all the architectures of Deep CNNs.
4.6.1 Guided Grad-CAM
The heat maps produced by Grad-CAM are coarse, unlike the other gradient-based methods. As the feature map of the last convolutional layer has a smaller resolution compared to the input image , Grad-CAM maps do not have fine-grained details that are generally seen in other gradient-based methods. To refine the maps, a variant of the method called the Guided Grad-CAM has been proposed which is a combination of the Grad-CAM and the guided backpropagation method by doing an element-wise multiplication of the two maps. The heat map that is obtained by this operation has been observed to have a higher resolution. We illustrate maps obtained by Grad-Cam and Guided Grad-CAM in Fig. 8.
5 Methods based on network structure
This category of methods integrates the architecture of the network while explaining the output. Starting from an output neuron, they employ different local redistribution rules to propagate the prediction to the input layer to obtain the relevance score maps. In this section, we present the details of the methods that belong to this category and differ in the rules that they use for the redistribution process.
5.1 Layer-wise Relevance Propagation (LRP)
Layer-wise Relevance Propagation (LRP) is an explanation method proposed by Bach et al. that explains the decision of a network for a particular image by redistributing the classification score for a class backwards through the network. The method does not use gradient calculations but defines the activation of the output neuron (either the predicted class or another class that is being considered) as the relevance value and a set of local rules for the re-distribution of this relevance score backwards till the input, layer by layer. The first rule that they propose is that of relevance conservation. Let the neurons in the different layers of the network be denoted by etc. and the classification score for the input image regarding the class . Then according to the relevance conservation rule, the sum of the relevance scores of all the neurons in each layer is a constant and equals as shown in Eq. (11).
Let and be two consecutive layers in the network and , denote the neurons belonging to these layers respectively. The relevance of the neuron based on can be written as . If neuron is connected to neuron then it is assigned a relevance value of weighted by the activation of the neuron and the weight of the connection between the two neurons . Similarly, neuron receives a relevance value from all the neurons that it is connected to in the next layer . The sum of all the relevance contributions that the neuron receives from the neurons it was connected to in the next layer is the final relevance value that is assigned to the neuron as shown in Eq. (12). The denominator term in Eq. (12) is the normalization value that is used to ensure relevance conservation rule Eq. (11). This rule is termed as the LRP-0 rule.
In this equation, the summation is done for all the neurons in the lower layer including the bias neuron in the network. The activation of the bias neuron is considered as and the weight of the connection is denoted as
. For the relevance propagation, the bias neuron is considered only for this term and is not considered elsewhere. Note that the authors propose these rules only for the specific case of rectifier networks i.e. for networks with ReLU as the non-linearity. The relevance of the output neuron is considered as its activation taken before the Softmax layer.
Similarly there exist a few other rules that improve on the LRP-0 rule for the propagation of relevance as presented in the following list.
Epsilon Rule (LRP-): To improve the stability of the LRP-0 rule a small positive term is added to the denominator as shown in Eq. (13). The term also reduces the flow of the relevance if the activation of the neuron is very small or there is a weak connection between the two neurons. If the value of is increased then it aids in ensuring only the stronger connections receive the redistributed relevance.
LRP- : The parameter was introduced to improve the contributions of the connections that had a positive weight () as shown in Eq. (15). The function and so the neurons with a positive weight connection receive a higher relevance score during propagation.
LRP rule: Two parameters and are used to control separately the positive and negative contributions to the relevance propagation. The function and and the parameters are constrained under the rule of .
5.1.1 LRP as Deep Taylor Decomposition (DTD)
Montavon et al. propose a framework to connect a rule-based method like LRP and the Taylor decomposition method as a way to theoretically explain the choice of the relevance propagation rules. They propose a method called the Deep Taylor Decomposition (DTD) which treats LRP as consecutive Taylor expansions applied locally at each layer and neuron. The main idea that DTD uses is that a deep network can be written as a set of subfunctions that relate the neurons of two consecutive layers. Instead of treating the whole network as a function , DTD expresses LRP as a series of mapping of neurons at a layer to the relevance of the neuron in layer .
The Taylor expansion of the relevance score can be expressed as function of the activations at some root point in the space of the activations as shown in Eq. (16).
The first-order term in this expansion can be used to determine how much of the relevance is redistributed to the neurons in the lower layer. The main challenge in the computation of the Taylor expansion, in this case, is that of finding the appropriate root point and compute the local gradients.
To compute the function , the authors propose to substitute it with a relevance model that is simpler to analyze. From the relevance propagation rules of LRP (Sec. 5.1), it can be seen that the relevance score of a neuron can be written as function of its activations as , where in the case of the LRP-0 rule Eq. (12) can be written as . As the LRP rules are described for deep rectifier networks, the relevance function is expressed based on the ReLU activation as:
A Taylor expansion of this function gives:
Due to the linearity of the ReLU function on the domain of positive activations the higher order terms in the expansion are zero. The choice of the root point would ensure that the zero order terms can be made small. The first order term computation is fairly straightforward and would identify how much of the relevance value should be redistributed to the neurons of the lower layer. Different LRP rules that have been presented previously can be derived from Eq. (18) based on the choice of the reference point . For instance, LRP-0 rule shown in Eq. (12) can be derived by choosing and LRP- as shown in Eq. (13) by choosing .
5.2 Deep Learning Important FeaTures (DeepLIFT)
The primary idea of DeepLIFT a method proposed by Shrikumar et al. is similar to the LRP method explained in Sec. 5.1. The major difference between the two methods is that DeepLIFT establishes the importance of neurons in each layer in terms of the difference of their response to that one of a reference state
. The reference state is either a default image or is an image chosen based on domain-specific knowledge. This reference could be an image that has the specific property against whose differences in the explanations are meant to be calculated. For example, it could be a black image in the case of the MNIST dataset as the backgrounds of the images in that dataset are all black. DeepLIFT aims to explain the difference in the output produced by the input image and the output of the reference state based on the difference between the input image and the chosen reference image.
For the output classification score of the input image and the output score of reference state as , the difference term is defined as . For the neurons belonging to a layer the relevance is denoted by . denotes the difference between the activations of the neuron for the input image and the reference state. Similar to the LRP method, DeepLIFT has the Summation to delta rule where the summation of the relevance of neurons at each is constant and equal to as shown in Eq. (19).
In order to explain the propagation rules, the authors define the a term multiplier - , which is defined as the contribution of the difference in the reference and input image activations to the difference in the output prediction divided by as shown in Eq. (20).
The multiplier is a term that is similar to a partial derivative but defined over finite differences. They also define the Chain rule for multipliers
similar to the chain rule used with derivatives as shown in Eq. (21) where denotes the neurons in intermediate layers between the neurons and the output .
Similar to LRP, the authors also separate the relevance values into two terms: positive and negative as they can then be treated differently if required. For each neuron , the two terms and are the positive and negative components respectively. These components can be found by grouping the positive and negative terms that contribute to the calculation of . Based on this idea the difference in the neuron activations of the input image and the reference state, and the relevance contribution can be written as shown in Eq. (22).
Using these terms, DeepLIFT proposes three rules that can be applied to a network for different layers to propagate the relevance from the output to the input layer.
Linear Rule: The linear rule is applied for the FC and Convolution layers (not applicable for the non-linearity layers). Considering the function , where is the activation of the neuron in the next layer, are the activations to the neuron from the previous layer and is weights of the connections, then we have based on the difference taken with the activations of the reference state neurons. The relevance contribution is then written as . The multiplier in this case is given by .
Rescale Rule: The Rescale rule is applied to layers with the non-linearities like the ReLU. Consider the neuron to be the non linear transformation of as . In the case of the ReLU function (Eq. (2)) denoted by . Considering the summation to delta property, the relevance contribution is as there is only one input . Hence, the multiplier in this case would be .
RevealCancel Rule: The RevealCancel rule treats the positive and negative contributions to relevance values separately. The impact of the positive and negative components of given as and on the components of given by and are calculated separately. Instead of the straightforward calculation, the value of is computed as the average of two terms. The first term is the average impact of the addition of only terms on the output of the nonlinearity . With as the value of the reference state at that neuron, the impact of is calculated by comparing the difference in the function value when it is included on top of , and is given as . The second term computes the impact of after the negative terms have been included. So the term measures the impact of when both the reference and negative terms are present. This computation is shown in Eq. (23).
Similarly for the calculation of , the average individual term is first computed in the absence of the positive term and then another term with the inclusion of is added to get the total impact as shown in Eq. (24).
Thus two multipliers that will be computed using this rule are as shown in Eq. (25), where and are calculated using Eqs. (23), (24) and and correspond to the sum of the positive and negative terms of .
The authors propose that the relevance scores that have been assigned to and are then distributed to the input features using the Linear Rule.
5.3 Feature based Explanation Method (FEM)
Feature based Explanation Method (FEM) proposed by Fuad et al., similar to Grad-CAM, employs the observation that deeper convolutional layers of the network act as high-level feature extractors.
Let us consider a CNN that comprises a single Gaussian filter at each convolution layer. Then the consecutive convolutions of the input image with the Gaussian filters followed by the pooling (downsampling) would be the same operation that would be performed to create a multi-resolution Gaussian pyramid. In this pyramid, the image at the last level would have only the spatial information of the main objects which are present in the image. Considering a standard CNN, the learned filters at the deeper convolution layers behave similar to the high-pass, i.e. derivative filters on the top of Gaussian pyramid (some examples are given in). This would imply that the information contained in the feature maps of the last convolution layer correspond to the main object that has been detected by the network from the given input image .
Hence, FEM proposes that the contribution of the input pixels for the network decision can be directly inferred from the features of that have been detected at the last conv layer of the network. Also, FEM proposes that the final decision of the output would be influenced by the strong features from maps in the last conv layer. FEM supposes that the feature maps of the last convolutional layer have a Gaussian distribution. In this case, the strong features from these maps would correspond to the rare features. The authors propose a K-sigma filtering rule to identify these rare and strong features. Each of the feature maps in layer is thresholded based on the K-sigma rule to create binary maps where denote the spatial dimension of the feature maps as shown in Eq. (26). The mean and standard deviation are calculated for each of the feature maps followed by the thresholding to create binary maps . K is the parameter that controls the threshold value and is set to 1 by the authors in their work.
The hyperparameters of a DNN such as the number of filters to train at each layer is often set arbitrarily, as the hyperparameter optimization process is a heavy computational problem. Hence channel attention mechanisms have been proposed in the DNNs to improve classification accuracy. These mechanisms select important feature channels (maps) in an end-to-end-training process. Inspired by these models, the authors hypothesize that not all feature maps will be important for the classification. The importance is always understood as the magnitude of positive features in the channels. Hence, a weight term equal to the mean of the initial feature maps is assigned to each of the binary maps . The importance map is computed as the linear combination of all the weighted binary maps and then normalized to the interval . The importance map is upsampled to the same spatial resolution as the input image by interpolation.
FEM eliminates the need to calculate the gradients from the output neuron and provides a faster and simpler method to get an importance score of the input pixels based only on the features that have been extracted by the network. It does not examine the classification part of the network but uses only the feature extraction part of the CNN to explain the important input pixels that have been extracted by the network to produce the decision. The method is applicable both for 2D and 3D images or video, considered as a 2D +t volume. We will now illustrate it in the problem of image classification from ImageNet database performed with the VGG16. We propose the reader to visually compare the heat maps presented in Fig. 9 obtained by using different LRP rules and FEM.
It can be seen from the fgure that LRP heat maps are dependent on the rule that is used. In this case, the LRP- assigns equal weight to both positive and negative features, and in the Fig. 9(a) most of the input pixels get assigned a higher relevance score. LRP- without bias from Fig. 9(c) results in a heat map that has the importance only in a smaller region of the tiger near the top. Without the added bias term the relevance scores are not distributed properly to the other regions. The rule with and only considers the features with a positive influence. As illustrated in the figure, this rule only highlights the contours with higher contrast in the image. Though FEM also considers the positive features (features are taken after the ReLU), the heat map that is more holistic and highlights the important regions in the image.
6 Methods based on Adversarial approach
Many recent works have used adversarial attacks on CNNs to demonstrate the susceptibility of the networks to simple methods that could lead the network to make completely wrong predictions. Different adversarial attacks on the network can be used to interpret its behaviour. Sample images that produce adversarial results give hints about the behaviour of the network[34, 22]. For example, the one-pixel attack proposed by Su et al., showed that the network can have a completely wrong prediction by changing just one pixel in input image. Studying these adversarial attacks we can interpret the regions of the image that the network focuses on to make a decision.
In addition to the adversarial attack based interpretation of the network, we bring focus on a recent adversarial learning-based explanation method proposed by Charachon et al.
that uses a Generative Adversarial Network (GAN)
based model. GAN is a type of network architecture with two components: i) Generator ii) Discriminator, that simultaneously trains the generator (G) to learn the data distribution and the discriminator (D) to estimate if the sample belongs to the dataset or has been generated by G.
For the case of a binary classification task on medical images, the authors along with their CNN classifier, use two generator networks i) Similar image generator ii) Adversarial image generator to produce explanations. The network is trained to generate an image that has the same output as the input image by the network. The network is trained to produce an image with a prediction that is opposite to that of the (adversarial). Consequently, the authors propose that the difference in these two generated images forms the explanation of the output by the network. For a given image the explanation of the classifier is then given as shown in Eq. (27).
The approach of just one network to generate an adversarial image ( and use the difference in the input image and the as the explanation was observed to produce noisy and non-intuitive features. To improve this, the authors use the two-generator approach and train and to sample from the same adversarial space, to have minimal differences in their learnt parameters but produce images with opposite classifications by the CNN.
A saliency map is a representative of what the network has learnt and it is not guaranteed that the explanations match human intuition. However, it is observed that a network with a higher classification accuracy generally produces more intuitive maps. A desirable property of every explanation map is that it is non-random and highlights only the relevant regions in the image and no more. Many of the methods use a qualitative assessment based on human inspection to evaluate and compare different maps. To evaluate the Grad-CAM maps, the authors conduct user surveys to find which maps and models they find reliable. Although human measurements on the quality of these maps are useful, they are time-consuming, could introduce bias and inaccurate evaluations. Another way to compare the generated explanation/saliency maps is to use different metrics proposed in the long-term research on the prediction of visual attention in images and video . For FEM the authors compare the saliency maps obtained by FEM with gradient-based methods. They show that the most similar explanations in terms of usual metrics of comparison of saliency maps such as Pearson Correlation Coefficient, -Similarity are given by Grad-CAM method we presented in Sec. 4.6.
Also, methods based on the computation of gradients like the backpropagation, guided backpropagation and DeconvNet suffer from gradient shattering
, whereas the depth of the layers increased the gradients of the progressively resembled white noise. This causes the importance scores to have high-frequency variations and makes them highly sensitive to small variations in the input. Galli et al. perform adversarial perturbations to their input image using the fast gradient method  and DeepFool. They compare the explanation maps of Guided Grad-CAM for the image and its perturbed variations using the Dice similarity coefficient. They observe that perturbations in the image strongly affected the saliency maps. The maps of the images before and after perturbations are different, though they note that these differences are not easily perceived by a human viewer. LRP, DeepLIFT and FEM are not sensitive to this problem as they do not compute gradients.
Cruciani et al. demonstrate the usefulness of LRP to visualize relevant features on brain magnetic resonance imaging modality (MRI) for multiple sclerosis classification. It is observed that the LRP heat maps are sparse and not always as intuitive which is also illustrated by Fig. 9. It is therefore important to consider the user, who will be using the maps while choosing a method to explain a network. For the domain of medical images, the specialist might require a map that provides holistic explanations for them to be able to interpret and trust the network decision.
Computation times of different methods also limit their application when dealing with a larger set of images. Fuad et al. observe that gradient-based methods including Grad-CAM have longer computation times and FEM was faster in comparison.
Selvaraju et al. use datasets with manually annotated bounding boxes on objects in the image to compare the saliency map and human annotation using Intersection over Union to determine the quality of their explanations. Muddamsetty et al.
also create a dataset comprising the user saliency maps in the form of eye-tracking details of medical experts on retinal images. They compare the two saliency maps based on metrics like Area Under the Curve (AUC) and Kullback-Leibler Divergence (KL-Div) and show that the saliency map closely aligns with human experts.
These are some of the first attempts that directly comparing the maps generated by a method to a human expert to determine which method is best suited based on the user application. The methods that we have presented in this paper focus on creating explanations aimed at humans and involving human feedback and intuition is thus a necessity for the choice and evaluation of these methods.
In this paper, we proposed a topology for the types of explanation methods: Black box and White box methods. We focused on white box methods as they can leverage the extra information that is available from the knowledge of the architecture of the network. Though there are multiple domains where different DNNs have been successful, we restrict our discussion to the explanations on image classification tasks using CNNs as the results have better intuitive interpretation and extensive applications in different modalities of images.
The methods that we have discussed are focused on explaining the decision for a single input image by creating saliency maps that attribute importance scores to each of its pixels based on its contribution to the final output. Multiple approaches have been used to calculate this contribution and based on the recent works, we proposed a categorization of methods to understand the similar approaches and compare the performance of one with the other. Based on our study, we can see that the choice of a method depends on the user for whom the explanation map is created. Evaluations of the methods based on simple human inspection might not always be the best way but we must consider some form of user-saliency to determine if the explanation method can represent human intuition and is easily interpretable.
The authors declare that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
This research was supported by University of Bordeaux/LaBRI.
Appendix A Filtering operation in DeconvNet and Gradient calculation
Consider a convolution neuron as shown in Fig. 10 where the operation performed is , with Y as the input, F the convolution layer filter and O the output. In a standard CNN during backpropagation of the loss , the neuron receives a partial gradient of and the gradients to be calculated are and .
According to the chain rule, the calculation of the partial gradient is given by Eq. (28).
To go through a step-by-step calculation of these gradients we suppose that the input is a matrix of , the filter F is a matrix of size
and the convolution operation is the one with a stride of. Then the corresponding matrices and the loss gradient that is backpropagated from the following layer would be as shown in Eq. (29).
The equations for the calculation of the convolution during the forward pass yields the expressions of the as shown in Eq. (30).
So to calculate the partial gradient of the loss w.r.t. the input, the first calculation that needs to be done is . A single calculation of this expression is shown in Eq. (31) based on the Eqs. (30). The rest of the terms can be calculated similarly.
Subsequently, the partial gradient of the loss w.r.t the input would be given by Eqs. (32).
Thus the partial gradient
when calculated using the chain rule can be written as a full convolution (when the loss matrix is zero-padded to have full convolution operation) of theinverted filter, i.e. the filter matrix has been flipped vertically and then horizontally as shown in Eq. (33), and the loss gradient matrix .
DeconvNet (Sec. 4.1) uses the same operation at the filtering step. Thus, the deconvolution step of the DeconvNet and the calculation of the gradient with respect to the input at a convolution layer are equivalent.
Improving alzheimer’s stage categorization with convolutional neural network using transfer learning and different magnetic resonance imaging modalities. Heliyon 6 (12), pp. e05652. Cited by: §1.
-  (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7), pp. 1–46. Cited by: §5.1.1, §5.1, §5.3.
The shattered gradients problem: if resnets are the answer, then what is the question?.
Proceedings of the International Conference on Machine Learning, pp. 342–350. Cited by: §7.
-  (2013) Representation learning: a review and new perspectives. IEEE Trans. on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §4.6.
-  (2002) Extracting decision trees from trained neural networks. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp. 456–461. External Links: Cited by: §2.1.
-  (2020) Proxy tasks and subjective measures can be misleading in evaluating explainable ai systems. In Proceedings of the 25th International Conference on Intelligent User Interfaces, pp. 454–464. Cited by: §7.
What do different evaluation metrics tell us about saliency models?. in IEEE Trans on Pattern Analysis and Machine Intelligence 41 (3), pp. 740–757. Cited by: §7, §7.
-  (2020) Combining similarity and adversarial learning to generate visual explanation: application to medical image classification. arXiv:2012.07332 .. Note: To be published in ICPR 2020 Cited by: §6.
-  (2021) Explainable 3d-cnn for multiple sclerosis patients stratification. In Proceedings of the ICPR 2020 Workshops Explainable Deep Learning-AI (EDL-AI), LNCS, Vol. 12663, pp. 103–114. External Links: Cited by: §7.
-  (2017) Saliency driven object recognition in egocentric videos with deep CNN: toward application in assistance to neuroprostheses. Comput. Vis. Image Underst. 164, pp. 82–91. Cited by: §1.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: Figure 1, Figure 4, Figure 7.
-  (2019) Explanations for attributing deep neural network predictions. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, and K. Müller (Eds.), Lecture Notes in Computer Science, Vol. 11700, pp. 149–167. External Links: Cited by: §2.1.
-  (2017) Distilling a neural network into a soft decision tree. In in Comprehensibility and Explanation in AI and ML (CEX), AI*IA, CEUR Workshop Proceedings, Vol. 2071. External Links: Cited by: §2.1.
-  (2020) Features understanding in 3d cnns for actions recognition in video. In Proceedings of the Tenth International Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–6. Cited by: §5.3, §7, §7.
-  (2021) Reliability of explainable artificial intelligence in adversarial perturbation scenarios. In Proceedings of the ICPR 2020 Workshops Explainable Deep Learning-AI (EDL-AI), LNCS, Vol. 12663, pp. 243–256. External Links: Cited by: §7.
-  (2004) Convolutional face finder: a neural architecture for fast and robust face detection. in IEEE Trans. on Pattern Analysis and Machine Intelligence 26 (11), pp. 1408–1423. Cited by: §1.
-  (2016) Deep learning. Vol. 1, MIT press Cambridge. Note: ISBN: 9780262035613 Cited by: Figure 3, §3.1.
-  (2014) Generative adversarial nets. In Proceedings of Advances in Neural Information Processing Systems, NeurIPS 2014, pp. 2672–2680. External Links: Cited by: §6.
-  (2015) Explaining and harnessing adversarial examples. In Procedings of the 3rd International Conference on Learning Representations, ICLR, External Links: Cited by: §7.
-  (2018) A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51 (5), pp. 1–42. Cited by: §2.1.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §5.3.
-  (2019) On relating explanations and adversarial examples. In Proceedings of Advances in Neural Information Processing Systems, NeurIPS 2019, pp. 15857–15867. External Links: Cited by: §6.
-  (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §3.1.
-  (2019) Unmasking clever hans predictors and assessing what machines really learn. Nature communications 10 (1), pp. 1–8. Cited by: §1.
-  (2020) Fine grained sport action recognition with twin spatio-temporal convolutional neural networks. Multim. Tools Appl. 79 (27-28), pp. 20429–20447. Cited by: §1.
-  (2019) Layer-wise relevance propagation: an overview. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, and K. Müller (Eds.), Lecture Notes in Computer Science, Vol. 11700, pp. 193–209. External Links: Cited by: 1st item, §5.1.
-  (2017) Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition 65, pp. 211–222. Cited by: §5.1.1.
Organizing cultural heritage with deep features. In Proceedings of the 1st Workshop on Structuring and Understanding of Multimedia Heritage Contents, pp. 55–59. Cited by: §4.3.
-  (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2574–2582. Cited by: §7.
-  (2021) Expert level evaluations for explainable ai (xai) methods in the medical domain. In Proceedings of the ICPR 2020 Workshops Explainable Deep Learning-AI (EDL-AI), LNCS, Vol. 12663, pp. 35–46. External Links: Cited by: §7.
-  (2016) Image annotation for mexican buildings database. In SPIE Optical Engineering+ Applications, Vol. 9970, pp. 99700Y–99700Y–8. Cited by: Figure 6, Figure 8.
-  (2021) Random forest model and sample explainer for non-experts in machine learning – two case studies. In Proceedings of the ICPR 2020 Workshops Explainable Deep Learning-AI (EDL-AI), LNCS, Vol. 12663, pp. 62–75. External Links: Cited by: §2.
-  (2016) Why should i trust you? explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. Cited by: §1, §2.1.
-  (2018) Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §6.
-  (2016) Evaluating the visualization of what a deep neural network has learned. in IEEE Trans. on neural networks and learning systems 28 (11), pp. 2660–2673. Cited by: §7.
-  (2019) Explainable ai: interpreting, explaining and visualizing deep learning. Vol. 11700, Springer Nature. Cited by: §2.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §4.6.1, §4.6.
-  (2021) Moving object properties-based video saliency detection. in Journal of Electronic Imaging 30 (2), pp. 023005. Cited by: §1.
-  (2017) Learning important features through propagating activation differences. In Proceedings of International Conference on Machine Learning, pp. 3145–3153. Cited by: §5.2, §5.2.
-  (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. In Proceedings of 2nd International Conference on Learning Representations, ICLR 2014, Workshop Track Proceedings, External Links: Cited by: §4.2, §4.2.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, pp. 1–14. Cited by: §5.3.
-  (2017) SmoothGrad: removing noise by adding noise. CoRR abs/1706.03825, pp. 1–10. External Links: Cited by: §4.5.
-  (2015) Striving for simplicity: the all convolutional net. In Proceedings of 3rd International Conference on Learning Representations, ICLR 2015, Workshop Track Proceedings, External Links: Cited by: §4.3.
One pixel attack for fooling deep neural networks.
in IEEE Trans. on Evolutionary Computation23 (5), pp. 828–841. Cited by: §6.
-  (2017) Axiomatic attribution for deep networks. In Proceedings of International Conference on Machine Learning, pp. 3319–3328. Cited by: §4.4.
-  (2014) Intriguing properties of neural networks. In ICLR (Poster), pp. 1–10. Cited by: §6.
-  (2018) Attacks meet interpretability: attribute-steered detection of adversarial samples. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7728–7739. Cited by: §6.
-  (2021) Denoising convolutional neural network inspired via multi-layer convolutional sparse coding. Journal of Electronic Imaging 30 (2), pp. 023007. External Links: Cited by: §1.
-  (2019) Deep neural networks improve radiologists’ performance in breast cancer screening. IEEE transactions on medical imaging 39 (4), pp. 1184–1194. Cited by: §1.
-  (2014) How transferable are features in deep neural networks?. In Proceedings of the Annual Conference on Neural Information Processing Systems, NeurIPS 2014, pp. 3320–3328. External Links: Cited by: §2.2.
-  (2014) Visualizing and understanding convolutional networks. In Proceedings of European Conference on Computer Vision, pp. 818–833. Cited by: §2.1, §4.1.
-  (2020) Deep learning in mining of visual content. Springer. Note: ISBN: 9783030343750 Cited by: §3.1, §5.3.
-  (2016) Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §4.6.