1. Introduction
Attribution (Samek et al., 2020; Fong and Vedaldi, 2017; Lundberg and Lee, 2017) is an efficient computational tool to interpret DNNs, inferring the importance score of each input feature to the final decision for a given sample. Backpropagation based visualization (Zeiler and Fergus, 2014; Springenberg et al., 2014; Bach et al., 2015; Montavon et al., 2017; Sundararajan et al., 2017; Shrikumar et al., 2017; Kindermans et al., 2017)
is a classic approach in attribution interpretations, which propagates an importance score from the target output neuron through each layer to input neurons in one pass. Different propagation rules are adopted in existing work. For example, deconvnet
(Zeiler and Fergus, 2014) maps the output neuron back to the input pixel space by resorting to backward relu and transposed filters in a layerwise manner. Deep Taylor decomposition (Montavon et al., 2017) hierarchically decomposes each upper layer neuron to lower layer neurons according to the upper layer neuron’s firstorder Taylor expansion at the nearest root point. Backpropagation based methods are usually efficient, parameterfree, some of which produce interpretations with good visual quality and achieve good performance on image degradation and object localization task. Besides, backpropagation methods enable finegrained visualizations and layerwise explanations, which may provide more comprehensive understandings to the decisionmaking process of the neural network than the perturbationbased attribution methods (Fong and Vedaldi, 2017; Fong et al., 2019).Despite the humaninterpretable visualizations, several works have raised doubts about the faithfulness of these backpropagation visualizations recently (Adebayo et al., 2018; Nie et al., 2018; Sixt et al., 2020).
Do the pretty visualizations really reliably reflect why the network makes such a decision, e.g., classify an image as a cat?
There always exists a concern that the theoretical rationales of most backpropagation methods are confusing and hard to explore (Nie et al., 2018; Kindermans et al., 2017). For example, the original purpose of the transposed filter and backward relu in Deconvnet is to preserve the neuron size and ensure the nonnegative neurons in intermediate layers (Zeiler and Fergus, 2014), which have no direct connections to interpreting the final prediction. In another example, the Guided Backpropagation (GBP) (Springenberg et al., 2014) significantly improves the visualization by merely combining the forward relu of saliency map and the backward relu of Deconvnet. The interpretations based on those perplexing behaviors can not be fully trusted.Recent studies have further confirmed the abovementioned suspicions and verified that most of the modified backpropagation interpretations are not faithful to the model’s decisionmaking process. A theoretical analysis has revealed that Deconvnet and GBP are essentially doing (partial) image recovery which is unrelated to the network decisions (Nie et al., 2018). Besides, Some empirical evidence have also been proposed. It’s considered a method insensitive to the class labels and parameters can not explain the network’s prediction faithfully (Nie et al., 2018; Adebayo et al., 2018). However, the previous observations (Samek et al., 2016; Selvaraju et al., 2017; Mahendran and Vedaldi, 2016) have shown that the visualizations of Deconvnet and GBP keep almost the same given different class labels. The investigations and conclusions have been extended to most of backpropagation based attribution methods, including Deep Taylor decomposition (Montavon et al., 2017), Layerwise Relevance Propagation (LRP) (Bach et al., 2015), Excitation BP (Zhang et al., 2018), PatternAttribution (Kindermans et al., 2017), Deconvnet, and GBP. The investigations demonstrate that the attributions of these methods are independent of the parameters in the layers of classification module (Sixt et al., 2020) and hence fail the sanity check (Adebayo et al., 2018). In addition, existing backpropagation attribution methods generate a vague “relative importance score”, in which the numerical values have no practical implications. Specifically, the low relevance values do not guarantee the network will ignore the corresponding part during the decisionmaking process (Schulz et al., 2020).
Thus, it is highly desirable to develop a novel backpropagation framework that guarantees theoretical faithfulness and produces a quantitative attribution score with a clear understanding. To achieve these two goals, we resort to the mutual information theory, which is capable of generating sound interpretations and estimating the amount of information. We aim to study how much information of prediction is encoded in each input neuron around the neighborhood of a given sample. It’s intractable to directly deal with the mutual information for each sample, which is notoriously hard to estimate. Motivated by masklearning methods
(Fong and Vedaldi, 2017) which learns a masked input maximally preserving the discriminative information, our basic idea is to learn a source signal via backpropagation. The source signal should satisfy that the mutual information between input and output should be preserved as much as possible in the mutual information between input and the source signal. We further formulate this objective as minimizing the conditional entropy of the input given the source signal, which could be approximately achieved via the reconstructiontype objective function (Fano’s inequality). Then to solve the optimization problem, we propose Mutual Information Preserving Inverse Network, termed as MIPIN, a layerwise inverse network. MIPIN recursively retrain the parameters of each layer to learn how to invert and then adopt forward relu to adapt the global framework to each specific input.We conduct experiments to validate the effectiveness of the proposed MIPIN framework. Firstly, we empirically validate that the source signal produced by MIPIN satisfies two good properties: completeness and minimality. Secondly, we show that MIPIN generates humaninterpretable visualizations for models without local convolution, while other backpropagation based methods fail to do so. Thirdly, we demonstrate that MIPIN produces visualizations of good quality, which well locates the object of interest. Finally, we investigate the sensitivity of interpretations w.r.t the class labels. We observe that MIPIN is very sensitive to the change of class labels, while most of modified backpropagation attribution methods are not. In summary, this paper has three main contributions:

[leftmargin=*]

We formulate the backpropagation attribution problem as learning a source signal in the input space maximally preserving the mutual information between input and output.

We propose a novel mutual information preserving inverse network, termed MIPIN, to learn how to invert for attribution.

Experimental results on two benchmark datasets validate that the proposed MIPIN framework could generate attribution heatmaps that have high quality and are faithful to the decisionmaking process of underlying DNN.
2. Related work
In this section, we provide an overview of existing attribution methods that could infer the importance of each input feature to the final prediction for a given input sample.
2.1. BackPropagation Based Attribution
Backpropagation based attribution is an efficient attribution approach that propagates an importance score from an output neuron through each layer to the input neurons in one pass. Many different backpropagation methods are proposed based on different backpropagation rules.
Gradient (Baehrens et al., 2010) describes the sensitivity of the change of the output neuron w.r.t each input neuron, which masks out the negative neurons of bottom data via the forward Relu. Deconvnet(Zeiler and Fergus, 2014) aims to map the output neuron back to the input pixel space. To keep the neuron’s size and nonnegative property, they resort to the transposed filters and backward relu, which masks out the negative neurons of the top gradients. The Guided Backpropagation (GBP) (Springenberg et al., 2014) combines Gradients and Deconvnet, considering both forward relu and backward relu. As a result, GBP could significantly improve the visual quality of visualizations.
In addition, other backpropagation attributions follow the idea of directly or hierarchically decomposing the value of the target neuron to the input neurons. Layerwise Relevance Propagation (LRP) (Bach et al., 2015) decomposes the relevance score to the neurons of the lower layer according to the corresponding proportion in the linear combination. DeepLift (Shrikumar et al., 2017) adopts a similar linear rule, while it assigns the difference between the output and a reference output in terms of the difference between the input and a preset reference input, instead of merely using the output value. Integrated Gradients (Sundararajan et al., 2017)
also decomposes the output difference by integrating the gradients along a straight path interpolating between input sample and a reference point, which corresponds to
Aumann Shapleypayoff assignment in cooperative game theory. Smooth Gradients
(Smilkov et al., 2017) is similar to Integrated Gradients, which averages the gradients in a local neighborhood sampled by adding Gaussian noise. Deep Taylor decomposition (DTD) (Montavon et al., 2017) hierarchically decomposes each upper layer neuron to the lower layer neurons according to the upper layer neuron’s firstorder Taylor expansion at the nearest root point. PatternAttribution (Kindermans et al., 2017) extends the DTD by training a function from data to learn the root points. Their different backpropagation structures are summarized in Figure 1.Nevertheless, recent studies (Nie et al., 2018; Adebayo et al., 2018; Samek et al., 2016; Selvaraju et al., 2017; Mahendran and Vedaldi, 2016; Schulz et al., 2020) have indicated that most modified backpropagation based interpretations are not faithful to the model’s decisionmaking process. This motivates us to develop a novel backpropagation framework that guarantees theoretical faithfulness and produces a quantitative attribution score with a clear understanding.
2.2. PerturbationBased Attribution
Apart from the backpropagation based attribution approach, many perturbationbased attribution methods have been proposed. They generate a meaningful explanation by informative subset learning for CNNs (Fong and Vedaldi, 2017; Dabkowski and Gal, 2017; Wagner et al., 2019; Schulz et al., 2020) and GNNs (Ying et al., 2019). The basic idea behind is to maximize the mutual information between the predictions of a neural network and the distribution of possible subset structure . For a pretrained neural network , and denote its input variable and corresponding final prediction variable. Then the informative subset learning problem can be formulated as:
so as to find the most informative subset for the prediction .
To make most compact, a constraint on the size of or sparse constraints are usually imposed. Noted that and the entropy term is constant because is fixed for a pretrained network. As a result, maximizing the mutual information is equivalent to minimizing the conditional entropy , that is:
(1) 
The distribution of possible subset structure for a given input sample is often constructed by introducing a Gaussian noise mask to the input. Specially, the input variable , in which the Gaussian noise
and the mask vector
needs to be optimized to learn the most informative subset. Besides, when the endusers come to the question ”why does the trained model predict the instance as a certain class label”, the above conditional entropy is usually modified by the cross entropy objective between the label class and the model prediction.Informative subset learning methods generate meaningful explanations, but suffer from some drawbacks: i) they incur the risk of triggering artifacts of the blackbox model. ii) the produced explanations are forced to be smooth, which prevents finegrained evidence from being visualized.
3. Learn to invert with mutual information preserving
In this section, we first introduce the instancelevel attribution problem that we aim to tackle, and then discuss the mutual information preserving principle that a backpropagation based attribution should follow in order to generate faithful interpretations.
3.1. Problem Statement
To interpret a pretrained network , the attribution method targets at inferring the contribution score of each input feature to the final decision for a given sample (Du et al., 2019). Backpropagation based attribution is an efficient and parameterfree attribution approach, which usually enables finegrained explanations. Different from existing backpropagation based methods, we develop a novel Backpropagation explanation method from the mutual information perspective. Specifically, we try to solve the attribution problem by answering the following two research questions: i) In a global view, for input variable and output variable , how much information of is encoded in each input feature? ii) In a local view, for a specific input to be explained, how much information of in a local neighbor of is encoded in each input feature?
3.2. Problem Formulation
The informative subset learning approach (Fong and Vedaldi, 2017) learns a masked input which masks out the irrelevant region to maximally preserves the information related to the final prediction. Motivated by the approach, we answer the first question by inverting the output variable back to the input (pixel) space with the mutual information preserving principle. Specifically, we aim to learn a set of source signal and distractor signal of input signal in the input space, which satisfy that each input signal in is composed of its corresponding source signal and distractor , i.e., . Here, is the desired inverted signal. We hope the inverted source signal contains almost all the signal of interest correlating with final prediction , while the distractor signal represents the remaining part in which no information of can be captured. More formally, the mutual information between input and output prediction should be preserved as much as possible in the mutual information between and , not in . Hence, the inverted source signal could be optimized by:
(2) 
It can be observed that the inverted source signal can be expressed as a function of the prediction , i.e., , when no extra variable except for is introduced during the backpropagation process. According to the data processing inequality of mutual information, we have , which means the source signal does not introduce the information out of the prediction . The term is constant for a pretrained network, hence maximally preserving the mutual information between input and output is equivalent to maximizing the mutual information between and the source signal , i.e.,
(3) 
Because the entropy of input is constant, optimizing equation (3) is equivalent to minimizing the conditional entropy , which can be expressed as:
It’s intractable to compute since we need to integrate over the distribution of . Hence, instead of directly optimizing , we use a reconstructiontype objective to approximately optimize such conditional entropy, whose feasibility is guaranteed by the following Theorem 1.
Theorem 1.
Let the random variables
andrepresent input and output messages with a joint probability
. Let represent an occurrence of reconstruction error event, i.e., that with being an approximate version of . We have Fano’s inequality,where is the alphabet of , and is the cardinality of . is the probability of existing reconstruction error and is the corresponding binary entropy.
The proof of Theorem 1 is widely accessible in standard references such as (Cover, 1999). The Fano’s inequality states the relationship between the conditional entropy and the probability of reconstruction error . The inequality demonstrates that minimizing the conditional entropy can be approximately achieved by minimizing its upper bound, i.e., minimizing the probability . Hence it’s feasible to minimize the reconstruction error between and to guarantee the minimization of the amount of information of encoded in . Assume and
are normally distributed, the probability of reconstruction error should be minimal by minimizing the mean square error
between and , that is,(4) 
Then the mutual information between input and its source signal can characterize the importance of each input feature to the final prediction globally.
It becomes a little different when the explanation comes to a local view for a specific input . The above formulation emphasizes the features of interest globally, which averages over the whole data distribution. While for a specific input, we are more interested in the model behavior around its local neighborhood. Hence, we need to adjust such global explanations to the specific input. We conduct the adaptation by incorporating some prior information about the neighborhood of . We will elaborate on the detailed methodology of the local adaptation in section 4.2.
4. MIPIN: Mutual Information Preserving Inverse Network
In this section, we propose Mutual Information Preserving Inverse Network (MIPIN), which is a layerwise inverse network (Figure 2) based on the mutual information preserving principle introduced in the previous section.
In MIPIN, the mutual information between input and output prediction is maximally preserved in the inverted source signal during backpropagation. Assume the DNN has layers, we construct the inverse function as a multiple composite function, i.e.,
where denotes the output neuron, and denotes the inverse function at th layer. And the source signal is obtained via layerwise backpropagation. According to the objective of equation (4) in section 3.2, we learn the inverse function by minimizing the L2 norm of the difference between input and the source signal ,
To guarantee the faithfulness to the model, we further restrict that the mutual information between every two adjacent layers is also maximally preserved during the inversion. Hence we recursively learn the inverse function from up layers to bottom layers. We initialize . Then, aims to reconstruct the neuron entries at layer, which is expressed as:
where denotes the intermediate feature map at the th layer, and represents the corresponding source signal of at th layer. Similarly, at th layer with obtained, aims to reconstruct the neuron entries at th layer, i.e.,
(5)  
The computational flow of MIPIN is shown in Figure 2.
In the following, we further introduce the proposed MIPIN framework in details. Firstly, we design the type of inverse function for different types of layers. The parameter of at each layer can be optimized by minimizing the reconstruction error. Then, we elaborate how to leverage the activation switch information to adapt to a specific input sample . Finally, we introduce how to generate the interpretations based on MIPIN. The overall pipeline is summarized in Algorithm 1, where the backpropagation operation follows the forward structure in Figure 1.
4.1. Design of The Inverse Function
In this subsection, we design the function
for (i) dense, (ii) convolution, and (iii) maxpooling layer, respectively. We focus on the block at
th layer, which uses the intermediate inverted source signal to reconstruct , as shown in Equation (5). For simplicity, we write the neurons at th layer and the inverted signal at layer as and .Dense Layer: For th dense layers, we define the corresponding backpropagation inverse function as a linear function, which can be expressed as . According to the recursive reconstruction objective, we need to minimize the reconstruction error between and , i.e., . Then and have an unique closed solution:
(6)  
where and represent the centralized input and inverted source signal respectively. To avoid overfitting, we adopt a L2 norm to regularize the parameters . Then the solution becomes , where controls the tradeoff between reconstruction error and L2 norm regularization.
Convolutional Layers: The convnet uses learned filters to convolve the feature maps from the previous layer. It has shown that transposed filter structure can partially recover the signal (Nie et al., 2018). Besides, such structure accurately describes the correlation relationships, i.e., which neurons at lower layer will be related to which neurons at next upper layer. Hence we set the transposed filter structure as inverse functions for convolutional layers, where the kernel parameter of can be obtained by minimizing the reconstruction objective function, i.e.,
(7) 
Noted that it’s different from Deconvnet and GBP, which directly use the transposed version of the same filter in original network.
Maxpooling Layers: We follow the design of Deconvnet on the maxpooling operation, which obtains an approximate inversion by recording the locations of the maximum within each pooling region in a set of switch variables. We use these recorded switches to place the intermediate inverted signal into appropriate locations, preserving the information. The operation is formalized as:
(8)  
Here represents elementwise operation.
4.2. Forward Relu for Local Adaptation
The above framework emphasizes the features of interest globally, which averages over the whole data distribution. It contains no specific information for a given sample. While for a specific input , we are interested in how much information of around its local neighborhood is encoded in each input feature. Hence, we need to adapt the global framework to explain a specific input.
The basic idea is to explore and leverage the neighborhood information of . An intuitive way is to draw neighborhood samples according to the prior distribution of , and train the inversion function on those samples. However, such an approach is clumsy because we need to retrain the network parameters for each sample to be interpreted, which is not scalable in a large DNN.
Instead, we adopt a shortcut by leveraging the activation switch information, i.e., whether the neurons at intermediate layers are activated. We find that the activation switch, is an effective indicator of neighborhood information. Specifically, we consider the th neuron at th layer of the sample , which is obtained by . and denote the corresponding weights and bias respectively. Nonactivated neuron implies that the linear combination . According to continuity assumption, the linear combination would be smaller than with a high probability for the samples whose intermediate feature map at th layer is near to the one of . Therefore, for these neighboring samples, the corresponding neurons would also be nonactivated with a high probability.
Hence, we leverage the activation switch information at each layer to adapt the global framework to the specific input . Specifically, we adopt forward relu, i.e., we force the locations which are nonactivated in the original neurons are also nonactivated in the inverted source signals as abovementioned.
(9)  
The forward relu operation could well recover the neighbor information of the specific input .
4.3. Generating Interpretations
We generate the interpretations by studying how much information of output neuron is encoded in the input neurons via source signal , which maximally preserves the mutual information between input and output .
As the inverse function and for dense layer and convolutional layer are essentially linear functions, we could exact the corresponding linear weight as the attribution at th layer and ignore the irrelevant bias terms (Algorithm 1 line 12). The attribution is calculated as a superposition of in each layer (Algorithm 1 line 13). When adapting to a specific input, we adopt a forward relu operation as in section 4.2 (Algorithm 1 line 16).
The effect of local adaptation is shown in Figure 3. We can observe that without forward relu, the produced global source signal and attribution vector keep almost the same for different samples in the same class. This is mainly because the global framework obtains an average over the whole data distribution. While after adopting the forward relu function for local adaptation, the framework can well adapt to each specific input. Besides, the attributions of local adaptation turn out to be specifications of the global attributions, emphasizing similar parts of digit .
5. Evaluation of the Source Signal
In this section, we introduce how to evaluate the performance of the source signals. It’s expected that the source signal should satisfy two properties: i) Completeness: preserves almost all the information of output and the distractor signal captures no relevant information of , which means ; ii) Minimality: no redundant information (irrelevant with ) is obtained in , which means .
The produced source signal naturally well satisfies the minimality because the construction of adopts a backpropagation manner, in which almost no extraneous information is introduced. Hence, we focus on evaluating the completeness property. should be close to when the conditional entropy is close to . The two conditional entropies are equivalent when the output of is the same as the output of , i.e., . Here represent the logits of in the network. We adopt the average percentage change (APC) as the metric to measure the difference between and ,
(10) 
where represent the input and inverted source signal of th sample. is the number of classes and is the number of samples in the th class. We also propose a relaxed metric, termed positive average percentage change (Positive APC), which uses relu function to replace the previous absolute function,
(11) 
Because it can be considered that contains the sufficient information for prediction when . The completeness is satisfied when are very small.
We evaluate the completeness property on three models: i) A multilayer perceptron (MLP) model trained on MNIST dataset
(LeCun, 1998); ii) A CNN model trained on MNIST; and iii) A CNN model trained on Cifar10 dataset (Krizhevsky et al., 2009). The detailed network architectures are listed in Appendix. We report the APCs and Positive APCs in Table 1, and also visualize the inverted source signals of several examples with tagging their classification logits in Figure 4. Table 1 shows that the APCs and especially the positive APCs, on these models are very small. Besides, the visualization in Figure 4 also validates that the logits of the source signal are very similar to the logits of the original input signal. These evidence implies that the produced source signals well satisfy the completeness property.6. Experiments
In this section, we conduct experiments to evaluate the effectiveness of the produced interpretations by the proposed MIPIN framework from three aspects. Firstly, we visualize the interpretation results of different model architectures on MNIST and Imagenet dataset in section
6.2. Secondly, we test the localization performance by comparing the produced interpretations with the bounding box annotations in section 6.3. Finally, we investigate the sensitivity of the interpretations with respect to class labels in section 6.4.6.1. Experimental Setups
6.1.1. Baseline Methods
We compare with seven popular backpropagation based attribution methods, including Grads (Simonyan et al., 2014), Smooth (Smilkov et al., 2017), LRP (Bach et al., 2015), Deep Taylor Decomposition (DTD) (Montavon et al., 2017), Deconvnet (Zeiler and Fergus, 2014), GBP (Springenberg et al., 2014), and PatternAttribution (Patternnet) (Kindermans et al., 2017). Here LRP is not included as it has been shown that the LRP method is equivalent to DTD (Montavon et al., 2017). We use the implementation of these compared methods from the innvestigate package (Alber et al., 2019).
6.1.2. Implementation Details
We evaluate the quality of the proposed interpretations on MNIST (LeCun, 1998)
and Imagenet dataset
(Russakovsky et al., 2015). The training epoch for convolutional layers is set as
. In addition, the L2 norm regularization coefficient in dense layer .On MNIST dataset, we train a threelayer MLP model on 50k training images. Then we exact the input, final prediction, and intermediate feature maps of 50k images for the model. The network parameters of MIPIN are learned for each class recursively. The learning could be done in few minutes. On Imagenet dataset, we interpret the pretrained VGG19 (Simonyan and Zisserman, 2014) and Resnet50 (He et al., 2016) models. The network parameters of MIPIN are trained on the ILSVRC validation set with 50k images (Russakovsky et al., 2015). More samples for training will result in better interpretation results. We mainly train the classifier modules. Specifically, the feature maps at top convolutional layers and the final predictions (logits) are fed as input and output respectively. Different from the classical propagation settings where forward relu stops at the second layer, the forward relu should also operate on the defined input layer (top convolutional layers). This training part takes less than one minute and repeats for times. Then MIPIN generates a dimensional attribution vector. We could derive a dimensional attribution vector by averaging over the channel dimension, then resize to original image size to obtain the saliency map, similar to GradCAM (Selvaraju et al., 2017). We apply a ReLU function to the attribution vectors for ImageNet examples because we are more interested in the features that have a positive influence on the class of interest.
6.2. Visualization Results
To qualitatively assess the interpretation performance, we visualize the saliency maps of evaluated samples of the MIPIN and compared methods for a MLP model and the pretrained VGG19 model.
6.2.1. Sanity Check Measurement.
Here we propose a novel sanity check measurement for interpretations: whether the attribution methods produce reasonable interpretations for models without local connections or maxpooling operation, e.g., MLP models. It has been shown that the local connection largely attributes to the good visual quality of GBP (Nie et al., 2018). Besides, it’s confirmed that the maxpooling operation is critical in helping DeconvNet produce humaninterpretable visualizations instead of random noise (Nie et al., 2018). However, the interpretation should be also humaninterpretable without the local connection and maxpooling operation. To evaluate the performance without the distraction from local connection and maxpooling, we specially show the interpretations for a MLP model (without maxpooling) trained on MNIST in section 6.2.2.
6.2.2. MLP Models.
We train a threelayer MLP model on 50k training images. Figure 5 shows the saliency maps. The proposed MIPIN interprets the network with the best visual quality. We have the following observations. i) The attribution of compared backpropagation methods roughly localize the corresponding pixels of the digits, while the one of MIPIN achieves an accurate localization; ii) The saliency map of compared baselines are almost noisy, which pay partial attention to the background of the image. While the visualization of MIPIN is much sharper and cleaner than the compared baselines. iii) The explanation of MIPIN is consistent with human cognition. Specifically, MIPIN implies that the pixels on the digit have a positive effect on the final prediction, while the pixels around the digit play a negative role in prediction. The regions on the background, which are scored close to and shown in white color, are not necessary for the classification. This accords with an obvious conclusion that no information of output would be embedded in these areas. in which the pixels are all constant.
6.2.3. VGG19 Model.
We also visualize the saliency maps of eight examples in the shetland sheepdog class produced by MIPIN in Figure 6. Besides, the saliency maps visualizations of seven backpropagation based baselines are shown in Figure 7. Figure 6 shows that the interpretations produced by MIPIN are of good visual quality and accurately locate the object of interest. In addition, the heatmaps of GBP, DTD, and especially PatternAttrition, recover some finegrained details of irrelevant regions, such as person. These methods suggest that the network is partially paying attention to the background. Instead, the saliency maps of MIPIN indicate that the network is ignoring the surrounding background with almost value attributions. These observations imply that MIPIN produces more reasonable explanations.
Models  VGG19  Resnet50 
Grads  0.345  0.368 
SmoothGrads  0.518  0.541 
LRP  0.307  0.292 
DTD  0.427  0.441 
Deconvnet  0.226  0.263 
GBP  0.398  0.456 
PatternAttr  0.499   
GradCAM  0.486  0.512 
MIPIN  0.648  0.662 
6.3. Localization Performance
To further quantitatively measure the effectiveness of the proposed interpretations, we evaluated localization performances of MIPIN and these backpropagation based baselines. We are interested in how well these generated attributions locate the object of interest. A common approach is to compare the saliency maps with the bounding box annotations. Assume the bounding box contains pixels. We select top pixels according to ranked attribution scores and count the number of pixels inside the bounding box. The ratio is used as the metric of localization accuracy (Schulz et al., 2020). We follow the setting in (Schulz et al., 2020), which considers the scenarios where the bounding boxes cover less than of the input image. We computed the localization accuracy of VGG19 and Resnet50 models on over 5k validation images, and the results are shown in Table 2. MIPIN obtains the highest bounding box accuracy on both VGG19 and Resnet50 models. Comparing to the best baseline, MIPIN has improved the accuracy by 0.13 and 0.121 for VGG19 and Resnet50, respectively. This substantial improvement validates that MIPIN is consistent with human cognition.
6.4. Class Label Sensitivity
A sound attribution should be sensitive to the label of the target class. Given different labels of interest, the attribution method is supposed to generate classdiscriminative saliency maps. However, several works have observed that the saliency map of GBP and Decovnet keep almost the same given different class labels (Samek et al., 2016; Selvaraju et al., 2017; Mahendran and Vedaldi, 2016; Nie et al., 2018). Hence we conduct a comprehensive investigation to the class sensitivity of the saliency maps produced by MIPIN and these backpropagation based baselines.
6.4.1. Qualitative Evaluation
To qualitatively assess the class sensitivity, we visualize the saliency maps from MIPIN and seven baselines for the ‘bull mastiff’ (dog) class (top) and ‘tiger cat’ class (bottom) in Figure 8. GBP, DTD, and PatternAttribution highlight finegrained details in the image, but generate very similar visualizations for the two classes. The observation demonstrates that these three backpropagation baselines are classinsensitive. Smooth gradients shows better sensitivity, but it highlights both dog and cat regions when interpreting the ‘cat’ class. In contrast, MIPIN is highly classsensitive and produce accurate interpretations. Specifically, the ‘dog’ explanation exclusively highlights the ‘dog’ regions but not ‘cat’ regions (top row), and vice versa (bottom row).
6.4.2. Quantitative Evaluation
To quantitatively describe how the backpropagation based visualizations change w.r.t. different class labels, we compute the average distance on Imagenet dataset. We calculate the distance of two saliency maps given the class logits of two different class labels and then average these distances.
We evaluated the VGG19 model on images of Imagenet. We randomly select two class labels, i.e., and class, and show the comparisons of average distance statistics in Figure 9. It can be observed that the average distances of MIPIN, Grad, Smooth, and LRP are much larger than Deconvnet and GBP, which demonstrate that these four attribution methods are classsensitive while Deconvnet and GBP are not. Although DTD has a relatively larger distance than GBP and Deconvnet, we conclude that DTD is actually classinsensitive after investigating the specific examples. There are two situations for the saliency maps of DTD: i) For some examples, the two saliency maps of two classes (with both positive or negative logits) keep almost the same with zero distance, ii) For the other examples, the saliency map values of one class (with negative logits) are all zero values, while the ones of another class (with positive logits) are not, which results in a large distance. Besides, PatternAttribution is much less sensitive than MIPIN. We also evaluated the average distance on a threelayer CNN model trained on MNIST, and the results are shown in Appendix.
7. Conclusions and Future Work
In this work, we formulate the backpropagation attribution problem as learning a source signal in the input space, which maximally preserves the mutual information between input and output. To solve this problem, we propose mutual information preserving inverse network (MIPIN) to generate the desired source signal via globally and locally inverting the output. Experimental results validate that the interpretations generated by MIPIN are reasonable, consistent with human cognition, and highly sensitive to the class label. In the future work, we will explore how to apply MIPIN to generate attributions at intermediate layers and adapt MIPIN to generate classlevel attributions.
References
 Sanity checks for saliency maps. arXiv preprint arXiv:1810.03292. Cited by: §1, §1, §2.1.

INNvestigate neural networks!.
Journal of Machine Learning Research
. Cited by: §6.1.1.  On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PloS one. Cited by: §1, §1, §2.1, §6.1.1.
 How to explain individual classification decisions. The Journal of Machine Learning Research 11, pp. 1803–1831. Cited by: §2.1.
 Elements of information theory. John Wiley & Sons. Cited by: §3.2.
 Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pp. 6967–6976. Cited by: §2.2.
 Techniques for interpretable machine learning. Communications of the ACM 63 (1), pp. 68–77. Cited by: §3.1.
 Interpretable explanations of black boxes by meaningful perturbation. In ICCV, Cited by: §1, §1, §2.2, §3.2.
 Understanding deep networks via extremal perturbations and smooth masks. In ICCV, Cited by: §1.
 Deep residual learning for image recognition. In CVPR, Cited by: §6.1.2.
 Learning how to explain neural networks: patternnet and patternattribution. arXiv preprint arXiv:1705.05598. Cited by: §1, §1, §1, §2.1, §6.1.1.
 Learning multiple layers of features from tiny images. Cited by: §5.
 The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §5, §6.1.2.
 A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874. Cited by: §1.

Salient deconvolutional networks.
In
European Conference on Computer Vision
, pp. 120–135. Cited by: §1, §2.1, §6.4.  Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition 65, pp. 211–222. Cited by: §1, §1, §2.1, §6.1.1.

A theoretical explanation for perplexing behaviors of backpropagationbased visualizations
. In ICML, Cited by: §1, §1, §2.1, §4.1, §6.2.1, §6.4.  Imagenet large scale visual recognition challenge. IJCV. Cited by: §6.1.2, §6.1.2.
 Evaluating the visualization of what a deep neural network has learned. TNNLS. Cited by: §1, §2.1, §6.4.
 Toward interpretable machine learning: transparent deep neural networks and beyond. arXiv preprint arXiv:2003.07631. Cited by: §1.
 Restricting the flow: information bottlenecks for attribution. In ICLR, Cited by: §1, §2.1, §2.2, §6.3.
 Gradcam: visual explanations from deep networks via gradientbased localization. In ICCV, Cited by: §1, §2.1, §6.1.2, §6.4.
 Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine LearningVolume 70, Cited by: §1, §2.1.
 Deep inside convolutional networks: visualising image classification models and saliency maps. Cited by: §6.1.1.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §6.1.2.
 When explanations lie: why many modified bp attributions fail. In ICML, Cited by: §1, §1.
 Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §2.1, §6.1.1.
 Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §1, §1, §2.1, §6.1.1.
 Axiomatic attribution for deep networks. In ICML, Cited by: §1, §2.1.

Interpretable and finegrained visual explanations for convolutional neural networks
. In CVPR, Cited by: §2.2.  Gnnexplainer: generating explanations for graph neural networks. In NeurIPS, Cited by: §2.2.
 Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1, §1, §2.1, §6.1.1.
 Topdown neural attention by excitation backprop. IJCV. Cited by: §1.
Appendix A Network Architectures
In this subsection, we listed the network architectures used in the paper in Table 3, 4, and 5, respectively. The dropout technique is adopted during the training process.
Name  Activation  Output size 

Initial  (28,28)  
Dense1  relu  (512, ) 
Dense2  relu  (512, ) 
Dense3  softmax  (10, ) 
Name  Activation  kernel size  Output size 
Initial  (28,28,1)  
Conv1  relu  (16,5,5)  (24,24,16) 
Conv2  relu  (64,3,3)  (22,22,64) 
Max1  (2,2)  (11, 11, 64)  
Flatten  (7744, )  
Dense1  relu  (512, )  
Dense2  softmax  (10, ) 
Name  Activation  kernel size  Output size 
Initial  (32,32,3)  
Conv1  relu  (32,3,3)  (30,30,32) 
Conv2  relu  (64,3,3)  (28,28,64) 
Max1  (2,2)  (14,14,64)  
Conv3  relu  (64,3,3)  (12,12,64) 
Max2  (2,2)  (6, 6, 64)  
Flatten  (2304, )  
Dense1  relu  (512, )  
Dense2  softmax  (10, ) 
Appendix B Average distance on MNIST
We also evaluated a threelayer CNN model on images of MNIST, to see whether the number of layers would influence the conclusion. We obtain similar observations on the threelayer models, i.e., MIPIN, Grad, Smooth, LRP are classsensitive to the class labels with average distances , while GBP, DTD and PatterAttribution are not, with average distances (DTD has the same situations to VGG19 model). Deconvnet shows a good sensitivity with average distances, which may because the trained CNN model only has one maxpooling layer.
Appendix C More Visualizations
In this section, we provide more qualitative visualizations for examples from the ImageNet dataset. On Figure 11 and Figure 12, we illustrate the saliency maps for the hair dryer and ski class respectively. It indicates that MIPIN could accurately locate the object of interest. An interesting observation is that the model would also have some attention for the human object when providing explanations for the ski class. This is perhaps humans and the skiing equipment typically cooccur in the training set, and the model would also capture this correlation and exploit it for prediction. In Figure 10, we also compare MIPIN with GradCAM. The results indicate that MIPIN could generate more accurate localization. In addition, we a provide comparison between MIPIN and the seven backpropagation based attribution methods in Figure 13.