Mutual Information Preserving Back-propagation: Learn to Invert for Faithful Attribution

04/14/2021 ∙ by Huiqi Deng, et al. ∙ 0

Back propagation based visualizations have been proposed to interpret deep neural networks (DNNs), some of which produce interpretations with good visual quality. However, there exist doubts about whether these intuitive visualizations are related to the network decisions. Recent studies have confirmed this suspicion by verifying that almost all these modified back-propagation visualizations are not faithful to the model's decision-making process. Besides, these visualizations produce vague "relative importance scores", among which low values can't guarantee to be independent of the final prediction. Hence, it's highly desirable to develop a novel back-propagation framework that guarantees theoretical faithfulness and produces a quantitative attribution score with a clear understanding. To achieve the goal, we resort to mutual information theory to generate the interpretations, studying how much information of output is encoded in each input neuron. The basic idea is to learn a source signal by back-propagation such that the mutual information between input and output should be as much as possible preserved in the mutual information between input and the source signal. In addition, we propose a Mutual Information Preserving Inverse Network, termed MIP-IN, in which the parameters of each layer are recursively trained to learn how to invert. During the inversion, forward Relu operation is adopted to adapt the general interpretations to the specific input. We then empirically demonstrate that the inverted source signal satisfies completeness and minimality property, which are crucial for a faithful interpretation. Furthermore, the empirical study validates the effectiveness of interpretations generated by MIP-IN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Attribution (Samek et al., 2020; Fong and Vedaldi, 2017; Lundberg and Lee, 2017) is an efficient computational tool to interpret DNNs, inferring the importance score of each input feature to the final decision for a given sample. Back-propagation based visualization (Zeiler and Fergus, 2014; Springenberg et al., 2014; Bach et al., 2015; Montavon et al., 2017; Sundararajan et al., 2017; Shrikumar et al., 2017; Kindermans et al., 2017)

is a classic approach in attribution interpretations, which propagates an importance score from the target output neuron through each layer to input neurons in one pass. Different propagation rules are adopted in existing work. For example, deconvnet 

(Zeiler and Fergus, 2014) maps the output neuron back to the input pixel space by resorting to backward relu and transposed filters in a layer-wise manner. Deep Taylor decomposition  (Montavon et al., 2017) hierarchically decomposes each upper layer neuron to lower layer neurons according to the upper layer neuron’s first-order Taylor expansion at the nearest root point. Back-propagation based methods are usually efficient, parameter-free, some of which produce interpretations with good visual quality and achieve good performance on image degradation and object localization task. Besides, back-propagation methods enable fine-grained visualizations and layer-wise explanations, which may provide more comprehensive understandings to the decision-making process of the neural network than the perturbation-based attribution methods (Fong and Vedaldi, 2017; Fong et al., 2019).

Despite the human-interpretable visualizations, several works have raised doubts about the faithfulness of these back-propagation visualizations recently  (Adebayo et al., 2018; Nie et al., 2018; Sixt et al., 2020).

Do the pretty visualizations really reliably reflect why the network makes such a decision, e.g., classify an image as a cat?

There always exists a concern that the theoretical rationales of most back-propagation methods are confusing and hard to explore (Nie et al., 2018; Kindermans et al., 2017). For example, the original purpose of the transposed filter and backward relu in Deconvnet is to preserve the neuron size and ensure the non-negative neurons in intermediate layers (Zeiler and Fergus, 2014), which have no direct connections to interpreting the final prediction. In another example, the Guided Back-propagation (GBP) (Springenberg et al., 2014) significantly improves the visualization by merely combining the forward relu of saliency map and the backward relu of Deconvnet. The interpretations based on those perplexing behaviors can not be fully trusted.

Recent studies have further confirmed the above-mentioned suspicions and verified that most of the modified back-propagation interpretations are not faithful to the model’s decision-making process. A theoretical analysis has revealed that Deconvnet and GBP are essentially doing (partial) image recovery which is unrelated to the network decisions (Nie et al., 2018). Besides, Some empirical evidence have also been proposed. It’s considered a method insensitive to the class labels and parameters can not explain the network’s prediction faithfully (Nie et al., 2018; Adebayo et al., 2018). However, the previous observations  (Samek et al., 2016; Selvaraju et al., 2017; Mahendran and Vedaldi, 2016) have shown that the visualizations of Deconvnet and GBP keep almost the same given different class labels. The investigations and conclusions have been extended to most of back-propagation based attribution methods, including Deep Taylor decomposition (Montavon et al., 2017), Layer-wise Relevance Propagation (LRP-)  (Bach et al., 2015), Excitation BP (Zhang et al., 2018), PatternAttribution (Kindermans et al., 2017), Deconvnet, and GBP. The investigations demonstrate that the attributions of these methods are independent of the parameters in the layers of classification module (Sixt et al., 2020) and hence fail the sanity check (Adebayo et al., 2018). In addition, existing back-propagation attribution methods generate a vague “relative importance score”, in which the numerical values have no practical implications. Specifically, the low relevance values do not guarantee the network will ignore the corresponding part during the decision-making process (Schulz et al., 2020).

Thus, it is highly desirable to develop a novel back-propagation framework that guarantees theoretical faithfulness and produces a quantitative attribution score with a clear understanding. To achieve these two goals, we resort to the mutual information theory, which is capable of generating sound interpretations and estimating the amount of information. We aim to study how much information of prediction is encoded in each input neuron around the neighborhood of a given sample. It’s intractable to directly deal with the mutual information for each sample, which is notoriously hard to estimate. Motivated by mask-learning methods  

(Fong and Vedaldi, 2017) which learns a masked input maximally preserving the discriminative information, our basic idea is to learn a source signal via back-propagation. The source signal should satisfy that the mutual information between input and output should be preserved as much as possible in the mutual information between input and the source signal. We further formulate this objective as minimizing the conditional entropy of the input given the source signal, which could be approximately achieved via the reconstruction-type objective function (Fano’s inequality). Then to solve the optimization problem, we propose Mutual Information Preserving Inverse Network, termed as MIP-IN, a layer-wise inverse network. MIP-IN recursively retrain the parameters of each layer to learn how to invert and then adopt forward relu to adapt the global framework to each specific input.

We conduct experiments to validate the effectiveness of the proposed MIP-IN framework. Firstly, we empirically validate that the source signal produced by MIP-IN satisfies two good properties: completeness and minimality. Secondly, we show that MIP-IN generates human-interpretable visualizations for models without local convolution, while other back-propagation based methods fail to do so. Thirdly, we demonstrate that MIP-IN produces visualizations of good quality, which well locates the object of interest. Finally, we investigate the sensitivity of interpretations w.r.t the class labels. We observe that MIP-IN is very sensitive to the change of class labels, while most of modified back-propagation attribution methods are not. In summary, this paper has three main contributions:

  • [leftmargin=*]

  • We formulate the back-propagation attribution problem as learning a source signal in the input space maximally preserving the mutual information between input and output.

  • We propose a novel mutual information preserving inverse network, termed MIP-IN, to learn how to invert for attribution.

  • Experimental results on two benchmark datasets validate that the proposed MIP-IN framework could generate attribution heatmaps that have high quality and are faithful to the decision-making process of underlying DNN.

2. Related work

In this section, we provide an overview of existing attribution methods that could infer the importance of each input feature to the final prediction for a given input sample.

Figure 1. Summary of existing back-propagation structures. Grads, Smooth grads, Integrated Grads, LRP, DTD, and PatternAttribution adopt the forward structure. Deconvnet adopts the backward structure, while GBP adopts the forward and backward structure.

2.1. Back-Propagation Based Attribution

Back-propagation based attribution is an efficient attribution approach that propagates an importance score from an output neuron through each layer to the input neurons in one pass. Many different back-propagation methods are proposed based on different back-propagation rules.

Gradient (Baehrens et al., 2010) describes the sensitivity of the change of the output neuron w.r.t each input neuron, which masks out the negative neurons of bottom data via the forward Relu. Deconvnet(Zeiler and Fergus, 2014) aims to map the output neuron back to the input pixel space. To keep the neuron’s size and non-negative property, they resort to the transposed filters and backward relu, which masks out the negative neurons of the top gradients. The Guided Back-propagation (GBP)  (Springenberg et al., 2014) combines Gradients and Deconvnet, considering both forward relu and backward relu. As a result, GBP could significantly improve the visual quality of visualizations.

In addition, other back-propagation attributions follow the idea of directly or hierarchically decomposing the value of the target neuron to the input neurons. Layerwise Relevance Propagation (LRP)  (Bach et al., 2015) decomposes the relevance score to the neurons of the lower layer according to the corresponding proportion in the linear combination. DeepLift  (Shrikumar et al., 2017) adopts a similar linear rule, while it assigns the difference between the output and a reference output in terms of the difference between the input and a pre-set reference input, instead of merely using the output value. Integrated Gradients  (Sundararajan et al., 2017)

also decomposes the output difference by integrating the gradients along a straight path interpolating between input sample and a reference point, which corresponds to

Aumann Shapley

payoff assignment in cooperative game theory. Smooth Gradients  

(Smilkov et al., 2017) is similar to Integrated Gradients, which averages the gradients in a local neighborhood sampled by adding Gaussian noise. Deep Taylor decomposition (DTD) (Montavon et al., 2017) hierarchically decomposes each upper layer neuron to the lower layer neurons according to the upper layer neuron’s first-order Taylor expansion at the nearest root point. PatternAttribution (Kindermans et al., 2017) extends the DTD by training a function from data to learn the root points. Their different back-propagation structures are summarized in Figure 1.

Nevertheless, recent studies  (Nie et al., 2018; Adebayo et al., 2018; Samek et al., 2016; Selvaraju et al., 2017; Mahendran and Vedaldi, 2016; Schulz et al., 2020) have indicated that most modified back-propagation based interpretations are not faithful to the model’s decision-making process. This motivates us to develop a novel back-propagation framework that guarantees theoretical faithfulness and produces a quantitative attribution score with a clear understanding.

2.2. Perturbation-Based Attribution

Apart from the back-propagation based attribution approach, many perturbation-based attribution methods have been proposed. They generate a meaningful explanation by informative subset learning for CNNs (Fong and Vedaldi, 2017; Dabkowski and Gal, 2017; Wagner et al., 2019; Schulz et al., 2020) and GNNs (Ying et al., 2019). The basic idea behind is to maximize the mutual information between the predictions of a neural network and the distribution of possible subset structure . For a pre-trained neural network , and denote its input variable and corresponding final prediction variable. Then the informative subset learning problem can be formulated as:

so as to find the most informative subset for the prediction .

To make most compact, a constraint on the size of or sparse constraints are usually imposed. Noted that and the entropy term is constant because is fixed for a pre-trained network. As a result, maximizing the mutual information is equivalent to minimizing the conditional entropy , that is:

(1)

The distribution of possible subset structure for a given input sample is often constructed by introducing a Gaussian noise mask to the input. Specially, the input variable , in which the Gaussian noise

and the mask vector

needs to be optimized to learn the most informative subset. Besides, when the end-users come to the question ”why does the trained model predict the instance as a certain class label”, the above conditional entropy is usually modified by the cross entropy objective between the label class and the model prediction.

Informative subset learning methods generate meaningful explanations, but suffer from some drawbacks: i) they incur the risk of triggering artifacts of the black-box model. ii) the produced explanations are forced to be smooth, which prevents fine-grained evidence from being visualized.

3. Learn to invert with mutual information preserving

In this section, we first introduce the instance-level attribution problem that we aim to tackle, and then discuss the mutual information preserving principle that a back-propagation based attribution should follow in order to generate faithful interpretations.

3.1. Problem Statement

To interpret a pre-trained network , the attribution method targets at inferring the contribution score of each input feature to the final decision for a given sample  (Du et al., 2019). Back-propagation based attribution is an efficient and parameter-free attribution approach, which usually enables fine-grained explanations. Different from existing back-propagation based methods, we develop a novel Back-propagation explanation method from the mutual information perspective. Specifically, we try to solve the attribution problem by answering the following two research questions: i) In a global view, for input variable and output variable , how much information of is encoded in each input feature? ii) In a local view, for a specific input to be explained, how much information of in a local neighbor of is encoded in each input feature?

3.2. Problem Formulation

The informative subset learning approach (Fong and Vedaldi, 2017) learns a masked input which masks out the irrelevant region to maximally preserves the information related to the final prediction. Motivated by the approach, we answer the first question by inverting the output variable back to the input (pixel) space with the mutual information preserving principle. Specifically, we aim to learn a set of source signal and distractor signal of input signal in the input space, which satisfy that each input signal in is composed of its corresponding source signal and distractor , i.e., . Here, is the desired inverted signal. We hope the inverted source signal contains almost all the signal of interest correlating with final prediction , while the distractor signal represents the remaining part in which no information of can be captured. More formally, the mutual information between input and output prediction should be preserved as much as possible in the mutual information between and , not in . Hence, the inverted source signal could be optimized by:

(2)

It can be observed that the inverted source signal can be expressed as a function of the prediction , i.e., , when no extra variable except for is introduced during the back-propagation process. According to the data processing inequality of mutual information, we have , which means the source signal does not introduce the information out of the prediction . The term is constant for a pre-trained network, hence maximally preserving the mutual information between input and output is equivalent to maximizing the mutual information between and the source signal , i.e.,

(3)

Because the entropy of input is constant, optimizing equation (3) is equivalent to minimizing the conditional entropy , which can be expressed as:

It’s intractable to compute since we need to integrate over the distribution of . Hence, instead of directly optimizing , we use a reconstruction-type objective to approximately optimize such conditional entropy, whose feasibility is guaranteed by the following Theorem 1.

Theorem 1.

Let the random variables

and

represent input and output messages with a joint probability

. Let represent an occurrence of reconstruction error event, i.e., that with being an approximate version of . We have Fano’s inequality,

where is the alphabet of , and is the cardinality of . is the probability of existing reconstruction error and is the corresponding binary entropy.

The proof of Theorem 1 is widely accessible in standard references such as (Cover, 1999). The Fano’s inequality states the relationship between the conditional entropy and the probability of reconstruction error . The inequality demonstrates that minimizing the conditional entropy can be approximately achieved by minimizing its upper bound, i.e., minimizing the probability . Hence it’s feasible to minimize the reconstruction error between and to guarantee the minimization of the amount of information of encoded in . Assume and

are normally distributed, the probability of reconstruction error should be minimal by minimizing the mean square error

between and , that is,

(4)

Then the mutual information between input and its source signal can characterize the importance of each input feature to the final prediction globally.

It becomes a little different when the explanation comes to a local view for a specific input . The above formulation emphasizes the features of interest globally, which averages over the whole data distribution. While for a specific input, we are more interested in the model behavior around its local neighborhood. Hence, we need to adjust such global explanations to the specific input. We conduct the adaptation by incorporating some prior information about the neighborhood of . We will elaborate on the detailed methodology of the local adaptation in section 4.2.

4. MIP-IN: Mutual Information Preserving Inverse Network

In this section, we propose Mutual Information Preserving Inverse Network (MIP-IN), which is a layer-wise inverse network (Figure 2) based on the mutual information preserving principle introduced in the previous section.

In MIP-IN, the mutual information between input and output prediction is maximally preserved in the inverted source signal during back-propagation. Assume the DNN has layers, we construct the inverse function as a multiple composite function, i.e.,

where denotes the output neuron, and denotes the inverse function at -th layer. And the source signal is obtained via layer-wise back-propagation. According to the objective of equation (4) in section 3.2, we learn the inverse function by minimizing the L2 norm of the difference between input and the source signal ,

To guarantee the faithfulness to the model, we further restrict that the mutual information between every two adjacent layers is also maximally preserved during the inversion. Hence we recursively learn the inverse function from up layers to bottom layers. We initialize . Then, aims to reconstruct the neuron entries at layer, which is expressed as:

where denotes the intermediate feature map at the -th layer, and represents the corresponding source signal of at -th layer. Similarly, at -th layer with obtained, aims to reconstruct the neuron entries at -th layer, i.e.,

(5)

The computational flow of MIP-IN is shown in Figure 2.

In the following, we further introduce the proposed MIP-IN framework in details. Firstly, we design the type of inverse function for different types of layers. The parameter of at each layer can be optimized by minimizing the reconstruction error. Then, we elaborate how to leverage the activation switch information to adapt to a specific input sample . Finally, we introduce how to generate the interpretations based on MIP-IN. The overall pipeline is summarized in Algorithm 1, where the back-propagation operation follows the forward structure in Figure 1.

Figure 2. Computational flow of MIP-IN framework. The inverted source signal is recursively propagated from the output layer to the input layer, and represents the intermediate inverted signal at -th layer. Red and blue arrows represent positive and negative parameters, respectively. The symbol above the parameter means the parameter of MIP-IN at -th layer is learned by using the inverted source signal to reconstruct the original signal at -th layer. Similarly for the symbol .

4.1. Design of The Inverse Function

In this subsection, we design the function

for (i) dense, (ii) convolution, and (iii) max-pooling layer, respectively. We focus on the block at

-th layer, which uses the intermediate inverted source signal to reconstruct , as shown in Equation (5). For simplicity, we write the neurons at -th layer and the inverted signal at -layer as and .

Dense Layer: For -th dense layers, we define the corresponding back-propagation inverse function as a linear function, which can be expressed as . According to the recursive reconstruction objective, we need to minimize the reconstruction error between and , i.e., . Then and have an unique closed solution:

(6)

where and represent the centralized input and inverted source signal respectively. To avoid over-fitting, we adopt a L2 norm to regularize the parameters . Then the solution becomes , where controls the trade-off between reconstruction error and L2 norm regularization.

Convolutional Layers: The convnet uses learned filters to convolve the feature maps from the previous layer. It has shown that transposed filter structure can partially recover the signal (Nie et al., 2018). Besides, such structure accurately describes the correlation relationships, i.e., which neurons at lower layer will be related to which neurons at next upper layer. Hence we set the transposed filter structure as inverse functions for convolutional layers, where the kernel parameter of can be obtained by minimizing the reconstruction objective function, i.e.,

(7)

Noted that it’s different from Deconvnet and GBP, which directly use the transposed version of the same filter in original network.

Max-pooling Layers: We follow the design of Deconvnet on the max-pooling operation, which obtains an approximate inversion by recording the locations of the maximum within each pooling region in a set of switch variables. We use these recorded switches to place the intermediate inverted signal into appropriate locations, preserving the information. The operation is formalized as:

(8)

Here represents element-wise operation.

0:  Input neurons , output neurons , and the intermediate layer neurons for samples. The target neuron index of interest .
1:  Initialization: ;
2:  for  do
3:     feed  ;
4:     if Dense layer then
5:        ;
6:     else if Convolutional layer then
7:        ;
8:     else if Max-pooling layer then
9:        ;
10:     end if
11:     ;
12:     ;
13:     ;
14:     ;
15:     Forward relu: ;
16:     Forward relu: ;
17:  end for
18:  return the parameters of the inversion function for each layer, the inverted source signals , and the attribution vectors for samples
Algorithm 1 Mutual Information Preserving Inverse Network

4.2. Forward Relu for Local Adaptation

The above framework emphasizes the features of interest globally, which averages over the whole data distribution. It contains no specific information for a given sample. While for a specific input , we are interested in how much information of around its local neighborhood is encoded in each input feature. Hence, we need to adapt the global framework to explain a specific input.

The basic idea is to explore and leverage the neighborhood information of . An intuitive way is to draw neighborhood samples according to the prior distribution of , and train the inversion function on those samples. However, such an approach is clumsy because we need to retrain the network parameters for each sample to be interpreted, which is not scalable in a large DNN.

Instead, we adopt a shortcut by leveraging the activation switch information, i.e., whether the neurons at intermediate layers are activated. We find that the activation switch, is an effective indicator of neighborhood information. Specifically, we consider the -th neuron at -th layer of the sample , which is obtained by . and denote the corresponding weights and bias respectively. Non-activated neuron implies that the linear combination . According to continuity assumption, the linear combination would be smaller than with a high probability for the samples whose intermediate feature map at -th layer is near to the one of . Therefore, for these neighboring samples, the corresponding neurons would also be non-activated with a high probability.

Hence, we leverage the activation switch information at each layer to adapt the global framework to the specific input . Specifically, we adopt forward relu, i.e., we force the locations which are non-activated in the original neurons are also non-activated in the inverted source signals as above-mentioned.

(9)

The forward relu operation could well recover the neighbor information of the specific input .

Figure 3. Globally and locally produced source signals and attribution vectors of samples.

4.3. Generating Interpretations

We generate the interpretations by studying how much information of output neuron is encoded in the input neurons via source signal , which maximally preserves the mutual information between input and output .

As the inverse function and for dense layer and convolutional layer are essentially linear functions, we could exact the corresponding linear weight as the attribution at -th layer and ignore the irrelevant bias terms (Algorithm 1 line 12). The attribution is calculated as a superposition of in each layer (Algorithm 1 line 13). When adapting to a specific input, we adopt a forward relu operation as in section 4.2 (Algorithm 1 line 16).

The effect of local adaptation is shown in Figure 3. We can observe that without forward relu, the produced global source signal and attribution vector keep almost the same for different samples in the same class. This is mainly because the global framework obtains an average over the whole data distribution. While after adopting the forward relu function for local adaptation, the framework can well adapt to each specific input. Besides, the attributions of local adaptation turn out to be specifications of the global attributions, emphasizing similar parts of digit .

Models MLP-M CNN-M CNN-C
APC 10.10% 16.5% 21.1%
Positive APC 2.6% 2.3% 10.6%
Table 1. APCs and Positive APCs

of the source signal. MLP-M, CNN-M, and CNN-C represent MLP model trained on MNIST, CNN model trained on MNIST, and CNN model trained on Cifar-10 dataset respectively.

Figure 4.

Samples and their source signals with logits showing in title. The top, middle, bottom row visualize the results of MLP-M, CNN-M and CNN-C models respectively.

5. Evaluation of the Source Signal

In this section, we introduce how to evaluate the performance of the source signals. It’s expected that the source signal should satisfy two properties: i) Completeness: preserves almost all the information of output and the distractor signal captures no relevant information of , which means ; ii) Minimality: no redundant information (irrelevant with ) is obtained in , which means .

The produced source signal naturally well satisfies the minimality because the construction of adopts a back-propagation manner, in which almost no extraneous information is introduced. Hence, we focus on evaluating the completeness property. should be close to when the conditional entropy is close to . The two conditional entropies are equivalent when the output of is the same as the output of , i.e., . Here represent the logits of in the network. We adopt the average percentage change (APC) as the metric to measure the difference between and ,

(10)

where represent the input and inverted source signal of -th sample. is the number of classes and is the number of samples in the -th class. We also propose a relaxed metric, termed positive average percentage change (Positive APC), which uses relu function to replace the previous absolute function,

(11)

Because it can be considered that contains the sufficient information for prediction when . The completeness is satisfied when are very small.

Figure 5. Visualization of saliency maps comparing with seven back-propagation baselines on MNIST. MIP-IN implies that 1) the pixels on the digit have a positive effect to the final prediction, 2) while the pixels around the digit play a negative role in prediction, and 3) the regions on the background, which are scored close to and shown in white color, are not necessary for the classification.

We evaluate the completeness property on three models: i) A multilayer perceptron (MLP) model trained on MNIST dataset 

(LeCun, 1998); ii) A CNN model trained on MNIST; and iii) A CNN model trained on Cifar-10 dataset (Krizhevsky et al., 2009). The detailed network architectures are listed in Appendix. We report the APCs and Positive APCs in Table 1, and also visualize the inverted source signals of several examples with tagging their classification logits in Figure 4. Table 1 shows that the APCs and especially the positive APCs, on these models are very small. Besides, the visualization in Figure 4 also validates that the logits of the source signal are very similar to the logits of the original input signal. These evidence implies that the produced source signals well satisfy the completeness property.

6. Experiments

In this section, we conduct experiments to evaluate the effectiveness of the produced interpretations by the proposed MIP-IN framework from three aspects. Firstly, we visualize the interpretation results of different model architectures on MNIST and Imagenet dataset in section

6.2. Secondly, we test the localization performance by comparing the produced interpretations with the bounding box annotations in section 6.3. Finally, we investigate the sensitivity of the interpretations with respect to class labels in section 6.4.

Figure 6. The saliency maps produced by MIP-IN for shetland sheepdog class. Our results indicate that the VGG19 model mainly focuses on the object of interest, while ignoring the background regions.
Figure 7. Visualization of saliency maps for seven back-propagation based baselines. They usually give some noisy scores to the background regions.

6.1. Experimental Setups

6.1.1. Baseline Methods

We compare with seven popular back-propagation based attribution methods, including Grads (Simonyan et al., 2014), Smooth (Smilkov et al., 2017), LRP- (Bach et al., 2015), Deep Taylor Decomposition (DTD) (Montavon et al., 2017), Deconvnet (Zeiler and Fergus, 2014), GBP (Springenberg et al., 2014), and PatternAttribution (Patternnet) (Kindermans et al., 2017). Here LRP- is not included as it has been shown that the LRP- method is equivalent to DTD (Montavon et al., 2017). We use the implementation of these compared methods from the innvestigate package (Alber et al., 2019).

6.1.2. Implementation Details

We evaluate the quality of the proposed interpretations on MNIST (LeCun, 1998)

and Imagenet dataset 

(Russakovsky et al., 2015)

. The training epoch for convolutional layers is set as

. In addition, the L2 norm regularization coefficient in dense layer .

On MNIST dataset, we train a three-layer MLP model on 50k training images. Then we exact the input, final prediction, and intermediate feature maps of 50k images for the model. The network parameters of MIP-IN are learned for each class recursively. The learning could be done in few minutes. On Imagenet dataset, we interpret the pre-trained VGG19 (Simonyan and Zisserman, 2014) and Resnet50 (He et al., 2016) models. The network parameters of MIP-IN are trained on the ILSVRC validation set with 50k images (Russakovsky et al., 2015). More samples for training will result in better interpretation results. We mainly train the classifier modules. Specifically, the feature maps at top convolutional layers and the final predictions (logits) are fed as input and output respectively. Different from the classical propagation settings where forward relu stops at the second layer, the forward relu should also operate on the defined input layer (top convolutional layers). This training part takes less than one minute and repeats for times. Then MIP-IN generates a dimensional attribution vector. We could derive a dimensional attribution vector by averaging over the channel dimension, then resize to original image size to obtain the saliency map, similar to Grad-CAM (Selvaraju et al., 2017). We apply a ReLU function to the attribution vectors for ImageNet examples because we are more interested in the features that have a positive influence on the class of interest.

6.2. Visualization Results

To qualitatively assess the interpretation performance, we visualize the saliency maps of evaluated samples of the MIP-IN and compared methods for a MLP model and the pre-trained VGG19 model.

6.2.1. Sanity Check Measurement.

Here we propose a novel sanity check measurement for interpretations: whether the attribution methods produce reasonable interpretations for models without local connections or max-pooling operation, e.g., MLP models. It has been shown that the local connection largely attributes to the good visual quality of GBP (Nie et al., 2018). Besides, it’s confirmed that the max-pooling operation is critical in helping DeconvNet produce human-interpretable visualizations instead of random noise (Nie et al., 2018). However, the interpretation should be also human-interpretable without the local connection and max-pooling operation. To evaluate the performance without the distraction from local connection and max-pooling, we specially show the interpretations for a MLP model (without max-pooling) trained on MNIST in section 6.2.2.

Figure 8. The saliency maps for dog (top row) and cat (bottom row) labels respectively. MIP-IN could generate class-discriminative interpretations. In contrast, the baseline methods produce nearly the same interpretations for both labels.

6.2.2. MLP Models.

We train a three-layer MLP model on 50k training images. Figure 5 shows the saliency maps. The proposed MIP-IN interprets the network with the best visual quality. We have the following observations. i) The attribution of compared back-propagation methods roughly localize the corresponding pixels of the digits, while the one of MIP-IN achieves an accurate localization; ii) The saliency map of compared baselines are almost noisy, which pay partial attention to the background of the image. While the visualization of MIP-IN is much sharper and cleaner than the compared baselines. iii) The explanation of MIP-IN is consistent with human cognition. Specifically, MIP-IN implies that the pixels on the digit have a positive effect on the final prediction, while the pixels around the digit play a negative role in prediction. The regions on the background, which are scored close to and shown in white color, are not necessary for the classification. This accords with an obvious conclusion that no information of output would be embedded in these areas. in which the pixels are all constant.

6.2.3. VGG19 Model.

We also visualize the saliency maps of eight examples in the shetland sheepdog class produced by MIP-IN in Figure 6. Besides, the saliency maps visualizations of seven back-propagation based baselines are shown in Figure 7. Figure 6 shows that the interpretations produced by MIP-IN are of good visual quality and accurately locate the object of interest. In addition, the heatmaps of GBP, DTD, and especially PatternAttrition, recover some fine-grained details of irrelevant regions, such as person. These methods suggest that the network is partially paying attention to the background. Instead, the saliency maps of MIP-IN indicate that the network is ignoring the surrounding background with almost value attributions. These observations imply that MIP-IN produces more reasonable explanations.

Models VGG19 Resnet50
Grads 0.345 0.368
SmoothGrads 0.518 0.541
LRP- 0.307 0.292
DTD 0.427 0.441
Deconvnet 0.226 0.263
GBP 0.398 0.456
PatternAttr 0.499 -
GradCAM 0.486 0.512
MIP-IN 0.648 0.662
Table 2. Bounding box accuracies evaluated on VGG19 and Resnet50. PatternAttribution do not support ResNet-50.

6.3. Localization Performance

To further quantitatively measure the effectiveness of the proposed interpretations, we evaluated localization performances of MIP-IN and these back-propagation based baselines. We are interested in how well these generated attributions locate the object of interest. A common approach is to compare the saliency maps with the bounding box annotations. Assume the bounding box contains pixels. We select top pixels according to ranked attribution scores and count the number of pixels inside the bounding box. The ratio is used as the metric of localization accuracy (Schulz et al., 2020). We follow the setting in  (Schulz et al., 2020), which considers the scenarios where the bounding boxes cover less than of the input image. We computed the localization accuracy of VGG19 and Resnet50 models on over 5k validation images, and the results are shown in Table 2. MIP-IN obtains the highest bounding box accuracy on both VGG19 and Resnet50 models. Comparing to the best baseline, MIP-IN has improved the accuracy by 0.13 and 0.121 for VGG19 and Resnet50, respectively. This substantial improvement validates that MIP-IN is consistent with human cognition.

Figure 9. Average distance comparisons on VGG19 model. The large distance for two labels indicate that MIP-IN could generate class-sensitive interpretations.

6.4. Class Label Sensitivity

A sound attribution should be sensitive to the label of the target class. Given different labels of interest, the attribution method is supposed to generate class-discriminative saliency maps. However, several works have observed that the saliency map of GBP and Decovnet keep almost the same given different class labels  (Samek et al., 2016; Selvaraju et al., 2017; Mahendran and Vedaldi, 2016; Nie et al., 2018). Hence we conduct a comprehensive investigation to the class sensitivity of the saliency maps produced by MIP-IN and these back-propagation based baselines.

6.4.1. Qualitative Evaluation

To qualitatively assess the class sensitivity, we visualize the saliency maps from MIP-IN and seven baselines for the ‘bull mastiff’ (dog) class (top) and ‘tiger cat’ class (bottom) in Figure 8. GBP, DTD, and PatternAttribution highlight fine-grained details in the image, but generate very similar visualizations for the two classes. The observation demonstrates that these three back-propagation baselines are class-insensitive. Smooth gradients shows better sensitivity, but it highlights both dog and cat regions when interpreting the ‘cat’ class. In contrast, MIP-IN is highly class-sensitive and produce accurate interpretations. Specifically, the ‘dog’ explanation exclusively highlights the ‘dog’ regions but not ‘cat’ regions (top row), and vice versa (bottom row).

6.4.2. Quantitative Evaluation

To quantitatively describe how the back-propagation based visualizations change w.r.t. different class labels, we compute the average distance on Imagenet dataset. We calculate the distance of two saliency maps given the class logits of two different class labels and then average these distances.

We evaluated the VGG19 model on images of Imagenet. We randomly select two class labels, i.e., and class, and show the comparisons of average distance statistics in Figure 9. It can be observed that the average distances of MIP-IN, Grad, Smooth, and LRP are much larger than Deconvnet and GBP, which demonstrate that these four attribution methods are class-sensitive while Deconvnet and GBP are not. Although DTD has a relatively larger distance than GBP and Deconvnet, we conclude that DTD is actually class-insensitive after investigating the specific examples. There are two situations for the saliency maps of DTD: i) For some examples, the two saliency maps of two classes (with both positive or negative logits) keep almost the same with zero distance, ii) For the other examples, the saliency map values of one class (with negative logits) are all zero values, while the ones of another class (with positive logits) are not, which results in a large distance. Besides, PatternAttribution is much less sensitive than MIP-IN. We also evaluated the average distance on a three-layer CNN model trained on MNIST, and the results are shown in Appendix.

7. Conclusions and Future Work

In this work, we formulate the back-propagation attribution problem as learning a source signal in the input space, which maximally preserves the mutual information between input and output. To solve this problem, we propose mutual information preserving inverse network (MIP-IN) to generate the desired source signal via globally and locally inverting the output. Experimental results validate that the interpretations generated by MIP-IN are reasonable, consistent with human cognition, and highly sensitive to the class label. In the future work, we will explore how to apply MIP-IN to generate attributions at intermediate layers and adapt MIP-IN to generate class-level attributions.

References

  • J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, and B. Kim (2018) Sanity checks for saliency maps. arXiv preprint arXiv:1810.03292. Cited by: §1, §1, §2.1.
  • M. Alber, S. Lapuschkin, P. Seegerer, M. Hägele, K. T. Schütt, G. Montavon, W. Samek, K. Müller, S. Dähne, and P. Kindermans (2019) INNvestigate neural networks!.

    Journal of Machine Learning Research

    .
    Cited by: §6.1.1.
  • S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, et al. (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one. Cited by: §1, §1, §2.1, §6.1.1.
  • D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K. Müller (2010) How to explain individual classification decisions. The Journal of Machine Learning Research 11, pp. 1803–1831. Cited by: §2.1.
  • T. M. Cover (1999) Elements of information theory. John Wiley & Sons. Cited by: §3.2.
  • P. Dabkowski and Y. Gal (2017) Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pp. 6967–6976. Cited by: §2.2.
  • M. Du, N. Liu, and X. Hu (2019) Techniques for interpretable machine learning. Communications of the ACM 63 (1), pp. 68–77. Cited by: §3.1.
  • R. C. Fong and A. Vedaldi (2017) Interpretable explanations of black boxes by meaningful perturbation. In ICCV, Cited by: §1, §1, §2.2, §3.2.
  • R. Fong, M. Patrick, and A. Vedaldi (2019) Understanding deep networks via extremal perturbations and smooth masks. In ICCV, Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §6.1.2.
  • P. Kindermans, K. T. Schütt, M. Alber, K. Müller, D. Erhan, B. Kim, et al. (2017) Learning how to explain neural networks: patternnet and patternattribution. arXiv preprint arXiv:1705.05598. Cited by: §1, §1, §1, §2.1, §6.1.1.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §5.
  • Y. LeCun (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §5, §6.1.2.
  • S. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874. Cited by: §1.
  • A. Mahendran and A. Vedaldi (2016) Salient deconvolutional networks. In

    European Conference on Computer Vision

    ,
    pp. 120–135. Cited by: §1, §2.1, §6.4.
  • G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K. Müller (2017) Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition 65, pp. 211–222. Cited by: §1, §1, §2.1, §6.1.1.
  • W. Nie, Y. Zhang, and A. Patel (2018)

    A theoretical explanation for perplexing behaviors of backpropagation-based visualizations

    .
    In ICML, Cited by: §1, §1, §2.1, §4.1, §6.2.1, §6.4.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. IJCV. Cited by: §6.1.2, §6.1.2.
  • W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. Müller (2016) Evaluating the visualization of what a deep neural network has learned. TNNLS. Cited by: §1, §2.1, §6.4.
  • W. Samek, G. Montavon, S. Lapuschkin, C. J. Anders, et al. (2020) Toward interpretable machine learning: transparent deep neural networks and beyond. arXiv preprint arXiv:2003.07631. Cited by: §1.
  • K. Schulz, L. Sixt, F. Tombari, and T. Landgraf (2020) Restricting the flow: information bottlenecks for attribution. In ICLR, Cited by: §1, §2.1, §2.2, §6.3.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In ICCV, Cited by: §1, §2.1, §6.1.2, §6.4.
  • A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Cited by: §1, §2.1.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. Cited by: §6.1.1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §6.1.2.
  • L. Sixt, M. Granz, and T. Landgraf (2020) When explanations lie: why many modified bp attributions fail. In ICML, Cited by: §1, §1.
  • D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §2.1, §6.1.1.
  • J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §1, §1, §2.1, §6.1.1.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In ICML, Cited by: §1, §2.1.
  • J. Wagner, J. M. Kohler, T. Gindele, L. Hetzel, J. T. Wiedemer, and S. Behnke (2019)

    Interpretable and fine-grained visual explanations for convolutional neural networks

    .
    In CVPR, Cited by: §2.2.
  • Z. Ying, D. Bourgeois, J. You, M. Zitnik, et al. (2019) Gnnexplainer: generating explanations for graph neural networks. In NeurIPS, Cited by: §2.2.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1, §1, §2.1, §6.1.1.
  • J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, et al. (2018) Top-down neural attention by excitation backprop. IJCV. Cited by: §1.

Appendix A Network Architectures

In this subsection, we listed the network architectures used in the paper in Table 3, 4, and 5, respectively. The dropout technique is adopted during the training process.

Name Activation Output size
Initial (28,28)
Dense1 relu (512, )
Dense2 relu (512, )
Dense3 softmax (10,  )

Table 3. Trained MLP architecture on MNIST dataset.
Name Activation kernel size Output size
Initial (28,28,1)
Conv1 relu (16,5,5) (24,24,16)
Conv2 relu (64,3,3) (22,22,64)
Max1 (2,2) (11, 11, 64)
Flatten (7744,  )
Dense1 relu (512,  )
Dense2 softmax (10,  )

Table 4. Trained CNN architecture on MNIST dataset.
Name Activation kernel size Output size
Initial (32,32,3)
Conv1 relu (32,3,3) (30,30,32)
Conv2 relu (64,3,3) (28,28,64)
Max1 (2,2) (14,14,64)
Conv3 relu (64,3,3) (12,12,64)
Max2 (2,2) (6, 6, 64)
Flatten (2304,  )
Dense1 relu (512,  )
Dense2 softmax (10,  )

Table 5. Trained CNN architecture on Cifar-10 dataset.

Appendix B Average distance on MNIST

We also evaluated a three-layer CNN model on images of MNIST, to see whether the number of layers would influence the conclusion. We obtain similar observations on the three-layer models, i.e., MIP-IN, Grad, Smooth, LRP are class-sensitive to the class labels with average distances , while GBP, DTD and PatterAttribution are not, with average distances (DTD has the same situations to VGG19 model). Deconvnet shows a good sensitivity with average distances, which may because the trained CNN model only has one max-pooling layer.

Appendix C More Visualizations

In this section, we provide more qualitative visualizations for examples from the ImageNet dataset. On Figure 11 and Figure 12, we illustrate the saliency maps for the hair dryer and ski class respectively. It indicates that MIP-IN could accurately locate the object of interest. An interesting observation is that the model would also have some attention for the human object when providing explanations for the ski class. This is perhaps humans and the skiing equipment typically co-occur in the training set, and the model would also capture this correlation and exploit it for prediction. In Figure 10, we also compare MIP-IN with Grad-CAM. The results indicate that MIP-IN could generate more accurate localization. In addition, we a provide comparison between MIP-IN and the seven back-propagation based attribution methods in Figure 13.

Figure 10. Comparing MIP-IN with Grad-CAM.
Figure 11. Heatmaps of hair dryer class generated by the proposed MIP-IN method.
Figure 12. Heatmaps of ski class generated by the proposed MIP-IN method.
Figure 13. More visualization comparison between MIP-IN and the seven baseline methods.