Local explanation methods (also known as attribution methods) attribute a deep network’s prediction to its input (cf. Baehrens et al. (2010); Simonyan et al. (2013); Shrikumar et al. (2017); Binder et al. (2016); Springenberg et al. (2014); Lundberg & Lee (2017); Sundararajan et al. (2017)). For instance, the network may perform an object recognition task. It takes as input an image and predicts scores for classes that are synonymous with objects in the image. The attributions then assign importance scores to the pixels.
We respond to the claim from Adebayo et al. (2018) that local explanation methods lack sensitivity, i.e., “DNNs with randomly-initialized weights produce explanations that are both visually and quantitatively similar to those produced by DNNs with learned weights.”
In this response, we show that the two methods (Spearman correlation and visual similarity) thatAdebayo et al. (2018) used to demonstrate this insensitivity made certain choices due to which they saw this insensitivity. To be clear, we do not show that local explanation methods are trustworthy. Rather, we attempt to demonstrate that Adebayo et al. (2018) does not seem to show that they are not. Specifically, we only address claims for one of the local explanation methods studied, i.e., Integrated Gradients, but the issues with the investigation possibly apply to other methods. Likewise, we also only discuss results about MNIST, but the impact of these choices likely applies to other tasks too.
2 Methodology of Adebayo et al. (2018)
Adebayo et al. (2018) studies sensitivity of an explanation method to network parameters
by randomizing the parameters starting from the topmost layer
and going all the way to the bottom.111 They call this “cascading randomization”
of the network. Randomization involves reinitializing parameters with truncated normal distribution with mean
They call this “cascading randomization” of the network. Randomization involves reinitializing parameters with truncated normal distribution with mean.
They then visually and quantitatively assess the changes in the generated explanation for a given set of inputs. The quantitative assessment involves computing the spearman rank correlation between the explanation vector from the original network and one with parameters randomized. If the correlation is high, then the network is (supposedly) not sensitive to the parameters that were randomized, and therefore the importance scores assigned by the attribution method are suspect. The explanation vectors are normalized by taking their absolute values.
3 Sensitivity of Integrated Gradients
We carefully examine the visualizations and the spearman rank correlation for explanations produced using the Integrated Gradients method Sundararajan et al. (2017). Formally, the integrated gradient for the base feature (e.g. a pixel) of an input and baseline is defined as:
where is the gradient of along the dimension at .
It is claimed that these importance scores remain visually and quantitatively unchanged as the the network is randomized. We argue that these findings are due to certain artifacts of how the visualization and spearman rank correlation is computed.
3.1 Spearman rank correlation
We begin by mentioning a property of spearman rank correlation. Consider any two positive vectors such that there exist common indices where and are both zero. The spearman rank correlation between and saturates to 1.0 as increases. Figure 1 shows this empirically for two positive random vectors. Even when is only 50% of the indices, the spearman rank correlation is more than .
Adebayo et al. (2018) reports results on sensitivity of integrated gradients for a 3-hidden layer CNN trained over the MNIST dataset. Furthermore, the integrated gradient vectors are normalized by taking absolute value before computing spearman correlation.
Integrated gradients explain the network’s prediction for an input relative to a certain baseline. In Adebayo et al. (2018), as Sundararajan et al. (2017) recommends, Integrated gradients are computed using a black image as baseline. A mathematical property of the method is that features (pixels in the case of image models) that have identical values at the input and the baseline are guaranteed to receive zero attribution.
Therefore, if a1 and a2 are integrated vectors from two different networks for the same MNIST image and black baseline, then a1 and a2 would be identically zero for all pixels that are black in the image. A typical MNIST image consists of a large number (nearly ) of black pixels, and therefore by the property stated at the beginning of this section, and would have a high spearman rank correlation; here is the absolute value function. Note that this is true regardless of the network architecture or parameters of the two networks.
The above result explains why Adebayo et al. (2018) empirically find high spearman correlation (nearly ) between (normalized) integrated gradients as the network parameters are gradually randomized. We consider this high correlation artifactual as it follows from the choice of baseline. A better correlation metric would be to consider the spearman rank correlation between the integrated gradient values of only those pixels that change between the baseline and input. When the baseline is black this amounts to considering only those pixels that are nonzero in the input image. We call this metric .
We now use the metric to compare (normalized) integrated gradients computed for a trained network with that computed for a network with randomized parameters. Figure 2 (red plot) shows the trend of the correlation as parameters of the network are randomized in a cascading manner starting from the “Output” layer down to the first convolutional layer (called “Conv1”). It is clear that the correlation drops sharply as soon as parameters in the “Output” layer are randomized.
We notice in Figure 2 that the correlation plateaus around after the “Ouput” layer is randomized. This can be explained as follows. Once the output layer is randomized, the gradient of the prediction with respect to the input pixels is effectively random. Now, why does the spearman correlation not go to zero? The mathematical form of integrated gradients includes scaling the integration of the gradients by the input, i.e., in Equation 1, the term ; recall that is zero because the baseline is black. That is, the mathematical form includes a non-random component that sets a bound on how low the correlations can go, i.e., when the parameters are random, the integrated gradients important scores are effectively . We empirically find that the average correlation between two such integrated gradient vectors is (% CI [, ]), which is what the correlation between the integrated gradients for the original network and those for the randomized network approximately converges to in Figure 2.
We find that the normalization of the integrated gradients by taking the absolute value is crucial for the findings in Adebayo et al. (2018). In particular we find that without the normalization, correlation under the original spearman metric drops all the way to as parameters of the network are randomized; see the blue plot in Figure 2.
Each row shows a different MNIST image. The first and second columns show the image and the visualization of integrated gradients from the original trained network. The following columns show visualization of integrated gradients (for the same label as the original image) as parameters in each layer of the network are successively randomized in a cascading manner. The number below each visualization is the sum of the raw integrated gradient scores of all pixels. It approximates the difference between the logit score for the input and the baseline (i.e., black image)
Adebayo et al. (2018) show that visualizations of normalized integrated gradients are largely unchanged as parameters of the network are randomized. We notice that the blame for this falls on the normalization, i.e., taking absolute values. In particular, doing this conflates pixels that receive positive attribution with those that receive negative attribution. A more faithful visualization is one that preserves the sign, and shows pixels with positive and negative attribution on separate channels.
Figure 3 shows such visualizations of integrated gradients for a few MNIST images. It also shows how the visualizations change as parameters of the network are gradually randomized. (All visualizations are for the same label as that of the original image.) It is clear that the sign of the integrated gradient value of a pixel is highly sensitive to the parameters of the network. For instance, notice how different parts of the number “8” flip from positive to negative attribution as the network is randomized. Notice also that the attributions for the original unrandomized network appear to highlight entire strokes whereas the attributions for the randomized networks appear disjointed.
Furthermore, while the visualizations in Figure 3 preserve the sign of the integrated gradients vectors, they do not reflect its magnitude. Each visualization is obtained by scaling the min and max values to and respectively, and therefore destroys the magnitude information. A mathematical property of integrated gradients is that the sum of the integrated gradient scores of all features adds up to the prediction score at the input minus the prediction score at the baseline. As parameters of the network are randomized, the final prediction score for the original label drops significantly, and so does the sum of the integrated gradients across all pixels. We verify this empirically in Figure 3. Notice that the sum of the integrated gradient values across all pixels approaches as parameters of the network are randomized.
That said, we do agree that further work needs to be done on careful visualizations for attribution. For image-based use-cases, humans consume the visualizations and not the numbers and if the visualizations are not carefully done, we risk losing relevant information contained in attribution scores.
We thank Julius Adebayo, Been Kim, and Raz Mathias for helpful discussions.
- Adebayo et al. (2018) Julius Adebayo, Justin Gilmer, Ian Goodfellow, and Been Kim. Local explanation methods for deep neural networks lack sensitivity to parameter values. International Conference on Learning Representation Workshop (ICLR Workshop), 2018. URL https://openreview.net/pdf?id=SJOYTK1vM.
Baehrens et al. (2010)
David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja
Hansen, and Klaus-Robert Müller.
How to explain individual classification decisions.
Journal of Machine Learning Research, pp. 1803–1831, 2010.
- Binder et al. (2016) Alexander Binder, Grégoire Montavon, Sebastian Bach, Klaus-Robert Müller, and Wojciech Samek. Layer-wise relevance propagation for neural networks with local renormalization layers. CoRR, 2016.
- Lundberg & Lee (2017) Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 4768–4777. Curran Associates, Inc., 2017.
- Shrikumar et al. (2017) Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 3145–3153. PMLR, 2017. URL http://proceedings.mlr.press/v70/shrikumar17a.html.
- Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, 2013.
- Springenberg et al. (2014) Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving for simplicity: The all convolutional net. CoRR, 2014.
- Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 3319–3328, 2017.