The (Un)reliability of saliency methods

by   Pieter-Jan Kindermans, et al.

Saliency methods aim to explain the predictions of deep neural networks. These methods lack reliability when the explanation is sensitive to factors that do not contribute to the model prediction. We use a simple and common pre-processing step ---adding a constant shift to the input data--- to show that a transformation with no effect on the model can cause numerous methods to incorrectly attribute. In order to guarantee reliability, we posit that methods should fulfill input invariance, the requirement that a saliency method mirror the sensitivity of the model with respect to transformations of the input. We show, through several examples, that saliency methods that do not satisfy input invariance result in misleading attribution.


The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?

There is a recent surge of interest in using attention as explanation of...

Saliency Methods for Explaining Adversarial Attacks

In this work, we aim to explain the classifications of adversary images ...

Segment Integrated Gradients: Better attributions through regions

Saliency methods can aid understanding of deep neural networks. Recent y...

Sanity Simulations for Saliency Methods

Saliency methods are a popular class of feature attribution tools that a...

Noise-adding Methods of Saliency Map as Series of Higher Order Partial Derivative

SmoothGrad and VarGrad are techniques that enhance the empirical quality...

Beyond Faithfulness: A Framework to Characterize and Compare Saliency Methods

Saliency methods calculate how important each input feature is to a mach...

A Simple Saliency Method That Passes the Sanity Checks

There is great interest in *saliency methods* (also called *attribution ...

1 Introduction

While considerable research has focused on discerning the decision process of neural networks (Baehrens et al., 2010; Simonyan & Zisserman, 2015; Haufe et al., 2014; Zeiler & Fergus, 2014; Springenberg et al., 2015; Bach et al., 2015; Yosinski et al., 2015; Nguyen et al., 2016; Montavon et al., 2017; Zintgraf et al., 2017; Sundararajan et al., 2017; Smilkov et al., 2017; Kindermans et al., 2017), there remains a trade-off between model complexity and interpretability. Research to address this tension is urgently needed; reliable explanations build trust with users, help identify points of model failure and remove barriers to entry for the deployment of deep neural networks in domains like health care, security and transportation.

In deep neural networks, data representation is delegated to the model and subsequently we cannot generally say in an informative way what led to a model prediction. Instead, saliency methods aim to infer insights about the learnt by the model by ranking the explanatory power of constituent inputs. While unified in purpose, these methods are surprisingly divergent and non-overlapping in outcome. Evaluating the reliability of these methods is complicated by a lack of ground truth, as ground truth would depend upon full transparency into how a model arrives at a decision — the very problem we are trying to solve for in the first place.

Given the need for a quantitative method of comparison, several properties such as completeness, implementation invariance and sensitivity have been articulated as desirable to ensure that saliency methods are reliable (Bach et al., 2015; Sundararajan et al., 2017). Implementation invariance, proposed as an axiom for attribution methods by (Sundararajan et al., 2017), is the requirement that functionally equivalent networks (models with different architectures but equal outputs for all inputs), always attribute in an identical way.

This work posits that a second invariance axiom, which we term input invariance, needs to be satisfied to ensure reliable interpretation of the input’s contribution to the model prediction. Input invariance requires that the saliency method mirror the sensitivity of the model with respect to transformations of the input. We demonstrate that numerous methods do not satisfy input invariance using a simple transformation – a constant shift of the input – that changes the attribution of these methods but does not affect the model prediction or weights. Our results demonstrate that explanations of a networks predictions can be purposefully manipulated using surprisingly simple transformations to be misleading. This work is motivated by an understanding that saliency methods are highly valued tools for gaining intuition about a network. Determining points of failure is a necessary step for knowledgeable use of these tools as well as a pre-requisite for domains like medicine where the incorrect classification of an input as salient carries a high cost.

In this work we:

  • introduce the axiom input invariance and show, using a simple constant shift in the input, that certain saliency methods do not satisfy this property (See Fig. 3).

  • demonstrate using MNIST that we can purposefully force misleading attribution (See Fig. 4 and Fig. 6).

  • show that "reference point" methods – Integrated gradients and the Deep Taylor Decomposition– have diverging attribution satisfy input invariance contingent on the choice of reference and the type of transformation considered (See Fig. 1).

  • propose data normalization as a way to ensure that some methods satisfy input invariance for the type of transformation considered. Discuss the need for wider research as normalization does not systematically guarantee reliable attribution for all possible transformations.

In Section 2, we detail our experiment framework. In Section 3, we determine that while the model is invariant to the input transformation considered, several saliency methods attribute to the mean shift. In Section 4 we discuss "reference point" methods and illustrate the importance of choosing an appropriate reference before discussing some directions for future research in Section 5.

Figure 1: Integrated gradients and Deep Taylor Decomposition determine input attribution relative to a chosen reference point. This choice determines the vantage point for all subsequent attribution. Using two example reference points for each method we demonstrate that changing the reference causes the attribution to diverge. The attributions are visualized in a consistent manner with the IG paper (Sundararajan et al., 2017)

. Visualisations were made using ImageNet data.

(Russakovsky et al., 2015) and the VGG16 architecture (Simonyan & Zisserman, 2015).

2 The model is invariant to a constant shift in input

We show that, by construction, the bias of a neural network compensates for the constant shift resulting in two networks with identical weights and predictions.

We compare the attribution across two networks, and . is a network trained on input that denotes sample from training set . The classification task of network 1 is:

is a network that predicts the classification of a transformed input . The relationship between and

is the addition of constant vector


Network 1 and 2 differ only by construction. Consider the first layer neuron before non-linearity in


We alter the biases in the first layer neuron by adding the mean shift . This now becomes Network 2:

As a result the first layer activations are the same for and :

Note that the gradient with respect to the input remains unchanged as well:

We have shown that Network 2 cancels out the mean shift transformation. This means that and have identical weights and produce the same output for all corresponding samples, , :

2.1 Experimental Setup

Now, we describe our experiment setup to evaluate the input invariance of a set of saliency methods. Most saliency research to date has centered on convolutional neural networks (CNN). In this work, we also evaluate input invariance using a CNN. Network 1 is a 3 layer multi-layer perceptron with 1024 ReLu-activated neurons each. Network 1 classifies MNIST image inputs in a [0,1] encoding. We consider a negative constant shift of

; Network 2 classifies MNIST image inputs in a [-1,0] MNIST encoding. The first network is trained for 10 epochs using mini-batch stochastic gradient descent (SGD). The final accuracy is 98.3% for both

333Although there is a gap between this and the state of art, the gap does not significantly influence our findings.. In 3.1 we introduce the saliency methods we evaluate.

3 The (In)sensitivity of Saliency Methods to Mean Shifts

In 3.1 we introduce key approaches to the classification of inputs as salient and the saliency methods we evaluate. In 3.2 we find that gradient and signal methods satisfy input invariance. In 3.3 we find that all attribution methods considered have points of failure.

3.1 Saliency methods considered

Saliency methods broadly fall into three different categories:

  1. Gradients (Sensitivity) (Baehrens et al., 2010; Simonyan et al., 2014) shows how a small change to the input affects the classification score for the output of interest.

  2. Signal methods such as DeConvNet (Zeiler & Fergus, 2014), Guided BackProp (Springenberg et al., 2015) and PatternNet (Kindermans et al., 2017) aim to isolate input patterns that stimulate neuron activation in higher layers.

  3. Attribution methods such as Deep-Taylor Decomposition (Montavon et al., 2017) and Integrated Gradients (Sundararajan et al., 2017) assign importance to input dimensions by decomposing the value at an output neuron into contributions from the individual input dimensions:

    is the decomposition into input contributions and has the same number of dimensions as , signifies the attribution method applied to output for sample . Attribution methods are distinct from gradients because of the insistence on completeness; the sum of all attributions should be approximately equal to the original output .

We consider the input invariance of each category separately (by evaluating raw gradients, GuidedBackprop, PatternNet, Integrated Gradients and Deep Taylor Decomposition) and also benchmark the input invariance of SmoothGrad (Smilkov et al., 2017), a method that wraps around an underlying saliency approach and uses the addition of noise to produce a sharper visualization of the saliency heatmap.

The experiment setup and methodology is as described in Section 2. Each method is evaluated by comparing the saliency heatmaps for the predictions of network 1 and 2, where is simply the mean shifted input (). A saliency method that satisfies input invariance will produce identical saliency heatmaps for Network 1 and 2 despite the constant shift in input.

Figure 2: Evaluating the sensitivity of gradient and signal methods using MNIST with a [0,1] encoding for network and a [-1,0] encoding for network . Both raw gradients and signal methods satisfy input invariance by producing identical saliency heatmaps for both networks.

3.2 Gradient and Signal methods Satisfy Input Invariance

Gradient and signal methods are not sensitive to a constant shift in inputs. In Fig. 2 raw gradients, PatternNet (PN), (Kindermans et al., 2017) and GuidedBackprop (GB) (Springenberg et al., 2015) produce identical saliency heatmaps for both networks. Intuitively, gradient, PN and GB satisfy input invariance given that we are comparing two networks with an identical . All three methods determine attribution entirely as a function of the network/pattern weights and thus will be input invariant as long as we are comparing networks with identical weights.

In the same manner, we can say that these methods will not be input invariant when comparing networks with different weights (even if we consider models with different architectures but identical predictions for every input).

Figure 3: Evaluation of attribution method sensitivity using MNIST with a [0,1] encoding for network and a [-1,0] encoding for network . Gradient x Input, IG and DTD with a zero reference point, which is equivalent to LRP (Bach et al., 2015; Montavon et al., 2017), do not satisfy input invariance and produce different attributions for each network. IG with a black image reference point and DTD with a PA reference point are not sensitive to a mean shift in input.

3.3 The Sensitivity of Attribution Methods

We evaluate the following attribution methods: gradient times input (GI), integrated gradients (IG, Sundararajan et al. (2017)) and the deep-taylor decomposition (DTD, Montavon et al. (2017)).

In 3.3.1 we find GI to be sensitive to constant shifts in the input. In 3.3.2 we group discussion of IG and DTD under "reference point" methods because both require that attribution is done in relation to a chosen reference. We find that satisfying input invariance depends upon the choice of reference point and the type of constant shift to the input.

3.3.1 Gradient times input is sensitive to mean shift of inputs

We find that the multiplication of raw gradients by the image fails to satisfy input invariance. In Fig. 3 GI produces different saliency heatmaps for both networks.

In 3.2 we determined that a saliency heatmap of gradients gradient does satisy input invariance. This breaks when the gradients are multiplied with the input image.

Multiplying by the input fails to satisfy input invariance because the input shift is carried through to final attribution. Naive multiplication by the input, as noted by (Smilkov et al., 2017), also constrains attribution without justification to inputs that are not 0.

3.3.2 Reliability of Reference Point Methods Depends on the Choice of Reference

Both Integrated Gradients IG, (Sundararajan et al., 2017) and Deep Taylor Decomposition DTD, (Montavon et al., 2017) determine the importance of inputs relative to a reference point. DTD refers to this as the root point and IG terms the reference point a baseline. The choice of reference point is not determined a priori

by the method and is instead a hyperparameter of the attribution task.

The choice of reference point determines all subsequent attribution. In Fig. 1 IG and DTD show different attribution depending on the choice of reference point. We show that IG and DTD only satisfy input invariance contingent on the choice of reference point and the type of transformation considered.

Integrated gradients

(IG) (Sundararajan et al., 2017) attributes the predicted score to each input with respect to a baseline

. This is achieved by constructing a set of inputs interpolating between the baseline and the input.

Since this integral cannot be computed analytically, it is approximated by a finite sum ranging over .

We evaluate whether two possible IG reference points satisfy input invariance. Firstly, we consider an image populated uniformly with the minimum pixel from the dataset () (black image) and a zero vector image. In Fig. 3, a black image reference point produces identical attribution heatmaps whereas a zero vector reference point is not input invariant.

IG using a black image reference point is not sensitive to the constant shift in input because is determined after the mean shift of the input so the difference between and remains the same for both networks. In network 1 this is and in network 2 this is .

IG with a zero vector reference point fails to satisfy input invariance because while the difference in network 1 is , the difference in network 2 becomes .

Figure 4: Evaluation of attribution method sensitivity using MNIST. Gradient x Input, all IG reference points and DTD with a LRP reference point do not satisfy input invariance and produce different attributions for each network. DTD with a PA reference point is not sensitive to the transformation of the input.

It is possible to construct a constant vector that will break the reliability of using a black image as a baseline. We consider a transformation of the input where the constant vector ( ) added to is an image of a checkered box. Consistent with Section 2 the relationship between and the transformed input is the addition of the checkered box image vector .

In Fig. 4 shows that we are able to manipulate the attribution heatmap of an MNIST prediction so that , an image of checkered boxes, appears for all reference points except for PA. This constant vector transformation causes all IG reference points to fail to satisfy input invariance.

Deep Taylor Decomposition (DTD)

determines attribution relative to a reference point neuron. DTD can satisfy input invariance if the right reference point is chosen. In the general formulation, the attribution of an input neuron

is initialized to be equal to the output of that neuron. The attribution of other output neurons is set to zero. This attribution is backpropagated to input neurons using the following distribution rule where

is the attribution assigned to neuron in layer :

We evaluate the input invariance of DTD using a reference point determined by Layer-wise Relevance Propagation (LRP,Bach et al. (2015)) and PatternAttribution (PA). In Fig. 3, DTD satisfies input invariance when using a reference point defined by PA however fails to satisfy input invariance when using a reference point defined by LRP.

LRP is sensitive to the input shift because it is a case of DTD where a zero vector is chosen as the root point.111This case of DTD is called the and can be shown to be equivalent to Layer-wise Relevance Propagation (Bach et al., 2015; Montavon et al., 2017). Under specific circumstances, LRP is also equivalent to the gradient times input (Kindermans et al., 2016; Shrikumar et al., 2016).. The back-propagation rule becomes:

depends only upon the input and so attribution will change between network 1 and 2 because and differ by a constant vector.

PatternAttribution (PA) satisfies input invariance because the reference point is defined as the natural direction of variation in the data  (Kindermans et al., 2017). This natural direction is determined by the covariance of the data and thus compensates explicitly for the constant vector shift of the input. Therefore it is by construction input invariant.

The PA root point is:


where .

In a linear model:


For neurons followed by a ReLu non-linearity the vector accounts for the non-linearity and is computed as:

Here denotes the expectation taken over values where is positive.

PA reduces to the following step:

The vector depends upon covariance and thus compensates the mean shift of the input. The attribution for both networks is thus identical.

3.4 SmoothGrad Inherits the Sensitivity Properties of Underlying Methods

Figure 5: Smoothgrad (SG) inherits the sensitivity of the underlying attribution method. SG is not sensitive to the input transformation for gradient and signal methods (SG-PA and and SG-GB). SG does not satisfy input invariance for Integrated Gradients (SG-Zero) and Deep Taylor Decomposition (SG-LRP) when a zero vector refernce point is used. SG is invariant to the constant input shift when PatternAttribution (SG-PA) or a black image (SG-Black) are used. SG is not input invariant for gradient x input.

SmoothGrad (SG, Smilkov et al. (2017)) replaces the input with identical versions of the input with added random noise. These noisy inputs are injected into the underlying attribution method and final attribution is the average attribution across . For example, if the underlying methods are gradients w.r.t. the input. SG becomes:

SG often results in aesthetically sharper visualizations when applied to multi-layer neural networks with non-linearities. SG does not alter the attribution method itself so will always inherit the sensitivity of the underlying method to an input transformation. In Fig. 5 applying SG on top of gradients and signal methods produces identical saliency maps. SG does not satisfy input invariant when applied to gradient x input, LRP and zero vector reference points which compares SG heatmaps generated for all methods discussed so far. SG is insensitive to the input transformation when applied to PA and a black image.

4 The Importance of Choosing an Appropriate Reference Point

IG and DTD satisfy input invariance when certain reference points or/and input transformations are considered. The choice of reference point is also important because it determines all subsequent attribution. In fig.1 attribution visually diverges for the same method if multiple reference points are considered.

A reasonable reference point choice will naturally depend upon domain and task. For example, (Sundararajan et al., 2017) suggests that a black image is a natural reference point for image recognition tasks whereas a zero vector is a reasonable choice for text based networks. However, we have shown that the choice of reference point can lead to very different results. Unintentional misrepresentation of the model is very possible when the implications of attribution using a given reference point are unclear. Thus far, we have discussed attribution for image recognition tasks with the assumption that pre-processing steps are known and visual inspection of the points determined to be salient is possible. For audio and language based models where visual inspection is difficult or inappropriate, identifying failure points or how attribution varies under different baselines poses a challenge.

If we cannot determine the implications of reference point choice, we are limited in our ability to say anything about the reliability of the method. To demonstrate this point, we construct a constant shift of the input that takes advantage of the input invariance points of failure we have already identified.

In the following experiment, we construct a constant vector shift using a hand drawn image of cat. Network 1 is the same as introduced in Section 2. The raw image can be seen in Fig. 6. Consistent with Section 2 the relationship between and the transformed input is the addition of a constant vectors .

We construct by choosing a desired attribution that should be assigned to a specific sample when the gradient is multiplied with the input.

is constructed to ensure that the specific receives the desired attribution as follows:

We clip the shift to be within [-.3,.3] so that the MNIST digit is still visible, if we do not clip the end attribution would only show the cat.

In Fig. 6 transforming the input in this manner allows purposeful misrepresentation of the attribution. All methods, except for PA, fail to satisfy input invariance and visibly show a cat as the explanation for an MNIST prediction.

Figure 6: Evaluation of attribution method sensitivity using MNIST. Gradient x Input, IG with both a black and zero reference point and DTD with a LRP reference point, do not satisfy input invariance and produce different attribution for each network. DTD with a PA reference point is not sensitive to the transformation of the input.

How can we avoid breaks in input invariance? PA is invariant to the input transformations considered because it relies on the covariance of the data which compensates for the shift. If the data had been normalized prior to attribution, in a manner that counters this exact transformation, many of the methods considered would still satisfy input invariance. However, this is far from a systematic treatment of the reference point selection as there are input transformations outside of our experiment scope where this would not be sufficient. We believe an important research agenda is furthering the understanding of reference point choice that guarantee reliability without relying on case-by-case solutions.

5 Conclusion

Saliency methods are powerful tools to gain intuition about our model. We show that numerous methods fail to attribute correctly when a constant vector shift is applied to the input. More worryingly, we show that we are able to purposefully create a deceptive explanation of the network using a hand drawn cat image.

We introduce input invariance as a prerequisite for reliable attribution. Our treatment of input invariance is restricted to demonstrating that there is at least one input transformation (a constant vector shift to the input) that causes numerous saliency methods to attribute incorrectly. This work is motivated by our belief that saliency methods remain valuable tools to gain intuition about the network. Understanding where they fail equips researchers with the tools to appropriately weigh the explanations these models provide.

Guaranteeing the reliability of saliency methods is crucial in tasks where visual inspection of results is not easy or the costs of incorrect attribution is high. For example, human inspection of the attribution for an image recognition task would catch the cat attack experiment (described in section 4). However, it is unclear how we would catch the same purposeful manipulation or an unintentional misrepresentation in a language or audio model where inspection is not possible or opaque. Paradoxically, these are also the cases where attribution is most needed in order to understand the data.

Determining how saliency methods fail is an important stepping stone to understanding where and how we should use these methods. An urgent research agenda, and a requirement for the use of deep neural networks in domains like medicine, is evaluating which methods and/or reference points consistently guarantee reliability for all possible transformations.


We would like to acknowledge the thoughtful feedback and guidance of Gregoir Montavon, Mukund Sundararajan, Ankur Taly, Doug Eck and Jonas Kemp.