Interpretation of Neural Networks is Fragile

by   Amirata Ghorbani, et al.
Stanford University

In order for machine learning to be deployed and trusted in many applications, it is crucial to be able to reliably explain why the machine learning algorithm makes certain predictions. For example, if an algorithm classifies a given pathology image to be a malignant tumor, then the doctor may need to know which parts of the image led the algorithm to this classification. How to interpret black-box predictors is thus an important and active area of research. A fundamental question is: how much can we trust the interpretation itself? In this paper, we show that interpretation of deep learning predictions is extremely fragile in the following sense: two perceptively indistinguishable inputs with the same predicted label can be assigned very different interpretations. We systematically characterize the fragility of several widely-used feature-importance interpretation methods (saliency maps, relevance propagation, and DeepLIFT) on ImageNet and CIFAR-10. Our experiments show that even small random perturbation can change the feature importance and new systematic perturbations can lead to dramatically different interpretations without changing the label. We extend these results to show that interpretations based on exemplars (e.g. influence functions) are similarly fragile. Our analysis of the geometry of the Hessian matrix gives insight on why fragility could be a fundamental challenge to the current interpretation approaches.


page 2

page 7

page 12

page 13

page 16


Interpreting Neural Networks With Nearest Neighbors

Local model interpretation methods explain individual predictions by ass...

Synthesizing Pareto-Optimal Interpretations for Black-Box Models

We present a new multi-objective optimization approach for synthesizing ...

Learning outside the Black-Box: The pursuit of interpretable models

Machine Learning has proved its ability to produce accurate models but t...

Local Interpretation Methods to Machine Learning Using the Domain of the Feature Space

As machine learning becomes an important part of many real world applica...

Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

Interpretability methods like Integrated Gradient and LIME are popular c...

Evaluating Saliency Methods for Neural Language Models

Saliency methods are widely used to interpret neural network predictions...

How Useful Are the Machine-Generated Interpretations to General Users? A Human Evaluation on Guessing the Incorrectly Predicted Labels

Explaining to users why automated systems make certain mistakes is impor...

1 Introduction

Predictions made by machine learning algorithms play an important role in our everyday lives and can affect decisions in technology, medicine, and even the legal system (Rich, 2015; Obermeyer & Emanuel, 2016). As the algorithms become increasingly complex, explanations for why an algorithm makes certain decisions are ever more crucial. For example, if an AI system predicts a given pathology image to be malignant, then the doctor would want to know what features in the image led the algorithm to this classification. Similarly, if an algorithm predicts an individual to be a credit risk, then the lender (and the borrower) might want to know why. Therefore having interpretations for why certain predictions are made is critical for establishing trust and transparency between the users and the algorithm (Lipton, 2016).

Having an interpretation is not enough, however. The explanation itself must be robust in order to establish human trust. Take the pathology predictor; an interpretation method might suggest that a particular section in an image is important for the malignant classification (e.g. that section could have high scores in saliency map). The clinician might then focus on that section for investigation, treatment or even look for similar features in other patients. It would be highly disconcerting if in an extremely similar image, visually indistinguishable from the original and also classified as malignant, a very different section is interpreted as being salient for the prediction. Thus, even if the predictor is robust (both images are correctly labeled as malignant), that the interpretation is fragile would still be highly problematic in deployment.

Our contributions.

In this paper, we show that widely-used neural network interpretation methods are fragile in the following sense: perceptively indistinguishable images that have the same prediction label by the neural network can often be given substantially different interpretations. We systematically investigate two classes of interpretation methods: methods that assign importance scores to each feature (this includes simple gradient (Simonyan et al., 2013), DeepLift (Shrikumar et al., 2017), and integrated gradient (Sundararajan et al., 2017)), as well as a method that assigns importance scores to each training example: influence functions (Koh & Liang, 2017). For both classes of interpretations, we show that the importance of individual features or training examples is highly fragile to even small random perturbations to the input image. Moreover we show how targeted perturbations can lead to dramatically different global interpretations (Fig. 1).

Our findings highlight the fragility of popular interpretation methods, which has not been carefully considered in literature. Fragility directly limits how much we can trust and learn from the interpretations. It also raises a significant new security concern. Especially in medical or economic applications, users often take the interpretation of a prediction as containing causal insight (“this image is a malignant tumor likely because of the section with a high saliency score”). An adversary could minutely manipute the input to draw attention away from relevant features or onto his/her desired features. Such attacks might be especially hard to detect as the actual labels have not changed.

While we focus on image data here because most of the interpretation methods have been motivated by images, the fragility of interpretation could be a much broader problem. Fig. 2 illustrates the intuition that when the decision boundary in the input feature space is complex, as is the case with deep nets, a small perturbation in the input can push the example into a region with very different loss contours. Because the feature importance is closely related to the gradient which is perpendicular to the loss contours, the importance scores can also be dramatically different. We provide additional analysis of this in Section 5.

Figure 1: The fragility of feature-importance maps. We generate feature-importance scores, also called saliency maps, using three popular interpretation methods: simple gradient (a), DeepLIFT (b) and integrated gradient (c). The top row shows the the original images and their saliency maps and the bottom row shows the perturbed images (using the center attack with , as described in Section 3) and the corresponding saliency maps. In all three images, the predicted label has not changed due to perturbation; in fact the network’s (SqueezeNet) confidence in the prediction has actually increased. However, the saliency maps of the perturbed images are meaningless.
Figure 2: Intuition for why interpretation is fragile. Consider a test example (black dot) that is slightly perturbed to a new position

in input space (gray dot). The contours and decision boundary corresponding to a loss function (

) for a two-class classification task are also shown, allowing one to see the direction of the gradient of the loss with respect to the input space. Neural networks with many parameters have decision boundaries that are roughly piecewise linear with many transitions. We illustrate that points near the transitions are especially fragile to interpretability-based analysis. A small perturbation to the input changes the direction of from being in the direction of to being in the direction of , directly affecting feature-importance analyses. Similarly, a small perturbation to the test image changes which training image, when up-weighted, has the largest influence on , directly affecting exemplar-based analysis.

2 Interpretation Methods for Neural Network Predictions

2.1 Feature-Importance Interpretation

This first class of methods explains predictions in terms of the relative importance of features in a test input sample. Given the sample and the network’s prediction , we define the score of the predicted class to be the value of the

-th output neuron right before the softmax operation. We take

to be the class with the max score; i.e. the predicted class. Feature-importance methods seek to find the dimensions of input data point that most strongly affect the score, and in doing so, these methods assign an absolute saliency score to each input feature. Here we normalize the scores for each image by the sum of the saliency scores across the features. This ensures that any perturbations that we design change not the absolute feature saliencies (which may still preserve the ranking of different features), but their relative values. We summarize three different methods to calculate the normalized saliency score, denoted by .

Simple gradient method

Introduced in  Baehrens et al. (2010) and applied to deep neural networks in  Simonyan et al. (2013), the simple gradient method applies a local linear approximation of the model to detect the sensitivity of the score to perturbing each of the input dimensions. Given input , the score is defined as:


Integrated gradients

A significant drawback of the simple gradient method is the saturation problem discussed by Shrikumar et al. (2017); Sundararajan et al. (2017). Consequently, Sundararajan et al. (2017) introduced the integrated gradients method where the gradients of the score with respect to scaled versions of the input are summed and then multiplied by the input. Letting be the reference point and

, the feature importance vector is calculated by:


which is then normalized for our analysis. Here the absolute value is taken for each dimension.


DeepLIFT is an improved version of layer-wise relevance propagation (LRP) method (Bach et al., 2015). LRP methods decompose the score backwards through the neural network. In each step, the score from the last layer is propagated to the previous layer, with the score being divided proportionally to magnitude of the activations of the neurons in the previous layer. The scores are propagated to the input layer, and the result is a relevance score assigned to each of the input dimensions. DeepLIFT (Shrikumar et al., 2017) defines a reference point in the input space and propagates relevance scores proportionally to the changes in the neuronal activations from the reference. We use DeepLIFT with the Rescale rule; see Shrikumar et al. (2017) for details.

2.2 Exemplar-Based Methods: Influence Functions

A complementary approach to interpreting the results of a neural network is to explain the prediction of the network in terms of its training examples, . Specifically, we can ask: which training examples, if up-weighted or down-weighted during training time, would have the biggest effect on the loss of the test example ? Koh & Liang (2017) proposed a method to calculate this value, called the influence, defined by the following equation:


where and is defined analogously. is the loss of the network with parameters set to for the (training or test) data point . is the empirical Hessian of the network calculated over the training examples. The training examples with the highest influence are understood as explaining why a network made a particular prediction for a test example.

2.3 Metrics for Interpretation Similarity

We consider two natural metrics for quantifying the similarity between interpretations for two different images.

  • Spearman’s rank order correlation: Because interpretation methods rank all of the features or training examples in order of importance, it natural to use the rank correlation (Spearman, 1904) to compare the similarity between interpretations.

  • Top-k intersection: In many settings, only the most important features or interpretations are of interest. In these settings, we can compute the size of the intersection of the most important features before and after perturbation.

3 Random and systematic perturbations

Problem statement

For a given fixed neural network and input data point , the feature importance and influence function methods that we have described produce an interpretation . For feature importance, is a vector of feature scores; for influence function is a vector of scores for training examples. We would like to devise efficient perturbations to change the interpretability of a test image. Yet, the perturbations should be visually imperceptible and should not change the label of the prediction. Formally, we define the problem as:

where measures the change in interpretation (e.g. how many of the top- pixels are no longer the top- pixels of the saliency map after the perturbation) and constrains the norm of the perturbation. In this paper, we carry out three kinds of input perturbations.

Random sign perturbation

As a baseline, we generate random perturbations in which each pixel is randomly perturbed by . This is used to measure robustness against untargeted perturbations.

Iterative attacks against feature-importance methods

In Algorithm 1 we define two adversarial attacks against feature-importance methods, each of which consists of taking a series of steps in the direction that maximizes a differentiable dissimilarity function between the original and perturbed interpretation. (1) The top- attack seeks to perturb the saliency map by decreasing the relative importance of the most important features of the original image. (2) When the input data are images, the center of mass of the saliency map often captures the user’s attention. The mass-center attack is designed to result in the maximum spatial displacement of the center of mass of the saliency scores. Both of these attacks can be applied to any of the three feature-importance methods.

  Input: test image , maximum norm of perturbation , normalized feature importance function , number of iterations , step size
  Define a dissimilarity function to measure the change between interpretations of two images:
  where is the set of the largest dimensions111The goal is to damp the saliency scores of the features originally identified as the most important. of , and is the center of saliency mass222The center of mass is defined for a image as:
  for  do
     Perturb the test image in the direction of signed gradient333

In some networks, such as those with ReLUs, this gradient is always 0. To attack interpretability in such networks, we replace the ReLU activations with their smooth approximation (softplus) when calculating the gradient and generate the perturbed image using this approximation. The perturbed images that result are effective adversarial attacks against the original ReLU network, as discussed in Section

of the dissimilarity function:
     If needed, clip the perturbed input to satisfy the norm constraint:
  end for
  Among , return the element with the largest value for the dissimilarity function and the same prediction as the original test image.
Algorithm 1 Iterative Feature-Importance Attacks

Gradient sign attack against influence functions

We can obtain effective adversarial images for influence functions without resorting to interative procedures. We linearize (3) around the values of the current inputs and parameters. If we further constrain the norm of the perturbation to , we obtain an optimal single-step perturbation:


This perturbation can be applied to the pixels of a test image to increase or decrease the influence of a particular training example, . The attack we use consists of applying the negative of the perturbation in (4) to decrease the influence of the 3 most influential training images of the original test image444In other words, we generate the perturbation given by: , where is the most influential training image of the original test image.. Of course, this affects the influence of all of the other training images as well.

We follow the same setup for computing the influence function as was done by the authors of Koh & Liang (2017). Because the influence is only calculated with respect to the parameters that change during training, we calculate the gradients only with respect to parameters in the final layer of our network (InceptionNet, see Section 4). This makes it feasible for us to compute (4) exactly, but it gives us the perturbation of the input into the final layer, not the first layer. So, we use standard back-propagation to calculate the corresponding gradient for the input test image. We then take the sign of this gradient as the perturbation and clip the image to produce the adversarial test image.

Figure 3: Comparison of adversarial attack algorithms on feature-importance methods. Across 512 correctly-classified ImageNet images, we find that the top- and center attacks perform similarly in top-1000 intersection and rank correlation measures, and are far more effective than the random sign perturbation at demonstrating the fragility of interpretability, as characterized through top-1000 intersection (top) as well as rank order correlation (bottom). This is true for (a) the simple gradient method, (b) DeepLift, and (c) the integrated gradients method.

4 Experiments & Results

Data sets and models

To evaluate the robustness of feature-importance methods, we used two image classification data sets: ILSVRC2012 (ImageNet classification challenge data set)  (Russakovsky et al., 2015) and CIFAR-10  (Krizhevsky, 2009). For the ImageNet classification data set, we used a pre-trained SqueezeNet555 model introduced by  Iandola et al. (2016). For the CIFAR-10 data set we trained our own convolutional network, whose architecture is presented in Appendix A.

For both data sets, the results are examined using simple gradient, integrated gradients, and DeepLIFT feature importance methods. For DeepLIFT, we used the pixel-wise and the channel-wise mean images as the CIFAR-10 and ImageNet reference points respectively. For the integrated gradients method, the same references were used with parameter M=100. We ran all iterative attack algorithms for iterations with step size .

To evaluate the robustness of influence functions, we followed a similar experimental setup to that of the original authors: we trained an InceptionNet v3 with all but the last layer frozen (the weights were pre-trained on ImageNet and obtained from Keras

666 The last layer was trained on a binary flower classification task (roses vs. sunflowers), using a data set consisting of 1,000 training images777adapted from: This data set was chosen because it consisted of images that the network had not seen during pre-training on ImageNet. The network achieved a validation accuracy of 97.5% on this task.

Results for feature-importance methods

From the ImageNet test set, 512 correctly-classified images were randomly sampled for evaluation purposes. Examples of the center-shift attack against three feature importance methods were presented in Fig. 1. Further representative examples of different attacks on additional images are found in Appendix B.

In Fig. 3, we present results aggregated over all 512 images. We compare different attack methods using top-1000 intersection and rank correlation methods. Random sign perturbation already causes significant changes in both top-1000 intersection and rank order correlation. For example, with , on average, there is less than 30% overlap in the top 1000 most salient pixels between the original and the randomly perturbed images across all three of interpretation methods. This suggests that the saliency of individual or small groups of pixels can be extremely fragile to the input and should be interpreted with caution. With targeted perturbations, we observe more dramatic fragility. Even with a perturbation of , the interpretations change significantly.

Both iterative attack algorithms have similar effects on feature importance of test images when measured on the basis of rank correlation or top-1000 intersection. In Appendix C, we show an additional metric: the displacement of the center of mass between the original and perturbed saliency maps. Empirically, we find this metric to correspond most strongly with intuitive perceptions of the similarity between two saliency maps. Not surprisingly, we found that the center attack method was more effective than the top- attack at moving the center of mass of the saliency maps. Comparing interpretation methods, we found that the integrated gradients method was the most robust to both random and adversarial attacks. Similar results for CIFAR-10 can be found in Appendix  C.

Results for influence functions

We evaluate the robustness of influence functions on a test data set consisting of 200 images of roses and sunflowers. Fig. 4 shows a representative test image to which we have applied the gradient sign attack. Although the prediction of the image does not change, the most influential training examples selected according to (3), as explanation for the prediction, change entirely from images of sunflowers and yellow petals that resemble the input image to those of red and pink roses that do not. Additional examples can be found in Appendix D.

In Fig. 5, we compare the random perturbations and gradient sign attacks across all of the test images. We find that the gradient sign-based attacks are significantly more effective at decreasing the rank correlation of the influence of the training images, as well as distorting the top-5 influential images. For example, on average, with a targeted perturbation of magnitude , only 2 of the top 5 most influential training images remain as the top 5 most influential images after the visually imperceptible perturbation. The influences of the training images before and after an adversarial attack are essentially uncorrelated. However, we find that even random attacks can have a non-negligible effect on influence functions, on average reducing the rank correlation to 0.8 ().

Figure 4: Gradient sign attack on influence functions. An imperceptible perturbation to a test image can significantly affect exemplar-based interpretability. The original test image is that of a sunflower that is classified correctly in a rose vs. sunflower classification task. The top 3 training images identified by influence functions are shown in the top row. Using the gradient sign attack, we perturb the test image (with ) to produce the leftmost image in the second row. Although the image is even more confidently predicted as a sunflower, influence functions suggest very different training images by means of explanation: instead of the sunflowers and yellow petals that resemble the input image, the most influential images are pink/red roses. The plot on the right shows the influence of each training image before and after perturbation. The 3 most influential images (targeted by the attack) have decreased in influence, but the influences of other images have also changed.
Figure 5: Comparison of random and targeted perturbations on influence functions. Here, we show the averaged results of applying random (green) and gradient sign-based (orange) perturbations to 200 test images on the flower classification task. While random attacks affect interpretability, the effect is small and generally doesn’t affect the most influential images. On the other hard, a targeted attack can significantly affect (a) the rank correlation and (b) even change the make-up of the 5 most influential images. Even at the maximal level of noise, the changes to the perturbed images were visually imperceptible, and prediction confidence was not significantly changed (the mean change was for random attacks and for targeted attacks at the highest level of noise).

5 Hessian analysis

In this section, we try to understand the source of interpretation fragility. The question is whether fragility a consequence of the complex non-linearities of a deep network or a characteristic present even in high-dimensional linear models, as is the case for adversarial examples for prediction (Goodfellow et al., 2014). To gain more insight into the fragility of gradient based interpretations, let denote the score function of interest; is an input vector and is the weights of the neural network, which is fixed since the network has finished training. We are interested in the Hessian whose entries are . The reason is that the first order approximation of gradient for some input perturbation direction is: .

First, consider a linear model whose score for an input is . Here, and ; the feature-importance vector is robust, because it is completely independent of

. Thus, some non-linearity is required for interpretation fragility. A simple network that is susceptible to adversarial attacks on interpretations consists of a set of weights connecting the input to a single neuron followed by a non-linearity (e.g. logistic regression):


We can calculate the change in saliency map due to a small perturbation in . The first-order approximation for the change in saliency map will be equal to : . In particular, the saliency of the feature changes by and furthermore, the relative change is . For the simple network, this relative change is:


where we have used and to refer to the first and second derivatives of . Note that and do not scale with the dimensionality of because in general, independent from the dimensionality, and are -normalized or have fixed -norm due to data preprocessing and weight decay regularization. However, if we choose , then the relative change in the saliency grows with the dimension, since it is proportional to the -norm of . When the input is high-dimensional—which is the case with images—the relative effect of the perturbation can be substantial. Note also that this perturbation is exactly the sign of the first right singular vector of the Hessian , which is appropriate since that is the vector that has the maximum effect on the gradient of . A similar analysis can be carried out for influence functions (see Appendix E).

For this simple network, the direction of adversarial attack on interpretability, is the same as the adversarial attack on prediction. This means that we cannot perturb interpretability independently of prediction. For more complex networks, this is not the case and in Appendix F we show this analytically for a simple case of a two-layer network. As an empirical test, in Fig. 4(a), we plot the distribution of the angle between and (the first right singular vector of which is the most fragile direction of feature importance) for 1000 CIFAR10 images (Details of the network in Appendix A). In Fig. 4(b), we plot the equivalent distribution for influence functions, computed across all 200 test images. The result confirms that the steepest direction of change in interpretation and prediction are generally orthogonal, justifying how the perturbations can change the interpretation without changing the prediction.

Figure 4: Orthogonality of Prediction and Interpretation Fragile Directions (a) The histogram of the angle between the steepest direction of change in feature importance and the steepest score change direction. (b) The distribution of the angle between the gradient of the loss function and the steepest direction of change of influence of the most influential image.

6 Discussion

Related works

To the best of our knowledge, the notion of adversarial attacks has not previously been studied in the context of interpretation of neural networks. Adversarial attacks to the input that changes the prediction of a network have been actively studied. Szegedy et al. (2013) demonstrated that it is relatively easy to fool neural networks into making very different predictions for test images that are visually very similar to each other. Goodfellow et al. (2014) introduced the Fast Gradient Sign Method (FGSM) as a one-step prediction attack. This was followed by more effective iterative attacks  (Kurakin et al., 2016) seeking to change the prediction of network by a small perturbation. Different metrics for quantifying the size of the perturbation have been used.  Moosavi-Dezfooli et al. (2016); Szegedy et al. (2013) used ;  Papernot et al. (2016) considered the number of perturbed pixels (); and  Goodfellow et al. (2014) suggest using , because this tightly controls how much individual feature can change. We followed the popular practice and evaluate with .

Interpretation of neural network predictions is also an active research area. Post-hoc interpretability (Lipton, 2016) is one family of methods that seek to ”explain” the prediction without talking about the details of black-box model’s hidden mechanisms. These included tools to explain predictions by networks in terms of the features of the test example (Simonyan et al., 2013; Shrikumar et al., 2017; Sundararajan et al., 2017; Zhou et al., 2016), as well as in terms of contribution of training examples to the prediction at test time (Koh & Liang, 2017). These interpretations have gained increasing popularity, as they confer a degree of insight to human users of what the neural network might be doing (Lipton, 2016).


This paper demonstrates that interpretation of neural networks can be fragile. We develop new perturbations to illustrate this fragility and propose evaluation metrics as well as insights on why fragility occurs. Fragility of interpretation is orthogonal to fragility of the prediction—we demonstrate how perturbations can substantially change the interpretation without changing the predicted label. The two types of fragility do arise from similar factors, as we discuss in Section 

5. Our focus is on the interpretation method, rather than on the original network, and as such we do not explore how interpretable is the original predictor. There is a separately line of research that tries to design simpler and more interpretable prediction models (Ba & Caruana, 2014).

Our main message is that robustness of the interpretation of a prediction is an important and challenging problem, especially as in many applications (e.g. many biomedical and social settings) users are as interested in the interpretation as in the prediction itself. Our results raise concerns on how interpretation is sensitive to noise and can be manipulated. We do not suggest that interpretations are meaningless, just as adversarial attacks on predictions do not imply that neural networks are useless. Interpretations do need to be used and evaluated with caution. Especially in settings where the importance of individual or a small subset of features are interpreted, we show that these importance scores can be sensitive to even random perturbation. More dramatic manipulations of interpretations can be achieved with our targeted perturbations, which raise security concerns. While we focus on image data (ImageNet and CIFAR-10), because these are the standard benchmarks for popular interpretation tools, this fragility issue can be wide-spread in biomedical, economic and other settings where neural networks are increasingly used. Understanding interpretation fragility in these applications and develop more robust methods are important agendas of research.



Appendix A Description of the CIFAR-10 classification network

We trained the following structure using ADAM optimizer  (Kingma & Ba, 2014) with default parameters. The resulting test accuracy using ReLU activation was 73%. For the experiment in section 6, we replaced ReLU activation with Softplus and retrained the network (with the ReLU network weights as initial weights). The resulting accuracy was 73%.

Network Layers
conv. 96 ReLU
conv. 96 ReLU
conv. 96 Relu
Stride 2
conv. 192 ReLU
conv. 192 ReLU
conv. 192 Relu
Stride 2
1024 hidden sized feed forward

Appendix B Additional examples of feature importance perturbations

Here we provide three more examples from ImageNet. For each example, all three methods of feature importance are attacked by random sign noise and our two targeted adversarial algorithms.

Figure 5: All of the images are classified as a airedale.
Figure 6: All of the images are classified as a damselfly.
Figure 7: All of the images are classified as a lighter.

Appendix C Measuring center of mass movement

Figure 8: Center-shift results for three feature importance methods on ImageNet: As discussed in the paper, among our three measurements, center-shift measure was the most correlated measure with the subjective perception of change in saliency maps. The results in Appendix B also show that the center attack which resulted in largest average center-shift, also results in the most significant subjective change in saliency maps. Random sign perturbations, on the other side, did not substantially change the global shape of the saliency maps, though local pockets of saliency are sensitive. Just like rank correlation and top-1000 intersection measures, the integrated gradients method is the most robust method against adversarial attacks in the center-shift measure .

Results for adversarial attacks against CIFAR-10 feature importance methods

Figure 9: Results for adversarial attacks against CIFAR10 feature importance methods: For CIFAR10 the center-shift attack and top-k attack with k=100 achieve similar results for rank correlation and top-100 intersection measurements and both are stronger than random perturbations. Center-shift attack moves the center of mass more than two other perturbations. Among different feature importance methods, integrated gradients is more robust than the two other methods. Additionally, results for CIFAR10 show that images in this data set are more robust against adversarial attack compared to ImageNet images which agrees with our analysis that higher dimensional inputs are tend to be more fragile.

Appendix D Additional Examples of Adversarial Attacks on Influence Functions

In this appendix, we provide additional examples of the fragility of influence functions, analogous to Fig. 4.

Figure 10: Further examples of gradient-sign attacks on influence functions. (a) Here we see a representative example of the most influential training images before and after a perturbation to the test image. The most influential image before the attack is one of the least influential afterwards. Overall, the influences of the training images before and after the attack are uncorrelated. (b) In this example, the perturbation has remarkably caused the training images to almost completely reverse in influence. Training images that had the most positive effect on prediction now have the most negative effects and the other way round.

Appendix E Dimensionality-Based Explanation for Fragility of Influence Functions

Here, we demonstrate that increasing the dimension of the input of a simple neural network increases the fragility of that network with respect to influence functions, analogous to the calculations carried out for importance-feature methods in Section 5. Recall that the influence of a training image on a test image is given by:


We restrict our attention to the term in (6) that is dependent on , and denote it by . represents the infinitesimal effect of each of the parameters in the network on the loss function evaluated at the test image.

Now, let us calculate the change in this term due to a small perturbation in . The first-order approximation for the change in is equal to: . In particular, for the parameter, changes by and furthermore, the relative change is . For the simple network defined in Section 5, this evaluates to (replacing with for consistency of notation):


where for simplicity, we have taken the loss to be , making the derivatives easier to calculate. Furthermore, we have used and to refer to the first and second derivatives of . Note that and do not scale with the dimensionality of because and are generalized -normalized due to data preprocessing and weight decay regularization.

However, if we choose , then the relative change in the saliency grows with the dimension, since it is proportional to the -norm of .

Appendix F Orthogonality of steepest directions of change in score and feature importance functions in a Simple Two-layer network

Consider a two layer neural network with activation function

, input , hidden vector , and score function , we have:

where . We have:

Now for an input sample perturbation , for the change in feature importance:

which is equal to:

We further assume that the input is high-dimensional so that and for we have . For maximizing the norm of saliency difference we have the following perturbation direction:


comparing which to the direction of feature importance:

we conclude that the two directions are not parallel unless which is not the case for many activation functions like Softplus, Sigmoid, etc.

Appendix G Designing Interpretability-Robust Networks

The analyses and experiments in this paper have demonstrated that small perturbations in the input layers of deep neural networks can have large changes in the interpretations. This is analogous to classical adversarial examples, whereby small perturbations in the input produce large changes in the prediction. In that setting, it has been proposed that the Lipschitz constant of the network be constrained during training to limit the effect of adversarial perturbations (Szegedy et al., 2013). This has found some empirical success (Cisse et al., 2017).

Here, we propose an analogous method to upper-bound the change in interpretability of a neural network as a result of perturbations to the input. Specifically, consider a network with layers, which takes as input a data point we denote as . The output of the layer is given by for . We define to be the output (e.g. score for the correct class) of our network, and we are interested in designing a network whose gradient is relatively insensitive to perturbations in the input, as this corresponds to a network whose feature importances are robust.

A natural quantity to consider is the Lipschitz constant of with respect to

. By the chain rule, the Lipschitz constant of



Now consider the function , which maps to . In the simple case of the fully-connected network, which we consider here, , where is a non-linearity and are the trained weights for that layer. Thus, the Lipschitz constant of the partial derivative in (8) is the Lipschitz constant of

which is upper-bounded by , where denotes the operator norm of

(its largest singular value)

888this bound follows from the fact that the Lipschitz constant of the composition of two functions is the product of their Lipschitz constants, and the Lipschitz constant of the product of two functions is also the product of their Lipschitz constants.. This suggests that a conservative upper ceiling for (8) is


Because the Lipschitz constant of the non-linearities are fixed, this result suggests that a regularization based on the operator norms of the weights may allow us to train networks that are robust to attacks on feature importance. The calculations in this Appendix section is meant to be suggestive rather than conclusive, since in practice the Lipschitz bounds are rarely tight.