1 Introduction
Predictions made by machine learning algorithms play an important role in our everyday lives and can affect decisions in technology, medicine, and even the legal system (Rich, 2015; Obermeyer & Emanuel, 2016). As the algorithms become increasingly complex, explanations for why an algorithm makes certain decisions are ever more crucial. For example, if an AI system predicts a given pathology image to be malignant, then the doctor would want to know what features in the image led the algorithm to this classification. Similarly, if an algorithm predicts an individual to be a credit risk, then the lender (and the borrower) might want to know why. Therefore having interpretations for why certain predictions are made is critical for establishing trust and transparency between the users and the algorithm (Lipton, 2016).
Having an interpretation is not enough, however. The explanation itself must be robust in order to establish human trust. Take the pathology predictor; an interpretation method might suggest that a particular section in an image is important for the malignant classification (e.g. that section could have high scores in saliency map). The clinician might then focus on that section for investigation, treatment or even look for similar features in other patients. It would be highly disconcerting if in an extremely similar image, visually indistinguishable from the original and also classified as malignant, a very different section is interpreted as being salient for the prediction. Thus, even if the predictor is robust (both images are correctly labeled as malignant), that the interpretation is fragile would still be highly problematic in deployment.
Our contributions.
In this paper, we show that widely-used neural network interpretation methods are fragile in the following sense: perceptively indistinguishable images that have the same prediction label by the neural network can often be given substantially different interpretations. We systematically investigate two classes of interpretation methods: methods that assign importance scores to each feature (this includes simple gradient (Simonyan et al., 2013), DeepLift (Shrikumar et al., 2017), and integrated gradient (Sundararajan et al., 2017)), as well as a method that assigns importance scores to each training example: influence functions (Koh & Liang, 2017). For both classes of interpretations, we show that the importance of individual features or training examples is highly fragile to even small random perturbations to the input image. Moreover we show how targeted perturbations can lead to dramatically different global interpretations (Fig. 1).
Our findings highlight the fragility of popular interpretation methods, which has not been carefully considered in literature. Fragility directly limits how much we can trust and learn from the interpretations. It also raises a significant new security concern. Especially in medical or economic applications, users often take the interpretation of a prediction as containing causal insight (“this image is a malignant tumor likely because of the section with a high saliency score”). An adversary could minutely manipute the input to draw attention away from relevant features or onto his/her desired features. Such attacks might be especially hard to detect as the actual labels have not changed.
While we focus on image data here because most of the interpretation methods have been motivated by images, the fragility of interpretation could be a much broader problem. Fig. 2 illustrates the intuition that when the decision boundary in the input feature space is complex, as is the case with deep nets, a small perturbation in the input can push the example into a region with very different loss contours. Because the feature importance is closely related to the gradient which is perpendicular to the loss contours, the importance scores can also be dramatically different. We provide additional analysis of this in Section 5.
![]() |
![]() |
![]() |

in input space (gray dot). The contours and decision boundary corresponding to a loss function (
) for a two-class classification task are also shown, allowing one to see the direction of the gradient of the loss with respect to the input space. Neural networks with many parameters have decision boundaries that are roughly piecewise linear with many transitions. We illustrate that points near the transitions are especially fragile to interpretability-based analysis. A small perturbation to the input changes the direction of from being in the direction of to being in the direction of , directly affecting feature-importance analyses. Similarly, a small perturbation to the test image changes which training image, when up-weighted, has the largest influence on , directly affecting exemplar-based analysis.2 Interpretation Methods for Neural Network Predictions
2.1 Feature-Importance Interpretation
This first class of methods explains predictions in terms of the relative importance of features in a test input sample. Given the sample and the network’s prediction , we define the score of the predicted class to be the value of the
-th output neuron right before the softmax operation. We take
to be the class with the max score; i.e. the predicted class. Feature-importance methods seek to find the dimensions of input data point that most strongly affect the score, and in doing so, these methods assign an absolute saliency score to each input feature. Here we normalize the scores for each image by the sum of the saliency scores across the features. This ensures that any perturbations that we design change not the absolute feature saliencies (which may still preserve the ranking of different features), but their relative values. We summarize three different methods to calculate the normalized saliency score, denoted by .Simple gradient method
Introduced in Baehrens et al. (2010) and applied to deep neural networks in Simonyan et al. (2013), the simple gradient method applies a local linear approximation of the model to detect the sensitivity of the score to perturbing each of the input dimensions. Given input , the score is defined as:
(1) |
Integrated gradients
A significant drawback of the simple gradient method is the saturation problem discussed by Shrikumar et al. (2017); Sundararajan et al. (2017). Consequently, Sundararajan et al. (2017) introduced the integrated gradients method where the gradients of the score with respect to scaled versions of the input are summed and then multiplied by the input. Letting be the reference point and
, the feature importance vector is calculated by:
(2) |
which is then normalized for our analysis. Here the absolute value is taken for each dimension.
DeepLIFT
DeepLIFT is an improved version of layer-wise relevance propagation (LRP) method (Bach et al., 2015). LRP methods decompose the score backwards through the neural network. In each step, the score from the last layer is propagated to the previous layer, with the score being divided proportionally to magnitude of the activations of the neurons in the previous layer. The scores are propagated to the input layer, and the result is a relevance score assigned to each of the input dimensions. DeepLIFT (Shrikumar et al., 2017) defines a reference point in the input space and propagates relevance scores proportionally to the changes in the neuronal activations from the reference. We use DeepLIFT with the Rescale rule; see Shrikumar et al. (2017) for details.
2.2 Exemplar-Based Methods: Influence Functions
A complementary approach to interpreting the results of a neural network is to explain the prediction of the network in terms of its training examples, . Specifically, we can ask: which training examples, if up-weighted or down-weighted during training time, would have the biggest effect on the loss of the test example ? Koh & Liang (2017) proposed a method to calculate this value, called the influence, defined by the following equation:
(3) |
where and is defined analogously. is the loss of the network with parameters set to for the (training or test) data point . is the empirical Hessian of the network calculated over the training examples. The training examples with the highest influence are understood as explaining why a network made a particular prediction for a test example.
2.3 Metrics for Interpretation Similarity
We consider two natural metrics for quantifying the similarity between interpretations for two different images.
-
Spearman’s rank order correlation: Because interpretation methods rank all of the features or training examples in order of importance, it natural to use the rank correlation (Spearman, 1904) to compare the similarity between interpretations.
-
Top-k intersection: In many settings, only the most important features or interpretations are of interest. In these settings, we can compute the size of the intersection of the most important features before and after perturbation.
3 Random and systematic perturbations
Problem statement
For a given fixed neural network and input data point , the feature importance and influence function methods that we have described produce an interpretation . For feature importance, is a vector of feature scores; for influence function is a vector of scores for training examples. We would like to devise efficient perturbations to change the interpretability of a test image. Yet, the perturbations should be visually imperceptible and should not change the label of the prediction. Formally, we define the problem as:
where measures the change in interpretation (e.g. how many of the top- pixels are no longer the top- pixels of the saliency map after the perturbation) and constrains the norm of the perturbation. In this paper, we carry out three kinds of input perturbations.
Random sign perturbation
As a baseline, we generate random perturbations in which each pixel is randomly perturbed by . This is used to measure robustness against untargeted perturbations.
Iterative attacks against feature-importance methods
In Algorithm 1 we define two adversarial attacks against feature-importance methods, each of which consists of taking a series of steps in the direction that maximizes a differentiable dissimilarity function between the original and perturbed interpretation. (1) The top- attack seeks to perturb the saliency map by decreasing the relative importance of the most important features of the original image. (2) When the input data are images, the center of mass of the saliency map often captures the user’s attention. The mass-center attack is designed to result in the maximum spatial displacement of the center of mass of the saliency scores. Both of these attacks can be applied to any of the three feature-importance methods.
In some networks, such as those with ReLUs, this gradient is always 0. To attack interpretability in such networks, we replace the ReLU activations with their smooth approximation (softplus) when calculating the gradient and generate the perturbed image using this approximation. The perturbed images that result are effective adversarial attacks against the original ReLU network, as discussed in Section
4. of the dissimilarity function:Gradient sign attack against influence functions
We can obtain effective adversarial images for influence functions without resorting to interative procedures. We linearize (3) around the values of the current inputs and parameters. If we further constrain the norm of the perturbation to , we obtain an optimal single-step perturbation:
(4) |
This perturbation can be applied to the pixels of a test image to increase or decrease the influence of a particular training example, . The attack we use consists of applying the negative of the perturbation in (4) to decrease the influence of the 3 most influential training images of the original test image444In other words, we generate the perturbation given by: , where is the most influential training image of the original test image.. Of course, this affects the influence of all of the other training images as well.
We follow the same setup for computing the influence function as was done by the authors of Koh & Liang (2017). Because the influence is only calculated with respect to the parameters that change during training, we calculate the gradients only with respect to parameters in the final layer of our network (InceptionNet, see Section 4). This makes it feasible for us to compute (4) exactly, but it gives us the perturbation of the input into the final layer, not the first layer. So, we use standard back-propagation to calculate the corresponding gradient for the input test image. We then take the sign of this gradient as the perturbation and clip the image to produce the adversarial test image.
![]() |
![]() |
![]() |
4 Experiments & Results
Data sets and models
To evaluate the robustness of feature-importance methods, we used two image classification data sets: ILSVRC2012 (ImageNet classification challenge data set) (Russakovsky et al., 2015) and CIFAR-10 (Krizhevsky, 2009). For the ImageNet classification data set, we used a pre-trained SqueezeNet555https://github.com/rcmalli/keras-squeezenet model introduced by Iandola et al. (2016). For the CIFAR-10 data set we trained our own convolutional network, whose architecture is presented in Appendix A.
For both data sets, the results are examined using simple gradient, integrated gradients, and DeepLIFT feature importance methods. For DeepLIFT, we used the pixel-wise and the channel-wise mean images as the CIFAR-10 and ImageNet reference points respectively. For the integrated gradients method, the same references were used with parameter M=100. We ran all iterative attack algorithms for iterations with step size .
To evaluate the robustness of influence functions, we followed a similar experimental setup to that of the original authors: we trained an InceptionNet v3 with all but the last layer frozen (the weights were pre-trained on ImageNet and obtained from Keras
666https://keras.io/applications/). The last layer was trained on a binary flower classification task (roses vs. sunflowers), using a data set consisting of 1,000 training images777adapted from: https://www.tensorflow.org/tutorials/image_retraining. This data set was chosen because it consisted of images that the network had not seen during pre-training on ImageNet. The network achieved a validation accuracy of 97.5% on this task.Results for feature-importance methods
From the ImageNet test set, 512 correctly-classified images were randomly sampled for evaluation purposes. Examples of the center-shift attack against three feature importance methods were presented in Fig. 1. Further representative examples of different attacks on additional images are found in Appendix B.
In Fig. 3, we present results aggregated over all 512 images. We compare different attack methods using top-1000 intersection and rank correlation methods. Random sign perturbation already causes significant changes in both top-1000 intersection and rank order correlation. For example, with , on average, there is less than 30% overlap in the top 1000 most salient pixels between the original and the randomly perturbed images across all three of interpretation methods. This suggests that the saliency of individual or small groups of pixels can be extremely fragile to the input and should be interpreted with caution. With targeted perturbations, we observe more dramatic fragility. Even with a perturbation of , the interpretations change significantly.
Both iterative attack algorithms have similar effects on feature importance of test images when measured on the basis of rank correlation or top-1000 intersection. In Appendix C, we show an additional metric: the displacement of the center of mass between the original and perturbed saliency maps. Empirically, we find this metric to correspond most strongly with intuitive perceptions of the similarity between two saliency maps. Not surprisingly, we found that the center attack method was more effective than the top- attack at moving the center of mass of the saliency maps. Comparing interpretation methods, we found that the integrated gradients method was the most robust to both random and adversarial attacks. Similar results for CIFAR-10 can be found in Appendix C.
Results for influence functions
We evaluate the robustness of influence functions on a test data set consisting of 200 images of roses and sunflowers. Fig. 4 shows a representative test image to which we have applied the gradient sign attack. Although the prediction of the image does not change, the most influential training examples selected according to (3), as explanation for the prediction, change entirely from images of sunflowers and yellow petals that resemble the input image to those of red and pink roses that do not. Additional examples can be found in Appendix D.
In Fig. 5, we compare the random perturbations and gradient sign attacks across all of the test images. We find that the gradient sign-based attacks are significantly more effective at decreasing the rank correlation of the influence of the training images, as well as distorting the top-5 influential images. For example, on average, with a targeted perturbation of magnitude , only 2 of the top 5 most influential training images remain as the top 5 most influential images after the visually imperceptible perturbation. The influences of the training images before and after an adversarial attack are essentially uncorrelated. However, we find that even random attacks can have a non-negligible effect on influence functions, on average reducing the rank correlation to 0.8 ().

![]() |
![]() |
5 Hessian analysis
In this section, we try to understand the source of interpretation fragility. The question is whether fragility a consequence of the complex non-linearities of a deep network or a characteristic present even in high-dimensional linear models, as is the case for adversarial examples for prediction (Goodfellow et al., 2014). To gain more insight into the fragility of gradient based interpretations, let denote the score function of interest; is an input vector and is the weights of the neural network, which is fixed since the network has finished training. We are interested in the Hessian whose entries are . The reason is that the first order approximation of gradient for some input perturbation direction is: .
First, consider a linear model whose score for an input is . Here, and ; the feature-importance vector is robust, because it is completely independent of
. Thus, some non-linearity is required for interpretation fragility. A simple network that is susceptible to adversarial attacks on interpretations consists of a set of weights connecting the input to a single neuron followed by a non-linearity (e.g. logistic regression):
.We can calculate the change in saliency map due to a small perturbation in . The first-order approximation for the change in saliency map will be equal to : . In particular, the saliency of the feature changes by and furthermore, the relative change is . For the simple network, this relative change is:
(5) |
where we have used and to refer to the first and second derivatives of . Note that and do not scale with the dimensionality of because in general, independent from the dimensionality, and are -normalized or have fixed -norm due to data preprocessing and weight decay regularization. However, if we choose , then the relative change in the saliency grows with the dimension, since it is proportional to the -norm of . When the input is high-dimensional—which is the case with images—the relative effect of the perturbation can be substantial. Note also that this perturbation is exactly the sign of the first right singular vector of the Hessian , which is appropriate since that is the vector that has the maximum effect on the gradient of . A similar analysis can be carried out for influence functions (see Appendix E).
For this simple network, the direction of adversarial attack on interpretability, is the same as the adversarial attack on prediction. This means that we cannot perturb interpretability independently of prediction. For more complex networks, this is not the case and in Appendix F we show this analytically for a simple case of a two-layer network. As an empirical test, in Fig. 4(a), we plot the distribution of the angle between and (the first right singular vector of which is the most fragile direction of feature importance) for 1000 CIFAR10 images (Details of the network in Appendix A). In Fig. 4(b), we plot the equivalent distribution for influence functions, computed across all 200 test images. The result confirms that the steepest direction of change in interpretation and prediction are generally orthogonal, justifying how the perturbations can change the interpretation without changing the prediction.
![]() |
![]() |
6 Discussion
Related works
To the best of our knowledge, the notion of adversarial attacks has not previously been studied in the context of interpretation of neural networks. Adversarial attacks to the input that changes the prediction of a network have been actively studied. Szegedy et al. (2013) demonstrated that it is relatively easy to fool neural networks into making very different predictions for test images that are visually very similar to each other. Goodfellow et al. (2014) introduced the Fast Gradient Sign Method (FGSM) as a one-step prediction attack. This was followed by more effective iterative attacks (Kurakin et al., 2016) seeking to change the prediction of network by a small perturbation. Different metrics for quantifying the size of the perturbation have been used. Moosavi-Dezfooli et al. (2016); Szegedy et al. (2013) used ; Papernot et al. (2016) considered the number of perturbed pixels (); and Goodfellow et al. (2014) suggest using , because this tightly controls how much individual feature can change. We followed the popular practice and evaluate with .
Interpretation of neural network predictions is also an active research area. Post-hoc interpretability (Lipton, 2016) is one family of methods that seek to ”explain” the prediction without talking about the details of black-box model’s hidden mechanisms. These included tools to explain predictions by networks in terms of the features of the test example (Simonyan et al., 2013; Shrikumar et al., 2017; Sundararajan et al., 2017; Zhou et al., 2016), as well as in terms of contribution of training examples to the prediction at test time (Koh & Liang, 2017). These interpretations have gained increasing popularity, as they confer a degree of insight to human users of what the neural network might be doing (Lipton, 2016).
Conclusion
This paper demonstrates that interpretation of neural networks can be fragile. We develop new perturbations to illustrate this fragility and propose evaluation metrics as well as insights on why fragility occurs. Fragility of interpretation is orthogonal to fragility of the prediction—we demonstrate how perturbations can substantially change the interpretation without changing the predicted label. The two types of fragility do arise from similar factors, as we discuss in Section
5. Our focus is on the interpretation method, rather than on the original network, and as such we do not explore how interpretable is the original predictor. There is a separately line of research that tries to design simpler and more interpretable prediction models (Ba & Caruana, 2014).Our main message is that robustness of the interpretation of a prediction is an important and challenging problem, especially as in many applications (e.g. many biomedical and social settings) users are as interested in the interpretation as in the prediction itself. Our results raise concerns on how interpretation is sensitive to noise and can be manipulated. We do not suggest that interpretations are meaningless, just as adversarial attacks on predictions do not imply that neural networks are useless. Interpretations do need to be used and evaluated with caution. Especially in settings where the importance of individual or a small subset of features are interpreted, we show that these importance scores can be sensitive to even random perturbation. More dramatic manipulations of interpretations can be achieved with our targeted perturbations, which raise security concerns. While we focus on image data (ImageNet and CIFAR-10), because these are the standard benchmarks for popular interpretation tools, this fragility issue can be wide-spread in biomedical, economic and other settings where neural networks are increasingly used. Understanding interpretation fragility in these applications and develop more robust methods are important agendas of research.
References
- Ba & Caruana (2014) Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pp. 2654–2662, 2014.
- Bach et al. (2015) Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
- Baehrens et al. (2010) David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert MÞller. How to explain individual classification decisions. Journal of Machine Learning Research, 11(Jun):1803–1831, 2010.
- Cisse et al. (2017) Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning, pp. 854–863, 2017.
- Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Iandola et al. (2016) Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016.
- Kingma & Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Koh & Liang (2017) Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. arXiv preprint arXiv:1703.04730, 2017.
- Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
- Lipton (2016) Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
-
Moosavi-Dezfooli et al. (2016)
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard.
Deepfool: a simple and accurate method to fool deep neural networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2574–2582, 2016. - Obermeyer & Emanuel (2016) Ziad Obermeyer and Ezekiel J Emanuel. Predicting the future—big data, machine learning, and clinical medicine. The New England journal of medicine, 375(13):1216, 2016.
- Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, pp. 372–387. IEEE, 2016.
- Rich (2015) Michael L Rich. Machine learning, automated suspicion algorithms, and the fourth amendment. U. Pa. L. Rev., 164:871, 2015.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- Shrikumar et al. (2017) Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. CoRR, abs/1704.02685, 2017.
- Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Spearman (1904) Charles Spearman. The proof and measurement of association between two things. American Journal of Psychology, 15:72–101, 1904.
- Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365, 2017.
- Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
-
Zhou et al. (2016)
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
Learning deep features for discriminative localization.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929, 2016.
Appendices
Appendix A Description of the CIFAR-10 classification network
We trained the following structure using ADAM optimizer (Kingma & Ba, 2014) with default parameters. The resulting test accuracy using ReLU activation was 73%. For the experiment in section 6, we replaced ReLU activation with Softplus and retrained the network (with the ReLU network weights as initial weights). The resulting accuracy was 73%.
Network Layers |
---|
conv. 96 ReLU |
conv. 96 ReLU |
conv. 96 Relu |
Stride 2 |
conv. 192 ReLU |
conv. 192 ReLU |
conv. 192 Relu |
Stride 2 |
1024 hidden sized feed forward |
Appendix B Additional examples of feature importance perturbations
Here we provide three more examples from ImageNet. For each example, all three methods of feature importance are attacked by random sign noise and our two targeted adversarial algorithms.



Appendix C Measuring center of mass movement

Results for adversarial attacks against CIFAR-10 feature importance methods
![]() |
![]() |
![]() |
Appendix D Additional Examples of Adversarial Attacks on Influence Functions
In this appendix, we provide additional examples of the fragility of influence functions, analogous to Fig. 4.
![]() |
![]() |
Appendix E Dimensionality-Based Explanation for Fragility of Influence Functions
Here, we demonstrate that increasing the dimension of the input of a simple neural network increases the fragility of that network with respect to influence functions, analogous to the calculations carried out for importance-feature methods in Section 5. Recall that the influence of a training image on a test image is given by:
(6) |
We restrict our attention to the term in (6) that is dependent on , and denote it by . represents the infinitesimal effect of each of the parameters in the network on the loss function evaluated at the test image.
Now, let us calculate the change in this term due to a small perturbation in . The first-order approximation for the change in is equal to: . In particular, for the parameter, changes by and furthermore, the relative change is . For the simple network defined in Section 5, this evaluates to (replacing with for consistency of notation):
(7) |
where for simplicity, we have taken the loss to be , making the derivatives easier to calculate. Furthermore, we have used and to refer to the first and second derivatives of . Note that and do not scale with the dimensionality of because and are generalized -normalized due to data preprocessing and weight decay regularization.
However, if we choose , then the relative change in the saliency grows with the dimension, since it is proportional to the -norm of .
Appendix F Orthogonality of steepest directions of change in score and feature importance functions in a Simple Two-layer network
Consider a two layer neural network with activation function
, input , hidden vector , and score function , we have:where . We have:
Now for an input sample perturbation , for the change in feature importance:
which is equal to:
We further assume that the input is high-dimensional so that and for we have . For maximizing the norm of saliency difference we have the following perturbation direction:
where:
comparing which to the direction of feature importance:
we conclude that the two directions are not parallel unless which is not the case for many activation functions like Softplus, Sigmoid, etc.
Appendix G Designing Interpretability-Robust Networks
The analyses and experiments in this paper have demonstrated that small perturbations in the input layers of deep neural networks can have large changes in the interpretations. This is analogous to classical adversarial examples, whereby small perturbations in the input produce large changes in the prediction. In that setting, it has been proposed that the Lipschitz constant of the network be constrained during training to limit the effect of adversarial perturbations (Szegedy et al., 2013). This has found some empirical success (Cisse et al., 2017).
Here, we propose an analogous method to upper-bound the change in interpretability of a neural network as a result of perturbations to the input. Specifically, consider a network with layers, which takes as input a data point we denote as . The output of the layer is given by for . We define to be the output (e.g. score for the correct class) of our network, and we are interested in designing a network whose gradient is relatively insensitive to perturbations in the input, as this corresponds to a network whose feature importances are robust.
A natural quantity to consider is the Lipschitz constant of with respect to
. By the chain rule, the Lipschitz constant of
is(8) |
Now consider the function , which maps to . In the simple case of the fully-connected network, which we consider here, , where is a non-linearity and are the trained weights for that layer. Thus, the Lipschitz constant of the partial derivative in (8) is the Lipschitz constant of
which is upper-bounded by , where denotes the operator norm of
(its largest singular value)
888this bound follows from the fact that the Lipschitz constant of the composition of two functions is the product of their Lipschitz constants, and the Lipschitz constant of the product of two functions is also the product of their Lipschitz constants.. This suggests that a conservative upper ceiling for (8) is(9) |
Because the Lipschitz constant of the non-linearities are fixed, this result suggests that a regularization based on the operator norms of the weights may allow us to train networks that are robust to attacks on feature importance. The calculations in this Appendix section is meant to be suggestive rather than conclusive, since in practice the Lipschitz bounds are rarely tight.