1 Introduction
Explanation methods have attracted significant attention over the last years due to their promise to open the black box of deep neural networks. Interpretability is crucial for scientific understanding and safety critical applications. Explanations can be provided in terms of explanation maps[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] that visualize the relevance attributed to each input feature for the overall classification result. In this work, we establish that these explanation maps can be changed to an arbitrary target map. This is done by applying a visually hardly perceptible perturbation to the input. We refer to Figure 1
for an example. This perturbation does not change the output of the neural network, i.e. in addition to the classification result also the vector of all class probabilities is (approximately) the same. This finding is clearly problematic if a user, say a medical doctor, is expecting a robustly interpretable explanation map to rely on in the clinical decision making process. Motivated by this unexpected observation, we provide a theoretical analysis that establishes a relation of this phenomenon to the geometry of the neural network’s output manifold. This novel understanding allows us to derive a bound on the degree of possible manipulation of the explanation map. This bound is proportional to two differential geometric quantities: the principle curvatures and the geodesic distance between the original input and its manipulated counterpart. Given this theoretical insight, we propose efficient ways to limit possible manipulations and thus enhance resilience of explanation methods. In summary, this work provides the following key contributions:

We propose an algorithm which allows to manipulate an image with a hardly perceptible perturbation such that the explanation matches an arbitrary target map. We demonstrate its effectiveness for six different explanation methods and on four network architectures as well as two datasets.

We provide a theoretical understanding of this phenomenon for gradientbased methods in terms of differential geometry. We derive a bound on the principle curvatures of the hypersurface of equal network output. This implies a constraint on the maximal change of the explanation map due to small perturbations.

Using these insights, we propose methods to undo the manipulations and increase the robustness of explanation maps by smoothing the explanation method. We demonstrate experimentally that smoothing leads to increased robustness not only for gradient but also for propagationbased methods.
1.1 Related work
In [20], it was demonstrated that explanation maps can be sensitive to small perturbations in the image. Their results may be thought of as untargeted manipulations, i.e. perturbations to the image which lead to an unstructured change in the explanation map. Our work focuses on targeted manipulations instead, i.e. to reproduce a given target map. Another approach [21] adds a constant shift to the input image, which is then eliminated by changing the bias of the first layer. For some methods, this leads to a change in the explanation map. Contrary to our approach, this requires to change the network’s biases. In [22], explanation maps are changed by randomization of (some of) the network weights. This is different from our method as it does not aim to change the explanation in a targeted manner and modifies the weights of the network.
2 Manipulating explanations
We consider a neural network
which classifies an image
in categories with the predicted class given by . The explanation map is denoted by and associates an image with a vector of the same dimension whose components encode the relevance score of each pixel for the neural network’s prediction. For a given explanation method and specified target , a manipulated image has the following properties:
The output of the network stays approximately constant, i.e. .

The explanation is close to the target map, i.e. .

The norm of the perturbation added to the input image is small, i.e. and therefore not perceptible.
Throughout this paper, we will use the following explanation methods:

Gradient Input: This method uses the map [14]. For linear models, this measure gives the exact contribution of each pixel to the prediction.

Integrated Gradients: This method defines where is a suitable baseline. See the original reference [13] for more details.

Guided Backpropagation (GBP)
: This method is a variation of the gradient explanation for which negative components of the gradient are set to zero while backpropagating through the nonlinearities
[4]. 
Layerwise Relevance Propagation (LRP): This method [5, 16] propagates relevance backwards through the network. For the output layer, relevance is defined by^{1}^{1}1Here we use the Kronecker symbol .
(1) which is then propagated backwards through all layers but the first using the rule
(2) where denotes the positive weights of the th layer and is the activation vector of the th layer. For the first layer, we use the rule to account for the bounded input domain
(3) where and are the lower and upper bounds of the input domain respectively.

Pattern Attribution (PA): This method is equivalent to standard backpropagation upon elementwise multiplication of the weights with learned patterns . We refer to the original publication for more details [17].
These methods cover two classes of attribution methods, namely gradientbased and propagationbased explanations, and are frequently used in practice [23, 24].
2.1 Manipulation Method
Let be a given target explanation map and an input image. As explained previously, we want to construct a manipulated image such that it has an explanation very similar to the target but the output of the network stays approximately constant, i.e.
. We obtain such manipulations by optimizing the loss function
(4) 
with respect to using gradient descent. We clamp after each iteration so that it is a valid image. The first term in the loss function (4
) ensures that the manipulated explanation map is close to the target while the second term encourages the network to have the same output. The relative weighting of these two summands is controlled by the hyperparameter
.The gradient with respect to the input of the explanation often depends on the vanishing second derivative of the relu nonlinearities. This causes problems during optimization of the loss (4). As an example, the gradient method leads to
We therefore replace the relu by softplus nonlinearities
(5) 
For large values, the softplus approximates the relu closely but has a welldefined second derivative. After optimization is complete, we test the manipulated image with the original relu network. Similarity metrics: In our analysis, we assess the similarity between both images and explanation maps. To this end, we use three metrics following [22]: the structural similarity index (SSIM), the Pearson correlation coefficient (PCC) and the mean squared error (MSE). SSIM and PCC are relative similarity measures with values in , where larger values indicate high similarity. The MSE is an absolute error measure for which values close to zero indicate high similarity. We normalize the sum of the explanation maps to be one and the images to have values between 0 and 1.
2.2 Experiments
To evaluate our approach, we apply our algorithm to 100 randomly selected images for each explanation method. We use a pretrained VGG16 network [25]
and the ImageNet dataset
[26]. For each run, we randomly select two images from the test set. One of the two images is used to generate a target explanation map . The other image is perturbed by our algorithm with the goal of replicating the target using a few thousand iterations of gradient descent. We sum over the absolute values of the channels of the explanation map to get the relevance per pixel. Further details about the experiments are summarized in Supplement A. Qualitative analysis: Our method is illustrated in Figure 2 in which a dog image is manipulated in order to have an explanation of a cat. For all explanation methods, the target is closely emulated and the perturbation of the dog image is small. More examples can be found in the supplement. Quantitative analysis: Figure 3 shows similarity measures between the target and the manipulated explanation map as well as between the original image and perturbed image .^{2}^{2}2Throughout this paper, boxes denote 25^{th} and 75^{th} percentiles, whiskers denote 10^{th} and 90^{th}percentiles, solid lines show the medians and outliers are depicted by circles.
All considered metrics show that the perturbed images have an explanation closely resembling the targets. At the same time, the perturbed images are very similar to the corresponding original images. We also verified by visual inspection that the results look very similar. We have uploaded the results of all runs so that interested readers can assess their similarity themselves^{3}^{3}3https://drive.google.com/drive/folders/1TZeWngoevHRuIw6gb5CZDIRrc7EWf5yb?usp=sharing and will provide code to reproduce them. In addition, the output of the neural network is approximately unchanged by the perturbations, i.e. the classification of all examples is unchanged and the median of is of the order of magnitude for all methods. See Supplement B for further details. Other architectures and datasets: We checked that comparable results are obtained for ResNet18 [27], AlexNet [28] and Densenet121 [29]. Moreover, we also successfully tested our algorithm on the CIFAR10 dataset [30]. We refer to the Supplement C for further details.3 Theoretical considerations
In this section, we analyze the vulnerability of explanations theoretically. We argue that this phenomenon can be related to the large curvature of the output manifold of the neural network. We focus on the gradient method starting with an intuitive discussion before developing mathematically precise statements. We have demonstrated that one can drastically change the explanation map while keeping the output of the neural network constant
(6) 
using only a small perturbation in the input . The perturbed image therefore lies on the hypersurface of constant network output .^{4}^{4}4It is sufficient to consider the hypersurface in a neighbourhood of the unperturbed input . We can exclusively consider the winning class output, i.e. with
because the gradient method only depends on this component of the output. Therefore, the hyperplane
is of codimension one. The gradient for every is normal to this hypersurface. The fact that the normal vector can be drastically changed by slightly perturbing the input along the hypersurface suggests that the curvature of is large. While the latter statement may seem intuitive, it requires nontrivial concepts of differential geometry to make it precise, in particular the notion of the second fundamental form. We will briefly summarize these concepts in the following (see e.g. [31] for a standard textbook). To this end, it is advantageous to consider a normalized version of the gradient method(7) 
This normalization is merely conventional as it does not change the relative importance of any pixel with respect to the others. For any point , we define the tangent space as the vector space spanned by the tangent vectors of all possible curves with . For , we denote their inner product by . For any , the directional derivative is uniquely defined for any choice of by
with  (8) 
We then define the Weingarten map as^{5}^{5}5The fact that follows by taking the directional derivative with respect to on both sides of .
where the unit normal can be written as (7). This map quantifies how much the unit normal changes as we infinitesimally move away from in the direction . The second fundamental form is then given by
It can be shown that the second fundamental form is bilinear and symmetric
. It is therefore diagonalizable with real eigenvalues
which are called principle curvatures. We have therefore established the remarkable fact that the sensitivity of the gradient map (7) is described by the principle curvatures, a key concept of differential geometry. In particular, this allows us to derive an upper bound on the maximal change of the gradient map as we move slightly on . To this end, we define the geodesic distance of two points as the length of the shortest curve on connecting and . In the supplement, we show that:Theorem 1
Let be a network with nonlinearities and an environment of a point such that is fully connected. Let have bounded derivatives for all . It then follows for all that
(9) 
where is the principle curvature with the largest absolute value for any point in and the constant depends on the weights of the neural network.
This theorem can intuitively be motivated as follows: for relu nonlinearities, the lines of equal network output are piecewise linear and therefore have kinks, i.e. points of divergent curvature. These relu nonlinearities are well approximated by softplus nonlinearities (5) with large . Reducing smoothes out the kinks and therefore leads to reduced maximal curvature, i.e. . For each point on the geodesic curve connecting and , the normal can at worst be affected by the maximal curvature, i.e. the change in explanation is bounded by .
There are two important lessons to be learned from this theorem: the geodesic distance can be substantially greater than the Euclidean distance for curved manifolds. In this case, inputs which are very similar to each other, i.e. the Euclidean distance is small, can have explanations that are drastically different. Secondly, the upper bound is proportional to the parameter of the softplus nonlinearity. Therefore, smaller values of provably result in increased robustness with respect to manipulations.
4 Robust explanations
Using the fact that the upper bound of the last section is proportional to the parameter of the softplus nonlinearities, we propose smoothing of explanations. This method calculates an explanation using a network for which the relu nonlinearities are replaced by softplus with a small parameter to smooth the principle curvatures. The precise value of is a hyperparameter of the method, but we find that a value around one works well in practice. As shown in the supplement, a relation between SmoothGrad [12] and smoothing can be proven for a onelayer neural network:
Theorem 2
For a onelayer neural network and its smoothed counterpart , it holds that
where .
Since
closely resembles a normal distribution with variance
, smoothing can be understood as limit of SmoothGrad where . We emphasize that the theorem only holds for a onelayer neural network, but for deeper networks we empirically observe that both lead to visually similar maps as they are considerably less noisy than the gradient map. The theorem therefore suggests that SmoothGrad can similarly be used to smooth the curvatures and can thereby make explanations more robust.^{6}^{6}6For explanation methods other than gradient, SmoothGrad needs to be used in a slightly generalized form, i.e. . Experiments: Figure 4 demonstrates that smoothing allows us to recover the orginal explanation map by lowering the value of theparameter. We stress that this works for all considered methods. We also note that the same effect can be observed using SmoothGrad by successively increasing the standard deviation
of the noise distribution. This further underlines the similarity between the two smoothing methods.If an attacker knew that smoothing was used to undo the manipulation, they could try to attack the smoothed method directly. However, both smoothing and SmoothGrad are substantially more robust than their nonsmoothed counterparts, see Figure 5. It is important to note that smoothing achieves this at considerably lower computational cost: smoothing only requires a single forward and backward pass, while SmoothGrad requires as many as the number of noise samples (typically between 10 to 50). We refer to Supplement D for more details on these experiments.
5 Conclusion
Explanation methods have recently become increasingly popular among practitioners. In this contribution we show that dedicated imperceptible manipulations of the input data can yield arbitrary and drastic changes of the explanation map. We demonstrate both qualitatively and quantitatively that explanation maps of many popular explanation methods can be arbitrarily manipulated. Crucially, this can be achieved while keeping the model’s output constant. A novel theoretical analysis reveals that in fact the large curvature of the network’s decision function is one important culprit for this unexpected vulnerability. Using this theoretical insight, we can profoundly increase the resilience to manipulations by smoothing only the explanation process while leaving the model itself unchanged. Future work will investigate possibilities to modify the training process of neural networks itself such that they can become less vulnerable to manipulations of explanations. Another interesting future direction is to generalize our theoretical analysis from gradient to propagationbased methods. This seems particularly promising because our experiments strongly suggest that similar theoretical findings should also hold for these explanation methods.
References

[1]
David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja
Hansen, and KlausRobert Müller.
How to explain individual classification decisions.
Journal of Machine Learning Research
, 11(Jun):1803–1831, 2010.  [2] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Workshop Track Proceedings, 2014.
 [3] Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision  ECCV 2014  13th European Conference, Zurich, Switzerland, September 612, 2014, Proceedings, Part I, pages 818–833, 2014.
 [4] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving for simplicity: The all convolutional net. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Workshop Track Proceedings, 2015.
 [5] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, KlausRobert Müller, and Wojciech Samek. On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PloS one, 10(7):e0130140, 2015.
 [6] Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra. Gradcam: Why did you say that? visual explanations from deep networks via gradientbased localization. CoRR, abs/1610.02391, 2016.
 [7] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM, 2016.
 [8] Luisa M. Zintgraf, Taco S. Cohen, Tameem Adel, and Max Welling. Visualizing deep neural network decisions: Prediction difference analysis. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, 2017.
 [9] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pages 3145–3153, 2017.
 [10] Scott M Lundberg and SuIn Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
 [11] Piotr Dabkowski and Yarin Gal. Real time image saliency for black box classifiers. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6967–6976. Curran Associates, Inc., 2017.
 [12] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda B. Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. CoRR, abs/1706.03825, 2017.
 [13] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pages 3319–3328, 2017.
 [14] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pages 3145–3153, 2017.
 [15] Ruth C Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. In 2017 IEEE international conference on computer vision (ICCV), pages 3449–3457. IEEE, 2017.
 [16] Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and KlausRobert Müller. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222, 2017.
 [17] PieterJan Kindermans, Kristof T Schütt, Maximilian Alber, KlausRobert Müller, Dumitru Erhan, Been Kim, and Sven Dähne. Learning how to explain neural networks: Patternnet and patternattribution. International Conference on Learning Representations, 2018. https://openreview.net/forum?id=Hkn7CBaTW.
 [18] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). arXiv preprint arXiv:1711.11279, 2017.
 [19] Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, Wojciech Samek, and KlausRobert Müller. Unmasking clever hans predictors and assessing what machines really learn. Nature communications, 10(1):1096, 2019.
 [20] Amirata Ghorbani, Abubakar Abid, and James Y. Zou. Interpretation of neural networks is fragile. CoRR, abs/1710.10547, 2017.
 [21] PieterJan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T. Schütt, Sven Dähne, Dumitru Erhan, and Been Kim. The (un)reliability of saliency methods. CoRR, abs/1711.00867, 2017.
 [22] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian J. Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. CoRR, abs/1810.03292, 2018.
 [23] Maximilian Alber, Sebastian Lapuschkin, Philipp Seegerer, Miriam Hägele, Kristof T. Schütt, Grégoire Montavon, Wojciech Samek, KlausRobert Müller, Sven Dähne, and PieterJan Kindermans. innvestigate neural networks! Journal of Machine Learning Research 20, 2019.
 [24] Marco Ancona, Enea Ceolini, Cengiz Oztireli, and Markus Gross. Towards better understanding of gradientbased attribution methods for deep neural networks. In 6th International Conference on Learning Representations (ICLR 2018), 2018.
 [25] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. International Conference on Learning Representations, 2014.
 [26] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.

[28]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
Imagenet classification with deep convolutional neural networks.
In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 1, NIPS’12, pages 1097–1105, USA, 2012. Curran Associates Inc.  [29] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
 [30] Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009.
 [31] Loring W Tu. Differential geometry: connections, curvature, and characteristic classes, volume 275. Springer, 2017.
Appendix A Details on experiments
We provide a run_attack.py file in our reference implementation which allows one to produce manipulated images. The hyperparameter choices used in our experiments are summarized in Table 1. We set and for beta growth (see section below for a description). The column ’factors’ summarizes the weighting of the mean squared error of the heatmaps and the images respectively.
method  iterations  lr  factors 

Gradient  1500  ,  
Grad x Input  1500  ,  
IntGrad  500  ,  
LRP  1500  ,  
GBP  1500  ,  
PA  1500  , 
The patterns for explanation method PA are trained on a subset of the ImageNet training set. The baseline for explanation method IG was set to zero. To approximate the integral, we use steps for which we verified that the attributions approximately adds up to the score at the input.
a.1 Beta growth
In practise, we observe that we get slightly better results by increasing the value of of the softplus during training a start value to a final value using
(10) 
where is the current optimization step and denotes the total number of steps. Figure 6 shows the MSE for images and explanation maps during training with and without growth. This strategy is however not essential for our results.
We use beta growth for all methods except LRP for which we do not find any speedup in the optimization as the LRP rules do not explicitly depend on the second derivative of the relu activations. Figure 7 demonstrates that for large beta values the softplus networks approximate the relu network well. Figure 8 and Figure 9 show this for an example for the gradient and the LRP explanation method. We also note that for small beta the gradient explanation maps become more similar to LRP/GPB/PA explanation maps.
Appendix B Difference in network output
Figure 10 summarizes the change in the output of the network due to the manipulation. We note that all images have the same classification result as the orginals. Furthermore, we note that the change in confidence is small. Last but not least, norm of the vector of all class probabilities is also very small.
Appendix C Generalization over architectures and data sets
Manipulable explanations are not only a property of the VGG16 network. In this section, we show that our algorithm to manipulate explanations can also be applied to other architectures and data sets. For the experiments, we optimize the loss function given in the main text. We keep the preactivation for all network architectures approximately constant, which also leads to approximately constant activation.
c.1 Additional architectures
In addition to the VGG architecture we also analyzed the explanation’s susceptibility to manipulations for the AlexNet, Densenet and ResNet architectures. The hyperparameter choices used in our experiments are summarized in Table 2. We set and for beta growth. Only for Densenet we set and as for smaller beta values the explanation map produced with softplus does not resemble the explanation map produced with relu. Figure 12 and 11 show that the similarity measures are comparable for all network architectures for the gradient method. Figure 13, 15, 16 and 14 show one example image for each architecture.
network  iterations  lr  factors 

VGG16  1500  1e11, 10  
AlexNet  4000  1e11, 10  
Densenet121  2000  1e11, 10  
ResNet18  2000  1e11, 10 
c.2 Additional datasets
We trained the VGG16 architecture on the CIFAR10 dataset^{7}^{7}7code for training VGG on CIFAR10 from https://github.com/chengyangfu/pytorchvggcifar10. The test accuracy is approximately . We then used our algorithm to manipulate the explanations for the LRP method. The hyperparameters are summarized in Table 3. Two example images can be seen in Figure 17.
method  iterations  lr  factors 

LRP  1500  , 
Appendix D Smoothing explanation methods
One can achieve a smoothing effect when substituting the relu activations for softplus activations and then applying the usual rules for the different explanation methods. A smoothing effect can also be achieved by applying the smoothgrad explanation method, see Figure 18. That is adding random perturbation to the image and then averaging over the resulting explanation maps. We average over 10 perturbed images with different values for the standard deviation of the Gaussian noise. The noise level is related to as , where and are the maximum and minimum values the input image can have.
The smoothing or SmoothGrad explanation maps are more robust with respect to manipulations. Figure 19, 20 and 21 show results (MSE, SSIM and PCC) for 100 targeted attacks on the original explanation, the SmoothGrad explanation and the smoothed explanation for explanation methods Gradient and LRP.
For manipulation of SmoothGrad we use beta growth with and . For manipulation of Smoothing we set for all runs. The hyperparameters for SmoothGrad and Smoothing are summarized in Table 4 and Table 5.
method  iterations  lr  factors 

Gradient  1500  ,  
LRP  1500  , 
method  iterations  lr  factors 

Gradient  500  ,  
Grad x Input  500  ,  
IntGrad  200  ,  
LRP  1500  ,  
GBP  500  ,  
PA  500  , 
In Figure 22 and Figure 23, we directly compare the original explanation methods with the smoothed explanation methods. An increase in robustness can be seen for all methods: explanation maps for smoothed explanations have higher MSE and lower SSIM and PCC than explanation maps for the original methods. The similarity measures for the manipulated images are of comparable magnitude.
Appendix E Proofs
In this section, we collect the proofs of the theorems stated in the main text.
e.1 Theorem 1
Theorem 3
Let be a network with nonlinearities and an environment of a point such that is fully connected. Let have bounded derivatives for all . It then follows for all that
(11) 
where is the principle curvatures with the largest absolute value for any point in and the constant depends on the weights of the neural network.
Proof: This proof will proceed in four steps. We will first bound the Frobenius norm of the Hessian of the network . From this, we will deduce an upper bound on the Frobenius norm of the second fundamental form. This in turn will allow us to bound the largest principle curvature . Finally, we will bound the maximal and minimal change in explanation. Step 1: Let where are the weights of layer .^{8}^{8}8We do not make the dependence of softplus on its parameter explicit to ease notation. We note that
(12)  
(13) 
where
(14) 
The activation at layer is then given by
(15) 
Its derivative is given by
We therefore obtain
(16) 
Deriving the expression for again, we obtain
We now restrict to the case for which the index only takes a single value and use . The Hessian is then bounded by
(17) 
where the constant is given by
(18) 
Step 2: Let be a basis of the tangent space . Then the second fundamental form for the hypersurface at point is given by
(19)  
(20) 
We now use the fact that , i.e. the gradient of is normal to the tangent space. This property was explained in the main text. This allows us to deduce that
(21) 
Step 3: The Frobenius norm of the second fundamental form (considered as a matrix in the sense of step 2) can be written as
(22) 
where
are the principle curvatures. This property follows from the fact that the second fundamental form is symmetric and can therefore be diagonalized with real eigenvectors, e.g. the principle curvatures. Using the fact that the derivative of the network is bounded from below,
, we obtain(23) 
Step 4: For , we choose a curve with and . Furthermore, we use the notation . It then follows that
(24) 
Using the fact that and choosing an orthonormal basis for the tangent spaces, we obtain
(25)  
(26) 
The second fundamental form is bilinear and therefore
(27) 
We now use the notation and choose its eigenbasis for . We then obtain for the difference in the unit normals:
(28) 
where denote the principle curvatures at . By orthonormality of the eigenbasis, it can be easily checked that
Using this relation and the triangle inequality, we then obtain by taking the norm on both sides of (28):
(29) 
This inequality holds for any curve connecting and but the tightest bound follows by choosing to be the shortest possible path in with length , i.e. the geodesic distance on . The second inequality of the theorem is obtained by the upper bound on the largest principle curvature derived above, i.e. (23).
e.2 Theorem 2
Theorem 4
For one layer neural networks and , it holds that
(30) 
where .
Proof: We first show that
(31) 
for a scalar input . This follows by defining implicitly as
(32) 
Differentiating both sides of this equation with respect to results in
(33) 
where is the Heaviside step function and . Differentiating both sides with respect to again results in
(34) 
Therefore, (31) holds. For a vector input , we define the distribution of its perturbation by
(35) 
where denotes the components of . We will suppress any arrows denoting vectorvalued variables in the following in order to ease notation. We choose an orthogonal basis such that
with  and  (36) 
This allows us to rewrite
By changing the integration variable to and using (31), we obtain
(37) 
The theorem then follows by deriving both sides of the equation with respect to .