# Explanations can be manipulated and geometry is to blame

Explanation methods aim to make neural networks more trustworthy and interpretable. In this paper, we demonstrate a property of explanation methods which is disconcerting for both of these purposes. Namely, we show that explanations can be manipulated arbitrarily by applying visually hardly perceptible perturbations to the input that keep the network's output approximately constant. We establish theoretically that this phenomenon can be related to certain geometrical properties of neural networks. This allows us to derive an upper bound on the susceptibility of explanations to manipulations. Based on this result, we propose effective mechanisms to enhance the robustness of explanations.

• 6 publications
• 7 publications
• 8 publications
• 3 publications
• 109 publications
• 12 publications
03/04/2022

### Do Explanations Explain? Model Knows Best

It is a mystery which input features contribute to a neural network's ou...
07/20/2020

### Fairwashing Explanations with Off-Manifold Detergent

Explanation methods promise to make black-box classifiers more transpare...
04/22/2020

### Assessing the Reliability of Visual Explanations of Deep Models with Adversarial Perturbations

The interest in complex deep neural networks for computer vision applica...
06/24/2022

### Robustness of Explanation Methods for NLP Models

Explanation methods have emerged as an important tool to highlight the f...
05/14/2020

### Distilling neural networks into skipgram-level decision lists

Several previous studies on explanation for recurrent neural networks fo...
12/18/2020

### Towards Robust Explanations for Deep Neural Networks

Explanation methods shed light on the decision process of black-box clas...
11/18/2019

### NeuronInspect: Detecting Backdoors in Neural Networks via Output Explanations

Deep neural networks have achieved state-of-the-art performance on vario...

## 1 Introduction

Explanation methods have attracted significant attention over the last years due to their promise to open the black box of deep neural networks. Interpretability is crucial for scientific understanding and safety critical applications. Explanations can be provided in terms of explanation maps[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] that visualize the relevance attributed to each input feature for the overall classification result. In this work, we establish that these explanation maps can be changed to an arbitrary target map. This is done by applying a visually hardly perceptible perturbation to the input. We refer to Figure 1

for an example. This perturbation does not change the output of the neural network, i.e. in addition to the classification result also the vector of all class probabilities is (approximately) the same. This finding is clearly problematic if a user, say a medical doctor, is expecting a robustly interpretable explanation map to rely on in the clinical decision making process. Motivated by this unexpected observation, we provide a theoretical analysis that establishes a relation of this phenomenon to the geometry of the neural network’s output manifold. This novel understanding allows us to derive a bound on the degree of possible manipulation of the explanation map. This bound is proportional to two differential geometric quantities: the principle curvatures and the geodesic distance between the original input and its manipulated counterpart. Given this theoretical insight, we propose efficient ways to limit possible manipulations and thus enhance resilience of explanation methods. In summary, this work provides the following key contributions:

• We propose an algorithm which allows to manipulate an image with a hardly perceptible perturbation such that the explanation matches an arbitrary target map. We demonstrate its effectiveness for six different explanation methods and on four network architectures as well as two datasets.

• We provide a theoretical understanding of this phenomenon for gradient-based methods in terms of differential geometry. We derive a bound on the principle curvatures of the hypersurface of equal network output. This implies a constraint on the maximal change of the explanation map due to small perturbations.

• Using these insights, we propose methods to undo the manipulations and increase the robustness of explanation maps by smoothing the explanation method. We demonstrate experimentally that smoothing leads to increased robustness not only for gradient but also for propagation-based methods.

### 1.1 Related work

In [20], it was demonstrated that explanation maps can be sensitive to small perturbations in the image. Their results may be thought of as untargeted manipulations, i.e. perturbations to the image which lead to an unstructured change in the explanation map. Our work focuses on targeted manipulations instead, i.e. to reproduce a given target map. Another approach [21] adds a constant shift to the input image, which is then eliminated by changing the bias of the first layer. For some methods, this leads to a change in the explanation map. Contrary to our approach, this requires to change the network’s biases. In [22], explanation maps are changed by randomization of (some of) the network weights. This is different from our method as it does not aim to change the explanation in a targeted manner and modifies the weights of the network.

## 2 Manipulating explanations

We consider a neural network

which classifies an image

in categories with the predicted class given by . The explanation map is denoted by and associates an image with a vector of the same dimension whose components encode the relevance score of each pixel for the neural network’s prediction. For a given explanation method and specified target , a manipulated image has the following properties:

1. The output of the network stays approximately constant, i.e. .

2. The explanation is close to the target map, i.e. .

3. The norm of the perturbation added to the input image is small, i.e. and therefore not perceptible.

Throughout this paper, we will use the following explanation methods:

• Gradient: The map is used and quantifies how infinitesimal perturbations in each pixel change the prediction [2, 1].

• Gradient Input: This method uses the map [14]. For linear models, this measure gives the exact contribution of each pixel to the prediction.

• Integrated Gradients: This method defines where is a suitable baseline. See the original reference [13] for more details.

• Guided Backpropagation (GBP)

: This method is a variation of the gradient explanation for which negative components of the gradient are set to zero while backpropagating through the non-linearities

[4].

• Layer-wise Relevance Propagation (LRP): This method [5, 16] propagates relevance backwards through the network. For the output layer, relevance is defined by111Here we use the Kronecker symbol .

 RLi=δi,k, (1)

which is then propagated backwards through all layers but the first using the rule

 Rli=∑jxli(Wl)+ji∑ixli(Wl)+jiRl+1j, (2)

where denotes the positive weights of the -th layer and is the activation vector of the -th layer. For the first layer, we use the rule to account for the bounded input domain

 R0i=∑jx0jW0ji−lj(W0)+ji−hj(W0)−ji∑i(x0jW0ji−lj(W0)+ji−hj(W0)−ji)R1j, (3)

where and are the lower and upper bounds of the input domain respectively.

• Pattern Attribution (PA): This method is equivalent to standard backpropagation upon element-wise multiplication of the weights with learned patterns . We refer to the original publication for more details [17].

These methods cover two classes of attribution methods, namely gradient-based and propagation-based explanations, and are frequently used in practice [23, 24].

### 2.1 Manipulation Method

Let be a given target explanation map and an input image. As explained previously, we want to construct a manipulated image such that it has an explanation very similar to the target but the output of the network stays approximately constant, i.e.

. We obtain such manipulations by optimizing the loss function

with respect to using gradient descent. We clamp after each iteration so that it is a valid image. The first term in the loss function (4

) ensures that the manipulated explanation map is close to the target while the second term encourages the network to have the same output. The relative weighting of these two summands is controlled by the hyperparameter

.

The gradient with respect to the input of the explanation often depends on the vanishing second derivative of the relu non-linearities. This causes problems during optimization of the loss (4). As an example, the gradient method leads to

We therefore replace the relu by softplus non-linearities

 softplusβ(x)=1βlog(1+eβx). (5)

For large values, the softplus approximates the relu closely but has a well-defined second derivative. After optimization is complete, we test the manipulated image with the original relu network. Similarity metrics: In our analysis, we assess the similarity between both images and explanation maps. To this end, we use three metrics following [22]: the structural similarity index (SSIM), the Pearson correlation coefficient (PCC) and the mean squared error (MSE). SSIM and PCC are relative similarity measures with values in , where larger values indicate high similarity. The MSE is an absolute error measure for which values close to zero indicate high similarity. We normalize the sum of the explanation maps to be one and the images to have values between 0 and 1.

### 2.2 Experiments

To evaluate our approach, we apply our algorithm to 100 randomly selected images for each explanation method. We use a pre-trained VGG-16 network [25]

and the ImageNet dataset

[26]. For each run, we randomly select two images from the test set. One of the two images is used to generate a target explanation map . The other image is perturbed by our algorithm with the goal of replicating the target using a few thousand iterations of gradient descent. We sum over the absolute values of the channels of the explanation map to get the relevance per pixel. Further details about the experiments are summarized in Supplement A. Qualitative analysis: Our method is illustrated in Figure 2 in which a dog image is manipulated in order to have an explanation of a cat. For all explanation methods, the target is closely emulated and the perturbation of the dog image is small. More examples can be found in the supplement. Quantitative analysis: Figure 3 shows similarity measures between the target and the manipulated explanation map as well as between the original image and perturbed image .222Throughout this paper, boxes denote 25th and 75th percentiles, whiskers denote 10th and 90th

percentiles, solid lines show the medians and outliers are depicted by circles.

All considered metrics show that the perturbed images have an explanation closely resembling the targets. At the same time, the perturbed images are very similar to the corresponding original images. We also verified by visual inspection that the results look very similar. We have uploaded the results of all runs so that interested readers can assess their similarity themselves and will provide code to reproduce them. In addition, the output of the neural network is approximately unchanged by the perturbations, i.e. the classification of all examples is unchanged and the median of is of the order of magnitude for all methods. See Supplement B for further details. Other architectures and datasets: We checked that comparable results are obtained for ResNet-18 [27], AlexNet [28] and Densenet-121 [29]. Moreover, we also successfully tested our algorithm on the CIFAR-10 dataset [30]. We refer to the Supplement C for further details.

## 3 Theoretical considerations

In this section, we analyze the vulnerability of explanations theoretically. We argue that this phenomenon can be related to the large curvature of the output manifold of the neural network. We focus on the gradient method starting with an intuitive discussion before developing mathematically precise statements. We have demonstrated that one can drastically change the explanation map while keeping the output of the neural network constant

 g(x+δx)=g(x)=c (6)

using only a small perturbation in the input . The perturbed image therefore lies on the hypersurface of constant network output .444It is sufficient to consider the hypersurface in a neighbourhood of the unperturbed input . We can exclusively consider the winning class output, i.e. with

because the gradient method only depends on this component of the output. Therefore, the hyperplane

is of co-dimension one. The gradient for every is normal to this hypersurface. The fact that the normal vector can be drastically changed by slightly perturbing the input along the hypersurface suggests that the curvature of is large. While the latter statement may seem intuitive, it requires non-trivial concepts of differential geometry to make it precise, in particular the notion of the second fundamental form. We will briefly summarize these concepts in the following (see e.g. [31] for a standard textbook). To this end, it is advantageous to consider a normalized version of the gradient method

 n(x)=∇g(x)∥∇g(x)∥. (7)

This normalization is merely conventional as it does not change the relative importance of any pixel with respect to the others. For any point , we define the tangent space as the vector space spanned by the tangent vectors of all possible curves with . For , we denote their inner product by . For any , the directional derivative is uniquely defined for any choice of by

 Duf(p)=ddtf(γ(t))∣∣∣t=0 with γ(0)=pand˙γ(0)=u. (8)

We then define the Weingarten map as555The fact that follows by taking the directional derivative with respect to on both sides of .

 L:{TpS→TpSu↦−Dun(p),

where the unit normal can be written as (7). This map quantifies how much the unit normal changes as we infinitesimally move away from in the direction . The second fundamental form is then given by

 L:{TpS×TpS→Ru,v↦−⟨v,L(u)⟩=−⟨v,Dun(p)⟩.

It can be shown that the second fundamental form is bilinear and symmetric

. It is therefore diagonalizable with real eigenvalues

which are called principle curvatures. We have therefore established the remarkable fact that the sensitivity of the gradient map (7) is described by the principle curvatures, a key concept of differential geometry. In particular, this allows us to derive an upper bound on the maximal change of the gradient map as we move slightly on . To this end, we define the geodesic distance of two points as the length of the shortest curve on connecting and . In the supplement, we show that:

###### Theorem 1

Let be a network with non-linearities and an environment of a point such that is fully connected. Let have bounded derivatives for all . It then follows for all that

 ∥h(p)−h(p0)∥≤|λmax|dg(p,p0)≤βCdg(p,p0), (9)

where is the principle curvature with the largest absolute value for any point in and the constant depends on the weights of the neural network.

This theorem can intuitively be motivated as follows: for relu non-linearities, the lines of equal network output are piece-wise linear and therefore have kinks, i.e. points of divergent curvature. These relu non-linearities are well approximated by softplus non-linearities (5) with large . Reducing smoothes out the kinks and therefore leads to reduced maximal curvature, i.e. . For each point on the geodesic curve connecting and , the normal can at worst be affected by the maximal curvature, i.e. the change in explanation is bounded by .

There are two important lessons to be learned from this theorem: the geodesic distance can be substantially greater than the Euclidean distance for curved manifolds. In this case, inputs which are very similar to each other, i.e. the Euclidean distance is small, can have explanations that are drastically different. Secondly, the upper bound is proportional to the parameter of the softplus non-linearity. Therefore, smaller values of provably result in increased robustness with respect to manipulations.

## 4 Robust explanations

Using the fact that the upper bound of the last section is proportional to the parameter of the softplus non-linearities, we propose -smoothing of explanations. This method calculates an explanation using a network for which the relu non-linearities are replaced by softplus with a small parameter to smooth the principle curvatures. The precise value of is a hyperparameter of the method, but we find that a value around one works well in practice. As shown in the supplement, a relation between SmoothGrad [12] and -smoothing can be proven for a one-layer neural network:

###### Theorem 2

For a one-layer neural network and its -smoothed counterpart , it holds that

 Eϵ∼pβ[∇g(x−ϵ)]=∇gβ∥w∥(x),

where .

Since

closely resembles a normal distribution with variance

, -smoothing can be understood as limit of SmoothGrad where . We emphasize that the theorem only holds for a one-layer neural network, but for deeper networks we empirically observe that both lead to visually similar maps as they are considerably less noisy than the gradient map. The theorem therefore suggests that SmoothGrad can similarly be used to smooth the curvatures and can thereby make explanations more robust.666For explanation methods other than gradient, SmoothGrad needs to be used in a slightly generalized form, i.e. . Experiments: Figure 4 demonstrates that -smoothing allows us to recover the orginal explanation map by lowering the value of the

parameter. We stress that this works for all considered methods. We also note that the same effect can be observed using SmoothGrad by successively increasing the standard deviation

of the noise distribution. This further underlines the similarity between the two smoothing methods.

If an attacker knew that smoothing was used to undo the manipulation, they could try to attack the smoothed method directly. However, both -smoothing and SmoothGrad are substantially more robust than their non-smoothed counterparts, see Figure 5. It is important to note that -smoothing achieves this at considerably lower computational cost: -smoothing only requires a single forward and backward pass, while SmoothGrad requires as many as the number of noise samples (typically between 10 to 50). We refer to Supplement D for more details on these experiments.

## 5 Conclusion

Explanation methods have recently become increasingly popular among practitioners. In this contribution we show that dedicated imperceptible manipulations of the input data can yield arbitrary and drastic changes of the explanation map. We demonstrate both qualitatively and quantitatively that explanation maps of many popular explanation methods can be arbitrarily manipulated. Crucially, this can be achieved while keeping the model’s output constant. A novel theoretical analysis reveals that in fact the large curvature of the network’s decision function is one important culprit for this unexpected vulnerability. Using this theoretical insight, we can profoundly increase the resilience to manipulations by smoothing only the explanation process while leaving the model itself unchanged. Future work will investigate possibilities to modify the training process of neural networks itself such that they can become less vulnerable to manipulations of explanations. Another interesting future direction is to generalize our theoretical analysis from gradient to propagation-based methods. This seems particularly promising because our experiments strongly suggest that similar theoretical findings should also hold for these explanation methods.

## Appendix A Details on experiments

We provide a run_attack.py file in our reference implementation which allows one to produce manipulated images. The hyperparameter choices used in our experiments are summarized in Table 1. We set and for beta growth (see section below for a description). The column ’factors’ summarizes the weighting of the mean squared error of the heatmaps and the images respectively.

The patterns for explanation method PA are trained on a subset of the ImageNet training set. The baseline for explanation method IG was set to zero. To approximate the integral, we use steps for which we verified that the attributions approximately adds up to the score at the input.

### a.1 Beta growth

In practise, we observe that we get slightly better results by increasing the value of of the softplus during training a start value to a final value using

 β(t)=β0(βeβ0)t/T, (10)

where is the current optimization step and denotes the total number of steps. Figure 6 shows the MSE for images and explanation maps during training with and without -growth. This strategy is however not essential for our results.

We use beta growth for all methods except LRP for which we do not find any speed-up in the optimization as the LRP rules do not explicitly depend on the second derivative of the relu activations. Figure 7 demonstrates that for large beta values the softplus networks approximate the relu network well. Figure 8 and Figure 9 show this for an example for the gradient and the LRP explanation method. We also note that for small beta the gradient explanation maps become more similar to LRP/GPB/PA explanation maps.

## Appendix B Difference in network output

Figure 10 summarizes the change in the output of the network due to the manipulation. We note that all images have the same classification result as the orginals. Furthermore, we note that the change in confidence is small. Last but not least, norm of the vector of all class probabilities is also very small.

## Appendix C Generalization over architectures and data sets

Manipulable explanations are not only a property of the VGG-16 network. In this section, we show that our algorithm to manipulate explanations can also be applied to other architectures and data sets. For the experiments, we optimize the loss function given in the main text. We keep the pre-activation for all network architectures approximately constant, which also leads to approximately constant activation.

In addition to the VGG architecture we also analyzed the explanation’s susceptibility to manipulations for the AlexNet, Densenet and ResNet architectures. The hyperparameter choices used in our experiments are summarized in Table 2. We set and for beta growth. Only for Densenet we set and as for smaller beta values the explanation map produced with softplus does not resemble the explanation map produced with relu. Figure 12 and 11 show that the similarity measures are comparable for all network architectures for the gradient method. Figure 13, 15, 16 and 14 show one example image for each architecture.

We trained the VGG-16 architecture on the CIFAR-10 dataset777code for training VGG on CIFAR-10 from https://github.com/chengyangfu/pytorch-vgg-cifar10. The test accuracy is approximately . We then used our algorithm to manipulate the explanations for the LRP method. The hyperparameters are summarized in Table 3. Two example images can be seen in Figure 17.

## Appendix D Smoothing explanation methods

One can achieve a smoothing effect when substituting the relu activations for softplus activations and then applying the usual rules for the different explanation methods. A smoothing effect can also be achieved by applying the smoothgrad explanation method, see Figure 18. That is adding random perturbation to the image and then averaging over the resulting explanation maps. We average over 10 perturbed images with different values for the standard deviation of the Gaussian noise. The noise level is related to as , where and are the maximum and minimum values the input image can have.

The -smoothing or SmoothGrad explanation maps are more robust with respect to manipulations. Figure 1920 and 21 show results (MSE, SSIM and PCC) for 100 targeted attacks on the original explanation, the SmoothGrad explanation and the -smoothed explanation for explanation methods Gradient and LRP.

For manipulation of SmoothGrad we use beta growth with and . For manipulation of -Smoothing we set for all runs. The hyperparameters for SmoothGrad and -Smoothing are summarized in Table 4 and Table 5.

In Figure 22 and Figure 23, we directly compare the original explanation methods with the -smoothed explanation methods. An increase in robustness can be seen for all methods: explanation maps for -smoothed explanations have higher MSE and lower SSIM and PCC than explanation maps for the original methods. The similarity measures for the manipulated images are of comparable magnitude.

## Appendix E Proofs

In this section, we collect the proofs of the theorems stated in the main text.

### e.1 Theorem 1

###### Theorem 3

Let be a network with non-linearities and an environment of a point such that is fully connected. Let have bounded derivatives for all . It then follows for all that

 ∥h(p)−h(p0)∥≤|λmax|dg(p,p0)≤βCdg(p,p0), (11)

where is the principle curvatures with the largest absolute value for any point in and the constant depends on the weights of the neural network.

Proof: This proof will proceed in four steps. We will first bound the Frobenius norm of the Hessian of the network . From this, we will deduce an upper bound on the Frobenius norm of the second fundamental form. This in turn will allow us to bound the largest principle curvature . Finally, we will bound the maximal and minimal change in explanation. Step 1: Let where are the weights of layer .888We do not make the dependence of softplus on its parameter explicit to ease notation. We note that

 ∂ksoftplus(∑jWijxj)=Wikσ(∑jWijxj) (12) ∂lσ(∑jWijxj)=βWilg(∑jWijxj)) (13)

where

 σ(x)=1(1+e−βx), g(x)=1(eβx/2+e−βx/2)2. (14)

The activation at layer is then given by

 a(L)(x)=(softplus(L)∘⋯∘softplus(1))(x) (15)

Its derivative is given by

 ∂ka(L)i =∑s2…sLW(L)isLσ(∑jW(L)ija(L−1)j)W(L−1)sLsL−1σ(∑jW(L−1)sLja(L−2)j) …W(1)s2kσ(∑jW(1)s2jxj)

We therefore obtain

 ∥∥∇a(L)∥∥≤L∏l=1∥∥W(l)∥∥F (16)

Deriving the expression for again, we obtain

 ∂l∂ka(L)i= ∑m∑s2…sL{ W(L)isLσ(∑jW(L)ija(L−1)j)W(L−1)sLsL−1σ(∑jW(L−1)ija(L−2)j) …β∑^smW(m)sm+1^smW(m)sm+1smg(∑jW(m)sm+1ja(m−1)j(x))∂la(m−1)^sm(x) …W(1)s2kσ(∑jW(1)s2jxj)}

We now restrict to the case for which the index only takes a single value and use . The Hessian is then bounded by

 ∥H∥F≤β~C (17)

where the constant is given by

 ~C=∑m∥∥W(L)∥∥F∥∥W(L−1)∥∥F…∥∥W(m)∥∥2F…∥∥W(1)∥∥2F. (18)

Step 2: Let be a basis of the tangent space . Then the second fundamental form for the hypersurface at point is given by

 L(ei,ej) =−⟨Dein(p),ej⟩ (19) =−⟨Dei∇f(p)∥∇f(p)∥,ej⟩ =−1∥∇f(p)∥⟨H[f]ei,ej⟩+(…)⟨∇f(p),ej⟩ (20)

We now use the fact that , i.e. the gradient of is normal to the tangent space. This property was explained in the main text. This allows us to deduce that

 L(ei,ej)=−1∥∇f(p)∥H[f]ij. (21)

Step 3: The Frobenius norm of the second fundamental form (considered as a matrix in the sense of step 2) can be written as

 ∥L∥F=√λ21+⋯+λ2d−1, (22)

where

are the principle curvatures. This property follows from the fact that the second fundamental form is symmetric and can therefore be diagonalized with real eigenvectors, e.g. the principle curvatures. Using the fact that the derivative of the network is bounded from below,

, we obtain

 |λmax|≤β~Cc. (23)

Step 4: For , we choose a curve with and . Furthermore, we use the notation . It then follows that

 n(p)−n(p0)=∫tt0ddt(n(γ(t)))dt=∫tt0Du(t)n(γ(t))dt (24)

Using the fact that and choosing an orthonormal basis for the tangent spaces, we obtain

 ∫tt0Du(t)n(γ(t))dt =∫tt0∑j⟨ej(t),Du(t)n(γ(t))⟩ej(t)dt (25) =∫tt0∑jL(ej(t),u(t))ej(t)dt (26)

The second fundamental form is bilinear and therefore

 ∫tt0∑iL(ej(t),u(t))ej(t)dt=∫tt0∑i,jL(ej(t),ei(t))ui(t)ej(t)dt (27)

We now use the notation and choose its eigenbasis for . We then obtain for the difference in the unit normals:

 n(p)−n(p0)=∫tt0∑iλi(t)ui(t)ei(t)dt, (28)

where denote the principle curvatures at . By orthonormality of the eigenbasis, it can be easily checked that

 ⟨∑iλi(t)ui(t)ei(t),∑jλj(t)uj(t)ej(t)⟩≤|λmax|2∑iui(t)2 ⇒∥∥ ∥∥∑iλi(t)ui(t)ei(t)∥∥ ∥∥≤|λmax|∥u(t)∥

Using this relation and the triangle inequality, we then obtain by taking the norm on both sides of (28):

 ∥n(p)−n(p0)∥≤|λmax|∫tt0∥˙γ(t)∥dt. (29)

This inequality holds for any curve connecting and but the tightest bound follows by choosing to be the shortest possible path in with length , i.e. the geodesic distance on . The second inequality of the theorem is obtained by the upper bound on the largest principle curvature derived above, i.e. (23).

### e.2 Theorem 2

###### Theorem 4

For one layer neural networks and , it holds that

 Eϵ∼pβ[∇g(x−ϵ)]=∇gβ∥w∥(x), (30)

where .

Proof: We first show that

 softplusβ(x)=Eϵ∼pβ[relu(x))], (31)

for a scalar input . This follows by defining implicitly as

 softplusβ(x)=∫+∞−∞p(ϵ)relu(x−ϵ)dϵ. (32)

Differentiating both sides of this equation with respect to results in

 σβ(x)=∫+∞−∞p(ϵ)Θ(x−ϵ)dϵ=∫x−∞p(ϵ)dϵ, (33)

where is the Heaviside step function and . Differentiating both sides with respect to again results in

 pβ(x)=p(x). (34)

Therefore, (31) holds. For a vector input , we define the distribution of its perturbation by

 pβ(→ϵ)=∏ipβ(ϵi), (35)

where denotes the components of . We will suppress any arrows denoting vector-valued variables in the following in order to ease notation. We choose an orthogonal basis such that

 ϵ=ϵp^w+∑iϵ(i)o^w(i)o with ^w⋅^w(i)o=0 and w=∥w∥^w. (36)

This allows us to rewrite

 Eϵ∼pβ[relu(wT(x−ϵ))] =Eϵ∼pβ[relu(wTx−∥w∥ϵp))] =∫pβ(ϵp)(relu(wTx−∥w∥ϵp))dϵp

By changing the integration variable to and using (31), we obtain

 softplusβ∥w∥(wTx)=Eϵ∼pβ[relu(wT(x−ϵ))], (37)

The theorem then follows by deriving both sides of the equation with respect to .