Despite impressive results in a variety of classification tasks (LeCun et al., 2015)
, even highly accurate neural network classifiers are plagued by a vulnerability to so-calledadversarial perturbations (Szegedy et al., 2014). These adversarial perturbations are small, often visually imperceptible perturbations to the network’s input, which however result in the network’s classification decision being changed. Such vulnerabilities may pose a threat to real-world deployments of automated recognition systems, especially in security-critical applications such as autonomous driving or banking. This has sparked a large number of publications related to both the creation of adversarial attacks (Goodfellow et al., 2014; Kurakin et al., 2016; Moosavi-Dezfooli et al., 2016) as well as defenses against these (see (Schott et al., 2018) for an overview). Apart from the application-focused viewpoint, the observed adversarial vulnerability offers non-obvious insights into the inner workings of neural networks. One particular method of defense is adversarial training (Madry et al., 2018), which aims to minimize a modified training objective. While this method – like all known approaches of defense – decreases the accuracy of the classifier, it is also successful in increasing the robustness to adversarial attacks, i.e. the perturbations need to be larger on average in order to change the classification decision.
(Tsipras et al., 2019) also notice that networks that are robustified in this way show interesting phenomena, which so far could not be explained. Neural networks usually exhibit very unstructured saliency maps (gradients of a classifier score with respect to the network’s input (Simonyan et al., 2013)) which barely relate to the input image. On the other hand, saliency maps of robustified classifiers tend to be far more interpretable, in that structures in the input image also emerge in the corresponding saliency map, as exemplified in Figure 1. (Tsipras et al., 2019) describe this as an ’unexpected benefit’ of adversarial robustness. In order to obtain a semantically meaningful visualization of the network’s classification decision in non-robustified networks, the saliency map has to be aggregated over many different points in the vicinity of the input image. This can be achieved either via averaging saliency maps of noisy versions of the image (Smilkov et al., 2017) or by integrating along a path (Sundararajan et al., 2017)
. Other approaches typically employ modified backpropagation schemes in order to highlight the discriminative portions of the image. Examples of this includeguided backpropagation (Springenberg et al., 2015) and deep Taylor decomposition (Montavon et al., 2017).
In this paper, we show that the interpretability of the saliency maps of a robustified neural network is not only a side-effect of adversarial training, but a general property enjoyed by networks with a high degree of robustness to adversarial perturbations. We first demonstrate this principle for the case of a linear, binary classifier and show that the ’interpretability’ is due to the image vector and the respective image gradient aligning. For the more general, non-linear case we empirically show that while this relationship is true on average, the linear theory and the non-linear reality do not always agree. We empirically demonstrate that the more linear the model is, the stronger the connection between robustness and alignment becomes.
2 Adversarial Robustness and Saliency Maps
Since adversarial perturbations are small perturbations that change the predicted class of a neural network, it makes sense to define the robustness towards adversarial perturbations via the distance of the unperturbed image to its nearest perturbed image, such that the classification is changed.
Let (with finite) be a classifier over the normed vector space . We call
the (adversarial) robustness of in the point . We call the (adversarial) robustness of over the distribution .
Put differently, the robustness of a classifier in a point is nothing but the distance to its closest decision boundary. Margin classifiers like support vector machines(Cortes & Vapnik, 1995) seek to keep this distance large for the training set, usually in order to avoid overfitting. (Sokolić et al., 2017) and (Elsayed et al., 2018) also apply this principle to neural networks via regularization schemes. We point out that our definition of adversarial robustness does not depend on the ground truth class label and – given feasible computability – can approximately be calculated even on unlabelled data.
In the following, we will always assume to be a real, finite-dimensional vector space with the Euclidean norm. The proofs for the following theoretical statements are found in the appendix.
2.1 A Motivating Toy Example
We consider the toy case of a linear binary classifier with the so-called score function and fixed , where denotes the standard inner product on . A straightforward calculation (see appendix) shows that the adversarial robustness of is given by
Unless stated otherwise, we will always denote with the gradient with respect to . Note that , where is the angle between the vectors and . This implies that grows with the alignment of and and is maximized if and only if and are collinear.
This motivates the following definition.
Definition 2 (Alignment).
Let the binary classifier
be defined a.e. by , where is differentiable in . We then call the saliency map of with respect to in and
the alignment with respect to in .
The alignment is a measure of how similar the input image and the saliency map are. If , and and are zero-centered, this coincides with the absolute value of their Pearson correlation. For a linear binary classifier, the alignment trivially increases with the robustness of the classifier.
Generalizing from the linear to the affine case leads to a classifier of the form , whose robustness in is
In this case the robustness and alignment do not coincide anymore. In order to connect these two diverging concepts, we offer two alternative viewpoints. On the one hand, we can trivially bound the robustness via the triangle inequality
This is particularly meaningful if is small in comparison to . Alternatively, one can connect the robustness to the alignment at a different point , leading to the relation
In the affine case this approach simply amounts to a shift of the data that is uniform over all data points . We will see how these two viewpoints lead to different bounds in the non-linear case later.
2.2 The General Case
We now consider the general, -class case.
Definition 3 (Alignment, Multi-Class Case).
be differentiable in . Then for an -class classifier defined a.e. by
we call the saliency map of . We further call
the alignment with respect to in .
2.2.1 Linearized Robustness
In general the distance to the decision boundary
can be unfeasible to compute. However, for classifiers built on locally affine score functions – such as most neural networks using ReLU or leaky ReLU activations –can easily be computed, provided the locally affine region is sufficiently large. To quantify this, define the radius of the locally affine component of around as
where is the open ball of radius around with respect to the Euclidean metric.
Let be a classifier with locally affine score function . Assume . Then
for the predicted class at .
Note that while nearly all state-of-the art classification networks are piecewise affine, the condition is typically violated in practice. However, the lemma can still hold approximately as long as the linear approximation to the network’s score functions is sufficiently good in the relevant neighbourhood of . This motivates the definition of the linearized (adversarial) robustness .
Definition 4 (Linearized Robustness).
Let be the differentiable score vector for the classifier in . We call
the linearized robustness in , where is the predicted class at point .
We later show that the two notions lead to very similar results, even if the condition is violated.
2.2.2 Reducing the Multi-Class Case
In this section, we introduce a toolset which helps bridge the gap between the alignment and the linearized robustness of a multi-class classifier. In the following, for fixed , let and be the minimizer in (9). We can assign in a binarized classifier with
where . Its linearized robustness in is the same as for . The binarized saliency map, and the respective alignment,
which we call binarized alignment, offer an alternative, natural perspective of the above considerations. This is because for classifiers as defined in (6), the actual score values do not necessarily carry any information about the classification decision, whereas the score differences do. While, roughly speaking, tells us what ’thinks’ makes a member of its predicted class, carries information what sets apart from its closest neighboring class (according to linearization).
In the special case of a linear, multi-class classifier, we have
and in the linear, binary case , even
3 Decompositions and Bounds for Neural Networks
3.1 Homogeneous Decomposition
In the previous chapter we have seen that in the case of binary classifiers, the robustness and binarized alignment coincide for linear score functions. However, requiring to be linear is a stronger assumption than necessary to deduce the result: It is in fact sufficient for to be positive one-homogeneous. Any such function satisfies for all and .
Lemma 2 (Linearized Robustness of Homogeneous Classifiers).
Consider a classifier with positive one-homogeneous score functions. Then
In particular, most feedforward neural networks with (leaky) ReLU activations without biases are positive one-homogeneous. This observation motivates to split up any classifier built on neural networks into a homogeneous term and the corresponding remainder, leading to the following decomposition result.
Theorem 1 (Homogeneous Decomposition of Neural Networks).
Let be any logit of a neural network with ReLU activations (of class
be any logit of a neural network with ReLU activations (of classin the appendix). Denote by the linear filters and by the bias terms of the network. Then
Note that the above vector
includes the running averages of the means for batch normalization. For ReLU networks, the remainder termis locally constant, because it changes only when enters another locally linear region. For ease of notation, we will now drop the subscripts and .
3.2 Pointwise Bounds
In section 2.1, we introduced two different viewpoints for affine linear, binary classifiers which connect the robustness to the alignment. In a similar vein to inequality (4) and equality (5), upper bounds to the linearized robustness depending on the alignment can be given for neural networks. In the following, we will write for . Again, in the following we fix and write and for the minimizer in from equation (9).
Let . Furthermore, let and . Then
Distances on the unit sphere (such as ) can be converted to angles through the law of cosines. For the above inequalities to be reasonably tight, the angle between and needs to be small and needs to be small in comparison to . In this case, the alignment should roughly increase with the linearized robustness.
Let and , with and defined as in the previous theorem. Then
and if additionally then
Depending on the sign of , the shifted image can either be understood as a gradient ascent or descent iterate for maximizing/minimizing . This theorem assimilates into , providing an upper bound to that depends on . The sensibility of this hinges on being reasonably close to and having a low angle with .
Nevertheless, the right-hand side may be much larger than , if the inner product between an image and its respective saliency map are almost orthogonal. This is because the Cauchy-Schwarz inequality (see the proofs in the appendix) provides a large upper bound in this case. The inequalities rather serve as an explanation of how the various terms of alignment may deviate from the linearized robustness in the case of a neural network.
3.3 Alignment and Interpretability
The above considerations demonstrate how an increase in robustness may induce an increase in the alignment between an input image and its respective saliency map. The initial observation – which was previously described as an increase in interpretability – may thus be ascribed to this phenomenon. This is especially true in the case of natural images, as exemplified in Figure 1. There, what a human observer would deem an increase in interpretability, expresses itself as discriminative portions of the original image reappearing in the saliency map, which naturally implies a stronger alignment. The concepts of alignment and interpretability should however not be conflated completely: In the case of quasi-binary image data like MNIST, 0-regions of the image render the inner product in equation (7) invariant with respect to the saliency map in this region, even if the saliency map e.g. assigns relevance to the absence of a feature in this region. Note however that the saliency map in this region still influences the alignment term through the division by its norm. Additionally, the alignment is also not invariant to the images’ representation (color space, shifts, normalization etc.). Still, for most types of image data an increase in alignment in discriminative regions should coincide with an increase in interpretability.
In order to validate our hypothesis, we trained several models of different adversarial robustness on both MNIST (LeCun et al., 1990)
and ImageNet(Deng et al., 2009) using double backpropagation (Drucker & Le Cun, 1992). For a neural network with a softmax output layer, this amounts to minimizing the modified loss
over the parameters . Here, is the training set and
denotes the negative log-likelihood error function. The hyperparameterdetermines the strength of the regularization. Note that this penalizes the local Lipschitz constant of the loss. As (Simon-Gabriel et al., 2018) demonstrate, double backpropagation makes neural networks more resilient to adversarial attacks. By varying , we can easily create models of different adversarial robustness for the same dataset, whose properties we can then compare. (Anil et al., 2018) previously noted that Lipschitz constrained networks exhibit interpretable saliency maps (without an explanation), which can be regarded as a side-effect of the increase in adversarial robustness.
For the MNIST experiments, we trained each of our 16 models on an NVIDIA 1080Ti GPU with a batch size of 100 for 200 epochs, covering the regularization hyperparameter range from 10 to 180,000, before the models start to degenerate. The used architecture is found in the appendix.
For the experiments on ImageNet, we fine-tuned the pre-trained ResNet50 model from (He et al., 2016)
over 35 epochs on 2 NVIDIA P100 GPUs with a total batch size of 32. We used stochastic gradient descent with a learning rate of 0.0001 and momentum of 0.99. The learning rate was divided by 10 whenever the error stopped improving. For the regularization parameter, we chose
. The experiments were implemented in Tensorflow(Abadi et al., 2015).
4.1 Robustness and Alignment
For checking the relation between the alignment and robustness of a neural network, we created 1000 adversarial examples per model on the respective validation set. This was realized using the python library Foolbox (Rauber et al., 2017), which offers pre-defined adversarial attacks, three of which we used in this paper: The GradientAttack performs a line search for the closest adversarial example along the direction of the loss gradient. L2BasicInterativeAttack implements the projected gradient descent attack from (Kurakin et al., 2016) for the Euclidean metric. Similarly, CarliniWagnerL2Attack (CW-attack) is the attack introduced in (Carlini & Wagner, 2017) suited for finding the closest adversarial example in Euclidean metric. Additionally, we calculated the linearized robustness , which entails calculating gradients per image for an -class problem.
In Figures 2 and 3, we investigate how the median alignment depends on the medians of the different conceptions of robustness. We opted in favor of the median (
) instead of the arithmetic mean due to its increased robustness to outliers, which occurred especially when using the gradient attack. In the case of ImageNet (Figure2), an increase in median alignment with the median robustness is clearly visible for all three estimates of the robustness. On the other hand, the alignment for the MNIST data increases with the robustness as well, but seems to saturate at some point. We will offer an explanation for this phenomenon later.
We now consider the pointwise connection between robustness and alignment. In Figure 4 the two variables are highly-correlated for a model trained on MNIST, pointing towards the fact that the network behaves very similarly to a positive one-homogeneous function. There is however no visible correlation between them on the ImageNet model, which is a consistent behavior throughout the whole experiment cohort. We will later analyse the source of this behavior. The increase in median alignment for ImageNet, , can still be explained by a statistical argument: If , as approximately true in our ImageNet model, then is the median absolute deviation of . In other words, the graph for ImageNet in Figure 4 depicts the dispersion of . The above observations also hold well for the binarized alignment.
In Figure 5 a tight correlation between and becomes evident. Here, the latter has been calculated using the CW-attack. The linearized robustness model is hence an adequate approximation of the actual robustness , even for the highly non-linear neural network models used on ImageNet. Finally note that all used attacks lead to the same general behavior of all quantities investigated (see Figures 2 and 3).
4.2 Explaining the Observations
In the last section, we observed some commonalities between the experiments on ImageNet and MNIST, but also some very different behaviors. In particular, two aspects stand out: Why does the median alignment steadily increase for the observed ImageNet experiments, whereas on MNIST this stagnates at some point (Figures 2 and 3)? Furthermore, why are and so highly-correlated on MNIST but almost uncorrelated on ImageNet (Figure 4)? We turn to Theorems 2 and 3 for answers.
Theorem 2 states that
where is the locally constant term and is the saliency map of the binarized classifier and for . In Figure 6, we check how strongly the right-hand side of inequality (18) is dominated by , i.e. how large the influence of the locally linear term is in comparison to the locally constant term. For ImageNet, this ratio increases from below 0.55 to almost 0.85, pointing towards a model increasingly governed by its linearized part. On MNIST, this ratio strongly decreases over the robustness’s range. Note however that in the weakly regularized MNIST models, the right hand side is extremely dominated by the median alignment in the first place.
A similar analysis can be performed for the second inequality from Theorem 2,
which additionally makes a step from binarized alignment to (conventional) alignment.
This leads to an additional error term, making the bound significantly less tight than in the previous case. In particular, the proportion of the alignment on the right-hand side diminishes, confirming our prediction from section 3.2. Nevertheless, the qualitative behaviors is similar to the previous case, with the taking up an increasing fraction of the right-hand with increasing robustness. For MNIST data, the ratio varies little compared to the ratio from the last inequality. This indicates that the remainder term does not change too strongly over the set of MNIST experiments compared to . We thus deduce that the qualitative relationship between robustness and alignment is fully governed by the error term introduced in (18), i.e. the locally constant term of the logit.
We now do the same for the inequality in Theorem 3, which states that
for and , which gets rid of the additive term from (18). Again, in the case of ImageNet grows more quickly in comparison to , the distance of the normalized gradients, whereas their ratio is approximately constant for MNIST data.
To conclude, we have seen that the upper bounds from Theorems 2 and 3 provide valuable information in which ways both the experiments on ImageNet and MNIST are influenced by the respective terms. In the case of ImageNet, we consistently see the alignment terms growing more quickly than the other terms. This might indicate that the growth in alignment stems not only from the growth in the robustness alone, but also from the model becoming increasingly similar to our idealized toy example. In other words, not only does the robustness make the alignment grow, but the connection between these two properties becomes stronger in the case of ImageNet. This is in agreement with the seemingly superlinear growth of the median alignment in Figure 2.
It is not surprising that a classifier for a problem as complex as ImageNet is highly non-linear, which makes the (pointwise) connection between alignment and robustness rather loose. We hence conjecture that the imposed regularization increasingly restricts the models to be more linear, thereby making them more similar to our initial toy example.
For MNIST, the regularization seems to have the opposite effect: As seen in Figure 6, the binarized alignment initially dwarfs the correction term introduced by the locally constant portion of the binarized logit . As the network becomes more robust, is apparently not dominated by the linear terms anymore, while the influence of the locally constant terms (i.e. ) increases. This hypothesis seems sensible, considering MNIST is a very simple problem which we tackled with a comparatively shallow network. This can be expected to yield a model with a low degree of non-linearity. The penalization of the local Lipschitz constant here seems to have the effect of requiring larger locally constant terms , in contrast to the models trained on ImageNet.
We check the validity of these claims by tracking the median size of against the median size of in Figure 9. On MNIST, starts out at approximately of and at the end rises to almost . Note that this does not indicate that is typically close to 0 for all , just that is, compared to .
On MNIST, this ratio is close to 1 up until , when it suddenly and quickly falls below . This drop is consistent with what we see in Figure 3: At around the same point this drop occurs, the alignment starts to saturate. While an increase in the model’s median robustness should imply an increase in the model’s median alignment, the deviation from linearity weakens the connection between robustness and alignment, such that the two effects roughly cancel out.
5 Conclusion and Outlook
In this paper, we investigated the connection between a neural network’s robustness to adversarial attacks and the interpretability of the resulting saliency maps. Motivated by the binary, linear case, we defined the alignment as a measure of how much a saliency map matches its respective image. We hypothesized that the perceived increase in interpretability is due to a higher alignment and tested this hypothesis on models trained on MNIST and ImageNet. While on average, the proposed relation holds well, the connection is much less pronounced for individual points, especially on ImageNet. Using some upper bounds for the robustness of a neural network, which we derived using a decomposition theorem, we arrived at the conclusion that the strength of this connection is strongly linked with how similar to a linear model the neural network is locally. As ImageNet is a comparatively complex problem, any sufficiently accurate model is bound to be very non-linear, which explains the difference to MNIST.
While this paper shows the general link between robustness and alignment, there are still some open questions. Since we only used one specific robustification method, further experiments should determine the influence of this method. One could explore, whether a different choice of norm leads to different observations. Another future direction of research could be to investigate the degree of (non-)linearity and its connection to this topic. While Theorems 2 and 3 illustrate how the pointwise linearized robustness and alignment may diverge, depending on terms like , , and , a more in-depth look should focus on why and when these terms have a certain relationship to each other.
From a methodological standpoint, the discovered connection may also serve as an inspiration for new adversarial defenses, where not only the robustness but also the alignment is taken into account. One way of increasing the alignment directly would be through the penalty term
which is bounded from below by 0 via the Cauchy-Schwarz inequality. Any robustifying effects of the increased alignment may however be confounded with the Lipschitz-penalty that the first summand effectively introduces, which necessitates a careful experimental evaluation.
CE and PM acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 281474342: ’RTG - Parameter Identification - Analysis, Algorithms, Applications’. The work by SL was supported by the EPSRC grant EP/L016516/1 for the University of Cambridge Centre for Doctoral Training, the Cambridge Centre for Analysis and by the Cantab Capital Institute for the Mathematics of Information. CBS acknowledges support from the Leverhulme Trust projects on Breaking the non-convexity barrier and on Unveiling the Invisible, the Philip Leverhulme Prize, the EPSRC grant Nr. EP/M00483X/1, the EPSRC Centre Nr. EP/N014588/1, the European Union Horizon 2020 research and innovation programmes under the Marie Skodowska-Curie grant agreement No 777826 NoMADS and No 691070 CHiPS, the Cantab Capital Institute for the Mathematics of Information and the Alan Turing Institute. We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Quadro P6000 and a Titan Xp GPUs used for this research.
- Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
- Anil et al. (2018) Anil, C., Lucas, J., and Grosse, R. Sorting out lipschitz function approximation. arXiv preprint arXiv:1811.05381, 2018.
- Carlini & Wagner (2017) Carlini, N. and Wagner, D. A. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pp. 39–57, 2017.
- Cortes & Vapnik (1995) Cortes, C. and Vapnik, V. Support-vector networks. Machine learning, 20(3):273–297, 1995.
- Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Ieee, 2009.
- Drucker & Le Cun (1992) Drucker, H. and Le Cun, Y. Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, 3(6):991–997, 1992.
- Elsayed et al. (2018) Elsayed, G. F., Krishnan, D., Mobahi, H., Regan, K., and Bengio, S. Large margin deep networks for classification. 2018. URL https://arxiv.org/pdf/1803.05598.pdf.
- Goodfellow et al. (2014) Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv 1412.6572, 2014.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Jakubovitz & Giryes (2018) Jakubovitz, D. and Giryes, R. Improving dnn robustness to adversarial attacks using jacobian regularization. In The European Conference on Computer Vision (ECCV), September 2018.
- Kurakin et al. (2016) Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
- LeCun et al. (1990) LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., and Jackel, L. D. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pp. 396–404, 1990.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436, 2015.
- Madry et al. (2018) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. 2018.
- Montavon et al. (2017) Montavon, G., Lapuschkin, S., Binder, A., Samek, W., and Müller, K.-R. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222, 2017.
- Moosavi-Dezfooli et al. (2016) Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2574–2582, 2016.
- Rauber et al. (2017) Rauber, J., Brendel, W., and Bethge, M. Foolbox v0. 8.0: A python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131, 2017.
- Schott et al. (2018) Schott, L., Rauber, J., Brendel, W., and Bethge, M. Robust perception through analysis by synthesis. CoRR, abs/1805.09190, 2018. URL http://arxiv.org/abs/1805.09190.
- Simon-Gabriel et al. (2018) Simon-Gabriel, C.-J., Ollivier, Y., Schölkopf, B., Bottou, L., and Lopez-Paz, D. Adversarial vulnerability of neural networks increases with input dimension. arXiv preprint arXiv:1802.01421, 2018.
- Simonyan et al. (2013) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Smilkov et al. (2017) Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. Smoothgrad: Removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
- Sokolić et al. (2017) Sokolić, J., Giryes, R., Sapiro, G., and Rodrigues, M. R. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265–4280, 2017.
- Springenberg et al. (2015) Springenberg, J., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional net. In ICLR (workshop track), 2015. URL http://lmb.informatik.uni-freiburg.de/Publications/2015/DB15a.
- Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365, 2017.
- Szegedy et al. (2014) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. 2014.
- Tsipras et al. (2019) Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A. Robustness may be at odds with accuracy. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SyxAb30cY7.
Proof of Equation (3):
The left-hand side is clearly maximized for , leading to
This proves the claim by taking the infimum over .
As , we can take the infimum in (1) over all perturbations in the local affine component, i.e. with only. This allows us to reformulate
The infimum over is achieved by choosing as a multiple of . A direct computation then finishes the proof. ∎
Proofs of Homogenization results
Lemma 3 (Euler’s Homogeneous Function Theorem).
Let be a positive one-homogeneous function that is continuously differentiable on . Then
First note that
Direct consequence of 3. ∎
Definition 5 (Neural Networks).
Define the class of neural networks to be any network built on learnable affine transforms (convolutional layers, dense layers) with linear weights and biases and ReLU or leaky ReLU activations. The network can include arbitrary skip-connections, batch-normalization layers and max or average pooling layers of arbitrary window size. This in particular includes many state-of-the-art classification networks.
Lemma 4 (Homogeneous Networks).
For fixed , consider the logit of a network , where denotes the linear weights and the bias vector of the network. Then the function
the bias vector of the network. Then the function
is positive one-homogeneous and .
Consider first a network consisting of a single layer with linear transformand bias with ReLU non-linearity. The associated network function is hence given by . For this network, we compute for fixed and any and as
A single layer is hence positive one-homogeneous. A function consisting of compositions of positive one-homogeneous functions is positive one-homogeneous itself as well, the function associated to a network consisting of affine transforms and ReLU activations is positive one-homogeneous. All of the operations skip-connections, batch-normalization layers and max or average pooling are positive one-homogeneous as well, thus proving the claim. ∎
using the decomposition theorem and the triangle inequality. Further,
using the Cauchy-Schwarz inequality. ∎
using the Cauchy-Schwarz inequality in the same way as in the last theorem. ∎
MNIST Model Architecture
Here we describe the architecture that was used for the MNIST models.
|Conv2D (, ’same’), 32 feature maps, ReLU|
|Max Pooling (factor 2)|
|Conv2D (, ’same’), 64 feature maps, ReLU|
|Max Pooling (factor 2)|
|Conv2D (, ’same’), 128 feature maps, ReLU|
|Max Pooling (factor 2)|
Dense Layer (128 neurons), ReLU