1 Introduction
Deep neural networks have been used with great success for perceptual tasks such as image classification simonyan2014very ; lecun2015deep or speech recognition hinton2012deep . While they are known to be robust to random noise, it has been shown that the accuracy of deep nets can dramatically deteriorate in the face of socalled adversarial examples biggio2013evasion ; szegedy2013intriguing ; goodfellow2014explaining , i.e. small perturbations of the input signal, often imperceptible to humans, that are sufficient to induce large changes in the model output. This apparent vulnerability is worrisome as deep nets start to proliferate in the realworld, including in safetycritical deployments.
Consequently, there has been a rapidly expanding literature exploring methods to find adversarial perturbations sabour2015adversarial ; papernot2016transferability ; kurakin2016adversarial ; moosavi2016deepfool ; moosavi2017universal ; madry2017towards ; athalye2018obfuscated , as well as to provide formal guarantees on the robustness of a model against specific attacks hein2017formal ; kolter2017provable ; raghunathan2018certified ; tsuzuku2018lipschitz ; cohen2019certified
. The most direct strategy of robustification, called adversarial training, aims to harden a machine learning model by immunizing it against an adversary that maliciously corrupts each training example before passing it to the model
goodfellow2014explaining ; kurakin2016adversarial ; miyato2015distributional ; miyato2017virtual ; madry2017towards . A different strategy of defense is to detect whether the input has been disrupted by detecting characteristic regularities either in the adversarial manipulations themselves or in the network activations they induce grosse2017statistical ; feinman2017detecting ; xu2017feature ; metzen2017detecting ; carlini2017adversarial ; roth2019odds .Despite practical advances in finding adversarial examples and defending against them, the definitive theoretical reason for the vulnerability of neural networks remains unclear. Bubeck et al. bubeck2018adversarial indentify four mutually exclusive scenarios: (i) no robust model exists, cf. fawzi2018adversarial ; gilmer2018adversarial , (ii) learning a robust model requires too much training data, cf. schmidt2018adversarially , (iii) learning a robust model from limited training data is possible but computationally intractable (the hypothesis favoured by Bubeck et al.), and (iv) we just have not found the right training algorithm yet.
In other words, it is still an open question whether adversarial examples exist because of intrinsic flaws of the model or learning objective or whether they are solely the consequence of computational limitations or nonzero generalization error and highdimensional statistics. In this work, we investigate the origin of adversarial vulnerability in neural networks by focusing on the attack algorithms used to find adversarial examples.
In particular, we make the following contributions:

We establish a theoretical link between adversarial training and operator norm regularization for deep neural networks. Specifically, we show that adversarial training is a datadependent generalization of spectral norm regularization.

We conduct extensive empirical evaluations showing that (i) adversarial perturbations align with dominant singular vectors, (ii) adversarial training and datadependent spectral norm regularization dampen the singular values, and (iii) both training methods give rise to models that are significantly more linear around data points than normally trained ones.

Our results provide fundamental insights into the origin of adversarial vulnerability and hint at novel ways to robustify and defend against adversarial attacks.
2 Related Work
As deep neural networks start to proliferate in the realworld, the requirement for trained models to be robust to input perturbations becomes paramount. Prominent machine learning frameworks dealing with such requirements are robust optimization el1997robust ; xu2009robustness ; bertsimas2018characterization (including distributionally robust optimization namkoong2017variance ; sinha2017certifiable ; gao2016distributionally ) and adversarial training goodfellow2014explaining ; shaham2015understanding ; kurakin2016adversarial ; miyato2017virtual ; madry2017towards . In these frameworks, machine learning models are trained to minimize the worstcase loss against an adversary that can either perturb the entire training set (in the case of robust optimization) or each training example individually (in the case of adversarial training) subject to a proximity constraint.
A number of works have been suggesting to use regularization, often based on the input gradient, as a means to improve model robustness against adversarial attacks gu2014towards ; lyu2015unified ; cisse2017parseval ; ross2017improving ; simon2018adversarial . Interestingly, for certain problems and uncertainty sets, robust optimization is equivalent to regularization el1997robust ; xu2009robustness ; bertsimas2018characterization
. E.g. for linear regression and induced matrix norm balls, the adversary’s innermaximization can equivalently be written as an operator norm penalty
el1997robust ; bertsimas2018characterization . Similar results on the equivalence of robustness and regularization have been obtained also for (kernelized) SVMs xu2009robustness . Cf. bietti2018regularization for a kernel perspective on robustness and regularization of deep nets.More recently, training methods based on spectral norm yoshida2017spectral ; miyato2018spectral ; bartlett2017spectrally ; farnia2018generalizable and Lipschitz constant regularization cisse2017parseval ; hein2017formal ; tsuzuku2018lipschitz ; raghunathan2018certified have been proposed, particularly as bounds on the spectral norm or Lipschitz constant can easily be translated to bounds on the minimal perturbation required to fool a machine learning model. Theoretical work connecting adversarial robustness with robustness to random noise fawzi2015analysis ; fawzi2016robustness and decision boundary tilting tanay2016boundary was also pursued.
Despite there being a wellestablished learning theory for standard nonrobust classification, including generalization bounds for neural networks, cf. for instance boucheron2005theory ; anthony2009neural , the theoretical understanding of the robust learning problem is still very limited. Recent works starting to fill this gap include Lipschitzsensitive generalization bounds neyshabur2015norm , spectrallynormalized margin bounds for neural networks bartlett2017spectrally , as well as stronger generalization bounds for deep nets via compression arora2018stronger .
3 Background
3.1 Robust Optimization and Regularization for Linear Regression
We begin by distilling the basic ideas on the relation between robust optimization and regularization presented in bertsimas2018characterization . Consider linear regression with additive perturbations of the data matrix
(1) 
where denotes the uncertainty set. A general way to construct is as a ball of bounded matrix norm perturbations . Of particular interest are induced matrix norms
(2) 
where is a seminorm and is a norm. It is obvious that if fulfills the triangle inequality then one can upper bound Robust Optimization Regularization
(3) 
by using (a) the triangle inequality and (b) the definition of the matrix norm.
The question then is, under which circumstances both inequalities become equalities at the maximizing . It is straightforward to check (bertsimas2018characterization, , Theorem 1) that specifically we may choose the rank matrix
(4) 
If then one can pick any for which to form (such a has to exist if is not identically zero). This shows that, for robust linear regression with induced matrix norm uncertainty sets, Robust Optimization Regularization.
3.2 Global Spectral Norm Regularization
In this section we rederive spectral norm regularization à la Yoshida et al. yoshida2017spectral , while also setting up the notation for later. Let and denote inputlabel pairs generated from a data distribution . Let
denote the logits of a
parameterized piecewise linear classifier, i.e.
, whereis the activation function, and
, denote the layerwise weight matrix^{1}^{1}1Note that convolutional layers can be constructed as matrix multiplications by converting the convolution operator into a Toeplitz matrix.and bias vector, collectively denoted by
. Let us furthermore assume that each activation function is a ReLU (the argument can easily be generalized to other piecewise linear activations). In this case, the activations
act as inputdependent diagonal matrices , where an element in the diagonal is one if the corresponding preactivation is positive and equal to zero otherwise.Following Raghu et al. raghu2017expressive , we call the “activation pattern”, where
is the number of neurons in the network. For any activation pattern
we can define the preimage , inducing a partitioning of the input space via . Note that some , as not all combinations of activiations may be feasible. See Figure 1 in raghu2017expressive or Figure 3 in novak2018sensitivity for an illustration of ReLU tesselations of the input space.We can linearize within a neighborhood around as follows
(5) 
where denotes the Jacobian of at
(6) 
We have the following bound for
(7) 
where is the spectral norm (largest singular value) of the linear operator . From a robustness perspective we want to be small in regions that are supported by the data.
Based on the decomposition in Equation 6 and the nonexpansiveness of the activations, for every , Yoshida et al. yoshida2017spectral suggest to upperbound the spectral norm of the Jacobian by the product of the spectral norms of the individual weight matrices
(8) 
The layerwise spectral norms can be computed iteratively using the power method. Starting with a random vector , the power method iteratively computes
(9) 
The (final) singular value can be computed via .
Yoshida et al. suggest to turn this upperbound into a global (dataindependent) regularizer by learning the parameters via the following penalized empirical risk minimization
(10) 
where denotes an arbitrary classification loss. Note, since the parameter gradient of is , with , and being the dominant singular value and singular vectors of (approximated via the power method), Yoshida et al.’s global spectral norm regularizer effectively adds a term for each layer
to the parameter gradient of the loss function. In terms of computational complexity, because the global regularizer decouples from the empirical loss term, a single power method iteration per parameter update step usually suffices in practice
yoshida2017spectral .3.3 Global vs. Local Regularizers
The advantage of global bounds is that they trivially generalize from the training to the test set. The problem however is that they can be arbitrarily loose, e.g. penalizing the spectral norm over irrelevant regions of the ambient space. To illustrate this, consider the ideal robust classifier that is essentially piecewise constant on classconditional regions, with sharp transitions between the classes. The global spectral norm will be heavily influenced by the sharp transition zones, whereas a local datadependent bound can adapt to regions where the classifier is approximately constant hein2017formal . In other words, we would expect a global regularizer to have the largest effect in the empty parts of the input space. On the contrary, a local regularizer has its main effect around the data manifold.
4 Adversarial Training Generalizes Spectral Norm Regularization
4.1 Datadependent Spectral Norm Regularization
We now show how to directly regularize the datadependent spectral norm of the Jacobian . Under the assumption that the dominant singular value is nondegenerate^{2}^{2}2For practical purposes, we can safely assume that the dominant singular value is nondegenerate (due to numerical errors)., the problem of computing the largest singular value and the corresponding left and right singular vectors can efficiently be solved via the power method. Let be a random vector or an approximation to the dominant right singular vector of . The power method iteratively computes
(11)  
The (final) singular value can be computed via . Note, the right singular vector gives the direction in input space that corresponds to the steepest ascent of along .
We can turn this into a regularizer by learning the parameters via the following Jacobianbased spectral norm penalized empirical risk minimization
(12) 
where and are the datadependent singular vectors of , computed via Equation 11.
By optimality / stationarity^{3}^{3}3 and linearization , we can regularize learning also via the following sumofsquares based spectral norm regularizer
(13) 
where the datadependent singular vector of is computed via Equation 11.
Both variants can readily be implemented in modern deep learning frameworks. We found the sumofsquares based spectral norm regularizer to be more numerically stable than the Jacobian based one, which is why we used this variant in our experiments. In terms of computational complexity, the datadependent regularizer is a constant (number of power method iterations) times more expensive than the dataindependent variant.
4.2 Power Method Formulation of Adversarial Training
Adversarial training goodfellow2014explaining ; kurakin2016adversarial ; madry2017towards aims to improve the robustness of a machine learning model by training it against an adversary that independently perturbs each training example subject to a proximity constraint, e.g. in norm,
(14) 
where denotes the loss function used to find adversarial perturbations (does not need to be the same as the classification loss ).
The adversarial example is typically computed iteratively, e.g. via norm constrained projected gradient ascent madry2017towards ; kurakin2016adversarial (the general norm constrained case is similar)
(15) 
where is the projection operator into the norm ball , is a small stepsize and is the true or predicted label. For targeted attacks the sign in front of is flipped, so as to descend the loss function into the direction of the target label.
By the chainrule, the gradientstep can be expressed as a Jacobian vector product while the projection into the
norm ball can be expressed as a normalization. Thus, norm constrained projected gradient ascent can equivalently be written as (the normalization of is optional)(16)  
where the ensures that if then . Note, that the logitspace gradient can be computed in a single forward pass, by directly expressing it in terms of the arguments of the adversarial loss.
Comparing the update equations for projected gradient ascent based adversarial training with those of datadependent spectral norm regularization, we can see that adversarial training generalizes spectral norm regularization in two ways: (i) via the choice of the adversarial loss function and (ii) by iterating within the norm ball (whereas spectral norm regularization keeps the input fixed).
Indeed, keeping the input fixed during the attack and taking the sumofsquares loss on the logits of the classifier, i.e. with and , allows us to recover datadependent spectral norm regularization,
(17) 
Finally, note that the adversarial loss function determines the logitspace direction of the directional derivative in the power method formulation of adversarial training, as shown in Section 7.1 in the Appendix for an example using the softmax crossentropy loss. The effect of iterating on the range of regularization is investigated in detail in the Experiments Section 5.3.
5 Experimental Results
5.1 Dataset, Architecture & Training Methods
We trained Convolutional Neural Networks (CNNs) with seven hidden layers and batch normalization on the CIFAR10 data set
krizhevsky2009learning. We use a 7layer CNN as our default platform, since it has good test set accuracy at acceptable computational requirements (we used an estimated
k GPU hours (Titan X) in total for all our experiments). We train each classifier with a number of different training methods: (i) ‘Standard’: standard empirical risk minimization with a softmax crossentropy loss, (ii) ‘Adversarial’: norm constrained projected gradient ascent (PGA) based adversarial training with a softmax crossentropy loss, (iii) ‘Yoshida’: global spectral norm regularization à la Yoshida et al. yoshida2017spectral in Equation 10, and (iv) ‘SNR’: datadependent spectral norm regularization, as in Equation 13.As a default attack strategy we use an norm constrained PGA whitebox attack with 10 attack iterations. The attack strength used for training was chosen to be the smallest value such that almost all adversarially perturbed inputs to the standard model are successfully misclassified, which is (indicated by a vertical dashed line in the Figures below). The regularization constants of the other training methods were then chosen in such a way that they roughly achieve the same test set accuracy on clean examples as the adversarially trained model does. Further details regarding the experimental setup can be found in Section 7.3 in the Appendix. Table 1
summarizes the test set accuracies and hyperparameters for all the training methods we considered. Shaded areas in the plots below denote standard errors with respect to the number of test set samples over which the experiment was repeated.
Training Method  Accuracy  Hyperparameters 

Standard Training  93.5%  — 
Adversarial Training  83.6%  iters 
Global Spectral Norm Reg.  80.4%  iters= 
Datadep. Spectral Norm Reg.  84.6%  iters 
5.2 Adversarial Training vs. Spectral Norm Regularization
Effect of training method on singular value spectrum. We compute the singular value spectrum of the Jacobian for networks trained with different training methods and evaluated at a number of different test set examples ( to be precise). Since we are interested in computing the full singular value spectrum, and not just the dominant singular value and singular vectors as during training, the power method would be too impractical to use, as it gives us access to only one (the dominant) singular valuevector pair at a time. Instead, we first extract the Jacobian (which is per se defined as a computational graph in modern deep learning frameworks) as an inputdimoutputdim dimensional matrix and then use available matrix factorization routines to compute the full SVD of the extracted matrix. For each training method, the procedure is repeated for randomly chosen clean and corresponding adversarially perturbed test set examples. Further details regarding the Jacobian extraction can be found in Section 7.4 in the Appendix.
The results are shown in Figure 1 (left). We can see that, compared to the spectrum of the normally trained and global spectral norm regularized model, the spectrum of adversarially trained and datadependent spectral norm regularized models is significantly damped after training. In fact, the datadependent spectral norm regularizer seems to dampen the singular values even slightly more effectively than adversarial training, while global spectral norm regularization has almost no effect compared to standard training.
Alignment of adversarial perturbations with singular vectors.
We compute the cosinesimilarity of adversarial perturbations with singular vectors
of the Jacobian , extracted at a number of test set examples, as a function of the rank of the singular vectors returned by the SVD decomposition. For comparison we also show the cosinesimilarity with the singular vectors of a random network as well as the cosinesimilarity with random perturbations.The results are shown in Figure 1 (right). We can see that for all training methods (except the random network) adversarial perturbations are strongly aligned with the dominant singular vectors while the alignment decreases towards the bottomranked singular vectors. For the random network, the alignment is roughly constant with respect to rank. Interestingly, this strong alignment with dominant singular vectors also explains why input gradient regularization and fast gradient method (FGM) based adversarial training do not sufficiently protect against adversarial attacks, namely because the input gradient, resp. a single power method iteration, do not yield a sufficiently good approximation for the dominant singular vector in general.
Adversarial classification accuracy. A plot of the classification accuracy on adversarially perturbed test examples, as a function of the perturbation strength , can be found in Figure 3 (left). We can see that the adversarial accuracy of the datadependent spectral norm regularized model lies between that for the normal and adversarially trained model, while global spectral norm regularization does not seem to robustify the model against adversarial attacks. I.e. datadependent spectral norm regularization can already account for a considerable amount of robustness against adversarial examples. This is in line with our earlier observation that adversarial perturbations tend to align with dominant singular vectors and supports our theoretical result that adversarial training generalizes spectral norm regularization. We speculate that the remaining gap between adversarially trained models and datadependent spectral norm regularized ones might to some extent be attributable to the fact that we used norm constrained attacks for evaluation and adversarial training, allowing the adversarial model to profit from potential overfitting to the specific attack.
Singular Value 
[width=0.45trim=0.0370pt 0.060pt 0.00pt 0.00pt,clip]plots/spectrum 
CosineSimilarity 
[width=0.45trim=0.0370pt 0.060pt 0.00pt 0.00pt,clip]plots/singular_vectors 
Rank  Rank 
5.3 Local Linearity & Range of Regularization Effects
Local linearity. In order to determine the size of the area where a locally linear approximation is valid, we measure the deviation from linearity of as the distance to is increased in random and adversarial directions, i.e. we measure as a function of the distance , for random and adversarial perturbations , aggregated over data points in the test set, with adversarial perturbations serving as a proxy for the direction in which the linear approximation holds the least. The purpose of this experiment is to investigate how good the linear approximation for different training methods is, as an increasing number of activation boundaries are crossed with increasing perturbation radius. See Figure 1 in raghu2017expressive or Figure 3 in novak2018sensitivity for an illustration of activation boundary tesselations in the input space.
The results are shown in Figure 2 (left). We can see that adversarial training and datadependent spectral norm regularization give rise to models that are considerably more linear than the clean trained one, both in random as well as adversarial directions. Compared to the normally trained model, the adversarially trained and spectral norm regularized ones remain flat in random directions for pertubations of considerable magnitude and even remain flat in the adversarial direction for perturbation magnitudes up to the order of the used during adversarial training, while the deviation from linearity seems to increase roughly linearly with thereafter. The global spectral norm regularized model behaves similar to the normally trained one (curve omitted).
Largest singular value over distance. Figure 2 (right) shows the largest singular value of the linear operator as the distance from is increased, both along random and adversarial directions, for different training methods. We can see that the naturally trained network develops large dominant singular values around the data point during training, while the adversarially trained and datadependent spectral norm regularized models manage to keep the dominant singular value low in the vicinity of .
Alignment of adversarial perturbations with dominant singular vector as a function of . Figure 3 (right) shows the cosinesimilarity of adversarial perturbations of mangitude with the dominant singular vector of , as a function of perturbation magnitude . For comparison, we also include the alignment with random perturbations. For all training methods, the larger the perturbation magnitude , the lesser the adversarial perturbation aligns with the dominant singular vector of , which is to be expected for a simultaneously increasing deviation from linearity. The alignment is similar for adversarially trained and datadependent spectral norm regularized models and for both larger than that of global spectral norm regularized and naturally trained models.
Deviation from Linearity 
[width=0.45trim=0.0420pt 0.060pt 0.00pt 0.00pt,clip]plots/linear 
Top Singular Value 
[width=0.45trim=0.0420pt 0.060pt 0.00pt 0.00pt,clip]plots/top_singular_value 
Distance from  Distance from 
Accuracy 
[width=0.45trim=0.0370pt 0.0850pt 0.00pt 0.00pt,clip]plots/adversarial_accuracy 
Singular Vector Alignment 
[width=0.45trim=0.0370pt 0.0850pt 0.00pt 0.00pt,clip]plots/top_singular_vector 
Perturbation Magnitude  Perturbation Magnitude 
6 Conclusion
We established a theoretical link between adversarial training and operator norm regularization for deep neural networks. Specifically, we showed that adversarial training is a datadependent generalization of spectral norm regularization. We conducted extensive empirical evaluations showing that (i) adversarial perturbations align with dominant singular vectors, (ii) adversarial training and datadependent spectral norm regularization dampen the singular values, and (iii) both training methods give rise to models that are significantly more linear around data points than normally trained ones. Our results provide fundamental insights into the origin of adversarial vulnerability and hint at novel ways to robustify and defend against adversarial attacks.
Acknowledgements
We would like to thank Michael Tschannen and Sebastian Nowozin for insightful discussions and helpful comments.
References
 [1] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
 [2] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.
 [3] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
 [4] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6241–6250, 2017.
 [5] Dimitris Bertsimas and Martin S Copenhaver. Characterization of the equivalence of robustification and regularization in linear and matrix regression. European Journal of Operational Research, 270(3):931–942, 2018.
 [6] Alberto Bietti, Grégoire Mialon, and Julien Mairal. On regularization and robustness of deep neural networks. arXiv preprint arXiv:1810.00363, 2018.
 [7] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013.

[8]
Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi.
Theory of classification: A survey of some recent advances.
ESAIM: probability and statistics
, 9:323–375, 2005.  [9] Sébastien Bubeck, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. arXiv preprint arXiv:1805.10204, 2018.

[10]
Nicholas Carlini and David Wagner.
Adversarial examples are not easily detected: Bypassing ten detection
methods.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, pages 3–14. ACM, 2017.  [11] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning, pages 854–863, 2017.
 [12] Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918, 2019.
 [13] Laurent El Ghaoui and Hervé Lebret. Robust solutions to leastsquares problems with uncertain data. SIAM Journal on matrix analysis and applications, 18(4):1035–1064, 1997.
 [14] Farzan Farnia, Jesse M Zhang, and David Tse. Generalizable adversarial training via spectral normalization. arXiv preprint arXiv:1811.07457, 2018.
 [15] Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classifier. arXiv preprint arXiv:1802.08686, 2018.
 [16] Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Analysis of classifiers’ robustness to adversarial perturbations. arXiv preprint arXiv:1502.02590, 2015.
 [17] Alhussein Fawzi, SeyedMohsen MoosaviDezfooli, and Pascal Frossard. Robustness of classifiers: from adversarial to random noise. In Advances in Neural Information Processing Systems, pages 1632–1640, 2016.
 [18] Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew B Gardner. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410, 2017.
 [19] Rui Gao and Anton J Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. arXiv preprint arXiv:1604.02199, 2016.
 [20] Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial spheres. arXiv preprint arXiv:1801.02774, 2018.
 [21] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 [22] Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.
 [23] Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068, 2014.
 [24] Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Advances in Neural Information Processing Systems, pages 2266–2276, 2017.
 [25] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 [26] J Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851, 2(4), 2017.
 [27] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 [28] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
 [29] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
 [30] Chunchuan Lyu, Kaizhu Huang, and HaiNing Liang. A unified gradient regularization family for adversarial examples. In 2015 IEEE International Conference on Data Mining, pages 301–309. IEEE, 2015.
 [31] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
 [32] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267, 2017.
 [33] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
 [34] Takeru Miyato, Shinichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semisupervised learning. arXiv preprint arXiv:1704.03976, 2017.
 [35] Takeru Miyato, Shinichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677, 2015.
 [36] SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. arXiv preprint, 2017.

[37]
Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, and Pascal Frossard.
Deepfool: a simple and accurate method to fool deep neural networks.
In
Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, number EPFLCONF218057, 2016.  [38] Hongseok Namkoong and John C Duchi. Variancebased regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2975–2984, 2017.
 [39] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Normbased capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015.
 [40] Roman Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha SohlDickstein. Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760, 2018.
 [41] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016.
 [42] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2847–2854. JMLR. org, 2017.
 [43] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344, 2018.
 [44] Andrew Slavin Ross and Finale DoshiVelez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. arXiv preprint arXiv:1711.09404, 2017.
 [45] Kevin Roth, Yannic Kilcher, and Thomas Hofmann. The odds are odd: A statistical test for detecting adversarial examples. arXiv preprint arXiv:1902.04818, 2019.
 [46] Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J Fleet. Adversarial manipulation of deep representations. arXiv preprint arXiv:1511.05122, 2015.
 [47] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. arXiv preprint arXiv:1804.11285, 2018.
 [48] Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local stability of neural nets through robust optimization. arXiv preprint arXiv:1511.05432, 2015.
 [49] CarlJohann SimonGabriel, Yann Ollivier, Bernhard Schölkopf, Léon Bottou, and David LopezPaz. Adversarial vulnerability of neural networks increases with input dimension. arXiv preprint arXiv:1802.01421, 2018.
 [50] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), 2014.
 [51] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571, 2017.
 [52] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 [53] Thomas Tanay and Lewis Griffin. A boundary tilting persepective on the phenomenon of adversarial examples. arXiv preprint arXiv:1608.07690, 2016.
 [54] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitzmargin training: Scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Information Processing Systems, pages 6541–6550, 2018.

[55]
Huan Xu, Constantine Caramanis, and Shie Mannor.
Robustness and regularization of support vector machines.
Journal of Machine Learning Research, 10(Jul):1485–1510, 2009.  [56] Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155, 2017.
 [57] Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017.
7 Appendix
7.1 Effect of the Adversarial Loss Function on the Logitspace Direction
The adversarial loss function determines the logitspace direction of the directional derivative in the power method like formulation of adversarial training in Equation 4.2.
Let us consider this for the softmax crossentropy loss, defined as
(18) 
Untargeted PGA on softmax crossentropy loss: (forward pass)
(19) 
Targeted PGA on softmax crossentropy loss: (forward pass)
(20) 
Notice that the logit gradient can be computed in a forward pass by analytically expressing it in terms of the arguments of the objective function (this is why we call the update a forward pass).
Interestingly, for a temperaturedependent softmax crossentropy loss, the logitspace direction becomes a “labelflip” vector in the lowtemperature limit (high inverse temperature ) where the softmax converges to the argmax: . E.g. for targeted attacks . This implies that iterative PGA finds an input space perturbation that corresponds to the steepest ascent of along the “label flip” direction . See Appendix 7.2 for further details.
A note on canonical link functions.
Interestingly, the gradient of the loss w.r.t. the logodds of the classifier takes the form “prediction  target” for both the sumofsquares error as well as the softmax crossentropy loss. This is in fact a general result of modelling the target variable with a conditional distribution from the exponential family along with a canonical link (activation) function. For our purposes, this means that in both cases adversarial attacks try to find perturbations in input space that induce a logit perturbation that is the difference between the current prediction (logodds) and the attack target (cf. note on “directional derivative” interpretation of the power method).
7.2 Temperaturedependent Softmax Crossentropy based PGA Attack
The temperaturedependent softmax crossentropy loss is defined as
(21) 
where denotes the inverse temperature. As () the softmax converges pointwise to the argmax: .
Untargeted PGA on softmax crossentropy loss: (forward pass)
(22) 
Targeted PGA on softmax crossentropy loss: (forward pass)
(23) 
Note, we can drop the prefactor in the update equations for as it gets cancelled anyway when normalizing.
The interesting point is that in the lowtemperature limit, the logitspace direction becomes a “labelflip” vector. E.g. for targeted attacks,
(24) 
where denotes the argmax of the current prediction (and we neglected the prefactor ).
7.3 Dataset, Architecture & Training Methods
We trained Convolutional Neural Networks (CNNs) with seven hidden layers and batch normalization on the CIFAR10 data set [27]. The CIFAR10 dataset consists of k colour images in classes, with k images per class. It comes in a prepackaged traintest split, with k training images and k test images, and can readily be downloaded from https://www.cs.toronto.edu/~kriz/cifar.html.
We conduct our experiments on a pretrained standard convolutional neural network, employing 7 convolutional layers, augmented with BatchNorm, ReLU nonlinearities and MaxPooling. The network achieves 93.5% accuracy on a clean test set. Relevant links to download the pretrained model can be found in our codebase.
We adopt the following standard preprocessing and data augmentation scheme: Each training image is zeropadded with four pixels on each side, randomly cropped to produce a new image with the original dimensions and horizontally flipped with probability one half. We also standardize each image to have zero mean and unit variance when passing it to the classifier.
The attack strength used for PGA was chosen to be the smallest value such that almost all adversarially perturbed inputs to the standard model are successfully misclassified, which is
. The regularization constants of the other training methods were then chosen in such a way that they roughly achieve the same test set accuracy on clean examples as the adversarially trained model does, i.e. we allow a comparable drop in clean accuracy for regularized and adversarially trained models. When training the derived regularized models, we started from a pretrained checkpoint and ran a hyperparameter search over number of epochs, learning rate and regularization constants. Table
1 summarizes the test set accuracies and hyperparameters for all the training methods we considered.7.4 Extracting Jacobian as a Matrix
Since we know that any neural network with its nonlinear activation function set to fixed values represents a linear operator, which, locally, is a good approximation to the neural network itself, we develop a method to fully extract and specify this linear operator in the neighborhood of any input datapoint . We have found the naive way of determining each entry of the linear operator by consecutively computing changes to individual basis vectors to be numerically unstable and therefore have settled for a more robust alternative:
In a first step, we run a set of randomly perturbed versions of through the network (with fixed activation functions) and record their outputs at the particular layer that is of interest to us (usually the logit layer). In a second step, we compute a linear regression on these inputoutput pairs to obtain a weight matrix as well as a bias vector , thereby fully specifying the linear operator. The singular vectors and values of can be obtained by performing an SVD.
Comments
There are no comments yet.