Deep neural networks have been used with great success for perceptual tasks such as image classification simonyan2014very ; lecun2015deep or speech recognition hinton2012deep . While they are known to be robust to random noise, it has been shown that the accuracy of deep nets can dramatically deteriorate in the face of so-called adversarial examples biggio2013evasion ; szegedy2013intriguing ; goodfellow2014explaining , i.e. small perturbations of the input signal, often imperceptible to humans, that are sufficient to induce large changes in the model output. This apparent vulnerability is worrisome as deep nets start to proliferate in the real-world, including in safety-critical deployments.
Consequently, there has been a rapidly expanding literature exploring methods to find adversarial perturbations sabour2015adversarial ; papernot2016transferability ; kurakin2016adversarial ; moosavi2016deepfool ; moosavi2017universal ; madry2017towards ; athalye2018obfuscated , as well as to provide formal guarantees on the robustness of a model against specific attacks hein2017formal ; kolter2017provable ; raghunathan2018certified ; tsuzuku2018lipschitz ; cohen2019certified
. The most direct strategy of robustification, called adversarial training, aims to harden a machine learning model by immunizing it against an adversary that maliciously corrupts each training example before passing it to the modelgoodfellow2014explaining ; kurakin2016adversarial ; miyato2015distributional ; miyato2017virtual ; madry2017towards . A different strategy of defense is to detect whether the input has been disrupted by detecting characteristic regularities either in the adversarial manipulations themselves or in the network activations they induce grosse2017statistical ; feinman2017detecting ; xu2017feature ; metzen2017detecting ; carlini2017adversarial ; roth2019odds .
Despite practical advances in finding adversarial examples and defending against them, the definitive theoretical reason for the vulnerability of neural networks remains unclear. Bubeck et al. bubeck2018adversarial indentify four mutually exclusive scenarios: (i) no robust model exists, cf. fawzi2018adversarial ; gilmer2018adversarial , (ii) learning a robust model requires too much training data, cf. schmidt2018adversarially , (iii) learning a robust model from limited training data is possible but computationally intractable (the hypothesis favoured by Bubeck et al.), and (iv) we just have not found the right training algorithm yet.
In other words, it is still an open question whether adversarial examples exist because of intrinsic flaws of the model or learning objective or whether they are solely the consequence of computational limitations or non-zero generalization error and high-dimensional statistics. In this work, we investigate the origin of adversarial vulnerability in neural networks by focusing on the attack algorithms used to find adversarial examples.
In particular, we make the following contributions:
We establish a theoretical link between adversarial training and operator norm regularization for deep neural networks. Specifically, we show that adversarial training is a data-dependent generalization of spectral norm regularization.
We conduct extensive empirical evaluations showing that (i) adversarial perturbations align with dominant singular vectors, (ii) adversarial training and data-dependent spectral norm regularization dampen the singular values, and (iii) both training methods give rise to models that are significantly more linear around data points than normally trained ones.
Our results provide fundamental insights into the origin of adversarial vulnerability and hint at novel ways to robustify and defend against adversarial attacks.
2 Related Work
As deep neural networks start to proliferate in the real-world, the requirement for trained models to be robust to input perturbations becomes paramount. Prominent machine learning frameworks dealing with such requirements are robust optimization el1997robust ; xu2009robustness ; bertsimas2018characterization (including distributionally robust optimization namkoong2017variance ; sinha2017certifiable ; gao2016distributionally ) and adversarial training goodfellow2014explaining ; shaham2015understanding ; kurakin2016adversarial ; miyato2017virtual ; madry2017towards . In these frameworks, machine learning models are trained to minimize the worst-case loss against an adversary that can either perturb the entire training set (in the case of robust optimization) or each training example individually (in the case of adversarial training) subject to a proximity constraint.
A number of works have been suggesting to use regularization, often based on the input gradient, as a means to improve model robustness against adversarial attacks gu2014towards ; lyu2015unified ; cisse2017parseval ; ross2017improving ; simon2018adversarial . Interestingly, for certain problems and uncertainty sets, robust optimization is equivalent to regularization el1997robust ; xu2009robustness ; bertsimas2018characterization
. E.g. for linear regression and induced matrix norm balls, the adversary’s inner-maximization can equivalently be written as an operator norm penaltyel1997robust ; bertsimas2018characterization . Similar results on the equivalence of robustness and regularization have been obtained also for (kernelized) SVMs xu2009robustness . Cf. bietti2018regularization for a kernel perspective on robustness and regularization of deep nets.
More recently, training methods based on spectral norm yoshida2017spectral ; miyato2018spectral ; bartlett2017spectrally ; farnia2018generalizable and Lipschitz constant regularization cisse2017parseval ; hein2017formal ; tsuzuku2018lipschitz ; raghunathan2018certified have been proposed, particularly as bounds on the spectral norm or Lipschitz constant can easily be translated to bounds on the minimal perturbation required to fool a machine learning model. Theoretical work connecting adversarial robustness with robustness to random noise fawzi2015analysis ; fawzi2016robustness and decision boundary tilting tanay2016boundary was also pursued.
Despite there being a well-established learning theory for standard non-robust classification, including generalization bounds for neural networks, cf. for instance boucheron2005theory ; anthony2009neural , the theoretical understanding of the robust learning problem is still very limited. Recent works starting to fill this gap include Lipschitz-sensitive generalization bounds neyshabur2015norm , spectrally-normalized margin bounds for neural networks bartlett2017spectrally , as well as stronger generalization bounds for deep nets via compression arora2018stronger .
3.1 Robust Optimization and Regularization for Linear Regression
We begin by distilling the basic ideas on the relation between robust optimization and regularization presented in bertsimas2018characterization . Consider linear regression with additive perturbations of the data matrix
where denotes the uncertainty set. A general way to construct is as a ball of bounded matrix norm perturbations . Of particular interest are induced matrix norms
where is a semi-norm and is a norm. It is obvious that if fulfills the triangle inequality then one can upper bound Robust Optimization Regularization
by using (a) the triangle inequality and (b) the definition of the matrix norm.
The question then is, under which circumstances both inequalities become equalities at the maximizing . It is straightforward to check (bertsimas2018characterization, , Theorem 1) that specifically we may choose the rank matrix
If then one can pick any for which to form (such a has to exist if is not identically zero). This shows that, for robust linear regression with induced matrix norm uncertainty sets, Robust Optimization Regularization.
3.2 Global Spectral Norm Regularization
In this section we rederive spectral norm regularization à la Yoshida et al. yoshida2017spectral , while also setting up the notation for later. Let and denote input-label pairs generated from a data distribution . Let
denote the logits of a
-parameterized piecewise linear classifier, i.e., where
is the activation function, and, denote the layer-wise weight matrix111Note that convolutional layers can be constructed as matrix multiplications by converting the convolution operator into a Toeplitz matrix.
and bias vector, collectively denoted by
. Let us furthermore assume that each activation function is a ReLU (the argument can easily be generalized to other piecewise linear activations). In this case, the activationsact as input-dependent diagonal matrices , where an element in the diagonal is one if the corresponding pre-activation is positive and equal to zero otherwise.
Following Raghu et al. raghu2017expressive , we call the “activation pattern”, where
is the number of neurons in the network. For any activation patternwe can define the preimage , inducing a partitioning of the input space via . Note that some , as not all combinations of activiations may be feasible. See Figure 1 in raghu2017expressive or Figure 3 in novak2018sensitivity for an illustration of ReLU tesselations of the input space.
We can linearize within a neighborhood around as follows
where denotes the Jacobian of at
We have the following bound for
where is the spectral norm (largest singular value) of the linear operator . From a robustness perspective we want to be small in regions that are supported by the data.
Based on the decomposition in Equation 6 and the non-expansiveness of the activations, for every , Yoshida et al. yoshida2017spectral suggest to upper-bound the spectral norm of the Jacobian by the product of the spectral norms of the individual weight matrices
The layer-wise spectral norms can be computed iteratively using the power method. Starting with a random vector , the power method iteratively computes
The (final) singular value can be computed via .
Yoshida et al. suggest to turn this upper-bound into a global (data-independent) regularizer by learning the parameters via the following penalized empirical risk minimization
where denotes an arbitrary classification loss. Note, since the parameter gradient of is , with , and being the dominant singular value and singular vectors of (approximated via the power method), Yoshida et al.’s global spectral norm regularizer effectively adds a term for each layer
to the parameter gradient of the loss function. In terms of computational complexity, because the global regularizer decouples from the empirical loss term, a single power method iteration per parameter update step usually suffices in practiceyoshida2017spectral .
3.3 Global vs. Local Regularizers
The advantage of global bounds is that they trivially generalize from the training to the test set. The problem however is that they can be arbitrarily loose, e.g. penalizing the spectral norm over irrelevant regions of the ambient space. To illustrate this, consider the ideal robust classifier that is essentially piecewise constant on class-conditional regions, with sharp transitions between the classes. The global spectral norm will be heavily influenced by the sharp transition zones, whereas a local data-dependent bound can adapt to regions where the classifier is approximately constant hein2017formal . In other words, we would expect a global regularizer to have the largest effect in the empty parts of the input space. On the contrary, a local regularizer has its main effect around the data manifold.
4 Adversarial Training Generalizes Spectral Norm Regularization
4.1 Data-dependent Spectral Norm Regularization
We now show how to directly regularize the data-dependent spectral norm of the Jacobian . Under the assumption that the dominant singular value is non-degenerate222For practical purposes, we can safely assume that the dominant singular value is non-degenerate (due to numerical errors)., the problem of computing the largest singular value and the corresponding left and right singular vectors can efficiently be solved via the power method. Let be a random vector or an approximation to the dominant right singular vector of . The power method iteratively computes
The (final) singular value can be computed via . Note, the right singular vector gives the direction in input space that corresponds to the steepest ascent of along .
We can turn this into a regularizer by learning the parameters via the following Jacobian-based spectral norm penalized empirical risk minimization
where and are the data-dependent singular vectors of , computed via Equation 11.
By optimality / stationarity333 and linearization , we can regularize learning also via the following sum-of-squares based spectral norm regularizer
where the data-dependent singular vector of is computed via Equation 11.
Both variants can readily be implemented in modern deep learning frameworks. We found the sum-of-squares based spectral norm regularizer to be more numerically stable than the Jacobian based one, which is why we used this variant in our experiments. In terms of computational complexity, the data-dependent regularizer is a constant (number of power method iterations) times more expensive than the data-independent variant.
4.2 Power Method Formulation of Adversarial Training
Adversarial training goodfellow2014explaining ; kurakin2016adversarial ; madry2017towards aims to improve the robustness of a machine learning model by training it against an adversary that independently perturbs each training example subject to a proximity constraint, e.g. in -norm,
where denotes the loss function used to find adversarial perturbations (does not need to be the same as the classification loss ).
The adversarial example is typically computed iteratively, e.g. via -norm constrained projected gradient ascent madry2017towards ; kurakin2016adversarial (the general -norm constrained case is similar)
where is the projection operator into the norm ball , is a small step-size and is the true or predicted label. For targeted attacks the sign in front of is flipped, so as to descend the loss function into the direction of the target label.
By the chain-rule, the gradient-step can be expressed as a Jacobian vector product while the projection into the-norm ball can be expressed as a normalization. Thus, -norm constrained projected gradient ascent can equivalently be written as (the normalization of is optional)
where the ensures that if then . Note, that the logit-space gradient can be computed in a single forward pass, by directly expressing it in terms of the arguments of the adversarial loss.
Comparing the update equations for projected gradient ascent based adversarial training with those of data-dependent spectral norm regularization, we can see that adversarial training generalizes spectral norm regularization in two ways: (i) via the choice of the adversarial loss function and (ii) by iterating within the norm ball (whereas spectral norm regularization keeps the input fixed).
Indeed, keeping the input fixed during the attack and taking the sum-of-squares loss on the logits of the classifier, i.e. with and , allows us to recover data-dependent spectral norm regularization,
Finally, note that the adversarial loss function determines the logit-space direction of the directional derivative in the power method formulation of adversarial training, as shown in Section 7.1 in the Appendix for an example using the softmax cross-entropy loss. The effect of iterating on the range of regularization is investigated in detail in the Experiments Section 5.3.
5 Experimental Results
5.1 Dataset, Architecture & Training Methods
. We use a 7-layer CNN as our default platform, since it has good test set accuracy at acceptable computational requirements (we used an estimatedk GPU hours (Titan X) in total for all our experiments). We train each classifier with a number of different training methods: (i) ‘Standard’: standard empirical risk minimization with a softmax cross-entropy loss, (ii) ‘Adversarial’: -norm constrained projected gradient ascent (PGA) based adversarial training with a softmax cross-entropy loss, (iii) ‘Yoshida’: global spectral norm regularization à la Yoshida et al. yoshida2017spectral in Equation 10, and (iv) ‘SNR’: data-dependent spectral norm regularization, as in Equation 13.
As a default attack strategy we use an -norm constrained PGA white-box attack with 10 attack iterations. The attack strength used for training was chosen to be the smallest value such that almost all adversarially perturbed inputs to the standard model are successfully misclassified, which is (indicated by a vertical dashed line in the Figures below). The regularization constants of the other training methods were then chosen in such a way that they roughly achieve the same test set accuracy on clean examples as the adversarially trained model does. Further details regarding the experimental setup can be found in Section 7.3 in the Appendix. Table 1
summarizes the test set accuracies and hyper-parameters for all the training methods we considered. Shaded areas in the plots below denote standard errors with respect to the number of test set samples over which the experiment was repeated.
|Global Spectral Norm Reg.||80.4%||iters=|
|Data-dep. Spectral Norm Reg.||84.6%||iters|
5.2 Adversarial Training vs. Spectral Norm Regularization
Effect of training method on singular value spectrum. We compute the singular value spectrum of the Jacobian for networks trained with different training methods and evaluated at a number of different test set examples ( to be precise). Since we are interested in computing the full singular value spectrum, and not just the dominant singular value and singular vectors as during training, the power method would be too impractical to use, as it gives us access to only one (the dominant) singular value-vector pair at a time. Instead, we first extract the Jacobian (which is per se defined as a computational graph in modern deep learning frameworks) as an input-dimoutput-dim dimensional matrix and then use available matrix factorization routines to compute the full SVD of the extracted matrix. For each training method, the procedure is repeated for randomly chosen clean and corresponding adversarially perturbed test set examples. Further details regarding the Jacobian extraction can be found in Section 7.4 in the Appendix.
The results are shown in Figure 1 (left). We can see that, compared to the spectrum of the normally trained and global spectral norm regularized model, the spectrum of adversarially trained and data-dependent spectral norm regularized models is significantly damped after training. In fact, the data-dependent spectral norm regularizer seems to dampen the singular values even slightly more effectively than adversarial training, while global spectral norm regularization has almost no effect compared to standard training.
Alignment of adversarial perturbations with singular vectors.
We compute the cosine-similarity of adversarial perturbations with singular vectorsof the Jacobian , extracted at a number of test set examples, as a function of the rank of the singular vectors returned by the SVD decomposition. For comparison we also show the cosine-similarity with the singular vectors of a random network as well as the cosine-similarity with random perturbations.
The results are shown in Figure 1 (right). We can see that for all training methods (except the random network) adversarial perturbations are strongly aligned with the dominant singular vectors while the alignment decreases towards the bottom-ranked singular vectors. For the random network, the alignment is roughly constant with respect to rank. Interestingly, this strong alignment with dominant singular vectors also explains why input gradient regularization and fast gradient method (FGM) based adversarial training do not sufficiently protect against adversarial attacks, namely because the input gradient, resp. a single power method iteration, do not yield a sufficiently good approximation for the dominant singular vector in general.
Adversarial classification accuracy. A plot of the classification accuracy on adversarially perturbed test examples, as a function of the perturbation strength , can be found in Figure 3 (left). We can see that the adversarial accuracy of the data-dependent spectral norm regularized model lies between that for the normal and adversarially trained model, while global spectral norm regularization does not seem to robustify the model against adversarial attacks. I.e. data-dependent spectral norm regularization can already account for a considerable amount of robustness against adversarial examples. This is in line with our earlier observation that adversarial perturbations tend to align with dominant singular vectors and supports our theoretical result that adversarial training generalizes spectral norm regularization. We speculate that the remaining gap between adversarially trained models and data-dependent spectral norm regularized ones might to some extent be attributable to the fact that we used -norm constrained attacks for evaluation and adversarial training, allowing the adversarial model to profit from potential overfitting to the specific attack.
|[width=0.45trim=0.0370pt 0.060pt 0.00pt 0.00pt,clip]plots/spectrum||
|[width=0.45trim=0.0370pt 0.060pt 0.00pt 0.00pt,clip]plots/singular_vectors|
5.3 Local Linearity & Range of Regularization Effects
Local linearity. In order to determine the size of the area where a locally linear approximation is valid, we measure the deviation from linearity of as the distance to is increased in random and adversarial directions, i.e. we measure as a function of the distance , for random and adversarial perturbations , aggregated over data points in the test set, with adversarial perturbations serving as a proxy for the direction in which the linear approximation holds the least. The purpose of this experiment is to investigate how good the linear approximation for different training methods is, as an increasing number of activation boundaries are crossed with increasing perturbation radius. See Figure 1 in raghu2017expressive or Figure 3 in novak2018sensitivity for an illustration of activation boundary tesselations in the input space.
The results are shown in Figure 2 (left). We can see that adversarial training and data-dependent spectral norm regularization give rise to models that are considerably more linear than the clean trained one, both in random as well as adversarial directions. Compared to the normally trained model, the adversarially trained and spectral norm regularized ones remain flat in random directions for pertubations of considerable magnitude and even remain flat in the adversarial direction for perturbation magnitudes up to the order of the used during adversarial training, while the deviation from linearity seems to increase roughly linearly with thereafter. The global spectral norm regularized model behaves similar to the normally trained one (curve omitted).
Largest singular value over distance. Figure 2 (right) shows the largest singular value of the linear operator as the distance from is increased, both along random and adversarial directions, for different training methods. We can see that the naturally trained network develops large dominant singular values around the data point during training, while the adversarially trained and data-dependent spectral norm regularized models manage to keep the dominant singular value low in the vicinity of .
Alignment of adversarial perturbations with dominant singular vector as a function of . Figure 3 (right) shows the cosine-similarity of adversarial perturbations of mangitude with the dominant singular vector of , as a function of perturbation magnitude . For comparison, we also include the alignment with random perturbations. For all training methods, the larger the perturbation magnitude , the lesser the adversarial perturbation aligns with the dominant singular vector of , which is to be expected for a simultaneously increasing deviation from linearity. The alignment is similar for adversarially trained and data-dependent spectral norm regularized models and for both larger than that of global spectral norm regularized and naturally trained models.
Deviation from Linearity
|[width=0.45trim=0.0420pt 0.060pt 0.00pt 0.00pt,clip]plots/linear||
Top Singular Value
|[width=0.45trim=0.0420pt 0.060pt 0.00pt 0.00pt,clip]plots/top_singular_value|
|Distance from||Distance from|
|[width=0.45trim=0.0370pt 0.0850pt 0.00pt 0.00pt,clip]plots/adversarial_accuracy||
Singular Vector Alignment
|[width=0.45trim=0.0370pt 0.0850pt 0.00pt 0.00pt,clip]plots/top_singular_vector|
|Perturbation Magnitude||Perturbation Magnitude|
We established a theoretical link between adversarial training and operator norm regularization for deep neural networks. Specifically, we showed that adversarial training is a data-dependent generalization of spectral norm regularization. We conducted extensive empirical evaluations showing that (i) adversarial perturbations align with dominant singular vectors, (ii) adversarial training and data-dependent spectral norm regularization dampen the singular values, and (iii) both training methods give rise to models that are significantly more linear around data points than normally trained ones. Our results provide fundamental insights into the origin of adversarial vulnerability and hint at novel ways to robustify and defend against adversarial attacks.
We would like to thank Michael Tschannen and Sebastian Nowozin for insightful discussions and helpful comments.
-  Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
-  Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.
-  Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
-  Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6241–6250, 2017.
-  Dimitris Bertsimas and Martin S Copenhaver. Characterization of the equivalence of robustification and regularization in linear and matrix regression. European Journal of Operational Research, 270(3):931–942, 2018.
-  Alberto Bietti, Grégoire Mialon, and Julien Mairal. On regularization and robustness of deep neural networks. arXiv preprint arXiv:1810.00363, 2018.
-  Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013.
Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi.
Theory of classification: A survey of some recent advances.
ESAIM: probability and statistics, 9:323–375, 2005.
-  Sébastien Bubeck, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. arXiv preprint arXiv:1805.10204, 2018.
Nicholas Carlini and David Wagner.
Adversarial examples are not easily detected: Bypassing ten detection
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3–14. ACM, 2017.
-  Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning, pages 854–863, 2017.
-  Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918, 2019.
-  Laurent El Ghaoui and Hervé Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal on matrix analysis and applications, 18(4):1035–1064, 1997.
-  Farzan Farnia, Jesse M Zhang, and David Tse. Generalizable adversarial training via spectral normalization. arXiv preprint arXiv:1811.07457, 2018.
-  Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classifier. arXiv preprint arXiv:1802.08686, 2018.
-  Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Analysis of classifiers’ robustness to adversarial perturbations. arXiv preprint arXiv:1502.02590, 2015.
-  Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of classifiers: from adversarial to random noise. In Advances in Neural Information Processing Systems, pages 1632–1640, 2016.
-  Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew B Gardner. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410, 2017.
-  Rui Gao and Anton J Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. arXiv preprint arXiv:1604.02199, 2016.
-  Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial spheres. arXiv preprint arXiv:1801.02774, 2018.
-  Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
-  Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.
-  Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068, 2014.
-  Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Advances in Neural Information Processing Systems, pages 2266–2276, 2017.
-  Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
-  J Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851, 2(4), 2017.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
-  Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
-  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
-  Chunchuan Lyu, Kaizhu Huang, and Hai-Ning Liang. A unified gradient regularization family for adversarial examples. In 2015 IEEE International Conference on Data Mining, pages 301–309. IEEE, 2015.
-  Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
-  Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267, 2017.
-  Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
-  Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. arXiv preprint arXiv:1704.03976, 2017.
-  Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677, 2015.
-  Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. arXiv preprint, 2017.
-  Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In , number EPFL-CONF-218057, 2016.
-  Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2975–2984, 2017.
-  Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015.
-  Roman Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760, 2018.
-  Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016.
-  Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2847–2854. JMLR. org, 2017.
-  Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344, 2018.
-  Andrew Slavin Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. arXiv preprint arXiv:1711.09404, 2017.
-  Kevin Roth, Yannic Kilcher, and Thomas Hofmann. The odds are odd: A statistical test for detecting adversarial examples. arXiv preprint arXiv:1902.04818, 2019.
-  Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J Fleet. Adversarial manipulation of deep representations. arXiv preprint arXiv:1511.05122, 2015.
-  Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. arXiv preprint arXiv:1804.11285, 2018.
-  Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local stability of neural nets through robust optimization. arXiv preprint arXiv:1511.05432, 2015.
-  Carl-Johann Simon-Gabriel, Yann Ollivier, Bernhard Schölkopf, Léon Bottou, and David Lopez-Paz. Adversarial vulnerability of neural networks increases with input dimension. arXiv preprint arXiv:1802.01421, 2018.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2014.
-  Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571, 2017.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
-  Thomas Tanay and Lewis Griffin. A boundary tilting persepective on the phenomenon of adversarial examples. arXiv preprint arXiv:1608.07690, 2016.
-  Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Information Processing Systems, pages 6541–6550, 2018.
Huan Xu, Constantine Caramanis, and Shie Mannor.
Robustness and regularization of support vector machines.Journal of Machine Learning Research, 10(Jul):1485–1510, 2009.
-  Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155, 2017.
-  Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017.
7.1 Effect of the Adversarial Loss Function on the Logit-space Direction
The adversarial loss function determines the logit-space direction of the directional derivative in the power method like formulation of adversarial training in Equation 4.2.
Let us consider this for the softmax cross-entropy loss, defined as
Untargeted -PGA on softmax cross-entropy loss: (forward pass)
Targeted -PGA on softmax cross-entropy loss: (forward pass)
Notice that the logit gradient can be computed in a forward pass by analytically expressing it in terms of the arguments of the objective function (this is why we call the update a forward pass).
Interestingly, for a temperature-dependent softmax cross-entropy loss, the logit-space direction becomes a “label-flip” vector in the low-temperature limit (high inverse temperature ) where the softmax converges to the argmax: . E.g. for targeted attacks . This implies that iterative PGA finds an input space perturbation that corresponds to the steepest ascent of along the “label flip” direction . See Appendix 7.2 for further details.
A note on canonical link functions.
Interestingly, the gradient of the loss w.r.t. the log-odds of the classifier takes the form “prediction - target” for both the sum-of-squares error as well as the softmax cross-entropy loss. This is in fact a general result of modelling the target variable with a conditional distribution from the exponential family along with a canonical link (activation) function. For our purposes, this means that in both cases adversarial attacks try to find perturbations in input space that induce a logit perturbation that is the difference between the current prediction (log-odds) and the attack target (cf. note on “directional derivative” interpretation of the power method).
7.2 Temperature-dependent Softmax Cross-entropy based PGA Attack
The temperature-dependent softmax cross-entropy loss is defined as
where denotes the inverse temperature. As () the softmax converges pointwise to the argmax: .
Untargeted -PGA on softmax cross-entropy loss: (forward pass)
Targeted -PGA on softmax cross-entropy loss: (forward pass)
Note, we can drop the pre-factor in the update equations for as it gets cancelled anyway when normalizing.
The interesting point is that in the low-temperature limit, the logit-space direction becomes a “label-flip” vector. E.g. for targeted attacks,
where denotes the argmax of the current prediction (and we neglected the pre-factor ).
7.3 Dataset, Architecture & Training Methods
We trained Convolutional Neural Networks (CNNs) with seven hidden layers and batch normalization on the CIFAR10 data set . The CIFAR10 dataset consists of k colour images in classes, with k images per class. It comes in a pre-packaged train-test split, with k training images and k test images, and can readily be downloaded from https://www.cs.toronto.edu/~kriz/cifar.html.
We conduct our experiments on a pre-trained standard convolutional neural network, employing 7 convolutional layers, augmented with BatchNorm, ReLU nonlinearities and MaxPooling. The network achieves 93.5% accuracy on a clean test set. Relevant links to download the pre-trained model can be found in our codebase.
We adopt the following standard preprocessing and data augmentation scheme: Each training image is zero-padded with four pixels on each side, randomly cropped to produce a new image with the original dimensions and horizontally flipped with probability one half. We also standardize each image to have zero mean and unit variance when passing it to the classifier.
The attack strength used for PGA was chosen to be the smallest value such that almost all adversarially perturbed inputs to the standard model are successfully misclassified, which is
. The regularization constants of the other training methods were then chosen in such a way that they roughly achieve the same test set accuracy on clean examples as the adversarially trained model does, i.e. we allow a comparable drop in clean accuracy for regularized and adversarially trained models. When training the derived regularized models, we started from a pre-trained checkpoint and ran a hyper-parameter search over number of epochs, learning rate and regularization constants. Table1 summarizes the test set accuracies and hyper-parameters for all the training methods we considered.
7.4 Extracting Jacobian as a Matrix
Since we know that any neural network with its nonlinear activation function set to fixed values represents a linear operator, which, locally, is a good approximation to the neural network itself, we develop a method to fully extract and specify this linear operator in the neighborhood of any input datapoint . We have found the naive way of determining each entry of the linear operator by consecutively computing changes to individual basis vectors to be numerically unstable and therefore have settled for a more robust alternative:
In a first step, we run a set of randomly perturbed versions of through the network (with fixed activation functions) and record their outputs at the particular layer that is of interest to us (usually the logit layer). In a second step, we compute a linear regression on these input-output pairs to obtain a weight matrix as well as a bias vector , thereby fully specifying the linear operator. The singular vectors and values of can be obtained by performing an SVD.