# Adversarial Training Generalizes Data-dependent Spectral Norm Regularization

We establish a theoretical link between adversarial training and operator norm regularization for deep neural networks. Specifically, we show that adversarial training is a data-dependent generalization of spectral norm regularization. This intriguing connection provides fundamental insights into the origin of adversarial vulnerability and hints at novel ways to robustify and defend against adversarial attacks. We provide extensive empirical evidence to support our theoretical results.

## Authors

• 10 publications
• 11 publications
• 48 publications
• ### Improved robustness to adversarial examples using Lipschitz regularization of the loss

Adversarial training is an effective method for improving robustness to ...
10/01/2018 ∙ by Chris Finlay, et al. ∙ 0

• ### On Regularization and Robustness of Deep Neural Networks

Despite their success, deep neural networks suffer from several drawback...
09/30/2018 ∙ by Alberto Bietti, et al. ∙ 0

• ### Average Margin Regularization for Classifiers

Adversarial robustness has become an important research topic given empi...
10/09/2018 ∙ by Matt Olfat, et al. ∙ 0

• ### Spectral Norm Regularization for Improving the Generalizability of Deep Learning

We investigate the generalizability of deep learning based on the sensit...
05/31/2017 ∙ by Yuichi Yoshida, et al. ∙ 0

• ### Adversarial confidence and smoothness regularizations for scalable unsupervised discriminative learning

In this paper, we consider a generic probabilistic discriminative learne...
06/04/2018 ∙ by Yi-Qing Wang, et al. ∙ 0

Adversarial training is an effective method to train deep learning model...
03/15/2021 ∙ by Mathias Lechner, et al. ∙ 9

• ### Adversarial Vulnerability of Neural Networks Increases With Input Dimension

Over the past four years, neural networks have proven vulnerable to adve...
02/05/2018 ∙ by Carl-Johann Simon-Gabriel, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep neural networks have been used with great success for perceptual tasks such as image classification simonyan2014very ; lecun2015deep or speech recognition hinton2012deep . While they are known to be robust to random noise, it has been shown that the accuracy of deep nets can dramatically deteriorate in the face of so-called adversarial examples biggio2013evasion ; szegedy2013intriguing ; goodfellow2014explaining , i.e. small perturbations of the input signal, often imperceptible to humans, that are sufficient to induce large changes in the model output. This apparent vulnerability is worrisome as deep nets start to proliferate in the real-world, including in safety-critical deployments.

Consequently, there has been a rapidly expanding literature exploring methods to find adversarial perturbations sabour2015adversarial ; papernot2016transferability ; kurakin2016adversarial ; moosavi2016deepfool ; moosavi2017universal ; madry2017towards ; athalye2018obfuscated , as well as to provide formal guarantees on the robustness of a model against specific attacks hein2017formal ; kolter2017provable ; raghunathan2018certified ; tsuzuku2018lipschitz ; cohen2019certified

. The most direct strategy of robustification, called adversarial training, aims to harden a machine learning model by immunizing it against an adversary that maliciously corrupts each training example before passing it to the model

goodfellow2014explaining ; kurakin2016adversarial ; miyato2015distributional ; miyato2017virtual ; madry2017towards . A different strategy of defense is to detect whether the input has been disrupted by detecting characteristic regularities either in the adversarial manipulations themselves or in the network activations they induce grosse2017statistical ; feinman2017detecting ; xu2017feature ; metzen2017detecting ; carlini2017adversarial ; roth2019odds .

Despite practical advances in finding adversarial examples and defending against them, the definitive theoretical reason for the vulnerability of neural networks remains unclear. Bubeck et al. bubeck2018adversarial indentify four mutually exclusive scenarios: (i) no robust model exists, cf. fawzi2018adversarial ; gilmer2018adversarial , (ii) learning a robust model requires too much training data, cf. schmidt2018adversarially , (iii) learning a robust model from limited training data is possible but computationally intractable (the hypothesis favoured by Bubeck et al.), and (iv) we just have not found the right training algorithm yet.

In other words, it is still an open question whether adversarial examples exist because of intrinsic flaws of the model or learning objective or whether they are solely the consequence of computational limitations or non-zero generalization error and high-dimensional statistics. In this work, we investigate the origin of adversarial vulnerability in neural networks by focusing on the attack algorithms used to find adversarial examples.

In particular, we make the following contributions:

• We establish a theoretical link between adversarial training and operator norm regularization for deep neural networks. Specifically, we show that adversarial training is a data-dependent generalization of spectral norm regularization.

• We conduct extensive empirical evaluations showing that (i) adversarial perturbations align with dominant singular vectors, (ii) adversarial training and data-dependent spectral norm regularization dampen the singular values, and (iii) both training methods give rise to models that are significantly more linear around data points than normally trained ones.

• Our results provide fundamental insights into the origin of adversarial vulnerability and hint at novel ways to robustify and defend against adversarial attacks.

## 2 Related Work

As deep neural networks start to proliferate in the real-world, the requirement for trained models to be robust to input perturbations becomes paramount. Prominent machine learning frameworks dealing with such requirements are robust optimization el1997robust ; xu2009robustness ; bertsimas2018characterization (including distributionally robust optimization namkoong2017variance ; sinha2017certifiable ; gao2016distributionally ) and adversarial training goodfellow2014explaining ; shaham2015understanding ; kurakin2016adversarial ; miyato2017virtual ; madry2017towards . In these frameworks, machine learning models are trained to minimize the worst-case loss against an adversary that can either perturb the entire training set (in the case of robust optimization) or each training example individually (in the case of adversarial training) subject to a proximity constraint.

A number of works have been suggesting to use regularization, often based on the input gradient, as a means to improve model robustness against adversarial attacks gu2014towards ; lyu2015unified ; cisse2017parseval ; ross2017improving ; simon2018adversarial . Interestingly, for certain problems and uncertainty sets, robust optimization is equivalent to regularization el1997robust ; xu2009robustness ; bertsimas2018characterization

. E.g. for linear regression and induced matrix norm balls, the adversary’s inner-maximization can equivalently be written as an operator norm penalty

el1997robust ; bertsimas2018characterization . Similar results on the equivalence of robustness and regularization have been obtained also for (kernelized) SVMs xu2009robustness . Cf. bietti2018regularization for a kernel perspective on robustness and regularization of deep nets.

More recently, training methods based on spectral norm yoshida2017spectral ; miyato2018spectral ; bartlett2017spectrally ; farnia2018generalizable and Lipschitz constant regularization cisse2017parseval ; hein2017formal ; tsuzuku2018lipschitz ; raghunathan2018certified have been proposed, particularly as bounds on the spectral norm or Lipschitz constant can easily be translated to bounds on the minimal perturbation required to fool a machine learning model. Theoretical work connecting adversarial robustness with robustness to random noise fawzi2015analysis ; fawzi2016robustness and decision boundary tilting tanay2016boundary was also pursued.

Despite there being a well-established learning theory for standard non-robust classification, including generalization bounds for neural networks, cf. for instance boucheron2005theory ; anthony2009neural , the theoretical understanding of the robust learning problem is still very limited. Recent works starting to fill this gap include Lipschitz-sensitive generalization bounds neyshabur2015norm , spectrally-normalized margin bounds for neural networks bartlett2017spectrally , as well as stronger generalization bounds for deep nets via compression arora2018stronger .

## 3 Background

### 3.1 Robust Optimization and Regularization for Linear Regression

We begin by distilling the basic ideas on the relation between robust optimization and regularization presented in bertsimas2018characterization . Consider linear regression with additive perturbations of the data matrix

 minwmax△∈Ug(y−(X+△)w),with lossg:Rn→R, (1)

where denotes the uncertainty set. A general way to construct is as a ball of bounded matrix norm perturbations . Of particular interest are induced matrix norms

 (2)

where is a semi-norm and is a norm. It is obvious that if fulfills the triangle inequality then one can upper bound Robust Optimization Regularization

 g(y−(X+△)w) (3)

by using (a) the triangle inequality and (b) the definition of the matrix norm.

The question then is, under which circumstances both inequalities become equalities at the maximizing . It is straightforward to check (bertsimas2018characterization, , Theorem 1) that specifically we may choose the rank matrix

 (4)

If then one can pick any for which to form (such a has to exist if is not identically zero). This shows that, for robust linear regression with induced matrix norm uncertainty sets, Robust Optimization Regularization.

### 3.2 Global Spectral Norm Regularization

In this section we rederive spectral norm regularization à la Yoshida et al. yoshida2017spectral , while also setting up the notation for later. Let and denote input-label pairs generated from a data distribution . Let

denote the logits of a

-parameterized piecewise linear classifier, i.e.

, where

is the activation function, and

, denote the layer-wise weight matrix111Note that convolutional layers can be constructed as matrix multiplications by converting the convolution operator into a Toeplitz matrix.

and bias vector, collectively denoted by

. Let us furthermore assume that each activation function is a ReLU (the argument can easily be generalized to other piecewise linear activations). In this case, the activations

act as input-dependent diagonal matrices , where an element in the diagonal is one if the corresponding pre-activation is positive and equal to zero otherwise.

Following Raghu et al. raghu2017expressive , we call the “activation pattern”, where

is the number of neurons in the network. For any activation pattern

we can define the preimage , inducing a partitioning of the input space via . Note that some , as not all combinations of activiations may be feasible. See Figure 1 in raghu2017expressive or Figure 3 in novak2018sensitivity for an illustration of ReLU tesselations of the input space.

We can linearize within a neighborhood around as follows

 f(x+Δx)≃f(x)+Jf(x)Δx(with equality if x+Δx∈X(ϕx)), (5)

where denotes the Jacobian of at

 Jf(x)=WL⋅ΦL−1x⋅WL−1⋅ΦL−2x⋅⋅⋅Φ1x⋅W1. (6)

We have the following bound for

 ||f(x+Δx)−f(x)||2||Δx||2≃||Jf(x)Δx||2||Δx||2≤σ(Jf(x)):=sup||Δx||2≠0||Jf(x)Δx||2||Δx||2, (7)

where is the spectral norm (largest singular value) of the linear operator . From a robustness perspective we want to be small in regions that are supported by the data.

Based on the decomposition in Equation 6 and the non-expansiveness of the activations, for every , Yoshida et al. yoshida2017spectral suggest to upper-bound the spectral norm of the Jacobian by the product of the spectral norms of the individual weight matrices

 σ(Jf(x))≤L∏ℓ=1σ(Wℓ), ∀x∈X . (8)

The layer-wise spectral norms can be computed iteratively using the power method. Starting with a random vector , the power method iteratively computes

 uℓk←~uℓk/||~uℓk||2 , ~uℓk←Wℓvℓk−1 ,vℓk←~vℓk/||~vℓk||2 , ~vℓk←(Wℓ)⊤uℓk. (9)

The (final) singular value can be computed via .

Yoshida et al. suggest to turn this upper-bound into a global (data-independent) regularizer by learning the parameters  via the following penalized empirical risk minimization

 minθ→E(x,y)∼^P[ℓ(y,f(x))]+λ2L∑ℓ=1σ(Wℓ)2 , (10)

where denotes an arbitrary classification loss. Note, since the parameter gradient of is , with , and being the dominant singular value and singular vectors of (approximated via the power method), Yoshida et al.’s global spectral norm regularizer effectively adds a term for each layer

to the parameter gradient of the loss function. In terms of computational complexity, because the global regularizer decouples from the empirical loss term, a single power method iteration per parameter update step usually suffices in practice

yoshida2017spectral .

### 3.3 Global vs. Local Regularizers

The advantage of global bounds is that they trivially generalize from the training to the test set. The problem however is that they can be arbitrarily loose, e.g. penalizing the spectral norm over irrelevant regions of the ambient space. To illustrate this, consider the ideal robust classifier that is essentially piecewise constant on class-conditional regions, with sharp transitions between the classes. The global spectral norm will be heavily influenced by the sharp transition zones, whereas a local data-dependent bound can adapt to regions where the classifier is approximately constant hein2017formal . In other words, we would expect a global regularizer to have the largest effect in the empty parts of the input space. On the contrary, a local regularizer has its main effect around the data manifold.

## 4 Adversarial Training Generalizes Spectral Norm Regularization

### 4.1 Data-dependent Spectral Norm Regularization

We now show how to directly regularize the data-dependent spectral norm of the Jacobian . Under the assumption that the dominant singular value is non-degenerate222For practical purposes, we can safely assume that the dominant singular value is non-degenerate (due to numerical errors)., the problem of computing the largest singular value and the corresponding left and right singular vectors can efficiently be solved via the power method. Let be a random vector or an approximation to the dominant right singular vector of . The power method iteratively computes

 uk ←~uk/||~uk||2 , ~uk←Jf(x)vk−1=WL⋅ΦL−1x⋅⋅⋅Φ1x⋅W1vk−1(forward pass) (11) vk ←~vk/||~vk||2 , ~vk←J⊤f(x)uk=∇x(f(x)⊤uk)(backward pass)

The (final) singular value can be computed via . Note, the right singular vector gives the direction in input space that corresponds to the steepest ascent of along .

We can turn this into a regularizer by learning the parameters via the following Jacobian-based spectral norm penalized empirical risk minimization

 minθ→E(x,y)∼^P[ℓ(y,f(x))+λ(u⊤Jf(x)v)2], (12)

where and are the data-dependent singular vectors of , computed via Equation 11.

By optimality / stationarity333 and linearization , we can regularize learning also via the following sum-of-squares based spectral norm regularizer

 minθ→E(x,y)∼^P[ℓ(y,f(x))+λ||f(x+ϵv)−f(x)||22], (13)

where the data-dependent singular vector of is computed via Equation 11.

Both variants can readily be implemented in modern deep learning frameworks. We found the sum-of-squares based spectral norm regularizer to be more numerically stable than the Jacobian based one, which is why we used this variant in our experiments. In terms of computational complexity, the data-dependent regularizer is a constant (number of power method iterations) times more expensive than the data-independent variant.

### 4.2 Power Method Formulation of Adversarial Training

Adversarial training goodfellow2014explaining ; kurakin2016adversarial ; madry2017towards aims to improve the robustness of a machine learning model by training it against an adversary that independently perturbs each training example subject to a proximity constraint, e.g. in -norm,

where denotes the loss function used to find adversarial perturbations (does not need to be the same as the classification loss ).

The adversarial example is typically computed iteratively, e.g. via -norm constrained projected gradient ascent madry2017towards ; kurakin2016adversarial (the general -norm constrained case is similar)

where is the projection operator into the norm ball , is a small step-size and is the true or predicted label. For targeted attacks the sign in front of is flipped, so as to descend the loss function into the direction of the target label.

By the chain-rule, the gradient-step can be expressed as a Jacobian vector product while the projection into the

-norm ball can be expressed as a normalization. Thus, -norm constrained projected gradient ascent can equivalently be written as (the normalization of is optional)

 uk←~uk/||~uk||2 ,~uk←∇zℓadv(y,z)|z=f(xk)(forward pass) vk←~vk/||~vk||2 , ~vk←J⊤f(xk)uk=∇x(f(xk)⊤uk) (% backward pass) (16)

where the ensures that if then . Note, that the logit-space gradient can be computed in a single forward pass, by directly expressing it in terms of the arguments of the adversarial loss.

Comparing the update equations for projected gradient ascent based adversarial training with those of data-dependent spectral norm regularization, we can see that adversarial training generalizes spectral norm regularization in two ways: (i) via the choice of the adversarial loss function and (ii) by iterating within the norm ball (whereas spectral norm regularization keeps the input fixed).

Indeed, keeping the input fixed during the attack and taking the sum-of-squares loss on the logits of the classifier, i.e.  with and , allows us to recover data-dependent spectral norm regularization,

Finally, note that the adversarial loss function determines the logit-space direction of the directional derivative in the power method formulation of adversarial training, as shown in Section 7.1 in the Appendix for an example using the softmax cross-entropy loss. The effect of iterating on the range of regularization is investigated in detail in the Experiments Section 5.3.

## 5 Experimental Results

### 5.1 Dataset, Architecture & Training Methods

We trained Convolutional Neural Networks (CNNs) with seven hidden layers and batch normalization on the CIFAR10 data set

krizhevsky2009learning

. We use a 7-layer CNN as our default platform, since it has good test set accuracy at acceptable computational requirements (we used an estimated

k GPU hours (Titan X) in total for all our experiments). We train each classifier with a number of different training methods: (i) ‘Standard’: standard empirical risk minimization with a softmax cross-entropy loss, (ii) ‘Adversarial’: -norm constrained projected gradient ascent (PGA) based adversarial training with a softmax cross-entropy loss, (iii) ‘Yoshida’: global spectral norm regularization à la Yoshida et al. yoshida2017spectral in Equation 10, and (iv) ‘SNR’: data-dependent spectral norm regularization, as in Equation 13.

As a default attack strategy we use an -norm constrained PGA white-box attack with 10 attack iterations. The attack strength  used for training was chosen to be the smallest value such that almost all adversarially perturbed inputs to the standard model are successfully misclassified, which is (indicated by a vertical dashed line in the Figures below). The regularization constants of the other training methods were then chosen in such a way that they roughly achieve the same test set accuracy on clean examples as the adversarially trained model does. Further details regarding the experimental setup can be found in Section 7.3 in the Appendix. Table 1

summarizes the test set accuracies and hyper-parameters for all the training methods we considered. Shaded areas in the plots below denote standard errors with respect to the number of test set samples over which the experiment was repeated.

### 5.2 Adversarial Training vs. Spectral Norm Regularization

Effect of training method on singular value spectrum. We compute the singular value spectrum of the Jacobian for networks trained with different training methods and evaluated at a number of different test set examples ( to be precise). Since we are interested in computing the full singular value spectrum, and not just the dominant singular value and singular vectors as during training, the power method would be too impractical to use, as it gives us access to only one (the dominant) singular value-vector pair at a time. Instead, we first extract the Jacobian (which is per se defined as a computational graph in modern deep learning frameworks) as an input-dimoutput-dim dimensional matrix and then use available matrix factorization routines to compute the full SVD of the extracted matrix. For each training method, the procedure is repeated for randomly chosen clean and corresponding adversarially perturbed test set examples. Further details regarding the Jacobian extraction can be found in Section 7.4 in the Appendix.

The results are shown in Figure 1 (left). We can see that, compared to the spectrum of the normally trained and global spectral norm regularized model, the spectrum of adversarially trained and data-dependent spectral norm regularized models is significantly damped after training. In fact, the data-dependent spectral norm regularizer seems to dampen the singular values even slightly more effectively than adversarial training, while global spectral norm regularization has almost no effect compared to standard training.

Alignment of adversarial perturbations with singular vectors.

We compute the cosine-similarity of adversarial perturbations with singular vectors

of the Jacobian , extracted at a number of test set examples, as a function of the rank of the singular vectors returned by the SVD decomposition. For comparison we also show the cosine-similarity with the singular vectors of a random network as well as the cosine-similarity with random perturbations.

The results are shown in Figure 1 (right). We can see that for all training methods (except the random network) adversarial perturbations are strongly aligned with the dominant singular vectors while the alignment decreases towards the bottom-ranked singular vectors. For the random network, the alignment is roughly constant with respect to rank. Interestingly, this strong alignment with dominant singular vectors also explains why input gradient regularization and fast gradient method (FGM) based adversarial training do not sufficiently protect against adversarial attacks, namely because the input gradient, resp. a single power method iteration, do not yield a sufficiently good approximation for the dominant singular vector in general.

### 5.3 Local Linearity & Range of Regularization Effects

Local linearity. In order to determine the size of the area where a locally linear approximation is valid, we measure the deviation from linearity of as the distance to is increased in random and adversarial directions, i.e. we measure as a function of the distance , for random and adversarial perturbations , aggregated over data points in the test set, with adversarial perturbations serving as a proxy for the direction in which the linear approximation holds the least. The purpose of this experiment is to investigate how good the linear approximation for different training methods is, as an increasing number of activation boundaries are crossed with increasing perturbation radius. See Figure 1 in raghu2017expressive or Figure 3 in novak2018sensitivity for an illustration of activation boundary tesselations in the input space.

The results are shown in Figure 2 (left). We can see that adversarial training and data-dependent spectral norm regularization give rise to models that are considerably more linear than the clean trained one, both in random as well as adversarial directions. Compared to the normally trained model, the adversarially trained and spectral norm regularized ones remain flat in random directions for pertubations of considerable magnitude and even remain flat in the adversarial direction for perturbation magnitudes up to the order of the used during adversarial training, while the deviation from linearity seems to increase roughly linearly with thereafter. The global spectral norm regularized model behaves similar to the normally trained one (curve omitted).

Largest singular value over distance. Figure 2 (right) shows the largest singular value of the linear operator as the distance from is increased, both along random and adversarial directions, for different training methods. We can see that the naturally trained network develops large dominant singular values around the data point during training, while the adversarially trained and data-dependent spectral norm regularized models manage to keep the dominant singular value low in the vicinity of .

Alignment of adversarial perturbations with dominant singular vector as a function of . Figure 3 (right) shows the cosine-similarity of adversarial perturbations of mangitude with the dominant singular vector of , as a function of perturbation magnitude . For comparison, we also include the alignment with random perturbations. For all training methods, the larger the perturbation magnitude , the lesser the adversarial perturbation aligns with the dominant singular vector of , which is to be expected for a simultaneously increasing deviation from linearity. The alignment is similar for adversarially trained and data-dependent spectral norm regularized models and for both larger than that of global spectral norm regularized and naturally trained models.

## 6 Conclusion

We established a theoretical link between adversarial training and operator norm regularization for deep neural networks. Specifically, we showed that adversarial training is a data-dependent generalization of spectral norm regularization. We conducted extensive empirical evaluations showing that (i) adversarial perturbations align with dominant singular vectors, (ii) adversarial training and data-dependent spectral norm regularization dampen the singular values, and (iii) both training methods give rise to models that are significantly more linear around data points than normally trained ones. Our results provide fundamental insights into the origin of adversarial vulnerability and hint at novel ways to robustify and defend against adversarial attacks.

## Acknowledgements

We would like to thank Michael Tschannen and Sebastian Nowozin for insightful discussions and helpful comments.

## 7 Appendix

### 7.1 Effect of the Adversarial Loss Function on the Logit-space Direction

The adversarial loss function determines the logit-space direction of the directional derivative in the power method like formulation of adversarial training in Equation 4.2.

Let us consider this for the softmax cross-entropy loss, defined as

Untargeted -PGA on softmax cross-entropy loss: (forward pass)

 (19)

Targeted -PGA on softmax cross-entropy loss: (forward pass)

Notice that the logit gradient can be computed in a forward pass by analytically expressing it in terms of the arguments of the objective function (this is why we call the update a forward pass).

Interestingly, for a temperature-dependent softmax cross-entropy loss, the logit-space direction becomes a “label-flip” vector in the low-temperature limit (high inverse temperature ) where the softmax converges to the argmax: . E.g. for targeted attacks . This implies that iterative PGA finds an input space perturbation that corresponds to the steepest ascent of along the “label flip” direction . See Appendix 7.2 for further details.

A note on canonical link functions.

Interestingly, the gradient of the loss w.r.t. the log-odds of the classifier takes the form “prediction - target” for both the sum-of-squares error as well as the softmax cross-entropy loss. This is in fact a general result of modelling the target variable with a conditional distribution from the exponential family along with a canonical link (activation) function. For our purposes, this means that in both cases adversarial attacks try to find perturbations in input space that induce a logit perturbation that is the difference between the current prediction (log-odds) and the attack target (cf. note on “directional derivative” interpretation of the power method).

### 7.2 Temperature-dependent Softmax Cross-entropy based PGA Attack

The temperature-dependent softmax cross-entropy loss is defined as

where denotes the inverse temperature. As () the softmax converges pointwise to the argmax: .

Untargeted -PGA on softmax cross-entropy loss: (forward pass)

 (22)

Targeted -PGA on softmax cross-entropy loss: (forward pass)

Note, we can drop the pre-factor in the update equations for as it gets cancelled anyway when normalizing.

The interesting point is that in the low-temperature limit, the logit-space direction becomes a “label-flip” vector. E.g. for targeted attacks,

where denotes the argmax of the current prediction (and we neglected the pre-factor ).

### 7.3 Dataset, Architecture & Training Methods

We trained Convolutional Neural Networks (CNNs) with seven hidden layers and batch normalization on the CIFAR10 data set [27]. The CIFAR10 dataset consists of k colour images in classes, with k images per class. It comes in a pre-packaged train-test split, with k training images and k test images, and can readily be downloaded from https://www.cs.toronto.edu/~kriz/cifar.html.

We conduct our experiments on a pre-trained standard convolutional neural network, employing 7 convolutional layers, augmented with BatchNorm, ReLU nonlinearities and MaxPooling. The network achieves 93.5% accuracy on a clean test set. Relevant links to download the pre-trained model can be found in our codebase.

We adopt the following standard preprocessing and data augmentation scheme: Each training image is zero-padded with four pixels on each side, randomly cropped to produce a new image with the original dimensions and horizontally flipped with probability one half. We also standardize each image to have zero mean and unit variance when passing it to the classifier.

The attack strength  used for PGA was chosen to be the smallest value such that almost all adversarially perturbed inputs to the standard model are successfully misclassified, which is

. The regularization constants of the other training methods were then chosen in such a way that they roughly achieve the same test set accuracy on clean examples as the adversarially trained model does, i.e. we allow a comparable drop in clean accuracy for regularized and adversarially trained models. When training the derived regularized models, we started from a pre-trained checkpoint and ran a hyper-parameter search over number of epochs, learning rate and regularization constants. Table

1 summarizes the test set accuracies and hyper-parameters for all the training methods we considered.

### 7.4 Extracting Jacobian as a Matrix

Since we know that any neural network with its nonlinear activation function set to fixed values represents a linear operator, which, locally, is a good approximation to the neural network itself, we develop a method to fully extract and specify this linear operator in the neighborhood of any input datapoint . We have found the naive way of determining each entry of the linear operator by consecutively computing changes to individual basis vectors to be numerically unstable and therefore have settled for a more robust alternative:

In a first step, we run a set of randomly perturbed versions of through the network (with fixed activation functions) and record their outputs at the particular layer that is of interest to us (usually the logit layer). In a second step, we compute a linear regression on these input-output pairs to obtain a weight matrix as well as a bias vector , thereby fully specifying the linear operator. The singular vectors and values of can be obtained by performing an SVD.