Log In Sign Up

Scaleable input gradient regularization for adversarial robustness

Input gradient regularization is not thought to be an effective means for promoting adversarial robustness. In this work we revisit this regularization scheme with some new ingredients. First, we derive new per-image theoretical robustness bounds based on local gradient information, and curvature information when available. These bounds strongly motivate input gradient regularization. Second, we implement a scaleable version of input gradient regularization which avoids double backpropagation: adversarially robust ImageNet models are trained in 33 hours on four consumer grade GPUs. Finally, we show experimentally that input gradient regularization is competitive with adversarial training.


page 1

page 2

page 3

page 4


Improved robustness to adversarial examples using Lipschitz regularization of the loss

Adversarial training is an effective method for improving robustness to ...

Second Order Optimization for Adversarial Robustness and Interpretability

Deep neural networks are easily fooled by small perturbations known as a...

Improving Resistance to Adversarial Deformations by Regularizing Gradients

Improving the resistance of deep neural networks against adversarial att...

On Connections between Regularizations for Improving DNN Robustness

This paper analyzes regularization terms proposed recently for improving...

Controlling the Complexity and Lipschitz Constant improves polynomial nets

While the class of Polynomial Nets demonstrates comparable performance t...

Robust Regularization with Adversarial Labelling of Perturbed Samples

Recent researches have suggested that the predictive accuracy of neural ...

On gradient regularizers for MMD GANs

We propose a principled method for gradient-based regularization of the ...

1 Introduction

Neural networks are vulnerable to adversarial attacks. These are small (imperceptible to the human eye) perturbations of an image which cause a network to misclassify the image (Biggio et al., 2013; Szegedy et al., 2013; Goodfellow et al., 2014). The threat posed by adversarial attacks must be addressed before these methods can be deployed in error-sensitive and security-based applications (Potember, 2017).

Building adversarially robust models is an optimization problem with two objectives: (i) maintain test accuracy on clean unperturbed images, and (ii) be robust to large adversarial perturbations. The present state-of-the-art method for adversarial defence, adversarial training (Szegedy et al., 2013; Goodfellow et al., 2014; Tramèr et al., 2018; Madry et al., 2017; Miyato et al., 2018), in which models are trained on perturbed images, offers robustness at the expense of test accuracy (Tsipras et al., 2018). It is not clear that multi-step adversarial training is scaleable to large datasets such as ImageNet-1k (Deng et al., 2009). Previous attempts (Kannan et al., 2018; Xie et al., 2018) used hundreds of GPUs and took nearly a week to train, although recent work by Shafahi et al. (2019) has offered a remedy.

Assessing the empirical effectiveness of an adversarial defence requires careful testing with multiple attacks (Goodfellow et al., 2018). Furthermore, existing defences are vulnerable to new, stronger attacks: Carlini and Wagner (2017a) and Athalye et al. (2018) advocate designing specialized attacks to circumvent prior defences, while Uesato et al. (2018) warn against using weak attacks to evaluate robustness. This has led the community to develop theoretical

tools to certify adversarial robustness. Several certification approaches have been proposed: through linear programming

(Wong and Kolter, 2018; Wong et al., 2018) or mixed-integer linear-programming (Xiao et al., 2018); semi-definite relaxation (Raghunathan et al., 2018b, a); randomized smoothing (Li et al., 2018; Cohen et al., 2019)

; or estimates of the local Lipschitz constant

(Hein and Andriushchenko, 2017; Weng et al., 2018; Tsuzuku et al., 2018). The latter two approaches have scaled well to ImageNet-1k.

In practice, certifiably robust networks often perform worse than adversarially trained models, which lack theoretical guarantees. In this article, we work towards bridging the gap between theoretically robust networks and empirically effective training methods. Our approach relies on minimizing a loss regularized against large input gradients


where is dual to the one measuring adversarial attacks (for example the norm for attacks measured in the

norm). Heuristically, making loss gradients small should make gradient based attacks more challenging.

Drucker and LeCun (1991) implemented gradient regularization using ‘double backpropagation’, which has been shown to improve model generalization (Novak et al., 2018). It has been used to improve the stability of GANs (Roth et al., 2017; Nagarajan and Kolter, 2017) and to promote learning robust features with contractive auto-encoders (Rifai et al., 2011). While it has been proposed for adversarial attacks robustness (Ross and Doshi-Velez, 2018; Roth et al., 2018; Hein and Andriushchenko, 2017; Jakubovitz and Giryes, 2018; Simon-Gabriel et al., 2018), experimental evidence has been mixed, in particular, input gradient regularization has so far not been competitive with multi-step adversarial training.

On non-smooth networks (such as those built of s) small gradients are no guarantee of adversarial robustness (Papernot et al., 2017), and so it is thought input gradient regularization should not be effective on non-smooth networks. This raises the question, how often is the lack of smoothness an issue, in practice? In other words, when do Taylor approximations of the loss fail to predict adversarial robustness, and is smoothness only needed theoretically? The fact that first-order gradient-based attacks of the loss (like PGD (Madry et al., 2017)) are usually effective indicates that in many scenarios, non-smoothness is not an issue. However in a non-negligible minority of cases, attacks based on decision boundary information (Carlini and Wagner, 2017b; Brendel et al., 2018; Chen and Jordan, 2019; Finlay et al., 2019) outperform gradient based attacks. This indicates the curvature near these points is large, and first-order information is not sufficient to guarantee robustness. We illustrate this point in Fig 1. In this work we overcome the limitation of gradient regularization for non-smooth networks by instead building networks of ‘smooth’ s. At the expense of a minor drop in test accuracy, we obtain tighter theoretical lower bounds on robustness, since we can better approximate the loss using local information.

Another drawback of input gradient regularization is that it is not presently tractable to update model weights using double backpropagation on large networks. We circumvent this limitation by differentiating the regularization term without double backpropagation.

Our main contributions are the following. First, we motivate using input gradient regularization of the loss by deriving new theoretical robustness bounds. These bounds show that small loss gradients and small curvature are sufficient conditions for adversarial robustness. Second, we empirically show that input gradient regularization is competitive with adversarial training, even on non-smooth networks, at a fraction of the training time. Finally, we scale input gradient regularization to ImageNet-1k by using finite differences to estimate the gradient regularization term, rather than double backpropagation. This allows us to train adversarially robust networks on ImageNet-1k in 33 hours on four consumer grade GPUs.

2 Adversarial robustness bounds from the loss

2.1 Background

Much effort has been directed towards determining theoretical lower bounds on the minimum sized perturbation necessary to perturb an image so that it is misclassified by a model. One promising approach, proposed by Hein and Andriushchenko (2017) and Weng et al. (2018), and which has scaled well to ImageNet-1k, is to use the Lipschitz constant of the model. In this section, we build upon these ideas: we propose using the Lipschitz constant of a suitable loss, designed to measure classification errors. In addition, when the loss is twice continuously differentiable, we propose a second-order bound based on the maximum curvature of the loss.

Our notation is as follows. Write

for a model which takes input vectors

to label probabilities, with parameters

. Let be the loss and write , for the loss of a model .

Finding an adversarial perturbation is interpreted as a global minimization problem: find the closest image to a clean image, in some specified norm, that is is also misclassified by the model


However, (2) is a difficult and costly non-smooth, non-convex optimization problem. Instead, Goodfellow et al. (2014) proposed solving a surrogate problem: find a perturbation of a clean image that maximizes the loss, subject to the condition that the perturbation be inside a norm-ball of radius around the clean image. The surrogate problem is written


The hard constraint forces perturbations to be inside the norm-ball centred at the clean image . Ideally, solutions of this surrogate problem (3) will closely align with solutions of the original more difficult global minimization problem. However, the hard constraint in (3) forces a particular scale: it may miss attacks which would succeed with only a slightly bigger norm. Additionally, the maximization problem (3) does not force misclassification, it only asks that the loss be increased.

Figure 1: Illustration of upper bounds on the loss of two networks. For smooth networks (blue) with finite curvature, the loss is bounded above using and . Non-smooth networks (orange) may have jumps in their gradients, which means robustness is not guaranteed by small local gradients.

The advantage of (3) is that it may be solved with gradient-based methods: present best-practice is to use variants of projected gradient descent (PGD), such as the iterative fast-signed gradient method (Kurakin et al., 2016; Madry et al., 2017) when attacks are measured in the norm. However, gradient-based methods are not always effective: on non-smooth networks, such as those built of activation functions, a small gradient does not guarantee that the loss remains small locally. This deficiency was identified in (Papernot et al., 2016). See Figure 1: networks may increase rapidly with a very small perturbation, even when local gradients are small. PGD methods will fail to locate these worst-case perturbations, and give a false impression of robustness. Carlini and Wagner (2017b) avoid this scenario by incorporating decision boundary information into the loss; others solve (2) directly (Brendel et al., 2018; Chen and Jordan, 2019; Finlay et al., 2019).

2.2 Derivation of lower bounds

This leads us to consider the following compromise between (2) and (3). Consider the following modification of the Carlini and Wagner (2017b) loss , where is the index of the correct label, and is the model output for the -th label. This loss has the appealing property the sign of the loss determines if the classification is correct. Adversarial attacks are found by minimizing


The constant determines when classification is incorrect; for the modified Carlini-Wagner loss, . Problem (4) is closer to the true problem (2), and will always find an adversarial image. We use (4) to derive theoretical lower bounds on the minimum size perturbation necessary to misclassify an image. Suppose the loss is -Lipschitz with respect to model input. Then we have the estimate


Now suppose is adversarial, with minimum adversarial loss . Then rearranging (5), we obtain the lower bound

Unfortunately, the Lipschitz constant is a global quantity, and ignores local gradient information; see for example Huster et al. (2018). Thus this bound can be quite poor, even when networks have small Lipschitz constant. On the other hand, if the model is twice continuously differentiable, then the loss landscape is smoother. This allows us to achieve a tighter bound, using local gradient information, as illustrated in Figure 1. Let

be an upper bound on the maximum positive eigenvalue of the Hessian of the loss over all


This value will be estimated empirically by maximizing over the dataset. The constant is a measure of the largest positive curvature of the network. Using a Taylor approximation about , we may upper bound the perturbed loss with


These two bounds give us the following.

Proposition 2.1.

Suppose the loss is Lipschitz continuous with respect to model input , with Lipschitz constant . Let be such that if , the model is always correct. Then a lower bound on the minimum magnitude of perturbation necessary to adversarially perturb an image is


Suppose in addition that the loss is twice-differentiable, with maximum curvature (defined as in (6)). Then


The proof of (-bound) is given above; the proof of (-bound) follows by rearranging (7) and solving for .

Remark 2.2.

The second-order bound requires that the network and loss are smooth with respect to the input, but almost all image classification networks now use s, which are not smooth. We use the following smoothed


This activation function is twice continuously differentiable, and avoids the vanishing gradient problem of smooth sigmoidal activation functions. Moreover because it agrees with

outside of the interval , it is fairly efficient during backpropagation. As for the loss, a smooth version of the Carlini-Wagner loss is available by using a soft maximum, rather than a strict .

Proposition 2.1 motivates the need for input gradient regularization. The Lipschitz constant is the maximum gradient norm of the loss over all inputs. Therefore (-bound) says that a regularization term encouraging small gradients (and so reducing ) should increase the minimum adversarial distance. This aligns with (Hein and Andriushchenko, 2017), who proposed the cross-Lipschitz regularizer, penalizing networks with large Jacobians in order to shrink the Lipschitz constant of the network.

However, this is not enough: the gap must be large as well. This explains one form of ‘gradient masking’ (Papernot et al., 2017). Shrinking the magnitude of gradients while also closing the gap

effectively does nothing to improve adversarial robustness. For example, in defense distillation, the magnitude of the model Jacobian is reduced by increasing the temperature of the final softmax layer of the network. However, this has the detrimental side-effect of sending the model output to

, where is the number of classes, which effectively shrinks the loss gap to zero. Thus with high distillation temperatures the lower bound provided by Proposition 2.1 approaches zero.

Moreover, even supposing the loss gradients are small and the gap is large, there may still be adversarially vulnerable images. For example, suppose we have two smooth networks, one with large curvature, and another with small curvature. Suppose that there is an image with zero gradient on both networks, each with identically large loss gaps . The second-order bound (-bound) says that the minimum adversarial distance here is bounded below by . In other words, the network with smaller curvature is more robust.

Taken together, Proposition 2.1 provides three sufficient conditions for training robust networks: (i) the loss gap should be large; (ii) the gradients of the loss should be small; and (iii) the curvature of the loss should also be small. The first point will be satisfied by default when the loss is minimized. The second point will be satisfied by training with a loss regularized to penalize large input gradients. Experimentally the third point is satisfied with input gradient regularization. When these conditions are satisfied, local information is enough to guarantee robustness.

Our robustness bounds are most similar in spirit to Weng et al. (2018), who derive bounds using an estimate of the local Lipschitz constant of the model. Moosavi-Dezfooli et al. (2018) have also used a second order approximation to derive approximate robustness bounds for binary classification, but they neglected higher order error terms. Cohen et al. (2019)

derive bounds by training with normally distributed input noise, then averaging model predictions normally sampled about the input image. It is well known that training with normal noise is equivalent to squared

norm gradient regularization (Bishop, 1995); thus Cohen et al. (2019) achieve gradient regularization indirectly. Our bounds require at most one gradient and model evaluation per image once and have been estimated; whereas both Cohen et al. and Weng et al. require many hundreds of local model evaluations per image. Since and are globally estimated, our bounds could be improved using these local sampling techniques to obtain local values of and , with more computational effort.

3 Squared norm gradient regularization

Proposition 2.1 provides strong motivation for input gradient regularization as a method for promoting adversarial robustness. However, it does not tell us what form the gradient regularization term should take. In this section, we show how norm squared gradient regularization arises from a quadratic cost.

In adversarial training, solutions of (3) are used to generate images on which the network is trained. In effect, adversarial training seeks a solution of the minimax problem


where is the distribution of images. This is a robust optimization problem (Wald, 1945; Rousseeuw and Leroy, 1987). The cost function penalizes perturbed images from being too far from the original. When the cost function is the hard constraint from (3), perturbations must be inside a norm ball of radius . This leads to adversarial training with PGD (Kurakin et al., 2016; Madry et al., 2017). However this forces a particular scale: it is possible that no images are adversarial within radius , but that there are adversarial images with only a slightly larger distance. Instead of using a hard constraint, we can relax the cost function to be the quadratic cost . The quadratic cost allows attacks to be of any size, but penalizes larger attacks more than smaller attacks. With a quadratic cost, there is less of a danger that a local attack will be overlooked.

Solving (9) directly is expensive: on ImageNet-1k, both Kannan et al. (2018) and Xie et al. (2018) required large-scale distributed training with many dozens or hundreds of GPUs, and over a week of training time. Instead we take the view that (9) may be bounded above, and solved approximately. When the loss is smooth and , the optimal value of using the bound (7) is , provided . This gives the following proposition.

Proposition 3.1.

Suppose both the model and the loss are twice continuously differentiable. Suppose attacks are measured with quadratic cost . Then the optimal value of (9) is bounded above by


where .

That is, we may bound the solution of the adversarial training problem (9) by solving the gradient regularization problem (10), when the cost function is quadratic. It is not necessary to know or compute ; they are absorbed into . In the adversarial robustness literature, input gradient regularization using the squared norm was proposed by Ross and Doshi-Velez (2018). It was expanded by Roth et al. (2018) to use a Mahalanobis norm with the correlation matrix of adversarial attacks. When is the hard constraint forcing attacks inside the norm ball and is small, supposing the curvature term is negligible, we can estimate the maximum in (9) by , using the dual norm for the gradient. This is norm gradient regularization (not squared), and was recently used for adversarial robustness on both CIFAR-10 (Simon-Gabriel et al., 2018), and MNIST (Seck et al., 2019).

3.1 Finite difference implementation

Norm squared input gradient regularization has long been used as a regularizer in neural networks: Drucker and LeCun (1991) first showed its effectiveness for generalization. Drucker and LeCun implemented gradient regularization with ‘double backpropagation’ to compute the derivatives of the penalty term with respect to the model parameters , which is needed to update the parameters during training. Double backpropagation involves two passes of automatic differentiation: one pass to compute the gradient of the loss with respect to the inputs , and another pass on the output of the first to compute the gradient of the penalty term with respect to model parameters . In neural networks, double backpropagation is the standard technique for computing the parameter gradient of a regularized loss. However, it is not currently scaleable to large neural networks. Instead we approximate the gradient regularization term with finite differences.

Proposition 3.2 (Finite difference approximation of squared gradient norm).

Let be the normalized input gradient direction: when the gradient is nonzero, and set otherwise. Let be the finite difference step size. Assume further that the loss is twice continuously differentiable. Then, the squared gradient norm is approximated by


The vector is normalized to ensure the accuracy of the finite difference approximation, which is of order , as can be seen by a Taylor approximation. The finite differences approximation (11) allows the computation of the gradient of the regularizer (with respect to model parameters ) to be done with only two regular passes of backpropagation, rather than with double backpropagation. On the first, the input gradient direction is calculated. The second computes the gradient with respect to model parameters by performing backpropagation on the right-hand-side of (11). Double backpropagation is avoided by detaching from the computational graph after the first pass. In practice, for large networks, we have found that the finite difference approximation of the regularization term is considerably more efficient than using double backpropagation.

The proposed training algorithm, with squared Euclidean input gradient regularization, is presented in Algorithm 1 of the appendix. Other gradient penalty terms can be approximated as well. For example, when defending against attacks measured in the norm, the squared norm penalty can approximated by setting instead when the gradient is nonzero.

(a) norm adversarial attacks
(b) norm adversarial attacks
Figure 2: Adversarial attacks on the CIFAR-10 dataset, on networks built with standard s. Regularized networks attacked in are trained with squared norm gradient regularization; networks attacked in are trained with squared norm regularization.

4 Experimental results

In this section we provide empirical evidence that input gradient regularization is an effective tool for promoting adversarial robustness, even on non-smooth networks built with standard activation functions.

We train networks on the CIFAR-10 dataset (Krizhevsky and Hinton, 2009), and ImageNet-1k (Deng et al., 2009). On the CIFAR dataset we use the ResNeXt architecture111ResNeXt34-2x32 on CIFAR-10; ResNeXt34-2x64 on CIFAR-100 (Xie et al., 2017); on ImageNet-1k we use a ResNet-50 (He et al., 2016). The CIFAR networks were trained with standard data augmentation and learning rate schedules on a single GeForce GTI 1080 Ti. On ImageNet-1k, we modified the training code of Shaw et al.’s [41] submission to the DAWNBench competition (Coleman et al., 2018) and train with four GPUs. Training code and trained model weights are available on GitHub.222

We train an undefended network as a baseline to compare various types of regularization. On CIFAR-10, networks are trained with squared and squared gradient norm regularization. The former is appropriate for defending against attacks measured in ; the latter for attacks measured in . We set the regularization strength to be either or 1; and set finite difference discretization

. We compare each network with the current state-of-the-art form of adversarial training, with models trained using the hyperparameters in

Madry et al. (2017) (7-steps of FGSM, step size , projected onto an ball of radius ). On ImageNet-1k we only train adversarially robust models with squared regularization.

On each dataset, we attack 1000 randomly selected images. We perturb each image with attacks in both the Euclidean and norms, with a suite of current state-of-the-art attacks: the Carlini-Wagner attack (Carlini and Wagner, 2017b); the Boundary attack (Brendel et al., 2018); the LogBarrier attack (Finlay et al., 2019); and PGD (Madry et al., 2017) (in both the norm or the norm). The former three attacks are effective at evading gradient masking defences; the latter is very good at finding images close to the original when gradients are not close to zero. We record the best adversarial distance on a per image basis, for each norm.

width= smooth ? % clean error % error at mean distance improvement ratio training time (hours) CIFAR-10  Undefended 4.36 70.82 98.94 -  2.06  Madry et al (7-step AT) 16.33 22.86 46.02333Madry et al report 54.2% error at with the WRN-28x10 architecture; our results are obtained with ResNeXt34 (2x32). 1.88 12.10  squared norm, 6.45 24.92 70.41 5.31  5.22  squared norm, 9.02 18.47 58.69 3.78  5.15
 Undefended 6.94 90.21 98.94 - 20.30  Undefended 9.39 82.03 95.42 4.17 23.46  squared norm, 7.66 70.56 97.53 9.83 32.60  squared norm, 9.49 63.23 94.21 5.84 52.47  squared norm, 10.26 52.79 95.93 3.19 33.87

Table 1: Adversarial robustness statistics, measured in the norm. Top1 error is reported on CIFAR-10; Top5 error on ImageNet-1k.

Adversarial robustness results for networks attacked in the norm are presented in Table 1. These results are for networks built of standard s. Table 1 and Figure 2 demonstrate a clear trade-off between test accuracy and adversarial robustness, as the strength of the regularization is increased. On CIFAR-10, the undefended network achieves test error of 4.36%, but is not robust to attacks even at distance . However with a strong regularization parameter (), test error increases to 9.02% on clean images, and only 18.47% test error at attack distance . In contrast, the network trained with 7-steps of adversarial training appears to be over-regularized: on clean images, the adversarially trained network achieves 16.33% test error, but 22.86% error at distance . To be fair, at the commonly reported of , the adversarially trained network outperforms the best gradient regularized networks by about 12%, but at over twice the training time of the regularized networks. On ImageNet, we see a reduction of nearly 40% at distance .

It has been noted that adversarial robustness comes with a cost of degraded test error (Tsipras et al., 2018). This trade-off may be quantified. We measure the relative improvement in adversarial robustness against the cost of degraded test error with the following metric. Suppose an undefended network has test error , and let a regularized network’s network test error be denoted . Define the relative degradation in test error to be . Similarly define the relative improvement in robustness (measured by mean adversarial distance ) to be . We define the adversarial improvement ratio to be . This measures the improvement in adversarial robustness against the expense of poorer test error: high values mean the defended model is much more robust and has not lost significant test accuracy. Values close to zero imply the model is more robust but has a much worse test accuracy relative to the undefended model. The improvement ratio is non-dimensional, and so it allows for comparison between datasets.

Measured in this metric, the tradeoff between test accuracy and adversarial robustness is clear. On both ImageNet-1k and CIFAR-10, models regularized with offer the best trade-off between robustness and test error. If test accuracy is not of foremost concern, then stronger regularization parameters may be chosen. If neither training time nor test accuracy are important factors, then adversarial training is competitive with gradient regularization.

In Table 2 we report results on models trained for attacks in the norm. On CIFAR-10, the most robust model is trained with regularization strength , and outperforms even the adversarially trained model. On ImageNet-1k, we see the same pattern: the model trained with offers the best protection against adversarial attacks. Due to the long training time, we were not able to train ImageNet-1k with multi-step adversarial training.

In Table 2 we also report our theoretical bounds on the minimum distance required to adversarially perturb, using the Carlini-Wagner loss.444This loss can be modified for Top- mis-classification as well. Figures 4 and 5 of the appendix show these bounds on a per-image basis. The theoretical bounds require calculating constants and , which are not readily available. Instead, we estimate as the maximum gradient norm over test images; for smooth models we estimate as the maximum spectral norm of the Hessian.555We compute the spectral norm of the Hessian using the Lanczos algorithm (Golub and Van Loan, 2012, §10.1) on Hessian-vector products (computed via automatic differentiation). These estimates are reported in Table 3 of the appendix. Gradient regularization reduces and , by one to two orders of magnitude. Table 3 shows adversarial training also reduces : effectively adversarial training is a regularizer. Because and are estimated, and not exact, one would expect that our bounds would sometimes fail. However, on CIFAR-10, the bounds reliable held on all attacked images. On ImageNet-1k, the bounds failed on about 9% of attacked test images, which indicates that and could be estimated more accurately, for example using by estimating these constants locally like in (Weng et al., 2018).

width= smooth ? % clean error mean adversarial distance improve- ment ratio training time (hours) -bound -bound empirical CIFAR-10  Undefended 4.36 - 0.12 -  2.06  Undefended 6.84 0.11  3.78  Madry et al (7-step AT) 16.33 0.18 - 0.74 1.81 12.10  squared norm, 8.03 0.14 - 0.63 4.86  5.18  squared norm, 11.68 0.13 0.17 0.59 2.25  9.46  squared norm, 20.31 0.30 - 0.81 1.52  5.08 ImageNet-1k  Undefended 6.94 - 0.55 - 20.30  Undefended 9.39 0.56 0.12 23.46  squared norm, 7.66 0.13 - 1.14 10.23 32.60  squared norm, 9.49 1.09 2.64 52.47  squared norm, 10.26 0.26 - 1.75 4.52 33.87

Table 2: Adversarial robustness statistics, measured in . Top1 error is reported on CIFAR-10; Top5 error on ImageNet-1k.

5 Conclusion

We have provided motivation for training adversarially robust networks through input gradient regularization, by bounding the minimum adversarial distance with gradient and curvature statistics of the loss. We have shown empirically that gradient regularization is scaleable to ImageNet-1k, and provides adversarial robustness competitive with adversarial training. We gave theoretical per-image bounds on the minimum adversarial distance, for non-smooth models (using the Lipschitz constant of the loss), and augmented these bounds using smooth models with a second-order bound based on model curvature. These bounds were empirically validated against state-of-the-art attacks.


Appendix A Additional methods and results

1:  Input: Initial model parameters Hyperparameters: Regularization strength ; batch size ; finite difference discretization
2:  while  not converged do
3:     sample minibatch of data from empirical distribution
4:     for  to  do
6:         COMMENTfor -norm use normalized signed gradient
7:        detach from computational graph
9:     end for
13:  end while
Algorithm 1 Training with squared -norm input gradient regularization, using finite differences
(a) -norm adversarial attacks
(b) -norm adversarial attacks
Figure 3: Adversarial attacks on ImageNet-1k with the ResNet-50 architecture. Top5 error reported.
Figure 4: Theoretical minimum lower bound on adversarial distance for CIFAR-10, on networks with smooth activation functions. Defended networks trained with , penalized with squared norm gradient.
Figure 5: Theoretical minimum lower bound on adversarial distance for ImageNet-1k, on networks with smooth activation functions. Defended networks trained with , penalized with squared norm gradient.
mean maximum
 Undefended 3.05 - 122.34 -
 Undefended 3.25 198.23 65.35 8134.26
 Madry et al (7-step AT) 0.40 - 2.52 -
 squared norm, 0.58 - 4.43 -
 squared norm, 0.65 2.08 4.52 27.05
 squared norm, 0.35 - 1.33 -
 Undefended 1.12 - 17.51 -
 Undefended 1.02 11.61 25.43 848.69
 squared norm, 0.46 - 4.85 -
 squared norm, 0.45 1.87 6.99 171.98
 squared norm, 0.27 - 2.12 -

Table 3: Regularity statistics on selected models, measured in the norm. Statistics computed using modified loss . A soft maximum is used for curvature statistics.