1 Introduction
Neural networks are vulnerable to adversarial attacks. These are small (imperceptible to the human eye) perturbations of an image which cause a network to misclassify the image (Biggio et al., 2013; Szegedy et al., 2013; Goodfellow et al., 2014). The threat posed by adversarial attacks must be addressed before these methods can be deployed in errorsensitive and securitybased applications (Potember, 2017).
Building adversarially robust models is an optimization problem with two objectives: (i) maintain test accuracy on clean unperturbed images, and (ii) be robust to large adversarial perturbations. The present stateoftheart method for adversarial defence, adversarial training (Szegedy et al., 2013; Goodfellow et al., 2014; Tramèr et al., 2018; Madry et al., 2017; Miyato et al., 2018), in which models are trained on perturbed images, offers robustness at the expense of test accuracy (Tsipras et al., 2018). It is not clear that multistep adversarial training is scaleable to large datasets such as ImageNet1k (Deng et al., 2009). Previous attempts (Kannan et al., 2018; Xie et al., 2018) used hundreds of GPUs and took nearly a week to train, although recent work by Shafahi et al. (2019) has offered a remedy.
Assessing the empirical effectiveness of an adversarial defence requires careful testing with multiple attacks (Goodfellow et al., 2018). Furthermore, existing defences are vulnerable to new, stronger attacks: Carlini and Wagner (2017a) and Athalye et al. (2018) advocate designing specialized attacks to circumvent prior defences, while Uesato et al. (2018) warn against using weak attacks to evaluate robustness. This has led the community to develop theoretical
tools to certify adversarial robustness. Several certification approaches have been proposed: through linear programming
(Wong and Kolter, 2018; Wong et al., 2018) or mixedinteger linearprogramming (Xiao et al., 2018); semidefinite relaxation (Raghunathan et al., 2018b, a); randomized smoothing (Li et al., 2018; Cohen et al., 2019); or estimates of the local Lipschitz constant
(Hein and Andriushchenko, 2017; Weng et al., 2018; Tsuzuku et al., 2018). The latter two approaches have scaled well to ImageNet1k.In practice, certifiably robust networks often perform worse than adversarially trained models, which lack theoretical guarantees. In this article, we work towards bridging the gap between theoretically robust networks and empirically effective training methods. Our approach relies on minimizing a loss regularized against large input gradients
(1) 
where is dual to the one measuring adversarial attacks (for example the norm for attacks measured in the
norm). Heuristically, making loss gradients small should make gradient based attacks more challenging.
Drucker and LeCun (1991) implemented gradient regularization using ‘double backpropagation’, which has been shown to improve model generalization (Novak et al., 2018). It has been used to improve the stability of GANs (Roth et al., 2017; Nagarajan and Kolter, 2017) and to promote learning robust features with contractive autoencoders (Rifai et al., 2011). While it has been proposed for adversarial attacks robustness (Ross and DoshiVelez, 2018; Roth et al., 2018; Hein and Andriushchenko, 2017; Jakubovitz and Giryes, 2018; SimonGabriel et al., 2018), experimental evidence has been mixed, in particular, input gradient regularization has so far not been competitive with multistep adversarial training.
On nonsmooth networks (such as those built of s) small gradients are no guarantee of adversarial robustness (Papernot et al., 2017), and so it is thought input gradient regularization should not be effective on nonsmooth networks. This raises the question, how often is the lack of smoothness an issue, in practice? In other words, when do Taylor approximations of the loss fail to predict adversarial robustness, and is smoothness only needed theoretically? The fact that firstorder gradientbased attacks of the loss (like PGD (Madry et al., 2017)) are usually effective indicates that in many scenarios, nonsmoothness is not an issue. However in a nonnegligible minority of cases, attacks based on decision boundary information (Carlini and Wagner, 2017b; Brendel et al., 2018; Chen and Jordan, 2019; Finlay et al., 2019) outperform gradient based attacks. This indicates the curvature near these points is large, and firstorder information is not sufficient to guarantee robustness. We illustrate this point in Fig 1. In this work we overcome the limitation of gradient regularization for nonsmooth networks by instead building networks of ‘smooth’ s. At the expense of a minor drop in test accuracy, we obtain tighter theoretical lower bounds on robustness, since we can better approximate the loss using local information.
Another drawback of input gradient regularization is that it is not presently tractable to update model weights using double backpropagation on large networks. We circumvent this limitation by differentiating the regularization term without double backpropagation.
Our main contributions are the following. First, we motivate using input gradient regularization of the loss by deriving new theoretical robustness bounds. These bounds show that small loss gradients and small curvature are sufficient conditions for adversarial robustness. Second, we empirically show that input gradient regularization is competitive with adversarial training, even on nonsmooth networks, at a fraction of the training time. Finally, we scale input gradient regularization to ImageNet1k by using finite differences to estimate the gradient regularization term, rather than double backpropagation. This allows us to train adversarially robust networks on ImageNet1k in 33 hours on four consumer grade GPUs.
2 Adversarial robustness bounds from the loss
2.1 Background
Much effort has been directed towards determining theoretical lower bounds on the minimum sized perturbation necessary to perturb an image so that it is misclassified by a model. One promising approach, proposed by Hein and Andriushchenko (2017) and Weng et al. (2018), and which has scaled well to ImageNet1k, is to use the Lipschitz constant of the model. In this section, we build upon these ideas: we propose using the Lipschitz constant of a suitable loss, designed to measure classification errors. In addition, when the loss is twice continuously differentiable, we propose a secondorder bound based on the maximum curvature of the loss.
Our notation is as follows. Write
for a model which takes input vectors
to label probabilities, with parameters
. Let be the loss and write , for the loss of a model .Finding an adversarial perturbation is interpreted as a global minimization problem: find the closest image to a clean image, in some specified norm, that is is also misclassified by the model
(2) 
However, (2) is a difficult and costly nonsmooth, nonconvex optimization problem. Instead, Goodfellow et al. (2014) proposed solving a surrogate problem: find a perturbation of a clean image that maximizes the loss, subject to the condition that the perturbation be inside a normball of radius around the clean image. The surrogate problem is written
(3) 
The hard constraint forces perturbations to be inside the normball centred at the clean image . Ideally, solutions of this surrogate problem (3) will closely align with solutions of the original more difficult global minimization problem. However, the hard constraint in (3) forces a particular scale: it may miss attacks which would succeed with only a slightly bigger norm. Additionally, the maximization problem (3) does not force misclassification, it only asks that the loss be increased.
The advantage of (3) is that it may be solved with gradientbased methods: present bestpractice is to use variants of projected gradient descent (PGD), such as the iterative fastsigned gradient method (Kurakin et al., 2016; Madry et al., 2017) when attacks are measured in the norm. However, gradientbased methods are not always effective: on nonsmooth networks, such as those built of activation functions, a small gradient does not guarantee that the loss remains small locally. This deficiency was identified in (Papernot et al., 2016). See Figure 1: networks may increase rapidly with a very small perturbation, even when local gradients are small. PGD methods will fail to locate these worstcase perturbations, and give a false impression of robustness. Carlini and Wagner (2017b) avoid this scenario by incorporating decision boundary information into the loss; others solve (2) directly (Brendel et al., 2018; Chen and Jordan, 2019; Finlay et al., 2019).
2.2 Derivation of lower bounds
This leads us to consider the following compromise between (2) and (3). Consider the following modification of the Carlini and Wagner (2017b) loss , where is the index of the correct label, and is the model output for the th label. This loss has the appealing property the sign of the loss determines if the classification is correct. Adversarial attacks are found by minimizing
(4) 
The constant determines when classification is incorrect; for the modified CarliniWagner loss, . Problem (4) is closer to the true problem (2), and will always find an adversarial image. We use (4) to derive theoretical lower bounds on the minimum size perturbation necessary to misclassify an image. Suppose the loss is Lipschitz with respect to model input. Then we have the estimate
(5) 
Now suppose is adversarial, with minimum adversarial loss . Then rearranging (5), we obtain the lower bound
Unfortunately, the Lipschitz constant is a global quantity, and ignores local gradient information; see for example Huster et al. (2018). Thus this bound can be quite poor, even when networks have small Lipschitz constant. On the other hand, if the model is twice continuously differentiable, then the loss landscape is smoother. This allows us to achieve a tighter bound, using local gradient information, as illustrated in Figure 1. Let
be an upper bound on the maximum positive eigenvalue of the Hessian of the loss over all
(6) 
This value will be estimated empirically by maximizing over the dataset. The constant is a measure of the largest positive curvature of the network. Using a Taylor approximation about , we may upper bound the perturbed loss with
(7) 
These two bounds give us the following.
Proposition 2.1.
Suppose the loss is Lipschitz continuous with respect to model input , with Lipschitz constant . Let be such that if , the model is always correct. Then a lower bound on the minimum magnitude of perturbation necessary to adversarially perturb an image is
(bound) 
Suppose in addition that the loss is twicedifferentiable, with maximum curvature (defined as in (6)). Then
(bound) 
The proof of (bound) is given above; the proof of (bound) follows by rearranging (7) and solving for .
Remark 2.2.
The secondorder bound requires that the network and loss are smooth with respect to the input, but almost all image classification networks now use s, which are not smooth. We use the following smoothed
(8) 
This activation function is twice continuously differentiable, and avoids the vanishing gradient problem of smooth sigmoidal activation functions. Moreover because it agrees with
outside of the interval , it is fairly efficient during backpropagation. As for the loss, a smooth version of the CarliniWagner loss is available by using a soft maximum, rather than a strict .Proposition 2.1 motivates the need for input gradient regularization. The Lipschitz constant is the maximum gradient norm of the loss over all inputs. Therefore (bound) says that a regularization term encouraging small gradients (and so reducing ) should increase the minimum adversarial distance. This aligns with (Hein and Andriushchenko, 2017), who proposed the crossLipschitz regularizer, penalizing networks with large Jacobians in order to shrink the Lipschitz constant of the network.
However, this is not enough: the gap must be large as well. This explains one form of ‘gradient masking’ (Papernot et al., 2017). Shrinking the magnitude of gradients while also closing the gap
effectively does nothing to improve adversarial robustness. For example, in defense distillation, the magnitude of the model Jacobian is reduced by increasing the temperature of the final softmax layer of the network. However, this has the detrimental sideeffect of sending the model output to
, where is the number of classes, which effectively shrinks the loss gap to zero. Thus with high distillation temperatures the lower bound provided by Proposition 2.1 approaches zero.Moreover, even supposing the loss gradients are small and the gap is large, there may still be adversarially vulnerable images. For example, suppose we have two smooth networks, one with large curvature, and another with small curvature. Suppose that there is an image with zero gradient on both networks, each with identically large loss gaps . The secondorder bound (bound) says that the minimum adversarial distance here is bounded below by . In other words, the network with smaller curvature is more robust.
Taken together, Proposition 2.1 provides three sufficient conditions for training robust networks: (i) the loss gap should be large; (ii) the gradients of the loss should be small; and (iii) the curvature of the loss should also be small. The first point will be satisfied by default when the loss is minimized. The second point will be satisfied by training with a loss regularized to penalize large input gradients. Experimentally the third point is satisfied with input gradient regularization. When these conditions are satisfied, local information is enough to guarantee robustness.
Our robustness bounds are most similar in spirit to Weng et al. (2018), who derive bounds using an estimate of the local Lipschitz constant of the model. MoosaviDezfooli et al. (2018) have also used a second order approximation to derive approximate robustness bounds for binary classification, but they neglected higher order error terms. Cohen et al. (2019)
derive bounds by training with normally distributed input noise, then averaging model predictions normally sampled about the input image. It is well known that training with normal noise is equivalent to squared
norm gradient regularization (Bishop, 1995); thus Cohen et al. (2019) achieve gradient regularization indirectly. Our bounds require at most one gradient and model evaluation per image once and have been estimated; whereas both Cohen et al. and Weng et al. require many hundreds of local model evaluations per image. Since and are globally estimated, our bounds could be improved using these local sampling techniques to obtain local values of and , with more computational effort.3 Squared norm gradient regularization
Proposition 2.1 provides strong motivation for input gradient regularization as a method for promoting adversarial robustness. However, it does not tell us what form the gradient regularization term should take. In this section, we show how norm squared gradient regularization arises from a quadratic cost.
In adversarial training, solutions of (3) are used to generate images on which the network is trained. In effect, adversarial training seeks a solution of the minimax problem
(9) 
where is the distribution of images. This is a robust optimization problem (Wald, 1945; Rousseeuw and Leroy, 1987). The cost function penalizes perturbed images from being too far from the original. When the cost function is the hard constraint from (3), perturbations must be inside a norm ball of radius . This leads to adversarial training with PGD (Kurakin et al., 2016; Madry et al., 2017). However this forces a particular scale: it is possible that no images are adversarial within radius , but that there are adversarial images with only a slightly larger distance. Instead of using a hard constraint, we can relax the cost function to be the quadratic cost . The quadratic cost allows attacks to be of any size, but penalizes larger attacks more than smaller attacks. With a quadratic cost, there is less of a danger that a local attack will be overlooked.
Solving (9) directly is expensive: on ImageNet1k, both Kannan et al. (2018) and Xie et al. (2018) required largescale distributed training with many dozens or hundreds of GPUs, and over a week of training time. Instead we take the view that (9) may be bounded above, and solved approximately. When the loss is smooth and , the optimal value of using the bound (7) is , provided . This gives the following proposition.
Proposition 3.1.
Suppose both the model and the loss are twice continuously differentiable. Suppose attacks are measured with quadratic cost . Then the optimal value of (9) is bounded above by
(10) 
where .
That is, we may bound the solution of the adversarial training problem (9) by solving the gradient regularization problem (10), when the cost function is quadratic. It is not necessary to know or compute ; they are absorbed into . In the adversarial robustness literature, input gradient regularization using the squared norm was proposed by Ross and DoshiVelez (2018). It was expanded by Roth et al. (2018) to use a Mahalanobis norm with the correlation matrix of adversarial attacks. When is the hard constraint forcing attacks inside the norm ball and is small, supposing the curvature term is negligible, we can estimate the maximum in (9) by , using the dual norm for the gradient. This is norm gradient regularization (not squared), and was recently used for adversarial robustness on both CIFAR10 (SimonGabriel et al., 2018), and MNIST (Seck et al., 2019).
3.1 Finite difference implementation
Norm squared input gradient regularization has long been used as a regularizer in neural networks: Drucker and LeCun (1991) first showed its effectiveness for generalization. Drucker and LeCun implemented gradient regularization with ‘double backpropagation’ to compute the derivatives of the penalty term with respect to the model parameters , which is needed to update the parameters during training. Double backpropagation involves two passes of automatic differentiation: one pass to compute the gradient of the loss with respect to the inputs , and another pass on the output of the first to compute the gradient of the penalty term with respect to model parameters . In neural networks, double backpropagation is the standard technique for computing the parameter gradient of a regularized loss. However, it is not currently scaleable to large neural networks. Instead we approximate the gradient regularization term with finite differences.
Proposition 3.2 (Finite difference approximation of squared gradient norm).
Let be the normalized input gradient direction: when the gradient is nonzero, and set otherwise. Let be the finite difference step size. Assume further that the loss is twice continuously differentiable. Then, the squared gradient norm is approximated by
(11) 
The vector is normalized to ensure the accuracy of the finite difference approximation, which is of order , as can be seen by a Taylor approximation. The finite differences approximation (11) allows the computation of the gradient of the regularizer (with respect to model parameters ) to be done with only two regular passes of backpropagation, rather than with double backpropagation. On the first, the input gradient direction is calculated. The second computes the gradient with respect to model parameters by performing backpropagation on the righthandside of (11). Double backpropagation is avoided by detaching from the computational graph after the first pass. In practice, for large networks, we have found that the finite difference approximation of the regularization term is considerably more efficient than using double backpropagation.
The proposed training algorithm, with squared Euclidean input gradient regularization, is presented in Algorithm 1 of the appendix. Other gradient penalty terms can be approximated as well. For example, when defending against attacks measured in the norm, the squared norm penalty can approximated by setting instead when the gradient is nonzero.
4 Experimental results
In this section we provide empirical evidence that input gradient regularization is an effective tool for promoting adversarial robustness, even on nonsmooth networks built with standard activation functions.
We train networks on the CIFAR10 dataset (Krizhevsky and Hinton, 2009), and ImageNet1k (Deng et al., 2009). On the CIFAR dataset we use the ResNeXt architecture^{1}^{1}1ResNeXt342x32 on CIFAR10; ResNeXt342x64 on CIFAR100 (Xie et al., 2017); on ImageNet1k we use a ResNet50 (He et al., 2016). The CIFAR networks were trained with standard data augmentation and learning rate schedules on a single GeForce GTI 1080 Ti. On ImageNet1k, we modified the training code of Shaw et al.’s [41] submission to the DAWNBench competition (Coleman et al., 2018) and train with four GPUs. Training code and trained model weights are available on GitHub.^{2}^{2}2https://github.com/cfinlay/tulip
We train an undefended network as a baseline to compare various types of regularization. On CIFAR10, networks are trained with squared and squared gradient norm regularization. The former is appropriate for defending against attacks measured in ; the latter for attacks measured in . We set the regularization strength to be either or 1; and set finite difference discretization
. We compare each network with the current stateoftheart form of adversarial training, with models trained using the hyperparameters in
Madry et al. (2017) (7steps of FGSM, step size , projected onto an ball of radius ). On ImageNet1k we only train adversarially robust models with squared regularization.On each dataset, we attack 1000 randomly selected images. We perturb each image with attacks in both the Euclidean and norms, with a suite of current stateoftheart attacks: the CarliniWagner attack (Carlini and Wagner, 2017b); the Boundary attack (Brendel et al., 2018); the LogBarrier attack (Finlay et al., 2019); and PGD (Madry et al., 2017) (in both the norm or the norm). The former three attacks are effective at evading gradient masking defences; the latter is very good at finding images close to the original when gradients are not close to zero. We record the best adversarial distance on a per image basis, for each norm.
Adversarial robustness results for networks attacked in the norm are presented in Table 1. These results are for networks built of standard s. Table 1 and Figure 2 demonstrate a clear tradeoff between test accuracy and adversarial robustness, as the strength of the regularization is increased. On CIFAR10, the undefended network achieves test error of 4.36%, but is not robust to attacks even at distance . However with a strong regularization parameter (), test error increases to 9.02% on clean images, and only 18.47% test error at attack distance . In contrast, the network trained with 7steps of adversarial training appears to be overregularized: on clean images, the adversarially trained network achieves 16.33% test error, but 22.86% error at distance . To be fair, at the commonly reported of , the adversarially trained network outperforms the best gradient regularized networks by about 12%, but at over twice the training time of the regularized networks. On ImageNet, we see a reduction of nearly 40% at distance .
It has been noted that adversarial robustness comes with a cost of degraded test error (Tsipras et al., 2018). This tradeoff may be quantified. We measure the relative improvement in adversarial robustness against the cost of degraded test error with the following metric. Suppose an undefended network has test error , and let a regularized network’s network test error be denoted . Define the relative degradation in test error to be . Similarly define the relative improvement in robustness (measured by mean adversarial distance ) to be . We define the adversarial improvement ratio to be . This measures the improvement in adversarial robustness against the expense of poorer test error: high values mean the defended model is much more robust and has not lost significant test accuracy. Values close to zero imply the model is more robust but has a much worse test accuracy relative to the undefended model. The improvement ratio is nondimensional, and so it allows for comparison between datasets.
Measured in this metric, the tradeoff between test accuracy and adversarial robustness is clear. On both ImageNet1k and CIFAR10, models regularized with offer the best tradeoff between robustness and test error. If test accuracy is not of foremost concern, then stronger regularization parameters may be chosen. If neither training time nor test accuracy are important factors, then adversarial training is competitive with gradient regularization.
In Table 2 we report results on models trained for attacks in the norm. On CIFAR10, the most robust model is trained with regularization strength , and outperforms even the adversarially trained model. On ImageNet1k, we see the same pattern: the model trained with offers the best protection against adversarial attacks. Due to the long training time, we were not able to train ImageNet1k with multistep adversarial training.
In Table 2 we also report our theoretical bounds on the minimum distance required to adversarially perturb, using the CarliniWagner loss.^{4}^{4}4This loss can be modified for Top misclassification as well. Figures 4 and 5 of the appendix show these bounds on a perimage basis. The theoretical bounds require calculating constants and , which are not readily available. Instead, we estimate as the maximum gradient norm over test images; for smooth models we estimate as the maximum spectral norm of the Hessian.^{5}^{5}5We compute the spectral norm of the Hessian using the Lanczos algorithm (Golub and Van Loan, 2012, §10.1) on Hessianvector products (computed via automatic differentiation). These estimates are reported in Table 3 of the appendix. Gradient regularization reduces and , by one to two orders of magnitude. Table 3 shows adversarial training also reduces : effectively adversarial training is a regularizer. Because and are estimated, and not exact, one would expect that our bounds would sometimes fail. However, on CIFAR10, the bounds reliable held on all attacked images. On ImageNet1k, the bounds failed on about 9% of attacked test images, which indicates that and could be estimated more accurately, for example using by estimating these constants locally like in (Weng et al., 2018).
5 Conclusion
We have provided motivation for training adversarially robust networks through input gradient regularization, by bounding the minimum adversarial distance with gradient and curvature statistics of the loss. We have shown empirically that gradient regularization is scaleable to ImageNet1k, and provides adversarial robustness competitive with adversarial training. We gave theoretical perimage bounds on the minimum adversarial distance, for nonsmooth models (using the Lipschitz constant of the loss), and augmented these bounds using smooth models with a secondorder bound based on model curvature. These bounds were empirically validated against stateoftheart attacks.
References

Athalye et al. [2018]
Anish Athalye, Nicholas Carlini, and David Wagner.
Obfuscated gradients give a false sense of security: Circumventing
defenses to adversarial examples.
In Jennifer Dy and Andreas Krause, editors,
Proceedings of the 35th International Conference on Machine Learning
, volume 80 of Proceedings of Machine Learning Research, pages 274–283, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/athalye18a.html.  Biggio et al. [2013] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, and Filip Železný, editors, Machine Learning and Knowledge Discovery in Databases, pages 387–402, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. ISBN 9783642409943.
 Bishop [1995] Christopher M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108–116, 1995. doi: 10.1162/neco.1995.7.1.108. URL https://doi.org/10.1162/neco.1995.7.1.108.
 Brendel et al. [2018] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decisionbased adversarial attacks: Reliable attacks against blackbox machine learning models. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=SyZI0GWCZ.

Carlini and Wagner [2017a]
Nicholas Carlini and David A. Wagner.
Adversarial examples are not easily detected: Bypassing ten detection
methods.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec@CCS 2017, Dallas, TX, USA, November 3, 2017
, pages 3–14, 2017a. doi: 10.1145/3128572.3140444. URL https://doi.org/10.1145/3128572.3140444.  Carlini and Wagner [2017b] Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 2226, 2017, pages 39–57, 2017b. URL https://doi.org/10.1109/SP.2017.49.
 Chen and Jordan [2019] Jianbo Chen and Michael I. Jordan. Boundary attack++: Queryefficient decisionbased adversarial attack. CoRR, abs/1904.02144, 2019. URL http://arxiv.org/abs/1904.02144.
 Cohen et al. [2019] Jeremy M. Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. CoRR, abs/1902.02918, 2019. URL http://arxiv.org/abs/1902.02918.
 Coleman et al. [2018] Cody Coleman, Daniel Kang, Deepak Narayanan, Luigi Nardi, Tian Zhao, Jian Zhang, Peter Bailis, Kunle Olukotun, Christopher Ré, and Matei Zaharia. Analysis of dawnbench, a timetoaccuracy machine learning performance benchmark. CoRR, abs/1806.01427, 2018. URL http://arxiv.org/abs/1806.01427.

Deng et al. [2009]
Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and FeiFei Li.
Imagenet: A largescale hierarchical image database.
In
2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 2025 June 2009, Miami, Florida, USA
, pages 248–255, 2009. doi: 10.1109/CVPRW.2009.5206848. URL https://doi.org/10.1109/CVPRW.2009.5206848.  Drucker and LeCun [1991] Harris Drucker and Yann LeCun. Double backpropagation increasing generalization performance. In IJCNN91Seattle International Joint Conference on Neural Networks, volume 2, pages 145–150. IEEE, 1991.
 Finlay et al. [2019] Chris Finlay, AramAlexandre Pooladian, and Adam M. Oberman. The LogBarrier adversarial attack: making effective use of decision boundary information. CoRR, abs/1903.10396, 2019. URL http://arxiv.org/abs/1903.10396.
 Golub and Van Loan [2012] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU press, 2012.
 Goodfellow et al. [2018] Ian Goodfellow, Patrick McDaniel, and Nicolas Papernot. Making machine learning robust against adversarial inputs. Communications of the ACM, 61(7):56–66, June 2018. URL http://dl.acm.org/citation.cfm?doid=3234519.3134599.
 Goodfellow et al. [2014] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. CoRR, abs/1412.6572, 2014. URL http://arxiv.org/abs/1412.6572.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part IV, pages 630–645, 2016. URL https://doi.org/10.1007/9783319464930_38.

Hein and Andriushchenko [2017]
Matthias Hein and Maksym Andriushchenko.
Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation.
In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 2263–2273, 2017. URL http://papers.nips.cc/paper/6821formalguaranteesontherobustnessofaclassifieragainstadversarialmanipulation.  Huster et al. [2018] Todd Huster, ChoYu Jason Chiang, and Ritu Chadha. Limitations of the lipschitz constant as a defense against adversarial examples. In ECML PKDD 2018 Workshops  Nemesis 2018, UrbReas 2018, SoGood 2018, IWAISe 2018, and Green Data Mining 2018, Dublin, Ireland, September 1014, 2018, Proceedings, pages 16–29, 2018. doi: 10.1007/9783030134532_2. URL https://doi.org/10.1007/9783030134532_2.
 Jakubovitz and Giryes [2018] Daniel Jakubovitz and Raja Giryes. Improving DNN robustness to adversarial attacks using jacobian regularization. In Computer Vision  ECCV 2018  15th European Conference, Munich, Germany, September 814, 2018, Proceedings, Part XII, pages 525–541, 2018. doi: 10.1007/9783030012588_32. URL https://doi.org/10.1007/9783030012588_32.
 Kannan et al. [2018] Harini Kannan, Alexey Kurakin, and Ian J. Goodfellow. Adversarial logit pairing. CoRR, abs/1803.06373, 2018. URL http://arxiv.org/abs/1803.06373.
 Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html.
 Kurakin et al. [2016] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016. URL http://arxiv.org/abs/1607.02533.
 Li et al. [2018] Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Secondorder adversarial attack and certifiable robustness. CoRR, abs/1809.03113, 2018. URL http://arxiv.org/abs/1809.03113.
 Madry et al. [2017] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. CoRR, abs/1706.06083, 2017. URL http://arxiv.org/abs/1706.06083.

Miyato et al. [2018]
Takeru Miyato, Shinichi Maeda, Shin Ishii, and Masanori Koyama.
Virtual adversarial training: a regularization method for supervised and semisupervised learning.
IEEE transactions on pattern analysis and machine intelligence, 2018.  MoosaviDezfooli et al. [2018] SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. CoRR, abs/1811.09716, 2018. URL http://arxiv.org/abs/1811.09716.
 Nagarajan and Kolter [2017] Vaishnavh Nagarajan and J. Zico Kolter. Gradient descent GAN optimization is locally stable. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 5591–5600, 2017. URL http://papers.nips.cc/paper/7142gradientdescentganoptimizationislocallystable.
 Novak et al. [2018] Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha SohlDickstein. Sensitivity and generalization in neural networks: an empirical study. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=HJC2SzZCW.

Papernot et al. [2016]
Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay
Celik, and Ananthram Swami.
The limitations of deep learning in adversarial settings.
In IEEE European Symposium on Security and Privacy, EuroS&P 2016, Saarbrücken, Germany, March 2124, 2016, pages 372–387, 2016. URL https://doi.org/10.1109/EuroSP.2016.36.  Papernot et al. [2017] Nicolas Papernot, Patrick D. McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical blackbox attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, AsiaCCS 2017, Abu Dhabi, United Arab Emirates, April 26, 2017, pages 506–519, 2017. URL http://doi.acm.org/10.1145/3052973.3053009.
 Potember [2017] Richard Potember. Perspectives on research in artificial intelligence and artificial general intelligence relevant to DoD. Technical report, The MITRE Corporation McLean United States, 2017. URL https://fas.org/irp/agency/dod/jason/aidod.pdf.
 Raghunathan et al. [2018a] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018a. URL https://openreview.net/forum?id=Bys4obRb.
 Raghunathan et al. [2018b] Aditi Raghunathan, Jacob Steinhardt, and Percy S. Liang. Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pages 10900–10910, 2018b. URL http://papers.nips.cc/paper/8285semidefiniterelaxationsforcertifyingrobustnesstoadversarialexamples.

Rifai et al. [2011]
Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio.
Contractive autoencoders: Explicit invariance during feature extraction.
In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28  July 2, 2011, pages 833–840, 2011. URL https://icml.cc/2011/papers/455_icmlpaper.pdf.  Ross and DoshiVelez [2018] Andrew Slavin Ross and Finale DoshiVelez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI18), the 30th innovative Applications of Artificial Intelligence (IAAI18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI18), New Orleans, Louisiana, USA, February 27, 2018, pages 1660–1669, 2018. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17337.
 Roth et al. [2017] Kevin Roth, Aurélien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 2015–2025, 2017. URL http://papers.nips.cc/paper/6797stabilizingtrainingofgenerativeadversarialnetworksthroughregularization.
 Roth et al. [2018] Kevin Roth, Aurélien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Adversarially robust training through structured gradient regularization. CoRR, abs/1805.08736, 2018. URL http://arxiv.org/abs/1805.08736.

Rousseeuw and Leroy [1987]
Peter J Rousseeuw and Annick M Leroy.
Robust regression and outlier detection
, volume 1. Wiley Online Library, 1987.  Seck et al. [2019] Ismaïla Seck, Gaëlle Loosli, and Stephane Canu. L1norm double backpropagation adversarial defense. arXiv preprint arXiv:1903.01715, 2019.
 Shafahi et al. [2019] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John P. Dickerson, Christoph Studer, Larry S. Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! CoRR, abs/1904.12843, 2019. URL http://arxiv.org/abs/1904.12843.
 [41] Andrew Shaw, Yaroslav Bulatov, and Jeremy Howard. ImageNet in 18 minutes. URL https://github.com/diuxdev/imagenet18.
 SimonGabriel et al. [2018] CarlJohann SimonGabriel, Yann Ollivier, Bernhard Schölkopf, Léon Bottou, and David LopezPaz. Adversarial vulnerability of neural networks increases with input dimension. CoRR, abs/1802.01421, 2018. URL http://arxiv.org/abs/1802.01421.
 Szegedy et al. [2013] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013. URL http://arxiv.org/abs/1312.6199.
 Tramèr et al. [2018] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkZvSeRZ.
 Tsipras et al. [2018] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. CoRR, abs/1805.12152, 2018. URL http://arxiv.org/abs/1805.12152.
 Tsuzuku et al. [2018] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitzmargin training: Scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pages 6542–6551, 2018. URL http://papers.nips.cc/paper/7889lipschitzmargintrainingscalablecertificationofperturbationinvariancefordeepneuralnetworks.
 Uesato et al. [2018] Jonathan Uesato, Brendan O’Donoghue, Pushmeet Kohli, and Aäron van den Oord. Adversarial risk and the dangers of evaluating against weak attacks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pages 5032–5041, 2018. URL http://proceedings.mlr.press/v80/uesato18a.html.
 Wald [1945] Abraham Wald. Statistical decision functions which minimize the maximum risk. Annals of Mathematics, pages 265–280, 1945.
 Weng et al. [2018] TsuiWei Weng, Huan Zhang, PinYu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, ChoJui Hsieh, and Luca Daniel. Evaluating the robustness of neural networks: An extreme value theory approach. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=BkUHlMZ0b.
 Wong and Kolter [2018] Eric Wong and J. Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pages 5283–5292, 2018. URL http://proceedings.mlr.press/v80/wong18a.html.
 Wong et al. [2018] Eric Wong, Frank R. Schmidt, Jan Hendrik Metzen, and J. Zico Kolter. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pages 8410–8419, 2018. URL http://papers.nips.cc/paper/8060scalingprovableadversarialdefenses.
 Xiao et al. [2018] Kai Y. Xiao, Vincent Tjeng, Nur Muhammad Shafiullah, and Aleksander Madry. Training for faster adversarial robustness verification via inducing relu stability. CoRR, abs/1809.03008, 2018. URL http://arxiv.org/abs/1809.03008.
 Xie et al. [2018] Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L. Yuille, and Kaiming He. Feature denoising for improving adversarial robustness. CoRR, abs/1812.03411, 2018. URL http://arxiv.org/abs/1812.03411.
 Xie et al. [2017] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pages 5987–5995, 2017. URL https://doi.org/10.1109/CVPR.2017.634.
Appendix A Additional methods and results

mean  maximum  
CIFAR10  
Undefended  3.05    122.34    
Undefended  ✓  3.25  198.23  65.35  8134.26  
Madry et al (7step AT)  0.40    2.52    
squared norm,  0.58    4.43    
squared norm,  ✓  0.65  2.08  4.52  27.05  
squared norm,  0.35    1.33    
ImageNet1k  
Undefended  1.12    17.51    
Undefended  ✓  1.02  11.61  25.43  848.69  
squared norm,  0.46    4.85    
squared norm,  ✓  0.45  1.87  6.99  171.98  
squared norm,  0.27    2.12    
