Adversarial Robustness Against the Union of Multiple Perturbation Models

09/09/2019 ∙ by Pratyush Maini, et al. ∙ Carnegie Mellon University 0

Owing to the susceptibility of deep learning systems to adversarial attacks, there has been a great deal of work in developing (both empirically and certifiably) robust classifiers, but the vast majority has defended against single types of attacks. Recent work has looked at defending against multiple attacks, specifically on the MNIST dataset, yet this approach used a relatively complex architecture, claiming that standard adversarial training can not apply because it "overfits" to a particular norm. In this work, we show that it is indeed possible to adversarially train a robust model against a union of norm-bounded attacks, by using a natural generalization of the standard PGD-based procedure for adversarial training to multiple threat models. With this approach, we are able to train standard architectures which are robust against ℓ_∞, ℓ_2, and ℓ_1 attacks, outperforming past approaches on the MNIST dataset and providing the first CIFAR10 network trained to be simultaneously robust against (ℓ_∞, ℓ_2,ℓ_1) threat models, which achieves adversarial accuracy rates of (47.6%, 64.8%, 53.4%) for (ℓ_∞, ℓ_2,ℓ_1) perturbations with radius ϵ = (0.03,0.5,12).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning algorithms have been shown to be susceptible to adversarial examples (Szegedy et al., 2014)

through the existence of data points which can be adversarially perturbed to be misclassified, but are “close enough” to the original example to be imperceptible to the human eye. Methods to generate adversarial examples, or “attacks", typically rely on gradient information, and most commonly use variations of projected gradient descent (PGD) to maximize the loss within a small perturbation region, usually referred to as the adversary’s threat model. Since then, a number of heuristic defenses have been proposed to defend against this phenomenon, e.g. distillation

(Papernot et al., 2016)

or more recently logit-pairing

(Kannan et al., 2018). However, as time goes by, the original robustness claims of these defenses typically don’t hold up to more advanced adversaries or more thorough attacks (Carlini and Wagner, 2017; Engstrom et al., 2018; Mosbach et al., 2018). One heuristic defense that seems to have survived (to this day) is to use adversarial training against a PGD adversary (Madry et al., 2018), which remains quite popular due to its simplicity and apparent empirical robustness. The method continues to perform well in empirical benchmarks even when compared to recent work in provable defenses, although it comes with no formal guarantees.

Some recent work, however, has claimed that adversarial training “overfits” to the particular type of perturbation used to generate the adversarial examples, and used this as motivation to propose a more complicated architecture in order to achieve robustness to multiple perturbation types on the MNIST dataset (Schott et al., 2019).

In this work, we offer a contrasting viewpoint: we show that it is indeed possible to use adversarial training to learn a model which is simultaneously robust against multiple types of norm bounded attacks (we consider , , and attacks, but the approach can apply to more general attacks). First, we show that simple generalizations of adversarial training to multiple threat models can already achieve some degree of robustness against the union of these threat models. Second, we propose a slightly modified PGD-based algorithm called multi steepest descent (MSD) for adversarial training which more naturally incorporates the different perturbations within the PGD iterates, further improving the adversarial training approach. Third, we show empirically that our approach improves upon past work by being applicable to standard network architectures, easily scaling beyond the MNIST dataset, and outperforming past results on robustness against multiple perturbation types.

2 Related work

After their original introduction, one of the first widely-considered attacks against deep networks had been the Fast Gradient Sign Method (Goodfellow et al., 2015), which showed that a single, small step in the direction of the sign of the gradient could sometimes fool machine learning classifiers. While this worked to some degree, the Basic Iterative Method (Kurakin et al., 2017) (now typically referred to as the PGD attack) was significantly more successful at creating adversarial examples, and now lies at the core of many papers. Since then, a number of improvements and adaptations have been made to the base PGD algorithm to overcome heuristic defenses and create stronger adversaries. Adversarial attacks were thought to be safe under realistic transformations (Lu et al., 2017) until the attack was augmented to be robust to them (Athalye et al., 2018). Adversarial examples generated using PGD on surrogate models can transfer to black box models (Papernot et al., 2017). Utilizing core optimization techniques such as momentum can greatly improve the attack success rate and transferability, and was the winner of the NIPS 2017 competition on adversarial examples (Dong et al., 2018). Uesato et al. (2018)

showed that a number of ImageNet defenses were not as robust as originally thought, and

Athalye et al. (2018) defeated many of the heuristic defenses submitted to ICLR 2018 shortly after the reviewing cycle ended, all with stronger PGD variations.

Throughout this cycle of attack and defense, some defenses were uncovered that remain robust to this day. The aforementioned PGD attack, and the related defense known as adversarial training with a PGD adversary (which incorporates PGD-attacked examples into the training process) has so far remained empirically robust (Madry et al., 2018). Verification methods to certify robustness properties of networks were developed, utilizing techniques such as SMT solvers (Katz et al., 2017), SDP relaxations (Raghunathan et al., 2018a)

, and mixed-integer linear programming

(Tjeng et al., 2019), the last of which has recently been successfully scaled to reasonably sized networks. Other work has folded verification into the training process to create provably robust networks (Wong and Kolter, 2018; Raghunathan et al., 2018b), some of which have also been scaled to even larger networks (Wong et al., 2018; Mirman et al., 2018; Gowal et al., 2018). Although some of these could potentially be extended to apply to multiple perturbations simultaneously, most of these works have focused primarily on defending against and verifying only a single type of adversarial perturbation at a time.

Last but most relevant to this work are adversarial defenses that attempt to be robust against multiple types of attacks simultaneously. Schott et al. (2019)

used multiple variational autoencoders to construct a complex architecture for the MNIST dataset that is not as easily attacked by

, , and adversaries. Importantly, Schott et al. (2019) compare to adversarial training with an -bounded PGD adversary as described by Madry et al. (2018), claiming that the adversarial training defense overfits to the metric and is not robust against other types of perturbations. Following this, a number of concurrent papers have since been released. While not studied as a defense, Kang et al. (2019) study the transferability of adversarial robustness between models trained against different threat models. Croce and Hein (2019) propose a provable adversarial defense against all norms for using a regularization term. Finally, Tramèr and Boneh (2019) study the theoretical and empirical trade-offs of adversarial robustness in various settings when defending against multiple adversaries, however, they use a rotation and translation adversary instead of an adversary for CIFAR10.

Contributions

In contrast to the claim that adversarial training overfits to a particular metric, in this work we demonstrate that adversarial training can in fact be used to learn models that are robust against a union of multiple perturbation models, as long as you train against a union of adversaries. First, we show that even simple aggregations of different adversarial attacks can achieve competitive universal robustness against multiple perturbations models without resorting to complex architectures. Second, we propose a modified PGD iteration that more naturally considers multiple perturbation models within the inner optimization loop of adversarial training. Third, we evaluate all approaches on the MNIST and CIFAR10 datasets, showing that our proposed generalizations of adversarial training can significantly outperform past approaches for , , and attacks. Specifically, on MNIST, our model achieves 63.7%, 82.7%, and 62.3% adversarial accuracy against all three attacks for respectively, substantially improving upon the multiple-perturbation-model robustness described in Schott et al. (2019). Unlike past work, we also train a CIFAR10 model, which achieves 47.6%, 64.8%, and 53.4% adversarial accuracy against all three attacks for . Code and trained models for all our experiments are at https://github.com/locuslab/robust_union.

3 Overview of adversarial training

Adversarial training is an approach to learn a classifier which minimizes the worst case loss within some perturbation region (the threat model). Specifically, for some network parameterized by

, loss function

, and training data , the robust optimization problem of minimizing the worst case loss within norm-bounded perturbations with radius is

(1)

where is the ball with radius centered around the origin. To simplify the notation, we will abbreviate .

3.1 Solving the inner optimization problem

We first look at solving the inner maximization problem, namely

(2)

This is the problem addressed by the “attackers” in the space of adversarial examples, hoping that the classifier can be tricked by the optimal perturbed image, . Typical solutions solve this problem by running a form of projected gradient descent, which iteratively takes steps in the gradient direction to increase the loss followed by a projection step back onto the feasible region, the ball. Since the gradients at the example points themselves (i.e., ) are typically too small to make efficient progress, more commonly used is a variation called projected steepest descent.

Figure 1: (left) A depiction of the steepest descent directions for , , and norms. The gradient is the black arrow, and the radius step sizes and their corresponding steepest descent directions , , and are shown in blue, red, and green respectively. (right) An example of the projection back to an ball of radius after a steepest descent step from the starting perturbation . The steepest descent step is the black arrow, and the corresponding projection back onto the ball is red arrow.

Steepest descent

For some norm and step size , the direction of steepest descent on the loss function for a perturbation is

(3)

Then, instead of taking gradient steps, steepest descent uses the following iteration

(4)

In practice, the norm used in steepest descent is typically taken to be the same norm used to define the perturbation region . However, depending on the norm used, the direction of steepest descent can be quite different from the actual gradient (Figure 1). Note that a single steepest descent step with respect to the norm reduces to , better known in the adversarial examples literature as the Fast Gradient Sign Method (Goodfellow et al., 2015).

Projections

The second component of projected steepest descent for adversarial examples is to project iterates back onto the ball around . Specifically, projected steepest descent performs the following iteration

(5)

where is the standard projection operator that finds the perturbation that is “closest” in Euclidean space to the input , defined as

(6)

Visually, a depiction of this procedure (steepest descent followed by a projection onto the perturbation region) for an adversary can be found in Figure 1. If we instead project the steepest descent directions with respect to the norm onto the ball of allowable perturbations, the projected steepest descent iteration reduces to

(7)

where “clips” the input to lie within the range . This is exactly the Basic Iterative Method used in Kurakin et al. (2017), typically referred to in the literature as an PGD adversary.

3.2 Solving the outer optimization problem

We next look at how to solve the outer optimization problem, or the problem of learning the weights that minimize the loss of our classifier. While many approaches have been proposed in the literature, we will focus on a heuristic called adversarial training, which has generally worked well in practice.

Adversarial training

Although solving the min-max optimization problem may seem daunting, a classical result known as Danskin’s theorem (Danskin, 1967) says that the gradient of a maximization problem is equal to the gradient of the objective evaluated at the optimum. For learning models that minimize the robust optimization problem from Equation (1), this means that

(8)

where

. In other words, this means that in order to backpropagate through the robust optimization problem, we can solve the inner maximization and backpropagate through the solution. Adversarial training does this by empirically maximizing the inner problem with a PGD adversary. Note that since the inner problem is not solved exactly, Danskin’s theorem does not strictly apply. However, in practice, adversarial training does seem to provide good empirical robustness, at least when evaluated against the

threat model it was trained against.

4 Adversarial training for multiple perturbation models

We can now consider the core of this work, adversarial training procedures against multiple threat models. More formally, let represent a set of threat models, such that corresponds to the perturbation model , and let be the union of all perturbation models in . Note that the chosen for each ball is not typically the same, but we still use the same notation for simplicity, since the context will always make clear which -ball we are talking about. Then, the generalization of the robust optimization problem in Equation (1) to multiple perturbation models is

(9)

The key difference is in the inner maximization, where the worst case adversarial loss is now taken over multiple perturbation models. In order to perform adversarial training, using the same motivational idea from Danskin’s theorem, we can backpropagate through the inner maximization by first finding (empirically) the optimal perturbation,

(10)

To find the optimal perturbation over the union of threat models, we begin by considering straightforward generalizations of standard adversarial training, which will use PGD to approximately solve the inner maximization over multiple adversaries.

4.1 Simple combinations of multiple perturbations

First, we propose two simple approaches to generalizing adversarial training to multiple threat models. These methods already perform quite well in practice and are competitive with existing, state-of-the-art approaches without relying on complicated architectures, showing that adversarial training can in fact generalize to multiple threat models.

Worst-case perturbation

One way to generalize adversarial training to multiple threat models is to use each threat model independently, and train on the adversarial perturbation that achieved the maximum loss. Specifically, for each adversary , we solve the innermost maximization with an PGD adversary to get an approximate worst-case perturbation ,

(11)

and then approximate the maximum over all adversaries as

(12)

When , then this reduces to standard adversarial training. Note that if each PGD adversary solved their subproblem from Equation (11) exactly, then this is exactly the optimal perturbation .

PGD augmentation with all perturbations

Another way to generalize adversarial training is to train on all the adversarial perturbations for all to form a larger adversarial dataset. Specifically, instead of solving the robust problem for multiple adversaries in Equation (9), we instead solve

(13)

by using individual PGD adversaries to approximate the inner maximization for each threat model. Again, this reduces to standard adversarial training when .

While these methods work quite well in practice (which is shown later in Section 5), both approaches solve the inner maximization problem independently for each adversary, so each individual PGD adversary is not taking advantage of the fact that the perturbation region is enlarged by other threat models. To take advantage of the full perturbation region, we propose a modification to standard adversarial training, which combines information from all considered threat models into a single PGD adversary that is potentially stronger than the combination of independent adversaries.

  Input: classifier , data , labels
  Parameters: for , maximum iterations , loss function
  
  for  do
     for  do
        
     end for
     
  end for
  return
Algorithm 1 Multi steepest descent for learning classifiers that are simultaneously robust to attacks for

4.2 Multi steepest descent

To create a PGD adversary with full knowledge of the perturbation region, we propose an algorithm that incorporates the different threat models within each step of projected steepest descent. Rather than generating adversarial examples for each threat model with separate PGD adversaries, the core idea is to create a single adversarial perturbation by simultaneously maximizing the worst case loss over all threat models at each projected steepest descent step. We call our method multi steepest descent (MSD), which can be summarized as the following iteration:

(14)

The key difference here is that at each iteration of MSD, we choose a projected steepest descent direction that maximizes the loss over all attack models , whereas standard adversarial training and the simpler approaches use comparatively myopic PGD subroutines that only use one threat model at a time. The full algorithm is in Algorithm 1, and can be used as a drop in replacement for standard PGD adversaries to learn robust classifiers with adversarial training. We direct the reader to Appendix A for a complete description of steepest descent directions and projection operators for , , and norms111The pure steepest descent step is inefficient since it only updates one coordinate at a time. It can be improved by taking steps on multiple coordinates, similar to that used in Tramèr and Boneh (2019), and is also explained in Appendix A..

5 Results

In this section, we present experimental results on using generalizations of adversarial training to achieve simultaneous robustness to , , and perturbations on the MNIST and CIFAR10 datasets. Our primary goal is to show that adversarial training can in fact be adapted to a union of perturbation models using standard architectures to achieve competitive results, without the pitfalls described by Schott et al. (2019)

. Our results improve upon the state-of-the-art in three key ways. First, we can use simpler, standard architectures for image classifiers, without relying on complex architectures or input binarization. Second, our method is able to learn a single MNIST model which is simultaneously robust to all three threat models, whereas previous work was only robust against two at a time. Finally, our method is easily scalable to datasets beyond MNIST, providing the first CIFAR10 model trained to be simultaneously robust against

, , and adversaries.

We trained models using both the simple generalizations of adversarial training to multiple adversaries and also using MSD. Since the analysis by synthesis model is not scalable to CIFAR10, we additionally trained CIFAR10 models against individual PGD adversaries to measure the changes and tradeoffs in universal robustness. We evaluated these models with a broad suite of both gradient and non-gradient based attacks using Foolbox222https://github.com/bethgelab/foolbox (Rauber et al., 2017) (the same attacks used by Schott et al. (2019)), and also incorporated all the PGD-based adversaries discussed in this paper. All aggregate statistics that combine multiple attacks compute the worst case error rate over all attacks for each example.

Summaries of these results at specific thresholds can be found in Tables 1 and 2, where B-ABS and ABS refer to binarized and non-binarized versions of the analysis by synthesis models from Schott et al. (2019), refers to a model trained against a PGD adversary with respect to the -norm, Worst-PGD and PGD-Aug refer to models trained using the worst-case and data augmentation generalizations of adversarial training, and MSD refers to models trained using multi steepest descent. Full tables containing the complete breakdown of these numbers over all individual attacks used in the evaluation are in Appendix C.

5.1 Experimental setup

Architectures and hyperparameters

For MNIST, we use a four layer convolutional network with two convolutional layers consisting of 32 and 64

filters and 2 units of padding, followed by a fully connected layer with 1024 hidden units, where both convolutional layers are followed by

Max Pooling layers and ReLU activations (this is the same architecture used by Madry et al. (2018)). This is in contrast to past work on MNIST, which relied on per-class variational autoencoders to achieve robustness against multiple threat models (Schott et al., 2019), which was also not easily scalable to larger datasets. Since our methods have the same complexity as standard adversarial training, they also easily apply to standard CIFAR10 architectures, and in this paper we use the well known pre-activation version of the ResNet18 architecture consisting of nine residual units with two convolutional layers each (He et al., 2016).

A complete description of the hyperparameters used is in Appendix

B, with hyperparameters for PGD adversaries in Appendix B.1, and hyperparameters for adversarial training in Appendix B.2. All reported are for images scaled to be between the range . All experiments can be run on modern GPU hardware (e.g. a single 1080ti).

Attacks used for evaluation

To evaluate the model, we incorporate the attacks from Schott et al. (2019)

as well as our PGD based adversaries using projected steepest descent, however we provide a short description here. Note that we exclude attacks based on gradient estimation, since the gradient for the standard architectures used here are readily available.

For attacks, although we find the PGD adversary to be quite effective, for completeness, we additionally use the Foolbox implementations of Fast Gradient Sign Method (Goodfellow et al., 2015), PGD adversary (Madry et al., 2018), and the Momentum Iterative Method (Dong et al., 2018).

For attacks, in addition to the PGD adversary, we use the Foolbox implementations of the same PGD adversary, the Gaussian noise attack (Rauber et al., 2017), the boundary attack (Brendel et al., 2017), DeepFool (Moosavi-Dezfooli et al., 2016), and the pointwise attack (Schott et al., 2019).

For attacks, we use both the PGD adversary as well as additional Foolbox implementations of attacks at the same radius, namely the salt & pepper attack (Rauber et al., 2017) and the pointwise attack (Schott et al., 2019). Note that an adversary with radius is strictly stronger than an adversary with the same radius, and so we choose to explicitly defend against perturbations instead of the perturbations considered by Schott et al. (2019).

We make 10 random restarts for each of the evaluation results mentioned hereon, for both MNIST and CIFAR10. We encourage future work in this area to incorporate the same, since the success of all attacks, specially decision based or gradient free ones, is observed to increase significantly over restarts.

Worst PGD
B-ABS333Results are from Schott et al. (2019), which used an threat model of the same radius and evaluated against attacks. So the reported number here is an upper bound on the adversarial accuracy. Further, they evaluate their model without restarts and the adversarial accuracy against all attacks is an upper bound based on the reported accuracies for the individual threat models. Finally, all ABS results were computed using numerical gradient estimation, since gradients are not readily available. ABS333Results are from Schott et al. (2019), which used an threat model of the same radius and evaluated against attacks. So the reported number here is an upper bound on the adversarial accuracy. Further, they evaluate their model without restarts and the adversarial accuracy against all attacks is an upper bound based on the reported accuracies for the individual threat models. Finally, all ABS results were computed using numerical gradient estimation, since gradients are not readily available. PGD Aug MSD
Clean Accuracy 99.1% 99.4% 98.9% 99% 99% 98.9% 99.1% 98.0%
attacks 90.3% 0.4% 0.0% 77% 8% 68.4% 83.7% 63.7%
attacks 46.4% 87.0% 70.7% 39% 80% 82.6% 76.2% 82.7%
attacks 1.4% 43.4% 71.8% 82% 78% 54.6% 15.6% 62.3%
All Attacks 1.4% 0.4% 0.0% 39% 8% 53.7% 15.6% 58.7%
Table 1: Summary of adversarial accuracy results for MNIST (higher is better)
Figure 2: Robustness curves showing the adversarial accuracy for the MNIST model trained with MSD against (left), (middle), and (right) threat models over a range of epsilon.

5.2 Mnist

We first present results on the MNIST dataset, which are summarized in Table 1 (a more detailed breakdown over each individual attack is in Appendix C.1). While considered an “easy” dataset, we note that the previous state-of-the-art result for multiple threat models on MNIST (and our primary comparison) is only able to defend against two out of three threat models at a time (Schott et al., 2019) using comparatively complex variational autoencoder architectures. In contrast, we see that both simple generalizations of adversarial training are able to achieve competitive results on standard models, notably being able to defend against all three threat models simultaneously, while the model trained with MSD performs even better, achieving error rates of %, %, and % for , , and perturbations with radius , , and . A complete robustness curve over a range of epsilons for the MSD model over each threat model can be found in Figure 2, and robustness curves for other models are deferred to Appendix C.1.

Worst-PGD PGD-Aug MSD
Clean accuracy 83.3% 90.2% 73.3% 81.0% 84.6% 81.7%
attacks 50.7% 28.3% 0.2% 44.9% 42.5% 47.6%
attacks 58.2% 61.6% 0.0% 62.1% 65.3% 64.8%
attacks 16.0% 46.6% 7.9% 39.4% 54.0% 53.4%
All attacks 15.6% 25.2% 0.0% 34.9% 40.6% 46.1%
Table 2: Summary of adversarial accuracy results for CIFAR10 (higher is better)
Figure 3: Robustness curves showing the adversarial accuracy for the CIFAR10 model trained with MSD against (left), (middle), and (right) threat models over a range of epsilon.

5.3 Cifar10

Next, we present results on the CIFAR10 dataset, which are summarized in Table 2 (a more detailed breakdown over each individual attack is in Appendix C.2). Our MSD model achieves adversarial accuracy for perturbations of size , reaching an overall adversarial adversarial accuracy of 46.1% over all threat models. Interestingly, note that the model trained against an PGD adversary is not very robust when evaluated against other attacks, even though it can defend reasonably well against the PGD attack in isolation (Table 4 in Appendix C.2).

The most relevant comparison in the literature would be the robustness curves for attacks on the CIFAR10 model trained against an adversary which was reported by Madry et al. (2018). Unsurprisingly, since their model wasn’t explicitly trained to be robust against perturbations, their model is only able to achieve under 5% accuracy at the same threshold of , whereas our model achieves 64.8% accuracy while maintaining similar levels of robustness. A complete robustness curve over a range of epsilons for the MSD model over each threat model can be found in Figure 3, and robustness curves for other models are deferred to Appendix C.2.

6 Conclusion

In this paper, we showed that adversarial training can be quite effective when training against a union of multiple perturbation models. We compare two simple generalizations of adversarial training and an improved adversarial training procedure, multi steepest descent, which incorporates the different perturbation models directly into the direction of steepest descent. MSD based adversarial training procedure is able to outperform past approaches, demonstrating that adversarial training can in fact learn networks that are robust to multiple perturbation models (as long as they are included in the threat model) while being scalable beyond MNIST and using standard architectures.

References

  • A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, External Links: Link Cited by: §2.
  • A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok (2018) Synthesizing robust adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 284–293. External Links: Link Cited by: §2.
  • W. Brendel, J. Rauber, and M. Bethge (2017) Decision-based adversarial attacks: reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248. Cited by: §5.1.
  • N. Carlini and D. Wagner (2017)

    Towards evaluating the robustness of neural networks

    .
    In Security and Privacy (SP), 2017 IEEE Symposium on, pp. 39–57. Cited by: §1.
  • F. Croce and M. Hein (2019) Provable robustness against all adversarial l-perturbations for p1. CoRR abs/1905.11213. External Links: Link, 1905.11213 Cited by: §2.
  • J. M. Danskin (1967) The theory of max-min and its application to weapons allocation problems. Vol. 5, Springer Science & Business Media. Cited by: §3.2.
  • Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li (2018) Boosting adversarial attacks with momentum. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2, §5.1.
  • J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra (2008) Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 272–279. External Links: ISBN 978-1-60558-205-4, Link, Document Cited by: Appendix A.
  • L. Engstrom, A. Ilyas, and A. Athalye (2018) Evaluating and understanding the robustness of adversarial logit pairing. arXiv preprint arXiv:1807.10272. Cited by: §1.
  • I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, External Links: Link Cited by: §2, §3.1, §5.1.
  • S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. A. Mann, and P. Kohli (2018) On the effectiveness of interval bound propagation for training verifiably robust models. CoRR abs/1810.12715. External Links: Link, 1810.12715 Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §5.1.
  • D. Kang, Y. Sun, T. Brown, D. Hendrycks, and J. Steinhardt (2019) Transfer of adversarial robustness between perturbation types. arXiv preprint arXiv:1905.01034. Cited by: §2.
  • H. Kannan, A. Kurakin, and I. J. Goodfellow (2018) Adversarial logit pairing. CoRR abs/1803.06373. External Links: Link, 1803.06373 Cited by: §1.
  • G. Katz, C. Barrett, D. Dill, K. Julian, and M. Kochenderfer (2017) Reluplex: an efficient smt solver for verifying deep neural networks. arXiv preprint arXiv:1702.01135. Cited by: §2.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2017) Adversarial examples in the physical world. ICLR Workshop. External Links: Link Cited by: §2, §3.1.
  • J. Lu, H. Sibai, E. Fabry, and D. Forsyth (2017) No need to worry about adversarial examples in object detection in autonomous vehicles. arXiv preprint arXiv:1707.03501. Cited by: §2.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §2, §5.1, §5.1, §5.3.
  • M. Mirman, T. Gehr, and M. Vechev (2018) Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning (ICML), External Links: Link Cited by: §2.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §5.1.
  • M. Mosbach, M. Andriushchenko, T. Trost, M. Hein, and D. Klakow (2018) Logit pairing methods can fool gradient-based attacks. arXiv preprint arXiv:1810.12042. Cited by: §1.
  • N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, New York, NY, USA, pp. 506–519. External Links: ISBN 978-1-4503-4944-4, Link, Document Cited by: §2.
  • N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pp. 582–597. Cited by: §1.
  • A. Raghunathan, J. Steinhardt, and P. S. Liang (2018a) Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 10900–10910. External Links: Link Cited by: §2.
  • A. Raghunathan, J. Steinhardt, and P. Liang (2018b) Certified defenses against adversarial examples. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • J. Rauber, W. Brendel, and M. Bethge (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. External Links: Link, 1707.04131 Cited by: §5.1, §5.1, footnote 2.
  • L. Schott, J. Rauber, M. Bethge, and W. Brendel (2019) Towards the first adversarially robust neural network model on MNIST. In International Conference on Learning Representations, External Links: Link Cited by: Appendix A, §C.1, §1, §2, §2, §5.1, §5.1, §5.1, §5.1, §5.2, §5, §5, §5, footnote 3, footnote 3.
  • L. N. Smith (2018) A disciplined approach to neural network hyper-parameters: part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820. Cited by: §B.2, §B.2.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • V. Tjeng, K. Y. Xiao, and R. Tedrake (2019) Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • F. Tramèr and D. Boneh (2019) Adversarial training and robustness for multiple perturbations. arXiv preprint arXiv:1904.13000. Cited by: §A.1, §2, footnote 1.
  • J. Uesato, B. O’Donoghue, P. Kohli, and A. van den Oord (2018) Adversarial risk and the dangers of evaluating against weak attacks. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 5025–5034. External Links: Link Cited by: §2.
  • E. Wong and Z. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5283–5292. Cited by: §2.
  • E. Wong, F. Schmidt, J. H. Metzen, and J. Z. Kolter (2018) Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 8410–8419. External Links: Link Cited by: §2.

Appendix A Steepest descent and projections for , , and adversaries

Finally, for completeness, we show what the steepest descent and projection steps are for adversaries for ; these are standard results, but included for a complete description of the algorithms. Note that this differs slightly from the adversaries considered in Schott et al. [2019]: while they used an adversary, we opted to use an adversary with the same radius. The ball with radius is contained within an ball with the same radius, so achieving robustness against an adversary is strictly more difficult.

space

The direction of steepest descent with respect to the norm is

(15)

and the projection operator onto is

(16)

space

The direction of steepest descent with respect to the norm is

(17)

and the projection operator onto the ball around is

(18)

space

The direction of steepest descent with respect to the norm is

(19)

where

(20)

and

is a unit vector with a one in position

. Finally, the projection operator onto the ball,

(21)

can be solved with Algorithm 2, and we refer the reader to Duchi et al. [2008] for its derivation.

  Input: perturbation , radius
  Sort into :
  
  
   for
  return
Algorithm 2 Projection of some perturbation onto the ball with radius . We use to denote element-wise absolute value.

a.1 Enhanced steepest descent step

Note that the steepest descent step for only updates a single coordinate per step. This can be quite inefficient, as pointed out by Tramèr and Boneh [2019]. To tackle this issue, and also empirically improve the attack success rate, Tramèr and Boneh [2019] instead select the top coordinates according to Equation 20 to update. In this work, we adopt a similar but slightly modified scheme: we randomly sample to be some integer within some range , and update each coordinate with step size . We find that the randomness induced by varying the number of coordinates aids in avoiding the gradient masking problem observed by Tramèr and Boneh [2019].

a.2 Restricting the steepest descent coordinate

The steepest descent direction for both the and norm end up selecting a single coordinate direction to move the perturbation. However, if the perturbation is already at the boundary of pixel space (for MNIST, this is the range [0,1] for each pixel), then it’s possible for the PGD adversary to get stuck in a loop trying to use the same descent direction to escape pixel space. To avoid this, we only allow the steepest descent directions for these two attacks to choose coordinates that keep the image in the range of real pixels.

Appendix B Experimental details

b.1 Hyperparameters for PGD adversaries

In this section, we describe the parameters used for all PGD adversaries in this paper.

Mnist

The adversary used a step size within a radius of for 50 iterations.

The adversary used a step size within a radius of for 100 iterations.

The adversary used a step size of within a radius of for 50 iterations. By default the attack is run with two restarts, once starting with = 0 and once by randomly initializing in the allowable perturbation ball. = 5, = 20 as described in A.1.

The MSD adversary used step sizes of for the directions within a radius of for 100 iterations.

At test time, we increase the number of iterations to for .

Cifar10

The adversary used a step size within a radius of for 40 iterations.

The adversary used a step size within a radius of for 50 iterations.

The adversary used a step size within a radius of for 50 iterations. = 5, = 20 as described in A.1.

The MSD adversary used step sizes of for the directions within a radius of for 50 iterations. Note that the MSD model trained for radius of 0.3 is in fact robust to a higher radius of 0.5.

b.2 Training hyperparameters

In this section, we describe the parameters used for adversarial training. For all the models, we used the SGD optimizer with momentum 0.9 and weight decay .

Mnist

We train the models to a maximum of 20 epochs. We used a variation of the learning rate schedule from

Smith [2018] to achieve convergence in 20 epochs, which is piecewise linear from 0 to 0.05 over the first 3 epochs, down to 0.001 over the next 7 epochs, and finally back down to 0.0001 in the last 10 epochs.

Cifar10

We used a variation of the learning rate schedule from Smith [2018] to achieve superconvergence in 50 epochs, which is piecewise linear from 0 to 0.1 over the first 20 epochs, down to 0.005 over the next 20 epochs, and finally back down to 0 in the last 10 epochs.

Appendix C Extended results

Here, we show the full tables which break down the overall adversarial error rates over individual attacks for both MNIST and CIFAR10, along with robustness curves for all models in the paper.

Worst PGD
B-ABS ABS PGD Aug MSD
Clean Accuracy 99.1% 99.4% 98.9% 99% 99% 98.9% 99.1% 98.0%
PGD- 90.3% 0.4% 0.0% - - 68.4% 83.7% 63.7%
FGSM 94.9% 68.6% 6.4% 85% 34% 82.4% 90.9% 81.8%
PGD-Foolbox 92.1% 8.5% 0.1% 86% 13% 72.1% 85.7% 67.9%
MIM 92.3% 14.5% 0.1% 85% 17% 73.9% 87.3% 71.0%
attacks 90.3% 0.4% 0.0% 77% 8% 68.4% 83.7% 63.7%
PGD- 83.8% 87.0% 70.8% - - 85.3% 87.9% 84.2%
PGD-Foolbox 93.4% 89.7% 74.4% 63% 87% 86.9% 91.5% 86.9%
Gaussian Noise 98.9% 99.6% 98.0% 89% 98% 97.4% 99.0% 97.8%
Boundary Attack 52.6% 92.1% 83.0% 91% 83% 86.9% 79.1% 88.6%
DeepFool 95.1% 92.2% 76.5% 41% 83% 87.9% 93.5% 87.9%
Pointwise Attack 74.3% 97.4% 96.6% 87% 94% 92.7% 89.0% 95.1%
attacks 46.4% 87.0% 70.7% 39% 80% 82.6% 76.2% 82.7%
PGD- 51.8% 49.9% 71.8% - - 66.5% 57.4% 64.8%
Salt & Pepper 55.5% 96.3% 95.6% 96% 95% 86.4% 71.9% 92.2%
Pointwise Attack 2.4% 66.4% 85.2% 82% 78% 60.1% 17.1% 72.8%
attacks 1.4% 43.4% 71.8% 82% 78% 54.6% 15.6% 62.3%
All attacks 1.4% 0.4% 0.0% 39% 8% 53.7% 15.6% 58.7%
Table 3: Summary of adversarial accuracy results for MNIST

c.1 MNIST results

Expanded table of results

Table 3 contains the full table of results for all attacks on all models on the MNIST dataset. All attacks were run on a subset of 1000 examples with 10 random restarts, with the exception of Boundary Attack, which by default makes 25 trials per iteration. Note that the results for B-ABS and ABS models are from Schott et al. [2019], which uses gradient estimation techniques whenever a gradient is needed, and the robustness against all attacks for B-ABS and ABS is an upper bound based on the reported results. Further, these models are not evaluated with restarts, pushing the reported results even higher than actual.

Robustness curves

Here, we plot the full robustness curves for the remaining models trained with the simpler generalizations of adversarial training, namely the worst case method (Figure 4) and the data augmentation method (Figure 5).

Figure 4: Robustness curves showing the adversarial accuracy for the MNIST model trained with the worst case generalization for adversarial training (Worst-PGD) against (left), (middle), and (right) threat models over a range of epsilon.
Figure 5: Robustness curves showing the adversarial accuracy for the MNIST model trained with the data augmentation generalization for adversarial training (PGD-Aug) against (left), (middle), and (right) threat models over a range of epsilon.

c.2 CIFAR10 results

Expanded table of results

Table 4 contains the full table of results for all attacks on all models on the CIFAR10 dataset. All attacks were run on a subset of 1000 examples with 10 random restarts, with the exception of Boundary Attack, which by default makes 25 trials per iteration. Further note that salt & pepper and pointwise attacks in the section are technically attacks, but produce perturbations in the ball. Finally, it is clear here that while the training against an PGD adversary defends against said PGD adversary, it does not seem to transfer to robustness against other attacks.

Worst-PGD PGD-Aug MSD
Clean accuracy 83.3% 90.2% 73.3% 81.0% 84.6% 81.7%
PGD- 50.3% 48.4% 29.8% 44.9% 42.8% 49.8%
FGSM 57.4% 43.4% 12.7% 54.9% 51.9% 55.0%
PGD-Foolbox 52.3% 28.5% 0.6% 48.9% 44.6% 49.8%
MIM 52.7% 30.4% 0.7% 49.9% 46.1% 50.6%
attacks 50.7% 28.3% 0.2% 44.9% 42.5% 47.6%
PGD- 59.0% 62.1% 28.9% 64.1% 66.9% 66.0%
PGD-Foolbox 61.6% 64.1% 4.9% 65.0% 68.0% 66.4%
Gaussian Noise 82.2% 89.8% 62.3% 81.3% 84.3% 81.8%
Boundary Attack 65.5% 67.9% 0.0% 64.4% 69.2% 67.9%
DeepFool 62.2% 67.3% 0.1% 64.4% 67.4% 65.7%
Pointwise Attack 80.4% 88.6% 46.2% 78.9% 83.8% 81.4%
attacks 58.2% 61.6% 0.0% 62.1% 65.3% 64.8%
PGD- 16.5% 49.2% 69.1% 39.5% 54.0% 53.4%
Salt & Pepper 63.4% 74.2% 35.5% 75.2% 80.7% 75.6%
Pointwise Attack 49.6% 62.4% 8.4% 63.3% 77.0% 72.8%
attacks 16.0% 46.6% 7.9% 39.4% 54.0% 53.4%
All attacks 15.6% 27.5% 0.0% 34.9% 40.6% 46.1%
Table 4: Summary of adversarial accuracy results for CIFAR10

Robustness curves

Here, we plot the full robustness curves for the remaining models trained with the simpler generalizations of adversarial training, namely the worst case method (Figure 6) and the data augmentation method (Figure 7).

Figure 6: Robustness curves showing the adversarial accuracy for the CIFAR10 model trained with the worst case generalization for adversarial training (Worst-PGD) against (left), (middle), and (right) threat models over a range of epsilon.
Figure 7: Robustness curves showing the adversarial accuracy for the CIFAR10 model trained with the data augmentation generalization for adversarial training (PGD-Aug) against (left), (middle), and (right) threat models over a range of epsilon.