ensembleadvtraining
Ensemble Adversarial Training on MNIST
view repo
Machine learning models are vulnerable to adversarial examples, inputs maliciously perturbed to mislead the model. These inputs transfer between models, thus enabling blackbox attacks against deployed models. Adversarial training increases robustness to attacks by injecting adversarial examples into training data. Surprisingly, we find that although adversarially trained models exhibit strong robustness to some whitebox attacks (i.e., with knowledge of the model parameters), they remain highly vulnerable to transferred adversarial examples crafted on other models. We show that the reason for this vulnerability is the model's decision surface exhibiting sharp curvature in the vicinity of the data points, thus hindering attacks based on firstorder approximations of the model's loss, but permitting blackbox attacks that use adversarial examples transferred from another model. We harness this observation in two ways: First, we propose a simple yet powerful novel attack that first applies a small random perturbation to an input, before finding the optimal perturbation under a firstorder approximation. Our attack outperforms prior "singlestep" attacks on models trained with or without adversarial training. Second, we propose Ensemble Adversarial Training, an extension of adversarial training that additionally augments training data with perturbed inputs transferred from a number of fixed pretrained models. On MNIST and ImageNet, ensemble adversarial training vastly improves robustness to blackbox attacks.
READ FULL TEXT VIEW PDFEnsemble Adversarial Training on MNIST
A PyTorch baseline attack example for the NIPS 2017 adversarial competition
Machine learning (ML) models are often vulnerable to adversarial examples, maliciously perturbed inputs designed to mislead a model at test time (Biggio et al., 2013; Szegedy et al., 2013; Goodfellow et al., 2014b; Papernot et al., 2016a). Furthermore, Szegedy et al. (2013) showed that these inputs transfer across models: the same adversarial example is often misclassified by different models, thus enabling simple blackbox attacks on deployed models (Papernot et al., 2017; Liu et al., 2017).
Adversarial training (Szegedy et al., 2013) increases robustness by augmenting training data with adversarial examples. Madry et al. (2017) showed that adversarially trained models can be made robust to whitebox attacks (i.e., with knowledge of the model parameters) if the perturbations computed during training closely maximize the model’s loss. However, prior attempts at scaling this approach to ImageNetscale tasks (Deng et al., 2009) have proven unsuccessful (Kurakin et al., 2017b).
It is thus natural to ask whether it is possible, at scale, to achieve robustness against the class of blackbox adversaries Towards this goal, Kurakin et al. (2017b) adversarially trained an Inception v3 model (Szegedy et al., 2016b) on ImageNet using a “singlestep” attack based on a linearization of the model’s loss (Goodfellow et al., 2014b). Their trained model is robust to singlestep perturbations but remains vulnerable to more costly “multistep” attacks. Yet, Kurakin et al. (2017b) found that these attacks fail to reliably transfer between models, and thus concluded that the robustness of their model should extend to blackbox adversaries. Surprisingly, we show that this is not the case.
We demonstrate, formally and empirically, that adversarial training with singlestep methods admits a degenerate global minimum, wherein the model’s loss can not be reliably approximated by a linear function. Specifically, we find that the model’s decision surface exhibits sharp curvature near the data points, thus degrading attacks based on a single gradient computation. In addition to the model of Kurakin et al. (2017b), we reveal similar overfitting in an adversarially trained Inception ResNet v2 model (Szegedy et al., 2016a), and a variety of models trained on MNIST (LeCun et al., 1998).
We harness this result in two ways. First, we show that adversarially trained models using singlestep methods remain vulnerable to simple attacks. For blackbox adversaries, we find that perturbations crafted on an undefended model often transfer to an adversarially trained one. We also introduce a simple yet powerful singlestep attack that applies a small random perturbation—to escape the nonsmooth vicinity of the data point—before linearizing the model’s loss. While seemingly weaker than the Fast Gradient Sign Method of Goodfellow et al. (2014b), our attack significantly outperforms it for a same perturbation norm, for models trained with or without adversarial training.
Second, we propose Ensemble Adversarial Training, a training methodology that incorporates perturbed inputs transferred from other pretrained models. Our approach decouples adversarial example generation from the parameters of the trained model, and increases the diversity of perturbations seen during training. We train Inception v3 and Inception ResNet v2 models on ImageNet that exhibit increased robustness to adversarial examples transferred from other holdout models, using various singlestep and multistep attacks (Goodfellow et al., 2014b; Carlini & Wagner, 2017a; Kurakin et al., 2017a; Madry et al., 2017). We also show that our methods globally reduce the dimensionality of the space of adversarial examples (Tramèr et al., 2017). Our Inception ResNet v2 model won the first round of the NIPS 2017 competition on Defenses Against Adversarial Attacks (Kurakin et al., 2017c), where it was evaluated on other competitors’ attacks in a blackbox setting.^{1}^{1}1We publicly released our model after the first round, and it could thereafter be targeted using whitebox attacks. Nevertheless, a majority of the top submissions in the final round, e.g. (Xie et al., 2018) built upon our released model.
Various defensive techniques against adversarial examples in deep neural networks have been proposed
(Gu & Rigazio, 2014; Luo et al., 2015; Papernot et al., 2016c; Nayebi & Ganguli, 2017; Cisse et al., 2017) and many remain vulnerable to adaptive attackers (Carlini & Wagner, 2017a, b; Baluja & Fischer, 2017). Adversarial training (Szegedy et al., 2013; Goodfellow et al., 2014b; Kurakin et al., 2017b; Madry et al., 2017) appears to hold the greatest promise for learning robust models.Madry et al. (2017) show that adversarial training on MNIST yields models that are robust to whitebox attacks, if the adversarial examples used in training closely maximize the model’s loss. Moreover, recent works by Sinha et al. (2018), Raghunathan et al. (2018) and Kolter & Wong (2017) even succeed in providing certifiable robustness for small perturbations on MNIST. As we argue in Appendix C, the MNIST dataset is peculiar in that there exists a simple “closedform” denoising
procedure (namely feature binarization) which leads to similarly robust models
without adversarial training. This may explain why robustness to whitebox attacks is hard to scale to tasks such as ImageNet (Kurakin et al., 2017b). We believe that the existence of a simple robust baseline for MNIST can be useful for understanding some limitations of adversarial training techniques.Szegedy et al. (2013) found that adversarial examples transfer between models, thus enabling blackbox attacks on deployed models. Papernot et al. (2017) showed that blackbox attacks could succeed with no access to training data, by exploiting the target model’s predictions to extract (Tramèr et al., 2016) a surrogate model. Some prior works have hinted that adversarially trained models may remain vulnerable to blackbox attacks: Goodfellow et al. (2014b) found that an adversarial maxout network on MNIST has slightly higher error on transferred examples than on whitebox examples. Papernot et al. (2017) further showed that a model trained on small perturbations can be evaded by transferring perturbations of larger magnitude. Our finding that adversarial training degrades the accuracy of linear approximations of the model’s loss is as an instance of a gradientmasking phenomenon (Papernot et al., 2016b), which affects other defensive techniques (Papernot et al., 2016c; Carlini & Wagner, 2017a; Nayebi & Ganguli, 2017; Brendel & Bethge, 2017; Athalye et al., 2018).
We consider a classification task with data and labels sampled from a distribution . We identify a model with an hypothesis from a space . On input , the model outputs class scores
. The loss function used to train the model, e.g., crossentropy, is
.For some target model and inputs the adversary’s goal is to find an adversarial example such that and are “close” yet the model misclassifies . We consider the wellstudied class of bounded adversaries (Goodfellow et al., 2014b; Madry et al., 2017) that, given some budget , output examples where . As we comment in Appendix C.1, robustness is of course not an endgoal for secure ML. We use this standard model to showcase limitations of prior adversarial training methods, and evaluate our proposed improvements.
We distinguish between whitebox adversaries that have access to the target model’s parameters (i.e., ), and blackbox adversaries with only partial information about the model’s inner workings. Formal definitions for these adversaries are in Appendix A. Although security against whitebox attacks is the stronger notion (and the one we ideally want ML models to achieve), blackbox security is a reasonable and more tractable goal for deployed ML models.
Following Madry et al. (2017), we consider an adversarial variant of standard Empirical Risk Minimization (ERM), where our aim is to minimize the risk over adversarial examples:
(1) 
Madry et al. (2017) argue that adversarial training has a natural interpretation in this context, where a given attack (see below) is used to approximate solutions to the inner maximization problem, and the outer minimization problem corresponds to training over these examples. Note that the original formulation of adversarial training (Szegedy et al., 2013; Goodfellow et al., 2014b), which we use in our experiments, trains on both the “clean” examples and adversarial examples .
We consider three algorithms to generate adversarial examples with bounded norm. The first two are singlestep (i.e., they require a single gradient computation); the third is iterative—it computes multiple gradient updates. We enforce by clipping all components of .
Fast Gradient Sign Method (FGSM). This method (Goodfellow et al., 2014b) linearizes the inner maximization problem in (1):
(2) 
SingleStep LeastLikely Class Method (StepLL). This variant of FGSM introduced by Kurakin et al. (2017a, b) targets the leastlikely class, :
(3) 
Although this attack only indirectly tackles the inner maximization in (1), Kurakin et al. (2017b) find it to be the most effective for adversarial training on ImageNet.
Iterative Attack (IFGSM or IterLL). This method iteratively applies the FGSM or StepLL times with stepsize and projects each step onto the ball of norm around . It uses projected gradient descent to solve the maximization in (1). For fixed , iterative attacks induce higher error rates than singlestep attacks, but transfer at lower rates (Kurakin et al., 2017a, b).
When performing adversarial training with a singlestep attack (e.g., the FGSM or StepLL methods above), we approximate Equation (1) by replacing the solution to the inner maximization problem in with the output of the singlestep attack (e.g., in (2)). That is, we solve
(4) 
For model families with high expressive power, this alternative optimization problem admits at least two substantially different global minima :
[leftmargin=0.1in]
For an input from , there is no close to (in norm) that induces a high loss. That is,
(5) 
In other words, is robust to all bounded perturbations.
The minimizer is a model for which the approximation method underlying the attack (i.e., linearization in our case) poorly fits the model’s loss function. That is,
(6) 
Thus the attack when applied to produces samples that are far from optimal.
Note that this second “degenerate” minimum can be more subtle than a simple case of overfitting to samples produced from singlestep attacks. Indeed, we show in Section 4.1 that singlestep attacks applied to adversarially trained models create “adversarial” examples
that are easy to classify even for undefended models
. Thus, adversarial training does not simply learn to resist the particular attack used during training, but actually to make that attack perform worse overall. This phenomenon relates to the notion of Reward Hacking (Amodei et al., 2016) wherein an agent maximizes its formal objective function via unintended behavior that fails to captures the designer’s true intent.The degenerate minimum described in Section 3.3 is attainable because the learned model’s parameters influence the quality of both the minimization and maximization in (1). One solution is to use a stronger adversarial example generation process, at a high performance cost (Madry et al., 2017). Alternatively, Baluja & Fischer (2017) suggest training an adversarial generator model as in the GAN framework (Goodfellow et al., 2014a). The power of this generator is likely to require careful tuning, to avoid similar degenerate minima (where the generator or classifier overpowers the other).
We propose a conceptually simpler approach to decouple the generation of adversarial examples from the model being trained, while simultaneously drawing an explicit connection with robustness to blackbox adversaries. Our method, which we call Ensemble Adversarial Training, augments a model’s training data with adversarial examples crafted on other static pretrained models. Intuitively, as adversarial examples transfer between models, perturbations crafted on an external model are good approximations for the maximization problem in (1). Moreover, the learned model can not influence the “strength” of these adversarial examples. As a result, minimizing the training loss implies increased robustness to blackbox attacks from some set of models.
We can draw a connection between Ensemble Adversarial Training and multiplesource Domain Adaptation (Mansour et al., 2009; Zhang et al., 2012). In Domain Adaptation, a model trained on data sampled from one or more source distributions is evaluated on samples from a different target distribution .
Let be an adversarial distribution obtained by sampling from , computing an adversarial example for some model such that , and outputting . In Ensemble Adversarial Training, the source distributions are (the clean data) and (the attacks overs the currently trained model and the static pretrained models). The target distribution takes the form of an unseen blackbox adversary . Standard generalization bounds for Domain Adaptation (Mansour et al., 2009; Zhang et al., 2012) yield the following result.
Let be a model learned with Ensemble Adversarial Training and static blackbox adversaries . Then, if is robust against the blackbox adversaries used at training time, then has bounded error on attacks from a future blackbox adversary , if is not “much stronger”, on average, than the static adversaries .
We give a formal statement of this result and of the assumptions on in Appendix B. Of course, ideally we would like guarantees against arbitrary future adversaries. For very lowdimensional tasks (e.g., MNIST), stronger guarantees are within reach for specific classes of adversaries (e.g., bounded perturbations (Madry et al., 2017; Sinha et al., 2018; Raghunathan et al., 2018; Kolter & Wong, 2017)), yet they also fail to extend to other adversaries not considered at training time (see Appendix C.1 for a discussion). For ImageNetscale tasks, stronger formal guarantees appear out of reach, and we thus resort to an experimental assessment of the robustness of Ensemble Adversarially Trained models to various noninteractive blackbox adversaries in Section 4.2.
We show the existence of a degenerate minimum, as described in Section 3.3, for the adversarially trained Inception v3 model of Kurakin et al. (2017b). Their model (denoted v3) was trained on a StepLL attack with . We also adversarially train an Inception ResNet v2 model (Szegedy et al., 2016a) using the same setup. We denote this model by IRv2. We refer the reader to (Kurakin et al., 2017b) for details on the adversarial training procedure.
We first measure the approximationratio of the StepLL attack for the inner maximization in (1). As we do not know the true maximum, we lowerbound it using an iterative attack. For random test points, we find that for a standard Inception v3 model, stepLL gets within of the optimum loss on average. This attack is thus a good candidate for adversarial training. Yet, for the v3 model, the approximation ratio drops to , confirming that the learned model is less amenable to linearization. We obtain similar results for Inception ResNet v2 models. The ratio is for a standard model, and for IRv2. Similarly, we look at the cosine similarity between the perturbations given by a singlestep and multistep attack. The more linear the model, the more similar we expect both perturbations to be. The average similarity drops from for Inception v3 to for v3. This effect is not due to the decision surface of v3 being “too flat” near the data points: the average gradient norm is larger for v3 () than for the standard v3 model ().
We visualize this “gradientmasking” effect (Papernot et al., 2016b) by plotting the loss of v3 on examples , where is the signed gradient of model v3 and
is a signed vector orthogonal to
. Looking forward to Section 4.1, we actually chose to be the signed gradient of another Inception model, from which adversarial examples transfer to v3. Figure 1 shows that the loss is highly curved in the vicinity of the data point , and that the gradient poorly reflects the global loss landscape. Similar plots for additional data points are in Figure 4.We show similar results for adversarially trained MNIST models in Appendix C.2. On this task, input dropout (Srivastava et al., 2014) mitigates adversarial training’s overfitting problem, in some cases. Presumably, the random input mask diversifies the perturbations seen during training (dropout at intermediate layers does not mitigate the overfitting effect). Mishkin et al. (2017) find that input dropout significantly degrades accuracy on ImageNet, so we did not include it in our experiments.
Kurakin et al. (2017b) found their adversarially trained model to be robust to various singlestep attacks. They conclude that this robustness should translate to attacks transferred from other models. As we have shown, the robustness to singlestep attacks is actually misleading, as the model has learned to degrade the information contained in the model’s gradient. As a consequence, we find that the v3 model is substantially more vulnerable to singlestep attacks than Kurakin et al. (2017b) predicted, both in a whitebox and blackbox setting. The same holds for the IRv2 model.
In addition to the v3 and IRv2 models, we consider standard Inception v3, Inception v4 and Inception ResNet v2 models. These models are available in the TensorFlowSlim library (Abadi et al., 2015). We describe similar results for a variety of models trained on MNIST in Appendix C.2.
Table 1 shows error rates for singlestep attacks transferred between models. We compute perturbations on one model (the source) and transfer them to all others (the targets). When the source and target are the same, the attack is whitebox. Adversarial training greatly increases robustness to whitebox singlestep attacks, but incurs a higher error rate in a blackbox setting. Thus, the robustness gain observed when evaluating defended models in isolation is misleading. Given the ubiquity of this pitfall among proposed defenses against adversarial examples (Carlini & Wagner, 2017a; Brendel & Bethge, 2017; Papernot et al., 2016b), we advise researchers to always consider both whitebox and blackbox adversaries when evaluating defensive strategies. Notably, a similar discrepancy between whitebox and blackbox attacks was recently observed in Buckman et al. (2018).


Attacks crafted on adversarial models are found to be weaker even against undefended models (i.e., when using v3 or IRv2
as source, the attack transfers with lower probability). This confirms our intuition from Section
3.3: adversarial training does not just overfit to perturbations that affect standard models, but actively degrades the linear approximation underlying the singlestep attack.The loss function visualization in Figure 1 shows that sharp curvature artifacts localized near the data points can mask the true direction of steepest ascent. We thus suggest to prepend singlestep attacks by a small random step, in order to “escape” the nonsmooth vicinity of the data point before linearizing the model’s loss. Our new attack, called R+FGSM (alternatively, R+StepLL), is defined as follows, for parameters and (where ):
(7) 
Note that the attack requires a single gradient computation. The R+FGSM is a computationally efficient alternative to iterative methods that have high success rates in a whitebox setting. Our attack can be seen as a singlestep variant of the general PGD method from (Madry et al., 2017).
Table 2 compares error rates for the StepLL and R+StepLL methods (with and ). The extra random step yields a stronger attack for all models, even those without adversarial training. This suggests that a model’s loss function is generally less smooth near the data points. We further compared the R+StepLL attack to a twostep IterLL attack, which computes two gradient steps. Surprisingly, we find that for the adversarially trained Inception v3 model, the R+StepLL attack is stronger than the twostep IterLL attack. That is, the local gradients learned by the adversarially trained model are worse than random directions for finding adversarial examples!


We find that the addition of this random step hinders transferability (see Table 9). We also tried adversarial training using R+FGSM on MNIST, using a similar approach as (Madry et al., 2017). We adversarially train a CNN (model A in Table 5) for epochs, and attain accuracy on R+FGSM samples. However, training on R+FGSM provides only little robustness to iterative attacks. For the PGD attack of (Madry et al., 2017) with steps, the model attains accuracy.
We now evaluate our Ensemble Adversarial Training strategy described in Section 3.4. We recall our intuition: by augmenting training data with adversarial examples crafted from static pretrained models, we decouple the generation of adversarial examples from the model being trained, so as to avoid the degenerate minimum described in Section 3.3. Moreover, our hope is that robustness to attacks transferred from some fixed set of models will generalize to other blackbox adversaries.
Trained Model  Pretrained Models  Holdout Models 
Inception v3 (v3)  Inception v3, ResNet v2 (50)  
Inception v3 (v3)  Inception v3, ResNet v2 (50), IncRes v2  
IncRes v2 (IRv2)  Inception v3, IncRes v2 
We train Inception v3 and Inception ResNet v2 models (Szegedy et al., 2016a) on ImageNet, using the pretrained models shown in Table 3. In each training batch, we rotate the source of adversarial examples between the currently trained model and one of the pretrained models. We select the source model at random in each batch, to diversify examples across epochs. The pretrained models’ gradients can be precomputed for the full training set. The perbatch cost of Ensemble Adversarial Training is thus lower than that of standard adversarial training: using our method with pretrained models, only every batch requires a forwardbackward pass to compute adversarial gradients. We use synchronous distributed training on 50 machines, with minibatches of size 16 (we did not precompute gradients, and thus lower the batch size to fit all models in memory). Half of the examples in a minibatch are replaced by StepLL examples. As in Kurakin et al. (2017b)
, we use RMSProp with a learning rate of
, decayed by a factor of every two epochs.To evaluate how robustness to blackbox attacks generalizes across models, we transfer various attacks crafted on three different holdout models (see Table 3), as well as on an ensemble of these models (as in Liu et al. (2017)). We use the StepLL, R+StepLL, FGSM, IFGSM and the PGD attack from Madry et al. (2017) using the hingeloss function from Carlini & Wagner (2017a). Our results are in Table 4. For each model, we report the worstcase error rate over all blackbox attacks transfered from each of the holdout models ( attacks in total). Results for MNIST are in Table 8.
Convergence of Ensemble Adversarial Training is slower than for standard adversarial training, a result of training on “hard” adversarial examples and lowering the batch size. Kurakin et al. (2017b) report that after epochs ( iterations with minibatches of size ), the v3 model achieves accuracy. Ensemble Adversarial Training for models v3 and v3 converges after epochs ( iterations with minibatches of size ). The Inception ResNet v2 model is trained for epochs, where a baseline model converges at around epochs.
For both architectures, the models trained with Ensemble Adversarial Training are slightly less accurate on clean data, compared to standard adversarial training. Our models are also more vulnerable to whitebox singlestep attacks, as they were only partially trained on such perturbations. Note that for v3, the proportion of whitebox StepLL samples seen during training is (instead of for model v3). The negative impact on the robustness to whitebox attacks is large, for only a minor gain in robustness to transferred samples. Thus it appears that while increasing the diversity of adversarial examples seen during training can provide some marginal improvement, the main benefit of Ensemble Adversarial Training is in decoupling the attacks from the model being trained, which was the goal we stated in Section 3.4.
Ensemble Adversarial Training is not robust to whitebox IterLL and R+StepLL samples: the error rates are similar to those for the v3 model, and omitted for brevity (see Kurakin et al. (2017b) for IterLL attacks and Table 2 for R+StepLL attacks). Kurakin et al. (2017b) conjecture that larger models are needed to attain robustness to such attacks. Yet, against blackbox adversaries, these attacks are only a concern insofar as they reliably transfer between models.
Top 1  Top 5  
Model  Clean  StepLL  Max. BlackBox  Clean  StepLL  Max. BlackBox 
v3  22.0  69.6  51.2  6.1  42.7  24.5 
v3  22.0  26.6  40.8  6.1  9.0  17.4 
v3  23.6  30.0  34.0  7.6  10.1  11.2 
v3  24.2  43.3  33.4  7.8  19.4  10.7 
IRv2  19.6  50.7  44.4  4.8  24.0  17.8 
IRv2  19.8  21.4  34.5  4.9  5.8  11.7 
IRv2  20.2  26.0  27.0  5.1  7.6  7.9 
Ensemble Adversarial Training significantly boosts robustness to all attacks transferred from the holdout models. For the IRv2 model, the accuracy loss (compared to IRv2’s accuracy on clean data) is (top 1) and (top 5). We find that the strongest attacks in our test suite (i.e., with highest transfer rates) are the FGSM attacks. Blackbox R+StepLL or iterative attacks are less effective, as they do not transfer with high probability (see Kurakin et al. (2017b) and Table 9). Attacking an ensemble of all three holdout models, as in Liu et al. (2017), did not lead to stronger blackbox attacks than when attacking the holdout models individually.
Our results have little variance with respect to the attack parameters (e.g., smaller
) or to the use of other holdout models for blackbox attacks (e.g., we obtain similar results by attacking the v3 and v3 models with the IRv2 model). We also find that v3 is not vulnerable to perturbations transferred from v3. We obtain similar results on MNIST (see Appendix C.2), thus demonstrating the applicability of our approach to different datasets and model architectures.Our Inception ResNet v2 model was included as a baseline defense in the NIPS 2017 competition on Adversarial Examples (Kurakin et al., 2017c). Participants of the attack track submitted noninteractive blackbox attacks that produce adversarial examples with bounded norm. Models submitted to the defense track were evaluated on all attacks over a subset of the ImageNet test set. The score of a defense was defined as the average accuracy of the model over all adversarial examples produced by all attacks.
Our IRv2 model finished 1^{st} among submissions in the first development round, with a score of (the second placed defense scored ). The test data was intentionally chosen as an “easy” subset of ImageNet. Our model achieved accuracy on the clean test data.
After the first round, we released our model publicly, which enabled other users to launch whitebox attacks against it. Nevertheless, a majority of the final submissions built upon our released model. The winning submission (team “liaofz” with a score of ) made use of a novel adversarial denoising technique. The second placed defense (team “cihangxie” with a score of ) prepends our IRv2
model with random padding and resizing of the input image
(Xie et al., 2018).It is noteworthy that the defenses that incorporated Ensemble Adversarial Training faired better against the worstcase blackbox adversary. Indeed, although very robust on average, the winning defense achieved as low as accuracy on some attacks. The best defense under this metric (team “rafaelmm” which randomly perturbed images before feeding them to our IRv2 model) achieved at least accuracy against all submitted attacks, including the attacks that explicitly targeted our released model in a whitebox setting.
Ensemble Adversarial Training decreases the magnitude of the gradient masking effect described previously. For the v3 and v3 models, we find that the loss incurred on a StepLL attack gets within respectively and of the optimum loss (we recall that for models v3 and v3, the approximation ratio was respectively and ). Similarly, for the IRv2 model, the ratio improves from (for IRv2) to . As expected, not solely training on a whitebox singlestep attack reduces gradient masking. We also verify that after Ensemble Adversarial Training, a twostep iterative attack outperforms the R+StepLL attack from Section 4.1, thus providing further evidence that these models have meaningful gradients.
Finally, we revisit the “GradientAligned Adversarial Subspace” (GAAS) method of Tramèr et al. (2017)
. Their method estimates the size of the space of adversarial examples in the vicinity of a point, by finding a set of
orthogonal perturbations of norm that are all adversarial. We note that adversarial perturbations do not technically form a “subspace” (e.g., the vector is not adversarial). Rather, they may form a “cone”, the dimension of which varies as we increase . By linearizing the loss function, estimating the dimensionality of this cone reduces to finding vectors that are strongly aligned with the model’s gradient . Tramèr et al. (2017) give a method that finds orthogonal vectors that satisfy (this bound is tight). We extend this result to the norm, an open question in Tramèr et al. (2017). In Section E, we give a randomized combinatorial construction (Colbourn, 2010), that finds orthogonal vectors satisfying and . We show that this result is tight as well.For models v3, v3 and v3, we select correctly classified test points. For each , we search for a maximal number of orthogonal adversarial perturbations with . We limit our search to directions per point. The results are in Figure 2. For , we plot the proportion of points that have at least orthogonal adversarial perturbations. For a fixed , the value of can be interpreted as the dimension of a “slice” of the cone of adversarial examples near a data point. For the standard Inception v3 model, we find over orthogonal adversarial directions for of the points. The v3 model shows a curious bimodal phenomenon for : for most points (), we find no adversarial direction aligned with the gradient, which is consistent with the gradient masking effect. Yet, for most of the remaining points, the adversarial space is very highdimensional (). Ensemble Adversarial Training yields a more robust model, with only a small fraction of points near a large adversarial space.
Previous work on adversarial training at scale has produced encouraging results, showing strong robustness to (singlestep) adversarial examples (Goodfellow et al., 2014b; Kurakin et al., 2017b). Yet, these results are misleading, as the adversarially trained models remain vulnerable to simple blackbox and whitebox attacks. Our results, generic with respect to the application domain, suggest that adversarial training can be improved by decoupling the generation of adversarial examples from the model being trained. Our experiments with Ensemble Adversarial Training show that the robustness attained to attacks from some models transfers to attacks from other models.
We did not consider blackbox adversaries that attack a model via other means than by transferring examples from a local model. For instance, generative techniques (Baluja & Fischer, 2017) might provide an avenue for stronger attacks. Yet, a recent work by Xiao et al. (2018) found Ensemble Adversarial Training to be resilient to such attacks on MNIST and CIFAR10, and often attaining higher robustness than models that were adversarially trained on iterative attacks.
Moreover, interactive adversaries (see Appendix A) could try to exploit queries to the target model’s prediction function in their attack, as demonstrated in Papernot et al. (2017). If queries to the target model yield prediction confidences, an adversary can estimate the target’s gradient at a given point (e.g., using finitedifferences as in Chen et al. (2017)) and fool the target with our R+FGSM attack. Note that if queries only return the predicted label, the attack does not apply. Exploring the impact of these classes of blackbox attacks and evaluating their scalability to complex tasks is an interesting avenue for future work.
We thank Ben Poole and Jacob Steinhardt for feedback on early versions of this work. Nicolas Papernot is supported by a Google PhD Fellowship in Security. Research was supported in part by the Army Research Laboratory, under Cooperative Agreement Number W911NF1320045 (ARL Cyber Security CRA), and the Army Research Office under grant W911NF1310421. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation hereon.
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, pp. 15–26. ACM, 2017.Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.Systematic evaluation of convolution neural network advances on the imagenet.
Computer Vision and Image Understanding, 2017.The limitations of deep learning in adversarial settings.
In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, pp. 372–387. IEEE, 2016a.We provide formal definitions for the threat model introduced in Section 3.1. In the following, we explicitly identify the hypothesis space that a model belongs to as describing the model’s architecture. We consider a target model trained over inputs sampled from a data distribution . More precisely, we write
where train is a randomized training procedure that takes in a description of the model architecture , a training set sampled from , and randomness .
Given a set of test inputs from and a budget , an adversary produces adversarial examples , such that for all . We evaluate success of the attack as the error rate of the target model over :
We assume can sample inputs according to the data distribution . We define three adversaries.
For a target model , a whitebox adversary is given access to all elements of the training procedure, that is train (the training algorithm), (the model architecture), the training data , the randomness and the parameters . The adversary can use any attack (e.g., those in Section 3.2) to find adversarial inputs.
Whitebox access to the internal model weights corresponds to a very strong adversarial model. We thus also consider the following relaxed and arguably more realistic notion of a blackbox adversary.
For a target model , a noninteractive blackbox adversary only gets access to train (the target model’s training procedure) and (the model architecture). The adversary can sample from the data distribution , and uses a local algorithm to craft adversarial examples .
Attacks based on transferability (Szegedy et al., 2013) fall in this category, wherein the adversary selects a procedure and model architecture , trains a local model over , and computes adversarial examples on its local model using whitebox attack strategies.
Most importantly, a blackbox adversary does not learn the randomness used to train the target, nor the target’s parameters . The blackbox adversaries in our paper are actually slightly stronger than the ones defined above, in that they use the same training data as the target model.
We provide with the target’s training procedure train to capture knowledge of defensive strategies applied at training time, e.g., adversarial training (Szegedy et al., 2013; Goodfellow et al., 2014b) or ensemble adversarial training (see Section 4.2). For ensemble adversarial training, also knows the architectures of all pretrained models. In this work, we always mount blackbox attacks that train a local model with a different architecture than the target model. We actually find that blackbox attacks on adversarially trained models are stronger in this case (see Table 1).
The main focus of our paper is on noninteractive blackbox adversaries as defined above. For completeness, we also formalize a stronger notion of interactive blackbox adversaries that additionally issue prediction queries to the target model (Papernot et al., 2017). We note that in cases where ML models are deployed as part of a larger system (e.g., a self driving car), an adversary may not have direct access to the model’s query interface.
For a target model , an interactive blackbox adversary only gets access to train (the target model’s training procedure) and (the model architecture). The adversary issues (adaptive) oracle queries to the target model. That is, for arbitrary inputs , the adversary obtains and uses a local algorithm to craft adversarial examples (given knowledge of , train, and tuples ).
Papernot et al. (2017) show that such attacks are possible even if the adversary only gets access to a small number of samples from . Note that if the target model’s prediction interface additionally returns class scores , interactive blackbox adversaries could use queries to the target model to estimate the model’s gradient (e.g., using finite differences) (Chen et al., 2017), and then apply the attacks in Section 3.2. We further discuss interactive blackbox attack strategies in Section 5.
We provide a formal statement of Theorem 1 in Section 3.4, regarding the generalization guarantees of Ensemble Adversarial Training. For simplicity, we assume that the model is trained solely on adversarial examples computed on the pretrained models (i.e., we ignore the clean training data and the adversarial examples computed on the model being trained). Our results are easily extended to also consider these data points.
Let be the data distribution and be adversarial distributions where a sample is obtained by sampling from , computing an such that and returning . We assume the model is trained on data points , where data points are sampled from each distribution , for . We denote . At test time, the model is evaluated on adversarial examples from .
For a model we define the empirical risk
(8) 
and the risk over the target distribution (or future adversary)
(9) 
We further define the average discrepancy distance (Mansour et al., 2009) between distributions and with respect to a hypothesis space as
(10) 
This quantity characterizes how “different” the future adversary is from the traintime adversaries. Intuitively, the distance is small if the difference in robustness between two models to the target attack is somewhat similar to the difference in robustness between these two models to the attacks used for training (e.g., if the static blackbox attacks induce much higher error on some model than on another model , then the same should hold for the target attack ). In other words, the ranking of the robustness of models should be similar for the attacks in as for .
Finally, let be the average Rademacher complexity of the distributions (Zhang et al., 2012). Note that as . The following theorem is a corollary of Zhang et al. (2012, Theorem 5.2):
Assume that is a function class consisting of bounded functions. Then, with probability at least ,
(11) 
Compared to the standard generalization bound for supervised learning, the generalization bound for Domain Adaptation incorporates the extra term
to capture the divergence between the target and source distributions. In our context, this means that the model learned by Ensemble Adversarial Training has guaranteed generalization bounds with respect to future adversaries that are not “too different” from the ones used during training. Note that need not restrict itself to perturbation with bounded norm for this result to hold.We reiterate our ImageNet experiments on MNIST. For this simpler task, Madry et al. (2017) show that training on iterative attacks conveys robustness to whitebox attacks with bounded norm. Our goal is not to attain similarly strong whitebox robustness on MNIST, but to show that our observations on limitations of singlestep adversarial training, extend to other datasets than ImageNet.
The MNIST dataset is a simple baseline for assessing the potential of a defense, but the obtained results do not always generalize to harder tasks. We suggest that this is because achieving robustness to perturbations admits a simple “closedform” solution, given the nearbinary nature of the data. Indeed, for an average MNIST image, over of the pixels are in and only are in the range . Thus, for a perturbation with , binarized versions of and can differ in at most of the input dimensions. By binarizing the inputs of a standard CNN trained without adversarial training, we obtain a model that enjoys robustness similar to the model trained by Madry et al. (2017). Concretely, for a whitebox IFGSM attack, we get at most error.
The existence of such a simple robust representation begs the question of why learning a robust model with adversarial training takes so much effort. Finding techniques to improve the performance of adversarial training, even on simple tasks, could provide useful insights for more complex tasks such as ImageNet, where we do not know of a similarly simple “denoising” procedure.
These positive results on MNIST for the norm also leave open the question of defining a general norm for adversarial examples. Let us motivate the need for such a definition: we find that if we first rotate an MNIST digit by , and then use the IFGSM, our rounding model and the model from Madry et al. (2017) achieve only accuracy (on “clean” rotated inputs, the error is ). If we further randomly “flip” pixels per image, the accuracy of both models drops to under . Thus, we successfully evade the model by slightly extending the threat model (see Figure 3).
Of course, we could augment the training set with such perturbations (see Engstrom et al. (2017)). An open question is whether we can enumerate all types of “adversarial” perturbations. In this work, we focus on the norm to illustrate our findings on the limitations of singlestep adversarial training on ImageNet and MNIST, and to showcase the benefits of our Ensemble Adversarial Training variant. Our approach can easily be extended to consider multiple perturbation metrics. We leave such an evaluation to future work.
We repeat experiments from Section 4 on MNIST. We use the architectures in Table 5. We train a standard model for epochs, and an adversarial model with the FGSM () for 12 epochs.
A  B  C  D  
Conv(64, 5, 5) + Relu 
Dropout(0.2)  Conv(128, 3, 3) + Tanh  
Conv(64, 5, 5) + Relu  Conv(64, 8, 8) + Relu  MaxPool(2,2)  
Dropout(0.25)  Conv(128, 6, 6) + Relu  Conv(64, 3, 3) + Tanh  FC + Softmax  
FC(128) + Relu  Conv(128, 5, 5) + Relu  MaxPool(2,2)  
Dropout(0.5)  Dropout(0.5)  FC(128) + Relu  
FC + Softmax  FC + Softmax  FC + Softmax 
During adversarial training, we avoid the label leaking effect described by Kurakin et al. (2017b) by using the model’s predicted class instead of the true label in the FGSM,
We first analyze the “degenerate” minimum of adversarial training, described in Section 3.3. For each trained model, we compute the approximationratio of the FGSM for the inner maximization problem in equation (1). That is, we compare the loss produced by the FGSM with the loss of a strong iterative attack. The results appear in Table 6. As we can see, for all model architectures, adversarial training degraded the quality of a linear approximation to the model’s loss.
A  A  B  B  B  B  C  C  D  D 
We find that input dropout (Srivastava et al., 2014) (i.e., randomly dropping a fraction of input features during training) as used in architecture B limits this unwarranted effect of adversarial training.^{2}^{2}2We thank Arjun Bhagoji, Bo Li and Dawn Song for this observation. If we omit the input dropout (we call this architecture B) the singlestep attack degrades significantly. We discuss this effect in more detail below. For the fully connected architecture D, we find that the learned model is very close to linear and thus also less prone to the degenerate solution to the minmax problem, as we postulated in Section 3.3.
Table 7 compares error rates of undefended and adversarially trained models on whitebox and blackbox attacks, as in Section 4.1. Again, model B presents an anomaly. For all other models, we corroborate our findings on ImageNet for adversarial training: (1) blackbox attacks trump whitebox singlestep attacks; (2) whitebox singlestep attacks are significantly stronger if prepended by a random step. For model B, the opposite holds true. We believe this is because input dropout increases diversity of attack samples similarly to Ensemble Adversarial Training.
whitebox  blackbox  
FGSM  R+FGSM  FGSM  FGSM  FGSM  FGSM  FGSM  
A    
A  
B    
B  
B    
B  
C    
C  
D    
D 
While training with input dropout helps avoid the degradation of the singlestep attack, it also significantly delays convergence of the model. Indeed, model B retains relatively high error on whitebox FGSM examples. Adversarial training with input dropout can be seen as comparable to training with a randomized singlestep attack, as discussed in Section 4.1.
The positive effect of input dropout is architecture and dataset specific: Adding an input dropout layer to models A, C and D confers only marginal benefit, and is outperformed by Ensemble Adversarial Training, discussed below. Moreover, Mishkin et al. (2017) find that input dropout significantly degrades accuracy on ImageNet. We thus did not incorporate it into our models on ImageNet.
To evaluate Ensemble Adversarial Training 3.4, we train two models per architecture. The first, denoted [AD], uses a single pretrained model of the same type (i.e., A is trained on perturbations from another model A). The second model, denoted [AD], uses pretrained models ( or ). We train all models for epochs.
We evaluate our models on blackbox attacks crafted on models A,B,C,D (for a fair comparison, we do not use the same pretrained models for evaluation, but retrain them with different random seeds). The attacks we consider are the FGSM, IFGSM and the PGD attack from Madry et al. (2017) with the loss function from Carlini & Wagner (2017a)), all with . The results appear in Table 8. For each model, we report the worstcase and averagecase error rate over all blackbox attacks.
Clean  FGSM  Max. Black Box  Avg. Black Box  
A  0.8  2.2  10.8  7.7 
A  0.8  7.0  6.6  5.2 
A  0.7  5.4  6.5  4.3 
B  0.8  11.6  8.9  5.5 
B  0.7  10.5  6.8  5.3 
B  0.8  14.0  8.8  5.1 
C  1.0  3.7  29.3  18.7 
C  1.3  1.9  17.2  10.7 
C  1.4  3.6  14.5  8.4 
D  2.6  25.5  32.5  23.5 
D  2.6  21.5  38.6  28.0 
D  2.6  29.4  29.8  15.6 
Ensemble Adversarial Training significantly increases robustness to blackbox attacks, except for architecture B, which we previously found to not suffer from the same overfitting phenomenon that affects the other adversarially trained networks. Nevertheless, model B achieves slightly better robustness to whitebox and blackbox attacks than B. In the majority of cases, we find that using a single pretrained model produces good results, but that the extra diversity of including three pretrained models can sometimes increase robustness even further. Our experiments confirm our conjecture that robustness to blackbox attacks generalizes across models. Indeed, we find that when training with three external models, we attain very good robustness against attacks initiated from models with the same architecture (as evidenced by the average error on our attack suite), but also increased robustness to attacks initiated from the fourth holdout model
In Section 4.1, we introduced the R+StepLL attack, an extension of the StepLL method that prepends the attack with a small random perturbation. In Table 9, we evaluate the transferability of R+StepLL adversarial examples on ImageNet. We find that the randomized variant produces perturbations that transfer at a much lower rate (see Table 1 for the deterministic variant).


Tramèr et al. (2017) consider the following task for a given model : for a (correctly classified) point , find orthogonal vectors such that and all the are adversarial (i.e., ). By linearizing the model’s loss function, this reduces to finding orthogonal vectors that are maximally aligned with the model’s gradient . Tramèr et al. (2017) left a construction for the norm as an open problem.
We provide an optimal construction for the norm, based on Regular Hadamard Matrices (Colbourn, 2010). Given the constraint, we find orthogonal vectors that are maximally aligned with the signed gradient, . We first prove an analog of (Tramèr et al., 2017, Lemma 1).
Let and . Suppose there are k orthogonal vectors satisfying . Then .
Let . Then, we have
(12) 
from which we obtain . ∎
This result bounds the number of orthogonal perturbations we can expect to find, for a given alignment with the signed gradient. As a warmup consider the following trivial construction of orthogonal vectors in that are “somewhat” aligned with . We split into “chunks” of size and define to be the vector that is equal to in the ^{th} chunk and zero otherwise. We obtain , a factor worse than the the bound in Lemma 6.
We now provide a construction that meets this upper bound. We make use of Regular Hadamard Matrices of order (Colbourn, 2010). These are square matrices such that: (1) all entries of are in ; (2) the rows of are mutually orthogonal; (3) All row sums are equal to .
The order of a Regular Hadamard Matrix is of the form for an integer . We use known constructions for .
Let and be an integer for which a Regular Hadamard Matrix of order exists. Then, there is a randomized construction of orthogonal vectors , such that . Moreover, .
We construct orthogonal vectors , where is obtained by repeating the i^{th} row of times (for simplicity, we assume that divides . Otherwise we pad with zeros). We then multiply each componentwise with . By construction, the vectors are mutually orthogonal, and we have , which is tight according to Lemma 6.
As the weight of the gradient
may not be uniformly distributed among its
components, we apply our construction to a random permutation of the signed gradient. We then obtain(13)  
(14) 
∎
In Section 3.3, we show that adversarial training introduces spurious curvature artifacts in the model’s loss function around data points. As a result, oneshot attack strategies based on firstorder approximations of the model loss produce perturbations that are nonadversarial. In Figures 4 and 5 we show further illustrations of this phenomenon for the Inception v3 model trained on ImageNet by Kurakin et al. (2017b) as well as for the model A we trained on MNIST.