Efficient Adversarial Training with Transferable Adversarial Examples

12/27/2019 ∙ by Haizhong Zheng, et al. ∙ University of Michigan 12

Adversarial training is an effective defense method to protect classification models against adversarial attacks. However, one limitation of this approach is that it can require orders of magnitude additional training time due to high cost of generating strong adversarial examples during training. In this paper, we first show that there is high transferability between models from neighboring epochs in the same training process, i.e., adversarial examples from one epoch continue to be adversarial in subsequent epochs. Leveraging this property, we propose a novel method, Adversarial Training with Transferable Adversarial Examples (ATTA), that can enhance the robustness of trained models and greatly improve the training efficiency by accumulating adversarial perturbations through epochs. Compared to state-of-the-art adversarial training methods, ATTA enhances adversarial accuracy by up to 7.2 requires 12 14x less training time on MNIST and CIFAR10 datasets with comparable model robustness.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-of-the-art deep learning models for computer vision tasks have been found to be vulnerable to adversarial examples 

[31]. Even a small perturbation on the image can fool a well-trained model. Recent works [9, 30, 29]

also show that adversarial examples can be physically realized, which can lead to serious safety issues. Design of robust models, which correctly classifies adversarial examples, is an active research area 

[7, 17, 25, 11, 34], with adversarial training [21] being one of the most effective methods. It formulates training as a game between adversarial attacks and the model: the stronger the adversarial examples generated to train the model are, the more robust the model is.

To generate strong adversarial examples, iterative attacks [16], which use multiple attack iterations to generate adversarial examples, are widely adopted in various adversarial training methods[21, 39, 4, 32]. Since adversarial perturbations are usually bounded by a constrained space and attack perturbations outside need to be projected back to , -step projected gradient descent method[16, 21] (PGD-) has been widely adopted to generate adversarial examples. Typically, using more attack iterations (higher value of ) produces stronger adversarial examples [21]. However, each attack iteration needs to compute the gradient on the input, which causes a large computational overhead. As shown in Table 1, the training time of adversarial training can be close to times larger than natural training.

Recent works [24, 20, 23] show that adversarial examples can be transferred between models: adversarial examples generated for one model can still stay adversarial to another model. The key insight in our work, which we experimentally verified, is that, because of high transferability between models (i.e., checkpoints) from neighboring training epochs, attack strength can be accumulated across epochs by repeatedly reusing the adversarial perturbation from the previous epoch.

We take advantage of this insight in coming up with a novel adversarial training method called ATTA (Adversarial Training with Transferable Adversarial examples) that can be significantly faster than state-of-the-art methods while achieving similar model robustness. In traditional adversarial training, when a new epoch begins, the attack algorithm generates adversarial examples from natural images, which ignores the fact that these perturbations can be reused effectively. In contrast, we show that it can be advantageous to reuse these adversarial perturbations through epochs. Even using only one attack iteration to generate adversarial examples, ATTA- can still achieve comparable robustness with respect to traditional PGD-.

We apply our technique on Madry’s Adversarial Training method (MAT) [21] and TRADES[39] and evaluate the performance on both MNIST and CIFAR10 dataset. Compared with traditional PGD attack, our method improves training efficiency by up to () on MNIST (CIFAR10) with comparable model robustness. Trained with ATTA, the adversarial accuracy of MAT can be improved up to . Noticeably, for MNIST, with one attack iteration, our method can achieve adversarial accuracy within minutes. For CIFAR10, compared to MAT whose training time is more than one day, our method achieves comparable adversarial accuracy in about hours.

We also compare our method with two fast adversarial training methods, YOPO[37] and Free[27]. Evaluation results show that our method is both more effective and efficient than these two methods.

Contribution. To the best of our knowledge, our work is the first to enhance the efficiency and effectiveness of adversarial training by taking advantage of high transferability between models from different epochs. In summary, we make the following contributions:

  • We are the first to reveal the high transferability between models of neighboring epochs in adversarial training. With this property, we verify that the attack strength can be accumulated across epochs by reusing adversarial perturbations from the previous epoch.

  • We propose a novel method (ATTA) for iterative attack based adversarial training with the objectives of both efficiency and effectiveness. It can generate the same (or even stronger) adversarial examples with much fewer attack iterations via accumulating adversarial perturbations through epochs.

  • Evaluation result shows that, with comparable model robustness, ATTA is () faster than traditional adversarial methods on MNIST (CIFAR10). ATTA can also enhance model adversarial accuracy by up to for MAT on CIFAR10.

2 Adversarial training and transferability

In this section, we introduce relevant background on adversarial training and transferability of adversarial examples. We also discuss the trade-off between training time and model robustness.

2.1 Adversarial Training

Adversarial training is an effective defense method to train robust models against adversarial attacks. By using adversarial attacks as a data augmentation method, a model trained with adversarial examples achieves considerable robustness. Recently, lots of works[21, 39, 38, 2, 14, 12, 26]

focuse on analyzing and improving adversarial machine learning. Madry

et al[21] first formulate adversarial training as a min-max optimization problem:

(1)

where is the hypothesis space, is the distribution of the training dataset,

is a loss function, and

is the allowed perturbation space that is usually selected as an L- norm ball around . The basic strategy in adversarial training is, given a natural image , to find a perturbed image that maximizes the loss with respect to correct classification. The model is then trained on generated adversarial examples. In this work, We consider adversarial examples with a high loss to have high attack strength.

PGD- attack based adversarial training: Unfortunately, solving the inner maximization problem is hard. Iterative attack [16] is commonly used to generate strong adversarial examples as an approximate solution for the inner maximization problem of Equation 1. Since adversarial perturbations are usually bounded by the allowed perturbation space , PGD- (-step projected gradient descent [16]) is adopted to conduct iterative attack [21, 39, 37, 27]. PGD- adversarial attack is the multi-step projected gradient descent on a negative loss function:

In the above, is the adversarial example in the -th attack iteration, is the attack step size, and is the projection function to project adversarial examples back to the allowed perturbation space .

With a higher value of (more attack iterations), PGD- can generate adversarial examples with higher loss [21]. However, there is a trade-off between training time and model robustness in adversarial training. On one hand, since each attack iteration needs to calculate the gradient for the input, using more attack iterations requires more time to generate adversarial examples, thus causing a large computational overhead for adversarial training. As shown in Table 1, compared to natural training, adversarial training may need close to x more training time until the model converges. Most of the training time is consumed in generating adversarial examples (attack time). On the other hand, reducing the number of attack iterations can reduce the training time, but that negatively impacts robustness of the trained model (Table 2).

Dataset
Natural
Training
Adversarial training
Training Attack Total
MNIST sec sec sec sec
CIAFR10 min min min min
Table 1: Training time of natural training and adversarial training. Attack column shows the time consumed in adversarial example generation.
Defense PGD- PGD- PGD- PGD-
Nat. Acc.
Adv. Acc.
Time (min)
Table 2: The relation between the number of attack iterations and adversarial accuracy against PGD- attack on CIFAR10 dataset.

2.2 Transferability of Adversarial Examples.

Szegedy et al.[31] show that adversarial examples generated for one model can stay adversarial for other models. This property is named as transferability. This property is usually leveraged to perform a black-box attack [23, 24, 19, 20]. To attack a targeted model , the attacker generates transferable adversarial examples from the source model . The higher the transferability between and is, the higher the success rate the attack has.

Substitute model training is a commonly used method to train a source model . Rather than the benchmark label , is trained with which is the prediction result of the targeted model [24] to achieve a higher black-box attack success rate. While our work does not use black-box attacks, we do rely on a similar intuition as behind substitute model training, namely, two models with similar behavior and decision boundaries are likely to have higher transferability between each other. We use this intuition to show high transferability between models from neighboring training epochs, as discussed in the next section.

3 Attack strength accumulation

In this section, we first conduct a study and find that models from neighboring epochs show very high transferability and are naturally good substitute models to each other. Based on this observation, we design an accumulative PGD- attack that accumulates attack strength by reusing adversarial perturbations from one epoch to the next. Compared to traditional PGD- attack, accumulative PGD- attack achieves much higher attack strength with fewer number of attack iterations in each epoch.

3.1 High transferability between epochs

Transferability between models in different training epochs of the same training program has not been studied. Because fluctuations of model parameters between different epochs are very small, we think that they are likely to have similar behavior and similar decision boundaries, which should lead to high transferability between these models.

To evaluate the transferability between models from training epochs, we adversarially train a model as the targeted model, while saving intermediate models at the end of three immediately prior epochs as , , and , with being the model from the epoch immediately prior to . We measure the transferability of adversarial examples of each of with . For comparison, we also train three additional models , , and that are trained by exactly the same training method as but with different random seeds. And we measure the transferability between each of and .

To measure transferability from the source model to the targeted model, we use two metrics. The first metric is error rate transferability used in [1, 23], which is the ratio of the number of adversarial examples misclassified by source model to that of the targeted model. The other metric is loss transferability, which is the ratio of the loss value caused by adversarial examples on the source model to the loss value caused by the same examples on the targeted model.

(a) MNIST
(b) CIFAR10
Figure 1: Error rate transferability and loss transferability with the different source models.

We conduct experiments on both MNIST and CIFAR10 dataset, and the results are shown on Figure 1. We find that, compared to the baseline models , the models from neighboring epochs of have higher transferability for both transferability metrics (the transferability metrics for all models are larger than ). This provides strong empirical evidence that adversarial examples generated in one epoch still retain some strength in subsequent epochs.

Inspired by the above result, we state the following hypothesis. Hypothesis: Repeatedly reusing perturbations from the previous epoch can accumulate attack strength epoch by epoch. Compared to current methods that iterate from natural examples in each epoch, this can allow us to use few attack iterations to generate the same strong adversarial examples.

3.2 Accumulative PGD- attack

To validate the aforementioned hypothesis, we design an accumulative PGD- attack. As shown in Figure 1(b), we longitudinally connect models in each epoch by directly reusing the attack perturbation of the previous epoch. Accumulative PGD- attack Figure 1(b) generates adversarial examples for first. Then, for the following epochs, the attack is performed based on the accumulated perturbations from previous epochs.

(a)
(b)
Figure 2: The traditional PGD- attack (a) and the accumulative PGD- attack through epochs (b). The PGD- attack is performed on the model in the red rectangle to get adversarial examples.

To compare the attack strength of two attacks, we use Madry’s method [21] to adversarially train two models on MNIST and CIFAR10 and evaluate the loss value of adversarial examples generated by two attacks. Figure 3 summarises the evaluation result. We can find that, with more epochs involved in the attack, accumulative PGD- attack can achieve a higher loss value with the same number of attack iterations .

Especially, when adversarial examples are transferred through a large number of epochs, even accumulative PGD- attack can cause high attack loss. For MNIST, accumulative PGD- attack can achieve the same attack loss as traditional PGD- attack when . For CIFAR10, accumulative PGD- attack can achieve the same attack loss as traditional PGD- attack when .

(a) MNIST
(b) CIFAR10
Figure 3: Given the number of epochs , the relationship between the number of attack iterations and the attack loss. stands for the traditional PGD- attack.

This result indicates that, with high transferability between epochs, adversarial perturbations can be reused effectively, which allows us to use fewer attack iterations to generate the same or stronger adversarial examples. Reuse of perturbations across epochs can help us reduce the number of attack iterations in PGD-, leading to more efficient adversarial training. Next section describes our proposed algorithm, ATTA (Adversarial Training with Transferable Adversarial examples), based on this property.

4 Adversarial training with transferable adversarial examples

The discussion on transferability in Section 3 suggests that adversarial examples can retain attack strength in subsequent training epochs. The results of accumulative attack in Section 3 suggest that stronger adversarial examples can be generated by accumulating attack strength. This property inspires us to link adversarial examples between adjacent epochs as shown in Figure 4. Unlike starting from a natural image to generate an adversarial example in each epoch as shown in Figure 4(a), we start from a previously saved adversarial example from the previous epoch to generate an adversarial example (Figure 4(b)). To improve the transferability between epochs, we use a connection function to link adversarial examples, which transforms to a start point for the next epoch. During the training, with repeatedly reusing adversarial example between epochs, attack strength can be accumulated epoch by epoch:

where is the attack algorithm, is a connection function (described in the next section). is the model in the -th epoch, is the adversarial examples generated in the -th epoch and are natural image and benchmark label 111Note that the is still bounded by the natural image , rather than ..

Figure 4: Traditional adversarial training (PGD-) (a) and ATTA- (b). is the connection function which improves the transferability between epochs.

As shown in the previous section, adversarial examples can achieve high attack strength as they are transferred across epochs via the above linking process, rather than starting from natural images. This, in turn, should allow us to train a more robust model with fewer attack iterations.

4.1 Connection function design

Designing connection function can help us overcome two challenges that we encountered in achieving high transferability of adversarial examples from one epoch to the next:

  1. Data augmentation problem: Data augmentation is a commonly used technique to improve the performance of deep learning. It applies randomly transformation on original images so that models can be trained with various images in different epochs. This difference can cause a mismatch of images between epochs. Since the perturbation is calculated with the gradient of image, if we directly reuse the perturbation, the mismatch between reused perturbation and new augmented image can cause a decrease in attack strength. Simply removing data augmentation also hurts the robustness. We experimentally show this negative influence of these two alternatives in Section 5.3.

  2. Drastic model parameter change problem: As discussed in Section 3, similar parameters between models tends to cause a similar decision boundary and thus high transferability. Unfortunately, model parameters tend to change drastically at the early stages of training. Thus, adversarial perturbations in early epochs tend to be useless for subsequent epochs.

Figure 5: The workflow of inversed data augmentation.

Overcoming challenges: Inverse data augmentation. To address the first issue, we propose a technique called inverse data augmentation so that adversarial examples retain a high transferability between training epochs despite data augmentation. Figure 5

shows the workflow with inverse data augmentation. Some transformations (like cropping and rotation) pad augmented images with background pixels. To transfer a perturbation on background pixels, our method transfers the padded image

rather than standard images so that the background pixels used by data augmentation can be covered by these paddings.

After the adversarial perturbation of augmented image is generated, we can perform the inverse transformation222

In adversarial training, most data augmentation methods used are linear transformations which are easy to be inversed.

to calculate the inversed perturbation on the padded image . By adding to , we can store and transfer all perturbation information in to next epoch. (Note that, when we perform the adversarial attack on the augmented image , the perturbation is still bounded by the natural image rather than .)

Periodically reset perturbation. To solve the second issue, we propose a straightforward but effective solution: our method resets the perturbation and lets adversarial perturbations be accumulated from the beginning periodically, which mitigates the impact caused by early perturbations.

4.2 Attack Loss

Various attack loss functions are used in different adversarial training algorithms. For example, TRADES[39] uses robustness loss (Equation 2) as the loss to generate the adversarial examples.

(2)

where is the loss function, is the model and , are natural and adversarial example respectively. This loss represents how much the adversarial example diverges from the natural image. Zhang et al.[39] shows that this loss has a better performance in the TRADES algorithm.

In our method, we use the following loss function:

(3)

It represents how much the adversarial examples diverges to the benchmark label.

The objection () of Equation 2 for adversarial attack varies across epochs, which may weaken the transferability between epochs. Equation 3 applies a fixed objection (), which doesn’t have this concern. In addition, Equation 3 has a smaller computational graph, which can reduce computational overhead during training.

The overall training method is described in Algorithm 1.

1:Input: Padded training dataset , model , attack algorithm , perturbation bound , the number of epochs to reset perturbation
2:Initialize
3:Initialize by cloning
4:for  do
5:     for  in and corresponding  do
6:         if  then
7:               a small random perturbation
8:              
9:         end if
10:         Store the transformation for the inverse augmentation:
11:         
12:         
13:         
14:         
15:     end for
16:end for
Algorithm 1 Adversarial Training with Transferable Adversarial Examples (ATTA)

5 Evaluation

In this section, we integrate ATTA with two popular adversarial training methods: Madry’s Adversarial Training (MAT) [21] and TRADES [39]. By evaluating the training time and robustness, we show that ATTA can provide a better trade-off than other adversarial training methods. To understand the contribution of each component to robustness, we conduct an ablation study.

5.1 Setup

Following the literature [21, 39, 37], we use both MNIST [17] and CIFAR10 dataset [15] to evaluate ATTA.

For the MNIST dataset, the model has four convolutional layers followed by three full-connected layers which is same architecture as used in [21, 39]. The adversarial perturbation is bounded by ball with size .

For the CIFAR10 dataset, we use the wide residual network WRN-34-10 [36] which is same as [21, 39]. The perturbation is bounded by ball with size .

5.2 Efficiency and effectiveness of ATTA

In this part, we evaluate the training efficiency and the robustness of ATTA, comparing it to state-of-the-art adversarial training algorithms. To better verify effectiveness of ATTA, we also evaluate ATTA under various attacks.

5.2.1 Training efficiency

We select four state-of-the-art adversarial training methods as baselines: MAT[21], TRADES[39], YOPO[37] and Free[27]. For MNIST, the model trained with ATTA can achieve a comparable robustness with up to times training efficiency and, for CIFAR10, our method can achieve a comparable robustness with up to times training efficiency. Compared to MAT trained with PGD, our method improve the accuracy up to with times training efficiency.

DefenseAttack Natural PGD-40
Time
(sec)
MAT PGD-1
PGD-40 99.37% 96.21% 3933
YOPO--
ATTA-1 99.45% 96.31% 297
ATTA-40
TRADES PGD-1
PGD-40 98.89% 96.54% 6544
ATTA-1 99.03% 96.10% 460
ATTA-40
Table 3: The result of different attacks on MNIST dataset.

MNIST. The experiment results of MNIST are summarised in Table 3. For MAT, to achieve comparable robustness, ATTA is about times faster than the traditional PGD training method. Even with one attack iteration, model trained with ATTA- achieves adversarial accuracy within seconds. For TRADES, we get a similar result. With one attack iteration, ATTA- achieves adversarial accuracy within seconds, which is about times faster than TRADES(PGD-). Compared to another fast adversarial training method YOPO, ATTA- is about times faster and achieves higher robustness ( versus YOPO’s ).

DefenseAttack Natural PGD-20
Time
(min)
MAT PGD-1
PGD-3
PGD-10 87.49% 47.07% 2027
Free()
YOPO--
ATTA-1 85.71% 50.96% 134
ATTA-3
ATTA-10
TRADES PGD-1
PGD-3
PGD-10 84.13% 56.6% 2028
YOPO-- 333The author-implemented YOPO-- can’t converge in our experiment. We pick the accuracy data from YOPO paper. -
ATTA-1
ATTA-3 84.23% 56.36% 320
ATTA-10
Table 4: The result of different attacks on CIFAR10 dataset.

ATTARGB194, 79, 85 PGDRGB78, 116, 174 FreeRGB219, 131, 87 YOPORGB88, 167, 106

CIFAR10.   We summarise the experiment results of CIFAR10 in Table 4. For MAT, compared to PGD-, ATTA- achieves higher adversarial accuracy with about times training efficiency, and ATTA- improves adversarial accuracy by with times training efficiency when the model is trained with attack iterations. For TRADES, ATTA- achieves comparable adversarial accuracy to PGD- with times faster training efficiency. By comparing the experiment results with YOPO and Free, for MAT, our method is () times faster than Free (YOPO) with () better adversarial accuracy.

To better understand the performance of ATTA, we present the trade-off between training time and robustness of different methods in Figure 6. We find that our method (indicated by the solid markers in the left-top corner) gives a better trade-off on efficiency and effectiveness on adversarial training.

Figure 6: The scatter plot presents adversarial accuracy against PGD- attack and training time of different adversarial training methods on CIFAR10.

5.2.2 Defense under other attacks

To verify the robustness of our method, we evaluate models in the previous section with other attacks: PGD-[16], FGSM[10], CW-[3].

Defense PGD FGSM CW
MNIST
M-PGD-
M-ATTA-
CIFAR10
M-PGD-
M-ATTA-
T-PGD-
T-ATTA-
Table 5: The robustness comparison between ATTA and PGD under other attacks. The first ‘M’ and ‘T’ stand for MAT and TRADES, respectively.

As shown in Table 5, models trained with ATTA are still robust to other attacks. Compared to baselines, our method still achieves a comparable or better robustness under other attacks. We find that, although ATTA- has a similar robustness to PGD- under PGD- attack, with a stronger attack (e.g. PGD-), ATTA- shows a better robustness ( higher adversarial accuracy.)

5.3 Ablation study

To study the contribution of each component to robustness, we do an ablation study on inverse data augmentation and different attack loss functions.

Inverse data augmentation. To study the robustness gain of inverse data augmentation(i.d.a.), we use ATTA- to adversarially train models by reusing the adversarial perturbation directly. As shown in Table 6, for both MAT (ATTA-) and TRADES (ATTA-), models trained with inverse data augmentation achieve about higher accuracy, which means that inverse data augmentation does help improve the transferability between training epochs. As discussed in 4.1, another alternative is to remove data augmentation. However, Table 6 shows that removal of data augmentation hurts both natural accuracy and robustness.

DefenseAttack Natural PGD-
MAT(w/o d.a., w/o i.d.a.)
MAT(w/ d.a., w/o i.d.a.)
MAT(w/ d.a., w/ i.d.a.)
TRADES(w/o d.a., w/o i.d.a.)
TRADES(w/ d.a., w/o i.d.a.)
TRADES(w/ d.a., w/ i.d.a.)
Table 6: The accuracy under PGD attack of models trained by ATTA- with or without d.a.(data augmentation) and i.d.a. (inversed data augmentation).

Attack loss. Zhang et al.[39] show that, for TRADES, using Equation 2 leads to a better robustness. However, in this attack loss, we noticed that both inputs to the loss function are related to the model . Since the model is updated every epoch, compared to Equation 3 whose is fixed, the instability of and may have a larger influence on transferability. To analyze the performance difference between these two attack losses in ATTA, we train two TRADES(ATTA-) models with different attack losses. In Table 7, we find that Equation 3 leads to higher accuracy against PGD- attack. This result suggests that the higher stability of Equation 3 helps ATTA increase transferability between training epochs.

DefenseAttack Natural PGD-
TRADES loss
MAT loss
Table 7: The accuracy under PGD attack of models trained by TRADES(ATTA-) with MAT attack loss and TRADES attack loss.

6 Related work

Adversarial training is first proposed in [16] and is formulated as a min-max optimization problem[21]. As one of the most effective defense methods, lots of works [2, 32, 28, 4, 12, 21, 18, 37, 27] focus on enhancing either the efficiency or effectiveness of adversarial training. YOPO[37] finds that an adversary update is majorly coupled with the first layer. It can speed up the training progress by just updating the first layer. Shafahi et al.[27] improve the training efficiency by recycling the gradient information computed when updating model parameters to generate adversarial examples. In [39], TRADES improves the robustness of an adversarially trained model by adding a robustness regularizer in the loss function.

Transferability of adversarial examples. Szegedy et al[31] first describes the transferability of adversarial examples. This property is usually used to perform black-box attack between models [24, 19, 35, 5, 20]. Lie et al[20] show that adversarial examples generated by an ensemble of multiple models are more transferable to a targeted model.

7 Discussion

Scalability to large dataset.

One downside to our method is the extra space needed to store the adversarial perturbation for each image, which may limit the scalability of ATTA when the model is trained on larger datasets (e.g., ImageNet). However, we believe this is not going to be a serious issue, since asynchronous IO pipelines are widely implemented in existing DL frameworks 

[6], which should allow ATTA to store perturbations data on hard disks without significantly impacting efficiency.

Transferability between training epochs. Adversarial attacks augment the training data to improve the model robustness. Our work points out that, unlike images augmented by traditional data augmentation methods that are independent between epochs, adversarial examples generated by adversarial attacks show high relevance transferability between epochs. We hope this finding can inspire other researchers to enhance adversarial training from a new perspective (e.g., improving transferability between epochs).

8 Acknowledgement

This work is supported by NSF Grant No.1646392.

9 Conclusion

ATTA is a new method for iterative attack based adversarial training that significantly speeds up training time and improves model robustness. The key insight behind ATTA is high transferability between models from neighboring epochs, which is firstly revealed in this paper. Based on this property, the attack strength in ATTA can be accumulated across epochs by repeatedly reusing adversarial perturbations from the previous epoch. This allows ATTA to generate the same (or even stronger) adversarial examples with fewer attack iterations. We evaluate ATTA and compare it with state-of-the-art adversarial training methods. It greatly shortens the training time with comparable or even better model robustness. More importantly, ATTA is a generic method and can be applied to enhance the performance of other iterative attack based adversarial training methods.

References

Appendix A Overview

This supplementary material provides details on our experiment and additional evaluation results. In Section B, we introduce the detailed setup of our experiment. In Section C, we compare adversarial examples generated by ATTA and PGD and show that, even with one attack iteration, ATTA- can generate similar perturbations to PGD- (PGD-) on MNIST (CIFAR10). We also provide the complete evaluation results in Section C.2.

Appendix B Experiment setup

We provide additional details on the implementation, model architecture, and hyper-parameters used in this work.

MNIST. We use the same model architecture used in [21, 39, 37], which has four convolutional layers followed by three fully-connected layers. The adversarial examples used to train the model are bounded by ball with size and the step size for each attack iteration is . We do not apply any data augmentation (and inverse data augmentation) on MNIST and set the epoch period to reset perturbation as infinity which means that perturbations are not reset during the training. The model is trained for epochs with an initial learning rate and a learning rate after epochs, which is the same as [37]. To evaluate the model robustness, we perform the PGD [16], M-PGD [8] and CW [3] attack with a step size and set decay factor as for M-PGD (momentum PGD).

CIFAR10. Following other works [21, 39, 37, 27], we use Wide-Resnet-34-10 [36] as the model architecture. The adversarial examples used to train the model are bounded by ball with size . For ATTA-, we use as the step size, respectively. For ATTA- (), we use as the step size. The data augmentation used is a random flip and -pixel padding crop, which is same with other works [21, 39, 37, 27]. We set the epoch period to reset perturbation as epochs. Following YOPO [37], the model is trained for epochs with an initial learning rate, a learning rate after epochs, and a learning rate after epochs. To evaluate the model robustness, we perform the PGD, M-PGD and CW attack with a step size and set decay factor as for M-PGD (momentum PGD).

For the baseline, we use the author implementation of MAT444 https://github.com/MadryLab/mnist_challenge
https://github.com/MadryLab/cifar10_challenge
 [21], TRADES555https://github.com/yaodongyu/TRADES [39], YOPO666https://github.com/a1600012888/YOPO-You-Only-Propagate-Once [39], and Free777https://github.com/ashafahi/free_adv_train [27] with the hyper-parameters recommended in their works, and we select as for TRADES (both ATTA and PGD).

In Section red3, which analyzes the transferability between training epochs, we use MAT with PGD- to train models and PGD- to calculate loss value and error rate.

Each experiment is taken on one idle NVIDIA GeForce RTX 2080 Ti GPU. Except PGD attack, we implement other attacks with Adversarial Robustness Toolbox [22].

Appendix C Experiment details

c.1 Qualitative study on training images

To compare the quality of adversarial examples generated by PGD and ATTA, we visualize some adversarial examples generated by both methods. For MNIST, we choose the model checkpoint trained by MAT-ATTA- at epoch . For CIFAR10, we choose the model checkpoint trained by MAT-ATTA- at epoch . Figure 7 shows the adversarial examples and perturbations used to train the model (ATTA-) and genereated by PGD- (PGD-) attack on MNIST (CIFAR10) model in each class. To better visualize the perturbation, we re-scale the perturbation by calculating (where is the perturbation and is the bound of adversarial attack). This shifts the ball to the scale of .

We find that, although ATTA- just performs one attack iteration in each epoch, it generates similar perturbations to PGD- (PGD-) in MNIST (CIFAR10). The effect of inverse data augmentation is shown in Figure 6(b). There are some perturbations on the padded pixels in the third row (ATTA-), but perturbations just generated by PGD- (shown in fifth row) just appear on cropped pixels.

(a) MNIST
(b) CIFAR10
Figure 7: Visualization of natural images, adversarial examples and corresponding perturbations in each class for MNIST and CIFAR10. The first row in (a) and (b) shows the natural images. The second and third rows show the adversarial examples and perturbations generated by ATTA-. The fourth and fifth rows show the adversarial examples and perturbations generated by PGD- in (a) and PGD- in (b).

c.2 Complete evaluation results

We put the complete evaluation result in this section as a supplement to Section red5.2.

We evaluate defense methods under additional attacks and the evaluation results are shown in Table 8 and Table 9. Similar to the conclusion stated in Section red5.2, compared to other methods, ATTA achieves comparable robustness with much less training time, which shows a better trade-off on training efficiency and robustness. With the same number of attack iterations, ATTA needs less time to train the model. As mentioned in Section red3.2, with the accumulation of adversarial perturbations, ATTA can use the same number of attack iterations to achieve a higher attack strength, which helps the model converge faster.

Natural accuracy v.s. Adversarial accuracy. In this paper, we find that higher adversarial accuracy can lower natural accuracy. This trade-off has been observed and explained in [33, 39]. A recent work [13] points out that features used by naturally trained models are highly-predictive but not robust and adversarially trained models tend to use robust features rather than highly-predictive features, which may cause this trade-off. Table 9 also shows that, when models are trained with stronger attacks (more attack iterations), the models tend to have higher adversarial accuracy but lower natural accuracy.

DefenseAttack Natural PGD-40 PGD-100 M-PGD-40 FGSM CW-20
Time
(sec)
MAT PGD-1
PGD-10
PGD-40 96.21% 3933
YOPO--
ATTA-1 96.31% 297
ATTA-10
ATTA-40
TRADES PGD-1
PGD-10
PGD-40 96.54% 6544
ATTA-1 96.10% 460
ATTA-10
ATTA-40
Table 8: The result of various attacks on MNIST dataset.
DefenseAttack Natural PGD-20 PGD-100 M-PGD-20 FGSM CW-20
Time
(min)
MAT PGD-1
PGD-2
PGD-3
PGD-10 47.07% 2027
Free()
YOPO--
ATTA-1 50.96% 134
ATTA-2
ATTA-3
ATTA-10
TRADES PGD-1
PGD-2
PGD-3
PGD-10 2028
ATTA-1
ATTA-2
ATTA-3 56.36% 320
ATTA-10
Table 9: The result of various attacks on CIFAR10 dataset.