CAT: Customized Adversarial Training for Improved Robustness

02/17/2020 ∙ by Minhao Cheng, et al. ∙ 4

Adversarial training has become one of the most effective methods for improving robustness of neural networks. However, it often suffers from poor generalization on both clean and perturbed data. In this paper, we propose a new algorithm, named Customized Adversarial Training (CAT), which adaptively customizes the perturbation level and the corresponding label for each training sample in adversarial training. We show that the proposed algorithm achieves better clean and robust accuracy than previous adversarial training methods through extensive experiments.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have proved their effectiveness on a variety of domains and tasks. However, it has been found that DNNs are highly vulnerable to adversarial examples (Szegedy et al., 2014). To enhance the robustness of DNNs against adversarial examples, adversarial training (Goodfellow et al., 2015; Madry et al., 2018) has become one of the most effective and widely used methods. Given a pre-defined perturbation tolerance, denoted as , adversarial training aims to minimize the robust loss, defined as the worst-case loss within -ball around each example, leading to a min-max optimization problem. (Madry et al., 2018) show that applying a multi-step projected gradient descent (PGD) attack to approximately solve the inner maximization leads to a robust model, and several recent research has proposed various ways to improve adversarial training (Zhang et al., 2019b; Wang, 2019; Wang et al., 2019; Balaji et al., 2019; Ding et al., 2018).

However, the standard adversarial training methods still have a hypothetical and possibly problematic assumption: the perturbation tolerance

is a large and fixed constant throughout the training process, which ignores the fact that every data point may have different intrinsic robustness. Intuitively, some examples are naturally closer to the decision boundary, and enforcing large margin on those examples will force the classifier to give up on those examples, leading to a distorted decision surface. This intuition may explain the known issue of the undesirable robustness-accuracy tradeoff in adversarial robustness

(Su et al., 2018; Tsipras et al., 2019)

. Furthermore, with a different perturbation tolerance, it is questionable whether we should still force the model to learn to fit the one-hot label as in the original adversarial training formulation. In the extreme case, if an example is perturbed to the decision boundary, a good classifier yielding the binary class prediction probabilities should output

instead of . This point becomes crucial when each example is associated with a different level of perturbation. Although some recent papers have started to address the uniform issue by treating correctly and incorrectly classified examples differently (Ding et al., 2018) or assigning non-uniform perturbation level (Balaji et al., 2019), none of them have tried to incorporate the customized training labels in this process.

Motivated by these ideas, we propose a novel Customized Adversarial Training (CAT) framework that can substantially improve the performance of adversarial training. Throughout the adversarial training process, our algorithm dynamically finds a non-uniform and effective perturbation level and the corresponding customized target label for each example. This leads to better generalization performance and furthermore, with a careful design on adaptive tuning, our algorithm has only negligible computational overhead and runs as fast as the original adversarial training algorithm. Furthermore, we theoretically explain why the proposed method could lead to improved generalization performance.

Our method significantly outperforms existing adversarial training methods on the standard CIFAR-10 defense task. With Wide-ResNet structure on CIFAR-10, under perturbation, our method achieves 73% robust accuracy under PGD attack and 71% robust accuracy under Carlini and Wagner (C&W) attack (Carlini and Wagner, 2017), while the current best model only achieves 58.6% under PGD attack and 56.8% under C&W attack. Furthermore, our method only degrades the clean accuracy from 95.93% (standard test accuracy) to 93.48%, while other adversarial training methods have clean accuracy below 91.34%.

2 Related Work

Adversarial attack.

Finding adversarial examples, also known as adversarial attacks, can be formulated as an optimization problem — the goal is to find the perturbation to maximize the (robust) loss, while constraining to have small norm (e.g., norm). Therefore gradient-based algorithms have been widely used, such as fast gradient sign method (FGSM) (Goodfellow et al., 2015), C&W attack (Carlini and Wagner, 2017) and PGD attack (Madry et al., 2018). In addition to white-box attacks, it has been also found that adversarial attacks can be generated also in the soft-label black box setting (Chen et al., 2017; Ilyas et al., 2018) and hard-label black box setting (Brendel et al., 2017; Cheng et al., 2018, 2020), and with similar quality to white-box attacks. Moreover, physical attacks have been proposed to generate adversarial examples in the real world (Eykholt et al., 2018). Therefore, with the existence of these powerful adversarial attacks, how to enhance the robustness of neural network models has become an important issue in many real world applications.

Adversarial training.

To enhance the adversarial robustness of a neural network model, a natural idea is to iteratively generate adversarial examples, add them back to the training data, and retrain the model. For example, Goodfellow et al. (2015) use adversarial examples generated by FGSM to augment the data, and Kurakin et al. (2017) propose to use a multiple-step FGSM to further improve the performance. Madry et al. (2018) show that adversarial training can be formulated as a min-max optimization problem, and propose to use PGD attack (similar to multi-step FGSM) to find adversarial examples for each batch. The resulting method achieves notable successes and can survive even under strong attacks (Athalye et al., 2018). After that, many defense algorithms are based on a similar min-max framework. Zhang et al. (2019b) propose TRADES, a theoretically-driven upper bound minimization algorithm to achieve the top-1 rank in NeurIPS 2018 defense competition. Recently, Ding et al. (2018) notice the importance of misclassified examples and treat correctly classified and misclassified examples differently. Wang (2020) use label prediction probability as a smooth way to combine correctly and misclassified samples. Other than just adding adversarial examples into the training process, Wang (2019) find that it is also effective to find the “adversarial label” along with the “adversarial perturbation”. The convergence of adversarial training has also been studied (Gao et al., 2019; Wang et al., 2019). Recently, to reduce the computational overhead brought by adversarial training, several works have been proposed (Shafahi et al., 2019; Zhang et al., 2019a; Wong et al., 2020). It is widely recognized that the current defensive models are still not ideal and have considerable room for improvement. Moreover, to make robust models useful in practice, it is crucial that both clean and robust error need to be further enhanced.

Other adversarial defenses

In addition to adversarial training based methods, a wide range of defense methods have been proposed such as Gaussian data augmentation (Zantedeschi et al., 2017), randomized smoothing (Liu et al., 2018; Cohen et al., 2019), Mixup (Zhang et al., 2018) and its variants (Thulasidasan et al., 2019; Verma et al., 2018), and Label smoothing (Shafahi et al., 2018; Goibert and Dohmatob, 2019). Shafahi et al. (2018) find that it could achieve similar robust accuracy with adversarial training when combining Gaussian data augmentation and label-smoothing.

However, some of the aforementioned methods have been shown to cause obfuscated gradients instead of enhanced robustness (Athalye et al., 2018), while adversarial training based methods are still shown to be robust under different kinds of adversarial attacks.

3 Proposed Method

3.1 Preliminaries

Adversarial training can be formulated as a min-max optimization problem. For a -class classification problem, let denote the set of training samples in the dataset with . Let denote a classification model parameterized by . We denote by as the prediction output for each class, i.e., . We use standard notation to hide universal constant factor, and to indicate .

Adversarial training can be formulated as:


where denotes the -norm ball centered at with radius . The inner maximization problem aims to find an adversarial version of a given data point that achieves a high loss. In general one can define based on the threat model, but the ball is the most popular choice adopted by recent works (Madry et al., 2018; Zhang et al., 2019b; Wang, 2019; Ding et al., 2018; Wang, 2020), which will also be used in this paper. For a deep neural network model, the inner maximization does not have a closed form solution, so adversarial training methods typically use a gradient-based iterative solver to approximately solve the inner problem. The most commonly used choice is the multi-step PGD (Madry et al., 2018) and C&W attack (Carlini and Wagner, 2017).

(a) Standard training
(b) Adv-train with
(c) Adv-train with
(d) CAT (ours) with
Figure 5: Different training methods on a linearly separable binary classification dataset with margin for both classes. Adversarial training with small works fine, but for a large beyond the true margin, adversarial training would ruin the classifier’s classification performance, while our proposed adaptive customized adversarial training method still keeps a good generalization performance.

3.2 Motivation

Intuitively, if adversarial training can always find a model with close-to-zero robust error, one should always use a large for training because it will automatically imply robustness to any smaller . Unfortunately, in practice a uniformly large is often harmful. In the following we empirically explain this problem and use it to motivate our proposed algorithm.

We use a simple linear classification case to demonstrate why a uniformly large is harmful. In Figure (a)a, we generate a synthetic linearly separable dataset with the margin set to be 1.75 for both classes, and the correct linear boundary can be easily obtained by standard training. In Figure (b)b, we run adversarial training with , and since this is smaller than the margin, the algorithm can still obtain near-optimal results. However, when we use a large for adversarial training in Figure (c)c, the resulting decision boundary becomes significantly worse. It is because adversarial training cannot correctly fit all the samples with a margin up to 4, so it will sacrifice some data samples, leading to distorted and undesirable decision boundary. This motivates the following two problems:

  • We shouldn’t set the same large uniformly for all samples. Some samples are intrinsically closer to the decision boundary and they should use a smaller . Without doing this, adversarial training will give up on those samples, which leads to worse training and generalization error (see more discussions in Section 3.5 on the generalization bounds).

  • The adversarial training loss is trying to force the prediction to match the one-hot label (e.g., in the binary classification case) even after large perturbations. However, if a sample is perturbed, the prediction shouldn’t remain one-hot. For instance, if a sample is perturbed to the decision boundary, the prediction of a perfect model should be instead of

    . This also makes adversarial training fail to recover a good decision hyperplane.

Furthermore, we observe that even if adversarial training can obtain close-to-zero training error with large (e.g., (Gao et al., 2019) proves that this will happen for overparameterized network with large-enough margin), a uniformly large will lead to larger generalization gap. This could be partially explained by the theoretical results provided by (Yin et al., 2018), which shows that the adversarial Rademacher complexity has a lower bound with an explicit dependence on the perturbation tolerance. The empirical results in Table 1 also illustrate this problem. When conducting adversarial training with on CIFAR10 VGG-16, we found that the model achieves close-to-zero robust training error on all , but it suffers larger generalization gap compared to training with smaller . This also demonstrates that a uniformly large is harmful even when it achieves perfect training error.

Testing Error Type Training
0.01 0.02 0.03
0.01 Train 99.96% 99.99% 99.16%
Test 69.79% 69.06% 66.04%
Table 1: The influence of different fixed values used in adversarial training on the robust accuracy with .

CAT (Customized Adversarial Training)

We propose the Customized Adversarial Training (CAT) framework that improves adversarial training by addressing the above-mentioned problems. First, our algorithm has an auto-tuning method to customize the used for each training example. Second, instead of forcing the model to fit the original label, we customize the target label for each example based on its own . In the following we will describe these two components in more detail.

3.3 Auto-tuning for adversarial training

The first component of our algorithm is an auto-tuning tuning method which adaptively assigns a suitable for each sample during the adversarial training procedure. Let be the perturbation level assigned to example . Based on the intuition mentioned in Section 3.2, we do not want to further increase if we find the classifier does not have capacity to robustly classify the example, which means we should set


and the adversarial training objective becomes


Note that in (2) depends on while in (3) also depends on . We thus propose to conduct alternative updates — conducting one SGD update on , and then update the in the current batch. However, finding exactly requires brute-force search for every possible value, which adds significant computational overhead to adversarial training. Therefore, we only conduct a simplified update rule on as follows. Starting from an initial perturbation level of zero, at each iteration we conduct adversarial attack (e.g., PGD attack) with perturbation tolerance where is a constant. If the attack is successful, then we keep the current , while if the attack is unsuccessful, which means an attacker still cannot find an adversarial example that satisfies , then we increase . The attack results will also be used to update the model parameter , so this adaptive scheme does not require any additional cost. In practice, we also have an upper bound on the final perturbation to make sure each individual will not be too large.

3.4 Adaptive label uncertainty for adversarial training

As mentioned in Section 3.2, the standard adversarial training loss is trying to enforce a sample being classified as the original one-hot label after perturbation. However, this may not be ideal. In the extreme case, if a sample is perturbed to the decision boundary, the prediction must be far away from one-hot. This problem is more severe when using non-uniform , since each different will introduce a different bias to the loss, and that may be one of the reasons that purely adaptive -scheduling does not work well (see our ablation study in Section 4.4 and also the results reported in (Balaji et al., 2019)).

In the following, we propose an adaptive label smoothing approach to reflect different perturbation tolerance on each example. Szegedy et al. (2016)

introduced label smoothing that converts one-hot label vectors into one-warm vectors representing low-confidence classification, in order to prevent the model from making over-confident predictions. Specifically, with a one-hot encoded label

, the smoothed version is


is the hyperparameter to control the smoothing level. In the adaptive setting, we set

so that a larger perturbation tolerance would receive a higher label uncertainty and is a hyperparameter. A common choice of

is the uniform distribution

. To further prevent over-fitting and improve the generalization, we use where Dirichlet() refers to the Dirichlet distribution and is an all one vector. With different perturbation tolerance, the adaptive version of label smoothing is


The final objective function: Combining the two aforementioned two main techniques, our Customized Adversarial Training (CAT) method attempts to minimize the following objective:


where is defined in (4). As described in Section 3.3, we approximately minimize this objective with an alternative update scheme, which encounters almost no additional cost compared to the original adversarial training algorithm. The detailed algorithm is shown in Algorithm 1.

Choice of loss function.

In general, our framework can be used with any loss function

. In the previous works, cross entropy loss is commonly used for . However, the model trained by smoothing techniques tends to have better performance against PGD attack than CW attack (see the VGG experiments in Figure 10. So in addition to testing our algorithm under cross entropy loss, we also propose a mixed loss to enhance the defense performance towards CW attack. That is,



is the final (logit) layer output, and

is the prediction score for the i-th class and is the original label. The parameter encourages to find an adversary that will not classified as class with high confidence.

  Input: Training dataset , cross entropy loss or mix loss , scheduling parameter , weighting factor , perturbation upperbound
  Initial every sample’s with 0


     for i= do
        for  do
        end for
        if  then
        end if
     end for
  end for
Algorithm 1 CAT algorithm

3.5 Theoretical Analysis

To better understand how our scheme improves generalization, we provide some theoretical analysis. Recall we denote by as the prediction probability for the classes. We define the bilateral margin that our paper is essentially maximizing over as follows.

Definition 3.1 (Bilateral margin)

We define the bilateral perturbed network output by :

The bilateral margin is now defined as the minimum norm of required to cause the classifier to make false predictions:


This margin captures both the relative perturbation on the input layer and on the soft-max output .

Theorem 3.2

Suppose the parameter space we optimize over has covering number that scales as for some complexity . Then with probability over the draw of the training data, any classifer which achieves training error 0 satisfies:

where is of small order .

We defer the proof to the Appendix, which is adapted from Theorem 2.1 of (Wei and Ma, 2019). We observe the population risk is bounded by two key factors, the average of and , the covering number of the parameter space. On one side, the average of is dominated by the samples with the smallest margin. Therefore when we do adversarial training, it is important that we not only achieve higher overall accuracy, but also make sure the samples closer to the decision boundary have large enough margin. This can not be achieved by simply using constant and large that will maintain a large margin for most samples but sacrifice the accuracy of a small portion of data. On the other hand, the covering number of the network’s parameter space can be roughly captured by a bound of product of all layers’ weight norms. We hypothesize that with more flexibility in choosing , our algorithm will converge faster than using larger constant and will have more implicit regularization effect. To testify this hypothesis, we roughly measure the model complexity by the product of the weight norms of different models. In comparison to our model, when training with constant and , it respectively yields as large as and times of that of our model, which means our model indeed has more implicit regularization effect among others.

3.6 Connections with other training methods

Many recent papers attempt to improve adversarial training. Although they all follow the similar min-max framework, each of them uses slightly different loss functions. We summarize the loss functions used by recent adversarial training methods in Table 2.

We see that except for natural training which directly minimizes the cross entropy loss (denoted as CE), all training techniques involve the use of the min-max framework. TRADES and MMA use the unperturbed data’s cross entropy loss as an additional regularization term to achieve a better trade-off between clean and robust error.

Similar to our method, both MMA and IAAT have sample-wise adaptive during training. They also utilize the adaptive to find the largest possible for every sample . However, they do not consider the adaptive label technique mentioned in Section 3.4. As a result, they can only achieve better clean accuracy while the improvements in robust accuracy is limited. Our CAT-CE algorithm (CAT with CE loss) is more general than IAAT and MMA. CAT-CE reduce to IAAT when we set in adaptive label smoothing. Moreover, MMA could be treated as a special case of CAT-CE when we use a line search scheme to find the and . Also, in Section 4.4, we will show the importance of the adaptive label uncertainty step in CAT.

Methods Loss Function
Adversarial training (Madry et al., 2018)
TRADES (Zhang et al., 2019b)
Bilateral Adv Training (Wang, 2019)
MMA (Ding et al., 2018)
MART (Wang, 2020)
IAAT (Balaji et al., 2019)
CAT-CE (ours)
CAT-MIX (ours)
Table 2: Summary of several robust training methods amd their corresponding loss function. indicates the Dirichlet distribution parameterized by .

4 Performance Evaluation

In this section, we conduct extensive experiments to show that CAT achieves a strong result on both clean and robust accuracy. We include the following methods into our comparison:

  • Customized Adversarial Training (CAT-CE): Our proposed method with the cross entropy loss.

  • Customized Adversarial Training (CAT-MIX): Our proposed method with the mixed cross entropy loss (6).

  • Adversarial training: The adversarial training method proposed in (Madry et al., 2018) where they use a K-step PGD attack as adversary.

  • TRADES: TRADES (Zhang et al., 2019b) improves adversarial training by an additional loss on the clean examples and achieves the state-of-art performance on robust accuracy.

  • Natural: the natural training which only minimizes the cross entropy loss.

Furthermore, since many recently proposed adversarial training methods have considered CIFAR-10 with Wide-ResNet structure as the standard setting and report their numbers, we also compare our performance with 7 previous methods on this specific setting.

Figure 10: Robust accuracy under different levels of attacks on CIFAR-10 dataset with VGG and Wide-ResNet architectures. CAT-CE and CAT-MIX clearly outperform TREADS and adversarial training.
Methods Clean accuracy PGD accuracy CW accuracy
Natural training 95.93% 0% 0%
Adversarial training (Madry et al., 2018) 87.30% 52.68% 50.73%
Dynamic adversarial training (Wang et al., 2019) 84.51% 55.03% 51.98%
TRADES (Zhang et al., 2019b) 84.22% 56.40% 51.98%
Bilateral Adv Training (Wang, 2019) 91.00% 57.5% 56.2%
MMA (Ding et al., 2018) 84.36% 47.18%
MART (Wang, 2020) 84.17% 58.56% 54.58%
IAAT (Balaji et al., 2019) 91.34% 48.53% 56.80%
CAT-CE (ours) 93.48% 73.38% 61.88%
CAT-MIX (ours) 89.61% 73.16% 71.67%
Table 3: The clean and robust accuracy of Wide Resnet models trained by various defense methods. All robust accuracy results use ball. We reported the best performance listed in the papers. denotes random-restart is applied in the testing attack. denotes it use a -step PGD attack
(a) Natural
(b) Adv train
(d) CAT
Figure 15: Loss landscape comparison of different adversarial training methods

4.1 Experimental Setup

Dataset and model structure. We use two popular dataset CIFAR-10 (Krizhevsky et al., )

and Restricted-ImageNet 

(Deng et al., 2009) for performance evaluation. For CIFAR-10, we use both standard VGG-16 (Simonyan and Zisserman, 2015) and Wide ResNet that is used in both vanilla adversarial training (Madry et al., 2018) and TRADES (Zhang et al., 2019b). For VGG-16, we implement adversarial training with the standard hyper-parameters and train TRADES with the official implementation. For Wide ResNet, since the model has become standard for testing adversarial training methods, we use exactly the same model structure provided by (Madry et al., 2018; Zhang et al., 2019b)

. And use the models’ checkpoint released by adversarial training and TRADES official repository and implement the Madry’s adversarial training using the standard hyper-parameters. For Restricted-ImageNet, we use ResNet-50. All our experiments were implemented in Pytorch-1.4 and conducted using dual Intel E5-2640 v4 CPUs (2.40GHz) with 512 GB memory with a GTX 2080 TI GPU.

Implementation details. We set the number of iterations in adversarial attack to be 10 for all methods. Adversarial training and TRADES are trained on PGD attacks setting with cross entropy loss (CE). We implement our CAT method both on cross entropy (CE) (Madry et al., 2018) and CW loss (Carlini and Wagner, 2017), and set . All the models are trained using SGD with momentum 0.9, weight decay . For VGG-16/Wide ResNet models, we use the initial learning rate of 0.01/0.1, and we decay the learning rate by 90% at the 80th, 140th, and 180th epoch. For CAT, we set epsilon scheduling parameter , and weighting parameter . For CAT-MIX, we set .

4.2 Robustness Evaluation and Analysis

White-box attcks. For CIFAR10, we evaluate all the models under different tolerance of white-box -norm bounded non-targeted PGD and CW attack. Specifically, we use both (20-step PGD with step size and . All attacks are equipped with random-start. To be noted, when , the robust accuracy is reduced to test accuracy of unperturbed (natural) test samples, i.e clean accuracy.

The experimental results are shown in Figure 10, where we can easily see that both CAT-CE and CAT-MIX clearly outperform other methods among from to . So our methods can achieve better robust error at the standard perturbation threshold considered in the literature, and also has better clean accuracy (). The accuracy curve becomes quite flat when is increased.

Wide ResNet has become a standard structure for comparing adversarial training methods, and it’s standard to train and evaluate with norm perturbation. For this setting, we collect the reported accuracy from 7 other adversarial training methods, with several of them published very recently, to have a detailed full comparison. As shown in Table 3, our method achieves state-of-art robust accuracy while maintaining a high clean accuracy. Due to the page limit, we put the Restricted ImageNet result in the appendix.

Black-box transfer attacks. We follow the criterion of evaluating transfer attacks as suggested by Athalye et al. (2018) to inspect whether the models trained by CAT will cause the issue of obfuscated gradients and give a false sense of model robustness. We generate 10,000 adversarial examples of CIFAR-10 from natural models with and evaluate their attack performance on the target model. Table 4 shows that CAT achieves the best accuracy compared with adversarial training and TRADES, suggesting the effectiveness of CAT in defending both white-box and transfer attacks.

Method VGG 16 Wide ResNet
Adv train 79.13% 85.84%
TRADES 83.53% 83.90%
CAT 86.58% 88.66 %
Table 4: Robust accuracy under transfer attack on CIFAR-10

4.3 Loss Landscape Exploration

To further verify the superior robustness using CAT, we visualize the loss landscape of different training methods in Figure 15. Following the implementation in (Engstrom et al., 2018), we divide the data input along a linear space grid defined by the sign of the input gradient and a random Rademacher vector, where the x- and y- axes represent the magnitude of the perturbation added in each direction and the z-axis represents the loss.

As shown in Figure 15, CAT generates a model with a lower and smoother loss landscape. Also, it could be taken as another strong evidence that we have found a robust model through CAT training.

4.4 Ablation study

The importance of adaptive label uncertainty. Here we discuss and perform an ablation study using VGG-16 and CIFAR-10 on the importance of adaptive label uncertainty and adaptive instance-wise . In Figure (b)b, Adp train denotes the original adversarial training, Adv+LS denotes adversarial training with label smoothing (setting by Eq (4)), Adp-Adv denotes adversarial training with adaptive instance-wise , and CAT-CE is the proposed method which is a combination of these two tricks. We found that only applying adaptive instance-wise or label smoothing cannot significantly boost the robust accuracy over standard adversarial training, but the proposed method, by nicely combining these two ideas, can significantly improve the performance. This explains why CAT significantly outperforms some instance adaptive methods like IAAT and MMA.

Figure 18: Analysis of CAT. In (a) we test CAT under different steps of PGD attack, and in (b) we conduct an ablation study on CAT by changing the loss function and removing Label Adaption (LA).

More iterations of PGD attack. As suggested in Athalye et al. (2018), to verify that the performance gain is not brought by insufficient iterations in PGD attack, in Figure (a)a, we show the robust accuracy with the number of iteration varying from to on CIFAR10 VGG16. The results show that although increasing the number of iterations would decrease the performance by around 2%, CAT always outperforms other methods significantly.

5 Conclusions

In this paper, we propose CAT, a customized adversarial training method that is designed to have better generalization for both clean and robust performance. We also provide a theoretical analysis to motivate our algorithm. Experimental results show that CAT has achieved state-of-art robust accuracy and a high clean accuracy while keeping similar running time as standard adversarial training. The success of CAT indicates that it is crucial to customize the perturbation level on both data sample side and its label in adversarial training.


  • A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. International Coference on International Conference on Machine Learning. Cited by: §2, §2, §4.2, §4.4.
  • Y. Balaji, T. Goldstein, and J. Hoffman (2019) Instance adaptive adversarial training: improved accuracy tradeoffs in neural nets. arXiv preprint arXiv:1910.08051. Cited by: §1, §1, §3.4, Table 2, Table 3.
  • W. Brendel, J. Rauber, and M. Bethge (2017) Decision-based adversarial attacks: reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248. Cited by: §2.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57. Cited by: §1, §2, §3.1, §4.1.
  • P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    pp. 15–26. Cited by: §2.
  • M. Cheng, T. Le, P. Chen, J. Yi, H. Zhang, and C. Hsieh (2018) Query-efficient hard-label black-box attack: an optimization-based approach. arXiv preprint arXiv:1807.04457. Cited by: §2.
  • M. Cheng, S. Singh, P. Chen, P. Chen, S. Liu, and C. Hsieh (2020) Sign-opt: a query-efficient hard-label adversarial attack. In ICLR, Cited by: §2.
  • J. M. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. International Conference on Machine Learning. Cited by: §2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    IEEE Conference on Computer Vision and Pattern Recognition

    pp. 248–255. Cited by: §4.1.
  • G. W. Ding, Y. Sharma, K. Y. C. Lui, and R. Huang (2018) Max-margin adversarial (mma) training: direct input space margin maximization through adversarial training. arXiv preprint arXiv:1812.02637. Cited by: §1, §1, §2, §3.1, Table 2, Table 3.
  • L. Engstrom, A. Ilyas, and A. Athalye (2018) Evaluating and understanding the robustness of adversarial logit pairing. arXiv preprint arXiv:1807.10272. Cited by: §4.3.
  • K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song (2018)

    Robust physical-world attacks on deep learning visual classification

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1625–1634. Cited by: §2.
  • R. Gao, T. Cai, H. Li, C. Hsieh, L. Wang, and J. D. Lee (2019) Convergence of adversarial training in overparametrized neural networks. In Advances in Neural Information Processing Systems, pp. 13009–13020. Cited by: §2, §3.2.
  • M. Goibert and E. Dohmatob (2019) Adversarial robustness via adversarial label-smoothing. arXiv preprint arXiv:1906.11567. Cited by: §2.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. International Conference on Learning Representations. Cited by: §1, §2, §2.
  • A. Ilyas, L. Engstrom, A. Athalye, and J. Lin (2018) Black-box adversarial attacks with limited queries and information. arXiv preprint arXiv:1804.08598. Cited by: §2.
  • [17] A. Krizhevsky, V. Nair, and G. Hinton () CIFAR-10 (canadian institute for advanced research). . External Links: Link Cited by: §4.1.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2017) Adversarial machine learning at scale. International Conference on Learning Representations. Cited by: §2.
  • X. Liu, M. Cheng, H. Zhang, and C. Hsieh (2018) Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 369–385. Cited by: §2.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations. Cited by: §1, §2, §2, §3.1, Table 2, 3rd item, §4.1, §4.1, Table 3.
  • A. Shafahi, A. Ghiasi, F. Huang, and T. Goldstein (2018) Label smoothing and logit squeezing: a replacement for adversarial training?. Cited by: §2.
  • A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein (2019) Adversarial training for free!. Neural Information Processing Systems. Cited by: §2.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations. Cited by: §4.1.
  • D. Su, H. Zhang, H. Chen, J. Yi, P. Chen, and Y. Gao (2018) Is robustness the cost of accuracy?–a comprehensive study on the robustness of 18 deep image classification models. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 631–648. Cited by: §1.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §3.4.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. International Conference on Learning Representations. Cited by: §1.
  • S. Thulasidasan, G. Chennupati, J. Bilmes, T. Bhattacharya, and S. Michalak (2019) On mixup training: improved calibration and predictive uncertainty for deep neural networks. Neural Information Processing Systems. Cited by: §2.
  • D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2019)

    Robustness may be at odds with accuracy

    In International Conference on Learning Representations, Cited by: §1.
  • V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, A. Courville, D. Lopez-Paz, and Y. Bengio (2018)

    Manifold mixup: better representations by interpolating hidden states

    International Conference on Machine Learning. Cited by: §2.
  • J. Wang (2019) Bilateral adversarial training: towards fast training of more robust models against adversarial attacks. International Conference on Computer Vision. Cited by: §1, §2, §3.1, Table 2, Table 3.
  • Y. Wang, X. Ma, J. Bailey, J. Yi, B. Zhou, and Q. Gu (2019) On the convergence and robustness of adversarial training. In International Conference on Machine Learning, pp. 6586–6595. Cited by: §1, §2, Table 3.
  • Wang (2020) Improving adversarial robustness requires revisiting misclassified examples. International Conference on Learning Representations. External Links: Link Cited by: §2, §3.1, Table 2, Table 3.
  • C. Wei and T. Ma (2019) Improved sample complexities for deep networks and robust classification via an all-layer margin. arXiv preprint arXiv:1910.04284. Cited by: Theorem A.1, Appendix A, §3.5.
  • E. Wong, L. Rice, and J. Z. Kolter (2020) Fast is better than free: revisiting adversarial training. arXiv preprint arXiv:2001.03994. Cited by: §2.
  • D. Yin, K. Ramchandran, and P. Bartlett (2018) Rademacher complexity for adversarially robust generalization. arXiv preprint arXiv:1810.11914. Cited by: §3.2.
  • V. Zantedeschi, M. Nicolae, and A. Rawat (2017) Efficient defenses against adversarial attacks. In ACM Workshop on Artificial Intelligence and Security, pp. 39–49. Cited by: §2.
  • D. Zhang, T. Zhang, Y. Lu, Z. Zhu, and B. Dong (2019a) You only propagate once: accelerating adversarial training via maximal principle. In Advances in Neural Information Processing Systems, pp. 227–238. Cited by: §2.
  • H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan (2019b) Theoretically principled trade-off between robustness and accuracy. International Conference on Machine Learning. Cited by: §1, §2, §3.1, Table 2, 4th item, §4.1, Table 3.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. International Conference on Learning Representations. Cited by: §2.

Appendix A Omitted Proofs

In this section we provide the omitted proof for Theorem 3.2, which is adapted from Theorem 2.1 from (Wei and Ma, 2019). They defined the all layer margin for a -layer network and perturbation as follows:

They define the all-layer margin as the minimum norm of required that causes the classifier to make a false prediction.


They consider the function class be the class of compositions of functions from function classes . They achieve the generalization bound as follows:

Theorem A.1 (Theorem 2.1 from (Wei and Ma, 2019))

In the above setting, with probability over the draw of the data, all classifiers which achieve training error satisfy

where is of small order .

For our problem, we define , where is identity mapping, and is the original function . Therefore the all layer margin is reduced to our bilateral margin:

Next, notice since is identity mapping, and composition with doesn’t affect the overall complexity. We apply Theorem A.1 and get our result.