Robust Local Features for Improving the Generalization of Adversarial Training

09/23/2019 ∙ by Chubiao Song, et al. ∙ Huazhong University of Science u0026 Technology Peking University cornell university 11

Adversarial training has been demonstrated as one of the most effective methods for training robust models so as to defend against adversarial examples. However, adversarial training often lacks adversarially robust generalization on unseen data. Recent works show that adversarially trained models may be more biased towards global structure features. Instead, in this work, we would like to investigate the relationship between the generalization of adversarial training and the robust local features, as the local features generalize well for unseen shape variation. To learn the robust local features, we develop a Random Block Shuffle (RBS) transformation to break up the global structure features on normal adversarial examples. We continue to propose a new approach called Robust Local Features for Adversarial Training (RLFAT), which first learns the robust local features by adversarial training on the RBS-transformed adversarial examples, and then transfers the robust local features into the training of normal adversarial examples. Finally, we implement RLFAT in two currently state-of-the-art adversarial training frameworks. Extensive experiments on STL-10, CIFAR-10, CIFAR-100 datasets show that RLFAT improves the adversarially robust generalization as well as the standard generalization of adversarial training. Additionally, we demonstrate that our method captures more local features of the object, aligning better with human perception.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has achieved a remarkable performance breakthrough on various challenging benchmarks in machine learning fields, such as image classification (Krizhevsky et al., 2012) and speech recognition (Hinton et al., 2012). However, recent studies (Szegedy et al., 2014; Goodfellow et al., 2015)

have revealed that deep neural network models are strikingly susceptible to

adversarial examples, in which small perturbations around the input are sufficient to mislead the predictions of the target model. Moreover, such perturbations are almost imperceptible to humans and often transfer across diverse models to achieve black-box attacks (Papernot et al., 2017; Liu et al., 2017).

Though the emergence of adversarial examples has received significant attention and resulted in various defend approaches for robust models (Madry et al., 2018; Dhillon et al., 2018; Wang and Yu, 2019; Song et al., 2019; Zhang et al., 2019a), many proposed defense methods provide few benefits for the true robustness but mask the gradients on which most attacks rely  (Carlini and Wagner, 2017a; Athalye et al., 2018; Uesato et al., 2018; Li et al., 2019). Currently, one of the best techniques to defend against adversarial attacks (Athalye et al., 2018; Li et al., 2019) is adversarial training (Madry et al., 2018; Zhang et al., 2019a), which improves the robustness by training on adversarial examples.

Among substantial works of adversarial training, there still remains a big robust generalization gap (Schmidt et al., 2018; Zhang et al., 2019b; Ding et al., 2019). The robustness of adversarial training fails to generalize on unseen testing data. Recent works (Geirhos et al., 2019; Zhang and Zhu, 2019) further show that adversarially trained models capture more on global structure features but normally trained models are more biased towards local features. Intuitively, the global structure features tend to be robust against adversarial perturbations but hard to generalize for unseen shape variations, instead the local features generalize well for unseen shape variations but are hard to generalize on adversarial perturbation. It naturally raises an intriguing question for adversarial training:

For adversarial training, is it possible to learn the robust local features , which have better adversarially robust generalization and better standard generalization?

To address this question, we investigate the relationship between the generalization of adversarial training and the robust local features, and advocate for learning robust local features for adversarial training. Our main contributions are as follows:

  • [leftmargin=14pt,topsep=1pt, itemsep=3pt]

  • To our knowledge, this is the first work that sheds light on the relationship between adversarial training and robust local features. Specifically, we develop a Random Block Shuffle (RBS) transformation to study such relationship by breaking up the global structure features on normal adversarial examples.

  • We propose a novel method called Robust Local Features for Adversarial Training (RLFAT), which learns the robust local features and transfers the information of robust local features into the training on normal adversarial examples.

  • We implement RLFAT in two state-of-the-art adversarial training frameworks, PGD Adversarial Training (PGDAT) (Madry et al., 2018) and TRADES (Zhang et al., 2019a). Experiments show consistent and substantial improvements for both adversarial robustness and standard accuracy on several standard datasets. Moreover, the sensitivity maps of our models on images tend to align better with human perception.

2 Preliminaries

In this section, we introduce some notations and provide a brief description on advanced methods for adversarial attacks and adversarial training.

2.1 Notation

Let

be a probabilistic classifier based on a neural network with the logits function

and the probability distribution

. Let be the cross entropy loss for image classification. The goal of the adversary is to find an adversarial example in the norm bounded perturbation, where denotes the magnitude of the perturbation. In this paper, we focus on to align with previous works.

2.2 Adversarial Attacks

Projected Gradient Descent.   Projected Gradient Descent (PGD) (Madry et al., 2018) is a stronger iterative variant of Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015). The goal is to iteratively solve the optimization problem with a step size :

(1)

where

denotes the uniform distribution, and

indicates the projection of the set .

Carlini-Wagner attack.  Carlini-Wagner attack (CW) (2017b) is a sophisticated method to directly solve for the adversarial example by using an auxiliary variable :

(2)

The objective function to optimize the auxiliary variable is defined as:

(3)

where . The constant controls the confidence gap between the adversarial class and the true class.

attack.  attack (Li et al., 2019) is a derivative-free black-box adversarial attack method and it breaks many of the defense methods based on gradient masking. The basic idea is to learn a probability density distribution over a small region centered around the clean input, such that a sample drawn from this distribution is likely to be an adversarial example.

2.3 Adversarial Training

Despite the intense interest in developing defenses,  Athalye et al. (2018) and  Li et al. (2019) have broken most previous defense methods (Dhillon et al., 2018; Buckman et al., 2018; Wang and Yu, 2019; Zhang et al., 2019a), and revealed that adversarial training remains one of the best defense method. The basic idea for adversarial training is to solve the min-max optimization problem, as shown in Eq. (4):

(4)

We introduce two currently state-of-the-art adversarial training frameworks.

PGD adversarial training.  PGD Adversarial Training (PGDAT) (Madry et al., 2018) uses the PGD attack for generating adversarial examples, whose objective function is formalized as follows:

(5)

where is obtained via PGD attack on the cross entropy .

TRADES.  Zhang et al. (2019a) propose TRADES to specifically maximize the trade-off of adversarial training between adversarial robustness and standard accuracy by optimizing the following regularized surrogate loss:

(6)

where is obtained via PGD attack on the KL-divergence ; is a hyper-parameter to control the trade-off between adversarial robustness and standard accuracy.

3 Robust Local Features for Adversarial Training

Different from adversarially trained models, normally trained models are more biased towards the local features but vulnerable to adversarial examples (Geirhos et al., 2019). It indicates that in contrast to global structural features, local features seems be more well-generalized but less robust against adversarial perturbation. Thus, in this work, we focus on the learning of robust local features by adversarial training, and propose a novel form of adversarial training called RLFAT that learns the robust local features and transfers the robust local features into the training of normal adversarial examples. In this way, our adversarially trained models are not only robust against adversarial examples but also show great generalization on unseen testing data.

3.1 Robust Local Feature Learning

It’s known that normal adversarial training tends to capture global structure features so as to increase invariance against adversarial perturbations (Zhang and Zhu, 2019; Ilyas et al., 2019). To advocate for the learning of robust local features on adversarial training, we propose a simple and straight-forward image transformation called Random Block Shuffle (RBS) to break up the global structure features of the images, at the same time retaining the local features. Specifically, for an input image, we randomly split the target image into blocks horizontally and randomly shuffle the blocks, then we perform the same split-shuffle operation vertically on the resulting image. As illustrated in Figure 1, RBS transformation can destroy the global structure features of the images to some extent and retain the local features of the images.

Figure 1: Illustration of the RBS transformation for . For a better understanding on the RBS transformation, we paint the split image blocks with different colors.

Then we apply the RBS transformation on adversarial training. Different from normal adversarial training, we use the RBS-transformed adversarial examples rather than normal adversarial examples as the adversarial information to encourage the models to learn robust local features. Note that we only use the RBS transformation as a tool to learn the robust local features during adversarial training and will not use RBS transformation in the inference phase. we refer to the form of adversarial training as RBS Adversarial Training (RBSAT).

We consider two currently state-of-the-art adversarial training frameworks, PGD Adversarial Training (PGDAT) (Madry et al., 2018) and TRADES (Zhang et al., 2019a), to demonstrate the effectiveness of the robust local features.

We use the following loss function as the alternative to the objective function of PGDAT:

(7)

where denotes the RBS transformation; is obtained via PGD attack on the cross entropy .

Similarly, we use the following loss function as the alternative to the objective function of TRADES:

(8)

where is obtained via PGD attack on the KL-divergence .

3.2 Robust Local Feature Transfer

To transfer the knowledge of robust local features learned by RBSAT to the normal adversarial examples, we present a knowledge transfer scheme, called Robust Local Feature Transfer (RLFT). The goal of RLFT is to learn the representation that minimizes the feature shift between the normal adversarial examples and the RBS-transformed adversarial examples.

Specifically, we apply the RLFT on the logit layer for high-level feature alignment. Formally, the objective functions of robust local feature transfer for PGDAT and TRADES are formalized as follows, respectively:

(9)

where denotes the mapping of the logits layer, and denotes the squared Euclidean norm.

3.3 Overall Objective Function

Since the quality of robust local feature transfer depends on the quality of robust local features learned by RBSAT, we integrate RLFT and RBSAT into an end-to-end training framework, which we refer to as RLFAT (Robust Local Features for Adversarial Training). The training process of RLFAT is summarized in Algorithm 1.

1:Randomly initialize network ;
2:Number of iterations ;
3:repeat
4:   ;
5:   Read a minibatch of data from the training set;
6:   Generate the normal adversarial examples
7:   Obtain the RBS-transformed adversarial examples ;
8:   Calculate the overall loss following Eq. (10).
9:   Update the parameters of network through back propagation;
10:until the training converges.
Algorithm 1 Robust Local Features for Adversarial Training (RLFAT).

We implement RLFAT in two state-of-the-art adversarial training frameworks, PGDAT and TRADES, and have new objective functions to learn the robust and well-generalized feature representations, which we call and :

(10)

where is a hyper-parameter to balance the two terms.

4 Experiments

To validate the effectiveness of RLFAT, we empirically evaluate our two implementations, denoted as and , and show that RLFAT makes significant improvement on both robust accuracy and standard accuracy on standard benchmark datasets.

4.1 Experimental setup

Baselines. Since most previous defense methods provide few benefit in true adversarially robustness (Athalye et al., 2018; Li et al., 2019), we compare the proposed method with two state-of-the-art adversarial training defenses, PGD Adversarial Training (PGDAT) (Madry et al., 2018) and TRADES (Zhang et al., 2019a).

Adversarial setting.  We consider two attack settings with the bounded norm: the white-box attack setting and the black-box attack setting. For the white-box attack setting, we use existing strongest white-box attacks: Projected Gradient Descent (PGD) (Madry et al., 2018) and Carlini-Wagner attack (CW) (Carlini and Wagner, 2017b). For the black-box attack setting, we perform the powerful black-box attack, attack (Li et al., 2019), on a sample of 1,500 test inputs as it is time-consuming.

Datasets.  We compare the proposed methods with the baselines on widely used benchmark datasets, namely CIFAR-10 and CIFAR-100 (Krizhevsky and Hinton, 2009)

. Since adversarial training becomes increasingly hard for high dimensional data and a little training data 

(Schmidt et al., 2018), we also consider one challenging dataset: STL-10 (Coates et al., ), which contains training images, with pixels per image.

Neural networks.  For STL-10, the architecture we consider is a wide ResNet 40-2; for CIFAR-10 and CIFAR-100, we use a wide ResNet w32-10 (Zagoruyko and Komodakis, 2016). For all datasets, we scale the input images to the range of .

Hyper-parameters.  To avoid posting much concentrates on optimizing the hyper-parameters, for all datasets, we set the hyper-parameter in TRADES as 6, set the hyper-parameter in as 0.5, and set the hyper-parameter in as 1. For the training jobs of all our models, we set the hyper-parameters of the RBS transformation as 2. More details about the hyper-parameters are provided in Appendix A.

4.2 Evaluation results

We first validate our hypothesis: for adversarial training, is it possible to learn the robust local features that have better adversarially robust generalization and better standard generalization?

In Table 1, we compare the accuracy of and with the competing baselines on three standard datasets. The proposed models demonstrate consistent and significant improvements on adversarial robustness as well as standard accuracy over the baseline models on all datasets. With the robust local features, achieves better adversarially robust generalization and better standard generalization than TRADES. also works similarly, showing a significant improvement on the robustness against all attacks and standard accuracy than PGDAT.

Defense No attack PGD CW attack
PGDAT 67.05 30.00 31.97 34.80
TRADES 65.24 38.99 38.35 42.07
71.47 38.42 38.42 44.80
72.38 43.36 39.31 48.13
((a)) STL-10. The magnitude of perturbation is 0.03 in norm.
Defense No attack PGD CW attack
PGDAT 82.96 46.19 46.41 46.67
TRADES 80.35 50.95 49.80 52.47
84.77 53.97 52.40 54.60
82.72 58.75 51.94 54.60
((b)) CIFAR-10. The magnitude of perturbation is 0.03 in norm.
Defense No attack PGD CW attack
PGDAT 55.86 23.32 22.87 22.47
TRADES 52.13 27.26 24.66 25.13
56.70 31.99 29.04 32.53
58.96 31.63 27.54 30.86
((c)) CIFAR-100. The magnitude of perturbation is 0.03 in norm.
Table 1:  The classification accuracy (%) of defense methods under white-box and black-box attacks on STL-10, CIFAR-10 and CIFAR-100.

The results demonstrate that, the robust local features can significantly improve both the adversarially robust generalization and the standard generalization over the state-of-the-art adversarial training frameworks, and strongly support our hypothesis. That is, for adversarial training, it is possible to learn the robust local features, which have better robust and standard generalization.

4.3 Loss Sensitivity under Distribution Shift

Motivation.  Ding et al. (2019) and Zhang et al. (2019b) found that the effectiveness of adversarial training is highly sensitive to the “semantic-loss” shift of the test data distribution, such as gamma mapping. To further investigate the defense performance of the proposed methods, we consider to quantify the smoothness of the models on different test data distributions. In particular, we use the uniform noise adding and gamma mapping to shift the test data distribution.

-neighborhood loss sensitivity.

  To quantify the smoothness of models on the shift of the uniform noise, we propose to estimate the Lipschitz continuity constant

by using the gradients of the loss function with respect to the -neighborhood region of the test data. A smaller value indicates a smoother loss function.

(11)

Gamma mapping loss sensitivity. Gamma mapping (Szeliski, 2011) is a nonlinear element-wise operation used to adjust the exposure of images by applying on the original image . Similarly, we approximate the loss sensitivity under gamma mapping, by using the gradients of the loss function with respect to the gamma mapping test data. A smaller value indicates a smoother loss function.

(12)

Sensitivity analysis.  The results for the -neighborhood loss sensitivity of the adversarially trained models are reported in Table 2(a), where we use 100 Monte Carlo samples for each test data. In Table 2(b), we report the loss sensitivity of the adversarially trained models under various gamma mappings. We can observe that provides the smoothest model with the robust local features under the distribution shifts on various datasets. The results suggest that, as compared to PGDAT and TRADES, and both show lower gradients of the models on different data distributions, which we can directly attribute to the robust local features.

Dataset -neighborhood loss sensitivity
PGDAT TRADES
STL-10 0.76 0.43 0.20 0.20
CIFAR-10 1.17 0.76 0.63 0.49
CIFAR-100 2.74 1.73 1.03 0.91
((a)) The -neighborhood loss sensitivity of the adversarially trained models.
Dataset Gamma mapping loss sensitivity /
PGDAT TRADES
STL-10 0.77 / 0.79 0.44 / 0.42 0.30 / 0.29 0.21 / 0.19
CIFAR-10 1.27 / 1.20 0.84 / 0.76 0.69 / 0.62 0.54 / 0.48
CIFAR-100 2.82 / 2.80 1.78 / 1.76 1.09 / 1.01 0.95 / 0.88
((b)) The gamma mapping loss sensitivity of the adversarially trained models.
Table 2:  The loss sensitivity of defense methods under different data distributions.

4.4 Ablation Studies

To further gain insights on the performance obtained by the robust local features, we perform ablation studies to dissect the impact of various components (robust local feature learning and robust local feature transfer). As shown in Figure 2, we conduct additional experiments for the ablation studies of and on STL-10, CIFAR-10 and CIFAR-100, where we report the standard accuracy over clean data and average robust accuracy over all attacks for each model.

0pt

((a)) STL-10

0pt

((b)) CIFAR-10

0pt

((c)) CIFAR-100
Figure 2: Ablation studies for and to investigate the impact of Robust Local Feature Learning (RLFL) and Robust Local Feature Transfer (RLFT).

Does robust local feature learning help?  We first analyze that as compared to adversarial training on normal adversarial examples, whether adversarial training on RBS-transformed adversarial examples produces better generalization and more robust features. As in Figure 2, we can see that Robust Local Features Learning (RLFL) exhibits stable improvements on both standard accuracy and robust accuracy on all datasets for and , providing strong support for our hypothesis.

Does robust local feature transfer help?  We further add Robust Local Feature Transfer (RLFT), the second term in Eq. (10), to get the overall loss of RLFAT. The standard accuracy further increases on all datasets for and . The robust accuracy further increases also, except for on CIFAR-100 for the no-attack setting, but it is still clearly higher than the baseline model. It shows the robust local feature transfer to the normal adversarial training does help promote the standard accuracy and robust accuracy in most cases.

4.5 Visualizing the Salience Maps

We would like to investigate the features of the input images that the models are mostly focused on. Following the work of Zhang and Zhu (2019), we generate the sensitivity maps using SmoothGrad (Smilkov et al., 2017) on STL-10. The key idea of SmoothGrad is to average the gradients of class activation with respect to noisy copies of an input image. As illustrated in Figure 3, all adversarially trained models focus on the global structure features of the object on the images. And as compared to PGDAT and TRADES, and both capture more local feature information of the object, aligning better with human perception. Note that the images are correctly classified by all these models. For more visualization results, see Appendix B.

Original        PGDAT     TRADES                              Original        PGDAT      TRADES           

Figure 3: Sensitivity maps of the four models on sampled images. For each group of images, we have the original image, and sensitivity maps of the four models sequentially.

5 Conclusion

Differs to existing adversarial training models that are more biased towards the global features of the images, in this paper, we hypothesize that robust local features can improve the generalization of adversarial training. To validate this hypothesis, we propose a new stream of adversarial training approach called Robust Local Features for Adversarial Training (RLFAT) and implement it in state-of-the-art adversarial training frameworks, PGDAT and TRADES. Extensive experiments show that the proposed methods based on RLFAT not only yield better standard generalization but also promote the adversarially robust generalization. Furthermore, we show that the sensitivity maps of our models on images align better with human perception, uncovering certain unexpected benefit of robust local features for adversarial training.

Acknowledgement

Supported by the Fundamental Research Funds for the Central Universities (2019kfyXKJC021).

References

  • A. Athalye, N. Carlini, and D. A. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, pp. 274–283. Cited by: §1, §2.3, §4.1.
  • J. Buckman, A. Roy, C. Raffel, and I. J. Goodfellow (2018) Thermometer encoding: one hot way to resist adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Cited by: §2.3.
  • N. Carlini and D. A. Wagner (2017a) Adversarial examples are not easily detected: bypassing ten detection methods. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec@CCS 2017

    ,
    pp. 3–14. Cited by: §1.
  • N. Carlini and D. A. Wagner (2017b) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, SP 2017, pp. 39–57. Cited by: §2.2, §4.1.
  • [5] A. Coates, A. Y. Ng, and H. Lee An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Cited by: §4.1.
  • G. S. Dhillon, K. Azizzadenesheli, Z. C. Lipton, J. Bernstein, J. Kossaifi, A. Khanna, and A. Anandkumar (2018) Stochastic activation pruning for robust adversarial defense. In 6th International Conference on Learning Representations, ICLR 2018, Cited by: §1, §2.3.
  • G. W. Ding, K. Y. C. Lui, X. Jin, L. Wang, and R. Huang (2019) On the sensitivity of adversarial robustness to input data distributions. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §1, §4.3.
  • R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2019) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §1, §3.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, Cited by: §1, §2.2.
  • G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al. (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29. Cited by: §1.
  • A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. CoRR abs/1905.02175. Cited by: §3.1.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    .
    In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, pp. 1106–1114. Cited by: §1.
  • Y. Li, L. Li, L. Wang, T. Zhang, and B. Gong (2019) NATTACK: learning the distributions of adversarial examples for an improved black-box attack on deep neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, pp. 3866–3876. Cited by: §1, §2.2, §2.3, §4.1, §4.1.
  • Y. Liu, X. Chen, C. Liu, and D. Song (2017) Delving into transferable adversarial examples and black-box attacks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Cited by: Appendix A, 3rd item, §1, §2.2, §2.3, §3.1, §4.1, §4.1.
  • N. Papernot, P. D. McDaniel, I. J. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, AsiaCCS 2017, pp. 506–519. Cited by: §1.
  • L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry (2018) Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, pp. 5019–5031. Cited by: §1, §4.1.
  • D. Smilkov, N. Thorat, B. Kim, F. B. Viégas, and M. Wattenberg (2017) SmoothGrad: removing noise by adding noise. CoRR abs/1706.03825. Cited by: §4.5.
  • C. Song, K. He, L. Wang, and J. E. Hopcroft (2019) Improving the generalization of adversarial training with domain adaptation. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Cited by: §1.
  • R. Szeliski (2011) Computer vision - algorithms and applications. Texts in Computer Science, Springer. Cited by: §4.3.
  • J. Uesato, B. O’Donoghue, P. Kohli, and A. van den Oord (2018) Adversarial risk and the dangers of evaluating against weak attacks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, pp. 5032–5041. Cited by: §1.
  • H. Wang and C. Yu (2019) A direct approach to robust deep learning using adversarial networks. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §1, §2.3.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In Proceedings of the British Machine Vision Conference 2016, BMVC, Cited by: §4.1.
  • H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan (2019a) Theoretically principled trade-off between robustness and accuracy. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, pp. 7472–7482. Cited by: 3rd item, §1, §2.3, §2.3, §3.1, §4.1.
  • H. Zhang, H. Chen, Z. Song, D. S. Boning, I. S. Dhillon, and C. Hsieh (2019b) The limitations of adversarial training and the blind-spot attack. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §1, §4.3.
  • T. Zhang and Z. Zhu (2019) Interpreting adversarially trained convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, pp. 7502–7511. Cited by: §1, §3.1, §4.5.

Appendix A Hyper-parameter Setting

Here we show the details of the training hyper-parameters and the attack hyper-parameters for the experiments.

Training Hyper-parameters.  For all training tasks, we use the Adam optimizer with a learning rate of 0.001 and a batch size of 32. For CIFAR-10 and CIFAR-100, we run 79,800 steps for training. For STL-10, we run 29,700 steps for training. For STL-10 and CIFAR-100, the adversarial examples are generated with step size 0.0075, 7 iterations, and = 0.03. For CIFAR-10, the adversarial examples are generated with step size 0.0075, 10 iterations, and = 0.03.

Attack Hyper-parameters.  For PGD attack, we use the same attack parameters as those of the training process. For CW attack, we use PGD to minimize its loss function with a high confidence parameter ( = 50) following the work of Madry et al. (2018). For attack, we set the maximum number of optimization iterations to ,

for the sample size, the variance of the isotropic Gaussian

, and the learning rate .

Appendix B More Feature Visualization

We provide more sensitive maps of the adversarially trained models on sampled images in Figure 4.

Original        PGDAT     TRADES                              Original        PGDAT      TRADES           

Figure 4: More sensitivity maps of the four models. For each group of images, we have the original image, and sensitivity maps of of the four models sequentially.