AT-GAN: A Generative Attack Model for Adversarial Transferring on Generative Adversarial Nets

04/16/2019 ∙ by Xiaosen Wang, et al. ∙ Huazhong University of Science u0026 Technology cornell university 38

Recent studies have discovered the vulnerability of Deep Neural Networks (DNNs) to adversarial examples, which are imperceptible to humans but can easily fool DNNs. Existing methods for crafting adversarial examples are mainly based on adding small-magnitude perturbations to the original images so that the generated adversarial examples are constrained by the benign examples within a small matrix norm. In this work, we propose a new attack method called AT-GAN that directly generates the adversarial examples from random noise using generative adversarial nets (GANs). The key idea is to transfer a pre-trained GAN to generate adversarial examples for the target classifier to be attacked. Once the model is transferred for attack, AT-GAN can generate diverse adversarial examples efficiently, making it helpful to potentially accelerate the adversarial training on defenses. We evaluate AT-GAN in both semi-whitebox and black-box settings under typical defense methods on the MNIST handwritten digit database. Empirical comparisons with existing attack baselines demonstrate that AT-GAN can achieve a higher attack success rate.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks (DNNs) have exhibited great performance in computer vision tasks in recent years

(Krizhevsky et al., 2012; He et al., 2016). However, DNNs are found vulnerable to adversarial examples (Szegedy et al., 2014). Due to the robustness and security implications, the ways of generating adversarial examples are called attacks and there are two types of adversarial attacks, targeted and untargeted. Targeted attacks aim to generate adversarial examples that are classified as specific target classes, while untargeted attacks aim to generate adversarial examples that are classified incorrectly. Various algorithms (Yuan et al., 2017) have been proposed for generating adversarial examples, such as the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) and Carlini-Wagner attack (C&W) (Carlini and Wagner, 2017). Adversarial examples can be used for the training to improve the model’s robustness, which is a popular and efficient defense method called adversarial training (Goodfellow et al., 2015; Kurakin et al., 2017; Song et al., 2019).

In order to generate adversarial examples, most attack algorithms (Goodfellow et al., 2015; Carlini and Wagner, 2017) add imperceptible perturbation to the input based on gradient descent which means their generated adversarial examples are restricted by the original images. Xiao et al. (2018) propose to train a generator using a benign image as the input to generate the perturbation so that can fool the target model. However, their result is still restricted by the original images. Song et al. (2018)

assume that the images generated by GAN contain adversarial examples, so they propose a new method that searches a noise input near an arbitrary noise vector using gradient descent of the target classifier for AC-GAN such that this input leads to adversarial example.

Song et al. (2018) called their output unrestricted adversarial examples by claiming that their output is not limited to a benign image. However, their output is still limited to the noise input as they use gradient descent to search for a good noise in the neighborhood of the original noise. Because their method can not always succeed for any noise, they need to switch to another random noise as the input in case the current search fails. Besides, their method has a slow generation speed by involving hundreds of iterations on the gradient descent in order to find a good noise in the neighborhood.

In this work, we propose a new generative attack model called AT-GAN (Adversarial Transferring on Generative Adversarial Nets) to estimate the distribution of adversarial examples so as to generate unrestricted adversarial examples from random noise. We first normally train a Generative Adversarial Network (GAN) (Goodfellow et al., 2014) and then transfer the generator to attack the target classifier. Note that compared to Song et al. (2018), our output is truly unrestricted to the input because it has learned the distribution of adversarial examples. Once our generator is transferred from generating normal images to adversarial images, it can directly generate unrestricted adversarial examples with any random noise, leading to a high diversity. Also, we do not depend on iterations of the gradient method, so our generation is very fast.

To evaluate the effectiveness of our attack strategy, we use AT-GAN on several models to generate adversarial examples from random noise and compare our model with several other attack methods in both semi-whitebox and black-box settings. Then we apply typical defense methods (Goodfellow et al., 2015; Madry et al., 2017; Tramèr et al., 2018) to defend against these generated adversarial examples. Empirical results show that the adversarial examples generated by AT-GAN yields a higher attack success rate. Our main contributions are as follows:

  • We use GAN to estimate the distribution of adversarial examples so as to generate unrestricted adversarial examples from random noise. This is different from most existing attack methods that focus on how to add crafted perturbation to the original image.

  • We train a conditional generative network to directly produce adversarial examples, which does not rely on the gradient of the input. The generation process is very fast. This is different from the main stream of attacks based on optimization.

  • As compared with the very few works using GAN for attack, which are still based on the searching near the neighborhood of the input, our generated images have a higher diversity because our method does not need a suitable noise as the input.

  • Extensive empirical study on typical defense methods against adversarial examples shows that the proposed AT-GAN achieves a higher attack success rate than existing adversarial attacks on both semi-whitebox and black-box settings.

2 Related Work

In this section, we provide an overview of GANs and typical attack methods to generate adversarial examples.We also introduce several attacks based on GANs that are most related to our work. The typical defense methods based on adversarial training are described in Appendix A.

2.1 Generative Adversarial Nets (GANs)

A generative adversarial net (GAN) (Goodfellow et al., 2014) consists of two neural networks trained in opposition to each other. The generator is optimized to estimate the data distribution and the discriminator aims to distinguish fake samples from and real samples from the training data. The objective of and can be formalized as a min-max value function :

(1)

The Conditional Generative Adversarial Net (CGAN) (Mirza and Osindero, 2014) is the first conditional version of GAN, which combines conditions from the input of both generator and discriminator. Radford et al. (2016) propose a Deep Convolutional Generative Adversarial Net (DCGAN), which implements GAN with convolutional networks and stabilizes the model during the training. Auxiliary Classifier GAN (AC-GAN) is another variant that extends GAN with some conditions by an extra classifier (Odena et al., 2017). The objective function of AC-GAN is as follows:

(2)

To make the GAN more trainable in practice, Arjovsky et al. (2017)

proposed Wasserstein GAN (WGAN) which uses Wassertein distance so that the loss function has more desirable properties.

Gulrajani et al. (2017) introduce WGAN_GP (WGAN with gradient penalty) that performs better than WGAN in practice. Its objective function is formulated as follows:

(3)

2.2 Gradient based Attacks

There are three types of attacks regarding how the attacks access the model. The white-box attack can fully access the target model, while the black-box attack (Papernot et al., 2017) has no knowledge of the target model. Existing black-box attacks mainly focus on transferability (Liu et al., 2017; Bhagoji et al., 2017), in which an adversarial instance generated on one model could be transferred to attack another model. A third type of semi-whitebox attack was recently proposed by Xiao et al. (2018)

. Semi-whitebox attack needs the logits output during training but afterwards can generate adversarial examples without accessing the target model.

We will introduce three typical adversarial attack methods. Here the components of all adversarial examples are clipped in .

Fast Gradient Sign Method (FGSM). FGSM (Goodfellow et al., 2015) adds perturbation in the direction of the gradient of the training loss on the input to generate adversarial examples.

(4)

Here is the true label of a sample , is the model parameter and specifies the distortion between and .

Projected Gradient Descent (PGD). The PGD adversary (Madry et al., 2017) is a multi-step variant of FGSM, which applies FGSM iteratively for times with a budget .

(5)

Here forces its input to reside in the range of .

Rand FGSM (R+FGSM). R+FGSM (Tramèr et al., 2018) first applies a small random perturbation on the benign image with parameter (where ), then it uses FGSM to generate the adversarial example based on the perturbed image.

(6)

2.3 GAN based Attacks

In this work we propose to use GAN to generate adversarial examples. In the literature, there are only a few attack methods based on GANs, such as AdvGAN (Xiao et al., 2018) and unrestricted adversarial attacks (Song et al., 2018).

AdvGAN. Xiao et al. (2018) propose to train an AdvGAN to take a benign image as the input and generate a perturbation which aims to make the target model classify in target class . The objective function is formulated as:

(7)

Unrestricted adversarial attacks. Song et al. (2018) propose to search a noise input near an arbitrary noise vector for AC-GAN so as to produce an adversarial example for the target model with an extra classifier . The objective function can be written as:

(8)

Our proposed AT-GAN is implemented based on the combination of AC-GAN and WGAN_GP (AC-WGAN_GP), but we have a very different objective for the training. We aim to learn the distribution of adversarial examples so that we could generate plentiful and diverse images that are visually realistic but misclassified by the target classifier. See details in the next section.

3 The proposed AT-GAN

In order to generate adversarial examples from random noise, we seek a generator to fit the distribution of adversarial examples.

(9)

Here is the target classifier to be attacked, is a random noise, is the true label of the generated image, is the target label, and represents the set of images with label .

The above objective function is hard to solve, so we propose a new model called AT-GAN (Adversarial Transferring on Generative Adversarial Net). The architecture of AT-GAN is illustrated in Figure 1. There are two training stages. We first train the GAN model to get a generator (See details in Appendix B), then we transfer to attack .


                                (a)                                                    (b)                                            (c)

Figure 1: The architecture of AT-GAN. The first training stage of AT-GAN is similar to that of AC-GAN. After is trained, we regard as the original model and copy as the initial attack model . We then transfer according to the target classifier to be attacked. After the second stage of training, AT-GAN can generate adversarial examples by .

3.1 Transfer the Generator for Attack

After the original generator is trained, we transfer the generator to learn the distribution of adversarial examples in order to attack the target network. As illustrated in Figure 1 (b), there are three neural networks, the original generator , the attack generator to be transferred that has the same set of weights as in the beginning and a classifier to be attacked. Then Eq. 9 could be rewritten as follows:

(10)

where is the inverse function of , is the norm. We use norm in experiments.

Based on Eq. 10, we construct the loss function by two components. aims to assure that yields the target label .

(11)

And aims to assure that generate realistic examples.

(12)

where is a Gaussian noise constrained by both and norm. In practice, there is a small difference from Eq. 10 as we add a Gaussian noise to smooth the generated adversarial example.

Thus, the total loss for transferring can be written as:

(13)

Here and

are hyperparameters to control the training process. For the untargeted attack, we just replace

in with where is the logits of the target classifier. Note that when and , the objective function will be similar to FGSM (Goodfellow et al., 2015) or R+FGSM (Tramèr et al., 2018).

3.2 Learn the Distribution of Adversarial Examples

Given a distribution in space , we construct another distribution by choosing a point in the -neighborhood of for any . Obviously, when is close enough to , has almost the same distribution as .

Lemma 1.

Given two distributions and

with probability density function

and in a space , if there is a that satisfies for any , we could get .

The proof is in Appendix C.1. Eq. 12 aims to constrain the image generated by in the -neighborhood of . Under the ideal condition that Eq. 12 guarantees is close enough to in high dimension for any input noise , the distribution of AT-GAN almost coincides that of WGAN_GP, aka .

Samangouei et al. (2018) prove that the global optimum of WGAN is , when the set has zero Lebesgue-measure. Similarly, we could show that the optimum of WGAN_GP is under the same condition. The proof is in Appendix C.2.

Therefore, under ideal conditions, we conclude , which means has almost the same distribution as the real data. However, the images generated by are adversarial examples. Thus, the distribution learned by is the distribution of adversarial examples.

4 Experiments

We evaluate the proposed AT-GAN 111The codes will be public after review. with several neural models on two benchmark datasets. The results demonstrate that for both semi-whitebox and black-box attacks, AT-GAN could achieve a higher attack success rate than the state-of-art baselines and AT-GAN indeed learns a good distribution of adversarial examples close to the real data distribution. Besides, AT-GAN is very fast.

4.1 Experimental Setup

All experiments are done on a single Titan X GPU. We test four neural network models, Models A, B, C, and D. Their details on neural network architecture and hyperparameters are in Appendix D.

Baselines. We compare AT-GAN with typical state-of-art attacks, namely FGSM, PGD, R+FGSM and Song et al. (2018). And we evaluate their attack performance on several defense methods, namely adversarial training Goodfellow et al. (2015), ensemble adversarial training Tramèr et al. (2018) and iterative adversarial training Madry et al. (2017).

Datasets. MNIST (LeCun and Cortes, 1998) is a dataset of hand written digit number from to . Fashion-MNIST (Xiao et al., 2017) is similar to MNIST with ten classes of fashion clothes. The corresponding class labels from to are t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot, respectively. For both datasets, we normalize the pixel value range converting to .

Attack success rate. For the attack performance, we use attack success rate, defined as the fraction of samples that meets the goal of the adversary: for untargeted attacks and for targeted attacks with a target .

4.2 Comparison on Attacks

We train the four neural Models A, B, C, and D by normal training, and the above three adversarial training methods, respectively. The classification accuracy of each model on the original test set is shown in Table 1. Models A, B and C achieve an accuracy rate greater than on MNIST and on Fashion-MNIST. Model D gains an accuracy rate between and on MNIST while between and on Fashion-MNIST, which is slightly lower as it does not contain any convolutional layer. We compare the attack performance of AT-GAN with other attack methods on the generation efficiency and classification accuracy on adversarial instances with these models.

MNIST() Fashion-MNIST()
A B C D A B C D
Nor. 99.1 99.0 99.0 97.3 91.8 90.2 92.2 85.2
Adv. 99.1 99.1 99.0 97.4 91.7 90.7 91.9 86.3
Ens. 99.2 99.0 99.0 97.4 92.0 89.6 92.1 85.5
Iter. Adv. 99.1 99.1 99.0 98.5 89.5 90.7 90.0 89.4
Table 1: The classification accuracy on test set for various models. Nor.: Normal training, Adv.: Adversarial training, Ens.: Ensemble adversarial training, Iter. Adv.: Iterative adversarial training.

4.2.1 Comparison on Attack Efficiency

We first test the efficiency of each attack method using Model A on MNIST data. The average time of generating adversarial examples is listed in Table 2. Among the five attack methods, AT-GAN is the fastest as it could generate adversarial examples without using the target classifier.

FGSM PGD R+FGSM Song’s AT-GAN
Generating time 0.3s 1.8s 0.4s >15min 0.2s
Table 2: Comparison on the example generating time, measured by generating 1000 adversarial instances using Model A on MNIST.

4.2.2 Comparison on the Generated Adversarial Examples

The adversarial examples generated by different methods for Model A are shown in Figure 2 and 3. On MNIST dataset, AT-GAN and Song’s method (Song et al., 2018) generate much higher quality examples than other attacks, and AT-GAN generates slightly more realistic images than Song’s method, for example on “0” and “3”. On Fashion-MNIST dataset, some adversarial examples generated by Song’s method are not realistic to human eyes, for example on “t-shirt/top (0) ” and “sandal (5)”. AT-GAN can generate very realistic images as Eq. 12 forces the adversarial examples to be close enough to the images generated by the original generator so they are realistic. Other attacks tend to perturb the images and some images are not clear enough for human eyes.

Figure 2: Adversarial examples generated by different methods using Model A on MNIST.
Figure 3: Adversarial examples generated by different methods using Model A on Fashion-MNIST.

4.2.3 Comparison on Semi-whitebox or Whitebox Attacks

As AT-GAN does not need to know the target model after it is trained, we generate adversarial instances by AT-GAN under semi-whitebox setting. For comparison, we also generate adversarial instances by the four baseline attack methods under white-box setting. We test on the four neural models by normal training and several adversarial trainings as the defense methods repsectively. The attack success rates are listed in Table 3.

On MNIST dataset, AT-GAN is the best on all defenses. Only for Model A, B and C by normal training, PGD achieves the highest attack success rate of , and AT-GAN gains the second highest attack success rate of over . For all other cases, AT-GAN achieves the highest attack success rate.

On Fashion-MNIST dataset, AT-GAN is the best on average. PGD achieves the highest attack success rate on Model A with normal training and Model C with normal training and ensemble adversarial training. Song et al. (2018) achieves the highest attack success rate on Model A with various adversarial trainings. On these cases, AT-GAN almost achieves the second highest attack success rate close to the highest one. For all other cases, AT-GAN achieves the highest attack success rate.

Moreover, as compared with the normal training, AT-GAN has a good attack success rate on these defense methods while the attack success rate of the baselines clearly decay. This may be due to AT-GAN trying to estimate the distribution of adversarial examples so the adversarial training defenses do not have much impact on AT-GAN.

Model Defense MNIST() Fashion-MNIST()
FGSM PGD R+FGSM Song’s AT-GAN FGSM PGD R+FGSM Song’s AT-GAN
A Nor. 68.8 100.0 77.9 57.3 98.7 82.7   99.9 95.5 89.2 96.1
Adv.   2.9   92.6 28.4 34.3 97.5 14.6   82.6 63.3 93.1 92.7
Ens.   8.7   85.1 20.3 36.6 96.7 36.0   90.8 68.4 95.9 95.4
Iter. Adv.   4.0     5.9   2.6 40.4 91.4 23.2   30.2 37.8 95.9 93.5
B Nor. 88.0 100.0 96.5 39.2 99.5 82.3   96.0 90.9 80.1 98.5
Adv.   9.5   42.4   7.1 22.9 97.7 24.2   69.9 74.2 81.6 91.1
Ens. 18.0   98.2 42.1 34.5 99.3 30.0   95.3 81.3 81.8 93.6
Iter. Adv.   9.2   36.2   4.6 31.8 95.6 23.8   34.7 28.2 72.0 91.6
C Nor. 70.7   100 81.6 42.4 99.3 82.7 100.0 98.4 94.3 98.0
Adv.   4.3   76.9   7.2 37.5 95.8 11.0   88.3 76.5 88.8 88.9
Ens.   7.8   96.4 19.9 41.7 96.9 47.8   99.9 71.3 88.6 93.9
Iter. Adv.   4.7     9.6   2.9 31.5 90.0 22.3   28.8 31.2 72.7 91.6
D Nor. 89.6   91.7 93.8 46.6 99.9 77.2   85.4 88.3 70.3 99.9
Adv. 23.5   96.8 76.2 21.7 99.9 33.8   54.7 45.3 55.1 99.4
Ens. 34.6   99.5 51.6 34.2 99.5 47.5   76.8 62.1 65.7 99.1
Iter. Adv. 49.8   81.1 25.3 18.1 99.7 22.9   30.2 33.6 75.3 93.1
Table 3: Attack success rate of adversarial examples generated by AT-GAN in semi-whitebox attack and the baselines in white-box attack under several defenses on MNIST. On each model, the highest or second highest attack success rates are highlighted in bold or underline.

4.2.4 Transferability for Black-box Attacks

For transferability evaluation, we use Model A to perform black-box attack against the target Model C. The attack success rate is shown in Table 4. On MNIST dataset, adversarial examples generated by PGD transfer best for normal training, while adversarial instances generated by AT-GAN achieve much higher attack success rate for adversarial training as the defense method. On Fashion-MNIST dataset, PGD achieves the highest attack rate on models with normal training and ensemble adversarial training, and Song et al. (2018) is the best on the other two adversarial trainings. However, this does not mean that AT-GAN has a lower transferability because the adversarial examples generated by other attacks on Fashion-MNIST are not realistic as that of AT-GAN, and less realistic images lead to a higher attack rate.

MNIST() Fashion-MNIST()
Nor. Adv. Ens. Iter. Avd. Nor. Adv. Ens. Iter. Avd.
FGSM 46.7   4.2   1.7   4.6 68.9 23.1 20.8 14.8
PGD 97.5   6.5   4.1   4.1 84.7 27.6 39.6 14.6
R+FGSM 82.3   6.7   4.8   4.1 21.2 32.1 17.5 26.3
Song’s   5.7   3.8   3.5   3.2 38.1 32.9 31.3 31.4
AT-GAN 65.3 24.6 27.9 17.2 58.0 22.7 32.0 15.23
Table 4: Attack success rate of adversarial examples generated by different black-box adversarial strategies on Model C based on examples generated on Model A.

4.3 Exploration on the Distribution of Adversarial Examples

To identify that AT-GAN can learn a distribution of adversarial examples which is close to the distribution of real data, we use t-SNE (Maaten and Hinton, 2008) on 5,000 real images sampled from test set and 5,000 generated adversarial examples to illustrate the distribution in 2 dimensions. If the adversarial examples have a different distribution from that of the real data, then t-SNE cannot deal well with them and the result will be in chaos. The results are illustrated in Figure 4. On both datasets of MNIST and Fashion-MNIST, the result of AT-GAN is closest to that of the test set. It indicates that AT-GAN indeed learns the distribution that is closest to the distribution of the real data.

(a) Test set
(b) FGSM
(c) PGD
(d) R+FGSM
(e) Song
(f) AT-GAN
Figure 4: The t-SNE result for test set and adversarial examples generated by different methods on MNIST (top) and Fashion-MNIST (bottom). For (a), we use 10,000 sampled real images in test set. For (b) to (f), we use 5,000 sampled images in test set and 5,000 adversarial examples by different attacks. The position of each class is random due to the property of t-SNE.

5 Conclusion

We propose a new attack model called AT-GAN that transfers generative adversarial nets (GANs) for adversarial attacks. Different from existing attacks that add small perturbations to the input images, AT-GAN tries to explore the distribution of adversarial instances so as to directly generate the adversarial examples from any random noise. In this way, the generated adversarial examples are not limited to any natual images. Also, compared with the pioneer work using GAN  (Song et al., 2018) that do iterations of gradient descent to search for a good noise in the neighborhood of an original random noise such that the corresponding output of GAN is an adversial example, our model is a generative model that tries to learn the distribution of the adversial examples. Once AT-GAN is trained, it can generate adversarial images and the output is not limited to the input noise. Experiments show that AT-GAN is very fast and can generate plenty of adversarial instances that look more realistic to human eyes, AT-GAN yields a higher attack success rate under various adversarial training defenses for semi-whitebox as well as black-box attack settings, and AT-GAN can learn a distribution of adversarial examples that is very close to the distribution of the real data.

Acknowledgments

We thank Chuan Guo for helpful discussions and suggestions on our work.

References

  • Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein GAN. In arXiv preprint arXiv:1701.07875.
  • Athalye et al. (2018) Athalye, A., Carlini, N., and Wagner, D. (2018). Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. In

    International Conference on Machine Learning

    .
  • Bhagoji et al. (2017) Bhagoji, A. N., He, W., Li, B., and Song, D. (2017). Exploring the Space of Black-box Attacks on Deep Neural Networks. In arXiv preprint arXiv:1703.09387.
  • Buckman et al. (2018) Buckman, J., Roy, A., Raffel, C., and Goodfellow, I. (2018). Thermometer Encoding: One Hot Way To Resist Adversarial Examples . In International Conference on Learning Representations.
  • Carlini and Wagner (2017) Carlini, N. and Wagner, D. (2017). Towards Evaluating the Robustness of Neural Networks. In IEEE Symposium on Security and Privacy, pages 39–57.
  • Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. (2016). InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Neural Information Processing Systems.
  • Goodfellow et al. (2014) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative Adversarial Nets. In International Conference on Learning Representations.
  • Goodfellow et al. (2015) Goodfellow, I. J., Shlens, J., and Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations.
  • Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. (2017). Improved Training of Wasserstein GANs. In Neural Information Processing Systems.
  • Guo et al. (2018) Guo, C., Rana, M., Cisse, M., and van der Maaten, L. (2018). Countering adversarial images using input transformations. In International Conference on Learning Representations.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning for Image Recognition. In

    IEEE Conference on Computer Vision and Pattern Recognition

    .
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. In Neural Information Processing Systems.
  • Kurakin et al. (2017) Kurakin, A., Goodfellow, I., and Bengio, S. (2017). Adversarial Machine Learning at Scale. In International Conference on Learning Representations.
  • LeCun and Cortes (1998) LeCun, Y. and Cortes, C. (1998).

    The MNIST database of handwritten digits.

  • Liao et al. (2018) Liao, F., Liang, M., Dong, Y., Pang, T., Hu, X., and Zhu, J. (2018). Defense against Adversarial Attacks Using High-Level Representation Guided Denoiser. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Liu et al. (2017) Liu, Y., Chen, X., Liu, C., and Song, D. (2017). Delving into Transferable Adversarial Examples and Black-box Attacks. In International Conference on Learning Representations.
  • Maaten and Hinton (2008) Maaten, L. and Hinton, G. (2008). Visualizing Data using t-SNE. In Neural Information Processing Systems.
  • Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2017).

    Towards Deep Learning Models Resistant to Adversarial Attacks.

    In International Conference on Learning Representations.
  • Metzen et al. (2017) Metzen, J. H., Genewein, T., Fischer, V., and Bischoff, B. (2017). On Detecting Adversarial Perturbations. In International Conference on Learning Representations.
  • Mirza and Osindero (2014) Mirza, M. and Osindero, S. (2014). Conditional Generative Adversarial Nets. In arXiv preprint arXiv:1411.1784.
  • Odena et al. (2017) Odena, A., Olah, C., and Shlens, J. (2017). Conditional Image Synthesis With Auxiliary Classifier GANs. In International Conference on Machine Learning.
  • Papernot et al. (2017) Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. (2017). Practical Black-Box Attacks against Machine Learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, page 506–519.
  • Radford et al. (2016) Radford, A., Metz, L., and Chintala, S. (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In International Conference on Learning Representations.
  • Samangouei et al. (2018) Samangouei, P., Kabkab, M., and Chellappa, R. (2018). Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models. In International Conference on Learning Representations.
  • Shen et al. (2017) Shen, S., Jin, G., Gao, K., and Zhang, Y. (2017). APE-GAN: Adversarial Perturbation Elimination with GAN. In arXiv preprint arXiv:1707.05474.
  • Song et al. (2019) Song, C., He, K., Wang, L., and Hopcroft, J. E. (2019). Improving the Generalization of Adversarial Training with Domain Adaptation. In International Conference on Learning Representations.
  • Song et al. (2018) Song, Y., Shu, R., Kushman, N., and Ermon, S. (2018). Constructing Unrestricted Adversarial Examples with Generative Models. In Neural Information Processing Systems.
  • Szegedy et al. (2014) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. (2014). Intriguing properties of neural networks. In Neural Information Processing Systems.
  • Tramèr et al. (2018) Tramèr, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., and McDaniel, P. (2018). Ensemble Adversarial Training: Attacks and Defenses. In International Conference on Learning Representations.
  • Xiao et al. (2018) Xiao, C., Li, B., Zhu, J.-Y., He, W., Liu, M., and Song, D. (2018). Generating Adversarial Examples with Adversarial Networks. In

    International Joint Conferences on Artificial Intelligence

    .
  • Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. In arXiv preprint arXive:1708.07747.
  • Yuan et al. (2017) Yuan, X., He, P., Zhu, Q., and Li, X. (2017). Adversarial Examples: Attacks and Defenses for Deep Learning. In arXiv preprint arXiv:1712.07107.

Appendix

A Adversarial training based Defenses

There are many defense strategies, such as detecting adversarial perturbation (Metzen et al., 2017), obfuscating gradients (Buckman et al., 2018; Guo et al., 2018) and eliminating perturbation (Shen et al., 2017; Liao et al., 2018), among which adversarial training is the most effective method (Athalye et al., 2018). We list several adversarial training methods as follows.

Adversarial training. Goodfellow et al. (2015) first introduce the method of adversarial training, where the standard loss function for a neural network is modified as:

(14)

Here is the true label of a sample and is the model’s parameter. The modified objective is to make the neural network more robust by penalizing it to count for adversarial samples. During the training, the adversarial samples are computed with respect to the current status of the network. Taking FGSM for example, the loss function could be written as:

(15)

Ensemble adversarial training. Tramèr et al. (2018) propose an ensemble adversarial training method, in which DNN is trained with adversarial examples transferred from a number of fixed pre-trained models.

Iterative adversarial training. Madry et al. (2017) propose to train a DNN with adversarial examples generated by iterative methods such as PGD.

B Train WGAN_GP to Get the Original Generator

Figure 1 (a) illustrates the overall architecture of the original AC-WGAN_GP. There are three neural networks: a generator , a discriminator and a classifier . The generator takes a random noise and a lable as the inputs and generates an image . It aims to generate an image that is indistinguishable to discriminator and makes the classifier to output a label . The loss function of can be formulated as:

(16)

Here is the entropy between and . The discriminator takes the training data or the generated data as input and tries to distinguish them. The loss function of with a penalty on the gradient norm for random samples can be formulated as:

(17)

The classifier takes the train data or the generated data as the input and predicts the corresponding label. There is no difference from other classifiers and the model is trained only on the training data. The loss of the classifier is:

(18)

The goal of this stage is to train a generator that could output realistic samples and estimate the data distribution properly so that we could later on transfer the generator to generate adversarial examples. So you could train a generator using other GANs as long as the generator could learn a good distribution of the real data.

C Proofs that AT-GAN can Learn the Distribution of Adversarial Examples

c.1 AT-GAN has almost the same distribution as that of the original generator.

For two distributions and with probability density function and , we could get where .

(19)

Obviously, when , we could get , which means .

c.2 Global Optimality of for WGAN_GP

We refers to Samangouei et al. (2018) to prove this property. The WGAN_GP min-max loss is given by:

(20)

For a fixed , the optimal discriminator which maximizes should be:

(21)

According to Eq. 20 and Eq. 21, we could get:

(22)

Let , to minimize Eq. 22, we need to set for any . Then, since both and integrate to 1, we could get:

(23)

However, this contradicts Eq. 21 for and , unless where is the Lebesgue measure. This concludes the proof.

D More Experimental Details

We describe the details about the experiments, including attack hyperparamters and model architectures, and we provide more property of the generated adversarial examples.

d.1 Hyperparamters

The hyperparameters used in the experiments are described in Table 5.

Hyperparamters for attacks
Attack MNIST Fashion-MNIST Norm
FGSM
PGD
R+FGSM
Song’s N/A
AT-GAN N/A
Table 5: The hyperparamters.

d.2 The Architecture of Models

We describe our neural network architectures used in experiments. The abbreviations for components in the network are described in Table 6. The architecture of WGAN_GP with auxiliary classifier is shown in Table 7. The details of model A through D are described in Table 8 and the generator and discriminator are the same as in Chen et al. (2016). Both Model A and Model C have convolutional layers and fully connected layers. The difference is only on the size and number of convolutional filters. Model B uses dropout as its first layer and adopts a bigger covolutional filter so it has less number of parameters. Model D is a fully connected neural network with the least number of parameters and its accuracy will be lower than others because there is no convolutional layers.

Abbreviation Meaning
Conv() A convolutional layer with filters, with filter size
DeConv() A transposed convolutional layer with filters, with filters size
Dropout() A dropout layer with probability
FC() A fully connected layer with outputs
Sigmoid

The sigmoid activation function

Relu

The Rectified Linear Unit activation function

LeakyRelu() The Leaky version of a Rectified Linear Unit with parameter
Table 6: The meaning of abbreviations.
Generator Discriminator Classifier
FC() + Relu Conv() + LeakyRelu() Conv() + Relu
FC() + Relu Conv() + LeakyRelu() pooling()
DeConv() + Sigmoid FC() + LeakyRelu() Conv() + Relu
DeConv() + Sigmoid FC(1) + Sigmoid pooling()
FC()
Dropout()
FC() + Softmax
Table 7: The architecture of WGAN_GP with auxiliary classifier.
Model A () Model B () Model C () Model D ()
Conv()+Relu Dropout() Conv()+Relu
Conv()+Relu Conv(  )+Relu Conv(  )+Relu
Dropout() Conv()+Relu Dropout() FC() + Softmax
FC()+Relu Conv()+Relu FC()+Relu
Dropout() Dropout() Droopout()
FC()+Softmax FC()+Softmax FC()+Softmax
Table 8: The architectures of the Models A through D we used for classification. The number in parentheses in the title is the number of parameters for each model.

E Target Attack Examples of AT-GAN

We show some adversarial examples generated by AT-GAN with target attack. The results are illustrated in Fig. 5 and Fig. 6. Instead of adding perturbation to the original images, AT-GAN transfers the generator so that the adversarial instance would not have totally the same shape of the initial examples generated by the original GAN, as shown in the diagnonal.

Figure 5: Adversarial examples generated by AT-GAN to different targets on MNIST with the same random noise input for each row. The images on the diagonal are generated by which are not adversarial examples and are treated as the initial instances for AT-GAN.
Figure 6: Adversarial examples generated by AT-GAN to different targets on Fashion-MNIST with the same random noise input for each row. The images on the diagonal are generated by which are not adversarial examples and are treated as the initial instances for AT-GAN.