Due to the state-of-the-art performance of deep neural networks, more and more large neural networks are widely adopted in real-world applications. However, recent works (Szegedy et al., 2013; Goodfellow et al., 2014b) have demonstrated that small perturbations are able to fool the networks in producing incorrect prediction by manipulating the input maliciously. The corresponding manipulated samples are called adversarial examples
that pose a serious threat to the success of deep learning in practice, especially in safety-critical applications.
There exists two types of adversarial attacks proposed in recent literature: white-box attacks and black-box attacks. White-box attacks such as (Szegedy et al., 2013; Goodfellow et al., 2014b; Carlini and Wagner, 2017; Akhtar and Mian, 2018) allow the attacker to have access to the target model, including architectures and parameters, while under the black-box attacks (Papernot et al., 2017), the attacker does not have access to model parameters but can query the oracle, i.e., the targeted DNN, for labels.
Correspondingly, a sizable body of defense strategies are proposed to resist adversarial examples. These defense methods can be mainly categorized into three types:
Adversarial training/Robust optimization.
FGSM adversarial training (Szegedy et al., 2013; Goodfellow et al., 2014b) augmented the training data of classifier with existing types of adversarial examples, usually referred to as first-order adversary. However, some works (Papernot et al., 2017; Tramèr et al., 2017; Na et al., 2017) showed that it is vulnerable to the gradient masking problem. Madry et al. (2017) studied the adversarial robustness through the lens of robust optimization (Sinha et al., 2018) and further trained projected gradient descent (PGD) adversary as a new form of adversarial training. Note that all of these forms of adversarial training above rely on some specific types of attacks, thus showing relatively good robustness only against the corresponding attacks. However, they may not necessarily defense against other types of attacks consistently.
Thermometer encoding (Buckman et al., 2018) is a direct input transformation method, employing thermometer encoding to break the linear nature of networks that is stated as the source of adversarial examples (Goodfellow et al., 2014b). Defense-GAN (Samangouei et al., 2018) trained a generative adversarial network (GAN) as a denoiser to project samples onto the data manifold before classifying them. Unfortunately, Athalye et al. (2018) found that these methods including other input transformations (Ma et al., 2018; Guo et al., 2017) suffered from obfuscated gradient problem and can be circumvented by corresponding attacks.
Defensive distillation (Papernot et al., 2016) trained the classifier two rounds using a variant of the distillation (Hinton et al., 2015) to learn a smoother network. The approach reduces the model’s sensitivity to input variations by decreasing the absolute value of model’s Jacobian matrix and makes it difficult for attackers generate adversarial examples. However, related work (Papernot et al., 2017) showed that it fails to adequately protect against black-box attacks transferred from other networks (Carlini and Wagner, 2017).
To design an effective defense method that can be resistant to various types of attacks, we propose a novel defense mechanism called Boundary Conditional GAN that can enhance the robustness of deep neural networks by augmenting boundary samples it generates; and demonstrate its effectiveness against various types of attacks. We leverage the representative power of the state-of-the-art conditional GAN, Auxiliary Classifier GAN (Odena et al., 2016)
to generate conditional samples. Our key idea is to modify the loss of conditional GAN by additionally minimizing the Kullback-Leibler (KL) divergence from the predictive distribution to the uniform distribution in order to generate conditional samples near the decision boundary of a pre-trained classifier. Theseboundary samples generated by the modified Conditional GAN are fed to the pre-trained classifier to make the decision boundary more robust. The crucial point why Boundary Conditional GAN can help to consistently defense a wide range of attacks rather than a specific type of attacks is that the boundary samples might represent almost all the potential directions of constructed adversarial examples.
We empirically show that the new robust model can resist various types of adversarial examples and exhibits consistent robustness to these attacks compared with FGSM, PGD adversarial training, defensive distillation and Defense-GAN. Furthermore, we quantify the enhancement of robustness on MNIST, Fashion-MNIST and CIFAR10 datasets. Finally, we visualize the change of decision boundaries to further demonstrate the effectiveness of our approach. Boundary Conditional GAN supplies us a new way to design a consistent defense mechanism against both existing and future attacks.
Before introducing our approach, we firstly present some preliminary knowledge about different adversarial attacks employed in this work and necessary background information about Conditional GANs.
2.1 Adversarial Attacks
Adversarial attacks aim to find a small perturbation , usually constrained by -norm, and then add the perturbation to a legitimate input to craft adversarial examples that can fool the deep neural networks. In this paper, we consider white-box attacks, where the adversary has full access to the neural network classifier (architectures and weights).
Fast Gradient Sign Method (FGSM, Goodfellow et al. (2014b)) is a simple but effective attack for an -bounded adversary and an adversarial example can be obtained by:
where denotes the adversarial example crafted from an input and measures the magnitude of the perturbation.
denotes the loss function of the classifier given the inputand its true label , e.g., the cross entropy loss. FGSM is widely used in attacks and design of defense mechanisms.
Projected Gradient Descent attack (PGD, Madry et al. (2017)) is an iterative attack method and can be regarded as a multi-step variant of FGSM:
where and is the step size. Madry et al. (2017) showed that (the version of) PGD is equivalent to Basic Iterative Method (BIM), another important iterative attacks. In this paper, we use PGD attack to represent a variety of iterative attacks.
Carlini-Wagner (CW) attack (Carlini and Wagner, 2017) is an effective optimization-based attack. In many cases, it can reduce the classifier accuracy to almost zero. The perturbation is found by solving an optimization of the form:
where is the norm, is the objective function that is defined to drive the example to be misclassified, e.g., where denotes a different class and is the classifier. represents a suitably chosen constant. Although various norms could be considered such as the norms, we choose CW attack with -norm due to the convenience of computation.
2.2 Conditional GANs
Generative Adversarial Networks (GANs, Goodfellow et al. (2014a)) consist of two neural networks trained in opposition to one another. The generator maps a low-dimensional latent space to the high dimensional sample space of . The discriminator is a binary classifier, discriminating the real and fake inputs generated by the generator . The generator and discriminator are trained in an alternating fashion to minimize the following min-max loss:
where is the noise, usually following a simple distribution , such as Gaussion distribution. The objective functions of discriminator and generator are as follows:
Auxiliary Classifier GAN (ACGAN, Odena et al. (2016)) leverages both the noise and the class label to generate each sample from the generator ,
. The discriminator gives a probability distributionover sources, i.e., real or fake examples, and a probability over the class labels respectively, , where denotes the sources and denotes the class labels. The objective function has two parts: the log-likelihood of the correct source and the log-likelihood of the correct class.
is trained to maximize while is trained to maximize . ACGAN learns a representation from that is independent of class label and we make a choice to use ACGAN considering its state-of-the-art performance.
3 The Proposed Boundary Conditional GAN
In this section, we will elaborate our approach Boundary Conditional GAN. First, we will detail our motivation why we consider to use boundary samples to defense against adversarial examples. Then, to verify that the proposed Boundary Conditional GAN can generate boundary samples near the decision boundary, we implement an experiment in a 2D classification task, in which we visualize the decision boundary and demonstrate that adding the Kullback-Leibler (KL) penalty to the loss of conditional GAN can force the generated samples with labels to be near the decision boundary of original classifier. Finally, we introduce the procedure how to use Boundary conditional GAN to generate boundary samples to enhance the robustness of our pretrained classifier.
From the perspective of attacks, the optimization-based adversarial attacks, e.g., FGSM and PGD, solve the following optimization problem to different extents so that they obtain different approximations of the optimal adversarial attack.
where denotes the classifier. However, adversarial attacks exist in many different directions around input (Goodfellow et al., 2018) and the constructed attacks above only represent some potential directions of them, which also explains why adversarial training based on these attacks has limited defense power against other types of attacks. Considering a clean example with a norm ball near the current decision boundary represented by black solid curve in the left part of Figure 1, we can observe that there exist adversarial examples, i.e., the coverage of yellow points, in some continuous directions of an angle given the magnitude of perturbation . Samples near the decision boundary, i.e., the green stars, can represent almost all directions of adversarial examples, thereby it is natural to consider to use boundary samples to refine the decision boundary by data augmentation to help the classifier defense against various types of adversarial attacks. It is expected that the decision boundary could be refined from left panel to right one in Figure 1 through considering boundary samples, thus shrinking the coverage of potential adversarial examples with an angle of direction from to . Consequently, the improved decision boundary reduces the number of adversarial examples significantly.
From the perspective of defense, the consistent effectiveness against a variety of attacks is of vital importance. Athalye et al. (2018) pointed out that a strong defense should be robust not only against existing attacks, but also against future attacks. The key point of our motivation is that boundary samples could represent almost all potential directions of adversarial attacks and might exhibit consistent robustness against various types of attacks, which takes both aspects of attacks and defense into consideration.
3.2 Boundary Conditional GAN
It is highly intuitive that the predictive distribution by the classifier for the samples near the decision boundary is close to a uniform distribution due to the ambiguity which class the boundary samples belong to. For example, in a classification task with two groups, the classification probabilities of samples on the decision boundary is a vector. Therefore, in order to facilitate the conditional GAN to generate more samples near the decision boundary of original classifier, we propose to add an additional penalty to force the predictive distribution of generated samples through the original classifier to be a uniform distribution. The new generator loss (Lee et al., 2017) is as follows,
where is the original generator loss of conditional GAN mentioned in (6) and (7), e.g., for ACGAN, . are parameters of the original classifier rather than auxiliary classifier in ACGAN, which are fixed during the training of the proposed Conditional GAN. is the uniform distribution and is a penalty parameter. The first term (a) corresponds to the original conditional GAN loss since we would like to guarantee that the generated samples are near the original distribution of the corresponding class and not too far from the data manifold. The KL divergence term (b) forces the generator to generate samples whose predictive distribution through the original classifier is close to the uniform one, i.e., samples near the decision boundary of the pretrained classifier, by minimizing the KL loss while training.
To illustrate the effectiveness of the new GAN, we firstly implement an experiment on the modified conditional GAN in a 2D classification task; and show that the new GAN loss can help the conditional GAN generate samples near the decision boundary with corresponding labels. The training data are simulated from two 2D Gaussian distributions in red and blue respectively shown in Figure2. For both the generator and the discriminator, we use a fully-connected neural network with 3 hidden layers. We visualize the decision boundary and samples generated by the proposed boundary conditional GAN in Figure 2. It shows that the new KL penalty can indeed generate conditional samples in yellow near the decision boundary of original classifier. And generated samples in yellow with different are close to decision boundary to different extents.
3.3 Defense Mechanism
In practice, we can easily access a pre-trained classifier for a specified machine learning task and then design a defense mechanism based on that. Due to the influence of the new loss Eq. (8), there exists a slight decreasing of precision for the obtained conditional GAN by directly training the modified conditional GAN from scratch. To overcome this issue, we inject clean examples during the data augmentation to maintain the accuracy of original classifier.
Here, we describe our procedure of defense mechanism as follows and corresponding flow chart is shown in Figure 3.
Pre-train a classifier, i.e., the target model to defense, on the specified dataset;
Train the modified conditional GAN with the new KL loss; Eq. (8), forcing the conditional GAN to generate boundary samples;
Feed the boundary samples with corresponding labels to the pre-trained classifier to refine the decision boundary by data augmentation;
Evaluate the robustness of the final classifier against various types of adversarial attacks.
In this section, in order to demonstrate the effectiveness by our approach on improving the robustness of deep neural networks, we conduct experiments on MNIST, Fashion-MNIST and CIFAR10 datasets from three aspects as follows.
Defense against adversarial attacks.
We empirically show that the new robust model by Boundary Conditional GAN can resist various adversarial attacks, e.g., FGSM, PGD and CW attacks. We compare the result with Defensive Distillation (Papernot et al., 2016), Defense-GAN (Samangouei et al., 2018), a defense approach also based on GAN, and FGSM adversarial training (Szegedy et al., 2013; Goodfellow et al., 2014b) and PGD adversarial training (Madry et al., 2017) which are regarded as commonly accepted baselines of defense. Consistent robustness can be observed through our detailed analysis.
Quantitative analysis of robustness.
To quantify the enhancement of robustness by our Boundary Conditional GAN, we quantitatively evaluate the robustness on MNIST, Fashion-MNIST and CIFAR10 and compare that with other defensive approaches.
Visualization of decision boundaries.
To verify the improvement on robustness of decision boundaries, we visualize the change of decision boundaries around the input .
4.1 Defense against Adversarial Attacks
We test the robustness of the original and improved classifier by Boundary Conditional GAN against various attack strategies compared with FGSM adversarial training (Szegedy et al., 2013; Goodfellow et al., 2014b), PGD adversarial training (Madry et al., 2017), Defensive Distillation (Papernot et al., 2016) and Defense-GAN (Samangouei et al., 2018).
Settings of various attack strategies. We present the experimental results by using three different strategies: FGSM, PGD and CW. We perform FGSM with different magnitude and PGD attack for 40 iterations of projected GD on MNIST, Fashion-MNIST and 8 iterations on CIFAR10. Next, we perform -norm CW attack with 1,000 test samples.
Settings of baselines. FGSM and PGD adversarial training are trained with adversarial examples generated by standard FGSM and PGD attacks mentioned above with different magnitude. Defensive distillation is trained with soft labels under Temperature , just the same as original paper (Papernot et al., 2016). Defense-GAN, as another baseline, is trained with WGAN (Arjovsky et al., 2017) first and has at inference time on all datasets.
Settings of architectures of deep neural networks. The classifier on MNIST and Fashion-MNIST has two convolutional layers and one fully-connected layer. For CIFAR10, we directly leverage ResNet18. Meanwhile, the architecture of conditional GAN, i.e., ACGAN, is adopted by the original one (Odena et al., 2016) so that the state-of-the-art performance can be maintained.
We train ACGAN on the new GAN loss Eq. (8) on the corresponding dataset to generate boundary samples and then feed these boundary samples to refine the classifier by data augmentation. Finally, we leverage the enhanced classifier to test the effectiveness of robustness against various types of adversarial attacks in Table 1.
Table 1 shows the classification accuracies under different defense strategies across various attacks on all the three datasets. An important observation is that the Boundary Conditional GAN significantly outperforms Defensive Distillation, Defense-GAN and FGSM, PGD adversarial training with different magnitude against all attacks especially on CIFAR10. Concretely speaking, adversarial training with stronger attacks, e.g., larger for FGSM or PGD, exhibits better robustness but they more easily suffer from overfitting to the crafted adversarial examples, showing a larger drop on clean accuracy. However, these types of adversarial training perform worse than other defensive methods on larger datasets such as CIFAR10. In addition, Defensive Distillation is on par with the state-of-the-art performance of BCGAN across gradient-based attacks but it fails to defense stronger CW attack, which is also demonstrated in (Carlini and Wagner, 2017). We re-implement Defense-GAN based on original paper (Samangouei et al., 2018) due to the different setting. However, the pratical difficulities especially the choice of hyper-parameters of Defense-GAN, which is also discussed in (Samangouei et al., 2018), hinder the effectiveness of this method, resulting its limited defensive performance especially on CIFAR10, in which the original paper of Defense-GAN (Samangouei et al., 2018) has not provided corresponding experimental result. For the Boundary Conditional GAN, the consistent robustness of our method is easy to observe although it slightly decreases the accuracy on clean data due to the influence of limited accuracy of conditional GAN, i.e., ACGAN.
4.2 Quantitative Analysis of Robustness
In order to further demonstrate the enhancement of robustness against adversarial attacks, we quantitatively investigate the enhancement of robustness.
We extend the measure of robustness (Papernot et al., 2017) to the adversarial behavior of source-target class pair misclassification within the context of classifiers built using DNNs. The robustness of a trained DNN model is:
where inputs are drawn from data distribution , and is defined to be the minimum perturbation required to misclassify sample in each of the other classes. We reformulate for simplicity as follows:
The higher the average minimum perturbation required to misclassify a sample is, the more robust a DNN is against adversarial examples. Then, we evaluate whether Boundary Conditional GAN increases the robustness metric on the three datasets. Unlike the original method (Papernot et al., 2017), we do not approximate the metric but search all perturbations for each sample with certain precision.
As shown in Table 2, Boundary Conditional GAN significantly improves the robustness of deep neural network on the three datasets. More importantly, the enhancement of robustness by Boundary Conditional GAN exceeds various types of adversarial training and Defense-GAN dramatically, showing the state-of-the-art performance. It is interesting to find that the real robustness of Defense-GAN is poor and the underlying reason might lie in the mode collapse problem mentioned in the NIPS 2016 GAN tutorial (Goodfellow, 2016). Modified conditional GAN applied in our approach might partially avoid this problem due to the good property of conditional GAN.
4.3 Visualization of Decision Boundaries
In this part, we visualize the effect of boundary samples by comparing the change of decision boundary on two given directions in Figure 4. We apply the visualization method proposed by Wu et al. (2018), in which the two directions of axes are chosen as follows.
Denoting the gradient , the first direction is selected as the locally averaged gradient,
where denotes the smoothed gradient. The motivation using this smoothed gradient is the shattered gradient phenomenon studied in (Balduzzi et al., 2017), observing that the gradient is very noisy; and one way to alleviate it is to smooth the landscape , thereby yielding a more informative direction than . We choose = 1, and the expectation in (11
) is estimated by. As shown in Figure 4, the horizontal axis represents the direction of smoothed gradient and the vertical axis denotes the orthogonal direction . Each point in the 2-D plane corresponds to an image perturbed by and along each direction,
where the origin denotes the considered clean example, i.e. the crossover point of the two dashed axes. The different colors represent the different classes of the perturbed images in the direction and . The left part of the subfigure for each dataset depicts the decision boundary of the original model around the clean image, while the right part denotes that of the robust model achieved by Boundary Conditional GAN.
We visualize the improvement of decision boundaries on MNIST and Fashion-MNIST. As shown on the left part for MNIST in Figure 4, the region in blue above the central point has been enlarged from the original model (left) to the robust model (right), indicating that only larger perturbations could attack the new model successfully. The similar situation can also be observed on the right (Fashion-MNIST), where the red region has been expanded around the decision boundary, exhibiting the sufficient robustness of our approach. All of the results in Figure 4 suggest that the robustness of the classifier has been improved by Boundary Conditional GAN.
5 Discussions and Conclusion
Through our empirical observation, we found that the diversity and accuracy of conditional GAN is of significant importance for our defense mechanism. The more diversity of the generated samples by conditional GAN have, the more improvement of robustness can be observed. Furthermore, the more accurate of conditional GAN is, the less decreasing accuracy caused by misclassified samples generated by conditional GAN can be obtained. Moreover, other strategies that can generate samples near the decision boundary can also be leveraged to design defense mechanisms.
In this work, we have proposed a novel defense mechanism called Boundary Conditional GAN to enhance the robustness of decision boundary against adversarial attacks. We leverage the modified Conditional GAN by additionally minimizing the Kullback-Leibler (KL) divergence from the predictive distribution to the uniform distribution in order to generate samples near the decision boundary of the pre-trained classifier. These boundary example might capture the different directions of various adversarial attacks. Then we feed the boundary samples to the pre-trained classifier to refine the decision boundary. We empirically show that the new robust model can be resistant to various types of adversarial attacks and quantitatively evaluation on the enhancement of robustness and visualization of the improvement of decision boundaries are also provided.
In summary, leveraging boundary samples by Boundary Conditional GAN opens up a new way to design defense mechanism against various types of adversarial examples consistently in the future.
- Akhtar and Mian  Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. arXiv preprint arXiv:1801.00553, 2018.
- Arjovsky et al.  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
- Athalye et al.  Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
- Balduzzi et al.  David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? arXiv preprint arXiv:1702.08591, 2017.
- Buckman et al.  Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. 2018.
- Carlini and Wagner  Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.
- Goodfellow et al. [2014a] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- Goodfellow et al. [2014b] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Goodfellow et al.  Ian Goodfellow, Patrick McDaniel, and Nicolas Papernot. Making machine learning robust against adversarial inputs. Communications of the ACM, 61(7):56–66, 2018.
- Goodfellow  Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
- Guo et al.  Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117, 2017.
- Hinton et al.  Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Lee et al.  Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training confidence-calibrated classifiers for detecting out-of-distribution samples. arXiv preprint arXiv:1711.09325, 2017.
- Ma et al.  Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Michael E Houle, Grant Schoenebeck, Dawn Song, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613, 2018.
- Madry et al.  Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Na et al.  Taesik Na, Jong Hwan Ko, and Saibal Mukhopadhyay. Cascade adversarial machine learning regularized with a unified embedding. arXiv preprint arXiv:1708.02582, 2017.
- Odena et al.  Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585, 2016.
- Papernot et al.  Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 582–597. IEEE, 2016.
- Papernot et al.  Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 506–519. ACM, 2017.
- Samangouei et al.  Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018.
- Sinha et al.  Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. 2018.
- Szegedy et al.  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Tramèr et al.  Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
- Wu et al.  Lei Wu, Zhanxing Zhu, Cheng Tai, et al. Understanding and enhancing the transferability of adversarial examples. arXiv preprint arXiv:1802.09707, 2018.