Generative Adversarial Trainer: Defense to Adversarial Perturbations with GAN

05/09/2017 ∙ by Hyeungill Lee, et al. ∙ Seoul National University 0

We propose a novel technique to make neural network robust to adversarial examples using a generative adversarial network. We alternately train both classifier and generator networks. The generator network generates an adversarial perturbation that can easily fool the classifier network by using a gradient of each image. Simultaneously, the classifier network is trained to classify correctly both original and adversarial images generated by the generator. These procedures help the classifier network to become more robust to adversarial perturbations. Furthermore, our adversarial training framework efficiently reduces overfitting and outperforms other regularization methods such as Dropout. We applied our method to supervised learning for CIFAR datasets, and experimantal results show that our method significantly lowers the generalization error of the network. To the best of our knowledge, this is the first method which uses GAN to improve supervised learning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, deep learning has advanced in all areas of artificial intelligence, including image classification and speech recognition

[Hinton et al., 2012, Krizhevsky et al., 2012]

. These advances were owing to deep neural networks which can be easily trained by backpropagation, so they can represent complex probability distributions over high dimensional data. Despite these advances, deep neural networks remain imperfect. In particular, they show weaknesses with respect to adversarial examples when compared to humans

[Szegedy et al., 2013]. Adversarial examples can effectively fool a neural network to change its predictions, and the human eye cannot distinguish such examples from original images.

Several studies have attempted to make neural networks robust to such adversarial examples [Goodfellow et al., 2014b, Miyato et al., 2015, Papernot et al., 2016]. Adversarial training is one of the methods that retrains a neural network to predict correct labels for adversarial examples. In adversarial training, adversarial examples are generated in the inner loop of the adversarial training algorithms. Thus, this process should be fast to help adversarial training to be practical. Goodfellow et al. [2014b] proposed the fast gradient sign method, which is a simple and fast method of generating adversarial examples.

Goodfellow et al. [2014a] also introduced generative adversarial networks (GAN). GAN is a framework to train generative models, and shows a state-of-the-art performance for image generation [Berthelot et al., 2017, Radford et al., 2015]. The main idea of GAN is that two networks play a minimax game so that they converge gradually to an optimal solution.

In this paper, we propose a novel adversarial training method using a GAN framework. Similar to GAN, we alternately train both a classifier network (trainee) and a generator network (trainer). The generator network attempts to generate adversarial perturbations that can easily fool the classifier network, whereas the classifier network tries to classify correctly both original and adversarial images produced by the generator network. These procedures gradually help the classifier network to be robust to adversarial perturbations. Our experimental results show that our method outperforms other adversarial training using a fast gradient method. We also observe that generalization errors are remarkably lower than that of other regularization methods such as dropout [Srivastava et al., 2014].

2 Backgrounds

In this section, we briefly review the adversarial training (with fast gradient method) and GAN.

2.1 Adversarial Training

Goodfellow et al. [2014b] introduced a rational explanation for the adversarial example, and proposed a fast technique to generate adversarial perturbation. The authors observed that adversarial examples exist because models are too linear. They suggested a fast gradient sign method such that


where is an adversarial perturbation, denotes parameters of the network, is the input with the label , and denotes the cost function to train the classifier network.

Deep neural networks trained by standard supervised methods are vulnerable to adversarial examples. Adversarial training helps neural networks to be robust to adversarial perturbation. The practical method of adversarial training is to introduce an adversarial objective function using fast gradient sign method. Let

be the loss function of adversarial training, which optimizes a neural network against an adversary.


Eq. (2) consists of two cost functions. The first cost function is the original cross-entropy loss function for a neural network. The second is the loss function with adversarial perturbations, which is added to each input . Note that

is a hyperparameter that adjusts the ratio between the two cost functions. In

[Shaham et al., 2015], the authors analyzed the principle of adversarial training and found a strong connection between robust optimization and regularization. This establishes a minimization-maximization approach for adversarial training, which in turn makes neural networks stable in a neighborhood around training points. The authors also generalized the fast gradient sign method and proposed another adversarial training method with norm constraint [Shaham et al., 2015].


2.2 Generative Adversarial Networks

GAN [Goodfellow et al., 2014a] is a new successful framework for generative models. The conventional means of training a generative model is to maximize the likelihood function, which computes various quantities such as marginal probabilities and partition functions, which are computationally intractable in most cases. GAN allows us to train a generative model without the intractable computation. A GAN framework forces two networks to compete with each other. These two networks are: a generative model which attempts to map a sample (noise distribution) to the data distribution, and a discriminative model, which attempts to discriminate between training data and a sample from a generative model. The goal of a generative model is to maximize the probability that the discriminative model will produce a mistake. Thus, generative and discriminative models play the following two-player minimax game with a value function


The competition in this minimax game forces both models to improve their ability until the discriminator cannot distinguish a generated sample from a data sample.

Figure 1: Adversarial training with Generative Adversarial Trainer: (1) Generative Adversarial Trainer is trained to generate an adversarial perturbation that can fool the classifier network using the gradient of each image. (2) Classifier Network is trained to classify correctly both original and adversarial examples generated by .

3 Proposed Method

In this section, we propose a novel adversarial training framework. However, we first introduce a generative adversarial trainer (GAT), which plays a major role in adversarial training. The objective of the GAT is to generate adversarial perturbations that can easily fool the classifier network using a gradient of images. A classifier is trained to classify correctly both original and adversarial images generated by the GAT. The entire procedure is shown in Fig. 1. In Section 3.1, we describe our notation. Section 3.2 describes the structure of the GAT, Section 3.3 explains the adversarial training mechanism used with the GAT.

3.1 Notations

We denote a labeled training set by , where represents input images with height , width , and channel , and is a label for an input . We use two neural networks in the proposed method. One is a standard K-class classifier network which is defined by:



represents the class probability vector computed using the softmax function. The other is a GAT

which is defined by:


Note that represents the perturbation of the input image , where denotes the gradient of input images with respect to the class probability of the label. We use a cross entropy loss function for the classifier , which is denoted by:


3.2 Generative Adversarial Trainer

The main idea of the GAT is to use a neural network to find the perturbation generator function specific to the classifier rather than just the sign or normalized functions used in the fast gradient method. The objective of the GAT is to find best perturbation image using the gradient of each image. To achieve this goal, the loss function is defined as follows:


The loss function of GAT consists of the two cost functions. One is the loss function, which is used to find perturbation images that lower the classifier’s class probability. The other cost function restricts the power of the perturbation so that it is not too large. In (9), is a hyperparameter that adjusts the ratio between two cost functions. If is too low, it will find only a trivial solution with very high perturbation power. If is too high, it will generate only a zero perturbation image. Therefore, finding the appropriate through a hyperparameter search is crucial.

3.3 Adversarial Training with GAT

As an analogy, our adversarial training framework is similar to the spring training of a baseball team. A trainer analyzes the vulnerable points of a player and, based on this analysis, trains his weakest parts in addition to providing general training. This process is repeated over and over again. The goal at the end of the spring camp, is that the player overcomes most of the weaknesses and becomes a better player.

GAT plays a similar role to a trainer for a baseball team. During each training step, GAT learns to generate the best adversarial perturbation for each input. Simultaneously, a classifier network is trained to classify correctly both original and adversarial examples generated by GAT. The loss function of GAT is given as (9), whereas that of the classifier network is based on the adversarial objective function.


For the simplicity’s sake, we used in all experiments. Similar to GAN, completely optimizing GAT in the inner loop of training is computationally expensive and would result in overfitting if we do not have a large number of datasets. Instead, we alternately optimize generator network steps and the classifier network step. GAT is maintained near its optimal solution if the classifier network varies in a sufficiently slow enough. We used in all of our experiments. The entire procedure is presented in Algorithm 1.

Input: training data , classifier network , generator network

Output: Robust parameter vector

for number of training iterations do
     for k steps do
         Sample minibatch of examples from data distribution.
         Update the generator by descending its stochastic gradient:
     end for
     Sample minibatch of examples from data distribution.
     Update the classifier by descending its stochastic gradient:
end for
The gradient-based updates can use any standard gradient-based learning rule. We used the Adam-Optimizer in our experiments.
Algorithm 1 Adversarial Training with GAT

4 Experiments

To verify the performance of our proposed method, we experimented on two CIFAR datasets [Krizhevsky and Hinton, 2009]: CIFAR-10 and CIFAR-100, which consist of colour images in 10 or 100 classes, respectively, with images per class. The two datasets each contain training samples and test samples. We split the original training samples into training samples and

validation samples, and used the latter to tune the hyperparameters. We performed all experiments using TensorFlow

[Abadi et al., 2015].

4.1 Experimental setup

The network architectures used in our experiments are described in Table 1. We used a small version of Allconvnet proposed by Springenberg et al. [2014] as a classifier (trainee) because this model has a simple structure and yields good performance. The input image to our ConvNets was a fixed-size and normalized between and . We used

convolutions with rectified linear units (ReLUs) as activation functions. The number of convolutional filters were increased linearly with each down-sampling which was implemented as sub-sampling with stride 2. Fully connected layers were replaced by simple

convolutions and the scores of each class were averaged over the spatial dimensions.

The generator network (trainer) had a simpler structure. The input gradient image of the generator network was computed from the classifier network. The network consisted of six convolutional layers followed by two convolutional layers, and used hyperbolic tangent as an activation function of the output layer.

Our model had no batch normalization

[Ioffe and Szegedy, 2015], no dropout [Srivastava et al., 2014], and no weight decay. These techniques may have slightly improved our results but were not a major concern of our study. We trained our model using the Adam optimizer [Kingma and Ba, 2014] with a learning rate for the classifier and for the generator. The network weights were initialized using Xavier initialization method [Glorot and Bengio, 2010], which improved the speed of convergence.

classifier network (trainee) generator network (trainer)
Input: RGB image Input: gradient image
conv ReLu conv ReLu
conv ReLu, stride= conv ReLu
conv ReLu conv ReLu
conv ReLu, stride= conv ReLu
conv ReLu conv ReLu
conv ReLu conv ReLu
conv (or ) conv ReLu
global averaging over image conv
softmax tanh
Output: (or ) class probabilities Output: perturbation image
Table 1: Model description for CIFAR datasets

4.2 Perturbations generated by GAT

Is it possible to find stronger perturbation images than that generated by fast gradient method when we only know the gradient of the original images? Surprisingly, the answer is yes. To verify this, we performed the following experiment. First, we trained a ConvNet without using an adversarial training algorithm. We call this network a "baseline network". The classification accuracy of the baseline network was approximately 77% and 44% for CIFAR-10 and CIFAR-100 datasets, respectively. We next trained our generator network (GAT) using the loss function given in (9). We used early stopping with validation loss, and the accuracy of the adversarial images generated by the generator network was computed using the test images. We compared the accuracy of the adversarial images generated by the fast gradient method ( or ) with the same perturbation power generated by the generator network. Experiments were repeated with different values given in (9), so that adversarial perturbations with various powers could be generated. The results are shown in Fig. 2.

(a) CIFAR-10
(b) CIFAR-100
Figure 2: Comparison of the fast gradient attack (with and norm constraint) and our proposed GAT. Adversarial examples were generated for use with each method with various perturbation powers, and classification accuracy is plotted as shown.

The generator network efficiently generates stronger adversarial images than those generated by the other methods when the perturbation power is low. Unlike the fast gradient method, the generator network requires several iterations to achieve optimal state. However, in adversarial training phase, full optimization of the generator in each training step is not required because we alternately optimize the generator and classifier network.

4.3 Adversarial training with GAT

Mainly two types of attacks against deep neural networks exist. One is a direct attack which generates adversarial examples based on a precise understanding of the model internals or its training data. The other is an indirect attack, which generates adversarial examples without knowing precise information about the model. We experimented with adversarial examples generated in these two cases to assess the robustness of our network against various attacks.

4.3.1 Direct attack

(a) CIFAR10
(b) CIFAR100
Figure 3: Comparison of the baseline and robustified networks. Adversarial examples were generated using (3) from each network for various values of , and classification accuracy is plotted as shown.

We compared the performance of the proposed adversarial training with the baseline network and the network trained with an adversarial objective function based on the fast gradient sign method. For each network, adversarial examples with various perturbation powers were generated by the fast gradient method with norm constraint using (3). Note that adversarial examples were generated for each network. Over the various perturbation powers, we measured the classification accuracy of each network for adversarial examples that were generated from each network’s parameters. The results are shown in Fig. 3. The proposed method (adversarial training with GAT) showed the best classification accuracy without perturbation. These results showed that our algorithm is also effective in regularizing the neural network. Furthermore, they showed that our method is more robust against adversarial perturbations than is the fast gradient sign method.

4.3.2 Indirect attack

We next examined the robustness of the network against indirect attacks. Similar to previous experiments, an adversarial example was generated based on the fast gradient method with norm constraint. However, because attackers would be unfamiliar with the internal structure of each network, adversarial examples were generated using another baseline network. This kind of attack is effective because a large fraction of adversarial examples are misclassified by networks trained from scratch with different hyperparameters [Szegedy et al., 2013]. We compared the proposed adversarial training method with the baseline network and with the fast gradient method with norm constraint. The classification accuracy for the adversarial examples with various perturbation powers is shown in Fig. 4. It shows similar tendency as in the previous experiment, but classification accuracy did not decrease more rapidly in an indirect attack than in a direct attack.

(a) CIFAR-10
(b) CIFAR-100
Figure 4: Comparison of the baseline and robustified networks. Adversarial examples were generated using (3) from another baseline network for various values of , and classification accuracy is plotted as shown.

4.4 Regularization effect

Szegedy et al. [2013]

showed that a neural network could be regularized through adversarial training. We performed an experiment to compare our method with dropout, random perturbation, and adversarial training using the fast gradient method. We used the set of hyperparameters that achieved the best performance on the validation data for each regularization method. We repeated this procedure 50 times using different weight initialization and obtained the average and standard deviation of the test accuracy. We measured the regularization performance on a low capacity network, which is intentionally used. It is well known that there is more room for improvement in a low capacity network than in a high capacity network in terms of generalization error. It is also true that it is harder to improve the generalization error of a low capacity network than that of a high capacity network

[Goodfellow et al., 2016, chap. 5]. The results for the CIFAR-10 and CIFAR-100 datasets are shown in Table 2.

The results show that the regularization effect of our method is remarkably superior to other methods. Existing regularization techniques show approximately performance improvement over the existing baseline network, whereas our method shows a remarkable improvement in accuracy. GAT can be applied to any neural network because it does not depend on the internal structure of a neural network. It appears that GAT is a powerful regularization technique that can be applied to any neural network.

Method Test accuracy(%)
Baseline 77.48 0.46
Dropout 78.49 0.64
Random Perturbation 77.59 0.57
Adv. training (FG, ) 78.12 0.59
Adv. training (FG, ) 77.99 0.45
Adv. training (GAT) 80.33 0.44
Dropout + GAT 81.62 0.34
(a) CIFAR-10
Method Test accuracy(%)
Baseline 44.32 0.63
Dropout 46.29 0.61
Random Perturbation 44.43 0.71
Adv. training (FG, ) 45.16 0.73
Adv. training (FG, ) 45.67 0.63
Adv. training (GAT) 50.44 0.56
Dropout + GAT 50.71 0.49
(b) CIFAR-100
Table 2: Test accuracy for CIFAR datasets

5 Discussion

We proposed a novel adversarial training method by combining adversarial training [Goodfellow et al., 2014b] and GAN [Goodfellow et al., 2014a]. Experimental results show that our proposed method is not only robust against adversarial examples, but also effective in improving the generalization accuracy of the classifier. We believe that there are two main reasons to explain the better performance than the conventional fast gradient method.

First, the classifier has different robustness for each training data. In some images, the classifier can be easily fooled when having only low perturbation power. However, in other images, it cannot be easily fooled even with very high perturbation power. Because the conventional fast gradient method normalizes the size of each gradient, it generates adversarial images of the same perturbation power for all training images, which means that it is difficult for networks to converge. Generating adaptive adversarial examples based on the degree of robustness of each image can help efficiently train the network.

Second, the classifier network is a non-linear function. If the classifier network is a perfect linear function, finding better adversarial images than those found by the fast gradient method is impossible. However, because the classifier network is a non-linear function, GAT can detect non-linear patterns in the classifier network and use a gradient to produce better perturbations than with the fast gradient method.

Our proposed GAT effectively solves both problems because it does not normalize the gradient vector and evolves adaptively as the classifier is trained. Therefore, training a classifier that is robust to various adversarial examples is possible, and accordingly, effectively regularizes the model. However, further study is required because the proposed method takes to times longer training (depending on the capacity of the generator network) than the conventional fast gradient method. In addition, hyperparameters of each network should be carefully tuned because of the properties of GAN training.

6 Conclusions

In this paper, we proposed a method to make a classifier robust to adversarial examples using a GAN framework. The generator network generates a perturbation by finding the weaknesses of the classifier, and the classifier re-learns the image generated by the generator back to the original label. As the two networks learn alternately, the classifier network becomes more robust to the adversary image, and eventually the generator network will not be able to find a proper image that could fool the classifier. Our adversarial training method is also practical since it does not need expensive optimization process in the inner loop to find optimal adversarial images. The classifier with our adversarial training method is highly robust to the adversarial examples. Furthermore, it was found that the proposed method was surprisingly effective in regularizing neural networks.

To the best of our knowledge, this is the first method to apply a GAN framework to adversarial training (or supervised learning). Therefore, much work remains to improve the method. What is the optimal capacity of a generator network? Does additional information other than a gradient exist that can help the generator to find a better adversarial image? When our method is applied to larger networks such as Inception [Szegedy et al., 2016], can similar results to those in this study be achieved? Further research is required to address these issues.