APE-GAN: Adversarial Perturbation Elimination with GAN

Although neural networks could achieve state-of-the-art performance while recongnizing images, they often suffer a tremendous defeat from adversarial examples--inputs generated by utilizing imperceptible but intentional perturbation to clean samples from the datasets. How to defense against adversarial examples is an important problem which is well worth researching. So far, very few methods have provided a significant defense to adversarial examples. In this paper, a novel idea is proposed and an effective framework based Generative Adversarial Nets named APE-GAN is implemented to defense against the adversarial examples. The experimental results on three benchmark datasets including MNIST, CIFAR10 and ImageNet indicate that APE-GAN is effective to resist adversarial examples generated from five attacks.



There are no comments yet.


page 9

page 10

page 14


HAD-GAN: A Human-perception Auxiliary Defense GAN model to Defend Adversarial Examples

Adversarial examples reveal the vulnerability and unexplained nature of ...

Customizing an Adversarial Example Generator with Class-Conditional GANs

Adversarial examples are intentionally crafted data with the purpose of ...

Salient Feature Extractor for Adversarial Defense on Deep Neural Networks

Recent years have witnessed unprecedented success achieved by deep learn...

Lightweight Lipschitz Margin Training for Certified Defense against Adversarial Examples

How can we make machine learning provably robust against adversarial exa...

Adversarial robustness via stochastic regularization of neural activation sensitivity

Recent works have shown that the input domain of any machine learning cl...

Adversarial Feature Genome: a Data Driven Adversarial Examples Recognition Method

Convolutional neural networks (CNNs) are easily spoofed by adversarial e...

Ground-Truth Adversarial Examples

The ability to deploy neural networks in real-world, safety-critical sys...

Code Repositories


Pytorch Implementation of APE-GAN

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have recently achieved excellent performance on a variety of visual and speech recognition tasks. However, they have non-intuitive characterisitics and intrinsic blind spots that are easy to attack using obscure manipulation of their input[6, 11, 19, 26]. In many cases, the structure of the neural networks is strongly related to the training data distribution, which is in contradiction with the network’s ability to achieve high generalization performance.

Szegedy et al. [26] first noticed that imperceptible perturbation of test samples can be misclassified by neural networks. They term this kind of subtly perturbed samples “adversarial examples”. In contrast to noise samples, adversarial examples are imperceptible, designed intentionally, and more likely to cause false predictions in the image classification domain. What is more serious is that adversarial examples transfer across models named transferability, which can be leveraged to perform black-box attacks[14, 19]. In other words, an adversary can find the adversarial examples generated from substitute model trained by the adversary and apply it to attack the target model. Howerver, so far, transferability is mostly appears on small datasets, such as MNIST and CIFAR10. Transferability over large scale datasets, such as ImageNet, has yet to be better understood. Therefore, in this paper, black-box attacks is not taken into consideration, and resisting the white-box attacks (the adversary has complete access to the target model including the architecture and all paramaters) is the core work of the paper.

Figure 1:

A demonstration of APE-GAN applied to MNIST and CIFAR10 classification networks. The images with a small perturbation named adversarial samples cannot be correctly classified with high confidence. However, the samples processed by our model is able to be classified correctly.

Adversarial examples pose potential security threats for practical machine learning applications. Recent research has shown that a large fraction of adversarial examples are classified incorrectly even when obtained from the cell-phone camera

[11]. This makes it possible that an adversary crafts adversarial images of traffic signs to cause the self-driving cars to take unwanted actions[20]. Therefore, the research of resisting adversarial examples is very significant and urgent.

Until now, there are two classes of approaches to defend against adversarial examples. The straightforward way is to make the model inherently more robust with enhanced training data as show in Figure 2 or adjusted learning strategies. Adversarial training [6, 27] or defensive distillation [18, 22]

belongs to this class. It is noteworthy that the original defensive distillation is broken by Carlini and Wagner’s attack

[2], however, which can be resisted by the extending defensive distillation. In addition, the faultiness original adversarial training remains highly vulnerable to transferred adversarial examples crafted on other models which is discussed in the ensemble adversarial training, and the model trained using ensemble adversarial training are slightly less robust to some white-box attacks. The second is a series of detection mechanisms used to detect and reject an adversarial sample[4, 15]. Unfortunately, Carlini et al. [1] indicate that adversarial examples generated from Carlini and Wagner’s attack are significantly harder to detect than previously appreciated via bypassing ten detection methods. Therefore, defensing against adversarial examples is still a huge challenge.

Misclassification of the adversarial examples is mainly due to the intentionally imperceptible perturbations to some pixels of the input images. Thus, we propose an algorithm to eliminate the adversarial perturbation of input data to defense against the adversarial examples. The adversarial perturbation elimination of the samples can be defined as the problem of learning a manifold mapping from adversarial examples to original examples. Generative Adversarial Net (GAN) proposed by Goodfellow et al[5] is able to generate images similar to training set with a random noise. Therefore, we designed an framework utilizing GAN to generate clean examples from adversarial examples. Meanwhile, SRGAN[13]

, the successful application of GAN on super-resolution issues, provides a valuable experience to the implementation of our algorithm.

In this paper, an effective framework is implementated to eliminate the aggressivity of adversarial examples before being recognized, as shown in Figure 2. The code and trained models of the framework are available on https://github.com/shenqixiaojiang/APE-GAN, so we welcome new attacks to break our defense.

This paper makes the following contributions:

Figure 2:

(a) The traditional deep learning framework shows its robustness to clean images but is highly vulnerable to adversarial examples. (b) Existing adversarial training framework can increase the robustness of target model with enhanced training data. (c) We propose an adversarial perturbation elimination framework named APE-GAN to eliminate the perturbation of the adversarial examples before feeding it into the target model to increase the robustness.

  • A new perspective of defending against adversarial examples is proposed. The idea is to first eliminate the adversarial perturbation using a trained network and then feed the processed example to classification networks.

  • An effective and reasonable framework based on the above idea is implemented to resist adversarial examples. The experimental results on three benchmark datasets demonstrate the effectiveness.

  • The proposed framework possesses strong applicability. It can tackle adversarial examples without knowing what target model they are constructed upon.

  • The training procedure of the APE-GAN needs no knowledge of the architecture and parameters of the target model.

2 Related Work

In this section, methods of generating adversarial examples that are closely related to this work is briefly reviewed. In addition, GAN and its connection to our method will be discussed.

In the remaining of the paper we use the following notation and terminology:

  • - the clean image from the datasets.

  • - the adversarial image.

  • - the true class for the input .

  • - the class label different from for the input .

  • - the classifier mapping from input image to a discrete label set.

  • - the cost function used to train the model given image and class .

  • - the size of worst-case perturbations, is the upper bound of the norm of the perturbation.

  • - an -neighbourhood clipping of to the range .

  • Non-targeted adversarial attack - the goal of it is to slightly modify clean image that it will be classified incorrectly by classifier.

  • Targeted adversarial attack - the goal of it is to slightly modify source image that it will be classified as specified target class by classifier.

2.1 Methods Generating Adversarial Examples

In this subsection, six approaches we utilized to generate adversarial images are provided with a brief description.

2.1.1 L-BFGS Attack

Minimum distortion generation function where defined as an optimization problem can be solved by a box-constrained L-BFGS to craft adversarial perturbation under distance metric [26].

The optimization problem can be formulized as:


The constant controls the trade-off between the perturbation’s amplitude and its attack power which can be found with line search.

2.1.2 Fast Gradient Sign Method Attack (FGSM)

Goodfellow et al.[6] proposed this method to generate adversarial images under distance metric.

Given the input the fast gradient sign method generates the adversarial images with :


where is chosen to controls the perturbation’s amplitude.

The Eqn.2 indicates that all pixels of the input are shifted simultaneously in the direction of the gradient with a single step. This method is simpler and faster than other methods, but lower attack success rate since at the beginning, it was designed to be fast rather than optimal.

2.1.3 Iterative Gradient Sign

An straightforward way introduced by Kurakin et al.[11] to extend the FGSM is to apply it several times with a smaller step size and the intermediate result is clipped by the . Formally,


2.1.4 DeepFool Attack

DeepFool is a non-targeted attack introduced by Moosavi-Dezfooli et al [16]. It is one of methods to apply the minimal perturbation for misclassification under the

distance metric. The method performs iterative steps on the adversarial direction of the gradient provided by a locally linear approximation of the classifier until the decision hyperplane has crossed. The objective of DeepFool is


Although it is different from L-BFGS, the attack can also be seen as a first-order method to craft adversarial perturbation.

2.1.5 Jacobian-Based Saliency Map Attack(JSMA)

The targeted attack is also a gradient-based method which uses the gradient to compute a saliency score for each pixel. The saliency score reflects how strongly each pixel can affect the resulting classification. Given the saliency map computed by the model’s Jacobian matrix, the attack greedily modifies the most important pixel at each iteration until the prediction has changed to a target class. This attack seeks to craft adversarial perturbation under distance metric [21].

2.1.6 Carlini and Wagner Attack(CW)

There are three attacks for the , and distance metric proposed by Carlini et al.[2] Here, we just give a brief description of attack. The objective of attack is


where the loss function

is defined as


is the logits of a given model and

is used to control the confidence of adversarial examples. As increases, the more powerful adversarial examples become. The constant c can be chosen with binary search which is similar to in the L-BFGS attack.

The Carlini and Wagner’s attack is proved by the authors that it is superior to other published attacks. Then, all the three attacks should be taken into consideration to defense against.

In addition, we use CW-, CW-, CW- to represent the attack for the , and distance metric respectively in the following experiments.

2.2 Generative Adversarial Nets

Generative Adversarial Net (GAN) is a framework incorporating an adversarial discriminator into the procedure of training generative models. There are two models in the GAN: a generator

that is optimized to estimate the data distribution and a discriminator

that aims to distinguish between samples from the training data and fake samples from .

The objective of GAN can be formulized as a minimax value function :


GAN has been known to be unstable to train, often resulting in generators that produce nonsensical outputs since it is difficult to maintain a balance between the G and D.

The Deep Convolutional Generative Adversarial Nets (DCGAN)[23] is a good implementation of the GAN with convolutional networks that make them stable to train in most settings.

Our model in this paper is implemented based on DCGAN owing to its stability. The details will be discussed in the following section.

3 Our Approach

The fundamental idea of defending against adversarial examples is to eliminate or damage of the trivial perturbations of the input before being recognized by the target model.

The infinitesimal difference of adversarial image and clean image can be formulated as:


Ideally, the perturbations can be got rid of from . That means the distribution of is highly consistent with .

The global optimality of GAN is the consistence of the generative distribution of with samples from the data generating distribution:


The procedure of converging to a good estimator of coincides with the demand of the elimimation of adversarial perturbations .

Based on the above analysis, a novel framework based GAN to eliminate the adversarial perturbations is proposed. We name this class of architectures defending against adversarial examples based on GAN, adversarial perturbation elimination with GAN(APE-GAN), as shown in Figure 2.

The APE-GAN network is trained in an adversarial setting. While the generator G is trained to alter the perturbation with tiny changes to the input examples, the discriminator is optimized to seperate the clean examples and reconstructed examples without adversarial perturbations obtained from G. To achieve this, a task specified fusion loss function is invented to make the adversarial examples highly consistent with original clean image manifold.

3.1 Architecture

The ultimate goal of APE-GAN is to train a generating function that estimates for a given adversarial input image its corresponding counterpart. To achieve this, a generator network parametrized by is trained. Here denotes the weights and baises of a generate network and is obtained by optimizing an adversarial perturbation elimination specified loss function . For training images obtained by applying FGSM with corresponding original clean image , , we solve:


A discriminator network along with is defined to solve the adversarial zero sum problem:

Attack MNIST CIFAR10 ImageNet (Top-1)
Target Model APE-GAN Target Model APE-GAN Target Model APE-GAN
L-BFGS 93.4 2.2 92.7 19.9 93.3 42.9
FGSM 96.3 2.8 77.8 26.4 72.9 40.1
DeepFool 97.1 2.2 98.3 19.2 98.4 45.9
JSMA 97.8 38.6 94.1 38.3 98.7 45.0
CW- 100.0 27.0 100.0 46.9 100.0 29.4
CW- 100.0 1.5 100.0 30.5 99.7 26.1
CW- 100.0 1.2 100.0 32.2 100.0 27.0
Table 1: Error rates (in %) of adversarial examples generated from five methods for target model and APE-GAN on MNIST, CIFAR10 and Imagenet. The error rates of target models on the clean images are reported in the experimental setup.

The general idea behind this formulation is that it allows one to train a generative model with the goal of deceiting a differentiable discriminator that is trained to tell apart reconstructed images G() from original clean images. Consequently, the generator can be trained to produce reconstructed images that are highly similar to original clean images, and thus is unable to distinguish them.

The general architecture of our generator network is illustrated in Figure 2

. Some convolutional layers with stride = 2 are leveraged to get feature maps with lower resolution and followed by some deconvolutional layers with stride = 2 to recover the original resolution.

To discriminate original clean images from reconstructed images, we train a discriminator network. The general architecture is illustrated in Figure 2. The discriminator network is trained to solve the maximization problem in Equation 10

. It also contains some convolutional layers with stride = 2 to get some high-level feature maps, two dense layers and a final sigmoid activation function to obtain a probability for sample classification.

The specific architectures on MNIST,CIFAR10 and ImageNet are introduced in the experimental setup.

3.2 Loss Function

3.2.1 Discriminator Loss

According to Equation 10, the loss function of discriminator , is designed easily:


3.2.2 Generator Loss

The definition of our adversarial perturbation elimination specified loss function is critical for the performance of our generator network to produce images without adversarial perturbations. We define as the weighted sum of several loss functions as:


which consists of pixel-wise MSE(mean square error) loss and adversarial loss.

  • Content Loss: Inspired by image super-resolution method[13], the pixel-wise MSE loss is defined as:


    Adversarial perturbations can be viewed as a special noise constructed delicately. The most widely used loss for image denoising or super-resolution will be able to achieve satisfactory results for adversarial perturbation elimination.

  • Adversarial Loss: To encourage our network to produce images residing on the manifold of original clean images, the adversarial loss of the GAN is also employed. The adversarial loss is calculated based on the probabilities of the discriminator over all reconstructed images:


4 Evaluation

The L-BFGS, DeepFool, JSMA, FGSM, CW including CW-L CW-L CW-L attacks introduced in the related-work are resisted by APE-GAN on three standard datasets: MNIST[12], a database of handwritten digits has 70,000 28x28 gray images in 10 classes(digit 0-9), CIFAR10[10], a dataset consists of 60,000 32x32 colour images in 10 classes, and ImageNet[3], a large-image recognition task with 1000 classes and more than 1,000,000 images provided.

It is noteworthy that the adversarial samples cannot be saved in the form of picture, since discretizing the values from a real-numbered value to one of the 256 points seriously degrades the quality. Then it should be saved and loaded as float32.

Input MNIST CIFAR10 ImageNet (Top-1)
C APE-GAN DenseNet40 APE-GAN InceptionV3 APE-GAN
clean image 0.8 1.2 9.9 10.3 22.9 24.0
random Gaussian noise image 1.7 1.6 11.3 10.7 25.2 24.5
Table 2: Error rates (in %) of benign input for target models and APE-GAN on MNIST, CIFAR10 and Imagenet. Here, the target models are model C, DenseNet40, InceptionV3 which are identical to the target models for FGSM attack on MNIST, CIFAR10 and Imagenet respectively.
= 0.1 35.9 0.8 77.8 26.4
= 0.2 86.0 1.1 84.7 45.2
= 0.3 96.3 2.8 86.3 55.9
= 0.4 98.0 21.0 87.2 63.4
Table 3: Error rates (in %) of adversarial exmamples generated from FGSM with different for target models, APE-GAN on MNIST and CIFAR10. The error rates of target models on the clean images are reported in the experimental setup. Here, the target models are model C, DenseNet40 which are identical to the target models for FGSM attack on MNIST, CIFAR10 respectively.

4.1 Experimental Setup

4.1.1 Input

The input samples of target model can be classified into adversarial input obtained from attack approaches and benign input which is taken into account by the traditional deep learning framework including clean images and clean images added with random noise. Adding random noise to original clean images is the common trick used in data augmentation to improve the robustness, but does not belong to the standard training procedures of target model. Hence, it is not shown in Figure 2.

  • Adversarial Input: The FGSM and JSMA attacks have been implemented in the cleverhans v.1 [17] which is a Python library to benchmark machine learning systems’ vulnerability to adversarial examples and the L-BFGS attack and DeepFool attacks have been implemented in the Foolbox[24] which is a Python toolbox to create adversarial examples that fool neural networks. The code of CW attack has been provided by Carlini et al [2]. We experiment with on MNIST, on CIFAE10, on ImageNet for the FGSM attack, = 0 on the three datasets for the CW- attack and default parameters of other attacks are utilised.

  • Benign Input:

    The full test set of MNIST and CIFAR10 are utilized for the evaluation while results on ImageNet use a random sample of 10,000 RGB inputs from the test set. In addition, Gaussian white noise of mean 0 and variance 0.05 is employed in the following.

4.1.2 Target Models

In order to provide the most accurate and fair comparison, whenever possible, the models provided by the authors or libraries should be used.

  • MNIST: We train a convolutional networks (denoted A in the Appendix) for L-BFGS and DeepFool attacks. For CW attack, the model is provided by Carlini (denoted B in the Appendix). For FGSM and JSMA attacks, the model is provided by cleverhans (denoted C in the Appendix). The 0.9%, 0.5% and 0.8% error rates can be achieved by the models A, B and C on clean images respectively, comparable to the state of the art.

  • CIFAR10: ResNet18[7] is trained by us for L-BFGS and DeepFool attack. For CW attack, the model is provided by Carlini (denoted D in the Appendix). For FGSM and JSMA attacks, DenseNet[8] with depth=40 is trained. The 7.1%, 20.2%111It is noteworthy that the 20.2% error rate of target model D is significantly greater than the other models, however, identical to the error rate reported by Carlini[2]. For accurate comparison, we respect the choice of author. and 9.9% error rates can be achieved by the models ResNet18, D and DenseNet40 on clean images respectively.

  • ImageNet: We use ResNet50 one pre-trained networks for L-BFGS and DeepFool attacks. For other three attacks, another pre-trained network InceptionV3 is leveraged [25]. ResNet50 achieves the top-1 error rate 24.4% and the top-5 error rate 7.2% while InceptionV3 achieves the top-1 error rate 22.9% and the top-5 error rate 6.1% on clean images.

Attack ImageNet(Top 1) ImageNet(Top 5)
InceptionV3 APE-GAN InceptionV3 APE-GAN
= 4 / 255 72.2 38.0 41.7 21.0
= 8 / 255 72.9 40.1 42.3 22.5
= 12 / 255 73.4 41.2 43.4 22.9
= 16 / 255 74.8 42.4 44.0 23.6
Table 4: Error rates (in %) of adversarial exmamples generated from FGSM with different for target model InceptionV3, APE-GAN on ImageNet. The error rate of target model on the clean images is reported in the experimental setup.

4.1.3 Ape-Gan

Three models are trained with the APE-GAN architecture on MNIST, CIFAR10 and ImageNet(denoted APE-GAN, APE-GAN, APE-GAN in the Appendix). The full training set of MNIST and CIFAR10 are utilized for the training of APE-GAN and APE-GAN respectively while a random sample of 50,000 RGB inputs from the training set of ImageNet make a contribution to train the APE-GAN.


    is trained for 2 epochs on batches of 64 FGSM samples with

    = 0.3, input size = (28,28,1).

  • CIFAR10: APE-GAN is trained for 10 epochs on batches of 64 FGSM samples with = 0.1, input size = (32,32,3).

  • ImageNet: APE-GAN is trained for 30 epochs on batches of 16 FGSM samples with = 8 / 255, input size = (256,256,3). However, as we all know, the input size of ResNet50 is 224 * 224 and the InceptionV3 is 299 * 299. So we use the resize operation to handle this.

The straightforward method to train the generator and the discriminator is update both in every batch. However, the discriminator network often learns much faster than the generator network because the generator is more complex than distinguishing between real samples and fake samples. Therefore, generator should be run twice with each iteration to make sure that the loss of discriminator does not go to zero. The learning rate is initialized with 0.0002 and Adam[9] optimizer is used to update parameters and optimize the networks. The weights of the adversarial perturbation elimination specified loss and used in the Eqn.12 are fixed to 0.7 and 0.3 separately.

The training procedure of the APE-GAN needs no knowledge of the architecture and parameters of the target model.

4.2 Results

4.2.1 Effectiveness:

  • Adversarial Input Table 1 indicates that the error rates of adversarial inputs are significantly decreased after its perturbation is eliminated by APE-GAN. Among all the attacks, the CW attack is more offensive than the others, and among the three attacks of CW, the CW- is more offensive. The error rate of FGSM is greater than the L-BFGS which may be caused by different target models. As it is shown in Figure 3, the aggressivity of adversarial examples can be eliminated by APE-GAN even though these is imperceptible differences between (a) and (b). In addition, the adversarial examples generated from FGSM with different are resisted and the result is shown in Table 3 and Table 4.

  • Benign Input The error rate of clean images and the clean images added with random Gaussian noise is shown in Table 2. Actual details within the image can be lost with multiple levels of convolutional and down-sampling layers which has a negative effect on the classification. However, Table 2 indicates that there is no marked increase in the error rate of clean images. Meanwhile, APE-GAN has a good performance on resisting the random noise. Figure 4 shows that the perturbation generated from random Gaussian noise is irregular and all in a muddle while the perturbation obtained from the FGSM attack is regular and intentional. However the perturbation, whether regular or irregular, can be eliminated by APE-GAN.

In summary, APE-GAN has the capability to provide a good performance to various input, whether adversarial or benign, on three benchmark datasets.

Figure 3: ImageNet dataset (a) Adversarial samples crafted by FGSM with = 8 / 255 on the ImageNet (b) Reconstructed samples by APE-GAN

4.2.2 Strong Applicability:

The experimental setup of target models indicates that there is more than one target model designed in experiments on MNIST, CIFAR10 and ImageNet respectively. Table 1 demonstrates that APE-GAN can tackle adversarial examples for different target models. Actually, it can provide a defense without knowing what model they are constructed upon. Therefore, we can conclude that the APE-GAN possesses strong applicability.

Figure 4: MNIST dataset (a). clean image (b). random Gaussian noise image (c). adversarial samples obtained from FGSM with = 0.3 (d). reconstructed image of (a) by APE-GAN (e). reconstructed image of (b) by APE-GAN (f). reconstructed samples of (c) by APE-GAN

5 Discussion and Future Work

Pre-processing the input to eliminate the adversarial perturbations is another appealing aspect of the framework which makes sure there is no conflict between the framework and other existing defenses. Then APE-GAN can work with other defenses such as adversarial training together. Another method APE-GAN followed by a target model trained using adversarial training is experimented. The results on MNIST and CIFAR10 have been done shown in Table 7, 8, 9 in the Appendix. The adversarial examples leveraged in Table 9 in the Appendix are generated from Iterative Gradient Sign with N = 2. Actually, the FGSM leveraged to craft the adversarial examples of Table 8 is identical to the Iterative Gradient Sign with N = 1. Compared with Table 8, Table 9 indicates that the robustness of target model cannot be significantly improved with adversarial training. However, the combination of APE-GAN and adversarial training makes a notable defence against Iterative Gradient Sign. New combinations of different defenses will be researched in the future work.

The core work in this paper is to propose a new perspective of defending against adversarial examples and to first eliminate the adversarial perturbations using a trained network and then feed the processed example to classification networks. The training of this adversarial perturbation elimination network is based on the Generative Adversarial Nets framework. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed approach.