The Robust Manifold Defense: Adversarial Training using Generative Models

12/26/2017 ∙ by Andrew Ilyas, et al. ∙ MIT The University of Texas at Austin 0

Deep neural networks are demonstrating excellent performance on several classical vision problems. However, these networks are vulnerable to adversarial examples, minutely modified images that induce arbitrary attacker-chosen output from the network. We propose a mechanism to protect against these adversarial inputs based on a generative model of the data. We introduce a pre-processing step that projects on the range of a generative model using gradient descent before feeding an input into a classifier. We show that this step provides the classifier with robustness against first-order, substitute model, and combined adversarial attacks. Using a min-max formulation, we show that there may exist adversarial examples even in the range of the generator, natural-looking images extremely close to the decision boundary for which the classifier has unjustifiedly high confidence. We show that adversarial training on the generative manifold can be used to make a classifier that is robust to these attacks. Finally, we show how our method can be applied even without a pre-trained generative model using a recent method called the deep image prior. We evaluate our method on MNIST, CelebA and Imagenet and show robustness against the current state of the art attacks.



There are no comments yet.


page 11

page 13

page 15

page 17

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural network (DNN) classifiers are currently demonstrating excellent performance for various computer vision tasks. These models work well for benign inputs but recent work has shown it is possible to make very small changes to an input image and drastically fool state-of-the-art models 

[45, 18]. These adversarial examples are barely perceivable to humans, can be targeted to create desired labels even with black-box access to the classifiers and can be made robust as real objects in the physical world  [36, 26, 5].

This phenomenon is receiving a tremendous amount of recent attention (e.g. [31, 25, 48, 19, 22, 46] and references therein) for two good reasons: First a classifier that can be easily fooled by non-perceivable noise poses a security threat in any real deployment. Second, it illustrates that even our best models can be making correct predictions for the wrong reasons. This relates to interpretability and trust [39, 29, 14] in modern complex models which is an important emerging topic.

Typical methods of attack involve modifying pixel values while keeping a small or distance from the original image. Very recent work however has shown that small rotations [15] or spatial transformations [4] can also fool classifiers. We would like to propose an extended definition of adversarial examples that captures all these important aspects, building on legal theory and the reasonable person test (see e.g. [35]): A pair of inputs is an adversarial example for a classifier, if a reasonable person would say they are of the same class but the classifier produces significantly different outputs. This definition is useful: if someone has defaced a stop sign so that a reasonable person could confuse it for a different sign, nobody can accuse a classifier for making the same mistake. On the contrary, attacks like the robust physical perturbations of traffic signs shown in [16] would never make a reasonable person think this is not a stop sign.

Many attempts have been made to defend DNNs against adversarial examples. We survey the literature in the subsequent section, but the overall message is that defending all possible methods of attack, as previously defined, remains challenging. Our intuition is that adversarial examples exist because an original natural image is perturbed into , a point that is far from the manifold of natural images. Our classifier has never been trained on objects far from natural images so it can behave in unexpected ways. Furthermore, the natural image manifold is low-dimensional but the noisy objects that can be reached with even small perturbations, is very high dimensional and hence much harder to learn.

In this paper we make the critical assumption that we have a generative model for the data we are working on. This generative model can be either explicit (i.e. produce likelihoods) or an implicit model like a Generative Adversarial Network [17]. Several methods train neural networks to project an image on the manifold [33, 42] but these are end-to-end differentiable and hence easy to attack [9]. We use the compressed sensing inversion method [7] instead: Given an input image and a classifier , do not feed the image directly as an input to the classifier, but rather treat it as noisy measurements of another true image in the range of a (pre-trained) generator . We solve a minimization problem to find a such that is close to the input image, and feed to the classifier. This minimization is solved by gradient descent which makes it a non-differentiable method of projecting on the manifold. Since thousands of gradient steps are required, it is not easy to “unfold” this operation and attack a differentiable substitute model, as we show in our evaluation section.

We formulate this method (called Invert and Classify (INC)) and show that it is able to resist first-order and black-box attacks. We then explore its robustness even further by formulating a min-max optimization problem where the adversary has much more power: the process tries to simply find any two points in the latent code that produce images that are close, i.e. is small, but the classifier produces very different outputs i.e. is large. By Lagrangifying the constraints and using a first-order method, we are able to solve this problem and find pairs of adversarial points on the manifold with unjustified drastically different classifier confidence. This shows that natural problematic points exist, an idea also supported by the recent work in [2], which shows it for an artificially constructed classifier over spheres. We thus seek to robustify the system further.

We show how this Min-Max attack can be used to robustify INC by using adversarial training on these examples on the manifold. We show that our proposed INC classifier is robust to various types of attacks including end-to-end substitution models. The accuracy of the classifier drops compared to clean-image performance but the inversion operation seems to provide effective protection.

Our last innovation deals with robust classification without a pre-trained generative model. This is relevant for several rich datasets like ImageNet where it is hard to train an accurate generative model. To address this problem, we rely on Deep Image Prior (DIP) [47]

: An untrained convolutional neural network for which the latent code is kept fixed in some random value, but the weights are trained to match a desired output image. Ulyanovet el. al 


showed how this can be used for denoising, inpainting, and super-resolution without any pre-training on a dataset.

We define the Deep Image Prior INC method that uses such untrained generators and can still be used to create robust classifiers for Imagenet. We show that the deep image prior INC protection maintains the accuracy of (top-1) ResNet152 for ImageNet at under BIM attacks for . The price of this robustness is that the accuracy on clean images drops from to , for top-1 classification in classes.

1.1 Contributions

Concretely, our contributions are summarized as follows:

  • We formulate and present the “Invert-and-Classify” (INC) algorithm, which protects a classifier by projecting its inputs onto the range of a given generator which effectively serves as a prior for the classification. We demonstrate that the algorithm induces robustness across a wide variety of attacks, including first-order methods, substitute models, and enhanced attacks combining the two.

  • By formulating a min-max optimization problem that can be viewed as an overpowered attack on , we demonstrate the there may in fact exist problematic pairs in the domain of a generator that interact with the hard decision boundaries of the classifier such that and are close but their classifications are far. Through adversarial training we soften the classifier’s decision boundaries and demonstrate robustness to the same min-max optimization attack.

  • We propose a possible modification of the INC algorithm for settings in which good generative models are unavailable (e.g. for Imagenet), where we instead use the structural prior given by an untrained generator, as introduced in [47]. We show that this Deep Image Prior defense can actually defend against adversarial attacks for the ImageNet dataset.

2 A Min-Max Formulation

2.1 Step 1: Defending using GANs

Given a classifier

parametrized by a vector of parameters

, we want to defend it by filtering its input through a generator that samples natural inputs. This would be a pre-trained generative model that is assumed to produce natural inputs from all different categories that we are classifying. More precisely, for some hyperparameter

and given an input , we perform the following procedure that we call Invert and Classify (INC):

  1. Perform gradient descent in space to minimize . Let be the point returned by gradient descent. Ideally, .

  2. If the “projection” of is far from , i.e. if , we reject the input as “unnatural,” since it lies far from the range-of- manifold.

  3. Otherwise, we apply our classifier on the projected input, outputting a class according to the distribution .

2.2 Step 2: An Overpowered Attack

Given some input , one way to attack INC is to search for some that is close to and also close to the manifold, so that the classification of their projections is significantly different.

If such an attack exists, then (by triangle inequality) there must exist and such that and are close, yet and are far. The following optimization problem captures the furthest and can be subject to some constraint on the distance of and . This provides an upper bound on the magnitude of the INC attack:


This optimization problem upper-bounds the size of the attack to INC. In fact, it also captures the loss that may arise from a potential imperfect optimization in the first step of INC, where the input is projected into the range of . Namely, the value of the objective function for a solution and to the above optimization problem also captures the maximum loss in accuracy under a scenario where and the projection of onto the range of identified in the first step of INC via gradient descent is . Of course, the projection of into the range of is not chosen adversarially, so the above optimization problem captures an “overpowered adversary,” serving as an upper bound on the loss from both adversarial attacks and suboptimality in step one of INC.

We can Lagrangify the constraints in the above formulation to obtain the following min-max formulation:


As usual, the Lagrange multipliers make sure that if any constraints are violated, then the inf player can make the objective , by setting the multiplier to . Hence, the sup player must respect these constraints, and the inf player will have to set these multipliers to , so ultimately the right objective will be maximized by the sup player.

We found experimentally that we can identify good solutions to (3) using gradient descent, as we describe in Section 4.3. We show attacks for a gender classifier trained on CelebA [30] and protected with the INC framework using a BEGAN [6] generator. Our experiments show that it is possible to obtain pairs of images that lie on the range of and are close in (or distance), yet the gender classifier produces drastically different outputs on these images.

Another interesting experimental finding is that the pairs of images that were identified through our attack framework appeared to change gender-relevant features to confuse the classifier, often producing images whose class was ambiguous even for human observers. We show some of these pairs and how they where classified in Section 4.3. The main limitation of the INC-protected classifier is that the predictions have very high confidence (close to

classification probabilities) even for ambiguous images, and that small changes on an image cause abrupt changes in the classification probabilities.

Finally, we note that while the classifier outputs drastically different distributions on the pairs of images that we identified, the “noise” introduced by the suboptimality of gradient descent in the first step of INC—the (non-convex) projection step—seems to make the process more robust. In particular, when the images and where given as inputs to the INC protected classifier, the outputs were not drastically different. This is because the latent codes recovered by gradient descent on these inputs were not exactly and this projection noise seemed to protect INC from the low-probability set of adversarial examples.

This is clearly an interesting empirical finding for the robustness of INC, but we would not hope to base its security on the noise introduced by gradient descent in the projection step. So in the next section we use these adversarial pairs of inputs to robustify our classifier using adversarial training.

2.3 Step 3: Robustifying INC through Adversarial Training

Anticipating the attack outlined in Section 2.2, we can take our defense approach outlined in Section 2.1 one step further, retraining the parameters of the classifier to minimize the damage from the attack. This results in the following min-max formulation:


where the mixing weight is some hyperparameter, and the second term of the objective function above evaluates the performance of the classifier on the input set , where

is a traditional classification loss function (

can actually be the same as that used to train the classifier initially, e.g. the cross-entropy loss).

We can again fold the constraints into a Min-Max formulation as follows:


We tried this retraining approach and report our findings in Section 4.3.

2.4 Step 4: What to do if no GAN is available

The approach described in Sections 2.12.3 can be used to robustify any classifier if we have a good generative model for the inputs of interest. We propose a way to extend our approach to settings where no pretrained generative model is available. We illustrate our approach for classification of image data.

We propose to use a Deep Image Prior (DIP) [47], i.e. an untrained generative model with a convolutional neural network topology. The surprising result is that training over the weights to approximate a given input image effectively projects onto the manifold of natural images. We leverage this idea to defend classifiers on Imagenet without relying on a pre-trained generative model. Our method is identical to INC with the only difference being that the projection step optimizes over weights .

Specifically, given an input image , we search for , such that is a natural image, and is small. The Deep Image Prior (DIP) method tries to solve this problem by constructing a generative convolutional neural network parameterized by a set of weights , and searching for a set of weights that satisfy


for some randomly selected , which is held constant throughout the optimization procedure. The final output is , which is a natural image close to the input, with respect to the distance.

An important issue here is that the search over is a gradient descent procedure that we terminate early: as observed by [47], if too many steps are performed, becomes too expressive and also reconstructs the adversarial noise. The number of steps was empirically tuned in our experiments and depends on the power of the adversary. We discuss the DIP INC in more detail in Section 3.1, and our experiments in Section 4.4.

As a final note, it should be emphasized that while all the previous methods of our paper also apply to non-image datasets, in order to apply our generator-free approach to non-image data we would have to develop an architecture, serving as the analog of DIP, for that type of data.

3 Implementation

We now describe Invert and Classify (INC) in detail. The algorithm is given in pseudocode 1, and the schematic is shown in Figure 1.

Figure 1: Schematic showing our proposed defense strategy, Invert-and-Classify. The trapezoid labeled and triangle labeled denote the generator and classifier models respectively. The rectangle denotes an optimization procedure that accepts an image as input, and repeatedly queries to find a that minimizes . Once this is found, the classifier makes its prediction based on . Note that in Invert-and-Classify, refers to a generator that has been pretrained, and is held constant.
Input : Input image , Generator , Classifier
Output : Classifier prediction
      for  to  do
Algorithm 1 Invert-and-classify

Given an input , our strategy is to find a , such that is close to in distance. This is achieved by sampling a , and running iterations of SGD, where the gradient at iteration is given by .

Our intuition for why this strategy works is based on the observation that adversarial noise is very high dimensional, whereas the natural images form a low dimensional manifold in . Hence, searching for an image in the span of that is close to in norm is equivalent to projecting onto the manifold of natural images, assuming

has learned the true probability distribution over the training dataset.

3.1 Deep Image Prior Defense

The algorithm for Deep Image Prior INC is given in Algorithm 2, and the schematic is shown in Figure 2.

Figure 2: Schematic showing the Deep Image Prior INC. The trapezoid labeled and triangle labeled denote the generator and classifier models respectively. The rectangle denotes an optimization procedure that accepts an image as input, and repeatedly queries to find a that minimizes . Once this is found, the classifier makes its prediction based on . Note that in the Deep Image Prior defense, refers to a generator that has been randomly initialized, and we search for optimal parameters , while is held constant throughout.
Input : Input image , Classifier
Output : Classifier prediction
      randomly initialized
      for  to  do
Algorithm 2 Deep Image Prior Invert-and-Classify

Given an input , our strategy is to find a set of parameters , such that is close to in distance. This is achieved by sampling a and at random, and running iterations of SGD, where the gradient at iteration is given by . An important benefit of this method is that it alleviates the need for a pre-trained generator. In Section 4.4 we show how this method can be used to protect against attacks on Imagenet.

As observed by Ulyanov et al. [47], the Deep Image Prior can yield very accurate reconstructions if many iterations. We found that to use this method to remove adversarial noise we have to use early stopping which has to be carefully tuned.

4 Evaluation

4.1 Experimental Setup


We evaluate the robustness of Invert-and-Classify for classification tasks on two datasets: handwritten digit classification on the MNIST dataset [27] and gender classification on the CelebA dataset [30]. To evaluate the robustness of the Deep Image Prior defense, we ran experiments on 1000 validation images randomly sampled from the Imagenet dataset [40].

The MNIST images were not pre-processed. The CelebA images were cropped using a bounding box of size (the top left corner of the bounding box was placed at coordinate ) and were then resized to

using bilinear interpolation. The pixel values of the images were normalized to lie in the range

. The Imagenet images were resized to and pixel values were normalized to lie in the range .


For MNIST classification111Code borrowed from; architecture was kept the same., the classifier

has two convolutional layers, followed by a fully connected layer and a softmax layer. The convolutional layers have 32 and 64 filters respectively, and the filters are of size

. The fully connected layer has 1024 nodes and the softmax layer has 10 nodes.

For gender classification222The code was borrowed from and minimal modifications were made to use it for CelebA. on the CelebA dataset, the classifier has two convolutional layers followed by two fully connected layers and a softmax layer. The convolutional layers have 64 filters in each layer, and all filters are of size . The fully connected layers have 384 and 192 filters respectively, and the softmax layer has two nodes.

For classifying Imagenet images, we used a ResNet152 [21].

Generative Models

We chose the decoder of a Variational Auto Encoder(VAE) [24] for generating images of digits from MNIST333Code borrowed from; architecture was kept the same.. The encoder has 2 fully connected hidden layers, each with 500 nodes; the output is 20-dimensional. The decoder has 2 fully connected hidden layers, each with 500 nodes; the final output is of size .

To generate images of celebrities, we trained a BEGAN [6]444Code borrowed from; architecture was kept the same. on the first 160,000 images in the CelebA dataset. The generator of the BEGAN, has 1 fully connected layer and 5 convolutional layers. The input to the generator is distributed according to , and the fully connected layer is of size . The first 4 convolutional layers have 128 filters of size , and the final convolutional layer has 3 filters of size .

For image generation in the Deep Image Prior method, we use the skip-autoencoder architecture employed by Ulyanov

et al. [47] for image denoising555See the supplementary material at for the precise details.


For the Invert-and-Classify experiments on MNIST and CelebA, we used a Tensorflow 

[1] implementation of an Adam optimizer with initial learning rate 0.1 and 0.01 respectively.

For the Deep Image Prior experiments, we used a PyTorch 

[38] implementation of an Adam optimizer [23] with initial learning rate 0.01.

4.2 Robustness against Attacks

We demonstrate the robustness of our method against standard attacks. Note that since inversion is non-differentiable gradient descent, first-order attacks against the end-to-end system are infeasible.

4.2.1 First-Order Classifier Attacks

Most methods to construct adversarial examples try to find perturbations that have small norm. First-order attacks on the full system are intractable to generate; in this section, we demonstrate robustness to first-order attacks on the unprotected classifier.

We first focus on the case where the adversarial perturbation has low norm, where we use the Fast Gradient Sign [18] method. We perform untargeted attacks—we only require that the classifier predicts some label other than the true label; the adversarial perturbation is given by


where is the original image and is its label; is the cross entropy loss between the label and the classifier prediction on . For the CelebA dataset, we additionally evaluated how robust Invert-and-Classify is against the Carlini-Wagner attacks [10]. For the attack, we set the confidence parameter to 5.


The accuracy of the MNIST classifier and the Invert-and-Classify method for varying is shown in Figure 3.

Figure 3: Accuracy of the classifier and INC protected classifier as we vary the norm of the adversarial perturbation for MNIST using a FGSM attack.

We report the accuracy of the CelebA classifier and the Invert-and-Classify method against the FGSM attack and Carlini-Wagner attacks in Table 1.

No defense Invert and Classify
Clean Data 97% 84%
FGSM 1% 82%
FGSM 0% 80%
FGSM 0% 73%
Carlini-Wagner 0% 77%
Carlini-Wagner 0% 65%
Carlini-Wagner 0% 66%
Table 1: Accuracy of the celebA classifier and the Invert-and-Classify method under different adversarial attacks: FGSM [18] at various powers , and the Carlini-Wagner attacks [10].

Figure 4 shows the reconstructions obtained by Invert-and-Classify for women and men in the CelebA dataset.

Figure 4: In this figure, are the original images, are the images obtained by inverting , are the adversarial examples obtained from (), and are the images obtained by inverting . The values at the corner of each image indicate the confidence with which the classifier predicts the image as the correct gender. As shown, the unprotected gender classifier has confidence and the INC protected one confidence .

4.2.2 Substitute Model Attacks

We investigate the robustness of our end-to-end INC protected classifier. The gradient descent INC inversion procedure is non-differentiable and hence much harder to attack. One possible attack is to “unfold” the gradient descent steps and create a differentiable model that can be subsequently attacked. Since our gradient descent projection in INC involves thousands of iterations (typically more than ) we could not make such an attack work.

The second approach is to leverage the transferability of the adversarial examples [45] and design a black-box attack to evaluate the Invert-and-Classify model, and show that the new attack is also ineffective.

We train our end-to-end substitute network as in the typical black-box setting [36] using input-output pairs of the target model. We emphasize that our substitute model is differentiable and only attempts to approximate the decision boundaries of the non-differentiable Invert-and-Classify model. The architecture of the substitute model follows that of the inner classifier described in Section 4.1. We train our network on images from the celebA dataset labeled by the output of the Invert-and-Classify model. Finally, to improve the substitute model even further we also train on inputs that are adversarial to the inner classifier . This incorporates white box information since we provide to the adversary knowledge of the gradients of on the training samples.

We craft first-order adversarial examples using our substitute model and feed them to the target INC model. We measure the percentage of the adversarial examples generated for the substitute model that are misclassified by the target model as well. Figure 5 shows the inability of first-order attacks, namely the Fast Gradient Sign Method [18] (FGSM) and Basic Iterative Method [26] (BIM) on the substitute model to transfer to Invert-and-Classify.

Figure 5: Non-transferability of attacks on substitute networks: The network is trained on natural and adversarial input-output pairs from celebA. The validation set consists of natural images that were correctly classified by both the target and substitute model. The images were adversarially perturbed with FGSM [18] and 20 steps of BIM [26] for .

4.2.3 Combined Attack

Another idea for attacking INC is to train a differentiable substitute model for the inversion step only, then combine this model with standard first-order attacks on the classifier to produce adversarial inputs. As mentioned, unfolding the gradient descent process is intractable, since thousands of iterations are run per projection. Thus, we model the inversion step using a convolutional neural network, effectively training a differentiable encoder and making an autoencoder. As we show in Table 12 in Appendix A, it is indeed easy to attack the autoencoder but the attacks do not in typically transfer to the INC system.

4.3 Increasing Robustness with Overpowered Generator Attacks

As described in section 2.2, we perform an overpowered attack on the standard invert-and-classify architecture. We search for and such that and are close but induce dramatically different classification labels. Recall that this involves solving the following min-max optimization problem:


In practice, we set our constraint to , corresponding to an average squared difference of per pixel-channel. We implement the optimization through alternating iterated gradient descent on both and , with a much more aggressive step size for the -player (since its payoff is linear in ). The gradient descent procedure is run for 10,000 iterations. Because the constraint was imposed through a Lagrangian, we consider two valid if the mean distance between the images is .

The optimization terminated with 93% of the images satisfying the constraint; within this set, the average KL-divergence between classifier outputs was 2.47, with 57% inducing different classifications. Figure 6 shows randomly selected successful results of the attack.

Figure 6: Pairs of images and generated by the Min-max overpowered attack. Since a reasonable person would disagree with these confidence changes, these pairs of images are adversarial attacks for this classifier, according to our definition. These adversarial attacks lie on the manifold of natural images so the classifier must be made robust.

First, note that in contrast to the attacks found in Figure 4 on the unprotected classifier, the attacks found with this optimization tend to yield yield images with semantically relevant features from both classes, and furthermore often introduce meaningful (though minute) differences between and (e.g. facial hair, eyes widening, etc.). This suggests that the attack is exploiting the hard decision boundary introduced in classifier training. Secondly, as described in Section 2.2, none of these images actually induce different classifications on the end-to-end classifier, which we attribute to imperfections in the projection step of the defense (that is, since exactly). That said, we opt to robustify the classifier against this attack regardless. Recall the more complex min-max optimization proposed in Section 2.3:


We implement this through adversarial training [31]; at each iteration, in addition to sampling a cross-entropy loss from images from the dataset, we also sample an adversariality loss, where we generate a batch of “adversarial” inputs using 500 steps of the min-max attack, then add the final distance between the classification outputs to the cross-entropy loss. As shown in Figure 7, the classifier eventually learns to minimize the adversary’s ability to find examples, most likely by learning and softening the decision boundaries being exploited by the generator. After robustifying the classifier using this adversarial training, we once again try the attack described earlier in this section for the same 10,000 iterations. Figure 8 shows the convergence of the attack against both the initial and adversarially trained classifier for two values of , showing the inefficacy of the attack on the adversarially trained classifier. After 10,000 iterations, 100% of the images were valid, but with 22% of them inducing different classification, and an average KL divergence of 0.08, showing that the classifier has indeed significantly softened its decision boundary.

Figure 7: The cross-entropy and adversarial components of the loss decaying as training continues.
Figure 8: The average for pairs found by the attack.

Though causing softer decision boundaries, the adversarial training does not significantly impact classification accuracy over the standard classifier: on normal input data, the model achieves the same 97% accuracy undefended. We also feed the “adversarial” inputs generated by the min-max attack on the initial classifier into the adversarially trained classifier, and observe that the average classification divergence between examples drops to 0.007, with only 18% of the valid images being classified inconsistently. Figure 9 shows a randomly selected subset of these examples with their respective classifier output.

Figure 9: The softmax output of both the original (blue) and robust adversarially trained (green) classifier on the ”borderline” images generated by the attack on the non-robustified classifier.

4.4 Deep Image Prior

4.4.1 Adversarial Attack

The adversarial attacks to the ResNet were constructed using 20 steps of the Basic Iterative Method [26]. This attack is given by

where is the original image and is its label; is the cross entropy loss between the label and the classifier’s prediction on ; is a clipping such that lies in an ball of radius , centered at .

We evaluated the robustness of the Deep Image Prior defense for . (Note: Our original images were rescaled such that each pixel lies in the range ).

4.4.2 Early Stopping

For , we run at most 500 iterations of DIP, while for we run at most 100 iterations. We also stop the optimization procedure if the Mean Square Error falls below a threshold of 0.005.

Note that running only 100 iterations leads to a decrease in accuracy for images that are not adversarial. If we run DIP for more than 100 iterations on adversarial images that have , then the adversarial perturbation is also reconstructed, and the reconstruction remains adversarial. This can be attributed to the low Signal to Noise Ratio that results from adding an adversarial perturbation with .

Clean Images Adversarial Images
Unprotected DIP Protected Unprotected DIP Protected
Classifier Classifier Classifier Classifier
71% 52% 5% 49%
71% 52% 2% 40%
71% 35% 1% 30%
Table 2: This table shows the percentage of 1000 random images that were correctly classified by a ResNet152. The accuracy of the unprotected classifier and the Deep Image Prior protected classifier on clean images are reported in columns 2 and 3 respectively; the accuracy of the unprotected classifier and the Deep Image Prior protected classifier on adversarial examples are reported in columns 4 and 5 respectively. The adversarial images were constructed by performing the Basic Iterative Method[26] for , where is the -norm of the adversarial perturbation.

4.4.3 Quantitative Results

Table 2 reports the accuracy of a ResNet152 against adversarial examples constructed using the Basic Iterative Method for varying . DIP Protected Classifier refers to first reconstructing images via the Deep Image Prior method, followed by classification using a ResNet152.

Figure 10: Figure showing the reconstructions obtained via the Deep Image Prior method. The top two rows show the original images and their reconstructions, while the bottom two rows show adversarially perturbed images and their reconstructions. The Deep Image Prior method was run for at most 500 iterations, and we stopped early if the MSE fell below . The adversarial images were generated using 20 steps of the Basic Iterative Method with , which is the -norm of the adversarial perturbation.

4.4.4 Qualitative Results

Figure 10 shows the reconstructions obtained using the Deep Image Prior method on original and adversarial images. The adversarial images were generated using 20 steps of the Basic Iterative Method with . The Deep Image Prior method was run for 500 iterations or until the Mean Square Error fell below .

5 Related work

There is currently a deluge of recent work on adversarial attacks and defenses. Common defense approaches involve modifying the training dataset such that the classifier is made more robust [20][41], modifying the network architecture to increase robustness [11]

or performing defensive distillation 

[37]. The idea of adversarial training [18] and its connection to robust optimization [41, 32, 43] leads to a fruitful line of defenses. On the attacker side, Carlini and Wagner [8][10] show different ways of overcoming many of the existing defense strategies.

Our approach to defending against adversarial examples leverages the power of GANs [17]. The GAT-Trainer work by Lee et al. [28] uses generative models to perform adversarial training but in a very different way from our work and further without projecting on the range of a GAN. MagNet [34] and APE-GAN [42] have the similar idea of denoising adversarial noise using a generative model but use differentiable projection methods that have already been successfully attacked by Carlini and Wagner [9].

While we were writing this paper we found two related submissions appearing online: The most closely related concurrent work is the DefenseGAN [3], submitted to ICLR, that proposes a very similar method to INC, independently from our work. However, the current manuscript [3] only validates on MNIST, does not discuss Min-Max attack, the robust process or the Deep image prior method. The second related paper is PixelDefend [44]. The main difference of our work to this paper uses PixelCNN generators as opposed to GANs and hence the projection, attack and defense processes are different.

6 Conclusion

This work demonstrates the possibility of resisting adversarial attacks using a generative model. We propose the Invert-and-Classify (INC) algorithm, based on the idea of projecting inputs into the range of a trained Generative Adversarial Network before classification. The INC projection is performed using Gradient Descent in -space which is a non-differentiable process.

We demonstrate the mechanism’s ability to resist both off-the-shelf and specifically designed first-order and black-box attacks. Then, through a crafted min-max optimization, we demonstrate that there are still adversarial images in the range of the GAN. These points are very close yet induce drastically different classifications. These points show that a classifier can be tremendously confident and disagree with human judgement. We show how to solve this problem by adversarially training on these inputs and obtain a robust INC model that displays natural uncertainty around decision boundaries.

Finally, for the cases when no pre-trained generative model is available, we propose the Deep Image Prior (DIP) INC defense. This relies on a structural prior given by an untrained generator to defend against adversarial examples. We show that this allows for a defense for the Imagenet dataset that is robust to first-order methods against the unprotected classifier.


Appendix A Autoencode-and-Classify

Figure 11: Schematic showing a strategy to detect the presence of adversarial perturbations. Given an input image that may be adversarially perturbed, the image we input to the classifier is . If has been adversarially perturbed, then is high.

We now introduce a strategy that can be useful in detecting the presence of adversarial perturbations. Certain generative models, like VAEs [24], BiGANs [12], ALI [13], have an autoencoder structure. In these models, an encoder and generator are learned jointly and are designed to be approximate inverses of each other. maps an input image to a seed value , and the generator maps the seed value to a reconstruction , such that .

MagNet [34] and APE-GAN [42] propose closely related ideas. Unfortunately, having a differentiable denoiser (e.g. an encoder) to protect from adversarial examples is problematic since an attacker can design new input attacks by back-propagating through the encoder. This was shown by the Carlini and Wagner [9] attacks on MagNet and APE-GAN. This approach also shows that one can attack an encoder to generate an image from a different class [25].

The final prediction on an input is given by . Hence we can view as a feedforward classifier for which we can construct an attack. In this case, equation 8 can be modified to


where is the original image and is its label; is the cross entropy loss between the label and the classifier prediction on .

We observe that adversarial attacks drive the encoding of images away from the typical set. If an adversarial example is encoded to , then the distribution of differs significantly from the true distribution of the seed values. In most models, seed values to the generator are drawn from . Figure 14 shows the distribution of seed values produced by adversarial examples, and non-adversarial examples.

Figure 13 shows how the distance between input images and images from the generator vary when the input is adversarial or natural.

Quantitative Results

The accuracy is plotted in Figure 12. Figure 13 shows how the distance between input images and images from the generator vary when the input is adversarial or natural.

Figure 12: Accuracy of classifier with change in the norm of the adversarial perturbation defined in 12. Notice that attacks found from Autoencode-and-classify do not typically transfer to INC.
Figure 13: This figure plots the norm of difference between the input and reconstructed image. refers to the original image, is the adversarial example, where is constructed according to 12. is the inverse of , and is the inverse of . is the result when is autoencoded by the VAE, while is the result when is autoencoded by the VAE. Notice that a high error can be used to detect whether an image is adversarial or not.

Qualitative Results

If an adversarial example is encoded to , then the distribution of differs significantly from the true distribution of the seed values, which in our case are distributed according to . Figure 14 shows the distribution of seed values produced by adversarial examples, and non-adversarial examples.

Figure 15 shows the images obtained by autoencoding versus , where is the inversion of .

Figure 14: The blue dots denote how non adversarial examples are encoded by . The orange dots denote how adversarial examples are encoded by . In this case and are the encoder and decoder of a VAE such that the seed has dimension 2.
Figure 15: are the original images, are the adversarial examples obtained from (the adversarial perturbation has ), are the results of autoencoding , and are the images obtained by inverting . It is interesting to note here that the attack on Autoencode-and-classify is actually attacking the encoder: there actually exist better codes that produce images that are closer to the input , compared to . These images are closer and actually of the correct class but the adversarial noise fools the encoder in producing further images from a different class. Gradient descent projection used by INC does not have this problem.