Deep neural network (DNN) classifiers are currently demonstrating excellent performance for various computer vision tasks. These models work well for benign inputs but recent work has shown it is possible to make very small changes to an input image and drastically fool state-of-the-art models[45, 18]. These adversarial examples are barely perceivable to humans, can be targeted to create desired labels even with black-box access to the classifiers and can be made robust as real objects in the physical world [36, 26, 5].
This phenomenon is receiving a tremendous amount of recent attention (e.g. [31, 25, 48, 19, 22, 46] and references therein) for two good reasons: First a classifier that can be easily fooled by non-perceivable noise poses a security threat in any real deployment. Second, it illustrates that even our best models can be making correct predictions for the wrong reasons. This relates to interpretability and trust [39, 29, 14] in modern complex models which is an important emerging topic.
Typical methods of attack involve modifying pixel values while keeping a small or distance from the original image. Very recent work however has shown that small rotations  or spatial transformations  can also fool classifiers. We would like to propose an extended definition of adversarial examples that captures all these important aspects, building on legal theory and the reasonable person test (see e.g. ): A pair of inputs is an adversarial example for a classifier, if a reasonable person would say they are of the same class but the classifier produces significantly different outputs. This definition is useful: if someone has defaced a stop sign so that a reasonable person could confuse it for a different sign, nobody can accuse a classifier for making the same mistake. On the contrary, attacks like the robust physical perturbations of traffic signs shown in  would never make a reasonable person think this is not a stop sign.
Many attempts have been made to defend DNNs against adversarial examples. We survey the literature in the subsequent section, but the overall message is that defending all possible methods of attack, as previously defined, remains challenging. Our intuition is that adversarial examples exist because an original natural image is perturbed into , a point that is far from the manifold of natural images. Our classifier has never been trained on objects far from natural images so it can behave in unexpected ways. Furthermore, the natural image manifold is low-dimensional but the noisy objects that can be reached with even small perturbations, is very high dimensional and hence much harder to learn.
In this paper we make the critical assumption that we have a generative model for the data we are working on. This generative model can be either explicit (i.e. produce likelihoods) or an implicit model like a Generative Adversarial Network . Several methods train neural networks to project an image on the manifold [33, 42] but these are end-to-end differentiable and hence easy to attack . We use the compressed sensing inversion method  instead: Given an input image and a classifier , do not feed the image directly as an input to the classifier, but rather treat it as noisy measurements of another true image in the range of a (pre-trained) generator . We solve a minimization problem to find a such that is close to the input image, and feed to the classifier. This minimization is solved by gradient descent which makes it a non-differentiable method of projecting on the manifold. Since thousands of gradient steps are required, it is not easy to “unfold” this operation and attack a differentiable substitute model, as we show in our evaluation section.
We formulate this method (called Invert and Classify (INC)) and show that it is able to resist first-order and black-box attacks. We then explore its robustness even further by formulating a min-max optimization problem where the adversary has much more power: the process tries to simply find any two points in the latent code that produce images that are close, i.e. is small, but the classifier produces very different outputs i.e. is large. By Lagrangifying the constraints and using a first-order method, we are able to solve this problem and find pairs of adversarial points on the manifold with unjustified drastically different classifier confidence. This shows that natural problematic points exist, an idea also supported by the recent work in , which shows it for an artificially constructed classifier over spheres. We thus seek to robustify the system further.
We show how this Min-Max attack can be used to robustify INC by using adversarial training on these examples on the manifold. We show that our proposed INC classifier is robust to various types of attacks including end-to-end substitution models. The accuracy of the classifier drops compared to clean-image performance but the inversion operation seems to provide effective protection.
Our last innovation deals with robust classification without a pre-trained generative model. This is relevant for several rich datasets like ImageNet where it is hard to train an accurate generative model. To address this problem, we rely on Deep Image Prior (DIP) 
: An untrained convolutional neural network for which the latent code is kept fixed in some random value, but the weights are trained to match a desired output image. Ulyanovet el. al
showed how this can be used for denoising, inpainting, and super-resolution without any pre-training on a dataset.
We define the Deep Image Prior INC method that uses such untrained generators and can still be used to create robust classifiers for Imagenet. We show that the deep image prior INC protection maintains the accuracy of (top-1) ResNet152 for ImageNet at under BIM attacks for . The price of this robustness is that the accuracy on clean images drops from to , for top-1 classification in classes.
Concretely, our contributions are summarized as follows:
We formulate and present the “Invert-and-Classify” (INC) algorithm, which protects a classifier by projecting its inputs onto the range of a given generator which effectively serves as a prior for the classification. We demonstrate that the algorithm induces robustness across a wide variety of attacks, including first-order methods, substitute models, and enhanced attacks combining the two.
By formulating a min-max optimization problem that can be viewed as an overpowered attack on , we demonstrate the there may in fact exist problematic pairs in the domain of a generator that interact with the hard decision boundaries of the classifier such that and are close but their classifications are far. Through adversarial training we soften the classifier’s decision boundaries and demonstrate robustness to the same min-max optimization attack.
We propose a possible modification of the INC algorithm for settings in which good generative models are unavailable (e.g. for Imagenet), where we instead use the structural prior given by an untrained generator, as introduced in . We show that this Deep Image Prior defense can actually defend against adversarial attacks for the ImageNet dataset.
2 A Min-Max Formulation
2.1 Step 1: Defending using GANs
Given a classifier
parametrized by a vector of parameters
, we want to defend it by filtering its input through a generator that samples natural inputs. This would be a pre-trained generative model that is assumed to produce natural inputs from all different categories that we are classifying. More precisely, for some hyperparameterand given an input , we perform the following procedure that we call Invert and Classify (INC):
Perform gradient descent in space to minimize . Let be the point returned by gradient descent. Ideally, .
If the “projection” of is far from , i.e. if , we reject the input as “unnatural,” since it lies far from the range-of- manifold.
Otherwise, we apply our classifier on the projected input, outputting a class according to the distribution .
2.2 Step 2: An Overpowered Attack
Given some input , one way to attack INC is to search for some that is close to and also close to the manifold, so that the classification of their projections is significantly different.
If such an attack exists, then (by triangle inequality) there must exist and such that and are close, yet and are far. The following optimization problem captures the furthest and can be subject to some constraint on the distance of and . This provides an upper bound on the magnitude of the INC attack:
This optimization problem upper-bounds the size of the attack to INC. In fact, it also captures the loss that may arise from a potential imperfect optimization in the first step of INC, where the input is projected into the range of . Namely, the value of the objective function for a solution and to the above optimization problem also captures the maximum loss in accuracy under a scenario where and the projection of onto the range of identified in the first step of INC via gradient descent is . Of course, the projection of into the range of is not chosen adversarially, so the above optimization problem captures an “overpowered adversary,” serving as an upper bound on the loss from both adversarial attacks and suboptimality in step one of INC.
We can Lagrangify the constraints in the above formulation to obtain the following min-max formulation:
As usual, the Lagrange multipliers make sure that if any constraints are violated, then the inf player can make the objective , by setting the multiplier to . Hence, the sup player must respect these constraints, and the inf player will have to set these multipliers to , so ultimately the right objective will be maximized by the sup player.
We found experimentally that we can identify good solutions to (3) using gradient descent, as we describe in Section 4.3. We show attacks for a gender classifier trained on CelebA  and protected with the INC framework using a BEGAN  generator. Our experiments show that it is possible to obtain pairs of images that lie on the range of and are close in (or distance), yet the gender classifier produces drastically different outputs on these images.
Another interesting experimental finding is that the pairs of images that were identified through our attack framework appeared to change gender-relevant features to confuse the classifier, often producing images whose class was ambiguous even for human observers. We show some of these pairs and how they where classified in Section 4.3. The main limitation of the INC-protected classifier is that the predictions have very high confidence (close to
classification probabilities) even for ambiguous images, and that small changes on an image cause abrupt changes in the classification probabilities.
Finally, we note that while the classifier outputs drastically different distributions on the pairs of images that we identified, the “noise” introduced by the suboptimality of gradient descent in the first step of INC—the (non-convex) projection step—seems to make the process more robust. In particular, when the images and where given as inputs to the INC protected classifier, the outputs were not drastically different. This is because the latent codes recovered by gradient descent on these inputs were not exactly and this projection noise seemed to protect INC from the low-probability set of adversarial examples.
This is clearly an interesting empirical finding for the robustness of INC, but we would not hope to base its security on the noise introduced by gradient descent in the projection step. So in the next section we use these adversarial pairs of inputs to robustify our classifier using adversarial training.
2.3 Step 3: Robustifying INC through Adversarial Training
Anticipating the attack outlined in Section 2.2, we can take our defense approach outlined in Section 2.1 one step further, retraining the parameters of the classifier to minimize the damage from the attack. This results in the following min-max formulation:
where the mixing weight is some hyperparameter, and the second term of the objective function above evaluates the performance of the classifier on the input set , where
is a traditional classification loss function (can actually be the same as that used to train the classifier initially, e.g. the cross-entropy loss).
We can again fold the constraints into a Min-Max formulation as follows:
We tried this retraining approach and report our findings in Section 4.3.
2.4 Step 4: What to do if no GAN is available
The approach described in Sections 2.1–2.3 can be used to robustify any classifier if we have a good generative model for the inputs of interest. We propose a way to extend our approach to settings where no pretrained generative model is available. We illustrate our approach for classification of image data.
We propose to use a Deep Image Prior (DIP) , i.e. an untrained generative model with a convolutional neural network topology. The surprising result is that training over the weights to approximate a given input image effectively projects onto the manifold of natural images. We leverage this idea to defend classifiers on Imagenet without relying on a pre-trained generative model. Our method is identical to INC with the only difference being that the projection step optimizes over weights .
Specifically, given an input image , we search for , such that is a natural image, and is small. The Deep Image Prior (DIP) method tries to solve this problem by constructing a generative convolutional neural network parameterized by a set of weights , and searching for a set of weights that satisfy
for some randomly selected , which is held constant throughout the optimization procedure. The final output is , which is a natural image close to the input, with respect to the distance.
An important issue here is that the search over is a gradient descent procedure that we terminate early: as observed by , if too many steps are performed, becomes too expressive and also reconstructs the adversarial noise. The number of steps was empirically tuned in our experiments and depends on the power of the adversary. We discuss the DIP INC in more detail in Section 3.1, and our experiments in Section 4.4.
As a final note, it should be emphasized that while all the previous methods of our paper also apply to non-image datasets, in order to apply our generator-free approach to non-image data we would have to develop an architecture, serving as the analog of DIP, for that type of data.
Given an input , our strategy is to find a , such that is close to in distance. This is achieved by sampling a , and running iterations of SGD, where the gradient at iteration is given by .
Our intuition for why this strategy works is based on the observation that adversarial noise is very high dimensional, whereas the natural images form a low dimensional manifold in . Hence, searching for an image in the span of that is close to in norm is equivalent to projecting onto the manifold of natural images, assuming
has learned the true probability distribution over the training dataset.
3.1 Deep Image Prior Defense
Given an input , our strategy is to find a set of parameters , such that is close to in distance. This is achieved by sampling a and at random, and running iterations of SGD, where the gradient at iteration is given by . An important benefit of this method is that it alleviates the need for a pre-trained generator. In Section 4.4 we show how this method can be used to protect against attacks on Imagenet.
As observed by Ulyanov et al. , the Deep Image Prior can yield very accurate reconstructions if many iterations. We found that to use this method to remove adversarial noise we have to use early stopping which has to be carefully tuned.
4.1 Experimental Setup
We evaluate the robustness of Invert-and-Classify for classification tasks on two datasets: handwritten digit classification on the MNIST dataset  and gender classification on the CelebA dataset . To evaluate the robustness of the Deep Image Prior defense, we ran experiments on 1000 validation images randomly sampled from the Imagenet dataset .
The MNIST images were not pre-processed. The CelebA images were cropped using a bounding box of size (the top left corner of the bounding box was placed at coordinate ) and were then resized to
using bilinear interpolation. The pixel values of the images were normalized to lie in the range. The Imagenet images were resized to and pixel values were normalized to lie in the range .
For MNIST classification111Code borrowed from https://www.tensorflow.org/get_started/mnist/pros; architecture was kept the same., the classifier
has two convolutional layers, followed by a fully connected layer and a softmax layer. The convolutional layers have 32 and 64 filters respectively, and the filters are of size. The fully connected layer has 1024 nodes and the softmax layer has 10 nodes.
For gender classification222The code was borrowed from https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10/ and minimal modifications were made to use it for CelebA. on the CelebA dataset, the classifier has two convolutional layers followed by two fully connected layers and a softmax layer. The convolutional layers have 64 filters in each layer, and all filters are of size . The fully connected layers have 384 and 192 filters respectively, and the softmax layer has two nodes.
For classifying Imagenet images, we used a ResNet152 .
We chose the decoder of a Variational Auto Encoder(VAE)  for generating images of digits from MNIST333Code borrowed from https://jmetzen.github.io/notebooks/vae.ipynb; architecture was kept the same.. The encoder has 2 fully connected hidden layers, each with 500 nodes; the output is 20-dimensional. The decoder has 2 fully connected hidden layers, each with 500 nodes; the final output is of size .
To generate images of celebrities, we trained a BEGAN 444Code borrowed from https://github.com/carpedm20/BEGAN-tensorflow; architecture was kept the same. on the first 160,000 images in the CelebA dataset. The generator of the BEGAN, has 1 fully connected layer and 5 convolutional layers. The input to the generator is distributed according to , and the fully connected layer is of size . The first 4 convolutional layers have 128 filters of size , and the final convolutional layer has 3 filters of size .
For the Invert-and-Classify experiments on MNIST and CelebA, we used a Tensorflow implementation of an Adam optimizer with initial learning rate 0.1 and 0.01 respectively.
4.2 Robustness against Attacks
We demonstrate the robustness of our method against standard attacks. Note that since inversion is non-differentiable gradient descent, first-order attacks against the end-to-end system are infeasible.
4.2.1 First-Order Classifier Attacks
Most methods to construct adversarial examples try to find perturbations that have small norm. First-order attacks on the full system are intractable to generate; in this section, we demonstrate robustness to first-order attacks on the unprotected classifier.
We first focus on the case where the adversarial perturbation has low norm, where we use the Fast Gradient Sign  method. We perform untargeted attacks—we only require that the classifier predicts some label other than the true label; the adversarial perturbation is given by
where is the original image and is its label; is the cross entropy loss between the label and the classifier prediction on . For the CelebA dataset, we additionally evaluated how robust Invert-and-Classify is against the Carlini-Wagner attacks . For the attack, we set the confidence parameter to 5.
The accuracy of the MNIST classifier and the Invert-and-Classify method for varying is shown in Figure 3.
We report the accuracy of the CelebA classifier and the Invert-and-Classify method against the FGSM attack and Carlini-Wagner attacks in Table 1.
|No defense||Invert and Classify|
Figure 4 shows the reconstructions obtained by Invert-and-Classify for women and men in the CelebA dataset.
4.2.2 Substitute Model Attacks
We investigate the robustness of our end-to-end INC protected classifier. The gradient descent INC inversion procedure is non-differentiable and hence much harder to attack. One possible attack is to “unfold” the gradient descent steps and create a differentiable model that can be subsequently attacked. Since our gradient descent projection in INC involves thousands of iterations (typically more than ) we could not make such an attack work.
The second approach is to leverage the transferability of the adversarial examples  and design a black-box attack to evaluate the Invert-and-Classify model, and show that the new attack is also ineffective.
We train our end-to-end substitute network as in the typical black-box setting  using input-output pairs of the target model. We emphasize that our substitute model is differentiable and only attempts to approximate the decision boundaries of the non-differentiable Invert-and-Classify model. The architecture of the substitute model follows that of the inner classifier described in Section 4.1. We train our network on images from the celebA dataset labeled by the output of the Invert-and-Classify model. Finally, to improve the substitute model even further we also train on inputs that are adversarial to the inner classifier . This incorporates white box information since we provide to the adversary knowledge of the gradients of on the training samples.
We craft first-order adversarial examples using our substitute model and feed them to the target INC model. We measure the percentage of the adversarial examples generated for the substitute model that are misclassified by the target model as well. Figure 5 shows the inability of first-order attacks, namely the Fast Gradient Sign Method  (FGSM) and Basic Iterative Method  (BIM) on the substitute model to transfer to Invert-and-Classify.
4.2.3 Combined Attack
Another idea for attacking INC is to train a differentiable substitute model for the inversion step only, then combine this model with standard first-order attacks on the classifier to produce adversarial inputs. As mentioned, unfolding the gradient descent process is intractable, since thousands of iterations are run per projection. Thus, we model the inversion step using a convolutional neural network, effectively training a differentiable encoder and making an autoencoder. As we show in Table 12 in Appendix A, it is indeed easy to attack the autoencoder but the attacks do not in typically transfer to the INC system.
4.3 Increasing Robustness with Overpowered Generator Attacks
As described in section 2.2, we perform an overpowered attack on the standard invert-and-classify architecture. We search for and such that and are close but induce dramatically different classification labels. Recall that this involves solving the following min-max optimization problem:
In practice, we set our constraint to , corresponding to an average squared difference of per pixel-channel. We implement the optimization through alternating iterated gradient descent on both and , with a much more aggressive step size for the -player (since its payoff is linear in ). The gradient descent procedure is run for 10,000 iterations. Because the constraint was imposed through a Lagrangian, we consider two valid if the mean distance between the images is .
The optimization terminated with 93% of the images satisfying the constraint; within this set, the average KL-divergence between classifier outputs was 2.47, with 57% inducing different classifications. Figure 6 shows randomly selected successful results of the attack.
First, note that in contrast to the attacks found in Figure 4 on the unprotected classifier, the attacks found with this optimization tend to yield yield images with semantically relevant features from both classes, and furthermore often introduce meaningful (though minute) differences between and (e.g. facial hair, eyes widening, etc.). This suggests that the attack is exploiting the hard decision boundary introduced in classifier training. Secondly, as described in Section 2.2, none of these images actually induce different classifications on the end-to-end classifier, which we attribute to imperfections in the projection step of the defense (that is, since exactly). That said, we opt to robustify the classifier against this attack regardless. Recall the more complex min-max optimization proposed in Section 2.3:
We implement this through adversarial training ; at each iteration, in addition to sampling a cross-entropy loss from images from the dataset, we also sample an adversariality loss, where we generate a batch of “adversarial” inputs using 500 steps of the min-max attack, then add the final distance between the classification outputs to the cross-entropy loss. As shown in Figure 7, the classifier eventually learns to minimize the adversary’s ability to find examples, most likely by learning and softening the decision boundaries being exploited by the generator. After robustifying the classifier using this adversarial training, we once again try the attack described earlier in this section for the same 10,000 iterations. Figure 8 shows the convergence of the attack against both the initial and adversarially trained classifier for two values of , showing the inefficacy of the attack on the adversarially trained classifier. After 10,000 iterations, 100% of the images were valid, but with 22% of them inducing different classification, and an average KL divergence of 0.08, showing that the classifier has indeed significantly softened its decision boundary.
Though causing softer decision boundaries, the adversarial training does not significantly impact classification accuracy over the standard classifier: on normal input data, the model achieves the same 97% accuracy undefended. We also feed the “adversarial” inputs generated by the min-max attack on the initial classifier into the adversarially trained classifier, and observe that the average classification divergence between examples drops to 0.007, with only 18% of the valid images being classified inconsistently. Figure 9 shows a randomly selected subset of these examples with their respective classifier output.
4.4 Deep Image Prior
4.4.1 Adversarial Attack
The adversarial attacks to the ResNet were constructed using 20 steps of the Basic Iterative Method . This attack is given by
where is the original image and is its label; is the cross entropy loss between the label and the classifier’s prediction on ; is a clipping such that lies in an ball of radius , centered at .
We evaluated the robustness of the Deep Image Prior defense for . (Note: Our original images were rescaled such that each pixel lies in the range ).
4.4.2 Early Stopping
For , we run at most 500 iterations of DIP, while for we run at most 100 iterations. We also stop the optimization procedure if the Mean Square Error falls below a threshold of 0.005.
Note that running only 100 iterations leads to a decrease in accuracy for images that are not adversarial. If we run DIP for more than 100 iterations on adversarial images that have , then the adversarial perturbation is also reconstructed, and the reconstruction remains adversarial. This can be attributed to the low Signal to Noise Ratio that results from adding an adversarial perturbation with .
|Clean Images||Adversarial Images|
|Unprotected||DIP Protected||Unprotected||DIP Protected|
4.4.3 Quantitative Results
Table 2 reports the accuracy of a ResNet152 against adversarial examples constructed using the Basic Iterative Method for varying . DIP Protected Classifier refers to first reconstructing images via the Deep Image Prior method, followed by classification using a ResNet152.
4.4.4 Qualitative Results
Figure 10 shows the reconstructions obtained using the Deep Image Prior method on original and adversarial images. The adversarial images were generated using 20 steps of the Basic Iterative Method with . The Deep Image Prior method was run for 500 iterations or until the Mean Square Error fell below .
5 Related work
There is currently a deluge of recent work on adversarial attacks and defenses. Common defense approaches involve modifying the training dataset such that the classifier is made more robust , , modifying the network architecture to increase robustness 
or performing defensive distillation. The idea of adversarial training  and its connection to robust optimization [41, 32, 43] leads to a fruitful line of defenses. On the attacker side, Carlini and Wagner ,  show different ways of overcoming many of the existing defense strategies.
Our approach to defending against adversarial examples leverages the power of GANs . The GAT-Trainer work by Lee et al.  uses generative models to perform adversarial training but in a very different way from our work and further without projecting on the range of a GAN. MagNet  and APE-GAN  have the similar idea of denoising adversarial noise using a generative model but use differentiable projection methods that have already been successfully attacked by Carlini and Wagner .
While we were writing this paper we found two related submissions appearing online: The most closely related concurrent work is the DefenseGAN , submitted to ICLR, that proposes a very similar method to INC, independently from our work. However, the current manuscript  only validates on MNIST, does not discuss Min-Max attack, the robust process or the Deep image prior method. The second related paper is PixelDefend . The main difference of our work to this paper uses PixelCNN generators as opposed to GANs and hence the projection, attack and defense processes are different.
This work demonstrates the possibility of resisting adversarial attacks using a generative model. We propose the Invert-and-Classify (INC) algorithm, based on the idea of projecting inputs into the range of a trained Generative Adversarial Network before classification. The INC projection is performed using Gradient Descent in -space which is a non-differentiable process.
We demonstrate the mechanism’s ability to resist both off-the-shelf and specifically designed first-order and black-box attacks. Then, through a crafted min-max optimization, we demonstrate that there are still adversarial images in the range of the GAN. These points are very close yet induce drastically different classifications. These points show that a classifier can be tremendously confident and disagree with human judgement. We show how to solve this problem by adversarially training on these inputs and obtain a robust INC model that displays natural uncertainty around decision boundaries.
Finally, for the cases when no pre-trained generative model is available, we propose the Deep Image Prior (DIP) INC defense. This relies on a structural prior given by an untrained generator to defend against adversarial examples. We show that this allows for a defense for the Imagenet dataset that is robust to first-order methods against the unprotected classifier.
-  Martín Abadi, Ashish Agarwal, Paul Barham, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016.
-  Anonymous. Adversarial spheres. ICLR Submission, available on OpenReview, 2017.
-  Anonymous. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. ICLR Submission, available on OpenReview, 2017.
-  Anonymous. Spatially transformed adversarial examples. ICLR Submission, available on OpenReview, 2017.
-  Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397, 2017.
-  David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
-  Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. arXiv preprint arXiv:1703.03208, 2017.
-  Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. arXiv preprint arXiv:1705.07263, 2017.
-  Nicholas Carlini and David Wagner. MagNet and “efficient defenses against adversarial attacks” are not robust to adversarial examples. arXiv preprint arXiv:1711.08478, 2017.
-  Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.
Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas
Parseval networks: Improving robustness to adversarial examples.
International Conference on Machine Learning, pages 854–863, 2017.
-  Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
-  Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
-  Ethan R Elenberg, Alexandros G Dimakis, Moran Feldman, and Amin Karbasi. Streaming weak submodularity: Interpreting neural networks on the fly. Advances in Neural Information Processing Systems (NIPS), 2017.
-  Logan Engstrom, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. A rotation and a translation suffice: Fooling cnns with simple transformations. arXiv preprint arXiv:1712.02779, 2017.
-  Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. Robust physical-world attacks on deep learning models. arXiv preprint arXiv:1707.08945, 1, 2017.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
-  Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.
-  Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068, 2014.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Warren He, James Wei, Xinyun Chen, Nicholas Carlini, and Dawn Song. Adversarial example defenses: Ensembles of weak defenses are not strong. arXiv preprint arXiv:1706.04701, 2017.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Jernej Kos, Ian Fischer, and Dawn Song. Adversarial examples for generative models. arXiv preprint arXiv:1702.06832, 2017.
-  Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Hyeungill Lee, Sungyeob Han, and Jungwoo Lee. Generative adversarial trainer: Defense to adversarial perturbations with gan. arXiv preprint arXiv:1705.03387, 2017.
-  Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
-  Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
-  Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
-  Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. stat, 1050:9, 2017.
-  Dongyu Meng and Hao Chen. MagNet: a two-pronged defense against adversarial examples. CoRR, abs/1705.09064, 2017.
-  Dongyu Meng and Hao Chen. MagNet: a two-pronged defense against adversarial examples. arXiv preprint arXiv:1705.09064, 2017.
-  Alan D Miller and Ronen Perry. The reasonable person. New York Universiry Law Review, 87:323, 2012.
-  Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against deep learning systems using adversarial examples. arXiv preprint arXiv:1602.02697, 2016.
-  Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 582–597. IEEE, 2016.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
-  Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local stability of neural nets through robust optimization. arXiv preprint arXiv:1511.05432, 2015.
-  S. Shen, G. Jin, K. Gao, and Y. Zhang. Ape-gan: Adversarial perturbation elimination with gan. ICLR Submission, available on OpenReview, 2017.
-  Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571, 2017.
-  Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766, 2017.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
-  Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453, 2017.
-  Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. arXiv preprint arXiv:1711.10925, 2017.
-  Valentina Zantedeschi, Maria-Irina Nicolae, and Ambrish Rawat. Efficient defenses against adversarial attacks. arXiv preprint arXiv:1707.06728, 2017.
Appendix A Autoencode-and-Classify
We now introduce a strategy that can be useful in detecting the presence of adversarial perturbations. Certain generative models, like VAEs , BiGANs , ALI , have an autoencoder structure. In these models, an encoder and generator are learned jointly and are designed to be approximate inverses of each other. maps an input image to a seed value , and the generator maps the seed value to a reconstruction , such that .
MagNet  and APE-GAN  propose closely related ideas. Unfortunately, having a differentiable denoiser (e.g. an encoder) to protect from adversarial examples is problematic since an attacker can design new input attacks by back-propagating through the encoder. This was shown by the Carlini and Wagner  attacks on MagNet and APE-GAN. This approach also shows that one can attack an encoder to generate an image from a different class .
The final prediction on an input is given by . Hence we can view as a feedforward classifier for which we can construct an attack. In this case, equation 8 can be modified to
where is the original image and is its label; is the cross entropy loss between the label and the classifier prediction on .
We observe that adversarial attacks drive the encoding of images away from the typical set. If an adversarial example is encoded to , then the distribution of differs significantly from the true distribution of the seed values. In most models, seed values to the generator are drawn from . Figure 14 shows the distribution of seed values produced by adversarial examples, and non-adversarial examples.
Figure 13 shows how the distance between input images and images from the generator vary when the input is adversarial or natural.
If an adversarial example is encoded to , then the distribution of differs significantly from the true distribution of the seed values, which in our case are distributed according to . Figure 14 shows the distribution of seed values produced by adversarial examples, and non-adversarial examples.
Figure 15 shows the images obtained by autoencoding versus , where is the inversion of .