The vulnerability of deep neural networks to adversarial attacks has generated a lot of interest and concern in the past few years. The fact that these networks can be easily fooled by adding specially crafted noise to the input, such that the original and modified inputs are indistinguishable to humans szegedy2013intriguing, clearly suggests that they fail to mimic the human learning process. Even though these networks achieve state-of-the-art performance, often surpassing human level performance he2015delving,huang2017densely on the test data used for different tasks, their vulnerability is a cause of concern when deploying them in real life applications, especially in domains such as health care finlayson2018adversarial, autonomous vehicles evtimov2017robust and defense, etc.
1.1 Adversarial Attacks and Defenses
Adversarially crafted samples can be classified into two broad categories, namely (i) adversarial samples szegedy2013intriguing and (ii) fooling samples as defined by nguyen2015deep. Existence of adversarial samples was first shown by Szegedy et al. szegedy2013intriguing, while fooling samples nguyen2015deep, which are closely related to the idea of “rubbish class” images lecun1998gradient were introduced by Nguyen et al. nguyen2015deep. Evolutionary algorithms were applied to inputs drawn from a uniform distribution, using the predicted probability corresponding to the targeted class as the fitness function nguyen2015deep to craft such fooling samples. It has also been shown that Gaussian noise can be directly used to trick classifiers into predicting one of the output classes with very high probability goodfellow2014explaining.
Adversarial attack methods can be classified into (i) white box attacks szegedy2013intriguing,goodfellow2014explaining,carlini2017towards,papernot2016limitations,moosavi2016deepfool,madry2017towards, which use knowledge of the machine learning model (such as model architecture, loss function used during training, etc.) for crafting adversarial samples, and (ii) black box attacks papernot2017practical,papernot2016transferability,chen2017zoo, which only require the model for obtaining labels corresponding to input samples. Both these kinds of attacks can be further split into two sub categories, (i) targeted attacks, which trick the model into producing a chosen output, and (ii) non-targeted attacks, which cause the model to produce any undesired output goodfellow2014explaining. The majority of attacks and defenses have dealt with adversarial samples so far szegedy2013intriguing,gu2014towards,papernot2016distillation, while a relatively smaller literature deals with fooling samples nguyen2015deep. However, to the best of our knowledge, no prior method tries to defend against both kinds of samples simultaneously under a unified framework. State-of-the-art defense mechanisms have tried to harden a classifier by one or more of the following techniques: adversarial retraining szegedy2013intriguing, preprocessing inputs gu2014towards, deploying auxiliary detection networks Meng:2017:MTD:3133956.3134057 or obfuscating gradients obfuscated-gradients. One common drawback of these defense mechanisms is that they do not eliminate the vulnerability of deep networks altogether, but only try to defend against previously proposed attack methods. Hence, they have been easily broken by stronger attacks, which are specifically designed to overcome their defense strategies carlini2016defensive,obfuscated-gradients.
Szegedy et al. szegedy2013intriguing argue that the primary reason for the existence of adversarial samples is the presence of small “pockets” in the data manifold, which are rarely sampled in the training or test set. On the other hand, Goodfellow et al. goodfellow2014explaining have proposed the “linearity hypothesis” to explain the presence of adversarial samples. Under our approach as detailed in Sec. 3.4, the adversarial objective poses a fundamental conflict of interest, and inherently addresses both these possible explanations.
We design a generative model that finds a latent random variablesuch that data label and the data become conditionally independent given , i.e., . We base our generative model on VAEs kingma2013auto, and obtain an inference model that represents and a generative model that represents . We perform label inference by computing . We choose the latent space distribution to be a mixture of Gaussians, such that each mixture component represents one of the classes in the data. Under this construct, inferring the label given latent encoding, i.e., becomes trivial by computing the contribution of the mixture components. Adversarial samples are dealt with by thresholding in the latent and output spaces of the generative model and rejecting the inputs for which . In Figure 1, we describe our network at test and train time.
Our contributions can be summarized as follows.
We show how VAE’s can be trained with labeled data, using a Gaussian mixture prior on the latent variable in order to perform classification.
We perform selective classification using this framework, thereby rejecting adversarial and fooling samples.
We propose a method to learn a classifier in a semi-supervised scenario using the same framework, and show that this classifier is also resistant against adversarial attacks.
We also show how the detected adversarial samples can be reclassified into the correct class by iterative optimization.
We verify our claims through experimentation on 3 publicly available datasets: MNIST lecun1998gradient, SVHN netzer2011reading and COIL-100 nayar1996columbia.
2 Related Work
A few pieces of work in the existing literature on defense against adversarial attacks have attempted to use generative models in different ways.
Samangouei et al. samangouei2018defensegan propose training a Generative Adversarial Network (GAN) on the training data of a classifier, and use this network to project every test sample on to the data manifold by iterative optimization. This method does not try to detect adversarial samples, and does not tackle “fooling images”. Further, this defense technique has been recently shown to be ineffective obfuscated-gradients. Other pieces of work have also shown that adversarial samples can lie on the output manifold of generative models trained on the training data for a classifier zhao2017generating.
PixelDefend, proposed by Song et al. song2017pixeldefend also uses a generative model to detect adversarial samples, and then rectifies the classifier output by projecting the adversarial input back to the data manifold. However, Athalye et al. have shown that this method can also be broken by bypassing the exploding/vanishing gradient problem introduced by the defense mechanism.
MagNet meng2017magnet uses autoencoders to detect adversarial inputs, and is similar to our detection mechanism in the way reconstruction threshold is used for detecting adversarial inputs. This defense method does not claim security in the white box setting. Further, the technique has also been broken in the grey box setting by recently proposed attack methods carlini2017magnet.
Traditional autoencoders do not constrain the latent representation to have a specific distribution like variational autoencoders. Our use of variational autoencoders allows us to defend against adversarial and fooling inputs simultaneously, by using thresholds in the latent and output spaces of the model in conjunction. This makes the method secure to white box attacks as well, which is not the case with MagNet.
Further, even state of the art defense mechanisms madry2017towards and certified defenses have been shown to be ineffective for simple datasets such as MNIST song2018generative. We show via extensive experimentation on different datasets how our method is able to defend against strong adversarial attacks, as well as end to end white box attacks.
3.1 Variational Autoencoders
We consider the dataset consisting of i.i.d. samples of a random variable in the space . Let be the latent representation from which the data is assumed to have been generated. Similar to Kingma et al. kingma2013auto, we assume that the data generation process consists of two steps: (i) a value is sampled from a prior distribution ; (ii) a value is generated from a conditional distribution . We also assume that the prior and likelihood come from parametric families of distributions and respectively. In order to maximize the data likelihood , VAEs kingma2013auto use an encoder network , that approximates . The evidence lower bound (ELBO) for VAE is given by
where represents the KL divergence measure. Using a Gaussian prior and a Gaussian posterior , variational autoencoders maximize this lower bound deriving a closed form expression for the KL divergence term.
3.2 Modifying the Evidence Lower Bound
VAEs do not enforce any lower or upper bound on encoder entropy . This can result in blurry reconstruction due to sample averaging in case of overlap in the latent space. On the other hand, unbounded decrease in is not desirable either, as in that case the VAE can degenerate to a deterministic autoencoder leading to holes in the latent space. Hence, we seek an alternative design in which we fix this quantity to a constant value. In order to do so, we express the KL divergence in terms of entropy.
where represents the cross entropy between and . It can be noted that we need to minimize the KL divergence term. Hence, if we assume that is constant, then we can drop this term during optimization (please refer to the next section for details of how is enforced to be constant). This lets us replace the KL divergence in the loss function with .
The choice of fixing the entropy of is further justified via experiments in section 4.
3.3 Supervision using a Gaussian Mixture Prior
In this section, we modify the above ELBO term for supervised learning by including the random variabledenoting labels. The following expression can be derived for the log-likelihood of the data.
Noting that , and replacing with by assuming to be constant (as shown in Eqn. 3), we get the following lower bound on the data likelihood.
We choose our VAE to use a Gaussian mixture prior for the latent variable . We further choose the number of mixture components to be equal to the number of classes in the training data. The means of each of these components,
are assumed to be the one-hot encodings of the class labels in the latent space. It can be noted here that although this choice enforces the latent dimensionality to be, it can be easily altered by choosing the means in a different manner. For example, means of all the mixture components can lie on a single axis in the latent space. Unlike usual VAEs, our encoder network outputs only the mean of . We use the reparameterization trick introduced by Kingma et al. kingma2013auto, but sample the input from in order to enforce the entropy of to be constant. Here, each mixture component corresponds to one class and is assumed to be generated from the latent space according to irrespective of . Therefore, and become conditionally independent given , i.e. .
Assuming the the classes to be equally likely, the final loss function for an input with label becomes the following.
where the encoder is represented by , the decoder is represented by and represents the mean of the mixture component corresponding to . is a hyper-parameter that trades off between reconstruction fidelity, latent space prior and classification accuracy.
The label for an input sample can be obtained following the Bayes Decision rule.
can be approximated by , i.e., the encoder distribution. This corresponds to the Bayes decision rule, in the scenario where there is no overlap among the classes in the input space, has enough variability and is able to match exactly.
In order to compute the class label as defined in equation 8
, we use a single sample estimate of the integration by simply using the mean ofas the value in our experiments. This choice does not affect the accuracy as long as the mixture components representing the classes are well separated in the latent space.
3.4 Resisting adversarial attacks
In order to successfully reject adversarial samples irrespective of the method of its generation, we use thresholding at the encoder and decoder outputs. This allows us to reject any sample whose encoding has low probability under , i.e., if the distance between its encoding and the encoding of the predicted class label in the latent space exceeds a threshold value, (since is a mixture of Gaussians). We further reject those input samples which have low probability under , i.e., if the reconstruction error exceeds a certain threshold, (since is Gaussian). Essentially, a combination of these two thresholds ensures that is not low.
Both and can be determined based on statistics obtained while training the model. In our experiments, we implement thresholding in the latent space as follows: we calculate the Mahalanobis distance between the encoding of the input and the encoding of the corresponding mixture component mean, and reject the sample if it exceeds the critical chi-square value ( rule in the univariate case). Similarly, for , we use the corresponding value for the reconstructions errors. However, in general, any value can be assigned to these two thresholds, and they determine the risk to coverage trade-off for this selective classifier.
If the maximum allowed norm of the perturbation is , then the adversary, trying to modify an input from class , must satisfy the following criteria.
By the first three constraints, the encoding of and must belong to different Gaussian mixture components in the latent space. However, constraint requires the distance between the reconstruction obtained from the encoding of to be close to , i.e., close to in the pixel space. This is extremely hard to satisfy because of the low probability of occurrence of holes in the latent space within distance from the means.
Similarly, for the case of fooling samples, it can be argued that even if an attacker manages to generate a fooling sample which tricks the encoder, it will be very hard to simultaneously trick the decoder to reconstruct a similar image belonging to the rubbish class.
Once a sample is detected as adversarial by either or both the thresholds discussed above, we attempt to find its true label using the decoder only. By definition of adversarial images, , where is the adversarial image corresponding to the original image , and is small. Hence, we can conclude that for any given image , . Suppose is given by Eqn. 9.
Following the argument stated above, we can approximate . We can now find the label of the adversarial sample as . Essentially, for reclassification, we try to find the in the latent space, which, when decoded, gives the minimum reconstruction error from the adversarial input. However, if Eqn. 9 returns a that lies beyond from the corresponding mean, or if the reconstruction error exceeds , we conclude that the sample is a fooling sample and reject the sample. It can be noted here that if this network is deployed in a scenario where fooling samples are not expected to be encountered, one can choose not to reject samples during reclassification, thereby increasing coverage. Also, starting from a single value of can cause the optimization process to get stuck at a local minimum. A better alternative is to run different optimization processes with as the initial values, and choose the which gives minimum reconstruction error as . Given enough compute power is available, these processes can be run in parallel. In our experiments, we follow these two strategies while reclassifying adversarial samples.
We verify the effectiveness of our network through numerical results and visual analysis on three different datasets - MNIST, SVHN and COIL-100. For different datasets, we make minimal changes to the hyper-parameters of our network, partly due to the difference in the image size and image type (grayscale/colored) in each dataset.
We use an encoder network with convolution, max-pooling and dense layers to parameterize, and a decoder network with convolution, up-sampling and dense layers to parameterize . We choose the dimensionality of the latent space to be the same as the number of classes for MNIST and COIL-100. However, noting that the size of images is larger for SVHN compared to MNIST, and also, because the dataset contains colored images, we choose the dimensionality of the latent space for SVHN as instead of
. The choice of means also varies slightly for this dataset, as we pad zeros to the one-hot encodings of the class labels to allow for the extra latent dimensions. The standard deviation of the encoder distribution is chosen such that the chance of overlap of the mixture components in the latent space is negligible and the classes are well separated. We use
as the variance for the MNIST dataset, and reduce this value as the latent dimensionality increases for the other datasets. We use the ReLU nonlinearity in our network, and sigmoid activation in the final layer so that the output lies in the allowed range. We use the Adamkingma2014adam optimizer for training.
Since our algorithm relies upon the reconstruction error between the generated and the original samples, we first show a few randomly chosen images generated by the network (for both supervised ad semi-supervised scenarios) corresponding to test samples of different classes from the three datasets in Figure 2.
In Table 1, we present the accuracy, error and rejection percentages obtained by our method with and without thresholding. For semi-supervised learning, we have taken randomly chosen labeled samples from each class for both MNIST and SVHN during training. It is important to note here that the SOTA for COIL-100 was obtained on a random train-test split of the dataset, and hence, the accuracy values are not directly comparable.
Adversarial attacks on encoder.
We use the encoder part of the network trained on the MNIST dataset to generate adversarial samples using the Fast Gradient Sign Method (FGSM) with varying values goodfellow2014explaining. The corresponding results are shown in Figure 3. The behavior is as desired, i.e., with increasing , percentage of misclassified samples rises to a maximum value of only and then decreases, while the accuracy decreases monotonically and the rejection percentage increases monotonically. Similar results are obtained for the semi-supervised model, as shown in Figure 3, although the maximum error rate is higher in this case. We further tried the FGSM attack from the Cleverhans library papernot2017cleverhans with the default parameters on the SVHN and COIL-100 datasets, and all the generated samples were rejected by the models after thresholding. Similarly, we generated adversarial samples for all three datasets using stronger attacks from Cleverhans with default parameter settings, including the Momentum Iterative Method dongboosting and Projected Gradient Descent madry2017towards. In these cases as well, all generated adversarial samples were successfully rejected by thresholding.
This indicates that since all these attacks lack knowledge of the decoder network, they only manage to produce samples which fool the encoder network, but are easily detected at the decoder output. From this set of experiments, we conclude that the only effective method of attacking our model would be to design a complete white-box attack that has knowledge of the decoder loss as well, as well as the two thresholds. Further, since we do not use any form of gradient obfuscation in our defense mechanism, a complete white-box attacker would represent a strong adversary.
White-box adversarial attack.
We present the results for completely white-box targeted attack on our model for the COIL-100 and MNIST datasets in figures 4a and 4b. Here, the adversary has complete knowledge of the encoder, the decoder, as well as the rejection thresholds. The results shown correspond to random samples from the first two classes of objects for the COIL-100 dataset, and the classes and for MNIST dataset. We perform gradient descent on the adversarial objective as given in Eqn. 10. The target class is set to for MNIST images from class , for MNIST images from class , and the class other than that of the source image for the COIL-100 images.
where is the original image we wish to corrupt, is the mean of target class, is the noise added, are the encoder and decoder respectively, and denotes target class covariance in latent space. and represent constant exponents which ensure that the adversarial loss grows steeply when the two threshold values are exceeded. Essentially, we aim for low reconstruction error and small change in the adversarial image while moving its embedding close to the target class mean. is initialized with zeros.
We also ran the white box attack on randomly sampled images from each of the classes for MNIST and SVHN, by setting each of the other classes as the target class. The samples generated by optimizing the adversarial objective in each of these cases were either correctly classified or rejected.
|Adversarial Samples (MNIST)||Adversarial Samples (COIL)||Fooling Samples|
We take images sampled from the uniform distribution as inputs and optimize the white-box fooling attack objective given by Eqn. 11, with each of the classes from the MNIST and SVHN datasets as the target classes. In Figure 4c, we visualize some of the images to which the attack converged and their reconstructions for the MNIST dataset, with the target classes .
Here, , , and are as described in sec. 4.
It has been shown that fooling samples are extremely easy to generate for state-of-the-art classifier networks goodfellow2014explaining,nguyen2015deep. Our technique, by design, gains resilience against such attacks as well. Since by definition, a fooling sample cannot look like a legitimate sample, it can not have small pixel space distance with any real image. This is exactly what can be noticed in the results in Figure 4c, where reconstruction errors are very high. Hence, most of the images to which this attack converges are rejected at the decoder, although they had managed to fool the encoder when considered in isolation. For the few cases where the images are not rejected, we observe that the attack method actually converged to a legitimate image of the target class.
Reclassifying Adversarial samples.
In this section we present the performance of our reclassification technique. Although one could have used our decoder network to perform both “ordinary” and “adversarial” sample classification using Eqn. 9, but this process involves an iterative optimization. Hence, we only use it for the detected adversarial samples. The results are summarized in Table 2.
Following the same reclassification scheme, we also find that the method is able to correctly classify rejected test samples, thereby improving the overall accuracy achieved by the proposed method. For example, among the 181 samples rejected by the supervised model for the MNIST test dataset (as per Table 1), 110 samples are now correctly classified, improving the accuracy to 99.07%.
Entropy of .
To compare the performance of the proposed network with the corresponding network with variable entropy of , we ran experiments by letting to be variable, and keeping all other parameters same. We tried the FGSM attack against the encoder of the model thus obtained, and observed that the adversarial sample detection capability of the network reduces drastically. This is justified by the fact that the reconstructions tend to be blurry in this case, thereby leading to a high reconstruction threshold. The results are shown in figure 5.
In order to further study the difference between the two cases, we train both variants of the network on the CelebA dataset, and observe that the “Fréchet Inception Distance (FID) heusel2017gans score is significantly better for the model with a constant (50.4) than the one with variable (58.3). The FID scores are obtained by randomly sampling points from the latent distribution, and comparing the distribution of the images generated from the these points with the training image distribution.
In this work, we have successfully demonstrated how a generative model can be used to gain defensive strength against adversarial attacks on images of relatively high resolution (128x128 for the COIL-100 dataset for example). However, the proposed network is limited by the generative capability of VAE based architectures, and thus, might not scale effectively to ImageNet scale datasets imagenet_cvpr09. In spite of this fact, keeping the underlying principles for adversarial sample detection and reclassification as described in this work, recent advances in invertible generative models such as Glow kingma2018glow can be exploited to scale to more complex datasets. Further, as discussed earlier, the problem of defending against adversarial attacks still remains an unsolved problem even for datasets with more structured images. Hence our method can be used for practical applications such as secure medical image classification finlayson2018adversarial, biometrics identification, etc.
Human perception involves both discriminative and generative capabilities. Similarly, our work proposes a modification to VAEs to incorporate discriminative ability, besides using its generative ability to gain robustness against adversarial samples. The input space dimensionality (to the decoder) is drastically smaller compared to the input space dimensionality of image classifiers. Hence, it is much easier to attain dense coverage in the latent space, thereby minimizing the possibility of the occurrence of holes, leading to defensive capability against both adversarial and fooling images. With our construct, selective classification and semi-supervised learning become feasible under the same framework. A possible direction of future research would be to study how effectively the proposed approach can be scaled to more complex datasets by using recently proposed invertible generative modeling techniques.
We are extremely grateful to Mr. Arnav Acharyya for his invaluable contribution to the discussions that helped shape this work.
- [Athalye, Carlini, and Wagner2018] Athalye, A.; Carlini, N.; and Wagner, D. A. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, 274–283.
- [Carlini and Wagner2016] Carlini, N., and Wagner, D. 2016. Defensive distillation is not robust to adversarial examples. arXiv preprint arXiv:1607.04311.
- [Carlini and Wagner2017a] Carlini, N., and Wagner, D. 2017a. Magnet and” efficient defenses against adversarial attacks” are not robust to adversarial examples. arXiv preprint arXiv:1711.08478.
- [Carlini and Wagner2017b] Carlini, N., and Wagner, D. 2017b. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, 39–57. IEEE.
[Chen et al.2017]
Chen, P.-Y.; Zhang, H.; Sharma, Y.; Yi, J.; and Hsieh, C.-J.
Zoo: Zeroth order optimization based black-box attacks to deep neural
networks without training substitute models.
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, 15–26. ACM.
- [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
- [Dong et al.2018] Dong, Y.; Liao, F.; Pang, T.; Su, H.; Zhu, J.; Hu, X.; and Li, J. 2018. Boosting adversarial attacks with momentum. In .
[Eykholt et al.2018]
Eykholt, K.; Evtimov, I.; Fernandes, E.; Li, B.; Rahmati, A.; Xiao, C.;
Prakash, A.; Kohno, T.; and Song, D.
Robust Physical-World Attacks on Deep Learning Visual Classification.In Computer Vision and Pattern Recognition (CVPR).
- [Finlayson, Kohane, and Beam2018] Finlayson, S. G.; Kohane, I. S.; and Beam, A. L. 2018. Adversarial attacks against medical deep learning systems. arXiv preprint arXiv:1804.05296.
- [Goodfellow, Shlens, and Szegedy2014] Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
- [Gu and Rigazio2014] Gu, S., and Rigazio, L. 2014. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068.
- [He et al.2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, 1026–1034.
- [Heusel et al.2017] Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, 6626–6637.
- [Huang et al.2017] Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, 3.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Kingma and Dhariwal2018] Kingma, D. P., and Dhariwal, P. 2018. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039.
- [Kingma and Welling2014] Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. International Conference on Learning Representations.
- [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
[Lee, Gallagher, and
Lee, C.-Y.; Gallagher, P. W.; and Tu, Z.
Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree.In Artificial Intelligence and Statistics, 464–472.
- [Madry et al.2018] Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2018. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations.
- [Meng and Chen2017a] Meng, D., and Chen, H. 2017a. Magnet: A two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, 135–147. New York, NY, USA: ACM.
- [Meng and Chen2017b] Meng, D., and Chen, H. 2017b. Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 135–147. ACM.
- [Moosavi Dezfooli, Fawzi, and Frossard2016] Moosavi Dezfooli, S. M.; Fawzi, A.; and Frossard, P. 2016. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), number EPFL-CONF-218057.
- [Nayar, Nene, and Murase1996] Nayar, S.; Nene, S.; and Murase, H. 1996. Columbia object image library (coil 100). Department of Comp. Science, Columbia University, Tech. Rep. CUCS-006-96.
- [Netzer et al.2011] Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, 5.
- [Nguyen, Yosinski, and Clune2015] Nguyen, A.; Yosinski, J.; and Clune, J. 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 427–436.
- [Nicolas Papernot2017] Nicolas Papernot, Nicholas Carlini, I. G. R. F. F. F. A. M. K. H. Y.-L. J. A. K. R. S. A. G. Y.-C. L. 2017. cleverhans v2.0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768.
- [Papernot et al.2016a] Papernot, N.; McDaniel, P.; Jha, S.; Fredrikson, M.; Celik, Z. B.; and Swami, A. 2016a. The limitations of deep learning in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, 372–387. IEEE.
- [Papernot et al.2016b] Papernot, N.; McDaniel, P.; Wu, X.; Jha, S.; and Swami, A. 2016b. Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, 582–597. IEEE.
- [Papernot et al.2017] Papernot, N.; McDaniel, P.; Goodfellow, I.; Jha, S.; Celik, Z. B.; and Swami, A. 2017. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, 506–519. ACM.
- [Papernot, McDaniel, and Goodfellow2016] Papernot, N.; McDaniel, P.; and Goodfellow, I. 2016. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277.
- [Pouya Samangouei2018] Pouya Samangouei, Maya Kabkab, R. C. 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. International Conference on Learning Representations.
- [Song et al.2018a] Song, Y.; Kim, T.; Nowozin, S.; Ermon, S.; and Kushman, N. 2018a. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In International Conference on Learning Representations.
- [Song et al.2018b] Song, Y.; Shu, R.; Kushman, N.; and Ermon, S. 2018b. Generative adversarial examples. In Advances in Neural Information Processing Systems (NIPS).
- [Szegedy et al.2014] Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing properties of neural networks. In International Conference on Learning Representations.
- [Wan et al.2013] Wan, L.; Zeiler, M.; Zhang, S.; Le Cun, Y.; and Fergus, R. 2013. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, 1058–1066.
- [Wu et al.2015] Wu, D.; Wu, J.; Zeng, R.; Jiang, L.; Senhadji, L.; and Shu, H. 2015. Kernel principal component analysis network for image classification. arXiv preprint arXiv:1512.06337.
- [Zhao, Dua, and Singh2018] Zhao, Z.; Dua, D.; and Singh, S. 2018. Generating natural adversarial examples. In International Conference on Learning Representations.