In a first observation in Szegedy et al. (2013) it was found that deep neural networks exhibit unstable behavior to small perturbations in the input. For the task of image classification this means that two visually indistinguishable images may have very different outputs, resulting in one of them being misclassified even if the other one is correctly classified with high confidence. Since then, a lot of research has been done to investigate this issue through the construction of adversarial examples: given a correctly classified image , we look for an image which is visually indistinguishable from but is misclassified by the network. Typically, the image is constructed as , where is an adversarial perturbation that is supposed to be small in a suitable sense (normally, with respect to an norm). Several algorithms have been developed to construct adversarial perturbations, see Goodfellow et al. (2014); Moosavi Dezfooli et al. (2016); Kurakin et al. (2017b); Madry et al. (2018); Carlini & Wagner (2017b) and the review paper Akhtar & Mian (2018).
Even though such pathological cases are very unlikely to occur in practice, their existence is relevant since malicious attackers may exploit this drawback to fool classifiers or other automatic systems. Further, adversarial perturbations may be constructed in a black-box setting (i.e., without knowing the architecture of the DNN but only its outputs) (Papernot et al., 2017; Moosavi-Dezfooli et al., 2017) and also in the physical world (Kurakin et al., 2017b; Athalye & Sutskever, 2017; Brown et al., 2017; Sharif et al., 2016). This has motivated the investigation of defenses, i.e., how to make the network invulnerable to such attacks, see Kurakin et al. (2017a); Carlini & Wagner (2017a); Madry et al. (2018); Tramèr et al. (2018); Wong & Kolter (2018); Raghunathan et al. (2018); Athalye et al. (2018); Kannan et al. (2018). In most cases, adversarial examples are artificially created and then used to retrain the network, which becomes more stable under these types of perturbations.
Most of the work on the construction of adversarial examples and on the design of defense strategies has been conducted in the context of small perturbations measured in the
norm. However, this is not necessarily a good measure of image similarity: e.g., for two translated imagesand , the norm of is not small in general, even though and will look indistinguishable if the translation is small. Several papers have investigated the construction of adversarial perturbations not designed for norm proximity (Rozsa et al., 2016; Sharif et al., 2016; Brown et al., 2017; Engstrom et al., 2017; Xiao et al., 2018).
In this work, we build up on these ideas and investigate the construction of adversarial deformations. In other words, the misclassified image is not constructed as an additive perturbation , but as a deformation , where
is a vector field defining the transformation. In this case, the similarity is not measured through a norm of, but instead through a norm of , which quantifies the deformation between and .
We develop an efficient algorithm for the construction of adversarial deformations, which we call ADef. It is based on the main ideas of DeepFool (Moosavi Dezfooli et al., 2016), and iteratively constructs the smallest deformation to misclassify the image. We test the procedure on MNIST (LeCun, ) (with convolutional neural networks) and on ImageNet (Russakovsky et al., 2015) (with Inception-v3 (Szegedy et al., 2016) and ResNet-101 (He et al., 2016)). The results show that ADef can succesfully fool the classifiers in the vast majority of cases (around 99%) by using very small and imperceptible deformations. We also test our adversarial attacks on adversarially trained networks for MNIST. Our implementation of the algorithm can be found at https://gitlab.math.ethz.ch/tandrig/ADef.
The results of this work have initially appeared in the master’s thesis Gauksson (2017), to which we refer for additional details on the mathematical aspects of this construction. While writing this paper, we have come across Xiao et al. (2018), in which a similar problem is considered and solved with a different algorithm. Whereas in Xiao et al. (2018) the authors use a second order solver to find a deforming vector field, we show how a first order method can be formulated efficiently and justify a smoothing operation, independent of the optimization step. We report, for the first time, success rates for adversarial attacks with deformations on ImageNet. The topic of deformations has also come up in Jaderberg et al. (2015), in which the authors introduce a class of learnable modules that deform inputs in order to increase the performance of existing DNNs, and Fawzi & Frossard (2015), in which the authors introduce a method to measure the invariance of classifiers to geometric transformations.
2 Adversarial deformations
2.1 Adversarial perturbations
Let be a classifier of images consisting of pixels into categories, i.e. a function from the space of images , where (for grayscale images) or (for color images), and into the set of labels . Suppose is an image that is correctly classified by and suppose is another image that is imperceptible from and such that , then is said to be an adversarial example. The meaning of imperceptibility varies, but generally, proximity in -norm (with ) is considered to be a sufficient substitute. Thus, an adversarial perturbation for an image is a vector such that and is small, where
Given such a classifier and an image , an adversary may attempt to find an adversarial example by minimizing subject to , or even subject to for some target label . Different methods for finding minimal adversarial perturbations have been proposed, most notably FGSM (Goodfellow et al., 2014) and PGD (Madry et al., 2018) for , and the DeepFool algorithm (Moosavi Dezfooli et al., 2016) for general -norms.
Instead of constructing adversarial perturbations, we intend to fool the classifier by small deformations of correctly classified images. Our procedure is in the spirit of the DeepFool algorithm. Before we explain it, let us first clarify what we mean by a deformation of an image. The discussion is at first more intuitive if we model images as functions (with or ) instead of discrete vectors in . In this setting, perturbing an image corresponds to adding to it another function with a small -norm.
While any transformation of an image can be written as a perturbation , we shall restrict ourselves to a particular class of transformations. A deformation with respect to a vector field is a transformation of the form , where for any image , the image is defined by
extending by zero outside of . Deformations capture many natural image transformations. For example, a translation of the image by a vector is a deformation with respect to the constant vector field . If is small, the images and may look similar, but the corresponding perturbation may be arbitrarily large in the aforementioned -norms. Figure 1 shows three minor deformations, all of which yield large -norms.
In the discrete setting, deformations are implemented as follows. We consider square images of pixels and define the space of images to be . A discrete vector field is a function . In what follows we will only consider the set of vector fields that do not move points on the grid outside of . More precisely,
An image can be viewed as the collection of values of a function on a regular grid , i.e. for . Such a function
can be computed by interpolating from. Thus, the deformation of an image with respect to the discrete vector field can be defined as the discrete deformed image in by
It is not straightforward to measure the size of a deformation such that it captures the visual difference between the original image and its deformed counterpart . We will use the size of the corresponding vector field, , in the norm defined by
as a proxy. The -norms defined in (1), adapted to vector fields, can be used as well. (We remark, however, that none of these norms define a distance between and , since two vector fields with may produce the same deformed image .)
2.3 The algorithm ADef
We will now describe our procedure for finding deformations that will lead a classifier to yield an output different from the original label.
Let be the underlying model for the classifier , such that
Let be the image of interest and fix obtained by interpolation from . Let denote the true label of , let be a target label and set . We assume that does not lie on a decision boundary, so that we have .
We define the function and note that . Our goal is to find a small vector field such that . We can use a linear approximation of around the zero vector field as a guide:
for small enough and the derivative of at . Hence, if is a vector field such that
and is small, then the classifier has approximately equal confidence for the deformed image to have either label or . This is a scalar equation with unknown in , and so has infinitely many solutions. In order to select with small norm, we solve it in the least-squares sense.
In view of (2), we have
. Thus, by applying the chain rule to, we obtain that its derivative at can, with a slight abuse of notation, be identified with the vector field
where is the derivative of in calculated at . With this, stands for , and the solution to (5) in the least-square sense is given by
Finally, we define the deformed image according to (2).
One might like to impose some degree of smoothness on the deforming vector field. In fact, it suffices to search in the range of a smoothing operator . However, this essentially amounts to applying to the solution from the larger search space . Let , where denotes the componentwise application of a two-dimensional Gaussian filter
(of any standard deviation). Then the vector field
also satisfies (5), since is self-adjoint. We can hence replace by to obtain a smooth deformation of the image .
We iterate the deformation process until the deformed image is misclassified. More explicitly, let and for let be given by (7) for . Then we can define the iteration as . The algorithm terminates and outputs an adversarial example if . The iteration also terminates if lies on a decision boundary of , in which case we propose to introduce an overshoot factor on the total deforming vector field. Provided that the number of iterations is moderate, the total vector field can be well approximated by and the process can be altered to output the deformed image instead.
The target label may be chosen in each iteration to minimize the vector field to obtain a better approximation in the linearization (4). More precisely, for a candidate set of labels , we compute the corresponding vectors fields and select
The candidate set consists of the labels corresponding to the indices of the smallest entries of , in absolute value.
By equation (6), provided that is moderate, the deforming vector field takes small values wherever has a small derivative. This means that the vector field will be concentrated on the edges in the image (see e.g. the first row of figure 2). Further, note that the result of a deformation is always a valid image in the sense that it does not violate the pixel value bounds. This is not guaranteed for the perturbations computed with DeepFool.
We evaluate the performance of ADef by applying the algorithm to classifiers trained on the MNIST (LeCun, ) and ImageNet (Russakovsky et al., 2015) datasets. Below, we briefly describe the setup of the experiments and in tables 1 and 2 we summarize their results.
MNIST: We train two convolutional neural networks based on architectures that appear in Madry et al. (2018) and Tramèr et al. (2018) respectively. The network MNIST-A consists of two convolutional layers of sizes and , each followed by max-pooling and a rectifier activation function, a fully connected layer into dimension 1024 with a rectifier activation function, and a final linear layer with output dimension 10. The network MNIST-B consists of two convolutional layers of sizes and
with a rectifier activation function, a fully connected layer into dimension 128 with a rectifier activation function, and a final linear layer with output dimension 10. During training, the latter convolutional layer and the former fully connected layer of MNIST-B are subject to dropout of drop probabilitiesand . We use ADef to produce adversarial deformations of the images in the test set. The algorithm is configured to pursue any label different from the correct label (all incorrect labels are candidate labels). It performs smoothing by a Gaussian filter of standard deviation , uses bilinear interpolation to obtain intermediate pixel intensities, and it overshoots by whenever it converges to a decision boundary.
ImageNet: We apply ADef to pretrained Inception-v3 (Szegedy et al., 2016) and ResNet-101 (He et al., 2016) models to generate adversarial deformations for the images in the ILSVRC2012 validation set. The images are preprocessed by first scaling so that the smaller axis has 299 pixels for the Inception model and 224 pixels for ResNet, and then they are center-cropped to a square image. The algorithm is set to focus only on the label of second highest probability. It employs a Gaussian filter of standard deviation , bilinear interpolation, and an overshoot factor .
We only consider inputs that are correctly classified by the model in question, and, since approximates the total deforming vector field, we declare ADef to be successful if its output is misclassified and , where we choose . Observe that, by (3), a deformation with respect to a vector field does not displace any pixel further away from its original position than . Hence, for high resolution images, the choice indeed produces small deformations if the vector fields are smooth. In appendix A, we illustrate how the success rate of ADef depends on the choice of .
|Model||Accuracy||ADef success||Avg.||Avg.||Avg. # iterations|
When searching for an adversarial example, one usually searches for a perturbation with -norm smaller than some small number . Common choices of range from to for MNIST classifiers (Goodfellow et al., 2014; Madry et al., 2018; Wong & Kolter, 2018; Tramèr et al., 2018; Kannan et al., 2018) and to for ImageNet classifiers (Goodfellow et al., 2014; Kurakin et al., 2017a; Tramèr et al., 2018; Kannan et al., 2018). Table 1 shows that on average, the perturbations obtained by ADef are quite large compared to those constraints. However, as can be seen in figure 2, the relatively high resolution images of the ImageNet dataset can be deformed into adversarial examples that, while corresponding to large perturbations, are not visibly different from the original images. In appendices B and C, we give more examples of adversarially deformed images.
3.2 Adversarial training
In addition to training MNIST-A and MNIST-B on the original MNIST data, we train independent copies of the networks using the adversarial training procedure described by Madry et al. (2018). That is, before each step of the training process, the input images are adversarially perturbed using the PGD algorithm. This manner of training provides increased robustness against adversarial perturbations of low -norm. Moreover, we train networks using ADef instead of PGD as an adversary. In table 2 we show the results of attacking these adversarially trained networks, using ADef on the one hand, and PGD on the other. We use the same configuration for ADef as above, and for PGD we use 40 iterations, step size and as the maximum -norm of the perturbation. Interestingly, using these configurations, the networks trained against PGD attacks are more resistant to adversarial deformations than those trained against ADef.
|Model||Adv. training||Accuracy||PGD success||ADef success|
3.3 Targeted attacks
ADef can also be used for targeted adversarial attacks, by restricting the deformed image to have a particular target label instead of any label which yields the optimal deformation. Figure 3 demonstrates the effect of choosing different target labels for a given MNIST image, and figure 4 shows the result of targeting the label of lowest probability for an image from the ImageNet dataset.
In this work, we proposed a new efficient algorithm, ADef, to construct a new type of adversarial attacks for DNN image classifiers. The procedure is iterative and in each iteration takes a gradient descent step to deform the previous iterate in order to push to a decision boundary.
We demonstrated that with almost imperceptible deformations, state-of-the art classifiers can be fooled to misclassify with a high success rate of ADef. This suggests that networks are vulnerable to different types of attacks and that simply training the network on a specific class of adversarial examples might not form a sufficient defense strategy. Given this vulnerability of neural networks to deformations, we wish to study in future work how ADef can help for designing possible defense strategies. Furthermore, we also showed initial results on fooling adversarially trained networks. Remarkably, PGD trained networks on MNIST are more resistant to adversarial deformations than ADef trained networks. However, for this result to be more conclusive, similar tests on ImageNet will have to be conducted. We wish to study this in future work.
The authors would like to thank Helmut Bölcskei and Thomas Wiatowski for fruitful discussions.
- Akhtar & Mian (2018) N. Akhtar and A. Mian. IEEE Access, 6:14410–14430, 2018. doi: 10.1109/ACCESS.2018.2807385.
- Athalye & Sutskever (2017) Anish Athalye and Ilya Sutskever. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397, 2017.
- Athalye et al. (2018) Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
- Brown et al. (2017) Tom B Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. Adversarial patch. arXiv preprint arXiv:1712.09665, 2017.
Carlini & Wagner (2017a)
Nicholas Carlini and David Wagner.
Adversarial examples are not easily detected: Bypassing ten detection
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. ACM, 2017a.
- Carlini & Wagner (2017b) Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE, 2017b.
- Engstrom et al. (2017) Logan Engstrom, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations. arXiv preprint arXiv:1712.02779, 2017.
- Fawzi & Frossard (2015) Alhussein Fawzi and Pascal Frossard. Manitest: Are classifiers really invariant? arXiv preprint arXiv:1507.06535, 2015.
- Gauksson (2017) Tandri Gauksson. Adversarial perturbations and deformations for convolutional neural networks. https://www.research-collection.ethz.ch/handle/20.500.11850/258550/, 2017.
- Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
- Jaderberg et al. (2015) Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems 28, pp. 2017–2025. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5854-spatial-transformer-networks.pdf.
- Kannan et al. (2018) Harini Kannan, Alexey Kurakin, and Ian Goodfellow. Adversarial logit pairing. arXiv preprint arXiv:1803.06373, 2018.
- Kurakin et al. (2017a) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In International Conference on Learning Representations, 2017a.
- Kurakin et al. (2017b) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In International Conference on Learning Representations, 2017b.
- (16) Yann LeCun. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
- Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.
- Moosavi-Dezfooli et al. (2017) S. M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 86–94, July 2017. doi: 10.1109/CVPR.2017.17.
- Moosavi Dezfooli et al. (2016) Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, and Pascal Frossard. DeepFool: a simple and accurate method to fool deep neural networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), number EPFL-CONF-218057, 2016.
- Papernot et al. (2017) Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519. ACM, 2017.
- Raghunathan et al. (2018) Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Bys4ob-Rb.
- Rozsa et al. (2016) Andras Rozsa, Ethan M Rudd, and Terrance E Boult. Adversarial diversity and hard positive generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 25–32, 2016.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
Sharif et al. (2016)
Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter.
Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition.In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1528–1540. ACM, 2016.
- Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826, 2016.
- Tramèr et al. (2018) Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkZvSe-RZ.
- Wong & Kolter (2018) Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, 2018.
- Xiao et al. (2018) Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, and Dawn Song. Spatially transformed adversarial examples. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HyydRMZC-.
Appendix A Distribution of vector field norms
Figures 5 and 6 show the distribution of the norms of the total deforming vector fields, , from the experiments in section 3. For networks that have not been adversarially trained, most deformations fall well below the threshold of . Out of the adversarially trained networks, only MNIST-A trained against PGD is truly robust against ADef. Further, a comparison between the first column of figure 5 and figure 6 indicates that ImageNet is much more vulnerable to adversarial deformations than MNIST, also considering the much higher resolution of the images in ImageNet. Thus, it would be very interesting to study the performance of ADef with adversarially trained network for ImageNet, as mentioned in the Conclusion.
The (normalized) distribution offrom the MNIST experiments. Deformations that fall to the left of the vertical line at are considered successful. The networks in the first column were trained using the original MNIST data, and the networks in the second and third columns were adversarially trained using ADef and PGD, respectively.
Appendix B Smooth deformations
The standard deviation of the Gaussian filter used for smoothing in the update step of ADef has significant impact on the resulting vector field. To explore this aspect of the algorithm, we repeat the experiment from section 3 on the Inception-v3 model, using standard deviations (where stands for no smoothing). The results are shown in table 3, and the effect of varying is illustrated in figures 7 and 8. We observe that as increases, the adversarial distortion steadily increases both in terms of vector field norm and perturbation norm. Likewise, the success rate of ADef decreases with larger . However, from figure 8 we see that the constraint on the total vector field may provide a rather conservative measure of the effectiveness of ADef in the case of smooth high dimensional vector fields.
|ADef success||Avg.||Avg.||Avg. # iterations|
Appendix C Additional deformed images
Figures 9 and 10 show adversarial deformations for the models MNIST-A and MNIST-B, respectively. The attacks are performed using the same configuration as in the experiments in section 3. Observe that in some cases, features resembling the target class have appeared in the deformed image. For example, the top part of the 4 in the fifth column of figure 10 has been curved slightly to more resemble a 9.
Figures 11 – 15 show additional deformed images resulting from attacking the Inception-v3 model using the same configuration as in the experiments in section 3. Similarly, figures 16 – 20 show deformed images resulting from attacking the ResNet-10 model. However, in order to increase variability in the output labels, we perform a targeted attack, targeting the label of 50th highest probability.