DeepAI
Log In Sign Up

ADef: an Iterative Algorithm to Construct Adversarial Deformations

04/20/2018
by   Rima Alaifari, et al.
ETH Zurich
Università di Genova
0

While deep neural networks have proven to be a powerful tool for many recognition and classification tasks, their stability properties are still not well understood. In the past, image classifiers have been shown to be vulnerable to so-called adversarial attacks, which are created by additively perturbing the correctly classified image. In this paper, we propose the ADef algorithm to construct a different kind of adversarial attack created by iteratively applying small deformations to the image, found through a gradient descent step. We demonstrate our results on MNIST with a convolutional neural network and on ImageNet with Inception-v3 and ResNet-101.

READ FULL TEXT VIEW PDF

page 7

page 8

11/18/2019

A New Ensemble Adversarial Attack Powered by Long-term Gradient Memories

Deep neural networks are vulnerable to adversarial attacks....
03/02/2018

Protecting JPEG Images Against Adversarial Attacks

As deep neural networks (DNNs) have been integrated into critical system...
06/18/2022

Adversarial Robustness is at Odds with Lazy Training

Recent works show that random neural networks are vulnerable against adv...
02/04/2019

SNN under Attack: are Spiking Deep Belief Networks vulnerable to Adversarial Examples?

Recently, many adversarial examples have emerged for Deep Neural Network...
12/07/2019

Principal Component Properties of Adversarial Samples

Deep Neural Networks for image classification have been found to be vuln...
04/12/2020

Verification of Deep Convolutional Neural Networks Using ImageStars

Convolutional Neural Networks (CNN) have redefined the state-of-the-art ...
08/13/2020

Semantically Adversarial Learnable Filters

We present the first adversarial framework that crafts perturbations tha...

1 Introduction

In a first observation in Szegedy et al. (2013) it was found that deep neural networks exhibit unstable behavior to small perturbations in the input. For the task of image classification this means that two visually indistinguishable images may have very different outputs, resulting in one of them being misclassified even if the other one is correctly classified with high confidence. Since then, a lot of research has been done to investigate this issue through the construction of adversarial examples: given a correctly classified image , we look for an image which is visually indistinguishable from but is misclassified by the network. Typically, the image is constructed as , where is an adversarial perturbation that is supposed to be small in a suitable sense (normally, with respect to an norm). Several algorithms have been developed to construct adversarial perturbations, see Goodfellow et al. (2014); Moosavi Dezfooli et al. (2016); Kurakin et al. (2017b); Madry et al. (2018); Carlini & Wagner (2017b) and the review paper Akhtar & Mian (2018).

Even though such pathological cases are very unlikely to occur in practice, their existence is relevant since malicious attackers may exploit this drawback to fool classifiers or other automatic systems. Further, adversarial perturbations may be constructed in a black-box setting (i.e., without knowing the architecture of the DNN but only its outputs) (Papernot et al., 2017; Moosavi-Dezfooli et al., 2017) and also in the physical world (Kurakin et al., 2017b; Athalye & Sutskever, 2017; Brown et al., 2017; Sharif et al., 2016). This has motivated the investigation of defenses, i.e., how to make the network invulnerable to such attacks, see Kurakin et al. (2017a); Carlini & Wagner (2017a); Madry et al. (2018); Tramèr et al. (2018); Wong & Kolter (2018); Raghunathan et al. (2018); Athalye et al. (2018); Kannan et al. (2018). In most cases, adversarial examples are artificially created and then used to retrain the network, which becomes more stable under these types of perturbations.

Most of the work on the construction of adversarial examples and on the design of defense strategies has been conducted in the context of small perturbations measured in the

norm. However, this is not necessarily a good measure of image similarity: e.g., for two translated images

and , the norm of is not small in general, even though and will look indistinguishable if the translation is small. Several papers have investigated the construction of adversarial perturbations not designed for norm proximity (Rozsa et al., 2016; Sharif et al., 2016; Brown et al., 2017; Engstrom et al., 2017; Xiao et al., 2018).

In this work, we build up on these ideas and investigate the construction of adversarial deformations. In other words, the misclassified image is not constructed as an additive perturbation , but as a deformation , where

is a vector field defining the transformation. In this case, the similarity is not measured through a norm of

, but instead through a norm of , which quantifies the deformation between and .

We develop an efficient algorithm for the construction of adversarial deformations, which we call ADef. It is based on the main ideas of DeepFool (Moosavi Dezfooli et al., 2016), and iteratively constructs the smallest deformation to misclassify the image. We test the procedure on MNIST (LeCun, ) (with convolutional neural networks) and on ImageNet (Russakovsky et al., 2015) (with Inception-v3 (Szegedy et al., 2016) and ResNet-101 (He et al., 2016)). The results show that ADef can succesfully fool the classifiers in the vast majority of cases (around 99%) by using very small and imperceptible deformations. We also test our adversarial attacks on adversarially trained networks for MNIST. Our implementation of the algorithm can be found at https://gitlab.math.ethz.ch/tandrig/ADef.

The results of this work have initially appeared in the master’s thesis Gauksson (2017), to which we refer for additional details on the mathematical aspects of this construction. While writing this paper, we have come across Xiao et al. (2018), in which a similar problem is considered and solved with a different algorithm. Whereas in Xiao et al. (2018) the authors use a second order solver to find a deforming vector field, we show how a first order method can be formulated efficiently and justify a smoothing operation, independent of the optimization step. We report, for the first time, success rates for adversarial attacks with deformations on ImageNet. The topic of deformations has also come up in Jaderberg et al. (2015), in which the authors introduce a class of learnable modules that deform inputs in order to increase the performance of existing DNNs, and Fawzi & Frossard (2015), in which the authors introduce a method to measure the invariance of classifiers to geometric transformations.

2 Adversarial deformations

2.1 Adversarial perturbations

Let be a classifier of images consisting of pixels into categories, i.e. a function from the space of images , where (for grayscale images) or (for color images), and into the set of labels . Suppose is an image that is correctly classified by and suppose is another image that is imperceptible from and such that , then is said to be an adversarial example. The meaning of imperceptibility varies, but generally, proximity in -norm (with ) is considered to be a sufficient substitute. Thus, an adversarial perturbation for an image is a vector such that and is small, where

(1)

Given such a classifier and an image , an adversary may attempt to find an adversarial example by minimizing subject to , or even subject to for some target label . Different methods for finding minimal adversarial perturbations have been proposed, most notably FGSM (Goodfellow et al., 2014) and PGD (Madry et al., 2018) for , and the DeepFool algorithm (Moosavi Dezfooli et al., 2016) for general -norms.

2.2 Deformations

Instead of constructing adversarial perturbations, we intend to fool the classifier by small deformations of correctly classified images. Our procedure is in the spirit of the DeepFool algorithm. Before we explain it, let us first clarify what we mean by a deformation of an image. The discussion is at first more intuitive if we model images as functions (with or ) instead of discrete vectors in . In this setting, perturbing an image corresponds to adding to it another function with a small -norm.

While any transformation of an image can be written as a perturbation , we shall restrict ourselves to a particular class of transformations. A deformation with respect to a vector field is a transformation of the form , where for any image , the image is defined by

extending by zero outside of . Deformations capture many natural image transformations. For example, a translation of the image by a vector is a deformation with respect to the constant vector field . If is small, the images and may look similar, but the corresponding perturbation may be arbitrarily large in the aforementioned -norms. Figure 1 shows three minor deformations, all of which yield large -norms.

Figure 1: First row: The original

pixel image from the MNIST database, and the same image translated by

, rotated by an angle of , and deformed w.r.t. an arbitrary smooth vector field . The -norm of the corresponding perturbation is shown under each deformed image. The pixel values range from 0 (white) to 1 (black), so the deformed images all lie far from the original image in the -norm. Second row: The vector fields corresponding to the above deformations and their -norms (cf. equation (3)).

In the discrete setting, deformations are implemented as follows. We consider square images of pixels and define the space of images to be . A discrete vector field is a function . In what follows we will only consider the set of vector fields that do not move points on the grid outside of . More precisely,

An image can be viewed as the collection of values of a function on a regular grid , i.e.  for . Such a function

can be computed by interpolating from

. Thus, the deformation of an image with respect to the discrete vector field can be defined as the discrete deformed image in by

(2)

It is not straightforward to measure the size of a deformation such that it captures the visual difference between the original image and its deformed counterpart . We will use the size of the corresponding vector field, , in the norm defined by

(3)

as a proxy. The -norms defined in (1), adapted to vector fields, can be used as well. (We remark, however, that none of these norms define a distance between and , since two vector fields with may produce the same deformed image .)

2.3 The algorithm ADef

We will now describe our procedure for finding deformations that will lead a classifier to yield an output different from the original label.

Let be the underlying model for the classifier , such that

Let be the image of interest and fix obtained by interpolation from . Let denote the true label of , let be a target label and set . We assume that does not lie on a decision boundary, so that we have .

We define the function and note that . Our goal is to find a small vector field such that . We can use a linear approximation of around the zero vector field as a guide:

(4)

for small enough and the derivative of at . Hence, if is a vector field such that

(5)

and is small, then the classifier has approximately equal confidence for the deformed image to have either label or . This is a scalar equation with unknown in , and so has infinitely many solutions. In order to select with small norm, we solve it in the least-squares sense.

In view of (2), we have

. Thus, by applying the chain rule to

, we obtain that its derivative at can, with a slight abuse of notation, be identified with the vector field

(6)

where is the derivative of in calculated at . With this, stands for , and the solution to (5) in the least-square sense is given by

(7)

Finally, we define the deformed image according to (2).

One might like to impose some degree of smoothness on the deforming vector field. In fact, it suffices to search in the range of a smoothing operator . However, this essentially amounts to applying to the solution from the larger search space . Let , where denotes the componentwise application of a two-dimensional Gaussian filter

(of any standard deviation). Then the vector field

also satisfies (5), since is self-adjoint. We can hence replace by to obtain a smooth deformation of the image .

We iterate the deformation process until the deformed image is misclassified. More explicitly, let and for let be given by (7) for . Then we can define the iteration as . The algorithm terminates and outputs an adversarial example if . The iteration also terminates if lies on a decision boundary of , in which case we propose to introduce an overshoot factor on the total deforming vector field. Provided that the number of iterations is moderate, the total vector field can be well approximated by and the process can be altered to output the deformed image instead.

The target label may be chosen in each iteration to minimize the vector field to obtain a better approximation in the linearization (4). More precisely, for a candidate set of labels , we compute the corresponding vectors fields and select

The candidate set consists of the labels corresponding to the indices of the smallest entries of , in absolute value.

Classification model , image , correct label , candidate labels
Deformed image
Initialize
while  do
     for  do
         
         
     end for
     
     
end while
return
Algorithm ADef

By equation (6), provided that is moderate, the deforming vector field takes small values wherever has a small derivative. This means that the vector field will be concentrated on the edges in the image (see e.g. the first row of figure 2). Further, note that the result of a deformation is always a valid image in the sense that it does not violate the pixel value bounds. This is not guaranteed for the perturbations computed with DeepFool.

3 Experiments

3.1 Setup

We evaluate the performance of ADef by applying the algorithm to classifiers trained on the MNIST (LeCun, ) and ImageNet (Russakovsky et al., 2015) datasets. Below, we briefly describe the setup of the experiments and in tables 1 and 2 we summarize their results.

MNIST: We train two convolutional neural networks based on architectures that appear in Madry et al. (2018) and Tramèr et al. (2018) respectively. The network MNIST-A consists of two convolutional layers of sizes and , each followed by max-pooling and a rectifier activation function, a fully connected layer into dimension 1024 with a rectifier activation function, and a final linear layer with output dimension 10. The network MNIST-B consists of two convolutional layers of sizes and

with a rectifier activation function, a fully connected layer into dimension 128 with a rectifier activation function, and a final linear layer with output dimension 10. During training, the latter convolutional layer and the former fully connected layer of MNIST-B are subject to dropout of drop probabilities

and . We use ADef to produce adversarial deformations of the images in the test set. The algorithm is configured to pursue any label different from the correct label (all incorrect labels are candidate labels). It performs smoothing by a Gaussian filter of standard deviation , uses bilinear interpolation to obtain intermediate pixel intensities, and it overshoots by whenever it converges to a decision boundary.

ImageNet: We apply ADef to pretrained Inception-v3 (Szegedy et al., 2016) and ResNet-101 (He et al., 2016) models to generate adversarial deformations for the images in the ILSVRC2012 validation set. The images are preprocessed by first scaling so that the smaller axis has 299 pixels for the Inception model and 224 pixels for ResNet, and then they are center-cropped to a square image. The algorithm is set to focus only on the label of second highest probability. It employs a Gaussian filter of standard deviation , bilinear interpolation, and an overshoot factor .

We only consider inputs that are correctly classified by the model in question, and, since approximates the total deforming vector field, we declare ADef to be successful if its output is misclassified and , where we choose . Observe that, by (3), a deformation with respect to a vector field does not displace any pixel further away from its original position than . Hence, for high resolution images, the choice indeed produces small deformations if the vector fields are smooth. In appendix A, we illustrate how the success rate of ADef depends on the choice of .

Model Accuracy ADef success Avg. Avg. Avg. # iterations
MNIST-A 98.99% 99.85% 1.1950 0.7455 7.002
MNIST-B 98.91% 99.51% 1.0841 0.7654 4.422
Inception-v3 77.56% 98.94% 0.5984 0.2039 4.050
ResNet-101 76.97% 99.78% 0.5561 0.1882 4.176
Table 1: The results of applying ADef to the images in the MNIST test set and the ILSVRC2012 validation set. The accuracy of the Inception and ResNet models is defined as the top-1 accuracy on the center-cropped and resized images. The success rate of ADef is shown as a percentage of the correctly classified inputs. The pixel range is scaled to , so the perturbation , where is the input and the output of ADef, has values in . The averages in the three last columns are computed over the set of images on which ADef is successful. Recall the definition of the vector field norm in equation (3).
Figure 2: Sample deformations for the Inception-v3 model. The vector fields and perturbations have been amplified for visualization. First row: An image from the ILSVRC2012 validation set, the output of ADef with a Gaussian filter of standard deviation 1, the corresponding vector field and perturbation. The rightmost image is a close-up of the vector field around the nose of the ape. Second row: A larger deformation of the same image, obtained by using a wider Gaussian filter (standard deviation 6) for smoothing.

When searching for an adversarial example, one usually searches for a perturbation with -norm smaller than some small number . Common choices of range from to for MNIST classifiers (Goodfellow et al., 2014; Madry et al., 2018; Wong & Kolter, 2018; Tramèr et al., 2018; Kannan et al., 2018) and to for ImageNet classifiers (Goodfellow et al., 2014; Kurakin et al., 2017a; Tramèr et al., 2018; Kannan et al., 2018). Table 1 shows that on average, the perturbations obtained by ADef are quite large compared to those constraints. However, as can be seen in figure 2, the relatively high resolution images of the ImageNet dataset can be deformed into adversarial examples that, while corresponding to large perturbations, are not visibly different from the original images. In appendices B and C, we give more examples of adversarially deformed images.

3.2 Adversarial training

In addition to training MNIST-A and MNIST-B on the original MNIST data, we train independent copies of the networks using the adversarial training procedure described by Madry et al. (2018). That is, before each step of the training process, the input images are adversarially perturbed using the PGD algorithm. This manner of training provides increased robustness against adversarial perturbations of low -norm. Moreover, we train networks using ADef instead of PGD as an adversary. In table 2 we show the results of attacking these adversarially trained networks, using ADef on the one hand, and PGD on the other. We use the same configuration for ADef as above, and for PGD we use 40 iterations, step size and as the maximum -norm of the perturbation. Interestingly, using these configurations, the networks trained against PGD attacks are more resistant to adversarial deformations than those trained against ADef.

Model Adv. training Accuracy PGD success ADef success
MNIST-A PGD 98.36% 5.81% 6.67%
ADef 98.95% 100.00% 54.16%
MNIST-B PGD 98.74% 5.84% 20.35%
ADef 98.79% 100.00% 45.07%
Table 2: Success rates for PGD and ADef attacks on adversarially trained networks.

3.3 Targeted attacks

ADef can also be used for targeted adversarial attacks, by restricting the deformed image to have a particular target label instead of any label which yields the optimal deformation. Figure 3 demonstrates the effect of choosing different target labels for a given MNIST image, and figure 4 shows the result of targeting the label of lowest probability for an image from the ImageNet dataset.

Figure 3: Targeted ADef against MNIST-A. First row: The original image and deformed images produced by restricting ADef to the target labels to . The -norms of the corresponding perturbations are shown under the deformed images. Second row: The vector fields corresponding to the deformations and their -norms.
Figure 4: Untargeted vs. targeted attack on the ResNet-101 model. An image from the ILSVRC2012 validation set deformed to the labels of second highest (first row) and lowest (second row) probabilities (out of 1,000) for the original image. The vector fields and perturbations have been amplified for visualization.

4 Conclusion

In this work, we proposed a new efficient algorithm, ADef, to construct a new type of adversarial attacks for DNN image classifiers. The procedure is iterative and in each iteration takes a gradient descent step to deform the previous iterate in order to push to a decision boundary.

We demonstrated that with almost imperceptible deformations, state-of-the art classifiers can be fooled to misclassify with a high success rate of ADef. This suggests that networks are vulnerable to different types of attacks and that simply training the network on a specific class of adversarial examples might not form a sufficient defense strategy. Given this vulnerability of neural networks to deformations, we wish to study in future work how ADef can help for designing possible defense strategies. Furthermore, we also showed initial results on fooling adversarially trained networks. Remarkably, PGD trained networks on MNIST are more resistant to adversarial deformations than ADef trained networks. However, for this result to be more conclusive, similar tests on ImageNet will have to be conducted. We wish to study this in future work.

Acknowledgments

The authors would like to thank Helmut Bölcskei and Thomas Wiatowski for fruitful discussions.

References

Appendix A Distribution of vector field norms

Figures 5 and 6 show the distribution of the norms of the total deforming vector fields, , from the experiments in section 3. For networks that have not been adversarially trained, most deformations fall well below the threshold of . Out of the adversarially trained networks, only MNIST-A trained against PGD is truly robust against ADef. Further, a comparison between the first column of figure 5 and figure 6 indicates that ImageNet is much more vulnerable to adversarial deformations than MNIST, also considering the much higher resolution of the images in ImageNet. Thus, it would be very interesting to study the performance of ADef with adversarially trained network for ImageNet, as mentioned in the Conclusion.

Figure 5:

The (normalized) distribution of

from the MNIST experiments. Deformations that fall to the left of the vertical line at are considered successful. The networks in the first column were trained using the original MNIST data, and the networks in the second and third columns were adversarially trained using ADef and PGD, respectively.
Figure 6: The (normalized) distribution of from the ImageNet experiments. Deformations that fall to the left of the vertical line at are considered successful.

Appendix B Smooth deformations

The standard deviation of the Gaussian filter used for smoothing in the update step of ADef has significant impact on the resulting vector field. To explore this aspect of the algorithm, we repeat the experiment from section 3 on the Inception-v3 model, using standard deviations (where stands for no smoothing). The results are shown in table 3, and the effect of varying is illustrated in figures 7 and 8. We observe that as increases, the adversarial distortion steadily increases both in terms of vector field norm and perturbation norm. Likewise, the success rate of ADef decreases with larger . However, from figure 8 we see that the constraint on the total vector field may provide a rather conservative measure of the effectiveness of ADef in the case of smooth high dimensional vector fields.

ADef success Avg. Avg. Avg. # iterations
0 99.12% 0.5272 0.1628 5.247
1 98.94% 0.5984 0.2039 4.050
2 95.91% 0.7685 0.2573 3.963
4 86.66% 0.9632 0.3128 4.379
8 67.54% 1.1684 0.3687 5.476
Table 3: The results of applying ADef to the images in the ILSVRC2012 validation set and the Inception model, using different values for the standard deviation of the Gaussian filter. As before, we define ADef to be successful if .
Figure 7: The effects of increasing the smoothness parameter on adversarial deformations for Inception-v3. First and fourth rows: A correctly classified image and deformed versions. Second and fifth rows: The corresponding deforming vector fields and their -norms. Third and sixth rows: The corresponding perturbations and their norms.
Figure 8: The effects of increasing the smoothness parameter on adversarial deformations for Inception-v3. Note that according to the criterion , the value yields an unsuccessful deformation of the recreational vehicle.

Appendix C Additional deformed images

c.1 Mnist

Figures 9 and 10 show adversarial deformations for the models MNIST-A and MNIST-B, respectively. The attacks are performed using the same configuration as in the experiments in section 3. Observe that in some cases, features resembling the target class have appeared in the deformed image. For example, the top part of the 4 in the fifth column of figure 10 has been curved slightly to more resemble a 9.

Figure 9: Adversarial deformations for MNIST-A. First and third rows: Original images from the MNIST test set. Second and fourth rows: The deformed images and the norms of the corresponding deforming vector fields.
Figure 10: Adversarial deformations for MNIST-B. Note that image 9 in row 3 is misclassified, and is then deformed to its correct label.

c.2 ImageNet

Figures 1115 show additional deformed images resulting from attacking the Inception-v3 model using the same configuration as in the experiments in section 3. Similarly, figures 1620 show deformed images resulting from attacking the ResNet-10 model. However, in order to increase variability in the output labels, we perform a targeted attack, targeting the label of 50th highest probability.

Figure 11: ADef attacks on the Inception-v3 model using the same configuration as in the experiments in section 3.
Figure 12: ADef attacks on the Inception-v3 model using the same configuration as in the experiments in section 3.
Figure 13: ADef attacks on the Inception-v3 model using the same configuration as in the experiments in section 3.
Figure 14: ADef attacks on the Inception-v3 model using the same configuration as in the experiments in section 3.
Figure 15: ADef attacks on the Inception-v3 model using the same configuration as in the experiments in section 3.
Figure 16: ADef attacks on the ResNet-101 model targeting the 50th most likely label.
Figure 17: ADef attacks on the ResNet-101 model targeting the 50th most likely label.
Figure 18: ADef attacks on the ResNet-101 model targeting the 50th most likely label.
Figure 19: ADef attacks on the ResNet-101 model targeting the 50th most likely label.
Figure 20: ADef attacks on the ResNet-101 model targeting the 50th most likely label.