Big but Imperceptible Adversarial Perturbations via Semantic Manipulation

04/12/2019 ∙ by Anand Bhattad, et al. ∙ University of Illinois at Urbana-Champaign 16

Machine learning, especially deep learning, is widely applied to a range of applications including computer vision, robotics and natural language processing. However, it has been shown that machine learning models are vulnerable to adversarial examples, carefully crafted samples that deceive learning models. In-depth studies on adversarial examples can help better understand potential vulnerabilities and therefore improve model robustness. Recent works have introduced various methods which generate adversarial examples. However, all require the perturbation to be of small magnitude (L_p norm) for them to be imperceptible to humans, which is hard to deploy in practice. In this paper we propose two novel methods, tAdv and cAdv, which leverage texture transfer and colorization to generate natural perturbation with a large L_p norm. We conduct extensive experiments to show that the proposed methods are general enough to attack both image classification and image captioning tasks on ImageNet and MSCOCO dataset. In addition, we conduct comprehensive user studies under various conditions to show that our generated adversarial examples are imperceptible to humans even when the perturbations are large. We also evaluate the transferability and robustness of the proposed attacks against several state-of-the-art defenses.



There are no comments yet.


page 1

page 3

page 4

page 5

page 7

page 9

page 11

page 12

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning (ML), especially deep learning, has achieved great successes in various tasks, including modeling, predicting and classifying large datasets. Recent literature have shown that these widely deployed ML models are vulnerable to adversarial examples – carefully crafted small perturbation aimed to mislead learning models  

[4, 22, 30]. Adversarial examples have caused many concerns mainly for security critical domains [3, 10]. Studying them is one important way to better understand such vulnerabilities and therefore develop robust ML models.

To date, many methods have been proposed to generate adversarial examples [14, 4, 22, 29, 30]. However, all these attack methods search for adversarial perturbation with small norm to appear invisible to humans, which is hard to deploy in practice. For instance, small norm perturbations will be difficult to be captured by cameras in the physical world and thus fail to be effective in real scenarios. In addition, such small perturbations can be fragile and simple operations such as JPEG or pixel quantization will be able to reduce its effectiveness. Furthermore, under certain scenarios (e.g. more complicated models), small perturbations are insufficient in achieving a high attack success rate. These weaknesses of existing adversarial perturbation motivate researchers to seek for solutions in generating perturbation with large norm which not only misleads learning models but also goes unnoticed by humans. In particular, Google has hosted a competition for generating “unrestricted adversarial examples”  [1].

To achieve the goal of unrestricted adversarial examples, we propose two novel methods, and to construct adversarial examples that are “far” from the original image in norm, but look natural to humans. This is done through the manipulation of explicit semantic visual representations (e.g. texture; color). utilizes the texture from nearest neighbor instances and adjusts the instance’s texture field using style transfer as adversarial perturbation, while

adaptively chooses locations in an image to change their colors, producing adversarial perturbation that is usually fairly substantial. Such semantic transformation-based adversarial perturbation sheds light upon the understanding of what information is important for neural networks to make predictions. For instance, in one of our case studies, we found that when the road is recolored from gray to blue, the image gets misclassified to tench (a fish) although a car remains evidently visible (Fig.

4b). This indicates that deep learning models can easily be fooled by certain local patterns. In addition to image classifiers, the proposed attack methods can be generalized to different machine learning tasks such as image captioning  [2]. Our attacks can either change the entire caption to the target or take on more challenging tasks like changing one or two specific target words from the caption to a target. For example, in Fig. 1, “stop sign” of the original image caption was changed to “umbrella is” and “cat sitting” for and respectively.

To guarantee that our large adversarial perturbations are imperceptible to humans, we conducted extensive user studies under various conditions. We compare our generated adversarial examples with the ones based on the state of the art methods (e.g. Basic iterative method (BIM) [22] and optimization based attack [4]) and show that when the norm of the perturbation is small, all methods appear natural to humans. However, when we relaxed the bound on perturbation, as increases from to , the proposed methods stay imperceptible (user study score changes from to ). In comparison, performances drop drastically for other attack methods (from to ) as increases from to .

We tested our proposed attacks on several state of the art defenses. Rather than just showing the attacks break these defenses (better defenses will come up), we aim to show that and are able to produce new types of adversarial examples which are more robust towards considered defenses than other attacks. Our proposed attacks are also more transferable given their large perturbation [26]. These new features of the proposed adversarial examples provide further insights into the vulnerabilities of learning models and therefore encourage new solutions to improve their robustness.

Contributions: 1) We describe two novel methods based on semantic transformation to generate adversarial examples; 2) Our attacks are unrestricted with no constraints on norm; 3) We conduct extensive experiments to attack both image classification and image captioning models on large scale ImageNet [7] and MSCOCO [24] dataset; 4) We perform comprehensive user studies under different scenarios to show that when compared to other attacks, our generated adversarial examples appear more natural to humans at large perturbation norm; 5) We test different adversarial examples against several state of the art defense methods and show that the proposed attacks are not only more robust but also contain unique properties.

2 Related Work

Figure 2: Perturbation comparisons. Images are attacked from tarantula to pretzel in the first row and photocopier in the third row. Our perturbations are large, structured and have spatial patterns when compared with other attacks. Perturbations from are low-frequency and locally smooth while perturbations from are primarily high-frequency and structured. Note gray color denotes zero perturbations.

In this section, we will briefly summarize current approaches generating adversarial examples, and discuss several standard semantic image transformation approaches. Adversarial examples have attracted a lot of attention recently due to its interesting properties such as being “indistinguishable from benign images” to human perception. Currently, most adversarial perturbations are measured by norm distance. For instance, fast gradient sign method (FGSM) is proposed to add perturbation with small norm distance to benign instance (i.e. adversarial example ) with respect to the direction of gradient of target learning model  [14]. Optimization based attack optimizes so that is misclassified to the adversarial target and the magnitude of is minimized [4, 22]. Though the norm distance is not a perfect metric to measure similarity for images, current adversarial examples mostly rely on bounding the norm based distance to guarantee to be close to the benign instance and therefore appear “natural” to humans. Later, Xiao et al. proposed to change pixel positions instead of pixel values to generate adversarial examples. While this attack leads to “natural” looking adversarial examples with large norm, it does not take image semantic information into account [30]. Unlike previous attack methods, our attacks are unbounded while remaining realistic. They are also effective in attacking different machine learning models.

Texture transfer extracts texture from one image and adds it to another. Texture is one of the most crucial descriptors that aids image recognition [5]. Transferring texture from one source image to another target image has been widely studied in computer vision  [9, 11]

. The Convolutional Neural Network (CNN) based texture transfer from Gatys et al.

[11] led to a series of new ideas in the domain of artistic style transfer  [12, 18, 23, 32]. It was identified that within-layer feature statistics or gram matrices of a pretrained deep neural net extract strong texture cues from the image. Recently, Yeh et al.  [32] demonstrated cross layer gram matrices are more effective in extracting texture than within-layer  [12]

Image Colorization is the task of giving natural colorization to a grayscale image. This is an ill-posed problem as there are multiple viable natural colorizations given a single grayscale image. Deshpande et al. [8] showed that diverse image colorization can be achieved by using an architecture that combines VAE [21] and Mixture Density Network; while Zhang et al. [34] demonstrated an improved and diverse image colorization by using input hints from users guided colorization process.

3 Texture Attack ()

Our goal is to generate adversarial examples by infusing texture from another image without explicit constraints on norm of the perturbation. For generating our examples, we used a pretrained VGG19 network [27] to extract textural features [12]. We directly optimize our victim image () by adding texture from a target image (). A natural strategy to transfer texture is by minimizing within-layer feature correlation statistics (gram matrices) between two images [11, 12]. Based on Yeh et al. [32], we found optimizing cross-layer gram matrices instead of within-layer gram matrices helps produce more natural looking adversarial examples. The difference between the within-layer and the cross-layer gram matrices is that for a within-layer, the features statistics are computed between the same layer. For a cross-layer, statistics are between two adjacent layers.

Objectives. directly attacks the image to create adversarial examples. Moreover, there is no additional content loss that is used in style transfer methods [12, 32]. Our overall objective function for the texture attack contains a texture transfer loss () and an adversarial loss ().


Unlike style transfer methods, we do not want the adversarial examples to be artistically pleasing. Our goal is to infuse a reasonable texture from a target class image to the victim image to fool a classifier or captioning network. To ensure a reasonable texture is added without perturbing the victim image too much, we introduce an additional constraint on the variation in the gram matrices of the victim image. This constraint helps us to control the image transformation procedure and prevents it from producing artistic images. Let and denote two layers of a pretrained VGG-19 with a decreasing spatial resolution and for number of filter maps in layer , our texture transfer loss is then given by


Let be feature maps, be an upsampled that matches the spatial resolution of layer . The cross layer gram matrices from  [32] between the victim image () and a target image () is given as

Figure 3: strategies. Texture transferred from random “texture source” () in row 1, random target class (row 2) and from the nearest target class (row 3). All examples are misclassified from Beacon to Nautilus. Images in the last row look photo realistic, while those in the first two rows contain more artifacts as the texture weight increases.

Texture Transfer. To create adversarial examples, we need to find images to extract the texture from, which we call “texture source” (). A naive strategy is to randomly select an image from the data bank as . Though this strategy is successful, their perturbations are clearly perceptible. Alternatively, we can randomly select from the adversarial target class. This strategy produces less perceptible perturbations compared to the random method as we are extracting a texture from the known target class. A better strategy to select is to find a target class image that lies closest to the victim image in the feature space using nearest neighbors. This strategy is sensible as we assure our victim image has similar feature statistics as our target image. Consequently, minimizing gram matrices is easier and our attack generates more natural looking images (see Fig. 3).

Control over Texture. The extent of texture that gets added to our victim image is controlled by the texture weight coefficient (). Increasing texture weights improves attack success rate at the cost of noticeable perturbation. When compared to within-layer statistics, the cross-layer statistics that we use are not only better at extracting texture, it is also easier to control the texture weight.

Structured Perturbations. Our perturbations are big when compared with existing attack methods in norm. They are of high-frequency and yet imperceptible (see Fig. 1 and Fig. 2). Since we are extracting features across different layers of VGG, the perturbations that we observe from examples follow a textural pattern. They are more structured and organized when compared to others.

4 Colorization Attack ()

In this section, our goal is to adversarially color an image by leveraging a pretrained colorization model. We hypothesize that it is possible to find a natural colorization that is adversarial for a target model (e.g., classifier or captioner) by searching in the color space. Since a colorization network learns to color natural colors that conform to boundaries and respect short-range color consistency, we can use it to introduce long-scale adversarial noise with a large magnitude that looks natural to humans. This differs from common adversarial attacks which tend to introduce short-scale high-frequency artifacts that are minimized to be invisible for human observers.

We leveraged Zhang et al. ’s [33] colorization model for our attack. In their work, they produce natural colorization on the ImageNet dataset using input hints from the user. The inputs to their network consist of the L channel of the image in CIELAB color space , the sparse colored input hints , and the binary mask , indicating the location of the hints.

Objectives. There are a few ways to leverage the colorization model to achieve adversarial objectives. We experimented with two main methods and achieved varied results.

Network weights. The straightforward method of producing adversarial colors is to change Zhang et al.’s colorization network directly. To do so, we simply update by minimizing . represents target class.


Hints and mask. We can also vary input hints and mask . Hints provides the network with ground truth color patches that guides the colorization while the mask provides its spatial location. By jointly varying both hints and mask, we are able to manipulate the output colorization. We can update our hints and mask as follows.

Figure 4: Class color affinity. on network weights with no provided hints. Groundtruth class for (a) is pretzel. The new colors are commonly found in images from target class. This gives us insights into the importance of colors as a factor in how classifiers make decisions.

Attack Methods. Attacking network weights allows the network to search the color space with no constraints for adversarial colors. This attack is the easiest to optimize, but the output colors are not realistic as shown in Fig. 4. Our various strategies outlined are ineffective as the model learns to ignore them and instead, directly generate the adversarial colors. However, the colorization it produces correlate with colors often observed in the target class. This suggests that classifiers associate certain colors with certain classes which we will discuss more in our case study.

Attacking input hints and mask instead gives us natural results as the pretrained network will not be affected by our optimization. Attacking hints and mask separately works, but they prove to be difficult with a long optimization time. Jointly attacking both hints and mask is more efficient and provides better results.

Figure 5: Controlling . We show a comparison of sampling 50 color hints from k clusters with low-entropy. All images are attacked to golf-cart. Second and Fourth row visualize our cluster segments, with darker colors representing higher mean entropy and red dots representing the location we sample hints from. Sampling hints across more clusters gives less color variety.

Control over colorization. Current attack methods lack control over where the attack occurs, opting to attack all pixels indiscriminately. This lack of control is not important for most attacks where the is small but is concerning in where making inconsiderate large changes can be jarring. To produce realistic colorization, we need to avoid making large color changes at locations where colors are unambiguous (e.g. roads in general are gray) and focus on those where colors are ambiguous (e.g. an umbrella can have different colors). To do so, we need to segment an image and determine which segments should be attacked or preserved.

To segment the image into meaningful areas, we cluster the image’s ground truth AB space using K-Means. We first use a Gaussian filter of

to smooth the AB channels and then cluster them into clusters. Then, we have to determine which cluster’s colors should be preserved. Fortunately, Zhang et al’s network output a per-pixel color distribution for a given image which we used to calculate the entropy of each pixel. The entropy represents how confident the network is at assigning a color at that location. The average entropy of each cluster represents how ambiguous their color is. We want to avoid making large changes to clusters with low-entropy while allowing our attack to change clusters with high entropy. One way to enforce this behavior is through hints which are sampled from the ground truth at locations belonging to clusters of low-entropy. We sample hints from the clusters with the lowest entropy which we refer as (e.g. samples hints from the 2 lowest entropy clusters).

Smooth perturbations. Fig. 2 shows interesting properties of the adversarial colors. We observe that perturbations are locally smooth and are relatively low-frequency. This is different from most adversarial attacks that generate high-frequency noise-like perturbations. This phenomenon can be explained by the observation that colors are usually smooth within object boundaries. The pretrained colorization model will thus produce smooth, low-frequency adversarial colors that often conform to object boundaries.

5 Experimental Results

In this section, we evaluate the two proposed attack methods both quantitatively, via attack success rate under different settings, and qualitatively, based on interesting case studies. We conduct our experiments on ImageNet [7] by randomly selecting images from sufficiently different classes for the classification attack and randomly chosen images from MSCOCO [24] for image captioning attack.

5.1 Experimental Setup

5.1.1 Attack

Texture Features. For texture transfer, we extract cross-layer statistics in Eq. 2 from R11, R21, R31, R41, and R51 of a pretrained VGG19 by adding a cross-entropy based adversarial loss in Eq. 1. We optimize our objective using an L-BFGS [25] optimizer.

Number of iterations. attacks are sensitive and if not controlled well, images get transformed into artistic images. Since we do not have any constraints over the perturbation norm, it is necessary to decide when to stop the texture transfer procedure. For a successful attack (images look realistic), we limit our L-BFGS to fixed number of small steps and perform two set of experiments: one with only one iteration or round of L-BFGS for steps and another with three iterations of steps. For the three iterations set up, after every iteration, we look at the confidence of our target class and stop if the confidence is greater than 0.9.

Texture and Cross-Entropy Weights. Empirically, we found setting to be in the range and in the range to be successful and also produce less perceptible examples. The additional cross-entropy based adversarial objective helps our optimization. We ensure large flow of gradients is from the texture loss and they are sufficiently larger than the adversarial cross-entropy objective. The adversarial objective also helps in transforming victim image to adversarial without stylizing the image.

All our numbers are shown for one iteration, and , unless otherwise stated. We use the following annotation for the rest of the paper to denote the texture method that we are using.

5.1.2 Attack

Figure 6: Variation in with number of color hints. All images are attacked to Merganser with . As the number of hints increases, the output colors are more similar to groundtruth. However, when the number of hints is too high, colorizations might appear unrealistic.

We use Adam Optimizer [20] with a learning rate of in . We update hints and mask until our image reaches the target class and the confidence change of subsequent iterations does not exceed a threshold of .

Number of input hints. Network hints constrain our output to have similar colors as the ground truth, avoiding the possibility of unnatural colorization at the cost of color diversity. This trade-off is controlled by the number of hints provided to the network as initialization (see Fig. 6). Generally, providing more hints gives us colors similar to the original image. However, having too many hints may instead make the image less realistic as we are constraining the search space for adversarial colors.

Number of Clusters. The trade-off between the color diversity and the color realism is controlled by the number of clusters we sample hints from as shown in Fig. 5. Sampling from multiple clusters gives us realistic colors closer to the ground truth image at the expense of color diversity. Diversity vs Realism. One of the most important factors we consider for colorization is this trade-off between color diversity and color realism. Our numbers in Table 2, 3, 7 reveal that examples with larger color changes (consequently more color diversity) are more robust against transferability and adversarial defenses. However, these big changes are found to be slightly less realistic from our user study (Table 4). Empirically, our experiments showed that in terms of color diversity, realism, and robustness of attacks, using and hints works the best. For the rest of this paper, we will fix hints for all methods.

5.2 Attacking Classifiers

Model Resnet50 Dense121 VGG 19

Attack Success

Table 1: Whitebox Target Attack Success Rate. Our attacks are highly successful on ResNet50, DenseNet121 and VGG19. For we show results for and . For , we show results for when attacked with hints.

We use a pretrained ResNet 50 classifier [15] for all our methods. DenseNet 121 [17] and VGG 19 [27] are used for our transferability analysis.

5.2.1 attack

Attack Success. With a very small weighted adversarial cross-entropy objective and using our texture loss, we break the state of the art image classifiers and yet our images look quite realistic to humans. As shown in Table 1, our attacks are highly successful on white-box attacks tested on three different models with the nearest neighbor texture transfer approach. In Table 2 we show our attacks are also more transferable to other models. In our supplementary, we have results for untargeted attacks and other strategies that we used for generating adversarial examples.

Importance of Texture in Classification. Textures are crucial descriptors for the task of image classification. More recently, Geirhos et al. [13] showed that ImageNet-trained deep learning classifiers are biased towards texture for making predictions. From our results, it is also evident that even with a small but invisible change in the texture field can break the current state of the art classifiers.

5.2.2 Attack

As shown in Table 1, can achieve high targeted attack success rate by adding realistic colorization perturbation. Importance of color in classification. From Fig. 4, we can compare how different target class affects our colorization results if we relax our constraints on colors ( on Network Weights, hints). In many cases, the images contain strong colors that are related to the target class. In the case of golf-cart, we get a green tint over the entire image. This can push the target classifier to misclassify the image as green grass is usually overabundant in benign golf-cart images. Fig. 4b shows our attack on an image of a car to tench (a type of fish). We observe that the gray road turned blue and that the colors are tinted. We can hypothesize that the blue colors and the tint fooled the classifier into thinking the image is a tench in the sea.

The colorization model is originally trained to produce natural colorization that conforms to object boundaries. By adjusting its parameters, we are able to produce such large and abnormal color change that is impossible with our attack on hints and mask. These colors, however, give us some evidence that colors play a stronger role in classification than we thought. We reserve the exploration of this observation for future works.

While this effect (i.e. strong color correlation to target class) is less pronounced for our attack on hints and mask, for all methods, we observe isoluminant color blobs in our images. Isoluminant colors are characterized by a change in color without a corresponding change in luminance. As most color changes occur along edges in natural images, it is likely that a classifier trained on ImageNet have never seen isoluminant colors. This suggests that might be exploiting isoluminant colors to fool classifiers.

Figure 7: Captioning attack. We attack the second word of each caption to {dog, bird} and show the corresponding change in attention mask of that word.

5.3 Attacking Captioning Model

Our methods are general and can be easily adapted for other learning tasks. As proof of concept, we test our attacks against image captioning models. Image captioning is the task of generating a sequence of word description for an image. The popular architecture for captioning is a Long-Short-Term-Memory (LSTM)

[16] based models [19, 28]. Recently, Janeja et al. proposed a convolutional based captioning model for a fast and accurate caption generation [2]. This convolutional based approach does not suffer from the commonly known problems of vanishing gradients and overly confident predictions of LSTM network. Therefore, we chose to attack the current state of the art convolutional captioning model.

Attacking captioning models is harder than attacking classifiers when the goal is to change exactly one word in the benign image’s caption. We show that our attacks are successful and has no visible artifacts even for this challenging task. In Fig. 7, we change the second word of the caption to {dog, bird} while keeping the rest of the caption the same. This is a challenging targeted attack because, in many untargeted attacks, the attacked captions do not make sense. For , we select as the nearest neighbor of the victim image from the ones in the adversarial target class using ImageNet dataset. For , we use all ground truth hints as initialization to ensure that the attack looks realistic at a certain cost of color diversity.

Method Model ResNet50 DenseNet121 VGG19
BIM ResNet50
CW ResNet50
Table 2: Transferability of attacks. We show attack success rate of attacking models in column and testing on rows.
Figure 8: Density Plot. Our methods achieve large norm perturbations without notable reduction in user preference. Each plot is a density plot between perturbation ( norm) on X axis and (user prefers adversarial image) on Y axis. For ideal systems, the density would be a concentrated horizontal line at 0.5. All plots are on the same set of axes. On the left, plots for three baseline methods (Fig a - Fig c). Note the very strong concentration on small norm perturbations, which users like. Right 4 plots shows our methods (Fig d - Fig g). Note strong push into large norm regions, without loss of user preference.

5.4 Defense and Transferability Analysis

We evaluate our attacks and compare them with existing methods on three main defenses – JPEG defense [6], feature squeezing  [31] and adversarial-training [14]. By leveraging JPEG compression and decompression, adversarial noise may be removed. We tested our methods against JPEG compression of

. Feature squeezing is a family of simple but surprisingly effective strategies, including reducing color bit depth and spatial smoothing. It has been shown that this can defend various adversarial attacks without harming the benign accuracy. By coalescing feature vectors, it essentially shrinks the search space available for the adversary, making it harder to search for perturbation. Adversarial training has been shown as an effective but costly method to defend against adversarial attacks.

Robustness of Our Attacks. In general, our attacks are more robust to the considered defenses and transferable when compared with three popular attack methods FGSM [14] BIM [22] and CW [4]. Our untargeted analysis, including FGSM, can be found in supplementary. For , increasing texture weight does not necessarily perform well with the defense even though it increases attack success rate, but increasing texture flow with more subsequent iterations improves attack’s robustness against defenses. For , there is a trade-off between more realistic colors (using more hints and sampling from more clusters) and attack robustness. From Table 3 and 2, we show that as we progressively use more clusters, our transferability and defense numbers drop. A similar trend is observed with the change in the number of hints (see supplementary).

Method JPEG75 Feature Squeezing Adv Trained Resnet
4-bit 5-bit 2x2 3x3 11-3-4
54.94 38.92 40.57 27.41
Table 3: Defense. Misclassification rate after passing through defense models. All attacks are done on ResNet50. The highest success rate of each group of methods is in bold.

6 Human Perceptual Study

To quantify how realistic and examples are, we conducted a user study on Amazon Mechanical Turk. For each attack, we choose the same 200 adversarial images and their corresponding benign ones. During each trial, one random adversarial-benign pair appears for three seconds and workers are given five minutes to identify the realistic one. Each attack has 600 unique pairs of images and each pair is evaluated by at least 10 unique workers. We restrict biases in this process by allowing each unique user up to 5 rounds of trials. In total, 598 unique workers completed at least one round of our user study. For each image, we can then calculate the user preference score as the number of times it is chosen divided by the number of times it is displayed. represents that users are unable to distinguish if the image is fake.

For baselines, we chose BIM and CW. Since these attacks are known to have low norm, we designed an aggressive version of BIM by relaxing its bound to match the norm of our attacks. We settled with two aggressive versions of BIM with average , which we refer to as BIM, BIM.

Results. The user preferences for all attacks are summarized in Table 4. For and , user preferences averages at and respectively, indicating that workers have a hard time distinguishing them. The average user preferences for BIM drops drastically from 0.497 to 0.332 when we relax the norm to BIM; the decrease in user preferences for (0.433 to 0.406) and (0.476 to 0.437) is not significant. In Fig. 8 and Fig. 11, we show density plot of norm vs user preference scores.

Method Preference
Table 4: User Study. User preference score and norm of perturbation for different attacks. and are imperceptible for humans (score close to 0.5) even with very big norm perturbation.

7 Discussion and Conclusion

We propose two novel attacks based on semantic manipulation to generate realistic adversarial examples, departing from previous works that limit the norm of perturbations. Our user study shows that despite having unbounded norm, the adversarial images generated by our methods are able to consistently fool human subjects. Our attacks also shed light on the role of texture and color fields in influencing deep network’s predictions. From the density plot in Fig. 8, the correlation between user preferences and for is weak () for both and even with a large . These are very large norm perturbations that maintain the property of being a natural image (for example, make the blue umbrella red in Fig. 5). Such perturbations are very hard to find from other methods that control norm. These results also dispute the belief of our community that small magnitude perturbation is required to generate photorealistic adversarial examples. We hope by presenting our methods, we encourage future studies on unbounded adversarial attacks, better metrics for measuring perturbations, and more sophisticated defenses.

Supplementary Material

Adversarial Cross-Entropy Objective for Captions. Let be the target caption, denote the word position of the caption, for the captioning model, for the victim image.


For , we add this cross-entropy to our texture perturbation loss and directly optimize the image to generate target captions. For , we give all color hints and optimize to get an adversarial colored image to produce target caption. We stop our attack once we reach the target caption and the caption doesn’t change with subsequent iterations. Note we do not attack or optimize the network weights, we only optimize the victim image (for ) or hints and mask (for ) to achieve our target.

Model Resnet50 Dense121 VGG 19

Attack Success

Random Target
Nearest Target
Table 5: Whitebox target attack success rate. Our attacks are highly successful on different models across all strategies. results are for , and iter.
Table 6: ablation study. Whitbox target success rate with nearest target (texture source). In columns, we have increasing texture weight () and in rows, we have increasing adversarial cross-entropy weight (). All attacks are done with one iteration of L-BFGS of steps on Resnet50. We found and to produce more imperceptible and robust adversarial examples.
Figure 9: Additional qualitative examples for .Texture transferred from random “texture source” () in row 1, random target class (row 2) and from the nearest target class (row 3). All examples are misclassified from Merganser to Umbrella. Images in the last row look more natural, while those in the first two rows contain more artifacts as the texture weight increases (from left to right).
Figure 10: Additional qualitative examples for controlling . We show a comparison of sampling 50 color hints from k clusters with low-entropy. All images are attacked to golf-cart. Even numbered rows visualize our cluster segments, with darker colors representing higher mean entropy and red dots representing the location we sample hints from. Sampling hints across more clusters gives less color variety.
Targeted Method Strategy JPEG75 Feature Squeezing Adv Trained Resnet
4-bit 5-bit 2x2 3x3 11-3-4
Random Target
Nearest Target
50 Hints
100 Hints
50 Hints
Nearest Target
50 Hints
Table 7: Attack success (or misclassification) rate after passing through defense models. All “Targeted Attacks” comparison are at the top of the table, and “Untargeted Attacks” comparison are at the bottom of the table. All attacks are done on ResNet50. Highest attack success rate of each group of method is in bold. For , all numbers are reported for .
Figure 11: Density Plot. Our methods achieve large norm (top) and norm (bottom) norm perturbations without notable reduction in user preference. Each plot is a density plot between perturbation ( norm) on X axis and (user prefers adversarial image) on Y axis. For ideal systems, the density would be a concentrated horizontal line at 0.5. All plots are on the same set of axes. On the left, plots for three baseline methods (Fig a - Fig c). Note the very strong concentration on small norm perturbations, which users like. Right 4 plots shows our methods (Fig d - Fig g). Note strong push into large norm regions, without loss of user preference.
Attack tarantula merganser nautilus hyena beacon golfcart photocopier umbrella pretzel sandbar Mean
Table 8: Class-wise norm and user preferences breakdown. Perturbations for a few classes (e.g. photocopier) are quite obvious and easy to detect for all methods, and much harder for a few classes (e.g. merganser).
(a) (a) for nearest Target
(b) (b) with 50 hints
Figure 12: Randomly sampled and adversarial examples. (a) Adversarial examples generated by with , and iter using nearest target . (b) Adversarial examples generated by attacking Hints and mask with hints. Note diagonal images are ground truth images and gray pixels indicate no perturbation.
Figure 13: Additional example for Captioning attack. We attack the second word of each caption to {dog, bird} and show the corresponding change in attention mask of that word. For we use the nearest neighbor selection method and for we initialize with all groundtruth color hints.