Semantic Adversarial Perturbations using Learnt Representations

01/29/2020
by   Isaac Dunn, et al.
University of Oxford
5

Adversarial examples for image classifiers are typically created by searching for a suitable norm-constrained perturbation to the pixels of an image. However, such perturbations represent only a small and rather contrived subset of possible adversarial inputs; robustness to norm-constrained pixel perturbations alone is insufficient. We introduce a novel method for the construction of a rich new class of semantic adversarial examples. Leveraging the hierarchical feature representations learnt by generative models, our procedure makes adversarial but realistic changes at different levels of semantic granularity. Unlike prior work, this is not an ad-hoc algorithm targeting a fixed category of semantic property. For instance, our approach perturbs the pose, location, size, shape, colour and texture of the objects in an image without manual encoding of these concepts. We demonstrate this new attack by creating semantic adversarial examples that fool state-of-the-art classifiers on the MNIST and ImageNet datasets.

READ FULL TEXT VIEW PDF

Authors

page 3

page 7

page 12

page 13

page 14

page 15

page 16

page 17

03/16/2018

Semantic Adversarial Examples

Deep neural networks are known to be vulnerable to adversarial examples,...
02/09/2022

Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations

Model robustness against adversarial examples of single perturbation typ...
03/04/2020

Metrics and methods for robustness evaluation of neural networks with generative models

Recent studies have shown that modern deep neural network classifiers ar...
06/18/2021

Analyzing Adversarial Robustness of Deep Neural Networks in Pixel Space: a Semantic Perspective

The vulnerability of deep neural networks to adversarial examples, which...
02/27/2018

On the Suitability of L_p-norms for Creating and Preventing Adversarial Examples

Much research effort has been devoted to better understanding adversaria...
03/25/2019

Exploiting Excessive Invariance caused by Norm-Bounded Adversarial Robustness

Adversarial examples are malicious inputs crafted to cause a model to mi...
04/21/2019

Beyond Explainability: Leveraging Interpretability for Improved Adversarial Learning

In this study, we propose the leveraging of interpretability for tasks b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite their many successes, deep neural networks have been found to be vulnerable to

adversarial examples: inputs designed to deliberately fool a model. In this paper, we introduce a new method that performs adversarial perturbations in the space of feature representations learnt by a generator network. By performing these perturbations at different layers in the generator, we can alter the full range of feature granularities from macro (e.g. the shape of a mountain) to micro (e.g. the texture of a segment of a dog’s ear).

In contrast, most adversarial examples research focuses on perturbations at the level of individual pixels. Although when originally proposed, robustness against such norm-constrained perturbations was intended only as a toy problem from which a solution could be generalised [Gilmer:MTR:2018], adversarial robustness research has remained preoccupied with this rather artificial paradigm. While it has led to some useful insights, there is simply no realistic threat which is well modelled by norm-constrained pixel perturbations. Adversarial pixel perturbations can be alternatively motivated as modelling a worst-case failure of the i.i.d. assumption. But robustness to pixel perturbations does not imply robustness to other kinds of distribution shift [Kang:TRA:2019a], so a stronger, more thoughtful threat model is needed to provide robustness to realistic distribution shifts.

This work belongs to the burgeoning movement considering unrestricted adversarial examples [Brown:UAE:2018]: inputs which fool a target network yet are not constrained to be within a certain distance of a given input. This paradigm is clearly strong enough to include any realistic adversarial or distribution-shift threat to correctness.

Our method is not the first to use semantic changes to images to create unrestricted adversarial examples. However, while prior works present ad-hoc techniques which are tailored to a certain kind of semantic change (such as colour [Hosseini:SAE:2018]), our method is general

: the features that can be perturbed are not hand-crafted but those that have been learnt by a generative model. This allows for a much richer space of possible manipulations, which will only improve as generative machine learning continues to develop. Note also that our method is not specific to a certain kind dataset (or even the domain of images): if a generator network can be trained on a dataset, then our method can be used to find unrestricted adversarial examples for it. Finally, our method is able to leverage any perturbation attack algorithm in its search for semantic adversarial perturbations; advances in this field can therefore also be readily applied.

2 Background

Adversarial Perturbation Attacks

Since the galvanising discovery that imperceptible changes to the pixel values of an image could fool state-of-the-art classifier networks [Szegedy:IPO:2014], many attack procedures have been proposed. Customarily, these entail a white- or black-box greedy search for a perturbation (with constrained magnitude under some norm) which, when applied to the given input, results in the network’s output being incorrect. Similarly many defence techniques have been proposed [Xu:AAA:2019a], the inadequacy of which [Carlini:DDI:2016a, Athalye:OGG:2018, Carlini:AEA:2017] has prompted a smaller literature of more successful approaches to defence which are able to prove the extent of their robustness to perturbation attacks on a test set [Liu:AFV:2019].

Generative Adversarial Networks

GANs are an approach to training a generator neural network to map from a known standard probability distribution to the distribution of the training data. The essential idea is to simultaneously co-train a discriminator network that learns to distinguish between dataset samples and generated examples; this therefore be used to provide a gradient to the generator. Refer to a tutorial

[Goodfellow:N2T:2017] for more detail.

3 Construction of Adversarial Semantic Perturbations

Semantic perturbation
(Pixel perturbation)
Figure 1: Illustration of semantic adversarial perturbation attacks for depth to aid understanding of Section 3.1.
Depth: 0 1 2 3 4 5 6
Original:
Perturbation:
Perturbed:
Figure 2: Untargeted single-depth semantic adversarial perturbation attacks for MNIST. Green pixels in the perturbation represent an increase in value; red represent a decrease.
Depth: 0 1 2 3 4 5 6 7
Original:
Perturbation:
Perturbed:
Figure 3: Untargeted single-depth semantic adversarial perturbation attacks for ImageNet. Note how the granularity of the features altered by the perturbation varies with depth.

Suppose we have a trained generator neural network that has learnt to map a standard distribution in latent space to the distribution of training images in pixel space, . We will view it as a fixed function . In this paper, we use the generator network from a GAN, but could also be the decoder network of a Variational Auto-Encoder, or any other generative model. Since neural networks are composed of layers, we think of our generator function as a composition , where is a function between latent (activation) spaces, with

. Note that the elements of each space are tensors.

Bau:GDV:2019 [Bau:GDV:2019]

showed that individual tensor elements (neurons) in these latent spaces

can represent semantic features of the generated image. For instance, some neurons may represent the presence of clouds. Our key idea is to perform adversarial perturbations in these activation spaces as an image is being generated rather than in pixel space . Our main hypothesis is that this results in semantic changes of different kinds (as opposed to meaningless pixel-level adjustments) that successfully fool classifiers without changing the true class of the data.

3.1 Single-Depth Attack

Consider using a trained network to classify the output of the generator . In effect, this is using the function composition to predict the class of the image generated by inputs to the generator. Noting that is itself a composition of functions allows us to now take a new perspective: pick some depth index , and now define a new pair of neural networks, and . Note that , so the computation is identical but we now have a new view of these. We can consider to be a generator of realistic activation tensors , while is able to classify these activation tensors according to the label of the images that would be induced in if each were passed through the rest of the generator, . Figure 1 may help the reader to visualise this.

Suppose we have an adversarial perturbation algorithm  that, given a classifier network and an input point , searches for a nearby adversarial example such that and for some distance metric and bound . Under these definitions, traditional pixel-space perturbations could be applied to generated images by running algorithm  with , for some appropriately sampled , as an norm, and chosen to be suitably small.

Our method is to instead run perturbation algorithm  with input and classifier

, as defined above. This finds an adversarially perturbed activation vector

which is nearby . Consider passing through the remainder of the original generator . This image, , is classified differently to the unperturbed image since the output of is such that . Our claim is that the correct classification of both images ought to be the same, since a sufficiently small perturbation in an activation space leads to only small changes visible in the resulting image. This claim is empirically evaluated in Section 4.

Performing such mid-generator perturbations is possible because there is no fundamental difference between an image classifier and an activation vector classifier : they are both simply neural networks composed of a number of layers. Any attack (or indeed defence) algorithm that can therefore be used for pixel-space perturbations against can be used for semantic feature-space perturbations against .

3.2 Multiple-Depth Attack

Figure 4: Illustration of semantic adversarial perturbation attacks at multiple depths to aid understanding of Section 3.2.

The previous section described how adversarial perturbations can be made to activations in a latent space . A strictly stronger attack would be to perform such a perturbation to activations at every layer during the generation of an image. This section describes how such an attack can be framed as another norm-constrained perturbation to an input of a constructed classifier . See Figure 4 for an illustration.

Consider, as above, a generator network and a classifier . We will construct a generator that takes as input the usual initial seed followed by a series of perturbations to apply at each latent space: , where . Now consider , which predicts the classes of images resulting from the inputs to . Given an input to original generator , we can use any adversarial attack algorithm with and to find an adversarial example constructed by performing adversarial perturbations at every space in the generator. The magnitude of a perturbed input can be found by flattening and concatenating each tensor before applying the distance metric of choice.

3.2.1 Scaling of Perturbation Magnitudes

Finding the magnitude of a traditional adversarial perturbation is straightforward: normalise each pixel from its original range, typically , to the range ; then compute . However, finding the magnitude of a semantic perturbation is more challenging. The elements of an activation vector do not share a well-defined range; to normalise the vector to , we empirically measure the maximum and minimum values seen for each tensor element (neuron) over 256 runs, then perform linear scaling for each.

There is a further element of nuance: perturbations of the same magnitude may have different visual effect sizes at different depths in the generator. For instance, we found that an perturbation of magnitude 0.01 at some layers could very significantly change the classifier output, yet be almost imperceptible in pixel space. If not compensated for, this could result in poor performance: using the same perturbation magnitude uniformly across all depths would result in the effects from some depths being dominated by the effects from others.

When performing a multiple-depth perturbation attack, we want to scale the perturbation for each depth so that a change of magnitude to the input vector results in a similar perceptual change when this is applied at each depth. This requires a rescaling of each perturbation . To determine the weights for these rescalings, we find by inspection the greatest magnitude perturbation at each depth for which the induced perturbed images remain recognisable. We then multiply each perturbation part in the input to with a scalar weight so that in the input vector scales to this largest allowable perturbation size.

This tuning procedure was fairly crude; we suspect that a more careful finetuning of the magnitude weightings could produce better results for our multiple-depth attack.

3.3 Semantic Perturbations of Arbitrary Inputs

The previous sections have described how to find a semantic adversarial perturbation for a generated image. One might ask whether our method can be adapted to find a semantic adversarial perturbation to an arbitrary image. It can, to the extent that the generator is capable of generating the given image: procedures exist which determine the input to a generator which produces an image most similar to the one that is given [Creswell:ITG:2019]. This can then be semantically perturbed in the usual way.

However, we argue that this question and answer is not relevant. From a security perspective, there are no realistic threat models in which an adversary has the ability to perform perturbation to existing data yet cannot simply present new data of their own construction; and from a machine learning perspective a generated image is equivalent to a ‘real’ image assuming that the generator has converged to the training distribution.

4 Experiments

We report experimental results for two datasets, representing two extremes: MNIST and ImageNet.

MNIST is notoriously small and simple, making its classification very easy; this is the only dataset for which classifiers somewhat robust against adversarial -perturbations are possible. This makes MNIST the most challenging dataset by far for successfully finding adversarial examples. We target Wong:PDA:2018’s network [Wong:PDA:2018] in particular, which is provably robust to perturbations of magnitude under in the sense that such attacks are provably unsuccessful on 94.2% of the MNIST test set. Other state-of-the-art adversarially-robust classifiers give similar results.

Figure 5: Instances of ‘class smudging’: small perturbations to features which do not maintain the correct label of the image.

Having tried three different GANs for MNIST, we find that our method works equally well with each. However, we also find that even small perturbations early in the generator can result in what could be called ‘class smudging’: as shown in Figure 5, even small changes can result in images failing to maintain their correct true label. The simple solution is to use a generator trained to generate one class only (we use class 9 for our experiments). Because such a generator does not learn representations for the other classes, it is much less likely that a small perturbation to an early-layer representation could result in a change of true label. If the computational cost of training such a GAN is a concern, note that a pretrained multi-class generator can still be used, albeit with class smudging leading to a lower success rate.

ImageNet, in contrast to MNIST, is notoriously large and complex. While this makes fooling its classifiers less challenging, some techniques struggle with the challenge of functioning at ImageNet scale. We target the ResNeXt-101-32x8d network[Xie:ART:2017], with top-1 accuracy of 79.3% and top-5 accuracy of 94.5%. BigGAN [Brock:LSG:2019] is state of the art in ImageNet generation; we use use the author’s ‘officially unofficial’ implementation and checkpoints [Brock:B:2019].

Full details of our experimental setup can be found in Appendix A.

4.1 Single-Depth Attack

4.1.1 Efficacy of Attacks

We first evaluate how successful our method is when applied at different depths in the generator. We make use of the Foolbox library [Rauber:FAP:2018], which implements many standard adversarial perturbation algorithms. We allow it to select a perturbation magnitude which is sufficiently high for the resulting image to change classification; human judges then label the perturbed images. Each perturbation attack is successful if the perturbed image is classified incorrectly as desired, with its true label remains unchanged.

Table 1 shows the success rate of performing feature-level semantic perturbations at each depth in the generator, using standard projected gradient descent (PGD) as the attack algorithm. The success rates are consistently high. Note that the pixel-space (depth 6) attacks are successful despite the network being defended against such perturbations because we increase the perturbation magnitude until misclassification occurs. Targeted attacks, for which a certain miclassification label must be achieved, are more challenging. This is reflected in the lower success rates for this case.

For ImageNet, we find that all reasonable adversarial perturbation algorithms succeed 100% of the time at all depths with perturbations so small as to leave the correct image label unchanged with certainty. This is as expected, since robust ImageNet classification is so difficult, but serves as proof that our method scales well. All ImageNet results in this paper use the Carlini:TET:2017 attack [Carlini:TET:2017] unless stated otherwise. Attacks typically take between one and ten seconds for single-depth attacks (or between ten and one hundred seconds for multiple-depth attacks), depending on the attack algorithm used.

The targeted case for ImageNet is much harder, however, since the task becomes crafting an adversarial image that is classified into the single target class, rather than just any one of the 999 incorrect classes. As a result, the variance in the success of single-depth attacks is much higher, with a dependency on perturbation algorithm used and the perturbation depth. For instance, the Carlini & Wagner attack has near-zero success rate at depth 3 but near-100% success rate at depth 1. The possible reasons for this discrepancy are discussed later.

Depth 0 1 2 3 4 5 6
Untargeted 84 81 84 82 81 77 90
Targeted 63 61 70 63 73 69 95
Table 1: Single-depth attack success rates (%) on MNIST as a function of the depth at which the perturbations happen. The target network is Wong:PDA:2018’s robust classifier [Wong:PDA:2018].
Depth 0 1 2 3 4 5 6
Targeted 91 87 82 74 73 100 99
Untargeted 97 93 97 90 87 100 100
Table 2: Percentages of the time that human judges are able to identify the single-depth adversarial example for MNIST from a selection of two. 50% would imply perfect realism.

4.1.2 Visual Effect of Attacks

Figures 3 and 3 show the effect of performing semantic adversarial perturbations at varying depths in the generator. Appendix B contains more extensive examples.

As the adversarial perturbations are made closer to the input of the generator, we observe that the changes:

  1. are grouped into increasingly large contiguous regions,

  2. have increasing magnitude in pixel space, and

  3. coincide with increasingly high-level semantic features.

For ImageNet, perturbations performed near the beginning of the generator result in changes to the shape, size, location and orientation of macro-level objects; perturbation in the middle stages induce small adversarial changes in micro-level features; and perturbations in the final stages result in the usual pixel-level adversarial ‘fuzz’.

For MNIST, perturbations close to the beginning of the generator result in small adversarial changes to the shape and orientation of the characters; perturbations in middle stages typically result in changes to the thickness or length of small line segments; and perturbations in the last stages result in pixel-level changes to the edges of the characters.

Visual inspection of the perturbations (see Appendix B for more examples) is strong evidence in support of our main claim that generator networks’ learnt representations can be leveraged to make adversarial changes to different kinds of semantic features.

Understanding Failures

It is informative to examine the cases for which our attacks fail, which happens much more often in the more challenging targeted misclassification scenario. For MNIST, there are two clear failure modes: class smudging (as previously described, see Fig. 5), and transformation into a meaningless blob. We conjecture that this latter failure mode may occur when the activation vectors are perturbed into regions well beyond the usual distribution of activation vectors at that layer, and so the generator is unable to use them to construct a suitable image. For ImageNet, the failures are often intriguing – see the examples in Figure 6. We again speculate that these may be caused by perturbations resulting in out-of-distribution activation vectors which the remainder of the generator has not learned to handle. Of particular interest is that the nature of the distortions is at the same level of granularity as we expect for that depth.

4.1.3 Conspicuousness of Attacks

One motivation of this work is the ‘non-suspicious input’ threat model [Gilmer:ADO:2019]: we would like to develop models which are robust to any adversarial input which is inconspicuous in the sense that it is not visually identifiable as being adversarial. That many of the visual effects of our semantic perturbations could be realised in the real world makes this threat model more relevant. We therefore evaluate the conspicuousness of our semantic perturbation attacks by measuring the proportion of the time that human judges are able to correctly identify the semantically-perturbed image when presented alongside a unperturbed dataset example. The results are shown in Table 2.

Note that for MNIST the least conspicuous single-depth attacks are those for which the perturbation occurs in the middle layers of the generator. For perturbations near the beginning of the generator, there can be noticeable macro-level artefacts such as unusually twisted shapes or extra marks. For perturbations near the end of the generator, the adversarial ‘fuzz’ becomes noticeable: this is always absent on real examples. It seems that the middle level of granularity – alterations to micro-features – is small enough not to attract attention but large enough as to have been plausibly created by a human pen.

Unsurprisingly, targeted attacks are more conspicuous since the task is harder and so larger perturbations are needed.

4.2 Multiple-Depth Attack

Depth 3 Depth 4 Depth 5 Depth 6
Figure 6: Examples of failed targeted semantic perturbations for ImageNet, using a single-depth perturbation attack. Note how the granularity of the distortions correspond to the depth of the perturbation.

4.2.1 Efficacy of Attacks

As described in Section 3.2, we perform adversarial perturbations at every stage of the generator network. On average, the multi-depth perturbation attack is successful 81% of the time in the untargeted case, and 74% of the time in the targeted case. Ignoring pixel-space perturbations (depth 6), this is comparable to the typical single-depth success rates for the untargeted case and well above average for the targeted case (see Table 2). We conjecture that this improvement in success rate for the targeted case is because it is able to perturb features at all levels of granularity, so a smaller perturbation is required at each depth; a large perturbation at a particular depth may cause an attack failure by distorting the image so its true label is not maintained.

Target
0
1
2
3
4
5
6
7
8
Figure 7: Targeted multiple-depth MNIST perturbations.
Figure 8: Targeted multiple-depth adversarial examples for ImageNet. The target class is ‘lemon’.
Resilience to Single-Depth Defence

We note that our multiple-depth attack cannot be mitigated by any defence against a single-depth attack. Suppose a multiple-depth perturbation is used to attack a network which is robust to pixel-level adversarial perturbations: the attack will succeed, since it will simply use its perturbation ‘budget’ to instead attack at earlier layers, which result in coarse-grained semantic changes which have a large magnitude in pixel space. We experimentally verify this by increasing the relative weighting from 0 to 1 of the pixel-space component of a multiple-depth attack against Wong:PDA:2018’s classifier robust to pixel perturbations [Wong:PDA:2018]. That the success rate of the attack is a monotonically increasing function of this weighting demonstrates that defending against a multiple-depth attack requires defending against the conjunction of the relevant single-depth attacks.

4.2.2 Visual Effect of Multiple-Depth Attacks

Figure 8 gives examples of targeted multiple-depth attacks; more can be found Appendix B. The visual effect is, unsurprisingly, a combination of the effects seen at each single depth. That is, the multiple-depth attack makes changes at every level of granularity from changes to location, shape, orientation and colour of high-level objects, through texture and adjustments to micro-level to pixel-level perturbations.

Since the attack now has a larger range of kinds of adversarial change it can make to the image, the change at each level of granularity needs only to be much smaller. This is analogous to the decrease in pixel perturbation magnitude required as the number of perturbed pixels increases. As a result, the failure cases shown in Figure 6 do not occur, and ‘class smudging’ on MNIST also decreases because the perturbation is less concentrated at one level of granularity.

4.2.3 Conspicuousness of Attacks

For multiple-depth attack, humans are able to correctly identify the adversarial examples 94% and 96% of the time in the untargeted and targeted cases respectively. This is considerably better than the pixel-perturbation attacks, which can be identified almost 100% of the time, but not as good as other single-depth attacks.

We conjecture that this is due to the procedure we used to determine the relative magnitude weightings for the perturbations at different depths in the multiple-depth attack. In particular, our procedure was not focused on indistinguishability, but rather on distortion to the point of unrecognisability. These are quite distinct goals: pixel-space perturbations, for instance, are very unlikely to distort an image to the extent that it is no longer recognisable, but the presence of visible pixel perturbations immediately rules out an image from being from the test dataset.

4.3 Choice of Perturbation Algorithm

As well as being able to take any trained generator network as a source of semantic feature representations, our procedure is able to use any standard pixel-perturbation attack algorithm to search in these semantic feature spaces. The choice of this attack algorithm does matter (see Appendix B for relevant results). For instance, the Fast Gradient Sign Method (FGSM) [Goodfellow:EAH:2015] is usually unable to find suitable perturbations for ImageNet examples, and has a low success rate for MNIST examples. Conversely, the Carlini:TET:2017 attack [Carlini:TET:2017] using the norm has a high rate of success, as does the projected gradient descent (PGD) under . Figure 9 compares the visual effects of these two algorithms when used to find multiple-depth perturbations for ImageNet with source class ‘ambulance’.

(a) Carlini:TET:2017 ()
(b) PGD ()
Figure 9: Comparison of multi-depth perturbations found by two adversarial attack algorithms.

In short, the Carlini:TET:2017 attack makes larger changes to high-level features than PGD does. We conjecture that this is because the norm encourages sparsity, and if fewer features are to be perturbed, then higher-level features offer the biggest change per feature.

5 Related Work

Jain:GSA:2019a [Jain:GSA:2019a] and Liu:BPN:2019a [Liu:BPN:2019a] each propose a method performing norm-constrained perturbations to latent variables encoding semantic image properties such as lighting, weather and foliage. Unfortunately, these semantic representations are hard-coded rather than learnt: the approaches require a hand-crafted invertible differentiable renderer to be built, capable of rendering any possible scene for the dataset of interest. This is likely to be prohibitively expensive for all but simple domains. In contrast, our approach automatically leverage learnt semantic representations, requiring only a dataset (even labels are optional) from which a GAN can be trained. Furthermore, while we are able to demonstrate high success rates against state-of-the-art classifiers on MNIST (the most challenging dataset to attack), renderer-based attacks have not been evaluated on adversarially-robust networks, with the success rate of Jain:GSA:2019a’s [Jain:GSA:2019a] method being rather disappointing even against non-robust networks (at best a 30 percentage point reduction in precision).

Song:CUA:2018 [Song:CUA:2018] create unrestricted adversarial examples by searching for an adversarial input to the generator, somewhat analogously to our single-depth attack at depth 0. Our work can be viewed as a generalisation of this in three dimensions simultaneously: rather than using an ad-hoc search procedure, we are able to leverage all existing adversarial perturbation algorithms; rather than attacking only in the input space , we demonstrate attacks for all latent spaces individually and jointly, thereby creating an attack space incorporating a wealth of semantic transformations in addition to the standard pixel perturbations; and rather than require an auxiliary-classifier GAN, any generator that can be decomposed is sufficient for our method. Since perturbations at depth 0 are relatively easy for a human to identify as being suspicious, our method offers a stronger attack under the ‘non-suspicious input’ threat model [Gilmer:ADO:2019].

In previous work [Dunn:AGO:2019a], we train a generator to generate unrestricted adversarial examples. This approach is orthogonal to the present paper: while training a generator is a search for generator weights that represent feature instances which are adversarial, semantic perturbations search in the existing generator feature space for adversarial examples. The flexibility of the former approach to learn new representations allows adaptation of the the attack more readily to mitigate defences, while the present use of representations designed solely for realism results in more realistic adversarial examples. If a pretrained GAN can be used, the present approach also does not require any further GAN training at all.

A number of other methods for semantic adversarial example creation consist of ad-hoc adversarially-tuned manipulations such as colouring [Hosseini:SAE:2018], rotations and translations [Engstrom:ARA:2017], and corruptions such as blurring, Gaussian noise and fogging [Hendrycks:BNN:2019]. Unlike these approaches, our work is not tailored to one specific method of adversarially-tuning one specific image property.

Qiu:SGA:2019 [Qiu:SGA:2019] utilises the learnt domain-to-domain mappings of StarGAN [Choi:SUG:2018]

to find adversarial linear interpolations between modified and unmodified faces. A key limitation of this approach is the need for a dataset with labels for each of the semantic domains to be interpolated over, using which the StarGAN must then be trained; our approach, in contrast, uses ordinary generative GANs to learn the semantic features of interest.

Bhattad:BBI:2019 [Bhattad:BBI:2019] also repurposes existing techniques, introducing a texture-transfer attack which modifies the style transfer [Gatys:IST:2016] to make it adversarial and minimally-perceptible, and a colouring attack which adversarially adjusts the parameters of a colourisation network [Zhang:CIC:2016]. Again, although these attacks show promise, they are ad-hoc, and untested against state-of-the-art robust classifiers (since these currently exist for MNIST only, for which the techniques do not apply).

6 Conclusion

We introduce the first method that allows the feature representations learnt by a pretrained generator network to be adversarially manipulated to create semantic adversarial examples. Significantly, this automatically includes a rich space of label-preserving transformations: there is no need for ad-hoc procedures targeting particular feature types such as colour, texture or object position. Experiments on ImageNet have demonstrated that the depth at which the adversarial perturbations occurs directly affects the granularity of the features which are altered in the resulting image; performing perturbations at all layers simultaneously therefore allows adversarial changes to be made to features at every scale. Experiments on MNIST have shown that this approach is able to fool state-of-the-art robust classifiers with high success rates; use of the multiple-depth procedure is more successful for targeted attacks since its perturbations to each feature can remain relatively small.

Appendix A Details of Experimental Setup

a.1 MNIST: Convolutional GAN

For MNIST, we tried a range of generators and found that they all worked roughly as well as one another. For the experiments, we use a simple convolutional generator, inspired by the Deep Convolutional GAN [Radford:URL:2016]. Details are shown in Table 3. Inputs to the generator are drawn from a 128-dimensional standard Gaussian. The sigmoid output transformation ensures that pixels are in the range , as expected by the classifier.

We perform adversarial perturbations before ReLU layers, rather than after, to prevent ReLU output values from being perturbed to become negative, which would not have been encountered during training and so may not result in plausible images being generated.

Note that perturbing before and after the sigmoid transformation has different effects because perturbations to values not close to 0 are diminished in magnitude if passed through the sigmoid function.

Depth 0
Fully-Connected (64 units)
Depth 1
ReLU
Transposed Convolution ( kernel, stride, 32 feature maps)
Depth 2
Batch Normalisation
Leaky ReLU (Slope )
Dropout ()
Transposed Convolution ( kernel, stride, 8 feature maps)
Depth 3
Batch Normalisation
Leaky ReLU (Slope )
Dropout ()
Transposed Convolution ( kernel, stride, 4 feature maps)
Depth 4
Batch Normalisation
Leaky ReLU (Slope )
Dropout ()
Fully-Connected (784 units)
Depth 5
Sigmoid ( used during training)
Depth 6
Table 3: MNIST generator architecture. Horizontal lines mark the stages at which adversarial perturbations are performed.
Depth 0
Dense ( output units)
Depth 1
ResBlock ( output units)
Depth 2
ResBlock ( output units)
Depth 3
ResBlock ( output units)
Depth 4
ResBlock ( output units)
Depth 5
ResBlock ( output units)
Depth 6
BatchNorm
ReLU
Convolution ( kernel, stride,

zero-padding, 1 output channel)

Sigmoid ( used during training)
Depth 7
Table 4: BigGAN architecture. Horizontal lines mark the stages at which adversarial perturbations are performed.

a.2 ImageNet: BigGAN

We use the BigGAN [Brock:LSG:2019] generator at resolution; Table 4 details the stages at which adversarial perturbations are performed. At this resolution, the randomly-sampled component of the input is 120 dimensions wide. This is decomposed into six chunks of 20 dimensions each, which are fed in (along with the desired output label) to the different blocks in the architecture. Please refer to Appendix B of the BigGAN for detailed descriptions, in particular of the ResBlocks which comprise the majority of the network. Note that we simply perform perturbations after each ResBlock; if desired, perturbations could also be performed within each block.

Appendix B Examples of Semantic Perturbations

This appendix contains examples of our semantic adversarial perturbations. In particular, we include examples for MNIST and for ImageNet, targeted and untargeted cases, single-depth and multiple-depth attacks, and for different perturbation algorithms.

Depth: 0 1 2 3 4 5 6
Original:
Perturbation:
Perturbed:
Original:
Perturbation:
Perturbed:
Figure 10: Untargeted single-depth semantic adversarial perturbations for MNIST. Green pixels in the perturbation represent an increase in value; red represent a decrease. Observe that the closer the perturbation is performed to pixel space (depth 6), the more scattered, smaller magnitude, and less macro-level the changes are.
Depth: 0 1 2 3 4 5 6 7
Original:
Perturbation:
Perturbed:
Original:
Perturbation:
Perturbed:
Figure 11: Untargeted single-depth semantic adversarial perturbations for ImageNet. Note how the granularity of the features altered by the perturbation varies with depth.
Depth: 0 1 2 3 4 5 6
Original:
Perturbation:
Perturbed:
Original:
Perturbation:
Perturbed:
Figure 12: Targeted single-depth semantic adversarial perturbations for MNIST. The target class is 0 for the top row and 1 for the bottom row. Note that the success rate is much lower for the targeted attack: there are both more examples which have changed true label (e.g. some in this figure may have changed to 0) and also more examples which have become nonsense. In this figure, there is also one instance (depth 1, top row, bottom left instance) of Foolbox being unable to find a suitable perturbation at all.
Depth: 0 1 2 3 4 5 6 7
Original:
Perturbation:
Perturbed:
Original:
Perturbation:
Perturbed:
Figure 13: Targeted single-depth semantic adversarial perturbations for ImageNet (target class 951: lemon). Note that perturbations near the beginning of the generator still tend to have macro-level effects such as colour changes or introduction of more bush (bottom left). Note also that the targeted case has a lower success rate for ImageNet too: besides some obvious distortions, there are other examples that may have changed class (for instance, the disappearing boat).
Figure 14: Targeted multiple-depth adversarial examples for ImageNet. The target class is ‘lemon’. Note the much higher success rate than for the targeted single-depth attacks. Note also the variety of semantic changes made: pose, colour, camera position, zoom, object location, object size texture, and shape.
Target
0
1
2
3
4
5
6
7
8
Figure 15: Targeted multiple-depth adversarial perturbations for MNIST. Note that some target classes appear to be more difficult than others: the perturbations when targeting a 0, 1 or 6 are larger than when targeting a 4 or 7. We can see that each perturbation is, as expected, affecting features at all levels of granularity, including digit shape and orientation, stroke thickness, the presence or absence of certain small strokes, and pixel-level noise.
(a) Results for attacks. Depth Attack Examples Success (None) 0 FGSM 75% 1 FGSM 59% 2 FGSM 79% 3 FGSM 72% 4 FGSM 78% 5 FGSM 49% 6 FGSM 96% Multiple FGSM 81% 0 BIM 87% 1 BIM 86% 2 BIM 93% 3 BIM 92% 4 BIM 93% 5 BIM 94% 6 BIM 97% Multiple BIM 92% 0 DeepFool 84% 1 DeepFool 80% 2 DeepFool 84% 3 DeepFool 87% 4 DeepFool 86% 5 DeepFool 87% 6 DeepFool 56% Multiple DeepFool 84% (b) Results for attacks. Depth Attack Examples Success (None) 0 BIM 86% 1 BIM 85% 2 BIM 91% 3 BIM 94% 4 BIM 93% 5 BIM 88% 6 BIM 90% Multiple BIM 84% 0 C&W 85% 1 C&W 82% 2 C&W 91% 3 C&W 90% 4 C&W 91% 5 C&W 98% 6 C&W 95% Multiple C&W 90% 0 NewtonFool 67% 1 NewtonFool 76% 2 NewtonFool 89% 3 NewtonFool 75% 4 NewtonFool 89% 5 NewtonFool 85% 6 NewtonFool 69% Multiple NewtonFool 82%
Figure 16: Tables showing examples of and success rates for single- and multiple-depth untargeted attacks on MNIST for a variety of perturbation algorithms.
Figure 17: Targeted multiple-depth adversarial examples for ImageNet, with target class ‘lemon’. The perturbations in the left column have been found using Projected Gradient Descent; those in the right have been found using the Carlini & Wagner attack. Note that the attack results in larger perturbations to fewer features; this is likely because the metric encourages sparsity, in contrast to .