Convolutional neural networks (CNNs) are known to be vulnerable to adversarial examples; inputs that exhibit marginal visual difference from the distribution of correctly classified images but cause dramatically different classification decisions [Szegedy2013, Goodfellow2014]. These ‘adversarial images’ may be crafted to attack an a priori
trained network, via backpropagation of gradients to induce pixel-level perturbations of a target image either globally[Szegedy2013, Goodfellow2014, DeepFool, CarliniWagner] or within a restricted local region (‘adversarial stickers’ [AdversarialPatch, Eykholt2017]) in order to nudge the classification outcome over the decision boundary. The enthusiastic adoption of neural networks e.g. for autonomous vehicles and robotics opens a new facet of cyber-security, motivated on one hand to train networks resilient to such attacks, and on the other to evaluate resilience by developing new attacks.
The contribution of this paper is a new algorithm for synthesizing adversarial image examples that exhibit equal or better robustness to image deformation than contemporary methods, whilst maintaining low perceptibility to a human observer. Currently adversarial image attacks are limited in their scope and real-world applicability due to either their fragility or the perceptibility of the perturbations introduced to the target image. The perturbations typically induced to generate an adversarial image manifest as high-frequency noise (see Fig. 1). The capacity of perturbations to induce misclassification is thus greatly attenuated by simple image deformations such as re-sampling under affine transformation, which are inherent in any eventual manifestation of an attack ‘in the wild’. Furthermore, high-frequency artifacts are readily detectable by the human visual system so revealing the presence of attacks.
In this work we explore an alternative method for generating covert adversarial image examples, leveraging the recently proposed ‘Deep Image Prior’ (DIP) for image reconstruction [Ulyanov_CVPR_2018]. The surprising result of DIP is that the statistics of natural images can be encoded through a CNN architecture; i.e. the CNN structure rather than actual weights of the filters. The translation equivariance of CNNs enables DIP to exploit the internal recurrence of visual texture in images [Ledig_CVPR_2017], in a similar way as the classical non-parametric patch based approaches to texture synthesis [Glasner_2009_ICCV]
. Under DIP an image is reconstructed by training a deep encoder-decoder CNN from scratch (random weights) to overfit a reconstruction loss function for that single image. Our core technical contribution is to frame adversarial image synthesis as a reconstruction problem, leveraging the DIP to reconstruct an image from a randomly initialized (white noise) image under a dual reconstruction and adversarial loss. Reconstructing the image under this constraint affords greater flexibility for perturbation across the whole image, in contrast to existing methods that rely upon backpropagation to update pixel values from a local minimum (the initial target image). The resulting perturbations are regularized by the DIP to resemble the appearance of natural images, further mitigating against sporadic high frequency noise patterns characteristic of adversarial images. Our reconstruction framework can be used to induce perturbations within the whole image, that are hard to perceive yet exhibit superior robustness to affine transformation than state of the art adversarial image algorithms. Further we show that our method can also be adapted to restrict perturbations to a local region of interest resulting in adversarial stickers that are competitive with the state of the art at inducing misclassification.
2 Related Work
Adversarial attacks on deep networks for visual recognition have received significant attention in recent years, prompted by the step change in performance delivered by CNNs across diverse object classes [DBLP:journals/cacm/KrizhevskySH17, szegedy2015going]. Szegedy et al. pioneered adversarial attacks for visual classifiers [Szegedy2013], demonstrating that minor perturbations of pixel values can induce significant CNN misclassification rates despite little human perceptible visual difference. Goodfellow et al. demonstrated linearity of this effect in input space, introducing the fast gradient sign method (FGSM) [Goodfellow2014] to quickly compute adversarial perturbations via backpropagation without need for solving costly optimizations. An iterative form of this method for more robust attacks was later presented [BasicIterative]. DeepFool [DeepFool] explictly optimises to minimise perceptibility and maximise robustness in order to form adversarial examples. Carlini and Wagner optimise for similar goals using alternative norms [CarliniWagner]. All of the above induce high frequency noise within the whole image with limited human perceptibility. Whilst the attacks are covert they are also fragile to image resampling e.g. due to image transformation or printing. Adversarial patches take a complementary, overt approach via synthesis of vivid ‘stickers’ [AdversarialPatch] that occupy only a small region yet induce misclassification [AdversarialPatch, Eykholt2017] or misdetection [chen2018robust, Thys2019]. Such approaches are reminiscent of attacks against classical facial detection algorithms [cvdazzle, Sharif2016] in that they can be physically manifested for real-world deployment. Nevertheless, like whole image methods, such stickers are sensitive to affine transformation limiting their use in practical attacks [NoWorry] for the time being. Like prior work, we take an optimization approach but uniquely leverage an encoder-decoder network and Deep Image Prior (DIP) [Ulyanov_CVPR_2018] to reconstruct the attack image from scratch incorporating an adversarial constraint. The latitude afforded by this reconstruction mitigates the introduction of sporadic noise which both improves robustness and limits perceptibly of attacks. Whilst our focus is on whole image covert attack our method can also be adapted to synthesise overt attacks via adversarial patches.
Let be a CNN classifier taking image pixel values and returning an object class label, and let be its associated loss function. For an image we aim to find a small perturbation such that . We say that is an adversarial image. If we pick a class and aim to get then we say the adversarial example is targeted, and if not we say it is untargeted. We let
be the output probabilities of the classifier, and we also letis the component of and is the iteration of .
We motivate our method by first briefly recapping fast gradient methods which create minor image perturbations via a short () linear jump in the input domain determined via backpropagation through :
We try to choose as small as possible such that the attack is still effective. A popular and fast approximation (FGSM) due to Goodfellow et al. [Goodfellow2014] is to set . A piece-wise linear extension of FGSM [BasicIterative] is to perform FGSM iteratively in the hope of obtaining a successful adversarial example without having to make large. More concretely, we set , and then for we have
where the for the number of iterations. The momentum iterative method [Momentum] adds a momentum term to this basic iterative method in order to escape from local minima. As before we set but also . Now for we iterate using:
where is a decay factor that degenerates to the basic iterative method as increases.
3.1 Adversarial Images via DIP Reconstruction
|Baseline||Ours Less Visible||No Difference||Ours More Visible|
The disadvantage of FGSM and derivative methods is their dependence upon backpropagation to perturb the image from a local minima (). Despite mitigation strategies (e.g. momentum) all tend to converge to an composed of high frequency speckle noise (c.f. Fig. 3) which presents a trade-off between visibility (high ) and fragility (low ). We therefore propose an alternative to these local methods, leveraging the recently proposed Deep Image Prior (DIP) [Ulyanov_CVPR_2018] to synthesise via global image reconstruction, such that .
The core idea is to learn a generative CNN (where are the network parameters) to reconstruct from a noise map of identical height and width to , with pixels drawn from a uniform random distribution. We use a symmetric encoder-decoder network architecture with skip connections (Fig. 2) for , comprising five (up-)convolutional layers with filter size and 128 en/decoder channels per layer, skip connections between all layer pairings. Ulyanov et al. originally proposed a reconstruction loss to learn
for image restoration applications such as denoising, in-painting and super-resolution:
where is the norm. In practice the generator provides implicit regularisation via the structure of the CNN in lieu of an explicit regularisation in the loss (e.g. total variation loss [Vedaldi2015]). Whilst is under-constrained, its output converges faster to natural images than to unnatural ones, thus DIP provides a useful reconstruction prior.
We propose an alternative, dual term loss function to learn to synthesise that both approximates and misclassifies :
where in our work balances the order of magnitude of the adversarial (first) and reconstruction (second) terms, and indicates network weights at iteration of training. We train to overfit to via the Adam optimizer yielding the adversarial perturbation . We explore popular CNN configurations for to attack in subsec. 4.2.
3.1.1 Local patch based attack
Our method can optionally be adapted to an overt attack, in which an adversarial patch is synthesised and composited into a region of an image in order to induce misclassification. We define a region of interest (ROI) via binary mask . In this case we seek perturbation to create a composite image :
where is element-wise multiplication. We optimize as before, but without any reconstruction constraint in the loss:
A single adversarial patch capable of attacking multiple images (similar to the adversarial stickers of Brown et al. [AdversarialPatch]) can be created by sampling in mini-batches from a set of training images (versus learning over a single image, as in the whole image case). We evaluate the performance of such stickers in subsec. 4.3.
4 Evaluation and Discussion
We evaluate the performance of our method in terms of its efficacy and robustness against affine image transformation, and the perceptibility of artifacts introduced into the covert adversarial image examples created.
4.1 Experimental Setup
We evaluate our approach against 6 baselines (subsec.4.1.1) over 2 popular architectures: VGG-19 and GoogLeNet Inception v3 trained using ImageNet [imagenet] and evaluate on a test set of 1000 images (hereafter, ImageNet-TS1K) comprising a random image sampled from each category in the ImageNet test partition. We evaluate the success rate, defined as the fraction of target images in which the method induces an incorrect classification decision. For covert attacks we also determine the perceptibility of induced image artifacts via user study on Amazon Mechanical Turk (MTurk) using a randomly sampled 10% of the test set (hereafter, ImageNet-TS100) for practicality. We present the original image alongside the perturbed images from our technique and a baseline (in random order) and ask which is closer to the original. Each triplet is presented five times (each to independent MTurkers); in total 500 annotations are collected for the perceptual user study.
4.1.1 Baseline Methods
Our baselines are momentum iterative FGSM [Momentum] with small (-S) and large (-L) parameter choices for . For the former we pick the smallest that the attack is effective (i.e. inducing minimal perturbation). For the latter we pick a constant large value (0.1). Four contemporary methods (below) are also evaluated. Open implementations from Rauber et al.’s Foolbox [foolbox] are used for all baselines.
L-BFGS. Szegedy et al.’s method [Szegedy2013] uses L-BFGS search to generate adversarial examples by finding minimum such that the minimising satisfies .
Carlini & Wagner (C&W) use a binary search to choose the smallest for which the solution to the problem
satisfies ; where is defined as
, and is a parameter that can guarantee a desired confidence. Then our adversarial example is .
Saliency Map Method (SMM) Papernot et al. differentiate the softmax output (or in another variant, the logits ) and apply a saliency map to this derivative, to target features to perturb. A typical example of a saliency map would be to choose two pixels :
where is the target class and we only consider for which the first term in (11) is positive and the second negative.
DeepFool [DeepFool] finds optimal adversarial examples for the case of an affine classifier. To apply this to a general classifier we iteratively linearise it. Begin by setting . We continue the following procedure until we achieve a change of class, i.e. it is an untargeted attack. For every , we define
and stop when .
4.2 Evaluating Adversarial Images
We quantify success rate of our proposed DIP method against baselines in Tbl. 1. All methods run as targetted attacks; for each image in ImageNet-TS1K we pick a random incorrect class to define over softmax loss in CNN and penalise deviation from a one-hot vector for that incorrect class under MSE loss. The exception is DeepFool which as proposed in [DeepFool] defined as an untargetted attack, using MSE loss from a negated one-hot vector for the correct class. We test robustness to five image transformations (plus no transformation). Specifically we test recompressing the image as JPEG with 80% quality, or scale the image by up to (-L) and (-S), or rotate it by up to (-L) and (-S) degrees. For all transformations and networks our method is significantly more robust than other covert baseline, observing by more graceful decay in success rate (Fig. 4). The exception is FGSM-IterL which we include an example of an overt attack, forcing highly visible image perturbations by setting high. This represents an indicative level of overtness that FGSM needs to perform at to match the barely visible perturbations of our method. We provide visual examples of (enhanced via normalization) in Fig. 3, and include results from our MTurk evaluation of attack visibility over ImageNet-TS100 (Tbl. 2). In all but two cases the perturbations via our method are less visible, or there is no visible preference, to baselines. The exceptions are DeepFool and C&W. Minor loss of high frequency detail inherent in DIP reconstruction may influences responses. Finally we compare run-time speed of our method to baselines in Tbl.3. Our approach takes a few minutes to run per image, relatively slow but comparable to state of the art optimization approaches [Szegedy2013, CarliniWagner] running on a NVIDIA 1080Ti GPU. The non-deterministic nature of DIP yields slightly different adversarial examples each run, however we found no significant performance difference between runs.
4.3 Evaluating Adversarial Patches
We evaluate the adaptation of our method (subsec. 3.1.1) to create adversarial patches for an overt targetted attack, comparing to Brown et al.’s Adversarial Stickers [AdversarialPatch] (Fig. 5). We pick a random ‘attack’ class and generate a sticker using 999 training images (one per ImageNet class, holding out the attack class) sampled randomly from the ImageNet training partition. We then apply that single sticker to all ImageNet-TS1K excluding the attack class, and consider the misclassification rate by applying to sticker to a random location. The process is averaged over 10 random attack classes. The sticker is scaled to a proportion of image area (see plot, Fig. 5). We find our method performs similarly to Brown et al. [AdversarialPatch] outperforming the control case of pasting a large image of the attack class into the image. We train on VGG and GNet to enable comparison to [AdversarialPatch] but also compare application of the GNet trained patches to attack a VGG network; our method performs similarly beyond 0.25.
We proposed a novel algorithm for the generation of covert adversarial image examples. Leveraging DIP to reconstruct the image from scratch under dual reconstruction and adversarial loss avoids the introduction of fragile high frequency artifacts. The resulting adversarial image exhibits greater robustness to affine image warping than the state of the art methods [Szegedy2013, Goodfellow2014, CarliniWagner, DeepFool, Saliency] whilst exhibiting low human visual perceptibility. We showed the same framework can be adapted to synthesise adversarial patches with similar performance to the state of the art sticker attack [AdversarialPatch]. We demonstrated successful attacks against popular CNN visual classification networks (VGG-19, GoogLeNet Inception v3) using diverse categories from ImageNet. Future work could further characterise other networks and datasets, however we do not feel such enhancements necessary to demonstrate the promise of our adversarial image synthesis using image reconstruction under DIP.
The first author was supported by an EPSRC Industrial Case Award with Thales UK. This work was supported in part by a GPU card academic gift from Nvidia Corp.