Universal, transferable and targeted adversarial attacks

08/29/2019 ∙ by Junde Wu, et al. ∙ Harbin Institute of Technology 22

Deep Neural Network has been found vulnerable in many previous works. A kind of well-designed inputs, which called adversarial examples, can lead the networks to make incorrect predictions. Depending on the different scenarios, requirements/goals and capabilities, the difficulty of the attack will be different. For example, targeted attack is more difficult than non-targeted attack. A universal attack is more difficult than a non-universal attack. A transferable attack is more difficult than a nontransferable one. The question is: Is there exist an attack that can survival in the most harsh environment to meet all these requirements. Although many cheap and effective attacks have been proposed, this question hasn't been fully answered over large models and large scale dataset. In this paper, we build a neural network to learn a universal mapping from the sources to the adversarial examples. These examples can fool classification networks into classifying all of them to one targeted class. Besides, they are also transferable between different models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

Code Repositories

Adversarial-style

Low Frequency Adversarial Example


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Network has outperformed many previous techniques in a wide domain. Their high accuracy and fast speed make them to be widely deployed in applications. Despite these great successes, it has been found vulnerable to the adversarial examples: the output of the networks can be manipulated by adding a kind of meticulously crafted subtle perturbations to the input data. This property is shown to be generally exist. Whether in the tasks of computer vision, like classification[], objection detection[], semantic segmentation[] or in tasks of Natural Language Processing[] and Reinforce Learning[].

In the classification task, the adversarial attacks aim to manipulate the network to misclassify. Previous works have proved generating this kind of adversarial examples can be very cheap and effective[]. But the difficulty of the attack varies with the goals and adversary’s knowledge. To put it clearly, we taxonimize the threat models by different goals, adversary’s knowledge and the tasks. The taxonomy is shown in Figure 1.

Figure 1: The taxonomy of adversarial attacks

Adversarial Goals

- Non-targeted misclassification forces the victim model to incorrectly classify the input into an arbitrary class.

- Targeted misclassification forces the victim model to incorrectly classify all the inputs to a specific targeted class.

Adversarial Knowledge

- White-box attacks assume threat model knows everything about the victim model, including the network architecture and the training dataset.

- Black-box attacks assume threat model can’t get access to the victim model. It only knows the standard output of the network, like the labels of the inputs and the corresponding scores. But if the adversarial examples are transferable, a white-box attack can be transferred to a black-box model.

Perturbation Scope

- Individual attacks solve the optimization problem for each single input. The perturbations for each clean input are all different.

- Universal attacks denote the attacks that are able to learn a universal mapping relation between the inputs and adversarial examples, but don’t need to solve the optimization problem for each input.

In Figure 1, the difficulty of the attacks increases with axes . In this paper, we mainly explore how to produce attacks under the strictest conditions, which is correspond to point (2,2,2) in Figure 1, denoting the transferable, universal and targeted attacks.

In the following, we first discuss related work, and then propose the low frequency fooling image in section2. We introduce our method of producing adversarial examples in section3, corresponding experiments and comparisons are provided in section4. Then we provide some discussions and analyses in section5.

2 Related work

3 Low-frequency fooling image

Before going into the adversarial examples, let’s discuss about the fooling images first. This nomenclature is adopted from[], which means the images that are meaningless to humans, but the networks classify them to certain classes with high confidences.

For producing a fooling image, we solve the following optimization problem:

(1)

where denotes the fooling image, denotes the classifier’s confidence of the targeted label when inputting image . If the neural networks are differentiable with respect to their inputs. We can use derivatives to iteratively tweak the input towards the goal. The way to produce fooling images is very like the way to visualize the network[]. The difference is the target layer and constraint. Neural network visualization optimize the layer which it aims to visualize. And for recognition, neural network visualization will add extra constraints to this optimization problem, forcing the goal to lie in the low frequency space. A contrast of fooling image and network visualization result is shown in Figure 2.

(a) fooling image
(b) visualized image
Figure 2: The comparison of fooling image and visualized image of class ’starfish’ in VGG19. Both images maximize the activation of the last fully connected layer before softmax.

A question is why the networks will naturally produce the high frequency unrecognizable noise. [] indicated they may be closely related to the structure of the networks, especially the strided deconvolutional layers and the pooling operations. Since [] pointed out the deconvolution operations are the root of the grid effect, one possible interpretation is when we leverage the gradients going backward from the targeted label, as what we do when solving the equationfooling, every convolution layer in the network will serve as a deconvolutional layer. Therefore, the gradients will have to go through excess deconvolutional layers (generally 2 or 3 deconvolutional layers are able to produce grid effect[]). The grid effect is sequentially magnified by these deconvolutional layers, and finally become these high frequency noises.

If these high frequency noises are closely related to the structure of the networks, are the low frequency fooling images can be unrelated to the structure of networks, thus being more general and transferable than these high frequency ones? To answer this question, we tried several methods to constrain the high frequency noises in the fooling images.

1. Transformation Robustness (TR) constrains high frequencies by applying small transformations to the fooling images before optimization[]. Here, we rotate, scale and jitter the images. The constrained optimization process can be expressed as:

(2)

where denotes the composition of the specific transformations.

2. Decorrelation (DR) decorrelated the relationship between the neighbour pixels. Here, we do it by using gradient descent in the Fourier basis, as what [] did to visualize the network. It can be expressed as:

(3)

where

denotes Fourier transform.

3. Transformation Robustness and Decorrelation (TR and DR) are able to combine together to generate fooling images, which is expressed as:

(4)

4. Gradient optimized Compositional Pattern Producing Network (Gradient-CPPN) uses CPPN to map the source images to the adversarial examples. CPPN is a neural network that map a position of the image to it’s color. Thus, the frequency of the outputted fooling image is related to the architecture of CPPN. The simpler the structure of the network, the lower the frequency of getting images. This method optimizes CPPN parameters by the gradients of the victim model, which can be expressed as:

(5)

where is a 2-D position map.

5. CPPN encoded Evolutionary Algorithms (EA-CPPN) is proposed by []. This method uses CPPN encoded image to represent genomes and uses EA to optimize.

We choose VGG16 as our victim model to train the fooling images and test the results on Clarifai.com, which is a black-box image classification system. We shown some samples in Figure 3. We find that consistent with our hypothesis, low-frequency images can fool Clarifai.com into classifying them as targeted or related classes, while high-frequency noises are fail to fool the system. In all these low-frequency images, gradient optimized CPPN performs better than the other methods.

(a) naive
(b) TR
(c) DR
(d) TRDR
(e) Gradient-CPPN
(f) EA-CPPN
Figure 3: Some samples of high-frequency fooling image: 2(a) and low-frequency fooling images generated by different methods: 2(b)-2(f). The classes following the images are predicted by Clarifai.com.

As our goal is to generate the transferable targeted adversarial examples, these low-frequency fooling images are not enough. In the next section, we will introduce how we leverage these low-frequency images to generate transferable targeted adversarial examples. For the convenience, in the following, we refer to these constrained low-frequency fooling images as , and unconstrained high-frequency fooling image as .

4 Merge the style

4.1 Method

For generating the adversarial examples, we aim at mapping the distribution of the source images to the targeted adversarial distribution. The samples in this distribution should maintain the similarity with the source images in the low level (pixel level), but have the similar high-level features with fooling images. These high-level features preserve the attribution of the fooling images to ensure the success of the attack and also, we assume, the low-frequency property to ensure the transferability. We prove this assumption by comparing the high-level features of fooling images with different frequencies. We find the mean and variance of different distributions cluster together[FIG], which denotes the

have some specific properties to ensure their transferabilities.

We build a conditional image generation function to shift the original source image distribution to inherit distributions’ properties. Put formally, let and be the source images and fooling images that sampled from two observed distributions and , and be the targeted adversarial examples. Our goal is to learn the conditional distribution to satisfy:

(6)

where denotes the sample produced by endowing with the properties of .

We build an an encoder-decoder convolutional neural network to serve as the conditional distribution generator. We call it Fooling Transfer Net (FTN). The details of FTN is described in the next section.

4.2 Network Structure

Inspired by the image-to-image translation [Unsupervised Image-to-Image Translation Networks],[Few-Shot Unsupervised Image-to-Image Translation] and style transfer tasks[][]. FTN is built with an encoder

and an adain decoder . We learn the properties of distribution from its’ high-level representations in the victim model, which are denoted as , where and denotes the targeted layer of the victim model.

The encoder

consists a sequence of convolutional layers and several residual blocks to encode the source images to a latent vector. The AdaIN decoder is made of two layers AdaIN Residual Blocks followed by several deconvolutional layers. AdaIN Residual Blocks are the residual blocks with adaptive instance normalization layers, will first normalize the activations of a sample in each channel to have a zero mean and unit variance and then scale it with learned scalars and biases. In our translation network, the scalars and biases are got from the means and variances of

.

Specifically, we extract the in a pretrained classifier, and put them through several fully connected layers to get a certain number of (depending on the number of encoded features) scalars and biases. These scalars and biases are then applied to do the affine transformation to the scaled latent code. Here, we aims at extracting the latent content representations from the source images using the encoder and extracting the class-specific representation from fooling image. Then we shift the latent content code using class-specific representation. In this way, we hope to remain the content information of the original images but adjust them to the targeted attribution.

We supervise the network by the original source images and the fooling images high-level representation . constrain the output to maintain maximum content information and constrain the output to have the similar high-level representations with the fooling images. An illustration of FTN is shown in Figure 4.

Figure 4: An illustration of FTN, the blue layers denote AdaIN Residual Blocks, the orange layers denote convolutional layers, the brown layers denote fully-connected layers.

4.3 Loss Function

We constrain the network by three loss function, the content loss

, the representation loss , and total variance loss .

The content loss is used to keep content similarity between the adversarial examples and the source images. We use the structural similarity (SSIM) index as our content loss function. SSIM is method to predict the perceived quality of the images. In our contrast experiment, it performs better than traditional loss function.

The representation loss constraint the output adversarial examples to have similar high-level representations with fooling images in the pretrained classifier. In this paper, we choose VGG19 as our classifier and empirically, choose layer , and as the targeted representation. The representation is expressed as:

(7)

The total variance loss applies a total variation regularization to punish the reconstruction noise.

The total loss of the network is expressed as:

(8)

where and are the weight constants of and .

5 Experiment

Models: In the paper, we choose a classic classification model: VGG19 as our training victim model and test the transferability on the other more delicate classification models, like Inception-v3, ResNet-18, ResNet-50 and Densenet. We denote them as validation victim models. There have two points we need to clarify. Firstly, we haven’t put a lot of energy on the selection of training victim model. That’s because this experiment have been done by many previous works[]. And the conclusion has seemed to be uncontroversial: the adversarial examples trained on the more basic models are equipped with more transferability. We don’t think our method can be an exception of it. Thus, doing this comparison will be time-consuming, not so meaningful and also distract the readers. Secondly, previous works have shown that using ensemble-based approach can get higher attack success rate[]. But we only trained our net on a single victim model. That’s because we also think our method has no reason to be an exception. Training on a single victim model can preserve more test models to prove the transferability. We highly welcome the reports that showing our method will be an exception of the previous conclusion.

Dataset: FTN is trained on ILSVRC 2012 classification training set and tested on its’ validation set.

Target: We choose attribution ’starfish’ as our default targeted class. We choose this class cause it is almost impossible to tangle with other classes. The features of starfish are distinct from most other objects. Thus it can avoid the circumstance that the model naturally misclassify the source images to the targeted class and then overestimate the performance of our proposed method. The targeted attack of the other classes can be checked in the appendix.

Measure: We measure our results by two important factors: transferability and distortion. The transferability is measured by the transfer success rate, which means the percentage of the generated adversarial examples are correctly classified to the targeted label by validation victim models. The distortion describes the difference between the generated adversarial examples and the source images. We measured the distortion by root mean square deviation (RMSD), which is computed as: . We also use the ratio of transfer success rate and distortion, which is denoted as RTD as a measure to compare different methods. Put formally, it is calculated as:

(9)

5.1 Low-frequency Fooling Image

In the paper, we proposed low-frequency fooling images and FTN to transfer the source images with them. In this section, we aim to prove are more transferable than and FTN can maintain this transferability.

We have given the examples that are more transferable than above. Here, we do the comprehensive experiment to prove this result. We compare the five high-frequency-constrained methods: CPPN Gradient, CPPN EA, DR, TR, DR+TR and the direct gradient ascent method with no constrain for high-frequency gradients. All the results are trained on training victim model and tested on validation victim models. The transfer success rate of randomly selected 100 samples are shown in Table 1. We can see high-frequency constrained methods generally perform better than direct gradient ascent method, and CPPN Gradient method is the best in them.

Inception-v3 Resnet-18 Resnet-50 Densenet Clarifai.com
Niave 1 2 1 1 0
TR 67 79 74 76 51
DR 72 76 83 75 62
TRDR 78 81 78 83 67
Gradient-CPPN 96 94 91 93 86
EA-CPPN 0 1 0 0 0
Table 1: Comparison of fooling images

And then, we will show FTN can maintain the transferability of . In other words, using more transferable fooling images in FTN can generate more transferable adversarial examples than the others. For the sake of fairness, we adjust the hyper-parameters for every methods to get their best effects. The ratio of transfer success rate and distortion is shown in Table 2 and the visual comparison is shown in Figure[]. We can see the better class images can contribute to generate more transferable or less distorted adversarial examples for the same model.

Inception-v3 Resnet-18 Resnet-50 Densenet Clarifai.com
Niave 0.32 0.67 0.32 0.32 0
TR 3.94 4.34 4.65 4.21 3.00
DR 4.52 5.33 5.43 5.10 3.94
TRDR 6.34 6.81 6.49 7.02 5.41
Gradient-CPPN 13.32 13.39 12.13 12.23 12.11
Table 2: FTN transferability comparisons using different fooling images

5.2 Ftn

AdaIN Normalization We design the AdaIN Residual Blocks in FTN for shifting the latent source distribution by the property of latent class distribution. But in ablation experiment, we see the results are not substantially different. In contrast, the chooses of targeted representation layers and the loss functions play more decisive roles. However, we find the AdaIN residual really help the training process to converge more quickly and the setting of hyper-parameters more easily. The AdaIN Residual Blocks allow more flexibility of the decisive hyper-parameter , which denotes the weight of representation loss relative to content loss. We show that in Figure[].

Selection of representations We’ve tried many feasible combination of the targeted representation layers in VGG19. In style transfer, the style representations are generally chose as the activations from low to high. But for generating adversarial examples, the lower representations will endow more superficial similarity between the adversarial examples and fooling images instead of the semantic features. We choose some typical selections and show them in Figure[].

Loss Function For the content loss, we tried loss function, perceptual loss[] and SSIM loss. We show the visual comparison in figure[]. The SSIM loss generates the best visual effect in them. For the representation loss, we tried many loss functions that might be better than simple

loss (e.g. cosine similarity, KL divergence, a bunch of discriminative nets). But unfortunately, none of them works better than

. Intuitively, we think there exists a better loss function for the representations. We hope further researches can find a better way than us.

5.3 Comparison with other methods

Different with most adversarial-examples-generating methods, we learn the universal mapping in the paper instead of doing optimization for every single image. For fairly comparison, we degrade our method to the single image when comparing with the methods that process one particular image every time. These methods include, FG, FGS, Deepfool, JSMA and CW attack. We do the optimization on VGG19 model and test on validation victim models and Clarifai.com. Clarifai.com is a black-box image classification system, no one can get access to its’ dataset, network structure and parameters, which is good to test the transferability of adversarial examples. As shown in Table 3, our method can access higher transfer success rate than the other methods for the targeted attack.

RMSD Inception-v3 Resnet-18 Resnet-50 Densenet Clarifai.com
FG 23.56 1 2 1 1 0
JSMA 24.21 2 2 0 1 0
DeepFool 11.98 28 33 34 31 1
CW 22.55 2 3 2 2 0
FTN 4.21 98 94 93 95 94
Table 3: Comparison of FTN with different adversarial example generating methods

We also compare our method with another universal attack[universal]. The quantitive results are shown in Table 4.

RMSD Inception-v3 Resnet-18 Resnet-50 Densenet Clarifai.com
Universal 34.25 63 56 41 51 12
FTN 7.68 92 88 87 91 86
Table 4: Comparison of universal attacks

6 Discussion

In this paper, we proposed low-frequency adversarial examples and proved it is more transferable than the high-frequency ones. However, what low-frequency attacks mean? Why they can be more transferable than the high-frequency ones? In this section, we attempt to discuss and answer these questions.

Firstly, we find low-frequency attacks and high-frequency attacks are highly different. This difference not only lay on the pixel image but also lay on the classifier representations, which are decisive to the classification (although they will be classified to the same class). This conclusion actually can be confirmed by the observation of difference between generated adversarial examples and generated ones, as shown in Figure[]. Cause the generating processes are all the same except the change of class images. But we still do the more rigorous experiments to observe the difference between these representations. We compare the representations of two kinds of fooling images in VGG19 model. Our targeted representations are , , , which we find most effective to supervise our model and most decisive to the final classification. We compute the mean and variance of them and find the mean and variance of the same kind of fooling image are highly similar. This proves although they will be classified to the same class, their high-dimensional features are different.

But what this difference means and how this difference cause the stronger transferability? For answering this question, we analysis the low-frequency attacks and high-frequency attacks on the manifold. In this way, we can think the classification models learn the classified boundaries in a high-dimensional space. The traditional strategy[] is learning a high-dimensional vector and add it to a source image point. These methods constraint the vector can help the source image to escape from its original boundary (non-targeted attack) or pass trough another specific boundary (targeted attack) figure[]. This vector is called perturbation. Since these learning methods attempt to find the smallest vector to make the attack successful, these vectors are highly related to the victim model’s classification boundary. Many previous works[] see them as the normal vectors from the source image point to the boundary. However, although the classification boundary learned by the different models shared some similarities (which make some high-frequency adversarial examples are transferable to some extent), the curvature of the different boundaries can’t be the same. That causes some high-frequency adversarial examples fail to transfer or have to magnify the perturbation to transfer.

Our method decouples the perturbation from the curvature of the boundary surface. Note learning a high-frequency fooling image is finding a point that is classified to the targeted class but which may be sensitive to the difference of the different classification models’ boundaries. And learning a low-frequency fooling image is finding a point that is still classified to the targeted class but less sensitive to the different models’ boundaries. We speculate that is because the low-frequency fooling images are closer to the nature image manifold. And the classification boundaries are trained to be more robust to the nature images than the meaningless noise, since the training datasets are consisted of the nature images. After getting , we then find the point that is near the source image in the pixel space and is close to

in the representation space. This constraint condition is independent of the curvature of the boundaries, thus is expected to be more transferable than the previous methods. We take the perturbations as high-dimensional vectors and calculate the number of transferable ones in all linear independent vectors. We randomly sample 100 images generated by C&W attack and our proposed method. Then we test the transferability on the victim classification models. C&W attack’s perturbations construct ..dimension space and ..dimension in them are transferable. Our proposed method construct …dimension space and …dimension in them are transferable.

7 References