Intermediate Level Adversarial Attack for Enhanced Transferability

11/20/2018 ∙ by Qian Huang, et al. ∙ cornell university 0

Neural networks are vulnerable to adversarial examples, malicious inputs crafted to fool trained models. Adversarial examples often exhibit black-box transfer, meaning that adversarial examples for one model can fool another model. However, adversarial examples may be overfit to exploit the particular architecture and feature representation of a source model, resulting in sub-optimal black-box transfer attacks to other target models. This leads us to introduce the Intermediate Level Attack (ILA), which attempts to fine-tune an existing adversarial example for greater black-box transferability by increasing its perturbation on a pre-specified layer of the source model. We show that our method can effectively achieve this goal and that we can decide a nearly-optimal layer of the source model to perturb without any knowledge of the target models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adversarial examples for neural networks are small perturbations of inputs carefully crafted to fool trained models [19, 5]

. In the context of Convolutional Neural Networks (CNNs)

[10], adversarial examples are non-perceptible perturbations of natural images, and they have been well studied in terms of attack [5, 2, 16, 3, 1]. Concerns have been raised over the vulnerability of CNNs to these perturbations in real-world contexts where they are used for online content filtration systems and self-driving cars [4, 11].

Moreover, our current understanding of these perturbations is quite limited. In particular, they have been known to exhibit black-box transfer, meaning that perturbations crafted to fool one model will also fool another model. Though some works have attempted to shed light on this phenomenon [21], many questions remain unanswered. In our work, we aim to provide more insights on this subject by exploring how perturbations on a source model can be enhanced for greater black-box transfer. We attempt to do this by maximizing their effect on one of the source model’s intermediate layers.

Our contributions are as follows:

  • We propose a novel method, Intermediate Level Attack (ILA), that enhances black-box adversarial transferability by increasing the perturbation on a pre-specified layer of a model.

  • When attacking a model, it is generally effective to focus on perturbing the last layer. However, we show that when optimizing for black-box transfer it is better to emphasize perturbing an intermediate layer.

  • Additionally, we provide a method for selecting an optimal layer for transferability using the source model alone, thus obviating the need for evaluation on transfer models during hyperparameter optimization.

2 Motivation and Approach

For the duration of this paper, we will focus on non-targeted image-dependent attacks. In a typical attack setting, most methods, such as I-FGSM [11], PGD [13], and DeepFool [14], primarily focus on attacking the last layer, with most loss objectives formulated directly in terms of cross-entropy with the softmax output and some desired discrete distribution. Although these attack methods produce transferable adversarial examples [21], they are not inherently emphasizing this property. Our attack, on the other hand, focuses on enhancing transferability of the adversarial examples by perturbing intermediate layers.

Motivated by [22], we view a CNN’s convolutional layer as having learned a feature hierarchy, with the earliest layers having learned primitive features and latest layers having learned the most high-level features. We hypothesize that for a given model the low-level feature representations are similar to those of other models until they begin to diverge at a specific layer (when models learn different representations of high-level features). Thus, we aim to find and attack the latest layer at which the feature representations learned are still general enough to be found in other models. Adversarial examples crafted to attack such an intermediate layer will be more transferable as they effectively focus on attacking the internal representation that is common to most models.

Based on the above motivation, we propose the following attack, which we call Intermediate Level Attack (ILA). We define as the output of layer of a network given an input . Given an adversarial example generated by attack method for natural image (generated after iterations)111The attack does not have to be iterative, but in this paper we focus on iterative attacks., and a specific layer of a network , we aim to produce an that solves:

(1)

In practice, we iterate times to attain this objective (full algorithm given in Appendix A). Note that the layer and the loss weight are hyperparameters of this attack. Also, note that represents a dot product. The above attack assumes that is a pre-generated adversarial example. As such, the attack can be viewed as a fine-tuning of the adversarial example . We fine-tune for greater norm of the output difference at layer (which we hope will be conducive to greater transferability) while attempting to preserve the output difference’s direction to avoid destroying the original adversarial structure.

3 Results

We start by showing that our method increases transferability for various base attack methods, including I-FGSM [11] and I-FGSM with momentum [3]. We test on a variety of models, namely: ResNet18 [6], SENet18 [7], DenseNet121[8] and GoogLeNet [20]. Architecture details are specified in Appendix B; note that in the below results sections, instead of referring to the architecture specific layer names, we refer to layer indices (e.g. is the last layer of the first block). Our models are trained on CIFAR-10 [9] with the code and hyperparameters in [12] to final test accuracies of for ResNet18, for SENet18, for DenseNet121, and for GoogLeNet.

For a fair comparison, we use the output of an attack run for 20 iterations as a baseline. ILA is run for iterations starting from the output of attack after iterations. The learning rate is set to for both I-FGSM and I-FGSM with momentum222Tuning the learning rate does not substantially affect transferability, as shown in Appendix E.. The learning rate for ILA is set to (this value is tuned to exhibit near optimal averaged attack strength on transferred models). Finally, we limit ourselves to generating adversarial examples for natural images333Our CIFAR-10 images are normalized to be in the range via the transform . Results for different values of are given in Appendix D. such that (we choose ).

To evaluate transferability, we test the accuracies of different models over adversarial examples generated from all CIFAR-10 test images. We then show that we can select a nearly-optimal layer for transferability using only the source model.

3.1 ILA Targeted at Different Values

To confirm the effectiveness of our attack, we fix a single source model and baseline attack method, and then check how ILA transfers to the other models compared to the baseline attack. Results for ResNet18 as the source model and I-FGSM as the baseline method are shown in Figure 2. Comparing the results of both methods on the source model and other models, we see that ILA outperforms I-FGSM when targeting any intermediate layers, especially for the optimal hyperparameter value of . Note that after fine-tuning, the adversarial examples perform worse on the source model, which is anticipated as ILA does not optimize for source model attack. Full results are shown in Appendix B.

Figure 1: Transfer results of ILA against I-FGSM on ResNet18 as measured by DenseNet121, SENet18, and GoogLeNet (lower accuracies indicate better attack). .
Figure 2: Disturbance values at each layer for ILA targeted at layer for ResNet18. Observe that the in the legend refers to the hyperparameter set in the ILA attack, and afterwards the disturbance values were computed on layers indicated by the in the x-axis. Note that the last peak is produced by the ILA attack.

3.2 ILA with Pre-Determined Value

Above we demonstrated that adversarial examples exhibit strongest transferability when targeting a specific layer. We wish to pre-determine this optimal value based on the source model alone so as to avoid tuning the hyperparameter . To do this, we examine the relationship between transferability and the ILA layer disturbance values for a given ILA attack. We define the disturbance values of an ILA attack perturbation as values of the function for all values of in the source model. For each value of in ResNet18 (the set of is defined for each architecture in Appendix B) we plot the disturbance values of the corresponding ILA attack in Figure 2. The same graph is given for other models in Appendix C. We observe that for a given source model and given ILA attack (for a specific value of ), the disturbance value usually reaches its peak at the targeted layer. Furthermore, we notice that the adversarial examples that produce the latest peak in the graph are typically the ones that have highest transferability for all transferred models (Table 1). Given this observation, we propose that the latest that still exhibits a peak is a nearly optimal value of (in terms of maximizing transferability). For example, according to Figure 2, we would choose . Table 1 supports our claim and shows that selecting this layer gives an optimal or near-optimal attack.

Intuitively, we choose a layer with a peak because we want to choose a layer we can significantly perturb. We choose the latest layer with a peak because since the noise () we are adding at the layer is fairly unstructured (roughly random noise), passing it through the model will only dampen it. Thus, choosing the latest layer will minimize dampening and maximize the chance that the perturbation influences the model output.

I-FGSM I-FGSM Momentum
Source Transfer 20 Itr 10 Itr ILA Opt ILA 20 Itr 10 Itr ILA Opt ILA
ResNet18222Model that is exactly the same model as the source model. 3.3% 7.4% 5.5% (5) 5.7% 11.3% 7.4% (5)
ResNet18 SENet18 44.4% 27.4% 27.4% (4) 33.8% 30.1% 30.1% (4)
() DenseNet121 45.8% 27.4% 27.4% (4) 35.1% 30.3% 30.3% (4)
GoogLeNet 58.6% 36.1% 36.1% (4) 45.1% 37.8% 37.8% (4)
ResNet18 36.8% 28.1% 28.1% (4) 31.0% 29.6% 29.6% (4)
SENet18 2.4% 8.7% 5.6% (6) 3.3% 10.8% 6.6% (6)
() DenseNet121 38.0% 28.3% 28.3% (4) 31.6% 29.6% 29.6% (4)
GoogLeNet 48.4% 36.3% 36.3% (4) 41.1% 37.1% 37.1% (4)
ResNet18 45.1% 29.1% 27.7%(6) 34.4% 30.4% 30.2%(6)
DenseNet121 SENet18 43.4% 28.6% 27.8%(6) 33.5% 29.8% 29.8% (7)
() 2.8% 2.6% 2.6%(7) 6.4% 4.3% 4.3%(7)
GoogLeNet 47.3% 31.8% 30.6%(6) 36.3% 32.3% 32.3% (7)
ResNet18 55.9% 35.2% 34.3%(3) 44.6% 35.3% 34.5%(3)
GoogLeNet SENet18 53.6% 34.5% 33.3% (3) 43.0% 34.7% 34.2%(3)
() DenseNet121 48.9% 29.7% 29.7%(3) 38.9% 30.2% 30.2%(9)
0.9% 1.3% 1.3% (9) 1.5% 2.2% 2.2% (9)
Table 1. Accuracies after attack are shown for the models (lower accuracies indicate better attack). The hyperparameter in the ILA attack is being fixed for each source model as decided by the layer disturbance graphs (e.g. setting for ResNet18 since it was the last peak in Figure 2). For I-FGSM and for I-FGSM with momentum . “Opt ILA” refers to a 10 iteration ILA that chooses the optimal layer (determined by evaluating on transfer models). Note that on the original model, ILA does not usually beat out the baseline attack. This is expected since we are optimizing for greatest transfer, and the ILA objective does not attempt to further attack the source model.
Table 1: ILA Accuracies (with pre-determined ILA layers)

3.3 ILA on Channels

Further providing evidence for the effectiveness of our attack, we show that a modification of our method to attack each individual channel of a layer outputs a perturbation that exhibits the greatest transferability when attacking the most important channels (as measured by standard deviation of the channel activation values across the dataset, motivated by

[17]).

We modify the loss function to target specific channels instead of specific layers. We do this by replacing

in ILA as the output of a particular channel, instead of the output of a layer. The adversarial examples are generated on ResNet18 by targeting each of 256 channels in layer 3 of the model. We then evaluate their transferability on a GoogLeNet model. The result is shown in Figure 3.

Figure 3: Adversarial examples are generated on ResNet18 by targeting each of 256 channels in layer 3 of the model. We then evaluate their transferability on a GoogLeNet. In this graph, channels are sorted in order of increasing error rate on GoogLeNet. The standard deviation of activations is shown, along with the smoothed connected version (smoothed by the Savitzky-Golay filter (sliding window size 41, degree 2)).

The result shows that adversarial examples produced are more transferable when generated by targeting channels with higher standard deviation measured across the entire dataset. Since the variance of a channel’s activations measure the information it contributes

[17], channels with higher variance are more likely to contain information shared across models. This then shows that the increased transferability attained by attacking these channels is a result of ILA disturbing the shared information among models contributed by the channel.

4 Related Work

Most prior work in generating adversarial examples for attack focuses on disturbing the softmax output space via the input space [5, 13, 14, 3]. Not many papers have focused on perturbing mid-layer outputs, though a work that has a similar goal of disturbing mid-layer activations is [15]. It focuses on crafting a single universal perturbation that produces as many spurious mid-layer activations as possible. In contrast to our work, it does not focus on generally fine-tuning an arbitrary perturbation to become more transferable, nor does it focus on exploiting properties of a particular layer. Another work that looks at perturbing mid-layer outputs is [18]. It focuses on showing that given a guide image and a very different target image it is possible to modify the target image to have a very similar embedding to that of the guide image. Unlike our work, it does not focus at all on enhancing adversarial transferability nor specific layer properties, but rather general concerns about internal deep network representations.

5 Conclusion

We introduce a novel attack, coined ILA, that aims to enhance the transferability of any given adversarial example. Moreover, we show that there are specific intermediate layers that we can target with ILA to substantially increase transferability with respect to regular attack baselines. We show that a near-optimal (in terms of transfer) target layer can be selected without any knowledge of specific transfer models.

Potential future work can focus on perturbing different sets of model components (i.e. different layers and channels) or further explaining the mechanism allowing ILA to improve transferability. Focusing on fine-tuning perturbations produced in different settings (i.e. generative perturbations or universal perturbations) and extending our method for targeted attack are also promising directions.

References

A. Intermediate Level Attack (ILA) Algorithm

1:Original image in dataset ; Adversarial example generated for by baseline attack; Function that calculates intermediate layer output; bound ; Learning rate ; Iterations ; Loss function parameter .
2: Initialize adversarial example as
3:
4:while  do
5:     
6:     
7:     
8:     
9:      Clip the disturbance
10:      Clip image to within the natural range
11:     
12:return
Algorithm 1 Intermediate Level Attack algorithm

B. ILA Targeted at Different Values Full Result

As described in the main paper, we tested ILA against I-FGSM [5] and I-FGSM with momentum [3]. We test on a variety of models, namely: ResNet18 [6], SENet18 [7], DenseNet121[8] and GoogLeNet [20] trained on CIFAR-10. For each source model, each large block output in the source model and each attack , we generate adversarial examples for all images in the test set using with 20 iterations as a baseline. We then generate adversarial examples using with 10 iterations as input to ILA, which will then run for 10 iterations. The learning rate is set to for I-FGSM, for I-FGSM with momentum and for ILA. We are in the norm setting with for all attacks. We then evaluate transferability of baseline and ILA adversarial examples over the other models by testing their accuracies. We also compare with their performance on source model, which is labeled by , in a similar fashion.

Below is the list of layers (models from [12]) we picked for each source model, which is indexed starting from 0 in the experiment results:

  • ResNet18: conv, bn, layer1, layer2, layer3, layer4, linear (layer1-4 are basic blocks)

  • GoogLeNet: pre_layers, a3, b3, maxpool, a4, b4, c4, d4, e4, a5, b5, avgpool, linear

  • DenseNet121: conv1, dense1, trans1, dense2, trans2, dense3, trans3, dense4, bn, linear

  • SENet18: conv1, bn1, layer1, layer2, layer3, layer4, linear (layer1-4 are pre-activation blocks)

Attack Layer Index SENet18 DenseNet121 GoogLeNet
0 12.3% 34.7% 36.9% 44.5%
1 12.6% 36.5% 38.5% 46.0%
2 15.4% 42.6% 43.8% 51.1%
ILA 3 12.7% 34.7% 35.0% 43.8%
4 7.4% 27.4% 27.4% 36.1%
5 5.5% 42.4% 43.2% 57.0%
6 6.0% 43.4% 44.9% 58.0%
I-FGSM 3.3% 44.4% 45.8% 58.6%
Table 2: Accuracies after attack using ResNet18 as source model
Attack Layer Index SENet18 DenseNet121 GoogLeNet
0 18.4% 38.3% 39.8% 45.7%
1 18.6% 39.4% 40.9% 46.8%
2 20.0% 43.3% 44.2% 50.1%
ILA 3 17.5% 37.4% 37.4% 44.2%
4 11.3% 30.1% 30.3% 37.8%
5 7.4% 40.1% 40.8% 53.2%
6 8.1% 41.6% 42.1% 54.7%
I-FGSM 5.7% 33.8% 35.1% 45.1%
Momentum
Table 3: Accuracies after attack using ResNet18 as source model
Attack Layer Index ResNet18 DenseNet121 GoogLeNet
0 33.4% 9.5% 34.7% 41.4%
1 46.6% 15.0% 47.7% 53.9%
2 49.6% 16.8% 49.8% 55.2%
ILA 3 34.3% 11.0% 34.7% 42.5%
4 28.1% 8.7% 28.3% 36.3%
5 37.3% 5.9% 38.2% 47.8%
6 36.5% 5.6% 37.5% 47.6%
I-FGSM 36.8% 2.4% 38.0% 48.4%
Table 4: Accuracies after attack using SENet18 as source model
Attack Layer Index ResNet18 SENet18 DenseNet121 GoogLeNet
0 34.9% 12.1% 36.0% 42.4%
1 44.9% 16.6% 45.9% 52.2%
2 47.6% 18.2% 48.5% 53.4%
ILA 3 35.3% 13.1% 35.5% 42.2%
4 29.6% 10.8% 29.6% 37.1%
5 36.2% 6.9% 37.1% 46.3%
6 35.6% 6.6% 36.3% 45.9%
I-FGSM 31.0% 3.3% 31.6% 41.1%
Momentum
Table 5: Accuracies after attack using as source model
Attack Layer Index ResNet18 SENet18 GoogLeNet
0 37.6% 37.2% 12.9% 39.5%
1 54.7% 53.4% 24.6% 55.3%
2 42.3% 41.5% 16.4% 45.0%
ILA 3 39.4% 38.7% 14.1% 42.1%
4 37.6% 36.4% 12.9% 39.9%
5 30.0% 29.7% 7.1% 32.9%
6 27.7% 27.8% 3.1% 30.6%
7 29.1% 28.6% 2.6% 31.8%
8 41.2% 39.5% 4.1% 43.4%
9 44.2% 42.9% 6.4% 46.7%
I-FGSM 45.1% 43.4% 2.8% 47.3%
Table 6: Accuracies after attack using DenseNet121 as source model
Attack Layer Index ResNet18 SENet18 GoogLeNet
0 42.3% 41.7% 20.9% 43.9%
1 53.0% 52.5% 27.4% 54.2%
2 44.4% 44.2% 22.1% 46.6%
ILA 3 42.1% 41.7% 19.9% 44.0%
4 40.5% 39.3% 18.4% 42.3%
5 33.6% 33.2% 12.0% 35.7%
6 30.2% 30.1% 5.4% 32.4%
7 30.4% 29.8% 4.3% 32.3%
8 39.2% 38.3% 5.8% 41.3%
9 41.9% 40.6% 10.0% 43.8%
I-FGSM 34.4% 33.5% 6.4% 36.3%
Momentum
Table 7: Accuracies after attack using DenseNet121 as source model
Attack Layer Index ResNet18 SENet18 DenseNet121
0 45.7% 44.5% 42.3% 6.3%
1 56.1% 55.2% 54.3% 9.4%
2 58.3% 57.0% 54.8% 12.7%
ILA 3 34.3% 33.3% 29.7% 3.8%
4 44.4% 42.6% 39.9% 7.8%
5 42.1% 40.4% 37.9% 7.5%
6 40.1% 38.4% 35.9% 6.7%
7 37.0% 35.9% 33.1% 5.4%
8 34.5% 33.7% 30.3% 4.1%
9 35.2% 34.5% 29.7% 1.3%
10 58.2% 56.8% 52.7% 2.0%
11 57.9% 56.1% 51.6% 2.3%
12 55.8% 53.8% 49.0% 2.5%
I-FGSM 55.9% 53.6% 48.9% 0.9%
Table 8: Accuracies after attack using GoogLeNet as source model
Attack Layer Index ResNet18 SENet18 DenseNet121
0 45.3% 44.3% 42.1% 8.9%
1 53.0% 51.9% 50.9% 11.4%
2 55.7% 54.4% 52.7% 14.2%
ILA 3 34.5% 34.2% 30.6% 5.6%
4 43.6% 42.5% 39.9% 9.9%
5 41.5% 40.4% 38.1% 9.2%
6 39.5% 38.8% 36.2% 8.6%
7 37.3% 36.3% 34.1% 7.3%
8 35.2% 34.6% 31.3% 6.0%
9 35.3% 34.7% 30.2% 2.2%
10 55.4% 54.2% 49.6% 2.9%
11 55.0% 53.5% 49.0% 3.1%
12 53.1% 51.4% 46.9% 3.3%
I-FGSM 44.6% 43.0% 38.9% 1.5%
Momentum
Table 9: Accuracies after attack using GoogLeNet as source model
Figure 4: Visualizations for the data in previous tables.

C. Disturbance Graphs

In this experiment, we used the same setting as our main experiment in Appendix B to generate adversarial examples, with only I-FGSM used as the baseline attack. The average disturbance of each set of adversarial examples is calculated at each layer. We repeated the experiment for all four models described in Appendix B. Observe that the in the legend refers to the hyperparameter set in the ILA attack, and afterwards the disturbance values were computed on layers indicated by the in the x-axis.

Figure 5: Layer disturbance graphs on different source models.

D. Fooling with Different Values

In this experiment, we use ILA to generate adversarial examples with an I-FGSM baseline attack on ResNet18 with . We then evaluated their transferability against I-FGSM baseline on the generated adversarial examples.

Figure 6: Transferability graphs for different epsilons

E. Learning Rate Ablation

We set iterations to 20 for both I-FGSM and I-FGSM with Momentum and experimented different learning rates on ResNet18. We then evaluate different models’ accuracies on the generated adversarial examples.

learning rate ResNet18222Model that is exactly the same model as the source model. SENet18 DenseNet121 GoogLeNet
0.002 3.3% 44.9% 47.1% 59.3%
0.008 0.8% 45.6% 46.8% 60.0%
0.014 0.6% 47.2% 49.4% 59.5%
0.02 1.3% 46.8% 51.4% 59.8%
Table 10: I-FGSM
learning rate SENet18 DenseNet121 GoogLeNet
0.002 5.9% 35.0% 36.6% 46.1%
0.008 0.6% 43.0% 43.8% 56.1%
0.014 0.4% 43.6% 45.2% 55.9%
0.02 0.4% 44.1% 46.4% 57.2%
Table 11: I-FGSM with Momentum