. In the context of Convolutional Neural Networks (CNNs), adversarial examples are non-perceptible perturbations of natural images, and they have been well studied in terms of attack [5, 2, 16, 3, 1]. Concerns have been raised over the vulnerability of CNNs to these perturbations in real-world contexts where they are used for online content filtration systems and self-driving cars [4, 11].
Moreover, our current understanding of these perturbations is quite limited. In particular, they have been known to exhibit black-box transfer, meaning that perturbations crafted to fool one model will also fool another model. Though some works have attempted to shed light on this phenomenon , many questions remain unanswered. In our work, we aim to provide more insights on this subject by exploring how perturbations on a source model can be enhanced for greater black-box transfer. We attempt to do this by maximizing their effect on one of the source model’s intermediate layers.
Our contributions are as follows:
We propose a novel method, Intermediate Level Attack (ILA), that enhances black-box adversarial transferability by increasing the perturbation on a pre-specified layer of a model.
When attacking a model, it is generally effective to focus on perturbing the last layer. However, we show that when optimizing for black-box transfer it is better to emphasize perturbing an intermediate layer.
Additionally, we provide a method for selecting an optimal layer for transferability using the source model alone, thus obviating the need for evaluation on transfer models during hyperparameter optimization.
2 Motivation and Approach
For the duration of this paper, we will focus on non-targeted image-dependent attacks. In a typical attack setting, most methods, such as I-FGSM , PGD , and DeepFool , primarily focus on attacking the last layer, with most loss objectives formulated directly in terms of cross-entropy with the softmax output and some desired discrete distribution. Although these attack methods produce transferable adversarial examples , they are not inherently emphasizing this property. Our attack, on the other hand, focuses on enhancing transferability of the adversarial examples by perturbing intermediate layers.
Motivated by , we view a CNN’s convolutional layer as having learned a feature hierarchy, with the earliest layers having learned primitive features and latest layers having learned the most high-level features. We hypothesize that for a given model the low-level feature representations are similar to those of other models until they begin to diverge at a specific layer (when models learn different representations of high-level features). Thus, we aim to find and attack the latest layer at which the feature representations learned are still general enough to be found in other models. Adversarial examples crafted to attack such an intermediate layer will be more transferable as they effectively focus on attacking the internal representation that is common to most models.
Based on the above motivation, we propose the following attack, which we call Intermediate Level Attack (ILA). We define as the output of layer of a network given an input . Given an adversarial example generated by attack method for natural image (generated after iterations)111The attack does not have to be iterative, but in this paper we focus on iterative attacks., and a specific layer of a network , we aim to produce an that solves:
In practice, we iterate times to attain this objective (full algorithm given in Appendix A). Note that the layer and the loss weight are hyperparameters of this attack. Also, note that represents a dot product. The above attack assumes that is a pre-generated adversarial example. As such, the attack can be viewed as a fine-tuning of the adversarial example . We fine-tune for greater norm of the output difference at layer (which we hope will be conducive to greater transferability) while attempting to preserve the output difference’s direction to avoid destroying the original adversarial structure.
We start by showing that our method increases transferability for various base attack methods, including I-FGSM  and I-FGSM with momentum . We test on a variety of models, namely: ResNet18 , SENet18 , DenseNet121 and GoogLeNet . Architecture details are specified in Appendix B; note that in the below results sections, instead of referring to the architecture specific layer names, we refer to layer indices (e.g. is the last layer of the first block). Our models are trained on CIFAR-10  with the code and hyperparameters in  to final test accuracies of for ResNet18, for SENet18, for DenseNet121, and for GoogLeNet.
For a fair comparison, we use the output of an attack run for 20 iterations as a baseline. ILA is run for iterations starting from the output of attack after iterations. The learning rate is set to for both I-FGSM and I-FGSM with momentum222Tuning the learning rate does not substantially affect transferability, as shown in Appendix E.. The learning rate for ILA is set to (this value is tuned to exhibit near optimal averaged attack strength on transferred models). Finally, we limit ourselves to generating adversarial examples for natural images333Our CIFAR-10 images are normalized to be in the range via the transform . Results for different values of are given in Appendix D. such that (we choose ).
To evaluate transferability, we test the accuracies of different models over adversarial examples generated from all CIFAR-10 test images. We then show that we can select a nearly-optimal layer for transferability using only the source model.
3.1 ILA Targeted at Different Values
To confirm the effectiveness of our attack, we fix a single source model and baseline attack method, and then check how ILA transfers to the other models compared to the baseline attack. Results for ResNet18 as the source model and I-FGSM as the baseline method are shown in Figure 2. Comparing the results of both methods on the source model and other models, we see that ILA outperforms I-FGSM when targeting any intermediate layers, especially for the optimal hyperparameter value of . Note that after fine-tuning, the adversarial examples perform worse on the source model, which is anticipated as ILA does not optimize for source model attack. Full results are shown in Appendix B.
3.2 ILA with Pre-Determined Value
Above we demonstrated that adversarial examples exhibit strongest transferability when targeting a specific layer. We wish to pre-determine this optimal value based on the source model alone so as to avoid tuning the hyperparameter . To do this, we examine the relationship between transferability and the ILA layer disturbance values for a given ILA attack. We define the disturbance values of an ILA attack perturbation as values of the function for all values of in the source model. For each value of in ResNet18 (the set of is defined for each architecture in Appendix B) we plot the disturbance values of the corresponding ILA attack in Figure 2. The same graph is given for other models in Appendix C. We observe that for a given source model and given ILA attack (for a specific value of ), the disturbance value usually reaches its peak at the targeted layer. Furthermore, we notice that the adversarial examples that produce the latest peak in the graph are typically the ones that have highest transferability for all transferred models (Table 1). Given this observation, we propose that the latest that still exhibits a peak is a nearly optimal value of (in terms of maximizing transferability). For example, according to Figure 2, we would choose . Table 1 supports our claim and shows that selecting this layer gives an optimal or near-optimal attack.
Intuitively, we choose a layer with a peak because we want to choose a layer we can significantly perturb. We choose the latest layer with a peak because since the noise () we are adding at the layer is fairly unstructured (roughly random noise), passing it through the model will only dampen it. Thus, choosing the latest layer will minimize dampening and maximize the chance that the perturbation influences the model output.
|Source||Transfer||20 Itr||10 Itr ILA||Opt ILA||20 Itr||10 Itr ILA||Opt ILA|
|ResNet18222Model that is exactly the same model as the source model.||3.3%||7.4%||5.5% (5)||5.7%||11.3%||7.4% (5)|
|ResNet18||SENet18||44.4%||27.4%||27.4% (4)||33.8%||30.1%||30.1% (4)|
|()||DenseNet121||45.8%||27.4%||27.4% (4)||35.1%||30.3%||30.3% (4)|
|GoogLeNet||58.6%||36.1%||36.1% (4)||45.1%||37.8%||37.8% (4)|
|ResNet18||36.8%||28.1%||28.1% (4)||31.0%||29.6%||29.6% (4)|
|SENet18||2.4%||8.7%||5.6% (6)||3.3%||10.8%||6.6% (6)|
|()||DenseNet121||38.0%||28.3%||28.3% (4)||31.6%||29.6%||29.6% (4)|
|GoogLeNet||48.4%||36.3%||36.3% (4)||41.1%||37.1%||37.1% (4)|
|0.9%||1.3%||1.3% (9)||1.5%||2.2%||2.2% (9)|
3.3 ILA on Channels
Further providing evidence for the effectiveness of our attack, we show that a modification of our method to attack each individual channel of a layer outputs a perturbation that exhibits the greatest transferability when attacking the most important channels (as measured by standard deviation of the channel activation values across the dataset, motivated by).
We modify the loss function to target specific channels instead of specific layers. We do this by replacingin ILA as the output of a particular channel, instead of the output of a layer. The adversarial examples are generated on ResNet18 by targeting each of 256 channels in layer 3 of the model. We then evaluate their transferability on a GoogLeNet model. The result is shown in Figure 3.
The result shows that adversarial examples produced are more transferable when generated by targeting channels with higher standard deviation measured across the entire dataset. Since the variance of a channel’s activations measure the information it contributes, channels with higher variance are more likely to contain information shared across models. This then shows that the increased transferability attained by attacking these channels is a result of ILA disturbing the shared information among models contributed by the channel.
4 Related Work
Most prior work in generating adversarial examples for attack focuses on disturbing the softmax output space via the input space [5, 13, 14, 3]. Not many papers have focused on perturbing mid-layer outputs, though a work that has a similar goal of disturbing mid-layer activations is . It focuses on crafting a single universal perturbation that produces as many spurious mid-layer activations as possible. In contrast to our work, it does not focus on generally fine-tuning an arbitrary perturbation to become more transferable, nor does it focus on exploiting properties of a particular layer. Another work that looks at perturbing mid-layer outputs is . It focuses on showing that given a guide image and a very different target image it is possible to modify the target image to have a very similar embedding to that of the guide image. Unlike our work, it does not focus at all on enhancing adversarial transferability nor specific layer properties, but rather general concerns about internal deep network representations.
We introduce a novel attack, coined ILA, that aims to enhance the transferability of any given adversarial example. Moreover, we show that there are specific intermediate layers that we can target with ILA to substantially increase transferability with respect to regular attack baselines. We show that a near-optimal (in terms of transfer) target layer can be selected without any knowledge of specific transfer models.
Potential future work can focus on perturbing different sets of model components (i.e. different layers and channels) or further explaining the mechanism allowing ILA to improve transferability. Focusing on fine-tuning perturbations produced in different settings (i.e. generative perturbations or universal perturbations) and extending our method for targeted attack are also promising directions.
- Athalye et al.  A. Athalye, N. Carlini, and D. A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.
- Carlini and Wagner  N. Carlini and D. A. Wagner. Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57, 2017.
- Dong et al.  Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li. Boosting adversarial attacks with momentum. 2017.
Eykholt et al. 
K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash,
T. Kohno, and D. Song.
Robust physical-world attacks on deep learning models.2017.
- Goodfellow et al.  I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. CoRR, abs/1412.6572, 2014.
- He et al.  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.
- Hu et al.  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. CoRR, abs/1709.01507, 2017.
- Huang et al.  G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
- Krizhevsky  A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- Krizhevsky et al.  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- Kurakin et al.  A. Kurakin, I. J. Goodfellow, and S. Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016.
- Liu  K. Liu. Pytorch cifar10. https://github.com/kuangliu/pytorch-cifar, 2018.
- Madry et al.  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. CoRR, abs/1706.06083, 2017.
- Moosavi-Dezfooli et al.  S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: A simple and accurate method to fool deep neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2574–2582, 2016.
- Mopuri et al.  K. R. Mopuri, A. Ganeshan, and R. V. Babu. Generalizable data-free objective for crafting universal adversarial perturbations. IEEE transactions on pattern analysis and machine intelligence, 2018.
Papernot et al. 
N. Papernot, P. D. McDaniel, I. J. Goodfellow, S. Jha, Z. B. Celik, and
Practical black-box attacks against machine learning.In AsiaCCS, 2017.
- Polyak and Wolf  A. Polyak and L. Wolf. Channel-level acceleration of deep face representations. IEEE Access, 3:2163–2175, 2015.
- Sabour et al.  S. Sabour, Y. Cao, F. Faghri, and D. J. Fleet. Adversarial manipulation of deep representations. CoRR, abs/1511.05122, 2015.
- Szegedy et al.  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.
- Szegedy et al.  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
- Tramèr et al.  F. Tramèr, N. Papernot, I. J. Goodfellow, D. Boneh, and P. D. McDaniel. The space of transferable adversarial examples. CoRR, abs/1704.03453, 2017.
- Zeiler and Fergus  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
A. Intermediate Level Attack (ILA) Algorithm
B. ILA Targeted at Different Values Full Result
As described in the main paper, we tested ILA against I-FGSM  and I-FGSM with momentum . We test on a variety of models, namely: ResNet18 , SENet18 , DenseNet121 and GoogLeNet  trained on CIFAR-10. For each source model, each large block output in the source model and each attack , we generate adversarial examples for all images in the test set using with 20 iterations as a baseline. We then generate adversarial examples using with 10 iterations as input to ILA, which will then run for 10 iterations. The learning rate is set to for I-FGSM, for I-FGSM with momentum and for ILA. We are in the norm setting with for all attacks. We then evaluate transferability of baseline and ILA adversarial examples over the other models by testing their accuracies. We also compare with their performance on source model, which is labeled by , in a similar fashion.
Below is the list of layers (models from ) we picked for each source model, which is indexed starting from 0 in the experiment results:
ResNet18: conv, bn, layer1, layer2, layer3, layer4, linear (layer1-4 are basic blocks)
GoogLeNet: pre_layers, a3, b3, maxpool, a4, b4, c4, d4, e4, a5, b5, avgpool, linear
DenseNet121: conv1, dense1, trans1, dense2, trans2, dense3, trans3, dense4, bn, linear
SENet18: conv1, bn1, layer1, layer2, layer3, layer4, linear (layer1-4 are pre-activation blocks)
C. Disturbance Graphs
In this experiment, we used the same setting as our main experiment in Appendix B to generate adversarial examples, with only I-FGSM used as the baseline attack. The average disturbance of each set of adversarial examples is calculated at each layer. We repeated the experiment for all four models described in Appendix B. Observe that the in the legend refers to the hyperparameter set in the ILA attack, and afterwards the disturbance values were computed on layers indicated by the in the x-axis.
D. Fooling with Different Values
In this experiment, we use ILA to generate adversarial examples with an I-FGSM baseline attack on ResNet18 with . We then evaluated their transferability against I-FGSM baseline on the generated adversarial examples.
E. Learning Rate Ablation
We set iterations to 20 for both I-FGSM and I-FGSM with Momentum and experimented different learning rates on ResNet18. We then evaluate different models’ accuracies on the generated adversarial examples.
|learning rate||ResNet18222Model that is exactly the same model as the source model.||SENet18||DenseNet121||GoogLeNet|