have shown that Convolutional Neural Networks (CNNs) are particularly vulnerable to such adversarial attacks. The existence of these adversarial attacks suggests that our architectures and training procedures produce fundamental blind spots in our models, and that our models are not learning the same features that humans do.
These adversarial attacks are of interest for more than just the theoretical issues they pose – concerns have also been raised over the vulnerability of CNNs to these perturbations in the real world, where they are used for mission-critical applications such as online content filtration systems and self-driving cars [6, 13]. As a result, a great deal of effort has been dedicated to studying adversarial perturbations. Much of the literature has been dedicated to the development of new attacks that use different perceptibility metrics [2, 25, 23], security settings (black box/white box) [21, 1], as well as increasing efficiency . Defending against adversarial attacks is also well studied. In particular, adversarial training, where models are trained on adversarial examples, has been shown to be very effective under certain assumptions [16, 24].
Adversarial attacks can be classified into two categories: white-box attacks and black-box attacks. In white-box attacks, information of the model (i.e., its architecture, gradient information, etc.) is accessible, whereas in black-box attacks, the attackers have access only to the prediction. Black-box attacks are a bigger concern for real-world applications for the obvious reason that such applications typically will not reveal their models publicly, especially when security is concerned (e.g., objectionable content filters in social media). Consequently, black-box attacks are mostly focused on the transferability of adversarial examples.
Moreover, most attacks generated using white-box attacks will sometimes successfully attack an unrelated model. This phenomenon is known as “transferability.” However, black-box success rates for an attack are nearly always lower than that of white-box attacks, suggesting that the white-box attacks overfit on the source model. Different adversarial attacks transfer at different rates, but most of them are not optimizing specifically for transferability. This paper aims to achieve the goal of increasing the transferability of a given adversarial example. To this end, we propose a novel method that fine-tunes a given adversarial example through examining its representations in intermediate feature maps that we call Intermediate Level Attack (ILA).
Our method draws upon two primary intuitions. First, while we don’t expect the direction found by the original adversarial attack to be the most optimal for transferability, we do expect it to be a reasonable proxy, as it still transfers far better than random noise would. As such, if we were searching for a more transferable attack, we should be willing to stray from our original attack direction in exchange for increasing our norm.111Attacks with a higher epsilon constraint are generally more effective, including for black box attacks However, from the ineffectiveness of noise on neural networks, we see that straying too far from our original direction will cause us to lose effectiveness – even if we are able to increase norm a modest amount. Thus, we must balance staying close to the original direction and increasing norm. A natural way to do so is to maximize our projection onto the original adversarial perturbation.
Second, we note that although for transferability we’d like to sacrifice some direction in exchange for increasing the norm, we are unable to do so in the image space without changing perceptibility, as norm and perceptibility are intrinsically tied.222Under the standard epsilon constraints However, if we examine the intermediate feature maps, perceptibility (in image space) is no longer intrinsically tied to the norm in an intermediate feature map, and we may be able to increase the norm of our perturbation in that feature space significantly with no change in perceptibility back in our image space. We will investigate the effects of using different intermediate feature maps on transferability, and provide insights drawn from empirical observations.
Our contributions are as follows:
We propose a novel method, ILA, that enhances black-box adversarial transferability by increasing the perturbation on a pre-specified layer of a model. We conduct a thorough evaluation that shows our method improves upon state-of-the-art methods on multiple models across multiple datasets. See Sec. 4.
Additionally, we provide insights into the effects of optimizing for adversarial examples in intermediate feature maps. See Sec. 5.
2 Background and Related Work
2.1 General Adversarial Attacks
An adversarial example for a given model is generated by augmenting an image so that in the model’s decision space its representation moves into the wrong region. Most prior work in generating adversarial examples for attack focuses on disturbing the softmax output space via the input space [7, 16, 19, 5]. Some representative white-box attacks are the following:
Gradient Based Approaches The Fast Gradient Sign Method (FGSM)  generates an adversarial example with the update rule:
It is the linearization of the maximization problem
where represents the original image; is the adversarial example; is the ground-truth label and
is the loss function;
is the model until the final softmax layer. Its iterative version (I-FGSM) applies FGSM iteratively. Intuitively, this fools the model by increasing its loss, which eventually causes misclassification. In other words, it finds perturbations in the direction of the loss gradient of the last layer (i.e., the softmax layer).
Decision Boundary Based Approaches Deepfool  produces approximately the closest adversarial example iteratively by stepping towards the nearest decision boundary. Universal Adversarial Perturbation  uses this idea to craft a single image-agnostic perturbation that pushes most of a dataset’s images across a model’s classification boundary.
Model Ensemble Attack Above methods are designed to yield the best performance only on the model they are tuned; often, they do not transfer to other models. In contrast,  proposed the Model-based Ensembling Attack that transfers better by avoiding dependence on any specific model. It uses k models with softmax outputs, notated as , …, , and solves
Using such an approach, the authors showed that the decision boundaries of different CNNs align with each other. Consequently, an adversarial example that fools multiple models is likely to fool other models as well.
2.2 Intermediate-layer Adversarial Attacks
A small number of studies has focused on perturbing mid-layer outputs.  perturbs mid-layer activations by crafting a single universal perturbation that produces as many spurious mid-layer activations as possible. These include the Feature Adversary Attack [28, 22], which performs a targeted attack by minimizing the distance of the representations of two images in internal neural network layers (instead of in the output layer). However, instead of emphasizing adversarial transferability, it focuses more on internal representations. Results in the paper show that even when given a guide image and a dissimilar target image, it is possible to perturb the target image to produce a much similar embedding to that of the guide image.
Another recent work that examines the intermediate layers for the purposes of increasing transferability is TAP . They attempt to maximize the norm between the original image and the adversarial example at all layers. In contrast to our approach, they do not attempt to take advantage of a specific layer’s feature representations, instead choosing to maximize the norm across all layers. In addition, unlike their method which generates an entirely new adversarial example, our method fine-tunes existing adversarial examples, allowing us to leverage existing adversarial attacks. We also show that our method improves upon theirs in Table 2.
Based on the motivation presented in the introduction, we propose the Intermediate Level Attack (ILA) framework, shown in Algorithm 2. Based on the form of loss function , we propose the following two variants. Note that we define as the output of layer of a network given an input .
3.1 Intermediate Level Attack Projection (ILAP) Loss
Given an adversarial example generated by attack method for natural image , we wish to enhance its transferability by focusing on a layer of a given network . Although is not the optimal direction for transferability, we view as a hint for this direction. We treat as a directional guide towards becoming more adversarial, with emphasis on the disturbance at layer . Our attack will attempt to find an such that matches the direction of while maximizing the norm of the disturbance in that direction. The high-level idea is that we want to maximize for the reasons expressed in Section 1. Since this is a maximization, we can disregard constants, and this simply becomes the dot product. The objective we solve is given below, and we term it the ILA projection loss:
3.2 Intermediate Level Attack Flexible (ILAF) Loss
Since the image may not be the optimal direction for us to optimize towards, we may want to give the above loss greater flexibility. We do this by explicitly balancing both norm maximization and also fidelity to the adversarial direction at layer . We note that in a rough sense, ILAF is optimizing for the same thing as ILAP. We augment the above loss by separating out the maintenance of the adversarial direction from the magnitude, and control the trade-off with the additional parameter to obtain the following loss, termed ILA flexible loss:
In practice, we choose either the ILAP or ILAF loss and iterate times to attain an approximate solution to the respective maximization objective. Note that the projection loss only has the layer as a hyperparameter, whereas the flexible loss also has the additional loss weight as a hyperparameter. The above attack assumes that is a pre-generated adversarial example. As such, the attack can be viewed as a fine-tuning of the adversarial example . We fine-tune for greater norm of the output difference at layer (which we hope will be conducive to greater transferability) while attempting to preserve the output difference’s direction to avoid destroying the original adversarial structure.
We start by showing that ILAP increases transferability for all base attack methods tested, including MI-FGSM  and Carlini-Wagner  in Table 1, as well as Transferable Adversarial Perturbations in Table 2. Results for IFGSM, FGSM, and Deepfool are shown in Appendix A 333We reimplemented all attacks except for Deepfool which is from the original repo. For C&W, we used randomized targeted version, since it has better performance.. We test on a variety of models, namely: ResNet18 , SENet18 , DenseNet121  and GoogLeNet . Architecture details are specified in Appendix A; note that in the below results sections, instead of referring to the architecture specific layer names, we refer to layer indices (e.g. is the last layer of the first block). Our models are trained on CIFAR-10  with the code and hyperparameters in  to final test accuracies of for ResNet18, for SENet18, for DenseNet121, and for GoogLeNet.
For a fair comparison, we use the output of an attack that was run for iterations as a baseline. ILAP runs for iterations starting from scratch with the output of attack after iterations as reference. The learning rate is set to for both I-FGSM and MI-FGSM444Tuning the learning rate does not substantially affect transferability, as shown in Appendix G..
We then show that we can select a nearly-optimal layer for transferability using only the source model. Moreover, ILAF allows further tuning to improve the performance across layers. Finally, we demonstrate that ILAP also improves transferability under the more complex setting of ImageNet.
4.1 ILAP Targeted at Different Values
To confirm the effectiveness of our attack, we fix a single source model and baseline attack method, and then check how ILAP transfers to the other models compared to the baseline attack. Results for ResNet18 as the source model and I-FGSM as the baseline method are shown in Figure 3. Comparing the results of both methods on the other models, we see that ILAP outperforms I-FGSM when targeting at any intermediate layers, especially for the optimal hyperparameter value of . Note that the choice of layer is crucial for both performance on the source model and target models. Full results are shown in Appendix A.
4.2 ILAP with Pre-Determined Value
Above we demonstrated that adversarial examples produced by ILAP exhibit the strongest transferability when targeting a specific layer (i.e. choosing a layer as the hyperparameter). We wish to pre-determine this optimal value based on the source model alone, so as to avoid tuning the hyperparameter . To do this, we examine the relationship between transferability and the ILAP layer disturbance values for a given ILAP attack. We define the disturbance values of an ILAP attack perturbation as values of the function for all values of in the source model. For each value of in ResNet18 (the set of is defined for each architecture in Appendix A) we plot the disturbance values of the corresponding ILAP attack in Figure 4. The same figure is given for other models in Appendix B.
We notice that the adversarial examples that produce the latest peak in the graph are typically the ones that have highest transferability for all transferred models (Table 1). Given this observation, we propose that the latest that still exhibits a peak is a nearly optimal value of (in terms of maximizing transferability). For example, according to Figure 4, we would choose . Table 1 supports our claim and shows that selecting this layer gives an optimal or near-optimal attack.
We leave our interpretation of this method for Section 5.3.
|MI-FGSM||C & W|
|Source||Transfer||20 Itr||10 Itr ILAP||Opt ILAP||1000 Itr||500 Itr ILAP||Opt ILAP|
|5.7%||11.3%||2.3% (6)||7.3%||5.2%||2.1% (5)|
|ResNet18||SENet18||33.8%||30.6%||30.6% (4)||85.4%||41.7%||41.7% (4)|
|()||DenseNet121||35.1%||30.4%||30.4% (4)||84.4%||41.7%||41.7% (4)|
|GoogLeNet||45.1%||37.7%||37.7% (4)||90.6%||57.3%||57.3% (4)|
|ResNet18||31.0%||27.5%||27.5% (4)||87.5%||42.7%||42.7% (4)|
|SENet18||3.3%||10.0%||2.6% (6)||6.2%||7.3%||3.1% (5)|
|()||DenseNet121||31.6%||27.3%||27.3% (4)||88.5%||38.5%||38.5% (4)|
|GoogLeNet||41.1%||34.8%||34.8% (4)||91.7%||52.1%||52.1% (4)|
|DenseNet121||SENet18||33.5%||27.7%||27.7% (6)||86.5%||34.4%||34.4% (6)|
|GoogLeNet||36.3%||30.3%||30.3% (6)||90.6%||45.8%||45.8% (6)|
|1.5%||1.4%||0.5% (11)||4.2%||0.0%||0.0% (12)|
Same model as the source model.
|Source||Transfer||20 Itr||Opt ILAP|
Same model as source model.
4.3 ILAF vs. ILAP
We show that ILAF can further improve transferability with the additional tunable hyperparameter . The best ILAF result for each model improves over ILAP as shown in Table 3. However, note that the optimal differs for each model and requires substantial hyperparameter tuning to outperform ILAP. Thus, ILAF can be seen as a more model-specific version that requires more tuning, whereas ILAP works well more generally out of the box. Full results are in Appendix C.
|Model||ILAP (best)||ILAF (best)|
4.4 ILAP on ImageNet
We also tested ILAP on ImageNet, with ResNet18, DenseNet121, SqueezeNet, and AlexNet555ResNet18 has accuracy 69.8%, DenseNet121 has accuracy 74.4%, SqueezeNet has accuracy 58.0%. pretrained on ImageNet (as provided in ). The learning rates for all attacks are tuned for best performance. For I-FGSM the learning rate is set to , for ILAP with I-FGSM to , for MI-FGSM to , and for ILAP with MI-FGSM to . To evaluate transferability, we tested the accuracies of different models over adversarial examples generated from all ImageNet test images. We observe that ILAP improves over I-FGSM and MI-FGSM on ImageNet. Results for ResNet18 as the source model and I-FGSM as the baseline attack are shown in Figure 5. Full results in Appendix D.
5 Explaining the Effectiveness of Intermediate Layer Emphasis
At a high level, we motivated projection in an intermediate feature map as a way to increase transferability. We saw empirically that we wanted to target the layer corresponding to the latest peak (see Figure 4) on the source model in order to maximize transferability. In this section, we attempt to explain the factors causing ILAP performance to vary across layers as well as what they suggest about the optimal layer for ILAP. As we iterate through layer indices, there are two factors affecting our performance: the angle between the original perturbation direction and best transfer direction (defined below in Section 5.1) as well as the linearity of the model decision boundary.
Below, we discuss how the factors change across layers and affect transferability of our attack.
5.1 Angle between Best Transfer Direction and the Original Perturbation
Motivated by  (where it is shown that the decision boundaries of models with different architectures often align) we define the Best Transfer Direction (BTD):
Best Transfer Direction: Let be an image and be a large (but finite) set of distinct CNNs. Find such that
Then the Best Transfer Direction of x is .
Since our method uses the original perturbation as an approximation for the BTD, it is intuitive that the better this approximation is in the current feature representation, the better our attack will perform.
We want to investigate the nature of how well a chosen source model attack, like I-FGSM, aligns with the BTD throughout layers. Here we measure alignment between an I-FGSM perturbation and the BTD using the angle between them. We investigate the alignment between the feature map outputs of the I-FGSM perturbation and the BTD at each layer. As shown in Figure 6, angle between the perturbation of I-FGSM and that of the BTD decreases as we iterate the layer indices. Therefore, the later the target layer is in the source model, the better it is to use I-FGSM’s attack direction as a guide. This is a factor increasing transfer attack success rate as layer indices increase.
To test our hypothesis, we propose to eliminate this source of variation in performance by using a multi-fool perturbation as the starting perturbation for ILAP, which is a better approximation for the BTD. As shown in Figure 7, ILAP performs substantially better when using a multi-fool perturbation as a guide rather than an I-FGSM perturbation, thus confirming that using a better approximation of the BTD gives better performance for ILAP. In addition, we see that these results correspond with what we would expect from Figure 6. In the earlier layers, I-FGSM is a worse approximation of the BTD, so passing in a multi-fool perturbation improves performance significantly. In the later layers, I-FGSM is a much better approximation of the BTD, and we see that passing in a multi-fool perturbation does not increase performance much.
5.2 Linearity of Decision Boundary
If we view I-FGSM as optimizing to cross the decision boundary, we can interpret ILAP as optimizing to cross the decision boundary approximated with a hyper-plane perpendicular to the I-FGSM perturbation. As the layer indices increase, the function from the feature space to the final output of the source model tends to becomes increasingly linear (there are more nonlinearities between earlier layers and the final layer than there are between a later layer and the final layer). In fact, we note that at the final layer, the decision boundary is completely linear. Thus, our linear approximation of the decision boundary becoming more accurate is one factor in improving ILAP performance as we select the later layers.
We define the “true decision boundary” as a majority-vote ensemble of a large number of CNNs. Note that for transfer, we care less about how well we are approximating the source model decision boundary than we do the true decision boundary. In most feature representations we expect that the true decision boundary is more linear, as ensembling reduces variance. However, note that at least in the final layer, by virtue of the source model decision boundary being exactly linear, the true decision boundary cannot be more linear, and is likely to be less linear.
We hypothesize that this flip is what causes us to perform worse in the final layers. In these layers, the source model decision boundary is more linear than the true decision boundary, so our approximation performs poorly. We test this hypothesis by attacking two variants of ResNet18 augmented with 3 linear layers before the last layer: one variant with activations following the added layers and one without. As shown in Figure 8, ILAP performance decreases less in the first variant. Also note that these nonlinearities also cause worse ILAP performance earlier in the network.
Thus, we conclude that the extreme linearity of the last several layers is associated with ILAP performing poorly.
5.3 Explanation of the main result
In this section, we tie together all of the above factors to explain the optimal intermediate layer for transferability. Denote:
the decreasing angle difference between I-FGSM’s and BTD’s perturbation direction as Factor 1
the increasing linearity with respect to the decision boundary as we increase layer index as Factor 2, and
the excessive linearity of the source model decision boundary as Factor 3
On the transfer models, as the index of the attacked source model layer increases, Factors 1 and 2 increase attack rate, while Factor 3 decreases the attack rate. Thus, before some layer, Factors 1 and 2 cause transferability to increase as layer index increases - however, afterward, Factor 3 wins out and causes transferability to decrease as the layer index increases. Thus the layer right before the point where this switch happens is the layer that is optimal for transferability (see Figure 9 for a visual overview).
We note that this explanation would also justify the method presented in Section 4.2. Intuitively, having a peak corresponds with having the linearized decision boundary (from using projection as the objective) be very different from the source model’s decision boundary. If this were not the case, then I-FGSM would presumably have found this improved perturbation already. As such, choosing the last layer that we can get a peak at corresponds with both having enough room (the peak) and as linear of a decision boundary as possible (as late of a layer as possible).
On the source model, since there is no notion of a “transfer” attack, Factor 3 and Factor 1 do not have any effect. Therefore, Factor 2 causes the performance of the later layers to improve, so much so that at the final layer ILAP’s performance on the source model is actually equal or better on all the attacks we used as baselines (see Figure 3). We hypothesize the improved performance on the source model is the result of a simpler loss and thus an easier to optimize loss landscape.
We introduce a novel attack, coined ILA, that aims to enhance the transferability of any given adversarial example. It is a framework with the goal of enhancing transferability by increasing projection onto the Best Transfer Direction. Within this framework, we propose two variants, ILAP and ILAF, and analyze their performance. We demonstrate that there exist specific intermediate layers that we can target with ILA to substantially increase transferability with respect to the attack baselines. In addition, we show that a near-optimal target layer can be selected without any knowledge of transfer performance. Finally, we provide some intuition regarding ILA’s performance and why it performs differently in different feature spaces.
Potential future work include making use of the interactions between ILA and existing adversarial attacks to explain differences among existing attacks, as well as extending ILA to perturbations produced for different settings (universal or targeted perturbations). In addition, other methods of attacking intermediate feature spaces could be explored, taking advantage of the properties we explored in this paper.
-  A. Athalye, N. Carlini, and D. A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.
-  T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer. Adversarial patch. CoRR, abs/1712.09665, 2017.
-  N. Carlini and D. A. Wagner. Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57, 2017.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. , pages 248–255, 2009.
-  Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li. Boosting adversarial attacks with momentum. 2017.
K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash,
T. Kohno, and D. Song.
Robust physical-world attacks on deep learning models.2017.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. CoRR, abs/1412.6572, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. CoRR, abs/1709.01507, 2017.
-  G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
-  A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  A. Kurakin, I. J. Goodfellow, and S. Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016.
-  K. Liu. Pytorch cifar10. https://github.com/kuangliu/pytorch-cifar, 2018.
-  Y. Liu, X. Chen, C. Liu, and D. X. Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.
-  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. CoRR, abs/1706.06083, 2017.
S. Marcel and Y. Rodriguez.
Torchvision the machine-vision package of torch.In ACM Multimedia, 2010.
-  S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 86–94, 2017.
-  S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: A simple and accurate method to fool deep neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2574–2582, 2016.
-  K. R. Mopuri, A. Ganeshan, and R. V. Babu. Generalizable data-free objective for crafting universal adversarial perturbations. IEEE transactions on pattern analysis and machine intelligence, 2018.
N. Papernot, P. D. McDaniel, I. J. Goodfellow, S. Jha, Z. B. Celik, and
Practical black-box attacks against machine learning.In AsiaCCS, 2017.
-  S. Sabour, Y. Cao, F. Faghri, and D. J. Fleet. Adversarial manipulation of deep representations. CoRR, abs/1511.05122, 2015.
-  M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter. Adversarial generative nets: Neural network attacks on state-of-the-art face recognition. CoRR, abs/1801.00349, 2017.
-  A. Sinha, H. Namkoong, and J. C. Duchi. Certifying some distributional robustness with principled adversarial training. 2017.
-  J. Su, D. V. Vargas, and K. Sakurai. One pixel attack for fooling deep neural networks. CoRR, abs/1710.08864, 2017.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.
-  X. Yuan, P. He, Q. Zhu, R. R. Bhat, and X. Li. Adversarial examples: Attacks and defenses for deep learning. CoRR, abs/1712.07107, 2017.
-  W. Zhou, X. Hou, Y. Chen, M. Tang, X. Huang, X. Gan, and Y. Yang. Transferable adversarial perturbations. In ECCV, 2018.