Improving Transferability of Adversarial Examples with Input Diversity
Though convolutional neural networks have achieved state-of-the-art performance on various vision tasks, they are extremely vulnerable to adversarial examples, which are obtained by adding human-imperceptible perturbations to the original images. Adversarial examples can thus be used as an useful tool to evaluate and select the most robust models in safety-critical applications. However, most of the existing adversarial attacks only achieve relatively low success rates under the challenging black-box setting, where the attackers have no knowledge of the model structure and parameters. To this end, we propose to improve the transferability of adversarial examples by creating diverse input patterns. Instead of only using the original images to generate adversarial examples, our method applies random transformations to the input images at each iteration. Extensive experiments on ImageNet show that the proposed attack method can generate adversarial examples that transfer much better to different networks than existing baselines. To further improve the transferability, we (1) integrate the recently proposed momentum method into the attack process; and (2) attack an ensemble of networks simultaneously. By evaluating our method against top defense submissions and official baselines from NIPS 2017 adversarial competition, this enhanced attack reaches an average success rate of 73.0 NIPS competition by a large margin of 6.6 strategy can serve as a benchmark for evaluating the robustness of networks to adversaries and the effectiveness of different defense methods in future. The code is public available at https://github.com/cihangxie/DI-2-FGSM.READ FULL TEXT VIEW PDF
Improving Transferability of Adversarial Examples with Input Diversity
Recent success of convolutional neural networks (CNNs) leads to a dramatic performance improvement on various vision tasks, including image classification [13, 29, 11], object detection [8, 25, 37] and semantic segmentation [19, 3]. However, CNNs are extremely vulnerable to small perturbations to the input images, i.e., human-imperceptible additive perturbations can result in failure predictions of CNNs. These intentionally crafted images are known as adversarial examples . Learning how to generate adversarial examples can help us investigate the robustness of different models  and understand the insufficiency of current training algorithms [9, 15, 34].
Several methods [9, 33, 14] have been proposed recently to find adversarial examples. In general, these attacks can be categorized into two types, single-step attacks  and iterative attacks [33, 14], according to the number of steps of gradient computation. Under the white-box setting, where the attackers have a perfect knowledge of the network structure and weights, iterative attacks can generate adversarial examples with much higher success rates than those generated by single-step attacks. However, if these adversarial examples are tested on a different network (either in terms of network structure, weights or both), i.e., the black-box setting, single-step attacks achieve higher success rates than iterative attacks. This trade-off is due to the fact that iterative attacks tend to overfit the specific network parameters (i.e., have high white-box success rates) thus generated adversarial examples rarely transfer to other networks (i.e., have low black-box success rates), while single-step attacks usually underfit to the network parameters (i.e., have low white-box success rates) thus producing adversarial examples with slightly better transferability. Given this phenomenon, one interesting question is whether we can generate adversarial examples with high success rates under both white-box and black-box settings.
Data augmentation [13, 29, 11] has been shown to be an effective way to prevent networks from overfitting during the training process. Specifically, a set of label-preserving transformations, e.g., resizing, cropping and rotating, are applied to the images to enlarge the training set. Consequently, the trained networks have stronger ability to generalize well to unseeing images. Meanwhile, [35, 10] showed that image transformations can defend against adversarial examples under certain situations, which indicates that adversarial examples cannot generalize well under different transformations. These transformed adversarial examples are known as hard examples [27, 28] for attackers, which can then be served as good samples to produce more transferable adversarial examples.
To this end, we propose the Diverse Input Iterative Fast Gradient Sign Method (DI2
-FGSM) to improve the transferability of adversarial examples. At each iteration, unlike the traditional methods which maximize the loss function directly w.r.t. the original inputs, we apply random and differentiable transformations to the input images with probability
and maximize the loss function w.r.t. these transformed inputs. In particular, the transformations used here are random resizing, which resizes the input images to a random size, and random padding, which pads zeros around the input images in a random manner. Note that, these randomized operations were previously used to defend against adversarial examples, while here we incorporate them into the attack process to create hard and diverse input patterns. Figure 1 shows an adversarial examples generated by our proposed attack method, DI2-FGSM, and compares its success rates to other attack methods under both white-box and black-box settings.
We test the proposed attack method on several networks under both white-box and black-box settings. Compared with traditional iterative attacks, the results on ImageNet (see Section 4.2) show that DI2-FGSM gets significantly higher success rates for black-box models, and maintains similar success rates for white-box models. To improve the transferability of adversarial examples further, we (1) integrate momentum term  into the attack process; and (2) attack multiple networks simultaneously . By evaluating our attack method w.r.t. the top defense submissions and official baselines from NIPS adversarial competition , this enhanced attack reaches an average success rate of , which outperforms the top attack submission in the NIPS competition by a large margin of . We hope that our proposed attack strategy can serve as a benchmark for evaluating the robustness of networks to adversaries and the effectiveness of different defense methods in future.
Traditional machine learning algorithms are known to be vulnerable to adversarial examples[5, 12, 2]. Recently, Szegedy et al.  pointed out that CNNs are also fragile to adversarial examples, and proposed a box-constrained L-BFGS method to find adversarial examples reliably. Due to the expensive computation in , Goodfellow et al.  proposed the fast gradient sign method to generate adversarial examples efficiently by performing a single gradient step. This method was then extended by  to an iterative version, and showed that the generated adversarial examples can exist in the physical world. Dong et al.  proposed a broad class of momentum-based iterative algorithms to boost the transferability of adversarial examples. The transferability can also be improved by attacking an ensemble of networks simultaneously . Besides image classification, adversarial examples also exist in object detection , semantic segmentation [36, 4], speech recognition 
, deep reinforcement learning, etc.. Unlike adversarial examples which can be recognized by human, Nguyen et al.  generated fooling images that are different from natural images and difficult for human to recognize, but CNNs believe they are recognizable objects with high confidences.
Conversely, many methods have been proposed recently to defend against adversarial examples. [9, 15] proposed to inject adversarial examples into the training data to increase the network robustness. Tramèr et al.  pointed out that such adversarially trained models still remain vulnerable to adversarial examples, and proposed ensemble adversarial training, which augments training data with perturbations transferred from other models, in order to improve the network robustness further. [35, 10] utilized randomized image transformations to inputs at inference time to mitigate adversarial effects. Dhillon et al.  pruned a random subset of activations according to their magnitude to enhance network robustness. Prakash et al.  proposed a framework which combines pixel deflection with soft wavelet denoising to defend against adversarial examples. [21, 30, 26] leveraged generative models to purify adversarial images by moving them back towards the distribution of clean images.
Let denote an image, and denote the corresponding ground-truth label. We use to denote the network parameters, and to denote the loss. For the adversarial example generation, the goal is to maximize the loss for the image , under the constraint that the generated adversarial example should look visually similar to the original image and the corresponding predicted label . In this paper, we use -norm to measure the perceptibility of adversarial perturbations, i.e., . The loss function is defined as
is the one-hot encoding of the ground-truth, and
is the logits output. Note that all the baseline attacks have been implemented in the cleverhans library, which can be used directly for our experiments.
In this section, we give an overview of the family of fast gradient sign methods:
Fast Gradient Sign Method (FGSM): FGSM  is the first member in this attack family, which finds the adversarial perturbations in the direction of the loss gradient . The update equation is
Iterative Fast Gradient Sign Method (I-FGSM): Kurakin et al.  extended FGSM to an iterative version, which can be expressed as
where indicates the resulting image are clipped within the -ball of the original image , is the iteration number and is the step size.
Momentum Iterative Fast Gradient Sign Method (MI-FGSM): MI-FGSM  proposed to integrate the momentum term into the attack process to stabilize update directions and escape from poor local maxima. The updating procedure is similar to I-FGSM, with the replacement of Equation (4) by:
where is the decay factor of the momentum term and is the accumulated gradient at iteration .
Let denote the unknown network parameters. In general, a strong adversarial example should have high success rates on both white-box models, i.e., , and black-box models, i.e., . On one hand, the traditional single-step attacks, e.g., FGSM, tend to underfit to the specific network parameters due to inaccurate linear appropriation of the loss , thus cannot reach high success rates on white-box models. On the other hand, the traditional iterative attacks, e.g., I-FGSM, greedily perturb the images in the direction of the sign of the loss gradient at each iteration, thus easily fall into the poor local maxima and overfit to the specific network parameters . These overfitted adversarial examples rarely transfer to black-box models. In order to generate adversarial examples with strong transferability, we need to find a better way to optimize the loss to alleviate this overfitting phenomenon.
Data augmentation [13, 29, 11] is shown as an effective way to prevent networks from overfitting during the training process. Meanwhile, [35, 10] showed that adversarial examples are no longer malicious if simple image transformations are applied, which indicates these transformed adversarial images can serve as good samples for better optimization.
Based on the analysis above, we propose the Diverse Inputs Iterative Fast Gradient Sign Method (DI2-FGSM), which applies image transformations to the original inputs with probability at each iteration to alleviate the overfitting phenomenon. Specifically, the image transformations applied here is random resizing, which resizes the input images to a random size, and random padding, which pads zeros around the input images in a random manner . The transformation probability controls the trade-off between success rates on white-box models and success rates on black-box models, which can be observed from Figure 3. If , DI2-FGSM degrades to I-FGSM and leads to overfitting. If , i.e., only transformed inputs are used for the attack, the generated adversarial examples tend to have much higher success rates on black-box models but lower success rates on white-box models, since the original inputs are not seen by the attackers.
In general, the updating procedure of DI2-FGSM is similar to I-FGSM, with the replacement of Equation (4) by:
where the stochastic transformation function is:
Intuitively, momentum and diverse inputs are two completely different ways to alleviate the overfitting phenomenon. We can combine them naturally to form a much stronger attack, i.e., Momentum Diverse Inputs Iterative Fast Gradient Sign Method (M-DI2-FGSM). The overall updating procedure of M-DI2-FGSM is similar to MI-FGSM, with only replacement of Equation (5) by:
The attacks mentioned above all belong to the family of Fast Gradient Sign Methods, and can be related via different parameter settings, as shown in Figure 2. In summary:
If the transformation probability , M-DI2-FGSM degrades to MI-FGSM, and DI2-FGSM degrades to I-FGSM;
If the decay factor , M-DI2-FGSM degrades to DI2-FGSM, and MI-FGSM degrades to I-FGSM;
If the total iteration number , I-FGSM degrades to FGSM.
Liu et al.  suggested that attacking an ensemble of multiple networks simultaneously can generate much stronger adversarial examples. The motivation is that if an adversarial image remains adversarial for multiple networks, then it is more likely to transfer to other networks as well. Therefore, we can use this strategy to improve the transferability even further.
We follow the ensemble strategy proposed in , which fuse the logit activations together to attack multiple networks simultaneously. Specifically, to attack an ensemble of models, the logits are fused by:
where is the logits output of the -th model with the parameters , is the ensemble weight with and .
It is less meaningful to attack the images that are already classified wrongly. Therefore, we randomly chooseimages from the ImageNet validation set that are classified correctly by all the networks which we test on, to form our test dataset. All these images are resized to beforehand.
We consider four normally trained networks, i.e., Inception-v3 (Inc-v3) , Inception-v4 (Inc-v4) , Resnet-v2-152 (Res-152)  and Inception-Resnet-v2 (IncRes-v2) , and three adversarially trained networks , i.e., ens3-adv-Inception-v3 (Inc-v3ens3), ens4-adv-Inception-v3 (Inc-v3ens4) and ens-adv-Inception-ResNet-v2 (IncRes-v2ens). All networks are publicly available111https://github.com/tensorflow/models/tree/master/research/slim,222https://github.com/tensorflow/models/tree/master/research/adv_imagenet_models.
For the parameters of different attackers, we follow the default settings in  with the step size and the total iteration number . We set the maximum perturbation to be , which is still imperceptible to human vision . For the momentum term, decay factor is set to be as in . For the stochastic transformation function , the probability is set to be , i.e., attackers put equal attentions on the original inputs and the transformed inputs. For transformation operations , the input is first randomly resized to a image, with , and then padded to the size in a random manner.
We first perform adversarial attacks on a single network, using FGSM, I-FGSM, DI2-FGSM, MI-FGSM and M-DI2-FGSM, respectively. We craft adversarial examples only on normally trained networks, and test them on all seven networks. The success rates are shown in Table 1, where the diagonal blocks indicate white-box attacks and off-diagonal blocks indicate black-box attacks. We list the networks that we attack on in rows, and networks that we test on in columns.
From Table 1, first and foremost, we observe that M-DI2-FGSM outperforms all other baseline attacks by a large margin on all black-box models, and maintains high success rates on all white-box models. For example, if adversarial examples are crafted on IncRes-v2, M-DI2-FGSM has success rates of on Inc-v4 (normally trained black-box model) and on Inc-v3ens3 (adversarially trained black-box model), while strong baselines like MI-FGSM only obtains the corresponding success rates of and , respectively. This convincingly demonstrates the effectiveness of the combination of input diversity and momentum for improving the transferability of adversarial examples.
We then compare the success rates of I-FGSM and DI2-FGSM to see the effectiveness of diverse input patterns solely. By generating adversarial examples with input diversity, DI2-FGSM significantly improves the success rates of I-FGSM on challenging black-box models, regardless whether this model is adversarially trained, and maintains high success rates on white-box models. For example, if adversarial examples are crafted on Res-152, DI2-FGSM has success rates of on Res-152 (white-box model), on Inc-v3 (normally trained black-box model) and on Inc-v3ens4 (adversarially trained black-box model), while I-FGSM only obtains the corresponding success rates of , and , respectively. Compared with FGSM, DI2-FGSM also reaches much higher success rates on the normally trained black-box models, and comparable performance on the adversarially trained black-box models.
Though the results in Table 1 show that momentum and input diversity can significantly improve the transferability of adversarial examples, they are still relatively weak at attacking an adversarially trained network under the black-box setting, e.g., the highest black-box success rate on IncRes-v2ens is only . Therefore, we follow the strategy in  to attack multiple networks simultaneously in order to further improve transferability. We consider all seven networks here. Adversarial examples are generated on an ensemble of six networks, and tested on the ensembled network and the hold-out network, using I-FGSM, DI2-FGSM, MI-FGSM and M-DI2-FGSM, respectively. FGSM is ignored here due to its low success rates on white-box models. All ensembled models are assigned with equal weight, i.e., .
The results are summarized in Table 2, where the top row shows the success rates on the ensembled network (white-box setting), and the bottom row shows the success rates on the hold-out network (black-box setting). Under the challenging black-box setting, we observe that M-DI2-FGSM always generates adversarial examples with better transferability than other methods on all networks. For example, by keeping Inc-v3ens3 as a hold-out model, M-DI2-FGSM can fool Inc-v3ens3 with an success rate of , while I-FGSM, DI2-FGSM and MI-FGSM only have success rates of , and , respectively. Besides, compared with MI-FGSM, we observe that using diverse input patterns alone, i.e., DI2-FGSM, can reach a much higher success rate if the hold-out model is an adversarially trained network, and a comparable success rate if the hold-out model is a normally trained network.
Under the white-box setting, we see that DI2-FGSM and M-DI2-FGSM reach slightly lower (but still very high) success rates on ensemble models compared with I-FGSM and MI-FGSM under the white-box setting. This is due to the fact that attacking multiple networks simultaneously is much harder than attacking a single model. However, the white-box success rates can be improved if we assign the transformation probability with a smaller value, increase the number of total iteration or use a smaller step size (see Section 4.4).
In this section, we conduct a series of ablation experiments to study the impact of different parameters, e.g., the step sizp , on DI2-FGSM and M-DI2-FGSM. We only consider attacking an ensemble of networks here, since this is much stronger than attacking a single network, which provides a more accurate evaluation of the network robustness. The max perturbation is set to for all experiments.
We first study the influence of the transformation probability on the success rates under both white-box and black-box settings. We set the step size and the total iteration number . The transformation probability is varied from to . According to the relationships showed in Figure 2, if , M-DI2-FGSM degrades to MI-FGSM and DI2-FGSM degrades to I-FGSM.
We show the success rates on various networks in Figure 3. We observe that both DI2-FGSM and M-DI2-FGSM achieve a higher black-box success rates but lower white-box success rates as increase. Moreover, for all attacks, if is small, i.e., only a small amount of transformed inputs are utilized, black-box success rates can increase significantly, while white-box success rates only drop a little. This phenomenon indicates the importance of adding transformed inputs into the attack process.
The trends showed in Figure 3 also provide useful suggestions of constructing strong adversarial attacks in practice. For example, if you know the black-box model is a new network that totally different from any existing networks, you can set to reach the maximum transferability. If the black-box model is a mixture of new networks and existing networks, you can choose a moderate value of to maximize the black-box success rates under a pre-defined white-box success rates, e.g., white-box success rates must greater or equal than .
We here study the influence of the total iteration number on the success rates under both white-box and black-box settings. We set the transformation probability and the step size . The total iteration number is varied from to , and the results are plotted in Figure 4. For DI2-FGSM, we see that the black-box success rates and white-box success rates always increase as the total iteration number increase. Similar trends can also be observed for M-DI2-FGSM except for the black-box success rates on adversarially trained models, i.e., performing more iterations cannot bring extra transferability on adversarially trained models. Moreover, we observe that the success rates gap between M-DI2-FGSM and DI2-FGSM is diminished as increase.
We finally study the influence of the step size on the success rates under both white-box and black-box settings. We set the transformation probability . In order to reach the maximum perturbation even for a small step size , we set the total iteration number be proportional to the step size, i.e., . The results are plotted in Figure 5. We observe that the white-box success rates of both DI2-FGSM and M-DI2-FGSM can be boosted if a smaller step size is provided. Under the black-box setting, the success rates of DI2-FGSM is insensitive to the step size, while the success rates of M-DI2-FGSM can still be improved with smaller step size.
In order to examine the effectiveness of our proposed attack methods in practice, we here reproduce the top defense submissions, which are black-box models to us, and official baselines from NIPS adversarial competition . Due to resource limitation, we only consider the top defense submissions, i.e., TsAIL333https://github.com/lfz/Guided-Denoise, iyswim444https://github.com/cihangxie/NIPS2017_adv_challenge_defense and Anil Thomas555https://github.com/anlthms/nips-2017/tree/master/mmd, and official baselines, i.e., Inc-v3adv, IncRes-v2ens and Inc-v3. The test dataset contains images which are all of the size , and their corresponding labels are the same as the ImageNet -class labels.
When generating adversarial examples, we follow the procedure in  that: (1) firstly, split the dataset equally into batches; (2) secondly, for each batch, the maximum perturbation is randomly chosen from the set ; (3) lastly, generate adversarial examples for each batch under the corresponding perturbation constraint.
For the attacker configuration, we follow exactly the same settings in  which attacks an ensemble of Inc-v3, Inc-v4, IncRes-v2, Res-152, Inc-v3ens3, Inc-v3ens4, IncRes-v2ens and Inc-v3adv . The ensemble weights are set as equally for the first seven models and for Inc-v3adv. The total iteration number is and the decay factor is . This configuration for MI-FGSM won the -st place in the NIPS adversarial attack competition. For DI2-FGSM and M-DI2-FGSM, we choose according to the trends showed in Figure 3.
The results are summarized in Table 3. We also report the official results of MI-FGSM (named MI-FGSM*) as a reference to validate our implementation. The performance difference between MI-FGSM and MI-FGSM* is due to the randomness of max perturbation magnitude introduced in the attack process. Compared with MI-FGSM, DI2-FGSM have higher success rates on top submissions while slightly lower success rates on baseline models, which results in these two attack methods having similar average success rates. By integrating both diverse inputs and momentum term, this enhanced attack, M-DI2-FGSM, reaches an average success rate of , which is far better than other methods. For example, the top attack submission, MI-FGSM, in the NIPS competition only get an average success rate of . We believe the same advantage can be observed even if we test on all defense submissions. This results also indicate that our proposed attack method can be used as a better tool to evaluate the robustness of various newly developed networks and defense methods.
We provide a brief discussion of why diverse patterns help generate adversarial examples with better transferability. One hypothesis is that the decision boundaries of different networks share similar inherent structures due to the same training dataset, e.g., ImageNet. For example, as shown in Figure 1, different networks make similar mistakes in the presence of adversarial examples. By incorporating diverse patterns at each step, the optimization produces adversarial examples that are more robust to small transformations. These adversarial examples are malicious in a certain region at the network decision boundary, thus increase the chance to fool other networks, i.e., they achieve better black-box success rate than existing methods. In the future, we plan to validate this hypothesis theoretically or empirically.
In this paper, we propose to improve transferability of adversarial examples with input diversity. Specifically, our method applies random transformations to the input images at each iteration in the attack process. Compared with traditional iterative attacks, the results on ImageNet show that our proposed attack method gets significantly higher success rates for black-box models, and maintains similar success rates for white-box models. We improve the transferability further by integrating momentum term and attacking multiple networks simultaneously. By evaluating this enhanced attack against the top defense submissions and official baselines from NIPS adversarial competition , we show that this enhanced attack reaches an average success rate of , which outperforms the top attack submission in the NIPS competition by a large margin of . We hope that our proposed attack strategy can serve as a benchmark for evaluating the robustness of networks to adversaries and the effectiveness of different defense methods in future. The code is public available at https://github.com/cihangxie/DI-2-FGSM.
Girshick, R.: Fast r-cnn. In: International Conference on Computer Vision. IEEE (2015)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Computer Vision and Pattern Recognition. IEEE (2015)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI (2017)
As pointed in Section 4.4, the success rates gap between M-DI2-FGSM and DI2-FGSM is diminished as the total iteration number increase, which may indicate that the momentum term  is less useful with a large total iteration number . In order to validate this assumption, we first study the influence of the large total iteration number on the success rates by increasing it to under both white-box and black-box settings. We set the transformation probability and the step size . The results are shown in Figure 6. When the iteration number is large, e.g., , we observe that (1) DI2-FGSM and M-DI2-FGSM have similar white-box success rates on all models, and comparable black-box success rates on most normally trained models; and (2) DI2-FGSM have higher black-box success rates on adversarially trained models than M-DI2-FGSM.
By fixing the total iteration number and the step size , we then study the influence of the transformation probability on the success rates under both white-box and black-box settings. The transformation probability is increased to , since the original value () may be small under the large iteration number setting. The results are shown in Figure 7. When the transformation probability is large, e.g., , compared with M-DI2-FGSM, we observe that DI2-FGSM has (1) similar white-box success rates on all models, and comparable black-box success rates on most normally trained models; and (2) much higher black-box success rates on adversarially trained models.
Based on the experiment results above, we can conclude that the momentum term  helps to reduce the total iteration number but is not needed when attack iteration number is already large.