Improving Transferability of Adversarial Examples with Input Diversity

03/19/2018 ∙ by Cihang Xie, et al. ∙ snapchat 0

Though convolutional neural networks have achieved state-of-the-art performance on various vision tasks, they are extremely vulnerable to adversarial examples, which are obtained by adding human-imperceptible perturbations to the original images. Adversarial examples can thus be used as an useful tool to evaluate and select the most robust models in safety-critical applications. However, most of the existing adversarial attacks only achieve relatively low success rates under the challenging black-box setting, where the attackers have no knowledge of the model structure and parameters. To this end, we propose to improve the transferability of adversarial examples by creating diverse input patterns. Instead of only using the original images to generate adversarial examples, our method applies random transformations to the input images at each iteration. Extensive experiments on ImageNet show that the proposed attack method can generate adversarial examples that transfer much better to different networks than existing baselines. To further improve the transferability, we (1) integrate the recently proposed momentum method into the attack process; and (2) attack an ensemble of networks simultaneously. By evaluating our method against top defense submissions and official baselines from NIPS 2017 adversarial competition, this enhanced attack reaches an average success rate of 73.0 NIPS competition by a large margin of 6.6 strategy can serve as a benchmark for evaluating the robustness of networks to adversaries and the effectiveness of different defense methods in future. The code is public available at https://github.com/cihangxie/DI-2-FGSM.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

Code Repositories

DI-2-FGSM

Improving Transferability of Adversarial Examples with Input Diversity


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent success of convolutional neural networks (CNNs) leads to a dramatic performance improvement on various vision tasks, including image classification [13, 29, 11], object detection [8, 25, 37] and semantic segmentation [19, 3]. However, CNNs are extremely vulnerable to small perturbations to the input images, i.e., human-imperceptible additive perturbations can result in failure predictions of CNNs. These intentionally crafted images are known as adversarial examples [33]. Learning how to generate adversarial examples can help us investigate the robustness of different models [1] and understand the insufficiency of current training algorithms [9, 15, 34].

Figure 1: Success rates comparisons of three attacks on four networks. The ground-truth is walking stick, and is marked as pink in the top- confidence distribution plots. The adversarial examples are crafted on Inception-v3 with the maximum perturbation . The first row shows the top- confidence distributions of the clean image, which indicates all the networks make right predictions with high confidences. The second and third rows show the top- confidence distributions of the adversarial examples generated by the Fast Gradient Sign Method (FGSM) and the Iterative Fast Gradient Sign Method (I-FGSM), respectively. These adversarial examples successfully attack the white-box model Inception-v3, but cannot transfer to all black-box models, e.g., Inception-Resnet-v2. The fourth row shows the top- confidence distributions of the adversarial examples generated by our proposed attack method, Diverse Inputs Iterative Fast Gradient Sign Method (DI2-FGSM), which attacks the white-box model and all black-box models successfully. Although these adversarial examples have different success rates, they are all perceived to be similar to the clean image by human observer

Several methods [9, 33, 14] have been proposed recently to find adversarial examples. In general, these attacks can be categorized into two types, single-step attacks [9] and iterative attacks [33, 14], according to the number of steps of gradient computation. Under the white-box setting, where the attackers have a perfect knowledge of the network structure and weights, iterative attacks can generate adversarial examples with much higher success rates than those generated by single-step attacks. However, if these adversarial examples are tested on a different network (either in terms of network structure, weights or both), i.e., the black-box setting, single-step attacks achieve higher success rates than iterative attacks. This trade-off is due to the fact that iterative attacks tend to overfit the specific network parameters (i.e., have high white-box success rates) thus generated adversarial examples rarely transfer to other networks (i.e., have low black-box success rates), while single-step attacks usually underfit to the network parameters (i.e., have low white-box success rates) thus producing adversarial examples with slightly better transferability. Given this phenomenon, one interesting question is whether we can generate adversarial examples with high success rates under both white-box and black-box settings.

Data augmentation [13, 29, 11] has been shown to be an effective way to prevent networks from overfitting during the training process. Specifically, a set of label-preserving transformations, e.g., resizing, cropping and rotating, are applied to the images to enlarge the training set. Consequently, the trained networks have stronger ability to generalize well to unseeing images. Meanwhile, [35, 10] showed that image transformations can defend against adversarial examples under certain situations, which indicates that adversarial examples cannot generalize well under different transformations. These transformed adversarial examples are known as hard examples [27, 28] for attackers, which can then be served as good samples to produce more transferable adversarial examples.

To this end, we propose the Diverse Input Iterative Fast Gradient Sign Method (DI2

-FGSM) to improve the transferability of adversarial examples. At each iteration, unlike the traditional methods which maximize the loss function directly w.r.t. the original inputs, we apply random and differentiable transformations to the input images with probability

and maximize the loss function w.r.t. these transformed inputs. In particular, the transformations used here are random resizing, which resizes the input images to a random size, and random padding, which pads zeros around the input images in a random manner. Note that, these randomized operations were previously used to defend against adversarial examples 

[35], while here we incorporate them into the attack process to create hard and diverse input patterns. Figure 1 shows an adversarial examples generated by our proposed attack method, DI2-FGSM, and compares its success rates to other attack methods under both white-box and black-box settings.

We test the proposed attack method on several networks under both white-box and black-box settings. Compared with traditional iterative attacks, the results on ImageNet (see Section 4.2) show that DI2-FGSM gets significantly higher success rates for black-box models, and maintains similar success rates for white-box models. To improve the transferability of adversarial examples further, we (1) integrate momentum term [7] into the attack process; and (2) attack multiple networks simultaneously [18]. By evaluating our attack method w.r.t. the top defense submissions and official baselines from NIPS adversarial competition [16], this enhanced attack reaches an average success rate of , which outperforms the top attack submission in the NIPS competition by a large margin of . We hope that our proposed attack strategy can serve as a benchmark for evaluating the robustness of networks to adversaries and the effectiveness of different defense methods in future.

2 Related Work

2.1 Generating Adversarial Examples

Traditional machine learning algorithms are known to be vulnerable to adversarial examples 

[5, 12, 2]. Recently, Szegedy et al. [33] pointed out that CNNs are also fragile to adversarial examples, and proposed a box-constrained L-BFGS method to find adversarial examples reliably. Due to the expensive computation in [33], Goodfellow et al. [9] proposed the fast gradient sign method to generate adversarial examples efficiently by performing a single gradient step. This method was then extended by [14] to an iterative version, and showed that the generated adversarial examples can exist in the physical world. Dong et al. [7] proposed a broad class of momentum-based iterative algorithms to boost the transferability of adversarial examples. The transferability can also be improved by attacking an ensemble of networks simultaneously [18]. Besides image classification, adversarial examples also exist in object detection [36], semantic segmentation [36, 4], speech recognition [4]

, deep reinforcement learning 

[17], etc.. Unlike adversarial examples which can be recognized by human, Nguyen et al. [22] generated fooling images that are different from natural images and difficult for human to recognize, but CNNs believe they are recognizable objects with high confidences.

2.2 Defending Against Adversarial Examples

Conversely, many methods have been proposed recently to defend against adversarial examples. [9, 15] proposed to inject adversarial examples into the training data to increase the network robustness. Tramèr et al. [34] pointed out that such adversarially trained models still remain vulnerable to adversarial examples, and proposed ensemble adversarial training, which augments training data with perturbations transferred from other models, in order to improve the network robustness further. [35, 10] utilized randomized image transformations to inputs at inference time to mitigate adversarial effects. Dhillon et al. [6] pruned a random subset of activations according to their magnitude to enhance network robustness. Prakash et al. [24] proposed a framework which combines pixel deflection with soft wavelet denoising to defend against adversarial examples. [21, 30, 26] leveraged generative models to purify adversarial images by moving them back towards the distribution of clean images.

3 Methodology

Let denote an image, and denote the corresponding ground-truth label. We use to denote the network parameters, and to denote the loss. For the adversarial example generation, the goal is to maximize the loss for the image , under the constraint that the generated adversarial example should look visually similar to the original image and the corresponding predicted label . In this paper, we use -norm to measure the perceptibility of adversarial perturbations, i.e., . The loss function is defined as

(1)

where

is the one-hot encoding of the ground-truth

, and

is the logits output. Note that all the baseline attacks have been implemented in the cleverhans library 

[23], which can be used directly for our experiments.

3.1 Family of Fast Gradient Sign Methods

In this section, we give an overview of the family of fast gradient sign methods:

  • Fast Gradient Sign Method (FGSM): FGSM [9] is the first member in this attack family, which finds the adversarial perturbations in the direction of the loss gradient . The update equation is

    (2)
  • Iterative Fast Gradient Sign Method (I-FGSM): Kurakin et al. [15] extended FGSM to an iterative version, which can be expressed as

    (3)
    (4)

    where indicates the resulting image are clipped within the -ball of the original image , is the iteration number and is the step size.

  • Momentum Iterative Fast Gradient Sign Method (MI-FGSM): MI-FGSM [7] proposed to integrate the momentum term into the attack process to stabilize update directions and escape from poor local maxima. The updating procedure is similar to I-FGSM, with the replacement of Equation (4) by:

    (5)
    (6)

    where is the decay factor of the momentum term and is the accumulated gradient at iteration .

3.2 Diverse Inputs Iterative Fast Gradient Sign Method

3.2.1 Overfitting Phenomenon

Let denote the unknown network parameters. In general, a strong adversarial example should have high success rates on both white-box models, i.e., , and black-box models, i.e., . On one hand, the traditional single-step attacks, e.g., FGSM, tend to underfit to the specific network parameters due to inaccurate linear appropriation of the loss , thus cannot reach high success rates on white-box models. On the other hand, the traditional iterative attacks, e.g., I-FGSM, greedily perturb the images in the direction of the sign of the loss gradient at each iteration, thus easily fall into the poor local maxima and overfit to the specific network parameters . These overfitted adversarial examples rarely transfer to black-box models. In order to generate adversarial examples with strong transferability, we need to find a better way to optimize the loss to alleviate this overfitting phenomenon.

Data augmentation [13, 29, 11] is shown as an effective way to prevent networks from overfitting during the training process. Meanwhile, [35, 10] showed that adversarial examples are no longer malicious if simple image transformations are applied, which indicates these transformed adversarial images can serve as good samples for better optimization.

3.2.2 Our Solution

Based on the analysis above, we propose the Diverse Inputs Iterative Fast Gradient Sign Method (DI2-FGSM), which applies image transformations to the original inputs with probability at each iteration to alleviate the overfitting phenomenon. Specifically, the image transformations applied here is random resizing, which resizes the input images to a random size, and random padding, which pads zeros around the input images in a random manner [35]. The transformation probability controls the trade-off between success rates on white-box models and success rates on black-box models, which can be observed from Figure 3. If , DI2-FGSM degrades to I-FGSM and leads to overfitting. If , i.e., only transformed inputs are used for the attack, the generated adversarial examples tend to have much higher success rates on black-box models but lower success rates on white-box models, since the original inputs are not seen by the attackers.

In general, the updating procedure of DI2-FGSM is similar to I-FGSM, with the replacement of Equation (4) by:

(7)

where the stochastic transformation function is:

(8)

3.3 Momentum Diverse Inputs Iterative Fast Gradient Sign Method

Intuitively, momentum and diverse inputs are two completely different ways to alleviate the overfitting phenomenon. We can combine them naturally to form a much stronger attack, i.e., Momentum Diverse Inputs Iterative Fast Gradient Sign Method (M-DI2-FGSM). The overall updating procedure of M-DI2-FGSM is similar to MI-FGSM, with only replacement of Equation (5) by:

(9)

3.4 Relationships between Different Attacks

The attacks mentioned above all belong to the family of Fast Gradient Sign Methods, and can be related via different parameter settings, as shown in Figure 2. In summary:

  • If the transformation probability , M-DI2-FGSM degrades to MI-FGSM, and DI2-FGSM degrades to I-FGSM;

  • If the decay factor , M-DI2-FGSM degrades to DI2-FGSM, and MI-FGSM degrades to I-FGSM;

  • If the total iteration number , I-FGSM degrades to FGSM.

Figure 2: Relationships between different attacks

3.5 Attacking an Ensemble of Networks

Liu et al. [18] suggested that attacking an ensemble of multiple networks simultaneously can generate much stronger adversarial examples. The motivation is that if an adversarial image remains adversarial for multiple networks, then it is more likely to transfer to other networks as well. Therefore, we can use this strategy to improve the transferability even further.

We follow the ensemble strategy proposed in [7], which fuse the logit activations together to attack multiple networks simultaneously. Specifically, to attack an ensemble of models, the logits are fused by:

(10)

where is the logits output of the -th model with the parameters , is the ensemble weight with and .

4 Experiment

4.1 Experiment Setup

4.1.1 Dataset

It is less meaningful to attack the images that are already classified wrongly. Therefore, we randomly choose

images from the ImageNet validation set that are classified correctly by all the networks which we test on, to form our test dataset. All these images are resized to beforehand.

4.1.2 Networks

We consider four normally trained networks, i.e., Inception-v3 (Inc-v3) [32], Inception-v4 (Inc-v4) [31], Resnet-v2-152 (Res-152) [11] and Inception-Resnet-v2 (IncRes-v2) [31], and three adversarially trained networks [34], i.e., ens3-adv-Inception-v3 (Inc-v3ens3), ens4-adv-Inception-v3 (Inc-v3ens4) and ens-adv-Inception-ResNet-v2 (IncRes-v2ens). All networks are publicly available111https://github.com/tensorflow/models/tree/master/research/slim,222https://github.com/tensorflow/models/tree/master/research/adv_imagenet_models.

4.1.3 Implementation details

For the parameters of different attackers, we follow the default settings in [14] with the step size and the total iteration number . We set the maximum perturbation to be , which is still imperceptible to human vision [20]. For the momentum term, decay factor is set to be as in [7]. For the stochastic transformation function , the probability is set to be , i.e., attackers put equal attentions on the original inputs and the transformed inputs. For transformation operations , the input is first randomly resized to a image, with , and then padded to the size in a random manner.

4.2 Attacking a Single Network

Attack Inc-v3 Inc-v4 IncRes-v2 Res-152 Inc-v3ens3 Inc-v3ens4 IncRes-v2ens
Inc-v3 FGSM 64.6% 23.5% 21.7% 21.7% 8.0% 7.5% 3.6%
I-FGSM 99.9% 14.8% 11.6% 8.9% 3.3% 2.9% 1.5%
DI2-FGSM (Ours) 99.9% 35.5% 27.8% 21.4% 5.5% 5.2% 2.8%
MI-FGSM 99.9% 36.6% 34.5% 27.5% 8.9% 8.4% 4.7%
M-DI2-FGSM (Ours) 99.9% 63.9% 59.4% 47.9% 14.3% 14.0% 7.0%
Inc-v4 FGSM 26.4% 49.6% 19.7% 20.4% 8.4% 7.7% 4.1%
I-FGSM 22.0% 99.9% 13.2% 10.9% 3.2% 3.0% 1.7%
DI2-FGSM (Ours) 43.3% 99.7% 28.9% 23.1% 5.9% 5.5% 3.2%
MI-FGSM 51.1% 99.9% 39.4% 33.7% 11.2% 10.7% 5.3%
M-DI2-FGSM (Ours) 72.4% 99.5% 62.2% 52.1% 17.6% 15.6% 8.8%
IncRes-v2 FGSM 24.3% 19.3% 39.6% 19.4% 8.5% 7.3% 4.8%
I-FGSM 22.2% 17.7% 97.9% 12.6% 4.6% 3.7% 2.5%
DI2-FGSM (Ours) 46.5% 40.5% 95.8% 28.6% 8.2% 6.6% 4.8%
MI-FGSM 53.5% 45.9% 98.4% 37.8% 15.3% 13.0% 8.8%
M-DI2-FGSM (Ours) 71.2% 67.4% 96.1% 57.4% 25.1% 20.7% 14.9%
Res-152 FGSM 34.4% 28.5% 27.1% 75.2% 12.4% 11.0% 6.0%
I-FGSM 20.8% 17.2% 14.9% 99.1% 5.4% 4.6% 2.8%
DI2-FGSM (Ours) 53.8% 49.0% 44.8% 99.2% 13.0% 11.1% 6.9%
MI-FGSM 50.1% 44.1% 42.2% 99.0% 18.2% 15.2% 9.0%
M-DI2-FGSM (Ours) 78.9% 76.5% 74.8% 99.2% 35.2% 29.4% 19.0%
Table 1: The success rates on seven networks where we attack a single network. The adversarial examples are crafted on four normally trained networks. The diagonal blocks indicate white-box attacks, while the off-diagonal blocks indicate black-box attacks which are much more challenging. We observe that M-DI2-FGSM always reaches the highest success rates on all black-box models, beating other methods by a large margin, and maintains high success rates on all white-box models

We first perform adversarial attacks on a single network, using FGSM, I-FGSM, DI2-FGSM, MI-FGSM and M-DI2-FGSM, respectively. We craft adversarial examples only on normally trained networks, and test them on all seven networks. The success rates are shown in Table 1, where the diagonal blocks indicate white-box attacks and off-diagonal blocks indicate black-box attacks. We list the networks that we attack on in rows, and networks that we test on in columns.

From Table 1, first and foremost, we observe that M-DI2-FGSM outperforms all other baseline attacks by a large margin on all black-box models, and maintains high success rates on all white-box models. For example, if adversarial examples are crafted on IncRes-v2, M-DI2-FGSM has success rates of on Inc-v4 (normally trained black-box model) and on Inc-v3ens3 (adversarially trained black-box model), while strong baselines like MI-FGSM only obtains the corresponding success rates of and , respectively. This convincingly demonstrates the effectiveness of the combination of input diversity and momentum for improving the transferability of adversarial examples.

We then compare the success rates of I-FGSM and DI2-FGSM to see the effectiveness of diverse input patterns solely. By generating adversarial examples with input diversity, DI2-FGSM significantly improves the success rates of I-FGSM on challenging black-box models, regardless whether this model is adversarially trained, and maintains high success rates on white-box models. For example, if adversarial examples are crafted on Res-152, DI2-FGSM has success rates of on Res-152 (white-box model), on Inc-v3 (normally trained black-box model) and on Inc-v3ens4 (adversarially trained black-box model), while I-FGSM only obtains the corresponding success rates of , and , respectively. Compared with FGSM, DI2-FGSM also reaches much higher success rates on the normally trained black-box models, and comparable performance on the adversarially trained black-box models.

4.3 Attacking an Ensemble of Networks

Attack -Inc-v3 -Inc-v4 -IncRes-v2 -Res-152 -Inc-v3ens3 -Inc-v3ens4 -IncRes-v2ens
Ensemble I-FGSM 96.6% 96.9% 98.7% 96.2% 97.0% 97.3% 94.3%
DI2-FGSM (Ours) 88.9% 89.6% 93.2% 87.7% 91.7% 91.7% 93.2%
MI-FGSM 96.9% 96.9% 98.8% 96.8% 96.8% 97.0% 94.6%
M-DI2-FGSM (Ours) 90.1% 91.1% 94.0% 89.3% 92.8% 92.7% 94.9%
Hold-out I-FGSM 43.7% 36.4% 33.3% 25.4% 12.9% 15.1% 8.8%
DI2-FGSM (Ours) 69.9% 67.9% 64.1% 51.7% 36.3% 35.0% 30.4%
MI-FGSM 71.4% 65.9% 64.6% 55.6% 22.8% 26.1% 15.8%
M-DI2-FGSM (Ours) 80.7% 80.6% 80.7% 70.9% 44.6% 44.5% 39.4%
Table 2: The success rates of ensemble attacks. We take all seven networks into consideration. Adversarial examples are generated on an ensemble of six networks, and tested on the ensembled network (white-box setting, top row) and the hold-out network (black-box setting, bottom row). The sign “-” indicates the name of the hold-out network. We observe that M-DI2-FGSM always reaches the highest success rates on all black-box models, beating other methods by a large margin, and maintains high success rates (though slightly lower than I-FGSM & MI-FGSM) on all white-box models

Though the results in Table 1 show that momentum and input diversity can significantly improve the transferability of adversarial examples, they are still relatively weak at attacking an adversarially trained network under the black-box setting, e.g., the highest black-box success rate on IncRes-v2ens is only . Therefore, we follow the strategy in [18] to attack multiple networks simultaneously in order to further improve transferability. We consider all seven networks here. Adversarial examples are generated on an ensemble of six networks, and tested on the ensembled network and the hold-out network, using I-FGSM, DI2-FGSM, MI-FGSM and M-DI2-FGSM, respectively. FGSM is ignored here due to its low success rates on white-box models. All ensembled models are assigned with equal weight, i.e., .

The results are summarized in Table 2, where the top row shows the success rates on the ensembled network (white-box setting), and the bottom row shows the success rates on the hold-out network (black-box setting). Under the challenging black-box setting, we observe that M-DI2-FGSM always generates adversarial examples with better transferability than other methods on all networks. For example, by keeping Inc-v3ens3 as a hold-out model, M-DI2-FGSM can fool Inc-v3ens3 with an success rate of , while I-FGSM, DI2-FGSM and MI-FGSM only have success rates of , and , respectively. Besides, compared with MI-FGSM, we observe that using diverse input patterns alone, i.e., DI2-FGSM, can reach a much higher success rate if the hold-out model is an adversarially trained network, and a comparable success rate if the hold-out model is a normally trained network.

Under the white-box setting, we see that DI2-FGSM and M-DI2-FGSM reach slightly lower (but still very high) success rates on ensemble models compared with I-FGSM and MI-FGSM under the white-box setting. This is due to the fact that attacking multiple networks simultaneously is much harder than attacking a single model. However, the white-box success rates can be improved if we assign the transformation probability with a smaller value, increase the number of total iteration or use a smaller step size (see Section 4.4).

4.4 Ablation Studies

In this section, we conduct a series of ablation experiments to study the impact of different parameters, e.g., the step sizp , on DI2-FGSM and M-DI2-FGSM. We only consider attacking an ensemble of networks here, since this is much stronger than attacking a single network, which provides a more accurate evaluation of the network robustness. The max perturbation is set to for all experiments.

4.4.1 Transformation Probability

Figure 3: The success rates of DI2-FGSM (left) and M-DI2-FGSM (right) w.r.t. different transformation probability . We generate adversarial examples using an ensemble of six networks, and attack on both the corresponding ensembled network (white-box setting, dashed line) and the hold-out network (black-box setting, solid line). We observe that both attack methods achieve a higher black-box success rates but lower white-box success rates as increase

We first study the influence of the transformation probability on the success rates under both white-box and black-box settings. We set the step size and the total iteration number . The transformation probability is varied from to . According to the relationships showed in Figure 2, if , M-DI2-FGSM degrades to MI-FGSM and DI2-FGSM degrades to I-FGSM.

We show the success rates on various networks in Figure 3. We observe that both DI2-FGSM and M-DI2-FGSM achieve a higher black-box success rates but lower white-box success rates as increase. Moreover, for all attacks, if is small, i.e., only a small amount of transformed inputs are utilized, black-box success rates can increase significantly, while white-box success rates only drop a little. This phenomenon indicates the importance of adding transformed inputs into the attack process.

The trends showed in Figure 3 also provide useful suggestions of constructing strong adversarial attacks in practice. For example, if you know the black-box model is a new network that totally different from any existing networks, you can set to reach the maximum transferability. If the black-box model is a mixture of new networks and existing networks, you can choose a moderate value of to maximize the black-box success rates under a pre-defined white-box success rates, e.g., white-box success rates must greater or equal than .

4.4.2 Total Iteration Number

Figure 4: The success rates of DI2-FGSM (left) and M-DI2-FGSM (right) w.r.t. different total iteration number . We generate adversarial examples using an ensemble of six networks, and attack on both the corresponding ensembled network (white-box setting, dashed line) and the hold-out network (black-box setting, solid line). We observe that both attack methods can be benefited if more iterations are performed

We here study the influence of the total iteration number on the success rates under both white-box and black-box settings. We set the transformation probability and the step size . The total iteration number is varied from to , and the results are plotted in Figure 4. For DI2-FGSM, we see that the black-box success rates and white-box success rates always increase as the total iteration number increase. Similar trends can also be observed for M-DI2-FGSM except for the black-box success rates on adversarially trained models, i.e., performing more iterations cannot bring extra transferability on adversarially trained models. Moreover, we observe that the success rates gap between M-DI2-FGSM and DI2-FGSM is diminished as increase.

4.4.3 Step Size

Figure 5: The success rates of DI2-FGSM (left) and M-DI2-FGSM (right) w.r.t. different step size . We generate adversarial examples using an ensemble of six networks, and attack on both the corresponding ensembled network (white-box setting, dashed line) and the hold-out network (black-box setting, solid line). We observe that both attack methods can be benefited if a smaller step is provided

We finally study the influence of the step size on the success rates under both white-box and black-box settings. We set the transformation probability . In order to reach the maximum perturbation even for a small step size , we set the total iteration number be proportional to the step size, i.e., . The results are plotted in Figure 5. We observe that the white-box success rates of both DI2-FGSM and M-DI2-FGSM can be boosted if a smaller step size is provided. Under the black-box setting, the success rates of DI2-FGSM is insensitive to the step size, while the success rates of M-DI2-FGSM can still be improved with smaller step size.

4.5 Reproducing NIPS Adversarial Competition

In order to examine the effectiveness of our proposed attack methods in practice, we here reproduce the top defense submissions, which are black-box models to us, and official baselines from NIPS adversarial competition [16]. Due to resource limitation, we only consider the top defense submissions, i.e., TsAIL333https://github.com/lfz/Guided-Denoise, iyswim444https://github.com/cihangxie/NIPS2017_adv_challenge_defense and Anil Thomas555https://github.com/anlthms/nips-2017/tree/master/mmd, and official baselines, i.e., Inc-v3adv, IncRes-v2ens and Inc-v3. The test dataset contains images which are all of the size , and their corresponding labels are the same as the ImageNet -class labels.

4.5.1 Generating Adversarial Examples

When generating adversarial examples, we follow the procedure in [16] that: (1) firstly, split the dataset equally into batches; (2) secondly, for each batch, the maximum perturbation is randomly chosen from the set ; (3) lastly, generate adversarial examples for each batch under the corresponding perturbation constraint.

4.5.2 Attacker Configurations

For the attacker configuration, we follow exactly the same settings in [7] which attacks an ensemble of Inc-v3, Inc-v4, IncRes-v2, Res-152, Inc-v3ens3, Inc-v3ens4, IncRes-v2ens and Inc-v3adv [15]. The ensemble weights are set as equally for the first seven models and for Inc-v3adv. The total iteration number is and the decay factor is . This configuration for MI-FGSM won the -st place in the NIPS adversarial attack competition. For DI2-FGSM and M-DI2-FGSM, we choose according to the trends showed in Figure 3.

Attack TsAIL iyswim Anil Thomas Inc-v3adv IncRes-v2ens Inc-v3 Avg.
I-FGSM 14.0% 35.6% 30.9% 98.2% 96.4% 99.0% 62.4%
DI2-FGSM (Ours) 22.7% 58.4% 48.0% 91.5% 90.7% 97.3% 68.1%
MI-FGSM 14.9% 45.7% 46.6% 97.3% 95.4% 98.7% 66.4%
MI-FGSM* 13.6% 43.2% 43.9% 94.4% 93.0% 97.3% 64.2%
M-DI2-FGSM (Ours) 20.0% 69.8% 64.4% 93.3% 92.4% 97.9% 73.0%
Table 3: The success rates on top defense submissions and official baselines from NIPS 2017 adversarial competition. * indicates the official results reported in the competition. We see that M-DI2-FGSM obtains the highest average success rate, beating other methods by a large margin

4.5.3 Results

The results are summarized in Table 3. We also report the official results of MI-FGSM (named MI-FGSM*) as a reference to validate our implementation. The performance difference between MI-FGSM and MI-FGSM* is due to the randomness of max perturbation magnitude introduced in the attack process. Compared with MI-FGSM, DI2-FGSM have higher success rates on top submissions while slightly lower success rates on baseline models, which results in these two attack methods having similar average success rates. By integrating both diverse inputs and momentum term, this enhanced attack, M-DI2-FGSM, reaches an average success rate of , which is far better than other methods. For example, the top attack submission, MI-FGSM, in the NIPS competition only get an average success rate of . We believe the same advantage can be observed even if we test on all defense submissions. This results also indicate that our proposed attack method can be used as a better tool to evaluate the robustness of various newly developed networks and defense methods.

4.6 Discussion

We provide a brief discussion of why diverse patterns help generate adversarial examples with better transferability. One hypothesis is that the decision boundaries of different networks share similar inherent structures due to the same training dataset, e.g., ImageNet. For example, as shown in Figure 1, different networks make similar mistakes in the presence of adversarial examples. By incorporating diverse patterns at each step, the optimization produces adversarial examples that are more robust to small transformations. These adversarial examples are malicious in a certain region at the network decision boundary, thus increase the chance to fool other networks, i.e., they achieve better black-box success rate than existing methods. In the future, we plan to validate this hypothesis theoretically or empirically.

5 Conclusions

In this paper, we propose to improve transferability of adversarial examples with input diversity. Specifically, our method applies random transformations to the input images at each iteration in the attack process. Compared with traditional iterative attacks, the results on ImageNet show that our proposed attack method gets significantly higher success rates for black-box models, and maintains similar success rates for white-box models. We improve the transferability further by integrating momentum term and attacking multiple networks simultaneously. By evaluating this enhanced attack against the top defense submissions and official baselines from NIPS adversarial competition [16], we show that this enhanced attack reaches an average success rate of , which outperforms the top attack submission in the NIPS competition by a large margin of . We hope that our proposed attack strategy can serve as a benchmark for evaluating the robustness of networks to adversaries and the effectiveness of different defense methods in future. The code is public available at https://github.com/cihangxie/DI-2-FGSM.

References

Appendix 0.A Ablation Studies with Large Iteration Number

Figure 6: The success rates of DI2-FGSM (left) and M-DI2-FGSM (right) w.r.t. different total iteration number . We generate adversarial examples using an ensemble of six networks, and attack on both the corresponding ensembled network (white-box setting, dashed line) and the hold-out network (black-box setting, solid line). We observe that DI2-FGSM can achieve comparable performances to M-DI2-FGSM when the total iteration number is large under both white-box and black-box settings

As pointed in Section 4.4, the success rates gap between M-DI2-FGSM and DI2-FGSM is diminished as the total iteration number increase, which may indicate that the momentum term [7] is less useful with a large total iteration number . In order to validate this assumption, we first study the influence of the large total iteration number on the success rates by increasing it to under both white-box and black-box settings. We set the transformation probability and the step size . The results are shown in Figure 6. When the iteration number is large, e.g., , we observe that (1) DI2-FGSM and M-DI2-FGSM have similar white-box success rates on all models, and comparable black-box success rates on most normally trained models; and (2) DI2-FGSM have higher black-box success rates on adversarially trained models than M-DI2-FGSM.

By fixing the total iteration number and the step size , we then study the influence of the transformation probability on the success rates under both white-box and black-box settings. The transformation probability is increased to , since the original value () may be small under the large iteration number setting. The results are shown in Figure 7. When the transformation probability is large, e.g., , compared with M-DI2-FGSM, we observe that DI2-FGSM has (1) similar white-box success rates on all models, and comparable black-box success rates on most normally trained models; and (2) much higher black-box success rates on adversarially trained models.

Based on the experiment results above, we can conclude that the momentum term [7] helps to reduce the total iteration number but is not needed when attack iteration number is already large.

Figure 7: The success rates of DI2-FGSM (left) and M-DI2-FGSM (right) w.r.t. different transformation probability . We generate adversarial examples using an ensemble of six networks, and attack on both the corresponding ensembled network (white-box setting, dashed line) and the hold-out network (black-box setting, solid line). Compared with M-DI2-FGSM, we observe that DI2-FGSM can achieve comparable black-box success rates on most normally trained models, and much higher black-box success rates on adversarially trained models as increase