Attacking Adversarial Attacks as A Defense

06/09/2021 ∙ by Boxi Wu, et al. ∙ 11

It is well known that adversarial attacks can fool deep neural networks with imperceptible perturbations. Although adversarial training significantly improves model robustness, failure cases of defense still broadly exist. In this work, we find that the adversarial attacks can also be vulnerable to small perturbations. Namely, on adversarially-trained models, perturbing adversarial examples with a small random noise may invalidate their misled predictions. After carefully examining state-of-the-art attacks of various kinds, we find that all these attacks have this deficiency to different extents. Enlightened by this finding, we propose to counter attacks by crafting more effective defensive perturbations. Our defensive perturbations leverage the advantage that adversarial training endows the ground-truth class with smaller local Lipschitzness. By simultaneously attacking all the classes, the misled predictions with larger Lipschitzness can be flipped into correct ones. We verify our defensive perturbation with both empirical experiments and theoretical analyses on a linear model. On CIFAR10, it boosts the state-of-the-art model from 66.16 AutoAttack, including 71.76 the top-1 robust accuracy of FastAT is improved from 33.18 100-step PGD attack.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 26

page 27

page 29

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks are first found to be vulnerable to adversarial attacks in Szegedy et al. (2014). That is, malicious attackers can deceive deep networks into a false category by adding a small perturbation onto the natural examples, as shown by Figure 1(a). From then on, huge efforts have been made to counter this intriguing deficit Papernot et al. (2016a, b); Xie et al. (2018); Dhillon et al. (2018). After an intense arms race between attacks and defenses, the methodology of adversarial training Goodfellow et al. (2015); Madry et al. (2018); Zhang et al. (2019b) withstands the examination of time and becomes the state-of-the-art defense. Yet, even with adversarial training, the robustness against attacks is still far from satisfying, considering the huge gap between natural accuracy and robust accuracy. On the attacking side, various attacks targeting at different scenarios are kept being proposed, e.g., white-box attacks Carlini and Wagner (2017); Moosavi-Dezfooli et al. (2016), black-box attacks Andriushchenko et al. (2020), and hard-label attacks Chen and Gu (2020).

In this work, after carefully examining 25 leading robust models against 9 most powerful and representative attacks, we find that the adversarial attacks themselves are not robust, either. Specifically, on these adversarially trained models, if we add small random perturbations onto the deceptive adversarial examples before feeding them to networks, the false predictions of adversarial examples can be reverted to the correct predictions of natural examples. Empirical evaluations suggest that such a vulnerability against perturbations varies among different kinds of attacks. For instance, on the adversarially-trained model of TRADES Zhang et al. (2019b), the successful rate of DeepFool Moosavi-Dezfooli et al. (2016) will be decreased from to . For the state-of-the-art black-box attack of Square Andriushchenko et al. (2020), the degradation will be to . In contrast, attacks like PGD Madry et al. (2018) and C&W Carlini and Wagner (2017) exhibit stronger resistance to random perturbations, and the degradation is generally less than .

Figure 1: Left: The loss landscape Li et al. (2018) around a natural example. An attacking perturbation intends to push the example out of the decision boundary (dot-dashed line), while a defensive perturbation intends to pull it back. Right: For the adversarial example on the left, we plot its two loss surfaces for the class of the false prediction and the ground-truth class.

The above finding offers a new insight for enhancing adversarially trained models: can defenders find a defensive perturbation that is more effective than a random noise (Figure 1(b)) and thus breaks attacks like PGD and C&W? In this work, we propose to accomplish this by leveraging a common understanding of adversarial training. That is, the ground-truth predictions are stabilized during training and thus may have a smoother loss surface and smaller local Lipschitzness Ross and Doshi-Velez (2018); Cissé et al. (2017); Chan et al. (2020); Gu and Tresp (2019); Wu et al. (2020a). This protection on the ground-truth class broadly exists in the vicinity of natural examples as well as the adversarial ones. In Figure 1

 Right, though deep networks wrongly classify the adversarial example, the loss surface for the false prediction is much steeper than that of the ground-truth class. Thus, if we simultaneously attack all the classes under the same attacking steps, unprotected false predictions with larger Lipschitzness will be pushed further away, while the ground-truth class may well survive. On natural examples (Figure 

1(c)), since the above pattern also holds, attacking all the classes barely affects their accuracy (See Section 4) .

The above manner of attacking all the classes resembles the concept of hedge fund in financial risk control, where people invest in contradictory assets to contain the risk of an isolated investment. Thus, we term our method Hedge Defense. Hedge Defense creates perturbations tailored to each example and is more effective than the indiscriminate random noise. On CIFAR10 Krizhevsky et al. (2009), it improves the state-of-the-art model Gowal et al. (2020) from to against AutoAttack Croce and Hein (2020b), an ensemble of 4 most powerful attacks from different tracks, including an improvement from to against the state-of-the-art black-box attack of Square Andriushchenko et al. (2020)

. On ImageNet 

Deng et al. (2009), it improves the top-1 robust accuracy of FastAT Wong et al. (2020) from to against the 100-step PGD Madry et al. (2018) attack. We first introduce Hedge Defense in Section 3 and then present empirical evaluations in Section 4. In Section 5, we interpret Hedge Defense on a linear model for a better understanding and discuss its feasibility and possible limitations in Section 6. Finally, we conclude our work in Section 7.

2 Related Work

Adversarial Attacks

Since the breakthrough finding in Szegedy et al. (2014), various adversarial attacks have been proposed, which can be approximately divided into two categories, white-box and black-box attacks. White-box attacks Gu et al. (2021); Carlini and Wagner (2017); Gowal et al. (2019) are allowed to get access to the model. Thus, they essentially maximize the loss value by repeatedly perturbing the input. For black-box attacks, where attackers cannot get access to gradient information, attacks are usually conducted via transferring adversarial examples Tramèr et al. (2017) or extensive queries Chen et al. (2017); Andriushchenko et al. (2020); Chen and Gu (2020). To the best of our knowledge, all of these attacks are assumed to acquire the label in advance. In Appendix B.7, we show that the reverse process of Hedge Defense can attack deep models without knowing the label as a by-product of our work.

Adversarial Training

At the beginning of developing robust models, plenty of defensive methods with different insights are proposed Papernot et al. (2016b, a), some of which even do not require training Guo et al. (2018). Yet, Athalye et al. (2018) found that most of these defenses are the falsehood originating from the obfuscated gradient problem, which can be circumvented by a tailored adaptation. Meanwhile, a series of defenses, dubbed adversarial training Tramèr et al. (2018); Wu et al. (2020b); Wang et al. (2020), withstand the above testing and have become the most solid defense so far. Adversarial training conducts attacks during training and directly trains deep models with the generated adversarial examples, allowing it to fuse the upcoming attacks in the future. Yet, empirical results suggest that adversarial training alone is not enough to learn highly robust models.

Alternative Defenses

Alternative defenses are proposed Qin et al. (2019); Chan et al. (2020). In particular, Guo et al. (2018) and Xie et al. (2018) adopt image pre-processing to counter attacks, which has been proven to be non-robust against white-box attacks Athalye et al. (2018). Pal and Vidal (2020); Pinot et al. (2020)

study attacks and defenses under Game Theory and theoretically analyze that defenders can improve the deterministic manner of predictions by adding a uniform noise.

Pal and Vidal (2020)

only testifies the uniform perturbation on MNIST 

LeCun and Cortes (2010) against the PGD attack, where the improvement is restricted. This paper is the first work to extensively study various kinds of attacks and point out their non-robust deficiency. In addition, Wu et al. (2020c) proposes to dynamically modify networks during inference time. Their method requires collective testing and has a different optimization criterion from ours. We compare Hedge Defense with these closely-related methods in more details in Appendix D.

3 Hedge Defense

Preliminaries.

For a classification problem with classes, we denote the input as and the ground-truth label as . Then, a robust model Tramèr et al. (2020) is expected to output for any input within , which defines the neighborhood area around with the attacking radius under norm. And the most fundamental form of adversarial attacks can be described as finding an adversarial example that maximizes the loss value for the label  Szegedy et al. (2014):

(1)

represents the adopted loss function, whose specific form may vary for different attacks. We choose the most popular choice of the cross-entropy loss for demonstration.

denotes the deep model whose output is the predicted distribution, and

is its scalar probability for class

.

3.1 Defense via Attacking All Classes

1:  Input: the coming input , the number of iterations , the step size , the deep network , and the defensive radius .
2:  // generates a uniform noise
3:  Initialization: .
4:  for  do
5:     // Sum the losses for all the classes,
6:     // then update with gradient ascending.
7:      ;
8:     // is the projection operator.
9:     ;
10:  end for
11:  Output: the safer prediction .
Algorithm 1  Hedge Defense

For the defensive purpose, we consider adding an extra perturbation on all the coming inputs before making the prediction so that we may dodge the dangerous adversarial examples. In practice, defenders do not know whether the coming input is adversarial or not. Thus, for a general sense of robustness, a defense should barely affect an arbitrary within meanwhile improve the resistance against . The defensive perturbation should also be constrained with a defensive radius , so that the image is not substantially changed. Thus, for any within , we search in examples of within to find a safer example . Specifically, we achieve this by maximizing the summation of the cross-entropy losses of all the classes:

(2)

When simultaneously attacking all the classes, the computed gradients for the input example will be dominated by those that attack the false classes since they have larger local Lipschitzness and can thus be more easily attacked. Note that, the soft-max cross-entropies of different classes contradict each other: a smaller logit of the false class means a larger loss of its own but a smaller loss value of the correct class. Thus, the loss value of the ground-truth class actually becomes smaller. On natural examples, because the above analysis can also hold, Hedge Defense barely affects their accuracy. We describe the detail of Hedge Defense in Algorithm 

1. In Section 5, we will further explain its mechanism on the more precise linear model for a binary classification problem.

Taking a deeper look at Eq.(2), the scheme of attacking all the classes can be reformulated as:

(3)

where

denotes the entropy of the uniform distribution

. Thus, Hedge Defense is equivalent to maximizing the KL-Divergence to the uniform distribution. This offers another perspective to understand Hedge Defense. As adversarial attacks usually result in low-confidence predictions Liu et al. (2018); Pang et al. (2018), Hedge Defense may force the input to stay away from those uncertain predictions and return to the sharper predictions of natural examples. In addition, minimizing the KL-Divergence to the uniform distribution is a popular choice of training out-of-distribution examples Hendrycks et al. (2019b), examples that belong to none of the considered categories. Thus, maximizing this metric as Eq. (3) can also be interpreted as seeking a better in-distribution prediction.

4 Attacking State-of-the-Art Attacks

4.1 Evaluations with Robust Models on Benchmark Datasets

This section extensively examines state-of-the-art robust models and attacks, investing in how the adversarial examples react to perturbations. We mainly consider the more typical norm attacks here and investigate the weaker norm attacks in Appendix B.8. We present evaluations with the metric of robust accuracy, which is opposite to the successful rate of attacks.

4.1.1 Empirical Results on CIFAR10 and CIFAR100

Model Method
Nat-
Acc.
PGD C&W
Deep
Fool
APGD
CE
APGD
T
FAB Square RayS
Auto
Attack
Worst
Case
WA+
SiLU
 Gowal et al. (2020)
- 91.10 69.16 67.48 71.23 68.24 66.17 66.70 71.76 72.03 66.16 66.15
Random 90.82 69.34 71.35 75.65 69.46 67.45 78.82 77.58 79.23 67.76 66.24
Hedge 90.62 71.43 78.42 79.64 73.86 73.00 82.28 83.30 82.03 72.66 68.61
AWP Wu et al. (2020b) - 88.25 64.14 61.44 64.81 63.56 60.49 60.97 66.15 66.93 60.50 60.46
Random 87.91 64.18 65.01 70.41 64.53 61.54 73.37 72.47 74.89 62.00 60.42
Hedge 86.98 66.16 71.94 73.83 68.18 66.47 76.37 76.69 76.80 66.68 62.72
RST Carmon et al. (2019) - 89.69 63.17 61.74 66.69 62.01 60.14 60.66 66.91 67.61 60.13 60.08
Random 89.50 63.47 65.90 71.16 63.42 61.49 75.48 74.51 77.02 61.89 60.09
Hedge 88.64 66.38 73.89 75.25 68.37 68.09 78.48 79.24 78.98 67.46 63.10
Pre-
Training
 Hendrycks et al. (2019a)
- 87.11 58.19 57.09 59.77 57.52 55.32 55.68 62.38 63.32 55.30 55.28
Random 86.93 58.52 61.79 70.4 58.59 56.45 71.99 71.16 74.56 56.79 55.32
Hedge 87.43 62.01 73.12 75.50 65.63 65.38 76.62 78.32 78.65 64.45 59.17
MART Wang et al. (2020) - 87.50 63.49 59.62 63.14 61.83 56.73 57.39 64.88 65.66 56.74 56.70
Random 87.08 63.59 63.82 72.26 63.02 58.10 73.04 72.11 74.71 58.75 56.99
Hedge 86.18 64.56 71.71 73.86 67.00 63.30 74.35 76.09 75.60 63.42 59.35
HYDRA Sehwag et al. (2020) - 88.98 60.95 62.19 64.66 59.86 57.67 58.41 65.02 65.61 57.67 57.63
Random 88.80 61.42 66.08 69.50 61.25 59.03 74.60 73.26 76.20 59.55 57.85
Hedge 87.82 64.19 73.05 73.43 66.15 65.56 76.94 77.58 77.36 65.07 61.13
TRADES Zhang et al. (2019b) - 84.92 55.83 54.47 58.15 55.08 53.09 53.56 59.45 59.69 53.09 53.06
Random 84.69 56.03 58.65 65.48 56.06 54.22 69.33 67.79 71.29 54.51 53.02
Hedge 83.99 58.99 69.98 71.30 62.24 62.22 74.31 74.71 73.94 61.28 56.21
AT Madry et al. (2018) - 83.23 46.97 57.51 55.84 47.97 50.99 72.68 77.71 60.93 47.29 38.81
Random 82.99 47.06 57.77 56.86 47.92 51.23 72.73 77.66 82.94 47.27 40.55
Hedge 81.03 48.31 63.22 63.52 52.38 57.67 74.89 77.93 80.47 51.82 42.87
Standard - 94.78 00.00 00.00 00.83 00.00 00.00 00.04 00.39 00.04 00.00 00.00
Random 91.32 00.00 15.97 74.57 01.16 01.16 84.37 56.41 79.15 01.20 00.00
Hedge 91.50 00.00 15.66 74.71 01.23 01.21 84.29 56.37 79.43 01.25 00.00
Table 1: Robust accuracy (

) before perturbation (-), after random noise (Random), and after Hedge Defense (Hedge) on CIFAR10. Eight robust models are showed here. Evaluations are repeated three times with the mean values presented. Extra models and variances are provided in Appendix 

B.4.

On CIFAR10, we select the top-20 officially-released robust models on RobustBench Croce et al. (2020) to demonstrate the effectiveness of our method. We show seven of them here and put the rest in Appendix. We also test models trained with the basic adversarial training (AT) and standard training for comparison. Details of these models are shown in Appendix B.1. On the attacking side, we select nine most representative and powerful attacks from different tracks, including six white-box attacks (PGD Madry et al. (2018), CW Carlini and Wagner (2017), DeepFool Moosavi-Dezfooli et al. (2016), APGD-CE Croce and Hein (2020b), APGD-T Croce and Hein (2020b), FAB Croce and Hein (2020a)), two black-box attacks (Square Andriushchenko et al. (2020), RayS Chen and Gu (2020)), and one ensemble attack (AutoAttack Croce and Hein (2020b)). All attacks are constrained by . We introduce different attacks in detail in Appendix B.2.

For each model in Table 1, we compare the robust accuracy of three scenarios: predictions before perturbation, predictions after random perturbation, and predictions after hedge perturbation. By default, we discuss as in Pal and Vidal (2020) and empirically test different values in Section 4.3.2. Hedge Defense is implemented with iterations of optimization with a step size of . This is the same configuration as the PGD attack. We first evaluate each attack independently. Then, we compute the worst-case robust accuracy of all attacks. Namely, for a certain testing example, as long as one of the nine attacks succeeds, we consider the defense as a failure. From Table 1, we can observe that most attacks are vulnerable to both random perturbation and hedge perturbation. And the capability to resist perturbations differs hugely among them. Here we briefly analyze possible causes of their non-robustness. A complete case study can be found in Appendix B.3.

Square and RayS: Square and RayS are black-box attacks based on extensively querying deep networks. They tend to stop queries once they find a successful adversarial example so that the computation budget can be saved. Yet, such a design makes attacks stop at the adversarial examples that are closer to the unharmful example, which is the example one step earlier before the successful attacking query. This explains why they are so vulnerable to defensive perturbation. The black-box constraint of not knowing the gradient information of the model may also prevent them from finding more powerful adversarial examples that are far away from the decision boundary.

(4)

FAB and DeepFool: FAB can be considered as a direct improvement of DeepFool. Unlike other white-box attacks that search adversarial examples within , FAB tries to find a minimal perturbation for attacks, as shown above. However, this aim of finding smaller attacking perturbations also encourages them to find adversarial examples that are too close to the decision boundary and thus become very vulnerable to defensive perturbation. Thus, even on the undefended standard model where they generate smaller perturbations than on robust models, they can still be hugely nullified by random perturbation, and the robust accuracy increases from less than to and .

(5)

PGD and C&W: Attacks like PGD and C&W show better resistance against perturbations. However, with Hedge Defense, the robust accuracy can still by improved by to . Interestingly, several commonly believed more powerful attacks turn out to be weaker when facing perturbations. For instance, C&W enhances attacks by increasing the largest false logit and decreasing the true logit . Without Hedge Defense, it is more powerful ( robust accuracy of TRADES) than the PGD attack (). With Hedge Defense, it becomes less effective ( vs ). This indicates that its efficacy comes at a price of less robustness against defensive perturbation. And the effect of directly increasing the false logit can be weakened if the false class is attacked. Similar analyses can also be applied to the attacks of APGD-CE and APGD-T.

Model Method
Nat-
Acc.
PGD C&W
Deep
Fool
APGD
CE
APGD
T
FAB Square RayS
Auto
Attack
Worst
Case
WA+
SiLU
 Gowal et al. (2020)
- 69.15 40.84 39.46 40.74 40.32 37.33 37.65 42.97 43.15 37.29 37.26
Random 69.02 40.97 43.61 52.22 41.35 38.75 54.59 51.08 55.09 38.79 37.33
Hedge 68.67 43.25 55.94 57.82 46.75 46.47 59.69 59.81 59.75 45.00 39.78
AWP Wu et al. (2020b) - 60.38 34.13 31.41 31.33 33.37 29.18 29.47 34.57 34.79 29.16 29.15
Random 60.45 34.21 35.24 44.42 34.53 30.56 45.59 42.94 47.82 30.75 29.25
Hedge 59.37 36.69 46.01 48.13 40.18 38.10 49.49 49.82 50.34 37.75 32.29
TRADES Zhang et al. (2019b) - 57.34 25.19 32.83 31.97 34.51 37.24 52.74 54.06 30.01 33.94 19.61
Random 55.60 25.30 33.34 33.97 34.11 36.73 51.41 52.36 55.59 33.78 22.00
Hedge 56.04 29.75 44.09 45.33 39.25 42.04 53.27 54.01 55.79 38.72 26.59
AT Madry et al. (2018) - 58.61 25.34 35.01 31.93 32.05 35.56 52.62 55.11 32.31 31.70 19.72
Random 58.49 25.43 35.53 35.54 32.09 35.57 52.58 55.00 58.72 31.74 21.61
Hedge 57.66 28.23 43.73 45.02 35.79 40.08 53.24 54.99 57.22 35.54 24.80
Table 2: Robust accuracy () on CIFAR100. All settings align with CIFAR10 in Table 1.

For natural accuracy on robust models, Hedge Defense is slightly lower than the directly predicting approach in most cases. This aligns with the common understanding that robustness may be at odds with natural accuracy 

Tsipras et al. (2019); Zhang et al. (2019b); Athalye et al. (2018). Compared with the improvements on robust accuracy, such a degradation of natural accuracy is acceptable. One exception is the robust model of Pre-Training Method Hendrycks et al. (2019a), where Hedge Defense actually achieves better natural accuracy. On CIFAR100, we evaluate four robust models in Table 2. Hedge Defense again achieves huge improvements. In particular, it improves the state-of-the-art model against AutoAttack from to .

4.1.2 Evaluations of Fast Adversarial Training on ImageNet

We further test Hedge Defense on the challenging ImageNet dataset. Because adversarial training is a computationally consuming algorithm, it is important to reduce its complexity to apply to very large datasets like ImageNet. In particular, Fast Adversarial Training Wong et al. (2020) is proposed for this purpose and achieves promising results. Thus, we choose this algorithm and evaluate it on different architectures against the PGD attack of different numbers of iterations (). Each trial is restarted 10 times until the attack succeeds. We adopt the officially released ResNet50 model of Wong et al. (2020). Also, we train ResNet101 with the official code to compare models with different capacities. is set to or as in Wong et al. (2020). In Table 3, Hedge Defense improves robust accuracy in all cases. As the model gets deeper, Hedge Defense achieves more improvement.

Method Model Hedge PGD-10 PGD-50 PGD-100 PGD-10 PGD-50 PGD-100
Fast AT Wong et al. (2020) ResNet-50 - 43.44 43.40 43.38 30.81 30.17 30.13
45.38 45.44 45.43 34.42 34.63 34.71
ResNet-101 - 44.69 44.62 44.60 34.02 33.26 33.18
47.00 47.07 47.08 38.36 38.50 38.54
Table 3: Top-1 robust accuracy () for Fast Adversarial Training on ImageNet.
(a) The Trade-off of Defensive Perturbation
(b) Lipschitzness Difference
(c) Visualization of Examples
Figure 2: Analytical Experiments: (a) The number of examples for FtoT, TtoF, FtoT-TtoF, and FtoF (Section 4.2.1). (b) The distribution of Lipschitzness difference for FtoT and TtoF (Section 4.2.2). (c) The visualization for natural, adversarial, and hedge examples with their SSIM scores (Section 4.2.3).

4.2 Analytical Experiments for Defensive Perturbation

In this section, we investigate multiple features of defensive perturbation. We adopt the official model of TRADES Zhang et al. (2019b) on CIFAR10 for demonstration in this part.

4.2.1 The Trade-off of Defensive Perturbation

We have observed that Hedge Defense achieves overall improvements on robustness. Now we investigate how it influences each testing example. Specifically, Hedge Defense may either convert an originally true prediction to a false one (TtoF) or convert a false prediction to a true class (FtoT). Then, the overall improvement is the number of FtoT subtracts the number of TtoF. (FtoT-TtoF). In addition, we count the number of examples that are originally falsely predicted before defensive perturbations and are converted to another false class after defensive perturbations (FtoF). This metric indicates a partial failure of the targeted attack, which is expected to deceive deep networks into a specific false class rather than an arbitrary one. We show these statistics of Hedge Defense against different attacks in Figure 2(a). It is shown that Hedge Defense improves robustness because it helps much more examples of FtoT, compared with the relatively fewer examples of TtoF. We also provide the statistics of random perturbation for comparison.

4.2.2 Local Lipshitzness and Hedge Defense

Here we empirically verify the statement that Hedge Defense takes effect because the ground-truth class has smaller Lipshitzness. On FtoT examples where Hedge Defense brings a benefit, we compute the approximated Lipshitzness between the false class and the ground-truth class. As shown in Figure 2(b) (blue), this difference is generally positive and verifies that the ground-truth class does have smaller Lipshitzness. We further evaluate the same metric on the few examples of TtoF and observe a generally negative difference (red). This again verifies our theory and shows that Hedge Defense actually hurts robustness if the Lipshitzness of the ground-truth class is not smaller.

4.2.3 Visualization and Image Similarity

To provide a better sense of how Hedge Defense perturbs the images, we visualize the three types of examples, i.e., natural examples , adversarial examples , and hedge examples . To show the similarity between different examples, we also compute the SSIM Wang et al. (2004) score of adversarial and hedge examples with respect to natural examples. As shown in Figure 2(c), both adversarial and hedge examples are visually consistent with natural examples. Their SSIM scores are generally higher than , indicating a high level of similarity (The score of means it is exactly the same image). More visualizations, including defensively perturbed natural examples, can be seen in Appendix B.5.

Figure 3: (a) Number of examples for FtoT, TtoF, and FtoT-TtoF against Hedge Defense iteration steps (Section 4.3.1). (b,c,d) Ablation study on the perturbation radii (/255) of and (Section 4.3.2).

4.3 Ablation Study for Defensive Perturbation

In this section, we study the hyper-parameters of optimization steps and perturbation radius. The TRADES Zhang et al. (2019b) model on CIFAR10 is evaluated against C&W Carlini and Wagner (2017) attack.

4.3.1 Ablation Study on the Complexity of Hedge Defense

Although Hedge Defense requires extra computation, its complexity is the same as attacks since both of them utilize similar optimization techniques. Thus, defenders are only required to have the same computation budget against attackers. Moreover, the complexity of Hedge Defense mainly comes from the number of iterations ( in Algorithm 1). Thus, we plot the effect of Hedge Defense against its iteration times in Figure 3. It is shown that Hedge Defense converges very fast. Thus, defenders can use our method with only a few steps of optimizations.

4.3.2 Ablation Study on Attacking and Defensive Perturbation Radius

Here we explore how the two perturbation radii of and influence each other in an extensive manner. We first set , and investigate how different values of influence robust accuracy. As shown in Figure 3, the efficacy of Hedge Defense is not very sensitive to the value of , considering obvious improvements for between and . We then set and investigate different values of  Dong et al. (2020). As shown in Figure 3, the unwanted examples of TtoF stay relatively unchanged as increases, indicating TtoF is more related to instead of . On the other hand, the other two metrics steadily increase with . This suggests Hedge Defense performs better when attacks get stronger. is not tested since it may destroy semantic information of the example. Eventually, we plot a complete view for the values of the two radii in Figure 3. The color represents the value of robust accuracy under different and .

5 Interpretation of Hedge Defense

We have shown that Hedge Defense can hugely improve the robustness of deep networks on real-world problems. To get a clearer understanding of its working mechanism, we give a detailed interpretation of Hedge Defense on a locally linear model in this section. We first construct a binary classification problem, where the ground-truth function is a robust model. Then, we add a small bias on the ground-truth function to create a naturally accurate but non-robust model. Finally, we show that Hedge Defense can convert this non-robust model to a robust counterpart, on the condition that the true class has a smaller magnitude of gradient, which is equivalent to Lipshitzness for linear models.

Figure 4: We represent examples of the two classes with red () and blue (). The arrows represent the direction of attacking or defensive perturbations. (a) Examples before attacks. (b) Example after attacks for the ground-truth model. (c) Attacking a non-robust model. (d) Examples after attacks for the non-robust model. The purple region represents the examples that are attacked successfully and wrongly classified. (e) The direction of defensive perturbation on adversarial examples.

5.1 A Binary Classification Task

We consider the classification for the two classes of . Without loss of generality, we adopt the scalar to represent the input to demonstrate our idea, and all the analyses in this section can be easily extended to multi-dimensional cases. Each class has its own score function: for and for as Eq. (6). The label for the input is determined by these two score functions: . Thus, , and .

(6)

We denote the gradient of the true class as , which is the gradient of if and the gradient of if , and as the false counterpart. As stated in Section 3, Hedge Defense relies on the condition that the magnitude of is much smaller than , which is . We set in a reverse proportion to so that the two classes exclude each other.

5.2 Construct a Non-Robust Classifier under Adversarial Attacks

Suppose the attacking perturbation is bounded by . Then, the standard FGM attack Goodfellow et al. (2015); Pal and Vidal (2020) generates the attacking perturbation on the opposite direction of the gradient for the true class:

(7)

As shown by the arrows in Figure 4, the attack drags natural examples closer to the decision boundary. Under this attack, a robust example Pal and Vidal (2020) should has the same prediction as examples within . Apparently, only satisfies this requirement, while is intrinsically non-robust and too close to the decision boundary Carmon et al. (2019). Thus, we do not consider from , as shown by the blank in Figure 4. In Figure 4, we show the input after the attack in Figure 4. The examples are not dragged across the decision boundary of the ground-truth function , making it a robust solution for the learning task.

Next, we consider the situation where we have learned a non-robust model so that Hedge Defense can provide an extra protection. To this end, suppose we learn an estimated function

with a small bias: , where , and holds still to . This assumption is reasonable since learning a bias frequently happens in many learning algorithms. Then, the estimated decision boundary will shift from to . The estimated classifier will always output the correct label for . Thus, the learned classifier is a naturally accurate discriminator. However, when examples are attacked, as shown by the arrows in Figure 4, will be perturbed into the range of and then be wrongly classified by the estimated decision boundary, as shown by the purple region in Figure 4. Thus, the learned classifier is not a robust discriminator like the ground-truth. In summary, we now have a classifier that is naturally accurate but non-robust. This resembles real-world tasks like CIFAR10, where we can learn a deep network that can accurately predict natural examples but fail to resist adversarial ones.

5.3 Apply Hedge Defense to the Non-Robust Classifier

Consider that defenders can also add a defensive perturbation bounded by after attacks. For a coming input, natural or adversarial, Hedge Defense generates a defensive perturbation as Eq. (2):

Notice that, while attackers are allowed to utilize the ground-truth classifier, defenders can merely utilize the estimated one. The direction of the defensive perturbation generated by is shown in Figure 4. The defensive perturbation is opposite to the attacking perturbation and will eventually push examples back to the positions before attacks occur. This allow the estimated classifier to correctly make a decision. Given the above illustration, we can tell that Hedge Defense highly relies on the condition that the true class has a smaller magnitude of gradient. If , Hedge Defense actually worsens the prediction and generates perturbations on the same direction of adversarial perturbations. This explains the rare cases of TtoF in Section 4, where deep models correctly classify adversarial examples but fail on hedge examples. Extra analyses are shown in Appendix C.

6 Discussion

From the Attacking Perspective: In this work, we have shown that adversarial attacks can also be vulnerable to perturbations. One may wonder, if attackers have known that a defensive perturbation will be added before prediction, can they design better attacks? For instance, based on our analyses in Section 4, attacks like Square may keep querying to find more robust adversarial examples, and FAB may consider keeping a proper distance to the decision boundary. We have also tried several possible adaptions to enhance the white-box attacks. Our defense can survive from all these defense-aware attacks (see Appendix B.11). Specifically, to resist defensive perturbation, we consider seeking a local space that is either full of misled predictions or against the condition that the ground-truth class has a smaller Lipschitzness. Both can be harder than simply finding a point-wise falsehood.

From the Defensive Perspective: Whenever a new defense comes out, the first thing that occurs to us is: Is this new method just an illusion of improper evaluations? Can it be naively broken by tailored attacks? Although we cannot prove the certified robustness Cohen et al. (2019) of Hedge Defense, for now, we can still get some insights about its solidity from the following aspects: 1) In a thorough evaluation of robustness, defenses should never interfere with the attacking stage. Our evaluations obey this rule. Specifically, we did not add any defensive perturbation to the input when generating adversarial examples because it can hugely worsen the performance of attacks (see Appendix B.11). 2) Hedge Defense does not fall into any category of the obfuscated gradient Athalye et al. (2018). We present a detailed analysis in Appendix E. 3) Pal and Vidal (2020) has theoretically justified that randomly perturbing the coming input before predicting can outperform the manner of directly predicting the coming input, and our analyses in Section 5 also partially confirm the reliability of Hedge Defense.

7 Conclusion

The contribution of this work is two-fold. 1) We point out the novel finding that adversarial attacks can also be vulnerable to perturbations. 2) Enlightened by the above finding, we develop the algorithm of Hedge Defense that can more effectively break attacks and enhance adversarially trained models. Both empirical results and theoretical analyses support our method. Our work fights back attacks with the same methodology and sheds light on a new direction of defenses. Future work may consider using criteria other than attacking all the classes to find the way back to the correct predictions. Moreover, with Hedge Defense, defenders may not need to guarantee that the model can correctly classify all the examples within the local space. We only need to ensure that some conditions, i.e., the ground-truth class has smaller local Lipshitzness, exist to allow us to find those safer examples.

References

  • M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein (2020) Square attack: A query-efficient black-box adversarial attack via random search. In ECCV, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Lecture Notes in Computer Science, Vol. 12368, pp. 484–501. External Links: Link, Document Cited by: §1, §1, §1, §2, §4.1.1.
  • A. Athalye, N. Carlini, and D. A. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In ICML,

    Proceedings of Machine Learning Research

    , Vol. 80, pp. 274–283.
    Cited by: Appendix A, §B.4, §E.2, §E.2, §E.3, §E.3, Appendix E, §2, §2, §4.1.1, §6.
  • N. Carlini and D. A. Wagner (2017) Towards evaluating the robustness of neural networks. In SP, pp. 39–57. Cited by: §B.2, §1, §1, §2, §4.1.1, §4.3.
  • Y. Carmon, A. Raghunathan, L. Schmidt, J. C. Duchi, and P. Liang (2019) Unlabeled data improves adversarial robustness. In NeurIPS, Cited by: §B.11, Table 4, Table 8, §C.1, Table 1, §5.2.
  • A. Chan, Y. Tay, Y. Ong, and J. Fu (2020) Jacobian adversarially regularized networks for robustness. In ICLR, External Links: Link Cited by: §1, §2.
  • J. Chen, Y. Cheng, Z. Gan, Q. Gu, and J. Liu (2020a) Efficient robust training via backward smoothing. CoRR abs/2010.01278. External Links: Link, 2010.01278 Cited by: §B.1, Table 5.
  • J. Chen and Q. Gu (2020) RayS: a ray searching method for hard-label adversarial attack. In Proceedings of the 26rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Cited by: §1, §2, §4.1.1.
  • P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In AISec@CCS, pp. 15–26. Cited by: §2.
  • T. Chen, S. Liu, S. Chang, Y. Cheng, L. Amini, and Z. Wang (2020b) Adversarial robustness: from self-supervised pre-training to fine-tuning. In CVPR, pp. 696–705. External Links: Link, Document Cited by: Table 5.
  • M. Cissé, P. Bojanowski, E. Grave, Y. N. Dauphin, and N. Usunier (2017) Parseval networks: improving robustness to adversarial examples. In ICML, Proceedings of Machine Learning Research, Vol. 70, pp. 854–863. Cited by: §1.
  • J. M. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In ICML, Proceedings of Machine Learning Research, Vol. 97, pp. 1310–1320. Cited by: §6.
  • F. Croce, M. Andriushchenko, V. Sehwag, N. Flammarion, M. Chiang, P. Mittal, and M. Hein (2020) RobustBench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670. Cited by: §B.12, §C.1, §4.1.1.
  • F. Croce and M. Hein (2020a) Minimally distorted adversarial examples with a fast adaptive boundary attack. In ICML, Proceedings of Machine Learning Research, Vol. 119, pp. 2196–2205. External Links: Link Cited by: §4.1.1.
  • F. Croce and M. Hein (2020b) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In ICML, Cited by: §B.2, §B.2, §E.2, §1, §4.1.1.
  • J. Cui, S. Liu, L. Wang, and J. Jia (2020) Learnable boundary guided adversarial training. CoRR abs/2011.11164. External Links: Link, 2011.11164 Cited by: Table 5.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li (2009) ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255. External Links: Link, Document Cited by: §1.
  • G. S. Dhillon, K. Azizzadenesheli, Z. C. Lipton, J. Bernstein, J. Kossaifi, A. Khanna, and A. Anandkumar (2018) Stochastic activation pruning for robust adversarial defense. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1.
  • G. W. Ding, Y. Sharma, K. Y. C. Lui, and R. Huang (2020) MMA training: direct input space margin maximization through adversarial training. In ICLR, External Links: Link Cited by: Table 5, Table 7.
  • Y. Dong, Q. Fu, X. Yang, T. Pang, H. Su, Z. Xiao, and J. Zhu (2020) Benchmarking adversarial robustness on image classification. In CVPR, pp. 318–328. External Links: Link, Document Cited by: §4.3.2.
  • L. Engstrom, A. Ilyas, H. Salman, S. Santurkar, and D. Tsipras (2019) Robustness (python library). External Links: Link Cited by: Table 5.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In ICLR, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §B.1, §B.2, §C.1, §E.1, §1, §5.2.
  • S. Gowal, C. Qin, J. Uesato, T. A. Mann, and P. Kohli (2020) Uncovering the limits of adversarial training against norm-bounded adversarial examples. CoRR abs/2010.03593. External Links: Link, 2010.03593 Cited by: Table 4, Table 5, §1, Table 1, Table 2.
  • S. Gowal, J. Uesato, C. Qin, P. Huang, T. A. Mann, and P. Kohli (2019) An alternative surrogate loss for pgd-based adversarial testing. CoRR abs/1910.09338. External Links: Link, 1910.09338 Cited by: §2.
  • J. Gu and V. Tresp (2019) Saliency methods for explaining adversarial attacks. arXiv preprint arXiv:1908.08413. Cited by: §1.
  • J. Gu, B. Wu, and V. Tresp (2021) Effective and efficient vote attack on capsule networks. arXiv preprint arXiv:2102.10055. Cited by: §2.
  • C. Guo, M. Rana, M. Cissé, and L. van der Maaten (2018) Countering adversarial images using input transformations. In ICLR, External Links: Link Cited by: §D.1, §D.1, §E.3, §2, §2.
  • D. Hendrycks, K. Lee, and M. Mazeika (2019a) Using pre-training can improve model robustness and uncertainty. In ICML, Proceedings of Machine Learning Research, Vol. 97, pp. 2712–2721. Cited by: Table 4, §4.1.1, Table 1.
  • D. Hendrycks, M. Mazeika, and T. G. Dietterich (2019b)

    Deep anomaly detection with outlier exposure

    .
    In ICLR, External Links: Link Cited by: §3.1.
  • L. Huang, C. Zhang, and H. Zhang (2020) Self-adaptive training: beyond empirical risk minimization. In NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §B.1, Table 5.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §1.
  • Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §2.
  • M. Lécuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana (2019) Certified robustness to adversarial examples with differential privacy. In SP, pp. 656–672. External Links: Link, Document Cited by: §D.4.
  • H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein (2018) Visualizing the loss landscape of neural nets. In NeurIPS, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 6391–6401. External Links: Link Cited by: Figure 1.
  • N. Liu, H. Yang, and X. Hu (2018) Adversarial detection with model interpretation. In KDD, Y. Guo and F. Farooq (Eds.), pp. 1803–1811. External Links: Link, Document Cited by: §3.1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018)

    Towards deep learning models resistant to adversarial attacks

    .
    In ICLR, External Links: Link Cited by: §B.1, §B.2, Table 4, §1, §1, §1, §4.1.1, Table 1, Table 2.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: A simple and accurate method to fool deep neural networks. In CVPR, pp. 2574–2582. External Links: Link, Document Cited by: §1, §1, §4.1.1.
  • A. Pal and R. Vidal (2020) A game theoretic analysis of additive adversarial attacks and defenses. In NeurIPS, Cited by: §D.4, §D.4, §D.4, §D.4, §D.4, Appendix F, §2, §4.1.1, §5.2, §6.
  • T. Pang, C. Du, Y. Dong, and J. Zhu (2018) Towards robust detection of adversarial examples. In NeurIPS, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 4584–4594. External Links: Link Cited by: §3.1.
  • T. Pang, X. Yang, Y. Dong, T. Xu, J. Zhu, and H. Su (2020) Boosting adversarial training with hypersphere embedding. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: Table 5.
  • N. Papernot, P. D. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami (2016a) The limitations of deep learning in adversarial settings. In EuroS&P, pp. 372–387. External Links: Link, Document Cited by: §1, §2.
  • N. Papernot, P. D. McDaniel, X. Wu, S. Jha, and A. Swami (2016b) Distillation as a defense to adversarial perturbations against deep neural networks. In SP, pp. 582–597. External Links: Link, Document Cited by: §1, §2.
  • R. Pinot, R. Ettedgui, G. Rizk, Y. Chevaleyre, and J. Atif (2020) Randomization matters how to defend against strong adversarial attacks. In ICML, Cited by: §B.6, §E.2, Appendix F, §2.
  • C. Qin, J. Martens, S. Gowal, D. Krishnan, K. Dvijotham, A. Fawzi, S. De, R. Stanforth, and P. Kohli (2019) Adversarial robustness through local linearization. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 13824–13833. External Links: Link Cited by: §2.
  • L. Rice, E. Wong, and J. Z. Kolter (2020) Overfitting in adversarially robust deep learning. In ICML, Proceedings of Machine Learning Research, Vol. 119, pp. 8093–8104. External Links: Link Cited by: Table 5, Table 7.
  • A. S. Ross and F. Doshi-Velez (2018) Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In AAAI, S. A. McIlraith and K. Q. Weinberger (Eds.), pp. 1660–1669. External Links: Link Cited by: §1.
  • V. Sehwag, S. Wang, P. Mittal, and S. Jana (2020) On pruning adversarially robust neural networks. CoRR abs/2002.10509. External Links: Link, 2002.10509 Cited by: Table 4, Table 1.
  • C. Sitawarin, S. Chakraborty, and D. A. Wagner (2020) Improving adversarial robustness through progressive hardening. CoRR abs/2003.09347. External Links: Link, 2003.09347 Cited by: Table 5.
  • D. Stutz, M. Hein, and B. Schiele (2019) Disentangling adversarial robustness and generalization. In CVPR, pp. 6976–6987. External Links: Link, Document Cited by: §C.1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In ICLR, Cited by: §1, §2, §3.
  • F. Tramèr, J. Behrmann, N. Carlini, N. Papernot, and J. Jacobsen (2020) Fundamental tradeoffs between invariance and sensitivity to adversarial perturbations. In ICML, Proceedings of Machine Learning Research, Vol. 119, pp. 9561–9571. External Links: Link Cited by: §B.8, §3.
  • F. Tramèr, A. Kurakin, N. Papernot, I. J. Goodfellow, D. Boneh, and P. D. McDaniel (2018) Ensemble adversarial training: attacks and defenses. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §2.
  • F. Tramèr, N. Papernot, I. J. Goodfellow, D. Boneh, and P. D. McDaniel (2017) The space of transferable adversarial examples. CoRR abs/1704.03453. External Links: Link, 1704.03453 Cited by: §2.
  • D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2019) Robustness may be at odds with accuracy. In ICLR, External Links: Link Cited by: §B.3, §C.1, §4.1.1.
  • Y. Wang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu (2020) Improving adversarial robustness requires revisiting misclassified examples. In ICLR, External Links: Link Cited by: Table 4, §C.3, §2, Table 1.
  • Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13 (4), pp. 600–612. External Links: Link, Document Cited by: §4.2.3.
  • E. Wong, L. Rice, and J. Z. Kolter (2020) Fast is better than free: revisiting adversarial training. In ICLR, External Links: Link Cited by: §B.2, Table 5, §1, §4.1.2, Table 3.
  • B. Wu, J. Chen, D. Cai, X. He, and Q. Gu (2020a) Does network width really help adversarial robustness?. CoRR abs/2010.01279. External Links: Link, 2010.01279 Cited by: §1.
  • D. Wu, S. Xia, and Y. Wang (2020b) Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems 33. Cited by: Table 4, Table 5, Table 7, §2, Table 1, Table 2.
  • Y. Wu, C. Yuan, and S. Wu (2020c) Adversarial robustness via runtime masking and cleansing. In ICML, Proceedings of Machine Learning Research, Vol. 119, pp. 10399–10409. External Links: Link Cited by: §D.3, §D.3, §2.
  • C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. L. Yuille (2018) Mitigating adversarial effects through randomization. In ICLR, External Links: Link Cited by: §B.6, §D.2, §D.2, §E.3, §1, §2.
  • D. Zhang, T. Zhang, Y. Lu, Z. Zhu, and B. Dong (2019a) You only propagate once: accelerating adversarial training via maximal principle. In NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 227–238. External Links: Link Cited by: Table 5.
  • H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan (2019b) Theoretically principled trade-off between robustness and accuracy. In ICML, Proceedings of Machine Learning Research, Vol. 97, pp. 7472–7482. Cited by: §B.3, Table 4, §C.1, §1, §1, §4.1.1, §4.2, §4.3, Table 1, Table 2.
  • J. Zhang, X. Xu, B. Han, G. Niu, L. Cui, M. Sugiyama, and M. S. Kankanhalli (2020a) Attacks which do not kill training make adversarial learning stronger. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 11278–11287. External Links: Link Cited by: Table 5.
  • J. Zhang, J. Zhu, G. Niu, B. Han, M. Sugiyama, and M. S. Kankanhalli (2020b) Geometry-aware instance-reweighted adversarial training. CoRR abs/2010.01736. External Links: Link, 2010.01736 Cited by: Table 5.

Appendix A Appendix Guideline

In Appendix B, we first provide extra detailed configurations about our empirical evaluations in Section 4, including the detail of the robust models that we tested, a complete introduction of the nine adversarial attacks we selected, and a complete version of our case studies on different attacks in Section 4.1.1. Then, we present extra experimental results, including a complete version of Table 1 with more robust models, the variance of the robust accuracy reported in Table 1, different visualizations for Section 4.2.3, evaluations on norm attacks, reversing Hedge Defense as a label-free attack, the attempts that we made to counter Hedge Defense, and several extra experiments aiming at verifying the solidity of Hedge Defense.

In Section 5, we have presented a linear model for a binary classification learning task. In Appendix C, we will further explore this simplified model, including discussing the legitimacy of several assumptions we have made, how Hedge Defense reacts to natural examples, and what will happen when the assumptions do not hold.

In Appendix D, we introduce the details of several closely related works and their relations with Hedge Defense. Then, in Appendix E, we present the four categories of obfuscated gradient and analyze why Hedge Defense does not fall into any of them. Notably, in the original paper of the obfuscated gradient phenomena Athalye et al. (2018), the authors conduct a case study for several previously believed solid defenses and break them with various approaches. Several of these examined defenses are also closely related to Hedge Defense as in Appendix D. Thus, we will illustrate why Hedge Defense can withstand the examinations that appeared in Athalye et al. (2018) and thus outperform previous methods. Finally, in Appendix 6, we provide extra discussions about the newly proposed Hedge Defense.

Appendix B Extra Experimental Details and Extra Experiments

b.1 Details of the Robust Models

Here we introduce the details of the 25 robust models we have tested. Most of these state-of-the-art robust models are generated by adversarial training Goodfellow et al. (2015); Madry et al. (2018) or similar training schemes Chen et al. (2020a); Huang et al. (2020). In particular, the basic Adversarial Training (AT) Madry et al. (2018)

is a very important work that was proposed a few years ago. Since it is not submitted to the RobustBench, we train its model with a PyTorch implementation and provide it in Table 

1. Together with the other 7 robust models, Table 1 includes 8 robust models. Also, RobustBench provides an unprotected model without adversarial training (Standard). We use this model official model of RobustBench for demonstration. Results on the rest 17 robust models will be shown in Appendix B.4.

b.2 Details of the Nine Attacks

Pgd:

PGD was first proposed in Madry et al. (2018). Comparing with FGSM Goodfellow et al. (2015), the most important difference is that PGD uses multi-step optimization techniques. This modification is significant and followed by almost all the other white-box attacks. In our settings of evaluation, we set iterations of optimization with a step-size of , and a random start of attacks Wong et al. (2020) is implemented via uniformly sampling an example in . PGD adopts the cross-entropy loss for optimization. Many following works investigate other choices of loss design Carlini and Wagner (2017); Croce and Hein (2020b) and achieve consistent improvements. Thus, recent works have considered PGD as a relatively weaker attack. In our work, we have shown that although PGD achieves worse attacking performance in the first place, it does show superiority against perturbations and thus becomes better than other attacks. In Algorithm 2, we present the detail of the PGD attack. Notice that Algorithm 2 generally resembles Algorithm 1, except for their criterion to be optimized completely different from each other.

1:  Input: the coming input , the label , the number of iterations , the step size , the deep network , and the defensive radius .
2:  // generates a uniform noise
3:  Initialization: .
4:  for  do
5:     ;
6:     ;
7:  end for
8:  Output: the adversarial example .
Algorithm 2  Projected Gradient Descent
Apgd-Ce:

APGD-CE also uses the cross-entropy loss for optimization like PGD. The difference is that APGD-CE uses adaptive optimization step size and achieves better performance. Specifically, on each optimization step of , the hyper-parameters of will be updated according to a series of intuitive designs. The details are shown in Algorithm 3.

1:  Input: the coming input , the label , the number of iterations , the step size , the deep network , the update checkpoint , the hyper-parameters , , and the defensive radius .
2:  // generates a uniform noise
3:  Initialization: .
4:  ;
5:  ;
6:  ;
7:   if else ;
8:  for  do
9:     ;
10:     ;
11:     if  then
12:          and .
13:     end if
14:     if  and  then
15:         if  or ( and  then
16:             and .
17:         end if
18:     end if
19:  end for
20:  Output: the adversarial example .
Algorithm 3  Auto Projected Gradient Descent
C&w:

PGD intends to maximize the soft-max cross-entropy loss. In contrast, C&W directly attacks the logits before the softmax layer. In previous works, it shows steadily better performance than the standard attack of PGD. Nevertheless, our works have shown that the adversarial examples generated by C&W can be more vulnerable to perturbations. We will discuss this in detail in the next section.

(8)
APGD-DLR and APGD-T:

Like C&W, APGD-DLR also attacks the logits. Croce and Hein (2020b) argues that the CW loss is not scaling invariant, and thus an extreme re-scaling could in principle be used to induce gradient masking. Thue, they introduce the difference between the largest logit and the third-largest logit to counter the potential scaling. APGD-T is its targeted version. It iterates among all the false classes as the targeted attacking aim and selects the worst-case among them. APGD-T can be more time-consuming but also more effective.

(9)
FAB and DeepFool:

These two attacks are based on pushing examples to the other side of the decision boundary. The reason they intend to minimize the distance to the natural example is to create adversarial examples that are more difficult to be detected by either visual clues or adversarial examples detection techniques.

(10)
Square:

Square is the state-of-the-art black-box attack based on extensively querying the model. Because it cannot get access to the gradient information of the model, it has to establish some assumptions about the landscape of deep neural networks and conduct searching based on the assumption. Although slightly worse than the white-box attacks and more time-consuming, its performance on the successful attacking rate is still overwhelming and significantly endangers the deployment of deep networks.

RayS:

RayS is the state-of-the-art hard-label black-box attack. Namely, black-box attacks like Square can utilize the predicted distribution of networks (the output of the softmax layer), while the hard-label attack can only use the discrete prediction of the model. This makes hard-label attacks the most difficult attacking tasks currently, and RayS has to search adversarial examples based on more restricted assumptions. Its performance on adversarially-trained models is slightly lower than Square’s, but the total successful attacking rate is still promising.

b.3 Case Study for Section 4

In Section 4.1.1, we briefly analyze why different attacks react differently to random or defensive perturbations. Here we offer a complete version of it.

Square and RayS: Square and RayS are black-box attacks based on extensively querying deep networks. Typically, they need to conduct about 1k to 40k queries to attack a single example. On an NVIDIA V100 GPU, typical white-box attacks like PGD take a few seconds to attack the examples of the CIFAR10 test set, while Square and RayS may cost to hours in different settings. Therefore, they tend to stop queries once they find a successful adversarial example to save the computation budget. Nevertheless, such a design makes attacks stop at the adversarial examples closer to the unharmful example, which is the example one step earlier before the successful attacking query. This explains why they are so vulnerable to defensive perturbations. The black-box constraint of not knowing the gradient information of the model may also prevent them from finding more powerful adversarial examples that are far away from the decision boundary. Explicitly speaking, to counter defensive perturbations, black-box attacks may well need to find a region-wise falsehood, a small local space that is full of false predictions. Without knowing the gradient information and the landscape of the loss surface, this aim can be tough to fulfill.

FAB: Unlike other white-box attacks that search adversarial examples within , FAB tries to find a minimal perturbation for attacks, as shown above. As discussed previously, this design allows FAB to generate more undetectable adversarial examples. However, this aim of finding smaller attacking perturbations also encourages them to find adversarial examples that are too close to the decision boundary and thus become very vulnerable to defensive perturbations. Thus, even on the undefended standard model where they generate smaller perturbations than on robust models, they can still be hugely nullified by random perturbation, and the robust accuracy increases from less than to and . It is easy to tell that FAB is vulnerable to Hedge Defense primarily due to its intrinsic design. With a proper adaption, FAB may alleviate the non-robustness to the emerge defensive perturbations, although questions like how they can keep a proper distance to the decision boundary and whether the new design degrades their performance remains unclear.

PGD and C&W: Attacks like PGD and C&W show better resistance against perturbations. However, with Hedge Defense, the robust accuracy can still be improved by to . Interestingly, several commonly believed more powerful attacks turn out to be weaker when facing perturbations. For instance, C&W enhances attacks by increasing the largest false logit and decreasing the true logit . Without Hedge Defense, it is more powerful ( robust accuracy of TRADES) than the PGD attack (). With Hedge Defense, it becomes less effective ( vs. ). This indicates that its efficacy comes at a price of less robustness against defensive perturbation. Furthermore, the effect of directly increasing the false logit can be weakened if the false class is attacked.

APGD-CE and APGD-T: APGD-CE is a direct improvement for PGD. However, even after random perturbation, the improvement brought by APGD-CE can be nullified compared with the standard PGD. APGD-T does bring consistent improvements in many cases. However, APGD-T or APGD-DLR may also lag behind the previously believed weaker PGD attack from time to time. Thus, future works may still need to put effort into examining how these intuitively designed losses help attacks.

Finally, we want to compare the robustness of attacks and the robustness of defenses with a unified perspective. As studies in previous works Tsipras et al. (2019); Zhang et al. (2019b), adversarial robustness may be at odds with natural accuracy. Similarly, we can observe that stronger attacks may bring a higher successful attacking rate but less robustness against defensive perturbation in our evaluations. This important finding gives defenses a chance to fight back and break attacks with their methodology. The above analyses show that state-of-the-art attacks are not robust due to various reasons. In the future, when they alleviate the problem and enhance their attacking robustness, they may not attack robust models at the currently high successful rate, resembling that robust deep networks all have lower natural accuracy.

b.4 Extra Results of Table 1

Here we provide extra results for our evaluations in Table 1, including the variances for the robust accuracy evaluations in Table 1 and evaluations on the rest 17 robust models. Besides the reference of the original work, we also provide the IDs of these models on RobustBench. Notice that, for the eight robust models and the one standard model, we evaluate three times, reporting the average value in Table 1 and presenting the variances in Table 4. For the rest 17 robust models in Table 5, we only conduct a single time of evaluation.

As shown below, compared with the improvement brought by random perturbation and Hedge Defense, the variances of them are primarily around , indicating their improvement is stable and reliable. Later we will show that, even with EOT Athalye et al. (2018), the improvements of defensive perturbation are still solid.

Model Method PGD C&W
Deep
Fool
APGD
CE
APGD
T
FAB Square RayS
Auto
Attack
Worst
Case
WA+
SiLU
 Gowal et al. (2020)
- 0.04 0.02 0.00 0.01 0.01 0.00 0.05 0 0.01 0.01
Uniform 0.03 0.09 0.10 0.03 0.10 0.09 0.14 0.12 0.03 0.03
Hedge 0.04 0.06 0.04 0.00 0.04 0.03 0.05 0.03 0.05 0.05
AWP Wu et al. (2020b) - 0.03 0.02 0.00 0.02 0.01 0.01 0.04 0.00 0.00 0.01
Uniform 0.07 0.08 0.03 0.00 0.05 0.02 0.10 0.12 0.08 0.10
Hedge 0.05 0.10 0.03 0.05 0.02 0.14 0.10 0.05 0.01 0.01
RST Carmon et al. (2019) - 0.02 0.01 0.00 0.01 0.01 0.01 0.02 0.00 0.02 0.01
Uniform 0.07 0.13 0.03 0.11 0.00 0.07 0.15 0.11 0.04 0.04
Hedge 0.08 0.09 0.06 0.06 0.06 0.07 0.04 0.01 0.03 0.03
Pre-
Training
 Hendrycks et al. (2019a)
- 0.04 0.01 0.00 0.01 0.02 0.02 0.03 0.00 0.00 0.02
Uniform 0.03 0.07 0.06 0.07 0.07 0.06 0.06 0.28 0.06 0.03
Hedge 0.02 0.11 0.05 0.06 0.04 0.06 0.10 0.04 0.10 0.04
MART Wang et al. (2020) - 0.04 0.03 0.00 0.01 0.01 0.03 0.02 0.00 0.01 0.02
Uniform 0.10 0.10 0.19 0.04 0.09 0.06 0.14 0.12 0.08 0.03
Hedge 0.02 0.04 0.02 0.07 0.06 0.03 0.05 0.06 0.06 0.02
HYDRA Sehwag et al. (2020) - 0.03 0.05 0.00 0.02 0.00 0.01 0.01 0.00 0.01 0.01
Uniform 0.07 0.13 0.09 0.05 0.05 0.16 0.17 0.27 0.01 0.05
Hedge 0.04 0.07 0.04 0.03 0.05 0.05 0.07 0.03 0.04 0.03
TRADES Zhang et al. (2019b) - 0.03 0.01 0.00 0.02 0.01 0.02 0.03 0.00 0.01 0.00
Uniform 0.04 0.08 0.05 0.08 0.06 0.06 0.2 0.24 0.04 0.02
Hedge 0.02 0.07 0.03 0.03 0.05 0.05 0.02 0.01 0.03 0.03
AT Madry et al. (2018) - 0.03 0.10 0.07 0.04 0.10 0.10 0.09 0.00 0.01 0.05
Uniform 0.00 0.29 0.10 0.04 0.04 0.10 0.17 0.01 0.08 0.10
Hedge 0.01 0.08 0.14 0.04 0.02 0.04 0.01 0.08 0.08 0.07
Standard - 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.00 0.00 0.00
Uniform 0.00 0.23 0.21 0.01 0.04 0.20 0.34 0.35 0.07 0.00
Hedge 0.00 0.42 0.23 0.05 0.02 0.19 0.23 0.25 0.02 0.00
Table 4: The variance for Table 1.

IDs on RobustBench (not including AT):

Gowal2020Uncovering_70_16_extra

Wu2020Adversarial_extra

Carmon2019Unlabeled

Hendrycks2019Using

Wang2020Improving

Sehwag2020Hydra

Zhang2019Theoretically

Standard

Model Method
Nat-
Acc.
PGD C&W
Deep
Fool
APGD
CE
APGD
T
FAB Square RayS
Auto
Attack
Worst
Case
Gowal et al. (2020) - 89.48 66.59 64.46 67.62 65.92 63.25 63.77 69.07 69.3 63.24 63.22
Random 89.21 66.58 68.08 71.89 66.90 64.35 75.54 75.30 77.12 64.80 63.08
Hedge 89.16 69.04 76.18 77.26 71.47 70.33 80.15 81.11 79.99 70.07 65.70
Zhang et al. (2020b) - 89.36 68.15 61.18 64.09 66.37 59.64 60.16 66.18 66.82 59.61 59.61
Random 89.33 68.06 65.33 74.79 67.54 60.93 74.92 73.56 76.47 61.65 59.74
Hedge 90.16 71.03 77.8 81.45 73.30 69.08 82.54 82.44 83.07 69.89 64.43
Gowal et al. (2020) - 88.70 56.38 55.38 60.42 55.49 53.56 54.09 61.96 63.01 53.57 53.52
Random 88.61 56.60 60.55 69.20 56.68 55.06 72.32 71.56 74.81 55.19 53.59
Hedge 88.77 58.58 69.45 73.56 59.84 59.35 76.67 77.11 78.06 58.61 55.44
Wu et al. (2020b) - 85.36 59.66 57.51 60.87 59.02 56.52 57.04 61.56 61.84 56.53 56.51
Random 85.22 59.80 61.49 68.71 59.99 57.82 71.61 69.74 72.59 58.15 56.67
Hedge 84.89 62.14 70.47 72.57 65.07 64.06 75.04 75.13 74.93 63.87 59.15
Pang et al. (2020) - 85.14 62.50 56.27 58.64 61.73 54.30 54.71 61.59 61.62 54.27 54.24
Random 84.86 62.74 60.45 68.40 62.74 55.54 69.08 68.95 72.09 55.93 54.44
Hedge 85.59 66.04 71.42 74.05 70.79 65.48 76.06 77.74 76.25 66.26 58.15
Cui et al. (2020) - 88.70 56.38 55.38 60.42 55.49 53.56 54.09 61.96 63.01 53.57 53.52
Random 88.61 56.60 60.55 69.20 56.68 55.06 72.32 71.56 74.81 55.19 53.59
Hedge 88.77 58.58 69.45 73.56 59.84 59.35 76.67 77.11 78.06 58.61 55.44
Zhang et al. (2020a) - 84.52 57.78 55.37 59.01 56.96 54.10 54.58 59.84 60.16 54.06 54.06
Random 84.35 58.09 59.47 64.66 58.18 55.26 69.42 68.12 71.32 55.85 54.25
Hedge 81.76 59.96 64.76 66.66 62.74 61.09 70.25 70.79 69.42 61.33 55.97
Rice et al. (2020) - 85.34 57.98 56.21 58.46 57.25 53.92 54.31 61.74 61.66 53.94 53.87
Random 85.13 58.10 60.15 69.09 58.20 55.24 69.36 69.66 72.53 55.45 53.81
Hedge 85.85 62.16 73.72 75.42 66.69 65.79 76.65 78.7 77.70 64.75 58.59
Huang et al. (2020) - 83.48 56.80 54.76 58.65 56.08 53.38 53.90 58.88 59.47 53.35 53.31
Random 83.40 56.86 58.83 64.29 57.04 54.77 69.25 67.01 70.63 54.74 53.40
Hedge 82.25 59.66 68.03 68.84 62.55 61.78 72.33 72.72 71.98 61.16 56.34
Cui et al. (2020) - 88.22 55.23 54.83 61.12 54.30 52.90 53.51 61.22 62.59 52.87 52.82
Random 88.05 55.63 60.23 68.67 55.56 54.54 72.11 70.98 74.07 54.39 52.95
Hedge 88.57 57.72 69.13 72.57 58.66 58.79 76.18 76.17 77.61 57.74 54.63
Chen et al. (2020b) - 86.04 54.86 52.86 56.47 54.35 52.12 52.51 58.47 59.77 52.09 52.08
Random 85.86 55.14 57.16 63.15 55.44 53.29 68.78 67.47 72.01 53.48 51.97
Hedge 85.99 55.97 62.93 66.97 56.84 55.14 72.26 71.60 74.60 55.47 53.01
Chen et al. (2020a) - 85.32 55.49 53.84 57.90 54.06 51.59 52.29 58.62 59.44 51.57 51.55
Random 85.19 55.88 59.11 65.85 55.63 53.32 71.78 68.66 72.47 53.67 51.99
Hedge 82.52 58.67 67.3 68.99 61.30 60.92 71.68 72.30 71.54 60.09 55.31
Sitawarin et al. (2020) - 86.84 54.41 54.01 57.17 53.25 51.46 51.92 59.75 60.53 51.42 51.38
Random 86.69 54.86 59.19 69.52 54.54 52.91 70.66 69.54 73.01 53.08 51.39
Hedge 86.11 57.66 70.57 73.51 60.31 60.94 74.21 76.52 75.95 59.13 54.58
Engstrom et al. (2019) - 87.03 53.49 53.25 56.48 52.09 49.93 50.36 58.34 59.79 49.87 49.84
Random 86.99 53.76 58.70 70.38 53.39 51.63 70.89 69.00 73.26 51.86 50.04
Hedge 86.00 56.83 70.96 72.30 59.73 59.74 73.39 76.13 75.01 57.96 53.42
Zhang et al. (2019a) - 87.20 47.39 48.16 51.99 46.53 45.44 45.98 55.50 55.98 45.42 45.39
Random 87.17 47.75 53.15 65.50 47.44 46.76 67.79 66.55 71.60 46.77 45.38
Hedge 86.73 50.91 68.06 72.09 53.46 55.50 73.59 75.45 75.32 52.71 48.50
Wong et al. (2020) - 83.34 47.57 47.17 50.20 46.23 43.78 44.48 53.68 54.15 43.75 43.71
Random 83.36 48.21 52.57 64.16 47.69 45.33 66.28 63.98 68.41 45.70 43.99
Hedge 81.76 52.31 66.03 68.94 55.09 55.42 69.87 71.80 71.54 53.91 49.13
Ding et al. (2020) - 84.36 52.38 53.28 58.39 49.67 42.24 43.27 55.82 51.49 41.94 41.83
Random 83.92 52.68 57.94 67.10 50.66 44.97 67.98 64.65 66.52 44.48 42.93
Hedge 82.99 54.59 66.56 69.62 54.74 49.95 68.12 70.68 68.64 48.80 43.49
Table 5: Seventeen extra robust models for Table 1

IDs on the RobustBench:

Gowal2020Uncovering_28_10_extra,

Zhang2020Geometry,

Gowal2020Uncovering_34_20,

Wu2020Adversarial,

Pang2020Boosting,

Cui2020Learnable_34_20,

Zhang2020Attacks,

Rice2020Overfitting,

Huang2020Self,

Cui2020Learnable_34_10,

Chen2020Adversarial,

Chen2020Efficient,

Sitawarin2020Improving,

Engstrom2019Robustness,

Zhang2019You,

Wong2020Fast, Ding2020MMA

b.5 Extra Visualization of Figure 2(c)

We present extra visualizations for Section 4.2.3 on CIFAR10. Figure 5 presents examples of FtoT for the TRADES model. Namely, the TRADES model classifies the natural examples in the first row and then wrongly classifies the adversarial examples in the second row. After perturbing the adversarial examples with Hedge Defense, the generated hedge examples in the third row are correctly classified. In the first row, we also present the visualizations for applying Hedge Defense to the natural examples in the fourth row. In contrast, Figure 6 presents examples that belong to TtoF. The SSIM scores are shown beneath the images.

Figure 5: Extra visualizations for Figure 2(c) on examples from FtoT.
Figure 6: Extra visualizations for Figure 2(c) on examples from TtoF.

b.6 Other Possibilities of Searching for the Defensive Perturbation

In this part, we test several intuitive and alternative designs that may help Hedge Defense. We first introduce some randomness in Hedge Defense. Previous works Pinot et al. (2020) have pointed out that such randomness may help defensive robustness, although not all randomness techniques will help Xie et al. (2018). Instead of simultaneously attacking all the classes, we randomly select a class to attack in each iteration, as shown below. This approach steadily achieves similar improvements to the standard Hedge Defense. More interestingly, on the undefended standard model, this random Hedge Defense can achieve around robust accuracy against the PGD attack. We will present a complete evaluation of this alternative method in a future version. Another method we tested is directly maximizing the KL-Divergence to the uniform distribution. Not surprisingly, it achieves the exact same performance as our Algorithm 1.

1:  Input: the coming input , the number of iterations , the step size , the deep network , and the defensive radius .
2:  // generates a uniform noise
3:  Initialization: .
4:  for  do
5:     // Randomly draw a class from to ,
6:      ;
7:     // is the projection operator.
8:     ;
9:  end for
10:  Output: the safer prediction .
Algorithm 4  Hedge Defense with Randomness

b.7 Reversing Hedge Defense as a label-free Attack

1:  Input: the coming input , the number of iterations , the step size , the deep network , and the defensive radius .
2:  // generates a uniform noise
3:  Initialization: .
4:  for  do
5:     // Sum the losses for all the classes,
6:     // then update with gradient descending.
7:      ;
8:     // is the projection operator.
9:     ;
10:  end for
11:  Output: the safer prediction .
Algorithm 5  The Label-Free Attack

In Section 2, we have mentioned that Hedge Defense essentially pushes examples away from the uniform distribution. Now we demonstrate that its reversing process, pulling examples to the uniform distribution, is also a form of adversarial attacks that do not require knowing the ground-truth labels. This label-free attack has the same loss function as Hedge Defense. However, when updating examples, it adopts the gradient descending method instead of the gradient ascending in Hedge Defense (Line 7 in Algorithm 5). We test its effectiveness on the undefended standard model and the robust model of TRADES in Table 6. Apparently, due to the lack of knowing the ground-truth label, our label-free attack is not as powerful as those who can access the label. On the adversarially-trained models, its performance is also not very promising. Thus, its usage is relatively limited. However, the fact that the reverse Hedge Defense is an attacking process helps us understand that why Hedge Defense can bring extra robustness, which is the intention of the empirical evaluations in this part.

Method Model Natural Accuracy (%) Robust Accuracy (%)
Label-Free Attack (The Reverse Hedge Defense) Standard 94.78 21.06
TRADES 84.92 79.97
Table 6: The lable-free attack on the standard model and the TRADES model.

b.8 L2 norm

In this part, we present evaluations on the relatively weaker attacks of norm attacks on CIFAR10. As shown by Table 7, the robust accuracy against norm attacks is much higher than those of the norm attacks. Recent works Tramèr et al. (2020) have illustrated that the norm may not keep the semantics of unchanged.

Model Method
Nat-
Acc.
APGD
CE
APGD
T
FAB Square
Auto
Attack
Worst
Case
Wu2020Adversarial Wu et al. (2020b) - 88.51 74.72 73.66 73.85 80.28 73.66 73.66
Uniform 88.51 75.26 74.19 80.73 83.62 74.41 74.08
Hedge 87.52 76.71 76.02 80.59 84.12 76.17 75.91
Rice2020Overfitting Rice et al. (2020) - 88.67 68.55 67.68 68.01 78.28 67.68 67.68
Uniform 88.73 69.08 68.35 78.79 82.68 68.45 68.14
Hedge 87.10 73.55 73.25 78.53 84.01 73.09 72.68
Ding2020MMA Ding et al. (2020) - 88.02 66.21 66.09 66.36 76.08 66.09 66.09
Uniform 88.06 66.47 66.41 77.04 79.36 66.30 66.17
Hedge 86.82 70.09 70.62 77.07 82.44 70.07 69.79
Table 7: norm attacks with on CIFAR10.

b.9 Extra Results of Analytical Experiments and Ablation Study

We provide extra results of the analytical experiments and the ablation study in Section 4.

Figure 7: A PGD version for Figure 3.
Figure 8: Extra results for Figure 2(a)

b.10 Directly Attacking the Predictions of Adversarial Examples

Figure 9: Attacking the predictions of adversarial examples.

In this section, we provide empirical evidence that can help understand and verify our claims in Section 1: On adversarially-trained models, the unprotected false classes are more vulnerable to attacks than the protected ground-truth class.

On the CIFAR10 test set, for the 8492 naturally accurate examples of the official TRADES model, we use PGD attack to generate their adversarial examples. 2908 cases are successfully attacked and generate false predictions, while 5584 of them withstand the attack and produce correct predictions. If we further attack the predictions of adversarial examples, 2889 out of the 2908 cases (red) will be successfully attacked, whereas 66% of them (1928/2908, purple) will return to the correct predictions. In contrast, the 5584 adversarial examples with correct predictions can better resist attacks (blue line), with only 2551 examples being successfully attacked.

b.11 Attempts to Attack Hedge Defense

In the evaluations of the main content, the adversarial attacks directly proceed on the static model without any defensive perturbation. We test an intuitive design that intends to counter Hedge Defense: On each step of attacks, we add a single-step defensive perturbation on the adversarial example after the attacking perturbation so that the attacks may explore the more complex cases of hedge examples. However, such an approach will hugely affect the attacking process and increase the robust accuracy to more than .

Next, we consider the possibility of directly attacking the combination of Hedge Defense (as a pre-processing) and the deep networks. We adopt a simplified denotation to help readers better understand this process. Denote the feed-forward process in Algorithm 1 as , and the corresponding first order derivative with respect to the input as . Then the unified predictor of deep networks and our Hedge Defense as a pre-processing can be formulated as:

, where is a pre-processing with

(11)

We omit components like and the projection operation in Algorithm 1, which will not affect our conclusion. Then, a direct attack should compute the first order derivative of the above function with respect to the input .

(12)

Empirical evaluations show that the magnitude of is extremely small but its computation can be very expensive. Therefore, we omit this part and get:

(13)

This means, on each attacking step of our direct attack on Hedge Defense, instead of computing the derivative with respect to , one should first conduct the iteration in Algorithm 1 and then apply the computed gradient with respect to to :

1:  Input: the coming input , the number of defensive iterations , the number of attacking iterations , the step size , the deep network , the attacking radius , and the defensive radius .
2:  // generates a uniform noise
3:  Initialization: .
4:  for  do
5:     
6:     for  do
7:          ;
8:         // is the projection operator.
9:         ;
10:     end for
11:     ;
12:     ;
13:  end for
14:  Output: the adversarial example .
Algorithm 6  Attack Hedge Defense

We test the above algorithms on RST Carmon et al. (2019). As shown by the table below, the defense-aware attack affects Hedge Defense and decreases its improvement by . Making the total improvement of Hedge Defense becomes . Meanwhile, the attacking performance of the defense-aware attack on the Direct Prediciton, predicting the generated adversarial examples without the defensive perturbation, hugely degrades. Moreover, the robust accuracy rises to . We will investigate this defense-aware attack in future work.

Model Attack Direct Prediction (%) Prediction of Hedge Defense (%)
RST Carmon et al. (2019) Attack the Model 63.17 66.38
Attack Hedge Defense 67.08 65.78
Table 8: Defense-aware attack on Hedge Defense.

b.12 About Adopted Assets

We provide our codes in the supplementary material. The license of the mainly used resources, RobustBench Croce et al. (2020), can be found at:

https://github.com/RobustBench/robustbench/blob/master/LICENSE

b.13 Clarification on Ethical Concerns

All assets adopted by this work are standard benchmark sets and do not relate to ethical issues like containing personally identifiable information or offensive content, acquiring data without the consent of the owners.

Appendix C More Interpretations on the Linear Model

c.1 Excluding the Non-Robust Examples in

In Section 5, we study Hedge Defense on a binary classification problem, where a robust solution exists (the ground-truth classifier). This design aims to simulate real-world tasks like image classification, where we assume that there is a robust solution Stutz et al. (2019). Many works aim at analyzing the causes of the intriguing adversarial examples on deep neural networks and suggest that deep networks are non-robust against attacks due to poor generalizations Croce et al. (2020). Others Stutz et al. (2019) find that improving adversarial robustness may implicitly help the generalization of deep networks. Goodfellow et al. (2015)

also points out that, on the logistic regression model, adversarial training will degenerate to the LASSO regularization. Our work advocates the opinions in these works. It can be told that, in the problem we construct in Section 

5, the non-robustness comes from learning a biased classifier instead of the ground-truth one. Thus, delving into its mechanism, it is still a problem focusing on model generalization. In addition, there is another series of theoretical works focusing on studying the intrinsically non-robust examples Carmon et al. (2019); Tsipras et al. (2019); Zhang et al. (2019b). In their framework, the ground-truth classifier is a non-robust one. Thus, for any classifier, the robust errors of these models have a lower bound Carmon et al. (2019) that is larger than zero, and the non-robust examples can never be eliminated.

Figure 10: Interpretation of Hedge Defense on the intrinsically non-robust examples.

Finally, we examine what will happen to the intrinsically non-robust examples in . As shown in Figure 10, we apply the FGM attack to the examples in . In Figure 10, these examples are pushed across the ground-truth decision boundary and thus wrongly classified. Then, if we further apply Hedge Defense on these attacked adversarial examples, the generated defensive perturbation will have the same direction as the attacking perturbation (Figure 10), pushing examples further away from their original positions (Figure 10). Notice that, for these examples, Hedge Defense does not increase their errors since the examples have been already wrongly classified in Figure 10. All the analyses are conducted on the ground-truth classifier, and there is nothing we can do about these examples because no classifier can separate whether the examples within are adversarial (Figure 10) or natural (Figure 10). Namely, suppose the classifier receives an input as . This can either be an adversarial example that originally locates at with label , or a natural example that stays at all the time with label . Without extra information, a classifier cannot separate these two cases and output different labels for both of them.

c.2 Hedge Defense for Natural Examples

Figure 11: Interpretation of Hedge Defense on the natural examples.

In Section 5, we apply Hedge Defense after the FGM adversarial attack. Now we examine the case where we directly apply Hedge Defense to natural examples. As shown in Figure 11, the generated defensive perturbation stays the same as those in Figure 4 and still pushes examples away from the decision boundary (Figure 11). Therefore, the accuracy of natural examples is not changed, and both the ground-truth and the estimated classifier are still naturally accurate.

c.3 For the Cases When

In Section 5, we assume the learned bias is smaller than so that the estimated classifier is a naturally accurate discriminator. On the other hand, when , Hedge Defense can still take effect. As shown in Figure 12, the examples within are wrongly classified before attacks happen. After attacks (Figure 12), the width of the purple region represents the robust errors. The examples within are wrongly classified because of attacks, while the examples within