DeepAI
Log In Sign Up

Adaptive Regularization for Adversarial Training

Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we propose a new adversarial training algorithm that is theoretically well motivated and empirically superior to other existing algorithms. A novel feature of the proposed algorithm is to use a data-adaptive regularization for robustifying a prediction model. We apply more regularization to data which are more vulnerable to adversarial attacks and vice versa. Even though the idea of data-adaptive regularization is not new, our data-adaptive regularization has a firm theoretical base of reducing an upper bound of the robust risk. Numerical experiments illustrate that our proposed algorithm improves the generalization (accuracy on clean samples) and robustness (accuracy on adversarial attacks) simultaneously to achieve the state-of-the-art performance.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/05/2022

Layer-wise Regularized Adversarial Training using Layers Sustainability Analysis (LSA) framework

Deep neural network models are used today in various applications of art...
10/09/2018

Average Margin Regularization for Classifiers

Adversarial robustness has become an important research topic given empi...
10/23/2020

Posterior Differential Regularization with f-divergence for Improving Model Robustness

We address the problem of enhancing model robustness through regularizat...
05/02/2022

Enhancing Adversarial Training with Feature Separability

Deep Neural Network (DNN) are vulnerable to adversarial attacks. As a co...
05/24/2022

Alleviating Robust Overfitting of Adversarial Training With Consistency Regularization

Adversarial training (AT) has proven to be one of the most effective way...
02/26/2020

Attacks Which Do Not Kill Training Make Adversarial Learning Stronger

Adversarial training based on the minimax formulation is necessary for o...
04/13/2022

Overparameterized Linear Regression under Adversarial Attacks

As machine learning models start to be used in critical applications, th...

1 Introduction

It is easy to generate human-imperceptible perturbations that put prediction of a deep neural network (DNN) out. Such perturbated samples are called adversarial examples Szegedy et al. (2014) and algorithms for generating adversarial examples are called adversarial attacks. Adversarial attacks can be divided into two types - white-box attacks Goodfellow et al. (2015); Madry et al. (2018); Carlini and Wagner (2017); Croce and Hein (2020a) and black-box attacks Papernot et al. (2016, 2017); Chen et al. (2017); Ilyas et al. (2018); Papernot et al. (2018). In the white-box attack setting, an adversary has access to the model architecture and the value of parameters of a given DNN , while in the black-box attack setting, the adversary has no access to them but can access only to outputs of the prediction model for given inputs.

It is well known that adversarial attacks can greatly reduce the accuracy of DNNs, for example from about 96% accuracy on clean data to almost zero accuracy on adversarial examples Madry et al. (2018). This vulnerability of DNNs can cause serious security problems when DNNs are applied to security critical applications Kurakin et al. (2017); Jiang et al. (2019) such as medicine Ma et al. (2020); Finlayson et al. (2019) and autonomous driving Kurakin et al. (2017); Deng et al. (2020); Morgulis et al. (2019); Li et al. (2020).

The aim of this paper is to develop a new adversarial training algorithm for DNNs, which is theoretically well founded and empirically superior to other existing competitors. A novel feature of the proposed algorithm is to use a data-adaptive regularization for robustifying a prediction model. We impose more regularization for data more vulnerable to adversarial attacks and vice versa. Even though the idea of data-adaptive regularization is not new, our data-adaptive regularization has a firm theoretical base of reducing an upper bound of the robust risk while most of other data-adaptive regularizations are rather heuristically motivated.

1.1 Related Works

Most of existing adversarial training algorithms can be understood as a procedure of estimating the optimal robust prediction model that is the minimizer of the robust population risk. The two most representative adversarial training algorithms are PGD-Training of

Madry et al. (2018) and TRADES of Zhang et al. (2019). Madry et al. (2018) develops the PGD-Training algorithm which minimizes the robust empirical risk directly. In contrast, Zhang et al. (2019) decomposes the robust population risk as the sum of the two missclassification errors - one for clean data and the other for adversarial data, and treats the second term as a regularization term to propose a regularized empirical risk minimization algorithm called TRADES.

Data-adaptive modifications of existing adversarial training algorithms have also received much attention. Zhang et al. (2020) decides the number of iterations of the PGD algorithm, which is a white-box attack algorithm to generate adversarial examples, data-adaptively when generating an adversarial example (early-stopped PGD). Ding et al. (2020)

generates an adversarial example only when a given datum is correctly classified and uses the datum itself as an adversarial example when it is missclassified.

Zhang et al. (2021) proposes to minimize the weighted robust empirical risk, where the weight is reciprocally proportional to the distance to the decision boundary. Wang et al. (2020)

devises the regularized robust empirical risk as the sum of the robust empirical risk and regularization term where the weights proportional to the conditional probability of a given datum being missclassified are employed in the regularization term. See Section

2 for details of data-adaptive adversarial training algorithms.

1.2 Our Contributions

We propose a new data-adaptive adversarial training algorithm. Novel features of our algorithm compared to the aforementioned data-adaptive adversarial training algorithms are that it is theoretically well motivated, easier to implement and empirically superior. First, we derive an upper bound of the robust risk . Then, we devise a data-adaptive regularized empirical risk which is a surrogate version of our theoretical upper bound. Finally, we learn a robust prediction model by minimizing the proposed data-adaptive regularized empirical risk. By analyzing benchmark data sets, we show that our proposed algorithm is superior to other competitors in view of the generalization (accuracy on clean samples) and robustness (accuracy on adversarial examples) simultaneously to achieve the state-of-the-art performance. In addition, we illustrate that our algorithm is helpful to improve the fairness of the prediction model in the sense that the error rates of each class become more similar compared to a non-adaptive adversarial training algorithm.

A summary of our contributions is as follows :

  • Theoretically, we derive an upper bound of the robust risk.

  • We propose a data-adaptive regularized empirical risk which is a surrogate version of the derived upper bound of the robust risk.

  • Numerical experiments are conducted to show that our algorithm improves the robustness and generalization simultaneously and outperforms the existing state-of-the-art methods.

  • Our algorithm can mitigate the unfairness due to the disparity between class-wise accuracies.

2 Preliminaries

2.1 Robust Population Risk

Let be the input space, be the set of output labels and be the score function parametrized by neural network parameters such that

is the vector of the conditional class probabilities. Let

and be the indicator function.

The robust population risk used in adversarial training is defined as

(1)

Most adversarial training algorithms learn by minimizing an empirical version of the above robust population risk. In turn, most empirical versions of (1) require to generate an adversarial example which is an empirical counterpart of

Any method of generating an adversarial example is called an adversarial attack.

2.2 Algorithms for Generating Adversarial Examples

Existing adversarial attacks can be categorized into either the white-box attack Goodfellow et al. (2015); Madry et al. (2018); Carlini and Wagner (2017); Croce and Hein (2020a) or the black-box attack Papernot et al. (2016, 2017); Chen et al. (2017); Ilyas et al. (2018); Papernot et al. (2018). For the white-box attack, the model structure and parameters are known to may be adversary who use these information for generating adversarial examples.

The most popular method for the white-box attack is PGD (Projected Gradient Descent) Madry et al. (2018). Let be a surrogate loss of with for given and PGD finds the adversarial example defined as

by applying the gradient ascent algorithm to to update and projecting it to That is, the update rule of PGD is

(2)

where is the projection operator to and . For the surrogate loss , the cross entropy Madry et al. (2018) or the KL divergence Zhang et al. (2019) is used.

There exists a case that an adversary can only use information about outputs for given inputs but cannot access the structure and parameters of the model. In this case, usually, the adversary generates dataset where is an output of a given input . Then, the adversary trains a substitute model by this data sets, and generates adversarial examples from the substitute model Papernot et al. (2017). These kinds of attacks are called the black-box attack.

2.3 Review of Adversarial Training Algorithms

We review some of the adversarial training algorithms which, we think, are related to our proposed algorithm. Typically, adversarial training algorithms consist of the maximization and minimization steps. In the maximization step, we generate adversarial examples for given . In the minimization step, we fix the adversarial example and update . For notational simplicity, we drop in the adversarial examples.

PGD-Training

Madry et al. (2018) proposes PGD-Training which updates by minimizing

where is the cross-entropy loss and is an adversarial example obtained by PGD.

Trades

Robust risk, natural risk and boundary risk are defined by

(3)
(4)
(5)

Zhang et al. (2019) shows

and proposes the following regularized empirical risk which is a surrogate version of the upper bound of the robust risk:

where is an adversarial example generated by PGD with the KL-divergence.

Data-Adaptive Methods Zhang et al. (2020); Ding et al. (2020); Zhang et al. (2021); Wang et al. (2020)
  • Geometry Aware Instacne Reweighted Adversarial Training (GAIR-AT) Zhang et al. (2021) is the method which increases the weights of samples that are close to the decision boundary. In other words, GAIR-AT minimizes

    where for a prespecified maximum iteration and .

  • Zhang et al. (2020) suggests early-stopped PGD which stops the iteration of PGD when the model first misclassifies the adversarial example. Friendly Adversarial Training (FAT) minimizes

    where is an adversarial sample by PGD and for a prespecified maximum iteration .

  • Misclassification Aware adveRsarial Training (MART) Wang et al. (2020) minimizes

    (6)

    where . The second term of (6) can be understood as a data-adaptive regularization term.

  • Ding et al. (2020) suggests to generate adversarial examples only for missclassified samples with data-adaptive neighborhood size in PGD. Max-Margin Adversarial (MMA) Training of Ding et al. (2020) minimizes

    where is an adversarial example by PGD with data-adaptively selected

3 Adaptive Regularization

In this section, we develop a new data-adaptive regularization algorithm for adversarial training called Anti-Robust Weighted Regularization (ARoW).

3.1 An Upper Bound of the Robust Population Risk

In this subsection, we consider the case of binary response

and the surrogate loss function

given as where is a function bounded below by . Examples of include the binary cross-entropy

We take into account the regularized robust risk defined as for a given . The robust risk of (3) is the regularized robust risk with . In the regularized robust risk, is considered as a regularization term, and the regularization parameter controls the trade-off between the generalization ability and robustness to adversarial attacks.

The following theorem provides an upper bound of the regularized robust risk.

Theorem 1.

Let

For any score function , we have

(7)

The upper bound (7) consists of the two terms where the first term is a surrogate loss of the natural risk and the second term is a surrogate version of the boundary risk. The upper bound in (7) can be served as a surrogate regularized robust risk of which will be done in the next subsection.

The second term on the right-hand side of (7) can be reformulated as the expectation of

(8)

with respect to and . For given can be considered as a measure of robustness of the prediction model at and therefore the second term is considered to be data-adaptive regularization term. That is, that term enforces the prediction model to be more robust for data whose weights are large.

Note that if the upper bound (7) is equal to the upper bound used in TRADES Zhang et al. (2019). Thus, our upper bound can be considered as a data-adaptive modification of TRADES.

There are many evidences that data-adaptive regularization would be helpful for adversarial training Zhang et al. (2020); Ding et al. (2020); Zhang et al. (2021); Wang et al. (2020) and so our proposal is expected to perform better than TRADES.

3.2 Anti-Robust Weighted Regularization (ARoW) Algorithm

Motivated by the upper bound (7) of the robust risk in Section 3.1, in this subsection, we propose a new data-adaptive adversarial training algorithm called the Anti-Robust Weighted Regularization (ARoW) algorithm, which learns by minimizing the following regularized empirical risk:

(9)

where , , is the one-hot vector whose the -th entry is 1 and is the vector whose entries are all 1.

The empirical counterpart (3.2) of the upper bound (7) is obtained by modifying the three terms after the population expectation is replaced by the empirical averages.

First, we employ the label smoothing to estimate the conditional class probabilities more accurately Müller et al. (2019). It is well known that DNNs trained on the original labels are poorly calibrated Guo et al. (2017). The accurate estimation of is important since is used in the regularization term of ARoW.

Second, we replace by because the KL divergence can be considered as multi-class calibration of the term . This modification is also used in TRADES Zhang et al. (2019).

The last modification is to replace by . We employ this modification since it would be helpful not to regularize highly robust samples. That is, we want to consider samples with and as equally robust samples.

The ARoW algorithm is summarized in Algorithm 1.

Input : network , training dataset , learning rate ,

            hyperparameters (

, , ) of (3.2

), number of epochs

, number of batch , batch size
      Output : adversarially robust network

1:for  do
2:     for  do
3:         ,
4:          in (3.2)
5:     end for
6:end for
7:Return
Algorithm 1 ARoW Algorithm

3.3 Remarks

An Alternative Upper Bound

An alternative adaptive upper bound of the robust population risk can be derived by replacing the term in (7) with . See Appendix B for derivation. From this alternative upper bound, we propose an adversarial training algorithm which minimizes the following regularized empirical risk:

(10)

We call the algorithm Confidence Weighted Regularization (CoW).

The roles of the two data-adaptive regularization terms in (3.2) and (10) are quite different. The regularization term in (3.2) encourages the prediction model to be more robust for samples which are more vulnerable to adversarial attacks (i.e. is small). In contrast, the regularization term in (10) puts more regularization on highly confident data (i.e. is large). The idea of focusing more on highly confident data would be reasonable since adversarial attacks for highly confident data result in serious damages. However, numerical studies in Section 4 show ARoW is superior to CoW.

Comparison to MART

The objective functions in MART (6) and ARoW (3.2) are similar. But, there are three main differences. First, the supervised loss term of ARoW is the label smoothing loss with clean samples, whereas MART uses the margin cross entropy loss with adversarial examples. Second, the surrogate loss functions used in PGD are different. In MART, the cross entropy is used while the KL divergence is used in ARoW. Third, the weight in regularization term in MART is proportional to while the weight in ARoW is proportional to . In the numerical studies, we find that ARoW outperforms MART with large margin. This would be partly because ARoW is theoretically well motivated.

Figure 1: Comparison of TRADES and ARoW with varying . We vary from 2 to 8 in TRADES while from 4 to 9 in ARoW. The X-axis and Y-axis are standard accuracy and robust accuracy, respectively. The robust accuracy in the left panel is against PGD while the robust accuracy in the right panel is against AutoAttack.

4 Experiments

In this section, we investigate the ARoW algorithm in view of robustness and generalization by analyzing benchmark data sets. We show that ARoW is superior to other competitors such as Madry et al. (2018); Zhang et al. (2019); Wang et al. (2020); Rade and Moosavi-Dezfolli (2022); Zhang et al. (2021) as well as CoW. In addition, we carry out ablation studies to illustrate that the adaptive regularization is a key component for the success of ARoW compared to TRADES. We show that ARoW improves the robustness of data vulnerable to adversarial attacks more than TRADES does. Moreover, ARoW improves the fairness of the prediction model in the sense that the error rates of each class become more similar. For benchmark data sets, we use CIFAR10 without and with unlabeled data Carmon et al. (2019), F-MINST Xiao et al. (2017)

and SVHN dataset

Netzer et al. (2011).

For CIFAR10, we use two CNN architectures - WideResNet34-10 Zagoruyko and Komodakis (2016) and ResNet18 He et al. (2016)

for investigating how ARoW works well depending on the capacity of the model. WideResNet34-10 is more complex than ResNet18. For F-MNIST and SVHN, ResNet18

He et al. (2016) is used.

In the main manuscript, we only present the results for CIFAR10 with WideResNet34-10, and defer the results for unlabeled CIFAR10 of Carmon et al. (2019) with WideResNet34-10, CIFAR10 with ResNet18, FMINST and SVHN to Appendices D.1.5, D.2, E.1 and E.2, respectively.

Experimental Setup

The datasets are normalized into [0, 1]. For generating adversarial examples in the training phase, PGD with random start, , and is used, where PGD is PGD in (2) with iterations. For learning prediction models, the SGD with momentum , weight decay , an initial learning rate of 0.1 and batch size of 128 are used and the learning rate is reduced by a factor of 10 at 60 and 90 epochs. The final model is set to be the best model against PGD on the test data among those obtained until 120 epochs. The random crop and random horizontal flip with probability 0.5 are applied for data augmentation. For CIFAR10, stochastic weighting average (SWA) Izmailov et al. (2018) is employed after 50-epochs for preventing from robust overfitting Rice et al. (2020) as Chen et al. (2021) does.

For evaluating the robust accuracy in the test phase, PGD and AutoAttack are used for adversarial attacks, where AutoAttack consists of three white box attacks - APGD and APGD-DLR in Croce and Hein (2020b) and FAB Croce and Hein (2020a) and one black box attack - Square Attack Andriushchenko et al. (2020). To the best of our knowledge, AutoAttack is the strongest attack.

Method Standard Madry et al. (2018) AutoAttack Croce and Hein (2020b)
PGD-Training Madry et al. (2018) 87.02(0.20) 57.50(0.12) 53.98(0.14)
TRADES Zhang et al. (2019) 85.86(0.09) 56.79(0.08) 54.31(0.08)
HAT Rade and Moosavi-Dezfolli (2022) 86.98(0.10) 56.81(0.17) 54.63(0.07)
GAIR-AT Zhang et al. (2021) 85.44(0.10) 67.27(0.07) 46.41(0.07)
ARoW 88.59(0.02) 58.18(0.09) 54.82(0.14)
CoW 88.20(0.09) 57.33(0.05) 54.63(0.12)
Table 1: Performance on CIFAR10.

We conduct the experiment three times with different seeds and present the averages of the accuracies with the standard errors in the brackets.

Method Standard APGD APGD-DLR FAB Croce and Hein (2020a) SQUARE Andriushchenko et al. (2020)
GAIR-AT Zhang et al. (2021) 85.44(0.170) 63.14(0.16) 46.48(0.07) 49.35(0.05) 55.19(0.16)
ARoW 88.59(0.03) 57.78(0.08) 54.83(0.13) 55.69(0.15) 62.31(0.06)
Table 2: Comparison between GAIR-AT and ARoW. We compare the robustness of GAIR-AT Zhang et al. (2021) and ARoW against the four attacks in AutoAttack

4.1 Performance Evaluation

Initial Robustness Rob.TRADES Rob.ARoW Diff. Rate of Impro. (%)
Highly Vulnerable 317 357 40 12.62
Vulnerable 945 1008 63 6.67
Robust 969 1027 58 5.99
Highly Robust 3524 3529 5 0.142
Table 3: Comparison of robustness of TRADES and ARoW. Rob.TRADES and Rob.ARoW represent the ratios of samples which are only robust to TRADES and ARoW, respectively. Diff. and Rate of Impro. denote (Rob.ARoW - Rob.TRADES) and Diff. / Rob.TRADES.
Method Standard Madry et al. (2018) AutoAttack Croce and Hein (2020b)
TRADES w/o-LS 85.86(0.09) 56.79(0.08) 54.31(0.08)
TRADES w/-LS 86.83(0.08) 57.75(0.02) 54.76(0.08)
ARoW w/o-LS 87.68(0.16) 57.54(0.09) 54.58(0.10)
ARoW w/-LS 88.59(0.02) 58.18(0.09) 54.83(0.14)
Table 4: The comparison of TRADES and ARoW with/without label smoothing. We conduct the experiment three times with different seeds and present the averages of the accuracies with the standard errors in the brackets.
Standard
Method Acc WC-Acc SD Acc WC-Acc SD
TRADES() 87.73 70.70 8.17 57.17 26.40 16.75
TRADES() 85.69 67.10 9.27 57.38 27.10 16.97
TRADES() 84.94 65.90 9.58 58.01 27.30 16.92
ARoW 88.58 75.10 7.16 59.23 30.80 15.68
CoW 88.41 72.20 7.22 58.34 26.40 17.09
Table 5: Class-wise accuracy disparity in CIFAR10

. We report the accuracy (ACC), the worst-class accuracy (WC-Acc) and the standard deviation of class-wise accuracies (SD) for each method.

Class Rob.Both (%) Rob.TRADES (%) Rob.ARoW (%)
0(Airplane) 61.3 3.5 5.4
1(Automobile) 75.8 1.7 1.7
2(Bird) 36.6 1.9 6.5
3(Cat) 23.1 3.0 7.1
4(Deer) 32.2 3.4 8.1
5(Dog) 44.7 3.9 2.5
6(Frog) 61.4 6.4 2.2
7(Horse) 67.1 2.6 2.2
8(Ship) 60.4 1.9 9.7
9(Truck) 73.0 2.3 3.3
Table 6: Per-class robustness in CIFAR10. We compare the per-class robustness of TRADES and aROW against . Rob.Both is the ratio of samples which are robust for both TRADES and ARoW, and Rob.TRADES and Rob.ARoW are ratio of samples which are only robust for TRADES and ARoW.

Table 1 reports the standard accuracies on clean samples (generalization) and the robust accuracies for adversarial examples generated by and AutoAttack Croce and Hein (2020b) (robustness) for various adversarial training algorithms including ARoW.

The regularization parameters, if any, are set to be the ones given in the corresponding articles while the regularization parameter in ARoW and CoW is set to be 6 which is the value used in TRADES. The other regularization parameters and in ARoW and CoW are selected so that the robust accuracy against PGD is maximized. The regularization parameters used in the training phase are summarized in Appendix D.1.1.

HAT Rade and Moosavi-Dezfolli (2022) in Table 1 is the state-of-art algorithm against to AutoAttack Croce and Hein (2020b). HAT is a variation of TRADES with an additional regularization term regarding helper examples. The additional regularization term restrains decision boundary from excessive margins. The objective function of HAT is given in Appendix 12.

The results indicate that ARoW is superior to the other competitors in view of robustness as well as generalization. Moreover, we can improve the robustness of ARoW a lot if the generalization sacrifices a little bit, whose results are presented in Appendix D.1.2.

We observe that GAIR-AT Zhang et al. (2021) is robust to , but not robust to AutoAttack Croce and Hein (2020b) in Table 1. For cheking whether the gradient masking occurs, we evaluate the robustness of GAIR-AT against to the four attacks in AutoAttack. In Table 2, the robustness of GAIR-AT is degraded much for the three attacks in AutoAttack except APGD while the robustness of ARoW remains stable regardless of adversarial attacks. This observations suggest that the gradient masking occurs in GAIR-AT while it does not in ARoW.

4.2 Ablation Study : Comparison of ARoW and TRADES

In this subsection, we compare ARoW and TRADES Zhang et al. (2019) since ARoW is a data-adaptive modification of TRADES.

Performance

In Figure 1, we compare the performances (standard accuracy vs robust accuracy) of ARoW and TRADES Zhang et al. (2019) for various choices of the regularization parameter We can see that ARoW uniformly dominates TRADES with respect to the regularization parameters regardless of the methods for adversarial attack.

The Effect of Adaptive Regularization

We investigate how the adaptive regularization in ARoW affects the robustness of each sample of CIFRA10. First, we divide the test data into four groups - highly vulnerable, vulnerable, robust and highly robust according to the values of ( , and ), where is the parameter learned by PGD-Training Madry et al. (2018). Then, for samples of each group, we check how many samples become robust for ARoW and TRADES, respectively, whose results are presented in Table 3. We can see that ARoW is superior in making non-robust samples (highly vulnerable or vulnerable) to robust ones compared with TRADES. We believe that this improvement is mainly due to the adaptive regularization term in ARoW that enforces more regularization on more vulnerable samples.

The Effect of Label Smoothing

In Table 4, we investigate the effects of label smoothing in ARoW and TRADES Zhang et al. (2019). The label smoothing is helpful not only for ARoW but also for TRADES. This would be partly because the regularization terms depend on the conditional class probabilities and it is well known that the label smoothing is helpful for the calibration of the conditional class probabilities Pereyra et al. (2017). Note that ARoW is even superior to TRADES even without label smoothing.

Improved Fairness

Xu et al. (2021) reports that TRADES Zhang et al. (2019) increases the variation of the per-class accuracies (accuracy in each class) which is not desirable in view of fairness. In turn, Xu et al. (2021) proposes the Fair-Robust-Learning (FRL) algorithm to alleviate this problem. Even if the fairness becomes better, the standard and robust accuracies of FRL become worse than TRADES.

In contrast, Table 5 shows that ARoW improves the fairness as well as the standard and robust accuracies compared to TRADES. Also, in Table 6, we can see that ARoW is highly effective in difficult classes such as Bird, Cat and Deer which contain a lot of non-robust samples. These desirable properties of ARoW can be partly understood as follows. The main idea of ARoW is to impose more robust regularization to non-robust samples. In turn, samples in less accurate classes tend to be vulnerable to adversarial attacks. Thus, ARoW improves the robustness of samples in less accurate classes which results in improved robustness as well as standard accuracies for such less accurate classes.

5 Conclusion and Future Works

In this paper, we derived an upper bound of the robust risk to develop the adaptive regularization algorithm for adversarial training and showed by numerical experiments that the adaptive regularization improves the robust accuracy as well as the standard accuracy.

Our proposed algorithms can be considered as a modifications of TRADES Zhang et al. (2019), which is a non-adaptive regularization algorithm. The idea of adaptive regularization, however, is not limited to TRADES and could be applied to other existing adversarial training algorithms including HAT Rade and Moosavi-Dezfolli (2022), GAIR Zhang et al. (2021), MMA Ding et al. (2020), FAT Zhang et al. (2020) and so on without much difficulty.

We have seen in Section 4.2 that ARoW improves the fairness as well as the accuracy compared to TRADES. The advantage of ARoW in terms of the fairness is an unexpected by-product, and it would be interesting to develop a more principled way of enhancing the fairness further without hampering the accuracy.

References

  • M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein (2020) Square attack: a query-efficient black-box adversarial attack via random searchg. In ECCV. Cited by: §4, Table 2.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. IEEE Symposium on Security and Privacy. Cited by: §1, §2.2.
  • Y. Carmon, A. Raghunathan, S. Ludwig, J. C Duchi, and P. S. Liang (2019) Unlabeled data improves adversarial robustness. In NeurIPS . Cited by: §D.1.5, §4, §4.
  • P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In ACM. External Links: Link, Document Cited by: §1, §2.2.
  • T. Chen, Z. Zhang, S. Liu, S. Chang, and Z. Wang (2021) Robust overfitting may be mitigated by properly learned smoothening. In ICLR. Cited by: §4.
  • F. Croce and M. Hein (2020a) Minimally distorted adversarial examples with a fast adaptive boundary attack.

    In The European Conference on Computer Vision(ECCV)

    .
    Cited by: §1, §2.2, §4, Table 2.
  • F. Croce and M. Hein (2020b) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. Cited by: §D.2.3, §4, §4.1, §4.1, §4.1, Table 1, Table 4.
  • Y. Deng, X. Zheng, T. Zhang, C. Chen, G. Lou, and M. Kim (2020) An analysis of adversarial attacks and defenses on autonomous driving models. IEEE International Conference on Pervasive Computing and Communications(PerCom). Cited by: §1.
  • G. W. Ding, Y. Sharma, K. Y. C. Lui, and R. Huang (2020) MMA training: direct input space margin maximization through adversarial training.. In International Conference on Learning Representataions(ICLR) . Cited by: §1.1, 4th item, §2.3, §3.1, §5.
  • S. G. Finlayson, H. W. Chung, I. S. Kohane, and A. L. Beam (2019)

    Adversarial attacks against medical deep learning systems

    .
    In Science. Cited by: §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In ICLR . Cited by: §1, §2.2.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In ICML. Cited by: §3.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR. Cited by: §4.
  • A. Ilyas, L. Engstrom, A. Athalye, and J. Lin (2018) Black-box adversarial attacks with limited queries and information. In ICML. Cited by: §1, §2.2.
  • P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson (2018) Averaging weights leads to wider optima and better generalization.

    Proceedings of the international conference on Uncertainty in Artificial Intelligence

    .
    Cited by: Table 9, §4.
  • L. Jiang, X. Ma, S. Chen, J. Bailey, and Y. Jiang (2019) Black-box adversarial attacks on video recognition models. In ACM. Cited by: §1.
  • A. Kurakin, I. J Goodfellow, and S. Bengio (2017) Adversarial examples in the physical world. In ICLR . Cited by: §1.
  • Y. Li, X. Xu, J. Xiao, S. Li, and H. T. Shen (2020) Adaptive square attack: fooling autonomous cars with adversarial traffic signs. IEEE Internet of Things Journal. Cited by: §1.
  • I. Loshchilov and F. Hutter (2017)

    SGDR: stochastic gradient descent with warm restarts

    .
    In ICLR. Cited by: §D.1.5.
  • X. Ma, Y. Niu, L. Gu, Y. Wang, Y. Zhao, J. Bailey, and F. Lu (2020) Understanding adversarial attacks on deep learning based medical image analysis systems. Pattern Recognition. Cited by: §1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In ICLR. Cited by: Table 11, Table 13, Table 9, Table 18, Table 20, §1.1, §1, §1, §2.2, §2.2, §2.3, §4.2, Table 1, Table 4, §4.
  • N. Morgulis, A. Kreines, S. Mendelowitz, and Y. Weisglass (2019) Fooling a real car with adversarial traffic signs. ArXiv. Cited by: §1.
  • R. Müller, S. Kornblith, and G. E. Hinton (2019) When does label smoothing help?. In NeurIPS. Cited by: §3.2.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. Cited by: Appendix E, §4.
  • N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017)

    Practical black-box attacks against machine learning

    .
    In ACM. Cited by: §1, §2.2, §2.2.
  • N. Papernot, P. McDaniel, and I. Goodfellow (2016) Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv. Cited by: §1, §2.2.
  • N. Papernot, P. McDaniel, A. Sinha, and M. Wellman (2018) Towards the science of security and privacy in machine learning. 2018 IEEE European Symposium on Security and Privacy (EuroS&P). Cited by: §1, §2.2.
  • G. Pereyra, G. Tucker, J. Chorowski, L. Kaiser, and G. E. Hinton (2017) Regularizing neural networks by penalizing confident output distributions. In ICLR. Cited by: §4.2.
  • R. Rade and S. Moosavi-Dezfolli (2022) Recuding excessive margin to achieve a better accuracy vs. robustness trade-off. In ICLR. Cited by: §C.2, §D.1.1, §D.1.2, §D.1.2, §D.2.1, Table 11, Table 13, Table 8, Table 9, Table 18, Table 20, Appendix E, §4.1, Table 1, §4, §5.
  • L. Rice, E. Wong, and J. Z. Kolter (2020) Overfitting in adversarially robust deep learning. In ICML . Cited by: §4.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In ICLR. Cited by: §1.
  • Y. Wang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu (2020) Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representataions(ICLR)) . Cited by: §D.1.1, Table 11, Table 13, Table 9, Table 18, Table 20, §1.1, 3rd item, §2.3, §3.1, §4.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. archive. Cited by: Appendix E, §4.
  • H. Xu, X. Liu, Y. Li, A. K. Jain, and J. Tang (2021) To be robust or to be fair: towards fairness in adversarial training. In ICML. Cited by: §4.2.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. Proceedings of the British Machine Vision Conference 2016. Cited by: §4.
  • H. Zhang, Y. Yu, J. Jiao, E. P Xing, L. El Ghaoui, and M. I Jordan (2019) Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning (ICML) . Cited by: §D.1.2, Table 11, Table 13, Table 8, Table 9, Table 18, Table 20, Appendix E, §1.1, §2.2, §2.3, §3.1, §3.2, §4.2, §4.2, §4.2, §4.2, Table 1, §4, §5.
  • J. Zhang, X. Xu, B. Han, G. Niu, L. Cui, M. Sugiyama, and M. S. Kankanhalli (2020) Attacks which do not kill training make adversarial learning stronger. In ICML. Cited by: §D.2.3, §D.2.3, §1.1, 2nd item, §2.3, §3.1, §5.
  • J. Zhang, J. Zhu, G. Niu, B. Han, M. Sugiyama, and M. S. Kankanhalli (2021) Geometry-aware instance-reweighted adversarial training. In ICLR. Cited by: Table 13, Table 9, Table 18, Table 20, §1.1, 1st item, §2.3, §3.1, §4.1, Table 1, Table 2, §4, §5.

Appendix A Implementation Details

All experiments are performed on NVIDIA TITAN RTX and NVIDIA Quadro RTX 4000. In the training phase, automatic mixed precision package in pytorch is used to speed up learning.

Appendix B Theoretical Results

In this section, we provide the proofs of Theorem 1 and the alternative upper bound.

See 1

Proof.

Note that and . It suffices to show that , which holds since

Theorem 2 (Alternative Upper Bound).

For any score function , we have

Proof.

Note that and . It suffices to show that , which holds since

Appendix C Adversarial Training Algorithms

In this section, we explain the CoW algorithm and extra adversarial training algorithms included in our experiments but not discussed in the manuscript.

c.1 Confidence Weighted Regularzation (CoW)

For notational simplicity, we define as

(11)

The CoW algorithm is summarized in Algorithm 2.

Input : network , training dataset , learning rate ,
            hyperparameters (, , ) of (C.1), the number of epochs , the number of batch , batch size
      Output : adversarially robust network

1:for  do
2:     for  do
3:         ,
4:          in (C.1)
5:     end for
6:end for
7:Return
Algorithm 2 CoW Algorithm

c.2 Helper Based Adversarial Training (HAT)

HAT Rade and Moosavi-Dezfolli [2022] is a variation of TRADES with an additional regularization term by using helper examples. The role of helper examples is to restrain the decision boundary from excessive margins. HAT minimizes the following regularized empirical risk:

(12)

where is the parameter of a pre-trained model only with clean samples, and is an adversarial example generated by PGD with the KL-divergence.

c.3 Helper Based Adversarial Training - Anti-Robust Weighted Regularization (HAT - ARoW)

We consider the combination of HAT and ARoW for improving the performance when SWA is not used. We call it HAT-ARoW which minimizes the following regularized empirical risk:

(13)

In Appendix D.1.3, we investigate the performance of HAT-ARoW.

Appendix D Additional Experiments - CIFAR10

In this section, we present the results of additional experiments and ablation studies for CIFAR10.

d.1 CIFAR10 - WideResNet34-10

In this subsection, we present the results for analyzing CIFAR10 with the WideResNet34-10 architecture.

d.1.1 Hyperparameter Selection

Method Weight Decay
SWA PGD-training - - - -
TRADES 6 - - -
TRADES w/-LS 6 - 0.2 -
HAT 4 0.25 - -
GAIR-AT - - - -
ARoW 6 - 0.2 0.2
CoW 6 - 0.2 0.2
ARoW w/o-LS 7 - - 0.2
HAT-ARoW 7 0.15 0.2 0.2
w/o-SWA PGD-Training - - - -
TRADES 6 - - -
MART 6 - - -
HAT 4 0.25 - -
GAIR-AT - - -
ARoW 8 - 0.2 0.2
CoW 8 - 0.2 0.2
HAT-ARoW 8 0.15 0.2 0.2
Table 7: Selected hyperparameters for CIFAR10 - WideResNet34-10. (w/o stands for without).

The regularization parameters are set to be the ones given in the corresponding articles if available while the regularization parameter in MART, ARoW and CoW is set to be 6 which is the value used in TRADES. The other regularization parameters and in ARoW and CoW are selected so that the robust accuracy against is maximized for fixed . Also, the regularization parameters and in HAT-ARoW (C.3) are selected so that the robust accuracy against PGD is maximized. In HAT (12), the regularization parameters and are set to be 4 and 0.25 which are values used in Rade and Moosavi-Dezfolli [2022] for CIFAR10. When SWA is not used, the regularization parameter in ARoW and CoW is selected so that the robust accuracy against is similar to that of HAT. In HAT-ARoW, the regularization parameter is set to be 8 that is used in ARoW and CoW. Weight decay parameter is set to be which is used in Wang et al. [2020], for MART while it is set to be for the other methods since MART works poorly with

d.1.2 Comparison of TRADES, HAT and ARoW for various values of the regularization parameter

In this subsection, we present the generalization and robustness of TRADES, HAT and ARoW against and AutoAttack with varying . The experiments are implemented with SWA, and in (12) used in HAT is set to be 0.25 which is the value used in Rade and Moosavi-Dezfolli [2022].

Figure 2: Comparison of TRADES, HAT and ARoW with varying . We vary from 2 to 8 in TRADES while from 3 to 6 in HAT and from 4 to 10 in ARoW. The X-axis and Y-axis are the standard accuracy and robust accuracy, respectively. The robust accuracy in the left panel is against PGD while the robust accuracy in the right panel is against AutoAttack.
Method Standard AutoAttack
TRADES 2 89.69 54.81 53.52
3 88.45 55.42 53.88
4 87.73 56.34 54.23
5 86.45 56.86 54.23
6 85.86 56.86 54.31
7 85.71 57.17 54.75
8 84.94 57.15 54.46
HAT 3 88.29 56.97 54.10
3.5 87.60 57.15 54.24
4 87.15 56.95 54.64
4.5 86.60 57.52 54.44
5 86.27 57.65 54.89
5.5 85.45 57.04 54.71
6 84.87 56.98 54.44
ARoW 4 89.49 56.96 54.00
5 88.78 57.60 54.52
6 88.59 58.18 54.82
7 87.90 58.63 54.94
8 87.59 58.61 55.21
9 87.24 58.91 55.22
10 86.51 58.90 55.41
Table 8: Performance of TRADES, HAT and ARoW with varying . We report the stand accuracy and the robust accuraces against and AutoAttack for TRADES Zhang et al. [2019], HAT Rade and Moosavi-Dezfolli [2022], and ARoW.

In Figure 2, we can see that ARoW uniformly outperforms HAT Rade and Moosavi-Dezfolli [2022] as well as TRADES Zhang et al. [2019] in terms of both the generalization and robustness.

d.1.3 Effect of Stochastic Weight Averaging (SWA)

We conduct the experiments with and without SWA to identify the effectiveness of SWA.

Method Standard AutoAttack
SWA Izmailov et al. [2018] PGD-Training Madry et al. [2018] 87.02(0.20) 57.50(0.12) 53.98(0.14)
TRADES Zhang et al. [2019] 85.86(0.09) 56.79(0.08) 54.31(0.08)
HAT Rade and Moosavi-Dezfolli [2022] 86.98(0.10) 56.81(0.17) 54.63(0.07)
GAIR-AT Zhang et al. [2021] 85.44(0.10) 67.27(0.07) 46.41(0.07)
ARoW 88.59(0.02) 58.18(0.09) 54.82(0.14)
CoW 88.20(0.09) 57.33(0.05) 54.63(0.12)
HAT-ARoW 87.77(0.03) 58.54(0.11) 54.95(0.15)
w/o-SWA PGD-Training Madry et al. [2018] 86.88(0.09) 54.15(0.16) 51.35(0.14)
TRADES Zhang et al. [2019] 85.48(0.12) 56.06(0.08) 53.16(0.17)
MART Wang et al. [2020] 84.69(0.18) 55.67(0.13) 50.95(0.09)
HAT Rade and Moosavi-Dezfolli [2022] 87.53(0.02) 56.41(0.09) 53.38(0.10)
GAIR-AT Zhang et al. [2021] 84.49(0.06) 62.11(0.12) 38.48(0.36)
ARoW 87.60(0.02) 56.47(0.10) 52.95(0.06)
CoW 86.94(0.08) 56.19(0.13) 53.39(0.08)
HAT-ARoW 87.90(0.05) 57.28(0.08) 53.56(0.05)
Table 9: Effectiveness of SWA on CIFAR10 - WideResNet 34-10. We conduct the experiment three times with different seeds and present the averages of the accuracies with the standard errors in the brackets.

In Table 9, we observe that SWA is effective for most of methods. Note that ARoW performs well even without SWA. But, HAT, which is known to be the SOTA method, has the best robust accuracy against to AutoAttack without SWA. In this case, combining ARoW and HAT together (C.3), we can improves the accuracies further. We omit the results for MART since SWA degrades the performance.

d.1.4 Per-class accuracies of ARoW and TRADES

In Table 10, we present the per-class robust and standard accuracies of the prediction models trained by ARoW and TRADES.

Class Rob.ARoW (%) Rob.TRADES (%) ARoW (%) TRADES (%)
0(Airplane) 67.3 65.6 91.6 88.3
1(Automobile) 77.8 77.8 95.3 93.7
2(Bird) 43.9 39.3 80.6 72.5
3(Cat) 30.9 27.2 75.1 65.9
4(Deer) 41.6 37.3 87.5 83.4
5(Dog) 48.1 48.8 79.3 76.0
6(Frog) 64.2 68.8 95.2 94.2
7(Horse) 70.1 70.4 92.7 91.0
8(Ship) 70.7 63.3 94.9 90.9
9(Truck) 76.7 75.7 93.5 93.5
Table 10: Comparison of per-class robustness and generalization of ARoW and TRADES. Rob.ARoW and Rob.TRADES are the robust accuracies against of ARoW and TRADES, respectively while ARoW and TRADE are the standard accuracies.

In Table 10, we can see that ARoW is highly effective for classes difficult to be classified such as such as Bird, Cat and Deer. For such classes, ARoW improves much not only the standard accuracies but also the robust accuracies. For example, in the class ’Cat’, which is the most difficult class (the lowest standard accuarcy for TRADES), the robustness and generalization are improved by and by ARoW compared with TRADES. This phenomenon would be partly due to the data-adaptive regularization used in ARoW. Usually, difficult classes are less robust to adversarial attacks. In turn, ARoW puts more regularization on non-robust classes and thus improves the accuracies of non-robust classes more.

d.1.5 Additional Unlabeled Data

In this subsection, we present the results on CIFAR10 with unlabeled data used in Carmon et al. [2019]. In the training phase, SWA is not used and the cosine annealing learning rate scheduler Loshchilov and Hutter [2017] is used. The final model is set to be the best model against PGD on the test data among those obtained until 400 epochs.

Method Standard AutoAttack
PGD-Training Madry et al. [2018] 91.97 61.42 58.90
TRADES Zhang et al. [2019] 90.59 62.72 59.99
MART Wang et al. [2020] 91.04 64.10 59.33
HAT Rade and Moosavi-Dezfolli [2022] 91.54 63.45 60.15
ARoW 92.09 63.72 60.12
CoW 91.14 63.27 60.12
Table 11: Performance on unlabeled CIFAR10 - WideResNet 34-10. We conduct the experiment once and present the accuracies.

In Table 11, we observe that ARoW outperforms the other competitors in terms of generalization while maintaining good robustness. MART has the best robustness against but has poor robustness against AutoAttack.

d.2 CIFAR10 - ResNet18

In this subsection, we summarize the results for analyzing CIFAR10 with the ResNet18 architecture which is smaller than WideResNet34-10.

d.2.1 Hyperparameter Selection and Performance

Method Weight Decay
PGD-Training - - - -
TRADES 6 - - -
MART 6 - - -
HAT 4 0.5 - -
GAIR-AT - - -
ARoW 6 - 0.2 0.2
CoW 6 - 0.2 0.2
Table 12: Selected hyperparameters for CIFAR10 - ResNet18.

The regularization parameter in TRADES, MART, HAT, ARoW and CoW is set to the values used in Table 7. For HAT, the regularization parameters and are set to be 4 and 0.5 which are the values used in Rade and Moosavi-Dezfolli [2022].

Method Standard AutoAttack
PGD-Training Madry et al. [2018] 82.42(0.05) 53.48(0.11) 49.30(0.07)
TRADES Zhang et al. [2019] 82.41(0.07) 52.68(0.22) 49.63(0.25)
MART Wang et al. [2020] 74.87(0.95) 53.68(0.30) 46.61(0.24)
HAT Rade and Moosavi-Dezfolli [2022] 83.05(0.03) 52.91(0.08) 49.60(0.02)
GAIR-AT Zhang et al. [2021] 81.09(0.12) 64.89(0.04) 41.35(0.16)
ARoW 85.30(0.13) 54.10(0.16) 49.66(0.18)
CoW 85.24(0.14) 52.91(0.25) 49.86(0.23)
Table 13: Performance on CIFAR10 - ResNet18. We conduct the experiment three times with different seeds and present the averages of the accuracies with the standard errors in the brackets.

Table 13 shows that ARoW outperforms the other competitors in terms of the generalization and robustness against while maintaining comparable robustness against AutoAttack.

d.2.2 Ablation Studies on and in (3.2)

We present the results of the ablation studies on and in ARoW algorithm with being fixed at 6.

Standard AutoAttack
0.10 0 84.10 53.29 49.75
0.10 0.05 84.40 53.13 49.67
0.10 0.10 84.49 53.13 49.55
0.10 0.15 84.02 52.92 49.24
0.10 0.20 85.30 53.37 49.35
0.10 0.25 85.48 52.98 49.38
0.10 0.30 85.96 52.53 48.83
0.20 0 84.52 53.68 49.96
0.20 0.05 84.49 53.77 49.86
0.20 0.10 84.21 53.17 49.12
0.20 0.15 85.15 53.66 49.96
0.20 0.20 85.31 54.29 49.67
0.20 0.25 85.42 53.27 49.52
0.20 0.30 85.88 53.41 48.86
0.30 0 84.55 53.53 49.89
0.30 0.05 84.96 54.23 49.85
0.30 0.10 85.07 53.90 49.88
0.30 0.15 85.02 54.18 49.77
0.30 0.20 85.55 53.92 49.40
0.30 0.25 85.71 53.57 49.18
0.30 0.30 86.05 53.62 49.17
0.40 0 84.65 54.23 50.11
0.40 0.05 84.80 53.67 49.66
0.40 0.10 85.10 53.55 49.66
0.40 0.15 85.30 53.40 49.61
0.40 0.20 85.59 53.30 49.73
0.40 0.25 85.98 53.31 49.41
0.40 0.30 86.15 52.92 49.20
Table 14: Ablation studies for and in (3.2).

Table 14 suggests that the choice of and does not affect much to the standard and robust accuracies. However, the results with being either 0.2 and 0.3 and the slightly larger than 0 are favorably compared with the results of other choices. That is, ARoW is not very sensitive to the choice of and but fine tuning of and would be helpful.

d.2.3 Combination of FAT and ARoW

FAT-TRADES Zhang et al. [2020] is an adversarial training algorithm which uses the early-stopped PGD for generating adversarial examples in TRADES. Similarly, we can apply the early-stopped PGD to ARoW for improving the performance. We propose an adversarial training algorithm FAT-ARoW which is a combination of ARoW and the early-stopped PGD. It minimizes the following regularized empirical risk:

(14)

where