DVERGE: Diversifying Vulnerabilities for Enhanced Robust Generation of Ensembles

09/30/2020 ∙ by Huanrui Yang, et al. ∙ Duke University 11

Recent research finds CNN models for image classification demonstrate overlapped adversarial vulnerabilities: adversarial attacks can mislead CNN models with small perturbations, which can effectively transfer between different models trained on the same dataset. Adversarial training, as a general robustness improvement technique, eliminates the vulnerability in a single model by forcing it to learn robust features. The process is hard, often requires models with large capacity, and suffers from significant loss on clean data accuracy. Alternatively, ensemble methods are proposed to induce sub-models with diverse outputs against a transfer adversarial example, making the ensemble robust against transfer attacks even if each sub-model is individually non-robust. Only small clean accuracy drop is observed in the process. However, previous ensemble training methods are not efficacious in inducing such diversity and thus ineffective on reaching robust ensemble. We propose DVERGE, which isolates the adversarial vulnerability in each sub-model by distilling non-robust features, and diversifies the adversarial vulnerability to induce diverse outputs against a transfer attack. The novel diversity metric and training procedure enables DVERGE to achieve higher robustness against transfer attacks comparing to previous ensemble methods, and enables the improved robustness when more sub-models are added to the ensemble. The code of this work is available at https://github.com/zjysteven/DVERGE

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 13

page 15

page 16

Code Repositories

DVERGE

Pytorch implementation of our NeurIPS'20 paper "DVERGE: Diversifying Vulnerabilities for Enhanced Robust Generation of Ensembles" https://arxiv.org/abs/2009.14720.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent discoveries of adversarial attacks

cast doubt on the inherent robustness of convolutional neural networks (CNNs) 

goodfellow2014explaining; carlini2017towards; madry2018towards. These attacks, commonly referred to as adversarial examples, comprise precisely crafted input perturbations that are often imperceptible to humans yet consistently induce misclassification in CNN models. Moreover, previous research has demonstrated widespread transferability

of adversarial examples, wherein adversarial examples generated against an arbitrary model can reliably mislead other unspecified deep learning models trained with the same dataset 

papernot2016transferability; ilyas2019adversarial; Inkawhich2020Transferable. Ilyas et al. ilyas2019adversarial conjecture the existence of robust and non-robust features within standard image classification datasets. Whereas humans may understand an image via “human-meaningful” robust features, which usually are insensitive to small additive noise, deep learning models are more prone to learning non-robust features. Non-robust features are highly correlated with output labels and help improve clean accuracy but are not visually meaningful and are sensitive to noise. Such dependency on non-robust features leads to adversarial vulnerability that is exploited by adversarial examples to mislead CNN models. Moreover, Ilyas et al. empirically show that CNN models independently trained on the same dataset tend to capture similar non-robust features, demonstrating overlapping vulnerability ilyas2019adversarial. This property can be observed from the example in the upper row of Figure 1, where an ensemble is trained on clean data and each of its sub-models are vulnerable along the same axis of a transfer attack. This similarity is key to the high transferability of adversarial attacks ilyas2019adversarial; li2015convergent.

Extensive research has been conducted to improve the robustness of CNN models against adversarial attacks, most notably adversarial training madry2018towards. Adversarial training minimizes the loss of a CNN model on online-generated adversarial examples against itself at each training step. This process forces the model to prefer robust to non-robust features and thereby largely eliminates the model’s vulnerability. Nevertheless, learning robust features is hard, so adversarial training often leads to a significant increase in the generalization error on clean testing data tsipras2018robustness.

Similar to traditional ensemble methods like bagging breiman1996bagging and boosting dietterich2000ensemble, which train an ensemble of weak learners with diverse predictions to improve overall accuracy, a recent line of research proposes to train an ensemble of individually non-robust sub-models that produce diverse outputs against transferred adversarial examples bagnall2017training; pang2019improving; kariyappa2019improving. Intuitively, the approach can defend against black-box transfer attacks as an attack can succeed only when multiple sub-models converge towards the same wrong prediction kariyappa2019improving

. Such an ensemble could also hypothetically achieve high clean accuracy since the training process doesn’t exclude non-robust features. Various ensemble training methods have been explored, such as diversifying output logits’ distributions 

bagnall2017training; pang2019improving

or minimizing the cosine similarity between the input gradient direction of each sub-model 

kariyappa2019improving. Yet empirical results show that these diversity metrics are not very effective at inducing output diversity among sub-models, and thus the corresponding ensemble can hardly attain the desired robustness tramer2020adaptive.

Figure 1: Decision regions in the

ball around the same testing image learned by ensembles of 3 ResNet-20 models trained on CIFAR-10 dataset. Same color indicates the same predicted label. The vertical axis is along the adversarial direction of a surrogate benign ensemble, and the horizontal axis is along a random Rademacher vector. The same axes are used for each subplot. Adversarial vulnerability can be inferred from the closest decision boundary and corresponding class. The baseline ensemble is achieved via standard training on clean data while the bottom ensemble is trained with DVERGE. More plots of this nature can be seen in

Appendix C.1.

We note that black-box transfer attacks are prevalent in real-world applications where model parameters are not exposed to end users Inkawhich2020Transferable; kariyappa2019improving. Moreover, high clean accuracy is always desirable. We therefore seek an effective training method that mitigates attack transferability while maintaining high clean accuracy. Based on a close investigation of the cause of adversarial vulnerability in sub-models, we propose to distill the features learned by each sub-model corresponding to its vulnerability to adversarial examples and use the overlap between the distilled features to measure the diversity between sub-models. As adversarial examples exploit the vulnerability of sub-models, a small overlap between sub-models indicates that a successful adversarial example on one sub-model is unlikely to fool the other sub-model. Consequently, our method impedes attack transferability between sub-models and leads to diverse outputs against a transferred adversarial example. Based on this diversity metric, we propose Diversifying Vulnerabilities for Enhanced Robust Generation of Ensembles (DVERGE), which uses a round-robin training procedure to distill and diversify the features corresponding to each sub-model’s vulnerability. The proposed ensemble training method makes the following contributions:

  • DVERGE can successfully isolate and diversify the vulnerability in each sub-model such that within-ensemble attack transferability is nearly eliminated;

  • DVERGE can significantly improve the overall robustness of the ensemble against black-box transfer attacks without significantly impacting the clean accuracy;

  • The diversity induced by DVERGE consistently improves robustness as the number of ensemble sub-models increases under equivalent evaluation conditions.

As shown in the bottom row of Figure 1, diverse vulnerabilities allowed to persist in each sub-model for high clean accuracy by DVERGE combine to yield an ensemble robust to transfer attacks. Our method can also be augmented with the adversarial training objective to yield an ensemble with both satisfying white-box robustness and higher clean accuracy compared to exclusively adversarial training. To the best of our knowledge, this work is the first to utilize distilled features for training diverse ensembles and quantitatively relate it to the robustness against adversarial attacks.

2 Related work

Adversarial attack and defense.

The pervasiveness of adversarial examples highlights the vulnerability of modern CNN systems to malicious inputs. An adversarial attack usually applies an additive perturbation subject to some constraint to an original input to form the adversarial example . The goal of the attack is to find so that can maximize the loss of some CNN model with parameters w.r.t. ’s true label . The attacker’s objective can be formulated as The constraint typically ensures adversarial examples are visually indistinguishable from original inputs, which is often defined as for some perturbation strength and -norm, e.g.  papernot2016limitations,  carlini2017towards, or  goodfellow2014explaining; madry2018towards. In this work, we focus on the attack bounded by the norm, which has become increasingly common in recent attack and defense research studies. Madry et al. madry2018towards show that the attacker’s objective can be effectively optimized in a multi-step projected gradient descent (PGD) manner, where in each step of the gradient update the achieved adversarial example is projected back into the constraint set to make sure it complies with the norm constraint. The attack can be further strengthened via application of a random starting point madry2018towards or consideration of the gradient’s momentum information during the optimization dong2017discovering; zheng2019distributionally.

Various empirical methods have been investigated for improving model robustness. Among theses methods, adversarial training madry2018towards has gained prominence for its reliability and effectiveness. Adversarial training generates adversarial examples while concurrently training CNN model(s) to minimize the loss on these adversarial examples. The objective of adversarial training is formulated as a min-max optimization: where the inner maximization is often conducted with PGD attacks for greater robustness madry2018towards. Although recent research shows that PGD adversarial training encourages a model to capture robust features within datasets ilyas2019adversarial, the process is difficult and costly. The learning of robust feature detrimentally and significantly affects the accuracy of the model on clean data tsipras2018robustness, and the model architecture needs to be much larger in order to compensate for the added complexity of the objective madry2018towards.

Ensemble of diverse sub-models for robustness.

Besides training a single robust model, a recent line of work investigates improving the robustness of an ensemble of small sub-models (especially against transfer adversarial attacks). Such robust ensembles can be obtained not only by combining individually robust sub-models but also by eliminating adversarial vulnerabilities shared by different sub-models, be they robust or non-robust, so that attacks cannot transfer between the sub-models within the ensemble. Several works attempt to promote diversity in internal representations or outputs across sub-models to serve as a mechanism to limit adversarial transferability and improve ensemble robustness. Pang et al. pang2019improving propose the ADP regularizer, which forces different sub-models to have high diversity in the non-maximal predictions. Kariyappa et al. kariyappa2019improving reduce the overlap between “Adversarial Subspaces” tramer2017space of different sub-models by maximizing the cosine distance between each sub-model’s gradient w.r.t. the input. Although the ideas behind these methods are intuitive for improving sub-model diversity, these diversity metrics do not in practice align well with diversifying the adversarial vulnerability shared by different sub-models. Thus training the ensemble with these diversity metrics does not lead to satisfying robustness against transferability between sub-models, and consequently the resulted ensemble is still highly non-robust tramer2020adaptive. An ensemble diversity metric that can effectively lead to low attack transferability and high overall robustness is still lacking.

3 Method

3.1 Vulnerability diversity metric

A recent study ilyas2019adversarial reveals that non-robust features captured by deep learning models are highly sensitive to additive noise, which is the main cause of adversarial vulnerability in CNN models. Based on this observation, we propose to isolate the vulnerability of CNN models based on their distilled non-robust features. Let us take a CNN model trained on dataset as an example. We consider a target input-label pair and another randomly-chosen independent source pair . Corresponding to the source image , the distilled feature of the input image by the -th layer of can be approximated with the feature distillation objective as ilyas2019adversarial:

(1)

where

denotes the output before the activation (e.g. ReLU) of the

-th hidden layer. This constrained optimization objective can then be optimized with PGD madry2018towards. The distilled feature is expected to be visually similar to rather than

but classified as the target class

since the same feature will be extracted from and by

. Such misalignment between the visual similarity and the classification result shows that

reflects the adversarial vulnerability of when classifying . Therefore, we define the vulnerability diversity between two models and as:

(2)

Here denotes the cross-entropy loss of model for an input-label pair . The expectation is taken over the independent uniformly random choices of , , and layer of models and . Since the distilled feature has the same dimension as input images, this formulation can be evaluated on models with arbitrary architectures trained on the same dataset. As is visually uncorrelated with , the cross entropy loss is small only if ’s vulnerability on ’s non-robust features overlaps with that of , and vice versa. So the formulation in Equation (2) effectively measures the vulnerability overlap between the two models. Note that the feature distillation process in Equation (1) can be considered a special case of generating an adversarial example from source image with target label . The diversity defined in Equation (2) therefore corresponds to the attack success rate when transferring adversarial examples between the two models in the same way as training cross-entropy loss corresponds to training accuracy.

3.2 Vulnerability diversification objective

As adversarial attacks are less likely to transfer between models with high vulnerability diversity, we propose to apply the metric defined in Equation (2) as an objective during the ensemble training to induce diverse sub-models and block transfer attacks. Equation (3) shows a straightforward way to incorporate the diversity metric into the training objective, where for each sub-model , the diversity between itself and all other sub-models in the ensemble is maximized when minimizing the original cross-entropy loss:

(3)

As the formulation of has no upper bound, directly maximizing it may ultimately lead to divergence. Thus, we revise the training objective as

(4)

which has a stronger bound than Equation (3). The new objective not only encourages the increase of vulnerability diversity but also facilitates the correct classification of the distilled image as . As such, the objective is well-posed and can be effectively optimized.

Furthermore, it should be noted that minimizing can effectively contribute to the minimization of as the distilled image is close to the clean image . Previous adversarial training research madry2018towards; Xie2020Intriguing also show that it is not necessary to include the clean data loss in the objective. So we further simplify Equation (4) to

(5)

which is adopted for training individual sub-models in DVERGE. The objective in Equation (5) can be understood as training sub-model with the adversarial examples generated for other sub-models. However, DVERGE is fundamentally different from adversarial training. Adversarial training process constantly trains a model on white-box attacks against itself and forces the model to capture the robust feature of the dataset. In DVERGE, Equation (5) can be minimized if utilizes a different set of features from other sub-models, including non-robust features. As non-robust features are distributed more commonly in dataset than robust features ilyas2019adversarial, capturing and integrating some non-robust features allows DVERGE to reach higher clean accuracy compared to adversarial training. Our training process should also be distinguished from that of Tramer et al. tramer2017ensemble, which trains a single model with adversarial examples transferred from an ensemble of static pretrained sub-models for improving robustness. In DVERGE, all the sub-models in the ensemble are being optimized with Equation (5) in a round-robin fashion. The procedure dynamically maximizes the diversity of every pair of sub-models, rather than forcing only a single model away from static pretrained sub-models. The entire training process of DVERGE is elaborated in Section 3.3.

3.3 DVERGE training routine

1:# initialization and pretraining
2:for  do
3:      Randomly initialize sub-model
4:      Pretrain with clean dataset
5:# round-robin feature diversification
6:for  do
7:      Uniformly randomly choose layer for feature distillation
8:      for  do
9:             get batched input-label pairs
10:             uniformly sample batched source input-label pairs
11:            # get distilled batch for each model
12:            for  do
13:                  non-robust feature distillation with Equation (1)             
14:            # calculate loss and perform SGD update for all sub-models
15:            for  do
16:                 
17:                                    
Algorithm 1 DVERGE training routine for a -sub-model ensemble.

Algorithm 1 shows the pseudo-code for training an ensemble of sub-models. We first randomly initialize and pretrain all the sub-models based on the clean dataset so that their feature spaces will be useful and we do not waste time diversifying irrelevant features. Then for each batch of training data during the diverse training phase, we randomly sample another batch of source data and use them to distill the non-robust feature following the objective of Equation (1

). A PGD optimization scheme is applied in the feature distillation process. Round-robin training is then employed wherein a single stochastic gradient descent step is performed on each sub-model with the distilled images from all other sub-models and their source labels, as stated in the objective in Equation (

5). This training process is performed on all batches of training data and repeated for epochs. The layer for the feature distillation is randomly chosen in each epoch to avoid overfitting to the features of a particular layer. This training routine can effectively increase the vulnerability diversity between each pair of sub-models within the ensemble and block within-ensemble transfer attacks. Consequently, the overall black-box robustness of the ensemble improves.

DVERGE induces a similar training complexity as adversarial training does. Both of these methods need extra back propagations to either distill non-robust features or find adversarial examples. However, DVERGE uses only intermediate features rather than final outputs for distillation so it is marginally faster than adversarial training. Detailed comparison of the training complexity of DVERGE vs. previous methods can be found in Appendix A.

4 Experimental results

4.1 Setup

We compare DVERGE with various counterparts, including Baseline which trains an ensemble in a standard way and two previous robust ensemble training methods: ADP pang2019improving and GAL kariyappa2019improving. For a fair comparison, we use ResNet-20 he2016deep

as sub-models and average the output probabilities after the soft-max layer of each sub-model to yield the final predictions of ensembles. All the evaluations are performed on the CIFAR-10 dataset 

krizhevsky2009learning. Training configuration details can be found in Appendix A. For DVERGE, we use PGD with momentum dong2018boosting to perform the feature distillation in Equation (1). We conduct 10 steps of gradient descent during feature distillation with a step size of . The used for each ensemble size to achieve the results in this section was empirically chosen for the highest diversity and lowest transferability, such that 0.07, 0.05, 0.05 for ensembles with 3, 5, and 8 sub-models, respectively. Analysis on the effect of is given in Appendix B.

4.2 Diversity and transferability within the ensemble

The objective of DVERGE is to guide sub-models to capture diverse non-robust features and minimize the vulnerability overlap between sub-models, thereby reducing the attack transferability within the ensemble. To validate our method, we measure diversity and transferability by randomly picking 1,000 test samples on which all sub-models initially give correct predictions. We compute the pair-wise diversity as the expected cross-entropy loss formulated in Equation (2), which is further averaged across all pairs of sub-models to obtain the diversity measurement of the whole ensemble. When measuring the transferability, we generate untargeted adversarial examples using 50-step PGD with a step size of and five random starts. The transferability is measured by the attack success rate which counts any misclassification as a success. Similar to diversity, the averaged pair-wise attack success rate is used to indicate the level of transferability within the ensemble.

Figure 2: The trend of diversity and transferability during the training of DVERGE. The results are rolling averaged with a window size of 30.

First, let’s take a look at how the diversity and transferability within the ensemble changes during the training process of DVERGE. Figure 2 shows the result of an ensemble of three sub-models. The diversity is evaluated using the same of 0.07 as during the training, and the transferability is measured using the standard of 0.03 () for adversarial attacks on CIFAR-10 madry2018towards. The figure clearly shows that the diversity increases while the transferability decreases as the training proceeds. This trend empirically validates that minimizing the DVERGE objective can effectively lead to a higher diversity and a lower adversarial transferability within an ensemble.

Figure 3: Pair-wise transferability (in the form of attack success rate) among sub-models for different ensemble methods.

Figure 3 presents the pair-wise transferability of an ensemble with three sub-models tested under the same as aforementioned. Results for the ensembles composed of more sub-models and for other testing are reported in Appendix B and Appendix C.2, respectively. The number at the intersection of the -th row and -th column represents the transfer success rate of the adversarial examples generated from the -th sub-model and tested on the -th sub-model. When , the number becomes the white-box attack success rate. Larger off-diagonal numbers indicate greater transferability across sub-models. Compared with other ensemble methods, DVERGE suppresses the transferability to a much lower level; among all adversarial examples that successfully break one sub-model, only 3-6% of them could lead to misclassification on other sub-models. Although ADP and GAL also strive to improve the diversity for better robustness, they cannot effectively block adversarial transfer. ADP exhibits transference of 60% to 70% of attacks between sub-models. When it comes to GAL, two out of the three sub-models are still extremely vulnerable to each other, where more than 80% of adversarial examples can successfully transfer between the first and second sub-models. Our evaluation demonstrates that stopping attack transfers is not trivial and applying an appropriate diversification metric is crucial. Therefore, we advocate the use of DVERGE as a more effective means for mitigating the attack transferability within an ensemble.

4.3 Robustness of the ensemble

Figure 4: Robustness results for different ensemble methods. The number after the slash stands for the number of sub-models.

We evaluate the robustness of ensembles under two threat models: black-box transfer adversary, where the attackers cannot access the model parameters and rely on surrogate models to generate transferable adversarial examples, and white-box adversary, where the attackers have the full access of everything of the model. Under the black-box scenario, we use hold-out baseline ensembles with 3, 5, and 8 ResNet-20 sub-models as the surrogate models. A more challenging setting considers an attacker fully aware of the defense such that the surrogate ensemble is trained with the exact technique. The results under this setting can be seen in Appendix C.3. We use three attack methodologies: (1) PGD with momentum dong2018boosting with three random starts. (2) M-DI2- FGSM xie2019improving

, which randomly resizes and pads the image in each step of attack generation. (3) SGM

wu2020skip, which adds weight to the gradient through the skip connections of ResNets. The latter two attacks are essentially two stronger black-box transfer attacks that can better expose the attack transferability between models. For more details, we refer the reader to the attacks’ respective papers. We run each attack for 100 iterations with the step size of . Other than using the cross-entropy loss, we also generate adversarial examples with CW loss carlini2017towards since it can also help with the transfer. As a result, in total, each sample will have 3 (surrogate models)5 (PGD with 3 random starts plus 2 other attacks)

2 (loss functions) = 30 adversarial counterparts. The black-box accuracy is reported in a

all-or-nothing fashion: We say the model is accurate on one sample only if all of its 30 adversarial versions are correctly classified by the model. We adopt such a powerful adversary and a strict criteria to give a tighter upper bound of the robustness against black-box transfer attacks. Under the white-box scenario, we use 50-step PGD with the step size of to attack ensembles. Five random starts are included. Our experiment shows that we have applied sufficient steps for attacks to converge (Appendix D).

Evaluated on 1,000 randomly selected test samples, Figure 4 shows the black-box and white-box robustness of ensembles with various number of sub-models across a wide range of attack budget . We refer the reader to Appendix C.3 for numerical results. DVERGE, even with the least sub-models, outperforms each case of the other methods with higher accuracy in both black-box and white-box settings and achieves comparable clean accuracy. In addition, robustness improvement can be easily obtained by adding more sub-models into the ensemble when using our method, while such a satisfying trend is less obvious in other methods. GAL, as the second best performing approach among the four methods, actually shares the same high-level concept as the proposed DVERGE algorithm. They both aim at diversifying the vulnerabilities shared by the sub-models. The difference lies in the fact that GAL considers using the adversarial gradient directions to evaluate the vulnerability of CNN models whereas DVERGE identifies the vulnerability in a model by distilling the learnt non-robust features. Results from both Figure 3 and Figure 4 suggest that our approach is a more effective realization of the intuition of identifying and diversifying adversarial vulnerability.

4.4 DVERGE with adversarial training

Figure 5: Results for DVERGE combined with adversarial training.

Although DVERGE achieves the highest robustness among ensemble methods, its robustness against white-box attacks and transfer attacks with a large perturbation strength is still quite low. This result is expected because the objective of DVERGE is to diversify the adversarial vulnerability rather than completely eliminate it. In other words, vulnerability inevitably exists within sub-models and can be captured by attacks with larger . One straightforward way to improve the robustness of ensembles is to augment DVERGE with adversarial training madry2018towards. We describe implementation details regarding adversarial training in Appendix A and the amended objective in Appendix C.4.

Figure 5 presents the black-box and white-box accuracy for adversarial training (AdvT), DVERGE only (DVERGE) and the combination of the two (DVERGE+AdvT) using the same evaluation setting as in Section 4.3. Ensembles with 5 sub-models are used here. More results with different ensemble sizes can be found in Appendix C.4. The DVERGE+AdvT objective favors the capture of more robust features by the ensemble. Compared to AdvT, DVERGE+AdvT encourages the ensemble to learn diverse non-robust features alongside robust features, leading to a higher clean accuracy and higher robustness against transfer attacks. In the meantime, no matter which objective is applied, the overall learning capacity of the ensemble remains the same. That is, learning more robust features will leave less capacity in the ensemble to capture diverse non-robust features, and vice versa. Forcing the inclusion of robust features causes DVERGE+AdvT to sacrifice the accuracy on clean examples comparing to performing DVERGE only. Learning diverse non-robust features harms DVERGE+AdvT’s robustness against white-box attacks with larger perturbations compared to AdvT alone. These results can be seen as an evidence for the recent findings in ilyas2019adversarial; tsipras2018robustness regarding the tradeoff between clean accuracy and robustness. DVERGE+AdvT can effectively explore such tradeoff by changing the ratio between the two objectives, which is further illustrated in Appendix C.4.

5 Conclusions

In this work we propose DVERGE, a CNN ensemble training method that isolates and diversifies the adversarial vulnerability in each sub-model to improve the overall robustness against transfer attacks without significantly reducing clean accuracy. We show that adversarial diversity in a CNN model can be successfully characterized by distilled non-robust features, from which we can measure the vulnerability diversity between two models. The diversity metric is further developed into the vulnerability diversification objective used for DVERGE ensemble training. We empirically show that training with DVERGE objective can effectively increase the vulnerability diversity between sub-models, thereby blocking attack transferability within the ensemble. In this way DVERGE reduces the success rate of transfer attacks between sub-models from more than 60% achieved by previous ensemble training methods to less than 6%, which enables ensembles trained with DVERGE to achieve significantly higher robustness against both black-box transfer attacks and white-box attacks compared to previous ensemble training methods. The robustness can be further improved with additional sub-models in the ensemble. We further demonstrate that DVERGE can be augmented with an adversarial training objective, which enables the ensemble to achieve higher clean accuracy and higher transfer attack robustness compared to adversarial training. In conclusion, the vulnerability diversity induced by DVERGE training objective can effectively contribute to enhancing the robustness of CNN ensembles while maintaining desirable clean accuracy.

Broader Impact

DVERGE hypothetically addresses some black-box adversarial vulnerabilities pervasive across machine learning applications while increasing compute requirements to training models. As such methods presented herein suggest potential impacts on the reliability, security, and carbon-footprint of deep-neural-network-based systems. The reliability and robustness of machine learning systems are not just a concern for practitioners but also policy makers

hamonrobustness.

A net increase in carbon production would be considered a negative impact by many researchers in climate-related fields. This problem is common to many techniques that modify model training to achieve robustness, including DVERGE. While yet to be examined in the case of DVERGE, the possibility to mitigate or reduce excessive training burdens through informed hyperparameter selection exists. Sometimes, though modified training may increase the required computation per model parameter update, the modified method may nevertheless require fewer steps or epochs to achieve desirable results. Recent work provides actionable recommendations, such as performing cost-benefit analysis, to determine if efficient downstream adoption is desirable

strubell2018linguistically.

In both industrial and military applications, practical solutions to vulnerabilities, such as relying on human-AI teaming danks2020, are effective but do not address the underlying source of vulnerability and may limit the adoption of machine learning elsewhere. Addressing vulnerabilities at the training stage, then, is a desirable capability for positive-impact applications. By orthogonally improving only black-box robustness, though, we leave machine learning systems vulnerable to other types of attacks. Previous work has shown that white-box knowledge can still be leaked in black-box scenarios oh2019towards; tramer2016stealing. As such, DVERGE is reliant on adversarial training to defend against white-box attacks and on traditional computer security to maintain system integrity. The ultimate interpretation of impact due to improved model reliability and security is not clear-cut, however, as it is highly dependent on the application space. This uncertainty is symptomatic of the fact that machine learning is often fundamental by nature and that there is no machine learning technique for improving robustness that can be applied only to positive-impact applications, whatever one’s subjective interpretation of “positive” may be.

Acknowledgments and Disclosure of Funding

This work is supported by the DARPA HR00111990079 (QED for RML) program.

References

Appendix A Training and implementation details

We train the baseline ensembles for 200 epochs using SGD with momentum 0.9 and weight decay 0.0001. The initial learning rate is 0.1, and we decay it by 10 at the 100-th and the 150-th epochs. Any models pre-trained on the clean dataset can serve as the starting point for the training of DVERGE. In our implementation, DVERGE starts from the trained baseline ensembles. We follow the aforementioned learning rate schedule, but using a carefully-tuned one is likely to bring extra performance gain. We reproduce ADP [pang2019improving] and GAL [kariyappa2019improving] according to either the released code or the paper with recommended hyperparameters and setups. Specifically, they both use Adam optimizer [kingma2014adam] with an initial learning rate of 0.001. Also note that GAL requires the ReLU function to be replaced with leaky ReLU to avoid gradient vanishing. The other configurations stay the same as those of baseline ensembles.

Ensembles with adversarial training follow the baseline’s training setup. We use 10-step PGD with and a step size [madry2018towards]. More specifically, adversarial examples w.r.t. the whole ensemble are generated in each step of the training process and are used to update model parameters. When combining DVERGE with adversarial training, however, adversarial examples are generated on each sub-model instead of the whole ensemble. We empirically find these choices help each case achieve its best robustness.

We use 0.5 as the input transformation probability for M-DI2- FGSM [xie2019improving] and 0.2 as the for SGM [wu2020skip] when generating these two attacks as recommended by their respective papers.

All models are implemented and trained with PyTorch 

[paszke2017automatic] on a single NVIDIA TITAN XP GPU. Evaluation is performed based on AdverTorch [ding2019advertorch]. As shown in Table 1, the training of DVERGE is marginally faster than that of adversarial training (AdvT). As they both need extra back propagations to either distill non-robust features or find adversarial examples, DVERGE uses only intermediate features for distillation while adversarial training requires the information back propagated from the final output. As for previous methods, though ADP requires the least time budget, it does not improve the robustness much as shown in Figure 4. And the significantly improved robustness would be worth the extra training cost of DVERGE over GAL. In addition, according to Figure 2, DVERGE could reduce the transferability within the ensemble at the very early stage of the training, so later training epochs have the potential to be simplified to a fine-tuning process without diversity loss, which will require much less training time. Further mitigating the computational overhead of DVERGE would be one of our future goals.

Method Baseline ADP [pang2019improving] GAL [kariyappa2019improving] AdvT [madry2018towards] DVERGE
Training time (h) 1.0 2.0 7.5 11.5 10.5
Table 1: Training time comparison on a single TITAN XP GPU. All times are evaluated for training a ResNet-20 ensemble with 3 sub-models for 200 epochs.

Appendix B Analysis on the training of DVERGE

#steps diversity loss
0.03 10 1.056 1.726
0.05 10 0.793 3.302
0.07 10 0.703 4.738
Table 2: The effect of on optimizing the feature distillation objective in Equation (1) and the resulting diversity loss measured with Equation (5). and are two ResNet-20 models trained in a standard way on CIFAR-10. is short for . The step size used for feature distillation is chosen as .

One important hyperparameter of DVERGE is the used for feature distillation. This section provides some initial exploration on the effect of using different . We start by looking at how affects the optimization of the feature distillation in Equation (1) and the resulted diversity loss in Equation (5). The results evaluated with 1,000 CIFAR-10 testing images on two pre-trained ResNet-20 models are shown in Table 2. We find that a larger enables more accurate feature distillation as a smaller distance between the internal representation of the distilled image and the target image is achieved. As a result, the distilled image from model can lead to a higher diversity loss on another model , which intuitively encourages the training routine of DVERGE to enforce a greater diversity and therefore a lower transferability between the two models. We empirically confirm this intuition in Figure 6, where we vary the training and measure the transferability between sub-models. For instance, when ensembles have three sub-models, increasing from 0.03 to 0.07 decreases the transferability from 8%-10% to 3%-6%. Interestingly, however, for ensembles with five or eight sub-models, although we do observe a drop in the transferability between most of the sub-model pairs by using a larger , some pairs of the sub-models remain highly vulnerable to one another. In particular, observe that when training an ensemble of 5 sub-models with an of 0.07, 79% of adversarial examples from the second sub-model can fool the fourth one and 48% of examples transfer in the reverse direction. We leave a thorough and rigorous analysis of this phenomenon to future work.

Figure 6: Transferability within DVERGE ensembles trained with different . The evaluation setting follows that of Figure 3, where the attack perturbation strength is 0.03.

Finally, we look at the clean accuracy and robustness achieved by DVERGE ensembles trained with different . In Table 3, we observe that training with a larger leads to a higher black-box transfer robustness but a lower clean accuracy. The trend between white-box robustness and is not monotonic though, which we suspect is related to the observation that the attack transferability worsens between some pairs of sub-models under a larger training , as shown in Figure 6. As the robustness against white-box attacks is not the main focus of DVERGE, we will explore the relationship, both qualitatively and quantitatively, between the transferability among sub-models and the achieved white-box robustness of the whole ensemble in the future.

[width=10em]#sub-modelstraining 0.03 0.05 0.07
3 92.9% /   4.2% / 22.7% 92.7% / 26.6% / 32.3% 91.4% / 53.2% / 40.0%
5 92.3% / 30.5% / 43.1% 91.5% / 57.2% / 48.9% 90.2% / 66.5% / 42.3%
8 91.3% / 42.8% / 51.9% 91.1% / 63.6% / 57.9% 89.2% / 71.3% / 52.4%
Table 3: Robustness of DVERGE ensembles trained with different . In each table block, we report (clean accuracy) / (black-box transfer accuracy under perturbation strength 0.03) / (white-box accuracy under perturbation strength 0.01).

Appendix C Additional results

c.1 Decision region visualization

We visualize the decision regions learned by DVERGE ensembles around more testing images from the CIFAR-10 dataset in Figure 7.

Figure 7: More decision region plots of ensembles with 3 ResNet-20 trained on CIFAR-10. Each pair of two rows is generated with one testing image. The first row is for the baseline ensemble, and the second row is for the DVERGE ensemble. The axes are chosen in the same way as in Figure 1.

c.2 Transferability within the ensemble under different testing

Figure 8: Transferability results under different testing .

In addition to Figure 3, we provide more results of the transferability between sub-models under different attack in Figure 8. In all cases, DVERGE achieves the lowest level of transferability among all ensemble methods.

clean 0.01 0.02 0.03 0.04 0.05 0.06 0.07
baseline/3 94.1% 10.0% 0.1% 0% 0% 0% 0% 0%
baseline/5 94.4% 10.2% 0% 0% 0% 0% 0% 0%
baseline/8 94.4% 9.3% 0% 0% 0% 0% 0% 0%
ADP/3 [pang2019improving] 93.3% 22.7% 0.7% 0% 0% 0% 0% 0%
ADP/5 [pang2019improving] 93.1% 22.8% 0.7% 0% 0% 0% 0% 0%
ADP/8 [pang2019improving] 93.0% 21.4% 0.5% 0% 0% 0% 0% 0%
GAL/3 [kariyappa2019improving] 88.8% 78.2% 59.9% 39.7% 22.9% 10.4% 5.1% 1.8%
GAL/5 [kariyappa2019improving] 91.0% 78.1% 54.8% 32.4% 14.9% 6.0% 2.4% 0.5%
GAL/8 [kariyappa2019improving] 92.3% 75.1% 45.8% 22.4% 7.5% 2.0% 0.5% 0.1%
DVERGE/3 91.4% 83.9% 70.4% 53.2% 34.5% 19.1% 9.0% 2.7%
DVERGE/5 91.9% 83.4% 72.4% 57.2% 39.4% 23.5% 12.6% 4.6%
DVERGE/8 91.1% 86.6% 75.9% 63.6% 48.4% 33.1% 21.2% 9.5%
AdvT/3 [madry2018towards] 76.7% 75.5% 73.8% 72.0% 70.3% 68.4% 65.1% 62.2%
AdvT/5 [madry2018towards] 76.9% 75.8% 74.7% 72.7% 69.8% 67.8% 65.5% 62.4%
AdvT/8 [madry2018towards] 79.7% 78.2% 76.2% 74.0% 71.1% 69.1% 66.8% 63.4%
DVERGE+AdvT/3 83.0% 80.6% 78.2% 76.2% 73.8% 70.1% 67.4% 63.4%
DVERGE+AdvT/5 84.5% 82.3% 80.4% 77.9% 75.3% 70.9% 66.4% 62.1%
DVERGE+AdvT/8 85.9% 83.1% 80.2% 77.1% 72.7% 68.0% 63.6% 56.5%
Table 4: Accuracy v.s. against black-box transfer attacks generated from hold-out baseline ensembles. The number in the first column after the slash is the number of sub-models within the ensemble.
clean 0.01 0.02 0.03 0.04 0.05 0.06 0.07
baseline/3 94.1% 1.5% 0% 0% 0% 0% 0% 0%
baseline/5 94.4% 2.1% 0% 0% 0% 0% 0% 0%
baseline/8 94.4% 3.2% 0% 0% 0% 0% 0% 0%
ADP/3 [pang2019improving] 93.3% 9.6% 0.6% 0% 0% 0% 0% 0%
ADP/5 [pang2019improving] 93.1% 11.8% 1.1% 0% 0% 0% 0% 0%
ADP/8 [pang2019improving] 93.0% 12.0% 4.4% 1.7% 0.8% 0.4% 0.2% 0.1%
GAL/3 [kariyappa2019improving] 88.8% 11.4% 0.2% 0% 0% 0% 0% 0%
GAL/5 [kariyappa2019improving] 91.0% 31.7% 8.3% 0.7% 0.2% 0.2% 0% 0%
GAL/8 [kariyappa2019improving] 92.3% 37.0% 8.7% 0.9% 0.1% 0% 0% 0%
DVERGE/3 91.4% 40.0% 12.2% 3.0% 0.3% 0.2% 0% 0%
DVERGE/5 91.9% 48.9% 21.8% 6.2% 1.1% 0.4% 0.1% 0%
DVERGE/8 91.1% 57.9% 27.0% 10.7% 2.6% 0.4% 0.1% 0%
AdvT/3 [madry2018towards] 76.7% 67.8% 58.4% 48.0% 35.6% 26.1% 17.5% 9.9%
AdvT/5 [madry2018towards] 76.9% 68.8% 59.0% 48.1% 37.0% 26.4% 16.6% 9.7%
AdvT/8 [madry2018towards] 79.7% 70.1% 60.5% 48.2% 36.4% 26.7% 17.0% 9.7%
DVERGE+AdvT/3 83.0% 72.9% 59.8% 44.0% 29.3% 18.8% 9.8% 4.7%
DVERGE+AdvT/5 84.5% 74.8% 59.8% 41.8% 27.1% 15.5% 7.3% 2.7%
DVERGE+AdvT/8 85.9% 72.5% 58.3% 39.8% 25.8% 13.9% 5.9% 2.7%
Table 5: Accuracy v.s. against white-box attacks. The number in the first column after the slash is the number of sub-models within the ensemble.

c.3 Numerical results for robustness

We report numerical results that correspond to Figure 4 and Figure 5 in Table 4 (black-box transfer accuracy) and Table 5 (white-box accuracy), respectively. Note that ADP presents higher white-box accuracy than black-box transfer accuracy in some cases, e.g., 4.4% > 0.5% for ADP/8 when is 0.02, which implies ADP might result in obfuscated gradients [athalye2018obfuscated].

A more challenging scenario for adversarial defenses is assuming the attacker is fully aware of the exact defense that the system relies on. In such case, the adversary can train an independent copy of the defended network as the surrogate model. We report black-box transfer accuracy of each method under this setting in Table 6. The same group of attacks as in Section 4.3 are used and ensembles with 3, 5, and 8 sub-models form the collection of surrogate models. According to Table 6, DVERGE still presents the strongest robustness against such a powerful black-box transfer adversary among all ensemble methods.

clean 0.01 0.02 0.03 0.04 0.05 0.06 0.07
ADP/3 [pang2019improving] 93.3% 34.7% 6.5% 1.4% 0.2% 0% 0% 0%
ADP/5 [pang2019improving] 93.1% 34.2% 6.6% 1.6% 0.6% 0% 0% 0%
ADP/8 [pang2019improving] 93.0% 32.5% 5.6% 1.4% 0.2% 0% 0% 0%
GAL/3 [kariyappa2019improving] 88.8% 67.8% 36.2% 13.8% 3.7% 0.6% 0.1% 0%
GAL/5 [kariyappa2019improving] 91.0% 67.5% 31.9% 9.2% 1.8% 0.3% 0% 0%
GAL/8 [kariyappa2019improving] 92.3% 64.7% 25.8% 5.4% 0.6% 0.1% 0% 0%
DVERGE/3 91.4% 75.4% 50.2% 23.8% 7.6% 2.1% 0.3% 0.2%
DVERGE/5 91.9% 77.2% 53.1% 26.7% 9.5% 2.6% 0.5% 0.2%
DVERGE/8 91.1% 77.7% 57.3% 32.0% 13.9% 3.8% 0.9% 0.2%
Table 6: Accuracy v.s. against black-box transfer attacks generated from hold-out ensembles that are trained with the exact defense technique used by each ensemble. The number in the first column after the slash is the number of sub-models within the ensemble.

c.4 Discussion on DVERGE with adversarial training

Formally, the combined training objective of DVERGE and adversarial training is

(6)

where is a hyperparameter that balances between the two terms. We set to achieve the results in Figure 5. Results for other ensemble sizes under are shown in Figure 9. The observations stay the same as in Section 4.4. To better reflect the trade-off between clean accuracy and robustness, we report the results for an ensemble of eight sub-models with . In this case, DVERGE loss is weighted less and the training process will favor adversarial training. In turn, the ensemble spends more of its capacity to capture robust features instead of diverse non-robust features. Consequently, compared with , we observe a decrease in clean accuracy and an increase in both black-box and white-box robustness when is large. In addition, the ensemble size is actually another weight factor in Equation (6) as increasing the number of sub-models will naturally lead to larger DVERGE loss such that it outweighs the AdvT loss. As a result, larger (smaller) ensemble sizes for DVERGE+AdvT results in better (worse) clean performance yet worse (better) black-box and white-box robustness under a large . This assertion can be confirmed by the results in the bottom three rows of Table 4 and Table 5.

Figure 9: Results for DVERGE combined with adversarial training.

Appendix D Convergence check

As suggested in [tramer2020adaptive, carlini2019evaluating], we report accuracy vs. the number of attack iterations in Table 7. Note, we use only one random start here for white-box attacks for efficiency. One can observe that using more steps decreases the accuracy by no more than 0.6% for black-box attacks and no more than 1.2% for white-box attacks. Thus, we confirm sufficient steps have been applied and all attacks have converged during the evaluation.

black-box () white-box ()
100 500 1000 50 500 1000
DVERGE/3 53.2% 52.6% 53.4% 42.7% 41.6% 41.5%
DVERGE/5 57.2% 56.9% 57.3% 50.4% 49.5% 49.5%
DVERGE/8 63.6% 63.6% 63.2% 58.0% 57.8% 57.8%
DVERGE+AdvT/3 76.2% 76.3% 76.2% 72.9% 72.9% 72.9%
DVERGE+AdvT/5 77.9% 78.0% 77.8% 74.8% 74.8% 74.8%
DVERGE+AdvT/8 77.1% 76.9% 77.4% 72.7% 72.7% 72.7%
Table 7: Accuracy against attacks with varying number of iterations.