Most existing adversarial training approaches focus on learning from data directly. For example, the popular adversarial training (AT) (Madry et al., 2018) leverages multi-step projected gradient descent (PGD) to generate the adversarial examples and feed them into the standard training. Zhang et al. (2019) developed TRADES on the basis of AT to balance the standard accuracy and robust performance. Recently, there are several methods under this paradigm are developed to improve the model robustness (Wang et al., 2019; Alayrac et al., 2019; Zhang et al., 2020; Jiang et al., 2020; Ding et al., 2020; Wang et al., 2020b; Du et al., 2021; Tian et al., 2021; Zhang et al., 2021). However, directly learning from the adversarial examples is a challenging task on the complex datasets since the loss with hard labels is difficult to be optimized, which limits us to achieve higher robust accuracy.
To mitigate this issue, one emerging direction is distilling robustness from the adversarially pre-trained model intermediately, which has shown promise in the recent study. For example, Ilyas et al. (2019) used an adversarially pre-trained model to build a “robustified” dataset to learn a robust DNN. Chen et al. (2020); Salman et al. (2020)
explored to boost the model robustness through fine-tuning or transfer learning from adversarially pre-trained models.Goldblum et al. (2020)and Chen et al. (2021) investigated distilling the robustness from adversarially pre-trained models, termed as adversarial distillation for simplicity, where they encouraged student models to mimic the outputs (i.e., soft labels) of the adversarially pre-trained teachers.
However, one critical difference is: in the conventional distillation, the teacher model and the student model share the natural training data; while in the adversarial distillation, the adversarial training data of the student model and that of the teacher model are egocentric (respectively generated by themselves) and becoming more adversarial challenging during training. Given this distinction, are the soft labels acquired from the teacher model in adversarial distillation always reliable and informative guidance? To answer this question, we take a closer look at the process of adversarial distillation. As shown in Figure 1(a), we discover that along with the training, the teacher model progressively fails to give a correct prediction for the adversarial data queried by the student model. The reason could be that with the students being more adversarially robust and thus the adversarial data being harder, it is too demanding to require the teachers become always good at every adversarial data queried by the student model, as the teacher model has never seen these data in its pre-training. In contrast, for the conventional distillation, student models are expected to distill the “static” knowledge from the teacher model, since the soft labels for the natural data from the teacher model are always fixed.
The observation in Figure 1(a) raises the challenge: how to conduct reliable adversarial distillation with unreliable teachers
? To solve this problem, we can categorize the training data according to the prediction on natural and adversarial data for three cases. First, if the teacher model can correctly classify both natural and adversarial data, it is reliable; Second, if the teacher model can correctly classify the natural but not adversarial data, it should be partially trusted, and the student model is suggested to trust itself to enhance model robustness as the adversarial regularization(Zhang et al., 2019); Third, if the teacher model cannot correctly classify both natural and adversarial data, the student model is recommended to trust itself totally. According to this intuition, we propose an Introspective Adversarial Distillation (IAD) to effectively utilize the knowledge from an adversarially pre-trained teacher model. The framework of our proposed IAD can be seen in Figure 1(b). Briefly, the student model is encouraged to partially instead of fully trust the teacher model, and gradually trust itself more as being more adversarial robust. We conduct extensive experiments on the benchmark CIFAR-10/CIFAR-100 and the more challenging
Tiny-ImageNetdatasets to evaluate the efficiency of our IAD. The main contributions of our work can be summarized as follows.
We take a closer look at adversarial distillation under the teacher-student paradigm. Considering adversarial robustness, we discover that the guidance from the teacher model is progressively unreliable along with the adversarial training.
We construct the reliable guidance for adversarial distillation by flexibly utilizing the robust knowledge from the teacher model: (a) if a teacher is good at adversarial data, its soft labels can be fully trusted; (b) if a teacher is good at natural data but not adversarial data, its soft labels should be partially trusted and the student also takes its own soft labels into account; (c) otherwise, the student only relies on its own soft labels.
We propose an Introspective Adversarial Distillation (IAD) to automatically realize the intuition of the previous reliable guidance during the adversarial distillation. The experimental results confirmed that our approach can improve adversarial robustness across a variety of training settings and evaluations, especially on the challenging (consider adversarial robustness) datasets (e.g., CIFAR-100 (Krizhevsky, 2009) and
Tiny-ImageNet(Le and Yang, 2015)) or using large models (e.g., WideResNet (Zagoruyko and Komodakis, 2016)).
2 Related Work
2.1 Adversarial Training.
Adversarial examples (Goodfellow et al., 2015) motivate many defensive approaches developed in the last few years. Among them, adversarial training has been demonstrated as the most effective method to improve the robustness of DNNs (Cai et al., 2018; Wang et al., 2020a, b; Jiang et al., 2020; Bai et al., 2021; Chen et al., 2021; Wu et al., 2020). The formulation of the popular AT (Madry et al., 2018) and its variants can be summarized as the minimization of the following loss:
where is the number of training examples, is the adversarial example within the -ball (bounded by an -norm) centered at natural example , is the associated label, is the DNN with parameter and is the standard classification loss, e.g., the cross-entropy loss. Adversarial training leverages adversarial examples to smooth the small neighborhood, making the model prediction locally invariant. To generate the adversarial examples, AT employs a PGD method (Madry et al., 2018). Concretely, given a sample and the step size , PGD recursively searches
until a certain stopping criterion is satisfied. In Eq. (2), ,
is the loss function,is adversarial data at step , is the corresponding label for natural data, and is the projection function that projects the adversarial data back into the -ball centered at .
2.2 Knowledge Distillation
The idea of distillation from other models can be dated back to (Craven and Shavlik, 1996), and re-introduced by (Hinton et al., 2015) as knowledge distillation (KD). It has been widely studied in recent years and works well in numerous applications like model compression and transfer learning. For adversarial defense, a few studies have explored obtaining adversarial robust models by distillation. Papernot et al. (2016)
proposed defensive distillation which utilizes the soft labels produced by a standard pre-trained teacher model, while this method is proved to be not resistant to the C&W attacks(Carlini and Wagner, 2017); Goldblum et al. (2020) combined AT with KD to transfer robustness to student models, and they found that the distilled models can outperform adversarially pre-trained teacher models of identical architecture in terms of adversarial robustness; Chen et al. (2021) utilized distillation as a regularization for adversarial training, which employed both robust and standard pre-trained teacher models to address the robust overfitting (Rice et al., 2020).
Nonetheless, all these related methods fully trust teacher models and do not consider that whether the guidance of the teacher model in distillation is reliable or not. In this paper, different from the previous studies, we find that the teacher model in adversarial distillation is not always trustworthy. Based on that, we propose reliable IAD to encourage student models to partially instead of fully trust teacher models, which effectively utilizes the knowledge from the adversarially pre-trained models.
3 A Closer Look at Adversarial Distillation
In Section 3.1, we discuss the unreliable issue of adversarial distillation, i.e., the guidance of the teacher model is progressively unreliable along with adversarial training. In Section 3.2, we partition the training examples into three parts and analyze them part by part. Specifically, we expect that the student model should partially instead of fully trust the teacher model and gradually trust itself more along with adversarial training.
3.1 Fully Trust: Progressively Unreliable Guidance
As aforementioned in the Introduction, previous methods (Goldblum et al., 2020; Chen et al., 2021) fully trust the teacher model when distilling robustness from adversarially pre-trained models. Taking Adversarial Robust Distillation (ARD) (Goldblum et al., 2020) as an example, we illustrate its procedure in the left part of Figure 1(b): the student model generates its adversarial data and then optimizes the prediction of them to mimic the output of the teacher model. However, although the teacher model is well optimized on the adversarial data queried by itself, we argue that it might not always be good at the more and more challenging adversarial data queried by the student model.
As shown in Figure 1(a), different from the ordinary distillation in which the teacher model has the consistent standard performance on the natural data, its robust accuracy on the student model’s adversarial data is decreasing during distillation. The guidance of the teacher model gradually fails to give the correct output on the adversarial data queried by the student model.
3.2 Partially Trust: Construction of Reliable Guidance
The unreliable issue of the teacher model in adversarial distillation raises the challenge of how to conduct reliable adversarial distillation with unreliable teachers? Intuitively, this requires us to re-consider the guidance of adversarially pre-trained models along with the adversarial training. For simplicity, we use () to represent the predicted label of the teacher model on the natural (adversarial) examples, and use to represent the targeted label. We partition the adversarial samples into three parts as shown in the toy illustration (Figure 2(a)), and analyze them part by part.
1) : It can be seen in Figure 2(a) that this part of data whose adversarial variants like is the most trustworthy among the three parts, since the teacher model performs well on both natural and adversarial data. In this case, we could choose to trust the guidance of the teacher model on this part of the data. However, as shown in Figure 2(b), we find that the sample number of this part is decreasing along with the adversarial training. That is, what we can rely on from the teacher model in adversarial distillation is progressively reduced.
2) : In Figure 2(b), we also check the number change of the part of data whose adversarial variants like . Corresponding to the previous category, the number of this kind of data is increasing during distillation. Since the teacher model’s outputs on the small neighborhood of the queried natural data are not always correct, its knowledge may not be robust and the guidance for the student model is not reliable. Think back to the reason for the decrease in the robust accuracy of the teacher model, the student model itself may also be trustworthy since it becomes gradually adversarial robust during distillation.
3) : As for the data which are like in Figure 2(a), the guidance of the teacher model is totally unreliable since the predicted labels on the natural data are wrong. The student model may also trust itself to encourage the outputs to mimic that of their natural data rather than the wrong outputs from the teacher model. First, it removes the potential threat that the teacher’s guidance may be a kind of noisy labels for training. Second, as an adversarial regularization (Zhang et al., 2019), it can improve the model robustness through enhancing the stability of the model’s outputs on the natural and the corresponding adversarial data.
To sum up, we suggest employing reliable guidance from the teacher model and encouraging the student model to trust itself more as the teacher model’s guidance being progressively unreliable and the student model gradually becoming more adversarially robust.
4 Introspective Adversarial Distillation
Based on previous analysis about the adversarial distillation, we propose the Introspective Adversarial Distillation (IAD) to better utilize the guidance from the adversarially pre-trained model. Concretely, we have the following KD-style loss, but composite with teacher guidance and student introspection.
where is the tempered variant of the student output with the temperature , is the tempered variant of the teacher output , is the adversarial data generated from the natural data , and is the KL-divergence loss. As for the annealing parameter that is used to balance the effect of the teacher model in adversarial distillation, we define it as,
is the prediction probability of the teacher model about the targeted labeland
is a hyperparameter to sharpen the prediction. The intuition behind IAD is to calibrate the guidance from the teacher model automatically based on the prediction of adversarial data. Ournaturally corresponds to the construction in Section 3.2, since the prediction probability of the teacher model for the adversarial data can well represent the categorical information.
Intuitively, the student model can trust the teacher model when approaches , which means that the teacher model is good at both natural and adversarial data. However, when approaches , it corresponds that the teacher model is good at natural but not adversarial data, or even not good at both, and thus the student model should take its self-introspection into account. In Figure 3, we check the reliability of the student model itself. According to the left panel of Figure 3, we can see that the student model is progressively robust to the adversarial data. And if we incorporate the student introspection into the adversarial distillation, the results in the middle of Figure 3 confirms its potential benefits to improve the accuracy of the guidance. Moreover, as shown in the right panel of Figure 3, adding self-introspection results in better improvement in model robustness compared to only using the guidance of the teacher model. Therefore, automatically encourages the outputs of the student model to mimic more reliable guidance in adversarial distillation.
Algorithm 1 summarizes the implementation of Introspective Adversarial Distillation (IAD). Specifically, IAD first leverages PGD to generate the adversarial data for the student model. Secondly, IAD computes the outputs of the teacher model and the student model on the natural data. Then, IAD mimics the outputs of the student model with that of itself and the teacher model partially based on the probability of the teacher model on the adversarial data.
During training, we add a warming-up period to activate the student model, where (in Eq. (3)) is hardcoded to 1. This is because the student itself is not trustworthy in the early stage (refer to the left panel of Figure 3). Through that, we expect the student model to first evolve into a relatively reliable learner and then conducts the procedure of introspective adversarial distillation.
4.1 Comparison with Related Methods
In this section, we discuss the difference between IAD and other related approaches in the perspective of the loss functions. Table 1 summarizes all of them.
As shown in Table 1, AT (Madry et al., 2018) utilizes the hard labels to supervise adversarial training; TRADES (Zhang et al., 2019) decomposes the loss function of AT into two terms, one for standard training and the other one for adversarial training with the soft supervision; Motivated by KD (Hinton et al., 2015), Goldblum et al. (2020) proposed ARD to conduct the adversarial distillation, which fully trusts the outputs of the teacher model to learn the student model. As indicated by the experiments in Goldblum et al. (2020), a larger resulted in less robust student models. Thus, they generally set in their experiments; Chen et al. (2021) utilized distillation as a regularization to avoid the robust overfitting issue, which employed both the adversarially pre-trained teacher model and the standard pre-trained model. Thus, there are two KL-divergence loss and for simplicity, we term their method as AKD; Regarding IAD, it consists of two parts, which respectively encourages the student model to partially instead of fully trust the guidance of the teacher model and gradually trust itself more. In the loss function, is gradually decreased as the adversarial examples being more challenging, which reduces dependency on the guidance of the teacher model during training.
We conduct comprehensive experiments to evaluate the effectiveness of IAD. In Section 5.1, we compare IAD with benchmark adversarial training methods (AT and TRADES) and some related methods which utilize adversarially pre-trained models via KD (ARD and AKD) on CIFAR-10/CIFAR-100 (Krizhevsky, 2009) datasets. In Section 5.2, we compare the previous methods with IAD on a more challenging dataset Tiny-ImageNet (Le and Yang, 2015). In Section 5.3, the ablation studies are conducted to analyze the effects of the hyper-parameter and different warming-up periods for IAD.
Several measures regarding both natural accuracy and adversarial robustness are applied to evaluate the model performance. We compute the natural accuracy on the natural test data and the robust accuracy on the adversarial test data following Wang et al. (2019). Specifically, the adversarial test data are generated by FGSM, PGD-20, and CW attacks with the same perturbation bound and the step size . All the adversarial generation in these attacks has a random start, i.e, the uniformly random perturbation of
added to the natural data before attacking iterations. Moreover, we also estimate the model performance under the AutoAttack (termed as AA for simplicity).
5.1 Evaluation on Cifar-10/cifar-100 Datasets
In this part, we follow the setup (learning rate, optimizer, weight decay, momentum) of (Goldblum et al., 2020) to implement the adversarial distillation experiments on the CIFAR-10/CIFAR-100 datasets. Specifically, we train ResNet-18 under IAD, AT, TRADES, ARD and AKD using SGD with momentum for epochs. The initial learning rate is divided by at Epoch and Epoch respectively, and the weight decay=. In the settings of adversarial defense, we set the perturbation bound , the PGD step size , and PGD step numbers . In the settings of distillation, we use and use models pre-trained by AT and TRADES which have the best PGD-10 test accuracy as the teacher models for ARD, AKD and our IAD. For ARD, we set its hyper-parameter as recommend in (Goldblum et al., 2020) for gaining better robustness. For AKD, we set , and as recommanded in (Chen et al., 2021). For IAD, we set and warming-up period as epoch to train on CIFAR-10. On CIFAR-100, the warming-up period is set as epochs.
We report the results in Table 2, where the results of AT and TARDES are listed in the first and fifth rows of Table 2, and the other methods use these models as the teacher models in distillation. On CIFAR-10 dataset, we note that our IAD has obtained consistent improvements on adversarial robustness in terms of PGD-20, CW and AA accuracy compared with the student models distilled by ARD or AKD and the adversarially pre-trained teacher models. While the natural and FGSM accuracy of our IAD is lower than others since we encourage the student model to partially trust itself which enhances the stability of model outputs but sacrifice a part of standard performance. On CIFAR-100 dataset, the improvements of our IAD for adversarial robustness are more obvious compared with that on CIFAR-10, especially distilling from the teacher model trained by TRADES. However, since the teacher models have poor standard and robust performance on CIFAR-100, the models distilled by the ARD and AKD which fully trust the guidance of teacher models also result in worse standard performance or robust performance compared with our IAD.
In this part, we evaluate these methods by the model with larger capacity, i.e, WideResNet-34-10. For these adversarially pre-trained models, we follow the settings of (Zhang et al., 2021) to train AT and TRADES. To be specific, we train WideResNet-34-10 using SGD with momentum for epochs. The initial learning rate is divided by at Epoch and respectively, and the weight decay=. For distillation baselines, we keep the most settings same as that in the previous part. Specially, we adjust the for ARD on CIFAR-100 dataset as (Goldblum et al., 2020) recommend to deal with complex tasks. For IAD, we use and no warming-up period for CIFAR-10 and -epoch as the warming-up period for CIFAR-100.
We report the results in Tables 3. According the results on CIFAR-10 dataset, our method can achieve better model robustness than ARD, AKD and the original teacher models in terms of PGD-20, CW and AA accuracy. Moreover, our IAD does not sacrifice much standard performance (See the results of IAD (TRADES) and TRADES in Table 3). Since AKD externally utilizes a standard pre-trained teacher model, it can achieve better natural accuracy in adversarial distillation. On CIFAR-100, similar to previous results of ResNet-18
, our IAD gains consistent improvements in model robustness. However, ARD get poor performance across these evaluation metrics since they fully trusting the teacher model. The reason is probably that the models with large capacity fit a part of unreliable guidance which seems to be noisy labels.
5.2 Evaluation on Tiny-ImageNet Dataset
In this part, we evaluate these methods on a more challenging Tiny-ImageNet dataset. For these adversarially pre-trained models, we follow the settings of (Chen et al., 2021) to train AT and TRADES. To be specific, we train PreActive-ResNet-18 using SGD with momentum for epochs. The initial learning rate is divided by at Epoch and respectively, and the weight decay=. For distillation baselines, we keep most settings the same as Section 5.1. For IAD, we use and epochs as the warming-up period.
We report the results in Table 4. Overall, our method can still achieve better model robustness than other methods. On Tiny-ImageNet, as the dataset is more challenging, the adversarially pre-trained teacher models have low standard and robust accuracy. In this case, the student model might be threatened by a large amount of unreliable guidance. As a result, ARD gets poor performance across these evaluation metrics, since it fully trusts the teacher model. About AKD, although it does not sacrifice much natural accuracy, the robust performance is worse than IAD.
5.3 Ablation Studies
To understand the effects of different and different warming-up periods on CIFAR-10 dataset, we conduct the ablation study in this part. Here, we choose the ResNet-18 as the backbone model, and keep the experimental settings the same as Section 5.1. In the first experiments, we set no warming-up period and study different . Then, in the second experiments, we set and use different warming-up periods.
We report part of the results in Figure 4. The complete results with other evaluation metrics like FGSM, PGD-20 and CW accuracy is put in Appendix A.1. In Figure 4, we first visualize the values of the using different in the left panel, which shows the proportion of the teacher guidance and student introspection in adversarial distillation. The bigger the beta corresponds to a larger proportion of the student introspection. In the middle panel, we plot the natural and AA accuracy of the student models distilled by different . We note that the AA accuracy is improved when the student model trusts itself more with the larger value. However, the natural accuracy is decreasing along with the increasing of the value. Similarly, we adjust the length of warming-up periods and check the natural and AA accuracy in the right panel of Figure 4. We find that setting the student model partially trust itself at the beginning of the training process leads to inadequate robustness improvements and more sacrifice on natural accuracy. An appropriate warming-up period at the early stage can improve the student model performance on the adversarial examples.
In this paper, we study distillation from adversarially pre-trained models. We take a closer look at adversarial distillation and discover that the guidance of teacher model is progressively unreliable by considering the robustness. Hence, we explore the construction of reliable guidance in adversarial distillation and propose a method for distillation from unreliable teacher models, i.e., Introspective Adversarial Distillation. Our methods encourages the student model partially instead of fully trust the guidance of the teacher model and gradually trust its self-introspection more to improve robustness.
- Are labels required for improving adversarial robustness?. In NeurIPS, Cited by: §1.
- Improving adversarial robustness via channel-wise activation suppressing. In ICLR, Cited by: §2.1.
- Curriculum adversarial training. In IJCAI, Cited by: §2.1.
- Towards evaluating the robustness of neural networks. In Symposium on Security and Privacy (SP), Cited by: §2.2.
- Adversarial robustness: from self-supervised pre-training to fine-tuning. In CVPR, Cited by: §1.
- Robust overfitting may be mitigated by properly learned smoothening. In ICLR, Cited by: §1, §2.1, §2.2, §3.1, §4.1, §5.1, §5.2.
- Extracting tree-structured representations of trained networks. NeurIPS. Cited by: §2.2.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1.
- Mma training: direct input space margin maximization through adversarial training. In ICLR, Cited by: §1.
- Learning diverse-structured networks for adversarial robustness. In ICML, Cited by: §1.
- Adversarially robust distillation. In AAAI, Cited by: §1, §2.2, §3.1, §4.1, §5.1, §5.1.
- Explaining and harnessing adversarial examples. In ICLR, Cited by: §1, §2.1.
- Deep residual learning for image recognition. In CVPR, Cited by: §1.
- Distilling the knowledge in a neural network. In arXiv, Cited by: §2.2, §4.1.
- Adversarial examples are not bugs, they are features. In NeurIPS, Cited by: §1.
- Robust pre-training by adversarial contrastive learning. In NeurIPS, Cited by: §1, §2.1.
- Learning multiple layers of features from tiny images. In arXiv, Cited by: item 3, §5.
- Adversarial machine learning-industry perspectives. In 2020 IEEE Security and Privacy Workshops (SPW), Cited by: §1.
- Tiny imagenet visual recognition challenge. Cited by: item 3, §5.
- Autonomous vehicle implementation predictions. Victoria Transport Policy Institute Victoria, Canada. Cited by: §1.
Towards deep learning models resistant to adversarial attacks. In ICLR, Cited by: §1, §1, §2.1, §4.1.
- Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP), Cited by: §2.2.
- Overfitting in adversarially robust deep learning. In ICML, Cited by: §2.2.
- Do adversarially robust imagenet models transfer better?. In NeurIPS, Cited by: §1.
- Intriguing properties of neural networks. In ICLR, Cited by: §1.
- Analysis and applications of class-wise robustness in adversarial training. In KDD, Cited by: §1.
- Once-for-all adversarial training: in-situ tradeoff between robustness and accuracy for free. In NeurIPS, Cited by: §2.1.
- On the convergence and robustness of adversarial training. In ICML, Cited by: §1, §5.
- Improving adversarial robustness requires revisiting misclassified examples. In ICLR, Cited by: §1, §2.1.
- Adversarial weight perturbation helps robust generalization. NeurIPS 33. Cited by: §2.1.
- Wide residual networks. arXiv:1605.07146. Cited by: item 3.
- Theoretically principled trade-off between robustness and accuracy. In ICML, Cited by: §1, §1, §3.2, §4.1.
- Attacks which do not kill training make adversarial learning stronger. In ICML, Cited by: §1.
- Geometry-aware instance-reweighted adversarial training. In ICLR, Cited by: §1, §5.1.
Appendix A Experiment
In this section, we provide additional experimental results about the IAD. All of the experiments are conducted on Tesla V100-SXM2 GPUs.
a.1 Complete results of ablation studies.
In this part, we report the complete results of our ablation studies in Tables 5 (about ) and 6 (about warming-up periods). In Table 5, we can see that the natural and FGSM accuracy will decrease, and the robust accuracy (PGD-20, CW, AA) will increase with the rise of . In Table 6, we adjust the length of warming-up periods. We can see that letting the student network partially trust itself at the beginning of the training process would result in inadequate robustness improvements.