This is the official code for "Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better"
Adversarial training is one effective approach for training robust deep neural networks against adversarial attacks. While being able to bring reliable robustness, adversarial training (AT) methods in general favor high capacity models, i.e., the larger the model the better the robustness. This tends to limit their effectiveness on small models, which are more preferable in scenarios where storage or computing resources are very limited (e.g., mobile devices). In this paper, we leverage the concept of knowledge distillation to improve the robustness of small models by distilling from adversarially trained large models. We first revisit several state-of-the-art AT methods from a distillation perspective and identify one common technique that can lead to improved robustness: the use of robust soft labels – predictions of a robust model. Following this observation, we propose a novel adversarial robustness distillation method called Robust Soft Label Adversarial Distillation (RSLAD) to train robust small student models. RSLAD fully exploits the robust soft labels produced by a robust (adversarially-trained) large teacher model to guide the student's learning on both natural and adversarial examples in all loss terms. We empirically demonstrate the effectiveness of our RSLAD approach over existing adversarial training and distillation methods in improving the robustness of small models against state-of-the-art attacks including the AutoAttack. We also provide a set of understandings on our RSLAD and the importance of robust soft labels for adversarial robustness distillation.READ FULL TEXT VIEW PDF
Knowledge distillation is effective for producing small high-performance...
Deep learning models are shown to be vulnerable to adversarial examples....
In ordinary distillation, student networks are trained with soft labels ...
We propose gradient adversarial training, an auxiliary deep learning
We propose a novel technique which addresses the challenge of learning
Adversarial examples have appeared as a ubiquitous property of machine
We study the problem of dataset distillation - creating a small set of
This is the official code for "Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better"
Different types of methods have been proposed to defend DNNs against adversarial attacks [22, 30, 32, 26, 57, 48], amongst which adversarial training (AT) has been found to be the most effective approach [2, 10]. AT can be regarded as a type of data augmentation technique that crafts adversarial versions of the natural examples for model training. AT is normally formulated as a min-max optimization problem with the inner maximization generates adversarial examples while the outer minimization optimizes the model’s parameters on the adversarial examples generated during the inner maximization [32, 57, 47].
While being able to bring reliable robustness, AT methods have several drawbacks that may limit their effectiveness in certain application scenarios. Arguably, the most notable drawback is its hunger for high capacity models, i.e., the larger the model the better the robustness [49, 44, 35, 16]. However, there are scenarios where small and lightweight models are more preferable than large models. One example is the deployment of small DNNs in devices with limited memory and computational power such as smart phones and autonomous vehicles . This has motivated the use of knowledge distillation along with AT to boost the robustness of small DNNs by distilling from robust large models [14, 3, 8, 62], a process known as Adversarial Robustness Distillation (ARD).
In this paper, we build upon previous works in both AT and ARD, and investigate the key element that can boost the robustness of small DNNs via distillation. We compare the loss functions adopted by several state-of-the-art AT methods and identify one common technique behind the improved robustness: the use of predictions of an adversarially trained model. We denote this type of supervision asRobust Soft Labels (RSLs). Compared to the original hard labels, RSLs can better represent the robust behaviors of the teacher model, providing more robust information to guide the student’s learning. This observation motivates us to design a new ARD method to fully exploit the power of RSLs in boosting the robustness of small student models.
In summary, our main contributions are:
We identify that the implicit distillation process existing in adversarial training methods is a useful function for promoting robustness and the use of robust soft labels can lead to improved robustness.
We propose a novel adversarial robustness distillation method called Robust Soft Label Adversarial Distillation (RSLAD), which applies robust soft labels to replace hard labels in all of its supervision loss terms.
We empirically verify the effectiveness of RSLAD in improving the robustness of small DNNs against state-of-the-art attacks. We also provide a comprehensive understanding of our RSLAD and the importance of robust soft labels for robustness distillation.
Given a DNN model with known parameters, adversarial examples (or attacks) can be crafted by Fast Gradient Sign Method (FGSM) , Projected Gradient Descent (PGD) , Carlini and Wagner (CW) attack  and a number of other methods. Several recent attacks were developed to produce more reliable adversarial robustness evaluation of defense models. These methods were designed to effectively avoid subtle gradient masking or obfuscating effects in improperly defended models. The AutoAttack (AA) 
is an ensemble of four attacking methods including Auto-PGD (APGD), Difference of Logits Ratio (DLR) attack, FAB-Attack and the black-box Square Attack . The AA ensemble is arguably the most powerful attack to date.
Adversarial training is known as the most effective approach to defend adversarial examples. Recently, a number of understandings [30, 21, 12, 59, 61] and methods [32, 57, 48, 52, 51, 34, 17, 18, 4] have been put forward in this area. Adversarial training can be formulated as the following min-max optimization problem:
where is a DNN model with parameters , is the adversarial example of natural example within bounded distance , is the loss for the outer minimization, is the loss for the inner maximization. The most commonly adopted norm is the norm. In Standard Adversarial Training (SAT) , the two losses and are set to the same loss, i.e., the most commonly used Cross Entropy (CE) loss. And the inner maximization problem is solved by the PGD attack. For simplicity, we omit the from the loss functions for the rest of this paper.
A body of work has been proposed to further improve the effectiveness of SAT. This includes the use of wider and larger models , additional unlabeled data , domain adaptation (natural domain versus adversarial domain) , theoretically-principled trade-off between robustness and accuracy (known as TRADES) via the use of Kullback–Leibler (KL) divergence loss for , the emphasis of misclassified examples via Misclassification-Aware adveRsarial Training (MART) , channel-wise activation suppressing (CAS)  and adversarial weight perturbation . In general, elements that have been found in these works that can contribute to robustness include large models, more data, and the use of KL loss for the inner maximization.
AT methods are not perfect. One notable drawback of existing AT methods is that the smaller the model the poorer the robust performance . It is generally hard to improve the robustness of small models like ResNet-18  and MobileNetV2 , though many of the above AT methods can bring considerable robustness improvements to large models such as WideResNet-34-10 [57, 48] and WideResNet-70-16 . This tends to limit their effectiveness in scenarios where storage or computational resources are limited, such as mobile devices, autonomous vehicles and drones. In this paper, we leverage knowledge distillation techniques to improve the robustness of small models and improve existing adversarial robustness distillation methods.
Knowledge distillation (KD) is one well-known method for deep neural network compression that distills the knowledge of a large DNN into a small, lightweight student DNN . Given a well-trained teacher network , KD trains the student network by solving the following optimization problem:
is the Kullback-Leibler divergence,is a temperature constant added to the softmax operation, is the classification loss of the student network with CE is a common choice. KD has been extended in different ways [36, 55, 27, 58] to a variety of learning tasks, such as noisy label learning [53, 60], AI security [14, 3, 28] and natural language processing [33, 41, 29]. Notably, a branch called self-distillation has attracted considerable attention in recent years [23, 58, 54]. Unlike traditional KD methods, self-distillation teaches a student network by itself rather than a separate teacher network.
KD has been applied along with adversarial training to boost the robustness of a student network with an adversarially pre-trained teacher network. The teacher can be a larger model with better robustness (e.g. ARD) or share the same architecture with the student  (e.g. IAD). It has been shown that ARD and IAD can produce student networks that are more robust than trained from scratch, indicating that robust features learned by the teacher network can also be distilled . In this paper, we will build upon these works and propose a more effective adversarial robustness distillation method to improve the robustness of small student networks.
|TRADES||S: ; T:|
|MART||S: ; T:|
|ARD||S: ; T:|
|IAD||S: ; T:|
|RSLAD (ours)||S: ; T:|
In this section, we revisit state-of-the-art AT and adversarial robustness distillation methods from the perspective of KD, and identify the importance of using robust soft labels for improving robustness. We then introduce our RSLAD method inspired by robust soft labels.
Following the adversarial training framework defined in equation (1), we summarize, in Table 1, the loss functions and the student and teacher networks used in 4 state-of-the-art AT methods (i.e., SAT , TRADES  and MART ) and two adversarial robustness distillation methods (i.e., ARD  and IAD ). Compared to SAT which simply adopts the original hard label to supervise the learning, TRADES utilizes the natural predictions of the model via the KL term and gains significant robustness improvement . From this perspective, TRADES is a self-distillation process where the teacher network is the student itself. MART 
is also a self-distillation process but with a focus on the low probability examples via the (1-) weighting scheme on the KL term. In ARD, a more powerful teacher instead of the student itself is used to supervise the learning. The robustness is constantly improved from SAT’s no distillation, TRADES/MART’s self-distillation to ARD’s full distillation , as we will also show in Section 4. IAD  is also an adversarial distillation method, which makes the distillation process more reliable by using the knowledge of both the teacher and the student networks. In this view, we believe that knowledge distillation implicitly or explicitly adopted in these methods contributes significantly to their success.
Another key difference between SAT and other methods mentioned above is that the latter exploit the teacher network’s natural predictions in both of their outer and inner optimization processes, via the KL term. The predictions of a robust teacher model can be considered as a type of Robust Soft Labels (RSLs). Previous works (and also our experiments in Section 4) have shown that TRADES and its variants can bring considerable robustness improvement to SAT. From a distillation point of view, this robustness improvement comes from the use of RSLs, contrasting the use of original hard labels . On the other hand, adversarial robustness distillation is to make the student as similar to the robust teacher as possible. Compared to the original hard labels, RSLs define the full robust behavior of the teacher network, thus convey more robust knowledge learned by the teacher to the student. In Section 4, we will empirically show that RSLs are indeed more beneficial to robustness than the original hard labels or other forms of non-robust soft labels. ARD has a KL term in its outer minimization loss, however, its other loss terms use the original hard labels . IAD uses the KL terms in its two outer minimization loss terms, but the inner maximization loss still uses the hard labels, leaving space for improvement.
The proposed Robust Soft Label Adversarial Distillation (RSLAD) framework is illustrated in Figure 1, including a comparison with four existing methods (i.e., TRADES, MART, ARD and IAD). The key difference of our RSLAD to existing methods lies in the use of RSLs produced by the large teacher network to supervise the student’s training on both natural and adversarial examples in all loss terms. The original hard labels are absent in our RSLAD.
As the student network in RSLAD is still trained using AT, it also has the inner maximization and the outer minimization processes. To bring RSLs into its full play, we apply RSLs in both of the two processes. The loss functions used by our RSLAD are summarized in the last row of Table 1. Note that, in our RSLAD, the temperature constant commonly exists in distillation methods is fixed to as we find it is no longer necessary when RSLs are used. Same as TRADES, MART, ARD and IAD, we use the natural RSLs (i.e. the predictions of a robust model for natural examples) as the soft label to supervise the model training.
|Model||Method||Best Checkpoint||Last Checkpoint|
|Model||Method||Best Checkpoint||Last Checkpoint|
: ARD training under our RSLAD setting (i.e. 300 epochs);ARD: the outer maximization part of ARD; ARD: the inner minimization part of ARD; RSLAD: the outer minimization part of our RSLAD; RSLAD: the inner maximization part of our RSLAD.
The overall optimization framework of our RSLAD is defined as following:
where and are the abbreviations for and , respectively. Since the RSLs produced by the adversarially trained teacher network are also used to supervise the clean training part of the student’s outer minimization, here we replace the commonly used CE loss by KL divergence to formulate the degree of distributional difference between the two models’ output probabilities.
The goal of RSLAD is to learn a small student network that is as robust as an adversarially pre-trained teacher network, which is also to retain as much as possible the teacher’s knowledge and robustness. We note that the commonly used hard labels in adversarial training can lose information learned by the teacher network to some extent, due to the fact that binarizing the teacher’s output probabilities into hard labels tends to lose its true distribution. However, not all soft labels are robust. We will empirically show that smooth labels produced by label smoothing or soft labels produced by naturally trained non-robust models cannot improve robustness.
We first describe the experimental setting, then evaluate the white-box robustness of 4 baseline defense methods and our RSLAD. We also conduct an ablation study, visualize the attention map learned by different methods, compare 3 types of soft labels, and explore how to choose a better teacher network.
Student and Teacher Networks. We consider two student networks including ResNet-18  and MobileNetV2 , and two teacher networks including WideResNet-34-10  for CIFAR-10 and WideResNet-70-16  for CIFAR-100. The CIFAR-10 teacher WideResNet-34-10 is trained using TRADES, while for CIFAR-100, we use the WideResNet-70-16 model provided by Gowal et al. .
We train the networks using Stochastic Gradient Descent (SGD) optimizer with initial learning rate 0.1, momentum 0.9 and weight decay 2e-4. We set batch size to 128. For our RSLAD, we set the total number of training epochs to 300, and the learning rate is divided by 10 at the 215th, 260th and 285th epoch. A 10 step PGD (PGD-10) with random start size 0.001, step size 2/255 is used to solve the inner maximization of our RSLAD. For baseline methods SAT, TRADES and ARD, we strictly follow their original settings. IAD uses the same structure for the teacher and student networks. Here, we reproduce their method by using a more powerful teacher to fit our settings. Training perturbation is bounded to thenorm for both datasets. For natural training, we train the networks for 100 epochs on clean images with standard data augmentations and the learning rate is divided by 10 at the 75th and 90th epochs.
Evaluation Attacks. After training, we evaluate the model against 5 adversarial attacks: FGSM, , , (optimized by PGD) and AutoAttack(AA). The attack is the original attack proposed in Madry et al. , while is the one used in Zhang et al. . They both are PGD attacks but differ in their hyper-parameters (e.g. step size). We consider these two attacks separately following Carmon et al. . Note these attacks are commonly used adversarial attacks in adversarial robustness evaluation. Maximum perturbation used for evaluation is also set to for both datasets. The perturbation steps of , and are all 20. The robustness of the teacher models against the 5 attacks are reported in Table 2, indicating the maximum robustness the student model can get. Besides the white-box evaluation, we also conduct a black-box evaluation which will be described later.
White-box Robustness. The white-box robustness of our RSLAD and other baseline methods are reported in Table 3 for CIFAR-10 and Table 4 for CIFAR-100. Following previous works, we report the results at both the best checkpoint and the last checkpoint. The best checkpoint of naturally training (i.e., showing as ‘Natural’ in both Tables) is selected based on the performance on clean test examples, and the best checkpoints of SAT, TRADES, ARD, IAD, and our RSLAD are selected based on their robustness against the attack.
As shown in Table 3 and Table 4, our RSLAD method demonstrates the state-of-the-art robustness on both CIFAR-10 and CIFAR-100 against all 5 attacks at either the best or the last checkpoints. For ResNet-18, RSLAD improves the robustness by 1.74% and 1.32% on CIFAR-10 and CIFAR-100 respectively, compared to previous SOTA under attack. For MobileNetV2, RSLAD brings 1.55% and 0.63% improvements against the attack. The improvements are more pronounced against the AutoAttack, which is the most powerful attack to date. Particularly, our RSLAD outperforms ARD by even 2.30% for the ResNet-18 student on CIFAR-10. This verifies that our RSLAD is more stable and robust in training robust small DNNs than all the baseline methods. We also observe that, under all settings, TRADES holds clear advantage over SAT, but can still be largely outperformed by distillation methods (i.e., ARD and our RSLAD).
Black-box Robustness. Here, we evaluate the black-box robustness of our RSLAD, SAT, TRADES, ARD and IAD. We test both the transfer attack and query-based attack. This experiment is conducted on CIFAR-10 dataset. For transfer attack, we craft the test adversarial examples using 20 step PGD (PGD-20) and CW on an adversarially pre-trained ResNet-50 surrogate model. The maximum perturbation is also set to . For query-based attack, we use one strong and query-efficient attack, i.e., the Square attack, to attack the models. We evaluate both the transfer attack and query-based attacks on the best checkpoints of the two student models (i.e., ResNet-18 and MobileNetV2). The results are presented in Table 5. As can be observed, our RSLAD surpasses all the 4 baseline methods against all 3 black-box attacks, demonstrating the superiority of our robust soft label distillation approach. The general trend across different types of defense methods is consistent with that in the white-box setting: for robustifying small DNNs, TRADES is better than SAT while distillation methods are better than TRADES.
Ablation of RSLAD. To better understand the impact of each component of our RSLAD to robustness, we conduct a set of ablation studies with the existing distillation method ARD on CIFAR-10 with the ResNet-18 student network (the teacher is the same WideResNet-34-10 network as used in the above experiments). We replace the inner maximization and outer minimization losses used by ARD by the ones used in our RSLAD, then test the robustness of the trained student network. We also run an experiment with ARD under our RSLAD setting for 300 epochs (it was 200 epochs in the original paper). The ablation results are reported in Table 6. Compared to ARD, there is a certain improvement when either the inner loss or the outer loss of our RSLAD is used. The best robustness is achieved when both losses in ARD are switched to our RSLAD losses. This confirms the importance of each component of RSLAD, and the robust soft labels used in these components. We also find that the outer maximization has more impact on the overall robustness than the inner minimization: replacing the inner part of ARD by RSLAD leads to a more robust student than the outer part. An additional comparison between RSLAD and the baselines trained for 300 epochs can be found in Appendix D.
Attention Maps Learned by RSLAD. Here we use attention maps and saliency maps to visually inspect the similarity of the knowledge learned by the student to that of the teacher network. Given the same adversarial examples, higher similarity indicates more successful distillation and better aligned robustness to the teacher model. We take the ResNet-18 student distilled from the WideResNet-34-10 teacher on CIFAR-10 dataset as an example, and visualize the attention maps (generated by Grad-CAM ) and saliency maps (generated by ) in Figure 2. As can be observed, the attention maps of the student trained using our RSLAD are noticeably more similar to that of the teacher’s than baseline methods ARD and IAD. This indicates that the student trained by our RSLAD can indeed mimic the teacher better and has gained more robust knowledge from the teacher. A parameter analysis of our RSLAD can be found in the appendix.
|Soft Labels||Best Checkpoint||Last Checkpoint|
Different Types of Soft Labels. Here, we compare three types of soft labels: 1) smooth soft labels (SSLs) crafted by label smoothing ; 2) natural soft labels (NSLs) produced by a naturally trained teacher model; and 3) robust soft labels (RSLs) produced by an adversarially trained robust teacher model. This experiment is conducted with ResNet-18 student and WideResNet-34-10 teacher on CIFAR-10 dataset, with our RSLAD. The probability distributions of the three types of soft labels for two example CIFAR-10 classes (i.e., ‘Airplane’ and ‘Cat’) are plotted in Figure 3. Different to RSLs, SSLs implement a fixed smoothing transformation to the original hard labels, while NSL probabilities are more concentrated around the ground truth label. The white-box robustness of the student network trained using our RSLAD with these 3 types of soft labels are shown in Table 7. One key observation is that the robustness drops drastically when non-robust labels including SSLs or NSLs are used in place of robust labels. This means that soft labels are not all beneficial to robustness, and non-robust labels especially the NSLs produced by non-robust models can significantly harm robustness distillation.
How to Choose a Good Teacher? Here, we provide some empirical understandings on the impact of the teacher on the robustness of the student. We conduct this experiment on CIFAR-10 with the ResNet-18 student network and investigate its robustness when distilled using our RSLAD from 6 different teacher networks: ResNet-18, ResNet-34, ResNet-50, WideResNet-34-10, WideResNet-34-20 and WideResNet-70-16. The results are plotted in Figure 4. Surprisingly, we find that the student’s robustness does not increase monotonically with that of the teacher’s, instead, it first rises then drops. We call this phenomenon robust saturation. When the teacher network becomes too complex for the student to learn, the robustness of the student tends to drop. As shown in the figure, the robustness gap between the student and the teacher increases when the complexity of the teacher network goes beyond WideResNet-34-10. Interestingly, the student’s robustness can surpass that of the teacher’s when the teacher is smaller than WideResNet-34-10, especially when the teacher has the same architecture (i.e., ResNet-18) as the student. We call this phenomenon the robust underfitting of adversarial training methods, where robustness can be improved by training the model the second time while using the model trained the first time as a teacher. The robust underfitting region is where distillation can help boost the robustness. The best robustness of the ResNet-18 student is achieved when the WideResNet-34-10 (4.5 larger than ResNet-18) teacher is used. These results indicate that choosing a moderately large teacher model can lead to the maximum robustness gain in adversarial robustness distillation.
In this paper, we investigated the problem of training small robust models via knowledge distillation. We revisited several state-of-the-art adversarial training and robustness distillation methods from the perspective of distillation. By comparing their loss functions, we identified the importance of robust soft labels (RSLs) for improved robustness. Following this view, we proposed a novel adversarial robustness distillation method named Roust Soft Label Adversarial Distillation (RSLAD) to fully exploit the advantage of RSLs. The advantage of RSLAD over existing adversarial training and distillation methods were empirically verified on two benchmark datasets under both the white-box and the black-box settings. We also provided several insightful understandings of our RSLAD, different types of soft labels, and more importantly, the interplay between the teacher and student networks. Our work can help build adversarially robust lightweight deep learning models.
This work was supported in part by National Natural Science Foundation of China (#62032006) and STCSM (#20511101000).
arXiv: Machine Learning. Cited by: §2.2.
Rethinking the inception architecture for computer vision. In CVPR, Cited by: §4.4.
Residual convolutional ctc networks for automatic speech recognition. arXiv preprint arXiv:1702.07793. Cited by: §1.
Self-training with noisy student improves imagenet classification. In CVPR, Cited by: §2.3.
In this section, we explore the impact of the hyper-parameter in equation (3). We apply RSLAD to train and distill ResNet-18 students from the the WideResNet-34-10 teacher on CIFAR-10 using different . We show the robust accuracy against PGD with respect to , which is the ratio of the adversarial loss term to the natural loss term. We report the robustness results at the best checkpoints in Figure 5. It can be observed that robustness rises rapidly with the increase of the ratio and reaches a plateau after . When the ratio becomes larger than 1, the robustness fluctuates slightly around 55.9% and achieves the best at .
In Section 4.4, we have demonstrated how to choose a good teacher network and showed the impact of the teacher on the student’s robustness. Here, we show a more complete robustness results of the student network against all 5 attacks mentioned in Section 4.1. We report results at both the best and the last checkpoints in Table 9. The teachers’ robustness is shown in Table 8. We can confirm the phenomenon of robust saturation and robust underfitting according to more evaluation attacks. This indicates that a moderately large teacher network can be a better teacher than a overly large teacher network.
|Student||Teacher||Best Checkpoint||Last Checkpoint|
RSLs are the outputs of a robust model, however, it can be on either the natural examples (natural RSLs ) or the adversarial examples (adversarial RSLs ). Same as TRADES, MART, ARD and IAD, RSLAD utilizes the as RSLs. But one may wonder whether is better than . To answer this question, we replace the used in our RSLAD loss terms (Equation 3) with . This experiment is conducted with ResNet-18 student and WideResNet-34-10 teacher on CIFAR-10 dataset. We report the results at the best checkpoints in Table 10.
Looking at the first row of Table 10, one can find that, when replacing with in all loss terms, the robustness reaches 49.79% against AA, which is slightly higher than that of the TRADES (49.27%) (see Table 2), but still much lower than our RSLAD with . Moving on to the second and third rows, one may notice that, when replacing more with in or , both clean accuracy and robustness can be improved. This clearly demonstrates the advantage of using RSLs of the natural examples, albeit RSLs of adversarial examples also help robustness.
Whilst the training epoch is 100 in SAT, TRADES and 200 in ARD, IAD, we train our RSLAD models for 300 epochs. For a fair comparison, here we also run the baseline methods for 300 epochs. Table 11 shows the results of 300-epoch SAT, TRADES, ARD and IAD on CIFAR-10, following the setting in Section 4.1. It shows that, although the robustness of best checkpoints has been slightly improved when training for 300 epochs using TRADES, ARD and IAD, their performances on the last checkpoints degrade, resulting in more overfitting. As expected, our RSLAD method achieves the best overall performance.
|Method||Best Checkpoint||Last Checkpoint|
Considering that we use a teacher trained by TRADES in most experiments in the main text, to figure out whether our method works with different kinds of teachers, we try to train the teacher model using SAT+AWP. AWP  is used to boost the teacher’s robustness and brings a robust accuracy of 54% against AA. The results are illustrated in Table 12. It shows that the best checkpoint of student model has 0.49% and 0.13% improvement in natural and robust accuracy, respectively, and the last checkpoint slightly degrades in robustness but gains 0.7% improvement in natural accuracy. It can be concluded that RSLAD can indeed boost the small models’ robustness with various kinds teacher models.
|Method||Best Checkpoint||Last Checkpoint|