The design of Deep Neural Networks (DNNs) for efficient real-world deployment involves careful consideration of the following key elements: memory and computational requirements, performance, reliability, and security. DNNs are often deployed in resource-constrained devices or in applications with strict latency requirements such as autonomous cars which leads to the necessity for developing compact models that generalize well. Furthermore, since the models are often deployed in dynamic environments, it is important to consider their performance on both in-distribution data as well as out-of-distribution data. This ensures the reliability of the models under distribution shift. Finally, the model needs to be robust to malicious attacks by adversaries[kurakin2016adversarial].
A number of techniques have been proposed for achieving high performance in compressed models such as model quantization [zhou2017stochastic], model pruning [han2015deep], and knowledge distillation [hinton2015distilling]. Here, we focus on knowledge distillation (KD) as an interactive learning method which is more similar to human learning. Knowledge distillation involves training a smaller model (student) under the supervision of a larger pre-trained model (teacher). In the original formulation, hinton2015distilling proposed mimicking the softened softmax output of the teacher which exhibits consistent performance improvement of the student compared to the model trained without teacher assistance. Despite the promising performance gain, there is still a significant generalization gap between student and teacher. Consequently, an optimal method of capturing knowledge from the larger model and transferring it to a smaller model remains an open question. While it is important to reduce the generalization gap, it is also pertinent to incorporate methods into the knowledge distillation framework to make the training and inference of the model more robust.
Noise permeates every level of the nervous system, from the perception of sensory signals to the generation of motor responses [faisal2008noise] and also plays an important role in neural networks training [bottou1991stochastic, neelakantan2015adding]. We therefore hypothesize that noise could help in improving the robustness and generalization of the neural networks. Inspired by trial-to-trial variability in the brain, variations in neural responses to the same stimuli, which can result from multiple noise sources, we introduce variability through noise at input level, supervision signal from teacher or target level in the knowledge distillation framework.
To test our hypothesis, we propose novel ways of injecting noise into the knowledge distillation framework as general and scalable techniques and exhaustively evaluate their effect on generalization and robustness. Our contributions are as follows:
“Fickel Teacher", a novel approach to simulate trial-to-trial response variability of biological neural networks. The method exposes student to the uncertainty of teacher which results in significant generalization improvement (from 94.28% to 94.67%).
‘Soft-Randomization", a novel approach for increasing robustness to input variability. The method significantly increases the capacity of the model to learn robust features with small additive noise in the input. For (), soft-randomization achieves 16.75% PGD-20 robustness and 92.57% test-set accuracy whereas the model trained with Gaussian data augmentation achieves only 0.41% adversarial robustness and lower generalization (92.14%).
“Messy-Collaboration", an approach for using target variability as a strong deterrent to cognitive bias. We observe its surprising ability to significantly improve adversarial robustness (from 0.15% to 11.41%) with minimal reduction in generalization (from 94.28% to 93.96%).
We show the effectiveness of messy-collaboration in learning with noisy labels.
A comprehensive analysis of the effects of injecting noise in the knowledge distillation framework.
2 Related Work
A number of experimental and computational methods have reported the presence of noise in the nervous system and how it affects the function of the system [faisal2008noise]. Analogously, noise has been used as a common regularization technique to improve the generalization performance of overparameterized deep neural networks by adding it to the input data, the weights or the hidden units [an1996effects, steijvers1996recurrent, graves2011practical, wan2013regularization, srivastava2014dropout, blundell2015weight]. Previous studies also showed that noise is crucial for non-convex optimization [zhou2017stochastic, li2017convergence, yim2017gift, kleinberg2018alternative]. Furthermore, a family of randomization techniques that inject noise in the model both during training and inference time are proven to be effective to the adversarial attacks [dhillon2018stochastic, xie2017mitigating, rakin2018parametric, liu2018towards]
. Randomized smoothing transforms any classifier into a new smooth classifier that has certifiable-norm robustness guarantees [lecuyer2018certified, cohen2019certified]. Label smoothing improves the performance of deep neural networks across a range of tasks [szegedy2016rethinking, pereyra2017regularizing]. However, muller2019does reports that label smoothing impairs knowledge distillation. On the contrary, we show that our biologically inspired technique of injecting noise into knowledge distillation, fickle teacher, significantly improves the generalization of student. Furthermore, fickle teacher differs from the works of bulo2016dropout and gurau2018dropout
in that instead of using the soft target distribution obtained by averaging Monte Carlo samples, we use the logits of the teacher model with dropout active directly as a source of uncertainty encoding noise for distilling knowledge to a compact student.
3 Experimental Setup
To study the effect of injecting noise in the knowledge distillation framework, we use Hinton method [hinton2015distilling]and . We conducted our experiments on Wide Residual Networks (WRN) [zagoruyko2016wide]. Unless otherwise stated, we normalize the images between 0 and 1 and use standard training scheme as used in [zagoruyko2016paying, tung2019similarity]
: SGD with 0.99 momentum; 200 epochs; batch size 128; and an initial learning rate of 0.1, decayed by a factor of 0.2 at epochs 60, 120 and 150. We conducted our experiments on CIFAR-10[krizhevsky2009learning]
, with WRN-40-2 with 2.2M parameters as teacher and WRN-16-2 with 0.7M parameters as student. In all of our experiments, we train each model for five different seed values. For the teacher, we select the model with the highest test accuracy and then use it to train the student again for five different seed values and report the mean values for our evaluation metrics.
To approximate the out-of-distribution generalization of our models, we use the ImageNet[krizhevsky2012imagenet] images from the CINIC dataset [darlow2018cinic]. For the evaluation of adversarial robustness, we use the Projected Gradient Descent (PGD) attack from kurakin2016adversarial and conduct the attack for multiple step sizes. We report the worst robustness accuracy for five random initialization runs. Finally, we test the robustness of our models to commonly occurring corruptions and perturbations proposed by hendrycks2019benchmarking in CIFAR-C as a proxy for natural robustness. We evaluate average robustness to the 19 distortions across 5 severity levels. Furthermore, we calculate mean Corruption Accuracy (mCA) over the 19 distortions. For both adversarial and natural robustness evaluation, we use a fine-grained measure of robustness, by computing accuracy on the corrupted image conditional on the clean image being classified correctly (for details of the methods, see Appendix).
4 Empirical study of Noises
In this section, we propose injecting different types of noise in the student-teacher collaborative learning framework and analyze their effect on the generalization and robustness of the model.
4.1 Fickle Teacher
Trial-to-trial response variability in the brain can be considerably different across stimuli, suggesting that it could also provide an important contribution to the information conveyed by the neural responses about the stimuli [scaglione2011trial]
. Similarly, in deep neural networks, dropout which randomly switches off a group of selected hidden units results in response variability and can be used to obtain principled uncertainty estimates[gal2015dropout]. We, therefore, propose to use dropout in the teacher model to simulate trial-to-trial variability. We first train the teacher with dropout and keep dropout active in the teacher while distilling knowledge to the student. As a result, the teacher provides variable supervision signals to the student for the same input, thereby exposing the student to its uncertainty. We systematically change the dropout rate and show its effect on generalization and robustness (for training details, see the Appendix). The results show that fickle teacher significantly improves both in-distribution and out-of-distribution generalization of the student compared to Hinton method (Figure 1, left). Interestingly, for CIFAR-10, even when the accuracy of the teacher decreases after a dropout rate of 0.2, the student accuracy still improves up to a dropout rate of 0.4. For the same student and teacher network architectures, fickle teacher achieves higher performance on CIFAR-10 than the state-of-the-art knowledge distillation methods [tung2019similarity, zagoruyko2016paying]. For CINIC, as we increase the dropout rate, the student generalization gets closer to the teacher and is the highest at dropout rate 0.5. The adversarial robustness increases for dropout rate up to 0.2 and then decreases (Figure 1, right) while the mCA for natural robustness is maintained or marginally increases with the dropout rate (Figure 2). The results show the effectiveness of fickle teacher in improving the generalization of student.
pinot2019theoretical show that injection of noise drawn from the Exponential family such as Gaussian or Laplace noise leads to guaranteed robustness to adversarial attack. However, this improved adversarial robustness comes at the cost of generalization. Therefore, an important consideration for methods proposed to increase the adversarial robustness is to reduce the loss in generalization. Since knowledge distillation provides an opportunity to combine multiple sources of information, we hypothesize that combining information from a teacher with high generalization while training the student to be robust to noisy input can reduce the trade-off. To test the hypothesis, we propose a novel technique for improving robustness to input variability in the student which uses the teacher trained on clean data, to train the student on noisy data. Here, we minimize the dissimilarity between the student’s distribution on noisy data with the teacher’s distribution on clean data (Figure 3
). The loss function is as follows:
where is white Gaussian noise. and are the balancing factor and temperature parameter from the Hinton method. denotes the softmax output of student, and denote the student’s and teacher’s softmax output with raised temperature respectively.
We train soft-randomization for both low noise intensities and higher noise intensities . As a baseline, we compare soft-randomization to the compact model trained alone with Gaussian data augmentation. Figure 4 shows that soft-randomization consistently achieves both higher adversarial robustness and generalization for all values compared to Gaussian augmentation. For lower noise intensities, soft-randomization significantly outperforms Gaussian augmentation. For , soft-randomization achieves 16.75% robustness to PGD-20 attack and 92.57% CIFAR-10 generalization compared to 0.41% robustness and 92.14% generalization for Gaussian augmentation. Hence, soft-randomization significantly increases the the capacity of the student to learn robust features even with lower noise intensities.
Figure 5 shows the mCA consistently improves over the Hinton method for all Gaussian noise levels. While robustness drops most notably for color distortions (brightness, fog, contrast, and saturation), robustness to noise and blurring corruptions improves significantly as the Gaussian noise intensity increases. We also observe changes in the effect at different intensities, for example for frost, the robustness increases at lower noise level and then decreases for higher intensities. Soft-randomization allows the use of lower noise intensity for increasing adversarial robustness while keeping the loss in generalization lower compared to Gaussian augmentation. This provides greater flexibility in finding a suitable trade-off between generalization and robustness based on the application.
Human decision-making shows systematic simplifications and deviations from the tenets of rationality (‘heuristics’) which may lead to sub-optimal decisional outcomes (‘cognitive biases’)[korteling2018neural]. We believe, this cognitive bias is manifested in deep neural networks in the form of memorization and over-generalization, and propose a counter-intuitive regularization technique based on label noise to mitigate cognitive bias. We term this technique as messy-collaboration (MC). For each sample in the training process, with rate
, we randomly change the one-hot encoded target labels to an incorrect class. The intuition behind this method is that by randomly relabeling a fraction of the samples in each batch, we introduce target variability111We use the term target variability to refer to the random label corruption which we are introducing intentionally for each batch during training. Whereas, noisy labels refers to the inherent corruption in the labels which comes from incorrect annotations. which encourages the model not to be overconfident in its predictions and discourages memorization. There are a number of studies on improving the tolerance of DNNs to noisy labels [hu2019understanding, han2019deep, wang2019symmetric]. However, to the best of our knowledge, random label noise has not been explored as a source of constructive noise for cognitive bias mitigation.
Our experiments set out to dissociate the effect of injecting target variability in different stages of the knowledge distillation framework. Therefore, we study the effect of systematically increasing the noise rate for different setups: introducing target variability while training teacher (MC-T), introducing target variability while distilling knowledge to the student (MC-S), and introducing target variability when training the teacher as well as during knowledge distillation to the student (MC-TS).
When target variability is used only during knowledge distillation to the student (MC-S), both in-distribution and out-of-distribution generalization increase over the Hinton method, even for very high noise rates. When target variability is used for training the teacher which is then used to distill knowledge to the student (MC-T and MC-TS), the generalization drops (Figure 6). At higher noise rates (), all variants of messy-collaboration significantly outperform the teacher trained with target variability at the same rate. Interestingly, target variability has the remarkable effect on increasing the adversarial robustness. Figure 7 shows that the robustness of the student increases as we increase the noise rate. The increase in adversarial robustness is more pronounced when the teacher model trained with target variability is used to train the student (MC-T and MC-TS). Furthermore, messy-collaboration maintains the natural robustness of Hinton method (Figure 16 in appendix).
To understand the effect of target variability on generalization, we visualize the distribution of the magnitude of softmax probabilities. Figure 8 shows that as we increase the noise rate in messy-collaboration, it not only leads to a smoother output distribution, but also shifts the distribution from overconfident softmax probabilities to lower probabilities. The results show that target variability in messy-collaboration enforces the model not to be overconfident in its predictions and discourages memorization which results in better generalization.
4.3.1 Messy-Collaboration for learning with noisy labels
To utilize the vast amount of open-source data available, researchers have proposed methods to generate labels automatically using user tags and keywords. However, these techniques lead to noisy labels which affect the generalization of the model[frenay2013classification]. Considering the abundance of noisy labels, it is important to develop methods that can effectively learn from noisy labels. Here, we show the effectiveness of knowledge distillation in learning with noisy labels at varying rates of label corruption on CIFAR-10. We further show that messy-collaboration improves the generalization over Hinton method.
Figure 9 shows that the generalization drops with increasing the corruption rate (cf. Teacher and WRN-16-2). For label corruption rate 0.1 and higher, knowledge distillation improves the generalization performance significantly and even outperforms the teacher. The gain in generalization with Hinton method is higher as we increase the label corruption rate. Figure 10 shows the effect of varying the messy-collaboration’s noise rate for the different label corruption rates. For label corruption rate over 0.1, messy-collaboration improves the generalization over the student and teacher for all noise rates. Interestingly, at a higher noise rate of 0.5, messy-collaboration improves the generalization over Hinton for all corruption rates. This shows that the target variability in messy-collaboration makes the model more tolerant to label noise which allows efficient learning with noisy labels.
We proposed novel ways of injecting noise in the knowledge distillation framework and rigorously studied their effect on the generalization and robustness of the model. We introduced fickle teacher which exposes the student to its uncertainty using dropout. We show that the variability in the supervision signal improves both in-distribution and out-of-distribution generalization significantly while marginally improving robustness to common and adversarial perturbations. We further proposed soft-randomization for increasing the robustness to input variability by matching the output distribution of student on noisy data to the output distribution of teacher on clean data. This significantly increases the capacity of the student to learn robust features. We showed it improves the adversarial robustness of student compared to the model trained alone with Gaussian data augmentation by a significant margin for lower noise intensities, while also reducing the drop in generalization. Finally, we introduced messy-collaboration which imposes target variability in different stages of the knowledge distillation framework. We showed the intriguing effect of target variability on increasing the adversarial robustness by an order of magnitude in addition to improving the generalization. We further showed the effectiveness of messy collaboration in learning with noisy labels. The extensive empirical results demonstrate that injecting noises which increase the trial-to-trial variability in the knowledge distillation framework is a promising direction towards training compact models with improved generalization and robustness.
Appendix A Appendix
In this section we provide details for the methods relevant to our study.
a.1.1 Knowledge Distillation
hinton2015distilling proposed to use the final softmax function with a raised temperature and use the smooth logits of the teacher as soft targets for the student. The method involves minimizing the Kullback–Leibler divergence between the smoother output probabilities:
where denotes cross-entropy loss, denotes softmax function, student output logit, teacher output logit, and
are the hyperparameters which denote temperature and balancing ratio, respectively.
a.1.2 Out-of-Distribution Generalization
Neural networks tend to generalize well when the test data comes from the same distribution as the training data [deng2009imagenet, he2015deep]. However, models in the real world often have to deal with some form of domain shift which adversely affects the generalization performance of the models [shimodaira2000improving, moreno2012unifying, kawaguchi2017generalization, liang2017enhancing]. Therefore, test set performance alone is not the optimal metric for evaluating the generalization of the models in test environment. To measure the out-of-distribution performance, we use the ImageNet images from the CINIC dataset [darlow2018cinic]. CINIC contains 2100 images randomly selected for each of the CIFAR-10 categories from the ImageNet dataset. Hence, the performance of models trained on CIFAR-10 on these 21000 images can be considered as a approximation for a model’s out-of-distribution generalization.
a.1.3 Adversarial Robustness
Deep Neural Networks have been shown to be highly vulnerable to carefully crafted imperceptible perturbations designed to fool a neural network by an adversary [szegedy2013intriguing, biggio2013evasion]
. This vulnerability poses a real threat to deep learning model’s deployment in the real world[kurakin2016adversarial]. Robustness to these adversarial attacks has therefore gained a lot of traction in the research community and progress has been made to better evaluate robustness to adversarial attacks [goodfellow2014explaining, moosavi2016deepfool, carlini2017towards] and defend the models against these attacks [madry2017towards, zhang2019theoretically].
To evaluate the adversarial robustness of models in this study, we use the Projected Gradient Descent (PGD) attack from kurakin2016adversarial. The PGD-N attack initializes the adversarial image with the original image with the addition of a random noise within some epsilon bound, . For each step it takes the loss with respect to the input image and moves in the direction of loss with the step size and then clips it within the epsilon bound and the range of valid image.
where denote epsilon-bound, step size and original image. The projection operator denotes element-wise clipping, with clipped to the range and within valid data range. In all of our experiments, we conduct a 10, 15 and 20 step PGD attack with 0.031 epsilon with 0.03 step size and 5 random initializations.
a.1.4 Natural Robustness
While robustness to adversarial attack is important from security perspective, it is an instance of worst case distribution shift. The model also needs to be robust to naturally occurring perturbations which it will encounter frequently in the test environment. Recent works have shown that Deep Neural Networks are also vulnerable to commonly occurring perturbations in the real world which are far from the adversarial examples manifold. hendrycks2019natural curated a set of real-world, unmodified and naturally occurring examples that causes classifier accuracy to significantly degrade. gu2019using measured model’s robustness to the minute transformations found across video frames which they refer to as natural robustness and found state-of-the-art classifier to be brittle to these transformations. In our study we use robustness to the common corruptions and perturbations proposed by hendrycks2019benchmarking in CIFAR-C as a proxy for natural robustness.
a.1.5 Trade-off between Generalization and Adversarial Robustness
While making our model’s robust to adversarial attacks, we need to be careful not to overemphasize robustness to norm bounded perturbation. We also need to rigorously test the effect of methods designed to increase adversarial robustness on model’s in-distribution and out-of-distribution generalization as well as robustness to naturally occurring perturbation and distribution shift. Recent study have highlighted the adverse affect of adversarially trained model on natural robustness. ding2019sensitivity showed that even a semantics-preserving transformations on the input data distribution significantly degrades the performance of adversarial trained models but only slightly affects the performance of standard trained model. yin2019fourier showed that adversarially trained models improve robustness to mid and high frequency perturbations but at the expense of low frequency perturbations which are more common in the real world. Furthermore, in the adversarial literature, a number of studies has shown an inherent trade-off between adversarial robustness and generalization [tsipras2018robustness, ilyas2019adversarial, zhang2019theoretically].
a.2 Additional Experiments
a.2.1 Signal-dependent Noise
Here, we add a signal-dependent noise to the output logits of the teacher. For each sample, we add zero-mean Gaussian noise with variance that is proportional to the output logits in the given sample ():
We systematically study the effect of increasing the intensity of signal-dependent noise (). Figure 11 shows that for noise levels up to 0.1, the random signal-dependent noise improves the in-distribution generalization over the Hinton method. However, it impairs the out-of-distribution generalization to CINIC-ImageNet. Figure 11 and Figure 12 show a slight increase in the adversarial robustness and natural robustness of the models.
muller2019does reported that when the teacher is trained with label smoothing, the knowledge distillation to the student is impaired and the student performs worse. On the contrary, for lower level of noise, our method improves the effectiveness of knowledge distillation process. Signal-dependent noise differs from their approach in that we train the teacher without any noise and only when distilling knowledge to the student, we add noise to its softened logits.
a.2.2 Random Swapping
To exploit the uncertainty of the teacher for a sample, we inject noise by randomly swapping softened softmax logits. With probability , we swap the logits if the difference is below a threshold.
We propose two variants of random swapping:
Swap Top 2: Swap the top two logits if the difference between them is below the threshold.
Swap All: Consider all consecutive pairs sorted by the logit value and swap the pairs if the difference is below the threshold value.
a.3 Training Scheme for Fickle Teacher
Because of the variability in the teacher, the student needs to be trained for more epochs in order for it to converge and be effectively exposed to the uncertainty of the teacher.
We used the same initial learning rate of 0.1 and decay factor of 0.2 as per the standard training scheme. For dropout rate of 0.1 and 0.2, we train for 250 epochs and reduce learning rate at 75, 150 and 200 epochs. For dropout rate 0.3, we train for 300 epochs and reduce learning rate at 90, 180 and 240 epochs. Finally for drop rate of 0.4 and 0.5, due to the increased variability, we train for 350 epochs and reduce learning rate at 105, 210 and 280 epochs.