Deep convolutional neural networks (DNNs) have achieved state of the art performance in many visual recognition tasks[krizhevsky2012imagenet, voulodimos2018deep]. However, the state of the art performance comes at the cost of training computationally expensive and memory-intensive networks with large depth and/or width with tens of millions of parameters and requires billions of operations per inference. This hinders their deployment in resource-constrained devices or in applications with strict latency requirements such as self driving cars and hence leads to a necessity for developing compact networks that generalize well.
Several model compression techniques have been proposed such as model quantization [zhou2017stochastic], model pruning [han2015deep], and knowledge distillation [hinton2015distilling], each of which has its own set of advantages and drawbacks [cheng2017survey]. Amongst these techniques, our study focuses on knowledge distillation (KD) as it involves training a smaller compact network (student) under the supervision of a larger pre-trained network or an ensemble of models (teacher) in an interactive manner which is more similar to how humans learn. KD has proven to be an effective technique for training a compact model and also provides greater architectural flexibility since it allows structural differences in the teacher and student. Furthermore, since it provides a training paradigm instead of a method for compressing a trained model, we aim to analyze its characteristics as a general-purpose training framework beyond just model compression.
In the original formulation, Hinton et al. [hinton2015distilling] proposed mimicking the softened softmax output of the teacher. Since then, a number of KD methods have been proposed, each trying to capture and transfer some characteristic of the teacher such as the representation space, decision boundary or intra-data relationship. However, a clear understanding of where knowledge resides in a deep neural network is still lacking and consequently an optimal method of capturing knowledge from teacher and transferring it to student remains an open question. Furthermore, it is plausible to argue that the effectiveness of the distillation approach is dependent upon a number of factors: the capacity gap between the student and teacher, the nature and degree of the constraint put on student training, and the characteristic of the teacher mimicked by the student. Hence, it is important to extensively study the effectiveness and versatility of different KD methods capturing different aspects of the teacher under a uniform experimental setting to gain further insights.
Furthermore, the standard training procedure with the cross-entropy loss on one-hot-encoded labels has a number of shortcomings. Over-parameterized DNNs trained with standard cross-entropy loss have been shown to have the capacity to fit any random labeling of data which makes it challenging to learn efficiently under label noise[hu2019understanding, arpit2017closer]johnson2019survey]. We hypothesize that one key factor contributing to these inefficiencies is that the only information the model receives about the classes is the one-hot-encoded labels and therefore it fails to capture additional information about the structural similarities between different classes. The additional supervision in KD about the relative probabilities of secondary classes and/or relational information between data points can be useful in increasing the efficacy of the network to learn under label noise and class imbalance. To test the hypothesis we simulate varying degrees of label noise and class imbalance and show how different KD methods perform under these challenging situations.
Our empirical results demonstrate that the effectiveness of the KD framework goes beyond just model compression and should be further explored as an effective general-purpose learning paradigm.
Our main contributions are as follows:
A study of nine different KD methods on different datasets (CIFAR-10 and CIFAR-100) and network architectures (ResNets and WideResNets) with varying degree of capacity gaps between the teacher and student.
Demonstrate KD as an effective approach for learning under label noise and class imbalance.
Provide insights into the effectiveness of mimicking different aspects of teacher.
Ii Knowledge Distillation Methods
KD aims to improve the performance of the student by providing additional supervision from the teacher. A number of methods have been proposed to decrease the performance gap between student and teacher. However, what constitutes knowledge and how to optimally transfer it from the teacher is still an open question. Here, we cover a diverse set of KD methods which differ from each other with respect to how knowledge is defined and transferred from the teacher. To highlight the subtle differences among the distillation methods used in the study, we present a broad categorization of these methods.
aims to mimic the output of the teacher. The key idea is that student can be trained to generalize the same way as the teacher by using the output probabilities produced by the teacher as a "soft target". The relative probabilities of the incorrect labels encode a rich similarity structure over the data and holds information about how the teacher generalizes. Bucilua et al. [bucilua2006model]
originally proposed to use the logits instead of the probabilities as target for the student and to minimize the squared difference. Hinton et al.[hinton2015distilling] built upon the previous work and proposed to raise the temperature of the final softmax function and minimize the Kullback–Leibler (KL) divergence between the smoother output probabilities. BSS [heo2019knowledge] proposed a method for more explicitly matching the decision boundary by utilizing an adversarial attack to discover samples supporting a decision boundary and use an additional boundary supporting loss which encourages the student to match the output of the teacher on samples close to the decision boundary. Response distillation can be seen as an implicit method for matching the decision boundaries of the student and the teacher.
Representation Space Distillation
aims to mimic the latent feature space of the teacher. FitNet [romero2014fitnets] introduced intermediate-level hints from the teacher’s hidden layers to guide the training process of the student. This encourages the student to learn an intermediate representation that is predictive of the intermediate representations of the teacher network. Instead of mimicking the intermediate layer activation values, which can be viewed as putting a hard constraint on the student training, FSP [yim2017gift] proposed to capture the transformation of features between the layers. The method encourages the student to mimic the teacher’s flow matrices, which are derived from the inner product between feature maps in two layers. Both FitNet and FSP employ a two-stage training procedure whereby the first stage involves mimicking the representation of the teacher for better initialization of the student and the second stage involves training the entire model with Hinton’s method. AT [zagoruyko2016paying] proposed attention as a mechanism of transferring knowledge. They define attention as a set of spatial maps that encode on which spatial areas of the input, the network focuses most for taking its output decision based on the activation values. Compared to FitNet, AT and FSP can be considered as softer constraints on the student training. AT, unlike FitNet and FSP, puts a constraint on the student to mimic the representation space (i.e., attention maps) throughout the training procedure and is therefore perhaps more representative of the representation space distillation.
Relational Knowledge Distillation
aims to mimic the structural relations between the learned representation of the teacher using the mutual relations of data samples in the teacher’s output representation. RKD [park2019relational] emphasizes that knowledge within a neural network is better captured by the relations of the learned representation than the individuals of those. Their distillation approach trains the student to form the same relational structure with that of the teacher in terms of two variants of relational potential functions: Distance-wise potential, RKD-D, measure the Euclidean distance between two data samples in the output representation space and angle wise potential, RKD-A, measures the angle formed by the three data samples in the output representation space. RKD-DA combines both of these losses to train the student. SP [tung2019similarity] encourages the student to preserve the pairwise similarities in the teacher in such a way that data pairs that produce similar/dissimilar activations in the teacher, also produce similar/dissimilar activations in the student. SP can be viewed as an instance of RKD angle wise potential variant and differs from RKD-A in that, it uses the dot product i.e. cosine angle between pairs of data. Relational knowledge distillation does not require the student to mimic the representation space of the teacher and hence provides more flexibility to the student.
Online Knowledge Distillation
aims to circumvent the need for a static teacher and the associated computational cost. Deep Mutual Learning (DML) [zhang2018deep]
replaces the one way knowledge transfer from a pretrained model with knowledge sharing between a cohort of compact models trained collaboratively. DML involves training each student with two losses: a conventional supervised learning loss, and a mimicry loss that aligns each student’s class posterior with the class probabilities of other students. To address the limitation of lacking a high capacity model and the need for training multiple student networks, ONE[lan2018knowledge] uses a single multi-branch network and uses an ensemble of the branches as a stronger teacher to assist the learning of the target network.
|Training Scheme||Distillation Layer||Parameters|
|ONE||As per paper||N/A||T=3|
|DML||Standard||Output Layer||T=4, Cohort size = 2|
Iii Experimental Setup
For our empirical analysis, we perform our experiments on CIFAR-10 and CIFAR-100 datasets [krizhevsky2010cifar] with ResNet [he2015deep] and Wide Residual Networks (WRN) [zagoruyko2016wide]
and evaluate the efficiency of the different distillation methods on varying capacity gaps between the teacher and student. To have a fair comparison between the different KD methods, we use a consistent set of hyperparameters and training scheme where possible. Unless otherwise stated, for all our experiments we use the following training scheme as used in Zagoruyko et al.[zagoruyko2016paying]
: normalize the images between 0 and 1; random horizontal flip and random crop data augmentations with reflective padding of 4; Stochastic Gradient Descent with 0.9 momentum; 200 epochs; batch size 128; and an initial learning rate of 0.1, decayed by a factor of 0.2 at epochs 60, 120, and 150. For the network initialization stage required for FitNet and FSP, we use 100 epochs, and an initial learning rate of 0.001, decayed by a factor of 0.1 at epochs 60 and 80. TableI
provides the details of the training parameters for each distillation method. We train all the models for 5 different seed values. For the teacher, we select the model with the highest test accuracy and then use it to train the student for five different seed values and report the mean value for our evaluation metrics. We use the classification branch, and student model with the highest accuracy on the test dataset for ONE and DML methods respectively for each seed. For the online distillation methods, ONE has the student architecture with an ensemble of three classification branches providing teacher supervision and DML uses mutual learning between two models with the same student architecture. For all the other KD methods, we use WRN-40-2 and ResNet-26 as the teacher for the WRN and ResNet student models respectively. Note, that for the methods which report the results for the same student and teacher network configuration[zagoruyko2016paying, heo2019knowledge, tung2019similarity], our rerun under the aforementioned experimental setup achieves superior results than reported in the original work.
Iv Empirical Analysis
The aim of the study is manifold: a) provide extensive analysis of how the underlying mechanisms of different KD methods affect the generalization performance of the student under uniform experimental conditions. b) Demonstrate the versatility of the KD framework on different datasets and network architectures under varying capacity gaps between the teacher and student. c) Highlight the efficacy of KD framework as a general-purpose training framework which provides additional benefits over model compression. To this end, section IV-A compares the performance of the KD methods across different datasets and architectures. Section IV-B evaluates the efficiency of KD methods under various degrees of label noise. Section IV-C further evaluates the performance on imbalanced datasets. Finally, section IV-D studies the transferability of adversarial examples between the student models trained with different KD methods.
It is important to note that the goal of our study is not to rank different KD methods based on the test set performance, but instead to provide a comprehensive evaluation of the methods and find general patterns and characteristics of the model to aid the design of efficient distillation methods. In addition, we aim to showcase the efficacy of the KD framework as a robust general-purpose learning paradigm.
Iv-a Generalization Performance
KD aims to minimize the generalization gap between the teacher and the student. Therefore, the generalization gain over the baseline (a model trained without teacher supervision) is a key metric for evaluating the effectiveness of a KD method. Tables III and IV demonstrate the effectiveness and versatility of the different KD methods in improving the generalization performance of the student on CIFAR-10 and CIFAR-100 datasets, respectively. For the majority of the methods, we see generalization gain over the baseline.
Iv-A1 Results on CIFAR-10
Amongst the response distillation methods, Hinton consistently improves the performance of the student over the baseline and proves effective across varying degrees of capacity gaps. BSS is effective for lower capacity students, providing the highest generalization for WRN-10-2, but it adversely affects the performance as the capacity of the model increases.
For the representation space distillation methods, FitNet and FSP provide similar performance to Hinton owing to the two-stage training scheme whereby the second stage is essentially the Hinton method. Both of these methods improve the generalization over the baseline consistently, however occasionally failing to outperform the Hinton method. AT, fails to improve over Hinton for the ResNet models, even decreasing the performance over the baseline for ResNet-8. For higher capacity WRN models, however, AT proves to be much more effective, and is amongst top-performing techniques for WRN-16-2 and WRN-28-2.
Relational knowledge distillation methods also show promising results, with SP and RKD-A amongst the top performing models for ResNet-20, WRN-16-2 and higher capacity variants. RKD-DA provides the highest generalization for WRN-40-2 and close to the highest value for ResNet-26. However, the effectiveness of these methods drops substantially as the capacity gap between the teacher and student increases e.g. for ResNet-8, all the relational distillation methods decreases the performance over the baseline.
Lastly, online distillation demonstrates comparable effectiveness in improving the performance of the baseline model to their traditional counterparts which uses a static teacher model for supervision. ONE consistently improves the generalization of the baseline for ResNet models and provides the highest performance for ResNet-8 and ResNet-26. However, for WRN, it only manages to improve the performance for the higher capacity WRN-40-2 model. DML consistently provides generalization gains over baseline across all the models. The results of ONE (on ResNet) and DML are promising given that they do not have the advantage of supervision from a high performing teacher model as for the other distillation methods.
Iv-A2 Results on CIFAR-100
The performance of the KD methods on CIFAR-100 follows a similar pattern observed on CIFAR-10 with a few deviations. Response distillation methods are among the most efficient methods across the different models. Hinton improves the generalization for all the models except WRN-10-2. BSS again proves to be effective on lower capacity models, providing the highest generalization for WRN-10-2. Similar to CIFAR-10, FitNet and FSP provide similar performance to Hinton, and AT is detrimental for lower capacity models. Relational knowledge distillation methods exhibit similar behavior as for CIFAR-10, proving more effective when the capacity gap is less. SP provides the highest generalization for ResNet-14 and higher capacity ResNet models. Online distillation methods also exhibit similar behavior, with ONE proving to be effective on ResNet but failing to improve the performance on WRN except for WRN-40-2. DML, on the other hand, is particularly effective on WRN, providing the highest generalization gains on WRN-16-2 and higher capacity WRN models.
Iv-A3 Key Insights
From the empirical study, we derive the following insights, which can provide some guidelines for designing effective KD methods.
KD framework is an effective technique which consistently provides generalization gains. The methods are generally effective and versatile enough to cater to different datasets and network architectures even for the higher capacity gap between the student and teacher. KD is not only effective for model compression but also improves the performance of the model when the student and the teacher have the same network architecture.
The original Hinton method, while simple, is quite effective and versatile, providing comparable performance gains to the recently proposed distillation methods. This shows the utility of response distillation. It also provides more architectural flexibility between the student and the teacher.
The performance of relational knowledge distillation methods provides a compelling case for the effectiveness of using the relations of the learned representations for KD. The comparison between SP and RKD-D is particularly interesting since both methods use pair-wise similarities at the same network layers. SP captures the angular relationship between the two vectors whereas RKD-D uses euclidean distance. RKD-A measures the angle formed between triplets of data points and provides similar performance to SP. The results suggest that angular information can capture higher-level structure which aids in a performance gain.
Considering AT as a better representative for representation space distillation, the constraints put by these methods on the learned representation can be detrimental to the performance of the student when the capacity gap is high. For a substantially lower capacity model, mimicking the representation space of the teacher with much higher capacity might be difficult and perhaps not the optimal approach. Generally, we observe that the methods which provide more flexibility to the student in learning e.g. response distillation and relational KD methods are more versatile and can provide higher performance gains.
The occasional performance gains with FSP and FitNet over Hinton, highlights the importance of better model initialization.
Online distillation is a promising direction which removes the necessity of having a large pre-trained teacher for supervision and instead relies on mutual learning between a cohort of student models collectively supervising each other. This highlights the effectiveness of collaborative learning in improving the generalization of the models.
Iv-B Label Noise
Much of the success of the supervised learning methods can be attributed to the availability of huge amounts of high-quality annotations [deng2009imagenet]
. However, real-world datasets often contain a certain amount of label noise arising from the difficulty of manual annotation. Furthermore, to leverage the large amount of open-sourced data, automated label generation methods[tsai2008automatically, mozafari2014scaling] are employed which make use of user tags and keywords which inherently leads to noisy labels. The performance of the model is significantly affected by label noise [sukhbaatar2014training, zhang2016understanding] and studies have shown that DNNs have the capacity to memorize noisy labels [arpit2017closer]. It is therefore pertinent for the training procedure to be more robust to label noise and efficiently learn under noisy supervision.
One reason for the failure of standard training is that the only supervision the model receives is the one-hot-labels. Therefore, when the ground-truth label is incorrect, the model doesn’t receive any other useful information to learn from. In KD on the other hand, in addition to the ground truth label, the model receives supervision from the teacher, e.g. the soft probabilities in case of Hinton provides useful information about the relative probabilities amongst the classes. Similarly, in online distillation, the consensus between different students provides extra supervision. We hypothesize that these extra supervision signals in the KD framework can mitigate the adverse effect of incorrect ground truth labels.
To test our hypothesis, we simulate label corruption on CIFAR-10 dataset whereby for each training image, we corrupt the true label with a given probability (referred to as noise rate,
) to a randomly chosen class sampled from a uniform distribution on the number of classes[hendrycks2019using]. We test the robustness of the various KD methods for different noise rates, . We use WRN-16-2 as student and WRN-40-2 as teacher and follow the same training procedure (Table I).
Table V shows that majority of the KD methods improve the generalization of the student trained under varying degrees of label corruption over the baseline. Hinton proves to be an effective method for learning under label noise and provides substantial gains in performance. For a lower noise level (0.2), it provides the highest generalization. As observed in section IV-A, FitNet and FSP show similar behavior to Hinton. Amongst the online distillation methods, ONE improves the generalization over the baseline for lower noise levels but significantly harms the performance for higher noise level (0.6). DML is amongst the more effective methods and provides performance comparable to Hinton. The performance of these methods demonstrate the effectiveness of soft-targets in providing useful information about the data points with corrupted labels and mitigating the adverse effect of noisy labels. Among the relational knowledge distillation methods, SP interestingly performs markedly better for lower noise levels, which suggests that the pairwise angular information can be useful in learning efficiently under label noise. The results show that the efficacy of KD extends beyond model compression and offers a general-purpose learning framework which is more robust to noisy labels prevalent in real-world datasets than standard training.
Iv-C Class Imbalance
In addition to noisy labels, high class imbalance is naturally inherent in many real-world applications wherein, the dataset is not uniformly distributed and some classes are more abundant than others [van2018inaturalist]. Models train with standard training exhibit bias towards the prevalent classes at the expense of minority classes. One drawback of the standard training is that the model receives no information about a particular class other than the data points belonging to it. The model does not receive any information about the similarities between data points of different classes which can be useful in learning better representation for the minority classes. KD framework, on the contrary, provides additional relational information between the different classes, e.g. the relative probabilities of each class provided as soft targets or the pairwise similarities between data points belonging to different classes. We hypothesize that this additional relational information can be useful in learning the minority classes better.
To test our hypothesis, we simulate varying degrees of class imbalance on the CIFAR-10 dataset. We follow the analysis performed in [hendrycks2019using] whereby similar to Dong et al.[dong2018imbalanced], we employ the power law model to simulate class imbalance. The number of training samples for a class as follows, , where is the integer rounding function, represents an imbalance ratio, and are offset parameters to specify the largest and smallest class sizes. The training data becomes a power law class distribution as the imbalance ratio decreases. We test the performance of the KD methods on varying degrees of imbalance; and () are set so that the maximum and minimum class counts are 5000 and 250 respectively.
Table VI shows that apart from ONE and BSS (), all the other KD methods improve the generalization performance of the baseline model. In particular, RKD-A and RKD-DA provide the highest generalization across all the values followed by Hinton and DML. The empirical results show KD methods offer a more effective approach for training models on imbalanced datasets. We emphasize that the efficacy of KD goes much beyond a model compression technique and it should be considered as a general-purpose training paradigm which offers more robustness to common challenges in the real-world datasets compared to the standard training procedure.
To further study the characteristics of the different distillation methods and analyze how well they are able to mimic the teacher, we propose to use the transferability [tramer2017space] of adversarial examples [szegedy2013intriguing] as an approximation of the similarity of the decision boundary and representation space of the two models. For each method, we first generate adversarial examples using the Projected Gradient Descent (PGD) method [madry2017towards] with , and . Then we perform a more fine-grained evaluation, by conducting the attack only on the subset of the test set which is correctly classified by both the source and target model and then reporting the success rate of the attack. Higher transferability values indicate a higher level of similarity between the two models.
Figure 1 (last column) shows the transferability of the adversarial examples generated on the teacher to students trained with different distillation methods. AT achieves significantly higher transferability compared to other KD methods, suggesting that mimicking the latent representation space enables the student to be more similar to the teacher. AT is followed by BSS which explicitly attempts to match the decision boundary of the teacher and therefore has much higher transferability compared to Hinton. Relational Knowledge Distillation methods provide higher transferability than Hinton, FSP and FitNet even though they do not explicitly mimic the representation space or the decision boundary which suggest that maintaining the relations between data points implicitly causes the internal representation to be similar. The comparatively lower values for Hinton and related methods (FitNet and FSP) might be explained by the effect of the Hinton method in improving the robustness of the model [papernot2015distillation].
Figure 1 further shows the transferability of adversarial examples generated for each of the KD methoda to the other KD methods. The adversarial examples generated with AT, on average, provide the highest transferability to the other methods. This might be explained by its higher level of similarity to the teacher. Relational knowledge distillation methods achieve higher transferability whereas the methods which include the Hinton loss (response distillation methods, FitNet, and FSP) provide significantly lower transferability. On the other hand, these methods show more robustness to black-box attacks generated by the other methods.
In this study, we provided an extensive evaluation of nine different knowledge distillation methods and demonstrated the effectiveness and versatility of the KD framework. We studied the effect of mimicking different aspects of the teacher on the performance of the model and derived insights to guide the design of more effective KD methods. We further showed the effectiveness of the KD framework in learning efficiently under varying degrees of label noise and class imbalance. Our study emphasizes that knowledge distillation should not only be considered as an efficient model compression technique but rather as a general-purpose training paradigm that offers more robustness to common challenges in the real-world datasets compared to the standard training procedure.