Hinton et al. (2015) proposed Knowledge Distillation (KD) to transfer knowledge from one neural network (teacher) to another (student). Usually, the teacher model has strong learning capacity with higher performance, which teaches a lower-capacity student model through providing “soft targets”. It is commonly believed that the soft targets of a teacher model can transfer “dark knowledge” containing privileged information on similarity among different categories (Hinton et al., 2015) to enhance the student model.
In this work, we first examine such a common belief through following exploratory experiments: 1) let student models teach teacher models by transferring soft targets of the students; (2) let poorly-trained teacher models with worse performance teach students. Based on the common belief, it is expected that the teacher model would not be enhanced significantly via training from the students and poorly-trained teachers would not enhance the students, as the weak student and poorly-trained teacher models cannot provide reliable similarity information between categories. However, after extensive experiments on various models and datasets, we surprisingly observe contradictory results: the weak student can improve the teacher and the poorly-trained teacher can also enhance the student remarkably. Such intriguing results motivate us to interpret KD as a regularization term, and we re-examine knowledge distillation from the perspective of Label Smoothing Regularization (LSR) (Szegedy et al., 2016) that regularizes model training by replacing the one-hot labels with smoothed ones.
We then analyze theoretically the relationships between KD and LSR. For LSR, by splitting the smoothed label into two parts and examining the corresponding losses, we find the first part is the ordinary cross-entropy for ground-truth distribution (one-hot label) and outputs of model, and the second part, surprisingly, corresponds to a virtual teacher model which provides a uniform distribution to teach the model. For KD, by combining the teacher’s soft targets with the one-hot ground-truth label, we find that KD is a learned LSR where the smoothing distribution of KD is from a teacher model but the smoothing distribution of LSR is manually designed. In a nutshell, we findKD is a learned LSR and LSR is an ad-hoc KD. Such relationships can explain the above counterintuitive results—the soft targets from weak student and poorly-trained teacher models can effectively regularize the model training, even though they lack strong similarity information between categories. We therefore argue that the similarity information between categories cannot fully explain the dark knowledge in KD, and the soft targets from the teacher model indeed provide effective regularization for the student model, which are equally or even more important.
Based on the analyses, we conjecture that with non-reliable or even zero similarity information between categories from the teacher model, KD may still well improve the student models. We thus propose a novel Teacher-free Knowledge Distillation (Tf-KD) framework with two implementations. The first one is to train the student model by itself (i.e., self-training), and the second is to manually design a targets distribution as a virtual teacher model which has 100% accuracy. The first method is motivated by replacing the dark knowledge with predictions from the model itself, and the second method is inspired by the relationships between KD and LSR. We validate through extensive experiments that the two implementations of Tf-KD are both simple yet effective. Particularly, in the second implementation without similarity information in the virtual teacher, Tf-KD still achieves comparable performance with normal KD, which clearly justifies:
dark knowledge does not just include the similarity between categories, but also imposes regularization on the student training.
Tf-KD well applies to scenarios where the student model is too strong to find teacher models or computational resource is limited for training teacher models. For example, if we take a cumbersome single model ResNeXt101-328d (Xie et al., 2017) as the student model (with 88.79M parameters and 16.51G FLOPs on ImageNet), it is hard or computationally expensive to train a stronger teacher model. We deploy our virtual teacher to teach this powerful student and achieve 0.48% improvement on ImageNet without any extra computation cost. Similarly, when taking a powerful single model ResNeXt29-864d with 34.53M parameters as a student model, our self-training implementation achieves more than 1.0% improvement on CIFAR100 (from 81.03% to 82.08%).
Our contributions are summarized as follows:
By designing two exploratory experiments, we find knowledge distillation can be interpreted as a regularization method. We then provide theoretical analysis to reveal the relation between knowledge distillation and label smoothing regularization.
We propose Teacher-free Knowledge Distillation (Tf-KD) by self-training or using a well-designed virtual teacher model. Tf-KD achieves comparable performance with normal knowledge distillation that has a superior teacher.
Tf-KD can be deployed as a generic regularization to train deep neural networks, which achieves superior performance to label smoothing regularization on ImageNet-2012.
2 exploratory experiments and Counterintuitive Observations
To examine the common belief on dark knowledge in KD, we conduct two exploratory experiments:
The standard knowledge distillation is to adopt a teacher to teach a weaker student. What if we reverse the operation? Based on the common belief, the teacher should not be improved significantly because the student is too weak to transfer effective knowledge.
If we use a poorly-trained teacher which has much worse performance than the student to teach the student, it is assumed to bring no improvement to the latter. For example, if a poorly-trained teacher with only 10% accuracy is adopted in an image classification task, the student would learn from its soft targets with 90% error, thus the student should not be improved or even suffer worse performance.
We name the “student teach teacher” as Reversed Knowledge Distillation (Re-KD), and the “poorly-trained teacher teach student” as Defective Knowledge Distillation (De-KD) (Fig. 1). We conduct Re-KD and De-KD experiments on CIFAR10, CIFAR100 and Tiny-ImageNet datasets with a variety of neural networks. For fair comparisons, all experiments are conducted with the same settings. Detailed implementation and experiment settings are given in Appendix A.1.
2.1 Reversed knowledge distillation
We conduct Re-KD experiments on the three datasets respectively. CIFAR10 and CIFAR100 (Krizhevsky et al., 2009) contain natural RGB images of 32x32 pixels with 10 and 100 classes, respectively, and Tiny-ImageNet111https://tiny-imagenet.herokuapp.com/ is a subset of ImageNet (Deng et al., 2009) with 200 classes, where each image is down-sized to 64x64 pixels. For generality of the experiments, we adopt 5-layer plain CNN, MobilenetV2 (Sandler et al., 2018) and ShufflenetV2 (Ma et al., 2018) as student models and ResNet18, ResNet50 (He et al., 2016), DenseNet121 (Huang et al., 2017) and ResNeXt29-864d as teachers. The results of Re-KD on the three datasets are given in Tabs. 1 to 3.
In Tab. 1, the teacher models are improved significantly by learning from students, especially for teacher models ResNet18 and ResNet50. The two teachers obtain more than 1.1% improvement when taught by MobileNetV2 and ShuffleNetV2. We can also observe similar results on CIFAR10 and Tiny-ImageNet. When comparing Re-KD (ST) with Normal KD (TS), we can see in most cases, Normal KD achieves better results. It should be noted that Re-KD takes the teacher’ accuracy as the baseline accuracy, which is much higher than that of Normal KD. However, in some cases, we can find Re-KD outperforms Normal KD. For instance, in Tab. 2 (3rd row), the student model (plain CNN) can only be improved by 0.31% when taught by MobileNetV2, but the teacher (MobileNetV2) can be improved by 0.92% by learning from the student. We have similar observations for ResNeXt29 and ResNet18 (4th row in Tab. 2).
|Teacher: baseline||Student: baseline||Normal KD (TS)||Re-KD (ST)|
|ResNet18: 75.87||MobileNetV2: 68.38||71.050.16 (+2.67)||77.280.28 (+1.41)|
|ShuffleNetV2: 70.34||72.050.13 (+1.71)||77.350.32 (+1.48)|
|ResNet50: 78.16||MobileNetV2: 68.38||71.040.20 (+2.66)||79.300.11 (+1.14)|
|ShuffleNetV2: 70.34||72.150.18 (+1.81)||79.430.39 (+1.27)|
|DenseNet121: 79.04||MobileNetV2: 68.38||71.290.23 (+2.91)||79.550.11 (+0.51)|
|ShuffleNetV2: 70.34||72.320.25 (+1.98)||79.830.05 (+0.79)|
|ResNeXt29: 81.03||MobileNetV2: 68.38||71.650.41 (+3.27)||81.530.14 (+0.50)|
|ResNet18: 75.87||77.840.15 (+1.97)||81.620.22 (+0.59)|
|Teacher: baseline||Student: baseline||Normal KD (TS)||Re-KD (ST)|
|ResNet18: 95.12||Plain CNN: 87.14||87.670.17 (+0.53)||95.330.12 (+0.21)|
|MobileNetV2: 90.98||91.690.14 (+0.71)||95.710.11 (+0.59)|
|MobileNetV2: 90.98||Plain CNN: 87.14||87.450.18 (+0.31)||91.810.23 (+0.92)|
|ResNeXt29: 95.76||ResNet18: 95.12||95.800.13 (+0.68)||96.490.15 (+0.73)|
|Teacher: baseline||Student: baseline||Normal KD (TS)||Re-KD (ST)|
|ResNet18: 63.44||MobileNetV2: 55.06||56.70 (+1.64)||64.12 (+0.68)|
|ShuffleNetV2: 60.51||61.19 (+0.68)||64.35 (+0.91)|
|ResNet50: 67.47||MobileNetV2: 55.06||56.02 (+0.96)||67.68 (+0.21)|
|ShuffleNetV2: 60.51||60.79 (+0.28)||67.62 (+0.15)|
|ResNet18: 63.44||64.23 (+0.79)||67.89 (+0.42)|
We claim that while the standard knowledge distillation can improve the performance of students on all datasets, the superior teacher can also be enhanced significantly by learning from a weak student, as suggested through the Re-KD experiments.
2.2 Defective knowledge distillation
We conduct De-KD on CIFAR100 and Tiny-ImageNet. We adopt MobileNetV2 and ShuffleNetV2 as student models and ResNet18, ResNet50 and ResNeXt29 (8
64d) as teacher models. The poorly-trained teachers are trained by 1 epoch (ResNet18) or 50 epochs (ResNet50 and ResNeXt29), with very poor performance. For example, ResNet18 only obtains 15.48% accuracy on CIFAR100 and 9.41% accuracy on Tiny-ImageNet after trained with 1 epoch, and ResNet50 obtains 45.82% and 31.01% on CIFAR100 and Tiny-ImageNet, after trained with 50 epochs (200 epochs in total).
From De-KD experiment results on CIFAR100 in Tab. 4, we observe that the student can be greatly promoted even when distilled by a poorly-trained teacher. For instance, the MobileNetV2 and ShuffleNetV2 can be promoted by 2.27% and 1.48% when taught by the one-epoch-trained ResNet18 with only 15.48% accuracy (2nd row). For poorly-trained ResNeXt29 with 51.94% accuracy (4th row), we find ResNet18 can still be improved by 1.41%, and MobileNetV2 obtains 3.14% improvement. From the De-KD experiment results on Tiny-ImageNet in Tab. 4, we find ResNet18 with 9.14% accuracy can still enhance the teacher model MobileNetV2 by 1.16%. Other poorly-trained teachers are all able to enhance the students to some degree.
|Dataset||Pt-Teacher: baseline||Student: baseline||De-KD|
|CIFAR100||ResNet18: 15.48||MobileNetV2: 68.38||70.650.35 (+2.27)|
|ShuffleNetV2: 70.34||71.820.11 (+1.48)|
|ResNet50: 45.82||MobileNetV2: 68.38||71.450.23 (+3.09)|
|ShuffleNetV2: 70.34||72.110.09 (+1.77)|
|ResNet18: 75.87||77.230.11 (+1.23)|
|ResNeXt29: 51.94||MobileNetV2: 68.38||71.520.27 (+3.14)|
|ResNet18: 75.87||77.280.17 (+1.41)|
|Tiny-ImageNet||ResNet18: 9.41||MobileNetV2: 55.06||56.22 (+1.16)|
|ShuffleNetV2: 60.51||60.66 (+0.15)|
|ResNet50: 31.01||MobileNetV2:55.06||56.02 (+0.96)|
|ShuffleNetV2: 60.51||61.09 (+0.58)|
To better demonstrate the distillation accuracy of a student when taught by poorly-trained teachers with different levels of accuracy, we save 9 checkpoints of ResNet18 and ResNeXt29 in the normal training process. Taking these checkpoints as teacher models to teach MobileNetV2, we observe that MobileNetV2 can always be improved by poorly-trained ResNet18 or poorly-trained ResNeXt29 with different levels of accuracy (Fig. 2
). So we can say while a poorly-trained teacher provides much more noisy logits to the student, the student can still be enhanced. The De-KD experiment results are also conflicted with the common belief.
The counterintuitive results of Re-KD and De-KD make us rethink the “dark knowledge” in KD, and we argue that it does not just contain the similarity information. Lacking enough similarity information, a model can still provide “dark knowledge” to enhance other models. To explain this, we make a reasonable assumption and view knowledge distillation as a model regularization, and investigate what is the additional information in the “dark knowledge” of a model. In the next, we will analyze the relationships between knowledge distillation and label smoothing regularization to explain the experiment results of Re-KD and De-KD.
3 Knowledge Distillation and Label Smoothing Regularization
We mathematically analyze the relationships between Knowledge Distillation (KD) and Label Smoothing Regularization (LSR), hoping to explain the intriguing results of exploratory experiments in Sec. 2. Given a neural network
to train, we first give loss function of LSR for. For each training example ,
outputs the probability of each label, where is the logit of the neural network . The ground truth distribution over the labels is . We write as and as for simplicity. The model can be trained by minimizing the cross-entropy loss: . For a single ground-truth label , the and for all .
In LSR, it minimizes the cross-entropy between modified label distribution and the network output , where is the smoothed label distribution formulated as
which is a mixture of and a fixed distribution , with weight . Usually, the is uniform distribution as . The cross-entropy loss defined over the smoothed labels is
is the Kullback-Leibler divergence (KL divergence) anddenotes the entropy of and is a constant for the fixed uniform distribution . Thus, the loss function of label smoothing to model can be written as
For knowledge distillation, the teacher-student learning mechanism is applied to improve the performance of the student. We assume the student is the model with output prediction , and the output prediction of the teacher network is , where is the output logits of the teacher network and is the temperature to soften (written as after softened). The idea behind knowledge distillation is to let the student (the model ) mimic the teacher by minimizing the cross-entropy loss and KL divergence between the predictions of student and teacher as
Comparing Eq. (3) and Eq. (4), we find the two loss functions have a similar form. The only difference is that the in is a distribution from a teacher model and in is the pre-defined uniform distribution. From this view, we can consider KD as a special case of LSR where the smoothing distribution is learned but not pre-defined. On the other hand, if we view the regularization term as a virtual teacher model of knowledge distillation, this teacher model will give a uniform probability to all classes, meaning it has a random accuracy (1% accuracy for CIFAR100, 0.1% accuracy for ImageNet).
Since , where the entropy is constant for a fixed teacher model, we can reformulate Eq. (4) to
If we set the temperature , we have , where is
If we compare Eq. (6) with Eq. (1), it is more clearly seen that KD is a special case of LSR. Moreover, the distribution is a learned distribution (from a trained teacher) instead of a uniform distribution . We visualize the output probability of a teacher and compare it with label smoothing in Appendix A.2, and find with higher temperature , the is more similar with the uniform distribution of label smoothing.
Based on the comparison of the two loss functions, we summarize the relationships between knowledge distillation and label smoothing regularization as follows:
Knowledge distillation is a learned label smoothing regularization, which has a similar function with the latter, i.e. regularizing the classifier layer of the model.
Label smoothing is an ad-hoc knowledge distillation, which can be revisited as a teacher model with random accuracy and temperature .
With higher temperature, the distribution of teacher’s soft targets in knowledge distillation is more similar to the uniform distribution of label smoothing.
Therefore, the experiment results of Re-KD and De-KD can be explained as the soft targets of the model in high temperature are closer to a uniform distribution of label smoothing, where the learned soft targets can provide model regularization for the teacher model. That is why a student can enhance the teacher and a poorly-trained teacher can still improve the student model.
4 Teacher-free knowledge distillation
As we above analyzed, the “dark knowledge” in a teacher model is more of a regularization term than the similarity information between categories. Intuitively, we consider replacing the output distribution of the teacher model with a simple one. We therefore propose a novel Teacher-free Knowledge Distillation (Tf-KD) framework, which uses no teacher model. Tf-KD is especially applicable to cases where the teacher model is not available, or only limited computation resources are provided. We propose two implementations of our Tf-KD.
The first Tf-KD method is self-training knowledge distillation, denoted as Tf-KD. As aforementioned, the teacher can be taught by a student and a poorly-trained teacher can also enhance the student. Hence when the teacher model is not available, we propose to deploy “self-training”. Specifically, we first train the student model in the normal way to obtain a pre-trained model, which is then used as the teacher to train itself by transferring soft targets as in Eq. (4). Formally, given a model , we denote its pre-trained model as teacher ; then we try to minimize the KL divergence of the logits between and by Tf-KD. The loss function of Tf-KD to train model is
where , are the output probability of and respectively, is the temperature and is the weight to balance the two terms.
The second implementation of our Tf-KD method is to manually design a teacher with 100% accuracy. In Sec. 3, we reveal LSR is a virtual teacher model with random accuracy. So, if we design a teacher with higher accuracy, we can assume it would bring more improvement to the student. We propose to combine KD and LSR to build a simple teacher model which will output distribution for classes as the following:
where is the total number of classes, is the correct label and is the correct probability for the correct class. We always set , so the probability of a correct class is much higher than that of an incorrect one, and the manually-designed teacher model has 100% accuracy for any dataset.
We name this method as Teacher-free KD by manually-designed regularization, denoted as Tf-KD. The loss function is
where is the temperature to soften the manually-designed distribution (as after softening). We set a high temperature to make this virtual teacher output a soft probability, in which way it gains the smoothing property as LSR. We visualize the distribution of the manually designed teacher in Fig. 3. As Fig. 3 shows, this manually designed teacher model outputs soft targets with 100% classification accuracy, and also has the smoothing property of label smoothing.
The two Teacher-free methods, Tf-KD and Tf-KD, are very simple yet effective, as validated via extensive experiments in the next section.
5 Experiments on Teacher-free knowledge distillation
In this section, we conduct experiments to evaluate Tf-KD and Tf-KD on three datasets for image classification: CIFAR100, Tiny-ImageNet and ImageNet. For fair comparisons, all experiments are conducted with the same setting. Codes will be publicly available.
5.1 Experiments for Self-Training Knowledge distillation
The hyper-parameters (temperature and ) for Tf-KD are given in Appendix A.3.
On CIFAR100, we use baseline models including MobileNetV2, ShuffleNetV2, GoogLeNet, ResNet18, DenseNet121 and ResNeXt29(864d). The baselines are trained for 200 epochs, with batch size 128. The initial learning rate is 0.1 and then divided by 5 at the 60th, 120th, 160th epoch. We use SGD optimizer with momentum of 0.9, and weight decay is set to 5e-4. For hyper-parameters, we use grid search to find the best value.
Fig. 4 (a) shows the test accuracy of the six models. It can be seen that our Tf-KD consistently outperforms the baselines. For example, as a powerful model with 34.52M parameters, ResNeXt29 improves itself by 1.05% with self-training (Fig. 4(b)). Even when compared to Normal KD with a superior teacher in Fig. 5(a) and Fig. 5(b), our method achieves comparable performance (experiment settings for Tf-KD and Normal KD are the same). For example, with ResNet50 to teach ReseNet18, the student has a 1.19% improvement, but our method achieves 1.23% improvement without using any stronger teacher model.
On Tiny-ImageNet, we use baseline models including MobileNetV2, ShuffleNetV2, ResNet50, DenseNet121. They are trained for 200 epochs with batch size for MobileNetV2, ShuffleNetV2 and for ResNet50, DenseNet121. The initial learning rate is and then divided by 10 at the 60th, 120th, 160th epoch. We use SGD optimizer with momentum of 0.9, and weight decay is set to 5e-4. Tab. 5 shows the results of Tf-KD on Tiny-ImageNet. It can be seen that Tf-KD consistently improves the baseline models and achieves comparable improvement with Normal KD.
|Model||Baseline||Tf-KD||Normal KD [Teacher]|
|MobileNetV2||55.06||56.77 (+1.71)||56.70 (+1.64) [ResNet18]|
|ShuffleNetV2||60.51||61.36 (+0.85)||61.19 (+0.68) [ResNet18]|
|ResNet50||67.47||68.18 (+0.71)||68.23 (+0.76) [DenseNet121]|
|DenseNet121||68.15||68.29 (+0.14)||68.31 (+0.16) [ResNeXt29]|
ImageNet-2012 is one of the largest datasets for object classification, with over 1.3m hand-annotated images. The baseline models we use on this dataset include ResNet18, ResNet50, DenseNet121, RexNeXt101 (32x8d), and we adopt official implementation222https://github.com/pytorch/examples/tree/master/imagenet
of Pytorch to train them. We set batch sizefor ResNet18, ResNet50, DenseNet121, and for RexNeXt101. Following common experiment settings (Goyal et al., 2017), the initial learning rate is which is then divided by 10 at the 30th, 60th, 80th epoch in total 90 epochs. We use SGD optimizer with momentum of 0.9, and weight decay is 1e-4. Results are reported in Tab. 6. We can see that the self-training can further improve the baseline performance on ImageNet-2012. As a comparison, we also use DenseNet121 to teach ResNet18 on ImageNet, and ResNet18 obtains 0.56% improvement, which is comparable with our self-training implementation (Tab. 7).
|Model||Baseline||Tf-KD||Normal KD [Teacher]|
|ResNet18||69.84||70.42 (+0.58)||70.40 (+0.56) [DenseNet121]|
5.2 Experiments for knowledge-distillation with manually-designed teacher
For all experiments of Tf-KD, we adopt the same implementation settings with Tf-KD, except for using a virtual output distribution as a virtual teacher (Eq. (9)) here. For fair comparisons, experiment settings for Normal KD and Tf-KD are the same. See Appendix A.4 for hyper-parameters of Tf-KD.
CIFAR100 and Tiny-ImageNet.
For Tf-KD experiments on CIFAR100 and Tiny-ImageNet, we set the probability for correct classes as (Eq. (8)). The temperature and in Eq. (9) are different for different baseline models (see Appendix A.4). From Tab. LABEL:tab:reg_CIFAR100 and Tab. 9, we can observe with no teacher used and just a regularization term added, Tf-KD achieves comparable performance with Normal KD on both CIFAR100 and Tiny-ImageNet.
|Model||Baseline||Tf-KD||Normal KD [Teacher]||+ LSR|
|MobileNetV2||68.38||70.88 (+2.50)||71.05 (+2.67) [ResNet18]||69.32 (+0.94)|
|ShuffleNetV2||70.34||72.09 (+1.75)||72.05 (+1.71) [ResNet18]||70.83 (+0.49)|
|ResNet18||75.87||77.36 (+1.49)||77.19 (+1.32) [ResNet50]||77.26 (+1.39)|
|GoogLeNet||78.15||79.22 (+1.07)||78.84 (+0.99) [ResNeXt29]||79.07 (+0.92)|
|Model||Baseline||Tf-KD||Normal KD [Teacher]||+ LSR|
|MobileNetV2||55.06||56.47 (+1.41)||56.53 (+1.47) [ResNet18]||56.24 (+1.18)|
|ShuffleNetV2||60.51||60.93 (+0.42)||61.19 (+0.68) [ResNet18]||60.66 (+0.11)|
|ResNet50||67.47||67.92 (+0.45)||68.15 (+0.68) [ResNeXt29]||67.63 (+0.16)|
|DenseNet121||68.15||68.37 (+0.18)||68.44 (+0.26) [ResNeXt29]||68.19 (+0.04)|
For the Tf-KD on ImageNet, we adopt temperature as normal knowledge distillation, and as label smoothing regularization. The probability for correct classes in the manually-designed teacher is (Eq. (9)). We test our Tf-KD with four baseline models: ResNet18, ResNet50, DenseNet121 and ResNeXt101 (32x8d). As a regularization term, the manually designed teacher achieves consistent improvement compared with baselines (Fig. 6 (a)). For example, the proposed Tf-KD improves the top1 accuracy of ResNet50 by 0.65% on ImageNet-2012 (Fig. 6 (b)). Even for a huge single model ResNeXt101 (32x8d) with 88.79M parameters, our method achieves 0.48% improvement by using the manually designed teacher.
Comparing our two methods Tf-KD and Tf-KD, we observe that Tf-KD works better in small dataset (CIFAR100) while Tf-KD performs slightly better in large dataset (ImageNet).
5.3 Teacher-free Regularization
For Tf-KD, the manually designed teacher is a regularization term, meaning no extra computation added. So it can serve as a generic regularization method to normally train neural networks. We compare our Teacher-free regularization with label smoothing on CIFAR100, Tiny-ImageNet and ImageNet. For fair comparisons, experiment settings for Tf-KD and LSR are the same. The results are shown in Tab. LABEL:tab:reg_CIFAR100, Tab. 9 and Figure 6(a). It can be seen that our Tf-KD regularization consistently outperforms label smoothing regularization, which is a better choice for model regularization compared with label smoothing.
6 Related Work
Since Hinton et al. (2015) proposed knowledge distillation, KD has been widely adopted or modified (Romero et al., 2014; Yim et al., 2017; Yu et al., 2017; Furlanello et al., 2018; Anil et al., 2018; Mirzadeh et al., 2019; Wang et al., 2019). Different from existing works, our work challenges the common belief of knowledge distillation based on our designed exploratory experiments. Another related work is deep mutual learning (Zhang et al., 2018), which proposes to let an ensemble of student models to learn with each other by minimizing the KL Divergence of predictions. Comparatively, our work reveals the relationship between KD and label smoothing, and our proposed Teacher-free regularization can serve as a general method for neural network training.
Szegedy et al. (2016) proposed label smoothing regularization to replace the “hard labels” with smoothed labels, boosting performance of many tasks like image classification, language translation and speech recognition (Pereyra et al., 2017). Recently, Müller et al. (2019) empirically showed label smoothing can also help improve model calibration. In our work, we adopt label smoothing regularization to understand the regularization function of knowledge distillation.
7 Conclusion and Future Work
In this work, we find through experiments and analyses that the “dark knowledge” of a teacher model is more of a regularization term than similarity information of categories. Based on the relationship between KD and LSR, we propose Teacher-free KD. Experiment results show our Tf-KD can achieve comparable results with Normal KD in image classification. One limitation of this work is that we only test the proposed methods in image classification tasks. In the future, we will explore its application to video classification and natural language understanding. Our work also suggests that, when it is hard to find a good teacher for a powerful model or computation resource is limited for obtaining teacher models, the targeted model can still get enhanced by self-training or a manually-designed regularization term.
- Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235. Cited by: §6.
- Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §2.1.
- Born again neural networks. arXiv preprint arXiv:1805.04770. Cited by: §6.
- Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §5.1.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.1.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §A.1, §A.2, §1, §6.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.1.
- Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §2.1.
- Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: §2.1.
- Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393. Cited by: §6.
- When does label smoothing help?. arXiv preprint arXiv:1906.02629. Cited by: §6.
- Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548. Cited by: §6.
- Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §6.
- Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §2.1.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1, §6.
- Distilling object detectors with fine-grained feature imitation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §1.
A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §6.
- Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1974–1982. Cited by: §6.
- Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4320–4328. Cited by: §6.
Appendix A Appendix
a.1 Implementation details for exploratory experiments
In this appendix, we provide implementation details and experiment settings for the exploratory experiments. We conduct experiments based on the standard implementation of knowledge distillation (Hinton et al., 2015). The loss function for standard knowledge distillation is:
where is the distribution of ground truth, is the output distribution of student model, is cross-entropy loss function and is KL divergence, and is the output distribution of teacher model soften by temperature . The temperature and weight are hyper-parameters. The temperature and weight for Reverse KD and Normal KD are given in Tab. 10; and for De-KD are given in Tab. 11. For fair comparisons, Normal KD, Re-KD and De-KD are conducted with the same experiment settings.
CIFAR10 and CIFAR100
For exploratory experiments Re-KD and De-KD on CIFAR10 and CIFAR100, we train for 200 epochs, with batch size 128. For the Plain CNN, the The initial learning rate is 0.01 and then be divided by 5 at the 60’th, 120’th, 160’th epoch. For other used models (MobileNetV2, ShuffleNetV2, ResNet, ResNeXt, DenseNet), the initial learning rate is 0.1 and then be divided by 5 at the 60’th, 120’th, 160’th epoch. We use Adam optimizer for the Plain CNN and SGD optimizer with momentum 0.9 for other models, and the weight decay is set to be 5e-4. For hyper-parameters, (temperature) and , we use grid search to find the best value.
The Plain CNN used in exploratory experiments is a 5-layer neural network with 3 convolutional layers and 2 fully-connected layers. On CIFAR10, the architecture of the Plain CNN is: .
|CIFAR-10||ResNet18||Plain CNN||=20, =0.90||=20, =0.01|
|MobileNetV2||=20, =0.90||=20, =0.05|
|MobileNetV2||Plain CNN||=20, =0.40||=20, =0.10|
|ResNeXt29||ResNet18||=6, =0.95||=20, =0.10|
|CIFAR100||ResNet18||MobileNetV2||=20, =0.95||=20, =0.60|
|ShuffleNetV2||=20, =0.95||=20, =0.60|
|ResNet50:||MobileNetV2||=20, =0.95||=20, =0.60|
|ShuffleNetV2||=20, =0.95||=20, =0.60|
||Densenet121||MobileNetV2||=20, =0.95||=20, =0.60|
|ShuffleNetV2||=20, =0.95||=20, =0.60|
||ResNeXt29||MobileNetV2||=20, =0.60||=20, =0.60|
|ResNet18||=20, =0.60||=20, =0.60|
|ResNet18||MobileNetV2||=20, =0.10||=20, =0.60|
|ShuffleNetV2||=20, =0.10||=20, =0.60|
||ResNet50||MobileNetV2||=20, =0.10(bm)||=20, =0.10|
|ShuffleNetV2||=20, =0.10||=20, =0.50|
|ResNet18||=20, =0.50||=20, =0.10|
|CIFAR100||ResNet18: 15.48%||MobileNetV2||=20, =0.95|
|ResNet50: 45.82%||MobileNetV2||=20, =0.95|
|ResNeXt29: 51.94%||MobileNetV2||=20, =0.95|
|Tiny-ImageNet||ResNet18: 9.41%||MobileNetV2||=20, =0.10|
|ResNet50: 31.01%||MobileNetV2||=20, =0.10|
For exploratory experiments on Tiny-ImageNet, all models are trained for 200 epochs, with batch size for MobileNetV2, ShuffleNetV2, ResNet18 and for ResNet50, DenseNet121. The initial learning rate is and then be divided by 10 at the 60’th, 120’th, 160’th epoch. We use SGD optimizer with momentum of 0.9, and the weight decay is set to be 5e-4.
We provide the model complexity (size and FLOPs) of all models we used in this work in Tab. 12, which is the reference to choose teacher and student model. The model size is measured by the total number of learnable parameters within each model. The FLOPs of model is tested with image size of .
|Model||GoogLeNet||DenseNet121||ResNeXt29 (8x64d)||ResNeXt101 (328d)|
a.2 Visualization of the output distribution of teacher
To better comparing the (the output distribution of teacher model) and (the uniform distribution of label smoothing), we visualize the soft targets of ResNet18 (trained on CIFAR10 with 95.12% accuracy) and compare the soft targets in different temperature with . As shown in Fig. 7, we can observe that with the temperature increasing, the two distributions become closer. In the common experiments of knowledge distillation, we always adopt temperature as 20 (Hinton et al., 2015).
a.3 Experiment settings for TF
For all TF experiments on ImageNet, we set temperature =20, and weight =0.10. The hyper-parameters on CIFAR100 and Tiny-ImageNet are given in Tab. 13.
|ResNeXt29 (8x64d)||=20, =0.90|
a.4 Experiment settings for TF
For all TF experiments on ImageNet, we set temperature =20, and weight =0.10. The hyper-parameters on CIFAR100 and Tiny-ImageNet are given in Tab. 14.