Learning from a Lightweight Teacher for Efficient Knowledge Distillation

05/19/2020 ∙ by Yuang Liu, et al. ∙ 23

Knowledge Distillation (KD) is an effective framework for compressing deep learning models, realized by a student-teacher paradigm requiring small student networks to mimic the soft target generated by well-trained teachers. However, the teachers are commonly assumed to be complex and need to be trained on the same datasets as students. This leads to a time-consuming training process. The recent study shows vanilla KD plays a similar role as label smoothing and develops teacher-free KD, being efficient and mitigating the issue of learning from heavy teachers. But because teacher-free KD relies on manually-crafted output distributions kept the same for all data instances belonging to the same class, its flexibility and performance are relatively limited. To address the above issues, this paper proposes en efficient knowledge distillation learning framework LW-KD, short for lightweight knowledge distillation. It firstly trains a lightweight teacher network on a synthesized simple dataset, with an adjustable class number equal to that of a target dataset. The teacher then generates soft target whereby an enhanced KD loss could guide student learning, which is a combination of KD loss and adversarial loss for making student output indistinguishable from the output of the teacher. Experiments on several public datasets with different modalities demonstrate LWKD is effective and efficient, showing the rationality of its main design principles.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Modern deep learning models gain tremendous success due to the design of complicated multi-layer neural networks (e.g., ResNet (he2016deep)), the collection of large-scale datasets, and the training with more effective optimization techniques (e.g., Adam (KingmaB14)) and computational-intensive resources (e.g., GPU). However, this paradigm is not so applicable in the era of Mobile Internet and Internet Of Things (IOT). This is because low-end intelligent terminal devices dominant the market, with small storage capacity and low computing power. As a result, there is an urgent need to train portable neural networks while maintaining the effectiveness of big models as much as possible. Knowledge Distillation (KD), firstly proposed by Hinton et al. (hinton2015distilling), is an elegant learning framework to address this need. This is realized by taking soft target generated by one or more powerful but complicated teacher networks as another target (beyond ground-truth) to train compact student model, as shown in Figure 1(a)

. So far, KD has been exploited in computer vision (e.g., image classification 

(romero2014fitnets) and semantic segmentation (LiuCLQLW19)

), natural language processing (e.g., machine translation 

(ChenLCL17) and relation classification (VyasC19)), and recommender system (TangW18), to name a few.

However, the common assumption of KD, leveraging strong teacher networks, still suffers from a non-negligible computational burden, due to the training of the complex teacher models in the training process. The costly process is inevitable for each specific situation or dataset, This issue is even amplified with the trends of proposing larger models for better performance, such as VGG19 (karen15very) and BERT (devlin2019bert). As reported in the study (devlin2019bert), the big Transformer model should be trained for 3.5 days on 8 NVIDIA P100 GPU, not to mention the industrial applications. Hence it brings large energy consumption and causes environmental loss (StrubellGM19).

Recently, teacher-free knowledge distillation (Tf-KD) is developed (yuan2019revisit) to free students from learning complicated teachers. The authors have demonstrated that vanilla KD which utilizes soft output class distributions of teachers plays a role of label smoothing  (szegedy2016rethinking; muller2019does) to constrain the class predictions of students when training them. To further verify this, we conduct a verification experiment on the CIFAR datasets (cf. Section 4.1) by removing the “dark knowledge” contained in the soft targets. Specifically, as shown in Figure 1(b)

, KD-shuffle is designed to perform shuffle operation on each instance in a training set, whereby all the elements of the teacher output, except the one with maximal probability, are randomly permuted. Table 

1 shows that in most cases, shuffling the soft targets do not cause the performance of the student network declines notably. As such, the conclusion that vanilla knowledge distillation is similar to label smoothing (yuan2019revisit) is reasonable.

To achieve efficient knowledge, Tf-KD adopts a virtual teacher model with manually-crafted output class distributions as transferred knowledge, eliminating the necessity of training a real and complex teacher. In particular, they assign a pre-defined larger value to the ground-truth class of an instance and make other classes share the same small probability value. Thus the crafted distributions are class-dependent and share some similar spirit with label smoothing which uses a global uniform class distribution. However, although Tf-KD is efficient, the manual distributions are kept the same for all data examples belonging to the same class. This causes that the flexibility of Tf-KD is relatively limited. As shown in the experimental section, it is hard for Tf-KD to reach the overall performance level of vanilla KD. This phenomenon naturally poses a major challenge: how to achieve efficient knowledge distillation inspired by Tf-KD while maintaining classification efficacy like vanilla KD.

(a) KD
(b) KD-shuffle
Figure 1. The sketches of KD and its shuffled variant. T corresponds to teacher and S for student. We use the trapezium area to denote the teacher is larger than the student.
Dataset Teacher KD KD-shuffle
CIFAR10 ResNet20 (he2016deep) 92.98 92.85
MobileNetV2 (sandler2018mobilenetv2) 91.20 91.33
ShuffleNetV2 (ma2018shufflenet) 92.17 91.15
CIFAR100 ResNet20 (he2016deep) 68.74 68.91
MobileNetV2 (sandler2018mobilenetv2) 69.00 69.15
ShuffleNetV2 (ma2018shufflenet) 71.43 71.58
Table 1. Results of KD and its shuffled variant (student: ResNet18 (he2016deep)).

In this paper, to address the challenge, we devise a novel efficient knowledge distillation learning framework LW-KD, short for lightweight knowledge distillation. In this framework, in contrast to vanilla KD, a general lightweight network is leveraged as the teacher, which is even much smaller than student networks to be used. Since the teacher network has lower capacity than students, it is intuitive for LW-KD to train the teacher on a simple dataset for ensuring satisfied performance, instead of training on the same target dataset used by the student, To be specific, LW-KD automatically builds a synthetic dataset SynMNIST based on the simple digit dataset MNIST111http://yann.lecun.com/exdb/mnist/, with an adjustable class number equal to that of the target dataset. It is worth noting that the target dataset could be with a different modality compared with the image modality of SynMNIST. Based on this, LW-KD trains the teacher to reach a suitable state to generate soft output class distributions. The framework performs a slight modification to the distribution so as to obtain transferred knowledge. An enhanced KD loss is developed, involving standard KD loss and adversarial loss, to exploit the knowledge for student training and make the output distributions of the student indistinguishable from the teacher’s.

Our key contributions are summarized as the following:

  • We indicate the limitations of both vanilla KD which suffers from heavy computational burden in the training process and teacher-free KD which lacks of flexibility. To achieve an effective trade-off, we propose to realize KD by learning from a lightweight teacher.

  • We propose the novel KD framework named LW-KD. It enables training a lightweight teacher on the synthesized simple dataset SynMNIST which could satisfy different requirements of class number. We further devise the enhanced KD loss for LW-KD to leverage the output class distributions of the teacher to guide the student learning.

  • We conduct extensive experiments on different types of data, involving image, text, and video. The results demonstrate the benefits of LW-KD over vanilla KD and Tf-KD, with better performance than teacher-free KD and sometimes even outperforming vanilla KD, and much faster than vanilla KD in teacher training. As a byproduct, the rationality of the main principles in LW-KD is validated.

2. Related Work

Since the aim of this paper is to learn from a lightweight teacher for knowledge distillation to obtain a well-performed student, we review the literature from the following aspects.

Knowledge Distillation. Transferring knowledge from a large network to a small network is a long-standing topic and has drawn much attention in recent years. The pioneering work (hinton2015distilling) has shown that distillation works well for transferring knowledge from an ensemble or a strong and complicated model into a small and compact model. The main claim is that using soft targets as complementary to hard targets could carry some useful information among classes learned by teachers. Due to the efficacy of KD in retaining classification performance and model compression, an enormous amount of research efforts have been spawned.

The first category includes the methods those simply transfer knowledge contained in soft output by teacher models to student models as vanilla KD (hinton2015distilling). For instance, Lopez-Paz et al. (Lopez-PazBSV15) unified distillation and privileged information into one framework. In  (li2017learning)

, Li et al. developed a new framework to learn from noisy labels, by leveraging the knowledge gained from a small clean dataset and semantic knowledge graph to correct the noisy labels. You et al. 

(YouX0T17) averaged the soft targets generated by multiple teachers as richer knowledge for more effective student learning. Tan et al. (TanRHQZL19) associated one teacher network with each source-target language pair for machine translation. There are all also some self-KD methods (xie2019self; zhang2019your; liu2020regularizing) using soft targets generated by a student itself. But the training process becomes more complicated since the coupling of optimizations on both the teacher and student sides.

A second line of studies are attributed to the category that transfers structural knowledge obtained from a teacher model. RKD (park2019relational) uses distance-wise and angle-wise distillation losses those penalize the structural differences in relations. Similar with RKD, IRG (liu2019knowledge) transformation is proposed to model the feature space transformation across layers. It models three kinds of knowledge, including instance features, instance relationships, and feature space transformation. SP (tung2019similarity) exploits a pairwise similarity preserving constraint in distillation loss, computed on each mini-batch. Both VID (ahn2019variational) and CRD (Tian-ICLR20) consider maximizing mutual information between the two networks as a knowledge distillation task. Mutual information is leveraged to maximize the variational lower bound in VID and the contrastive loss in CRD, respectively.

A third category of approaches learns the knowledge revealed in the intermediate layers of teachers. NST (huang2017like)

treats it as a distribution matching problem and matches the distributions of neuron selectivity patterns between teacher and student networks. Romero et al. 

(romero2014fitnets) distilled from a teacher using additional linear projection layers to train a relatively narrower student. Instead of mimicking a teacher’s output activations, Zagoruyko et al. (komodakis2017paying) proposed to force a student to mimic a teacher’s attention maps. Crowley et al. (crowley2018moonshine) compressed a model by grouping convolution channels of the model and training it with an attention transfer. In (yim2017gift), the flow of solution procedure (FSP), generated by computing the Gram matrix of features across layers, is employed for knowledge transfer.

Although much progress has been made for KD, a very recent study shows that vanilla KD still behaves well compared with other representative KD methods (Tian-ICLR20). To sum up, most of the above KD approaches require one or more large and high-performance teachers. As aforementioned, the training procedure of teachers is costly for the consideration of time and computational resources. Although the teacher-free KD framework frees the student learning from relying on powerful teachers, its manually-crafted class distributions are relatively limited, making room for improving student classification performance. This paper addresses both the limitations of vanilla KD and teacher-free KD by proposing to learn from a lightweight teacher for efficient knowledge distillation.

Label Smoothing. The label smoothing mechanism is firstly proposed in  (szegedy2016rethinking)

to regularize the classifier layer by estimating the marginalized effect of label dropout during training. In fact, it encourages the model to be less confident and makes it more generalizable. Label smoothing has been successfully utilized to improve the accuracy of deep learning models across a range of tasks, including image classification 

(szegedy2016rethinking), speech recognition (chorowski2016towards), and machine translation (vaswani2017attention). Recently, Müller et al. (muller2019does) summarized and explained several observations when training deep neural networks with label smoothing. Yuan et al. (yuan2019revisit) demonstrated the relation between label smoothing and knowledge distillation, indicating their similar roles in effect.

Model Compression and Acceleration.

Knowledge distillation could be regarded as one branch of model compression methods through transfer learning 

(crowley2018moonshine; bhardwaj2019efficient; shu2019co). The aim of model compression and acceleration is to create networks with fast computation speed and small parameter complexity. Meanwhile, they should maintain high performance. A straightforward way to achieve this is to design a powerful but lightweight network since the original convolution network has many redundant parameters. MobileNet (sandler2018mobilenetv2) is designed with depth-wise separable convolution to replace standard convolution. In ShuffleNet (ma2018shufflenet), point-wise group convolution and channel shuffle are proposed to reduce the burden of computation while maintaining high accuracy. Another manner is network pruning which boosts the speed of inference by pruning the neurons or filters with low importance based on certain criteria (li2016pruning; zhou2019accelerate). Besides, some other studies exploit low rank approximation to large layers (sainath2013low) and quantization seeks to use low-precision model parameter representation (wu2016quantized).

3. Methodologies

In this section, we first revisit the key parts of knowledge distillation, followed by the elaboration of our proposed framework LW-KD.

3.1. Knowledge Distillation

The general idea of knowledge distillation is to let a student network to mimic the soft target generated by a teacher, as shown in Figure 1(a)

. This is realized in the KD learning framework by adding another loss function to complement the standard cross-entropy loss measuring the gap between ground-truth and predictions. The added loss constrains the soft output of the student to be similar with the soft output of the teacher, which could be regarded as knowledge learned by the teacher. Specifically, the loss is defined as follows:


where denotes the KL-divergence measure. is assumed to be the number of instances in the training set . and are the soft output of the student and soft target generated by the teacher, respectively.

The soft outputs are computed based on the logits, gotten from last layers of neural networks before feeding to the

function. Taking as an example, it is computed by:


where is the logit from the teacher and is the total number of classes.

is a hyperparameter (referred as temperature in  

(hinton2015distilling)) to control the scale of logits. For the teacher-free KD framework, the major difference is that is not calculated based on a well-trained teacher, but is obtained through manually-crafted class distributions.

Figure 2. The sketch of the LW-KD learning framework. LT corresponds to a lightweight teacher.

3.2. The LW-KD Learning Framework

3.2.1. Overview

Figure 2 depicts the concrete procedures of the LW-KD learning framework. For a target dataset with a specific number of classes, LW-KD firstly synthesizes a tailored dataset based on a simple image dataset. Afterwards, a lightweight teacher is trained on the synthetic dataset. Later, the soft output class distributions generated by the teacher are slightly modified to be used in the enhanced KD loss to train a student network on the target dataset. Under this way, the student could reach its better performance.

In what follows, we mathematically specify the details of LW-KD. We use to represent the teacher network with trainable parameters and to denote the student network with learnable parameters .

3.2.2. Data Synthesis

We automatically synthesize the dataset for training a lightweight teacher based on the simple dataset MNIST. MNIST is composed of fixed-sized images about size-normalized handwritten digits. The task of classifying the digit images is relatively easy compared with other image classification tasks like CIFAR. The pioneering study (lecun1998gradient) indicates the simple neural network LeNet5 could already achieve very low error rate. As such, it is reasonable for a lightweight teacher to train on this dataset and gain satisfied performance.

Data: Official MNIST dataset .
Input: Number of classes and number of instances in eacher class .
Output: SynMNIST dataset () for an -class classification task.
1 for  to  do
2       Generate a digit label list by splitting the number ;
3       for  to  do
4             Get a group of images from by referring to the digits in and classes in ;
5             concat();
6             resize();
7             Append label to ;
8             Append data to ;
10       end for
12 end for
Algorithm 1 The algorithm for synthesizing the SynMNIST dataset.
(a) 03
(b) 07
(c) 30
(d) 41
(e) 57
(f) 99
(g) 010
(h) 057
(i) 130
(j) 169
(k) 374
(l) 748
Figure 3. Examples in the SynMNIST dataset.

Algorithm 1 illustrates the main procedures of constructing the SynMNIST dataset. Regarding a specific target dataset to be used for training the student , the algorithm only needs to know its total class number and the number of instances for each class. The innovation of the algorithm lies into combining different basic digit images to synthesize new images corresponding to larger numerical values, each of which could denote a specific class. As a result, this algorithm supports different number of classes. Figure 3 shows the synthesized images by Algorithm 1 for 100-category and 1000-category classification tasks, respectively. For example, the image “03” represents an instance in the 100-category (“00”-“99”) dataset, corresponding to the fourth class.

3.2.3. Soft Target Generation

Given the synthetic dataset (), LW-KD trains the lightweight teacher network to a good and stable state. Now the teacher network could generate the soft target for the synthetic dataset: (), w.r.t. the synthetic data instance

. The main aim of LW-KD is to transfer the knowledge contained in the probability distributions to benefit the student learning. Nevertheless, one fundamental issue is that the dataset used for the teacher and the student are dramatically different. Therefore their classes could not be semantically aligned. In such a situation, a preliminary thought is that it is impossible to realize effective knowledge transfer. However, as we emphasized in this paper, vanilla KD acts like label smoothing. And the aim of LW-KD is to utilize the flexible class distributions generated by real teachers for label smoothing. Thus LW-KD does not need the strict semantic alignment between classes of the two datasets.

The only modification for the generated soft targets () is to associate the largest probability value within them with the ground-truth class of a target data instance , while leaving other probabilities unchanged. To this end, we define the following way to update the soft targets:


where is the real class of the target instance. is obtained by . The operation of shift() means swapping the values of the two positions. Through the above manner, the soft target generated by teacher becomes somewhat realistic for the target example.

3.2.4. Enhanced KD Loss for Optimization

After obtaining the soft target from the lightweight teacher , we could adopt the standard KD loss adopted by vanilla KD. It is composed of a cross-entropy loss and a KL-divergence loss (shown in Equation 1), which is defined as follows:


where is the ground-truth class distribution and is a hyperparameter to balance the two losses. The above equation enables the instance-level guidance from one soft target distribution for training the student on a target instance. With the consideration of vanilla knowledge distillation as label smoothing, LW-KD goes further by introducing corpus-level guidance, i.e., making the soft class distributions generated by the student indistinguishable from those of the teacher This is realized by the effective generative adversarial networks (GANs) (goodfellow2014generative).

GANs have been widely applied for generating samples satisfying the distribution of real data. A specific GAN consists of a generator to generate desired data and a discriminator

to identify differences between real instances and generated ones. To be specific, given an input noise vector

, maps to the desired data , i.e., . On the other hand, outputs a probability denoting an example to be real, i.e., . The objective function of a standard GAN is formulated as below:


where the generator will be adjusted according to the training error produced by using the back propagation strategy. And the optimal generator is:


Here we regard the soft class distributions generated by the teacher network as the real data examples and correspond the student network to the generator . A two-layered fully-connected neural network is adopted as the discriminator . Upon this, we define the follow objective in our situation:


Through adversarial training, the student can learn from the lightweight teacher better, because the above objective encourages the student to be more confident in the soft target distributions which act as a good regularization.

Data: Synthetic dataset SynMNIST and the specific training data
Input: Lightweight teacher model and discriminator
Output: Student model
1 Quickly pretrain the lightweight teacher on SynMNIST and get a satisfied teacher model ;
2 Randomly initialize the student model ;
3 Build a simple one-to-one mapping between real instances and synthetic examples;
4 for 

number of training epochs

5       for number of training steps do
6             Select a training instance ;
7             Load the mapped instance from synthetic data: ;
8             ;
9             ;
10             Compute and in a similar fashion as Eq. 2;
11             Perform shift operation (Eq. 3) to get ;
12             Update by minimizing Eq. 8;
13             Update by maximizing Eq. 7;
15       end for
17 end for
Algorithm 2 The learning algorithm of LW-KD.

In the end, the enhanced KD loss for LW-KD is formulated as:


where is a hyperparameter for balancing the KD loss and the adversarial loss. Algorithm 2 summarizes the overall learning procedure of LW-KD. As shown in Line 6 to 13, the knowledge learned by the teacher network on the synthetic dataset is transferred to guide the student learning on the target dataset.

Type Name Model #Params Dataset #Example #Epoch Acc
Lightweight teacher LeNet5 (lecun1998gradient) 61.7K MNIST 60000 10 98.80
SynMNIST-15 75000 10 98.46
SynMNIST-100 60000 10 97.22
SynMNIST-101 101000 2 84.81
Variant for multi-channel image LeNetW 25.8M CIFAR10 60000 20 74.52
CIFAR100 60000 20 40.95
Table 2. Teacher instances of LeNet5 and its variant LeNetW. We use SynMNIST-NUM to denote the class number of the synthesized dataset to be NUM. Each teacher instance is trained on the associated dataset.
Dataset Student #Params Sup-Stu Teacher KD LSR Tf-KD KD[] Ours[]
CIFAR10 ResNet20 (he2016deep) 69.7K 92.29 ResNet18 92.98 92.37 92.55 92.46 92.94
ResNet50 93.04
MobileNetV2 (sandler2018mobilenetv2) 2.3M 89.74 ResNet18 91.20 89.85 89.96 90.04 91.57
ResNet50 91.69
ShuffleNetV2 (ma2018shufflenet) 1.3M 91.21 ResNet18 92.17 91.28 91.39 91.51 92.48
ResNet50 92.55
VGG19 (karen15very) 39.0M 93.51 DenseNet121 93.75 93.58 93.30 93.81
GoogLeNet (szegedy2015going) 6.3M 95.17 ResNeXt29 95.21 94.73 94.62 95.35
CIFAR100 ResNet20 (he2016deep) 275.5K 68.38 ResNet18 68.74 68.74 69.04 68.73 69.39
ResNet50 68.78
MobileNetV2 (sandler2018mobilenetv2) 2.4M 67.44 ResNet18 69.00 67.78 68.21 67.77 69.13
ResNet50 68.04
ShuffleNetV2 (ma2018shufflenet) 1.4M 70.83 ResNet18 71.43 70.87 71.46 71.24 71.57
ResNet50 71.11
VGG19 (karen15very) 39.3M 73.40 DenseNet121 73.91 73.50 73.87 74.18
GoogLeNet (szegedy2015going) 6.3M 79.37 ResNeXt29 79.58 78.83 78.16 79.44
AVG 82.13 82.15 82.26 82.99*
Table 3. Results of image classification for different learning methods.

denotes for the two teachers of each student, we use the more complicated one to compute the average results for its better performance. * represents LW-KD is significantly better than Tf-KD under t-test (


4. Experimental Setup

This section clarifies the detailed setup for the experiments, including the used datasets, the adopted baseline learning approaches, and the implementations of the proposed LW-KD framework.

4.1. Datasets

To ensure reliable comparison, we adopt multiple datasets, covering the modalities of image, text, and video.

CIFAR10 and CIFAR100. CIFAR is a popular image classification benchmark, involving 3232 RGB images and being widely used in for testing KD. Both CIFAR10 and CIFAR100 share the same data source which contains 50k training images and 10K testing images. The difference of the two datasets is that CIFAR10 has 10 classes while CIFAR100 has 100 classes.

THUCNews222http://thuctc.thunlp.org/. THUCNews is a Chinese text classification dataset collected from Sina333https://www.sina.com.cn/ during the period from 2005 to 2011. We extracted 200,000 news headlines from THUCNews with 10 categories, including Finance, Real Estate, Stocks, Education, Technology, Society, Current Affairs, Sports, Games, and Entertainment. Each category has 20,000 headlines and the text length is between 20 and 30. For ease of training, validation, and testing, this dataset is segmented by the ratio of 18 to 1 to 1.

ToutiaoNews444https://github.com/skdjfla/toutiao-text-classfication-dataset. This dataset was collected from Toutiao555https://www.toutiao.com/, a streaming news platform. It has 382,688 text instances belonging to 15 categories, e.g. Story, Culture, and Entertainment. The same segmentation procedure for THUCNews is applied for ToutiaoNews.

UCF101. UCF101 (Soomro2012ucf101) is currently the largest dataset about human action. It consists of 101 action classes, over 13k clips, and 27 hours of video data collected from Youtube.

4.2. Baselines

We adopt the following learning approaches directly relevant to LW-KD for comparison.

  • [leftmargin=*]

  • Sup-Stu:

    This is the general name for a group of student models, each of which is trained on a target dataset by supervised learning, but without the help of knowledge distillation learning.

  • LSR: It denotes that a student is trained with label smoothing. Following (szegedy2016rethinking; muller2019does), we adopt a global uniform class distribution to smooth the ground-truth distribution.

  • Tf-KD: The teacher-free learning framework (yuan2019revisit) designs a virtual perfect teacher for soft target generation. The generated distributions by the teacher is class-dependent, sharing a similar spirit with the smoothing distribution used in LSR.

  • KD: This represents vanilla KD (hinton2015distilling) proposed to leverage the knowledge contained in the teachers’ output logits. Although this method is simple compared with its follow-up approaches, its performance is still very competitive compared to other representative KD approaches (Tian-ICLR20).

For fair comparison, all the adopted approaches are tuned on the validation datasets for achieving their own good performance.

4.3. Implementation Details

Our learning framework is implemented with PyTorch 

(paszke2019pytorch) on two NVIDIA GTX 2080Ti GPUs. For the part of teacher training, we use SGD with the learning rate of 0.1, momentum of 0.9, and weight decay of 5e-4. For adversarial training of the discriminator, we use Adam with learning rate 0.1. The batch size is set to 64 for UCF101, and 128 for the other datasets. The lightweight teachers adopted by LW-KD are presented in Table 2. Othe detailed settings, including the hyperparameters of , , and , and the choices of teachers for vanilla KD, are reported for specific datasets.

5. Experimental Results

In this part, we aim to address the following research questions:

  • How well does LW-KD perform on different datasets compared with the existing vanilla KD and teacher-free KD learning methods?

  • Do the main design principles involved in LW-KD ensure model effectiveness and efficiency?

We adopt the metric of accuracy (Acc for short) to quantify the classification performance. Furthermore, we utilize floating point operations (FLOPs), denoting the total calculation cost for one instance, and model size for efficiency analysis.

Student #Params Dataset Sup-Stu LSR Tf-KD KD[BERT] Ours
TextCNN (kim2014convolutional) 2.1M THUCNews 90.58 91.33 91.38 91.78 91.82 []
ToutiaoNews 85.93 86.23 86.04 86.32 86.65 []
TextRCNN (lai2015recurrent) 2.6M THUCNews 91.11 91.51 91.31 91.60 91.86 []
ToutiaoNews 84.89 85.36 85.02 85.80 85.86 []
DPCNN (johnso2017deep) 1.8M THUCNews 91.07 91.29 90.66 91.51 91.60 []
ToutiaoNews 84.45 84.56 84.52 85.35 85.02 []
FastText (joulin2017bag) 152.0M THUCNews 92.32 92.66 92.33 92.47 92.78 []
ToutiaoNews 86.51 87.21 86.79 87.62 87.46 []
BERT (devlin2019bert) 102.3M THUCNews 94.59 94.63 94.37 94.77 94.83 []
ToutiaoNews 89.29 89.42 89.36 89.64 89.77 []
Table 4. Results of text classification for different learning methods.

5.1. Approach Performance Comparison (Rq1)

5.1.1. Performance on Image Modality

We first test the performance on the image classification datasets, i.e., CIFAR10 and CIFAR100. and in Table 2 are taken as the two lightweight teachers exploited by LW-KD for CIFAR10 and CIFAR100, respectively. Since the original form of LeNet5 does not support to model multi-channel images in CIFAR, we design the wider variants of LeNet5, i.e., and for later analysis. Besides, vanilla KD adopts some other complex teachers with size and accuracy described in Table 5. In LW-KD, we set , and for CIFAR10 and and for CIFAR100 through simple grid search. The learning rate of SGD is decayed by 0.1 at the epochs of 100 and 150, respectively. The total number of training epochs is set to 200 for the two CIFAR datasets.

Teacher #Params CIFAR10 CIFAR100
ResNet18 (he2016deep) 11.2M 95.18 78.18
ResNet50 (he2016deep) 23.5M 95.53 78.86
DenseNet121 (huang2017densely) 7.0M 95.67 79.86
ResNeXt29 (xie2017aggregated) 89.6M 95.74 81.02
Table 5. Strong teacher models trained on CIFAR.

Table 3 shows the results on CIFAR10 and CIFAR100, from which we have the following key findings:

  • [leftmargin=*]

  • Tf-KD and LSR gain near performance on the image datasets, which confirms to the expectation since their intrinsic mechanisms are similar. Compared to KD, although the two learning methods achieve comparable performance on CIFAR100, their results on CIFAR10 are inferior. This demonstrates that the performance of Tf-KD and LSR might be limited by their inflexible manually-crafted class distributions.

  • Our learning framework LW-KD shows comparable performance with vanilla KD, and outperforms Tf-KD and LSR significantly in most cases. This observation is meaningful and welcomed since the adopted lightweight teachers and are much smaller (about 61.7K in size) than the complicated teachers and are even smaller than the student models. Besides, using the variant of lightweight teacher trained on the CIFAR datasets does not perform as well as LW-KD.

(a) The student ResNet20 on CIFAR100.
(b) The student MiCT-ResNet18 on UCF101.
Figure 4. Performance curves corresponding to three different learning approaches. The jumps in the curves are caused by the decay of learning rates in SGD (yuan2019revisit), which is normal and reasonable.

5.1.2. Performance on Text Modality

Our framework can also be naturally applied to text classification task. For the two experimental textual datasets, we set the hyperparameters , and temperature . LW-KD uses and as the lightweight teachers for THUCNews and ToutiaoNews, respectively. By contrast, vanilla KD employs BERT (devlin2019bert) as the teacher to teach the student models. We use Adam to update all the student models, including TextCNN, TextRCNN, DPCNN, FastText, and BERT.

Table 4 reports the text classification results of the students using different learning methods. We witness similar phenomena to image classification that distillation based learning methods boost the performance of students, showing the benefits of incorporating distillation learning beyond supervised learning. And Tf-KD still could not reach the performance level of KD. More importantly, LW-KD even slightly outperforms KD in most cases of text classification.

Method MiCT-ResNet18 3D-ResNet18 3D-ResNet101
#Params 16.1M 33.3M 85.3M
Sup-Stu 64.42 60.67 67.84
LSR 65.94 61.43 67.92
Tf-KD 64.98 61.12 68.01
KD 65.97 61.91
Ours[] 66.18 62.09 69.17
Table 6. Results of human action classification on UCF101.

5.1.3. Performance on Video Modality

Finally, we conduct experiments on UCF101 to investigate how well LW-KD works on video modality. is leveraged in our learning framework. We select MiCT-ResNet18 (zhou2018mict), 3D-ResNet18 (hara2018can), and 3D-ResNet101 (hara2018can) as students. For the implementation of KD, we regard 3D-ResNet101 as the teacher of the other two student models. Considering that all the three student models behave not so well on this task (cf. Sup-Stu in Table 6), we only train for 2 epochs to bridge the performance gap between the teacher and the students. The hyperparameters , , and are set to 0.1, 0.05, and 20.0, respectively. We use SGD for optimization, with an initial learning rate of 0.01 and a decay factor of 0.1. Table 6 presents the result details and similar conclusions about LW-KD could be drawn as well.

In a nutshell, the above analysis demonstrates the benefits of LW-KD, for it achieves better classification performance than teacher-free KD and LSR, and could learn from a more efficient lightweight teacher than vanilla KD. As a complementary, we depict the performance curves w.r.t. specific students on the test sets of CIFAR100 and UCF101. Figure 4 plots the variation trends, which could validate the consistent and robust improvement of student models brought by LW-KD, and show the slightly better performance over vanilla KD.

5.2. Ablation Study of LW-KD (Rq2)

Student Dataset Ours w/o ADV w/o KL w/o SYN
ResNet20 CIFAR10 92.94 92.76 92.60 92.46
ResNet20 CIFAR100 69.39 69.21 69.15 68.73
TextRCNN THUCNews 91.86 91.48 91.53
TextCNN ToutiaoNews 86.65 86.28 86.42
MiCT-ResNet18 UCF101 75.62 75.22 75.05
Table 7. Ablation study of LW-KD.

This part further analyzes how the main design principles in LW-KD could ensure its effectiveness. To achieve this, we come up with several variant learning methods of LW-KD. “w/o ADV” means removing the adversarial loss from Equation 8. “w/o KL” denotes erasing the KL-divergence loss shown in Equation 4. “w/o SYN” represents to train the teacher model LeNetW on the CIFAR datasets, instead of using the synthetic dataset SynMNIST to train the lightweight teacher, just as KD[ in Table 3.

Table 7 describes the corresponding results. Firstly, we compare “w/o ADV” and “w/o KL” with the full LW-KD approach and observe that both the two variants suffer from consistent performance drops. This phenomenon verifies the necessity of incorporating corpus-level knowledge guidance by adversarial loss and instance-level knowledge guidance through KL-divergence loss. By further investigating “w/o SYN”, we find that its performance drop is larger than those of the first two variants. It might be attributed to the fact that the teacher network is a relatively simple model for the CIFAR datasets. As seen in Table 2, the performance of LeNetW is not very satisfied, compared to results of other more powerful deep networks in Table 3. This confirms the intuition that learning a lightweight teacher on a simple dataset is more reasonable for transferring knowledge to a student as label smoothing.

CIFAR10 UCF101 ToutiaoNews
Model #Params #FLOPs Model #Params #FLOPs Model #Params #FLOPs
*LeNet5 61.7K 481.7K *LeNet5 61.7K 481.7K *LeNet5 69.4K 497.0K
*Discriminator 4.9K 9.7K *Discriminator 4.9K 9.7K *Discriminator 10.8K 21.4K
ResNet18 11.2M 37.2M MiCT-ResNet18 16.1M 12.6G TextCNN 2.1M 20.6M
DenseNet121 7.0M 898.1M 3D-ResNet18 33.3M 16.8G FastText 152.0M 465.9K
ResNeXt29 89.6M 14.1G 3D-ResNet101 85.3M 37.7G BERT 102.3M 5.4G
Table 8. Space complexity and computational cost of models on different datasets.

5.3. Efficiency Analysis (Rq2)

To validate the efficiency of LW-KD in training a lightweight teacher, we analyze both the space complexity and computational cost of the relevant models. In Table 8, the first part of results w.r.t. each dataset clarifies the total number of parameters contained in the lightweight teacher LeNet5 and some other adopted teacher networks adopted by KD. Through systematic comparison, we summarize that the lightweight teacher is about two orders of magnitude smaller than the complicated teachers. As such, the proposed LW-KD framework is space-efficient in the teacher training stage, which alleviates the issue of large space occupation encountered by vanilla KD.

Moreover, Table 8 also presents the computational cost in terms of FLOPs needed to generate prediction for each data instance. This is a reliable measure, no matter how the computational environment changes. As we can see, the lightweight teacher is about two orders of magnitude faster than the other chosen teachers. Besides, we also report the cost of the discriminator used in adversarial training in the table. Compared to the other teachers including the lightweight teacher, the computational cost is nearly negligible.

6. Conclusions

In this paper, we study knowledge distillation for neural networks from the perspective of learning a lightweight teacher as label smoothing. This is motivated by the issues of complicated teachers used in many existing KD approaches and the inflexibility of manually-crafted virtual teachers utilized in teacher-free KD. We propose the novel distillation learning framework named LW-KD. In the framework, the lightweight teacher (i.e., LeNet5) is efficiently trained on the simple synthetic dataset SynMNIST, which later generates soft target distribution as teacher knowledge. The enhanced KD loss is devised to incorporate the knowledge and further guide the effective student learning on a target dataset. To ensure reliability, we conduct experiments on multiple data modalities. The comprehensive comparison demonstrates the benefits of LW-KD and validates the rationality of its innovative design principles.


Appendix A Appendix

To support the reproducibility of our experiments, we detail the architectures of LeNet5 and LeNetW, and the configurations of the compared learning approaches .

a.1. Detailed Settings of Model Architectures

The detailed architecture of the lightweight teacher in our experiments is shown in Table 9. Its original form (lecun1998gradient) only handles one image channel. LeNetW is built based on LetNet5 for handling 3 channels, suitable for the RGB images in CIFAR. Table 10 shows its concrete architecture, having only 4 layers but being very wide.

Output Size LeNet5
14 14 5

5 6 Conv,ReLU

5 5 5 5 16 Conv,ReLU
1 1 5 5 120 Conv,ReLU
1 1 84 FC
1 1 Output FC
Table 9. The architecture of LeNet5 for SynMNIST
Output Size LeNetW
30 30 3 3 64 Conv,ReLU
14 14 3 3 128 Conv,ReLU
1 1 1024 FC, Dropout
1 1 Output FC
Table 10. The architecture of LeNetW for CIFAR

a.2. Configurations of Approaches

Table 11 summarizes the model configurations for each dataset. As shown in the second line of the results, we denote ‘*ALL’ as all the models trained on CIFAR10 and CIFAR100, including ResNet20 (he2016deep) and MobileNetV2 (sandler2018mobilenetv2), etc.

Dataset Model Input Size #Batch Size Optimizer lr lr_step #Epochs
SynMNIST, MNIST LeNet5 28 28 128 SGD 0.1 10
CIFAR *ALL 32 32 128 SGD 0.1 [100, 150] 200
THUCNews, ToutiaoNews DPCNN, TextCNN, FastText 1 300 128 Adam 1e-3 20
THUCNews, ToutiaoNews TextRCNN 1 300 128 Adam 0.1 10
THUCNews, ToutiaoNews BERT 1 300 128 Adam 5e-5 3
UCF101 MiCT-ResNet18 224 224 64 SGD 0.01 [80] 120
UCF101 3D-ResNet18, 3D-ResNet101 224 224 64 SGD 0.01 [40, 80] 90
Table 11. Detailed configurations of approaches for different datasets.