Deep neural networks (DNNs) have achieved superior performance on various computer vision tasks [Russakovsky2015ImageNet]. However, the promising results  come with the costs of the deeper and wider architectures, which require long inference time and high cost of computation. This drawback suppresses their implementation on devices with low memory and fast execution requirements such as mobile phones and embedded devices. Therefore, model compression techniques [Song2015Deep] [Hassibi1993Second] [Jaderberg2014Speeding] [Cun1989Optimal] have emerged to address such practical applications.
Knowledge distillation [Hinton2015Distilling] [Romero2015FitNets] [Zagoruyko2016Paying] provides an effective and promising solution for model compression. Generally speaking, the distilling technique utilizes a teacher-student framework to transfer knowledge from a complicated large teacher to a compact student. It is an effective approach to obtain a compact neural network with performance close to the complicated teacher network. It usually works by adding a distilling term to the original classification loss that encourages the student to mimic the teacher’s behavior. For example, Hinton et al. [Hinton2015Distilling]
propose to match the final predicted probabilities of the teacher and student network, as the distribution of the teacher network’s soft targets contains more information than raw one-hot labels. Then Romero et al.[Romero2015FitNets] devise a hint-based training approach which matches intermediate layers of the student network to the corresponding layers of the teacher network. Zagoruyko et al. [Zagoruyko2016Paying] extend the hint-based idea to use attention as a mechanism for distilling knowledge from teacher network to student network.
However, we observe that the compact student network should be guided gradually using samples ordered in a meaningful sequence. Despite the teacher network conveys additional information beyond the conventional supervised learning target, the student network is harder to train than the teacher network due to the ability of feature representation. The conventional distillation methods usually train the networks by providing a sequence of random mini-batches sampled uniformly from the entire training data. Instead, we supervise the student network by generalizing easy samples before difficult ones. In other words, the teacher network guides the student network to construct an easy feature space using easy samples and then make it close to the teacher’s complicated feature space by increasing the difficulty of training samples.
In this paper, we propose a new knowledge distillation framework called instance-level Sequence Learning Knowledge Distillation (SLKD). It employs the early epochs as snapshot to create a curriculum for the student network’s next training epochs. Figure 1
shows the overall procedure. Curriculum learning, which is motivated by the idea of a curriculum in human learning, is employed in our distillation framework to supervise the student’s training progress. In the first training phase, the student network is trained under the pre-trained teacher network by feeding the random mini-batches sampled uniformly from the entire training data. Then we use the early epochs of the student network as snapshot to make a curriculum by ranking the training data for the next training phase. Moreover, the easy samples from the resulting classifier are firstly feed to the teacher and student network in the knowledge distillation framework. Then difficult samples gradually join in the training. Therefore, the teacher network is no longer blindly supervising the students, but receiving feedback from the student network and advancing in regular order.
Our contributions in this paper are summarized as follows:
We propose a novel teacher-student knowledge distillation framework with curriculum learning. The framework allows the student model to get generalization capability by continuously adding the training data with increasing complexity into the distillation process.
We use snapshot of the student to rank sample complexity without involving extra training resource consumption. It makes the teacher efficiently supervise the student’s training process by the student’s feedback on performance.
We verify our proposed instance-level sequence learning knowledge distillation framework on CIFAR-10, CIFAR-100, SVHN and CINIC-10 datasets. Experiments show that our method can significantly improve the performance of student networks in knowledge distillation.
Ii Related Work
Since this work focuses on training a compact network with high performance via knowledge distillation, we briefly review the related literature on model compression and acceleration and knowledge distillation. In addition, we also give a brief review of the recent advances in curriculum learning.
DNNs compression and acceleration are significant to the real-time applications which have drawn increasing attention in recent years. A straight way to boost the inference speed of DNNs is parameter pruning [Cun1989Optimal] [Hao2016Pruning] [Han2015Learning], which removes redundant weights from the trained larger network to obtain a small one. Typically, it retains the accuracy of the larger model by setting the proper prune ratio. Another way is low-rank decomposition [Denil2013Predicting] [Kim2015Compression], which tries to factorize parameter-heavy layers into multiple lightweight ones by using the matrix decomposition technique. However, parameter pruning and low-rank decomposition usually leads to large accuracy drops, thus fine-tuning is a must to alleviate those drops.
Knowledge distillation methods usually utilize the teacher-student strategy to transfer knowledge from a pre-trained large teacher network to a compact student network. It can directly get a compact model that retains the accuracy of a large model for facilitating the deployment at the test time. Bucilua et al. [Bucilua:2006:MC:1150402.1150464] pioneer these series of methods in model compression. They attempt to compress the information from an ensemble of heterogeneous neural networks into a single neural network. Subsequently, Caruana et al. [Caruana2013Do]
extend this method through forcing the wider and shallower student network to mimic the teacher network, using an L2 norm to penalize the difference between the student’s and teacher’s logits. More recently, Hinton et al.[Hinton2015Distilling] firstly propose the concept of knowledge distillation, which trains the student network by imitating the distribution of teacher network’s soften output. The output from the teacher network, which divides the logits before softmax by introducing a hyper-parameter temperature, contains more information than one-hot targets.
Since then, researchers attempt to explore variants of knowledge distillation by using more supervised information from the teacher network. In [Romero2015FitNets], Romero et al. introduce a new metric of intermediate features between the teacher and student networks and add a regressor to match different size of teacher’s and student’s outputs. Zagoruyko et al. [Zagoruyko2016Paying] propose to use the activation-based and gradient-based spatial attention maps from intermediate layers as the supervise information. Yim et al. [Yim2017A] propose to use the flow of solution procedure (FSP) that is generated by computing the Gram matrix of features between layers to transfer knowledge. And the students imitate the process of solving problems by the teachers in FSP method.
Different from the above methods, several researches adopts multiple teachers to supervise the student network’s training. Shan et al. [Shan2017Learning] conduct distillation by combining the knowledge of intermediate representations from multiple teacher networks. Shen et al. [DBLP:journals/corr/abs-1811-02796] extend this idea by learning a compact student model which is capable of handling the super task from multiple teachers.
Moreover, knowledge distillation can be used by combining with the conventional DNNs compression and acceleration approaches. In [Mishra2017Apprentice], Mishra et al. propose a novel method to combine network quantization with knowledge distillation by jointly training a teacher network (full-precision) and a student network (low-precision) from scratch based on knowledge distillation. Besides, Zhou et al. [zhou2017Rocket] also propose a similar framework that the student network and the teacher network sharing lower layers and training simultaneously. Recently, researchers study knowledge distillation from another perspective rather than model compression. Born-again-network [DBLP:conf/icml/FurlanelloLTIA18] optimizes the same network in generations by training the students parameterized identically to their teachers. Yang et al. [DBLP:journals/corr/abs-1805-05551] optimize deep networks in many generations and a few networks with the same architecture are optimized one by one. A similar idea has been proposed to extract useful information from earlier epochs in the same generation [DBLP:journals/corr/abs-1812-00123].
As shown above, knowledge distillation typically transfers knowledge via feeding the sequence of random mini-batches sampled uniformly from the training data. In contrast, the student network in our proposed SLKD method is gradually guided using samples ordered in a meaningful sequence.
Curriculum Learning is a learning paradigm inspired by the cognitive process of human and animals, in which the model is trained gradually using samples ordered in a meaningful sequence. A curriculum specifies a scheme under which training samples will be gradually learned. Bengio et al. [Bengio:2009:CL:1553374.1553380] pioneer these series of methods in curriculum learning. They propose that it is better to feed samples organized in a meaningful way than employ samples at random so that the low complex examples are presented first. Hacohen et al. [DBLP:conf/icml/HacohenW19] employ the curriculum learning on the training of deep networks and analyze the effect of curriculum learning. Jiang et al. [Jiang2018Mentornet] propose a novel method to learn data-driven curriculums for deep CNNs trained on corrupted labels.
In this section, we will describe our main concept of knowledge distillation via instance-level sequence learning. Firstly, we introduce the motivation of our SLKD method. Then the following part describes the training objective and procedure.
Knowledge distillation methods usually train the student to mimic the teacher’s class probabilities or feature representation. Supervised by this softened knowledge, extra supervision information, such as the class correlation, is added to train the student network. However, the optimization of the student model is much harder due to the limited feature extraction capability. Moreover, the teacher network does not receive any feedback information from the student network in the whole distillation process.
From the perspective of curriculum learning [Bengio:2009:CL:1553374.1553380], the teacher network should gradually distill knowledge to the student through feeding samples ordered in a meaningful sequence. We claim that the local minima can be promoted by the easy-to-hard learning process. To validate this assumption, we partition the CIFAR-100 into three subsets by the pre-trained teacher network’s confidence score and use the classification accuracy and training loss to evaluate its difficulty. ResNet-110 is used as the teacher network and ResNet-20 as the student network. The teacher is trained using cross-entropy loss and the student network is trained by KD loss. Table I shows the results. We can find that the less difficult lesson has lower training loss. It means that the complex lesson takes a difficult target for the student network to approach.
Based on this philosophy, we propose the instance-level SLKD method which employs curriculum learning to order the training samples by difficulty. Figure 2 shows the overview of the SLKD. We use the snapshot networks to represent the intermediate training states of the student network in our architecture. As can be seen from Figure 2, we employ the snapshot network to design the lessons (subsets) from the whole training dataset. The curriculum is updated dynamically in the whole training process. The student network is firstly trained on the easiest lesson and we gradually increase the difficulty of lessons in the next training stage. Therefore our distilling is proceeded sequentially from the easy examples to the hard ones. In this manner, we hope to ease the learning process and help the student network reach an ideal local minimum.
The idea of knowledge distillation is to train the student network not only via the true labels information but also by mimicking the teacher’s class probabilities or feature representation (activations of hidden layers). In other words, the teacher network could provide valuable dark knowledge as extra supervisory information besides the ground-truth labels.
Let us denote , as the input of the DNN and one-hot labels of our architecture. and represent the student network and teacher network with parameters and respectively. Given an input image , the teacher network outputs the final predictions as which are obtained by applying the softmax function on the un-normalized log probability values , i.e. . Similarly, the same image is fed into the student network to obtain the predictions . The standard cross-entropy is denoted as . In classical supervised learning, the mismatch between the output of the student network softmax and the ground-truth label is usually penalized using cross-entropy loss:
Hinton et al. [Hinton2015Distilling] extend previous works [Bucilua:2006:MC:1150402.1150464]
by training a compact student network to mimic the output probability distribution of the teacher network. They name this informative and representative knowledge as dark knowledge and try to match the softened outputs of teacher and student via a KL-divergence loss:
It contains the relative probabilities of incorrect classification results provided by teacher networks. When we perform knowledge distillation with a temperature parameter
, the student network will be trained to optimize the following loss function:
where is a second hyper parameter controlling the trade-off between the two losses.
The teacher network is sometimes deeper and wider than the above approaches, but sometimes has the similar size as the student network [DBLP:conf/icml/FurlanelloLTIA18][DBLP:journals/corr/abs-1805-05551][DBLP:journals/corr/abs-1812-00123]. Snapshot Distillation [DBLP:journals/corr/abs-1812-00123] proposes to finish teacher-student optimization within one generation which acquires teacher information from the previous iterations of the same training process. Inspired by this, we propose to employ the snapshot of student from the previous epochs to design curriculum for efficient knowledge distillation.
Instance-level sequence learning for knowledge distillation. We partition the entire knowledge distillation process into mini-generations. During the first generation, the student network is trained from the true labels and the teacher network’s supervisory information using the conventional distillation terms (using uniformly sampled mini-batches) by Eq.(3). Then, at each consecutive generation, the snapshot of student network from earlier epochs is employed to design curriculum for efficient distillation. In other words, we use the snapshot as a classifier and take its confidence score as the scoring function for each image.
Without loss of generality, we measure the complexity of sample by a scoring function defined as . The sample is more difficult than sample if . Here, we choose the classifier from snapshot as scoring function . Thus, the training data is ranked by the difficulty of samples and sorted by the scoring function . Then a sequence of subset is divided by difficulty. Note that, it is important to keep the sample balanced with the same number of samples from each class as in the training set. Thus, we keep the same number of each class per subset to avoid bias.
Then the optimization goal of SLKD in step is to minimize the following loss function:
where the and refer to the corresponding outputs of student and teacher network, respectively. The parameter is optimized by learning to the instance-level easy-to-hard curriculums sequentially.
Iii-C Training procedure
The training procedure is illustrated in Algorithm 1. In this paper, we partition the entire training procedure into phases. We update the teaching strategy (i.e., curriculum) at the beginning of every training phase to supervise the student network. Specifically, we firstly train the teacher network under the conventional supervised learning by Eq.(1). To obtain an initial snapshot network, we distill the knowledge from the pre-trained teacher network to the student network by Eq.(3) through feeding a sequence of uniformly sampled mini-batches. Then, we load the student’s model (checkpoints) from earlier epochs as snapshot which is used to design the curriculum for knowledge distillation. Thus, the whole training dataset is divided into several subsets according to the sample’s difficulty. In each training phase, the teacher network supervises the training process of student network through different curriculums, which contains all the categories in the dataset. In this way, the student network is trained to generalize easy samples before the hard ones. The subset of training data is gradually added into the training process from easy to hard and eventually all the training data is included.
At the beginning of next training phase, we load the new snapshot from earlier epochs and update the curriculum via the confidence scores of examples. The distillation process is the same as above. In this way, there is an interaction between the teacher network and student network in the knowledge distillation process via employing the snapshot of student. Moreover, this approach could improve the student network’s generalization performance.
In the following sections, we verify the effectiveness of our proposed instance-level Sequence Learning Knowledge Distillation (SLKD) on several standard datasets, including CIFAR-10, CIFAR-100, SVHN and CINIC-10. For all experiments, a deep residual network [DBLP:conf/cvpr/HeZRS16] has been employed as the base architecture. The deep residual network has shortcut connections to make an ensemble structure. Furthermore, the shortcut connections allow training of very deep networks. Therefore, many researchers use the residual network for various tasks.
For all experiments, we compare our proposed method with several state-of-the-art knowledge distillation methods, including knowledge distillation (KD) [Hinton2015Distilling], Attention Transfer Knowledge Distillation (ATKD) [Zagoruyko2016Paying] and Fitnets [DBLP:journals/corr/RomeroBKCGB14].
Iv-a Experimental Setup
Network architecture. For all experiments, we use the ResNet-110 as the teacher model, and the ResNet-20 is adopted as the student model. It stacks the basic residual blocks to achieve the state-of-the-art performance. To further investigate the effectiveness of our method, we conduct extensive experiments on different network architecture families, such as MobileNet [DBLP:journals/corr/HowardZCKWWAA17] and ShuffleNetV2 [Ma_2018_ECCV]. Table III shows the number of parameters of the networks we adopt in our experiments for CIFAR-100 dataset.
Implementation Details. We firstly conduct our experiments on the public datasets CIFAR-10 which has
small RGB images. For all experiments, we use minibatches of size 128 for training. Moreover, we use horizontal flips and random crops for data augmentations before each minibatch. The learning rate start with 0.1 and is reduced by a factor of 0.1 on epoch 60, 120 and 160 respectively. For CIFAR dataset, we use stochastic gradient descent with momentum fixed at 0.9 for 200 epochs. However, we use Adam[DBLP:journals/corr/KingmaB14]
with learning rate 0.01 initially and drop the learning rate by 0.2 at epoch 20, 40, 60 for SVHN dataset which is easy to learn. Furthermore, all networks have batch normalization.
For the baseline methods, we set the hyper-parameters as following. In the KD method, we set the temperature for softened softmax to 4 and = 16 [Hinton2015Distilling]. For the FitNet and AT methods, the value of is set to and [Zagoruyko2016Paying]. The mapping function of AT adopted in our experiment is square sum, which performs the best in the experiments of [Zagoruyko2016Paying].
In this section, we evaluate our method on the CIFAR dataset. The CIFAR-10 and CIFAR-100 datasets both contain 60k tiny RGB images at a spatial resolution of
. The only difference is that both training and testing images are uniformly distributed over 10 or 100 classes, respectively. Note that, we use theRGB images after random crops and horizontal flips for training and the original RGB images are used for testing. For optimization, we take SGD with a mini-batch size of 128. The learning rate starts from 0.1 and is divided by 10 at 60th, 120th, 160th epochs and we train for 200 epochs.
Specifically, we firstly train the Resnet-110 as the teacher network which provides 94.04% classification accuracy on CIFAR-10 dataset. Then, we use the pre-trained teacher network to supervise the training of student network for initial 40 epochs by Eq.(3). For CIFAR-10 dataset, we set in our experiment which partition the training data into 5 degree by the samples’ difficulty. Note that, the five subsets of training data are provided by five different snapshots of the student network in the whole training process. In details, we load the first snapshot of student network from initial 40 epochs and use it as classifier to rank the whole training data into five subsets. In other words, this snapshot designs the first curriculum for next knowledge distillation training. Then, we fed the easiest subset to our knowledge distillation framework. The teacher network supervises the training of student network for next 30 epochs. In the same manner, the next snapshot is loaded to update the curriculum and the second subset is added into the training process for the next 30 epochs. Finally, the entire training data is included and we train the student for the last 100 epochs.
We compare the classification accuracy of CIFAR dataset and show the results in Table II. From the results we can find that our approach achieves higher accuracy than the original student network and also gets notable improvement compared to the existing methods. Note that, our new architecture of instance-level sequence learning for knowledge distillation gets 93.21% accuracy with 1.03% improvement than the student network trained independently on CIFAR-10 dataset. In other words, the knowledge of the Resnet-110 teacher network, which contains 1.70 parameters, is distilled into a smaller Resnet-20 student network with only 0.27 parameters. This is a 6 compression rate with only 0.8% loss in accuracy. Furthermore, our approach performs better than exiting knowledge distillation methods. For CIFAR-100 dataset, we change the parameter and partition the training data into subsets. Then, the snapshot of student is loaded at 41, 71, 141 epoch, respectively. And our method achieves 70.02% classification accuracy on CIFAR-100 dataset and gets 1.69% improvement compared with the student network trained individually. Moreover, our method outperforms the state-of-the-art knowledge distillation methods as shown in Table II.
Figure 3 shows the training loss and testing accuracy of the conventional KD and our SLKD method. We train the network with the same iterations for comparison. As can be seen from the testing accuracy curves, our method outperforms the KD method on classification accuracy. Note that, the loss curves in our approach suddenly pulls up at 100 and 150 epochs because the new curriculum is added. The student network from our architecture finally reaches the saturation region and consequently results in a higher performance.
Actually, we conduct all the experiments on 300 epochs for CIFAR-10 and CIFAR-100 datasets. We can observe that our approach uses less iterations than the other comparative trials. Because we use the snapshot of student to design curriculum and feed the samples from easy to difficult ones in a proper order. That is to say, we only use a small subset of the entire training data in earlier epochs. We also set the same iterations (7.5K) for our experiments as Figure 3 shown. For designing curriculum, we attempt to investigate different curriculum designing strategy in section IV-D.
Iv-C SVHN and CINIC-10
In this section, we verify the effectiveness of the proposed method through conducting complicated classification experiments on the larger SVHN and CINIC-10 datasets. The SVHN dataset is similar to MNIST with small
RGB cropped digits in 10 classes and it is obtained from house numbers in Google Street View images. SVHN has 73257 images for training, 26032 images in testing set and 531131 samples additional. The CINIC-10 dataset consists of images from both CIFAR dataset and ImageNet dataset, which is a middle option relative to CIFAR-10 and ImageNet. It contains 270,000 images at a spatial resolution ofvia the addition of downsampled ImageNet images. We adopt the CINIC-10 dataset for rapid experimentation because its scale is closer to that of ImageNet and it is a noisy dataset.
In this experiment, the backbone network of teacher is Resnet-110. For the training of teacher network on CINIC-10 dataset, we set initial learning rate to 0.1 and drop by 0.1 at 100, 150, 250 epochs and train for 300 epochs. We set the weight decay to 5e-4 and train the network using SGD with momentum. And the student network is set almost identical with the teacher except the initial learning rate is 0.01.
We compare the classification accuracy on SVHN, CINIC-10 dataset and show the results in Table IV. As can be seen from Table IV, our method improves about 1.27%/0.73% on SVHN/CINIC-10 compared with conventional cross-entropy supervised loss. We also compare our method with some state-of-the-art knowledge distillation methods. We can see that the student from our proposed method outperforms others. The results verify that our method is applicable to large classification.
Iv-D Ablation Studies
With the easy-to-hard learning process inspired by curriculum learning, the student network can reach an ideal local minimum under the proposed instance-level sequence knowledge distillation method. The results in previous experiments have shown its effectiveness. In this section, we conduct extra experiments to investigate the impact of training time, different network architectures and snapshot networks on the proposed framework.
Training with the same epochs. Since training time plays a vital factor in either research or industrial, we aim to investigate the time consumption between the conventional KD method and the proposed method. As known to all, the same training epochs usually mean the same iterations because one training epoch contains the same iterations. However, the number of iterations in each epoch is different under our easy-to-hard sequence learning strategy. That is because the curriculum (subset) only contains examples of the whole training set in the early training stage. The number of iterations in each epoch increases when the new curriculum is added. In other words, our SLKD method outperforms other methods with fewer iterations when training with the same epochs. Figure 4 shows the training loss and testing accuracy of the conventional KD and the proposed method on 300 epochs. We can find that the proposed method outperforms the conventional KD method under the condition of using fewer total iterations. That means we use less training time than the conventional KD method.
Different student and teacher architectures. To explore the impact of different architecture families, we conduct additional experiments on CIFAR-100 dataset. We adopt the ResNet-110 as the teacher network, MobileNet-0.25 and ShuffleNetV2-0.5 as the student network. The width multiplier is a tunable parameter to control the complexity of the ShuffleNetV2 and MobileNet. Table V shows the results under different architecture families. We can observe that the proposed approach outperforms the conventional training as well as the state-of-the-art knowledge distillation methods. The results verify that our approach can be applicable to different network architecture families.
Comparison on different snapshots. In this section , we evaluate the impacts of different curriculum designing strategy. In the proposed instance-level SLKD method, it employs the student network of the early epoch as the snapshot network to design the easy-to-hard curriculum for the student network’s next training stage. Why we choose the intermediate training states of the student network as the snapshot networks? Typically, it is much easy to get the snapshots from the teacher network because it could produce tremendous checkpoints during the pre-trained process. However, our goal is to find the optimal learning sequence for distilling knowledge from the teacher network to the student network. If the curriculum is designed by the snapshot of teacher network, the student network may be misleaded due to the gap (in capacity) between the teacher and the student network. To validate the assumption, we compare our approach with another scheme, which using the pre-trained teacher network to provide the snapshot networks for curriculum designing.
Specifically, we conduct extra experiment on the CIFAR-100 dataset. As comparison, we choose 60th, 120th, 160th epochs from the teacher network’s training process as the snapshot networks. We load the snapshot networks from teacher network and student network respectively for designing the curriculum. The recognition results are shown in Table VI. We can observe that the proposed method, which adopts the early epochs (checkpoints) of the student as the snapshot networks, outperforms the scheme using the teacher’s intermediate states as snapshot networks.
In this paper, we propose a novel distillation framework named instance-level sequence learning knowledge distillation (SLKD) to boost the performance of student network. By employing the intermediate training states (i.e. checkpoints) of student network as snapshot networks, we design the easy-to-hard curriculums for distilling knowledge from teacher network to student. In such manner, we train the student network step-by-step to mimic the teacher’s soften knowledge until finishing the whole curriculums. Moreover, the curriculums update automatically through the different snapshots in the whole training process. Experiments show that the proposed instance-level sequence learning strategy can significantly improve the performance of student networks in knowledge distillation.
We thank supports of National Natural Science Foundation of China under Project No. U1706218, 61971388 and L1824025.