Feature Matters: A Stage-by-Stage Approach for Knowledge Transfer

Convolutional Neural Networks (CNNs) become deeper and deeper in recent years, making the study of model acceleration imperative. It is a common practice to employ a shallow network, called student, to learn from a deep one, which is termed as teacher. Prior work made many attempts to transfer different types of knowledge from teacher to student, however, there are two problems remaining unsolved. Firstly, the knowledge used by existing methods is usually manually defined, which may not be consistent with the information learned by the original model. Secondly, there lacks an effective training scheme for the transfer process, leading to degradation of performance. In this work, we argue that feature is the most important knowledge from teacher. It is sufficient for student to achieve appealing performance by just learning similar features as teacher without any processing. Based on this discovery, we further present an efficient learning strategy, which is to make student mimic features of teacher stage by stage. Extensive experiments suggest that the proposed approach significantly narrows down the gap between student and teacher, and shows strong stability on various tasks, ie classification and detection, outperforming the state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 3

page 4


Revisiting Knowledge Distillation: An Inheritance and Exploration Framework

Knowledge Distillation (KD) is a popular technique to transfer knowledge...

Representation Evaluation Block-based Teacher-Student Network for the Industrial Quality-relevant Performance Modeling and Monitoring

Quality-relevant fault detection plays an important role in industrial p...

Quantization Mimic: Towards Very Tiny CNN for Object Detection

In this paper, we propose a simple and general framework for training ve...

Two-stage Image Classification Supervised by a Single Teacher Single Student Model

The two-stage strategy has been widely used in image classification. How...

Efficient Crowd Counting via Structured Knowledge Transfer

Crowd counting is an application-oriented task and its inference efficie...

Teacher-Explorer-Student Learning: A Novel Learning Method for Open Set Recognition

If an unknown example that is not seen during training appears, most rec...

N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning

While bigger and deeper neural network architectures continue to advance...

Code Repositories


Neural networks compression via knowledge distillation technique

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past few years, Convolutional Neural Networks (CNNs) have advanced various tasks in computer vision field, such as image classification

[9], object detection [18], and semantic segmentation [4]. However, along with deeper architectures [14, 24, 6, 10], the great success of CNN is actually based on high computational cost, which cannot be afforded by most devices in practice. Some lightweight models are presented by recent work [8] to reduce the computing cost especially for mobile devices, but the performance drops severely compared with the state-of-the-art methods. Accordingly, it is crucial to balance the trade-off between efficiency and capability of a CNN model.

To tackle this problem, knowledge distillation is introduced in [7] for model acceleration. The core idea is to train shallow networks (student) to mimic deep ones (teacher) following two steps. First, teacher employs a very deep model to achieve satisfying performance by excavating information (knowledge) from labeled data. Second, student learns the knowledge from teacher with a shallow model to speed up the inference process without losing much accuracy. Accordingly, the main challenges, corresponding to the above two steps respectively, lie in (1) what kind of knowledge should be transferred from teacher to student, and (2) how to transfer the knowledge as much as possible.

Figure 1: Student (S) is trained to mimic the hidden features from teacher (T) without using the ground truth label. Both teacher and student are separated into different stages. When training a particular stage, previous well-trained stages (blue) of student are fixed, the current stage (green) is trained to mimic the output feature of teacher model, while the remaining stages (black) are not used. Better viewed in color.

For the first issue, previous work has proposed various methods to distill the knowledge learned by teacher for student to learn, such as attention map [26] and information flow [25]. However, all types of distillation are manually defined and may not be consistent with the real information contained in the teacher model. In other words, teacher is trained independently from the distillation methods, but it is required to guide the student with these handcraft information during the mimicking process, which leads to discrepancy in knowledge. There is no guarantee that the distilled knowledge is enough for student to retain the precision of teacher model.

For the second issue, existing approaches usually train a student model to learn from the teacher and labeled data simultaneously. There are mainly two problems in doing so. On one hand, the training would be very sensitive to the hyper-parameters, i.e., the loss weights to balance these two objective functions. Specifically, when the same teacher-student pair is applied to a different task, or even applied to another dataset, it may need to carefully readjust the loss weights to get a satisfying result, which is very inefficient. On the other hand, the purposes to learn from teacher and to learn from ground-truth are not always in line with each other. For example, the teacher model may have already eliminated some label errors during the training process. In this case, information from the teacher and from the training set indicates two opposite directions, which may cause confusions to the student.

In this paper, we address these weaknesses by proposing a practical knowledge transfer approach, where student is trained to progressively mimic the hidden features from teacher, as shown in Fig.1

. Here, for simplicity, we do not distinguish between feature (vector) and feature map (tensor). It has three appealing properties:

(1) Feature as the meaningful guidance - We directly use the output features of some selected hidden layers of teacher model to supervise student without any processing. Since teacher uses features for inference in practice, they are expected to reflect the complete information from teacher model. Therefore, these features should be able to provide better guidance for student.

(2) Isolation of knowledge source - We isolate the knowledge contained in teacher model from the information contained in labeled data using two phases. In the first phase, student learns knowledge by mimicking teacher regardless of the ground truth label, while in the second phase, student is trained to apply the fixed learned features from first phase to adapt with the training set. In this way, student can focus on acquiring information from only one source in each phase, making the transfer process more accurate. Furthermore, training these two phases separately also makes our method more stable, since there is no need to adjust the loss weights between different terms.

(3) Progressive knowledge distillation - instead of training all parameters of student together, we divide the transfer process into different stages and only train a sub-network at one time. In doing so, each stage can obtain focused and sufficient training from the teacher. All stages will finally collaborate together to achieve a better performance. Although our method employs multiple stages, the training efficiency is barely affected. That is because when training a particular stage, all previous stages are fixed and only parameters in the current stage is updated, as illustrated in Fig.1.

To summarize, this work has three main contributions:

  • We demonstrate the effectiveness of directly mimicking features in knowledge transfer.

  • We present a stage-by-stage training strategy to facilitate more effective and efficient knowledge distillation from teacher to student.

  • We show experimentally that our approach surpasses the state-of-the-art methods on various tasks with higher performance and stronger stability. Concretely, we achieve 0.6% higher top-1 accuracy on the ImageNet classification task than other competitors by using ResNet-18 to mimic ResNet-34, and achieve 0.3% higher mean Average Precision (mAP) on COCO detection task by using ResNet-50 to mimic ResNet-101.

2 Related Work

Model Compression. There have been extensive attempts in the literature to reduce computational cost by model compression. Network pruning was proposed to find a balance between performance and storage capacity by removing redundant structures of the network. Molchanov et al. [21] and Li et al. [16]

introduced different criteria to evaluate the importance of neurons and filter out the insignificant channels to reduce the network size. Besides, quantization

[5], which uses fewer bits for each neuron, and low-rank approximation [27], which factorizes a huge matrix with several small matrices, are also widely applied for model acceleration. In this work, we focus on the other model compression technique, knowledge transfer, where a shallow student model is trained to gain information from a deep teacher model.

Knowledge Transfer. The preliminary view of knowledge transfer was adopted in [2], which trains a shallow network with data labeled by a deep one. Through student learning the soft label tagged by teacher instead of the ground truth, knowledge is assumed to be transferred from teacher to student. [7]

introduced the concept of Knowledge Distillation (KD), which describes the process of knowledge transfer as a student mimicking the output of teacher. As for classification task, the student model is trained with the ground truth label as well as the class probabilities predicted by teacher simultaneously. Hinton

et al. [7] also proposed to raise the temperature in softmax function to further distill the knowledge, however it is very sensitive to the hyper-parameter, i.e. temperature, and is not applicable to tasks other than classification due to the limitation of softmax loss.

To solve the above problem, many other methods for knowledge distillation are proposed. Zagoruyko et al. [26] attempted to transfer spatial attention map, which is defined as the average of feature maps across the channel dimension. Huang and Wang [11]

proposed to learn feature map through Maximum Mean Discrepancy (MMD), which can be regarded as a sample-based metric to measure the distance between two probability distributions. Both methods can be considered as distilling knowledge by computing the statistics of feature maps. However, learning such statistics is not equivalent to learning the original feature maps, since the detail information may vanish, which will lead to performance degradation. Yim

et al. [25]

used Flow of Solution Procedure (FSP) to describe the information flow of a CNN model, which computes the Gram matrix of two hidden feature maps. Instead of learning the knowledge from teacher directly, student is trained to mimic how information flows in teacher. It indeed extracts some higher-level knowledge compared to mimicking features, but the student model may not treat the feature extraction process same as teacher. Therefore, forcing it to reproduce the information flow may be too difficult to train. Similarly, Lee

et al. [15]

applied Singular Value Decomposition (SVD) to derive the correlation between two feature maps.

Transfer Strategy. Besides defining what kind of knowledge should be transferred, how to transfer the knowledge from teacher to student is another direction that is worth exploring. Romero et al. [22] presented FitNets by introducing an intermediate layer from teacher model as hint for student. Similarly, a hidden layer of student is chosen as the guide for the entire transfer process. The guide layer is firstly trained to mimic the hint layer to assist the shallow network to get a better initialization. Then the entire network is end-to-end trained as other methods did.

Different from FitNets, our proposed method employs a progressive training strategy, which has been widely applied in various fields, such as transfer learning

[23] and generative model [12]. We also employ multiple stages to mimic features, but different stages are trained separately. In other words, only a sub-network will be trained in each stage. Unlike FitNets where the first half of the student model will be trained twice, each layer in our framework will only be optimized once, making our approach much more efficient. Furthermore, we separate the feature extraction and feature adaption process apart, which is another difference from FitNets.

In addition with the conventional training framework, there are also some other studies that introduce reinforcement learning

[1] and adversarial network [3] to the knowledge transfer problem. However, training these networks is still not trivial for now, since there are many problems remain unsolved, e.g., how to design the structure, how to optimize the network, how to formulate the reward in reinforcement learning, or how to establish the competition in adversarial nets. Compared to them, our approach is much more straightforward.

3 Methodology

This section introduces the proposed stage-by-stage knowledge transfer method as shown in Fig.2. We will discuss preliminary knowledge on knowledge transfer in Sec.3.1. Sec.3.2 explains how knowledge is transferred by mimicking features. Sec.3.3 presents the progressive training scheme. Sec.3.4 introduces the implementation details.

3.1 Preliminary

In general, most deep CNN models can be treated as two parts, which are the feature extraction part and the feature adaption part, term as final stage in this work. More specifically, given an image and the corresponding ground truth , the first part will represent the image as a high dimensional feature with , and the second part will take such representation as input and make a prediction with . Accordingly, the entire model can be regarded as the combination of the above two parts


where indicates the function composition.

For a particular task, the predicted label is expected to be as close to real label as possible. To achieve this goal, the model is trained with an objective function


where , consisting of and , is the trainable parameters of the entire model. is the task-related energy function, such as the softmax cross-entropy loss in classification task and the bounding box regression loss in detection task.

When it comes to knowledge transfer problem, we have a deep teacher model and a shallow student model , which are composed of and respectively. Correspondingly, the hidden feature and the final prediction are denoted as and . The key challenge is to find out the knowledge contained in teacher model with and then transfer it to the student. Therefore, this problem can be formulated as



is the loss function for transferring knowledge and

is the loss weight to balance the two terms. In this process, we assume that the teacher model has already been well trained with Eq.(2) and will no longer be optimized.

Figure 2: Overview of the proposed stage-by-stage knowledge transfer approach. Before transferring knowledge, teacher model extracts information from labeled data, which is shown on the first column. Then, both teacher and student models are divided into stages in addition with a final feature adaption stage. The number of rectangular bars in each stage indicates the depth of the sub-network. All stages of teacher are deeper than the corresponding one of student, but teacher and student share the same network structure in the final stage. Except the final stage, student is trained to mimic the output feature of teacher at each stage. In the final stage student learns from dataset without taking teacher as reference any more. Stages in green indicates the current training stage, while all previous stages in blue are fixed. Dashed arrow in red shows the loss function. Better viewed in color.

3.2 Feature Transfer

Considering the two parts mentioned above, the only difference between student and teacher is the ability to extract features from images, because they share the same structure in the final stage, as shown in Fig.2. From this point of view, student has equivalent capability as teacher in how to use the feature for label prediction. In other words, if the student could produce identical feature as the teacher does, it should be able to achieve as promising performance as well. Therefore, we propose to focus on the knowledge contained in the feature extraction part instead of the entire model, and do not transfer the final stage knowledge from teacher to student. Based on this argumentation, the mimicking process defined in Eq.(3) can be easily separated into two parts


where indicates the knowledge contained in the feature extraction stage. It is different from in Eq.(3), which is the knowledge contained in the entire model. In this work, we choose as an identity function, i.e. , to avoid information loss caused by distillation.

More concretely, is firstly trained to mimic the feature extracted by with Eq.(4). Since is not as powerful as in extracting information from the original data, using as guidance will alleviate the training difficulty. Then is trained with the supervision of ground truth with fixed as in Eq.(5). There are two advantages in doing so. On one hand, information from the teacher model and that from the ground truth label will not interfere with each other, and the already transferred knowledge in the first phase will not vanish. On the other hand, the training will be very efficient. As both and are only trained once, the total training time is almost the same as end-to-end training. Furthermore, compared to existing work, the entire training process in our approach does not rely on the loss weight in Eq.(3), resulting in much stronger stability.

3.3 Stage-by-Stage Training Scheme

From the discussion above, it is crucial for our method that student can learn similar feature as teacher. However, considering the wide difference between the representation capabilities of these two models, the goal is not that easy to achieve through simple end-to-end learning. To tackle this problem, we break them down into multiple stages and make student to mimic the output stage features of teacher progressively, as shown in Fig.2. Taking teacher model as an instance, we have


where is the total number of stages, excluding the final feature adaption stage. is the hidden feature of the -th stage in teacher network, while is the initial feature, i.e. the input image. is the sub-network of teacher model in -th stage, and can be considered as a composition of a series of sub-networks. Similarly, is also divided into stages.

Under such separation, the feature transfer process as shown in Eq.(4) can be further split apart as follows


where is also set as an identity function, same as that in Eq.(4), to maintain the original information from the teacher model.

Similar as fixing when training as mentioned in Sec.3.2, we fix all parameters in previous stages when training a new stage, to prevent the transferred knowledge from vanishing and speed up the training process. Although these stages are trained separately, they are not completely independent, since each stage will take the feature produced by the previous stage as input for further training. That is to say, after the first -th stages are well trained, the -th stage will try to mimic feature based on instead of . In this way, even though the -th stage of student cannot exactly reproduce the same feature as teacher, the difference between and will be further handled by the -th stage. Finally, these stages will cooperate with each other to achieve a better performance.

Discussion. FitNets [22] also proposed to train student model in two stages by mimicking the hidden feature of teacher model in the first stage. However, our method differs from FitNets from two main aspects rather than just adding more stages. (1) In FitNets, after the first stage finished, the second stage does not fix the well-trained part but trains the entire network instead. In other words, the first half of student will be trained twice, which is very time-consuming. In addition, the already transferred knowledge in the first stage may vanish when fine-tuned in the second stage. Therefore, FitNets is more likely to give student a better initialization in the first stage and transfer knowledge in the second one. Differently, knowledge transfer appears in every stage in our method – we break apart the information gained by teacher model and then transfer it to student gradually. (2) FitNets does not isolate the information from teacher model and that from training data, and student is still trained with Eq.(3) in the second stage. On the contrary, we argue that the final stage in the student model has the same learning ability as teacher such that there is no need to transfer the knowledge contained in that stage. Thus, our approach trains the student to learn from teacher and from ground truth separately.

3.4 Implementation

In our experiments, we assume that the features of each stage produced by student and teacher should have the same dimension. This is a common assumption that has be widely used by previous work [22, 26, 25]. Even so, our method can also deal with the case where the feature dimensions mismatch, by adding an additional convolutional layer with kernel size after the output feature of student in each stage to fulfill the dimension condition. For simplicity, we just use models that have similar structures as teacher-student pair, such as ResNet-34 [6] and ResNet-18.

As for the stage partition, we treat each resolution down-sampling layer as a break point. Taking ResNet as an example, the input image is with the size . Besides the first convolutional layer and pooling layer, there are mainly four sets of residual blocks. The spatial resolutions of the output features of these sets of blocks are , , , respectively. These features break down the entire model into 4 stages automatically.

When training a particular stage, Eq.(7) and Eq.(5) are used for the feature mimicking stage and feature adaption stage respectively. As previously mentioned, we choose to be an identity function. is a function to measure the distance between the features from student and teacher. We use distance in this work, i.e. . The function is task-related. For the classification task, we use the cross-entropy loss, i.e., , where is the total number of categories. For detection task, we use the bonding box regression loss, i.e., . We use SGD optimizer with momentum 0.9 to train each stage. The learning rate is set to 0.01 initially and drops 0.1 every time the feature distance does not decrease any more. After the learning rate drops to , we suppose the current stage is well trained and start training the next stage.

Training strategy Method Model Top-1 Top-5
(a) End-to-end training from scratch. Student ResNet-18 68.062 89.598
Teacher ResNet-34 73.045 91.545
(b) Training final stage only. Student ResNet-18 68.046 89.533
Teacher ResNet-34 73.063 91.586
(c) End-to-end training with stage loss. 1 stage ResNet-18 69.687 89.124
2 stages ResNet-18 70.236 89.375
3 stages ResNet-18 70.932 89.849
4 stages ResNet-18 71.687 90.126
(d) Stage-by-stage training. 1 stage ResNet-18 70.371 89.100
2 stages ResNet-18 71.223 90.000
3 stages ResNet-18 72.321 90.795
4 stages ResNet-18 72.768 91.396
(e) Stage-by-stage training with more stages. 5 stages ResNet-18 72.558 91.187
6 stages ResNet-18 72.942 91.423
7stages ResNet-18 72.673 91.354
8 stages ResNet-18 73.001 91.562
Table 1: Ablation experiments on CIFAR-100 dataset. Details are discussed in Sec.4.2 and Sec.4.3.

4 Experiments

To evaluate the performance of our proposed method, we carry out various experiments on different datasets and different tasks. Sec.4.1 briefly introduces the basic settings used in our experiments. Sec.4.2 conducts a series of comparative experiments to verify the importance of learning good features in CNN model. Sec.4.3 demonstrates the efficiency of the novel stage-by-stage feature transfer scheme. Sec.4.4 explores our approach on classification and detection tasks and compares with other state-of-the-art knowledge transfer methods.

4.1 Experimental Settings

We evaluate our feature transfer method on several commonly used datasets, including CIFAR-100 [13], ImageNet [14] and COCO [19]. Among them, CIFAR-100 and ImageNet are used for the classification task, while COCO is used for the detection task. For the classification task, a pre-trained ResNet-34 [6] is employed as the teacher model and a shallower network, ResNet-18, is adopted as the student model. We also present experimental results on the detection task with RetinaNet following the settings in [18]. ResNet-101 [17] and ResNet-50 are applied as teacher and student models, respectively.

To further validate the effectiveness of our method, a set of comparative experiments have been done with state-of-the-art knowledge transfer approaches, including KD [7], FitNets [22], AT [26] and NST [11]. We set and in KD following the settings in [7]. For FitNets, we use the features with resolution as the intermediate hint layer. For AT and NST, we compute the transfer loss with 4 output features of each set of residual blocks, which is the same as our approach. This provides a fair comparison.

4.2 Feature Transfer

In this section, we set up a series of experiments to show the rationality of transferring features. We take classification task on CIFAR-100 as an example. The performances of directly training student and teacher model from scratch are shown in Tab.1(a), and the teacher will be used as guidance of student in the following experiments.

As discussed in Sec.3.2, the whole network can be divided into two parts, which are respectively designated for feature extraction and feature adaption. We propose to firstly train the feature extraction part of student to mimic the corresponding output feature of teacher, and then train the feature adaption part with labeled data based on the fixed feature from the first stage. To evaluate that training these two stages separately will not affect the performance, we fix the feature extraction part of a well-trained model, re-initialize the final stage, i.e. fully-connected layer, and then merely train the final stage. Results are shown in Tab.1(b). By comparing (a) and (b), we can tell that training final stage with fixed feature can actually achieve the same performance as end-to-end training, demonstrating the feasibility of our approach.

Figure 3: Visualization of features produced by different models, including (a) feature of the 3rd stage and (b) feature of the 4th stage. Better viewed in color.

To further evaluate that mimicking features can indeed transfer knowledge from teacher to student, we train four additional student models supervised by both ground truth label and the hidden features of teacher, as shown in Tab.1(c). Here, “1 stage” means that only the final feature is used as supervision, “2 stages” means that the feature extraction part is separated into two stages and the output features from both stages are used to guide the student, and so on and so forth. In these experiments, we set the weight parameters corresponding to each loss term, i.e. feature mimicking loss and classification loss, to be 1. The comparison between (a) and (c) suggests that providing shallow model with the information contained in the feature from the deep one can assist student to gain more knowledge and thus get better performance. Furthermore, along with the increasing number of stages, the model becomes more accurate. That is because features from different stages can provide different level of information, i.e. shallower layers provide low-level information while deeper layers provide high-level information. Accordingly, the inference ability of student is enhanced by learning more knowledge from the teacher.

4.3 Stage-by-Stage Learning

In this section, we validate the effectiveness of our proposed stage-by-stage training scheme, as described in Sec.3.3. We train another four independent shallow models with different number of stages, where the partition of stages is same as that described in Sec.4.2. Each model is trained to mimic the features of teacher progressively. Tab.1(d) shows the results. The only difference between (c) and (d) is whether the student is trained end-to-end or stage-by-stage. It is easy to tell that no matter how many stages are used, progressive training always achieves better performance than end-to-end training with nearly the same training time.

Fig.3 visualizes the high dimensional features of the 3rd stage and 4th stage in 2D space with T-SNE [20]. Compared to the baseline model (first column) which is trained independently from the teacher, the model that is trained with feature mimicking loss can learn a better representation (second column). When transferring knowledge through feature, stage-by-stage training (third column) produces clearer feature space, especially for intermediate feature. There are three reasons. (1) The feature extraction and feature adaption parts are optimized separately, and student can focus on learning knowledge from one source in each phase, i.e. teacher and training data respectively. (2) There is no need for our approach to adjust the loss weights, which may severely affect the final results. (3) Each stage can get adequate training without being affected by the following stages. More specifically, in end-to-end training, each stage is trained with not only the feature mimicking loss for itself, but also the loss for latter stages. Therefore, updating the parameters of latter stages will also influence the loss for the current stage. On the contrary, stages in our method are only supervised by the current output feature, which is much more accurate.

We also explore the relationship between performance and the number of stages by training some student models with more stages, as shown in Tab.1(e). Based on the original 4-stage partition, “5 stages” means that the first stage is evenly divided into two halves, “6 stages” means that the first two stages are evenly divided, and so on and so forth. From Tab.1(e) we can tell that involving more stages may not always improve the performance. In the following experiments, we just use the 4-stage partition.

4.4 Knowledge Transfer on Different Tasks

In this section, we compare our approach with other state-of-the-art knowledge transfer methods on different tasks, including image classification and object detection. Note that we use exactly the same training settings, e.g., learning rate and optimizer type, when performing experiments on different tasks (or datasets), to verify that our method can be generally applied for various problems.

Method Model Top-1 Top-5
Student ResNet-18 68.062 89.598
Teacher ResNet-34 73.045 90.545
KD [7] ResNet-18 72.393 91.062
FitNets [22] ResNet-18 71.662 90.277
AT [26] ResNet-18 70.741 90.036
NST [11] ResNet-18 70.482 89.241
ours ResNet-18 72.786 91.396
Table 3: Comparative results of image classification task on ImageNet dataset.
Method Model Top-1 Top-5
Student ResNet-18 69.572 89.244
Teacher ResNet-34 73.554 91.456
KD [7] ResNet-18 70.759 89.806
FitNets [22] ResNet-18 70.662 89.232
AT [26] ResNet-18 70.726 90.038
NST [11] ResNet-18 70.762 89.586
Ours ResNet-18 71.361 90.496
Table 2: Comparison results of image classification task on CIFAR-100 dataset.
Method Backbone mAP AP AP AP AP AP
Student ResNet-50 35.5 55.6 38.5 17.8 39.1 47.7
Teacher ResNet-101 38.7 58.7 41.9 21.5 42.3 49.9
FitNets [22] ResNet-50 36.2 55.3 39.1 19.0 40.6 48.5
AT [26] ResNet-50 35.9 54.7 38.2 19.3 39.6 46.6
NST [11] ResNet-50 35.7 54.3 38.0 18.6 39.7 46.8
Ours ResNet-50 36.5 55.6 39.4 19.5 40.7 48.8
Table 4: Comparative results of object detection task on COCO benchmark.

4.4.1 Image Classification

For classification task, we employ CIFAR-100 and ImageNet as validation datasets to evaluate how our framework could fit with various numbers of classes.

Evaluation on CIFAR-100. We firstly start with CIFAR-100 dataset that consists of 50K training images and 10K testing images from 100 classes. Tab.3 shows the comparative results. Our method surpasses other work in both top-1 accuracy and top-5 accuracy. We even achieve similar performance as the teacher model. The results demonstrate the effectiveness of our method, which emphasizes progressive feature transfer for knowledge distillation.

Evaluation on ImageNet. We also conduct larger-scale experiments on ImageNet dataset, which includes over 1M training images and 50K testing images collected from 1,000 categories. As shown in Tab.3, our method improves the baseline model with 1.8% top-1 accuracy and beats the second competitor, i.e., NST [11] by 0.6% top-1 accuracy.

4.4.2 Object Detection

We also conduct experiments on the detection task. As described in Sec.4.1, we use RetinaNet [18] as the detection framework, and employ ResNet-50 and ResNet-101 as the backbones of student and teacher respectively. We do not compare with KD method in this task, since soft target by raising the temperature of softmax function cannot be directly applied to bounding box regression as required in object detection task. Not being able to be applied to different tasks is also a huge limitation to some previous methods. From this point of view, our approach is much more general. From the results in Tab.4, we can see that our method outperforms the baseline model with 1.0% mean Average Precision (mAP) and also outperforms all the other methods.

In addition, by cross-checking Tab.3 and Tab.3, we can tell that other methods perform inconsistently on different datasets. For example, KD [26] works well on CIFAR-100, but it shows similar performance as others on ImageNet. Likewise, FitNets [22] works well on COCO but its performance is slightly inferior to AT [26] and NST [11] on ImageNet dataset. That is because the design of these methods is highly sensitive to the hyper-parameters, e.g., temperature for softmax function and loss weights to balance transfer loss and task loss, leading to their performance discrepancies on different datasets. In comparison to these approaches, our method focuses on transferring knowledge from network to network by mimicking features, which is independent from tasks and datasets, resulting in much stronger stability.

5 Conclusion

This work presents a stage-by-stage knowledge transfer approach by training student to mimic the output features of teacher network gradually. Compared to prior work, our method pays more attention to the information contained in the feature extraction part instead of the entire model. The progressive training strategy helps reduce the learning difficulties of student in each stage, and all stages cooperate together for a better result. Extensive experimental results suggest that our scheme can significantly improve the performance of student model on various tasks with strong stability.