Deep neural networks have proved success in many visual tasks like image classification[39, 54], object detection [9, 11], semantic segmentation  and other fileds [71, 24, 67] due to their powerful knowledge extraction capability from massive available data. Beyond these remarkable successes, there is still increasing concern that the development of effective learning approaches for the real-world scenarios where the high-quality data often is not available or insufficient. In this case, network learning will encounter obstacles. Knowledge distillation  provides an economic way that can transfer knowledge from a pre-trained teacher network to facilitate the learning of a new student network.
Knowledge distillation approaches mainly follow offline or online strategies. Offline strategies try to design more effective knowledge representation methods to learn from powerful teacher. Park et al.  and Liu et al.  focused on the structured knowledge about the instances, while other approaches [47, 56, 63] paid attention to some internal information in the network during distillation process. These offline approaches usually use a fixed pre-trained model as teacher, as shown in Fig. 1 (a), and this manner usually brings in a big capability gap between teacher and student during learning (see Fig. 1 (c)), leading to transfer difficulties. The capability gap refers to the performance difference between teacher and student network, in the image classification task, it specifically refers to the difference in accuracy. By contrast, the online distillation strategies attempt to reduce the capability gap by some training schemes for the absence of pre-trained teachers to improve the learning of student. An on-the-fly native ensemble (ONE) scheme  was proposed for one-stage online distillation. Instead of borrowing supervision signals from previous generations, snapshot distillation  extracted information from earlier iteration in the same generation. These online practices provide some good solutions to the shackles of capability gap brought by the fixed teacher, resulting in relatively better knowledge transfer, while they need high demands on the efficiency of knowledge representation because they rely more on their relatively less reliable peers’ or their own predictions to provide additional supervision [30, 2, 8]. Hence, there is a question: how to ensure both small capability gap and efficient knowledge representation to facilitate student learning?
Inspired by the recent observations [62, 40, 27] that a small teacher-student capability gap is beneficial to knowledge transfer, we propose an evolutionary knowledge distillation (EKD) approach to improve the learning of student, as shown in Fig. 1 (b). The approach uses an evolutionary teacher whose performance is continuously improved as the training process to constantly transfer intermediate knowledge to the student, the evolutionary teacher could provide richer supervision information for the learning of student and reduce the capability gap. In addition, to improve the knowledge representation ability, the teacher and student networks are both divided into several blocks, and a simple guided module pair is introduced between each corresponding block. In short, the evolutionary teacher not only solves the problem of capability gap between fixed teacher and student of offline knowledge distillation, but also solves the problem of insufficient and relatively unreliable supervision information caused by the absence of a qualified teacher for online knowledge distillation. In addition, the guided modules further promote the representation of intermediate knowledge. In this way, the student can continuously and adequately learn the intermediate knowledge from teacher as well as its growth process.
The main contributions of our work could be summarized in three folds: 1) We propose an evolutionary knowledge distillation approach to facilitate the student network learning by narrowing the teacher-student capability gap; 2) We introduce guided module pairs to enhance the knowledge representation and transfer ability; 3) We conduct extensive experiments to verify that our approach is superior to the state-of-the-arts and exhibits better adaptability in the low-resolution and few-sample scenarios.
Ii Related Work
Ii-a The Basic of Knowledge Distillation
Knowledge distillation (KD) provides a concise but effective solution for transferring knowledge from a pre-trained large teacher model to a smaller student model . Since the introduction of knowledge distillation, it has been widely used in image recognition, semantic segmentation, and other fields, especially model compression [23, 6, 3, 46, 36, 34]. In practice, the student model learns the prediction of pre-trained teacher model to make itself more powerful than it’s trained alone. Compared to hard ground-truth labels, fine-grained class information in soft predictions helps the small student model to reach flatter local minima, which results in more robust performance and improves generalization ability [45, 28]. Several recent works attempt to further improve that transfer knowledge between varying-capacity network models with offline or online knowledge distillation approaches [56, 30, 8, 70, 1, 10].
Ii-B Offline Knowledge Distillation
The offline knowledge distillation approaches often adopt a two-stage training mode, it first trains the teacher model, and then trains the student network by various distillation strategies. Classical FitNet  tried to transfer more supervision information by using the feature map of the teacher network middle layer firstly. Crowley et al.  proposed structural model distillation for memory reduction using a strategy that produced a student architecture that was a simple transformation of the teacher’s: no redesign is needed, and the same hyper-parameters can be used. And some recent approaches [37, 42] attempted to pay more attention to the relationship information of the instances. Tian et al.  proposed contrastive representation distillation, the main idea is very general: learn a representation that is close in some metric space for “positive” pairs and push apart the representation between “negative” pairs. Furlanello et al.  interactively absorbed the distilled student models into the teacher model group, through which the better generalization ability on test data is obtained. Sukmin Yun et al.  proposed a new regularization method that penalizes the predictive distribution between similar samples via self knowledge distillation to mitigate the issue that deep neural networks with millions of parameters may suffer from poor generalization due to overfitting. [64, 48, 50, 51] proposed utilizing multiple teachers to provide more supervision for the learning of student network.
Generally speaking, for offline knowledge distillation, in order to achieve a great effect, the following conditions need to be met: a high-quality pre-trained teacher model, difference between teacher and student, adequate supervision information [63, 62, 61]. In particular, offline knowledge distillation methods rely heavily on a fixed pre-trained model, however, the huge capability gap between fixed teacher and student model will bring great challenges to knowledge transfer. To bridge this gap, Mirzadeh et al.  introduced multi-step knowledge distillation, which employed an intermediate-sized network (teacher assistant); Jin et al. proposed a method named RCO , which utilizes the route in parameter space teacher network passed by as a constraint to bring a better optimization to student. None of these methods can completely solve the obstacles of knowledge transfer caused by the capability gap due to the fact that they also rely to some extent on the pre-trained teacher model. Therefore, we need to rethink how to break the shackles of offline distillation approaches to improve the learning of student network more effectively.
Ii-C Online Knowledge Distillation
Different from the conventional two-stage offline distillation approaches, the current approaches increasingly focus on the online distillation strategies, which attempt to reduce the capability gap by some training schemes in the absence of pre-trained teachers to improve the learning of student. A group of networks or sub-networks will be trained almost synchronously, which aims to improve the performance of student by using the predictions of their peers or sub-networks as supervision instead of the high-quality pre-trained teachers’. Deep Mutual Learning (DML)  applied distillation losses mutually, treating each other as teachers, and it achieved good results. However, DML lacks an appropriate teacher role, hence provides only limited information to each network. Guocong Song et al. 
introduced collaborative learning in which multiple classifier heads of the same network are simultaneously trained on the same training data to improve generalization and robustness to label noise with no extra inference cost. A similar learning strategy named On-the-fly Native Ensemble (ONE) for one-stage online distillation proposed by Xu Lanet al. . Specifically, ONE trains only a single multi-branch network while simultaneously establishing a strong teacher on-the-fly to enhance the learning of target network. Chung et al. 
proposed an online knowledge distillation method that transfers the knowledge of the class probabilities and the feature map using the adversarial training framework. Zhanget al.  proposed an online training framework called self-distillation, which forces student to refine its knowledge inside the network, thereby improving itself. And a framework Snapshot Distillation was proposed for teacher-student optimization in one generation 
, which extracted such information from earlier epochs in the same generation, meanwhile made sure that the difference between teacher and student is sufficiently large so as to prevent under-fitting. After these, a novel two-level framework OKDDip was proposed to perform distillation during training with multiple auxiliary peers and one group leader for effective online distillation. In OKDDip framework, the first-level distillation works as diversity maintained group distillation with several auxiliary peers, while the second-level distillation transfers the diversity enhanced group knowledge to the ultimate student model called group leader.
These online, timely and efficient training methods are promising and some good progress has been made in narrowing the capability gap between teacher and student model [30, 62, 68]. However, for online distillation, two factors still hold them back. First, although the teachers of online distillation methods are dynamic and have narrowed the capability gap, the gap still exists due to the lack of representation in detail and the process of learning. Second, owing to the absence of a qualified teacher role for online knowledge distillation, the insufficient and relatively unreliable supervision information will restrict the learning of student to some extent. There is still great room for improvement in knowledge transfer and representation. Therefore, we propose the evolutionary knowledge distillation approach to enhance the performance of student network learning by using an evolutionary teacher and focusing on intermediate knowledge of the teaching process dynamically.
Iii The Proposed Approach
In our evolutionary knowledge distillation (EKD) approach, the teacher and student network are trained almost synchronously, and an evolutionary teacher can provide supervision information for the learning of student. The performance of teacher is continuously improved as the training process, which can provide richer supervision information and reduce the capability gap. As shown in Fig. 2, EKD consists of teacher stream, student stream and some extra guided modules. The network of each stream is divided into several blocks, depending on the specific network. Previous experiences have shown that using early blocks helps to exploit information inside the network [2, 70, 21], in order to utilize the intermediate knowledge, we improve bottleneck module [21, 12] to form some guided module pairs that are applied to assist the representation and transfer of knowledge. For each corresponding block between teacher and student stream, a guided module pair is inserted to make distillation process effective and efficient within stream and cross stream. Within-stream knowledge distillation aims to enhance knowledge representation, while cross-stream knowledge distillation helps to improve knowledge transfer.
Iii-a Problem Formulation
Denote the training set is , where is the th sample with label . Here is the number of classes. The objective of distillation is transferring the knowledge of teacher into the student , where and are the model parameters of teacher and student, respectively. In our setting, the network of each stream is divided into blocks. Between the corresponding teacher and student blocks, a pair of guided modules is introduced to assist the representation, transfer of knowledge, where and are the feature maps and model parameters from the th block of teacher () and student () respectively. Therefore, during student network learning, the problem of EKD is formulated as:
where is the function to measure distillation loss. Different from traditional knowledge distillation where is fixed, EKD simultaneously learns both and from with the help of several guided module pairs . After learning, the temporal network within the dashed box (see the Fig. 2) will be discarded and only the parameters of the main model of student stream will be used for inference.
Iii-B Evolutionary Distillation
The proposed evolutionary knowledge distillation adopts online training of teacher and student model synchronously, the parameters of student and teacher models are updated in each batch data process, which can reduce the teacher-student capability gap. When coupled with Fig. 2, we can see that teacher and student streams are divided into blocks. Each block followed by a guided module with a fully connected layer constitutes multiple classifiers. We assume a stream with classifiers. The training process includes two simultaneous stages, within-stream distillation and cross-stream distillation. For within-stream distillation, deeper classifiers provide supervision to help the learning of shallow classifiers, which can improve the ability of the stream itself to represent knowledge. Cross-stream distillation can improve knowledge transfer from the evolutionary teacher to student. The guided modules can help facilitate knowledge representation and transfer.
Within-stream Distillation. For within-stream distillation, we use the shallower classifiers as students, and use the deeper classifiers and , here , as teachers. The students are trained by the supervision provided by teachers, supervision includes Kullback-Leibler (KL) divergence  and L2 distance  between the feature maps before the final FC layers of teacher and student. In order to simplify the form, here we only show the loss between the backbone network and each guided modules .
In practice, we use the classic distillation loss, KL divergence, and L2 distance loss. We compute KL divergence loss between deep classifiers and shallow classifiers within stream. The output of teacher and student is and respectively, . And the distillation loss is calculated by the following.
where denotes the KL divergence loss. As multiple different teachers provide different knowledge, we could achieve the more robust and accurate knowledge representation.
We compute feature loss by L2 distance which can be obtained through computing the feature maps of deep classifier and shallow classifiers.
where and represent the feature maps of teacher or student before the FC layer respectively.
Cross-stream Distillation. For the cross stream knowledge transfer, the student will learn under the supervision of the corresponding guided modules of the teacher stream:
The cross-stream distillation loss contains two types of losses: distillation loss and feature loss. For distillation loss between backbone networks and the one between each pair of guided modules, they were calculated by the KL divergence.
where the outputs of guided module attached to the teacher and student are and , respectively, . The and denote the outputs of the backbone networks.
The form of feature loss is as follows,
where and represent the feature map of classifier before the FC layer respectively. The feature loss between each pair of guided modules is to measure differences in the feature map, which can promote intermediate knowledge transfer cross stream. However, we divide the stream of teacher and student into several blocks, the output dimensions between each corresponding block may not match. Through the processing of the guided modules, we can ensure the consistency of the dimensions while ensuring the least loss of intermediate knowledge during the transfer process.
Then, we combine the two parts of the guided loss,
For each classifier, we compute cross entropy loss between and . In this way, the label directs each classifier’s probability as possible. There are multiple classifiers and we sum each cross entropy loss as follows:
As the parameters of teacher and student are constantly updated in each iteration, the evolutionary knowledge distillation process can be formulated as:
denotes the number of iterations. Specifically, for teacher stream and student stream, we used different loss functions,and , respectively.
where is an indicator function which equals 1 if the stream is teacher and 0 otherwise. This means that the training of teacher and student streams are similar but different. The student learns from evolutionary teacher constantly with the supervision provided by teacher network and guided modules. For each iteration, we first optimize a teacher network to provide supervision to the learning of student, after that, the optimization of teacher and student will be carried out synchronously. In general, the evolutionary teacher reduces the capability gap between teacher and student. The guided modules not only facilitate the knowledge representation of teacher and student stream but also promote the transfer of more intermediate and process knowledge from teacher to student.
|Teacher Acc. (%)||74.17||70.76||78.71||76.61||80.05||78.71||80.24|
|Student Acc. (%)||70.76||70.76||76.61||76.61||78.71||78.71||80.24|
|Teacher Acc. (%)||76.61||70.76||76.61||80.24||80.24|
|Student Acc. (%)||69.21||66.18||70.48||66.18||69.21|
Iii-C Implementation Details
The detailed training of the EKD is shown in Algorithm 1
. We introduce several identical guided modules on each stream, and each module is followed by a fully connected layer and a softmax layer. In training process, the two networks are trained at the same time, and they will adopt a certain degree of randomization in order to ensure they are not completely synchronized. This kind of incomplete synchronization can provide enough second-hand knowledge to promote student’s learning[61, 62]. Once the training is completed, we only keep the student’s parameters of the backbone network for inference.
The experimental results of other distillation methods are based on the open source code of CRD, we implement experiments according to their hyper-parameters settings. The networks are trained with SGD optimizer . The hyper-parameters are set as follows, batch size is 64, the number of threads is 8, the initial learning rate is 0.1, and it will be multiplied by 0.1 when the epoch is equal to 75,130 and 180, respectively. We fix the random seed to 5 and the temperature of distillation 
to 4. All the experiments are implemented by PyTorch on a NVIDIA TITAN GPU.
Iv-a Experimental Settings
, Tiny-ImageNet, UMDFaces  and UCCS . We first compare EKD with several state-of-the-arts offline distillation approaches, including KD , Fitnet , Attention , Factor , PKT , RKD , Similarity , Correlation , VID [ahn2019variational], Abound , CRD , Res-KD  and AFD . We also compare with several online distillation approaches that do not use pre-trained teacher, including DML , CL-ILR , ONE , Snapshot-KD , OKDDip  and Self-KD . Then, ablation experiments are conducted to study the impact of evolutionary teacher, guided modules and component of loss function. Finally, we further conduct the experiments on low-resolution and few-sample scenarios to verify the adaptability of our approach. In our experiments, we use VGG(bn) , ResNet, Wide ResNet , ShuffleNetV1  and ShuffleNetV2  as the backbone networks.
CIFAR10. The dataset consists of 60,000 colour images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images.
CIFAR100. This dataset is just like the CIFAR10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.
Tiny-ImageNet. It is an image classification dataset, which contains about
sized 100,000 training images and 10,000 verification images with 200 classes. It is more difficult than CIFAR100 dataset. On Tiny-ImageNet, we first randomly adjust the crop size to, apply random horizontal flip and normalize it finally . For testing, we only resize the pictures to .
UMDFaces. This face dataset is collected from Internet, and contains 367,888 face annotations for 8,277 subjects. In our Experiments IV-F Adaptability Analysis, UMDFaces is as a training dataset.
UCCS. This dataset contains16,149 images in 1,732 subjects in the wild conditions. It is a very challenging benchmark with various levels of challenges, including blurry image, occluded appearance and bad illumination. We follow the setting as , randomly select a 180-subject subset, and separate the images into 3,918 training images and 907 testing images, and report the results with the standard top-1 accuracy.
|Network||Baseline||DML ||CL-ILR ||Snapshot-KD ||ONE ||Self-KD ||OKDDip ||EKD (Ours)|
Iv-B Comparisons with Offline Distillation Approaches
We conduct experiments on CIFAR100 dataset and perform the comparisons on the offline distillation with peer-architecture setting and cross-architecture setting. The experimental results of other offline distillation methods are based on the open source code RepDistiller of CRD , we implement experiments according to their hyper-parameters setting.
Peer-Architecture Setting. In this setting, the teacher and student networks share similar or the same structures. From the results shown in Tab. I, we get some meaningful observations. First, our approach outperforms all other offline distillation approaches, implying its remarkable effectiveness in improving student network learning, such as, when the teacher and student is VGG19(bn) and VGG11(bn) respectively, we achieve 74.12% accuracy on CIFAR100 which is 0.63% higher than the state-of-art method Similarity , and we gain 3.13% improvement when the teacher and student is ResNet50 and ResNet18 compared to CRD . Second, the effect is also obvious when the teacher stream and student stream adopt the same structure. For example, when both the teacher and student are VGG11(bn), we gain 1.20% improvement even higher than learning from VGG19(bn), and the same is true when teacher and student are ResNet50. We speculate that it’s due to the smaller capability gap between two identical backbone networks, which facilitates the knowledge transfer from evolutionary teacher to student.
Cross-Architecture Setting. According to the experimental results of the previous section, we can find that the effect of proposed approach in the same or similar network architecture is superior to the conventional offline knowledge distillation methods. Then, to verify the generalization performance of our approach, we conduct further experiments on more challenging knowledge transfer tasks, cross-architecture setting, which means the architecture of teacher and student network are completely different.
We verify the performance of cross-architecture setting on CIFAR100 dataset, and learning rate, update strategy and simple data augmentation methods keep the same settings as before. The results are shown in Tab. II. We achieve the best results under the condition that the teacher and student networks are completely different. Specifically, when the teacher is ResNet18 and the student is VGG8(bn), our approach is at least 0.65% higher than the other offline distillation strategies, and for the teacher is WRN50-2, the student (VGG8 (bn)) gains a clear advantage, its performance improvement is more than 1.55%. When the teacher and student network are WRN50-2 and ShuffleNetV1 respectively, we achieve an improvement of at least 1.83% except when compared to Similarity . These are due to the evolutionary teacher reduces the shackles caused by the capability gap between teacher and student, provides richer supervision information, and the guided modules ensure the efficiency of knowledge representation and transfer. Results on cross-architecture setting clearly demonstrate that our method does not rely on architecture-specific cues.
Comparisons on Tiny-ImageNet. In order to further verify the effectiveness of our proposed method, we conduct experiments on the more challenging Tiny-ImageNet. We mainly compare our approach with some typical methods, including KD , FitNet , Attention , RKD , CRD , Res-KD , AFD . Here, Res-KD used the knowledge gap between teacher and student as a guide to train a more lightweight student network, which we call “res-student”, and AFD is an attention-based feature matching distillation method utilizing all the feature levels of the teacher. As illustrated with Tab. IV, the proposed method EKD achieves better performance on large-scale datasets than other baseline knowledge distillation methods, for example, our method achieves a 0.66% performance improvement compared with the previous best results of AFD.
Iv-C Comparisons with Online Distillation Approaches
We also compare the proposed EKD to several recent online knowledge distillation approaches, including Deep Mutual Learning (DML) , Collaborative Learning for Deep Neural Networks (CL-ILR) , Knowledge Distillation by On-the-Fly Native Ensemble (ONE) , Snapshot-KD , Self Distillation (Self-KD)  and Online knowledge distillation with diverse peers (OKDDip) . The results in Tab. III are based on the code of OKDDip , in which the results of Snapshot-KD are based data from , and the results of Self-KD are obtained by re-implementing the framework of original research paper .
As shown in the Tab. III, the “Baseline” approach trains a model by ground-truth labels only. Compared with the other online knowledge distillation methods, our EKD shows some advantages, when the backbone network is VGG16(bn), our method is 1.48% higher in accuracy than the current best approach Self-KD . And when the backbone network is ResNet110, our method has a 0.37% improvement than OKDDip , which is an increase of 3.40% from the “Baseline”. We suspect that the capability gap still exits due to the lack of representation in detail and the process of learning for previous online knowledge distillation methods. So that the insufficient and relatively unreliable supervision information from peers or itself will restrict the learning of student network to some extent. Our evolutionary teacher not only reduces the capability gap brought by fixed pre-trained teacher, but also provides a more flexible teacher role for online knowledge distillation instead of themselves or their peers. Above results illustrate that our method can transfer knowledge more adequately and effectively by combining guided modules and evolutionary teacher.
Previous experimental results quantitatively demonstrate the superiority of our proposed evolutionary knowledge distillation. In order to further visually demonstrate the advantages of our approach, we use the t-SNE 
for visualization. t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. The Fig.3 gives the visualisation results of ResNet18 trained with KD , CRD  and EKD on the Tiny-ImageNet dataset. The “Baseline” in Fig. 3 denotes that we train ResNet18 without any distillation methods, for KD, CRD and EKD, we train the student with the helps of teacher ResNet50. We randomly select ten classes in this dataset for visualization experiment, with 50 samples for each class, different numbers indicate different classes in Fig. 3. To begin with, it is obvious that our approach achieves more concentrated clusters than KD and CRD. In addition, as demonstrated in Fig. 3, the changes of the distances in classifiers of KD and CRD are more severe than that in classifier of EKD. We speculate that the evolutionary teacher can facilitate the student network learning by narrowing the teacher-student capability gap and the guided modules can enhance intermediate knowledge representation to improvement of student network learning.
Iv-E Further Analysis
The approach we proposed mainly includes two parts, evolutionary teacher and guided modules. In this section, we conduct further experiments to explore the specific influence of evolutionary teacher, guided modules and components of loss function. In addition, we compare the influence of our approach in training time with other distillation approaches. All ablation experiments were performed in the same setting as previous ones.
|KD+T_G+S_G+ET (EKD)||82.05 (+2.04)|
Influence of Evolutionary Teacher. Here, we make ResNet50 and ResNet18 as teacher and student, respectively. As illustrated in Tab. V, for the classic knowledge distillation method KD , the accuracy of student on the CIFAR100 dataset is 77.57%. When we use an evolutionary teacher to teach student network, and other settings remain unchanged, the performance of our student network will increase by 2.14%, reach 79.71%. We believe it is because the evolutionary teacher provides more sufficient supervision information to promote student network learning. What’s more, the guided modules play a positive role in within-stream distillation and cross-stream distillation, so, when the guided modules are working, can the evolutionary teacher still play an active role in the learning of student? In order to further verify the influence of evolutionary teacher, we added guided modules to teacher stream and student stream respectively, and compared the performance differences of student model before and after adopting evolutionary teacher strategy. As shown in Tab. V, Introducing the guided modules for the student stream only, the evolutionary teacher strategy improves the accuracy by 2.04%, while the introduction of the guided modules for the teacher and student streams, the evolutionary teacher strategy improves the performance by 2.49%. It’s necessary to point out that the teacher model also has a good performance with the help of its guided modules, when our evolutionary teacher without guided modules to supervise the learning of student, it can still increase by 3.77% compared to KD (81.37% VS. 77.57%). The above results demonstrate that evolutionary teacher can promote student network learning by reducing the capability gap between teacher and student.
Influence of Guided Modules. In order to explore the influence of introduced guided modules, we conduct further experiments. As shown in Tab. V, The effect of the guided modules to improve the performance of the student network is obvious. When only the student stream uses the guided modules, the performance of student is improved by 1.28% compared to KD (78.85% VS. 77.57%). When the teacher stream and student stream use the guided modules at the same time, the performance is improved by 2.44% (80.01% VS. 77.57%). These results show that the guided modules promote the effective transfer of knowledge by making full use of the intermediate information of the network, and improve the performance of student.
In addition, the number of guided modules also has an important impact on knowledge transfer, so, we conduct experiments base on VGG11(bn). For the network structure, we use up to four guided modules and at least zero module. When the number of guided modules is zero, our approach essentially degenerates into classic knowledge distillation with evolutionary teacher. As shown in Fig. 4, experimental results show that guided modules play a vital role. When the number of guided modules increases, the effect is continuously improved, but as the number increases, the improvement becomes less obvious. It should be pointed out that due to the extremely simple structure of guided modules, their consumption of computing resources is negligible.
Influence of Loss Function. The training process of EKD includes two simultaneous stages, within-stream distillation and cross-stream distillation. The loss function mainly includes within-stream distillation loss (, ) and cross-distillation loss (), is classification loss of each classifier. We conduct related experiments to explore the influence of components of loss function (Eq. (11)). The results are shown in the Fig. 5, it is obvious that the loss function has a promotion effect on the improvement of student network, which shows that the guided modules effectively promote the evolutionary teacher to transfer knowledge to student. and are the distillation losses that act on the within-stream feature level and predicted distribution level, respectively. All of them play a positive role in the learning of the student network, and the effect of is more obvious, which indicate that the softened label distribution and feature maps that contain rich information about image intensity and spatial correlation provide sufficient supervision for the learning of students in within-stream and cross-stream distillation. In general, for within-stream distillation, deeper classifiers provide supervision to help the learning of shallow ones, thereby bringing about the performance improvement of the stream itself. Cross-stream distillation can improve knowledge transfer from the evolutionary teacher with the help of guided modules.
Training Time Analysis. We compare training time with other knowledge distillation methods, all the comparative experiments are implemented on NVIDIA TITAN GPU with identical hyper-parameters. In fact, the guided modules won’t increase the amount of calculation too much because of their extremely simple structure, what’s more, since the traditional distillation method needs to train the teacher and student model separately in two stages, our approach trains them at the same time, so the training time will not increase significantly. Specifically, when the teacher and student are ResNet50 and ResNet18 respectively, the training time of our approach is 277.5s/epoch, the traditional knowledge distillation method KD  training time for teacher and student are 205.29s/epoch and 171.07s/epoch. The memory of GPU occupied during training is slightly higher than that of traditional distillation methods. In general, our training time is not significantly higher than traditional methods, but brings considerable performance improvements.
Iv-F Adaptability Analysis
To verify the performance of EKD in low-resolution scenario, we conduct experiments on challenging low-resolution image classification and low-resolution face recognition tasks.
For image classification task, we first train ResNet50 as teacher, and then make ResNet18 as student. The dataset used is down-sampled to reduce the resolution to 1616, and the classification loss is classic cross-entropy loss.The experimental results are shown in Fig. 6, our approach shows good adaptability on more challenging low-resolution image classification task, which is 1.31% higher than the current best method HORKD . In addition, HORKD needs to consume more computing resources due to the introduction of the assistant network. These results indicate the effectiveness of evolutionary teacher supervision and the introduced guided modules in representing and transferring knowledge in low-resolution scenario.
|Method||Top-1 Accuracy (%)||Parameters||Year|
Then we conduct experiments on the low-resolution face recognition task which is very helpful in many real-world applications, e.g., recognizing low-resolution surveillance faces in the wild. In our experiments, the teacher uses a recent face recognizer VGGFace2  with ResNet50 structure and the student network is based on streamlined ResNet18 with only 0.61M parameters, they are trained on UMDFaces  and tested on UCCS dataset . In order to verify the validity of our low-resolution models, we emphatically check the accuracy when the input resolution is 1616. As shown in Tab. VI, our student model achieves better low-resolution face recognition performance and costs less parameters. Specifically, we achieve a top-1 accuracy of 93.85% on the UCCS benchmark, which is 0.45% higher than the state-of-art LRFRW , and the amount of parameters is reduced by nearly ten times. Classical face recognition methods, such as ArcFace and CosFace, do not perform well in low-resolution scenario, with their highest recognition accuracy only reaching 88.73%. Moreover, their models have more parameters, which will lead to a significant increase in the computing cost for inference. It is worth mentioning that whether it is VLRR , SKD , HORKD  or LRFRW  in the distillation process, high-resolution images corresponding to low-resolution faces are necessary to provide more information, but such high-resolution images are not always easy to obtain, our approach only uses low-resolution face images for training, which adopts real-world application scenarios.
Few-sample Scenario. In practical scenario (e.g., in the wild), the number of samples available for training is usually limited to a certain extent. In order to study the adaptability of our approach in a few-sample scenario, we conduct experiments on different subsets of CIFAR100 dataset. We randomly select images of each class to form new subsets, and use the newly designed training set to train the student models while maintaining the same test set. ResNet50 and ResNet18 are chosen as teacher and student, respectively. We compare the performance of student models with several typical distillation methods include KD , FitNet  and CRD . The percentages of retained samples are 100%, 75%, 50% and 25%. For a fair comparison, we use the same data in different distillation approaches to train student models and train their teacher models on full dataset.
The results in Fig. 7 show that our approach remarkably surpasses other distillation approaches under few-sample scenario, even when the teacher stream of EKD is learned and distilled knowledge from less training data. Specifically, as the amount of training data decreases, the performance of distillation methods represented by KD, FitNet, and CRD in the figure will decrease significantly, while the downward trend of EKD proposed by us is significantly milder, even the advantage is more obvious when the training samples are less. When the percentage of training samples is 25%, our approach is nearly 10% higher than CRD.
In this paper, we propose an evolutionary knowledge distillation and show its superiority by comparing it with the state-of-the-art offline distillation and online distillation approaches. Our approach uses an evolutionary teacher to supervise the learning of student from scratch, which can reduce the teacher-student capability gap to promote knowledge transfer. What’s more, through the introduction of some simple guided module pairs between corresponding teacher-student blocks, the efficiency of intermediate knowledge representation is improved. We believe this evolutionary knowledge transfer manner and simple guided mechanism are very promising in knowledge distillation community. In the future, we will explore the potential of our approach in combining with existing knowledge distillation schemes, and performing more practical tasks.
Acknowledgement. This work was partially supported by grants from the National Key Research and Development Plan (2020AAA0140001), National Natural Science Foundation of China (61772513), Beijing Natural Science Foundation (19L2040), and the project from Beijing Municipal Science and Technology Commission (Z191100007119002). Shiming Ge is also supported by the Youth Innovation Promotion Association, Chinese Academy of Sciences.
-  (2019) Variational information distillation for knowledge transfer. In , pp. 9163–9171. Cited by: §II-A.
-  (2018) Large scale distributed neural network training through online distillation. In International Conference on Learning Representations, Cited by: §I, §III.
-  (2014) Do deep nets really need to be deep?. In Conference on Neural Information Processing Systems, pp. 2654–2662. Cited by: §II-A.
-  (2017) Umdfaces: an annotated face dataset for training deep networks. In International Joint Conference on Biometrics, pp. 464–473. Cited by: §IV-A, §IV-F.
-  (2012) Stochastic gradient descent tricks. In Neural networks: Tricks of the trade - Second Edition, Vol. 7700, pp. 421–436. Cited by: §III-C.
-  (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §II-A.
-  VGGFace2: A dataset for recognising faces across pose and age. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China, May 15-19, 2018, pp. 67–74. Cited by: §IV-F, TABLE VI.
Online knowledge distillation with diverse peers.
Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3430–3437. Cited by: §I, §II-A, §II-C, §IV-A, §IV-C, §IV-C, TABLE III.
-  (2017) Learning efficient object detection models with knowledge distillation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 742–751. Cited by: §I.
Coupled end-to-end transfer learning with generalized fisher information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4329–4338. Cited by: §II-A.
-  (2017) Pixelwise deep sequence learning for moving object detection. IEEE Transactions on Circuits and Systems for Video Technology 29 (9), pp. 2567–2579. Cited by: §I.
-  (2020) Explaining knowledge distillation by quantifying the knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12925–12935. Cited by: §III.
Feature-map-level online adversarial knowledge distillation.
International Conference on Machine Learning, pp. 2006–2015. Cited by: §II-C.
-  (2018) Moonshine: distilling with cheap convolutions. In Conference on Neural Information Processing Systems, pp. 2893–2903. Cited by: §II-B.
-  (2019) Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: TABLE VI.
-  (2018) Born again neural networks. In International Conference on Machine Learning, pp. 1607–1616. Cited by: §II-B.
-  (2020) Look one and more: distilling hybrid order relational knowledge for cross-resolution image recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10845–10852. Cited by: §IV-F, §IV-F, §IV-F, TABLE VI.
-  (2018) Low-resolution face recognition in the wild via selective knowledge distillation. TIP, pp. 2051–2062. Cited by: §IV-A, §IV-F, TABLE VI.
Understanding deep learning techniques for image segmentation. ACM Computing Surveys (CSUR) 52 (4), pp. 1–35. Cited by: §I.
-  (2017) Unconstrained face detection and open-set face recognition challenge. In International Joint Conference on Biometrics, Cited by: §IV-A, §IV-F.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §III, §IV-A.
Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3779–3787. Cited by: TABLE I, TABLE II, §IV-A.
-  (2015) Distilling the knowledge in a neural network. In Deep Learning and Representation Learning Workshop on Neural Information Processing Systems, Cited by: §I, §II-A, §III-B, §III-C, TABLE I, TABLE II, Fig. 3, §IV-A, §IV-B, §IV-D, §IV-E, §IV-E, §IV-F, TABLE IV.
Channel-wise and spatial feature modulation network for single image super-resolution. IEEE Transactions on Circuits and Systems for Video Technology 30 (11), pp. 3911–3927. Cited by: §I.
-  (2018) Paraphrasing complex network: network compression via factor transfer. In Conference on Neural Information Processing Systems, pp. 2765–2774. Cited by: TABLE I, TABLE II, §IV-A.
-  (2021) Show, attend and distill: knowledge distillation via attention-based feature matching. Proceedings of the AAAI Conference on Artificial Intelligence. Cited by: §IV-A, §IV-B, TABLE IV.
-  (2019) Knowledge distillation via route constrained optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1345–1354. Cited by: §I, §II-B.
-  (2017) On large-batch training for deep learning: generalization gap and sharp minima. In International Conference on Learning Representations, Cited by: §II-A.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §IV-A.
-  (2018) Knowledge distillation by on-the-fly native ensemble. In Conference on Neural Information Processing Systems, pp. 7528–7538. Cited by: §I, §II-A, §II-C, §II-C, §IV-A, §IV-C, TABLE III.
-  (2015) Tiny imagenet visual recognition challenge. CS 231N 7, pp. 7. Cited by: §IV-A.
-  (2019) On low-resolution face recognition in the wild: comparisons and new techniques. IEEE Transactions on Information Forensics and Security 14 (8), pp. 2000–2012. Cited by: §IV-F, TABLE VI.
-  (2020) ResKD: residual-guided knowledge distillation. arXiv preprint arXiv:2006.04719. Cited by: §IV-A, §IV-B, TABLE IV.
-  (2021) Deep cross-modal representation learning and distillation for illumination-invariant pedestrian detection. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §II-A.
-  (2017) Sphereface: deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220. Cited by: TABLE VI.
-  (2019) Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2604–2613. Cited by: §II-A.
-  (2019) Knowledge distillation via instance relationship graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7096–7104. Cited by: §I, §II-B.
-  (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision, pp. 116–131. Cited by: §IV-A.
-  (2018) Recent advances in deep learning: an overview. International Journal of Machine Learning and Computing, pp. 747–750. Cited by: §I.
-  (2020) Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher. Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5191–5198. Cited by: §I, §II-B.
-  (2019) Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976. Cited by: §I, TABLE I, TABLE II, §IV-A, §IV-B, TABLE IV.
-  (2020) Heterogeneous knowledge distillation using information flow modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2336–2345. Cited by: §II-B.
-  (2018) Learning deep representations with probabilistic knowledge transfer. In Proceedings of the European Conference on Computer Vision, pp. 268–284. Cited by: TABLE I, TABLE II, §IV-A.
-  (2019) Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5007–5016. Cited by: TABLE I, TABLE II, §IV-A.
-  (2017) Regularizing neural networks by penalizing confident output distributions. In International Conference on Learning Representations, Cited by: §II-A.
-  (2018) Model compression via distillation and quantization. In International Conference on Learning Representations, Cited by: §II-A.
-  (2015) FitNets: hints for thin deep nets. In International Conference on Learning Representations, Cited by: §I, §II-B, §III-B, TABLE I, TABLE II, §IV-A, §IV-B, §IV-F, TABLE IV.
-  (2017) Knowledge adaptation: teaching to adapt. arXiv preprint arXiv:1702.02052. Cited by: §II-B.
-  (2016) Wide residual networks. In The British Machine Vision Conference, Cited by: §IV-A.
-  (2019) Meal: multi-model ensemble via adversarial learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4886–4893. Cited by: §II-B.
-  (2020) Meal v2: boosting vanilla resnet-50 to 80%+ top-1 accuracy on imagenet without tricks. arXiv preprint arXiv:2009.08453. Cited by: §II-B.
-  (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §IV-A.
-  (2018) Collaborative learning for deep neural networks. In Conference on Neural Information Processing Systems, pp. 1837–1846. Cited by: §II-C, §IV-A, §IV-C, TABLE III.
-  (2020) Data augmentation using random image cropping and patching for deep cnns. IEEE Transactions on Circuits and Systems for Video Technology 30 (9), pp. 2917–2931. Cited by: §I.
-  (2019) Attribute-guided coupled gan for cross-resolution face recognition. In 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS), pp. 1–10. Cited by: TABLE VI.
-  (2020) Contrastive representation distillation. In International Conference on Learning Representations, Cited by: §I, §II-A, §II-B, §III-C, TABLE I, TABLE II, Fig. 3, §IV-A, §IV-B, §IV-B, §IV-B, §IV-D, TABLE IV.
-  (2019) Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1365–1374. Cited by: TABLE I, TABLE II, §IV-A, §IV-B, §IV-B.
-  (2008) Visualizing data using t-sne. Journal of machine learning research 9 (11). Cited by: §IV-D.
-  (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5265–5274. Cited by: TABLE VI.
-  (2016) Studying very low resolution recognition using deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4792–4800. Cited by: §IV-F, TABLE VI.
-  (2018) Knowledge distillation in generations: more tolerant teachers educate better students. arXiv preprint arXiv:1805.05551. Cited by: §II-B, §III-C.
-  (2019) Snapshot distillation: teacher-student optimization in one generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2859–2868. Cited by: §I, §I, §II-B, §II-C, §II-C, §III-C, §IV-A, §IV-C, TABLE III.
-  (2017) A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §I, §II-B.
-  (2017) Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1285–1294. Cited by: §II-B.
-  (2020) Regularizing class-wise predictions via self-knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13876–13885. Cited by: §II-B.
Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations, Cited by: TABLE I, TABLE II, §IV-A, §IV-B, TABLE IV.
Deep transfer hashing for image retrieval. IEEE Transactions on Circuits and Systems for Video Technology 31 (2), pp. 742–753. Cited by: §I.
-  (2019) Be your own teacher: improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3712–3721. Cited by: §II-C, §II-C, §IV-A, §IV-C, §IV-C, TABLE III.
-  (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §IV-A.
-  (2018) Deep mutual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4320–4328. Cited by: §II-A, §II-C, §III, §IV-A, §IV-C, TABLE III.
-  (2019) Scale-aware crowd counting via depth-embedded convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology 30 (10), pp. 3651–3662. Cited by: §I.