Learning Student-Friendly Teacher Networks for Knowledge Distillation

02/12/2021
by   Dae Young Park, et al.
0

We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student. Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students and, consequently, more appropriate for knowledge transfer. In other words, even at the time of optimizing a teacher model, the proposed algorithm learns the student branches jointly to obtain student-friendly representations. Since the main goal of our approach lies in training teacher models and the subsequent knowledge distillation procedure is straightforward, most of the existing knowledge distillation algorithms can adopt this technique to improve the performance of the student models in terms of accuracy and convergence speed. The proposed algorithm demonstrates outstanding accuracy in several well-known knowledge distillation techniques with various combinations of teacher and student architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 11

page 12

06/10/2021

Does Knowledge Distillation Really Work?

Knowledge distillation is a popular technique for training a small stude...
11/12/2021

Learning Interpretation with Explainable Knowledge Distillation

Knowledge Distillation (KD) has been considered as a key solution in mod...
09/30/2020

Efficient Kernel Transfer in Knowledge Distillation

Knowledge distillation is an effective way for model compression in deep...
02/14/2021

Self Regulated Learning Mechanism for Data Efficient Knowledge Distillation

Existing methods for distillation use the conventional training approach...
11/18/2020

Privileged Knowledge Distillation for Online Action Detection

Online Action Detection (OAD) in videos is proposed as a per-frame label...
12/21/2021

Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix

In the context of multi-modality knowledge distillation research, the ex...
08/04/2020

Prime-Aware Adaptive Distillation

Knowledge distillation(KD) aims to improve the performance of a student ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge distillation (Hinton et al., 2015)

is a well-known technique to learn compact deep neural network models with competitive accuracy, where a smaller network (student) is trained to simulate the representations of a larger one (teacher) with higher accuracy. The popularity of knowledge distillation is mainly due to its simplicity and generality; it is straightforward to learn a student model based on a teacher and there is no restriction about the network architectures of both models. The main goal of most approaches is how to transfer dark knowledge to student models effectively, given predefined or pretrained teacher networks.

Although knowledge distillation is a promising and convenient method, it sometimes fails to achieve satisfactory performance in terms of accuracy. This is partly because the model capacity of student is too limited compared to that of teacher and knowledge distillation algorithms are suboptimal (Kang et al., 2020; Mirzadeh et al., 2020). In addition to this reason, we claim that the consistency of teacher and student features is critical to knowledge transfer and the inappropriate representation learning of a teacher often leads to suboptimality of knowledge distillation.

We are interested in making a teacher network hold better transferable knowledge by providing the teacher with a snapshot of the student model at the time of its training. We employ the typical structures of convolutional neural networks based on multiple blocks and make the representations of each block in the teacher easy to be transferred to the student. The proposed approach aims to train teacher models friendly to students for facilitating knowledge distillation; we call the teacher model trained by this strategy student-friendly teacher network (SFTN). SFTN is deployed in arbitrary distillation algorithms easily due to its generality for training models and transferring knowledge.

SFTN is somewhat related to collaborative learning methods (Zhang et al., 2018b; Guo et al., 2020; Wu and Gong, 2020), but does perform knowledge transfer from teacher to student in one direction. More importantly, the models given by collaborative learning are prone to be correlated and may not be appropriate for fully exploiting knowledge in teacher models. On the other hand, SFTN adopts a two-stage learning procedure to alleviate the limitation: student-aware training of teacher network followed by knowledge distillation from teacher to student. Figure 1 demonstrates the main difference between the proposed algorithm and the standard knowledge distillation methods.

Standard knowledge distillation
Student-friendly teacher network
Figure 1: Comparison between the standard knowledge distillation and our approach. (a) The standard knowledge distillation trains teachers alone and then distill knowledge to students. (b) The proposed student-friendly teacher network trains teachers along with student branches, and then distill more easy-to-transfer knowledge to students.

The following is the list of our main contributions:

  • We adopt a student-aware teacher learning procedure before knowledge distillation, which enables teacher models to transfer their representations to students more effectively.

  • The proposed approach is applicable to diverse architectures of teacher and students while it can be incorporated into various knowledge distillation algorithms.

  • We illustrate that the integration of SFTN into various baseline algorithms and models improve accuracy consistently with substantial margins.

The rest of the paper is organized as follows. We first discuss the existing knowledge distillation techniques in Section 2. Section 3 describes the details of the proposed SFTN including the knowledge distillation algorithm. The experimental results with in-depth analysis are presented in Section 4, and we make the conclusion in Section 5.

2 Related Work

Although deep learning has shown successful outcomes in various areas, it is still difficult to apply deep neural networks to real-world problems due to the constraints of computation and memory resources. There have been many attempts to reduce the computational cost of deep learning models, and knowledge distillation is one of the examples. Various computer vision 

(Chen et al., 2017; Luo et al., 2016; Wu et al., 2019; Wang et al., 2020)

and natural language processing 

(Jiao et al., 2019; Mun et al., 2018; Sanh et al., 2019; Arora et al., 2019) tasks often employ knowledge distillation to obtain efficient models. Recently, some cross-modal tasks (Thoker and Gall, 2019; Zhao et al., 2018; Zhou et al., 2020) adopt it to transfer knowledge across domains. This section summarizes the research efforts to the performance of models via knowledge distillation.

2.1 What to distill

Since Hinton et al. (Hinton et al., 2015) introduces the basic concept of knowledge distillation, where the knowledge is given by the temperature-scaled representations of the softmax function in teacher models, various kinds of information have been employed as the sources of knowledge for distillation from teachers to students. FitNets (Romero et al., 2015)

distills intermediate features of a teacher network, where the student network transforms the intermediate features using guided layers and then calculates the difference between the guided layers and the intermediate features of teacher network. The position of distillation to the layer is shifted to the layers before ReLU operations in

(Heo et al., 2019a)

, which also proposes the novel activation function and the partial

loss function for effective knowledge transfer. Zagoruyko and Nikos Komodakis (Zagoruyko and Komodakis, 2017) argue importance of attention and proposes an attention transfer (AT) method from teachers to students while Kim et al. (Kim et al., 2018)

compute the factor information of the teacher representations using autoencoder, which is decoded by students for knowledge transfer. Relational knowledge distillation (RKD) 

(Park et al., 2019) introduces a technique to transfer relational information such as distances and angles of features.

CRD (Tian et al., 2020) maximizes mutual information between student and teacher via contrastive learning while ONE (lan et al., 2018) and BYOT (Zhang et al., 2019) transfer knowledge without teacher. There exist a couple of methods to perform knowledge distillation without teacher models. For example, (lan et al., 2018) distills knowledge from an ensemble of multiple students, and BYOT (Zhang et al., 2019) transfers knowledge from deeper layers to shallower ones. Besides, SSKD (Xu et al., 2020) distills self-supervised features of teachers to students for transferring richer knowledge.

2.2 How to distill

Several recent knowledge distillation methods focus on the strategy of knowledge distillation. Born again network (BAN) (Furlanello et al., 2018) presents the effectiveness of sequential knowledge distillation via the networks with an identical architecture. A curriculum learning method (Jin et al., 2019) employs the optimization trajectory of a teacher model to train students. Collaborative learning approaches (Zhang et al., 2018b; Guo et al., 2020; Wu and Gong, 2020) attempt to learn multiple models with distillation jointly, but their concept is not well-suited for asymmetric teacher-student relationship, which may lead to suboptimal convergence of student models.

The model capacity gap between a teacher and a student is addressed in (Kang et al., 2020; Cho and Hariharan, 2019; Mirzadeh et al., 2020). TAKD (Mirzadeh et al., 2020) suggests the teacher assistant network to reduce model capacity gap, where a teacher model transfer knowledge to student via a teaching assistant with an intermediate size. An early stopping technique for training teacher networks is proposed to obtain better transferable representations and a neural architecture search is employed to identify a student model with the optimal size (Kang et al., 2020). Our work proposes a novel student-friendly learning technique of teacher networks to facilitate knowledge distillation.

3 Student-Friendly Knowledge Distillation

Student-aware training of a teacher network
Knowledge distillation
Figure 2: Overview of the student-friendly teacher network (SFTN). In this figure, , , , , , , , and denote the teacher network feature, the student network feature, the block of teacher network, the block of student network, teacher network feature transform layer, softmax output of the student branch, softmax output of teacher network, and softmax output of student network, respectively. The loss for teacher network is given by Eqn. (4), Kullback-Leibler loss and cross entropy loss are defined in Eqn. (5) and Eqn. (6), respectively. (a) While training a teacher, SFTN trains and for better knowledge transfer to student networks. (b) In the distillation stage, the features in the teacher network, and , are distilled to student networks with existing knowledge distillation algorithms straightforwardly.

This section describes the details of the student-friendly teacher network (SFTN), which transfers the features of teacher models to student networks more effectively than the standard distillation. Figure 2 illustrates the main idea of our method.

3.1 Overview

The conventional knowledge distillation approaches attempt to find the way of teaching student networks given the architecture of teacher networks. Since the teacher network is trained only with the loss with respect to the ground-truth, and the optimization of the objective is not necessarily beneficial for knowledge distillation to students. To the contrary, SFTN framework aims to improve the effectiveness of knowledge distillation from the teacher to the student models. The procedure of SFTN is composed of the following steps.

Modularizing teacher and student networks

We modularize teacher and student networks into multiple blocks based on the depth of layers and the feature map sizes. This is because knowledge distillation is often performed at every 3 or 4 blocks for accurate extraction and transfer of knowledge in teacher models. Figure 2 presents the case that both networks are modularized into 3 blocks, denoted by and for teacher and student, respectively.

Adding student branches

SFTN adds student branches to a teacher model for joint training of both parts. Each student branch is composed of a teacher network feature transform layer and a student network blocks. Note that is similar to a guided layer in FitNets (Romero et al., 2015) and transforms the dimensionality of channel in into . Depending on the configuration of teacher and student networks, the transformation need to increase or decrease the size of the feature maps. We employ 33 convolutions to reduce the size of while 44 transposed convolutions are used to increase its size. Also, 11 convolutions is used when we do not need to change the size of

. The features transformed to a student branch is forwarded separately to compute the logit of the branch. For example, as shown in Figure 

2(a), in the teacher stream is transformed to , which initiates a student branch to derive while another student branch starts from the transformed features from . Note that has no trailing teacher network block in the figure and has no associated student branch.

Training SFTN

The teacher network is trained along with multiple student branches corresponding to individual blocks in the teacher, where the loss is given by the differences in the representations between the teacher and the student branches. Our loss function is composed of three terms: loss in the teacher network , Kullback-Leibler loss in the student branch, and cross-entropy loss in the student branch. The main loss term, , minimizes the error between and the ground-truth while enforces and to be similar to each other and makes fit the ground-truth.

Distillation using SFTN

As shown in Figure 2(b), the conventional knowledge distillation technique is employed to transfer and simulated by and respectively. The actual knowledge distillation step is straightforward because the representations of and have already been learned properly at the time of training SFTN. We expect the performance of the student network distilled from the SFTN to be better than the one obtained from the conventional teacher network.

3.2 Network Architecture

SFTN consists of a teacher network and multiple student branches. The teacher and student networks are divided into blocks, where a set of blocks in the teacher is given by while the blocks in the student is denoted by . Note that the last block in the teacher network does not have the associated student branch.

Let be an input of the network. Then, the softmax function output of the teacher network, , is given by

(1)

where denotes the logit of the teacher network and is the parameter to determine the smoothness of the softmax function. On the other hand, the output of the softmax function in the student branch, , is given by

(2)

where denotes the logit of the student branch.

3.3 Loss Functions

The teacher network in the conventional knowledge distillation framework is traned only with . However, SFTN has additional loss terms such as and as described in Section 3.1. The total loss function of SFTN, denoted by , is given by

(3)

where , and are the weights of individual loss terms.

Each loss term is defined as follows. First, is given by the cross-entropy between and the ground-truth label as

(4)

The knowledge distillation loss, , employs the KL divergence between and , where student branches except for the last block in the teacher network are considered together as

(5)

The cross-entropy loss of the student network, , is obtained by the average cross-entropy loss from all the student branches, which is given by

(6)

3.4 Discussion

SFTN learns teacher models that transfer knowledge to students more effectively. One drawback of the proposed method is the increase of training cost due to the two-stage training framework. However, the main goal of knowledge distillation is to maximize the benefit in student networks, and the additional training cost may not be critical in many real applications. In the next section, we show that SFTN is effective to learn high-performance student models combined with various knowledge distillation techniques.

Models (Teacher/Student) WRN40-2/WRN16-2 WRN40-2/WRN40-1 resnet32x4/resnet8x4 resnet32x4/resnet8x2
Teacher training method Standard SFTN Standard SFTN Standard SFTN Standard SFTN
Teacher Accuracy 76.30 78.20 76.30 77.62 79.25 79.41 79.25 77.89
Student accuracy w/o KD 73.41 72.16 72.38 68.19
KD 75.46 76.25 0.79 73.73 75.09 1.36 73.39 76.09 2.70 67.43 69.17 1.74
FitNets 75.72 76.73 1.01 74.14 75.54 1.40 75.34 76.89 1.55 69.80 71.07 1.27
AT 75.85 76.82 0.97 74.56 75.86 1.30 74.98 76.91 1.93 68.79 70.90 2.11
SP 75.43 76.77 1.34 74.51 75.31 0.80 74.06 76.37 2.31 68.39 70.03 1.64
VID 75.63 76.79 1.16 74.21 75.76 1.55 74.86 77.00 2.14 69.53 71.08 1.55
RKD 75.48 76.49 1.01 73.86 75.11 1.25 74.12 76.62 2.50 68.54 70.91 2.36
PKT 75.71 76.57 0.86 74.43 75.49 1.06 74.70 76.57 1.87 69.29 70.75 1.45
AB 70.12 70.76 0.64 74.38 75.51 1.13 74.73 76.51 1.78 69.76 71.05 1.29
FT 75.6 76.51 0.91 74.49 75.11 0.62 74.89 77.02 2.13 69.70 71.11 1.40
CRD 75.91 77.23 1.32 74.93 76.09 1.16 75.54 76.95 1.41 70.34 71.34 1.00
SSKD 75.96 76.80 0.84 75.72 76.03 0.31 75.95 76.85 0.90 69.34 70.29 0.96
OH 76.00 76.39 0.39 74.79 75.62 0.83 75.04 76.65 1.61 68.10 69.69 1.59
Best 76.00 77.23 1.23 75.72 76.09 0.37 75.95 77.02 1.07 70.34 71.34 1.00
Table 1:

Comparisons between SFTN and the standard teacher models on CIFAR-100 dataset when the architectures of the teacher-student pairs are homogeneous. In all the tested algorithms, the student models distilled from the teacher models given by SFTN outperform the ones trained from the standard teacher models. All the reported accuracies in this table is computed by the outputs of 3 independent runs.

Models (Teacher/Student) resnet32x4/ShuffleV1 resnet32x4/ShuffleV2 ResNet50/VGG8 WRN40-2/ShuffleV2
Teacher training method Standard SFTN Standard SFTN Standard SFTN Standard SFTN
Teacher Accuracy 79.25 80.03 79.25 79.58 78.70 82.52 76.30 78.21
Student accuracy w/o KD 71.95 73.21 71.12 73.21
KD 74.26 77.93 3.67 75.25 78.07 2.82 73.82 74.92 1.10 76.68 78.06 1.38
FitNets 75.95 78.75 2.80 77.00 79.68 2.68 73.22 74.80 1.58 77.31 79.21 1.90
AT 76.12 78.63 2.51 76.57 78.79 2.22 73.56 74.05 0.49 77.41 78.29 0.88
SP 75.80 78.36 2.56 76.11 78.38 2.27 74.02 75.37 1.35 76.93 78.12 1.19
VID 75.16 78.03 2.87 75.70 78.49 2.79 73.59 74.76 1.17 77.27 78.78 1.51
RKD 74.84 77.72 2.88 75.48 77.77 2.29 73.54 74.70 1.16 76.69 78.11 1.42
PKT 75.05 77.46 2.41 75.79 78.28 2.49 73.79 75.17 1.38 76.86 78.28 1.42
AB 75.95 78.53 2.58 76.25 78.68 2.43 73.72 74.77 1.05 77.28 78.77 1.49
FT 75.58 77.84 2.26 76.42 78.37 1.95 73.34 74.77 1.43 76.80 77.65 0.85
CRD 75.60 78.20 2.60 76.35 78.43 2.08 74.52 75.41 0.89 77.52 78.81 1.29
SSKD 78.05 79.10 1.05 78.66 79.65 0.99 76.03 76.95 0.92 77.81 78.34 0.53
OH 77.51 79.56 2.05 78.08 79.98 1.90 74.55 75.95 1.40 77.82 79.14 1.32
Best 78.05 79.56 1.51 78.66 79.98 1.32 76.03 76.95 0.92 77.82 79.21 1.39
Table 2: Comparisons between SFTN and the standard teacher models on CIFAR-100 dataset when the architectures of the teacher-student pairs are heterogeneous. In all the tested algorithms, the student models distilled from the teacher models given by SFTN outperform the ones trained from the standard teacher models. All the reported accuracies in this table is computed by the outputs of 3 independent runs.

4 Experiments

We evaluate the performance of SFTN in comparison to existing methods and analyze the characteristics of SFTN in various aspects. We first describe our experiment setting in Section 4.1. Then, we compare results between SFTN and the standard teacher networks with respect to classification accuracy in various knowledge distillation algorithms in Section 4.2

. The results from ablative experiments for SFTN and transfer learning are discussed in the rest of this section.

4.1 Experiment Setting

We perform evaluation on multiple well-known datasets including ImageNet 

(Russakovsky et al., 2015). CIFAR-100 (Krizhevsky, 2009), and STL10 (Coates et al., 2011). For the experiment, we select several different backbone networks such as ResNet (He et al., 2016), WideResNet (Zagoruyko and Komodakis, 2016), VGG (Simonyan and Zisserman, 2015), ShuffleNetV1 (Zhang et al., 2018a), and ShuffleNetV2 (Tan et al., 2019).

For comprehensive evaluation, we adopt various knowledge distillation techniques, which include KD (Hinton et al., 2015), FitNets (Romero et al., 2015), AT (Zagoruyko and Komodakis, 2017), SP (Tung and Mori, 2019), VID (Ahn et al., 2019), RKD (Park et al., 2019), PKT (Passalis and Tefas, 2018), AB (Heo et al., 2019b), FT (Kim et al., 2018), CRD (Tian et al., 2020), SSKD (Xu et al., 2020), and OH (Heo et al., 2019a). Among these methods, the feature distillation methods (Romero et al., 2015; Zagoruyko and Komodakis, 2017; Tung and Mori, 2019; Ahn et al., 2019; Park et al., 2019; Passalis and Tefas, 2018; Heo et al., 2019b; Kim et al., 2018; Heo et al., 2019a) conduct joint distillation with conventional KD (Hinton et al., 2015) during student training, which results in higher accuracy in practice than the feature distillation only. We also include comparisons with collaborative learning methods such as DML (Zhang et al., 2018b) and KDCL (Guo et al., 2020), and a curriculum learning technique, RCO (Jin et al., 2019). We have reproduced the results from the existing methods using the implementations provided by the authors of the papers.

4.2 Main Results

To show effectiveness of SFTN, we incorporate SFTN into various existing knowledge distillation algorithms and evaluate accuracy. We present implementation details and experimental results on CIFAR-100 (Krizhevsky, 2009)

and ImageNet 

(Russakovsky et al., 2015) datasets.

4.2.1 Cifar-100

CIFAR-100 (Krizhevsky, 2009) consists of 50K training images and 10K testing images in 100 classes. We select 12 state-of-the art distillation methods to compare accuracy of SFTN with the standard teacher network. To show the generality of the proposed approach, 8 pairs of teacher and student models have been tested in our experiment. The experiment setup for CIFAR-100 is identical to the one performed in CRD111https://github.com/HobbitLong/RepDistille; most experiments employ the SGD optimizer with learning rate 0.05, weight decay 0.0005 and momentum 0.9 while learning rate is set to 0.01 in ShuffleNet experiments. The hyper-parameters for the loss function are set as , , , and .

Table 1 and 2 demonstrates the full results on the CIFAR-100 dataset. Table 1 presents distillation performance of all the compared algorithms when the architecture styles of teacher and student pairs are same while Table 2 shows distillation performance of teacher-student pairs with different architecture styles. Both tables clearly present that SFTN is consistently better than the standard teacher network in all experiments. The average difference between standard teacher and SFTN is 1.58% points, and the average difference between best student accuracy of standard teacher and SFTN is 1.10% points. We note that the outstanding performance of SFTN is not only driven by the higher accuracy of teacher models achieved by our student-aware learning technique. As observed in Table 1 and 2, the proposed approach often demonstrates substantial improvement compared to the standard distillation methods despite similar or lower teacher accuracies. Refer to Section 4.4 for the further discussion about the relation of teacher-student accuracy. Figure 3 illustrates the accuracies of the best student models of the standard teacher and SFTN given teacher and student architecture pairs. Despite the small capacity of the students, the best student models of SFTN sometimes outperform the standard teachers while the only one best student of the standard teacher shows higher accuracy than its teacher.

Figure 3: Accuracy comparison of the best students from SFTN with the standard teacher on CIFAR-100. The four best student models of SFTN outperform the standard teachers while the only one best student of the standard teacher shows higher accuracy than its teacher.

4.2.2 ImageNet

Models
(Teacher/Student)
ResNet50/ResNet34
Top-1 Top-5
Teacher training Stan. SFTN Stan. SFTN
Teacher Acc. 76.45 77.43 93.15 93.75
Stu. Acc. w/o KD 73.79 91.74
KD 73.55 74.14 0.59 91.81 92.21 0.40
FitNets 74.56 75.01 0.45 92.31 92.51 0.20
SP 74.95 75.53 0.58 92.54 92.69 0.15
CRD 75.01 75.39 0.38 92.56 92.67 0.11
OH 74.56 75.01 0.45 92.36 92.56 0.20
Best 75.01 75.53 0.52 92.56 92.69 0.13
Table 3: Top-1 and Top-5 validation accuracy on ImageNet.

ImageNet (Russakovsky et al., 2015)

consists of 1.2M training images and 50K validation images for 1K classes. We adopt the standard Pytorch ImageNet training setup

222https://github.com/pytorch/examples/tree/master/imagenet for this experiment. The optimization is given by SGD with learning rate 0.1, weight decay 0.0001 and momentum 0.9. The coefficients of individual loss terms are set as , and . We conduct the ImageNet experiment for 5 different knowledge distillation methods, where teacher models based on ResNet50 distill knowledge to ResNet34 student networks.

As presented in Table 3, SFTN consistently outperforms the standard teacher network in all settings. Also, the best student accuracy of SFTN achieves higher top-1 accuracy than standard teacher model by approximately 0.5% points. This results implies that the proposed SFTN has great potential on large datasets as well.

4.3 Comparison with Collaborative and Curriculum Learning Methods

Contrary to traditional knowledge distillation methods based on static pretrained teachers, collaborative learning and curriculum learning employ dynamic teacher networks trained with students jointly and the optimizaiton history of teachers, respectively. Table 4 shows that SFTN outperforms the collaborative learning approaches such as DML (Zhang et al., 2018b) and KDCL (Guo et al., 2020); the heterogeneous architectures may not be effective for mutual learning. On the other hand, the accuracy of SFTN is consistently higher than that of the curriculum learning method, RCO (Jin et al., 2019)

under the same and even harsher training condition in terms of epoch numbers; the identification of the optimal checkpoints may be challenging in trajectory-based learning. Note that SFTN improves its accuracy substantially with more iterations as shown in the results for SFTN-4.

Teacher WRN40-2 WRN40-2 resnet32x4 resnet32x4 resnet32x4 ResNet50
Student WRN16-2 WRN40-1 resnet8x4 ShuffleV1 ShuffleV2 VGG8
Standard teacher Acc 76.3 76.3 79.25 79.25 79.25 78.7
Student Acc. w/o KD 73.41 72.16 72.38 71.95 73.21 71.12
Student Student Student Student Student Student
Standard 75.46 2.05 73.73 1.57 73.39 1.01 74.26 2.31 75.25 2.04 73.82 2.7
DML 75.30 1.89 74.08 1.92 74.34 1.96 73.37 1.42 73.80 0.59 73.01 1.89
KDCL 75.45 2.04 74.65 2.49 75.21 2.83 73.98 2.03 74.30 1.09 73.48 2.36
RCO-one-stage-EEI 75.36 1.95 74.29 2.13 74.06 1.68 76.62 4.67 77.40 4.19 74.30 3.18
SFTN 76.25 2.84 75.09 2.93 76.09 3.71 77.93 5.98 78.07 4.86 74.92 3.80
RCO-EEI-4 75.69 2.28 74.87 2.71 73.73 1.35 76.97 5.02 76.89 3.68 74.24 3.12
SFTN-4 76.96 3.55 76.31 4.15 76.67 4.29 79.11 7.16 78.95 5.74 75.52 4.40
Table 4: Comparision with collaborative learning and curriculum learning approaches on CIFAR-100. We employ KDCL-Naïve for ensemble logits of KDCL. One stage EEI (equal epoch interval) and EEI-4 are adopted for training RCO. Both RCO-EEI-4 and SFTN-4 are trained for epochs.

4.4 Effect of Hyperparameters

SFTN computes the KL-divergence loss to minimize the difference between the softmax outputs of teacher and student branches, which involves two hyperparameters, temperature of the softmax function,

, and weight for KL-divergence loss term, . This subsection discusses the impact and trade-off issue of the two hyperparameters. In particular, we present our observations that the student-aware learning is indeed helpful to improve the accuracy of student models while maximizing performance of teacher models may be suboptimal for knowledge distillation.

Accuracy of SFTN Student accuracy by KD
Teacher resnet32x4 resnet32x4 WRN40-2 WRN40-2 AVG resnet32x4 resnet32x4 WRN40-2 WRN40-2 AVG
Student ShuffleV1 ShuffleV2 WRN16-2 WRN40-1 ShuffleV1 ShuffleV2 WRN16-2 WRN40-1
=1 81.19 80.26 78.23 78.14 78.85 76.05 77.18 76.30 74.75 75.58
=5 81.23 81.56 79.22 78.31 79.54 75.36 75.59 76.31 73.64 75.10
=10 81.27 81.98 78.81 78.38 79.58 74.47 75.93 75.85 73.62 74.76
=15 81.89 81.74 79.27 78.63 79.74 74.78 75.65 75.79 73.49 74.81
=20 81.60 81.70 78.84 78.45 79.59 74.62 75.88 75.82 74.03 74.95
Table 5: Effects of varying in the knowledge distillation via SFTN. The student accuracy is fairly stable over a wide range of the parameter. Note that the accuracies of SFTN and student are rather inversely correlated, which implies that the maximization of teacher models is not necessarily idea for knowledge distillation.
Accuracy of SFTN Student accuracy by KD
Teacher resnet32x4 resnet32x4 WRN40-2 WRN40-2 AVG resnet32x4 resnet32x4 WRN40-2 WRN40-2 AVG
Student ShuffleV1 ShuffleV2 WRN16-2 WRN40-1 ShuffleV1 ShuffleV2 WRN16-2 WRN40-1
=1 81.19 80.26 78.23 78.14 79.46 76.05 77.18 76.30 74.75 76.07
=3 78.70 79.80 77.83 77.57 78.48 77.36 78.56 76.20 74.71 76.71
=6 78.29 78.29 77.28 76.05 77.48 77.33 77.70 76.02 74.67 76.43
=10 73.02 75.01 75.03 73.51 74.14 75.57 76.62 74.19 73.08 74.87
Standard 79.25 79.25 76.30 76.30 77.78 74.31 75.25 75.28 73.56 74.60
Table 6: Effects of varying in the knowledge distillation via SFTN. The accuracies of SFTN and student are not correlated while the accuracy gaps of the two model drops as increases.
Models (Teacher/Student) resnet32x4/ShuffleV2
CIFAR100 STL10 CIFAR100 TinyImageNet
   Teacher training method  Standard  SFTN     Standard   SFTN  
Teacher accuracy 69.81 76.84 31.25 40.16
Student accuracy w/o KD 70.18 33.81
KD 67.49 73.81 6.32 30.45 37.81 7.36
SP 69.56 75.01 5.45 31.16 38.28 7.12
CRD 71.70 75.80 4.10 35.50 40.87 5.37
SSKD 74.43 77.45 3.02 38.35 42.41 4.06
OH 72.09 76.76 4.67 33.52 39.95 6.43
AVG 71.05 75.77 4.71 33.80 39.86 6.07
Table 7: The accuracy of student models on STL10 and TinyImageNet by transferring knowledge from the models trained on CIFAR-100.
Temperature of softmax function

The temperature parameter, denoted by , controls the softness of and ; as gets higher, the output of the softmax function becomes smoother. Despite the fluctuation in teacher accuracy, student models given by KD via SFTN maintain fairly consistent results. Table 5 also shows that the performance of SFTNs and student models is rather inversely correlated. In other words, a loosely optimized teacher model turns out to be more effective for knowledge distillation according to this ablation study.

Weight for KL-divergence loss

is a parameter that makes similar to , and consequently facilitates knowledge distillation. However, it affects the accuracy of teacher network negatively. Table 6 shows that the average accuracy gaps between SFTNs and student models drops gradually as increases. One interesting observation is the student accuracy via SFTN with compared to its counterpart via the standard teacher; even though the standard teacher network is more accurate than SFTN by a large margin, its corresponding student accuracy is lower than that of SFTN.

4.5 Transferability

The goal of transfer learning is to obtain versatile representations that adapt well on unseen datasets. To investigate transferability of the student models distilled from SFTN, we perform experiments to transfer the student features learned on CIFAR-100 to STL10 (Coates et al., 2011) and TinyImageNet (http://tiny-imagenet.herokuapp.com/, )

. The representations of the examples in CIFAR-100 are obtained from the last student block and frozen during transfer learning, and then we make the features fit to the target datasets using linear classifiers attached to the last student block.

Table 7 presents transfer learning results on 5 different knowledge distillation algorithms using ResNet324 and ShuffleV2 as teacher and student, respectively. Our experiments show that the accuracy of transfer learning on the student models derived from SFTN is consistently better than the students associated with the standard teacher. The average student accuracy of SFTN even outperforms that of the standard teacher by 4.71% points on STL10 and 6.07% points on TinyImageNet.

4.6 Similarity

Similarity between student and teacher network is an important measure for knowledge distillation tasks considering student network is basically trying to resemble similar output of teacher network. We employ KL-divergence and CKA (Kornblith et al., 2019) as metrics of similarity between student and teacher network, where lower KL-divergence and higher CKA imply higher similarity.

Models (Teacher/Student) resnet32x4/ShuffleV2
KL-divergence CKA
Teacher training method Standard SFTN Standard SFTN
KD 1.10 0.47 0.88 0.95
FitNets 0.79 0.38 0.89 0.95
SP 0.95 0.45 0.89 0.95
VID 0.88 0.45 0.88 0.95
CRD 0.81 0.43 0.88 0.95
SSKD 0.54 0.26 0.92 0.97
OH 0.85 0.37 0.90 0.96
AVG 0.84 0.39 0.89 0.96
Table 8: Similarity measurements between teachers and students on the CIFAR-100 test set. Higher CKA and lower KL indicate that the representations given by two models are more similar.

Table 8 presents the similarities between the representations of a ResNet324 teacher and a ShuffleV2 student given by various algorithms on the CIFAR-100 test set. The results show that the distillation from SFTN always gives higher similarity to the student model with respect to the teacher network; SFTN reduces KL-divergence by 50% in average while improving average CKA by 7% points compared to the standard teacher network. Since SFTN is trained with student branches to obtain student-friendly representations via a KL-divergence loss, the improved similarity is natural.

5 Conclusion

We proposed a simple but effective knowledge distillation approach by introducing the novel student-friendly teacher network (SFTN). Our strategy sheds a light in a new direction to knowledge distillation, which focus on the stage to train teacher networks. We train teacher networks along with their student branches, and then perform distillation from teachers to students. The proposed strategy turns out to improve training efficiency greatly, and can be incorporated into various knowledge distillation algorithms in a straightforward manner. For the demonstration of the effectiveness of our strategy, we conducted comprehensive experiments in diverse environments, which show consistent performance gains compared to the standard teacher networks regardless of architectural and algorithmic variations.

6 More Analysis

6.1 Effectiveness of Training Student-Aware Teacher Networks

For training a student-aware teacher network, the proposed approach adopts the student branch whose structure matches with the student network for knowledge distillation. To analyze the effectiveness of student-aware teacher networks, we present how the size of the student branch affects the performances of STFN when the student model is fixed. Table 9 shows that the student branches with the identical structures to the target student networks achieve the best accuracy in general. Note that larger students branches are often effective to enhance the accuracy of teachers and the smaller ones tend to lose their accuracy.

Models (teacher/student) WRN40-2/resnet8x4 resnet32x4/resnet8x4 resnet32x4/WRN16-2
Student branch size Smaller Equal Larger Smaller Equal Larger Smaller Equal Larger
Student branch model resnet8x2 resnet8x4 resnet32x4 resnet8x2 resnet8x4 resnet32x4 WRN16-1 WRN16-2 WRN40-2
Teacher accuracy 76.22 78.18 78.82 77.89 79.41 80.85 76.53 79.04 80.3
Student Acc. w/o KD 72.38 71.12 73.41
KD 74.08 75.46 74.84 75.19 76.09 75.19 75.14 76.33 76.37
SP 74.01 75.58 75.24 75.76 76.37 75.62 75.61 76.91 76.34
FT 74.55 75.83 76.03 76.54 77.02 76.48 76.26 77.07 76.81
CRD 75.8 76.94 76.52 76.72 76.95 76.54 76.55 77.39 77.27
SSKD 73.67 75.85 75.93 75.77 76.85 76.67 75.27 77.14 77.35
OH 74.79 75.98 76.09 75.69 76.65 76.38 75.85 76.87 77.00
Average 74.48 75.94 75.78 75.94 76.66 76.15 75.78 76.95 76.86
Best 75.8 76.94 76.52 76.72 77.02 76.67 76.55 77.39 77.35
Table 9: Effectiveness of training student-aware teacher networks before knowledge distillation. For each teacher and student branch pairs adopted in the student-aware teacher network training stage, we evaluate knowledge distillation performance of three different student branch sizes. The results show that matching model architectures between the two stages—student-aware teacher network training and knowledge distillation—leads to the best accuracies in general, which implies the practical benefit of SFTN. Note that all the results numbers are given by the averages of 3 independent runs. Bold-faced numbers indicate the highest accuracies among the various student branch models.

6.2 Relationship between Teacher and Student Accuracies

Figure 4 illustrates the relationship between teacher and student accuracies. According to our experiment, higher teacher accuracy does not necessarily lead to better student models. Also, even in the case that the teacher accuracies of SFTN are lower than those of the standard method, the student models of SFTN consistently outperform the counterparts of the standard method. This result implies that the performance improvement of teacher models is not the main reason for the better results of SFTN.

Teacher accuracy on CIFAR-100.
Student Accuracy by KD.
Figure 4: Relationship between teacher and student accuracies tested on CIFAR-100, where resnet with different sizes and MobileNetV2 are employed as teacher and student networks, respectively. Generally, the teacher accuracy of SFTN is lower than the standard teacher network, but the student models of SFTN is consistently outperform standard methods.

6.3 Training and Testing Curves

Figure 5(a) illustrates the KL-divergence loss of SFTN for knowledge distillation converges faster than the standard teacher network. This would be because, by the student-aware training through student branches, SFTN learns better transferrable knowledge to student model than the standard teacher network. We believe that it leads to higher test accuracies of SFTN as shown in Figure 5(b).

KL-divergence loss during training on CIFAR-100.
Test accuracy on CIFAR-100.
Figure 5: Visualization of training and testing curves on CIFAR-100, where resnet324 and ShuffleV2 are employed as teacher and student networks, respectively. SFTN converges faster and show improved test accuracy than the standard teacher models.

7 Implementation Details

We present the details of our implementation for better reproduction.

7.1 Cifar-100

The models for CIFAR-100 are trained for 240 epochs with a batch size of 64, where the learning rate is reduced by a factor of 10 at the 150, 180, and 210 epochs We use randomly cropped 32

32 image with 4 pixel padded and adopt horizontal flipping with a probability of 0.5 for data augmentation. Each channel in an input image is normalized to the standard Gaussian.

7.2 ImageNet

ImageNet models are learned for 100 epochs with a batch size of 256. We reduce the learning rate by an order of magnitude at the 30, 60, and 90 epochs. In training phase, we perform random cropping with the range from 0.08 to 1.0, which denotes the relative size to the original image while the aspect ratios are adjusted randomly by multiplying a scalar value between 3/4 and 4/3 to the original aspect. All images are resized to 224224 and flipped horizontally with a probability of 0.5. In validation phase, images are resized to 256256, and then center-cropped to 224224. Each channel in an input image is normalized to the standard Gaussian.

8 Architecture Details

We present the architectural details of SFTN with VGG13 and VGG8 respectively for teacher and student on CIFAR100. VGG13 and VGG8 are modularized into 4 blocks based on the depths of layers and the feature map sizes. VGG13 SFTN adds a student branch to every output of the teacher network block except the last teacher network block. Figure 6, 7 and 8 show the architectures of VGG13 teacher, VGG8 student and VGG13 SFTN with a VGG8 student branch attached. Table 10, 11 and 12 describe the full details of the architectures.

Figure 6: Architecture of VGG13 teacher model. and denote the block of teacher network and the block of student network, respectively. Table 10 shows detailed description of VGG13 teacher.
Figure 7: Architecture of VGG8 student. and denote the block of teacher network and the block of student network, respectively. Table 11 shows detailed description of VGG8 student.
Figure 8: Architecture of SFTN with VGG13 teacher and VGG8 student branch. , and denote the block of teacher network, the block of student network and teacher network feature transform layer, respectively. Table 12 shows detailed description of VGG13 SFTN attached VGG8 student branch.
Layer Input Layer Input Shape Filter Size Channels Stride Paddings Output Shape Block
Image - - - - - - 3x32x32 -
Conv2d-1 Image 3x32x32 3x3 64 1 1 64x32x32
BatchNorm2d-2 Conv2d-1 64x32x32 - 64 - - 64x32x32
Relu-3 BatchNorm2d-2 64x32x32 - - - - 64x32x32
Conv2d-4 Relu-3 64x32x32 3x3 64 1 1 64x32x32
BatchNorm2d-5 Conv2d-4 64x32x32 - 64 - - 64x32x32
Relu-6 BatchNorm2d-5 64x32x32 - - - - 64x32x32
MaxPool2d-7 Relu-6 64x32x32 2x2 - 2 0 64x16x16
Conv2d-8 MaxPool2d-7 64x16x16 3x3 128 1 1 128x16x16
BatchNorm2d-9 Conv2d-8 128x16x16 - 128 - - 128x16x16
Relu-10 BatchNorm2d-9 128x16x16 - - - - 128x16x16
Conv2d-11 Relu-10 128x16x16 3x3 128 1 1 128x16x16
BatchNorm2d-12 Conv2d-11 128x16x16 - 128 - - 128x16x16
Relu-13 BatchNorm2d-12 128x16x16 - - - - 128x16x16
MaxPool2d-14 Relu-13 128x16x16 2x2 - 2 0 128x8x8
Conv2d-15 MaxPool2d-14 128x8x8 3x3 256 1 1 256x8x8
BatchNorm2d-16 Conv2d-15 256x8x8 - 256 - - 256x8x8
Relu-17 BatchNorm2d-16 256x8x8 - - - - 256x8x8
Conv2d-18 Relu-17 256x8x8 3x3 256 1 1 256x8x8
BatchNorm2d-19 Conv2d-18 256x8x8 - 256 - - 256x8x8
Relu-20 BatchNorm2d-19 256x8x8 - - - - 256x8x8
MaxPool2d-21 Relu-20 256x8x8 2x2 - 2 0 256x4x4
Conv2d-22 MaxPool2d-21 256x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-23 Conv2d-22 512x4x4 - 512 - - 512x4x4
Relu-24 BatchNorm2d-23 512x4x4 - - - - 512x4x4
Conv2d-25 Relu-24 512x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-26 Conv2d-25 512x4x4 - 512 - - 512x4x4
Relu-27 BatchNorm2d-26 512x4x4 - - - - 512x4x4
Conv2d-28 Relu-27 512x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-29 Conv2d-28 512x4x4 - 512 - - 512x4x4
Relu-30 BatchNorm2d-29 512x4x4 - - - - 512x4x4
Conv2d-31 Relu-30 512x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-32 Conv2d-31 512x4x4 - 512 - - 512x4x4
Relu-33 BatchNorm2d-32 512x4x4 - - - - 512x4x4
AvgPool2d-34 Relu-33 512x4x4 - - - - 512x1x1 -
Linear-35 AvgPool2d-34 512x1x1 - - - - 100 -
Table 10: VGG13 detailed teacher.
Layer Input Layer Input Shape Filter Size Channels Stride Paddings Output Shape Block
Image - - - - - - 3x32x32 -
Conv2d-1 Image 3x32x32 3x3 64 1 1 64x32x32
BatchNorm2d-2 Conv2d-1 64x32x32 - 64 - - 64x32x32
Relu-3 BatchNorm2d-2 64x32x32 - - - - 64x32x32
MaxPool2d-4 Relu-3 64x32x32 2x2 - 2 0 64x16x16
Conv2d-5 MaxPool2d-4 64x16x16 3x3 128 1 1 128x16x16
BatchNorm2d-6 Conv2d-5 128x16x16 2x2 128 1 - 128x16x16
Relu-7 BatchNorm2d-6 128x16x16 - - - - 128x16x16
Maxpool2d-8 Relu-7 128x16x16 2x2 - 2 0 128x8x8
Conv2d-9 Maxpool2d-8 128x8x8 3x3 256 1 1 256x8x8
BatchNorm2d-10 Conv2d-9 256x8x8 - 256 - - 256x8x8
Relu-11 BatchNorm2d-10 256x8x8 - - - - 256x8x8
MaxPool2d-12 Relu-11 256x8x8 2x2 - 2 0 256x4x4
Conv2d-13 MaxPool2d-12 256x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-14 Conv2d-13 512x4x4 - 512 - - 512x4x4
Relu-15 BatchNorm2d-14 512x4x4 - - - - 512x4x4
Conv2d-16 Relu-15 512x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-17 Conv2d-16 512x4x4 - 512 - - 512x4x4
Relu-18 BatchNorm2d-17 512x4x4 - - - - 512x4x4
AvgPool2d-19 Relu-18 512x4x4 - - - - 512x1x1 -
Linear-20 AvgPool2d-19 512x1x1 - - - - 100 -
Table 11: VGG8 student model.
Layer Input Layer Input Shape Filter Size Channels Stride Paddings Output Shape Block
Image - - - - - - 3x32x32 -
Conv2d-1 Image 3x32x32 3x3 64 1 1 64x32x32
BatchNorm2d-2 Conv2d-1 64x32x32 - 64 - - 64x32x32
Relu-3 BatchNorm2d-2 64x32x32 - - - - 64x32x32
Conv2d-4 Relu-3 64x32x32 3x3 64 1 1 64x32x32
BatchNorm2d-5 Conv2d-4 64x32x32 - 64 - - 64x32x32
Relu-6 BatchNorm2d-5 64x32x32 - - - - 64x32x32
MaxPool2d-7 Relu-6 64x32x32 2x2 - 2 0 64x16x16
Conv2d-8 MaxPool2d-7 64x16x16 3x3 128 1 1 128x16x16
BatchNorm2d-9 Conv2d-8 128x16x16 - 128 - - 128x16x16
Relu-10 BatchNorm2d-9 128x16x16 - - - - 128x16x16
Conv2d-11 Relu-10 128x16x16 3x3 128 1 1 128x16x16
BatchNorm2d-12 Conv2d-11 128x16x16 - 128 - - 128x16x16
Relu-13 BatchNorm2d-12 128x16x16 - - - - 128x16x16
MaxPool2d-14 Relu-13 128x16x16 2x2 - 2 0 128x8x8
Conv2d-15 MaxPool2d-14 128x8x8 3x3 256 1 1 256x8x8
BatchNorm2d-16 Conv2d-15 256x8x8 - 256 - - 256x8x8
Relu-17 BatchNorm2d-16 256x8x8 - - - - 256x8x8
Conv2d-18 Relu-17 256x8x8 3x3 256 1 1 256x8x8
BatchNorm2d-19 Conv2d-18 256x8x8 - 256 - - 256x8x8
Relu-20 BatchNorm2d-19 256x8x8 - - - - 256x8x8
MaxPool2d-21 Relu-20 256x8x8 2x2 - 2 0 256x4x4
Conv2d-22 MaxPool2d-21 256x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-23 Conv2d-22 512x4x4 - 512 - - 512x4x4
Relu-24 BatchNorm2d-23 512x4x4 - - - - 512x4x4
Conv2d-25 Relu-24 512x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-26 Conv2d-25 512x4x4 - 512 - - 512x4x4
Relu-27 BatchNorm2d-26 512x4x4 - - - - 512x4x4
Conv2d-28 Relu-27 512x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-29 Conv2d-28 512x4x4 - 512 - - 512x4x4
Relu-30 BatchNorm2d-29 512x4x4 - - - - 512x4x4
Conv2d-31 Relu-30 512x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-32 Conv2d-31 512x4x4 - 512 - - 512x4x4
Relu-33 BatchNorm2d-32 512x4x4 - - - - 512x4x4
AvgPool2d-34 Relu-33 512x4x4 - - - - 512x1x1 -
Linear-35 AvgPool2d-34 512x1x1 - - - - 100 -
Student Branch 1
Conv2d-36 Relu-13 128x16x16 1x1 128 1 0 128x16x16
BatchNorm2d-37 Conv2d-36 128x16x16 - 128 - - 128x16x16
Relu-38 BatchNorm2d-37 128x16x16 - - - - 128x16x16
Maxpool2d-39 BatchNorm2d-37 128x16x16 2x2 - 2 0 128x8x8
Conv2d-40 Maxpool2d-39 128x8x8 3x3 256 1 1 256x8x8
BatchNorm2d-41 Conv2d-40 256x8x8 - 256 - - 256x8x8
Relu-42 BatchNorm2d-41 - - - - - 256x8x8
MaxPool2d-43 Relu-42 256x8x8 2x2 - 2 0 256x4x4
Conv2d-44 MaxPool2d-43 256x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-45 Conv2d-44 512x4x4 - 512 - - 512x4x4
Relu-46 BatchNorm2d-45 - - - - - 512x4x4
Conv2d-47 Relu-46 512x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-48 Conv2d-47 512x4x4 - 512 - - 512x4x4
Relu-49 BatchNorm2d-48 - - - - - 512x4x4
AvgPool2d-50 Relu-49 512x4x4 - - - - 512x1x1 -
Linear-51 AvgPool2d-50 512x1x1 - - - - 100 -
Table 12: Details of SFTN architecture with VGG13 teacher and VGG8 student branch.
Layer Input Layer Input Shape Filter Size Channels Stride Paddings Output Shape Block
Student Branch 2
Conv2d-52 Relu-20 256x8x8 1x1 256 1 0 256x8x8
BatchNorm2d-53 Conv2d-52 256x8x8 - 256 - - 256x8x8
Relu-54 BatchNorm2d-53 256x8x8 - - - - 256x8x8
MaxPool2d-55 Relu-54 256x8x8 2x2 - 2 0 256x4x4
Conv2d-56 MaxPool2d-55 256x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-57 Conv2d-56 512x4x4 - 512 - - 512x4x4
Relu-58 BatchNorm2d-57 - - - - - 512x4x4
Conv2d-59 Relu-58 512x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-60 Conv2d-59 512x4x4 - 512 - - 512x4x4
Relu-61 BatchNorm2d-60 - - - - - 512x4x4
AvgPool2d-62 Relu-61 512x4x4 - - - - 512x1x1 -
Linear-63 AvgPool2d-62 512x1x1 - - - - 100 -
Student Branch 3
Conv2d-64 Relu-27 512x4x4 1x1 512 1 0 512x4x4
BatchNorm2d-65 Conv2d-64 512x4x4 - 512 - - 512x4x4
Relu-66 BatchNorm2d-65 512x4x4 - - - - 512x4x4
Conv2d-67 Relu-66 512x4x4 3x3 512 1 1 512x4x4
BatchNorm2d-68 Conv2d-67 512x4x4 - 512 - - 512x4x4
Relu-69 BatchNorm2d-68 - - - - - 512x4x4
AvgPool2d-70 Relu-69 512x4x4 - - - - 512x1x1 -
Linear-71 AvgPool2d-70 512x1x1 - - - - 100 -
Table 3: Continued from the previous table.

References

  • S. Ahn, S. X. Hu, A. C. Damianou, N. D. Lawrence, and Z. Dai (2019) Variational information distillation for knowledge transfer. In CVPR, Cited by: §4.1.
  • S. Arora, M. M. Khapra, and H. G. Ramaswamy (2019) On knowledge distillation from complex networks for response prediction. In NAACL, J. Burstein, C. Doran, and T. Solorio (Eds.), Cited by: §2.
  • G. Chen, W. Choi, X. Chen, T. X. Han, and M. K. Chandraker (2017) Learning Efficient Object Detection Models with Knowledge Distillation. In NeurIPS, Cited by: §2.
  • J. H. Cho and B. Hariharan (2019) On the efficacy of knowledge distillation. In ICCV, Cited by: §2.2.
  • A. Coates, A. Y. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In AISTATS, Cited by: §4.1, §4.5.
  • T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar (2018) Born-again neural networks. In ICML, Cited by: §2.2.
  • Q. Guo, X. Wang, Y. Wu, Z. Yu, D. Liang, X. Hu, and P. Luo (2020) Online knowledge distillation via collaborative learning. In CVPR, Cited by: §1, §2.2, §4.1, §4.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In CVPR, Cited by: §4.1.
  • B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and J. Y. Choi (2019a) A comprehensive overhaul of feature distillation. In ICCV, Cited by: §2.1, §4.1.
  • B. Heo, M. Lee, S. Yun, and J. Y. Choi (2019b)

    Knowledge transfer via distillation of activation boundaries formed by hidden neurons

    .
    In AAAI, Cited by: §4.1.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the Knowledge in a Neural Network. In NeurIPS Deep Learning and Representation Learning Workshop, Cited by: §1, §2.1, §4.1.
  • [12] http://tiny-imagenet.herokuapp.com/ Cited by: §4.5.
  • X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu (2019) TinyBERT: distilling BERT for natural language understanding. CoRR. Cited by: §2.
  • X. Jin, B. Peng, Y. Wu, Y. Liu, J. Liu, D. Liang, J. Yan, and X. Hu (2019) Knowledge distillation via route constrained optimization. In ICCV, Cited by: §2.2, §4.1, §4.3.
  • M. Kang, J. Mun, and B. Han (2020) Towards oracle knowledge distillation with neural architecture search. In AAAI, Cited by: §1, §2.2.
  • J. Kim, S. Park, and N. Kwak (2018) Paraphrasing Complex Network: Network Compression via Factor Transfer. In NeurIPS, Cited by: §2.1, §4.1.
  • S. Kornblith, M. Norouzi, H. Lee, and G. E. Hinton (2019) Similarity of neural network representations revisited. In ICML, Cited by: §4.6.
  • A. Krizhevsky (2009) Learning Multiple Layers of Features from Tiny Images. Technical report Citeseer. Cited by: §4.1, §4.2.1, §4.2.
  • x. lan, X. Zhu, and S. Gong (2018) Knowledge Distillation by On-the-Fly Native Ensemble. In NeurIPS, Cited by: §2.1.
  • P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang (2016) Face Model Compression by Distilling Knowledge from Neurons. In AAAI, Cited by: §2.
  • S. Mirzadeh, M. Farajtabar, A. Li, and H. Ghasemzadeh (2020) Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher. In AAAI, Cited by: §1, §2.2.
  • J. Mun, K. Lee, J. Shin, and B. Han (2018) Learning to Specialize with Knowledge Distillation for Visual Question Answering. In NeurIPS, Cited by: §2.
  • W. Park, D. Kim, Y. Lu, and M. Cho (2019) Relational Knowledge Distillation. In CVPR, Cited by: §2.1, §4.1.
  • N. Passalis and A. Tefas (2018) Learning deep representations with probabilistic knowledge transfer. In ECCV, Cited by: §4.1.
  • A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2015) FitNets: Hints for Thin Deep Nets. In ICLR, Cited by: §2.1, §3.1, §4.1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2015) ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115 (3), pp. 211–252. Cited by: §4.1, §4.2.2, §4.2.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR. Cited by: §2.
  • K. Simonyan and A. Zisserman (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, Cited by: §4.1.
  • M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) MnasNet: platform-aware neural architecture search for mobile. In CVPR, Cited by: §4.1.
  • F. M. Thoker and J. Gall (2019) Cross-modal knowledge distillation for action recognition. In ICIP, Cited by: §2.
  • Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive representation distillation. In ICLR, Cited by: §2.1, §4.1.
  • F. Tung and G. Mori (2019) Similarity-preserving knowledge distillation. In ICCV, Cited by: §4.1.
  • H. Wang, Y. Li, Y. Wang, H. Hu, and M. Yang (2020) Collaborative distillation for ultra-resolution universal style transfer. In CVPR, Cited by: §2.
  • A. Wu, W. Zheng, X. Guo, and J. Lai (2019) Distilled person re-identification: towards a more scalable system. In CVPR, Cited by: §2.
  • G. Wu and S. Gong (2020) Peer collaborative learning for online knowledge distillation. In arXiv 2006.04147, Cited by: §1, §2.2.
  • G. Xu, Z. Liu, X. Li, and C. C. Loy (2020) Knowledge distillation meets self-supervision. In ECCV, Cited by: §2.1, §4.1.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In BMVC, Cited by: §4.1.
  • S. Zagoruyko and N. Komodakis (2017) Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In ICLR, Cited by: §2.1, §4.1.
  • L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma (2019) Be your own teacher: improve the performance of convolutional neural networks via self distillation. In ICCV, Cited by: §2.1.
  • X. Zhang, X. Zhou, M. Lin, and J. Sun (2018a) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In CVPR, Cited by: §4.1.
  • Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018b) Deep Mutual Learning. In CVPR, Cited by: §1, §2.2, §4.1, §4.3.
  • M. Zhao, T. Li, M. A. Alsheikh, Y. Tian, H. Zhao, A. Torralba, and D. Katabi (2018)

    Through-wall human pose estimation using radio signals

    .
    In CVPR, Cited by: §2.
  • B. Zhou, N. Kalra, and P. Krähenbühl (2020) Domain adaptation through task distillation. In ECCV, Cited by: §2.