Interactive Knowledge Distillation

07/03/2020
by   Shipeng Fu, et al.
0

Knowledge distillation is a standard teacher-student learning framework to train a light-weight student network under the guidance of a well-trained large teacher network. As an effective teaching strategy, interactive teaching has been widely employed at school to motivate students, in which teachers not only provide knowledge but also give constructive feedback to students upon their responses, to improve their learning performance. In this work, we propose an InterActive Knowledge Distillation (IAKD) scheme to leverage the interactive teaching strategy for efficient knowledge distillation. In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation: randomly replacing the blocks in the student network with the corresponding blocks in the teacher network. In the way, we directly involve the teacher's powerful feature transformation ability to largely boost the student's performance. Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods on diverse image classification datasets.

READ FULL TEXT VIEW PDF

Authors

page 5

07/23/2019

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

High storage and computational costs obstruct deep neural networks to be...
03/31/2021

Fixing the Teacher-Student Knowledge Discrepancy in Distillation

Training a small student network with the guidance of a larger teacher n...
10/21/2021

Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For Model Compression

Knowledge distillation (KD) is an effective model compression technique ...
12/19/2021

Controlling the Quality of Distillation in Response-Based Network Compression

The performance of a distillation-based compressed network is governed b...
11/07/2017

Moonshine: Distilling with Cheap Convolutions

Model distillation compresses a trained machine learning model, such as ...
05/15/2018

Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

This paper studies teacher-student optimization on neural networks, i.e....
05/07/2020

How Peripheral Interactive Systems Can Support Teachers with Differentiated Instruction: Using FireFlies as a Probe

Teachers' response to the real-time needs of diverse learners in the cla...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past few years, deeper and deeper convolutional neural networks (CNNs) have demonstrated cutting edge performance on various computer vision tasks 

[7, 11, 17, 16]. However, they usually undergo huge computational costs and memory consumption, and are hardly embedded into resource-constrained devices, e.g., mobiles and UAVs. To reduce the resource consumption while maintaining good performance, researchers leverage knowledge distillation techniques [30, 6, 9, 10] to transfer informative knowledge from a cumbersome but well-trained teacher network into a light-weight but unskilled student network.

Figure 1: Differences between our proposed interactive knowledge distillation method and conventional, non-interactive ones. “S block” denotes the student block, “T block” denotes the teacher block. and denote the definition of the student and the teacher knowledge, respectively, and vary a lot from method to method.

Most existing knowledge distillation methods [19, 9, 28, 20, 8, 24] encourage the student network to mimic the representation space of the teacher network for approaching its proficient performance. Those methods manually define various kinds of knowledge based on the teacher network’s responses, such as softened outputs [9], attention maps [28], flow of solution procedure [27], and relations [20]. These different kinds of knowledge facilitate the distillation process through additional distillation losses. If we regard the teacher’s responses as the theorems in a textbook written by the teacher, manually defining those different kinds of knowledge just rephrases those theorems and further points out an easier way to make the student understand those theorems (soften the constraints). This kind of distillation process is viewed as non-interactive knowledge distillation, since the teacher just sets a goal for the student to mimic, ignoring the interaction with the student.

The non-interactive knowledge distillation methods undergo one major problem: since the feature transformation ability of the student is less powerful than that of the teacher’s, the knowledge that the student have learned through mimicking is impossible to be identical with the knowledge provided by the teacher, which may impede the knowledge distillation performance. This problem becomes more tough when introducing multi-connection knowledge [28, 3]. Specifically, since the student can not perfectly mimic the teacher’s knowledge even if in the shallow blocks, the imperfectly-mimicked knowledge is reused in the following blocks to imitating the corresponding teacher’s knowledge. Therefore, the gap between the knowledge learned by the student and the one provided by the teacher becomes larger, restricting the performance of knowledge distillation.

Based on the discussion above, we try to make the distillation process interactive so that the teacher can be truly involved in guiding the student. Specifically, the student first extracts features on its own, then the teacher responds by improving the weak features with its powerful feature transformation ability, finally the student continues to work on the improved features. Better features are critical for better performance and can provide a good foundation for the learning process of the student [29, 5, 22]. As shown in Fig. 1, instead of forcing the student to mimic the teacher’s representation space, our interactive distillation directly involves the teacher’s powerful feature transformation ability into the student network. Consequently, the student can make the best use of the improved features to better exploit its potential on relevant tasks.

In this paper, we propose a simple yet effective knowledge distillation method named Interactive Knowledge Distillation (IAKD). In IAKD, we randomly swap in the blocks in the teacher network to replacing the blocks in the student network during the distillation phase. Each set of swapped-in teacher blocks handles the output of previous student block, and provides better feature maps to motivate the next student block. In addition, randomly swapping in the teacher blocks makes it possible for the teacher to guide the student in many different manners. Compared with other conventional knowledge distillation methods, our proposed method does not deliberately force the student’s knowledge to be similar to the teacher’s knowledge. Therefore, we do not need the additional distillation losses to drive the knowledge distillation process, resulting in no need of hyper-parameters to balancing the task-specific loss and distillation losses. Besides, the distillation process of our IAKD method is highly efficient. The reason is that our IAKD discards the cumbersome knowledge transformation process and gets rid of the feature extraction of the teacher network, both of which commonly exist in conventional knowledge distillation methods. Experimental results demonstrate that our proposed method effectively boosts the performance of the student, which proves that the student can actually benefit from the interaction with the teacher.

2 Related Work

Since Ba and Caruana [2] first introduced the view of teacher supervising student for model compression, many variant works have emerged based on their teacher-student learning framework. To achieve promising model compression results, these works focus on how to better capture the knowledge of the large teacher network to supervise the training process of the small student network. The most representative work about knowledge distillation (KD) comes from Hinton et al[9]

. They defined knowledge as the teacher’s softened logits and encouraged the student to mimic it instead of the raw activations before softmax 

[2], because they argued that the relative relationships between intra-class and inter-class learned by the pre-trained teacher can be described more accurately by the softened logits than by the raw activations. However, simply using logits as the learning goal for the student may limit the information that the teacher distills to the student. In order to learn more knowledge of the teacher, FitNets [21] added point-wise supervision on the feature maps at intermediate layers to assisting the training of the student. However, this approach only works well when one supervision loss is applied to one intermediate layer. It can not achieve satisfying results when more losses are added to supervising more intermediate layers [27]. This is because supervision on the feature maps at many different layers makes the constraints become too restrictive. There are many methods trying to soften the constraints while preserving meaningful information of feature maps. Zagoruyko and Komodakis [14] condensed feature maps to attention maps based on channel statistics of each spatial location. Instead of defining knowledge based on feature maps of a single layer, Yim et al[27] employed Gramian matrix to measure the correlations between feature maps from different layers. Besides, Kim et al[13] utilized convolutional auto-encoder, and Lee et al[12]

used Singular Value Decomposition (SVD) to perform dimension reduction on feature maps. Heo 

et al[3, 8]

demonstrated that simply using activation status of neurons could also assist the distillation process. Ahn 

et al[1] proposed a method based on information theory which maximizes the mutual information between the teacher and the student. To enable the student to acquire more meaning information, a lot of studies defined knowledge based on relations rather than individual feature maps [20, 18, 15, 25].

On the other hand, our proposed IAKD is essentially different from aforementioned knowledge distillation methods. IAKD does not require distillation losses to drive the distillation process (see Eq. 1 and Eq. 5). In the learning process of the student, we directly involve the teacher’s powerful feature transformation ability to improve the relatively weak features extracted by the student, since better features are critical for getting a better prediction.

3 Proposed Method

3.1 Conventional Knowledge Distillation

In general, the conventional knowledge distillation methods decompose the student network and the teacher network into

modules. The student is trained to minimize the following loss function:

(1)

where is the task-specific loss, represents the ground truth. is the predicted result of the student. and denote the student’s transformed output and the teacher’s transformed output of the -th module. is a tunable balance factor to balancing the different losses. is the -th distillation loss to narrowing down the difference between and . varies from method to method because of different definitions of knowledge. Essentially, the conventional knowledge distillation methods aim at forcing the student to mimic the teacher’s representation space.

Figure 2: The diagram of the hybrid network which can effectively implement the proposed interactive knowledge distillation (IAKD). Specifically, the hybrid network enables the teacher blocks to randomly replace the student block of the original student network. FC denotes the fully-connected layer.

3.2 Interactive Knowledge Distillation

To achieve the interaction between the student and the teacher, our IAKD method randomly swaps in the teacher blocks to replace the student blocks at each iteration. The swapped teacher blocks can respond to the output of the last student block and then provide better features to motivate the next student block. To make all student blocks fully participate in the distillation process, after each training iteration, we put back the replaced student blocks to their original positions in the student network. Due to the return of replaced student blocks, the architecture of the student network is kept consistent for the swapping-in operation upon the next training iteration.

Hybrid block for swapping-in operation. As shown in Fig. 2, for effective implementation of such interactive strategy, we propose a two-path hybrid block, which is composed of one student block (the student path) and several teacher blocks (the teacher path). We use the hybrid block to replace the original student block in the student network to form a hybrid network. However, the blocks and layers that are shared by the teacher and the student are not replaced by the hybrid blocks111For example, the fully-connected layers and the transition blocks in ResNet of which the input and the output are of different spatial size are considered as shared.. If student blocks not shared by the student and the teacher, we will obtain hybrid blocks totally.

For the -th hybrid block in the hybrid network, let

denotes a Bernoulli random variable, which indicates whether the student path is selected (

) or the teacher path is chosen (). Therefore, the output of the -th hybrid block can be formulated as

(2)

where and denote the input and the output of the -th hybrid block, is the function of the student block, and is the nested functions of the teacher blocks.

If , the student path is chosen and Eq. 2 reduces to:

(3)

This means the hybrid block degrades to the original student block. If , the teacher path is chosen and Eq. 2 reduces to:

(4)

This implies that the hybrid block degrades to the teacher blocks. Next, we denote the probability of selecting the student path in the

-th hybrid block as . Therefore, each student block is randomly replaced by the teacher blocks with probability . The swapped teacher blocks can interact with the student blocks to inspire the potential of the student.

Additionally, the student blocks are replaced by the teacher blocks in a random manner, which means the teacher can guide the student in many different situations through interaction. On the contrary, swapping in the teacher blocks by manually-defined manners only allows the teacher to guide the student in limited situations. Therefore, to traverse as many different situations as possible, the student blocks are replaced by the teacher blocks in a random manner.

Distillation phase. When the image features go through each hybrid block, the path is randomly chosen during the distillation phase. Let us consider two extreme situations. The first is that for all hybrid blocks. The student path is chosen for every hybrid block and the hybrid network becomes the student network. The second is that for all hybrid blocks. Under such circumstances, all teacher paths are chosen, and the hybrid network becomes the teacher network. In other cases (

), the inputs randomly go through the student path or the teacher path in each hybrid block. The swapped-in teacher blocks interact with the student blocks, thus they can respond to the output of the last student block and further provide better features to motivate the following student blocks. Before distillation phase, the parameters of the teacher blocks in each hybrid block are loaded from the corresponding teacher blocks in the pretrained teacher network. Next, during distillation process, those parameters are frozen, and the batch normalization layers still use the mini-batch statistics. The student block in each hybrid block, and other shared parts are randomly initialized. The optimized loss function

can be formulated as

(5)

where denotes the trainable parameters in the hybrid network, is the task-specific loss, denotes the ground truth, and is the predicted result of the hybrid network. At each iteration, only the student blocks the feature maps go through are updated (the shared parts are considered as the student parts since they are randomly initialized).

Comparing Eq. 5 with Eq. 1, we can see our proposed method does not need the additional distillation losses to drive the knowledge distillation process. Therefore, laboriously searching the optimal hyper-parameters to balance the task-specific loss and different distillation losses can be avoided. Besides, our proposed method has no need to transform the teacher network’s responses into manually-defined knowledge, and the input is not required to go through both the teacher network and the student network, which is highly efficient.

Test phase. Since knowledge distillation aims at improving the performance of the student network, we just test the whole student network to check whether the student can benefit from the interaction or not. For convenient validation, we can fix the probability to 1 in the test phase. Under this circumstances, the hybrid network becomes the student network during testing. Meanwhile, for practical development, we can reduce the burden of resource consumption by separating the student network from the hybrid network for further inference.

3.3 Probability Schedule

The interaction level between the teacher network and the student network is controlled by . The smaller value of indicates that the student network has more interaction with the corresponding teacher blocks. For simplicity, we assume , which suggests that every hybrid block shares the same interaction rule. To properly utilize the interaction mechanism, we propose three kinds of probability schedules for the change of : 1) uniform schedule, 2) linear schedule and 3) review schedule. These schedules are discussed in detail as follows.

(a) Uniform schedule
(b) Linear schedule
(c) Review schedule
Figure 3: Schematic illustrations for with different probability schedules. The vertical green dashed lines indicates that the learning rate decreases.

Uniform schedule. The simplest probability schedule is to fix

for all epochs during the distillation phase. This means the interaction level between the student and the teacher does not change. See Fig. 

2(a) for a schematic illustration.

Linear schedule. With the epoch increasing, the teacher network could participate less in the distillation process as [21, 27, 3] do, because the student network gradually becomes more powerful to handle the task by itself. Thus, we set the value of according to a function of epoch and propose a simple linear schedule. The probability of choosing the student path in each hybrid block increases linearly from for the first epoch to for the last epoch (see Fig. 2(b)). The linear schedule indicates that as the knowledge distillation process proceeds, the interaction level between the student and the teacher gradually decreases. Vividly speaking, the teacher teaches the student through frequent interaction in early epochs, and the student attempts to solve the problem by himself in late epochs.

Review schedule. The proposed schedules above ignore the change of the learning rate during training. When the learning rate decreases, the optimization algorithm narrows the parameter search radius for more refined adjustments. Thus, we reset to when the learning rate decreases so that the teacher can re-interact with the student for better guidance. Within the interval where the learning rate remains unchanged, the probability increases linearly from to . Fig. 2(c) shows a schematic illustration. We call this kind of probability schedule as review schedule because it is just like that the teacher gives review lessons after finishing a unit so that the student can grasp the knowledge more firmly.

The effectiveness of different probability schedules will be validated experimentally in Sec. 4.1.

4 Experiments

In this section, we first investigate the effectiveness of different probability schedules and then conduct a series of experiments to analyze the impact of the interactive mechanism on the distillation process. At last, we compare our proposed IAKD with conventional, non-interactive methods on several public classification datasets to demonstrate the superiority of our proposed method.

4.1 Different Probability Schedules

To investigate the effectiveness of different probability schedules, we conduct experiments on CIFAR-10 [14] and CIFAR-100 [14] datasets. The CIFAR-10 dataset consists of 60000 images of size in 10 classes. The CIFAR-100 dataset [14] has 60000 images of size from 100 classes. For all experiments in this subsection, we use SGD optimizer with momentum 0.9 to optimize the hybrid network. We set batchsize to 128 and set weight decay to . The hybrid network are trained for 200 epochs. The learning rate starts at 0.1 and is multiplied by 0.1 at 100, 150 epochs. Images of size

are randomly cropped from zero-padded

images. Each image is horizontally flipped with a probability of 0.5 for data augmentation.

4.1.1 Uniform Schedule

We first conduct experiments with uniform schedule mentioned in Sec. 3.3. We employ ResNet [7] as our base network and denote ResNet-d as the ResNet with a specific depth d. ResNet-44 and ResNet-26 are used as the teacher/student (T/S) pair. Therefore, in each hybrid block, there is one student block in the student path, and two teacher blocks in the teacher path.

The experimental results are shown in Fig. 4. Note that and are not reasonable options. When , we train the teacher network with some frozen, pre-trained blocks. When , we just train the student network individually without interaction. As we can see from Fig. 4, IAKD with uniform schedule (denoted as IAKD-U for short) can actually improve the performance of the student when is large (CIFAR-10: , CIFAR-100: ). However, we then notice that the performance of the student drops dramatically when is small (For example, for both CIFAR-10 and CIFAR-100). This is due to small results in high probability of choosing the teacher path for each hybrid block, which leads to the insufficient training of the student.

(a) CIFAR-10
(b) CIFAR-100
Figure 4: The results of IAKD-U with various . Each experiment is repeated 5 times, and we take the mean value as the final result.

4.1.2 Linear Schedule

We employ ResNet-44/ResNet-26 as the T/S pair when conducting experiments with linear schedule. In this schedule, increases linearly from for the first epoch, to 1 for the last epoch. During early epochs when the value is relatively small, the selected student blocks and the shared blocks are frequently interacted with the frozen teacher blocks, which means those blocks are supposed to have better initialization when is large in late epochs (when the hybrid network is more “student-like”). As gradually increases, the hybrid network transitions from “teacher-like” to “student-like”. Such distillation process is related to the two-stage knowledge distillation methods [21, 27, 3]. In these methods, the student network is better initialized based on the distillation losses in the first stage, and is trained for the main task based on the task-specific loss in the second stage. However, the two-stage knowledge distillation methods abruptly make the transition from the initialization stage to the task-specific training stage. As a result, the teacher’s knowledge may disappear as the training process proceeds, because the teacher is not involved in guiding the student in the second stage. On the contrary, as increases linearly, IAKD makes a smooth transition to the task-specific training stage.

The experimental results of IAKD with linear schedule (denoted as IAKD-L for short) are shown in Fig. 5. Comparing Fig. 5 with Fig. 4, we have two main observations. The first is that IAKD-L improves the original student performance on almost every value of while IAKD-U only works for large . The second is that the best distillation results of IAKD-L on both CIFAR-10 and CIFAR-100 datasets are better than the ones of IAKD-U. These two observations indicate that, as the student network becomes powerful, the interaction is more effective between the teacher network and student network when at a gradually reduced rate other than a fixed level.

(a) CIFAR-10
(b) CIFAR-100
Figure 5: The results of IAKD-L using various . Each experiment is repeated 5 times, and we take the mean value as the final result.

4.1.3 Review Schedule

(a) CIFAR-10
(b) CIFAR-100
Figure 6: The results of IAKD-R using various . Each experiment is repeated 5 times, and we take the mean value as the final result.

For fair comparison, we employ ResNet-44/ResNet-26 as the T/S pair to conduct experiments on IAKD with review schedule (denoted as IAKD-R for short). The experimental results of IAKD-R are shown in Fig. 6. IAKD-R, which considers the change of learning rate, is actually an advanced version of IAKD-L. Comparing Fig. 6 with Fig. 5, IAKD-R could further improve the performance of the original student in contrast to IAKD-L. Even for very small , IAKD-R still provides improvement for the student network.

For more clear comparison, we summarize the best results of different probability schedules in Tab. 1. It can be seen that IAKD-R obtains the highest classification accuracy in comparison with IAKD-U and IAKD-L on both CIFAR-10 and CIFAR-100 datasets. From above discussions and comparisons, we can conclude that compared with uniform schedule and linear schedule, review schedule is the most effective interaction mechanism which makes the teacher re-interact with the student once the learning rate decreases. Thus, we take the review schedule as our final probability schedule setting when comparing with other state-of-the-art knowledge distillation methods. Besides, it is worth noticing that for IAKD-L and IAKD-R, we need to set to a small value when the number of classes is large, while we need to set to a large value when the number of classes is small. This reflects that the distillation process can be assisted well by adequate interaction based on the difficulty of a certain task.

Method/Model CIFAR-10 CIFAR-100
ResNet-44 (Teacher) 92.91% 71.07%
ResNet-26 (Student) 92.01% 68.94%
IAKD-U 92.55%0.11% (0.9) 69.54%0.32% (0.9)
IAKD-L 92.71%0.24% (0.7) 69.74%0.19% (0.3)
IAKD-R 92.77%0.13% (0.9) 70.18%0.31% (0.1)
Table 1: The results of different probability schedules. The best is shown in bold. Numbers in brackets represent the when achieving the best performance

4.2 Study of Interaction Mechanism

In this subsection, the training settings are the same with those in Sec. 4.1. All experiments are repeated 5 times, and we take the mean value as the final result.

4.2.1 Without Interaction

By randomly swapping in the teacher blocks at each iteration, we achieve the interaction between the teacher and the student. To investigate the effectiveness of the interaction mechanism, we turn off the interaction between the teacher network and the student one during the distillation process, i.e.,we do not replace the student block with the corresponding teacher blocks. Correspondingly, taking the whole pretrained teacher network and the student network, we force the output of each student block to be similar with the output of the corresponding teacher blocks. Since IAKD randomly swaps in the teacher blocks to replace the student block, we can randomly measure the difference between the output of each student block and that of the corresponding teacher blocks. The student is trained as follows:

(6)

where is the ground truth. is the output of the student network. is the number of student blocks except the blocks shared by the teacher and the student. and are the outputs of the student block and the corresponding teacher blocks, respectively. is the cross-entropy loss. is the loss, penalizing the different between and . denotes a Bernoulli random variable, and indicates whether the -th loss is added () or ignored (). We denote , which represents the probability of ignoring the -th loss. For simplicity and fair comparison, we set the probability schedule of to be the same as the probability schedule of in IAKD-U.

Model w/o IA IAKD-U-G IAKD-U
ResNet-44 (Teacher) 92.91% -
ResNet-26 (Student) 92.01% -
ResNet-26 92.10%(0.02%) 92.05%(0.21%) 92.11%(0.41%) 0.7
ResNet-26 92.18%(0.05%) 92.28%(0.05%) 92.31%(0.30%) 0.8
ResNet-26 92.18%(0.06%) 92.37%(0.05%) 92.55%(0.11%) 0.9
(a) on CIFAR-10
Model w/o IA IAKD-U-G IAKD-U
ResNet-44 (Teacher) 71.07% -
ResNet-26 (Student) 68.94% -
ResNet-26 69.20%(0.15%) 68.68%(0.16%) 69.54%(0.32%) 0.9
(b) on CIFAR-100
Table 2: Experimental results on different datasets when training the student without interaction (denoted as “w/o IA”) or with group-wise interaction (denoted as “IAKD-U-G”). Better performance is shown in bold

The experimental results are shown in Tab. 2(a) and Tab. 2(b). Comparing with IAKD-U showing performance improvement on CIFAR-10 () and CIFAR-100 (), the students trained without interaction achieve slight performance improvement. This proves that interaction mechanism well assists the student to achieve much better performance.

4.2.2 Group-wise Interaction

Until now, we achieve IAKD by block-wise interaction (one student block pairs with several teacher blocks). To verify the effectiveness of block-wise interaction, we reduce the interaction level to group-wise222The blocks with the same output size form a group. The shared blocks are excluded. interaction (one student group pairs with one teacher group). For simplicity, we only choose IAKD-U to achieve group-wise interaction (denoted as IAKD-U-G for short). ResNet-44/ResNet-26 are employed as the T/S pair. As we can see from Tab. 2(a) and Tab. 2(b), IAKD-U outperforms IAKD-U-G on both datasets. Besides, IAKD-U-G fails to achieve performance improvement on CIFAR-100, because of the significant decrease in the interaction level caused by the group-wise interaction. The student blocks, unless the first or the last in the group, have no chance to interact with the teacher blocks.

4.3 Comparison with State-of-the-arts

We compare our proposed IAKD-R with other representative state-of-the-art knowledge distillation methods: HKD [9], AT [28], FSP [27], Overhaul [8], AB [3], VID-I [1] and RKD-DA [20]. As the discussion above, all these comparison methods are non-interactive. We conduct experiments on three datasets: CIFAR-10 [14], CIFAR-100 [14], and TinyImageNet [26]. These three datasets are widely used to validate distillation performance by most knowledge distillation methods. TinyImageNet contains 100000 training images of size with 200 classes. All experiments are repeated 5 times, and we take the mean values as the final report results. For other compared methods, we directly use the author-provided codes if publicly available, or implement the method based on the original paper.333The hyper-parameters of comparison methods can be found in our supplementary material. For fair comparison, the optimization configurations are identical for all methods on a certain dataset.

Method Model CIFAR-10 CIFAR-100 TinyImageNet
Top-1 Top-1 Top-1
Teacher ResNet-44/80 92.91% 71.07% 55.73%
Student ResNet-26 92.01% 68.94% 50.04%
HKD [9] ResNet-26 92.90%(0.19%) 69.59% (0.26%) 50.25%(0.46%)
AT [28] ResNet-26 92.55%(0.12%) 69.72%(0.17%) 52.07%(0.21%)
FSP [27] ResNet-26 92.47%(0.06%) 69.10%(0.34%) 50.97%(0.50%)
AB [3] ResNet-26 92.59%(0.16%) 69.40%(0.42%) 48.48%(2.96%)
VID-I [1] ResNet-26 92.48%(0.10%) 69.90%(0.23%) 51.98%(0.15%)
RKD-DA [20] ResNet-26 92.70%(0.16%) 69.46%(0.39%) 51.69%(0.43%)
Overhaul [8] ResNet-26 92.29%(0.10%) 69.11%(0.17%) 52.56%(0.22%)
IAKD-R (Ours) ResNet-26 92.77%(0.13%) 70.18%(0.31%) 53.28%(0.13%)
IAKD-R+HKD ResNet-26 93.31%(0.08%) 71.02%(0.16%) 53.50%(0.31%)
Table 3: Comparison of test accuracy by different knowledge distillation methods on CIFAR-10, CIFAR-100, and TinyImageNet datasets. The best and second best results are highlighted in red and blue, respectively. ResNet-44 is used on CIFAR-10 and CIFAR-100, while ResNet-80 is used on TinyImageNet.

For CIFAR-10 and CIFAR-100, we use the same optimization configurations as described in Sec. 4.1. Specially, for two-stage methods like FSP and AB, 50 epochs are used for initialization and 150 epochs for classification training. The learning rate at the initialization stage is 0.01. The initial learning rate at the classification training stage is 0.1, and is multiplied by 0.1 at 75, 110 epochs. For TinyImageNet, we employ ResNet-80/ResNet-26 as the T/S pair. Horizontal flipping is applied for data augmentation. We optimize the network using SGD with batchsize 128, momentum 0.9 and weight decay . The number of total training epochs is 300. The initial learning rate is 0.1 and is multiplied by 0.2 at 60, 120, 160, 200, 250 epochs. Specially, for FSP and AB, 50 epochs are used for initialization and 250 epochs for classification training. The learning rate at the initialization stage is 0.01. The initial learning rate at the classification training stage is 0.1, and is multiplied by 0.2 at 60, 120, 160, 200 epochs.

For the proposed IAKD, we use the review schedule as the final probability schedule setting. is set to 0.9 for comparison on CIFAR-10 dataset, and is set to 0.1 for comparison on CIFAR-100 and TinyImageNet datasets. The configurations are based on the experimental results and arguments in Sec. 4.1.3. In Tab. 3, we list the comparison results on CIFAR-10, CIFAR-100, and TinyImageNet by different methods. As we can see, the proposed IAKD, which is essentially different from conventional knowledge distillation methods, improves the performance of the student more effectively than most contrastive methods. The result of IAKD is competitive with HKD. We then combine HKD with our proposed IAKD through adding the distillation loss to Eq. 5, and denote it as IAKD-R+HKD. Surprisingly, IAKD-R+HKD even outperforms the performance of the teacher. On CIFAR-100 and TinyImageNet, IAKD-R+HKD outperforms all contrastive methods by a large margin. Meanwhile, IAKD-R, which does not have distillation losses, also shows highly competitive results on CIFAR-100 and TinyImageNet. In addition, our proposed method, robust as we can see, consistently achieves significant performance improvement on three datasets while most contrastive methods do not.

Advantages on learning efficiency. Our IAKD-R is also benefited from better learning efficiency. Conventional knowledge distillation methods update all student blocks along the whole training iterations. Our work reveals that this laborious distillation is not necessary for promising performance. In our IAKD-R, the student blocks replaced by teacher ones will not be updated. Thus, in each training iteration, a portion of the student network is updated. For the possible replaced student block, our training epochs number, the mathematical expectation, is 190, 110 and 165 on CIFAR-10, CIFAR-100 and TinyImageNet, respectively, which accounts for 95%, 55% and 55% of other contrastive methods on corresponding datasets.  This demonstrates that our IAKD-R enjoys efficient learning capacity, and also validates that the student network indeed benefits well from the interaction with the teacher network.

4.4 Different Architecture Combinations

To ensure the applicability of IAKD, we explore more architecture combinations on CIFAR-100, as shown in Tab. 4. We replace the “Conv3

3-BN-ReLU” with “group Conv3

3-BN-ReLU-Conv11-BN-ReLU” to achieve the cheap ResNet, as suggested by [4]. The experimental settings are the same as those described in Sec. 4.3. As we can see from Tab. 4, in all the architecture combinations, IAKD-R and IAKD-R+HKD can consistently achieve performance improvement over the student network. On the other hand, contrastive methods give different knowledge distillation performance in different settings. In the case of (a), only IAKD-R and IAKD-R+HKD can successfully achieve performance improvement over the student cheap ResNet-26. AT and RKD-DA fail to improve the student performance in the case of (b). RKD-DA and AB fail to achieve performance improvement in the case of (d). Therefore, the proposed IAKD, a novel knowledge distillation method without additional distillation losses, can also achieve promising performance improvement in different teacher/student settings.

Setup (a) (b) (c) (d)
Teacher ResNet-44 [7] ResNet-18 [7] VGG-19 [23] ResNet-50x4 [7]
Student cheap ResNet-26 VGG-8 [23] VGG-11 [23] ShuffleNet-v1 [31]
Teacher 71.07% 72.32% 72.74% 72.79%
Student 65.92% 68.91% 68.84% 67.11%
HKD [9] 65.55%(0.40%) 70.63%(0.27%) 71.09%(0.13%) 68.58%(0.43%)
AT [28] 65.90%(0.37%) 68.54%(0.40%) 70.93%(0.14%) 68.13%(0.30%)
RKD-DA [20] 65.66%(0.21%) 68.69%(0.31%) 70.84%(0.14%) 66.69%(0.38%)
AB [3] 65.64%(0.76%) 69.86%(0.24%) 71.21%(0.27%) 66.89%(1.46%)
IAKD-R 67.15%(0.33%) 69.80%(0.23%) 70.94%(0.15%) 68.25%(0.10%)
IAKD-R+HKD 66.11%(0.72%) 70.51%(0.22%) 71.76%(0.21%) 68.65%(0.29%)
Table 4: Evaluations on different architecture combinations. Average over 5 runs on CIFAR-100. The table gives the Top-1 accuracy. The best and the second best results are shown in red and blue, respectively. denotes underperforming the corresponding students trained without knowledge distillation

5 Conclusion

In this paper, we propose a simple yet effective Interactive Knowledge Distillation (IAKD) method to approximate the performance of cumbersome networks (teacher networks) by light-weight networks (student networks), reducing the demand on resources. By randomly swapping in the teacher blocks to replacing the student blocks at each iteration, we achieve the interaction between the teacher network and the student network. To properly utilize the interaction mechanism during distillation process, we propose three kinds of probability schedules. Besides, IAKD does not need additional distillation losses to drive the distillation process, and is complementary to conventional knowledge distillation methods. Extensive experiments demonstrate that the proposed IAKD could boost the performance of the student significantly. Instead of mimicking the teacher’s representation space, our proposed method aims at directly leveraging the teacher’s powerful feature transformation ability to motivate the student, which provides a new perspective for knowledge distillation.

References

  • [1] Ahn, S., Hu, S.X., Damianou, A., Lawrence, N.D., Dai, Z.: Variational information distillation for knowledge transfer. In: CVPR (2019)
  • [2] Ba, J., Caruana, R.: Do deep nets really need to be deep? In: NeurIPS. pp. 2654–2662 (2014)
  • [3] Byeongho Heo, Minsik Lee, S.Y.J.Y.C.: Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In: AAAI (2019)
  • [4] Crowley, E.J., Gray, G., Storkey, A.J.: Moonshine: Distilling with cheap convolutions. In: NeurIPS. pp. 2888–2898 (2018)
  • [5] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. In: ICML. pp. 647–655 (2014)
  • [6] Furlanello, T., Lipton, Z.C., Tschannen, M., Itti, L., Anandkumar, A.: Born again neural networks. In: ICML (2018)
  • [7] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
  • [8] Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., Choi, J.Y.: A comprehensive overhaul of feature distillation. In: ICCV (2019)
  • [9] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
  • [10] Hou, Y., Ma, Z., Liu, C., Loy, C.C.: Learning lightweight lane detection cnns by self attention distillation. In: ICCV (2019)
  • [11] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)
  • [12] Hyun Lee, S., Ha Kim, D., Cheol Song, B.: Self-supervised knowledge distillation using singular value decomposition. In: ECCV (2018)
  • [13] Kim, J., Park, S., Kwak, N.: Paraphrasing complex network: Network compression via factor transfer. In: NeurIPS (2018)
  • [14] Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep. (2009)
  • [15] Lee, S., Song, B.C.: Graph-based knowledge distillation by multi-head self-attention network. In: BMVC (2019)
  • [16]

    Li, Z., Yang, J., Liu, Z., Yang, X., Jeon, G., Wu, W.: Feedback network for image super-resolution. In: CVPR (2019)

  • [17] Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: CVPRW (2017)
  • [18] Liu, Y., Cao, J., Li, B., Yuan, C., Hu, W., Li, Y., Duan, Y.: Knowledge distillation via instance relationship graph. In: CVPR (2019)
  • [19] Mirzadeh, S.I., Farajtabar, M., Li, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393 (2019)
  • [20] Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR (2019)
  • [21] Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: ICLR (2015)
  • [22] Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: Cnn features off-the-shelf: an astounding baseline for recognition. In: CVPRW. pp. 806–813 (2014)
  • [23] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [24] Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: ICLR (2020)
  • [25] Tung, F., Mori, G.: Similarity-preserving knowledge distillation. In: ICCV (2019)
  • [26]

    Yao, L., Miller, J.: Tiny imagenet classification with convolutional neural networks. CS 231N (2015)

  • [27]

    Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: CVPR (2017)

  • [28] Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: ICLR (2017)
  • [29] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV. pp. 818–833. Springer (2014)
  • [30]

    Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: CVPR (2019)

  • [31] Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: CVPR. pp. 6848–6856 (2018)