A Comprehensive Overhaul of Feature Distillation

04/03/2019 ∙ by Byeongho Heo, et al. ∙ NAVER Corp. Seoul National University 0

We investigate the design aspects of feature distillation methods achieving network compression and propose a novel feature distillation method in which the distillation loss is designed to make a synergy among various aspects: teacher transform, student transform, distillation feature position and distance function. Our proposed distillation loss includes a feature transform with a newly designed margin ReLU, a new distillation feature position, and a partial L2 distance function to skip redundant information giving adverse effects to the compression of student. In ImageNet, our proposed method achieves 21.65 of the teacher network, ResNet152. Our proposed method is evaluated on various tasks such as image classification, object detection and semantic segmentation and achieves a significant performance improvement in all tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Experiencing remarkable advances in many machine learning tasks using neural networks, researchers have started to work on network compression and enhancement. Several approaches such as model pruning, model quantization and knowledge distillation have been suggested to make the model smaller and cost-efficient. Among them,

knowledge distillation is being actively investigated. Knowledge distillation refers to the method that helps the training process of a smaller network (student) under the supervision of a larger network (teacher). Unlike other compression methods, it can downsize a network regardless of the structural difference between the teacher and the student network. Allowing architectural flexibility, knowledge distillation is emerging as a next generation approach of network compression.

Hinton et al. [8] proposed a knowledge distillation (KD) method using the softmax output of the teacher network. This method can be applied to any pair of network architectures since the dimensions of both outputs are the same. However, the output of a high-performance teacher network is not significantly different from the ground truth. Thus, transferring only the output is similar to training the student with the ground truth, making the performance of output distillation limited. To make better use of the information contained in the teacher network, several approaches have been proposed for feature distillation instead of output distillation. FitNets [22] have proposed a method that encourages a student network to mimic the hidden feature values of a teacher network. Although feature distillation was a promising approach, the performance improvement by FitNets was not significant.

Figure 1: Performance of distillation methods: AT [30], FT [14], AB [7] and proposed method on ImageNet. The graph shows accuracy(%) of ResNet50 trained with each distillation method. Note that ResNet152 with 78.31% accuracy is used as a teacher.

After FitNets, variant methods of feature distillation have been proposed as follows. The methods in [30, 28] transform the feature into a representation having a reduced dimension and transfer it to the student. In spite of the reduced dimension, it has been reported that the abstracted feature representation does lead to an improved performance. Recent methods (FT [14], AB [7]) have been proposed to increase the amount of transferred information in distillation. FT [14] encodes the feature into a ‘factor’ using an auto-encoder to alleviate the leakage of information. AB [7] focuses on activation of a network with only the sign of features being transferred. Both methods show a better distillation performance by increasing the amount of transferred information. However, FT [14] and AB [7] deform feature values of the teacher, which leaves a further room for the performance to be improved.

In this paper, we further improve the performance of feature distillation by proposing a new feature distillation loss which is designed via investigation of various design aspects: teacher transform, student transform, distillation feature position and distance function. Our design concept is to distinguish the beneficial information from the adverse information for the enhancement of feature distillation performance. The distillation loss is designed so as to transfer only the beneficial teacher information to the student. To this purpose, we propose a new ReLU function used in our method, change the distillation feature position to the front of ReLU, and use a partial distance function to skip the distillation of adverse information. The proposed loss significantly improves performance of feature distillation. In our experiments, we have evaluated our proposed method in various domains including classification (CIFAR [15], ImageNet [23]), object detection (PASCAL VOC [2]) and semantic segmentation (PASCAL VOC). As shown in Fig. 1, in our experiments, the proposed method shows a performance superior to the existing state-of-the-art methods and even the teacher model.

Method
Teacher
transform
Student
transform
Distillation
feature position
Distance
Missing
information
FitNets [22] None 11 conv Mid layer None
AT [30] Attention Attention End of group Channel dims
FSP [28] Correlation Correlation End of group Spatial dims
Jacobian [26] Gradient Gradient End of group Channel dims
FT [14] Auto-encoder Auto-encoder End of group Auto-encoded
AB [7] Binarization 11 conv Pre-ReLU Marginal Feature values
Proposed Margin ReLU 11 conv Pre-ReLU Partial Negative features
Table 1: Difference in various kinds of feature distillation. Most distillation use teacher transform with information loss.
Figure 2: The general training scheme of feature distillation. The form of teacher transform , student transform and distance differ from method to method.

2 Motivation

In this section, we investigate the design aspects of feature distillation methods achieving network compression and present novel aspects of our approach distinctive to the preceding methods. First, we describe a general form of loss function in feature distillation. As shown in Fig. 

2, the feature of the teacher network is denoted as and the feature of the student network is . To match the feature dimension, and respectively, we transform the feature and . A distance between the transformed features is used as a loss function . In other words, the loss function of feature distillation is generalized as

(1)

The student network is trained by minimizing the distillation loss .

It is desirable to design the distillation loss so as to transfer all feature information without missing any important information from the teacher. To achieve this, we aim to design a new feature distillation loss in which all important teacher’s information is transferred as much as possible to improve the distillation performance. To get an idea for this purpose, we analyze the design aspects of feature distillation loss. As described in Table 1, the design aspects of feature distillation loss are categorized into 4 categories: teacher transform, student transform, distillation feature position and distance function.

Teacher transform. A teacher transform converts the teacher’s hidden features into an easy-to-transfer form. It is an important part of feature distillation and also a main cause of the information missing in distillation. AT [30], FSP [28] and Jacobian [26]

reduce the dimension of the feature vector via the teacher transform which causes serious information missing. FT 

[14] uses a compression ratio determined by the user and AB [7] utilizes the original feature in the form of binarized values, making both methods to use features different from the original ones. Except FitNets [22], most teacher transforms of existing approaches cause an information missing in the teacher’s feature used in the distillation loss. Since features include both adverse and beneficial information, it is important to distinguish them and avoid missing the beneficial information. In the proposed method, we use a new ReLU activation, called margin ReLU, for the teacher transform. In our margin ReLU, the positive (beneficial) information is used without any transformation while the negative (adverse) information is suppressed. As a result, the proposed method can perform distillation without missing the beneficial information.

Student transform. Typically, the student transform uses the same function as the teacher transform . Therefore, in methods like AT [30], FSP [28], Jacobian [26] and FT [14], the same amount of information is lost in both the teacher transform and the student transform. FitNets and AB do not reduce the dimension of teacher’s feature and use a 11 convolutional layer as a student transform to match the feature dimension with the teacher. In this case, the feature size of the student does not decrease, but rather increases, so there is no information missing. In our method, we use this asymmetric format of transformations as the student transform.

Distillation feature position. Besides the types of feature transformation, we should be careful in picking the location in which distillation occurs. FitNets uses the end of an arbitrary middle layer as the distillation point, which has been shown to have a poor performance. We refer to a group of layers with the same spatial size as a layer group [29, 3]. In AT [30], FSP [28] and Jacobian [26], the distillation point lies at the end of each layer group, whereas in FT it lies at the end of only the last layer group. This has led to better results than FitNets but still lacks consideration about the ReLU-activated values of the teacher. ReLU allows the beneficial information (positive) to pass through and filters out the adverse information (negative). Therefore, knowledge distillation must be designed under the acknowledgement of this information dissolution. In our method, we design the distillation loss to bring the features in front of the ReLU function, called pre-ReLU. Positive and negative values are preserved in the pre-ReLU position without any deformation. So, it is suitable for distillation.

Distance function. Most distillation methods naively adopt or distance. However, in our method, we need to design an appropriate distance function according to our teacher transform, and our distillation point in the pre-ReLU position. In our design, the pre-ReLU information is transferred from teacher to student, but negative values of the pre-ReLU feature contain adverse information. The negative values of the pre-ReLU feature are blocked by the ReLU activation and not used by the teacher network. The transfer of all values may have an adverse effect to the student network. To handle this issue, we propose a new distance function, called partial distance, which is designed to skip the distillation of information on a negative region.

3 Approach

In this section, we describe our distillation method outlined in section 2. We first describe the location where the distillation occurs in our method and then explain about the newly designed loss function.

Figure 3: Position of distillation target layer. We place the distillation layer between the last block and the first ReLU. The exact location differs according to the network architecture.

3.1 Distillation position

The activation function is a crucial component of neural networks. The non-linearity of a neural network attributes to this function. The performance of the model is significantly influenced by the type of activation function. Among various activation functions, rectified linear unit (ReLU) 

[20]

is used in most computer vision tasks. Most networks 

[17, 25, 27, 5, 6, 29, 3, 10, 24] use ReLU or modified versions very similar to ReLU [19, 16]. ReLU simply applies a linear mapping for positive values. For negative values, it eliminates the values and fixes them to zero, which prevents the unnecessary information from going backward. With a careful design of knowledge distillation considering ReLU, it is possible to transfer only the necessary information. Unfortunately, most of the preceding research don’t take a serious consideration of the activation function. We define the minimum unit of the network, such as the residual block in ResNet [5] and the Conv-ReLU in VGG [25], as a layer block. The distillation in most methods occurs at the end of the layer block ignoring whether it is related to ReLU or not.

In our proposed method, the position of the distillation lies between the first ReLU and the end of layer block. This positioning enables the student to reach the preserved information of the teacher before it passes through ReLU. Fig. 3 depicts the distillation position of some architectures. In case of the simple block [17, 25, 27, 10] and the residual block [5], the fact of whether the distillation happens before or after the ReLU constitutes the difference between our proposed method and other methods. However, for networks using the pre-activation [6, 29], the difference is larger. Since there is no ReLU at the end of each block, our method has to find the ReLU in the next block. In a structure like PyramidNet [3, 24], our proposed method can reach the ReLU after the 11 convolution layer. Though the positioning strategy may be complicated according to the architecture, it has a significant influence on the performance. Our new distillation position significantly improves the performance of the student as demonstrated in our experiments.

3.2 Loss function

Figure 4: A comparison of the conventional ReLU, teacher transforms in Heo et al[7] and our proposed method.

Based on the format of section 2, we explain the teacher transform , student transform and the distance function of our proposed method. Since the feature values of teacher are the values before ReLU, positive values have the information utilized by the teacher while negative values do not. If a value in the teacher is positive, the student must produce the same value as in the teacher. On the contrary, if a value of the teacher is negative, the student should produce a value less than zero. Heo et al. [7] noted that a margin is required to make the student’s value less than zero. Thus, we propose a teacher transform that preserves positive values while giving a negative margin.

(2)

Here, is a margin value less than zero. We name this function as margin ReLU. Several types of teacher transforms are depicted in Fig. 4. Margin ReLU is designed to give a negative margin which is easier to follow than the negative value of the teacher. Heo et al. set the margin by an arbitrary scalar value, which does not reflect the weight values of the teacher. In our proposed method, the margin value is defined as the channel-by-channel expectation value of the negative response, and the margin ReLU uses values that correspond to each channel of the input. For a channel and the -th element of the teacher’s feature , the margin value of a channel is set to an expectation value over all training images.

(3)

The expectation value can be calculated directly in the training process, or it can be calculated using parameters of the previous batch normalization layer. The margin ReLU

is used as a teacher transform in our proposed method and produces the target feature value for the student network. For the student transform, a regressor consisting of an 1 1 convolution layer [22, 7] and a batch normalization layer is used.

We now explain our distance function . Our proposed method transfers the representation before ReLU. Therefore, the distance function should be changed considering ReLU. In the feature of the teacher, the positive responses are actually used for the network which implies that the positive responses of the teacher should be transferred by their exact values. However, negative responses are not. For a negative teacher response, if the student response is higher than the target value, it should be reduced, but if the student response is lower than the target value, it doesn’t need to be increased since negatives are equally blocked by ReLU regardless of their values. For the feature representation of the teacher and student, , let the

-th component of the tensor be

. Our partial distance () is defined as

(4)

where is the position for the teacher’s feature and is the position for the student’s feature.

Our proposed method uses margin ReLU as a teacher transform and a regressor consisting of an 11 convolution layer as a student transform , and uses partial distance () as the distance function. Distillation loss of the proposed method is:

(5)

Our proposed method is conducted as continuous distillation using the distillation loss . Thus, the final loss function is the sum of distillation loss and task loss:

(6)

The task loss refers to the loss specified by the task of a network. The feature position for distillation is after the last block of one spatial size and before ReLU as depicted in Fig. 3. In a network with a 3232 input, such as CIFAR [15], there are three target layers, and in the case of ImageNet [23], the number of target layers is four.

Setup Compression type Teacher network Student network
# of params
teacher
# of params
student
Compress
ratio
(a) Depth WideResNet 28-4 WideResNet 16-4 5.87M 2.77M 47.2%
(b) Channel WideResNet 28-4 WideResNet 28-2 5.87M 1.47M 25.0%
(c) Depth & channel WideResNet 28-4 WideResNet 16-2 5.87M 0.70M 11.9%
(d) Different architecture WideResNet 28-4 ResNet 56 5.87M 0.86M 14.7%
(e) Different architecture PyramidNet-200 (240) WideResNet 28-4 26.84M 5.87M 21.9%
(f) Different architecture PyramidNet-200 (240) PyramidNet-110 (84) 26.84M 3.91M 14.6%
Table 2: Experiments settings with various network architectures on CIFAR-100. Network architecture is denoted as WideResNet (depth)-(channel multiplication) for Wide Residual Networks [29] and PyramidNet-(depth) (channel factor) for PyramidNet [3].
Setup Teacher Baseline KD [8] FitNets [22] AT [30] Jacobian [26] FT [14] AB [7] Proposed
(a) 21.09 22.72 21.69 21.85 22.07 22.18 21.72 21.36 20.89
(b) 21.09 24.88 23.43 23.94 23.80 23.70 23.41 23.19 21.98
(c) 21.09 27.32 26.47 26.30 26.56 26.71 25.91 26.02 24.08
(d) 21.09 27.68 26.76 26.35 26.66 26.60 26.20 26.04 24.44
(e) 15.57 21.09 20.97 22.16 19.28 20.59 19.04 20.46 17.80
(f) 15.57 22.58 21.68 23.79 19.93 23.49 19.53 20.89 18.89
Table 3: Performance of various knowledge distillation methods on CIFAR-100. Measurement is error rate (%) of classification. lower is better. ‘Baseline’ represents a result without distillation.

3.3 Batch normalization

We further investigate batch normalization in knowledge distillation. Batch normalization [12] is used in most recent network architectures to stabilize training. A recent study on batch normalization [11] explains the difference between training mode and evaluation mode of batch normalization. Each mode of batch-norm layer acts differently in the network. Therefore, when performing knowledge distillation, it is necessary to consider whether to use the teacher in training mode or evaluation mode. Typically, the feature of the student is normalized batch by batch. Therefore, the feature from the teacher must be normalized in the same way. In other words, the mode of the teacher’s batch normalization layers should be training mode when distilling the information. To do this, we attach a batch normalization layer after the 11 convolutional layer and use it as a student transform and bring the knowledge from the teacher in training mode. As a result, our proposed method achieves additional performance improvements. This issue holds for all knowledge distillation methods including the proposed method. We empirically analyze various knowledge distillation methods for batch normalization issues in Section 4.5.3.

4 Experiments

We have evaluated the efficiency of our distillation method in several domains. The first task is the classification problem which is a fundamental problem in machine learning. As most of the other distillation methods have reported their performance under this domain, we also have compared our results to those of others. The performance of knowledge distillation depends on which network architecture is used, how well the teacher performs and what kind of training scheme is used. To control other factors and make a fair comparison, we reproduced the algorithms of other methods based on their codes and papers.

4.1 Cifar-100

CIFAR-100 [15] is the dataset that most knowledge distillation methods use to validate their performance. Composed of 50,000 images with 100 classes, we use CIFAR-100 to compare various settings of all methods. To be practically used in any task, knowledge distillation must be applicable to any network structure. Therefore, we provide the experimental results of knowledge distillation using various structures of the teacher and student. Table. 2 shows the settings of each experiment such as the architecture used for each model, model size and compression rate. Majority of our experiments utilize Wide Residual Network [29] since the number of layers and the depth of each layer can be easily modified. Distillation between different types of architectures has also been experimented with the setting (d), (e), (f). In the case of (f), network names are similar, but the teacher is based on the bottleneck block and the student uses the basic block. Note that (e) and (f) use a teacher network PyramidNet-200 [3] trained with the Mixup augmentation [9]

. All models have been trained for 200 epochs with a learning rate of 0.1, multiplied by 0.1 at epoch 100 and 150. To produce the best results from other methods, some algorithms 

[22, 14, 7] are trained with an output distillation loss [8] along with the feature distillation loss. The rest of the algorithms have shown better results when trained without the output distillation loss. The results of each method on every setting are presented in Table 3. Our proposed method outperforms the state-of-the-art in the settings of depth and channel compression (a), (b), (c) and different architecture (d), (e), (f). Especially, in the setting of depth compression (a), the student network trained by our proposed method outperforms the teacher network. The proposed method consistently shows a good performance regardless of the compression rate and even when distilling to different types of network architecture. Note that the error rate of 17.8% in (e) is better than any network reported in the paper of Wide Residual Network [29]. Therefore, our proposed method can be applied not only to small networks but also to large networks with high performance.

4.2 ImageNet

The image size of 3232 in CIFAR is not enough to represent real world images. For this reason, we have conducted experiments on the ImageNet dataset [23] as well. ImageNet includes images with an average size of 469387, which allows us to verify distillation performance in large images. In this paper, we have used the dataset in ILSVRC 2015 [23]. This dataset consists of 1.2 million training images and 50 thousand validation images. Images are cropped to the size of 224

224 for training and evaluation. The student network is trained for 100 epochs, and the learning rate begins at 0.1 multiplied by 0.1 at every 30th epoch. For a fair comparison and a simple reproduction, we used the pre-trained model in the PyTorch 

[21] library as the teacher network.

The experiments have been conducted on two pairs of networks. The first one is distillation from ResNet152 [5] to ResNet50, and the second one is distillation from ResNet50 to MobileNet [10]. In this section, we present the results of three latest algorithms [30, 14, 7], which have shown the best results in the previous subsection. The results are represented in Table 4. Our proposed method shows a great improvement. In particular, our method has made ResNet50 perform better than ResNet152, which is a remarkable achievement. In addition, it has shown a considerable improvement in the recently proposed lightweight architecture, MobileNet. In case of MobileNet, it is hard to reproduce the performance of the paper (29.4) because the training scheme, such as training epochs, is not reported. Thus, we measured the performance in a standard setting.

Network
# of param
(ratio)
Method
Top-1
error(%)
Top-5
error(%)
ResNet152 60.19M Teacher 21.69 5.95
ResNet50
25.56M
(42.5%)
Baseline 23.72 6.97
AT [30] 22.75 6.35
FT [14] 22.80 6.49
AB [7] 23.47 6.94
Proposed 21.65 5.83
ResNet50 25.56M Teacher 23.84 7.14
MobileNet
4.23M
(16.5%)
Baseline 31.13 11.24
AT [30] 30.44 10.67
FT [14] 30.12 10.50
AB [7] 31.11 11.29
Proposed 28.75 9.66
Table 4: Results on ILSVRC 2015 validation set. Networks are trained and evaluated in 224224 size with single-crop. ‘Baseline’ represents a result without distillation.

4.3 Object detection

Network # of params Method mAP(%)
ResNet50-SSD 36.7M Teacher (T1) 76.79
VGG-SSD 26.3M Teacher (T2) 77.50
ResNet18-SSD 20.0M Baseline 71.61
Proposed-T1 73.08
Proposed-T2 72.38
MobileNet
-SSD lite
6.5M Baseline 67.58
Proposed-T1 68.54
Proposed-T2 68.45
Table 5: Object detection results of SSD300 [18] in PASCAL VOC2007 test set [2]. Results are described in mean Average Precision (mAP). Higher is better.

In this section, we apply our method to other computer vision tasks. The first one is object detection which is one of the most frequently used neural network techniques. Since the purpose of distillation is to improve speed, we applied our proposed method on a high-speed detector, Single Shot Detector (SSD) [18]. Networks are trained on a mixture of VOC2007 and VOC2012 [2] trainval set, which are widely used in object detection. The backbone network in all models is pre-trained using the ImageNet dataset. Networks have been trained for 120k iterations with a batch size of 32. To show the improvement of our method, we set the SSD trained with no distillation as the baseline, referred to as ‘Baseline’ in Table 5. SSD detector based on ResNet50 [5] or VGG [25] is used as the teacher network to examine the difference of performance according to the teacher architecture. As the student networks, SSD based on ResNet18 and SSD lite based on Mobilenet [10] have been used.

Detection performance has been evaluated in the VOC 2007 test set and all results are presented in Table 5. In the case of ResNet18, the performance improvement of ResNet18-T1 using ResNet teacher is larger than T2 using a VGG teacher. Though both student architectures outperform the baseline, the distillation between similar structure shows a better quality of knowledge distillation. In the case of MobileNet, our proposed method shows a constant performance improvement regardless of the type of the teacher. Student models in all experiments have experienced improvements in performance and this implies that our method can be applied to any SSD-based object detector.

4.4 Semantic segmentation

In this section, we verify the performance of our proposed method on semantic segmentation. Applying distillation on semantic segmentation is challenging since the output size is much larger than any other task. We have selected the latest study, DeepLabV3+ [1] as our base model for semantic segmentation. DeepLabV3+ based on ResNet101 has been used as the teacher network, and DeepLabV3+ based on MobileNetV2 [24] and ResNet18 [5] has been used as the student network. Experiments have been performed on the PASCAL VOC 2012 segmentation [2] dataset. We also use an augmentation of the dataset provided by the extra annotations in [4] as in the baseline paper [1]. All models have been trained for 50 epochs, and the learning rate schedule is the same as the baseline paper [1]. In similar fashion to our detection task, the student network is initialized to a network pre-trained on ImageNet without distillation. Results are presented in Table 6. Our proposed method significantly improves the performance of ResNet18 and MobilenetV2. Taking MobileNetV2, in particular, our proposed method improves the performance by almost 3 points in mIoU and contributes to computation reduction of the segmentation algorithm. We have shown that our proposed method can be applied to image classification, object detection and semantic segmentation. Being able to be applied to many tasks without major changes is an advantage of feature distillation and shows that our proposed method has a wide range of applications.

Backbone # of params Method mIoU
ResNet101 59.3M Teacher 77.39
ResNet18
16.6M
(28.0%)
Baseline 71.79
Proposed 73.24
MobileNetV2
5.8M
(9.8%)
Baseline 68.44
Proposed 71.36
Table 6: Semantic segmentation based on DeepLabV3+ [1] on the PASCAL VOC 2012 test set [2]. Measurement of performance is mean Intersection over Union (mIoU).

4.5 Analysis

We analyze possible factors which would have lead to the performance improvement by our proposed method. The first analysis is the output similarity between the teacher and the student learned by distillation. By this, we verify how well our method forces the student to follow the teacher. After that, we provide an ablation study of our proposed method. We measure how much each component of our proposed method contributes to the performance. Finally, we discuss about how the mode of batch normalization affects knowledge distillation, as mentioned in Section 3.3. All experiments are based on setting (c) of Table 2.

4.5.1 Teacher-student similarity

Method
KL divergence
with teacher
Cross-entropy
with GT
Error
rate(%)
Baseline 0.7318 1.0741 27.32
KD [8] 0.7064 1.0758 26.47
FitNets [22] 0.7993 1.1585 26.30
AT [30] 0.7047 1.0303 26.56
Jacobian [26] 0.7122 1.0495 26.71
FT [14] 0.6872 1.0561 25.91
AB [7] 0.7555 1.1197 26.02
Proposed 0.5723 0.9585 24.08

Table 7: Output similarity analysis between teacher and student on test set of CIFAR-100.
Mode of batch-norm KD [8] FitNets [22] AT [30] Jacobian [26] FT [14] AB [7] Proposed
Training mode 26.47 26.61 26.56 26.71 25.91 26.02 24.08
Evaluation mode 26.45 26.92 26.42 26.75 26.15 26.36 24.54
Table 8: Analysis of the mode of batch normalization in teacher network on CIFAR-100. Table shows error rate(%). The first row shows the results of teacher’s batch-norm in training mode while the second row shows the results of using the batch-norm in evaluation mode.

KD [8] forces the output of the student to be similar to output of the teacher. The purpose of output distillation is quite intuitive, i.e., if a student produces an output similar to that of the teacher, its performance will also be similar. However, in the case of feature distillation, it is necessary to investigate how the output of the student changes. To see how well the student mimics the teacher, we measure the similarity of the teacher’s and student’s output under a consistent setting. On the test set of CIFAR-100, we measure the KL divergence between the teacher and student output. The cross-entropy with the ground truth has also been measured since classification performance also contributes to the reduction of KL divergence. Results are presented in Table 7. Methods that apply distillation only in the early stage of the training, Initial distillation (FitNets [22], AB [7]), increase the KL divergence both with the teacher and ground-truth. With this result, it is hard to say that the student networks of these methods are mimicking their teacher networks. Meanwhile, distillation methods with continuous distillation in Eq. 6 (KD [8], AT [30], Jacobian [26], FT [14]) as well as our proposed method reduce the KL divergence, which implies that the similarity between the teacher and student is relatively high. Specifically, our method shows a considerably high similarity compared to other continuous distillation methods. In other words, our proposed method trains the student to produce an output most similar to that of the teacher. This similarity is one of the main reasons of improved performance of our proposed method.

4.5.2 Ablation study

Proposed
w/o
continuous
w/o
position
w/o
mReLU
w/o
partial
24.08 26.01 26.23 24.63 24.51
Table 9: Ablation study of proposed method. The results are presented in the form of error rate (%).

As shown in Table 1, our proposed method consists of various components. We investigate the contribution of each component with an ablation study. From our proposed method, we exclude one component at a time and measure the resulting performance. The result is shown in Table 9. By replacing the proposed method to initial distillation which applies distillation as network initialization as in [22, 28, 7], the performance degraded. Changing the feature position where the distillation occurs to the end of the group also shows a significant degradation in performance. This result shows that using the feature before ReLU is necessary for our margin ReLU and partial distance. Using a conventional ReLU instead of the margin ReLU (Eq. 3) also decreases the performance. Further, controlling the negative values of the student by using a classical distance degrades performance. We can see that if any component is omitted, the student network suffers from a performance degradation, especially when the distillation type or distillation feature position is changed. In conclusion, a combination of all proposed components leads to a significant improvement in performance of the proposed method.

4.5.3 Batch normalization

In Section 3.3, we mentioned about the issue related to the mode of the batch normalization in knowledge distillation. To investigate this, we measure the performance variation of knowledge distillation methods when differing the mode of the teacher’s batch norm layer. The experimental results are shown in Table 8. The distillation methods that use additional information other than feature (KD [8], Jacobian [26]) show marginal differences between each mode of batch normalization. AT [30], which uses a diminished feature for distillation, has shown a better result in the evaluation mode. However, methods that do not squeeze the feature (FitNets [22], FT [14], AB [7]) consistently work better in the training mode. Our method especially shows a substantial improvement when using the training mode. Note that all experiments in previous sections exploit the better mode of the batch-norm layer as there is no mention about it in each paper. In conclusion, an appropriate type of batch normalization should be carefully chosen in many distillation methods including ours.

4.6 Implementation details

We describe our implementation details to help reproduce our proposed method. In the loss function of our method in Eq. 5, we sum the values in the entire layer rather than averaging them. When one moves from the top layer to the bottom layer, the total size of the feature is increased by twice the amount as spatial resolution increases. Therefore, the loss is divided in half accordingly. in Eq. 6 is for CIFAR [15], for ImageNet [23] and detection, and for segmentation. For CIFAR, we used a batch size of 128 for (a) to (d) in Table 2 and a batch size of 64 for (e) and (f). Every experiment on CIFAR has been performed 5 times with the median value reported. Detection has an extra layer behind the backbone and all extra layers were also used for distillation. Detector were trained over a 120k iteration with a batch size of 32. The learning rate started at and was multiplied by 0.1 at iteration 80k and 100k. In the case of segmentation, we use an additional distillation layer at the atrous spatial pyramid pooling

and a layer just before output layer. Output stride was set to 16 and all dropout layers are not used for distillation. When using a pre-trained network, we initialized the student transform at the start of training. Initialization proceeded for 500 iteration for detection and 1 epoch for segmentation. All experiments were implemented and evaluated on NAVER Smart Machine Learning (NSML) 

[13] platform with PyTorch [21].

5 Conclusion

We propose a new knowledge distillation method along with several investigations about various aspects of the existing feature distillation methods. We have discovered the effectiveness of pre-ReLU location and proposed a new loss function to improve the performance of feature distillation. The new loss function consists of a teacher transform (margin ReLU) and a new distance function (partial ) and enables an effective feature distillation at pre-ReLU location. We have also investigated about the mode of batch normalization in teacher network and achieved additional performance improvements. Through experiments, we examined the performance of the proposed method using various networks in various tasks, and proved that the proposed method substantially outperforms the state-of-the-arts of feature distillation.

Acknowledgement

We appreciate support of Clova AI members, especially Nigel Fernandez for proofreading the manuscript, Dongyoon Han for providing help on implementation and Jung-Woo Ha for insightful comments. We also thank the NSML Team for providing an excellent experiment platform.

References

  • [1] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV), 2018.
  • [2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1):98–136, Jan. 2015.
  • [3] D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • [4] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In International Conference on Computer Vision (ICCV), 2011.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV), 2016.
  • [7] B. Heo, M. Lee, S. Yun, and J. Y. Choi.

    Knowledge transfer via distillation of activation boundaries formed by hidden neurons.

    In

    AAAI Conference on Artificial Intelligence (AAAI)

    , 2019.
  • [8] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In

    NIPS Deep Learning and Representation Learning Workshop

    , 2015.
  • [9] Y. N. D. D. L.-P. Hongyi Zhang, Moustapha Cisse. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations (ICLR), 2018.
  • [10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
  • [11] S. Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In Advances in Neural Information Processing Systems (NIPS). 2017.
  • [12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.
  • [13] H. Kim, M. Kim, D. Seo, J. Kim, H. Park, S. Park, H. Jo, K. Kim, Y. Yang, Y. Kim, N. Sung, and J. Ha. NSML: meet the mlaas platform with a real-world case study. CoRR, abs/1810.09957, 2018.
  • [14] J. Kim, S. Park, and N. Kwak. Paraphrasing complex network: Network compression via factor transfer. In Advances in Neural Information Processing Systems (NIPS), 2018.
  • [15] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  • [16] A. Krizhevsky.

    Convolutional deep belief networks on cifar-10.

    Technical report, 2010.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

    Imagenet classification with deep convolutional neural networks.

    In Advances in Neural Information Processing Systems (NIPS). 2012.
  • [18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision (ECCV), 2016.
  • [19] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
  • [20] V. Nair and G. E. Hinton.

    Rectified linear units improve restricted boltzmann machines.

    In International Conference on Machine Learning (ICML), 2010.
  • [21] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017.
  • [22] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In International Conference on Learning Representations (ICLR), 2015.
  • [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
  • [24] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [25] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
  • [26] S. Srinivas and F. Fleuret. Knowledge transfer with jacobian matching. In International Conference on Machine Learning (ICML), 2018.
  • [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [28] J. Yim, D. Joo, J. Bae, and J. Kim.

    A gift from knowledge distillation: Fast optimization, network minimization and transfer learning.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [29] S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference (BMVC), 2016.
  • [30] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations (ICLR), 2017.