Experiencing remarkable advances in many machine learning tasks using neural networks, researchers have started to work on network compression and enhancement. Several approaches such as model pruning, model quantization and knowledge distillation have been suggested to make the model smaller and cost-efficient. Among them,knowledge distillation is being actively investigated. Knowledge distillation refers to the method that helps the training process of a smaller network (student) under the supervision of a larger network (teacher). Unlike other compression methods, it can downsize a network regardless of the structural difference between the teacher and the student network. Allowing architectural flexibility, knowledge distillation is emerging as a next generation approach of network compression.
Hinton et al.  proposed a knowledge distillation (KD) method using the softmax output of the teacher network. This method can be applied to any pair of network architectures since the dimensions of both outputs are the same. However, the output of a high-performance teacher network is not significantly different from the ground truth. Thus, transferring only the output is similar to training the student with the ground truth, making the performance of output distillation limited. To make better use of the information contained in the teacher network, several approaches have been proposed for feature distillation instead of output distillation. FitNets  have proposed a method that encourages a student network to mimic the hidden feature values of a teacher network. Although feature distillation was a promising approach, the performance improvement by FitNets was not significant.
After FitNets, variant methods of feature distillation have been proposed as follows. The methods in [30, 28] transform the feature into a representation having a reduced dimension and transfer it to the student. In spite of the reduced dimension, it has been reported that the abstracted feature representation does lead to an improved performance. Recent methods (FT , AB ) have been proposed to increase the amount of transferred information in distillation. FT  encodes the feature into a ‘factor’ using an auto-encoder to alleviate the leakage of information. AB  focuses on activation of a network with only the sign of features being transferred. Both methods show a better distillation performance by increasing the amount of transferred information. However, FT  and AB  deform feature values of the teacher, which leaves a further room for the performance to be improved.
In this paper, we further improve the performance of feature distillation by proposing a new feature distillation loss which is designed via investigation of various design aspects: teacher transform, student transform, distillation feature position and distance function. Our design concept is to distinguish the beneficial information from the adverse information for the enhancement of feature distillation performance. The distillation loss is designed so as to transfer only the beneficial teacher information to the student. To this purpose, we propose a new ReLU function used in our method, change the distillation feature position to the front of ReLU, and use a partial distance function to skip the distillation of adverse information. The proposed loss significantly improves performance of feature distillation. In our experiments, we have evaluated our proposed method in various domains including classification (CIFAR , ImageNet ), object detection (PASCAL VOC ) and semantic segmentation (PASCAL VOC). As shown in Fig. 1, in our experiments, the proposed method shows a performance superior to the existing state-of-the-art methods and even the teacher model.
|FitNets ||None||11 conv||Mid layer||None|
|AT ||Attention||Attention||End of group||Channel dims|
|FSP ||Correlation||Correlation||End of group||Spatial dims|
|Jacobian ||Gradient||Gradient||End of group||Channel dims|
|FT ||Auto-encoder||Auto-encoder||End of group||Auto-encoded|
|AB ||Binarization||11 conv||Pre-ReLU||Marginal||Feature values|
|Proposed||Margin ReLU||11 conv||Pre-ReLU||Partial||Negative features|
In this section, we investigate the design aspects of feature distillation methods achieving network compression and present novel aspects of our approach distinctive to the preceding methods. First, we describe a general form of loss function in feature distillation. As shown in Fig.2, the feature of the teacher network is denoted as and the feature of the student network is . To match the feature dimension, and respectively, we transform the feature and . A distance between the transformed features is used as a loss function . In other words, the loss function of feature distillation is generalized as
The student network is trained by minimizing the distillation loss .
It is desirable to design the distillation loss so as to transfer all feature information without missing any important information from the teacher. To achieve this, we aim to design a new feature distillation loss in which all important teacher’s information is transferred as much as possible to improve the distillation performance. To get an idea for this purpose, we analyze the design aspects of feature distillation loss. As described in Table 1, the design aspects of feature distillation loss are categorized into 4 categories: teacher transform, student transform, distillation feature position and distance function.
Teacher transform. A teacher transform converts the teacher’s hidden features into an easy-to-transfer form. It is an important part of feature distillation and also a main cause of the information missing in distillation. AT , FSP  and Jacobian 
reduce the dimension of the feature vector via the teacher transform which causes serious information missing. FT uses a compression ratio determined by the user and AB  utilizes the original feature in the form of binarized values, making both methods to use features different from the original ones. Except FitNets , most teacher transforms of existing approaches cause an information missing in the teacher’s feature used in the distillation loss. Since features include both adverse and beneficial information, it is important to distinguish them and avoid missing the beneficial information. In the proposed method, we use a new ReLU activation, called margin ReLU, for the teacher transform. In our margin ReLU, the positive (beneficial) information is used without any transformation while the negative (adverse) information is suppressed. As a result, the proposed method can perform distillation without missing the beneficial information.
Student transform. Typically, the student transform uses the same function as the teacher transform . Therefore, in methods like AT , FSP , Jacobian  and FT , the same amount of information is lost in both the teacher transform and the student transform. FitNets and AB do not reduce the dimension of teacher’s feature and use a 11 convolutional layer as a student transform to match the feature dimension with the teacher. In this case, the feature size of the student does not decrease, but rather increases, so there is no information missing. In our method, we use this asymmetric format of transformations as the student transform.
Distillation feature position. Besides the types of feature transformation, we should be careful in picking the location in which distillation occurs. FitNets uses the end of an arbitrary middle layer as the distillation point, which has been shown to have a poor performance. We refer to a group of layers with the same spatial size as a layer group [29, 3]. In AT , FSP  and Jacobian , the distillation point lies at the end of each layer group, whereas in FT it lies at the end of only the last layer group. This has led to better results than FitNets but still lacks consideration about the ReLU-activated values of the teacher. ReLU allows the beneficial information (positive) to pass through and filters out the adverse information (negative). Therefore, knowledge distillation must be designed under the acknowledgement of this information dissolution. In our method, we design the distillation loss to bring the features in front of the ReLU function, called pre-ReLU. Positive and negative values are preserved in the pre-ReLU position without any deformation. So, it is suitable for distillation.
Distance function. Most distillation methods naively adopt or distance. However, in our method, we need to design an appropriate distance function according to our teacher transform, and our distillation point in the pre-ReLU position. In our design, the pre-ReLU information is transferred from teacher to student, but negative values of the pre-ReLU feature contain adverse information. The negative values of the pre-ReLU feature are blocked by the ReLU activation and not used by the teacher network. The transfer of all values may have an adverse effect to the student network. To handle this issue, we propose a new distance function, called partial distance, which is designed to skip the distillation of information on a negative region.
In this section, we describe our distillation method outlined in section 2. We first describe the location where the distillation occurs in our method and then explain about the newly designed loss function.
3.1 Distillation position
The activation function is a crucial component of neural networks. The non-linearity of a neural network attributes to this function. The performance of the model is significantly influenced by the type of activation function. Among various activation functions, rectified linear unit (ReLU)
is used in most computer vision tasks. Most networks[17, 25, 27, 5, 6, 29, 3, 10, 24] use ReLU or modified versions very similar to ReLU [19, 16]. ReLU simply applies a linear mapping for positive values. For negative values, it eliminates the values and fixes them to zero, which prevents the unnecessary information from going backward. With a careful design of knowledge distillation considering ReLU, it is possible to transfer only the necessary information. Unfortunately, most of the preceding research don’t take a serious consideration of the activation function. We define the minimum unit of the network, such as the residual block in ResNet  and the Conv-ReLU in VGG , as a layer block. The distillation in most methods occurs at the end of the layer block ignoring whether it is related to ReLU or not.
In our proposed method, the position of the distillation lies between the first ReLU and the end of layer block. This positioning enables the student to reach the preserved information of the teacher before it passes through ReLU. Fig. 3 depicts the distillation position of some architectures. In case of the simple block [17, 25, 27, 10] and the residual block , the fact of whether the distillation happens before or after the ReLU constitutes the difference between our proposed method and other methods. However, for networks using the pre-activation [6, 29], the difference is larger. Since there is no ReLU at the end of each block, our method has to find the ReLU in the next block. In a structure like PyramidNet [3, 24], our proposed method can reach the ReLU after the 11 convolution layer. Though the positioning strategy may be complicated according to the architecture, it has a significant influence on the performance. Our new distillation position significantly improves the performance of the student as demonstrated in our experiments.
3.2 Loss function
Based on the format of section 2, we explain the teacher transform , student transform and the distance function of our proposed method. Since the feature values of teacher are the values before ReLU, positive values have the information utilized by the teacher while negative values do not. If a value in the teacher is positive, the student must produce the same value as in the teacher. On the contrary, if a value of the teacher is negative, the student should produce a value less than zero. Heo et al.  noted that a margin is required to make the student’s value less than zero. Thus, we propose a teacher transform that preserves positive values while giving a negative margin.
Here, is a margin value less than zero. We name this function as margin ReLU. Several types of teacher transforms are depicted in Fig. 4. Margin ReLU is designed to give a negative margin which is easier to follow than the negative value of the teacher. Heo et al. set the margin by an arbitrary scalar value, which does not reflect the weight values of the teacher. In our proposed method, the margin value is defined as the channel-by-channel expectation value of the negative response, and the margin ReLU uses values that correspond to each channel of the input. For a channel and the -th element of the teacher’s feature , the margin value of a channel is set to an expectation value over all training images.
The expectation value can be calculated directly in the training process, or it can be calculated using parameters of the previous batch normalization layer. The margin ReLUis used as a teacher transform in our proposed method and produces the target feature value for the student network. For the student transform, a regressor consisting of an 1 1 convolution layer [22, 7] and a batch normalization layer is used.
We now explain our distance function . Our proposed method transfers the representation before ReLU. Therefore, the distance function should be changed considering ReLU. In the feature of the teacher, the positive responses are actually used for the network which implies that the positive responses of the teacher should be transferred by their exact values. However, negative responses are not. For a negative teacher response, if the student response is higher than the target value, it should be reduced, but if the student response is lower than the target value, it doesn’t need to be increased since negatives are equally blocked by ReLU regardless of their values. For the feature representation of the teacher and student, , let the
-th component of the tensor be. Our partial distance () is defined as
where is the position for the teacher’s feature and is the position for the student’s feature.
Our proposed method uses margin ReLU as a teacher transform and a regressor consisting of an 11 convolution layer as a student transform , and uses partial distance () as the distance function. Distillation loss of the proposed method is:
Our proposed method is conducted as continuous distillation using the distillation loss . Thus, the final loss function is the sum of distillation loss and task loss:
The task loss refers to the loss specified by the task of a network. The feature position for distillation is after the last block of one spatial size and before ReLU as depicted in Fig. 3. In a network with a 3232 input, such as CIFAR , there are three target layers, and in the case of ImageNet , the number of target layers is four.
|Setup||Compression type||Teacher network||Student network||
|(a)||Depth||WideResNet 28-4||WideResNet 16-4||5.87M||2.77M||47.2%|
|(b)||Channel||WideResNet 28-4||WideResNet 28-2||5.87M||1.47M||25.0%|
|(c)||Depth & channel||WideResNet 28-4||WideResNet 16-2||5.87M||0.70M||11.9%|
|(d)||Different architecture||WideResNet 28-4||ResNet 56||5.87M||0.86M||14.7%|
|(e)||Different architecture||PyramidNet-200 (240)||WideResNet 28-4||26.84M||5.87M||21.9%|
|(f)||Different architecture||PyramidNet-200 (240)||PyramidNet-110 (84)||26.84M||3.91M||14.6%|
|Setup||Teacher||Baseline||KD ||FitNets ||AT ||Jacobian ||FT ||AB ||Proposed|
3.3 Batch normalization
We further investigate batch normalization in knowledge distillation. Batch normalization  is used in most recent network architectures to stabilize training. A recent study on batch normalization  explains the difference between training mode and evaluation mode of batch normalization. Each mode of batch-norm layer acts differently in the network. Therefore, when performing knowledge distillation, it is necessary to consider whether to use the teacher in training mode or evaluation mode. Typically, the feature of the student is normalized batch by batch. Therefore, the feature from the teacher must be normalized in the same way. In other words, the mode of the teacher’s batch normalization layers should be training mode when distilling the information. To do this, we attach a batch normalization layer after the 11 convolutional layer and use it as a student transform and bring the knowledge from the teacher in training mode. As a result, our proposed method achieves additional performance improvements. This issue holds for all knowledge distillation methods including the proposed method. We empirically analyze various knowledge distillation methods for batch normalization issues in Section 4.5.3.
We have evaluated the efficiency of our distillation method in several domains. The first task is the classification problem which is a fundamental problem in machine learning. As most of the other distillation methods have reported their performance under this domain, we also have compared our results to those of others. The performance of knowledge distillation depends on which network architecture is used, how well the teacher performs and what kind of training scheme is used. To control other factors and make a fair comparison, we reproduced the algorithms of other methods based on their codes and papers.
CIFAR-100  is the dataset that most knowledge distillation methods use to validate their performance. Composed of 50,000 images with 100 classes, we use CIFAR-100 to compare various settings of all methods. To be practically used in any task, knowledge distillation must be applicable to any network structure. Therefore, we provide the experimental results of knowledge distillation using various structures of the teacher and student. Table. 2 shows the settings of each experiment such as the architecture used for each model, model size and compression rate. Majority of our experiments utilize Wide Residual Network  since the number of layers and the depth of each layer can be easily modified. Distillation between different types of architectures has also been experimented with the setting (d), (e), (f). In the case of (f), network names are similar, but the teacher is based on the bottleneck block and the student uses the basic block. Note that (e) and (f) use a teacher network PyramidNet-200  trained with the Mixup augmentation 
. All models have been trained for 200 epochs with a learning rate of 0.1, multiplied by 0.1 at epoch 100 and 150. To produce the best results from other methods, some algorithms[22, 14, 7] are trained with an output distillation loss  along with the feature distillation loss. The rest of the algorithms have shown better results when trained without the output distillation loss. The results of each method on every setting are presented in Table 3. Our proposed method outperforms the state-of-the-art in the settings of depth and channel compression (a), (b), (c) and different architecture (d), (e), (f). Especially, in the setting of depth compression (a), the student network trained by our proposed method outperforms the teacher network. The proposed method consistently shows a good performance regardless of the compression rate and even when distilling to different types of network architecture. Note that the error rate of 17.8% in (e) is better than any network reported in the paper of Wide Residual Network . Therefore, our proposed method can be applied not only to small networks but also to large networks with high performance.
The image size of 3232 in CIFAR is not enough to represent real world images. For this reason, we have conducted experiments on the ImageNet dataset  as well. ImageNet includes images with an average size of 469387, which allows us to verify distillation performance in large images. In this paper, we have used the dataset in ILSVRC 2015 . This dataset consists of 1.2 million training images and 50 thousand validation images. Images are cropped to the size of 224
224 for training and evaluation. The student network is trained for 100 epochs, and the learning rate begins at 0.1 multiplied by 0.1 at every 30th epoch. For a fair comparison and a simple reproduction, we used the pre-trained model in the PyTorch library as the teacher network.
The experiments have been conducted on two pairs of networks. The first one is distillation from ResNet152  to ResNet50, and the second one is distillation from ResNet50 to MobileNet . In this section, we present the results of three latest algorithms [30, 14, 7], which have shown the best results in the previous subsection. The results are represented in Table 4. Our proposed method shows a great improvement. In particular, our method has made ResNet50 perform better than ResNet152, which is a remarkable achievement. In addition, it has shown a considerable improvement in the recently proposed lightweight architecture, MobileNet. In case of MobileNet, it is hard to reproduce the performance of the paper (29.4) because the training scheme, such as training epochs, is not reported. Thus, we measured the performance in a standard setting.
4.3 Object detection
|Network||# of params||Method||mAP(%)|
In this section, we apply our method to other computer vision tasks. The first one is object detection which is one of the most frequently used neural network techniques. Since the purpose of distillation is to improve speed, we applied our proposed method on a high-speed detector, Single Shot Detector (SSD) . Networks are trained on a mixture of VOC2007 and VOC2012  trainval set, which are widely used in object detection. The backbone network in all models is pre-trained using the ImageNet dataset. Networks have been trained for 120k iterations with a batch size of 32. To show the improvement of our method, we set the SSD trained with no distillation as the baseline, referred to as ‘Baseline’ in Table 5. SSD detector based on ResNet50  or VGG  is used as the teacher network to examine the difference of performance according to the teacher architecture. As the student networks, SSD based on ResNet18 and SSD lite based on Mobilenet  have been used.
Detection performance has been evaluated in the VOC 2007 test set and all results are presented in Table 5. In the case of ResNet18, the performance improvement of ResNet18-T1 using ResNet teacher is larger than T2 using a VGG teacher. Though both student architectures outperform the baseline, the distillation between similar structure shows a better quality of knowledge distillation. In the case of MobileNet, our proposed method shows a constant performance improvement regardless of the type of the teacher. Student models in all experiments have experienced improvements in performance and this implies that our method can be applied to any SSD-based object detector.
4.4 Semantic segmentation
In this section, we verify the performance of our proposed method on semantic segmentation. Applying distillation on semantic segmentation is challenging since the output size is much larger than any other task. We have selected the latest study, DeepLabV3+  as our base model for semantic segmentation. DeepLabV3+ based on ResNet101 has been used as the teacher network, and DeepLabV3+ based on MobileNetV2  and ResNet18  has been used as the student network. Experiments have been performed on the PASCAL VOC 2012 segmentation  dataset. We also use an augmentation of the dataset provided by the extra annotations in  as in the baseline paper . All models have been trained for 50 epochs, and the learning rate schedule is the same as the baseline paper . In similar fashion to our detection task, the student network is initialized to a network pre-trained on ImageNet without distillation. Results are presented in Table 6. Our proposed method significantly improves the performance of ResNet18 and MobilenetV2. Taking MobileNetV2, in particular, our proposed method improves the performance by almost 3 points in mIoU and contributes to computation reduction of the segmentation algorithm. We have shown that our proposed method can be applied to image classification, object detection and semantic segmentation. Being able to be applied to many tasks without major changes is an advantage of feature distillation and shows that our proposed method has a wide range of applications.
|Backbone||# of params||Method||mIoU|
We analyze possible factors which would have lead to the performance improvement by our proposed method. The first analysis is the output similarity between the teacher and the student learned by distillation. By this, we verify how well our method forces the student to follow the teacher. After that, we provide an ablation study of our proposed method. We measure how much each component of our proposed method contributes to the performance. Finally, we discuss about how the mode of batch normalization affects knowledge distillation, as mentioned in Section 3.3. All experiments are based on setting (c) of Table 2.
4.5.1 Teacher-student similarity
|Mode of batch-norm||KD ||FitNets ||AT ||Jacobian ||FT ||AB ||Proposed|
KD  forces the output of the student to be similar to output of the teacher. The purpose of output distillation is quite intuitive, i.e., if a student produces an output similar to that of the teacher, its performance will also be similar. However, in the case of feature distillation, it is necessary to investigate how the output of the student changes. To see how well the student mimics the teacher, we measure the similarity of the teacher’s and student’s output under a consistent setting. On the test set of CIFAR-100, we measure the KL divergence between the teacher and student output. The cross-entropy with the ground truth has also been measured since classification performance also contributes to the reduction of KL divergence. Results are presented in Table 7. Methods that apply distillation only in the early stage of the training, Initial distillation (FitNets , AB ), increase the KL divergence both with the teacher and ground-truth. With this result, it is hard to say that the student networks of these methods are mimicking their teacher networks. Meanwhile, distillation methods with continuous distillation in Eq. 6 (KD , AT , Jacobian , FT ) as well as our proposed method reduce the KL divergence, which implies that the similarity between the teacher and student is relatively high. Specifically, our method shows a considerably high similarity compared to other continuous distillation methods. In other words, our proposed method trains the student to produce an output most similar to that of the teacher. This similarity is one of the main reasons of improved performance of our proposed method.
4.5.2 Ablation study
As shown in Table 1, our proposed method consists of various components. We investigate the contribution of each component with an ablation study. From our proposed method, we exclude one component at a time and measure the resulting performance. The result is shown in Table 9. By replacing the proposed method to initial distillation which applies distillation as network initialization as in [22, 28, 7], the performance degraded. Changing the feature position where the distillation occurs to the end of the group also shows a significant degradation in performance. This result shows that using the feature before ReLU is necessary for our margin ReLU and partial distance. Using a conventional ReLU instead of the margin ReLU (Eq. 3) also decreases the performance. Further, controlling the negative values of the student by using a classical distance degrades performance. We can see that if any component is omitted, the student network suffers from a performance degradation, especially when the distillation type or distillation feature position is changed. In conclusion, a combination of all proposed components leads to a significant improvement in performance of the proposed method.
4.5.3 Batch normalization
In Section 3.3, we mentioned about the issue related to the mode of the batch normalization in knowledge distillation. To investigate this, we measure the performance variation of knowledge distillation methods when differing the mode of the teacher’s batch norm layer. The experimental results are shown in Table 8. The distillation methods that use additional information other than feature (KD , Jacobian ) show marginal differences between each mode of batch normalization. AT , which uses a diminished feature for distillation, has shown a better result in the evaluation mode. However, methods that do not squeeze the feature (FitNets , FT , AB ) consistently work better in the training mode. Our method especially shows a substantial improvement when using the training mode. Note that all experiments in previous sections exploit the better mode of the batch-norm layer as there is no mention about it in each paper. In conclusion, an appropriate type of batch normalization should be carefully chosen in many distillation methods including ours.
4.6 Implementation details
We describe our implementation details to help reproduce our proposed method. In the loss function of our method in Eq. 5, we sum the values in the entire layer rather than averaging them. When one moves from the top layer to the bottom layer, the total size of the feature is increased by twice the amount as spatial resolution increases. Therefore, the loss is divided in half accordingly. in Eq. 6 is for CIFAR , for ImageNet  and detection, and for segmentation. For CIFAR, we used a batch size of 128 for (a) to (d) in Table 2 and a batch size of 64 for (e) and (f). Every experiment on CIFAR has been performed 5 times with the median value reported. Detection has an extra layer behind the backbone and all extra layers were also used for distillation. Detector were trained over a 120k iteration with a batch size of 32. The learning rate started at and was multiplied by 0.1 at iteration 80k and 100k. In the case of segmentation, we use an additional distillation layer at the atrous spatial pyramid pooling
and a layer just before output layer. Output stride was set to 16 and all dropout layers are not used for distillation. When using a pre-trained network, we initialized the student transform at the start of training. Initialization proceeded for 500 iteration for detection and 1 epoch for segmentation. All experiments were implemented and evaluated on NAVER Smart Machine Learning (NSML) platform with PyTorch .
We propose a new knowledge distillation method along with several investigations about various aspects of the existing feature distillation methods. We have discovered the effectiveness of pre-ReLU location and proposed a new loss function to improve the performance of feature distillation. The new loss function consists of a teacher transform (margin ReLU) and a new distance function (partial ) and enables an effective feature distillation at pre-ReLU location. We have also investigated about the mode of batch normalization in teacher network and achieved additional performance improvements. Through experiments, we examined the performance of the proposed method using various networks in various tasks, and proved that the proposed method substantially outperforms the state-of-the-arts of feature distillation.
We appreciate support of Clova AI members, especially Nigel Fernandez for proofreading the manuscript, Dongyoon Han for providing help on implementation and Jung-Woo Ha for insightful comments. We also thank the NSML Team for providing an excellent experiment platform.
-  L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV), 2018.
-  M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1):98–136, Jan. 2015.
D. Han, J. Kim, and J. Kim.
Deep pyramidal residual networks.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In International Conference on Computer Vision (ICCV), 2011.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV), 2016.
B. Heo, M. Lee, S. Yun, and J. Y. Choi.
Knowledge transfer via distillation of activation boundaries formed by hidden neurons.In
AAAI Conference on Artificial Intelligence (AAAI), 2019.
G. Hinton, O. Vinyals, and J. Dean.
Distilling the knowledge in a neural network.
NIPS Deep Learning and Representation Learning Workshop, 2015.
-  Y. N. D. D. L.-P. Hongyi Zhang, Moustapha Cisse. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations (ICLR), 2018.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
-  S. Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In Advances in Neural Information Processing Systems (NIPS). 2017.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.
-  H. Kim, M. Kim, D. Seo, J. Kim, H. Park, S. Park, H. Jo, K. Kim, Y. Yang, Y. Kim, N. Sung, and J. Ha. NSML: meet the mlaas platform with a real-world case study. CoRR, abs/1810.09957, 2018.
-  J. Kim, S. Park, and N. Kwak. Paraphrasing complex network: Network compression via factor transfer. In Advances in Neural Information Processing Systems (NIPS), 2018.
-  A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
Convolutional deep belief networks on cifar-10.Technical report, 2010.
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.In Advances in Neural Information Processing Systems (NIPS). 2012.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision (ECCV), 2016.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
V. Nair and G. E. Hinton.
Rectified linear units improve restricted boltzmann machines.In International Conference on Machine Learning (ICML), 2010.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017.
-  A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In International Conference on Learning Representations (ICLR), 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
-  S. Srinivas and F. Fleuret. Knowledge transfer with jacobian matching. In International Conference on Machine Learning (ICML), 2018.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
J. Yim, D. Joo, J. Bae, and J. Kim.
A gift from knowledge distillation: Fast optimization, network minimization and transfer learning.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference (BMVC), 2016.
-  S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations (ICLR), 2017.