Ensemble methods have shown considerable improvements in model generalization and produced state of the art results in several machine learning competitions (e.g., Kaggle)[Chen et al.2017]. These ensemble methods typically contain multiple deep Convolutional Neural Networks (CNN) as sub-networks which are pre-trained on large-scale datasets to extract discriminative features from the input data. The size of an ensemble is not constrained by training because the sub-networks can be trained independently, and their outputs can be computed in parallel. However, in many applications limited training data is not sufficient to effectively train deep CNN models compared to small or compact networks. For instance in healthcare applications, the amount of available data is constrained by the number of patients. Therefore, improving generalization capability of compact network without requiring large-scale annotated datasets is of utmost importance. Furthermore, today’s high performing deep CNN based ensemble models have Giga-FLOPS compute and Giga-Bytes storage requirements [Huang et al.2018], making them prohibitive in resource constrained systems (e.g., mobile- or edge-devices) which have stringent requirements on memory, latency and computational power.
To overcome these challenges, model compression techniques such as parameter pruning [Yu et al.2018] is a common way to reduce model size with trade-offs between accuracy and efficiency. Other techniques include hand crafting efficient CNN architectures such as SqueezeNets [Iandola et al.2016], MobileNets [Howard et al.2017], and ShuffleNets [Zhang et al.2018a]. Recently, neural network search showed an effective way to generate efficient CNN architectures [Tan et al.2019, Cai et al.2018] by extensively tuning parameters such as network width, depth, filter types and sizes. These models showed better efficiency than hand-crafted networks but, at the cost of extremely large tuning cost. Another stream of work in building efficient networks for resource constrained scenarios is through Knowledge distillation [Hinton et al.2015]. It enables small low memory footprint networks to mimic the behavior of large complex networks by training small networks using the predictions of large networks as soft labels in addition to the ground truth hard labels.
In this paper we also explore Knowledge Distillation (KD) based strategies to improve model generalization and classification performance for applications with memory and compute restrictions. For this, we present a CNN architecture with parallel branches which distill high level features from a teacher ensemble during training and maintains low computational overhead during inference. Our architecture provides two main benefits: i) It combines the student network with different teacher networks and distills diverse feature representations into the student network during training. This promotes heterogeneity in feature learning and enables the student network to mimic diverse high-level feature spaces produced by the teacher networks. ii) It combines the distilled information through parallel-branches in an ensembling manner. This reduces variance in the branch-level outputs and improves the quality of the final predictions of the student network. In summary, the main contributions of this paper are as follows:
We present an Ensemble Knowledge Distillation (EKD) framework which improves classification performance and model generalization of small and compact networks by distilling knowledge from multiple teacher networks into a compact student network in an ensemble learning manner.
We present a specialized training objective function to distill ensemble knowledge into a single student network. Our objective function optimizes the parameters of the student network with a goal of learning mappings between input data and ground truth labels, and a goal of minimizing the difference between high level features of the teacher networks and the student network.
We perform ablation study of our framework on CIFAR-10 and CIFAR-100 datasets in terms of different CNN architectures, varying ensemble sizes, and limited training data scenarios. Experiments show that by encouraging heterogeneity in feature learning through the proposed ensemble distillation, our EKD-based compact networks produce superior accuracy compared to conventional ensembles without knowledge distillation and networks trained through other KD-based methods.
2 Related Work
In this section, we discuss related work on model compression and knowledge distillation.
2.1 Model Compression
Network pruning is a popular approach to reduce a heavy network to obtain a light-weight form by removing redundancy in the heavy network. In this approach, a complex over-parameterized network is first trained, then pruned based on come criterions, and finally fine-tuned to achieve comparable performance with reduced parameters. In this context, methods such as [Yu et al.2018] compress large networks through the reduction of connections based on weight magnitudes or importance scores. Other methods used quantization of the weights to 8 bits [Han et al.2015], filter pruning [Li et al.2016] and channel pruning [Luo et al.2017] to reduce network sizes. However, the trimmed models are generally sub-graphs of the original networks and there is less flexibility in changing the original architecture design.
2.2 Knowledge Distillation
Knowledge Distillation (KD) aims at learning a light-weight student network such that it can mimic the behavior of a complicated teacher network. In this context, the work of [Ba and Caruana2014] was the first to introduce knowledge distillation by minimizing L2 distance between the features from the last layers of two networks. Later, the work of [Hinton et al.2015]
showed that the predicted class probabilities from the teacher are informative for the student and can be used as a supervision signal in addition to the regular labeled training data during training. Romero et. al.[Romero et al.2015] bridged the intermediate layers of the student and teacher networks in addition to the class probabilities and used L2 loss to supervise the student network. The method of [Czarnecki et al.2017] minimized the difference between teacher and student derivatives of the loss combined with the divergence from the teacher predictions. Other methods explored knowledge distillation using activation maps [Heo et al.2019], attention maps [Zagoruyko and Komodakis2016], Jacobians [Srinivas and Fleuret2018], and unsupervised feature factors [Kim et al.2018].
Ensembling is a promising technique to improve model generalization compared to the performance of individual models. Since different CNN architectures can achieve diverse distributions of errors due to the presence of several local minima, the combination of the outputs of individually trained networks leads to improved performance and better generalization to unseen test data. In the light of these studies, methods such as [Urban et al.2016, Furlanello et al.2018] combined ensemble learning and knowledge distillation to improve model generalization. For instance, the method of [Urban et al.2016]
trained an ensemble of 16 CNN models and compressed the learned function into shallow multi-layer perceptrons containing 5 layers. The work of[Furlanello et al.2018] presented an iterative technique to transform a student model into the teacher model at each iteration. At the end of the iterations, all the student outputs were combined to form an ensemble. Our work also follows ensemble learning coupled with knowledge distillation however, compared to [Furlanello et al.2018], we train a compact student network through knowledge distillation in a single iteration. Furthermore, our ensemble architecture distills knowledge from different teacher networks into the student network. This increases heterogeneity in student feature learning and enables the student network to mimic diverse feature representations produced by different teacher networks. Consequently, our EKD-based compact networks demonstrate better generalization capability compared to iterative KD methods [Furlanello et al.2018] or conventional KD methods [Hinton et al.2015].
3 The Proposed Framework
Fig. 1 shows the overall architecture of our Ensemble Knowledge Distillation (EKD) framework. It consists of two main modules: i) A compact student network (CompNet) which is composed of branches connected in parallel (Fig. 1-A). The branches follow a common architecture constituting convolutional and pooling layers. ii) A Teacher Ensemble Network (TeachNet) which is composed of CNN models with different architectures or layer configurations (Fig. 1-B). In the following, we describe in detail the individual modules of the proposed framework.
3.1 The Proposed Compact Network (CompNet)
Our compact network is composed of branches connected in parallel. The branches follow a common architecture where each branch is composed of multiple convolutional layers interconnected through fuse connections. The branch outputs are fed into linear layers to produce probabilistic distributions of the input data with respect to the target classes. Our branch architecture is composed of Dense Blocks which contain multiple bottleneck convolutions interconnected through dense connections [Huang et al.2017]. Specifically, the branch architecture starts with aaverage pooling operation. Next, there are three dense blocks, where each dense block consists of number of layers termed Dense Layers which share information from all the preceding layers connected to the current layer through fuse connections. Fig. 1-C shows our branch structure with dense layers configuration. Each dense layer consists of and convolutions followed by Batch Normalization (BN), Rectified Linear Units (ReLU), and a dropout block as shown in Fig. 1. The output of the dense layer () in a dense block can be written as:
where represents concatenation of the features produced by the layers . Each branch ends with global average pooling and produces dimensional feature maps which are then fed to a linear layer of dimensions to produce probabilitic distributions () with respect to target classes. Mathematically, the output of a linear layer can be written as:
where, and represent weights and bias matrices, respectively. Finally, the outputs of the linear layers are summed to produce a combined feature representation as shown in Fig. 1-A. It is given by:
3.2 Proposed Teacher Ensemble Network (TeachNet)
Our teacher ensemble is composed of multiple CNN models which act as independent classifiers. The teacher sub-networks should use different architectures or layer configurations in order to produce diverse feature representations at the final convolutional layers. Similar to our CompNet architecture, the teacher outputs are first fed into linear layers to produce probabilistic distributions () of the input data with respect to the target classes, and finally summed together to produce a combined feature representation as shown in Fig. 1-B. It is given by:
In the following we describe our specialized training objective function which optimizes the parameters of our CompNet using ground truth labels as well as high-level feature representations from the teacher ensemble.
3.3 Proposed Ensemble Knowledge Distillation (EKD)
Consider a training dataset of images and labels , where each sample belongs to one of the classes . To learn the mapping , we train our CompNet parameterized by , where are the learned parameters obtained by minimizing a training objective function :
Our training function is a weighted combination of three loss terms. A CrossEntropy loss term which is applied on the outputs of the teacher ensemble and the CompNet model with respect to the ground truth labels (), and a distillation loss term which matches the outputs of the sub-networks of the teacher ensemble and the outputs of the branches of the CompNet model. Mathematically, can be written as:
represent the logits (the inputs to the SoftMax) of the teacher ensemble and the CompNet model, respectively. The terms, , and are the hyper-parameters which balance the individual loss terms. Mathematically, the CrossEntropy loss can be written as:
where is the indicator function and is the SoftMax operation. It is given by:
Our KD-based loss in Eq. 6 is composed of two terms. A Kullback-Leibler (KL) divergence loss term and a Mean-Squared-Error loss term . Mathematically, can be written as:
where indexes the sub-networks of the teacher ensemble and the branches of the CompNet model. The term in Eq. 9 is a temperature hyper-parameter which controls the softening of the output of the teacher sub-networks. A higher value of
produces a softer probability distribution over the target classes. The KL divergence lossis given by:
4 Experimental Setup
In this section we describe the details of our experiments.
4.1 Network Architectures
We evaluated our ensemble knowledge distillation using two architectures for the student network. The first one is a CNN with dense connections [Huang et al.2017], where we considered the number of dense layers of the model as a proxy for size or capacity of the model. Specifically, we considered a medium capacity CNN with 6 dense layers (DenseNet6), and a large capacity CNN with 12 dense layers (DenseNet12). The other student architecture we considered for our evaluations is based on the ResNet8 structure with skip connections as used in other knowledge distillation based studies [Heo et al.2018, Mirzadeh et al.2019]. For the teacher ensemble, we considered up to 7 sub-networks based on ResNet14, ResNet20, ResNet26, ResNet32, ResNet44, ResNet56, and ResNet110 architectures as used in [Heo et al.2018, Mirzadeh et al.2019].
4.2 Training and Implementation
For training, we initialized the weights of the convolutional and the fully connected layers from zero-mean Gaussian distributions. The standard deviations were set to 0.01, the biases were set to 0, and a parameter decay of 0.0005 was set on the weights and biases. The teacher ensemble was first trained independently from the scratch, and then fine-tuned simultaneously and collaboratively with the student network. The distillation from the teacher sub-networks to the student network was performed throughout the whole training process by optimizing the training objective function in Eq.6
. Specifically, we trained the networks for 1000 epochs starting with a learning rate of 0.01 which was divided by 10 at 50% and 75% of the total number of epochs. Our implementation is based on the auto-gradient computation framework of the Torch library[Paszke et al.2017]. Training was performed by ADAM optimizer with a batch size of 128 using 2 nvidia V100 GPU hardware. For hyper-parameter optimization, we used the toolkit of [Bergstra et al.2011] to tune the loss weighing parameters , and the temperature parameter .
We evaluated our framework on two well calibrated image classification datasets CIFAR-10 and CIFAR-100. Both datasets consists of 60,000 RGB images distributed into 10 and 100 classes, respectively. Specifically, the training set contains 50,000 images and the test set contains 10,000 images of sizes pixels.
5 Results and Analysis
|Ensemble [Dutt et al.2019]||No||92.79||70.73|
5.1 Ensemble Distillation Improves Model Performance
Here, we evaluate our compact student networks with and without the proposed Ensemble Knowledge Distillation (EKD) on the CIFAR-10 and CIFAR-100 datasets. Table 1 shows the results of these experiments for different CNN architectures. From Table 1 we see that EKD based networks improve accuracy for all the tested CNN architectures on both CIFAR-10 and CIFAR-100 datasets. For instance, EKD improves accuracy by 4%, 5%, and 4% for ResNet8 (R8), DenseNet6 (D6), and DenseNet12 (D12) architectures on the CIFAR-100 dataset, respectively. Table 1 also shows that our EKD improves accuracy by 2% and 4% on the CIFAR-10 and CIFAR-100 datasets, respectively, compared to the ensembling method of [Dutt et al.2019] which combines heterogeneous classifiers without knowledge distillation.
5.2 Ensemble Learning Improves Knowledge Distillation
|Model||ES||Teachers||CIFAR||CIFAR||No. of||No. of|
|Model||CIFAR||CIFAR||No. of||No. of||Inf.|
Here, we evaluate the performance of our EKD based student networks by varying the size of the ensemble to explore the benefits of ensembling for knowledge distillation. Table 2 shows the results of these experiments on the CIFAR-10 and CIFAR-100 datasets. The results show that the accuracy increases with the increase in ensemble size (ES) for all the tested CNN architectures. For instance, a student network with 6 branches improves accuracy by around 2% and 4% compared to a 1-branch network on the CIFAR-10 and CIFAR-100 datasets, respectively. Table 3 shows a comparison between the student network and the teacher networks on the CIFAR-10 and CIFAR-100 datasets. The results show that our EKD based student network (CompNet-D12) produced competitive accuracy on the test datasets with less training parameters, less FLOPS, and faster inference speed compared to the teacher ensemble (TeachNet).
|[Hinton et al.2015]||CIFAR-10||86.66|
|FITNET [Mirzadeh et al.2019]||CIFAR-10||86.73|
|[Zagoruyko and Komodakis2016]||CIFAR-10||86.86|
|FSP [Yim et al.2017]||CIFAR-10||87.07|
|BSS [Heo et al.2018]||CIFAR-10||87.32|
|MUTUAL [Zhang et al.2018b]||CIFAR-10||87.71|
|TAKD [Mirzadeh et al.2019]||CIFAR-10||88.01|
|CompNet - ResNet8 (this work)||CIFAR-10||90.48|
|[Hinton et al.2015]||CIFAR-100||61.41|
|TAKD [Mirzadeh et al.2019]||CIFAR-100||61.82|
|CompNet - ResNet8 (this work)||CIFAR-100||62.52|
5.3 Comparison with Other Knowledge Distillation Methods
Here, we compare our Ensemble Knowledge Distillation (EKD) with some of the recent state-of-the-art knowledge distillation based methods including: activation based attention transfer (AAT) [Zagoruyko and Komodakis2016], the method of [Yim et al.2017] (FSP), the method of [Hinton et al.2015], hint based transfer (FitNet) [Romero et al.2015], the method of [Heo et al.2018](BSS), the method of deep mutual learning [Zhang et al.2018b](MUTUAL), and the method of [Mirzadeh et al.2019] (TAKD). For a fair comparison, we used exactly the same setting for CIFAR-10 and CIFAR-100 experiments, and a ResNet8 based student network as used in the baseline studies. Table 4 shows that our EKD based networks improved accuracy on both the tested datasets compared to the other KD methods. This improved performance is attributed to the proposed ensemble distillation architecture where the proposed training objective function enabled the student network to successfully mimic diverse feature embeddings produced by different teachers and improve generalization to unseen test data. Furthermore, the combination of distilled information through ensembling reduced variance in the heterogeneous outputs and improved the quality of the final predictions of the student network.
5.4 Model Generalization Performance
Here we evaluate the generalization performance of the proposed ensemble knowledge distillation. For this, we conducted experiments for different sizes of the data used for training the teacher and the student networks. Table 5 shows the results of these experiments using ResNet8 as the student network. The results show that the performance gap increases as the size of the dataset is reduced. For instance, the accuracy drop by around 25% and 40% when 10% of the data was used to train the networks without the proposed EKD on the CIFAR-10 and CIFAR-100 datasets, respectively. Table 5 also shows that using the proposed EKD, the networks improve test accuracy for all the tested sizes of training data with considerable margins compared to networks without distillation (NO-KD). For instance, EKD-based networks produced improvements of upto 7% and 6% when 50% of data was used for training on the CIFAR-10, and CIFAR-100 datasets, respectively compared to networks without distillation (NO-KD).
Table 5 (columns labelled KD) show test accuracy of networks trained through conventional distillation [Hinton et al.2015], where ResNet8 based student networks were trained with ResNet56 as the teacher network. The results show that while non-ensemble based distillation improved accuracy compared to the networks without distillation (NO-KD), our ensemble distillation based networks (EKD) produced the best accuracy on the test datasets for different sizes of training data. These improvements are attributed to our ensemble distillation which promotes diversity in feature learning by transfering knowledge from different teachers into the student network and improves model generalization to test data. These experiments show that for situations where non-KD methods fail to achieve generalization due to insufficient data, the proposed ensemble distillation achieves considerable performance improvements, thereby demonstrating potentials for uses in applications with limited-data constraints.
5.4.1 Visualization of Teacher and Student Features Space
Here, we conducted experiments to visualize the features space learnt by our EKD based networks. Fig. 2 shows the 2-dimensional TSNE-embeddings generated using the features produced by the teacher ensemble and ResNet8 models with and without using the proposed ensemble distillation under different sizes of data used during training. These experiments show that training a student network using high-level features from multiple teacher networks enables the student network to imitate teachers and produce embeddings which exhibit better separation of the target classes compared to the embeddings produced by the models without using distillation.
5.4.2 Visualization of Teacher and Student Loss Landscapes
Fig. 3 shows a comparison of loss landscape visualizations [Li et al.2018] around local minima for the teacher ensemble, ResNet8 with the proposed EKD and ResNet8 without distillation on the test data of the CIFAR-100 dataset. The comparison shows that our EKD-based networks exhibit flatter surface topology similar to those produced by the teacher networks compared to the models without distillation. This translates to better generalization performance of our EKD-based networks compared to the networks without distillation.
5.5 Conclusion and Future Work
Recently, deep CNN based ensemble methods have shown state-of-the-art performance in image classification but at the cost of high computation cost and large memory requirements. In this paper, we show that knowledge distillation using an ensemble architecture can improve classification accuracy and model generalization especially with fewer training data for small and compact networks. Unlike traditional ensembling techniques which reduce variance in outputs by combining independently trained networks, we show that Ensemble Knowledge Distillation (EKD) encourages heterogeneity in student feature learning through collaboration between heterogeneous teachers and the student network. This enables student networks to learn more discriminative and diverse feature representations while maintaining small memory and compute requirements. Experiments on well-established CIFAR-10 and CIFAR-100 datasets show that compact networks trained through the proposed ensemble distillation improved classification accuracy and model generalization especially in situations with fewer training data. In future, we plan to explore a fully data-driven automated ensemble selection. We also plan to evaluate our framework for video classification tasks to gain more insights into the benefits of ensemble distillation.
- [Ba and Caruana2014] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NIPS, pages 2654–2662, 2014.
- [Bergstra et al.2011] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In NIPS, pages 2546–2554, 2011.
- [Cai et al.2018] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
- [Chen et al.2017] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI, 40(4):834–848, 2017.
- [Czarnecki et al.2017] Wojciech M Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan Pascanu. Sobolev training for neural networks. In NIPS, pages 4278–4287, 2017.
- [Dutt et al.2019] Anuvabh Dutt, Denis Pellerin, and Georges Quénot. Coupled ensembles of neural networks. Neurocomputing, 2019.
- [Furlanello et al.2018] Tommaso Furlanello, Zachary C Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. arXiv preprint arXiv:1805.04770, 2018.
- [Han et al.2015] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
- [Heo et al.2018] B Heo, M Lee, S Yun, and JY Choi. Improving knowledge distillation with supporting adversarial samples. arXiv preprint arXiv:1805.05532, 3, 2018.
[Heo et al.2019]
Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi.
Knowledge transfer via distillation of activation boundaries formed by hidden neurons.In AAAI, volume 33, pages 3779–3787, 2019.
- [Hinton et al.2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- [Howard et al.2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- [Huang et al.2017] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. CVPR, 2017.
- [Huang et al.2018] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965, 2018.
- [Iandola et al.2016] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
- [Kim et al.2018] Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. In NIPS, pages 2760–2769, 2018.
- [Li et al.2016] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
- [Li et al.2018] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In NIPS, pages 6389–6399, 2018.
- [Luo et al.2017] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In ICCV, pages 5058–5066, 2017.
- [Mirzadeh et al.2019] Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393, 2019.
[Paszke et al.2017]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.2017.
- [Romero et al.2015] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. ICLR, 2015.
- [Srinivas and Fleuret2018] Suraj Srinivas and François Fleuret. Knowledge transfer with jacobian matching. arXiv preprint arXiv:1803.00443, 2018.
- [Tan et al.2019] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In CVPR, pages 2820–2828, 2019.
- [Urban et al.2016] Gregor Urban, Krzysztof J Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Rich Caruana, Abdelrahman Mohamed, Matthai Philipose, and Matt Richardson. Do deep convolutional nets really need to be deep and convolutional? arXiv preprint arXiv:1603.05691, 2016.
[Yim et al.2017]
Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim.
A gift from knowledge distillation: Fast optimization, network minimization and transfer learning.In CVPR, pages 4133–4141, 2017.
- [Yu et al.2018] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. In CVPR, pages 9194–9203, 2018.
- [Zagoruyko and Komodakis2016] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.
- [Zhang et al.2018a] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, pages 6848–6856, 2018.
- [Zhang et al.2018b] Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. In CVPR, pages 4320–4328, 2018.