The performance of Convolutional Neural Network (CNN) can be significantly improved by the deeper and wider design of network structure. Whereas, it is hard to deploy these heavy networks on energetic consumption processor with limited memory. One way to deal with this situation is to make a trade-off between performance and speed by designing a miniaturized model to reduce the computational workload. Thus, narrowing the performance gap between heavy model and miniaturized model becomes a research focus in recent years. Many methods were proposed to tackle this problem, such as model pruning[7, 17], quantization [14, 29] and knowledge transfer [11, 25].
Among these approaches, knowledge distillation (KD) performs as an essential way to optimize a static model by mimicking the behavior (final predictions  or activations of hidden layers ) of a powerful teacher network as shown in Figure 1(a) and Figure 1
(b). Guided by this softened knowledge, a student network could pay more attention to extra supervision such as the probability correlation between classes rather than the one-hot label.
Previous methods only consider single converged model as teacher to teach small student network, which may result that the student stucks in approximating teacher’s performance along with the increasing gap (in capacity) between teacher and student . We observe that student supervised by teacher’s early training stage has a much smaller performance gap with its teacher than that supervised by teacher’s latter training stage. From the perspective of curriculum learning , a reasonable explanation beyond this observation is that teacher’s knowledge is gradually becoming harder along with the training process. We claim that the intermediate states that teacher passed by are also valuable knowledge for easing the learning process and lowering the error bound of the student.
Based on this philosophy, we propose a new method called RCO which supervises student with the teacher’s optimization route. Figure 1(c) shows the whole framework of RCO. Comparing to single converged model, the route of teacher contains extra knowledge through providing an easy-to-hard learning sequence. By gradually mimicking such sequence, the student can learn more consistent with the teacher, therefore narrowing the performance gap. Besides, we analyze the impact of different learning sequence on performance and propose an efficient method based on greedy strategy for generating sequence, which can be used to shorten the training paradigm meanwhile maintaining high performance.
Extensive experiments on CIFAR-100, ImageNet-1K and large scale face recognition show that RCO significantly outperforms knowledge distillation and other SOTA methods on all the three tasks. Moreover, our method can be combined with previous knowledge transfer methods and boost their performance. To sum up, our contribution could be summarized into three parts:
We rethink the knowledge distillation model from the perspective of teacher’s optimization path and learn the significant observation that learning from the converged teacher model is not the optimal way.
Based on the observation, we propose a novel method named RCO that utilizes the route in parameter space that teacher network passed by as a constraint to lead a better optimization on student network.
We demonstrate that the proposed RCO can be easily generalized to both knowledge distillation and hint learning. Under the same data and computational cost, RCO outperforms KD by a large margin on CIFAR, ImageNet and a one-to-million face recognition benchmark Megaface.
2 Related Work
Neural Network Miniaturization. Many works study the problem of neural network miniaturization. They could be categorized into two methods: designing small network structure and improving the performance of small network via knowledge transfer. As for the former, many modifications on convolution were proposed since the original convolution took up too many computation resources. MobileNet  used depth-wise separable convolution to build block, ShuffleNet  used pointwise group convolution and channel shuffle. These methods could maintain a decent performance without adding too much computing burden at inference time. Besides, many studies [8, 24, 20, 10] focus on network pruning, which boosts the speed of inference through removing redundancy in a large CNN model. Han  proposed to prune nonsignificant connections. Molchanov 
presented that they could prune filters with low importance, which were ranked by the impact on the loss. They approximated the change in the loss function with Taylor expansion. These methods typically need to tune the compression ratio of each layer manually. Most recently, Liu
presented the network slimming framework. They constrained the scale parameters of each batch normalization layer with sparsity penalty such that they could remove corresponding channels with lower scale parameters. In this way, unimportant channels can be automatically identified during training and then removed. He 
proposed to adopt reinforcement learning to exploit the design space of model compression. They benefited more from replacing manually tuning with automatical strategies.
Knowledge Distillation for Classification. Efficiently transferring knowledge from large teacher network to small student network is a traditional topic which has drawn more and more attention in recent years. Caruana  advocated it for the first time. They claimed that knowledge of an ensemble of models could be transferred to the other single model. Then Hinton 
further claimed that knowledge distillation (KD) could transfer distilled knowledge to student network efficiently. By increasing the temperature, the logits (the inputs to the final softmax) contain richer information than one-hot labels. KD is generally used for close-set classification, where the training set and testing set have exactly the same classes.
Learning Representation from Hint. Hint-based learning is often used for open-set classification such as face recognition and person Re-identification. FitNet  firstly introduced more supervision by exploiting intermediate-level feature maps from the hidden layers of teacher to guide training process of student. Afterward, Zagoruyko  proposed the method to transfer attention maps from teacher to student. Yim  defined the distilled knowledge from teacher network as the flow of the solution process (FSP), which is calculated by the inner product between feature maps from two selected layers.
Previous knowledge transfer methods only supervise student with converged teacher, thus fail to capture the knowledge during teacher’s training process. Our work differs from existing approaches in that we supervise student with the knowledge transferred from teacher’s training trajectory.
3 Route Constrained Optimization
3.1 Teacher-Student Learning Mechanism
For better illustration, we refer teacher network as with parameters , and student network as with parameters . and represent output predictions of teacher and student respectly, and
for logits of teacher and student. The idea beyond KD is to let the student mimic the behavior of teacher by minimizing the cross-entropy loss and Kullback–Leibler divergence between predictions of teacher and student as follows:
is a relaxation hyperparameter (referred as Temperature in) for softening the output of teacher network, and is a hyper-parameter for balancing cross-entropy and KL divergence loss. In several works [27, 19] the KL divergence is replaced by euclidean distance,
represents the feature representations.
3.2 Difficulty in Optimizing Student
In common, the teacher invokes a larger and deeper network for arriving at a lower local minimum and achieving higher performance. It is hard for a smaller and shallower student to mimic such a large teacher due to the huge gap (in capacity) between teacher and student. Usually, the network is trained to minimize the objective function by using stochastic gradient descent. Due to the high non-convex of the loss function, there are many local minima in the training process of deep neural networks. When the network converges to a certain local minimum, its training loss will converge to a certain (or similar) value regardless of different initializations.
Could we reach a better local-minimal than this condition? We consider changing the optimization objective. More specifically, the student is trained by mimicking less deterministic target first, then moving forward to deterministic one, Hoping in this way the student has smaller gap with teacher. To validate this, we use different intermediate states of teacher to supervise student, and use the training loss and top-1 accuracy to evaluate the difference between target teacher and converged student. MobileNetV2  is adopted as student and ResNet-50  as teacher. The teacher network is trained by cross-entropy loss, the student network is trained by KD loss. We select checkpoints of teacher at 10th epoch, 40th epoch, 120th epoch, and 240th epoch as the target to train student network separately. The checkpoint at 240th epoch is the final converged model, and checkpoint at 10th epoch is least deterministic in the analysis.
Table 1 summarizes the results. It can be observed from the table that the student guided by the less deterministic target has lower training loss, and more convergent target brings larger gap in performance. In other words, the more convergent teacher means a harder target for student to approach.
Inspired by curriculum learning  that the local minima can be promoted by the easy-to-hard learning process, we take the sequence of teacher’s intermediate states as the curriculum to help the student reach a better local minimum.
For better illustration, we refer the intermediate training states (checkpoints) used to form the learning sequence as anchor points. Suppose there are anchor points on the teacher’s trajectory. The overall framework of RCO is shown in Figure 2.
Without loss of generality, let represent the anchor points set, and the correpsonding outputs are . The training for student is started from random initialization. Then we train the student step-by-step to mimic the anchor point on teacher’s trajectory until finishing training with the last anchor point. At step, the learning target of student is switched to the output of anchor point. The optimization goal of step is as follows:
where . The parameter is optimized by learning to these anchor points sequentially. Algorithm 1 describes the details of the whole training paradigm.
3.4 Rationale for RCO
From the perspective of curriculum learning, the easy-to-hard learning sequence can help the model get a better local minimum . RCO is similar to curriculum learning but different in that it provides an easy-to-hard labels sequence on teacher’s trajectory.
Let be the output of anchor point. The outputs of whole anchor points constrcut the space . The results shown in table 1 premises that the intermediate states on teacher’s trajectory construct an easy-to-hard sequence, eg. is easier to mimic than while , the converged model, is the hardest objective for a small student.
Let the be the training data. The training data and output of anchor pair provide a lesson. Then the curriculum sequence can be formulated as follows:
Without loss of generality, let represents a single-parameter family of cost functions such that can be easily optimized, while is the criterion that we actually wish. During the sequential training of RCO, increasing means adding the hardness of learning through switching anchor points. Let represent the hardness metric for a learning target. As shown in Section 3.2, more convergence of anchor means more hardness of learning target,
In curriculum learning  the sequence of learning is generated by spliting the in to several “lessons” with different hardness depending on a predefined criterion. While RCO can be seen as a more flexible approach, which gradually changes the hardness of target labels . Both curriculum learning and RCO work by easy-to-hard learning to move gradually into the basin of attraction of a dominant (if not global) minimum .
3.5 Strategy for Selecting Anchor Points
Equal Epoch Interval Strategy. Typically, the teacher network could produce tremendous checkpoints during the training process. To find the optimal learning sequence, one can search it with brute force. However, given possible intermediate states, there are possible sequences, which is impractical to implement. A straightforward strategy is supervising the student by every state (epoch/iteration) on teacher’s trajectory. However, mimicking every state is dispensable and time-consuming since adjacent training states are very close to each other. Given limited time, a more efficient way is to sample epochs with equal epoch interval (EEI), eg. select one for every four epochs.
Although efficient in time, EEI is a quite simple ad-hoc method that ignores the hardness between different anchor points, and it would lead to an improper curriculum sequence. The desirable property of the curriculum sequence should be efficient to quickly learn and smooth in hardness to better bridge the gap between teacher and student.
Greedy Search Strategy. To delve into optimization route of student when learning to teacher, we count the KL divergence between outputs of student and different target states of teacher on validation set, which consists of 10k examples random sampled from training set. The teacher is trained by dropping learning rate at 60th,120th,180th epochs separately. We choose 30th, 100th, and 180th epochs as target states to supervise the student separately. Figure 3 shows the KL divergences curve between student and intermediate states of teacher. From the figure we can observe that the student supervised by teacher’s 30th epoch is very close with teacher’s 30th epoch, but has large gap with teacher’s latter epochs, especially after teacher dropping learning rate. The same observations can be found from student supervised by teacher’s other epochs.
Inspired by these insights, we propose a greedy search strategy (GS) to find efficient and hardness-smooth curriculum sequence. The goal of greedy strategy is to find the one which is on the boundary of range that student can learn. To find those boundary anchor points, a metric is introduced as follows:
where is the KL divergence, and is the validation set. evaluates the hardness of epochs of teacher for a student guided by epochs. Then we refer a hyper-parameter as the threshold which indicates the learning ability of student. When , it means that epoch is hard for student trained by to learn, and means inverse. Based on above philosophy, we give the complete GS strategy in Algorithm 2.
It seems the anchor points that near the point of tuning learning rate are more important than other anchor points. Intuitively, according to Algorithm 2, the optimal learning sequence must contain at least one anchor point from different learning rate stage. Since this section mainly focuses on the strategy for anchor points selection, we provide the empirical value of for MobileNetV2 to achieve a better balance between performance and training cost. Note that although our experiments are based on SGD, GS is also applicable for other optimization methods like SGDR, since prerequisites are still true.
Common Settings. The backbone network for teacher in all experiments is ResNet-50. For the student structure, instead of using smaller ResNet, we use more compact MobileNetV2 as well as its variants with different FLOPs, since MobileNetV2 has proven to be highly effective in keeping high accuracy while maintaining low FLOPs in many tasks. Expansion ratio and width multiplier are two tunable parameters to control the complexity of the MobileNetV2. We make default configuration by setting expansion ratio to 6 and width multiplier to 0.5. The relaxation is 5 in KD loss. Note that all these experiments are based on GS that usually produces about 4 anchor points.
4.1 Experiment on CIFAR-100
The CIFAR-100 dataset contains 50 000 images in training set and 10 000 images in validation set with size 3232. In this experiment, for the teacher network we set initial learning rate to 0.05 and divide it by 10 at 150th, 180th, 210th epochs and we train for 240 epochs. We set weight decay to 5e-4, batch size to 64 and use SGD with momentum. For the student network, the setting is almost identical with teacher’s except that the initial learning rate is 0.01.
We compare the top-1 accuracy of CIFAR-100 dataset and show the result in Table 2. From the result we can find that our method improves about 2.1% on top-1 compared with KD.
Although the base student network is small and fast, it is common that some specific cases or applications require the model to be smaller and faster. To further investigate the effectiveness of the proposed method, we conduct extensive experiments by applying RCO to a series of MobileNetV2 with different width multipliers and expansion ratios. We set expansion ratio to 4, 6, 8, 10 and width multiplier to 0.35, 0.5, 0.75, 1.0 separately, which forms totally 16 different combinations. The FLOPs of these models are shown at Table 3. We rank the model according to the width multiplier and draw the result on Figure 4. From the figure we can make the following observations: (i) The proposed method exhibits consistently superiority in all settings. (ii) The student network with smaller capacity(e.g. MobileNetV2 with T=4, Width=0.35) generally gains more improvement from RCO. (iii) Although the model with expansion ratio set to 10 and width multiplier set to 0.35 has larger FLOPs than the model with expansion ratio set to 4 and width multiplier set to 0.5, the former setting shows performance reduction among all three methods. It indicates that parameterizing expansion ratio with 10 and width multiplier to 0.35 largely limits representation power.
|Expansion ratio||Width multiplier|
4.2 Experiment on ImageNet
The ImageNet dataset contains 1000 classes of images with various sizes. It is the most popular dataset in classification task. In this experiment, for the training of teacher network, we set initial learning rate to 0.4 and drop by 0.1 at 15k, 30k and 45k iterations and we train for 50k iterations. We set weight decay to 5e-4, batch size to 3072 and use SGD with momentum. As for the student network, we set initial learning rate to 0.1 and drop by 0.1 at 45k, 75k and 100k iterations and we train for 130k iterations. We set weight decay to 5e-4, batch size to 3072 and use SGD with momentum. In order to keep training stable, we use warm-up suggested by  when training with large batch size. We compare the top-1 and top-5 accuracy of ImageNet dataset and show the result in Table 4. From the result we can find that our method improves about 1.5%/0.7% on top-1/top-5 compared with KD, which verifies that RCO is applicable to large-scale classification.
4.3 Experiment on Face Recognition
Unlike classification, the network in face recognition usually contains a feature layer implemented as a fully-connected layer to represent the projection of each identity. Empirical evidence  shows that mimicking the feature layer as the way used in FitNet  could bring more improvements for student network. We follow this setting in our baseline experiments.
We take two popular face recognition datasets MS-Celeb-1M  and IMDb-Face  as our training set and validate our method on MegaFace. The MS-Celeb-1M is a large public face dataset which contains one million identities with different age, sex, skin, and nationality and is widely used in face recognition area. The IMDb-Face dataset contains about 1.7 million faces, 59k identities. All images are obtained from IMDb website. The MegaFace is one of the most popular benchmarks that could perform face recognition under up to 1 million distractors. This benchmark is evaluated through probe and gallery images from FaceScrub.
In this experiment, for the teacher network, we set initial learning rate to 0.1 and drop by 0.1 at 100k, 140k, 170k iterations and we train for 220k iterations. We set weight decay to 5e-4, batch size to 1024 and use SGD with momentum. We resize the input image to 224224 without augmentation. We use ArcFace  to train the teacher network. As for the student network, we set initial learning rate to 0.05 and drop by 0.1 at 180k, 210k iterations and we train for 240k iterations. The rest settings are identical to the teacher.
We show our result on Table 5. From the table we can see that on this challenging face recognition task, RCO largely boosts the performance of MobileNetV2  compared to original hint-based learning.
|top-1 @ distractor size|
4.4 Ablation Studies
Although RCO achieves decent result in previous experiments, the extra training time that it brings is not negligible. Even if we just construct the learning sequence with 4 anchor points, it still needs 4 times training epochs compared with KD or Softmax. Since training time plays an important role in either research or industrial, we consider using the same time as KD to verify the robustness of RCO. Note that we set expansion ratio to 4 and width multiplier to 0.35 for the backbone MobileNetV2 in this section.
Comparison under Limited Training Epochs. Previous experiments commonly need more training epochs than KD. Consider performing RCO with EEI strategy on CIFAR-100. Let be the epoch interval used in EEI. To get 4 anchor points, we can set to 60. Then the selected anchor points should be 60th, 120th, 180th, and 240th epoch. The student trained with the learning sequence needs 960 epochs in total since each anchor point is trained for 240 epochs to ensure convergence.
We then speed up the EEI strategy from multi-stage to one stage (one-stage EEI), where we only train student for 240 epochs, by simply modifying the training paradigm as follows: the student is initially supervised by 60th epoch of teacher for the student’s first 60 epochs, then supervised by teacher’s 120 epoch for the next 60 epochs, and so on.
In one-stage EEI, it is natural to evaluate the impact of different number of anchor points. Let be the size of training set. We start the from the smallest case, where is 1/( /) (It is 1.28E-3 in Table 6, which means student mimic teacher’s every iteration). Then gradually increase to the largest case, where is the maximum epoch (240) and RCO degrades into KD.
From the perspective of optimization route, we find method in  could be regarded as a particular case of RCO when setting to the smallest value, and matching logits with KD loss instead of MSE loss. Besides, We also follow  to implement DML and make a comparison with KD. The result in Table 6 shows that RCO outperforms other methods in all settings. By properly selecting to 10, RCO gets 4.2% and 3.8% improvement on CIFAR-100 compared with KD and DML respectively.
Comparison on Different Strategies. Since the strategy is the most crucial part of RCO, we make a comparison between these strategies. For practical considerations, we limit the training epoch to no more than four times the epochs of KD. We have chosen the following strategies to compare: one-stage EEI, EEI-x, GS, where the “x” in “EEI-x” represents the number of selected anchor points with EEI strategy. The results are shown in Table 7. From the result we make the following observations: 1) all strategies show great superiority to KD, 2) GS is the best strategy among them, thus should be used when training time is not a constraint.
Visualization of Trajectory. In order to further analyze our method, we plot student’s training trajectory using PCA directions suggested by Li . The process is as follows: Given training epochs, let denote model parameters at epoch
and the final estimate as. Then we apply PCA to the matrix and choose the most two principal directions.
In Figure 5 the training trajectory of MobileNetV2 on CIFAR-100 is plotted for student in three modes: 1) Softmax, 2) KD, 3) the proposed method (Ours). For a fair comparison, three students are initialized with the same parameters (marked as red dot) and are trained for 240 epochs each, where RCO uses one-stage EEI with three anchor points. For the curve of RCO, the blue dots on the line show the epochs where student is guided by intermediate anchor points. For KD or Softmax, epochs where the learning rate was decreased are shown as black dots.
The first anchor point keeps the student away from the direction suggested by Softmax or KD and arrives at an intermediate state. The state itself may not lie in a well-performed parameter space, but with the guidance of succeeding anchor points the student network eventually reaches to a deeper local minimum, which has adequately demonstrated the importance of optimization route from teacher network.
Visualization of Robustness to Noise.
Besides the visualization of optimization trajectory, we also observe that the new local minimum has better generalization capacity and is more robust to random noise in input space. We consider bringing noise to the testing image. Firstly we calculate the standard deviationfor each image and set the ranging from 0.0 to 1.0 by step 0.1. The noise is sampled from , where . We choose some noised images and show them at the bottom row of Figure 6. The images are clear at first column, but as the increases, the images become illegible, especially for the last column. We ran this experiment on model trained both with KD and RCO and compared their loss. The loss gap from KD to RCO becomes more significant as the increasing of , which suggests that model trained with RCO is more robust to noise than KD. We draw our result on top of Figure 6.
We have proposed a simple but effective and generally applicable method to boost the performance of small student network. By constructing an easy-to-hard sequence of learning target, small student network could achieve much higher performance compared with other knowledge transfer methods. Moreover, we offer two available strategies to construct the sequence of anchor points. For future work, we would like to explore the strategy to design the learning sequence automatically.
Y. Bengio, J. Louradour, R. Collobert, and J. Weston.
International Conference on Machine Learning, pages 41–48, 2009.
-  R. Caruana and A. Niculescu-Mizil. Model compression. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 535–541, 2006.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
-  J. Deng, J. Guo, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. 2018.
-  P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
-  Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: Challenge of recognizing one million celebrities in the real world. Electronic Imaging, 2016(11):1–6, 2016.
-  S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
-  S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  Y. He, J. Lin, Z. Liu, H. Wang, L. J. Li, and S. Han. Amc: Automl for model compression and acceleration on mobile devices. 2018.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. 2014.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  I. Hubara, M. Courbariaux, D. Soudry, E. Y. Ran, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18, 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
-  H. Li, Z. Xu, G. Taylor, and T. Goldstein. Visualizing the loss landscape of neural nets. arXiv preprint arXiv:1712.09913, 2017.
-  Q. Li, S. Jin, and J. Yan. Mimicking very efficient network for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7341–7349. IEEE, 2017.
-  Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2755–2763. IEEE, 2017.
-  I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. 2016. arXiv preprint arXiv:1608.03983.
P. Luo, Z. Zhu, Z. Liu, X. Wang, X. Tang, P. Luo, Z. Zhu, Z. Liu, X. Wang, and
Face model compression by distilling knowledge from neurons.In
AAAI Conference on Artificial Intelligence, 2016.
-  S.-I. Mirzadeh, M. Farajtabar, A. Li, and H. Ghasemzadeh. Improved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393, 2019.
-  P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. 2016.
-  A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
-  G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, R. Caruana, A. Mohamed, M. Philipose, and M. Richardson. Do deep convolutional nets really need to be deep and convolutional? Nature, 521, 2016.
-  F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, and C. C. Loy. The devil of face recognition is in the noise. arXiv preprint arXiv:1807.11649, 2018.
-  J. Wu, L. Cong, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. In Computer Vision and Pattern Recognition, pages 4820–4828, 2016.
J. Yim, D. Joo, J. Bae, and J. Kim.
A gift from knowledge distillation: Fast optimization, network minimization and transfer learning.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.
-  S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. 2016.
-  X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. 2017.
-  Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. arXiv preprint arXiv:1706.00384, 6, 2017.
-  G. Zhou, Y. Fan, R. Cui, W. Bian, X. Zhu, and K. Gai. Rocket launching: A universal and efficient framework for training well-performing light net. stat, 1050:16, 2017.