Deep convolutional neural networks (CNNs) have achieved great success in many computer vision tasks such as image classification[10, 14], object detection [21, 27], and semantic segmentation [9, 4]. However, the enormous computational cost of CNNs makes it very slow to run the model on resources-constrained devices such as mobile phones. Thus, it is essential to reduce computational cost and accelerate the inference of CNNs before deployment.
Neural network pruning [8, 6, 22, 11, 31] is one of the most popular model acceleration methods. It removes redundant weights or filters in CNNs to reduce computations. Most popular neural network pruning can be divided into two groups: weight-level and filter-level. Weight-level pruning [8, 6, 7]
sets redundant weights in CNNs to zeros, making weight matrices or tensors sparse. Some previous works[22, 18, 31] have pointed out that weight-level pruning contributes little to accelerating the inference of CNNs unless specialized libraries (such as cuSPARSE111https://docs.nvidia.com/cuda/cusparse/index.html) are used. However, the support of these libraries on mobile devices is limited. On the other hand, filter-level pruning [22, 31, 11] removes redundant filters to reduce computations directly. Nonetheless, computations in the same layer are highly parallelized, which means most reduced computations run in parallel with the un-reduced ones. As a result, the acceleration ratio of filter-level pruning is limited.
As CNNs becomes deeper and deeper, there are many redundant layers in CNNs. Since the computations of different layers run in serial, reducing the number of layers can achieve a higher acceleration ratio than filter-level pruning methods. Thus, we propose a block-level pruning method to prune redundant layers in CNNs. Namely, we take a sequence of consecutive layers (e.g., Conv-BN-ReLu) as a block and prune redundant blocks to reduce computations.
The key to block-level pruning is how to find redundant blocks. Motivated by [35, 3], which proposes that the discrimination of features in CNN was enhanced block by block, we explore the discrimination of each block’s output features as Figure 1. Explicitly, we place a linear classifier after each block and test the accuracy of the classifier on a dataset. The higher the accuracy of the classifier is, the more discriminative the features are. The results show that the discrimination of features ascends as the block goes deeper, which is consistent with the conclusion of . However, we also find that the discrimination of features ascends laxly or even descends at some blocks. Based on previous works and our observations, we assume that these blocks are redundant and can be pruned with acceptable loss.
Some algorithms also support block-level pruning. [15, 32] use a norm-based importance evaluation, which has been proven to be inappropriate in .  utilizes generative adversarial learning. However, its performance is still limited, and the reason might be that generative adversarial networks are difficult to converge .
Extensive experiments show that DBP achieves a higher acceleration ratio as well as higher accuracy than state-of-the-art filter-level pruning and block-level pruning. Additionally, we also compare DBP with knowledge distillation because it can also yield shallow models that achieve high accuracy. Experiments show that DBP has surpassed the state-of-the-art knowledge distillation based model acceleration methods.
It is worthwhile to highlight our contributions:
We analyze the reason for the limited acceleration ratio of filter-level pruning and propose that block-level pruning avoids the problem well.
We propose a discrimination based criterion to observe the redundant blocks in CNNs.
Extensive experiments show that DBP has surpassed state-of-the-art block-level pruning, filter-level pruning, and knowledge distillation methods in both accuracy and acceleration ratio.
2 Related works
Neural network pruning
Neural network pruning devotes to removing redundant weights in networks to accelerate the inference, and it can be divided into three groups: weight-level, filter-level, and block-level: Weight-level pruning observes redundant weights in filters and set them to zeros. For example,  proposes weights with small absolute value can be set to zeros without loss. Weight-level pruning makes weights tensors sparse, and it can accelerate the inference of CNNs with the help of specialized libraries (e.g., cuSPARSE). Unfortunately, the support of these specialized libraries on mobile devices is limited. Filter-level pruning solves the problem by pruning unimportant filters in CNNs and reduce computations directly. The key to filter-level pruning is how to define the importance of filters. For example, [18, 22] take advantage of the idea of  and evaluate the importance of filters according to their amplitude. [31, 11] considers the interrelations among filters when evaluating importance. Generally speaking, filter-level pruning has achieved great success, but, as we have discussed above, its acceleration ratio is still limited because of parallel computing. Block-level pruning takes a sequence of consecutive layers as a block and remove redundant blocks to reduce the computations in CNNs. Considering that computations in different layers run in serial, pruning more blocks must mean a higher acceleration ratio. [32, 15, 20] all support block-level pruning. Specifically, [32, 15] observe redundant blocks and filters through norm-based criterion, and GAL  utilizes generative adversarial networks (GAN) to prune blocks and filters simultaneously. However, the norm-based criterion has been shown to be inappropriate in , and GAN is hard to converge.
The main idea of knowledge distillation (KD) is to transfer knowledge from a trained teacher model to an un-trained student model, so KD helps on many tasks such as transfer learning, few-shot learning, and model acceleration. When working with model acceleration, KD transfers knowledge from a large teacher model to a small student model, and the key to KD is what knowledge should be transferred to the student.[13, 1] take the output of the last layer as the knowledge and induce soft targets and mimic loss  to transfer the knowledge. Other algorithms [37, 12, 30] explore the knowledge from intermediate layers and combine them with soft targets and mimic loss to achieve higher accuracy. DBP also takes advantage of mimic loss when fine-tuning the model to advance its performance.
Deep CNNs consist of a number of repeatable blocks, and most computations of CNNs come from these blocks. Thus, We aim to accelerate the inference of CNNs through pruning redundant blocks while keeping the accuracy of CNNs.
We first introduce how we observe and prune redundant blocks, namely our discrimination based criterion, in Section 3.2. Then the strategies of recovering the performance of the model after pruning are introduced in Section 3.3. In Section 3.4, we do some customization for models with special structures. For simplicity, we use to represent the block in a CNN.
3.2 Discrimination based criterion
 has explored the discrimination of output features of each block in CNNs by placing a linear classifier after each block and find that the discrimination of features ascends as the block goes deeper. 
shows that training a deep CNN on ImageNet block by block generates a model whose accuracy is as high as an end-to-end-trained model. Though there has not been a reasonable explanation to these phenomena, these experiments have shown that the performance of deep CNNs is gradually enhanced block by block.
Motivated by [35, 3], we further explore the discrimination of each block’s output features in CNNs. We use a fully-connected layer as our linear classifier and place it after each block of a trained CNN. We fix the weights of the CNN and train these classifiers. Consequently, it is natural to take the accuracy of these classifiers as the discrimination of blocks’ output features, and the higher the accuracy is, the more discriminative the features are. We experiment on different depths of CNNs, and the results are in Figure 2. The accuracy ascends as the block goes deeper, which is consistent with the conclusion in . Moreover, we also find another two interesting phenomena and propose our assumptions:
The accuracy of the classifier descends at some blocks, which indicates that their extracted features are confusing and degrade the discrimination of output features. We call them degraded blocks and call other blocks upgraded blocks. We explore each block’s contribution to the model by removing it and check the changes of the model’s accuracy. As Table 1 shows, compared with an upgraded block, removing a degraded block has limited influence on the performance of the model, which means the weights of degraded blocks have less contribution to enhancing the performance of the whole model. As a result, degraded blocks can be pruned with acceptable loss.
The trend line of a shallow model is always steeper than that of a deeper one. In other words, the deeper the model is, the slower the discrimination of its features ascends. We assume that it is because many blocks in deep models contribute little to ascending the discrimination of the features, and these blocks can also be pruned with acceptable loss.
With these two assumptions, we observe and prune redundant blocks with the following steps:
Place a fully-connected layer as a linear classifier after each block of a CNN and train these classifiers.
Compute the contributions of all blocks. Without loss of generality, we define the contribution of block as , in which means the accuracy of the linear classifier after block .
Find the part of blocks which have the least contributions and prune them.
3.3 Performance recovery
Pruning blocks will hurt the performance of the model, so we fine-tune the model after pruning to recover its performance. We take advantage of some techniques from knowledge distillation and filter-level pruning during the process of fine-tuning:
It is a common technique in knowledge distillation that using the mimic loss [1, 37, 20] to advance the accuracy of the model. Specifically,  finds that forcing the small (student) model to mimic the last layer’s output of the huge (teacher) model helps to improve the performance of the small model and proposes its -based mimic loss. We follow the manner of 
and re-define our loss function during the fine-tuning process as Equation1, in which is a hyper-parameter, and
are the logits (the output before activations) of the unpruned model and pruned model respectively,is the one-hot label of ground-truth and . The first term in the equation is the mimic loss, and the second term is commonly used [10, 14] cross-entropy loss in classification.
Iterative pruning has shown to help improve the performance of pruning in many papers [23, 19, 25]. Explicitly, they prune redundant filters layer by layer and fine-tune the model after pruning each layer because pruning all layers once may lead to invocatable accuracy loss. Besides, research on knowledge distillation  also proposes that it will hurt the performance of the student model if the teacher model is much more complicated than the student. Motivated by them, we also prune redundant blocks in an iterative manner. Specifically, we use a hyper-parameter to control the ratio of blocks to prune each time and repeat pruning and fine-tuning steps until we get a model as small as we need. Considering that there is a tradeoff between accuracy and acceleration ratio, iterative pruning has another advantage that it would generate some intermediate models, and users can stop pruning once they meet a satisfying one.
Conv-BN-ReLu is the most common block in CNNs. However, it is not appropriate to take it as the minimum pruning unit for ResNet owing to the special structure of ResNet. Specifically, pruning means we will take the output of as the input of directly. This requires the number of output channels of to be the same as the number of input channels of . However, the bottleneck design in ResNet makes most do not satisfy the condition. As a result, we take a residual block in ResNet as the minimum pruning unit, and then most blocks can be pruned safely. Moreover,  also proposes that removing a residual block in ResNet does not cut off the information flow in ResNet, which is friendly to block-level pruning.
According to the definition in DenseNet , each dense block contains too many layers, and it is difficult to keep the accuracy of the model once we prune one block. So, we have to re-define the block in DenseNet as a sequence of consecutive BN-ReLu-Conv. Note that we only give “block” a new meaning without changing the structure of DenseNet. Consequently, like the example shown in Figure 3, the output of a certain block will be the input of many other blocks, and pruning a block means that the number of input channels of all blocks after will be reduced.
4.1 Datasets and architectures
We experiment on two well-known classification datasets: CIFAR  and ImageNet . CIFAR contains 50,000 images for training and 10,000 images for evaluation, and the size of each image is . CIFAR10 and CIFAR100 are two splits in CIFAR, which contain 10 and 100 classes, respectively, and we use both of them when performing experiments. ImageNet is a dataset containing 1.28 million training images and 50,000 testing images for 1000 categories of objects. The size of each image in ImageNet is .
We use two famous deep CNNs to take experiments: ResNet and DenseNet. Both of them have very deep versions of over a hundred layers and achieve competitive results. We try these two CNNs with several different depths, and our DBP achieves great success on all of them.
4.2 Compared algorithms and evaluation protocol
We compare our DBP with three kinds of model acceleration methods: block-level pruning, knowledge distillation, and filter-level pruning. Both block-level pruning and knowledge distillation based acceleration methods reduce the number of blocks in CNNs, so it is necessary to compare with them. Besides, we also compare with filter-level pruning to highlight our considerable acceleration ratio over them, though filter-level pruning based methods are not our opponents since DBP can be combined with them for further acceleration.
The target of model acceleration is accelerating the inference as well as keeping the accuracy of models. Other papers on pruning [18, 22, 31, 11] use the accuracy of the pruned model and the reduction ratio of floating-point-operations (FLOPs) to evaluate their algorithms. We add a third evaluation protocol, namely, the acceleration ratio. It is worth recalling the definition of acceleration ratio (AR) as Equation 2, in which and represent the inference time of the original huge model and the pruned model, respectively. Acceleration ratio is the most direct metric of acceleration effect.
4.3 Implementation details
. The baseline models on ImageNet are downloaded from the official website of PyTorch222https://pytorch.org/.
The linear classifiers after each block are trained on the same dataset as training the baseline. Considering that there are only few parameters in a linear classifier, the classifiers are trained only three epochs with learning rate [0.1, 0.01, 0.001] for each epoch when experimenting on CIFAR. When experimenting on ImageNet, we train only 3,000 iterations with learning rate [0.1, 0.01, 0.001] for each 1,000 iterations to save time.
When fine-tuning the model, we always set in Equation 1 to 1.0 for all experiments and use the same data augmentation and batch size as training the baseline. With a global pruning ratio (), all models are pruned three times iteratively, so the pruned ratio for each time is computed as . Then we fine-tune the pruned model for as many epochs as training the baseline for each time of pruning. Thus, the number of fine-tuning epochs is as big as that of training the baseline.
4.4 Results and analysis
4.4.1 Results compared with block-level pruning
To the best of our knowledge, there has not been any algorithm only for block-level pruning, but some algorithms such as GAL and SSS perform block-level and filter-level pruning simultaneously. When comparing with them, DBP just prunes redundant blocks while GAL and SSS prune both redundant blocks and filters. Notably, we use two 2.2GHz CPU to run the model with , and we run 100 examples and take their average inference time as the result.
As shown in Table 2, DBP performs far better than them in both accuracy and acceleration ratio, even though we only prune redundant blocks. We ascribe our advantages over them to two aspects: 1) Some methods like SSS add an extra special designed regularizer to the loss function and train the huge model as well as generate the pruned model in one pass. Considering that the training of all blocks is constrained by the regularizer, the performance of the model will be influenced by the regularizer. In contrast, DBP never uses extra regularizer besides common regularizer when training the huge model, so the huge model is easier to do its best. It is obvious that the good performance of huge model benefits. 2) Some methods, like GAL, utilizes generative adversarial learning to prune redundant blocks. However, the generative adversarial networks might be hard to train. On the other hand, we prune blocks through discrimination based criterion, and models are easy to converge during the fine-tuning.
4.4.2 Results compared with knowledge distillation
Following the manners of most knowledge distillation [2, 33, 34, 12, 37] methods, we use the same teacher and student models when compared with most algorithms. As a consequence, the acceleration ratio of them is the same as ours, and we only need to compare the accuracy of their student models. For the reason that DBP decides the architecture of the pruned model automatically while knowledge distillation uses manually designed ones, we take our pruned model as the student model of knowledge distillation. “NONE” trains the small model from scratch without knowledge distillation and is the benchmark of all algorithms. Note that some algorithms do not publish their code, so we only compare DBP with their results in the papers which share the same architecture, depth, and FLOPs with ours. Considering that models with the same architecture, depth, and FLOPs also share similar acceleration ratios, the comparison also makes sense.
The results are shown in Table 3. We experiment on different depths of ResNet and DenseNet with CIFAR and ImageNet dataset. The results show that all algorithms perform better than “NONE”, while our DBP achieves the best performance in all experiments. It is worth highlighting that the accuracy of DBP is at least 1.5% higher than that of other algorithms on ImageNet. We do not compare with RoLa on DenseNet because RoLa cannot be used on DenseNet directly. We ascribe our advantage over these knowledge distillation methods to two reasons: 1) Compared with those using the same small (student) models as DBP, we fine-tune the model while knowledge distillation train it from scratch, and fine-tuning a trained model has been shown to be better than training from scratch in many papers [18, 8, 25]. 2) Besides the influence of fine-tuning, DBP also observes a better student architecture than them.
4.4.3 Results compared with filter-level pruning
Slim , FPGM , and COP  are three state-of-the-art filter-level pruning algorithms, which respectively utilize different characteristics of filters to observe redundant filters. The experimental results are shown in Table 4. As we can see, though the FLOPs-reduction-ratio of DBP is smaller than filter-level pruning methods, its inference time is always much lower than filter-level pruning because of the pruning of layers. It is also worth highlighting that the combination of DBP and filter-level pruning algorithm achieves the highest acceleration ratio. Note that the inference of the COP-pruned model is faster than that of the FPGM-pruned model and the Slim-pruned model, even though COP reduces fewer FLOPs. The reason is that both FPGM  and Slim  induce extra operations into the pruned model, which cost some inference time.
4.5 Ablation study
We ascribe the success of DBP to three aspects: 1) Pruning redundant blocks removes redundant information and preserves the important one. 2) Iterative pruning helps in advancing the accuracy. 3) MSE-based mimic loss provides useful information to the pruned model. In this section, we will explore the contributions of these aspects, and the experimental results are all in Table 5.
4.5.1 Pruning blocks randomly
To check the effectiveness of the discrimination based criterion, we compare it with a random pruning strategy. Specifically, we experiment under the same conditions with DBP except that the pruned blocks are chosen randomly. The experiment is repeated ten times, and the results are shown in the “Random” column of Table 5.
The results show that the average accuracy of a random pruning is always less than the accuracy of DBP, and the difference on CIFAR100 is bigger than on CIFAR10 because CIFAR100 is more complicated than CIFAR10. The results show that discrimination based pruning does remove redundant blocks and preserve important information.
4.5.2 Pruning all once & fine-tuning without mimic-loss
As described in Section 3.3, two techniques are used when recovering the performance of our pruned model: iteratively pruning and MSE-based mimic loss. To check the enhancement brought by them, we use original DBP as our baseline and do three control experiments:
Pruning all redundant blocks once & fine-tuning without mimic loss. (DBP-A)
Pruning all redundant blocks once & fine-tuning with mimic loss. (DBP-B)
Iteratively pruning the redundant blocks & fine-tuning without mimic loss. (DBP-C)
The results are shown in the “DBP”, “DBP-A”, “DBP-B”, and “DBP-C” column of Table 5. Generally speaking, using both techniques achieves the highest accuracy, which means DBP is the best choice for both ResNet and DenseNet. However, as we can see, the impacts of these two techniques on ResNet and DenseNet are different:
Impacts on ResNet
The performance of DBP, DBP-A, and DBP-B is not much different. However, DBP-C yields awful results, which means the mimic loss is very important to iterative pruning. The results are kind of counter-intuitive. However, we think they are still reasonable. Specifically, if an intermediate-generated model does not converge well, it will influence the selection of blocks to be pruned, and thus affect the final accuracy of the pruned model, and mimic loss guarantees the fast convergence of intermediate-generated models.
Impacts on DenseNet
Only DBP yields satisfying results, which means both iterative pruning and mimic loss are essential when pruning DenseNet. The output features of a block in DenseNet are used by many other blocks, so pruning one block will affect many other blocks. In other words, DenseNet is more sensitive to pruning than ResNet, and pruning too many blocks once will lead to invocatable loss of performance.
4.6 Case study
In this section, we explore what happens during the pruning process. It is worth recalling that we defined a degraded block as the block at which the discrimination of features degrades. We prune ResNet with 81 blocks to 9 blocks through three times of pruning. As shown in Figure 4, the ratio of redundant blocks in the whole blocks decreases as pruning progresses.
We also track some preserved blocks after pruning, the Figure 5 shows that the output features of preserved blocks look similar before and after pruning, which means that DBP preserves the important information, and the pruned model utilizes fewer blocks to express similar information to the original model.
In this paper, we point out the limited performance of filter-level pruning in accelerating the inference of CNNs and analyze the reasons. To solve the problem, a discrimination based block-level pruning (DBP) is proposed. DBP outperforms the state-of-the-art and achieves a considerable acceleration ratio with acceptable accuracy loss.
-  Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NeurIPS, 2014.
-  Vasileios Belagiannis, Azade Farshad, and Fabio Galasso. Adversarial network compression. In Computer Vision - ECCV 2018 Workshops, 2018.
-  Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy layerwise learning can scale to imagenet. In ICML, 2019.
-  Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. CoRR, abs/1706.05587, 2017.
-  Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NeurIPS, 2015.
-  Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In NeurIPS, 2017.
-  Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In NeurIPS, 2016.
-  Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
-  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. In ICCV, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In CVPR, 2019.
-  Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. Knowledge distillation with adversarial samples supporting decision boundary. In AAAI, 2019.
-  Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
-  Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
-  Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In ECCV, 2018.
-  Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  Fengfu Li and Bin Liu. Ternary weight networks. CoRR, abs/1605.04711, 2016.
-  Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
-  Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In ICLR, 2017.
-  Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang, and David S. Doermann. Towards optimal structured CNN pruning via generative adversarial learning. In CVPR, 2019.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. In ECCV, 2016.
-  Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
-  Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In ICCV, 2017.
-  Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. CoRR, abs/1902.03393, 2019.
-  Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In ICLR, 2017.
-  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
-  Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. IJCV, 2015.
-  Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
-  Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. CoRR, abs/1907.09682, 2019.
-  Wenxiao Wang, Cong Fu, Jishun Guo, Deng Cai, and Xiaofei He. COP: customized deep model compression via regularized correlation-based filter-level pruning. In IJCAI, 2019.
-  Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NeurIPS, 2016.
-  Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, 2017.
-  Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
-  Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
-  Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
-  Guorui Zhou, Ying Fan, Runpeng Cui, Weijie Bian, Xiaoqiang Zhu, and Kun Gai. Rocket launching: A universal and efficient framework for training well-performing light net. In AAAI, 2018.