1 Introduction
In recent years, more and more methods are proposed for CNN inference optimization, such as model quantization [Gong et al.2014, Han et al.2015b]
, tensor decomposition (TD)
[Kim et al.2015], weight pruning [Wen et al.2016, Liu et al.2017], knowledge distillation [Hinton et al.2015, Yim et al.2017] and new network architecture designs [Iandola et al.2016, Howard et al.2017], etc. Most existing methods can achieve promising model compression and floatingpoint operations (FLOPs) reduction with marginal accuracy degradation. However, smaller model size and FLOPs cannot guarantee the practical speedup for inference. In this work, we propose the GAP method to prune the weight in order to achieve better practical inference optimization. The pruned network could be supported by any offtheshelf deep learning libraries, thus practical acceleration can be achieved without any effort to build additional compute libraries. Our method falls into the type of weight pruning based on structural sparsity constraint.
Most existing structural sparsityinduced weight pruning methods prune the network at either channellevel or kernellevel. [Wen et al.2016] uses group sparsity regularization to help select the removable kernels or channels, and network slimming (NS) [Liu et al.2017] alters to use
norm on the scaling factors of batch normalization (BN) layers for channel selection. However, recent models are of much more complex structures, such as crossconnection, including 1ton connection and nto1 connection. For such structure, pruning a channel cannot result in pruning of the preceding kernel since output of the filter may still be used by other layers. As a result, postprocessing may be necessary after channellevel pruning to maintain the network topology. For example, DenseNet
[Huang et al.2016] has 1ton connections, as the output of one layer will be reused by all the following convolution layers, and each reusage has its own BN layer. After channellevel pruning, the NS method chooses to remain all the kernels before the 1ton connection and insert a selection layer before each connected subsequent layers to determine which subset of the received channels should be selected. The selection layer will involve memory copy and increase the inference time. Therefore, to get an efficient network practically, a more general method is needed, which could take the network topology into consideration during pruning, so as to avoid additional postprocessing.Tackling this issue, we propose GAP for network topologyadaptive pruning. In the method, the network is viewed as a computational graph , with the vertices denoting the computation operations, and the edges describing the information flow. In GAP, we conduct network pruning by removal of certain vertices or edges based on graph topology analysis. According to graph theory, the vertices can be divided into articulation points and nonarticulation points, where an articulation point of graph is a vertex, of which removal disconnects [Cormen et al.2009]
. To guarantee the information flow from the input to output, only the nonarticulation points can be pruned. Similarly, the edges can be classified into bridges and nonbridges, while a bridge of
is an edge, of which removal disconnects . Only the nonbridges can be removed, otherwise the information flow will be broken off.The whole procedure of GAP follows the framework of sparsityinduced weight pruning methods: 1) conduct structural sparsity constraint on the model parameters during training, 2) prune the vertices or edges with minor significance, 3) finetune the pruned graph. In GAP, pruning can be conducted at either vertexlevel or edgelevel. At vertexlevel, the graph topology is considered in order to avoid postprocessing which may affect inference efficiency. For the vertices on the same crossconnection, regularization is conducted collaboratively using group sparsity to prune them all or keep them all. At edgelevel, we mainly focus on the slimming of the multiple paths. Thus the graph is analyzed at coarser level. Although a coaserlevel pruning may suffer from more serious performance degradation [Mao et al.2017], edgelevel pruning is still considered in our method, since it not only reduces the computation cost but also the memory access times, which can further accelarate the inference in realtime.
For finetuning step, we introduce a selftaught KD procedure. In traditional KD methods, more complex networks are used as teachers to guide the student network training. In the proposed method, we choose to use the original model as the teacher. So it is actually a selftaught mechanism.
The contribution of this work can be summarized as:
1. We propose the GAP method for topologyadaptive CNN inference optimization, which does not need any postprocessing even when the network contains crossconnections.
2. In GAP, a CNN model can be pruned at vertexlevel as well as edgelevel for the networks with multipath data flow.
3. A selftaught KD mechanism for finetuning is proposed to further improve performance of the pruned network.
2 Related work
The inference optimization methods can be categorized into two classes: 1) reducing the model representation precision and 2) reducing the number of model parameters.
1) Reducing the model representation precision.
This category includes network quantization and binarization. Network quantization compresses the bitwidth of the weights, activations or both
[Gong et al.2014, Han et al.2015a]. Extreme quantization is to binarize the network [Hubara et al.2016], using 1bit to represent a value. Such kind of works using fixpoint or binary representation need specially designed compute acceleration library or hardware. Additionally, the binarization methods always suffer from significant accuracy loss.2) Reducing the number of model parameters. Such kind of methods can be categorized into new network architecture designs, TD, weight pruning and KD.
New network architecture designs. Some researches explore to get the inferenceefficiency at the beginning of network design, such as SqueezeNet [Iandola et al.2016] and MobileNet [Howard et al.2017] etc. The main technique is to replace the large convolution filters by a stack of small filters and train the network endtoend.
Tensor decomposition aims to reduce the FLOPs by decomposing a large 4D filter into several small tensors by Canonical Polyadic (CP) decomposition [Lebedev et al.2014] or Tucker decomposition [Kim et al.2015]. The TDbased methods will introduce more convolution layers, which is less cacheefficient. As a result, the practical speedup ratio (SR) is not as ideal as the theoretical value.
Weight pruning can reduce the model size by removing some redundant parameters. The work in [Han et al.2015b] employs the magnitude of the weights to evaluate the weight importance to determine which parameters should be removed. This kind of pruning (finegrained pruning) needs dedicated compute libraries or/and hardware design, such as EIE [Han et al.2016]. More works explore to find structural pruning, which can get practical speedup over existing compute libraries. The structural pruning methods [He et al.2017, Wen et al.2016] remove part of the filters/channels offline based on certain importance measurement or online using sparsity constraint while training. A finetuning procedure is conducted to compensate the performance loss. However, such methods usually ignore the network topology. As a result, additional postprocessing layer may be needed to deal with complex network structure, such as crossconnections.
Knowledge Distillation is proposed by Hinton et al. to guide the student network training by a pretrained teacher model using soft target [Hinton et al.2015]. The method aims to transfer the knowledge from a complex teacher model to the student network. FitNet [Romero et al.2014] extends the method by distilling the knowledge not only in the output but also the intermediate representations.
In this paper, our method falls into the type of weight pruning based on structural sparsity constraint, and the strategy of KD is adopted to distill the knowledge from original model to maintain the model performance.
3 Proposed method
Given a pretrained model, graph pruning can be conducted using the following steps:
1) Retrain with sparsity regularization. The sparsity is conducted on some parameters with certain structural pattern to make some vertices or edges removable;
2) Sort all the weights and determine the pruning threshold;
3) Remove the correpsonding vertices or edges according to the threshold;
4) Finetune the pruned graph with or without selftaught KD.
3.1 Notations
In this section, we use two kinds of description to represent a CNN network: mathematics and graph.
Mathematics. For convolution, we use , , to denote the input feature maps, convolution kernels and output feature maps, respectively. Each channel in the output feature maps corresponds to a filter , and the batch normalized result is represented by ,
(1) 
where and
are mean and variance of the channel,
and are the scaling factor and bias factor, respectively.We use a symbol to represent all the parameters in CNN, including , , and also the other parameters, such as those in the FC layer. can be learned by the following optimization,
(2) 
where, denotes the input pairs including data and labels,
is the loss function,
is the regularization used in the training process, such as Frobeniousnorm for weight decay.Graph. We use a graph to represent a network, where vertices denote the computation operations and edges show the data flow. Subset module of DenseNet and ResNet [He et al.2016] are shown in Figures 1 and 2. In CNN, the computation operations include convolution, BN, activation, concat, add and FC, etc. Since convolution accounts for the majority of computational load, we focus on the pruning of convolution vertices in our method.
3.2 Vertexlevel pruning
There are several rules to represent a graph for vertexlevel pruning. 1) A convolution vertex represents a single filter rather than a set of filters in a convolution layer. Otherwise, we cannot prune the network at the fine filterlevel when pruning graph at vertex level. 2) Similarly, a BN vertex represents the operation on a single channel. 3) Because the activation functions are always placed after BN and are conducted elementwise, we fuse the activation function into the BN layer to simplify the graph illustration. In the following we will show 1) how the original channellevel pruning is performed
[Liu et al.2017] and 2) how we perform the structural vertexlevel pruning with consideration of the graph topology.For CNN with BN layers, the scaling factors in BN layers can play a role of measuring the importance of each channel, and thus can be directly used for channel selection with sparsity regularization. The channellevel pruning can be obtained by modifying the optimization in Eq.(2) as
(3) 
where is the sparsity regularization, which is typical realized using norm, is the balance parameter which can tradeoff between the sparsity loss and the original loss.
In a graph, channellevel pruning is targeted to remove the insignificant BN vertices. However, in DensNet as shown in Figure 1(a), removal of a BN vertex cannot result in removal of the preceding convolution vertex, because the convolution vertex still has outgoing edges connecting with other BN vertices. Similarly in residual module as shown in Figure 2(a), there is nto1 connection due to the addoperation. The addvertex cannot be removed if one of its incoming is remained, and once an addvertex is remained, all its incoming edges should be remained to guarantee the validity of the data flow.
Therefore, to remove a certain convolution vertex, the graph topology should be taken into consideration, especially for the networks with crossconnections. Based on the conception, we propose to adaptively prune network at vertexlevel by a more structural way.
Firstly, the BN vertices are classified into articulation points and nonarticulation points ;
Secondly, is further split into to 1to1 connection , 1ton connection and nto1 connection BN vertices. If we use and to represent the set of the parent and child vertices of , then the definitions are as follows,
(4)  
In GAP, we ignore crossconnections when the shared child vertex is “concat”. When different feature maps are combined through “concat”, there is still no correlation among them. As a result, for the parent vertices of concatvertex, whether they can be pruned still depends on themselves.
Finally, different constraints are conducted on different subsets,
(5) 
The vertices in are regularized by norm , and those in and are constrained by group sparsity, using norm , while each group denotes the vertices that share the same parent or child vertex.
3.3 Edgelevel pruning
In this section, we introduce the edgelevel pruning. Recently, many networks are proposed using multipath data flow, such as the inception module in GoogleNet, the fire module in SqueezeNet and the “dense” connection in DenseNet. Specifically, group convolution is a special case for multipath design, with all paths identical. Figure
3(a) shows the original group convolution module, while (b) shows an equivalent structure, which is easier for topology analysis. As the network structure are usually designed with redundancy to help solving the highly nonconvex optimization [Luo et al.2017], not all paths are essential for the network performance. Thus certain paths can be pruned to reduce the model size and FLOPs of inference. In a graph, this can be realized by edgelevel pruning. Different from Section 3.2, here we can treat a network as graph at a coarser level: a set of filters in a convolution layer is regarded as a single vertex. Similarly, a BN vertex represents a whole BN layer in edgelevel pruning. When there are multiple paths for data flow, the edges on such paths become nonbridge. Thus the multipath pruning is equivalent to removing part of the nonbridge edges. And the sparsity regularization to make nonbridge edges pruning is conducted as steps below:Firstly, the nonbridge edges are selected as candidates to be pruned. As shown in Figures 3(c) and 4(b), if one edge is removed, the whole path will be disabled. Thus there is no need to regulate all the nonbridge edges on a certain path. We only choose the last edge in each path to conduct pruning. Furthermore, in CNNs, multiple paths are always combined together using a “concat” operation. Therefore, we use concatvertex to detect the edges to be pruned. The set of selected edges is denoted as . Secondly, each selected edge is scaled by an additional parameter , acting as a measurement of the edge’s importance. The edge scaling factors are therefore constrained using sparsity regularization,
(6) 
where, denotes norm on the scaling factors , is the balance parameter.
3.4 Selftaught KD
After training with sparsity constraints, all the scaling factors can be sorted. The vertices or edges with scaling factor smaller than a certain threshold are pruned, while the threshold can be obtained by the pruning ratio. The pruned graph may suffer from certain performance degradation, and it can be compensated by finetuning. In addition to naive finetuning, we propose to finetune the network using a selftaught KD strategy in this paper.
For the pruned network, the original model is apparently a more complex model with better performance, and it can act as the teacher in KD for finetuning. In addition, the pretrained model is already provided, which is rather important in practice, as there is always limited resource and time to train a more complex teacher model for a specific task. As the knowledge is distilled from the original model to the pruned network, we denote it as selftaught KD. In the following sections, experimental results will show that the performance of the pruned network is sufficiently improved when compared with the naive finetuning strategy.
4 Experiments
4.1 Implementations





FLOPs 



ResNeXt29  Baseline (Our impl.)    4.00    34.43    5.00G      
Channellevel [Liu et al.2017]  60%  4.28  4.09  9.10  3.78  1.18G  4.24  1.69  
Vertexlevel  60%  4.08  4.03  6.71  5.13  0.83G  6.02  2.68  
Edgelevel  60%  4.55  4.11  24.61  1.40  2.23G  2.24  2.41  
DenseNet40  Baseline (Our impl.)    5.60    1.06    288M      
Channellevel [Liu et al.2017]  50%  6.38  5.70  0.58  1.83  185M  1.56  1.03  
Vertexlevel  50%  6.14  5.67  0.52  2.04  138M  2.09  1.27  
Edgelevel  50%  6.24  6.00  0.90  1.18  207M  1.39  1.11  
ResNet164  Baseline (Our impl.)    5.11    1.70    251.0M      
Channellevel [Liu et al.2017]  75%  5.50  5.32  0.71  2.39  71.3M  3.52  1.24  
Vertexlevel  50%  5.47  5.36  0.71  2.39  70.1M  3.58  1.48 
We evaluated the effectiveness of the proposed pruning method using two widely used datasets: CIFAR10 [Krizhevsky and Hinton2009]
and ImageNet LSVRC 2012
[Russakovsky et al.2015]. Considering the “topologyadaptive” attribute of GAP, ResNet, DenseNet and ResNeXt [Xie et al.2017] were chosen for the evaluation.To evaluate the inference efficiency, we used three criteria: model compression ratio (CR), theoretical SR and practical SR. Model size and FLOPs before and after pruning were used to compute the model CR and theoretical SR. We used the practical SR as an additional indicator of inference efficiency, since the memory access and movement time are not considered in FLOPs. TensorFlow is used as the basic framework. The practical SR was evaluated with the library of CUDA8, cuDNN5 on a GPU (GTX1080Ti). As for all the network trainings, SGD with a Nesterov momentum of 0.9 was used as the optimizer. Weight decay for the networks was set to
. For the sparsity regularization, the balance parameters for vertexlevel were set to for simplicity, as the structural group sparsity is harder to be sparsified. was selected in , based on sparseness of the targeted weights. For edgelevel, was searched in . All the layers were pruned simultaneously based on an adaptive threshold, which was determined by the pruning proportion.4.2 Experiments on CIFAR10
Data augmentation of CIFAR10 for training were conducted using random cropping and mirroring. Images were normalized channelwise based on statistical values. For experiments on CIFAR10, ResNet164, DenseNet40 () and ResNeXt29 (
d) were adopted. The original pretrained models were implemented by ourselves based on TensorFlow, with the same settings as the authors’. In the first step, the pretrained models were retrained with sparsity constraint using minibatch size 128 for 10 epochs, with a learning rate of 0.01. All layers in the graph were pruned together and then finetuned for 90 epochs. Initial learning rate for finetuning was set to 0.01 and divided by 10 at 2/3 of the total epochs. For selftaught KD, a temperature of
was used and the relative weight for soft target was set to 1.









ResNeXt50  Baseline (Our impl.)    24.47    25.03    4.26      
Channellevel [Liu et al.2017]  50%  25.22  24.67  17.99  1.39  2.39  1.78  1.27  
Vertexlevel  40%  26.15  24.44  14.99  1.67  2.29  1.86  1.51  
Edgelevel  50%  26.19  25.27  16.88  1.48  2.37  1.80  1.48  
DenseNet121  Baseline (Our impl.)    25.17    7.98    2.87      
Channellevel [Liu et al.2017]  50%  25.50  25.19  4.74  1.68  2.06  1.39  1.01  
Vertexlevel  20%  26.24  25.30  6.06  1.32  1.99  1.44  1.14  
Edgelevel  50%  25.63  25.27  6.72  1.19  2.48  1.16  1.25 
Pruning results are shown in Table 1. In DenseNet and ResNeXt, the networks are pruned with the same percentage for channellevel, vertexlevel and edgelevel. The results suggest that the strategy of finetuning with selftaught KD performs better than naive finetuning in restoring the degradation of classification accuracy caused by pruning. By comparison, we can see that structurally pruning at vertexlevel can get higher model CR and theoretical SR, while vertexlevel pruning can still get better performance in classification error rate. In ResNeXt, with approximately no loss of accuracy, pruning 60% off at vertexlevel can get 2.68 practical SR while only 1.69 at channellevel. In DenseNet, channellevel pruning achieves almost no speedup as it introduces additional selection operations, which increases the memory access time. Compared with ResNeXt, DenseNet is already a quite compact network, which is more difficult to be pruned. However, we still achieve 2.04 model CR and 1.27 practical SR through vertexlevel pruning with marginal loss of performance. In ResNet, only channellevel and vertexlevel pruning were conducted. The model was pruned to comparable model size and FLOPs at the two levels. At vertexlevel, we can get 1.48 practical SR and 2.39 model CR with minor loss of accuracy.
Edgelevel pruning leads to the largest remaining model size and FLOPs, because edgelevel pruning can only prune part of the graph. In ResNeXt, only the edges contained in the group convolution can be pruned. Similarly, only the dense connections can be removed in DenseNet. Additionally, edgelevel gets the worst error rate. This is naturally because it prunes the network at a coarsegrained level, which will do more harm on the network [Mao et al.2017]. However, the benefit of edgelevel pruning is that it has little gap between practical SR and theoretical SR: edgelevel pruning reduces the number of computation operations, as a result it can reduce the FLOPs as well as the memory access times, while the theoretical SR ignores the issue of memory access. Specifically, in ResNeXt, the practical SR actually exceeds the theoretical one.
Furthermore, we use ResNeXt29 to quantitatively analyze the network performance with respect to different model CRs, theoretical and practical SRs. Results are shown in Figure 5. Vertexlevel pruning achieves lower error rate than channellevel with the same model CR, theoretical or practical SR, especially in the practical SR measurement. Through vertexlevel pruning with selftaught KD finetuning, we can get approximately 12 model CR, 15 theoretical SR and 4.3 practical SR with nearly no loss of accuracy. Although, edgelevel suffers more in accuracy loss, it achieves larger practical SR at the same level of theoretical SR. As shown in Figure 5(c), at the same error rate, edgelevel pruning gets higher actual speedup than channellevel pruning. Finally, we can see that the performance is sufficiently improved by finetuning with the selftaught KD compared with naive finetuning.
4.3 Experiments on ImageNet
We adopted the same data augmentation scheme as in [Huang et al.2016] for ImageNet. Top1 error rate of a single center crop was used as the performance measurement. In the first step, the pretrained models were retrained with sparsity constraint using a minibatch size of 256 on 4 GPUs for 1 epoch, with the learning rate being 0.01. For finetuning, the model was trained for 40 epochs, with an initial learning rate of 0.01, which was decreased by a factor of 10 at 15th and 30th epochs, respectively. For KD, was used and the relative weight for soft target was set to 100 making the loss magnitude of soft target and hard target comparable.
On ImageNet, ResNeXt50 (d) and DenseNetBC121 were validated, while the model settings were the same as in [Xie et al.2017] and [Huang et al.2016]. Table 2 shows the pruning results. To better evaluate the performance, the models at channellevel and vertexlevel are pruned to comparable model size and FLOPs. At vertexlevel, the pruned networks can achieve quite similar error rates after finetuning with selftaught KD. At the same time, we get a model CR of 1.67 and practical SR of 1.51 in ResNeXt while 1.32 CR and 1.14 practical SR in DenseNet.
For edgelevel pruning, Figure 6 shows the distribution of edge scaling parameters after training with sparsity in DenseNet121. are all initialized to be 1.0, and we can see that after training, nonzero becomes sparse so that we can prune the edges with low scaling values. Furthermore, it can be observed that the information flow between blocks are critical, as indicated in the first row and last column of each block, while the layer connections within a block may have high redundancy. For ResNeXt50, Figure 7 shows the number of remained edges in ResNeXt50 with different pruning percentages. We use a global threshold to adaptively prune each layer, as the redundency may vary with different layers.
Because of the bottleneck structure in DenseNetBC, edgelevel pruning can only remove the convolutional layers with kernel size . Therefore, it can only achieve a quite low model CR and FLOPs reduction. As for ResNeXt, only edges involved in the group convolutions can be pruned, thus the model CR and theoretical SR are also not high enough. However, as described in Section 4.2, the benefit of edgelevel pruning is that it can also reduce the memory access times and thus further accelerate the inference. In ResNeXt, the practical SR exceeds the theoretical value, and in DenseNet it has quite little gap with the theoretical SR.
5 Conclusion
In this paper, we propose the GAP for CNN model compression and inference acceleration. By adaptive analysis of the graph, the method can directly remove certain vertices or edges to achieve a compact and efficient graph for inference by maintaining the original graph topology. The pruned network can achieve practical speedup without any postprocessing to deal with complex structures. For finetuning, we adopt a selftaught KD strategy to improve the network performance. The strategy can sufficiently improve the model performance and it does not introduce extra workload, which is quite applicable for practical tasks. Experimental results show it can make the inference more efficient, with high model CR and practical SR, while keeping the network performance very close to original model. As the future work, we will develop an autotuning mechanism to search optimal hyperparameters involved in the framework, and we are going to investigate the scheme to combine the vertexlevel and edgelevel pruning, so that a more rational mixedlevel pruning can be conducted for a network given computation resource or latency limitation.
References
 [Cormen et al.2009] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009.
 [Gong et al.2014] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
 [Han et al.2015a] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [Han et al.2015b] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
 [Han et al.2016] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, pages 243–254. IEEE Press, 2016.

[He et al.2016]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
IEEE conference on computer vision and pattern recognition (CVPR)
, pages 770–778, 2016.  [He et al.2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [Hinton et al.2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [Howard et al.2017] Andrew Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [Huang et al.2016] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
 [Hubara et al.2016] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems, pages 4107–4115. 2016.
 [Iandola et al.2016] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 [Kim et al.2015] YongDeok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015.
 [Krizhevsky and Hinton2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, 2009.
 [Lebedev et al.2014] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553, 2014.
 [Liu et al.2017] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In IEEE International Conference on Computer Vision (ICCV), pages 2755–2763. IEEE, 2017.
 [Luo et al.2017] JianHao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. arXiv preprint arXiv:1707.06342, 2017.
 [Mao et al.2017] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J Dally. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.
 [Romero et al.2014] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
 [Russakovsky et al.2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [Wen et al.2016] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
 [Xie et al.2017] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5987–5995. IEEE, 2017.

[Yim et al.2017]
Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim.
A gift from knowledge distillation: Fast optimization, network minimization and transfer learning.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
Comments
There are no comments yet.