1 Introduction
Convolutional Neural Network (CNN) has obtained better performance in classification, detection and segmentation tasks than traditional methods in computer vision. However, CNN leads to massive computation and storage consumptions, thus hindering its deployment on mobile and embedded devices. Previous research indicated that CNN acceleration falls into four categories: designing compact network architectures, parameter quantization, matrix decomposition and parameter pruning. Our work belongs to the last category.
Pruning is a promising way for CNN acceleration which aims at eliminating model parameters based on a performance loss function. However, unstructured pruning will lead to irregular sparsity, which is hard to implement for speedup on general hardware platforms
[7]. Even with sparse matrix kernels, the speedup is very limited [28]. To solve this problem, many works focus on structured pruning (Fig.2),which can shrink a network into a thinner one so that the implementation of the pruned network is efficient [1, 26]. For example, [19] proposed an oneshot pruning method to prune the less important filters based on their norms. [23] proposed a progressive pruning method to prune filters based on a novel importance criteria derived from Taylor expansion. Recently, [11]
achieved the stateoftheart pruning results on VGG16, via an alternative filter pruning method using LASSO regression based channel selection and least square reconstruction.
However, existing pruning approaches mainly have three problems:

Trainingbased methods prune unimportant weights based on some importance criteria and never recover them in the following training process. Given the importance criteria is either simple, such as the commonly used norm and norm [9, 19], or derived under very strong assumptions, such as the parameters need to be i.i.d. in [23], it is probable that some pruned weights may become important later if they were kept through the whole training process. It is necessary to design recovery mechanisms for pruned weights to correct misjudgments during early training stages.

Reconstructionbased methods prune and reconstruct the network layer by layer, so the time complexity of pruning grows linearly with the number of layers. This is timeconsuming considering that current CNNs such as ResNet utilize very deep architectures. In this sense, it is better to prune parameters simultaneously from all layers than prune layerwisely.

Many structured pruning methods target to prune the whole filters. Since filters are big and coarselygrained units, it is very likely that the accuracy will drop dramatically after filterlevel pruning.
To solve the above problems, we propose the Structured Probabilistic Pruning (SPP) for CNN acceleration. Firstly, SPP prunes weights in a probabilistic manner, as shown in Fig.1. Specifically, we assign a pruning probability to each weight. When some weights are below the importance threshold and should have been pruned, we only increase its pruning probability rather than totally eliminate them. Only when reaches will the weights be permanently eliminated from the network. We also design a mechanism to decrease the pruning probability if the weights become more important during training, thus correcting the previous misjudgments. Secondly, SPP prunes the whole network at the same time instead of layerwisely, so the time complexity is controllable when network becomes deeper. Thirdly, the basic pruning units for SPP are columns of model parameters. Compared with filterlevel pruning, the structured unit is smaller and results are more robust against inaccurate importance criteria.
With speedup, SPP can accelerate AlexNet with only loss of top5 accuracy and VGG16 with loss of top5 accuracy in ImageNet classification. Moreover, SPP can be directly applied to accelerate multibranch CNN networks, such as ResNet, without specific adaptations. Our speedup ResNet50 only suffers loss of top5 accuracy on ImageNet. We further prove the effectiveness of our method on transfer learning task on Flower102 dataset with AlexNet.
2 Related Work
Intensive research has been carried out in CNN acceleration, which is normally categorized into four groups, i.e. designing compact network architectures, parameter quantization, matrix decomposition and parameter pruning.
Compact architecture designing methods use small and compact architectures to replace big and redundant ones. For example, VGG [25] and GoogLeNet [27] used kernels to replace larger convolutional kernels of size and . ResNet [10] used kernels to build compact bottleneck blocks for saving computation. SqueezeNet [12] was proposed to stack compact blocks, which decreased the number of parameters by less than the original AlexNet.
Parameter quantization reduces CNN storage by vector quantization in the parameter space.
[8] and [29] used vector quantization over parameters to reduce redundancy. [2]proposed a hash function to group weights of each CNN layer into hash buckets for parameter sharing. As the extreme form of quantization, binarized networks were proposed to learn binary value of weights or activation functions in CNN training and testing
[4, 20, 24]. Quantization reduces floating computational complexity, but the actual speedup may be very related to hardware implementations.Matrix decomposition modifies weights into smaller components to reduce computation. [6] showed that the weight matrix of a fullyconnected layer can be compressed via truncated SVD. Tensor decomposition was proposed and obtained better compression result than SVD [22]. Several methods based on lowrank decomposition of convolutional kernel tensor were also proposed to accelerate the convolutional layer [6, 13, 17].
Parameter pruning was pioneered in the early development of neural networks. Optimal Brain Damage [18] leveraged a secondorder Taylor expansion to select parameters for deletion, using pruning as regularization to improve training and generalization. Deep Compression [8] removed closetozero connections and quantized the remained weights for further compression. Although these pruning methods achieve remarkable reduction in storage, the induced irregular sparsity is hard to implement for acceleration. Structured pruning was proposed to overcome this problem, which tended to prune structured units of parameters (e.g. rows and columns of weight matrix), so that it can accelerate CNN computation without specific implementation and hardware modification [1, 26]. Structured Sparsity Learning [28] uses group LASSO regularization to prune weight rows or columns, which achieves speedup of AlexNet while suffers loss of top1 accuracy. Taylor Pruning [23] uses a Taylor expansion based importance criteria to prune filters, which was reported effective on transfer learning tasks with AlexNet, while not very impressive on large dataset like ImageNet (about top5 accuracy loss under speedup). Filter Pruning [19] is an oneshot pruning method, using the norm to prune filters, which was proved effective on CIFAR10 and ImageNet with VGG16 and ResNet, while the reported speedup is very limited. Recently, Channel Pruning [11] alternatively uses LASSO regression based channel selection and feature map reconstruction to prune filters, and achieves the stateoftheart result on VGG16. They also verify their method on ResNet and Xception [3].
3 The Proposed Method
Suppose that the we have a dataset consisting of inputs and their corresponding labels :
The parameters of a CNN with convolutional layers is represented by
which are learned to minimize the discrepancy, i.e. the loss function , between network outputs and labels. The common loss function for classification tasks is the negative loglikelihood of Softmax output , which is defined as
(1) 
where represents the th element of the Softmax output for the th input.
The aim of parameter pruning is to find a simpler network with fewer convolutional parameters based on the original network , in which the increase of loss is minimized. This minimization problem is defined by Eqn.(2).
(2) 
Normally for CNN, an input tensor of convolutional layer is firstly convoluted with the weight tensor , then a nonlinear activation function
, usually Rectified Linear Units (ReLU), will be applied to it. Then the output will be passed as input to the next layer. For pruning, a mask
is introduced for every weight, which determines whether this weight is used in the network. Thus, the output of th layer is described as(3) 
where denotes elementwise multiplication and denotes the convolution operation. Note that the masked weights are also not updated during back propagations. Since SPP prunes columns of weights, we assign the same to all weights in the same column, so weights in each column are pruned or retained simultaneously in each iteration. We choose columns to prune because they are the smallest structured granularity in CNN, which gives us more freedom to select pruning components.
For traditional pruning methods, when weights are pruned, they will never be reused in the network, which we call it deterministic pruning. On the contrary, SPP prunes weights in a probabilistic manner by assigning a pruning probability to each weight. For example, means that there is likelihood that the mask of the corresponding weight is set to zero. During the training process, we increase or decrease all ’s based on the importance criteria of weights. Only when is increased to , its corresponding weight is permanently pruned out from the network. Obviously deterministic pruning can be regarded a specific case of probabilistic pruning when we choose to increase from to in a single iteration.
For SPP training, all ’s are updated by algorithms described in Sec.3.1. then the mask is generated by Monte Carlo sampling according to :
(4) 
After is obtained for each weight, the pruning process is applied by Eqn.(3).
3.1 How to update
Assume that a convolutional layer consists of columns. Our aim is to prune columns, where is the pruning ratio, indicating the fraction of columns to be pruned at the final stage of SPP.
SPP updates by a competition mechanism according to some weight importance criteria. In this paper, we choose the importance criteria as the sum of norm of each column: the bigger the norm, the more important that column is. It was shown by experiments that and norms have similar performance as pruning criteria [9, 19]. There are also other importance criteria such as Taylor expansions to guide pruning [18, 23]. In this paper, we choose the norm for simplicity. Our method can be easily generalized to other criteria.
The increment of is a function of the rank , where the rank is obtained by sorting the norm of columns in ascending order. The function should satisfy the following three properties:

is a strictly decreasing function because higher rank means greater norm. In this situation, the increment of pruning probability should be smaller since weights are more important based on the norm importance criteria.

The integral of with ranging from to should be positive. In this situation, the total sum of pruning probabilities increases after each update, so that we tend to prune more and more weights until reaching the final stage of the algorithm.

should be zero when . Since we aim at pruning columns at the final stage, we need to increase the pruning probability of weights whose ranks are below , and decrease the pruning probability of weights whose ranks are above . By doing this, we can ensure that exactly columns are pruned at the final stage of the algorithm.
The simplest form satisfying these properties is a linear function, as shown in Fig.3,
a line penetrating point and , where is a hyperparameter indicating the increment of pruning probability for the worstranked column.
However, experimental results are not very good if we take as an linear function (refer to Sec.4.7). The reason is that we use the norm to measure the weight importance, but the
norm is not uniformly distributed. In Fig.
4, we plot the norm histogram of convolutional layers in AlexNet, VGG16 and ResNet50.It is observed that the norm of each layer shares a Gaussianlike distribution in which the vast majority of values are accumulated within a very small range. If we set linearly, the variations of increments would be huge for middleranked columns, but their actual values are very close. Intuitively, we need to set of middleranked columns similar and make steeper on both ends, making it in correspondence with the distribution of norms. In this paper, we propose a centersymmetric exponential function to achieve this goal, as shown in Eqn.(5):
(5) 
Here is a hyperparameter to control the flatness of the function, smaller indicates that for middleranked columns is flatter, as shown in Fig.3. Note that is a decaying parameter for the exponential function, and the function is centersymmetric on point . Since we need to compel passing through and , we can solve out and as and .
For each column, its pruning probability is updated by
(6) 
where denotes the th update of , the min/max is to ensure that is within the range of . During the th update, for a specific column, if its rank is less than , would be positive to make increase; and if its rank is greater than , would be negative to make decrease. The pruning action are sampled by Eqn.(4) and back propagation is applied to update these weights. Under this mechanism, the weights will compete to survive pruning. They will be eliminated gradually until the pruning ratio is reached.
3.2 When to update
Another question is when to update . A common practice is to prune weights at a fixed interval [23]. In SPP, we keep this simple rule: is updated every iterations of training, where is a fixed number. Note that is monotonically increasing when and decreasing otherwise. As stated above, the increased portion is larger than the decreased portion at each update, so it is ensured that the pruning ratio will be achieved within finite iterations. Finally, after the pruning ratio is reached, we stop the pruning process and retrain the pruned model for several iterations to compensate for accuracy. The whole algorithm of SPP is summarized in Algorithm 1.
4 Experiments
The hyperparameters in SPP are , and . Here we set , and , which are the same for all experiments. The settings such as weight decay, momentum, and dropout are unchanged from the baseline model. In the pruning and retraining process, we only adjust the learning rate and batch size, which will be elaborated in the following experiments.
On the smallscale dataset CIFAR10 [15], We firstly evaluate our method on a shallow singlebranch model ConvNet and a deep multibranch model ResNet56. Secondly, on the largescale dataset ImageNet2012 [5], we evaluate our method with three stateoftheart models – AlexNet [16], VGG16 [25] and ResNet50 [10]. We also test our method for transfer learning tasks on the Oxford Flower102 dataset [21]
. We use Caffe
[14] for all of our experiments.Methods for comparison include four structured pruning approaches which were proposed in recent years: Structured Sparsity Learning (SSL) [28], Taylor Pruning (TP) [23], Filter Pruning (FP) [19] and Channel Pruning (CP) [11].
4.1 ConvNet on CIFAR10
CIFAR10 dataset contains classes with images for training and for testing. We take images from the training set as the validation set. ConvNet was firstly proposed by [16] for classification on CIFAR10, which is composed of convolutional layers and fully connected layer. The batch size is set to and initial learning rate set to .
The performance of methods is shown in Fig.5. We carefully vary the pruning ratio to obtain the same speedup ratio, and compare accuracies of these methods. When the speedup is small (), SPP can even improve the performance, for which we argue that the modest pruning can regularize the objective function and increase accuracy. This phenomenon was also found by other pruning method [28]. It can be observed that when speedup ratio is greater than , SPP performs much better than the other three methods.
The second experiment studies the fraction of weights which are not important at the beginning, but become important at the final stage of training. We calculate the fraction of weights whose ranks are below at the beginning and then above at the final stage. These weights should be pruned by many oneshot pruning methods, but they are finally retained by SPP. Tab.1 shows that this fraction, termed as ‘recovery ratio’, for three convolutional layers of ConvNet.
Layer  

conv1  
conv2  
conv3 
Because the first layer (conv1) is very small, consisting of only columns, it lacks dynamics for SPP to perform well. However, for conv2 and conv3, which consist of columns each, the recovery ratios are very prominent, indicating that SPP could effectively utilize the dynamics of model parameters and achieve good performance. This may explain why the performance of SPP is more robust for big pruning ratios than the other three methods.
4.2 ResNet56 on CIFAR10
ResNet56 is a deep multibranch residual network, which is composed of convolutional layers. We trained it from scratch with batch size and learning rate . The obtained baseline accuracy is on the test set.
The pruning results are shown in Tab.2. For acceleration, our method prunes ResNet56 with only error increase, which is better than FP and CP.
4.3 AlexNet on ImageNet
We further verify our method on ImageNet2012, which is a large dataset of classes, containing M images for training, for testing and for validation. AlexNet [16] is composed of convolutional layers and fully connected layers. Here we adopt the CaffeNet, an open reimplementation of AlexNet with Caffe, as pretrained model. The baseline top5 accuracy is on ImageNet2012 validation set.
Because AlexNet, VGG16 and ResNet50 are much deeper CNN models, where constant pruning ratio for all layers is not optimal. Usually pruning ratios of different layers are determined according to their sensitivity. There are mainly two ways to evaluate sensitivity: (1) Fix other layers and prune one layer, using the accuracy of the pruned model as the sensitivity measure for the pruned layer [19]; (2) Apply PCA to each layer and take the reconstruction error as the sensitivity measure [11, 28]. Here we take the PCA approach because it is simple and not timeconsuming.
The PCA analysis of AlexNet is shown in Fig.6.
We plot the normalized reconstrucction errors with different remaining principle component ratios. It can be seen that, under the same remaining principle component ratio, the normalized construction error of upper layers (like conv5) are greater than that of the bottom layers (like conv1), which means that the upper layers are less redundant. Thus, we set the proportion of the remaining ratios (one minus the pruning ratio) of these five layers to .
Accuracy comparison of TP, FP, CP and SPP is shown in Tab.3.
Method  

TP [23]  
SSL [28]  
FP [19] (our impl.)  
SPP   
SPP outperforms the other three methods by a large margin. SPP can accelerate AlexNet by speedup with only increase of top5 error. Similar to the results of ConvNet on CIFAR10, when the speedup ratio is small (), SPP can even improve the performance (by ). Previous work [28] reported similar improvements, while their improvement is relatively small () and under less speedup settings ().
4.4 VGG16 on ImageNet
VGG16 is a deep singlebranch convolutional neural network with convolutional layers. We use the open pretrained model^{1}^{1}1http://www.robots.ox.ac.uk/ vgg/research/very_deep/, whose singleview top5 accuracy is .
Like the experiment with AlexNet, we firstly use PCA to explore the redundancy of different layers. Similarly, we found that deeper layers have less redundancy, where [11] obtains similar results. According to the sensitivity analysis, we set the proportion of remaining ratios of shallow layers (conv1_x to conv3_x), middle layers (conv4_x) and top layers (conv5_x) to . Because layer conv1_1 and conv5_3 are very small, and their contribution to total computation is and respectively, we choose not to prune these two layers.
We firstly prune the network with batch size and learning rate . Then we retrain the pruned model with batch size and learning rate . Tab.4
Method  

TP [23]  
FP [19] ([11]’s impl.)  
CP [11]  
SPP 
shows the accuracy comparison of TP, FP, CP and SPP. Generally, SPP and CP are much better than FP and TP. With small speedup ratio (), both SPP and CP can achieve zero increase of top5 error. When the speedup ratio is greater ( and ), the performance of SPP is comparably with CP. SPP is better for speedup and CP is better for speedup.
We further evaluate the actual speedup on CPU with Caffe. Results are averaged from 50 runs with batch size . From Tab.5,
Method  CPU time (ms) 

VGG16 baseline  
SPP ()  
SPP ()  
SPP () 
we can see that the pruned model by SPP can achieve actual acceleration on general platform with offtheshelf libraries. The discrepancy between speedup (measured by GFLOPs reduction) and actual speedup (measured by inference time) is mainly because of the influence of memory access, the unpruned fullyconnected layers and the nonweight layers such as pooling and ReLU.
4.5 ResNet50 on ImageNet
Unlike singlebranch AlexNet and VGG16, ResNet50 is a more compact CNN with multibranches, which is composed of convolutional layers. We use the open pretrained caffemodel^{2}^{2}2https://github.com/KaimingHe/deepresidualnetworks., whose top5 accuracy on ImageNet2012 validation set is .
The PCA analysis of ResNet50 is shown in Fig.7.
Unlike AlexNet and VGG16, the redundancy of different layers in ResNet50 is not very relevant to their positions in the network. For example, bottom layers res2c_branch2b, res3b_branch2b and res3d_branch2b share similar sensitivity with top layers res5x. Thus, we set the pruning ratio of all convolutional layers of ResNet50 to the same value in this experiment.
We firstly prune the net with batch size and learning rate . Then retrain the pruned model with batch size and learning rate .
The result is shown in Tab.6.
Method  Increased err. (%) 

CP (enhanced) [11]  
SPP 
It can be seen that our method achieves better result than CP for speedup ratio. Note that the implementation of SPP on ResNet50 is very simple, just the same as previous experiments. However, because of the special structure of ResNet, CP needs to add the multibranch enhancement procedure to generalize the method to ResNet.
4.6 Transfer Learning
Finally, we apply SPP to transfer learning, where a welltrained model is finetuned by the data from other knowledge domains. We use pretrained AlexNet model to finetune the Oxford Flower102 dataset [21]. The 102class dataset is composed of images, among which is used for training, for validation and the other for testing. We use the open pretrained caffemodel^{3}^{3}3https://github.com/jimgoo/caffeoxford102 as the baseline with test accuracy . We compare the performance of our method with TP [23], which was reported very effective for transfer learning.
Like the above experiments, we firstly prune the pretrained model with SPP, then finetune the model to regain accuracy. The learning rate is set to be and batch size . For simplicity, we use a constant pruning ratio in this experiment. Tab.7
Method  

TP [23] (our impl.)  
SPP 
shows that SPP consistently outperforms TP at all speedup ratios. Note that in [23], the authors claim that Taylorbased criteria was significantly better than norm in pruning for the transfer learning task,while our result shows that SPP+ norm outperforms TP+Taylorbased criteria, which proves the effectiveness of SPP in transfer learning tasks.
4.7 Comparison of Exponential and Linear Functions
We experimentally compare our proposed symmetric exponential function with the linear one (Sec.3.1) with ConvNet on CIFAR10. The result is shown in Tab.8.
Method  

SPP (exp)  
SPP (lin) 
We can see that, with relatively small speedup ratios, linear function has similar performance with the exponential one. However, when the speedup ratio becomes larger, our proposed exponential function significantly outperforms the linear one, which indicates that the proposed exponential function is more resistant to harsh pruning.
5 Conclusions
We proposed Structured Probabilistic Pruning (SPP) for CNN acceleration, which prunes weights of CNN in a probabilistic manner and able to correct misjudgement of importance in early training stages. The effectiveness of SPP is proved by comparison with stateoftheart methods on popular CNN architectures. We also show that SPP can be applied to transfer learning tasks and achieve better results than previous methods.
References
 [1] S. Anwar and W. Sung. Compact deep convolutional neural networks with coarse pruning. arXiv preprint, arXiv:1610.09639, 2016.

[2]
W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen.
Compressing neural networks with the hashing trick.
In
Proceedings of the International Conference on Machine Learning, ICML
, pages 1–10, Lille, France, 2015. 
[3]
F. Chollet.
Xception: Deep learning with depthwise separable convolutions.
2016.  [4] M. Courbariaux and Y. Bengio. BinaryNet: Training deep neural networks with weights and activations constrained to or . arXiv preprint, arXiv:1602.02830, 2016.

[5]
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Feifei.
Imagenet: A largescale hierarchical iage database.
In
Proceedings of the International Conference on Computer Vision and Pattern Recognition, CVPR’09
, pages 248–255, Miami, FL, 2009.  [6] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, NIPS, Montréal, Canada, 2014.
 [7] S. Han, X. Liu, and H. Mao. EIE: efficient inference engine on compressed deep neural network. ACM Sigarch Computer Architecture News, 44(3):243–254, 2016.
 [8] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint, arXiv:1510.00149, 2015.
 [9] S. Han, J. Pool, and J. Tran. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, NIPS, pages 1135–1143, Montréal, Canada, 2015.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, CVPR, pages 770 – 778, Las Vegas, NV, 2016.
 [11] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. 2017.
 [12] F. Iandola, M. Moskewicz, and K. Ashraf. SqueezeNet: Alexnetlevel accuracy with 50x fewer parameters and 0.5MB model size. arXiv preprint, arXiv:1602.07360, 2016.
 [13] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. Computer Science, 4(4):1–13, 2014.
 [14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrel. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint, arXiv:1408.5093, 2014.
 [15] A. Krizhevsky and G. E. Hinton. Learning multiple layers of features from tiny images. 2009.
 [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Information Processing Systems, NIPS, pages 1097–1105, Lake Tahoe, CA, 2012.
 [17] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speedingup convolutional neural networks using finetuned CPdecomposition. arXiv preprint, arXiv:1510.03009, 2016.
 [18] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Information Processing Systems, NIPS, pages 598–605, Denver, CO, 1990.
 [19] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, ICLR, 2017.
 [20] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. arXiv preprint, arXiv:1510.03009, 2016.
 [21] M. E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP’08, Bhubaneswar, India, 2008.
 [22] A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Tensorizing neural networks. arXiv preprint, arXiv:1509.06569, 2014.
 [23] P. M. P, S. Tyree, and T. Karras. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations, ICLR, 2017.
 [24] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, ECCV, pages 525–542, Amsterdam, Netherland, 2016.
 [25] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. Computer Science, 2014.
 [26] V. Sze, Y. H. Chen, T. J. Yang, and J. Emer. Efficient processing of deep neural networks: A tutorial and survey. arXiv preprint, arXiv:1703.09039, 2017.
 [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, CVPR, pages 1–9, Boston, MA, 2015.
 [28] W. Wen, C. Wu, and Y. Wang. Learning structured sparsity in deep neural networks. In Advances in Information Processing Systems, NIPS, pages 2074–2082, Barcelona, Spain, 2015.
 [29] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pages 4820 – 4828, Las Vegas, NV, 2016.