Deep Neural Networks (DNNs) have many parameters, which leads to problems related to storage, computation and energy cost. State-of-art Convolutional Neural Network (CNN) models have hundreds of millions parameters and take tens of billions operationshe2016deep ; krizhevsky2012imagenet ; simonyan2014very . That makes DNN models difficult to deploy on embedded systems with limited resources.
To deal with this problem, various methods have been proposed to compress DNN models and reduce the amount of computation. Some methods are based on decomposition and factorizationzhang2016accelerating ; lebedev2014speeding . These methods can preserve the regular dense computation structure of the original models, thus are able to to achieve both compression and acceleration on general-purpose processors. Pruning serves as another effective method to greatly reduce the number of parameters with no loss of accuracyhan2015learning ; guo2016dynamic .
Pruning based methods are better at preserving accuracy as well as achieving higher compression rateshan2015learning . However, such improvements come at the cost of the irregularity of the sparse computation pattern. On the other side, structured pruning, such as pruning entire filters will cause larger accuracy loss than pruning individual weightsli2016pruning . Those observations pose several questions: What is the trade-off between regularity and accuracy? Is it possible to find a sweet spot in the range of regularity? How does the sweet spot improve the efficiency of hardware implementation?
We attempt to answer those questions by looking into pruning with different granularity, as shown in Figure 1. There are existing works trying to prune filters or channels instead of individual weightsMolchanovPruningICLR17 ; wen2016learning ; anwar2015structured
. However, they are individual points in the design space. Due to the various methods they used, we cannot directly evaluate the relationship between pruning granularity and final accuracy. We therefore apply the exact same method and experimental setting for an effective comparison. We also want to explore a consistent range of granularity, which includes intermediate grain size like 2D kernels and 1D sub-kernel vectors. Based on a thorough space exploration, we are able to analyze the storage saving and hardware efficiency at different granularity of sparsity.
In this work, we make the following contributions:
We explore a complete range of pruning granularity and evaluate the trade-off between the model’s regularity and accuracy.
We demonstrate that coarse-grained pruning is able to reach similar or even better compression rates than the fine-grained one, even though it obtains less sparsity.
We show that coarse-grained sparsity is able to skip computations and reduce memory references in a structured manner, which leads to more efficient hardware accelerator implementation.
2 Related Works
Methods of pruning. Sparsity has been proven as an effective approach to save parameters of Deep Neural Network modelshan2015learning ; guo2016dynamic . A number of works have investigated how to select the important connections and effectively recover accuracy. Second-order derivativelecun1989optimal , absolute valuehan2015learning , loss-approximating Taylor expansionMolchanovPruningICLR17 , and output sensitivityengelbrecht2001new are examples of importance metrics used for pruning. There are also methods trying to better integrate pruning and training, like iterative pruninghan2015learning and dynamic pruningguo2016dynamic .
Granularity of sparsity. Among all types of sparsity, fine-grained sparsity (vanilla sparsity) and filter-wise sparsity (very coarse-grained sparsity) are two extreme cases that has been studiedhan2015learning ; li2016pruning . Fine-grained sparsity is a type of sparsity in which individual weights are deleted and was first proposed in 1989 by LeCun et al.lecun1989optimal . Fine-grained sparsity has been proven to work well on a wide range of popular neural network models of CNN and RNNhan2015learning ; guo2016dynamic ; giles1994pruning ; han2017ese . There is also channel reduction and filter reduction, which reduce the dimension of input/output features as well as layers. Channel reduction can be viewed as very coarse-grained sparsity that removes 3-dimensional sub-tensors in convolutional layers. Such coarse-grained sparsity is beneficial for acceleration due to regularitylebedev2016fast ; wen2016learning . However, it usually causes notable reduced accuracy compared with fine-grained sparsity, as indicated by Li et al.li2016pruning .
There is a large range of granularity between vanilla sparsity and channel reduction. Some literature attempts to explore one or a few possibilities among all choices. Intra-kernel strided pruning is one case investigated in the work of Anwar et al.anwar2015structured .
Accelerating sparse models. For very coarse-grained sparsity like filter-sparsity and channel-sparsity, it is simple to achieve acceleration on general-purpose processors because it is equivalent to obtaining a smaller dense modelwen2016learning . For fine-grained sparsity, custom acceleratorshan2016eie ; parashar2017scnn have been proposed to exploit the reduction of computations.
3 Granularity of Sparsity
To simplify descriptions, we use the following notations for CNN. In a single convolutional layer, the weights compose a 4-dimensional tensor of shape . is the output dimension, i.e., the number of output feature maps. is the input dimension. and are the shape of convolution kernels (=3, =3 for a 3x3 kernel).
One layer’s weights consist of multiple filters (3-dimensional tensor of shape ), each one associated with an output feature map. The weights can also be viewed as multiple channels (3-dimensional tensor ), each one associated with an input feature map. Filters and channels are both composed of kernels (2-dimensional tensor ), which are the key element in the 2-d convolution operation. Sub-kernel vectors (1-dimensional tensor of size or ) and scalar weights(0-dimensional tensor) are lower-level elements in a convolutional layer. Figure 3 illustrates these concepts.
3.2 Range of Granularity
It has been stated that fine-grained sparsity and channel reduction are two extreme cases of granularity. Among all possible grain sizes, there are a large variety of choices.
In this paper, we investigate granularity where the grain size increases with the number of dimensions. To be specific, we study 4 cases where the atomic elements(grain) during pruning are scalar weights (0-D), sub-kernel vectors (1-D), kernels (2-D) and filters (3-D). We explain these cases with numpy-like pseudo codes as below. Figure 3 also illustrates the different granularities of sparsity.
We select these four cases because they can be and have already been mapped into simple computation logics. Fine-grained sparsity breaks the tensor into disjoint weights, therefore fine-grained multiplication-accumulation are required. Its implementations has been provided in EIEhan2016eie and SCNNparashar2017scnn . Sub-kernel Vector sparsity can be mapped into 1-D convolution. Though the sparse case has yet been studied, Eyerisschen2016eyeriss deals with the dense case and treats 1-D convolution as primitive operation. Kernel sparsity is related with 2-D convolution, which is primitive operations in a variety of algorithms and platforms, like Winograd convolution in cuDNN111Nvidia developer’s blog and some FPGA implementationsqiu2016going . Channel sparsity is equivalent to model reduction and, thus, is easy for acceleration in all types of platforms.
3.3 Coarse-grained Pruning Method
Coarse-grained pruning deletes multiple weights together instead of individual weights. Typically the grain can be filters, kernels, sub-kernels vectors or anything else. Because we are interested in the effects of the grain size rather than the pruning method, we adopt the simple magnitude-based pruning criterion in han2015learning . For a grain that consists of multiple weights, the Salience is defined as the sum of absolute values, i.e. the L1 norm in Equation 1. Given the targeted sparsity, say we want 30% of the weights to be zero, we sort the grains of weights according to the L1 norm defined in Equation 1. The grains with the smallest 70% L1-norm are deleted.
We also adopt the iterative pruning method proposed by Han et al.han2015learning . It is able to reach higher sparsity than direct pruning. The sparsity during each pruning stage is determined by sensitivity analysis, which requires individually pruning every layer and measure the accuracy loss on the training dataset.
4 Sparsity-Accuracy Relation with Different Grain Sizes
Our goal is to study how the granularity of pruning influences the accuracy. Specifically, we want to compare the accuracy of different pruning granularities at the same sparsity, and the result is shown in Figure 4. Sparsity serves as a regularizer because it lowers the model capacity by reducing the number of parameters. Coarse-grained sparsity not only reduces the number of parameters but also constrains the positions of parameters, which is an even stronger regularizer. That’s why at low sparsity rate we observed the accuracy improvement in Figure 4.
To ensure fair comparison, we enforce the same sparsity setting and training schedule for the same model. All experiments were performed on the ImageNet datasetdeng2009imagenet
with Caffejia2014caffe .
For CNN models, we only count the overall sparsity of convolutional layers since convolutional layers take up most of the computations in a typical CNN modelbagherinezhad2016lcnn . However, we still prune the fully-connected layers together with convolutional layers to obtain consistent comparisons with previous workshan2015learning ; guo2016dynamic . For fc layers, we only use fine-grained pruning because there’s no such hierarchy of pruning granularity in the fc layer.
We experimented on AlexNet for detailed accuracy-sparsity curves. We also experimented with modern CNNs, including VGG-16simonyan2014very , GoogLeNetszegedy2015going , ResNet-50he2016deep , and DenseNet-121huang2016densely and compared their accuracies at the same-sparsity point. Their results are reported and compared in Table 1.
Figure 4 shows the accuracy curve of density(1-sparsity) under various settings. In this figure there are four different types granularity of sparsity; in each case the atomic element for pruning is
Fine-grained(0-Dim): Individual weights.
Vector(1-Dim) : Sub-kernel vectors of size .
Kernel(2-Dim): Kernels of shape .
Filter(3-Dim): Filters of shape .
When the grain size of pruning is very large, say, filters, we observed huge accuracy loss during iterative pruning. AlexNet loses nearly 1% validation accuracy at the very first pruning stage, which implies it is unsuitable for lossless model compression. For finer-grained pruning, the accuracy loss is much smaller; we even noticed small accuracy increases during the first several pruning stages. Note that the results for AlexNet are better than the original work by Han et al.han2015learning due to a smoother pruning process. We give a detailed description in Section 7.
|AlexNet||24.8%||Kernel Pruning (2-D)||79.20%|
|Vector Pruning (1-D)||79.94%|
|Fine-grained Pruning (0-D)||80.41%|
|VGG-16||23.5%||Kernel Pruning (2-D)||89.70%|
|Vector Pruning (1-D)||90.48%|
|Fine-grained Pruning (0-D)||90.56%|
|GoogLeNet||38.4%||Kernel Pruning (2-D)||88.83%|
|Vector Pruning (1-D)||89.11%|
|Fine-grained Pruning (0-D)||89.40%|
|ResNet-50||40.0%||Kernel Pruning (2-D)||92.07%|
|Vector Pruning (1-D)||92.26%|
|Fine-grained Pruning (0-D)||92.34%|
|DenseNet-121||30.1%||Kernel Pruning (2-D)||91.56%|
|Vector Pruning (1-D)||91.89%|
|Fine-grained Pruning (0-D)||92.21%|
The results in Table 1 and Figure 4 support the assumption that coarse-grained sparsity causes greater accuracy loss than fine-grained sparsity. Pruning with a large grain size like filters will greatly hurt accuracy. On the other hand, pruning with a smaller grain size leads to similar accuracy-sparsity curves with fine-grained pruning. Notice that in Figure 4, some curves appear to rise smoothly at first. That suggests coarse-grained pruning can still reach similar compression rates as fine-grained pruning, giving additional advantages described in the following section.
5 Comparison of Storage
Model size is an important factor for real-world mobile applications. On the one hand, it constrains the application in memory-bounded devices. On the other hand, memory access is more than two orders of magnitude more energy expensive during the execution of deep neural networkhan2015learning . Sparsity serves as an effective approach to compress neural network models. Sparse neural networks are usually stored in a similar format to Compressed Row Storage(CRS), where both values and indices are stored. Coarse-grained sparsity, due to its regularity, is able to save the number of indices as illustrated in Figure 6. Therefore the coarse-grained sparse models take up less storage than fine-grained models at the same sparsity.
We want to investigate how the prediction accuracy changes with different grain sizes of pruning at the same level of storage(instead of sparsity). We do not use full-precision 32-bit weights, but 8-bit weights instead, as 8-bit weights, either true 8-bit integer formats or 8-bit indices indexing to a table of shared fp32 weights, have been proven to be sufficient in a lot of literatureJouppi2017 ; han2016eie ; vanhoucke2011improving . We use 4-bit indices to store the distances between adjacent non-zeros, following the method in Deep Compression han2015deep . Moreover, as indicated in Deep Compression, the quantization method works independently with sparsity. To check if it still works with coarse-grained sparsity, we plot the accuracy-bits curves of different types of pruned models in Figure 6. The results show that sparsity structure has negligible influence over quantization.
Figure 7 shows the accuracy-storage relationship of AlexNet. We find that the first three curves(Fine-grained, Vector and Kernel) are closer than those in Figure 4. This figure shows the effect of index saving for coarse-grained pruning.
To better compare the compression ratio with modern deep neural nets, we list the results of AlexNet, VGG-16 and GoogLeNet, ResNet-50 and DenseNet-121 in Table 2
. Here the storage ratio is defined as the model size of pruned 8-bit models(with 4-bit indices) to that of dense 8-bit models. Note that it is almost impossible to prune a model that exactly matches the baseline accuracy, so we use linear interpolation to obtain the estimated density and storage ratio at a given point of accuracy.
|AlexNet||80.3%||Kernel Pruning (2-D)||37.8%||39.7%|
|Vector Pruning (1-D)||29.9%||34.5%|
|Fine-grained Pruning (0-D)||22.1%||33.0%|
|VGG-16||90.6%||Kernel Pruning (2-D)||44.4%||46.9%|
|Vector Pruning (1-D)||30.7%||35.8%|
|Fine-grained Pruning (0-D)||27.0%||40.6%|
|GoogLeNet||89.0%||Kernel Pruning (2-D)||43.7%||51.6%|
|Vector Pruning (1-D)||36.9%||47.4%|
|Fine-grained Pruning (0-D)||32.3%||48.5%|
|ResNet-50||92.3%||Kernel Pruning (2-D)||61.3%||77.0%|
|Vector Pruning (1-D)||40.0%||52.7%|
|Fine-grained Pruning (0-D)||37.1%||55.7%|
|DenseNet-121||91.9%||Kernel Pruning (2-D)||35.5%||48.9%|
|Vector Pruning (1-D)||31.1%||43.8%|
|Fine-grained Pruning (0-D)||26.6%||39.8%|
For a sparse network, the larger the grain size is, the less storage needed. This is due to index sharing among the weights of the kernel as shown in Figure 6. However, AlexNet and VGG-16 in particular have much closer density/storage results for kernel pruning than GoogLeNet, ResNet, and DenseNet. This is caused by the small size of the convolutional kernels being pruned: these networks have many layers of 1x1 convolutions, which do not benefit from sharing index values. AlexNet and VGG-16, on the other hand, have a multitude of larger convolutions.
6 Regular Sparsity Helps Hardware Implementation
It has been mentioned in the previous sections that filter pruning is able to obtain acceleration on general-purpose processors like CPUs or GPUs. For intermediate grain sizes like kernels or sub-kernel vectors, though it is still difficult for acceleration on general-purpose processors, there are several advantages over fine-grained sparsity on customized hardware. Those advantages enable simpler circuit design and higher energy efficiency on customized hardware. We qualitatively and quantitatively analyze the advantages as follows:
Qualitative analysis. In convolutional layers, 2-D convolution is usually the primitive operation. Kernel pruning (2-D pruning) can easily lead to computation reduction, because the 2-D convolutions of deleted kernels can be saved. Recent custom hardware design for CNN also use 1-D convolution as the primitive operationchen2016eyeriss . In this case, sub-kernel vector pruning is beneficial. Compared with fine-grained sparsity, coarse-grained sparsity is able to preserve the low-level computation logic, therefore simplifying the hardware design.
Quantitative analysis. Memory reference is a major factor of energy consumptionhan2015learning . Recent work on custom hardware exploits both the sparsity of weights and activations of CNNsparashar2017scnn . In their implementation, the weights and input activations are both stored in sparse format while output activations are stored in dense format. The indices of weights and activations are used for calculating the output address, to which the product of weight and activation will perform a scatter-add operation. This process is illustrated in Figure8. After one layer is finished, the output activations will then be compressed into the sparse format for next layer.
If the same output address is referenced again, a data shortcut can be used to avoid the expensive read/write. For example, two adjacent weights and two adjacent activations will reference 3 addresses instead of 4. Due to the locality, coarse-grained sparse weights have a larger probability of output address collision. We simulated with VGG-16 on ImageNet’s validation set to compare the number of memory references and listed the results in Table3. With the same density, coarse-grained sparsity saves of the total output memory references.
|Density||Fine-grained||Vector Pruning||Relative # of memory references|
In this section we compare our results with previous works on pruninghan2015learning ; guo2016dynamic . We select AlexNet, as its layer-wise sparsity is published in previous papers. By using a smoother pruning process, we find the results of Han et al.han2015learning can be further improved without any algorithmic change.
Table 4 gives an overall comparison of key statistics for AlexNet. Apart from the number of parameters, there are some other factors affecting the efficiency of DNN models. Here FLOPs is the total number of floating-point operations. Storage is measured with of 8-bit weights and 4-bit indices as indicated in Section 5. Due to the fact that the storage of convolutional layers is much smaller but reused much more frequently than fully-connected layers, we add an additional row for storage of convolutional layers. The number of memory referenced is calculated by simulating the process of Figure 8. Here the baseline number of memory references is obtained from dense model with sparse activations.
The results show that the our fine-grained pruned model already has advantages over the previous state-of-art work in terms of FLOPs, storage of convolutional layers and number of memory references. Moreover, compared with our fine-grained baseline, the vector pruning method can further reduce the storage of convolutional layers by 23% and the number of memory references by 43%.
We thoroughly explored the granularity of sparsity with experiments on detailed accuracy-density relationship. Due to the advantage of index saving, coarse-grained pruning is able to achieve a higher model compression ratio, which is desirable for mobile implementation. We also analyzed the hardware implementation advantages and show that coarse-grained sparsity saves output memory access compared with fine-grained sparsity, and compared with dense implementation. Given the advantages of simplicity and efficiency from a hardware perspective, coarse-grained sparsity enables more efficient hardware architecture design of deep neural networks.
-  Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, and et al. . In-datacenter performance analysis of a tensor processing unit. In 44th International Symposium on Computer Architecture, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence, 38(10):1943–1955, 2016.
-  Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014.
-  Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
-  Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient DNNs. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016.
-  Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz.
Pruning convolutional neural networks for resource efficient transfer learning.International Conference on Learning Representations, 2017.
-  Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
-  Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. J. Emerg. Technol. Comput. Syst., 13(3):32:1–32:18, February 2017.
-  Yann LeCun, John S Denker, Sara A Solla, Richard E Howard, and Lawrence D Jackel. Optimal brain damage. In NIPs, volume 2, pages 598–605, 1989.
-  Andries Petrus Engelbrecht. IEEE transactions on Neural Networks, 12(6):1386–1399, 2001.
C Lee Giles and Christian W Omlin.
Pruning recurrent neural networks for improved generalization performance.IEEE transactions on neural networks, 5(5):848–851, 1994.
-  Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 75–84. ACM, 2017.
-  Vadim Lebedev and Victor Lempitsky. Fast convnets using group-wise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2554–2564, 2016.
-  Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, pages 243–254. IEEE Press, 2016.
-  Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joe Emer, Stephen Keckler, and William J. Dally. SCNN: An accelerator for compressed-sparse convolutional neural networks. In 44th International Symposium on Computer Architecture, 2017.
-  Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 2016.
-  Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 26–35. ACM, 2016.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
-  Hessam Bagherinezhad, Mohammad Rastegari, and Ali Farhadi. LCNN: Lookup-based convolutional neural network. arXiv preprint arXiv:1611.06473, 2016.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
-  Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
Vincent Vanhoucke, Andrew Senior, and Mark Z Mao.
Improving the speed of neural networks on CPUs.
Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4. Citeseer, 2011.
-  Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.