Introduction
Deep neural network (DNN) has emerged as the fundamental element and core enabler in machine learning applications due to its high accuracy, excellent scalability, and selfadaptiveness
[Goodfellow et al.2016]. A well trained DNN model can be deployed as inference system for multiple objectives, such as image classification [Krizhevsky, Sutskever, and Hinton2012], object detection [Ren et al.2015], and natural language processing
[Hinton, Deng, and Yu2012]. However, the stateofart DNN models such as VGG16 [Simonyan and Zisserman2014], ResNet50 [He et al.2016] and MobileNet [Howard et al.2017] involve intensive computation and high memory storage, making it very challenging to execute inference system on current mobile platforms in a realtime manner.Recently, highend mobile platforms are rapidly overtaking desktop and laptop as primary computing devices for broad DNN applications such as wearable devices, video streaming, unmanned vehicles, smart health devices, etc. [Philipp, Durr, and Rothermel2011][Lane et al.2015][Boticki and So2010]. Developing a realtime DNN inference system is desirable but still yield to the limited computation resources of embedded processors on a mobile platform. Multiple endtoend mobile DNN acceleration frameworks, such as TVM [Chen et al.2018], TensorFlowLite (TFLite) [Ten] and Alibaba Mobile Neural Network (MNN) [Ali], have been developed. However, the inference time of largescale DNNs (e.g., 242ms inference time using TVM on Adreno 640 GPU with VGG16) is still far from realtime requirement.
In order to mitigate the challenge brings by the DNN’s bulky computation and achieve the goal of realtime inference, it is necessary to consider algorithmlevel innovations. Various DNN model compression techniques are studied, among which weight pruning [Han, Mao, and Dally2015][Mao et al.2017][Dai, Yin, and Jha2017][Wen et al.2016][He, Zhang, and Sun2017] can result in a notable reduction in the model size. Early work [Han, Mao, and Dally2015] on nonstructured weight pruning (finegrained) prunes weights at arbitrary location, resulting in a sparse model to be stored in the compressed sparse column (CSC) format. It leads to an undermined processing throughput because the indices in the compressed weight representation cause stall or complex workload on highly parallel architectures [Han, Mao, and Dally2015][Wen et al.2016]. On the other hand, structured weight pruning [Wen et al.2016] (coarsegrained) is more hardware friendly. By exploiting filter pruning and channel pruning, the pruned model is more regular in its shape, which eliminates the storage requirement in weight indices. However, it is observed that structured pruning hurts accuracy more significantly than nonstructured sparsity.
It is imperative to find a new granularity level that can satisfy high accuracy demand as well as regularity in DNN model structure. We make the observation that nonstructured and structured pruning are two extremes of the full design space. The two missing keys are: (i) Find a new, intermediate sparsity dimension that can fully leverage both the high accuracy from finegrained model and high regularity level from coarsegrained model; (ii) Find the corresponding (algorithmcompilerhardware) optimization framework which can seamlessly bridge the gap between hardware efficiency and the new sparsity dimension. To address the above problems, this paper proposes PCONV, comprising (a) a new sparsity dimension that exploits both intraconvolution and interconvolution kernel sparsities, exhibiting both high accuracy and regularity, and revealing a previously unknown point in design space; and (b) a compilerassisted DNN inference framework that fully leverages the new sparsity dimension and achieves realtime DNN acceleration on mobile devices.
In PCONV, we call our intraconvolution kernel pruning pattern pruning and interconvolution kernel pruning connectivity pruning
. For pattern pruning, a fixed number of weights are pruned in each convolution kernel. Different from nonstructured weight pruning, pattern pruning produces the same sparsity ratio in each filter and a limited number of pattern shapes. Essentially, our designed patterns correspond to the computer vision concept of key convolution filters, such as Gaussian filter for smoothing, Laplacian of Gaussian filter for smoothing and sharpening. For connectivity pruning, the key insight is to
cut the connections between certain input and output channels, which is equivalent to removal of corresponding kernels, making filter “length” shorter than original model. With connectivity pruning, we further enlarge compression rate and provide greater DNN acceleration potential, while maintaining balanced workload in filterwise computation of DNNs. Pattern and connectivity pruning can be combined at algorithm level and accelerated under the unified compilerassisted acceleration framework. For our advanced compilerassisted DNN inference framework, we use execution code generation which converts DNN models into computational graphs and applies multiple optimizations including a highlevel, finegrained DNN layerwise information extraction, filter kernel reorder and load redundancy elimination. All design optimizations are general, and applicable to both mobile CPUs and GPUs.We demonstrate that pattern pruning consistently improve model accuracy. When combined with connectivity pruning, the results still outperform current DNN pruning methods, both nonstructured and structured weight pruning. In Section “Accuracy Analysis”, we show PCONV is the most desirable sparsity among current pruneforacceleration works. We also deploy PCONV
model on our compilerassisted mobile acceleration framework and compare with three stateofart frameworks on mobile CPU and GPU, TensorFlow Lite, TVM, and MNN, using three widely used DNNs, VGG16, ResNet50, and MobileNetv2 and two benchmark datasets, ImageNet and CIFAR10. Evaluation results show that
PCONV achieves up to speedup without any accuracy drop. Using Adreno 640 embedded GPU, PCONV achieves an unprecedented 19.1 ms inference time of VGG16 on ImageNet dataset. To the best of our knowledge, it is the first time to achieve realtime execution of such representative largescale DNNs on mobile devices.Background
DNN Model Compression
DNN model compression is a promising method to remove redundancy in the original model. It targets on the purpose that inference time can be reduced if fewer weights are involved in the computation graph. The weight pruning method acts as a surgeon to remove the inherently redundant neurons or synapses. As Figure
1 shows, two main approaches of weight pruning are the general, nonstructured pruning and structured pruning, which produce irregular and regular compressed DNN models, respectively.Nonstructured pruning: Early work is [Han, Mao, and Dally2015]
, in which an iterative, heuristic method is used with limited, nonuniform model compression rates. Flourished by
[Zhang et al.2018] and [Ren et al.2019] with the powerful ADMM [Boyd et al.2011] optimization framework, nonstructured pruning achieves very high weight reduction rate and promising accuracy. However, for compiler and code optimization, irregular weight distribution within kernels requires heavy controlflow instructions, which degrades instructionlevel parallelism. Also, kernels in different filters have divergent workloads, which burdens threadlevel parallelism when filters are processed through multithreading. Moreover, irregular memory access causes low memory performance and thereby execution overheads.Structured pruning: This method has been proposed to address the index overhead and imbalanced workload caused by nonstructured pruning. Pioneered by [Wen et al.2016][He, Zhang, and Sun2017], structured weight pruning generates regular and smaller weight matrices, eliminating overhead of weight indices and achieving higher acceleration performance in CPU/GPU executions. However, it suffers from notable accuracy drop when the pruning rate increases.
Patterns in Computer Vision
Convolution operations exist in different research areas for an extended period of time, such as image processing, signal processing, probability theory, and computer vision. In this work, we focus on the relationship between conventional image processing and stateofart convolutional neural networks in the usage of convolutions. In image processing, the convolution operator is manually crafted with prior knowledge from the particular characteristics of diverse patterns, such as Gaussian filter. On the other hand, in convolutional neural networks, the convolution kernels are randomly initialized, then trained on large datasets using gradientbased learning algorithms for value updating.
[Mairal et al.2014] derived a network architecture named Convolutional Kernel Networks (CKN), with lower accuracy than current DNNs, thus limited usage. [Zhang2019] proposed to apply the blur filter to DNNs before pooling to maintain the shiftequivalence property. The limited prior work on the application of conventional vision filters to DNNs require network structure change and do not focus on weight pruning/acceleration, thus distinct from PCONV.
DNN Acceleration Frameworks on Mobile Platform
Recently, researchers from academia and industry have investigated DNN inference acceleration frameworks on mobile platforms, including TFLite [Ten], TVM [Chen et al.2018], Alibaba Mobile Neural Network (MNN) [Ali], DeepCache [Xu et al.2018] and DeepSense [Yao et al.2017]. These works do not account for model compression techniques, and the performance is far from realtime requirement. There are other researches that exploit model sparsity to accelerate DNN inference, e.g., [Liu et al.2015], SCNN [Parashar et al.2017], but they either do not target mobile platforms (require new hardware) or trade off compression rate and accuracy, thus having different challenges than our work.
Motivations
Based on the current research progress on DNN model compression vs. acceleration, we analyze and rethink the whole design space, and are motivated by the following three points:
Achieving both high model accuracy and pruning regularity. In nonstructured pruning, any weight can be pruned. This kind of pruning has the largest flexibility, thus achieves high accuracy and high prune rate. But it is not hardwarefriendly. On the other hand, structured pruning produces hardwarefriendly models, but the pruning method lacks flexibility and suffers from accuracy drop. Our motivation is to use the best of the above two sparsities. To achieve that, we introduce a new dimension, patternbased sparsity, revealing a previously unknown design point with high accuracy and structural regularity simultaneously.
Image enhancement inspired sparse convolution patterns. The contemporary DNN weight pruning methods originate from the motivation that eliminating redundant information (weights) will not hurt accuracy. On the other hand, these pruning methods scarcely treat pruning as a specific kind of binary convolution operator, not to mention exploiting corresponding opportunities. Along this line, we find that sparse convolution patterns have the potential in enhancing image quality thanks to its special vision properties. Motivated by the fact that sparse convolution patterns can potentially enhance image quality, we propose our carefully designed patterns which are derived from mathematical vision theory.
Compilerassisted DNN inference framework. With the higher accuracy enabled by finegrained pruning patterns, the key question is how to regain similar (or even surpass) hardware efficiency as coarsegained structured pruning. We take a unique approach and design an optimized, compilerassisted DNN inference framework to close the performance gap between full structured pruning and patternbased pruning.
Theory of Sparse Convolution Patterns (SCP)
Let an image with resolution be represented by . An layer DNN can be expressed as a feature extractor , with layer index . Inside the DNN, each convolutional layer is defined as , with filter kernel shape , number of filters and number of channels .
Besides treating pruning as a redundant information removal technique, we consider it as incorporating an additional convolution kernel to perform elementwise multiplication with the original kernel. is termed the Sparse Convolution Pattern (SCP), with dimension and binaryvalued elements (0 and 1). Specific SCPs fit the mathematical vision theory well according to our following derivation. Based on the mathematical rigority, we propose the novel pattern pruning scheme, i.e., applying SCPs to convolution kernels. As illustrated in Figure 2, the white blocks denote a fixed number of pruned weights in each kernel. The remaining red blocks in each kernel have arbitrary weight values, while their locations form a specific SCP . Different kernels can have different SCPs, but the total number of SCP types shall be limited.
In order to further increase the pruning ratio and DNN inference speed, we can selectively cut the connections between particular input and output channels, which is equivalent to the removal of corresponding kernels. This is termed connectivity pruning. Connectivity pruning is illustrated in Figure 2, with gray kernels as pruned ones. The rationale of connectivity pruning stems from the desirability of locality in layerwise computations inspired by human visual systems [Yamins and DiCarlo2016]. It is a good supplement to pattern pruning. Both pruning schemes can be integrated in the same algorithmlevel solution and compilerassisted mobile acceleration framework.
The Convolution Operator
In conventional image processing, a convolution operator is formally defined by the following formula, where the output pixel value is the weighted sum of input pixel values , and is the weight kernel value
(1) 
This formula could transform to
(2) 
Then we derive the notation of convolution operator as:
(3) 
Convolution is a linear shiftinvariant (LSI) operator, satisfying the commutative property, the superposition property and the shiftinvariance property. Additionally, convolution satisfies the associative property following the Fubini’s theorem.
Sparse Convolution Pattern (SCP) Design
Our designed SCPs could be transformed to a series of steerable filters [Freeman and Adelson1991], i.e., the Gaussian filter and Laplacian of Gaussian filter, which function as image smoothing, edge detection or image sharpening in mathematical vision theory.
Gaussian filter: Consider a twodimensional Gaussian filter :
(4) 
and are input coordinates, and
is standard deviation of the Gaussian distribution. Typically, the Gaussian filter performs image smoothing, and further sophisticated filters can be created by first smoothing the image input with a unit area Gaussian filter, then applying other steerable filters.
Laplacian of Gaussian filter: The Laplacian operator is the second derivative operator. According to the associative property, smoothing an image with Gaussian filter and then applying Laplacian operator is equivalent to convolve the image with the Laplacian of Gaussian (LoG) filter:
(5) 
The LoG filter is a bandpass filter that eliminates both the highfrequency and lowfrequency noises. LoG has elegant mathematical properties, and is valid for a variety of applications including image enhancement, edge detection, and stereo matching.
Taylor series expansion is utilized to determine the approximate values of the LoG filter with filter size. First, we consider the 1D situation. The Taylor series expansions of 1D Gaussian filter are given by:
(6) 
(7) 
By summing (6) and (7), we have
(8) 
The second derivative of Gaussian is equivalent to LoG . Equation (8) is further transformed to
(9) 
Applying central difference approximation of LoG , we derive the 1D approximation of LoG filter as . Then we procure the 2D approximation of LoG filter by convolving and , and get result as . According to the property of second derivative:
(10) 
and Equation (9), we have
(11) 
Based on (11), we derive another approximation of LoG as .
According to the central limit theorem, the convolution of two Gaussian functions is still a Gaussian function, and the new variance is the sum of the variances of the two original Gaussian functions. Hence, we convolve the above two approximations of LoG and then apply normalization, and get the
Enhanced Laplacian of Gaussian (ELoG) filter as .[Siyuan, Raef, and Mikhail2018]
have proved the convergence of the interpolation in the context of (multilayer) DNNs, so we utilize the interpolated probability density estimation to make the further approximation. In ELoG filter where 1 appears, we mask it to 0 with the probability of
. Because we uniformly convolve SCPs intoconvolutional layers, this random masking operation can be treated as distributed interpolation of SCPs. In continuous probability space, interpolating SCPs into convolution function is a specific Probability Density Function (PDF), so the effect of interpolating SCPs is accumulating probability expectations of interpolation into
convolutional layers. Besides, the convolution function is normalized to unity, so we separate the coefficient in the following equation.The four SCPs are shown in colored positions in (12). In order to get the best approximation to ELoG filter, we set and , then the desired filter is equal to interpolating these four SCPs for eight times. The coefficient has no effect after normalization.
Upper bound: According to [C.Blakemore and Campbell1969], the optimal times for applying the LoG filter is six and the maximum is ten. Thus the desired number of times to interpolate the SCP in (12) is around 24 and the maximum number is around 55. This upper bound covers most of the existing effective DNNs, even for ResNet152, which comprises 50 convolutional layers with filter kernel size of .
The four SCPs in (12) form the ELoG filter through interpolation. Hence, the designed SCPs inherit the denoising and sharpening characteristics of LoG filters. We visualize the intermediate results of DNNs to interpret and verify the advancement of our designed SCPs in the following section.
Visualization and Interpretation
Explanations of individual DNN decision have been explored by generating informative heatmaps such as CAM and gradCAM [Selvaraju et al.2017], or through guidedbackpropagation (BP) [Springenberg and Alexey Dosovitskiy2015] conditioned on the final prediction. Utilizing guidedbackpropagation, we can visualize what a DNN has learned. The visualization results of applying SCPs to an original DNN model (pattern pruning) are demonstrated in Figure 3. We sample four input images from the ImageNet dataset, as “hourglass”, “bee”, “dragonfly” and “chihuahua”, then apply the guidedbackpropagation to propagate back from each target class label and get the gradient images. Eventually, we generate the saliency maps of gradient images. Compared with the original VGG16 model, the pattern pruned VGG16 model captures more detailed information of the input image with less noise.
There are plenty of DNN visualization techniques. In Supplemental Materials, we demonstrate two more sets of visualization results using integrated gradients and inverted representation methods. And both sets show our pattern pruned model collects more information in an image than original model. We conclude that by applying our designed SCPs, pattern pruning enhances DNNs’ image processing ability.
Accuracy Analysis (Supplemental Materials)
In our previous derivation, we have determined the (four) SCPs as our pattern set. Our algorithmlevel solution starts from a pretrained DNN model, or can train from scratch. To generate PCONV model, we need to assign SCPs to each kernel (pattern pruning) or prune specific kernels (connectivity pruning), and train the active (unpruned) weights. To achieve this goal, we extend the ADMMNN framework in [Ren et al.2019] to produce pattern and connectivitypruned models. Due to space limit, the algorithm details in PCONV model generation are described in Supplemental Materials.
Accuracy results are illustrated in Figure 4. Starting from the baseline accuracy results that are in many cases higher than prior work, we have the first conclusion that the accuracy will improve when applying our designed SCPs on each convolution kernel. For ImageNet dataset, pattern pruning improves the top5 accuracy of VGG16 from to , and ResNet50 from to with SCPs applied to each convolution kernel. The accuracy improvement is attributed to the enhanced image processing ability of our designed SCPs.
Pruning vs. accuracy for nonstructured pruning, structured pruning and PCONV. Combined with connectivity pruning, PCONV achieves higher compression rate without accuracy compromise. Comparing with other pruning methods, i.e., nonstructured pruning and structured pruning, we conclude that: (i) PCONV achieves higher accuracy and higher compression rate compared with prior nonstructured pruning, and close to the results in ADMMNN; (ii) compared with structured pruning, under the same compression rate, PCONV achieves higher accuracy, and can structurally prune more weights without hurting accuracy. Detailed comparison is shown in Supplemental Materials.
Compilerassisted DNN Inference Framework
In this section, we propose our novel compilerassisted DNN inference acceleration framework for mobile devices. Motivated by the two merits – flexibility and regularity of the PCONV model, our compilerassisted platform uniquely enables optimized code generation to guarantee endtoend execution efficiency. As DNN’s computation paradigm is in a manner of layerwise execution, we can convert a DNN model into computational graph, which is embodied by static C++ (for CPU execution) or OpenCL (for GPU execution) code. The code generation process includes three steps: (i) layerwise information extraction; (ii) filter kernel reorder; (iii) load redundancy elimination.
Layerwise information extraction is a model analysis procedure. In particular, it analyzes detailed kernel pattern and connectivityrelated information. Key information such as pattern distribution, pattern order and connection between input/output channel through kernels are utilized by the compiler to perform optimizations in steps (ii) and (iii).
Filter kernel reorder is designed to achieve the best of instructionlevel and threadlevel parallelism. When a PCONV
model is trained, patterns and connections of all kernels are already known, i.e., the computation pattern is already fixed before deploying the model for inference. All these information of patterns are collected from layerwise information extraction, and is leveraged by filter kernel reorder to (i) organize the filters with similar kernels together to improve
interthread parallelism, and (ii) order the same kernels in a filter together to improve intrathread parallelism. Figure 6 illustrates the two key steps of filter kernel reorder: (i) organizes similar filters next to each other; (ii) groups kernels with identical patterns in each filter together. As a result, the generated execution code eliminates much of execution branches, implying higher instructionlevel parallelism; meanwhile, similar filter groups escalate execution similarity and result in a good load balance, achieving better threadlevel parallelism.Load redundancy elimination addresses the issue of irregular memory access that causes memory overhead. In DNN execution, the data access pattern of input/output is decided by the (nonezero elements) patterns of kernels. Therefore, we can generate data access code with this information for each kernel pattern and call them dynamically during DNN execution. Because the data access code consists of all information at kernellevel computation, it is possible to directly access valid input data that is associated with the nonzero elements in a patternbased kernel. After steps (i) and (ii), patterns are distributed in a structured manner, which reduces the calling frequency of data access code and as a result, reduces the memory overhead.
Experimental Results
In this section, we evaluate the execution performance of our compilerassisted framework with our PCONV
model deployed. All of our evaluation models are generated by ADMM pruning algorithm which is described in Supplemental Materials, and are trained on an eight NVIDIA RTX2080Ti GPUs server using PyTorch.
Methodology
In order to show acceleration of PCONV on mobile devices, we compare it with three stateofart DNN inference acceleration frameworks, TFLite [Ten], TVM [Chen et al.2018], and MNN [Ali]. Our experiments are conducted on a Samsung Galaxy S10 cell phone with the latest Qualcomm Snapdragon 855 mobile platform that consists of a Qualcomm Kryo 485 Octacore CPU and a Qualcomm Adreno 640 GPU.
In our experiment, our generated PCONV models are based on three widely used network structures, VGG16 [Simonyan and Zisserman2014], ResNet50 [He et al.2016] and MobileNetv2 [Howard et al.2017]. Since convolution operation is most timeconsuming (more than of the total inference time) in DNN computation, our evaluation on the above network structures focus on convolutional layers performance. In order to provide a very clear illustration on how PCONV enhances mobile performance, the whole devicelevel evaluation is shown in three aspects: (i) execution time, (ii) ondevice GFLOPS performance and (iii) how pattern counts affect performance.
Performance Evaluation
In this part, we demonstrate our evaluation results on mobile device from the three aspects we discussed above. In order to illustrate PCONV has the best acceleration performance on mobile devices, our comparison baselines, i.e., TFLite, TVM and MNN use the fully optimized configurations (e.g., Winograd optimization is turned on).
Execution time. Figure 7 shows mobile CPU/GPU performance of PCONV model executing on our compilerassisted DNN inference framework. On CPU, PCONV achieves to speedup over TFLite, to speedup over TVM and to speedup over MNN. On GPU, PCONV achieves to speedup over TFLite, to speedup over TVM and to speedup over MNN. For the largest DNN (VGG16) and largest data set (ImageNet), our framework completes computations on a single input image within (i.e., 52.4 frames/sec) on GPU, which meets the realtime requirement (usually 30 frames/sec, i.e., 33 /frame).
Ondevice GFLOPS performance. From the previous comparison results we see that MNN has the higher performance than TVM and TFLite. To show that PCONV has better throughput on mobile devices, we compare PCONV with MNN by measuring their runtime GFLOPS on both CPU and GPU. Figure 8 demonstrates layerwise GFLOPS performance comparison between PCONV and MNN. The 9 layers we pick from VGG16’s 13 convolutional layers are representing 9 unique layers with 9 unique layer sizes. The other 4 layers are omitted in Figure 8 because they have repeated layer sizes which product repeated GFLOPS results. From the results we can see that for both CPU and GPU throughputs, PCONV outperforms MNN.
Pattern counts vs. performance. In order to determine how pattern counts affects execution performance, we design some random patterns with 4 nonzero elements in one kernel alongside with our designed SCPs. Table 1 and Table 2 show accuracy and execution time under different pattern counts using VGG16 on Cifar10 and ImageNet datasets. The results show that the accuracy losses are not necessarily related to the increase of pattern counts, but the execution performance drops quickly, especially on ImageNet dataset. The pattern counts vs. performance results prove that our designed SCPs result in ideal performance with a negligible accuracy loss.
Dataset  Pattern#  Acc. (%)  Acc. loss (%)  Device  Speed (ms) 
Cifar10  4  93.8  0.3  CPU  2.7 
GPU  2.9  
8  93.7  0.2  CPU  2.9  
GPU  3.0  
12  93.8  0.3  CPU  3.1  
GPU  3.3 
Dataset  Pattern#  Acc. (%)  Acc. loss (%)  Device  Speed (ms) 
ImageNet  4  91.5  0.2  CPU  52.7 
GPU  19.1  
8  91.6  0.1  CPU  58.9  
GPU  22.0  
12  91.6  0.1  CPU  105.2  
GPU  32.1 
Conclusion
This paper presents PCONV, a desirable sparsity type in DNN weight pruning that elicits mobile devices acceleration, leading to realtime mobile inference. PCONV inherits the high flexibility in nonstructured pruning which helps achieving high accuracy and compression rate, and maintains highly structured weight composition like structured pruning which leads to hardware friendlinesses such as optimized memory access, balanced workload and computation parallelism etc. To show PCONV’s realtime performance on mobile devices, we design a compilerassisted DNN inference framework, which can fully leverage PCONV’s structural characteristics and achieve very high inference speed on representative largescale DNNs.
References
 [Ali] https://github.com/alibaba/MNN.

[Aravindh and
Andrea2015]
Aravindh, M., and Andrea, V.
2015.
Understanding deep image representations by inverting them.
In
Computer Vision and Pattern Recognition, 2015. CVPR 2015. IEEE Conference on
.  [Boticki and So2010] Boticki, I., and So, H.J. 2010. Quiet captures: A tool for capturing the evidence of seamless learning with mobile devices. In International Conference of the Learning SciencesVolume 1.
 [Boyd et al.2011] Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3(1):1–122.
 [C.Blakemore and Campbell1969] C.Blakemore, and Campbell, F. W. 1969. On the existence of neurones in the human visual system selectively sensitive to the orientation and size of retinal images. In The Journal of Physiology. The Physiological Society.

[Chen et al.2018]
Chen, T.; Moreau, T.; Jiang, Z.; Zheng, L.; Yan, E.; Shen, H.; Cowan, M.; Wang,
L.; Hu, Y.; Ceze, L.; et al.
2018.
TVM: An automated endtoend optimizing compiler for deep learning.
In OSDI.  [Dai, Yin, and Jha2017] Dai, X.; Yin, H.; and Jha, N. K. 2017. Nest: a neural network synthesis tool based on a growandprune paradigm. arXiv preprint arXiv:1711.02017.
 [Freeman and Adelson1991] Freeman, W., and Adelson, E. 1991. The design and use of steerable filters. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 13, 891–906. IEEE.
 [Goodfellow et al.2016] Goodfellow, I.; Bengio, Y.; Courville, A.; and Bengio, Y. 2016. Deep learning, volume 1. MIT press Cambridge.
 [Han et al.2015] Han, S.; Pool, J.; Tran, J.; and Dally, W. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, 1135–1143.
 [Han, Mao, and Dally2015] Han, S.; Mao, H.; and Dally, W. J. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
 [He et al.2018] He, Y.; Lin, J.; Liu, Z.; Wang, H.; Li, L.J.; and Han, S. 2018. Amc: Automl for model compression and acceleration on mobile devices. In European Conference on Computer Vision, 815–832.
 [He et al.2019] He, Y.; Liu, P.; Wang, Z.; Hu, Z.; and Yang, Y. 2019. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4340–4349.
 [He, Zhang, and Sun2017] He, Y.; Zhang, X.; and Sun, J. 2017. Channel pruning for accelerating very deep neural networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 1398–1406. IEEE.
 [Hinton, Deng, and Yu2012] Hinton, G.; Deng, L.; and Yu, D. e. a. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine.
 [Howard et al.2017] Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
 [Hu et al.2016] Hu, H.; Peng, R.; Tai, Y.W.; and Tang, C.K. 2016. Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250.
 [Huang and Wang2018] Huang, Z., and Wang, N. 2018. Datadriven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV).
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).
 [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NeurIPS.
 [Lane et al.2015] Lane, N. D.; Bhattacharya, S.; Georgiev, P.; Forlivesi, C.; and Kawsar, F. 2015. An early resource characterization of deep learning on wearables, smartphones and internetofthings devices. In International workshop on IOT towards applications.
 [Li et al.2016] Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Graf, H. P. 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.
 [Liu et al.2015] Liu, B.; Wang, M.; Foroosh, H.; Tappen, M.; and Pensky, M. 2015. Sparse convolutional neural networks. In CVPR, 806–814.
 [Liu et al.2018] Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; and Darrell, T. 2018. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270.
 [Luo, Wu, and Lin2017] Luo, J.H.; Wu, J.; and Lin, W. 2017. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, 5058–5066.
 [Mairal et al.2014] Mairal, J.; Koniusz, P.; Harchaoui, Z.; and Schmid, C. 2014. Convolutional kernel networks. In NeurIPS.
 [Mao et al.2017] Mao, H.; Han, S.; Pool, J.; Li, W.; Liu, X.; Wang, Y.; and Dally, W. J. 2017. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922.
 [Min et al.2018] Min, C.; Wang, A.; Chen, Y.; Xu, W.; and Chen, X. 2018. 2pfpce: Twophase filter pruning based on conditional entropy. arXiv preprint arXiv:1809.02220.
 [Mukund, Ankur, and Qiqi2017] Mukund, S.; Ankur, T.; and Qiqi, Y. 2017. Axiomatic attribution for deep networks. In 2017 International Conference on Machine Learning (ICML). ACM/IEEE.
 [Parashar et al.2017] Parashar, A.; Rhu, M.; Mukkara, A.; Puglielli, A.; Venkatesan, R.; Khailany, B.; Emer, J.; Keckler, S. W.; and Dally, W. J. 2017. Scnn: An accelerator for compressedsparse convolutional neural networks. In ISCA.
 [Philipp, Durr, and Rothermel2011] Philipp, D.; Durr, F.; and Rothermel, K. 2011. A sensor network abstraction for flexible public sensing systems. In 2011 IEEE Eighth International Conference on Mobile AdHoc and Sensor Systems, 460–469. IEEE.
 [Ren et al.2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, 91–99.
 [Ren et al.2019] Ren, A.; Zhang, T.; Ye, S.; Xu, W.; Qian, X.; Lin, X.; and Wang, Y. 2019. Admmnn: an algorithmhardware codesign framework of dnns using alternating direction methods of multipliers. In ASPLOS.
 [Selvaraju et al.2017] Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Gradcam: Visual explanations from deep networks via gradientbased localization. In ICCV.
 [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556.
 [Siyuan, Raef, and Mikhail2018] Siyuan, M.; Raef, B.; and Mikhail, B. 2018. The power of interpolation: Understanding the effectiveness of sgd in modern overparametrized learning. In 2018 International Conference on Machine Learning (ICML). ACM/IEEE.
 [Springenberg and Alexey Dosovitskiy2015] Springenberg, J. T., and Alexey Dosovitskiy, T. B. a. R. 2015. Striving for simplicity: The all convolutional net. In ICLR2015 workshop track.
 [Ten] https://www.tensorflow.org/mobile/tflite/.
 [Wen et al.2016] Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2016. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, 2074–2082.
 [Xu et al.2018] Xu, M.; Zhu, M.; Liu, Y.; Lin, F. X.; and Liu, X. 2018. Deepcache: Principled cache for mobile deep vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, 129–144. ACM.
 [Yamins and DiCarlo2016] Yamins, D. L., and DiCarlo, J. J. 2016. Using goaldriven deep learning models to understand sensory cortex. Nature neuroscience 19(3):356.
 [Yao et al.2017] Yao, S.; Hu, S.; Zhao, Y.; Zhang, A.; and Abdelzaher, T. 2017. Deepsense: A unified deep learning framework for timeseries mobile sensing data processing. In Proceedings of the 26th International Conference on World Wide Web.
 [Yu et al.2018] Yu, R.; Li, A.; Chen, C.F.; Lai, J.H.; Morariu, V. I.; Han, X.; Gao, M.; Lin, C.Y.; and Davis, L. S. 2018. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9194–9203.
 [Zhang et al.2018] Zhang, T.; Ye, S.; Zhang, Y.; Wang, Y.; and Fardad, M. 2018. Systematic weight pruning of dnns using alternating direction method of multipliers. arXiv preprint arXiv:1802.05747.
 [Zhang2019] Zhang, R. 2019. Making convolutional networks shiftinvariant again. In ICML.
Appendix A Pattern and Connectivity Pruning Algorithm
From our derivation in Section “Theory of Sparse Convolution Patterns (SCP)” of our main paper, we have determined the (four) SCPs as our desired patterns. In this section, we describes the methods to generate compressed DNN models for PCONV. The procedure is composed of two steps: (1) we use the four SCPs to form a pattern set; (2) assign a pattern from pattern set for each kernel (pattern pruning) or prune the whole kernel (connectivity pruning), and train the patternbased weights for maintaining accuracy. The overall flow is shown in Figure 9. Essentially, it reflects the algorithm aspects of PCONV. Our method can be applied to either a pretrained DNN or train a model from scratch.
Problem Formulation: Consider an layer DNN, and we focus on the most computationally intensive CONV layers. The weights and biases of layer are respectively denoted by and
, and the loss function of DNN is denoted by
; see [Zhang et al.2018]. In our discussion, and respectively characterize the collection of weights and biases from layer to layer . Then the pattern and connectivity pruning is formulated as an optimization problem:(12)  
subject to 
The collection of weights in the
th CONV layer forms a fourdimensional tensor, i.e.,
, where , and are respectively the height of kernel, the width of kernel, the number of kernels, and the number of filters, in layer . Suppose denotes the weight tensor in a specific layer, then denotes a specific kernel.In pattern pruning, the constraint in the th CONV layer is each kernel in needs to satisfy one specific pattern shape in the pattern set (and nonzero weight values can be arbitrary). In connectivity pruning, the constraint in the th CONV layer is the number of nonzero kernels in is less than or equal to (
is a predetermined hyperparameter with more discussions later). Both constraints need to be simultaneously satisfied.
Extended ADMMbased Solution Framework: The constraint in problem (12) is different from the clusteringlike constraints in ADMMNN [Ren et al.2019], in that it is flexible to select a pattern for each kernel from the pattern set. As long as a pattern is assigned for each kernel, constraints in problem (12) become clusteringlike and ADMM compatible. Similar to ADMMNN [Ren et al.2019], the ADMMbased solution is an iterative process, starting from a pretrained DNN model. We assign an appropriate pattern for each kernel based on the norm metric in each iteration, to achieve higher flexibility.
By incorporating auxiliary variables ’s and ’s, and dual variables ’s and ’s, we decompose (12) into three subproblems, and iteratively solve until convergence. In iteration , after assigning patterns we solve the first subproblem
(13) 
The first term is the loss function of the DNN, while the other quadratic terms are convex. As a result, this subproblem can be solved by stochastic gradient descent (e.g., the ADAM algorithm
[Kingma and Ba2014]) similar to training the original DNN.The solution of subproblem 1 is denoted by . Then we aim to derive and in subproblems 2 and 3. These subproblems have the same form as those in ADMMNN [Ren et al.2019]. Thanks to the characteristics in combinatorial constraints, the optimal, analytical solution of the two subproblems are Euclidean projections, and are polynomial time solvable. For example, for connectivity pruning, the projection is: keeping kernels with largest norms and setting the rest of kernels to zero. For pattern pruning it is similar. Finally, we update dual variables and according to the ADMM rule [Boyd et al.2011] and thereby complete the th iteration in the ADMMbased solution.
The hyperparameter determination process is relatively straightforward for joint pattern and connectivity pruning. There is no additional hyperparameters for pattern pruning when the pattern set has been developed. For connectivity pruning we need to determine the pruning rate for each layer. In this paper, we adopt a heuristic method of uniform pruning rate for all layers except for the first layer (which is smaller, yet more sensitive to pruning).
Appendix B Accuracy Evaluation for Different Pruning Methods
In this section, we provide comparison results on accuracy and compression rate of PCONV and several baseline works. Based on two different datasets, we category our comparison into two parts – one for ImageNet compression results and another one for Cifar10. In both categories, we compare PCONV with prior works in nonstructured pruning and structured pruning. The comparison results show that (i) PCONV achieves higher accuracy and higher compression rate compared with prior nonstructured pruning, and close to the results in ADMMNN; (ii) compared with structured pruning, under the same compression rate, PCONV achieves higher accuracy, and can structurally prune more weights without hurting accuracy.
ImageNet Dataset
Table 3 and Table 4 illustrate the Top5 accuracy comparison on joint pattern pruning and connectivity pruning, on VGG16 and ResNet50 using ImageNet dataset. For VGG16, all kernels are . After applying SCPs on all kernels and 3.1 uniform connectivity pruning, we achieve around 7 weight reduction on convolution layers of VGG16. For ResNet50, a portion of kernels are besides the majority of kernels. We apply pattern pruning on all ones, and apply uniform 2.7 connectivity pruning on all kernels. We achieve 3.9 weight reduction on convolution layers.




VGG16  Deep compression  88.7%  3.5  
NeST  90.1%  6.5  
ADMMNN  88.9%  10.2  
Our’s (pattern + connectivity)  91.5%  7.0  
ResNet50  Finegrained pruning  92.3%  2.6  
SSS32  91.8%  1.4  
SSS26  90.8%  1.6  
ADMMNN  92.3%  7.0  
Our’s (pattern + connectivity)  92.6%  3.9 




VGG16  ThiNet  89.5%  1.1  
APoZ  87.6%  2.0  
Our’s (pattern + connectivity)  91.5%  7.0  
ResNet50  ThiNet50  90.0%  2.0  
ThiNet30  88.3%  3.3  
Efficient ConvNet  91.1%  1.4  
Our’s (pattern + connectivity)  92.6%  3.9 
Cifar10 Dataset
Table 5 and Table 6 illustrate the Top1 accuracy comparison on joint pattern pruning and connectivity pruning, on VGG16 and ResNet50 using Cifar10 dataset. For VGG16, all kernels are . After applying SCPs on all kernels and 8.8 uniform connectivity pruning, we achieve around 19.7 weight reduction on convolution layers of VGG16. For ResNet50, a portion of kernels are besides the majority of kernels. We apply pattern pruning on all ones, and apply uniform 8 connectivity pruning on all kernels. We achieve 11.5 weight reduction on convolution layers.




VGG16  Iterative Pruning  93.3%  3.5  
One Shot Pruning  93.7%  5.0  
Our’s (pattern + connectivity)  93.8%  19.7  
ResNet50  One Shot Pruning  93.6%  2.5  
One Shot Pruning  92.7%  3.3  
Our’s (pattern + connectivity)  95.6%  11.5 




VGG16  2PFPCE  92.8%  4.0  
Efficient ConvNet  93.4%  2.7  
FPGM  93.2%  1.6  
Our’s (pattern + connectivity)  93.8%  19.7  
ResNet50  AMC  93.5%  1.7  
NISP  93.2%  1.7  
Our’s (pattern + connectivity)  95.6%  11.5 
Appendix C Visualization Interpretation for Verification
In our main paper, by adopting guidedbackpropagation technique, we generate one set of visualization results of four images from ImageNet dataset to demonstrate the image enhancement ability of PCONV. In this part, we extend the visualization results by adopting two more widely used visualization methods, integrated gradients [Mukund, Ankur, and Qiqi2017] and inverted representation [Aravindh and Andrea2015]. By using three visualization techniques, we provide strong evidence that PCONV can effectively capture more image details.
Integrated gradients attribute a complex DNN’s prediction to its input features. Integrated gradients differentiate between artifacts that stem from perturbing the data, a misbehaving model and a misbehaving attribution method. This is distinguished from the previous visualization methods, which are characterized by intuitive design and empirical evaluations [Mukund, Ankur, and Qiqi2017]. Hence, applying integrated gradients is a desired visualization methodology to verify the advancement of PCONV.
The integral of integrated gradients is efficiently approximated via a summation. We sum the gradients at points occurring at sufficiently small intervals along the straightline path from the baseline’s to the input :
(14)  
Here is the number of steps in the Riemann approximation of the integral. In practice, the recommended value of is [Mukund, Ankur, and Qiqi2017]. We set steps and compute the integrated gradients.
Figure 10 (b) visualizes the integrated gradients from the original VGG16 model and the pattern pruned VGG16 model. By contrast, the pattern pruned VGG16 model learns more comprehensive information, according to the visualization of integrated gradients.
Inverted representation originates from the following question: given an encoding of an image, to which extent is it possible to reconstruct the image itself through the DNN? [Aravindh and Andrea2015] has shown that several layers in DNN retrain photographically accurate information about the image, with different degrees of geometric and photometric invariance. Hence, we utilize inverted representation to interpret the difference between original VGG16 model and pattern pruned VGG16 model.
The visualization results of inverted representation are demonstrated in Figure 10 (c). We can clearly see that the pattern pruned VGG16 model retains more photographically accurate information.
After visualizing the original DNN models and pattern pruned DNN models through different visualization methods, we conclude that by applying our designed SCPs, pattern pruning enhances DNNs’ image processing ability.
Comments
There are no comments yet.