I Introduction
Most recent breakthroughs in artificial intelligence rely on deep neural networks or DNNs as the fundamental building blocks, such as image classification [1, 41, 42, 43, 20, 23], object detection [12, 15, 39, 33, 38]
, and so on. As the emergence of highend mobile devices in recent years, there is an urgent need to migrate deep learning applications from cloud servers and desktops to these edge devices because of their cost advantages. However, this is challenging due to the high computation intensity of deep learning models and the limited computing power of these devices
[52, 51, 47, 54]. In this sense, designing CNN models under a specific latency budget is essential for their deployment on resourceconstrained mobile devices.There is a considerable body of work in compression and acceleration of deep neural networks to overcome these challenges. Such as network pruning [18, 19, 22, 16, 3], quantization [50, 6, 25, 29], low rank approximation [8, 26, 27], efficient model design [23, 44, 40, 24, 4, 17, 45], and so on. Between which pruning [18, 31, 16] has been a predominate approach for accelerating deep neural networks. Early endeavours in network pruning often aimed at reducing the model size (e.g. the number of parameters) or the number of Floating Point OPerations (FLOPs) of networks. However, it is recently realized that reducing the number of nonzero parameters or arithmetic operations does not necessarily lead to acceleration [53, 54], which is one of the main concerns for model deployment on resourceconstrained devices. Resourceconstrained compression which aims to directly reduce network latency [22, 55, 3] or energy consumption [53, 51] of networks then emerges and soon draws great attention.
While achieving good tradeoff among accuracy and latency/energy, there is still space to further push the frontier of resourceconstrained compression.
The way is to take advantage of modern mobile computation architectures. The pruning patterns are mostly not specifically designed for mobile devices, pruning is conducted either channelwisely [53, 51, 55, 3], which is too inflexible to attain high compression rate, or randomly [18, 19], which is not convenient for acceleration because of their irregular memory access. It is necessary to take into account the computing architectures and design specialized pruning patterns to further push the frontier among network accuracy and latency.
It requires efficient and accurate estimation of latency/energy for solving the constrained problem. Latency/energy modeling in previous works [37, 53, 52] are tied to specific hardware platforms, and the estimation requires deep knowledge to the hardware platform. Other platform independent methods approximate the latency/energy with a look up table [54, 47] or an estimation model [51, 3]. However, constructing the look up table or training the estimation model require a large amount of sparsitylatency/energy pairs, which is laborious to collect.
In this paper, we propose architectureaware latency constrained sparse neural networks (ALCS) towards better Pareto frontier among network accuracy and latency. Specifically, considering that most modern mobile devices utilize the Single Instruction Multiple Data (SIMD) technique to improve the computation capacity, we propose SIMDstructured pruning which groups the parameters according to the bitlength of SIMD registers. Parameters are then pruned in a groupwise manner. Along with it, we also propose an efficient computation algorithm for accelerating the SIMDstructured sparse neural networks. Our method does not suffer from strong structure constraint as channel pruning and therefore is able to achieve relatively high compression/acceleration ratio on mobile devices.
For efficient latency estimation, we approximate with piecewise linear interpolation. Our construction of the latency estimator doesn’t require any specific architecture knowledge. Compared to other platform independent estimation methods [54, 51, 3] which requires tens of thousands of sparsitylatency pairs, our piecewise linear interpolation latency estimator is much easier to establish. Only a small collective of sparsitylatency data pairs (11 in our experiments) are required.
The whole latency constrained sparsify task is formulated as a constrained optimization problem, which can be efficiently solved with Alternative Direction Method of Multipliers (ADMM). Extensive experiments show that ALCS can achieve better Pareto frontier among network accuracy and latency as shown in Figure 1. With ALCS, the execution time of MobileNet is reduced by without accuracy drop. The latency of ResNet50 can be reduced by even with accuracy improvement.
In summery, our contributions are threefolds:

We propose ALCS, an endtoend systemalgorithm codesign framework which utilizes SIMDstructured pruning to exploit modern mobile computation architectures for agile and accurate model.

We propose an efficient piecewise linear interpolation method to estimate the network inference latency, which is sample efficient and accurate.

Extensive experiments and ablation studies demonstrate the advantages of architectureaware pruning, as well as the superiority of our proposed method against a set of competitive compression and acceleration methods.
Ii Related Works
Network Pruning. Network pruning is a key technique for compression and acceleration of neural networks. Pioneer approaches prune weights of models randomly, which means that each individual element of the parameters can be removed or retained without any constraint. This category of pruning method can be dated back to Optimal Brain Damage (OBD) [28]
. OBD prunes weights based on Hessian matrix of the loss function, which is difficult to get when the amount of parameters becomes large. More recently, Han et al. present a ’Deep Compression’ pipeline
[18], which prunes parameters with relatively small magnitude. Ding et al. [10] utilize the momentum term of SGD step to force the parameters to converge to zeros. Besides, there are many other works focusing on training a pruned network from scratch [34, 2, 9, 35, 32]. These methods can remove a large part of parameters with negligible accuracy loss while are not convenient for inference acceleration because of their irregular memory access [49].The limitations of random weight pruning described above motivate recent works [49, 21, 16, 31, 30, 46] to focus more on channel pruning, which prunes the parameters in a channelwise manner. The channel pruning is able to accelerate the computation of networks, while it requires to prune a whole channel simultaneously, which is too inflexible to achieve high compression and acceleration ratio. Moreover, these methods often aim to reduce the model size of networks, while it has been well acknowledged now that network latency, which is one of the main concerns when deploying CNNs on resourceconstrained mobile devices, does not decrease monotonously with the reduction of model size [54].
Resource Constrained Compression.
Recognizing that model size is not a sufficient surrogate for network latency/energy consumption, recent researchers have started investigating resource constrained compression which compress the network to meet some budgets (e.g. latency, energy). Given some explicit resource constraints, these methods search for the optimal network structures with reinforcement learning
[22], greedy search [53, 54, 55], bayesian optimization [5], dynamic programming [3], or optimize the network structures and values of weights simultaneously with optimization algorithms [52, 51, 36]. Compared to previous works, our work further takes into account the computing architecture of mobile devices and propose mobileoriented SIMDstructured pruning for CNNs. What’s more, we employ linear interpolation for estimation of network latency, which is efficient and accurate and needs neither deep architecture knowledge nor large number of collective sparsitylatency data pairs.Efficient Sparse Computation. Recently, [11] propose efficient computing algorithm for sparse matrix multiplication on mobile devices. The common points between our works is that channel pruning is not necessary for network acceleration on mobile devices, while our method is different from theirs in that our work supports not only matrix multiplication but also general convolution, and we further argue that SIMDstructured pruning is necessary to achieve better tradeoff between network accuracy and latency.
Iii Methodology
Iiia Problem formulation
Our goal is to accelerate the network to meet some latency budget while minimizing the target loss:
(1) 
where denotes the set of parameters of each layer, is the taskspecified loss function, for example, the cross entropy loss function for classification. and denote the latency of the network and the target latency budget, respectively. There are three important challenges to obstacle for solving the above problem: 1) how to utilize modern computation architectures to get higher compression and acceleration rate on mobile devices, 2) how to efficiently estimate the latency of the network, and 3) how to solve the constrained optimization problem. In this work, we propose SIMDstructured pruning along with an efficient SIMDstructured sparse convolution algorithm for CNN acceleration. The network latency is estimated with piecewise linear interpolation and the constrained problem is finally solved with ADMM. We will introduce more details in the following sections.
IiiB SIMDstructured pruning for fast inference
It is necessary to take into account the computing architectures of the target platforms for fast inference of CNNs. To this end, considering that most mobile CPUs utilize the Single Instruction Multiple Data (SIMD) technique to improve the computation efficiency, we propose SIMDstructured pruning along with an efficient SIMDstructured sparse convolution algorithm for CNN acceleration. We will describe them in detail in the following sections.
IiiB1 SIMDstructured pruning
In this section, we introduce the proposed SIMDstructured pruning. Before we get into more details, it is worthy to have a brief introduction to Single Instruction Multiple Data (SIMD). As a data level parallelism scheme, SIMD is widely used in modern mobile CPU architectures. It allows CPUs to operate on a set of data items at the same time in a single instruction. In this way, a vector of data can be loaded into vector registers and processed simultaneously by one instruction.
We start with grouping of parameters. For a convolutional layer with filters , where denote the number of output/input channels and kernel size, respectively. The elements are first grouped along the dimension. The size of each group depends on the length of SIMD vector registers. For example, on the widely used ARM v7/v8 architectures, the length of each vector register for SIMD instructions is 128 bits, so for 32bit single float precision parameters, the group size should be 4. In the other words, parameters at the same location of each 4 adjacent channels are grouped together. The parameters are then pruned in a group wise manner. The right of Figure 3(a) shows a simple example for the proposed SIMDstructured pruning with group size of 2. Note that the only constraint of SIMD structured pruning is that the locations of zeros in each group of filters should be the same, however, the locations of zeros across different groups of filters can be irregular.
IiiB2 SIMDstructured sparse convolution
Having introduced the SIMDstructured pruning for deep neural networks, in this section, we describe the efficient algorithm for computation of sparse convolutions.
OVERVIEW We show in Figure 4 an overview of the proposed computation algorithm. In this example, the group size of SIMDstructured pruning is 2. We denote the input/output of convolution as . The output element at location of the channel can be computed by:
which can be treated as the inner product of the stretched kernel of
and a subtensor of
related to the spatial location . For instance, the element at the top left corner of the first channel of can be computed by the inner product of the stretched first filter of and the subtensor colored in orange at the top left corner of . It is easy to realize that the output values related to multiple output channels and spatial locations can be computed collectively. For instance, the elements related to the first 2 channels and the first 2 spatial locations of the output can be computed by: First flattening and stacking together the first 2 filters of into rows of a matrix, say ; Then vectorizing and stacking together the subtensors of related to the first 2 output spatial locations (e.g. the subtensors colored in orange and yellow, respectively located at the topleft corner of ) into different columns of a matrix, say ; Finally, the 4 output elements can be computed by multiplication between and . In case where is SIMDstructured sparse, this data parallelism can be easily achieved with the help of SIMD instructions. Before going into more details, we first introduce the storage format of SIMDstructured sparse tensor .STORAGE FORMAT OF SIMDSTRUCTURED SPARSE TENSOR. As shown in the middle of Figure 4, a group of filters is stored in memory in a grouped version of Compressed Sparse Row (CSR) format, which consists of the number of nonzero columns (the orange value), column offsets (the gray elements), and nonzero values (the light blue elements). For instance, we can see in middle of Figure 4 that, there are 3 nonzero columns in the original kernel data, so the orange value is 3. For the first nonzero column , there is 1 column before it in the original kernel data, so the column offset related to is 1. As for the second nonzero column , there are 3 columns before it in the original kernel data, so its column offset is 3, and so on. Note that after training, the values of parameters are fixed during inference, so this reorganization of parameters can be done in advance without any extra time overhead.
EFFICIENT MULTIPLICATION COMPUTATION. As what has been described in the previous Section, the core computation component is the multiplication between the SIMDstructured sparse matrix and the dense input . In this section we describe how this multiplication can be efficiently computed when is stored in the format as described in the previous section. See the middle of Figure 4 as a simple example. We first allocate several SIMD registers and initialize them to zero for storage of intermediate results. Then we load from memory the number of nonzero columns of the kernel data, which determines the number of iterations. In each computation iteration, we load a column of nonzero kernel data into some SIMDregister, and then load into some other SIMDregisters the input data according to the corresponding column offset of the nonzero kernel data. After that, the loaded kernel and input data are multiplied and accumulated into intermediate results simultaneously with SIMD instructions.
In practice, this procedure is implemented with highly optimized assembly. The group size and the number of collectively computed output elements are determined by the bitlength and number of SIMD registers.
IiiC Latency estimation model
Having introduced the method utilized for CNN acceleration, the following problem is how to efficiently and accurately estimate the network latency given network parameters.
The latency of the whole network can be denoted as the summation of the latency of each layer:
(2) 
where is the latency of the CONV/FC layer.
denotes the latency of other layers (e.g. ReLU layers, pooling layers, et.al.) and can be regarded as a constant factor. In this work, the network is accelerated with sparsification, so the latency of the
layer can be formulated as a function of the number of nonzero weights of its kernel tensor:(3) 
Note that the above equation does not mean that the model size is used as a proxy of latency, because we model the latency layerwisely, and the number of nonzero parameters across different layers may have different influences on the latency of the whole model.
We propose to approximate with linear interpolation. This is based on the observation that the latency of each layer is locally linear with respect to the density of parameters as shown in Figure 5. Take the layer as an example, we measure the run time of the layer on device when the number of nonzero parameters are , respectively, where is the number of elements of . For a given tensor with nonzero parameters, the run time can be then approximated by linear interpolation:
(4) 
where is a variable which indicates whether the number of nonzero parameters is larger than :
(5) 
and is the run time of the layer when the number of nonzero parameters is . is the ascending speed of latency when the number of nonzero parameters increases between and :
(6) 
In practice, we set to be . In this way, only 11 collective sparsitylatency data pairs are required for approximation of the network latency. We find that our approximation of latency with linear interpolation is rather accurate as shown in Figure 6. What’s more, no platform knowledge is needed because we approximate the network latency with directly measurement and treat the hardware as a blackbox. In contrast, previous works rely on either deep architecture knowledge [53, 52] or a large collective (usually over 10000) of sparsitylatency/energy data pairs for construction of look up table [54] or training of the estimation model [51].
IiiD The optimization algorithm
Now we have been able to efficiently approximate the latency of CNN models given parameters, we are ready to solve the constrained optimization problem as shown in Equation 1. Many optimization algorithms can be applied to solve the problem 1, such as Alternating Direction Method of Multipliers (ADMM) , Projected Gradient Descent (PGD) , and so on. In this paper, we apply the ADMM which is recently proved to be sufficient for solving nonconvex, nonsmooth problems [48]. One may choose other optimization algorithms to solve the problem 1, and the proposed SIMDstructured pruning and resource estimation method with linear interpolation are also applicable as a plugin for other network compression methods to further improve their latencyaccuracy tradeoff, while this is out of the scope of this paper. The original problem can be reformulated by:
(7) 
where is defined by equations (2)(6). By applying augmented largrangian, the above problem is equivalent to:
(8) 
where is a hyperparameter. The main idea of ADMM is to update the original parameters , the auxiliary variable and the dual variable in an alternative manner:
(9a)  
(9b)  
(9c) 
The update of the original parameters and the dual variable are relatively straight forward. For update of
, we apply SGD on the training dataset for one epoch. The main difficulty lies in the update of the auxiliary variable
, which is the projection of on the constraint set. We solve this problem with a greedy algorithm. We first sort groups of elements of in term of norm, and pick them one by one, until the final latency achieves the target budget. Direct implementation of this algorithm is not efficient in that it may need a large number of iterations. While it can be efficiently implemented with bisection method as shown in Algorithm 1. After ADMM optimization finishes, we set and finetune the generated model on the training set for a few epochs. We summarize the final optimization algorithm in Algorithm 2, and more details are given in Section IVA.Iv Experimental results
Iva Experimental setup
We evaluate our method on both compact models such as MobileNet [23], as well as relatively heavy networks like ResNet18 and ResNet50 [20]
for 1000class image classification on ImageNet
[7]. We do not conduct experiments on CIFAR because it is more practical and challenging to accelerate CNN models on large scale vision tasks. We use the standard data preprocessing pipeline which is provided by pytorch official examples. The batch size is set to 256, and the group size for SIMDstructured pruning is set to 4 to match the bitlength of SIMD registers
^{1}^{1}1In most mobile devices, length of each vector register for SIMD instructions is 128 bits, so for SIMDstructured pruning of 32bit single float precision parameters, the group size should be 4.. The hyperparameter is set to 0.01 for all the experiments. In each ADMM iteration, for update of the original parameters as indicated in Equation (9a), we apply the momentum SGD optimizer for 1 epoch with learning rate fixed to 0.001 and weight decay set to for ResNet and for MobileNet. We apply 100 ADMM iterations for MobileNet and ResNet18 and 60 ADMM iterations for ResNet50. After ADMM iterations, the generated compressed model is then finetuned for 60 epochs with learning rate annealed from to with cosine learning rate scheduler. The weight decay is set to . During this procedure, only nonzero parameters are updated.The latency of all the dense models (including the models compressed with channel pruning methods) are measured with Tensorflow Lite
[13], which is one of the most popular mobileoriented inference framework for DNNs, and the latency of all the SIMDstructured sparse models are measured with our proposed SIMDstructured sparse convolution computation algorithm, which is implemented in C++ with SIMD instructions. Averaged latency over 50 runs on a single ARM CortexA72 CPU is reported.IvB Ablation study
Precision of latency estimation:
This section we first study the precision of the proposed latency estimation with linear interpolation. To this end, we uniformly sample 100 ResNet18 models with different sparsity, and plot the real latency and estimated latency in Figure 6. From the figure we can see that the proposed approximation of latency with linear interpolation is rather accurate.
Influence of :
FLOPs  Latency  Acc@1  


185M  62ms 

To study the influence of the hyperparameter for our algorithm, we compress MobileNet on ImageNet and set the target latency to be 62ms with . Results are shown in Figure 7 and Table I. From Figure 7(a) we can see that converges to a lower value with small , since with smaller , the algorithm focuses more on optimizing the original parameters . While we can further see from Figure 7(b) that it is not sufficient to constraint the sparse structure of the original parameters if is too small. For instance, when , the difference between and is rather large even after 100 ADMM iterations, which means that in this case fails to converge to be sparse during ADMM optimization. Therefore, during finetuning, the smallest doesn’t lead to the lowest loss (see Figure 7(c)). From Table I, we can see that achieves slightly better accuracy compared to the other two cases, and we will set in all the following experiments.
Impact of different components:

FLOPs  Latency  ADMM  FT  Acc@1  

WP  71.1M  61.55ms  ✓  ✓  68.36%  
FP  186M  64ms  ✓  ✓  67.76%  
SIMD  185M  62ms  ✓  ✓  70.53%  
SIMD  185M  62ms  ✓  70.08%  
SIMD  185M  62ms  ✓  69.96% 
In this section, we study the impact of different components of ALCS. That is The SIMDstructured pruning, the ADMM optimization and the post finetuning. For this end, we compare several variants of ALCS for compressing MobileNet on ImageNet. Results are summarized in Table II, Figure 8 and Figure 9. Where WP denotes weight pruning, in which each individual element of parameters can be removed or retained without any constraint. To measure the latency of models compressed by weight pruning, we apply the codes of XNNPACK [14], which is the state of the art implementation of sparse matrix multiplication [11]. Note that the implementation supports only matrix multiplication, which is equivalent with convolution layer with kernel size of , so we do not prune the first convolution layer in this variant, following the same set as in [11]. FP denotes filter pruning, which prunes parameters in a channel wise manner. SIMD denotes the proposed SIMDstructured pruning. ADMM and FT denote the ADMM optimization and the post finetuning process, respectively. For fair comparison, we set the latency budgets to be the same and train all the variants with equal number of total epochs. Specifically, for ADMM+FT variants, we apply the same hyperparameters as described in Section IVA. For ADMMonly variants, we apply 160 ADMM iterations, and for FTonly variants, we prune the model with normbased method and employ finetuning for 160 epochs with learning rate fixed to 0.001 at the first 100 epochs and annealing from 0.001 to 0 with cosine learning rate at the last 60 epochs.
As shown in table II, the proposed training pipeline outperforms all the other variants. By comparing the first three variants, we can conclude that SIMDstructured pruning is able to achieve better tradeoff between network accuracy and latency than weight pruning and filter pruning. For instance, the accuracy of ALCS with SIMDstructured pruning is higher than weight pruning and higher than filter pruning under similar latency budgets. This is mainly because that: (1) Compared to weight pruning, SIMDstructured pruning is more friendly to the SIMD architecture of mobile devices, and thus is able to achieve similar latency under a higher density, which benefits retaining accuracy of networks; (2) Compared to filter pruning, SIMDstructured pruning does not suffer from such strong constraint on data structure, thus improves flexibility and attains lower accuracy loss.
By comparing the last three variants, we can see that both ADMM optimization and the post finetuning are necessary for improving the network accuracy. Particularly, when only applying the ADMM optimization, the final accuracy will be only , which is lower than applying both ADMM optimization and the post finetuning, and the accuracy gap is without the ADMM optimization. In Figure 9 we further draw the training curves of these three variants, we see that there is a sharp decline in vallidation loss when the post finetuning begins. We explain that the ADMM optimization helps to find a better initialization for the post finetuning process.
IvC Comparison with stateofthearts
Model  Method  FLOPs  Latency  Acc@1 

ResNet18  baseline  1.8G  537ms  69.76% 
DMCP  1.04G  341ms  69.70%  
ALCS(OURS)  548M  200ms  69.88%  
ResNet50  baseline  4.1G  1053ms  76.15% 
DMCP  2.2G  659ms  76.2%  
1.1G  371ms  74.1%  
HRank  2.26G  695ms  75.56%  
AutoSlim  3.0G  792ms  76.0%  
2.0G  609ms  75.6%  
1.0G  312ms  74.0%  
ALCS(OURS)  2.2G  630ms  76.26%  
985M  370ms  75.05%  
MobileNet  Uniform  569M  167ms  71.8% 
Uniform  325M  102ms  68.4%  
Uniform  150M  53ms  64.4%  
AMC  285M  94ms  70.7%  
Fast  71.1M  61ms  68.4%  
AutoSlim  325M  99ms  71.5%  
150M  55ms  67.9%  
USNet  325M  102ms  69.5%  
150M  53ms  64.2%  
ALCS(OURS)  283M  82ms  72.04%  
185M  62ms  70.53%  
140M  52ms  69.16%  
91M  40ms  65.96% 
To further prove the efficacy of our method, in this section, we compare ALCS with various stateoftheart network compression/acceleration methods. All the experiments are conducted on ResNet18, ResNet50 and MobileNet on ImageNet. For fair comparison, we set the latency budget to be the same for all approaches. The results are given in Table III and Figure 10.
From Table III, we see that ALCS is able to achieve better tradeoff between network accuracy and latency than all the other methods. For example, our method is able to accelerate the inference of ResNet18 by without any accuracy loss. Compared to DMCP [16], ALCS is faster with better accuracy. As for ResNet50, our method is faster with better accuracy and faster with only accuracy drop compared to the original model. On MobileNet, our method also achieves higher accuracy compared to other methods under similar or smaller latency budgets. For instance, under 82ms latency budget, ALCS achieves top1 accuracy, which is higher than AutoSlim under the latency of 99ms and higher than AMC under the latency of 94ms. Compared to the original model, ALCS is faster without any accuracy loss. The same trend also holds under other latency budgets. Overall, the advantage of ALCS is more obvious on compact models under lower latency budgets. This implies that specialized design of pruning structure is more necessary for acceleration of compact models under tight latency budget.
From Table III and Figure 10 we see that ALCS does not achieve a better FLOPsaccuracy tradeoff compared to Fast [11]. This is because that Fast accelerates the networks with random weight pruning, in which each individual element of parameters can be pruned without any constraint. In contrast, in ALCS, the proposed SIMDstructured pruning is used, in which a group of parameters (4 parameters in our experiments) must be pruned or retained simultaneously, so Fast is able to achieve higher accuracy than ALCS under the same FLOPs. Whereas the goal of this paper is not to reduce the model size or the number of arithmetic operations, but to accelerate the true inference speed, because when deploying deep models for practical applications, it is often the true runtime, instead of the FLOPs of models, that we concern more about. Compared to random weight pruning, the proposed SIMDstructured pruning fully utilizes the advantages of SIMD architectures in the target platform, which is helpful for achieving high computation efficiency. Thus, to achieve some latency budget, more parameters need to be pruned when using random weight pruning. For example, to accelerate the latency of MobileNet to , the FLOPs of Fast is , which means that of parameters need to be pruned. On the contrary, the FLOPs of ALCS is , only of the parameters need to be pruned, which is conducive to enhance the model accuracy. As a result, ALCS is able to achieve better tradeoff between accuracy and latency compared to Fast, as shown in Table III and Figure 10.
V Conclusion
In this paper, we propose ALCS (Architecture Aware Latency Constrained Sparse Neural Networks) for model acceleration on mobile devices. Considering that most modern mobile devices utilize the Single Instruction Multiple Data (SIMD) technique to improve the computation capacity, we propose a novel SIMDstructured pruning method along with an efficient SIMDstructured sparse convolution algorithm for acceleration of sparse models. Moreover, we propose to estimate the latency of compressed models with piece wise linear interpolation, which is accurate and efficient, and does not need a large number of collective architecturelatency data pairs in comparison with existing budget approximation methods. The whole latency constrained problem is finally solved with ADMM. Extensive experimental results on various network architectures indicate that ALCS is able to achieve better latencyaccuracy tradeoff thanks to the proposed SIMDstructured pruning along with the efficient SIMDstructured sparse convolution algorithm.
The main purpose of this paper is to investigate the design space lying between the traditional random weight pruning and structured filter level pruning. The results show that it is possible to further push ahead the latencyaccuracy frontier with the help of SIMD instructions in modern CPUs. One limitation of SIMDstructured pruning is that it is not applicable on GPUs because the computing architectures are very different, which is an interesting future direction of this work.
References

[1]
(2012)
ImageNet classification with deep convolutional neural networks
. In Proceedings of the 26th International Conference on Neural Information Processing Systems, (NIPS), Cited by: §I.  [2] (2018) Deep rewiring: training very sparse deep networks. In International Conference on Learning Representation, (ICLR), Cited by: §II.

[3]
(2020)
AOWS: adaptive and optimal network width search with latency constraints.
In
2020 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR)
, Cited by: §I, §I, §I, §I, §II.  [4] (2020) Onceforall: train one network and specialize for efficient deployment. In International Conference on Learning Representations, (ICLR), Cited by: §I.
 [5] (2018) Constraintaware deep neural network compression. In European Conference on Computer Vision, (ECCV), Cited by: §II.
 [6] (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or 1. In Advances in Neural Information Processing Systems, (NIPS), Cited by: §I.
 [7] (2009) ImageNet: a largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Recognition, (CVPR), Cited by: §IVA.
 [8] (2014) Esploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, (NIPS), Cited by: §I.
 [9] (2019) Sparse networks from scratch: faster training without losing performance. CoRR arXiv: 1907.04840. Cited by: §II.
 [10] (2019) Global sparse momentum sgd for pruning very deep neural networks. In Conference of Neural Information Processing System, (NeurIPS), Cited by: §II.
 [11] (2020) Fast sparse convnets. In 2020 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: Fig. 1, §II, §IVB, §IVC, TABLE III.
 [12] (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I.
 [13] (2020) Machine learning for mobile devices: tenworflow lite. Note: https://www.tensorflow.org/lite Cited by: §IVA, TABLE III.
 [14] (2020) XNNPACK. Note: https://github.com/google/XNNPACK Cited by: §IVB, TABLE III.
 [15] (2015) Fast rcnn: fast regionbased convolutional networks for object detection. In Internation Conference on Computer Vision, (ICCV), Cited by: §I.
 [16] (2020) DMCP: differentiable markov channel pruning for neural networks. In 2020 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I, §II, §IVC, TABLE III.
 [17] (2020) Model rubik’s cube: twisting resolution, depth and width for tinynets. In The 34th Conference on Neural Information Processing System, (NeurIPS), Cited by: §I.
 [18] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, (ICLR), Cited by: §I, §I, §II.
 [19] (2015) Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems, (NIPS), Cited by: §I, §I.
 [20] (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Recognition, (CVPR), Cited by: §I, §IVA.
 [21] (2019) Asymptopic soft filter pruning for deep neural networks. IEEE Transactions on Cybernetics 50. Cited by: §II.
 [22] (2018) AMC: automl for model compression and acceleration on mobile devices. In European Conference on Computer Vision, (ECCV), Cited by: Fig. 1, §I, §II, TABLE III.
 [23] (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR, arXiv.1704.04861. Cited by: Fig. 1, §I, §I, §IVA.
 [24] (2019) Searching for mobilenetv3. In International Conference on Computer Vision, (ICCV), Cited by: §I.
 [25] (2018) Quantization and training of neural networks for efficient integerarithmeticonly inference. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I.
 [26] (2014) Speeding up convolutional neural networks with low rank expansions. CoRR abs/1405.3866. Cited by: §I.
 [27] (2015) Speeding up convolutional neural networks using finetuned cpdecomposition. In 3rd International Conference on Learning Representations, (ICLR), Cited by: §I.
 [28] (1989) Optimal brain damage. In In Proceedings of Neural Information Processing System, (NIPS), Cited by: §II.
 [29] (2018) Extremely low bit neural network: squeeze the last bit out with admm. In The 32nd AAAI Conference on Artificial Intelligence, (AAAI), Cited by: §I.
 [30] (2020) DHP: differentiable meta pruning via hypernetworks. In European Conference on Computer Vision, (ECCV), Cited by: §II.
 [31] (2020) HRank: filter pruning using highrank featur map. In 2020 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I, §II, TABLE III.
 [32] (2020) Dynamic model pruning with feedback. In International Conference on Learning Representation, (ICLR), Cited by: §II.
 [33] (2016) SSD: single shot multibox detector. In European Conference on Computer Vision, (ECCV), Cited by: §I.
 [34] (201806) Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Neural Communications 9. Cited by: §II.
 [35] (2019) Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Internation Conference on Machine Learning, (ICML), Cited by: §II.
 [36] (2020) DSA: more efficient budgeted pruning via differentiable sparsity allocation. In European Conference on Computer Vision, (ECCV), Cited by: §II.
 [37] (2017) Faster cnns with direct sparse convolutions and guided pruning. In The 5th International Conference on Learning Representations, (ICLR), Cited by: §I.
 [38] (2016) You only look once: unified, realtime object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.
 [39] (2015) Faster rcnn: towards real time object detection with region proposal networks. In Advances in Neural Information Processing Systems, (NIPS), Cited by: §I.
 [40] (2018) MobileNetV2: inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I.
 [41] (2015) Very deep convolutional networks for large scale image recognition. In 3rd International Conference on Learning Repression, (ICLR), Cited by: §I.
 [42] (2015) Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I.
 [43] (2016) Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I.
 [44] (2019) EfficientNet: rethinking model scaling for convolutional neural networks.. In International Conference on Machine Learning, (ICML), Cited by: §I.
 [45] (2021) EfficientNetV2: smaller models and faster training. In The 38th International Conference on Machine Learning, (ICML), Cited by: §I.
 [46] (2020) Revisiting parameter sharing for automatic neural channel number search. In In Proceedings of Neural Information Processing System, (NeurIPS), Cited by: §II.
 [47] (2019) HAQ: hardwareaware automated quantization with mixed precision. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I, §I.
 [48] (2018) Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing 78, pp. 29–63. Cited by: §IIID.
 [49] (2016) Learning structured sparsity in deep neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, (NIPS), Cited by: §II, §II.
 [50] (2016) Quantized convolutional neural networks for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I.
 [51] (2019) ECC: platformindependent energyconstrained deep neural network compression via bilinear regression model. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I, §I, §I, §I, §I, §II, §IIIC.
 [52] (2019) Energyconstrained compression for deep neural networks via weighted sparse projection and layer input masking. In 7th Internation Conference on Learning Representations, (ICLR), Cited by: §I, §I, §II, §IIIC.
 [53] (2017) Designing energyefficient convolutional neural networks using energyaware pruning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I, §I, §I, §II, §IIIC.
 [54] (2018) NetAdapt: platformaware neural network adaption for mobile applications. In European Conference on Computer Vision, (ECCV), Cited by: §I, §I, §I, §I, §II, §II, §IIIC.
 [55] (2019) Network slimming by slimmable networks: towards oneshot architecture search for channel numbers. CoRR abs/1903.11728. Cited by: §I, §I, §II, TABLE III.
 [56] (2019) Universally slimmable networks and improved training techniques. In 2019 IEEE Conference on Computer Vision, (ICCV), Cited by: TABLE III.