DeepAI
Log In Sign Up

Architecture Aware Latency Constrained Sparse Neural Networks

09/01/2021
by   Tianli Zhao, et al.
0

Acceleration of deep neural networks to meet a specific latency constraint is essential for their deployment on mobile devices. In this paper, we design an architecture aware latency constrained sparse (ALCS) framework to prune and accelerate CNN models. Taking modern mobile computation architectures into consideration, we propose Single Instruction Multiple Data (SIMD)-structured pruning, along with a novel sparse convolution algorithm for efficient computation. Besides, we propose to estimate the run time of sparse models with piece-wise linear interpolation. The whole latency constrained pruning task is formulated as a constrained optimization problem that can be efficiently solved with Alternating Direction Method of Multipliers (ADMM). Extensive experiments show that our system-algorithm co-design framework can achieve much better Pareto frontier among network accuracy and latency on resource-constrained mobile devices.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/15/2021

Joint Channel and Weight Pruning for Model Acceleration on Moblie Devices

For practical deep neural network design on mobile devices, it is essent...
07/19/2021

Latency-Memory Optimized Splitting of Convolution Neural Networks for Resource Constrained Edge Devices

With the increasing reliance of users on smart devices, bringing essenti...
02/16/2020

MDInference: Balancing Inference Accuracy and Latency for Mobile Applications

Deep Neural Networks (DNNs) are allowing mobile devices to incorporate a...
03/07/2022

Dynamic ConvNets on Tiny Devices via Nested Sparsity

This work introduces a new training and compression pipeline to build Ne...
07/23/2020

PareCO: Pareto-aware Channel Optimization for Slimmable Neural Networks

Slimmable neural networks provide a flexible trade-off front between pre...
01/13/2021

NetCut: Real-Time DNN Inference Using Layer Removal

Deep Learning plays a significant role in assisting humans in many aspec...
08/30/2022

SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance

Ad relevance modeling plays a critical role in online advertising system...

I Introduction

Most recent breakthroughs in artificial intelligence rely on deep neural networks or DNNs as the fundamental building blocks, such as image classification [1, 41, 42, 43, 20, 23], object detection [12, 15, 39, 33, 38]

, and so on. As the emergence of high-end mobile devices in recent years, there is an urgent need to migrate deep learning applications from cloud servers and desktops to these edge devices because of their cost advantages. However, this is challenging due to the high computation intensity of deep learning models and the limited computing power of these devices

[52, 51, 47, 54]. In this sense, designing CNN models under a specific latency budget is essential for their deployment on resource-constrained mobile devices.

There is a considerable body of work in compression and acceleration of deep neural networks to overcome these challenges. Such as network pruning [18, 19, 22, 16, 3], quantization [50, 6, 25, 29], low rank approximation [8, 26, 27], efficient model design [23, 44, 40, 24, 4, 17, 45], and so on. Between which pruning [18, 31, 16] has been a predominate approach for accelerating deep neural networks. Early endeavours in network pruning often aimed at reducing the model size (e.g. the number of parameters) or the number of Floating Point OPerations (FLOPs) of networks. However, it is recently realized that reducing the number of non-zero parameters or arithmetic operations does not necessarily lead to acceleration [53, 54], which is one of the main concerns for model deployment on resource-constrained devices. Resource-constrained compression which aims to directly reduce network latency  [22, 55, 3] or energy consumption [53, 51] of networks then emerges and soon draws great attention.

Fig. 1: Latency vs. Accuracy for MobileNet [23]

on ImageNet. The experiments are conducted on a single ARM Cortex-A72 CPU. Our method outperforms existing acceleration methods such as Fast Sparse ConvNets

[11] and AMC [22]. Best viewed in color.

While achieving good trade-off among accuracy and latency/energy, there is still space to further push the frontier of resource-constrained compression.

The way is to take advantage of modern mobile computation architectures. The pruning patterns are mostly not specifically designed for mobile devices, pruning is conducted either channel-wisely [53, 51, 55, 3], which is too inflexible to attain high compression rate, or randomly [18, 19], which is not convenient for acceleration because of their irregular memory access. It is necessary to take into account the computing architectures and design specialized pruning patterns to further push the frontier among network accuracy and latency.

It requires efficient and accurate estimation of latency/energy for solving the constrained problem. Latency/energy modeling in previous works  [37, 53, 52] are tied to specific hardware platforms, and the estimation requires deep knowledge to the hardware platform. Other platform independent methods approximate the latency/energy with a look up table [54, 47] or an estimation model  [51, 3]. However, constructing the look up table or training the estimation model require a large amount of sparsity-latency/energy pairs, which is laborious to collect.

In this paper, we propose architecture-aware latency constrained sparse neural networks (ALCS) towards better Pareto frontier among network accuracy and latency. Specifically, considering that most modern mobile devices utilize the Single Instruction Multiple Data (SIMD) technique to improve the computation capacity, we propose SIMD-structured pruning which groups the parameters according to the bit-length of SIMD registers. Parameters are then pruned in a group-wise manner. Along with it, we also propose an efficient computation algorithm for accelerating the SIMD-structured sparse neural networks. Our method does not suffer from strong structure constraint as channel pruning and therefore is able to achieve relatively high compression/acceleration ratio on mobile devices.

For efficient latency estimation, we approximate with piece-wise linear interpolation. Our construction of the latency estimator doesn’t require any specific architecture knowledge. Compared to other platform independent estimation methods [54, 51, 3] which requires tens of thousands of sparsity-latency pairs, our piece-wise linear interpolation latency estimator is much easier to establish. Only a small collective of sparsity-latency data pairs (11 in our experiments) are required.

The whole latency constrained sparsify task is formulated as a constrained optimization problem, which can be efficiently solved with Alternative Direction Method of Multipliers (ADMM). Extensive experiments show that ALCS can achieve better Pareto frontier among network accuracy and latency as shown in Figure 1. With ALCS, the execution time of MobileNet is reduced by without accuracy drop. The latency of ResNet-50 can be reduced by even with accuracy improvement.

In summery, our contributions are three-folds:

  • We propose ALCS, an end-to-end system-algorithm co-design framework which utilizes SIMD-structured pruning to exploit modern mobile computation architectures for agile and accurate model.

  • We propose an efficient piece-wise linear interpolation method to estimate the network inference latency, which is sample efficient and accurate.

  • Extensive experiments and ablation studies demonstrate the advantages of architecture-aware pruning, as well as the superiority of our proposed method against a set of competitive compression and acceleration methods.

Ii Related Works

Network Pruning. Network pruning is a key technique for compression and acceleration of neural networks. Pioneer approaches prune weights of models randomly, which means that each individual element of the parameters can be removed or retained without any constraint. This category of pruning method can be dated back to Optimal Brain Damage (OBD) [28]

. OBD prunes weights based on Hessian matrix of the loss function, which is difficult to get when the amount of parameters becomes large. More recently, Han et al. present a ’Deep Compression’ pipeline

[18], which prunes parameters with relatively small magnitude. Ding et al. [10] utilize the momentum term of SGD step to force the parameters to converge to zeros. Besides, there are many other works focusing on training a pruned network from scratch [34, 2, 9, 35, 32]. These methods can remove a large part of parameters with negligible accuracy loss while are not convenient for inference acceleration because of their irregular memory access [49].

The limitations of random weight pruning described above motivate recent works [49, 21, 16, 31, 30, 46] to focus more on channel pruning, which prunes the parameters in a channel-wise manner. The channel pruning is able to accelerate the computation of networks, while it requires to prune a whole channel simultaneously, which is too inflexible to achieve high compression and acceleration ratio. Moreover, these methods often aim to reduce the model size of networks, while it has been well acknowledged now that network latency, which is one of the main concerns when deploying CNNs on resource-constrained mobile devices, does not decrease monotonously with the reduction of model size [54].

Resource Constrained Compression.

Recognizing that model size is not a sufficient surrogate for network latency/energy consumption, recent researchers have started investigating resource constrained compression which compress the network to meet some budgets (e.g. latency, energy). Given some explicit resource constraints, these methods search for the optimal network structures with reinforcement learning

[22], greedy search [53, 54, 55], bayesian optimization [5], dynamic programming [3], or optimize the network structures and values of weights simultaneously with optimization algorithms [52, 51, 36]. Compared to previous works, our work further takes into account the computing architecture of mobile devices and propose mobile-oriented SIMD-structured pruning for CNNs. What’s more, we employ linear interpolation for estimation of network latency, which is efficient and accurate and needs neither deep architecture knowledge nor large number of collective sparsity-latency data pairs.

Efficient Sparse Computation. Recently, [11] propose efficient computing algorithm for sparse matrix multiplication on mobile devices. The common points between our works is that channel pruning is not necessary for network acceleration on mobile devices, while our method is different from theirs in that our work supports not only matrix multiplication but also general convolution, and we further argue that SIMD-structured pruning is necessary to achieve better trade-off between network accuracy and latency.

Iii Methodology

Iii-a Problem formulation

Our goal is to accelerate the network to meet some latency budget while minimizing the target loss:

(1)

where denotes the set of parameters of each layer, is the task-specified loss function, for example, the cross entropy loss function for classification. and denote the latency of the network and the target latency budget, respectively. There are three important challenges to obstacle for solving the above problem: 1) how to utilize modern computation architectures to get higher compression and acceleration rate on mobile devices, 2) how to efficiently estimate the latency of the network, and 3) how to solve the constrained optimization problem. In this work, we propose SIMD-structured pruning along with an efficient SIMD-structured sparse convolution algorithm for CNN acceleration. The network latency is estimated with piece-wise linear interpolation and the constrained problem is finally solved with ADMM. We will introduce more details in the following sections.

Iii-B SIMD-structured pruning for fast inference

It is necessary to take into account the computing architectures of the target platforms for fast inference of CNNs. To this end, considering that most mobile CPUs utilize the Single Instruction Multiple Data (SIMD) technique to improve the computation efficiency, we propose SIMD-structured pruning along with an efficient SIMD-structured sparse convolution algorithm for CNN acceleration. We will describe them in detail in the following sections.

Fig. 2: Framework of architecture aware latency constrained sparse neural networks. Left: In ALCS, network is pruned and accelerated with SIMD-structured pruning, in which parameters are grouped according to the bit-length of SIMD registers in hardware, and pruned in a group-wise manner. Middle: To solve the latency constrained problem, we approximate the latency of compressed models precisely and efficiently with piece-wise linear interpolation. Right: After training, the compressed models can be deployed on the target platform for practical applications.

Iii-B1 SIMD-structured pruning

Fig. 3: (a) Comparison of left: non-structured pruning method, middle: structured channel pruning, and right: our proposed SIMD structured pruning. In SIMD structured pruning, parameters are divided into groups according to the bit-length of SIMD registers and removed in a group wise way. (b) The storage format of sparse kernel and the core computation component of the proposed SIMD structured sparse convolution algorithm.

In this section, we introduce the proposed SIMD-structured pruning. Before we get into more details, it is worthy to have a brief introduction to Single Instruction Multiple Data (SIMD). As a data level parallelism scheme, SIMD is widely used in modern mobile CPU architectures. It allows CPUs to operate on a set of data items at the same time in a single instruction. In this way, a vector of data can be loaded into vector registers and processed simultaneously by one instruction.

We start with grouping of parameters. For a convolutional layer with filters , where denote the number of output/input channels and kernel size, respectively. The elements are first grouped along the dimension. The size of each group depends on the length of SIMD vector registers. For example, on the widely used ARM v7/v8 architectures, the length of each vector register for SIMD instructions is 128 bits, so for 32-bit single float precision parameters, the group size should be 4. In the other words, parameters at the same location of each 4 adjacent channels are grouped together. The parameters are then pruned in a group wise manner. The right of Figure 3(a) shows a simple example for the proposed SIMD-structured pruning with group size of 2. Note that the only constraint of SIMD structured pruning is that the locations of zeros in each group of filters should be the same, however, the locations of zeros across different groups of filters can be irregular.

Iii-B2 SIMD-structured sparse convolution

Having introduced the SIMD-structured pruning for deep neural networks, in this section, we describe the efficient algorithm for computation of sparse convolutions.

OVERVIEW We show in Figure 4 an overview of the proposed computation algorithm. In this example, the group size of SIMD-structured pruning is 2. We denote the input/output of convolution as . The output element at location of the channel can be computed by:

which can be treated as the inner product of the stretched kernel of

and a sub-tensor of

related to the spatial location . For instance, the element at the top left corner of the first channel of can be computed by the inner product of the stretched first filter of and the sub-tensor colored in orange at the top left corner of . It is easy to realize that the output values related to multiple output channels and spatial locations can be computed collectively. For instance, the elements related to the first 2 channels and the first 2 spatial locations of the output can be computed by: First flattening and stacking together the first 2 filters of into rows of a matrix, say ; Then vectorizing and stacking together the sub-tensors of related to the first 2 output spatial locations (e.g. the sub-tensors colored in orange and yellow, respectively located at the top-left corner of ) into different columns of a matrix, say ; Finally, the 4 output elements can be computed by multiplication between and . In case where is SIMD-structured sparse, this data parallelism can be easily achieved with the help of SIMD instructions. Before going into more details, we first introduce the storage format of SIMD-structured sparse tensor .

STORAGE FORMAT OF SIMD-STRUCTURED SPARSE TENSOR. As shown in the middle of Figure 4, a group of filters is stored in memory in a grouped version of Compressed Sparse Row (CSR) format, which consists of the number of non-zero columns (the orange value), column offsets (the gray elements), and non-zero values (the light blue elements). For instance, we can see in middle of Figure 4 that, there are 3 non-zero columns in the original kernel data, so the orange value is 3. For the first non-zero column , there is 1 column before it in the original kernel data, so the column offset related to is 1. As for the second non-zero column , there are 3 columns before it in the original kernel data, so its column offset is 3, and so on. Note that after training, the values of parameters are fixed during inference, so this re-organization of parameters can be done in advance without any extra time overhead.

EFFICIENT MULTIPLICATION COMPUTATION. As what has been described in the previous Section, the core computation component is the multiplication between the SIMD-structured sparse matrix and the dense input . In this section we describe how this multiplication can be efficiently computed when is stored in the format as described in the previous section. See the middle of Figure 4 as a simple example. We first allocate several SIMD registers and initialize them to zero for storage of intermediate results. Then we load from memory the number of non-zero columns of the kernel data, which determines the number of iterations. In each computation iteration, we load a column of non-zero kernel data into some SIMD-register, and then load into some other SIMD-registers the input data according to the corresponding column offset of the non-zero kernel data. After that, the loaded kernel and input data are multiplied and accumulated into intermediate results simultaneously with SIMD instructions.

In practice, this procedure is implemented with highly optimized assembly. The group size and the number of collectively computed output elements are determined by the bit-length and number of SIMD registers.

Fig. 4: SIMD structured pruning and the efficient sparse convolution algorithm. In this figure, denote the kernel, input and output tensors of convolution. The size of input/output feature maps are denoted by , respectively, and the kernel size are denoted by .

Iii-C Latency estimation model

Fig. 5: The density (ratio of non zero elements) of parameters & the real runtime for each of convolutional layers in resnet18. Overall, the latency of each layer does not grow linearly with respect to the density, while the linearity is approximately satisfied locally. This motivates us to approximate the latency of each layer with linear interpolation.

Having introduced the method utilized for CNN acceleration, the following problem is how to efficiently and accurately estimate the network latency given network parameters.

The latency of the whole network can be denoted as the summation of the latency of each layer:

(2)

where is the latency of the CONV/FC layer.

denotes the latency of other layers (e.g. ReLU layers, pooling layers, et.al.) and can be regarded as a constant factor. In this work, the network is accelerated with sparsification, so the latency of the

layer can be formulated as a function of the number of non-zero weights of its kernel tensor:

(3)

Note that the above equation does not mean that the model size is used as a proxy of latency, because we model the latency layer-wisely, and the number of non-zero parameters across different layers may have different influences on the latency of the whole model.

We propose to approximate with linear interpolation. This is based on the observation that the latency of each layer is locally linear with respect to the density of parameters as shown in Figure 5. Take the layer as an example, we measure the run time of the layer on device when the number of non-zero parameters are , respectively, where is the number of elements of . For a given tensor with non-zero parameters, the run time can be then approximated by linear interpolation:

(4)

where is a variable which indicates whether the number of non-zero parameters is larger than :

(5)

and is the run time of the layer when the number of non-zero parameters is . is the ascending speed of latency when the number of non-zero parameters increases between and :

(6)

In practice, we set to be . In this way, only 11 collective sparsity-latency data pairs are required for approximation of the network latency. We find that our approximation of latency with linear interpolation is rather accurate as shown in Figure 6. What’s more, no platform knowledge is needed because we approximate the network latency with directly measurement and treat the hardware as a black-box. In contrast, previous works rely on either deep architecture knowledge [53, 52] or a large collective (usually over 10000) of sparsity-latency/energy data pairs for construction of look up table [54] or training of the estimation model [51].

Iii-D The optimization algorithm

0:  The variable to be projected. The group size for SIMD-structured pruning. The time budget . The tolerance .
0:  The projected variable .
1:  Divide into multiple groups with elements in each group. (Section III-B)
2:  Sort the groups of elements in in term of norm.
3:   the total number of groups in .
4:   the run-time of model if all the parameters are removed/retained.
5:  while  do
6:     
7:      pick the top-N largest groups of elements in
8:     
9:     if  then
10:        
11:     else
12:        
13:     end if
14:  end while
15:   pick the largest groups of elements in .
16:  return  
Algorithm 1 Projection operation with bisection.

Now we have been able to efficiently approximate the latency of CNN models given parameters, we are ready to solve the constrained optimization problem as shown in Equation 1. Many optimization algorithms can be applied to solve the problem 1, such as Alternating Direction Method of Multipliers (ADMM) , Projected Gradient Descent (PGD) , and so on. In this paper, we apply the ADMM which is recently proved to be sufficient for solving non-convex, non-smooth problems [48]. One may choose other optimization algorithms to solve the problem 1, and the proposed SIMD-structured pruning and resource estimation method with linear interpolation are also applicable as a plugin for other network compression methods to further improve their latency-accuracy trade-off, while this is out of the scope of this paper. The original problem can be reformulated by:

(7)

where is defined by equations (2)-(6). By applying augmented largrangian, the above problem is equivalent to:

(8)

where is a hyper-parameter. The main idea of ADMM is to update the original parameters , the auxiliary variable and the dual variable in an alternative manner:

(9a)
(9b)
(9c)

The update of the original parameters and the dual variable are relatively straight forward. For update of

, we apply SGD on the training dataset for one epoch. The main difficulty lies in the update of the auxiliary variable

, which is the projection of on the constraint set. We solve this problem with a greedy algorithm. We first sort groups of elements of in term of norm, and pick them one by one, until the final latency achieves the target budget. Direct implementation of this algorithm is not efficient in that it may need a large number of iterations. While it can be efficiently implemented with bisection method as shown in Algorithm 1. After ADMM optimization finishes, we set and finetune the generated model on the training set for a few epochs. We summarize the final optimization algorithm in Algorithm 2, and more details are given in Section IV-A.

0:  The base model with pretrained parameters . The latency budget . The group size for SIMD-structured pruning. The budget tolerance . The training dataset . The training epochs and the penalty parameter for ADMM optimization. The training epochs for finetuning.
0:  The pruned model with latency .
  Initialize with Algorithm 1,
  for  do
     Update with SGD on for one epoch
     Update with Algorithm 1
     Update
  end for
  Set
  Update the non-zero parameters of with SGD on for epochs
  return  
Algorithm 2 The ALCS algorithm

Iv Experimental results

Iv-a Experimental setup

We evaluate our method on both compact models such as MobileNet [23], as well as relatively heavy networks like ResNet18 and ResNet50 [20]

for 1000-class image classification on ImageNet

[7]

. We do not conduct experiments on CIFAR because it is more practical and challenging to accelerate CNN models on large scale vision tasks. We use the standard data pre-processing pipeline which is provided by pytorch official examples. The batch size is set to 256, and the group size for SIMD-structured pruning is set to 4 to match the bit-length of SIMD registers

111In most mobile devices, length of each vector register for SIMD instructions is 128 bits, so for SIMD-structured pruning of 32-bit single float precision parameters, the group size should be 4.. The hyper-parameter is set to 0.01 for all the experiments. In each ADMM iteration, for update of the original parameters as indicated in Equation (9a), we apply the momentum SGD optimizer for 1 epoch with learning rate fixed to 0.001 and weight decay set to for ResNet and for MobileNet. We apply 100 ADMM iterations for MobileNet and ResNet18 and 60 ADMM iterations for ResNet50. After ADMM iterations, the generated compressed model is then fine-tuned for 60 epochs with learning rate annealed from to with cosine learning rate scheduler. The weight decay is set to . During this procedure, only non-zero parameters are updated.

The latency of all the dense models (including the models compressed with channel pruning methods) are measured with Tensorflow Lite

[13], which is one of the most popular mobile-oriented inference framework for DNNs, and the latency of all the SIMD-structured sparse models are measured with our proposed SIMD-structured sparse convolution computation algorithm, which is implemented in C++ with SIMD instructions. Averaged latency over 50 runs on a single ARM Cortex-A72 CPU is reported.

Iv-B Ablation study

Precision of latency estimation:
Fig. 6: Real & estimated latency on 100 randomly sampled ResNet18 models. Tested on a single ARM Cortex-A72 CPU.

This section we first study the precision of the proposed latency estimation with linear interpolation. To this end, we uniformly sample 100 ResNet18 models with different sparsity, and plot the real latency and estimated latency in Figure 6. From the figure we can see that the proposed approximation of latency with linear interpolation is rather accurate.

Influence of :
Fig. 7: Training curves for compressing MobileNet on ImageNet during ADMM optimization and the following fine-tuning with different values of . Better viewed in color.
FLOPs Latency Acc@1
0.001
0.01
0.05
185M 62ms
70.37%
70.53%
70.38%
TABLE I: Accuracy of compressed MobileNet on ImageNet with different values of .

To study the influence of the hyper-parameter for our algorithm, we compress MobileNet on ImageNet and set the target latency to be 62ms with . Results are shown in Figure 7 and Table I. From Figure 7(a) we can see that converges to a lower value with small , since with smaller , the algorithm focuses more on optimizing the original parameters . While we can further see from Figure 7(b) that it is not sufficient to constraint the sparse structure of the original parameters if is too small. For instance, when , the difference between and is rather large even after 100 ADMM iterations, which means that in this case fails to converge to be sparse during ADMM optimization. Therefore, during fine-tuning, the smallest doesn’t lead to the lowest loss (see Figure 7(c)). From Table I, we can see that achieves slightly better accuracy compared to the other two cases, and we will set in all the following experiments.

Impact of different components:
Pruning
Method
FLOPs Latency ADMM FT Acc@1
WP 71.1M 61.55ms 68.36%
FP 186M 64ms 67.76%
SIMD 185M 62ms 70.53%
SIMD 185M 62ms 70.08%
SIMD 185M 62ms 69.96%
TABLE II: Comparison of different variants of our method for compressing MobieNet on ImageNet. WP denotes weight pruning, FP denotes filter pruning, and SIMD denotes our proposed SIMD-structured pruning. ADMM and FT denotes the ADMM optimization process and the post fine-tuning process, respectively. Our method outperforms all the other variants in that it is able to achieve better trade-off between network latency and accuracy.
Fig. 8: Training dynamics for compressing MobileNet on ImageNet during ADMM optimization and the post finetuning process with different pruning methods. SIMD denotes the proposed SIMD-structured pruning, WP denotes random weight pruning in which any individual element of parameters can be pruned without any constraint, and FP denotes filter pruning. The latency budget of all the methods are set to 62ms.
Fig. 9: Comparison of different variants of ALCS with or without the ADMM optimization and the post finetuning process. The latency budget of all the compared methods are set to 62ms.

In this section, we study the impact of different components of ALCS. That is The SIMD-structured pruning, the ADMM optimization and the post finetuning. For this end, we compare several variants of ALCS for compressing MobileNet on ImageNet. Results are summarized in Table II, Figure 8 and Figure 9. Where WP denotes weight pruning, in which each individual element of parameters can be removed or retained without any constraint. To measure the latency of models compressed by weight pruning, we apply the codes of XNNPACK [14], which is the state of the art implementation of sparse matrix multiplication [11]. Note that the implementation supports only matrix multiplication, which is equivalent with convolution layer with kernel size of , so we do not prune the first convolution layer in this variant, following the same set as in [11]. FP denotes filter pruning, which prunes parameters in a channel wise manner. SIMD denotes the proposed SIMD-structured pruning. ADMM and FT denote the ADMM optimization and the post fine-tuning process, respectively. For fair comparison, we set the latency budgets to be the same and train all the variants with equal number of total epochs. Specifically, for ADMM+FT variants, we apply the same hyper-parameters as described in Section IV-A. For ADMM-only variants, we apply 160 ADMM iterations, and for FT-only variants, we prune the model with norm-based method and employ fine-tuning for 160 epochs with learning rate fixed to 0.001 at the first 100 epochs and annealing from 0.001 to 0 with cosine learning rate at the last 60 epochs.

As shown in table II, the proposed training pipeline outperforms all the other variants. By comparing the first three variants, we can conclude that SIMD-structured pruning is able to achieve better trade-off between network accuracy and latency than weight pruning and filter pruning. For instance, the accuracy of ALCS with SIMD-structured pruning is higher than weight pruning and higher than filter pruning under similar latency budgets. This is mainly because that: (1) Compared to weight pruning, SIMD-structured pruning is more friendly to the SIMD architecture of mobile devices, and thus is able to achieve similar latency under a higher density, which benefits retaining accuracy of networks; (2) Compared to filter pruning, SIMD-structured pruning does not suffer from such strong constraint on data structure, thus improves flexibility and attains lower accuracy loss.

By comparing the last three variants, we can see that both ADMM optimization and the post fine-tuning are necessary for improving the network accuracy. Particularly, when only applying the ADMM optimization, the final accuracy will be only , which is lower than applying both ADMM optimization and the post finetuning, and the accuracy gap is without the ADMM optimization. In Figure 9 we further draw the training curves of these three variants, we see that there is a sharp decline in vallidation loss when the post finetuning begins. We explain that the ADMM optimization helps to find a better initialization for the post finetuning process.

Iv-C Comparison with state-of-the-arts

Model Method FLOPs Latency Acc@1
ResNet18 baseline 1.8G 537ms 69.76%
DMCP 1.04G 341ms 69.70%
ALCS(OURS) 548M 200ms 69.88%
ResNet50 baseline 4.1G 1053ms 76.15%
DMCP 2.2G 659ms 76.2%
1.1G 371ms 74.1%
HRank 2.26G 695ms 75.56%
AutoSlim 3.0G 792ms 76.0%
2.0G 609ms 75.6%
1.0G 312ms 74.0%
ALCS(OURS) 2.2G 630ms 76.26%
985M 370ms 75.05%
MobileNet Uniform 569M 167ms 71.8%
Uniform 325M 102ms 68.4%
Uniform 150M 53ms 64.4%
AMC 285M 94ms 70.7%
Fast 71.1M 61ms 68.4%
AutoSlim 325M 99ms 71.5%
150M 55ms 67.9%
USNet 325M 102ms 69.5%
150M 53ms 64.2%
ALCS(OURS) 283M 82ms 72.04%
185M 62ms 70.53%
140M 52ms 69.16%
91M 40ms 65.96%
TABLE III: Comparison of ALCS with other state-of-the-art network compression/acceleration methods on ImageNet. We compare ALCS with the following 6 state-of-the-art network compression/acceleration methods: DMCP [16], HRank [31], Fast sparse convolution [11], AMC [22], AutoSlim [55], and USNet [56]. The method marked by indicates that it is a weight pruning method, the latency of which is measured by XNNPACK [14]. For all the other baseline methods, we download the models released by the authors and measure their latency with TFLite [13].
Fig. 10: Comparison of ALCS with state of the art model acceleration methods for compressing MobileNet. The size of points denote the FLOPs of models. It is obviously that ALCS is able to achieve better accuracy-latency trade-off than all the other methods.

To further prove the efficacy of our method, in this section, we compare ALCS with various state-of-the-art network compression/acceleration methods. All the experiments are conducted on ResNet18, ResNet50 and MobileNet on ImageNet. For fair comparison, we set the latency budget to be the same for all approaches. The results are given in Table III and Figure 10.

From Table III, we see that ALCS is able to achieve better trade-off between network accuracy and latency than all the other methods. For example, our method is able to accelerate the inference of ResNet18 by without any accuracy loss. Compared to DMCP [16], ALCS is faster with better accuracy. As for ResNet50, our method is faster with better accuracy and faster with only accuracy drop compared to the original model. On MobileNet, our method also achieves higher accuracy compared to other methods under similar or smaller latency budgets. For instance, under 82ms latency budget, ALCS achieves top-1 accuracy, which is higher than AutoSlim under the latency of 99ms and higher than AMC under the latency of 94ms. Compared to the original model, ALCS is faster without any accuracy loss. The same trend also holds under other latency budgets. Overall, the advantage of ALCS is more obvious on compact models under lower latency budgets. This implies that specialized design of pruning structure is more necessary for acceleration of compact models under tight latency budget.

From Table III and Figure 10 we see that ALCS does not achieve a better FLOPs-accuracy trade-off compared to Fast [11]. This is because that Fast accelerates the networks with random weight pruning, in which each individual element of parameters can be pruned without any constraint. In contrast, in ALCS, the proposed SIMD-structured pruning is used, in which a group of parameters (4 parameters in our experiments) must be pruned or retained simultaneously, so Fast is able to achieve higher accuracy than ALCS under the same FLOPs. Whereas the goal of this paper is not to reduce the model size or the number of arithmetic operations, but to accelerate the true inference speed, because when deploying deep models for practical applications, it is often the true runtime, instead of the FLOPs of models, that we concern more about. Compared to random weight pruning, the proposed SIMD-structured pruning fully utilizes the advantages of SIMD architectures in the target platform, which is helpful for achieving high computation efficiency. Thus, to achieve some latency budget, more parameters need to be pruned when using random weight pruning. For example, to accelerate the latency of MobileNet to , the FLOPs of Fast is , which means that of parameters need to be pruned. On the contrary, the FLOPs of ALCS is , only of the parameters need to be pruned, which is conducive to enhance the model accuracy. As a result, ALCS is able to achieve better trade-off between accuracy and latency compared to Fast, as shown in Table III and Figure 10.

V Conclusion

In this paper, we propose ALCS (Architecture Aware Latency Constrained Sparse Neural Networks) for model acceleration on mobile devices. Considering that most modern mobile devices utilize the Single Instruction Multiple Data (SIMD) technique to improve the computation capacity, we propose a novel SIMD-structured pruning method along with an efficient SIMD-structured sparse convolution algorithm for acceleration of sparse models. Moreover, we propose to estimate the latency of compressed models with piece wise linear interpolation, which is accurate and efficient, and does not need a large number of collective architecture-latency data pairs in comparison with existing budget approximation methods. The whole latency constrained problem is finally solved with ADMM. Extensive experimental results on various network architectures indicate that ALCS is able to achieve better latency-accuracy trade-off thanks to the proposed SIMD-structured pruning along with the efficient SIMD-structured sparse convolution algorithm.

The main purpose of this paper is to investigate the design space lying between the traditional random weight pruning and structured filter level pruning. The results show that it is possible to further push ahead the latency-accuracy frontier with the help of SIMD instructions in modern CPUs. One limitation of SIMD-structured pruning is that it is not applicable on GPUs because the computing architectures are very different, which is an interesting future direction of this work.

References

  • [1] A. Arizhevsky, I. Sutskever, and G. E. Hintion (2012)

    ImageNet classification with deep convolutional neural networks

    .
    In Proceedings of the 26th International Conference on Neural Information Processing Systems, (NIPS), Cited by: §I.
  • [2] G. Bellec, D. Kappel, W. Maass, and R. Legenstein (2018) Deep rewiring: training very sparse deep networks. In International Conference on Learning Representation, (ICLR), Cited by: §II.
  • [3] M. Berman, L. Pishchulin, N. Xu, M. B.Blaschko, and G. Medioni (2020) AOWS: adaptive and optimal network width search with latency constraints. In

    2020 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR)

    ,
    Cited by: §I, §I, §I, §I, §II.
  • [4] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han (2020) Once-for-all: train one network and specialize for efficient deployment. In International Conference on Learning Representations, (ICLR), Cited by: §I.
  • [5] C. Chen, F. Tung, N. Vedula, and G. Mori (2018) Constraint-aware deep neural network compression. In European Conference on Computer Vision, (ECCV), Cited by: §II.
  • [6] M. Courbariaux, I. Hubara, D. Soudry, R. EI-Yaniv, and Y. Bengio (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or -1. In Advances in Neural Information Processing Systems, (NIPS), Cited by: §I.
  • [7] J. Deng, W. Dong, R. Socher, L. Jia, K. Li, and L. Feifei (2009) ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Recognition, (CVPR), Cited by: §IV-A.
  • [8] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Esploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, (NIPS), Cited by: §I.
  • [9] T. Dettmers and L. Zettlemoyer (2019) Sparse networks from scratch: faster training without losing performance. CoRR arXiv: 1907.04840. Cited by: §II.
  • [10] X. Ding, G. Ding, X. Zhou, Y. Guo, J. Liu, and J. Han (2019) Global sparse momentum sgd for pruning very deep neural networks. In Conference of Neural Information Processing System, (NeurIPS), Cited by: §II.
  • [11] E. Elsen, M. Dukhan, T. Gale, and K. Simonyan (2020) Fast sparse convnets. In 2020 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: Fig. 1, §II, §IV-B, §IV-C, TABLE III.
  • [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I.
  • [13] Google-Inc. (2020) Machine learning for mobile devices: tenworflow lite. Note: https://www.tensorflow.org/lite Cited by: §IV-A, TABLE III.
  • [14] Google-Inc. (2020) XNNPACK. Note: https://github.com/google/XNNPACK Cited by: §IV-B, TABLE III.
  • [15] R. Grischick (2015) Fast rcnn: fast region-based convolutional networks for object detection. In Internation Conference on Computer Vision, (ICCV), Cited by: §I.
  • [16] S. Guo, Y. Wang, Q. Li, and J. Yan (2020) DMCP: differentiable markov channel pruning for neural networks. In 2020 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I, §II, §IV-C, TABLE III.
  • [17] K. Han, Y. Wang, Q. Zhang, W. Zhang, C. Xu, and T. Zhang (2020) Model rubik’s cube: twisting resolution, depth and width for tinynets. In The 34th Conference on Neural Information Processing System, (NeurIPS), Cited by: §I.
  • [18] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, (ICLR), Cited by: §I, §I, §II.
  • [19] S. Han, J. Pool, J. Tran, and W. J. Dally (2015) Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems, (NIPS), Cited by: §I, §I.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Recognition, (CVPR), Cited by: §I, §IV-A.
  • [21] Y. He, X. Dong, G. Kang, Y. Fu, C. Yan, and Y. Yang (2019) Asymptopic soft filter pruning for deep neural networks. IEEE Transactions on Cybernetics 50. Cited by: §II.
  • [22] Y. He, J. Lin, Z. Liu, H. Wang, Li-Jia, and S. Han (2018) AMC: automl for model compression and acceleration on mobile devices. In European Conference on Computer Vision, (ECCV), Cited by: Fig. 1, §I, §II, TABLE III.
  • [23] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR, arXiv.1704.04861. Cited by: Fig. 1, §I, §I, §IV-A.
  • [24] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam (2019) Searching for mobilenetv3. In International Conference on Computer Vision, (ICCV), Cited by: §I.
  • [25] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, and A. Howard (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I.
  • [26] M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. CoRR abs/1405.3866. Cited by: §I.
  • [27] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky (2015) Speeding up convolutional neural networks using fine-tuned cp-decomposition. In 3rd International Conference on Learning Representations, (ICLR), Cited by: §I.
  • [28] Y. Lecun, J. S. Denker, and S. A. Solla (1989) Optimal brain damage. In In Proceedings of Neural Information Processing System, (NIPS), Cited by: §II.
  • [29] C. Leng, H. Li, S. Zhu, and R. Jin (2018) Extremely low bit neural network: squeeze the last bit out with admm. In The 32nd AAAI Conference on Artificial Intelligence, (AAAI), Cited by: §I.
  • [30] Y. Li, S. Gu, K. Zhang, L. V. Gool, and R. Timofte (2020) DHP: differentiable meta pruning via hypernetworks. In European Conference on Computer Vision, (ECCV), Cited by: §II.
  • [31] M. Lin, R. Ji, Y. Wang, Y. Zhang, B. Zhang, Y. Tian, and L. Shao (2020) HRank: filter pruning using high-rank featur map. In 2020 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I, §II, TABLE III.
  • [32] T. Lin, S. U. Stich, L. Barba, D. Dmitriev, and M. Jaggi (2020) Dynamic model pruning with feedback. In International Conference on Learning Representation, (ICLR), Cited by: §II.
  • [33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In European Conference on Computer Vision, (ECCV), Cited by: §I.
  • [34] D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, and A. Liotta (2018-06) Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Neural Communications 9. Cited by: §II.
  • [35] H. Mostafa and X. Wang (2019) Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Internation Conference on Machine Learning, (ICML), Cited by: §II.
  • [36] X. Ning, T. Zhao, W. Li, P. Lei, Y. Wang, and H. Yang (2020) DSA: more efficient budgeted pruning via differentiable sparsity allocation. In European Conference on Computer Vision, (ECCV), Cited by: §II.
  • [37] J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey (2017) Faster cnns with direct sparse convolutions and guided pruning. In The 5th International Conference on Learning Representations, (ICLR), Cited by: §I.
  • [38] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.
  • [39] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster rcnn: towards real time object detection with region proposal networks. In Advances in Neural Information Processing Systems, (NIPS), Cited by: §I.
  • [40] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I.
  • [41] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large scale image recognition. In 3rd International Conference on Learning Repression, (ICLR), Cited by: §I.
  • [42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, and S. Reed (2015) Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I.
  • [43] C. Szegedy, V. Vanhoucke, S. Ioffe, and J. Shlens (2016) Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I.
  • [44] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks.. In International Conference on Machine Learning, (ICML), Cited by: §I.
  • [45] M. Tan and Q. V. Le (2021) EfficientNetV2: smaller models and faster training. In The 38th International Conference on Machine Learning, (ICML), Cited by: §I.
  • [46] J. Wang, H. Bai, J. Wu, X. Shi, J. Huang, I. King, M. Lyu, and J. Cheng (2020) Revisiting parameter sharing for automatic neural channel number search. In In Proceedings of Neural Information Processing System, (NeurIPS), Cited by: §II.
  • [47] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) HAQ: hardware-aware automated quantization with mixed precision. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I, §I.
  • [48] Y. Wang, W. Yin, and J. Zeng (2018) Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing 78, pp. 29–63. Cited by: §III-D.
  • [49] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, (NIPS), Cited by: §II, §II.
  • [50] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng (2016) Quantized convolutional neural networks for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I.
  • [51] H. Yang, Y. Zhu, and J. Liu (2019) ECC: platform-independent energy-constrained deep neural network compression via bilinear regression model. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I, §I, §I, §I, §I, §II, §III-C.
  • [52] H. Yang, Y. Zhu, and J. Liu (2019) Energy-constrained compression for deep neural networks via weighted sparse projection and layer input masking. In 7th Internation Conference on Learning Representations, (ICLR), Cited by: §I, §I, §II, §III-C.
  • [53] T. Yang, Y. Chen, and V. Sze (2017) Designing energy-efficient convolutional neural networks using energy-aware pruning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: §I, §I, §I, §II, §III-C.
  • [54] T. Yang, A. G. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam (2018) NetAdapt: platform-aware neural network adaption for mobile applications. In European Conference on Computer Vision, (ECCV), Cited by: §I, §I, §I, §I, §II, §II, §III-C.
  • [55] J. Yu and T. Huang (2019) Network slimming by slimmable networks: towards one-shot architecture search for channel numbers. CoRR abs/1903.11728. Cited by: §I, §I, §II, TABLE III.
  • [56] J. Yu and T. Huang (2019) Universally slimmable networks and improved training techniques. In 2019 IEEE Conference on Computer Vision, (ICCV), Cited by: TABLE III.