BLK-REW: A Unified Block-based DNN Pruning Framework using Reweighted Regularization Method

01/23/2020 ∙ by Xiaolong Ma, et al. ∙ Northeastern University 7

Accelerating DNN execution on various resource-limited computing platforms has been a long-standing problem. Prior works utilize l1-based group lasso or dynamic regularization such as ADMM to perform structured pruning on DNN models to leverage the parallel computing architectures. However, both of the pruning dimensions and pruning methods lack universality, which leads to degraded performance and limited applicability. To solve the problem, we propose a new block-based pruning framework that comprises a general and flexible structured pruning dimension as well as a powerful and efficient reweighted regularization method. Our framework is universal, which can be applied to both CNNs and RNNs, implying complete support for the two major kinds of computation-intensive layers (i.e., CONV and FC layers). To complete all aspects of the pruning-for-acceleration task, we also integrate compiler-based code optimization into our framework that can perform DNN inference in a real-time manner. To the best of our knowledge, it is the first time that the weight pruning framework achieves universal coverage for both CNNs and RNNs with real-time mobile acceleration and no accuracy compromise.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks (DNNs) such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been extensively adopted in various artificial intelligence (AI) systems. However, accelerating the computational intensive DNN inference is very challenging for many AI applications, especially those with critical time constraints, such as self-driving cars 

[Nugraha et al.2017] and real-time translation [Gehring et al.2016].

Pruning has gained its popularity due to the effectiveness in reducing model size and computation cost. In order to remove redundant weights while maintaining accuracy, many studies have been proposed regarding both pruning dimension (DNN structure level) and pruning method (algorithm level). According to the structure of pruned models, there are mainly two DNN pruning approaches: non-structured pruning and structured pruning. However, non-structured pruning has been proven by many recent studies [Wen et al.2016, He et al.2017] that it is not compatible with the parallelism in hardware accelerations due to the imbalanced computation and significant overhead. Structured pruning has been proposed to conquer the challenge. A structured pruned model maintains the regularity of the weight matrix, which eliminates the overhead and facilitates on-device acceleration. However, the aggressive pruning strategy causes severe information loss, making accuracy degradation non-negligible. Achieving both high accuracy and fast inference with DNN pruning is an ideal but very challenging goal.

Efforts have been made to achieve this goal. At algorithm level, many pruning techniques have been proposed to find the uncritical weights. For non-structured pruning, prior works leverage a magnitude-based pruning method that prunes weights with small magnitudes or use regularization to explore sparsity in DNN models. For structured pruning, the static

-based group lasso regularization is used to find the regular sparse pattern in DNN models. However, the above approaches fail to find a satisfactory solution for the pruning problem due to the heuristic nature. ADMM 

[Boyd et al.2011] algorithm emerges to mitigate the challenges. With a significant improvement in the solution quality, ADMM pruning supersedes (almost) every pruning framework and becomes the state-of-the-art method. Nevertheless, ADMM still suffers from sub-optimal solution quality and long convergence time, especially for the long-standing problem of finding structured sparsity solution for the Fully Connected (FC) layer. This will certainly limit the usage of ADMM solutions on many CNNs and almost all RNNs since they are majorly composed of FC layers.

In this paper, we present a unified pruning framework – block-based structured pruning with reweighted regularization (BLK-REW). Our efforts focus on two aspects: pruning dimension and pruning method.

Aspect 1: From the pruning dimension aspect, we propose block-based structured pruning (BLK pruning) which divides DNN layers into multiple blocks and applies structured pruning independently to each block. Our design takes a unique perspective on structured pruning, which greatly enlarges the design space by introducing a higher degree of flexibility with a changeable block shape. More importantly, the proposed BLK pruning is applicable to both CNNs and RNNs without obvious accuracy degradation, which outperforms the existing pruning dimensions. It achieves similar or even higher accuracy compared with non-structured pruning, and preserves the hardware compatibility advantage of structured pruning, with the compiler-based code optimization embedded in our pruning-acceleration framework.

Aspect 2: From the pruning method aspect, we propose to use reweighted (REW) group lasso regularization method to generate structured sparsity. By introducing a reweighted term into regularization, our method can perform group regularization at a more precise location in DNN with an appropriate degree. Compared with the traditional -based group lasso and the recently developed ADMM regularization method, REW method acquires a significant improvement in the regularization effect (i.e., facilitating better pruning results) with a desirable short convergence time (i.e., efficient training process), which makes it a favorable approach that naturally fit for the DNN pruning problems.

We show the performance improvements of BLK-REW framework in three ways. First, the proposed REW method can efficiently find uncritical weights. Compared with other methods, REW achieves better weight regularization effect using significantly shorter training time. Second, the proposed BLK pruning dimension is more general and achieves extremely high compression rates in both CNN and RNN. Third, the proposed BLK-REW pruning naturally fits for the compiler optimization. Our designed compiler-aided acceleration framework achieves real-time inference on the resource-limited mobile devices.

2 Background and Motivation

2.1 Structured Pruning Dimension

Recent works [Wen et al.2016, He et al.2017] considered to incorporate regularity (i.e., filter pruning, channel pruning, etc.) in weight pruning, which generates regular and smaller weight matrices for faster executions on CPUs/GPUs. For convolution computations, weight matrices are usually transformed into general matrix multiplication (GEMM) form as Figure 1 illustrates. Accordingly, filter pruning can also be termed as row pruning since it corresponds to removing one row of the weight matrix, and channel pruning corresponds to reducing multiple consecutive columns (column pruning). Current structured pruning approaches suffer from notable accuracy loss when the compression rate is high because the entire information of the pruned filter(s)/channel(s) is lost. As a result, it usually has limited compression rates and low accuracy, as well as limited applicability as most works focus on CONV layers only. For FC layers (applied partially in CNN and majorly in RNN), structured pruning is applicable but not desirable due to the same reason above. The drawback is obvious, especially for time-based RNNs since one pruned row/column in an RNN will not be utilized for all timestamps, causing server accuracy degradation.

Figure 1: Different types of structured pruning.

2.2 Regularization-based Pruning Methods

Finding structured sparsity in a DNN model is intrinsically solving an optimization problem with structured constraints. Two mainstream methods have been proposed to solve this problem. One incorporates a static regularization term into DNN training, and the other one uses a dynamically updated regularization term during DNN training.

Static regularization is firstly utilized in solving non-structured pruning problems by incorporating regularization into DNN training. By extending regularization into group lasso [Yuan and Lin2006, Wen et al.2016, He et al.2017] form, structured pruning on DNN models can also be achieved. With specified regularization dimensions (groups), it can perform different types of structured pruning (i.e., filter pruning, channel pruning and the combination of them). However, this method yields limited compression rates and non-negligible accuracy degradation due to the intrinsically heuristic and non-optimized approach.

Dynamic regularization method such as ADMM pruning [Zhang et al.2018a, Ren et al.2019]

usually reforms pruning problems into optimization problems with dynamically updated regularization terms bounded by the designated constraint (i.e., pruning with specific dimensions or with any desired weight matrix shapes) sets. During training, ADMM can separately and iteratively solve the pruning problem. Although this method is revolutionary in its functionality and outperforms the former ones in terms of pruning rate/accuracy, a satisfactory solution cannot always be guaranteed for the non-convex (i.e., DNN loss function) problem, not to mention that this method suffers from a time-consuming training process.

2.3 Motivation

From the pruning dimension aspect, the current structured pruning dimensions suffer from major information loss. The accuracy drop is especially significant in RNN pruning. The motivation of our study is to seek an approach to maintain the regularity in the pruned model (for facilitating hardware acceleration), while restoring the flexibility of the spatial distribution of the weights (to re-gain high accuracy). In our proposed BLK pruning which is applicable to both CNNs and RNNs, we take a unique step towards this goal by introducing a new pruning perspective, and avoid the pitfall of making this approach “a mere trade-off” between model accuracy and regularity. We also take a further step of compiler optimization to establish the connection between the general, BLK sparsity and the on-device speedups. Integrating all merits into one design, the accuracy can be similar or even surpass the non-structured pruning, and the on-device acceleration performance can be close to structured pruning.

From the regularization aspect, we emphasize that both current static and dynamic regularization methods are limited by their intrinsic shortcomings. For static regularization, the or group lasso regularization applied on the loss function penalizes all weights in its dimension scope through the entire network, which means some important weights are penalized to near-zero values, thereby resulting in highly impaired solutions. On the other hand, the dynamic regularization method reforms pruning problem as an optimization problem with hard constraints on norm, and then use ADMM to solve it. However, this method suffers from long convergence time due to the strong non-convexity of norm, especially with structured hard constraints. Using ADMM in the training process also inevitably generates extremely small weights that are difficult to remove, not to mention the hard constraints cause a large amount of hyper-parameters that need to be tuned manually for each layer, which is very inefficient. It is imperative to find an effective method to solve the optimization problem with self-adaptive regularization and soft constraints.

3 Unified and Flexible Framework of DNN Pruning - Acceleration

In this section, we propose a unified framework of DNN weight pruning, supporting (i) the flexible, block-based structured (BLK) pruning that applies to both CNN and RNN architectures, and (ii) highly effective weight pruning algorithm with reweighted (REW) method. Our framework also includes a general method to accelerate DNN execution by utilizing compiler-based code optimization, achieving holistic supports for the DNN pruning-acceleration studies.

3.1 Block-based Structured Pruning – A Unique Perspective on Structured Weight Pruning

Conventional, structured pruning treats the DNN weight matrix in each layer as a whole, and selects to prune a whole row or column of the entire weight matrix. However, the accuracy performance is hindered by this limited, inflexible view of structured pruning.

Figure 2: Proposed flexible, block-based structured pruning.

In our perspective, we consider the weight matrix in each layer (e.g., GEMM or FC that represent different types of layer-wise computation) to be composed of multiple weight blocks with the same size as Figure 2 shows. We apply independent row and column pruning on each block, with potentially different pruning rates (number of pruned rows/columns) in each block, to ensure high flexibility. The remaining weights in each block still form a full matrix with a smaller size. Within our perspectives, the aforementioned non-structured pruning and the state-of-the-art structured pruning are two extremes in our design with the block size (i.e., non-structured pruning) and the size of the whole matrix (i.e., structured pruning).

We will discuss in our experiment results that the BLK pruning is not just “a mere trade-off” as Figure 3 shows. The reason is that pruning is processed within each block independently, thereby part of the weights with important information in each filter/channel is preserved, implying high flexibility. In the meantime, the remaining weights still maintain a certain degree of regularity. It is beneficial to both DNN accuracy and inference acceleration. The high flexibility and regularity enabled by our approach reveal a huge design space that potentially facilitates versatile front-end systems.

Figure 3: An illustrative demonstration of the the regularity and accuracy of the proposed block-based structured pruning.

3.2 Effective Regularization-based Pruning Algorithm with Reweighted Method

For an -layer DNN of interest, let denote the collection of weights for -th layer, i.e., . According to our design of the flexible, block-based sparsity, we propose the following constraints on the pruning of .

Constraints: Each will be uniformly divided into blocks with the size of in each of the GEMM or FC matrix, namely, , where . Let and denote the -th row and the -th column of , respectively.

Towards training of the DNN, we minimize the loss function of the network to increase accuracy. In order to achieve structured sparsity, the common method is to add group lasso regularization [Yuan and Lin2006] to the loss function. In fact, achieving block-based row and column sparsity is also a special group lasso problem. Let denote the training loss. The classic optimization with group lasso regularization on the block-based sparsity can be formulated as

(1)

where is the penalty parameter to adjust the relative importance of accuracy and sparsity degree, and denotes group lasso computation. It is difficult to find high quality solution using this fixed regularization method (please refer to the explanation in Section 2.3). Instead, an effective dynamic regularization method dealing with such soft constraints is in need. To achieve this goal, we propose to use reweighted method [Candes et al.2008] to solve group lasso regularization, thereby eliminating the previous shortcoming of applying the same penalty on important and less significant weights. We formulate the following two optimization problems for block-based row pruning and column pruning.

For block-based row pruning, we solve

(2)

where denotes element-wise multiplication, denotes the Frobenius norm and is the collection of penalty weights111 is initialized by the original weights in the pre-trained model., which is updated in every iteration to help increase the degree of sparsity beyond group lasso regularization. In each iteration, the solution of is given by and we update by setting

where is a parameter with small value to prevent the division by zero denominator.

For block-based column pruning, we solve

(3)

and update by

Please note that (2) and (3) can be solved separately or simultaneously using the standard solver.

1 Initialization: Pretrained DNN model with initialized ; Set and total iteration number ; Pre-define block size and
Result: Block-based structured pruned model;
2 Each layer (the size of )/();
3 while  do
4        Solve (2) and/or (3) using standard solver in SGD ;
5        Update using the solution of
6 end while
Remove the group of weights close to zero and retrain the rest of non-zero weights to refine accuracy.
Algorithm 1 Reweighted regularization for block-based structured pruning

Algorithm 1 describes the general steps that are used in the proposed REW method. We first initialize using the pretrained model, and pre-define the block size for the pruned model. During DNN training, we incorporate the reweighted group lasso regularization in (2) and (3), and update the penalty parameter iteratively. By updating the penalty, we “reweight” the regularization term(s) that is (are) bounded in the optimization problems. After reweighted steps, we remove the weights (or group of weights) which are close to zeros and refine the DNN using the non-zero weights.

Reweighted regularization analysis: Consider that two weights and () are penalized by certain regularization. The larger is inevitably being penalized more heavily than the smaller . Although it is easier for to become zero, the fact that is penalized still violates the original intention of weight pruning, which is to remove the “uncritical” weights. Larger weights typically serve a critical role in generating stronger activation for a more confident decision. In the REW method, remains un-penalized or even being rewarded while ’s penalty is amplified. Interestingly, our experimental results in Section 4.1 show that the importance of a (group of) weight is also related to its location, and the REW method can effectively separate those locations. We claim that this characteristic is attributed to the systematic and iterative manner of the REW method.

Reweighted training:

Compared with ADMM training which also uses an iteratively updating scheme for the regularization term, the proposed REW method uses fewer training epochs for the loss to converge. For example, when pruning VGG-16 on CIFAR-10, the ADMM method usually requires 1,000 - 1,200 epochs to converge when the compression rate is around 20

. Additionally, the retraining step also requires the same amount of epochs to restore accuracy. In the proposed reweighted training, we only need 150 - 200 epochs for reweighted step and 200 epochs for retraining. In the meantime, ADMM requires setting pruning ratio and other hyper-parameters (e.g., layer-wise penalty) for each layer manually, while the proposed REW method only requires one penalty parameter for all layers. Also, the soft constraints in REW method determine pruning ratio for the whole network automatically, which eliminates a lot of parameters that need to be set empirically.

Multiple objective functions: The original objective function in the proposed REW method is targeting at DNN weight reduction. However, our objective function can also be formulated for operation (FLOPS) reduction, storage reduction, etc., and solved using the same REW method. Due to space limits, those formulations will not be discussed.

Figure 4: Compact weights by matrix reorder.
Figure 5: Critical weights distribution (logarithmic scale) found by reweighted regularization method in the first FC layer of a VGG-16 model. The comparison includes (a) a pretrained model, (b) an -based group lasso regularized model and (c) an ADMM regularized model.

3.3 Compiler-aided Mobile Acceleration Framework for Block-based DNN Sparsity

In order to fully leverage the block-based sparsity, we design a compiler-aided acceleration framework to deploy DNN models on the computing platform. We adopt code generation to convert a DNN model into computational graph which is embodied by static C++ (for CPU execution) or OpenCL (for GPU execution) code, and with the optimization techniques to guarantee end-to-end execution efficiency. This work use mobile devices as the computing platform. However, the concept and principle of using compiler to execute DNN is universal and can be utilized in (almost) every computing device.

The compiler optimization aims to address the following performance challenges in pruned DNN executions: thread divergence and load imbalance caused by the well-known challenges of the sparse matrix multiplications. To mitigate the challenge, we propose the matrix reorder technique.

Matrix reorder: At first glance, the block-based sparsity has a disordered weight distribution, which incurs significant thread divergence and load imbalance if rows are processed by different threads. Figure 4 illustrates our proposed matrix reorder technique. As the remaining weights that appear in certain rows and columns in each block have a certain degree of regularity, we first reorder the rows (e.g., filters in CNN) by arranging the ones with the same or similar patterns together. Next, we compact the weights in the column direction (e.g., kernels in CNN). At last, the rows with the same or similar computations are grouped together. As a result, each group is processed by all threads in parallel, and each thread is in charge of multiple consecutive rows. Thus, the computation divergence among these threads is significantly reduced. On the other hand, since the weight distribution pattern in each block is regular and known after grouping, the input matrix that corresponds to each weight group will be loaded only once. The load imbalance can be relieved thanks to the register level loading operation reduction.

4 Experimental Results

Methodology:

In our experiment, the proposed BLK-REW pruning framework is utilized on two different machine learning tasks – image classification and natural language processing (NLP). In image classification tasks, our experiments are based on four widely used CNN structures, VGG-16 simonyan2014very, ResNet-18/50 he2016deep and MobileNet-V2 howard2017mobilenets on CIFAR-10 and ImageNet datasets; and for NLP task, we test our proposed pruning framework on GRU with TIMIT dataset. We train the networks on an eight NVIDIA Titan RTX GPUs server using PyTorch paszke2019pytorch.

In order to show the acceleration of block-based sparsity on mobile devices, we compare it with three state-of-the-art DNN acceleration frameworks, TensorFlow-lite (TensorFlow-Lite), TVM 

[Chen et al.2018], and MNN (Ali-MNN). Our evaluations are conducted on a Samsung Galaxy S10 phone with the latest Qualcomm Snapdragon 855 that consists of a Qualcomm Kryo 485 Octa-core CPU and a Qualcomm Adreno 640 GPU.

Figure 6: Mobile CPU/GPU inference time () on different network structures inferring CIFAR-10 and ImageNet images.

4.1 Critical Weights Analysis on Different Regularization Methods

We state that the proposed REW method can achieve better pruning result. The reason is that our method can effectively separate the uncritical weights from critical ones. We use VGG-16 on ImageNet to generate a sparse model based on the proposed reweighted regularization method, and compare it with -based regularization as well as ADMM regularization. To ensure absolute fairness, all the models in the comparison use the same pruning dimension and compression rate. In this case, we use one block (i.e., prune entire columns and rows) in each layer for all methods.

Figure 5 illustrates the difference of critical weights distribution between REW method and others. We first find the non-zero value positions in the sparse model generated by our REW method. Through using those positions, we find the corresponding weights and their distribution in (i) a pretrained model, (ii) an -based group lasso regularized model and (iii) an ADMM regularized model. The critical weight distribution is shown in Figure 5, with the orange color denoting original weights distribution and the blue color indicating the “critical” weights found and preserved by our method. According to the figure, we have the following analyses:

(a). In a pretrained DNN model, some weights with small magnitude are critical to maintain accuracy. Therefore, some pruning works that only prune small weights are very subjective and hard to achieve good results.

(b). In an -based group lasso or ADMM regularized model, part of the weighs are penalized to zero or near-zero values, and then those close-to-zero values are pruned and the rest non-zero values are retrained to restore accuracy. However, the REW method considers some weights that have been penalized are critical, thus should not be pruned.

We conclude that REW method separates critical weights in a very different way, in which the importance of weight(s) is not only based on its value, but also associated with its position. To prove and reinforce our conclusion, we need to show a strong accuracy improvement of the REW method compared with others, which is reported in the following section.

4.2 Accuracy Analysis on Overall Model Compression Results

In our previous analyses, we stress that reweighted regularization can effectively separate critical weights, thus achieving better pruning solutions. In this part, we demonstrate the overall compression results to support our conclusion. Specifically, we prune the entire rows and columns (i.e. using one block for each layer) with REW method to compare with other methods (e.g., lasso, ADMM and other heuristics). Beyond one block structured pruning, we also divide weights into several blocks to show BLK-REW pruning results.

Table 1 and Table 2 show our pruning results using different CNN structures with CIFAR-10 and ImageNet datasets. Table 3 shows RNN pruning results using GRU with TIMIT dataset. Overall, when we prune entire rows and columns using the proposed REW method, the compression results consistently outperform the baseline methods. By using the BLK-REW framework, we unprecedentedly achieve better compression results for both CNNs and RNNs, leading to lightweight model size and computation.

Method
Base
Acc.
Prune
Acc.
Comp.
Rate
Sparsity (Method)
Scheme

ResNet-18

AMC he2018amc 90.5% 90.2% 2.0 Channel (Lasso)
TinyADMM ma2019tiny 94.1% 93.2% Row+Col. (ADMM)
Our’s 94.0% 94.0% 18.1 One BLK (REW)
Our’s 94.0% 94.1% 22.8 BLK (REW)
Our’s 94.0% 93.7% 28.5 BLK (REW)

MBNT

DCP zhuang2018discrimination 94.5% 94.7% 1.4 Channel (Heuristic)
Our’s 94.5% 94.5% 7.1 One BLK (REW)
Our’s 94.5% 94.5% 8.9 BLK (REW)
Our’s 94.5% 93.4% 10.3 BLK (REW)

VGG-16

2PFPCE min20182pfpce 92.9% 92.8% 4.0 Row (Lasso)
TinyADMM ma2019tiny 93.7% 93.3% Row+Col. (ADMM)
Our’s 93.5% 93.3% 56.6 One BLK (REW)
Our’s 93.5% 93.5% 50.1 BLK (REW)
Our’s 93.5% 93.0% 69.7 BLK (REW)
Table 1: BLK-REW pruning results on CIFAR-10 using VGG-16 and ResNet-18 and MobileNet-V2 (MBNT).
Method
Base
Top-1/5
Acc.
Prune
Top-1/5
Acc.
Comp.
Rate
Sparsity (Method)
Scheme

ResNet-18

Network Slim. liu2017learning 68.9/88.7% 67.2/87.4% 1.4 Channel (Lasso)
DCP zhuang2018discrimination 69.6/88.9% 64.1/85.7% 3.3 Channel (Heuristic)
TinyADMM ma2019tiny N/A/89.1% N/A/88.4% 3.3 Row+Col. (ADMM)
StructADMM zhang2018adam 69.9%/N/A 68.8%/N/A% 3.0 Col. (ADMM)
Our’s 69.9/89.1% 69.0/88.5% 4.0 One BLK (REW)
Our’s 69.9/89.1% 69.2/88.9% 4.0 BLK (REW)
Our’s 69.9/89.1% 66.6/87.1% 7.6 BLK (REW)

MBNT

AMC he2018amc 71.8%/N/A 70.8%/N/A 1.4 Channel (Lasso)
Our’s 70.9/90.4% 70.5/89.8% 1.6 One BLK (REW)
Our’s 70.9/90.4% 70.0/89.7% 2.0 BLK (REW)

VGG-16

Decorrelation zhu2018ijcai 73.1%/N/A 73.2%/N/A 3.9 Row (Group Lasso)
APoZ hu2016network N/A/88.4% 66.2/87.6% 2.0 Channel (Heuristic)
AutoADMM liu2019autoslim N/A/92.1% N/A/91.5% 6.4 Row+Col. (ADMM)
Our’s 74.5/91.7% 74.0/91.5% 6.5 One BLK (REW)
Our’s 74.5/91.7% 74.4/91.6% 3.1 BLK (REW)
Our’s 74.5/91.7% 73.8/91.2% 7.8 BLK (REW)
Table 2: BLK-REW pruning results on ImageNet using VGG-16 and ResNet-18 and MobileNet-V2 (MBNT).
Method
Base
PER
Prune
PER
Comp.
Rate
Sparsity
(Method)
Scheme
Exe.
Time ()
CPU/GPU
ESE han2017ese 20.40% 20.70% 8.0 Irregular (Heuristic) N/A
C-LSTM wang2018c 24.15% 25.48% 16.0 Block-circ. N/A
E-RNN li2019ERNN 20.02% 20.20% 8.0 Block-circ. N/A
Our’s 18.8% 18.8% 19.1 BLK (REW) 0.97/0.50
Our’s 18.8% 23.2% 112.9 BLK (REW) 0.35/0.25
Our’s 18.8% 24.0% 231.3 BLK (REW) 0.21/0.09
Table 3: BLK-REW pruning results comparison on GRU with TIMIT dataset. PER denotes phone error rate.

4.3 Performance Evaluation on Mobile Devices

Execution time results are shown in Figure 6. We test the BLK pruned model on mobile CPU/GPU. To ensure fairness, all frameworks are using the same pattern-based sparse model, and we also enable the fully optimized configurations of TFLite, TVM and MNN (e.g., Winograd optimization is turned on). All test models are the ones with the largest compression rates in Table 2 and Table 1. For GRU RNN execution, since other frameworks do not support end-to-end execution on mobile devices, we only report the execution time of the proposed block-based sparse model with block size in Table 3. We can see our approach achieves significant acceleration on mobile devices compared with other frameworks. For image classification tasks, all of our results on mobile GPU exceed the real-time requirements (i.e., usually 33/frame). For NLP tasks, the proposed framework also achieves real-time speech recognition.

5 Conclusion

This paper presents the block-based DNN structured pruning framework using reweighted regularization method (BLK-REW). The proposed block-based structured sparsity is flexible and can be used in both CNN and RNN applications. With the support of the compiler code generation and optimization, our framework can achieve real-time acceleration on many devices. The proposed framework also uses reweighted method to dynamically update the regularization process, which improves the pruning results effectively within considerably shorter training time. Compared with state-of-the-art pruning methods, the proposed framework is general and achieves high performance.

References