1 Introduction
Deep Neural Networks (DNNs) such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been extensively adopted in various artificial intelligence (AI) systems. However, accelerating the computational intensive DNN inference is very challenging for many AI applications, especially those with critical time constraints, such as selfdriving cars
[Nugraha et al.2017] and realtime translation [Gehring et al.2016].Pruning has gained its popularity due to the effectiveness in reducing model size and computation cost. In order to remove redundant weights while maintaining accuracy, many studies have been proposed regarding both pruning dimension (DNN structure level) and pruning method (algorithm level). According to the structure of pruned models, there are mainly two DNN pruning approaches: nonstructured pruning and structured pruning. However, nonstructured pruning has been proven by many recent studies [Wen et al.2016, He et al.2017] that it is not compatible with the parallelism in hardware accelerations due to the imbalanced computation and significant overhead. Structured pruning has been proposed to conquer the challenge. A structured pruned model maintains the regularity of the weight matrix, which eliminates the overhead and facilitates ondevice acceleration. However, the aggressive pruning strategy causes severe information loss, making accuracy degradation nonnegligible. Achieving both high accuracy and fast inference with DNN pruning is an ideal but very challenging goal.
Efforts have been made to achieve this goal. At algorithm level, many pruning techniques have been proposed to find the uncritical weights. For nonstructured pruning, prior works leverage a magnitudebased pruning method that prunes weights with small magnitudes or use regularization to explore sparsity in DNN models. For structured pruning, the static
based group lasso regularization is used to find the regular sparse pattern in DNN models. However, the above approaches fail to find a satisfactory solution for the pruning problem due to the heuristic nature. ADMM
[Boyd et al.2011] algorithm emerges to mitigate the challenges. With a significant improvement in the solution quality, ADMM pruning supersedes (almost) every pruning framework and becomes the stateoftheart method. Nevertheless, ADMM still suffers from suboptimal solution quality and long convergence time, especially for the longstanding problem of finding structured sparsity solution for the Fully Connected (FC) layer. This will certainly limit the usage of ADMM solutions on many CNNs and almost all RNNs since they are majorly composed of FC layers.In this paper, we present a unified pruning framework – blockbased structured pruning with reweighted regularization (BLKREW). Our efforts focus on two aspects: pruning dimension and pruning method.
Aspect 1: From the pruning dimension aspect, we propose blockbased structured pruning (BLK pruning) which divides DNN layers into multiple blocks and applies structured pruning independently to each block. Our design takes a unique perspective on structured pruning, which greatly enlarges the design space by introducing a higher degree of flexibility with a changeable block shape. More importantly, the proposed BLK pruning is applicable to both CNNs and RNNs without obvious accuracy degradation, which outperforms the existing pruning dimensions. It achieves similar or even higher accuracy compared with nonstructured pruning, and preserves the hardware compatibility advantage of structured pruning, with the compilerbased code optimization embedded in our pruningacceleration framework.
Aspect 2: From the pruning method aspect, we propose to use reweighted (REW) group lasso regularization method to generate structured sparsity. By introducing a reweighted term into regularization, our method can perform group regularization at a more precise location in DNN with an appropriate degree. Compared with the traditional based group lasso and the recently developed ADMM regularization method, REW method acquires a significant improvement in the regularization effect (i.e., facilitating better pruning results) with a desirable short convergence time (i.e., efficient training process), which makes it a favorable approach that naturally fit for the DNN pruning problems.
We show the performance improvements of BLKREW framework in three ways. First, the proposed REW method can efficiently find uncritical weights. Compared with other methods, REW achieves better weight regularization effect using significantly shorter training time. Second, the proposed BLK pruning dimension is more general and achieves extremely high compression rates in both CNN and RNN. Third, the proposed BLKREW pruning naturally fits for the compiler optimization. Our designed compileraided acceleration framework achieves realtime inference on the resourcelimited mobile devices.
2 Background and Motivation
2.1 Structured Pruning Dimension
Recent works [Wen et al.2016, He et al.2017] considered to incorporate regularity (i.e., filter pruning, channel pruning, etc.) in weight pruning, which generates regular and smaller weight matrices for faster executions on CPUs/GPUs. For convolution computations, weight matrices are usually transformed into general matrix multiplication (GEMM) form as Figure 1 illustrates. Accordingly, filter pruning can also be termed as row pruning since it corresponds to removing one row of the weight matrix, and channel pruning corresponds to reducing multiple consecutive columns (column pruning). Current structured pruning approaches suffer from notable accuracy loss when the compression rate is high because the entire information of the pruned filter(s)/channel(s) is lost. As a result, it usually has limited compression rates and low accuracy, as well as limited applicability as most works focus on CONV layers only. For FC layers (applied partially in CNN and majorly in RNN), structured pruning is applicable but not desirable due to the same reason above. The drawback is obvious, especially for timebased RNNs since one pruned row/column in an RNN will not be utilized for all timestamps, causing server accuracy degradation.
2.2 Regularizationbased Pruning Methods
Finding structured sparsity in a DNN model is intrinsically solving an optimization problem with structured constraints. Two mainstream methods have been proposed to solve this problem. One incorporates a static regularization term into DNN training, and the other one uses a dynamically updated regularization term during DNN training.
Static regularization is firstly utilized in solving nonstructured pruning problems by incorporating regularization into DNN training. By extending regularization into group lasso [Yuan and Lin2006, Wen et al.2016, He et al.2017] form, structured pruning on DNN models can also be achieved. With specified regularization dimensions (groups), it can perform different types of structured pruning (i.e., filter pruning, channel pruning and the combination of them). However, this method yields limited compression rates and nonnegligible accuracy degradation due to the intrinsically heuristic and nonoptimized approach.
Dynamic regularization method such as ADMM pruning [Zhang et al.2018a, Ren et al.2019]
usually reforms pruning problems into optimization problems with dynamically updated regularization terms bounded by the designated constraint (i.e., pruning with specific dimensions or with any desired weight matrix shapes) sets. During training, ADMM can separately and iteratively solve the pruning problem. Although this method is revolutionary in its functionality and outperforms the former ones in terms of pruning rate/accuracy, a satisfactory solution cannot always be guaranteed for the nonconvex (i.e., DNN loss function) problem, not to mention that this method suffers from a timeconsuming training process.
2.3 Motivation
From the pruning dimension aspect, the current structured pruning dimensions suffer from major information loss. The accuracy drop is especially significant in RNN pruning. The motivation of our study is to seek an approach to maintain the regularity in the pruned model (for facilitating hardware acceleration), while restoring the flexibility of the spatial distribution of the weights (to regain high accuracy). In our proposed BLK pruning which is applicable to both CNNs and RNNs, we take a unique step towards this goal by introducing a new pruning perspective, and avoid the pitfall of making this approach “a mere tradeoff” between model accuracy and regularity. We also take a further step of compiler optimization to establish the connection between the general, BLK sparsity and the ondevice speedups. Integrating all merits into one design, the accuracy can be similar or even surpass the nonstructured pruning, and the ondevice acceleration performance can be close to structured pruning.
From the regularization aspect, we emphasize that both current static and dynamic regularization methods are limited by their intrinsic shortcomings. For static regularization, the or group lasso regularization applied on the loss function penalizes all weights in its dimension scope through the entire network, which means some important weights are penalized to nearzero values, thereby resulting in highly impaired solutions. On the other hand, the dynamic regularization method reforms pruning problem as an optimization problem with hard constraints on norm, and then use ADMM to solve it. However, this method suffers from long convergence time due to the strong nonconvexity of norm, especially with structured hard constraints. Using ADMM in the training process also inevitably generates extremely small weights that are difficult to remove, not to mention the hard constraints cause a large amount of hyperparameters that need to be tuned manually for each layer, which is very inefficient. It is imperative to find an effective method to solve the optimization problem with selfadaptive regularization and soft constraints.
3 Unified and Flexible Framework of DNN Pruning  Acceleration
In this section, we propose a unified framework of DNN weight pruning, supporting (i) the flexible, blockbased structured (BLK) pruning that applies to both CNN and RNN architectures, and (ii) highly effective weight pruning algorithm with reweighted (REW) method. Our framework also includes a general method to accelerate DNN execution by utilizing compilerbased code optimization, achieving holistic supports for the DNN pruningacceleration studies.
3.1 Blockbased Structured Pruning – A Unique Perspective on Structured Weight Pruning
Conventional, structured pruning treats the DNN weight matrix in each layer as a whole, and selects to prune a whole row or column of the entire weight matrix. However, the accuracy performance is hindered by this limited, inflexible view of structured pruning.
In our perspective, we consider the weight matrix in each layer (e.g., GEMM or FC that represent different types of layerwise computation) to be composed of multiple weight blocks with the same size as Figure 2 shows. We apply independent row and column pruning on each block, with potentially different pruning rates (number of pruned rows/columns) in each block, to ensure high flexibility. The remaining weights in each block still form a full matrix with a smaller size. Within our perspectives, the aforementioned nonstructured pruning and the stateoftheart structured pruning are two extremes in our design with the block size (i.e., nonstructured pruning) and the size of the whole matrix (i.e., structured pruning).
We will discuss in our experiment results that the BLK pruning is not just “a mere tradeoff” as Figure 3 shows. The reason is that pruning is processed within each block independently, thereby part of the weights with important information in each filter/channel is preserved, implying high flexibility. In the meantime, the remaining weights still maintain a certain degree of regularity. It is beneficial to both DNN accuracy and inference acceleration. The high flexibility and regularity enabled by our approach reveal a huge design space that potentially facilitates versatile frontend systems.
3.2 Effective Regularizationbased Pruning Algorithm with Reweighted Method
For an layer DNN of interest, let denote the collection of weights for th layer, i.e., . According to our design of the flexible, blockbased sparsity, we propose the following constraints on the pruning of .
Constraints: Each will be uniformly divided into blocks with the size of in each of the GEMM or FC matrix, namely, , where . Let and denote the th row and the th column of , respectively.
Towards training of the DNN, we minimize the loss function of the network to increase accuracy. In order to achieve structured sparsity, the common method is to add group lasso regularization [Yuan and Lin2006] to the loss function. In fact, achieving blockbased row and column sparsity is also a special group lasso problem. Let denote the training loss. The classic optimization with group lasso regularization on the blockbased sparsity can be formulated as
(1) 
where is the penalty parameter to adjust the relative importance of accuracy and sparsity degree, and denotes group lasso computation. It is difficult to find high quality solution using this fixed regularization method (please refer to the explanation in Section 2.3). Instead, an effective dynamic regularization method dealing with such soft constraints is in need. To achieve this goal, we propose to use reweighted method [Candes et al.2008] to solve group lasso regularization, thereby eliminating the previous shortcoming of applying the same penalty on important and less significant weights. We formulate the following two optimization problems for blockbased row pruning and column pruning.
For blockbased row pruning, we solve
(2) 
where denotes elementwise multiplication, denotes the Frobenius norm and is the collection of penalty weights^{1}^{1}1 is initialized by the original weights in the pretrained model., which is updated in every iteration to help increase the degree of sparsity beyond group lasso regularization. In each iteration, the solution of is given by and we update by setting
where is a parameter with small value to prevent the division by zero denominator.
For blockbased column pruning, we solve
(3) 
and update by
Please note that (2) and (3) can be solved separately or simultaneously using the standard solver.
Algorithm 1 describes the general steps that are used in the proposed REW method. We first initialize using the pretrained model, and predefine the block size for the pruned model. During DNN training, we incorporate the reweighted group lasso regularization in (2) and (3), and update the penalty parameter iteratively. By updating the penalty, we “reweight” the regularization term(s) that is (are) bounded in the optimization problems. After reweighted steps, we remove the weights (or group of weights) which are close to zeros and refine the DNN using the nonzero weights.
Reweighted regularization analysis: Consider that two weights and () are penalized by certain regularization. The larger is inevitably being penalized more heavily than the smaller . Although it is easier for to become zero, the fact that is penalized still violates the original intention of weight pruning, which is to remove the “uncritical” weights. Larger weights typically serve a critical role in generating stronger activation for a more confident decision. In the REW method, remains unpenalized or even being rewarded while ’s penalty is amplified. Interestingly, our experimental results in Section 4.1 show that the importance of a (group of) weight is also related to its location, and the REW method can effectively separate those locations. We claim that this characteristic is attributed to the systematic and iterative manner of the REW method.
Reweighted training:
Compared with ADMM training which also uses an iteratively updating scheme for the regularization term, the proposed REW method uses fewer training epochs for the loss to converge. For example, when pruning VGG16 on CIFAR10, the ADMM method usually requires 1,000  1,200 epochs to converge when the compression rate is around 20
. Additionally, the retraining step also requires the same amount of epochs to restore accuracy. In the proposed reweighted training, we only need 150  200 epochs for reweighted step and 200 epochs for retraining. In the meantime, ADMM requires setting pruning ratio and other hyperparameters (e.g., layerwise penalty) for each layer manually, while the proposed REW method only requires one penalty parameter for all layers. Also, the soft constraints in REW method determine pruning ratio for the whole network automatically, which eliminates a lot of parameters that need to be set empirically.Multiple objective functions: The original objective function in the proposed REW method is targeting at DNN weight reduction. However, our objective function can also be formulated for operation (FLOPS) reduction, storage reduction, etc., and solved using the same REW method. Due to space limits, those formulations will not be discussed.
3.3 Compileraided Mobile Acceleration Framework for Blockbased DNN Sparsity
In order to fully leverage the blockbased sparsity, we design a compileraided acceleration framework to deploy DNN models on the computing platform. We adopt code generation to convert a DNN model into computational graph which is embodied by static C++ (for CPU execution) or OpenCL (for GPU execution) code, and with the optimization techniques to guarantee endtoend execution efficiency. This work use mobile devices as the computing platform. However, the concept and principle of using compiler to execute DNN is universal and can be utilized in (almost) every computing device.
The compiler optimization aims to address the following performance challenges in pruned DNN executions: thread divergence and load imbalance caused by the wellknown challenges of the sparse matrix multiplications. To mitigate the challenge, we propose the matrix reorder technique.
Matrix reorder: At first glance, the blockbased sparsity has a disordered weight distribution, which incurs significant thread divergence and load imbalance if rows are processed by different threads. Figure 4 illustrates our proposed matrix reorder technique. As the remaining weights that appear in certain rows and columns in each block have a certain degree of regularity, we first reorder the rows (e.g., filters in CNN) by arranging the ones with the same or similar patterns together. Next, we compact the weights in the column direction (e.g., kernels in CNN). At last, the rows with the same or similar computations are grouped together. As a result, each group is processed by all threads in parallel, and each thread is in charge of multiple consecutive rows. Thus, the computation divergence among these threads is significantly reduced. On the other hand, since the weight distribution pattern in each block is regular and known after grouping, the input matrix that corresponds to each weight group will be loaded only once. The load imbalance can be relieved thanks to the register level loading operation reduction.
4 Experimental Results
Methodology:
In our experiment, the proposed BLKREW pruning framework is utilized on two different machine learning tasks – image classification and natural language processing (NLP). In image classification tasks, our experiments are based on four widely used CNN structures, VGG16 simonyan2014very, ResNet18/50 he2016deep and MobileNetV2 howard2017mobilenets on CIFAR10 and ImageNet datasets; and for NLP task, we test our proposed pruning framework on GRU with TIMIT dataset. We train the networks on an eight NVIDIA Titan RTX GPUs server using PyTorch paszke2019pytorch.
In order to show the acceleration of blockbased sparsity on mobile devices, we compare it with three stateoftheart DNN acceleration frameworks, TensorFlowlite (TensorFlowLite), TVM
[Chen et al.2018], and MNN (AliMNN). Our evaluations are conducted on a Samsung Galaxy S10 phone with the latest Qualcomm Snapdragon 855 that consists of a Qualcomm Kryo 485 Octacore CPU and a Qualcomm Adreno 640 GPU.4.1 Critical Weights Analysis on Different Regularization Methods
We state that the proposed REW method can achieve better pruning result. The reason is that our method can effectively separate the uncritical weights from critical ones. We use VGG16 on ImageNet to generate a sparse model based on the proposed reweighted regularization method, and compare it with based regularization as well as ADMM regularization. To ensure absolute fairness, all the models in the comparison use the same pruning dimension and compression rate. In this case, we use one block (i.e., prune entire columns and rows) in each layer for all methods.
Figure 5 illustrates the difference of critical weights distribution between REW method and others. We first find the nonzero value positions in the sparse model generated by our REW method. Through using those positions, we find the corresponding weights and their distribution in (i) a pretrained model, (ii) an based group lasso regularized model and (iii) an ADMM regularized model. The critical weight distribution is shown in Figure 5, with the orange color denoting original weights distribution and the blue color indicating the “critical” weights found and preserved by our method. According to the figure, we have the following analyses:
(a). In a pretrained DNN model, some weights with small magnitude are critical to maintain accuracy. Therefore, some pruning works that only prune small weights are very subjective and hard to achieve good results.
(b). In an based group lasso or ADMM regularized model, part of the weighs are penalized to zero or nearzero values, and then those closetozero values are pruned and the rest nonzero values are retrained to restore accuracy. However, the REW method considers some weights that have been penalized are critical, thus should not be pruned.
We conclude that REW method separates critical weights in a very different way, in which the importance of weight(s) is not only based on its value, but also associated with its position. To prove and reinforce our conclusion, we need to show a strong accuracy improvement of the REW method compared with others, which is reported in the following section.
4.2 Accuracy Analysis on Overall Model Compression Results
In our previous analyses, we stress that reweighted regularization can effectively separate critical weights, thus achieving better pruning solutions. In this part, we demonstrate the overall compression results to support our conclusion. Specifically, we prune the entire rows and columns (i.e. using one block for each layer) with REW method to compare with other methods (e.g., lasso, ADMM and other heuristics). Beyond one block structured pruning, we also divide weights into several blocks to show BLKREW pruning results.
Table 1 and Table 2 show our pruning results using different CNN structures with CIFAR10 and ImageNet datasets. Table 3 shows RNN pruning results using GRU with TIMIT dataset. Overall, when we prune entire rows and columns using the proposed REW method, the compression results consistently outperform the baseline methods. By using the BLKREW framework, we unprecedentedly achieve better compression results for both CNNs and RNNs, leading to lightweight model size and computation.






ResNet18 
AMC he2018amc  90.5%  90.2%  2.0  Channel (Lasso)  
TinyADMM ma2019tiny  94.1%  93.2%  Row+Col. (ADMM)  
Our’s  94.0%  94.0%  18.1  One BLK (REW)  
Our’s  94.0%  94.1%  22.8  BLK (REW)  
Our’s  94.0%  93.7%  28.5  BLK (REW)  
MBNT 
DCP zhuang2018discrimination  94.5%  94.7%  1.4  Channel (Heuristic)  
Our’s  94.5%  94.5%  7.1  One BLK (REW)  
Our’s  94.5%  94.5%  8.9  BLK (REW)  
Our’s  94.5%  93.4%  10.3  BLK (REW)  
VGG16 
2PFPCE min20182pfpce  92.9%  92.8%  4.0  Row (Lasso)  
TinyADMM ma2019tiny  93.7%  93.3%  Row+Col. (ADMM)  
Our’s  93.5%  93.3%  56.6  One BLK (REW)  
Our’s  93.5%  93.5%  50.1  BLK (REW)  
Our’s  93.5%  93.0%  69.7  BLK (REW) 






ResNet18 
Network Slim. liu2017learning  68.9/88.7%  67.2/87.4%  1.4  Channel (Lasso)  
DCP zhuang2018discrimination  69.6/88.9%  64.1/85.7%  3.3  Channel (Heuristic)  
TinyADMM ma2019tiny  N/A/89.1%  N/A/88.4%  3.3  Row+Col. (ADMM)  
StructADMM zhang2018adam  69.9%/N/A  68.8%/N/A%  3.0  Col. (ADMM)  
Our’s  69.9/89.1%  69.0/88.5%  4.0  One BLK (REW)  
Our’s  69.9/89.1%  69.2/88.9%  4.0  BLK (REW)  
Our’s  69.9/89.1%  66.6/87.1%  7.6  BLK (REW)  
MBNT 
AMC he2018amc  71.8%/N/A  70.8%/N/A  1.4  Channel (Lasso)  
Our’s  70.9/90.4%  70.5/89.8%  1.6  One BLK (REW)  
Our’s  70.9/90.4%  70.0/89.7%  2.0  BLK (REW)  
VGG16 
Decorrelation zhu2018ijcai  73.1%/N/A  73.2%/N/A  3.9  Row (Group Lasso)  
APoZ hu2016network  N/A/88.4%  66.2/87.6%  2.0  Channel (Heuristic)  
AutoADMM liu2019autoslim  N/A/92.1%  N/A/91.5%  6.4  Row+Col. (ADMM)  
Our’s  74.5/91.7%  74.0/91.5%  6.5  One BLK (REW)  
Our’s  74.5/91.7%  74.4/91.6%  3.1  BLK (REW)  
Our’s  74.5/91.7%  73.8/91.2%  7.8  BLK (REW) 







ESE han2017ese  20.40%  20.70%  8.0  Irregular (Heuristic)  N/A  
CLSTM wang2018c  24.15%  25.48%  16.0  Blockcirc.  N/A  
ERNN li2019ERNN  20.02%  20.20%  8.0  Blockcirc.  N/A  
Our’s  18.8%  18.8%  19.1  BLK (REW)  0.97/0.50  
Our’s  18.8%  23.2%  112.9  BLK (REW)  0.35/0.25  
Our’s  18.8%  24.0%  231.3  BLK (REW)  0.21/0.09 
4.3 Performance Evaluation on Mobile Devices
Execution time results are shown in Figure 6. We test the BLK pruned model on mobile CPU/GPU. To ensure fairness, all frameworks are using the same patternbased sparse model, and we also enable the fully optimized configurations of TFLite, TVM and MNN (e.g., Winograd optimization is turned on). All test models are the ones with the largest compression rates in Table 2 and Table 1. For GRU RNN execution, since other frameworks do not support endtoend execution on mobile devices, we only report the execution time of the proposed blockbased sparse model with block size in Table 3. We can see our approach achieves significant acceleration on mobile devices compared with other frameworks. For image classification tasks, all of our results on mobile GPU exceed the realtime requirements (i.e., usually 33/frame). For NLP tasks, the proposed framework also achieves realtime speech recognition.
5 Conclusion
This paper presents the blockbased DNN structured pruning framework using reweighted regularization method (BLKREW). The proposed blockbased structured sparsity is flexible and can be used in both CNN and RNN applications. With the support of the compiler code generation and optimization, our framework can achieve realtime acceleration on many devices. The proposed framework also uses reweighted method to dynamically update the regularization process, which improves the pruning results effectively within considerably shorter training time. Compared with stateoftheart pruning methods, the proposed framework is general and achieves high performance.
References
 [Ali] https://github.com/alibaba/MNN.
 [Boyd et al.2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 2011.
 [Candes et al.2008] Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd. Enhancing sparsity by reweighted l1 minimization. Journal of Fourier analysis and applications, 2008.

[Chen et al.2018]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen
Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al.
TVM: An automated endtoend optimizing compiler for deep learning.
In OSDI, 2018.  [Gehring et al.2016] Jonas Gehring, Michael Auli, David Grangier, and Yann N Dauphin. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344, 2016.
 [Han et al.2017] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William J. Dally. Ese: Efficient speech recognition engine with sparse lstm on fpga. In FPGA, 2017.
 [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [He et al.2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
 [He et al.2018] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, LiJia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In ECCV, 2018.
 [Howard et al.2017] Andrew Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861, 2017.
 [Hu et al.2016] Hengyuan Hu, Rui Peng, YuWing Tai, and ChiKeung Tang. Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
 [Li et al.2019] Zhe Li, Caiwen Ding, Shuo Wang, Wujie Wen, Youwei Zhuo, Xue Lin, Xuehai Qian, and Yanzhi Wang. Ernn: design optimization for efficient recurrent neural networks in fpgas. In HPCA, 2019.
 [Liu et al.2017] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
 [Liu et al.2020] Ning Liu, Xiaolong Ma, Zhiyuan Xu, Yanzhi Wang, Jian Tang, and Jieping Ye. Autoslim: An automatic dnn structured pruning framework for ultrahigh compression rates. AAAI, 2020.
 [Ma et al.2020] Xiaolong Ma, Geng Yuan, Sheng Lin, Caiwen Ding, Fuxun Yu, Tao Liu, Wujie Wen, Xiang Chen, and Yanzhi Wang. Tiny but accurate: A pruned, quantized and optimized memristor crossbar framework for ultra efficient dnn implementation. ASPDAC, 2020.
 [Min et al.2018] Chuhan Min, Aosen Wang, Yiran Chen, Wenyao Xu, and Xin Chen. 2pfpce: Twophase filter pruning based on conditional entropy. arXiv:1809.02220, 2018.
 [Nugraha et al.2017] Brilian Tafjira Nugraha, ShunFeng Su, et al. Towards selfdriving car using convolutional neural network and road lane detector. In ICACOMIT, 2017.
 [Paszke et al.2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. In NeurIPS, 2019.
 [Ren et al.2019] Ao Ren, Tianyun Zhang, Shaokai Ye, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi Wang. Admmnn: an algorithmhardware codesign framework of dnns using alternating direction methods of multipliers. In ASPLOS, 2019.
 [Simonyan and Zisserman2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv:1409.1556, 2014.
 [Ten] https://www.tensorflow.org/mobile/tflite/.
 [Wang et al.2018] Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. Clstm: Enabling efficient lstm using structured compression techniques on fpgas. In FPGA, 2018.
 [Wen et al.2016] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NeurIPS, 2016.

[Yuan and Lin2006]
Ming Yuan and Yi Lin.
Model selection and estimation in regression with grouped variables.
Statistical Methodology, 2006.  [Zhang et al.2018a] Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. In ECCV, 2018.
 [Zhang et al.2018b] Tianyun Zhang, Kaiqi Zhang, Shaokai Ye, Jian Tang, Wujie Wen, Xue Lin, Makan Fardad, and Yanzhi Wang. Adamadmm: A unified, systematic framework of structured weight pruning for dnns. arXiv preprint arXiv:1807.11091, 2:3, 2018.
 [Zhu et al.2018] Xiaotian Zhu, Wengang Zhou, and Houqiang Li. Improving deep neural network sparsity through decorrelation regularization. In IJCAI, 2018.
 [Zhuang et al.2018] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. Discriminationaware channel pruning for deep neural networks. In NeurIPS, 2018.
Comments
There are no comments yet.