ResNet Can Be Pruned 60x: Introducing Network Purification and Unused Path Removal (P-RM) after Weight Pruning

04/30/2019 ∙ by Xiaolong Ma, et al. ∙ 0

The state-of-art DNN structures involve high computation and great demand for memory storage which pose intensive challenge on DNN framework resources. To mitigate the challenges, weight pruning techniques has been studied. However, high accuracy solution for extreme structured pruning that combines different types of structured sparsity still waiting for unraveling due to the extremely reduced weights in DNN networks. In this paper, we propose a DNN framework which combines two different types of structured weight pruning (filter and column prune) by incorporating alternating direction method of multipliers (ADMM) algorithm for better prune performance. We are the first to find non-optimality of ADMM process and unused weights in a structured pruned model, and further design an optimization framework which contains the first proposed Network Purification and Unused Path Removal algorithms which are dedicated to post-processing an structured pruned model after ADMM steps. Some high lights shows we achieve 232x compression on LeNet-5, 60x compression on ResNet-18 CIFAR-10 and over 5x compression on AlexNet. We share our models at anonymous link http://bit.ly/2VJ5ktv.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In order to solve the high demand for computation and storage resources of a DNN application, weight pruning (Han et al., 2015) (Wen et al., 2016) are developed to facilitate weight compression and computation acceleration. In this work, a structured pruning technique is utilized to compress the DNN models which reduces weight storage and computation, and the structured weight matrix storage also has potential advantages for high-parallelism implementation in hardware by eliminating the required weight indices compared with irregular pruning. (Ding et al., 2018) (Zhang et al., 2018)

However, the accuracy loss problem in structured pruning is inevitable. By adopting ADMM (Boyd et al., 2011)

, the original weight pruning problem is decomposed into two subproblems, one solved using stochastic gradient descent as original DNN training, while the other solved optimally and analytically via Euclidean projection 

(Zhang et al., 2018) (Ye et al., 2019). ADMM method achieves one of the state-of-art structured weight pruning results, 40 weight reduction on LeNet-5 (LeCun et al., 1998) with MNIST (LeCun et al., 2015), 20 on VGG-16 (Simonyan & Zisserman, 2014) with CIFAR-10 (Krizhevsky & Hinton, 2009) and 4.7 on AlexNet (Krizhevsky et al., 2012)

with ImageNet 

(Deng et al., 2009) without post-processing optimization.

During post-processing procedure, we find that after model retraining, some weights become less contributing to the network performance. This phenomenon is caused by the shortcoming that ADMM technique lacks the guarantee on solution feasibility (non-optimality) due to the non-convex nature of objective function (loss function). We propose a novel algorithm to detect and remove the redundant weights which slip away from ADMM pruning. Also, we are the first to discover the unused path in a structured pruned DNN model and design a sophisticate optimization framework to further boost compression rate as well as maintain high network accuracy. The contributions of this paper include:

  • We adopt ADMM for efficiently optimizing the non-convex problem and successfully utilized this method on structured weight pruning.

  • We design a novel Network Purification and Unused Path Removal (P-RM) algorithm focused on post-processing an ADMM structured pruned model to boost compression rate while maintain accuracy.

2 ADMM model compression

Consider an -layer DNNs, sets of weights of the -th (CONV or FC) layer are denoted by , respectively. And the loss function associated with the DNN is denoted by . In this paper, characterize the set of weights from layer to layer . The overall problem is defined by

(1)
subject to

Given the value of , the constraint set is denoted by , where “card” refers to cardinality and “supp” refers to the support set. Elements in are the solution of satisfying the number of non-zero elements in is limited by for layer . The general constraint can be extended in structured pruning such as filter pruning, channel pruning and column pruning.

The standard ADMM regularized optimization steps are shown as follow, consider a indicator function is utilized to incorporate into objective functions, which is

(2)

Then original problem (1) can be equivalently rewritten as

(3)
subject to

Auxiliary variables and dual variables are imported. ADMM decompose problem (3) into simpler subproblems and solve subproblems iteratively until convergence. The augmented Lagrangian formation of problem (3) is

(4)

The first term in problem (4) is the differentiable loss function of the DNN, and the second term is a quadratic regularization term of the , which is differentiable and convex, and denotes Frobenius norm. As a result, subproblem (4) can be solved by stochastic gradient descent algorithm (Kingma & Ba, 2014) as the original DNN training.

The standard ADMM algorithm (Boyd et al., 2011) steps proceed by repeating, for , the following subproblems iterations:

(5)
(6)
(7)

which (5) is the proximal step, (6) is projection step and (7) is dual variables update. However, due to the non-convexity of the DNN loss function rather than the quadratic term in our method, the global optimality cannot be guaranteed.

Figure 1 illustrate a combined structured pruning techniques in General Matrix Multiply (GEMM) view. We adopt filter pruning and column pruning together to reduce matrix dimension. As a result shows in Figure 1 (c), the weight matrix size is reduced drastically compared with the original one, in the meantime, the shape of the weight matrix is still regular.

Figure 1: GEMM view of weight pruning
Figure 2: Unused data path caused by structured pruning

3 Network Purification and Unused Path Removal (P-RM)

ADMM weight pruning can significantly reduce weights while maintaining high accuracy. However, does the pruning process really remove all unnecessary weights?

From our observation and analysis on the data flow through a network, we find that if a whole filter is pruned, then after GEMM, the generated feature maps by this filter will be all “blank”. If those “blank” feature maps input to next layer, then no matter what values are in the corresponding channel for those feature maps, the GEMM result will be zero. As a result, that channel will become unused channel which can be removed. By the same token, if a channel is pruned, then no matter what values are in the previous layer’s corresponding filter, the GEMM result of the generated feature maps by this channel will be all zeros, in which case make that corresponding filter an unused one. Figure 2 gives a clear illustration about the corresponding relationship between the ADMM pruned filters/columns and the correspond unused channels/filters.

We further improved the empty channels caused unused filters method by creating a more generalized criterion that define what is “emptiness” of a channel. Suppose is the number of columns per channel in layer , and is the emptiness ratio. We have

(8)

If exceed a pre-defined threshold, we can assume that this channel is empty. But this indiscriminate criterion has its limitation. The reason is that after pruning, the remaining columns are remained for a reason which is they are relatively “important” to the whole network. If we remove all columns that satisfy , disastrous accuracy drop will occur and hard to recover by retraining.

In order to make our previous assumption work, we design a unified algorithm called “Network Purification” which is targeted on dealing with the non-optimality problem of the ADMM process. By solving the problem, the above assumption can be validated simultaneously. We add a criterion constraint to compare the importance of the remaining columns channel-wisely and to help decide which columns can be sacrifice and which can not. We set-up an criterion constant to represent channel ’s importance score, which is derived from an accumulation procedure:

(9)

One can think of this process as if collection evidence for whether each channel that contains one or several columns need to be removed. Network Purification also works on purifying remaining filters and thus remove more unused path in the network. The effect of the combinatorially using Network Purification and Unused Path Removal (P-RM) is network will achieve extremely high compression rate without having any accuracy drop. Algorithm 1 shows our generalized method of the P-RM method where are hyper-parameter thresholds values.

Result: Redundant weights and unused paths removed
Load ADMM pruned model
= numbers of columns per channel
for  until last layer do
        for  until last in  do
               for each:  do
                      calculate: (8), (9);
                     
               end for
              if  and  then
                      prune()
                     prune()  when ;
                     
               end if
              
        end for
       for  until last in  do
               if  is empty or  then
                      prune()
                     prune()  when last layer index;
                     
               end if
              
        end for
       
end for
Algorithm 1 Network purification & Unused path removal

4 Experimental Results

Figure 3 proves that ADMM’s non-optimality exists in a structured pruned model. By purifying the redundant weights, we can further optimize the loss function. All of the results are based on non-retraining Network Purification process. The purification along with removal of unused path (P-RM) process has great compression boost effect when the network is deep enough.

Figure 3: Effect of removing redundant weights and unused paths. (dataset: CIFAR-10; Accuracy: VGG-16-93.36%, ResNet-18-93.79%. No retraining used)
Structured Weight Pruning Statistics
Method
Original
Accuracy
Prune Rate
w/o P-RM
Accuracy
w/o P-RM
Prune Rate
with P-RM
Accuracy
with P-RM
MNIST
SSL 26.10 99.00% N/A N/A
our
LeNet-5
99.17% 23/18 99.20% 39.23 99.20%
34.46 99.06% *87.93 99.06%
45.54 98.48% 231.82 98.48%
*numbers of parameter reduced: 25.2K
CIFAR-10
2PFPCE 92.98% 4.00 92.76% N/A N/A
our
VGG-16
93.70% 20.16 93.36% 44.67 93.36%
*50.02 92.73%
AMC 93.53% 1.70 93.55% N/A N/A
our
ResNet-18
94.14% 5.83 93.79% 52.07 93.79%
15.14 93.20% *60.11 93.22%
*numbers of parameter reduced on:
VGG-16: 14.42M, ResNet-18: 10.97M
ImageNet ILSVRC-2012
SSL AlexNet 80.40% 1.40 80.40% N/A N/A
our AlexNet 82.40% 4.69 81.76% 5.13 81.76%
our ResNet-18 89.07% 3.02 88.41% 3.33 88.47%
our ResNet-50 92.86% 2.00 92.26% 2.70 92.27%
numbers of parameter reduced on:
AlexNet: 1.66M, ResNet-18: 7.81M, ResNet-50: 14.77M
Table 1: Structured weight pruning results on multi-layer network on MNIST, CIFAR-10 and ImageNet ILSVRC-2012 datasets

Table1 shows our experimental results of network pruning on Lenet-5, VGG-16, AlexNet and ResNet-18/50. The accuracy and pruning ratio results of SSL (Wen et al., 2016) method is compared with our structured pruned Lenet-5 and AlexNet model, and 2PFPCE (Min et al., 2018) and AMC (He et al., 2018) (ResNet-50) methods are compared with our VGG-16 and ResNet-18 on CIFAR-10 results. By using ADMM structured prune, Network Purification and Unused Path Removal (P-RM), LeNet-5 achieve 39 compression rate without accuracy drop, 88 compression with negligible accuracy drop and 232 with only 0.7% accuracy drop. On CIFAR-10 dataset, our VGG-16 compressed model achieves 44 compression without accuracy degradation and 50 with 1% accuracy drop and our ResNet-18 achieve 52 compression without noticeable accuracy loss and 60 compression with 0.9% accuracy loss.

On ImageNet dataset, we increase AlexNet compression rate from 4.69 to 5.13, ResNet-18 from 3.02 to 3.33 and ResNet-50 from 2 to 2.7. All of our compression rate boost doesn’t cause noticeable accuracy degradation.

5 Conclusion

In this paper, we provide an ADMM regularized method to achieve highly compressed DNN models with combination of different weight pruning structures, and maintain the network accuracy in a high level. We further investigate the post-process of ADMM pruning to solve the non-optimal solution caused by non-convex DNN loss function. We proposed Network Purification and Unused Path Removal that increase our model compression rate significantly.

References