A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM

11/05/2018 ∙ by Shaokai Ye, et al. ∙ Syracuse University 10

Many model compression techniques of Deep Neural Networks (DNNs) have been investigated, including weight pruning, weight clustering and quantization, etc. Weight pruning leverages the redundancy in the number of weights in DNNs, while weight clustering/quantization leverages the redundancy in the number of bit representations of weights. They can be effectively combined in order to exploit the maximum degree of redundancy. However, there lacks a systematic investigation in literature towards this direction. In this paper, we fill this void and develop a unified, systematic framework of DNN weight pruning and clustering/quantization using Alternating Direction Method of Multipliers (ADMM), a powerful technique in optimization theory to deal with non-convex optimization problems. Both DNN weight pruning and clustering/quantization, as well as their combinations, can be solved in a unified manner. For further performance improvement in this framework, we adopt multiple techniques including iterative weight quantization and retraining, joint weight clustering training and centroid updating, weight clustering retraining, etc. The proposed framework achieves significant improvements both in individual weight pruning and clustering/quantization problems, as well as their combinations. For weight pruning alone, we achieve 167x weight reduction in LeNet-5, 24.7x in AlexNet, and 23.4x in VGGNet, without any accuracy loss. For the combination of DNN weight pruning and clustering/quantization, we achieve 1,910x and 210x storage reduction of weight data on LeNet-5 and AlexNet, respectively, without accuracy loss. Our codes and models are released at the link http://bit.ly/2D3F0np

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Despite the significant success and wide applications, Deep Neural Networks (DNNs) have increasing model sizes and associated computation and storage overheads. DNN model compression techniques have been widely investigated, including weight pruning [Han et al.2015, Wen et al.2016, Dai, Yin, and Jha2017, Guo, Yao, and Chen2016], weight clustering and quantization [Han, Mao, and Dally2016, Park, Ahn, and Yoo2017, Zhou et al.2017, Leng et al.2017], low-rank approximation [Cheng et al.2015, Sindhwani, Sainath, and Kumar2015, Zhao et al.2017], etc.

A pioneering work on weight pruning is the iterative pruning method [Han et al.2015], which successfully prunes 12 weights in LeNet-5 (MNIST dataset) and 9

weights in AlexNet (ImageNet dataset), without accuracy degradation. The limitations of this work include (i) the limited capability in weight pruning in convolutional (CONV) layers, the most computationally intensive component in DNNs, and (ii) the irregularity in weight storage after pruning. To overcome these limitations, multiple recent work have extended to (i) use more sophisticated algorithms for higher pruning ratio

[Zhang et al.2018b, Ye et al.2018, Dai, Yin, and Jha2017, Guo, Yao, and Chen2016], (ii) strike a balance between higher pruning ratio and lower accuracy degradation [Yang, Chen, and Sze2016], and (iii) incorporate regularity in weight pruning and storage to facilitate hardware implementations [Wen et al.2016, Wen et al.2017, Zhang et al.2018c].

Weight clustering and quantization are equally important, if not more, in DNN model compression. Weight clustering is different from quantization: the former requires the weights to be grouped into a predefined number of clusters, and weights within a cluster are the same; the latter requires the weights to take pre-defined, fixed values. In fact, weight quantization is a special type of the general weight clustering. Due to its flexibility, weight clustering will result in higher accuracy and/or compression ratio than quantization. On the other hand, weight quantization, especially equal-distance quantization, is more hardware friendly compared to weight clustering (and weight pruning as well). Both weight clustering and quantization will be considered in this paper in a unified way.

Many research work are dedicated to weight clustering and quantization [Leng et al.2017, Zhou et al.2016]. The current work are mainly iterative, including back-propagation training assuming continuous weights and mapping procedure to discrete values. It is important to note that multiplications can even be eliminated through effective weight quantization [Leng et al.2017], through quantization into binary weights, ternary weights (, 0, ), or weight quantization into powers of 2 such as the DoReFa net [Zhou et al.2016].

Weight pruning makes use of the redundancy in the number of weights in DNNs, whereas weight clustering/quantization exploits the redundancy in weight representations. These two sources of redundancy are largely independent with each other, which makes it desirable to combine weight pruning and clustering/quantization to make full exploitation of the degree of redundancy. Despite some early heuristic investigation

[Han et al.2015, Han, Mao, and Dally2016], there lacks a systematic investigation on the best possible combination of DNN model compression techniques. This paper aims to overcome this limitation and shed some light on the highest possible DNN model compression through effective combinations.

This paper develops a unified, systematic framework of DNN weight pruning and weight clustering/quantization using Alternating Direction Method of Multipliers (ADMM), a powerful technique in optimization to deal with non-convex optimization problems with potentially combinatorial constraints [Boyd et al.2011, Takapoui et al.2017]. This framework is based on a key observation: both DNN weight pruning and weight clustering/quantization, as well as their combinations, can be solved in a unified manner using ADMM. In the solution, the original problem is decomposed into three (or two) subproblems, which are iteratively solved until convergence. Overall, the ADMM-based solution can be understood as a smart, dynamic regularization process in which the regularization target is dynamically updated in each iteration. As a result it can outperform the prior work on regularization [Wen et al.2016] or projected gradient descent [Zhang et al.2018a].

For further performance improvement in ADMM-based weight clustering/quantization, we propose multiple techniques including iterative weight quantization and retraining procedure, joint weight clustering training and adaptive centroid updating, weight clustering retraining process, etc. The proposed unified framework using ADMM outperforms prior work in two aspects. First, for individual weight pruning and clustering/quantization methods, the proposed ADMM method outperforms prior work. For instance, we achieve 167 weight reduction in LeNet-5, 24.7 in AlexNet, and 23.4 in VGGNet, without any accuracy loss, which clearly outperform prior arts. Second, for the joint DNN weight pruning and clustering/quantization, we achieve 1,910 and 210 storage reduction of weight data on LeNet-5 and AlexNet, respectively, without accuracy loss. These results significantly outperform the state-of-the-art results. Our codes and models are released at the link http://bit.ly/2D3F0np.

Discussions on the Combination of DNN Weight Pruning and Clustering/Quantization

As discussed before, weight pruning can be largely combined with weight clustering/quantization, thereby making full exploitation of the degree of redundancy. Preliminary work [Han, Mao, and Dally2016]

in this direction uses a combination of iterative weight pruning and K-means clustering methods. It simultaneously achieves 9

weight pruning in AlexNet, and uses 8-bit CONV layer clustering and 5-bit FC layer clustering. This work does not target hardware implementation and only focuses on weight clustering.

When comparing with weight clustering/quantization, weight pruning can often result in a higher compression ratio of DNN [Han, Mao, and Dally2016]. This is because of two reasons. First, there is often higher degree of redundancy in the number of weights than the number of bits for weight representations. For weight clustering/quantization, for a single bit reduction in weight representation, the imprecision can be perceived to be doubled. This difficulty is not faced by weight pruning methods. Second, moderate weight pruning can often result in an increase in accuracy (by up to 2% in AlexNet in our ADMM framework), thereby resulting in a higher margin for further weight reduction. This effect, however, is not observed in weight clustering/quantization. As a result, weight pruning is often prioritized over weight clustering/quantization despite the effect of irregular weight storage and associated hardware implementation overhead in the former method.

In the prior work, there lacks a systematic investigation on the best possible combination of DNN weight pruning and weight clustering/quantization methods. In this paper we fill this void in order to make the full exploitation of the degree of redundancy. We provide formulation that can both perform ADMM-based weight pruning and clustering/quantization simultaneously, or give priority to weight pruning.

Fig. 1 shows an illustrative process about weight quantization after weight pruning. Given that a weight matrix after pruning will be quantized on Fig. 1 (a), we use 2-bit for quantization and the interval is 0.5. Then the quantization levels become without 0 because 0 represents pruned weights. Fig. 1 (b) shows the quantized weights, and Fig. 1 (c) displays the values that are actually stored in hardware along with the interval value 0.5.

Figure 1: Description of weight quantization after weight pruning (the interval equal to 0.5).

Fig. 2 shows the illustrative weight clustering process after weight pruning, which is different from weight quantization. Given the same weight matrix after pruning, we also use 2-bit for weight clustering. Again 0 is not considered because the associated weights are already pruned. Fig. 2 (b) shows the weights after clustering, along with the centroid values for the 4 clusters shown in Fig. 2 (c). Different from weight quantization, the centroid values are flexible in the weight clustering process.

Figure 2: Description of weight clustering after weight pruning (along with centroid values).

The Unified Framework of ADMM based Weight Pruning and Clustering/Quantization

Background of ADMM

Consider a non-convex optimization problem that is difficult to solve directly. The ADMM method decomposes it into two subproblems that can be solved separately and efficiently. For example, the optimization problem

(1)

lends itself to the application of ADMM if is differentiable and has some structure such as /-norm or the indicator function of a constraint set. The problem is first re-written as

(2)
subject to

By using augmented Lagrangian [Boyd et al.2011], this problem is decomposed into two subproblems on and . The first is , where

is a quadratic function. As a result, the complexity of solving subproblem 1 (e.g., via stochastic gradient descent) is the same as minimizing

. Subproblem 2 is , where is quadratic. When has special structure, exploiting the properties of allows this problem to be solved analytically and optimally. In this way, we can solve the problem via ADMM that is difficult to solve directly.

Problem Formulation

Consider an -layer DNN, the collections of weights and biases of the -th layer are respectively denoted by and

; The loss function of the

-layer DNN is denoted by .

When we combine DNN weight pruning with clustering or quantization, the overall problem is defined by

(3)
subject to

The set reflects the constraint for the weight pruning problem, i.e., , where is the desired number of weights after pruning in the -th layer. When we combine weight pruning with weight clustering, . When we combine weight pruning with weight quantization, . Here the values are quantization levels, and we consider equal-distance quantization (the same distance between quantization levels) to facilitate hardware implementations. Besides, , and is the number of bits we use for weight clustering or quantization.

Both constraints and need to be satisfied simultaneously in the joint problem of DNN weight pruning and weight clustering/quantization. In this way we can make sure that most of the DNN weights are pruned (set to zero), while the remaining weights are clustered/quantized.

The Unified ADMM-based Framework

To apply ADMM, we define indicator functions to incorporate the combinatorial constraints into objective function. The indicator functions are

for .

We then incorporate auxiliary variables and , and rewrite the original problem (3) as

(4)
subject to

Through ADMM [Boyd et al.2011], problem (4) can be decomposed into three subproblems. The overall problem of weight pruning and clustering/quantization is solved through solving the subproblems iteratively until convergence. The first subproblem is

(5)

where and are the dual variables updated in each ADMM iteration. The first term in (5) is the differentiable loss function of the DNN, while the other terms are quadratic terms and they are differentiable and convex. As a result, this subproblem can be solved by stochastic gradient descent (e.g., the ADAM algorithm [Kingma and Ba2014]) and the complexity of solving this subproblem is the same as training of the original DNN.

The second subproblem is

(6)

As we mentioned before, is the indicator function of , thus the analytical solution of problem (6) is

(7)

where is Euclidean projection of onto the set . In DNN weight pruning, is the desired number of weights after pruning in the -th layer. The Euclidean projection is to keep elements in with the largest magnitude and set the rest to be zero [Boyd et al.2011, Zhang et al.2018b].

The third subproblem is

(8)

Similar to the second subproblem, the solution of problem (8) is

(9)

For the weight quantization problem, the quantization levels , , …, are fixed. In fact, weight quantization is a special type of clustering in which the clustering centroids are pre-determined and fixed. In weight quantization, the Euclidean projection is to map every element of to the quantization level (centroid) closest to that element.

For weight clustering, the centroids of the clusters can be updated dynamically, and the constraint is on the number of clusters for the -th layer. Suppose that the weights are already divided into clusters , , …, . Then the Euclidean projection is to set every element in to the average value of its cluster. The details of how to divide the weights into clusters will be discussed later in the next section.

After solving the subproblems, we update the dual variables and , which are given by

(10)
(11)

This is one iteration of ADMM, and we solve the subproblems and update the dual variables iteratively until the convergence of ADMM. Namely, the following conditions need to be satisfied

(12)
(13)

More Understanding about the ADMM-based Framework

An interpretation of the high performance of ADMM-based framework is as follows. It can be understood as a smart, dynamic DNN regularization technique (see Eqn. (5)), in which the regularization targets are dynamically updated in each ADMM iteration through solving of subproblems 2 and 3. This dynamic characteristics is one of the key reason that the ADMM-based framework outperforms many prior work on DNN model compression based on regularization (without updating of regularization targets), or Projected Gradient Descent [Zhang et al.2018a].

Simplification for the Proposed Framework

The above formulation and iterative solution has high complexity. To address this issue, we present a method that prunes the unimportant weights first and then performs weight clustering or quantization. The underlying reason for this order is the higher degree of redundancy in the number of weights than the number of bits for weight representations (and therefore higher gain in weight pruning than weight clustering/quantization), as discussed before.

In the first step, we only account for the constraints for DNN weight pruning. We update and according to

(14)

and update and according to (7) and (10).

After weight pruning, we solve DNN weight clustering or quantization problem. We consider the constraints for weight clustering or quantization on the pruned model (the remaining weights). To solve this problem, we update and according to

(15)

and update and according to (9) and (11). Note that in weight clustering/quantization, we only update the non-zero elements. The pruned weights are fixed to zero.

The overall algorithm is shown in Algorithm 1, in which details in weight quantization/clustering will be discussed in the next section. We start from the trained DNNs (e.g., LeNet-5, AlexNet, VGGNet). We use the weight pruning ratios ’s and clustering/quantization levels ’s from prior work [Han et al.2015, Han, Mao, and Dally2016] as starting points, and further increase the pruning ratios and decrease the number of clustering/quantization levels. The rationale behind this procedure is that our framework is unified and systematic and can achieve higher DNN model compression compared with state-of-the-arts.

1:for  in number of layers  do
2:     Initialize value in every layer;
3:end for
4:for each in ADMM iterations do
5:     Solve subproblem (14) and update ’s and ’s;
6:     for  in the number of layers  do
7:         Update ’s by performing Euclid mapping (7);
8:         Update ’s using Eqn. (10);
9:     end for
10:end for
11:Retrain the weights that are not converged yet, and conclude the weight pruning process;
12:Perform ADMM-based weight clustering/quantization (details to be discussed next).
Algorithm 1 ADMM-based joint Weight Pruning and Clustering/Quantization (without details about the latter)

Details in Weight Quantization and Clustering

Weight clustering and quantization are used to further compress the weight representation after ADMM-based pruning. After weight pruning, there exists lots of zeros in weight matrix. We use bits, which means there are points representing different weights, to cluster or quantize the rest of non-zero weights (zero weights are already pruned). The number of bits and their representations can be different for different layers of DNN. The difference between weight clustering and quantization is that in weight clustering centroids of clusters are flexible, while the quantization levels (centroids) in weight quantization are fixed and predefined.

In the following we first discuss details of weight quantization, which can be perceived as a special case of weight clustering, and then the general weight clustering process.

Details in Weight Quantization

Parameter Initialization

In this work we use equal-distance quantization to facilitate hardware implementations. In each layer the number of quantization levels is . We quantize the weights into a set of quantization levels . The interval is the distance between nearby quantization levels, which may be different for different layers. There is no need to quantize zero weights because they are already pruned.

The interval and quantization level () in weight quantization can be determined in an effective manner. For finding a value , we denote as -th weight in layer and as a quantization function to the closest quantization level. Then the total square error in a single quantization step is given by . In order to minimize the total square error, we use binary search method to determine . To decide a value (), we reference some prior work like [Han et al.2015] and decrease accordingly. In [Han et al.2015], around 5-bit is used for quantization, which actually is a kind of clustering, in AlexNet, whereas our experiment results prove that 3-4 bits on average to quantize weights in AlexNet are sufficient without incurring any accuracy loss.

Iterative Weight Quantization and Retraining

After the ADMM procedure, many weights are close to the quantization levels rather than exactly on those levels, which means that we have not strictly quantized the weights yet. The reasons are twofold: the non-convexity nature of the optimization problem and the time limitation to finish the ADMM procedure. A straightforward way is to project all the weights to the nearby quantization levels. Due to the huge number of weights, although the change in every weight value is very small since they are close enough to the quantization levels, in accumulation it causes around 1% overall accuracy degradation in our experiments (quantization to 3 bits).

To address this degradation, we present an iterative weight quantization method. In our method, we iteratively project a portion of weights to the nearby quantization levels, fix these values (i.e., we quantize these weights), and retrain the rest of them. More specifically, we quantize % of weights closest to every quantization level after the ADMM procedure and then retrain the rest of weights. After we quantize the weights, we observe accuracy degradation of the DNN, while the retraining step can retrieve the accuracy. After the retraining step, we again quantize % of weights closest to every quantization level and implement another retraining step. This quantization and retraining process is performed iteratively until the number of unquantized weights is small enough. Finally, we quantize these small number of remaining weights and it will not incorporate accuracy loss.

The advantage of our proposed method is that we only quantize a portion of weights in every iteration, and our retraining step provides additional chance to the rest of the weights, so that they can be updated to retrieve the accuracy. This explains why the iterative quantization method works better than the straightforward method that quantizes all the weights directly. We show the overall algorithm about ADMM-based weight quantization and iterative retraining process by using Algorithm 2.

1:for  in number of layers  do
2:     Initialize and in every layer;
3:     Set ;
4:end for
5:for each in ADMM iterations do
6:     Solve subproblem (15) and update ’s and ’s;
7:     for  in the number of layers  do
8:         Update ’s by performing Euclid mapping (9);
9:         Update ’s using Eqn. (11);
10:     end for
11:end for
12:for  in number of iterations  do
13:     for every layer  do
14:         Quantize % of weights closest to every quantization level;
15:     end for
16:     Perform retraining on the remaining weights;
17:end for
18:Quantize the rest of weights.
Algorithm 2 ADMM-based Weight Quantization and Iterative Retraining Process

Details in Weight Clustering

In weight clustering, we cluster the remaining weights (after weight pruning) into clusters, where weights in each cluster have the same value. Different from weight quantization, the centroid value () for each cluster is flexible, and should be optimized along with the weight clustering procedure. In the following, we discuss the details in weight clustering that are different from quantization, including weight clustering training and retraining processes. The initialization of () value is the same as weight quantization, and will not be discussed in details.

Weight Clustering Training

The clustering centroids need to be determined together in the training procedure. For initialization, we perform K-means clustering () and determine each centroid as the average value of associated weights. In each ADMM iteration in weight clustering training, we perform weight mapping (Euclid mapping) of to the nearest centroid values, and update the weights through solving Eqn. (15). Based on weight updating, we perform K-means clustering again and update each centroid value as the average value of associated weights. In this way we perform both weight clustering and centroid updating in an effective manner.

Weight Clustering Retraining

After finishing ADMM training of weight clustering, we perform weight clustering retraining process to avoid accuracy degradation. This retraining process is not ADMM-based, but based on the basic stochastic gradient descent. In the retraining process, we perform stochastic gradient descent only on the centroid value for each cluster. In this way we maintain the same value (centroid value) for all the weights in this cluster. The retraining process will only result in accuracy enhancement instead of accuracy degradation. This retraining flexibility of centroid values is the key reason that weight clustering has higher accuracy than quantization. Algorithm 3 illustrates the whole process of ADMM-based weight clustering and retraining.

1:for  in number of layers  do
2:     Determine the number of clusters in every layer;
3:end for
4:for each in ADMM iterations do
5:     Solve subproblem (15) and update ’s and ’s;
6:     for  in the number of layers  do
7:         Update ’s by performing Euclid mapping (9);
8:         Update ’s using Eqn. (11);
9:     end for
10:     for  in the number of layers  do
11:         Perform K-means clustering on the weights;
12:         Update centroid values based on clustering results;
13:     end for
14:end for
15:for each

in times of epoch 

do
16:     Retrain the centroid values based on clustering results without ADMM ;
17:end for
Algorithm 3 ADMM-based Weight Clustering and Retraining Process.
Model Accuracy degradation No. of weights CONV weight bits FC weight bits Total data size/  Compress ratio Total model size  (including index)/ Compress ratio
LeNet-5 Baseline 0.0% 430.5K 32 32 1.7MB 1.7MB
Iterative pruning [Han, Mao, and Dally2016] 0.1% 35.8K 8 5 24.2KB / 70.2 52.1KB / 33
Our Method (Clustering) 0.1% 2.57K 3 2 (3 for output layer) 0.89KB / 1,910 2.73KB / 623
Our Method (Quantization) 0.2% 2.57K 3 2 (3 for output layer) 0.89KB / 1,910 2.73KB / 623
Table 1: Comparisons of model size compression ratio on the LeNet-5 model for MNIST dataset.
Model Accuracy degradation No. of weights CONV weight bits FC weight bits Total data size/  Compress ratio Total model size  (including index)/ Compress ratio
AlexNet Baseline 0.0% 60.9M 32 32 243.6MB 243.6MB
Iterative pruning [Han, Mao, and Dally2016] 0.0% 6.7M 8 5 5.4MB / 45 9.0MB / 27
Binary quant. [Leng et al.2017] 3.0% 60.9M 1 1 7.3MB / 32 7.3MB / 32
Ternary quant. [Leng et al.2017] 1.8% 60.9M 2 2 15.2MB / 16 15.2MB / 16
Our Method (Clustering) 0.1% 2.47M 5 3 1.16MB / 210 2.7MB / 90
Our Method (Quantization) 0.2% 2.47M 5 3 1.16MB / 210 2.7MB / 90
Table 2: Comparisons of model size compression ratio on the AlexNet model for ImageNet dataset.

Experimental Results and Discussions

In this section, we apply the proposed joint weight pruning and weight clustering/quantization framework on LeNet-5 [LeCun et al.1998] for MNIST dataset and AlexNet [Krizhevsky, Sutskever, and Hinton2012]

for ImageNet dataset. We focus on the total compression ratio on the overall DNN model, which depends on the number of weights and the total number of bits for weight representations. Also, we make comparisons of our model compression results with the representative works on DNN weight pruning and clustering/quantization. The comparisons show that we achieve a significant improvement in DNN model compression. We implement our experiments of LeNet-5 on Tensorflow

[Abadi et al.2016]

and AlexNet (and VGGNet) on Caffe

[Jia et al.2014]. Our experiments are carried out on GeForce GTX 1080Ti and NVIDIA Tesla P100 GPUs.

We initialize ADMM by using the pretrained model of LeNet-5 and AlexNet. For LeNet-5, we set the penalty parameters as for LeNet-5 and for AlexNet. The penalty parameters we set for weight pruning and weight clustering/quantization on a network are the same.

Our codes and models are released at the link http://bit.ly/2D3F0np.

LeNet-5 on MNIST Dataset

We first present the weight pruning and quantization/clustering results on the LeNet-5 model. The overall results on model size compression are shown in Table 1, while the layer-wise results are shown in Table 3. For ADMM-based weight pruning alone, we achieve up to 167 weight reduction without accuracy loss, which is notably higher than the prior work such as [Han et al.2015] (12), [Zhang et al.2018a] (24.1, but this is on a different model LeNet-300-100), and [Aghasi et al.2017] (45.7, with 0.5% accuracy degradation).

For ADMM-based joint weight pruning and quantization, we simultaneously achieve 88 weight reduction through pruning, and use an average of 2.4-bit for quantization, without accuracy loss. In terms of weight data storage, the compression ratio reaches 1,910

when comparing with the original LeNet-5. This is clearly impressive result, when considering that each MNIST sample has 784 pixels and even logistic regression has 7.84M weights (MNIST has 10 classes). When indices (required in weight pruning) are accounted for, the whole model size reduction becomes 623

. We mainly compare with [Han et al.2016] because there lacks much prior work on joint weight pruning and quantization/clustering. We can observe significant improvements in both weight pruning and quantization compared with the prior work, demonstrating the effectiveness of ADMM framework.

When weight clustering is applied, we use the same bit for CONV and FC layers as quantization, and accuracy improvement is achieved. Note that further reduction in bit representation is difficult to achieve without accuracy degradation (because FC layer is already quantized/clustered to 2-bit). Hence quantization will be sufficient if 0.1% accuracy is not the design consideration.

Layer No. of Weights Number of weights after prune Percentage of weights after prune Weight bits
conv1 0.5K 0.1K 20% 5
conv2 25K 1.33K 5.3% 3
fc1 400K 0.8K 0.2% 2
fc2 5K 0.35K 7% 3
Total 430.5K 2.58K 0.6% 2.4
Table 3: Layer-wise weight pruning and weight quantization/clustering results on LeNet-5

AlexNet on ImageNet Dataset

In this section we present the weight pruning and quantization/clustering results on the AlexNet model. The overall results on model size compression are shown in Table 2, while the layerwise results are shown in Table 4. For ADMM-based weight pruning alone, we can achieve 24.7 weight reduction without accuracy loss, which is notably higher than the prior work such as [Han et al.2015] (9), [Yu et al.2017] (10) and [Dai, Yin, and Jha2017] (15.7, but starting from a smaller DNN than original AlexNet). We also performed testing on VGGNet and achieve similar results on weight pruning and quantization/clustering. For example, the weight pruning ratio is 23.4 without accuracy loss. More results are abbreviated due to space limitation.

For ADMM-based joint weight pruning and quantization, we simultaneously achieve 24.7 weight reduction through pruning, and use an average of 3.4-bit (not accounting for the first and last layers, similar to prior work) for weight quantization, without accuracy loss. In terms of weight data storage, the compression ratio reaches 210 when comparing with the original AlexNet model, which is also significant improvement. When indices in weight pruning are accounted for, the whole model size reduction becomes 90. When comparing with [Han et al.2015], we observe significant improvements in both weight pruning and quantization compared with the prior work, demonstrating the effectiveness of the ADMM-based framework.

Finally when it comes to weight clustering, we use the same bit for CONV and FC layers as quantization, and accuracy improvement is observed. Again due to the difficulty for further bit representation reduction without accuracy loss, weight quantization will be sufficient for most of the application domains.

Layer No. of Weights Number of weights after prune Percentage of weights after prune Weight bits
conv1 34.8K 28.19K 81% 8
conv2 307.2K 61.44K 20% 5
conv3 884.7K 168.09K 19% 5
conv4 663.5K 132.7K 20% 5
conv5 442.4K 88.48K 20% 5
fc1 37.7M 0.75M 2% 3
fc2 16.8M 0.91M 5.4% 3
fc3 4.1M 0.33M 8% 8
Total 60.9M 2.47M 4.06% 3.4
Table 4: Layer-wise weight pruning and weight quantization/clustering results on AlexNet

Conclusion

In this paper, we present a unified framework of DNN weight pruning and weight clustering/quantization using ADMM. When we focus on weight pruning alone, we achieve 167 weight reduction in LeNet-5, 24.7 in AlexNet, and 23.4 in VGGNet without accuracy loss. For the combination of DNN weight pruning and clustering/quantization, we achieve 1,910 and 210 storage reduction of weight data on LeNet-5 and AlexNet, respectively, without accuracy loss.

References