Learning Sparsity and Quantization Jointly and Automatically for Neural Network Compression via Constrained Optimization

10/14/2019 ∙ by Haichuan Yang, et al. ∙ 0

Deep Neural Networks (DNNs) are widely applied in a wide range of usecases. There is an increased demand for deploying DNNs on devices that do not have abundant resources such as memory and computation units. Recently, network compression through a variety of techniques such as pruning and quantization have been proposed to reduce the resource requirement. A key parameter that all existing compression techniques are sensitive to is the compression ratio (e.g., pruning sparsity, quantization bitwidth) of each layer. Traditional solutions treat the compression ratios of each layer as hyper-parameters, and tune them using human heuristic. Recent researchers start using black-box hyper-parameter optimizations, but they will introduce new hyper-parameters and have efficiency issue. In this paper, we propose a framework to jointly prune and quantize the DNNs automatically according to a target model size without using any hyper-parameters to manually set the compression ratio for each layer. In the experiments, we show that our framework can compress the weights data of ResNet-50 to be 836x smaller without accuracy loss on CIFAR-10, and compress AlexNet to be 205x smaller without accuracy loss on ImageNet classification.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, Deep Neural Networks (DNNs) are being applied everywhere around us. Besides running inference tasks on cloud servers, DNNs are also increasingly deployed in resource-constrained environments today, ranging from embedded systems in micro aerial vehicle and autonomous cars to mobile devices such as smartphones and Augmented Reality headsets. In these environments, DNNs often operate under a specific resource constraint such as the model size, execution latency, and energy consumption. Therefore, it is critical to compress DNNs to run inference under given resource constraints while maximizing the accuracy.

In the past few years, various techniques have been proposed to compress the DNN models. Pruning and quantization are two of which most widely used in practice. Pruning demands the weights tensor to be sparse, and quantization enforces each DNN weight has a low-bits representation. These methods will compress the DNN weights in each layer and result in a compressed DNN having lower resource consumption. It has been shown that by appropriately setting the compression ratio (i.e., sparsity or quantization bitwidth) for each layer, the compression could bring negligible accuracy drop 

(Han et al., 2015a).

A fundamental question for these compression techniques is: how to find the optimal compression ratio, e.g., sparsity and/or bitwidth, for each layer in a way that meets a given resource constraint. Traditional DNN compression methods (Han et al., 2015a; Ye et al., 2018b; He et al., 2019) determine the compression ratio of each layer based on human heuristics. Since the compression ratios can be seen as hyper-parameters, the idea in recent research of using black-box optimization for hyper-parameter search can be directly adopted (Tung & Mori, 2018). He et al. (2018)

apply reinforcement learning (RL) in DNN pruning by formulating the pruning ratio as a continuous action and the accuracy as the reward.

Wang et al. (2019) apply the similar formulation but uses it for searching the quantization bitwidth of each layer. CLIP-Q (Tung & Mori, 2018) propose a compression method which requires the sparsity and quantization bitwidth to be set as hyper-parameters, and they use Bayesian optimization libraries to search them. Evolutionary search (ES) is also being used in this scenario, for example, Guo et al. (2019) propose an ES-based network architecture search (NAS) method and use it for searching compression ratios. Liu et al. (2019) use meta-learning and ES to find the pruning ratios of channel pruning. The basic idea of these methods is formulating the compression ratio search as a black-box optimization problem, but it introduces new hyper-parameters in the RL or ES algorithm. However, tuning black-box optimization algorithms could be very tricky (Islam et al., 2017) and usually inefficient (Irpan, 2018). Moreover, it introduces new hyper-parameters. For example, the RL algorithm DDPG (Lillicrap et al., 2015)

has dozens of hyper-parameters including batch size, actor / critic network architecture, actor /critic optimizer and learning rate, reward scale, discounting factor, reply buffer size, target network updating factor, exploration noise variance, and so on. Therefore, it is highly desirable to have an automated approach avoiding as much as possible the human heuristics.

Meanwhile, to maximize the compression performance pruning and quantization could be used together (Han et al., 2015a). Under this circumstance, a compression algorithm must tune both sparsity and quantization bitwidth for each layer. These two parameters will influence each other. For example, if layer has higher bitwidth than another layer , then pruning layer (i.e., reducing the number of nonzero elements) will contribute more to model compression than pruning layer . So jointly pruning and quantization increases the difficulty of manually choosing the compression ratios or hyper-parameter tuning.

Methods Features Support pruning Support quantization End-to-end optimization
AMC (He et al., 2018)
HAQ (Wang et al., 2019)
CLIP-Q (Tung & Mori, 2018)
Table 1: Comparison across different automated model compression methods.

In this paper, we present an end-to-end framework for automated DNN compression. Our method can jointly quantize and prune the DNN weights, and simultaneously learn the compression ratios (i.e. layer-wise sparsity and bitwidth) and compressed model weights. Instead of treating the compression ratios as hyper-parameters and using the black-box optimization, our method is based on a constrained optimization where an overall model size is set as the constraint to restrict the structure of the compressed model weights. Table 1 shows a comparison of our method with recently proposed automated model compression works.

Because our compression constraint considers both the bitwidth and sparsity of each layer, it is hard to be solved directly. In this work, we show that this constraint can be decoupled and the resultant formulation can be solved with the Alternating Direction Method of Multipliers (ADMM), which is widely used in constrained optimization (Boyd et al., 2011)

. Using ADMM to solve the proposed problem involves two projection operations, one is induced by pruning and another by quantization. We show that these two projection operators can be casted to integer linear programming (ILP) problems, and derive efficient algorithms to solve them. Figure 

1 is an overview of the proposed framework, the main procedure consists of gradient descent / ascent and projection operations. For more details, refer to Section 3.

Figure 1: Illustration of the proposed DNN compression framework. DNN weight is sparse and is quantized. is a “soft duplicate” of and they are converged to be equal.

In summary, we make the following contributions:

  • We propose an end-to-end framework to automatically compress DNNs without manually setting the compression ratio of each layer. It simultaneously utilizes pruning and quantization and directly learns the compressed DNN weights which can have different sparsity and bitwidth for each layer.

  • We mathematically formulate the automated compression problem to a constrained optimization problem. The problem has a “sparse + quantized” constraint and it is further decoupled so that we can solve it using ADMM.

  • The main challenge in using ADMM for the automated compression problem is solving the projection operators for pruning and quantization. We introduce the algorithms for getting the projection of the sparse constraint and quantization constraint. In the experiment, we validate our automated compression framework to show its superiority over the handcrafted and black-box hyper-parameter search methods.

2 Related Work

2.1 Model Compression Techniques

Due to the enormous impactions of mobile computing, more and more complicated DNN models are required to fit into those low-power consumption devices for real application. To solve the computation consumption issue onto the mobile systems, pruning and quantization are proposed as two practical approaches nowadays.


Pruning refers to decrease the amount of non-zero parameters in DNN models. Han et al. (2015b) proposed a simple approach by zeroing out the weights whose magnitudes are smaller than a threshold. By performing fine-tuning after removing the smaller weights, the accuracy drop is usually negligible even with a considerable compression ratio (Han et al., 2015a)

. Besides using weights pruning for model compression, channel (filter / neuron) pruning 

(Li et al., 2016b; Zhou et al., 2016; Molchanov et al., 2016; He et al., 2017; Luo et al., 2017; Zhuang et al., 2018; Liu et al., 2017; Ye et al., 2018a) is proposed to remove the entire filter of the CNN weights, thus also achieve inference acceleration. Wen et al. (2016) introduced more sparsity structures into CNN pruning, such as shape-wise and depth-wise sparsity.


Besides decreasing the number of parameters with pruning, quantization is considered as another direction to compress DNNs. To relieve the cost of memory storage or computation, quantization focuses on converting the floating-point number elements to low-bits representations. For example, we can quantize all the parameters’ precision from 32 bits to 8 bits or lower (Han et al., 2015a) to down-scale the model size. Extremely, the model weights can be binary (Courbariaux et al., 2015; Rastegari et al., 2016; Courbariaux et al., 2016; Hubara et al., 2017), or ternary (Li et al., 2016a; Zhu et al., 2016). The quantization interval can be either uniform (Jacob et al., 2018) or nonuniform (Han et al., 2015a; Miyashita et al., 2016; Tang et al., 2017; Zhang et al., 2018)

. Typically, nonuniform quantization can achieve higher compression rate, while uniform quantization can provide acceleration. Besides the scalar quantization, vector quantization was also applied in DNN model compression 

(Gong et al., 2014; Wu et al., 2018).

There are some methods which perform training together with pruning and quantization, including Ye et al. (2018b) and CLIP-Q (Tung & Mori, 2018). These methods rely on setting hyper-parameters to compress the layers with desired compression ratios, although the black-box hyper-parameter optimization method can be used (Tung & Mori, 2018).

2.2 Automated Model Compression

Prior efforts on setting for the compression ratio of each layer mostly use either rule-based approaches (Han et al., 2015a; Howard et al., 2017; Ye et al., 2018b; He et al., 2019) or black-box hyper-parameter search. Rule-based approaches rely on heuristics, and thus are not optimal and unscalable as network architectures become more complex. Search-based approaches treat this problem as hyper-parameter search to eliminate the need for human labor. For pruning, NetAdapt (Yang et al., 2018) uses a greedy search strategy to find the sparsity ratio of each layer by gradually decreasing the resource budget and performing fine-tuning and evaluation iteratively. In each iteration, NetAdapt tries to reduce the number of nonzero channels of each layer, and pick the layer which results in smallest accuracy drop. Recent search-based approaches also employ reinforcement learning (RL), which use the accuracy and resource consumption to define the reward and guide the search to find pruning ratio (He et al., 2018) and quantization bitwidth (Yazdanbakhsh et al., 2018; Wang et al., 2019). Guo et al. (2019) uses evolutionary search (ES) for network architecture search (NAS) and show that it can be used for searching compression ratios. Liu et al. (2019) use a hyper-network in the ES algorithm to find the layer-wise sparsity for channel pruning. Instead of regarding the layer-wise sparsity as hyper-parameters, recently proposed energy-constrained compression methods (Yang et al., 2019a, b) use optimization-based approaches to prune the DNNs under a given energy budget. Besides the above, there are some methods on searching efficient neural architectures (Cai et al., 2018; Tan et al., 2019), while our work mainly concentrates on compressing a given architecture.

3 End-to-end Automated DNN Compression

In this section, we firstly introduce a general formulation of DNN compression, which is constrained by the total size of the compressed DNN weights. Secondly, we reformulate the original constraint to decouple the pruning and quantization and show the algorithm outline which uses ADMM to solve the constrained optimization. Lastly, as the proposed algorithm requires two crucial projection operators, we show that they can be formed as special integer linear programming (ILP) problems and introduce efficient algorithms to solve them.

3.1 Problem Formulation

Let be the set of weight tensors of a DNN which has layers. To learn a compressed DNN having a target size of , we have the constrained problem


Where is the minimum bitwidth to encode all the nonzero elements of tensor , i.e., . -norm is the number of nonzero elements of

. The loss function

is task-driven, for example, using the cross entropy loss as for classification, or mean squared error for regression.

Problem (1) is a general form of DNN compression. When assuming the bitwidth is fixed and same for all the layers, problem (1) reduces to the case of weights pruning (Han et al., 2015b). When assuming the weight tensors are always dense, it is reduced to mixed-bitwidth quantization (Wang et al., 2019).

Compared with the ordinary training of deep learning, the compressed DNN learning problem (

1) introduces a constraint, i.e. . It is defined by two non-differentiable functions and , which obstruct solving it via normal training algorithm. Although there is projection-based algorithm which can handle the -norm constraint, it can not be applied to our case because our constraint sums the products of and , which is more complicated.

3.2 Constraint Decoupling via Alternating Direction Method of Multipliers

We deal with the constraint in (1) by decoupling its -norm and bitwidth parts. Specifically, we reformulate the problem (1) to an equivalent form


Where is a duplicate of the DNN weights , and .

In this paper, we apply the idea from ADMM to solve the above problem. We introduce the dual variable and absorb the equality constraint into the augmented Lagrangian, i.e.,

s. t. (3b)

where is a hyper-parameter. Based on ADMM, we can solve this problem by updating and iteratively. In each iteration , we have three steps corresponding to the variable and respectively.

Fix , update .

In this step, we treat as constants and update to minimize , i.e.,


Because of the complexity of the DNN model and the large amount of the training data, is usually complex and the gradient based algorithms are often used to iteratively solve it. Here we apply a proximal method to simplify the objective (4). Firstly, use a quadratic proxy to approximate , the problem (4) becomes


Where is the (stochastic) gradient of at point , and is the learning rate. Problem (5) is the projection of onto the set . We call it the compression projection with fixed bitwidth, and show how to solve it in Section 3.3.

Fix , update .

Here we use the updated and minimize in terms of .


Since and are fixed in this step, they can be seen as constants here. Problem (6) is the projection of onto . We call this projection the compression projection with fixed sparsity and leave the detail of solving it in Section 3.4.

Fix , update .

To update the dual variable , we perform a gradient ascent step with learning rate as :


The above updating rules follow the standard ADMM. Although ADMM relies on several assumptions (e.g., the objective function should be convex), we apply it in non-convex loss function and non-differentiable constraint functions to solve the minimax problem (3). In Section 4, we will demonstrate these updating rules work well in our problem.

3.3 Compression Projection with Fixed Bitwidth

Problem (5) can be seen as a weighted -norm projection with :


We will show that this is actually a 0-1 Knapsack problem.

Proposition 1.

The projection problem in (8) is equivalent to the following 0-1 Knapsack problem:


where and are of the same shape as , and the elements of is defined as . takes element-wise square of . The optimal solution of (8) is , where is the optimal solution to the knapsack problem (9) and is the element-wise multiplication.

In this 0-1 Knapsack problem, is called the “profit”, and is the “weight”. The 0-1 Knapsack is basically selecting a subset of items (corresponding to the DNN weights in our case) to maximize the sum of the profit and the total weight does not exceed the budget . The 0-1 Knapsack problem is NP hard, while there exists an efficient greedy algorithm to approximately solve it (Kellerer et al., 2004). The idea is based on the profit to weight ratio . We sort all items based on this ratio and iteratively select the largest ones until the constraint boundary is reached. The theoretical complexity of this algorithm is , where is the number of total items. Because the sorting and cumulative sum operations are supported on GPU, we can efficiently implement this greedy algorithm on GPU and use it in our DNN compression framework.

3.4 Compression Projection with Fixed Sparsity

The solution of problem (6) is the projection , where the projection operator is defined as


The above problem can be also reformulate as an integer linear programming. In the following, we will introduce a special variant of Knapsack problem called Multiple-Choice Knapsack Problem (MCKP) (Kellerer et al., 2004) and show that the problem (10) can be written as an MCKP.

Definition 1.

Multiple-Choice Knapsack Problem (MCKP) (Kellerer et al., 2004). Consider there are mutually disjoint groups which contain items respectively. The -th item from the -th group has a “profit” , and “weight” , . MCKP formulates how to select exactly one item from each group to maximize the sum of profits and keep the sum of weights under a given budge , i.e.,


Define as the set of bitwidth candidates. In this paper, we use . Let be the error to quantize with bitwidth , i.e.,

, which can be solved by k-means algorithm for nonuniform quantization 

(Han et al., 2015a). Now we are ready to reformulate the problem (10) as an MCKP.

Proposition 2.

The compression projection problem (10) can be reformulated to an instance of MCKP in Definition 1. Specifically, each group is defined by each layer and has size . Each choice of the quantization bitwidth is regraded as an MCKP item. The profit is , the weight is , the Knapsack budget is , and indicates selecting which bitwidth.

The MCKP is also NP-hard. However, if we relax the binary constraints to , it is reduced to a Linear Programming and can be solved efficiently. Zemel (1980) transforms the linear relaxation of MCKP to the fractional knapsack problem and use a greedy algorithm to solve it. Here we can get a feasible MCKP solution based on the basic steps in Kellerer et al. (2004):

  1. For each group, sort the items based on their weights in ascending order, i.e., if . According to  Kellerer et al. (2004, Proposition 11.2.2), the profits of the sorted items are nondecreasing, i.e., if . The incremental profit density has descending order, i.e., if .

  2. Select the first item (having the smallest weight) of each group. It should be noted that the budget must be large enough to contain these items, otherwise there is no feasible solution under the constraints.

  3. For other items, select the one with the largest incremental profit density. When selecting the -th item of the -th group, discard the -the item. Repeat the same procedure for the 2nd, 3rd, … largest ones, until the total weight of selected items exceeds the budget.

The above greedy algorithm can find a feasible MCKP solution, i.e., selecting one item from each group and guarantee their total weight is under the given budget . Its time complexity is . In practice, and are much smaller than the number of DNN weights, so the time complexity of this algorithm is negligible. Although this greedy solution is not always global optimal, it has some nice properties and could be global optimal in some case (Kellerer et al., 2004, Corollary 11.2.3). By using MCKP-Greedy to solve our compression projection problem (10), we can get the projection result of , which essentially allocates the bitwidth across different layers.

We summarize the training procedure of our method in Algorithm LABEL:alg:main. We use to denote the number of total SGD iterations of our algorithm. For large scale datasets, the number of SGD iterations could be very large. So we do not make the projections and dual update every time after we perform the proximal SGD on , but use a hyper-parameter to control the frequency of dual updates. should be divisible by . In our experiments,

is set to be the iteration number of one epoch. Ideally, we should get

converged to in the end, but t is hard to get exactly equals to in practice. So we perform a quantization to to guarantee it satisfies the model size constraint.


4 Experiments

In this section, we will evaluate our automated compression framework. We start with introducing the experiment setup such as evaluation and implementation details, then we show the compression results of our framework and compare it with state-of-the-art methods.

4.1 Experiment Setup


We evaluate our method on three datasets which are most commonly used in DNN compression: MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky et al., 2009), and ImageNet (Deng et al., 2009). We use the standard training / testing data split and data preprocessing on all the three datasets. For ImageNet, we evaluate on the image classification task (1000 classes).

DNN models

We evaluate on a wide range of DNN models, which are also used in current state-of-the-art compression methods. On MNIST, we use the LeNet-5111https://github.com/BVLC/caffe/tree/master/examples/mnist. It has two convolution layers followed by two fully connected layers. For CIFAR-10, we evaluate on ResNet-20 and ResNet-50 (He et al., 2016) which have 20 and 50 layers respectively. For ImageNet, we use the AlexNet (Krizhevsky et al., 2012) and the well-known compact model MobileNet (Howard et al., 2017).

Baselines and metric

We compare our method with current state-of-the-art model compression methods related to ours. These methods include automated pruning method AMC (He et al., 2018), automated quantization methods ReLeQ (Yazdanbakhsh et al., 2018) and HAQ (Wang et al., 2019), and jointly pruning and quantization methods (Han et al., 2015a; Louizos et al., 2017; Tung & Mori, 2018; Ye et al., 2018b). Although there are some overhead of the sparse index, we only use the size of the compressed weights data to compute the compression rate because it directly corresponds to . In addition, different indexing techniques may introduce unfairness in the comparison.

Implementation details

We set the batch size as 256 for AlexNet and LeNet-5, and use 128 batch size on ResNets and MobileNet. We use the momentum(=) SGD to optimize . We use initial learning rate is set to on AlexNet and MobileNet, and on LeNet-5 and ResNets. We use the cosine annealing strategy (Loshchilov & Hutter, 2016) to decay the learning rate. We set the hyper-parameter for all the experiments. To make a more clear comparison, the compression budget is set to be not larger than the compared methods. Training is performed for 120 epochs on LeNet-5 and ResNets and 90 epochs on AlexNet and MobileNet.

4.2 DNN Compression Results


In Table 2, we show the validation accuracies of compressed LeNet-5 models of different methods. We list the nonzero weights percentage, averaged bitwidth, and the compression rate (original weights size / compressed weights size), and the accuracy drop. The accuracy of the uncompressed LeNet-5 is , and our method can achieve an impressive compression rate without any accuracy drop. In the compared methods, the performance of Ye et al. (2018b) is the closest to our method. Compare with the detail of its compressed model, we find that our method tend to leave more nonzero weights but use less bits to represent each weight.

Method Nonzero% Averaged bitwidth Compression rate Accuracy drop
Deep compression (Han et al., 2015a) 8.3% 5.3 70 0.1%
BC-GNJ (Louizos et al., 2017) 0.9% 5 573 0.1%
BC-GHS (Louizos et al., 2017) 0.6% 5 771 0.1%
Ye et al. (2018b) 0.6% 2.8 1,910 0.1%
Ours 1.0% 1.46 2,120 0.0%
Table 2: Comparison across different compression methods on LeNet-5@MNIST.


Table 3 shows the results of the compressed ResNets on CIFAR-10 dataset. The accuracy of the original ResNet-20 is and the accuracy of ResNet-50 is . For ResNet-20, we compare with the automated quantization method ReLeQ (Yazdanbakhsh et al., 2018). For fair comparison, we evaluate two compressed models of our method, one only uses quantization and another use jointly pruning and quantization. For the quantization-only model, we achieve compression rate without accuracy drop, which has better accuracy and smaller size than ReLeQ. When introducing pruning, there is a accuracy drop but the compression rate is improved to .

For ResNet-50, we compare with the automated pruning method AMC (He et al., 2018). Its compressed ResNet-50 targeted on model size reduction has of non-zero weights. In our experiment, we find that ResNet-50 still has a large space to compress. The pruning-only result of our method compress ResNet-50 with weights and an accuracy improvement. By performing jointly pruning and quantization, our method can compress the ResNet-50 with compression rate from to . The accuracy loss is only met when compress the model to smaller, which suggests the ResNet-50 is mostly redundant on CIFAR-10 classification, and compressing it could reduce overfitting.

Model Method Nonzero% Averaged bitwidth Compression rate Accuracy drop
ResNet-20 ReLeQ (Yazdanbakhsh et al., 2018) - 2.8 11.4 0.12%
Ours - 2 16 0.00%
Ours 46% 1.9 35.4 0.14%
ResNet-50 AMC (He et al., 2018) 60% - 1.7 -0.11%
Ours 50% - 2 -1.51%
Ours 4.2% 1.7 462 -1.25%
Ours 3.1% 1.9 565 -0.90%
Ours 2.2% 1.8 836 0.00%
Table 3: Comparison across different compression methods on CIFAR-10.


Table 4 shows the compressed results of MobileNet and AlexNet. For MobileNet, we compare with the quantization methods of Han et al. (2015a) and HAQ (Wang et al., 2019). The original MobileNet has top-1 accuracy. Our quantization-only results with averaged bitwidth 2 and 3 have and accuracy drops respectively, which are about smaller than the HAQ counterparts ( and ). The compression rate can be further improved to when jointly perform pruning and quantization.

For AlexNet, we compare with methods of joint pruning and quantization. Unlike our end-to-end framework, all the compared methods set the pruning ratios and quantization bitwidth as hyper-parameters. CLIP-Q (Tung & Mori, 2018)

uses Bayesian optimization to choose these hyper-parameters, while others manually set them. The uncompressed AlexNet is from PyTorch pretrained models and has

top-1 accuracy. When compressing the model to be 118 smaller, our method has an accuracy improvement which is higher than the compressed CLIP-Q model with similar compression rate. Our method can also compress AlexNet to be 205 smaller without accuracy drop, while the compressed model of Ye et al. (2018b) has a accuracy drop with a slightly lower compression rate.

Model Method Nonzero% Averaged bitwidth Compression rate Accuracy drop
MobileNet Han et al. (2015a) - 2 16 33.28%
HAQ (Wang et al., 2019) - 2 16 13.76%
Ours - 2 16 7.10%
Han et al. (2015a) - 3 10.7 4.97%
HAQ (Wang et al., 2019) - 3 10.7 3.24%
Ours - 3 10.7 1.19%
Ours 42% 2.8 26.7 4.41%
AlexNet Han et al. (2015a) 11% 5.4 54 0.00%
CLIP-Q (Tung & Mori, 2018) 8% 3.3 119 -0.70%
Ours 7.4% 3.7 118 -1.00%
Ye et al. (2018b) 4% 4.1 193 0.10%
Ours 5% 3.1 205 -0.08%
Table 4: Comparison across different compression methods on ImageNet.
LeNet-5 (Ye et al. (2018b))
AlexNet (CLIP-Q (Tung & Mori, 2018))
AlexNet (Ye et al. (2018b))
LeNet-5 (Ours)
AlexNet (Our 118 compressed model)
AlexNet (Our 205 compressed model)
Figure 2: Visualization of the compressed results of different layers on LeNet-5 and AlexNet. The number of nonzero weights is shown in scale.

Compressed model visualization

In Figure 2, we visualize the distribution of sparsity and bitwidth for each layer on LeNet-5 and AlexNet. The first row shows compressed models of Ye et al. (2018b) and CLIP-Q (Tung & Mori, 2018), and the second row shows our compressed models. For LeNet-5, we observe that our method preserves more nonzero weights in the third layer, while allocates less bitwidth compared with Ye et al. (2018b). For AlexNet, our method has the trend of allocating larger bitwidth to convolutional layers than fully connected layers. CLIP-Q also allocates more bits to the convolutional layers, while Ye et al. (2018b) assigns more bits to the first and last layer. Our method also shows a preference for allocating more bits to sparser layers. This coincides with the intuition that the weights of sparser layers may be more informative, and increasing the bitwidth on these layers also brings less storage growth.

5 Conclusion

As DNNs are increasing deployed on mobile devices, model compression is becoming more and more important in practice. Although many model compression techniques have been proposed in the past few years, lack of systematic approach to set the layer-wise compression ratio diminishes their performance. Traditional methods require human labor to manually tune the compression ratios. Recent work uses black-box optimization to search the compression ratios but introduces instability of black-box optimization and is not efficient enough. Different from prior work, we try to start from the root of the problem. Specifically, we propose a constrained optimization formulation which considers both pruning and quantization and does not require compression ratio as hyper-parameter. By using ADMM, we can build the framework of our algorithm to solve the constrained optimization problem iteratively. Efficient algorithms for Knapsack problems are introduced to solve the sub-procedures in ADMM. In the future, we want to investigate other scenarios where we can substitute black-box optimization for more efficient optimization based approaches.