Prune DNN using Alternating Direction Method of Multipliers (ADMM)
Weight pruning methods for deep neural networks (DNNs) have been investigated recently, but prior work in this area is mainly heuristic, iterative pruning, thereby lacking guarantees on the weight reduction ratio and convergence time. To mitigate these limitations, we present a systematic weight pruning framework of DNNs using the alternating direction method of multipliers (ADMM). We first formulate the weight pruning problem of DNNs as a nonconvex optimization problem with combinatorial constraints specifying the sparsity requirements, and then adopt the ADMM framework for systematic weight pruning. By using ADMM, the original nonconvex optimization problem is decomposed into two subproblems that are solved iteratively. One of these subproblems can be solved using stochastic gradient descent, while the other can be solved analytically. The proposed ADMM weight pruning method incurs no additional suboptimality besides that resulting from the nonconvex nature of the original optimization problem. Furthermore, our approach achieves a fast convergence rate. The weight pruning results are very promising and consistently outperform prior work. On the LeNet-5 model for the MNIST data set, we achieve 40.2 times weight reduction without accuracy loss. On the AlexNet model for the ImageNet data set, we achieve 20 times weight reduction without accuracy loss. When we focus on the convolutional layer pruning for computation reductions, we can reduce the total computation by five times compared with prior work (achieving a total of 13.4 times weight reduction in convolutional layers). A significant acceleration for DNN training is observed as well, in that we can finish the whole training process on AlexNet around 80 hours. Our models are released at https://drive.google.com/drive/folders/1_O9PLIFiNHIaQIuOIJjq0AyQ7UpotlNl.READ FULL TEXT VIEW PDF
We present a systematic weight pruning framework of deep neural networks...
Weight pruning methods of deep neural networks (DNNs) have been demonstr...
Many model compression techniques of Deep Neural Networks (DNNs) have be...
Weight pruning and weight quantization are two important categories of D...
Deep neural networks (DNNs) although achieving human-level performance i...
Weight quantization is one of the most important techniques of Deep Neur...
We develop a fast, tractable technique called Net-Trim for simplifying a...
Prune DNN using Alternating Direction Method of Multipliers (ADMM)
Large-scale deep neural networks or DNNs have made breakthroughs in many fields, such as image recognition [1, 2, 3], speech recognition [4, 5], game playing , and driver-less cars . Despite the huge success, their large model size and computational requirements will add significant burden to state-of-the-art computing systems [1, 8, 9], especially for embedded and IoT systems. As a result, a number of prior works are dedicated to model compression in order to simultaneously reduce the computation and model storage requirements of DNNs, with minor effect on the overall accuracy. These model compression techniques include weight pruning [9, 10, 11, 12, 13, 14, 15, 16], sparsity regularization [17, 18, 19], weight clustering [9, 20, 21], and low rank approximation [22, 23], etc.
A simple but effective weight pruning method has been proposed in , which prunes the relatively less important weights and performs retraining for maintaining accuracy in an iterative manner. It can achieve 9 weight reduction ratio on the AlexNet model with virtually no accuracy degradation. This method has been extended and generalized in multiple directions, including energy efficiency-aware pruning , structure-preserved pruning using regularization methods 
, and employing more powerful (and time-consuming) heuristics such as evolutionary algorithms. While existing pruning methods achieve good model compression ratios, they are heuristic (and therefore cannot achieve optimal compression ratio), lack theoretical guarantees on compression performance, and require time-consuming iterative retraining processes.
To mitigate these shortcomings, we present a systematic framework of weight pruning and model compression, by (i) formulating the weight pruning problem as a constrained nonconvex optimization problem with combinatorial constraints, which employs the cardinality function to induce sparsity of the weights, and (ii) adopting the alternating direction method of multipliers (ADMM)  for systematically solving this optimization problem. By using ADMM, the original nonconvex optimization problem is decomposed into two subproblems that are solved iteratively. In the weight pruning problem, one of these subproblems can be solved using stochastic gradient descent, and the other can be solved analytically. Upon convergence of ADMM, we remove the weights which are (close to) zero and retrain the network.
Our extensive numerical experiments indicate that ADMM works very well in weight pruning. The weight pruning results consistently outperform the prior work. On the LeNet-5 model for the MNIST data set, we achieve 71.2 weight reduction without accuracy loss, which is 5.9 times compared with . On the AlexNet model for the ImageNet data set, we achieve 21 weight reduction without accuracy loss, which is 2.3 times compared with . Moreover, when we focus on the convolutional layer pruning for computation reductions, we can reduce the total computation by five times compared with the prior work (achieving a total of 13.4 weight reduction in convolutional layers). Our models and codes are released at https://github.com/KaiqiZhang/admm-pruning.
Mathematical investigations have demonstrated a significant margin for weight reduction in DNNs due to the redundancy across filters and channels, and a number of prior works leverage this property to reduce weight storage. The techniques can be classified into two categories:1) Low rank approximation methods [22, 23]
such as Singular Value Decomposition (SVD), which are typically difficult to achieve zero accuracy degradation with compression, especially for very large DNNs;2) Weight pruning methods which aim to remove the redundant or less important weights, thereby achieving model compression with negligible accuracy loss.
A prior work  serves as a pioneering work for weight pruning. It uses a heuristic method of iteratively pruning the unimportant weights (weights with small magnitudes) and retraining the DNN. It can achieve a good weight reduction ratio, e.g., 9 for AlexNet, with virtually zero accuracy degradation, and can be combined with other model compression techniques such as weight clustering [9, 20]. It has been extended in several works. For instance, the energy efficiency-aware pruning method  has been proposed to facilitate energy-efficient hardware implementations, allowing for certain accuracy degradation. The structured sparsity learning technique has been proposed to partially overcome the limitation in  of irregular network structure after pruning. However, neither technique can outperform the original method  in terms of compression ratio under the same accuracy. There is recent work  that employs an evolutionary algorithm for weight pruning, which incorporates randomness in both pruning and growing of weights, following certain probabilistic rules. Despite the higher compression ratio it achieves, it suffers from a prohibitively long retraining phase. For example, it needs to start with an already-compressed model for further pruning on the ImageNet data set, instead of the original AlexNet or VGG models.
In summary, the prior weight pruning methods are highly heuristic and suffer from a long retraining phase. On the other hand, our proposed method is a systematic framework, achieves higher compression ratio, exhibits faster convergence rate, and is also general for structured pruning and weight clustering.
ADMM was first introduced in the 1970s, and theoretical results in the following decades have been collected in 
. It is a powerful method for solving regularized convex optimization problems, especially for problems in applied statistics and machine learning. Moreover, recent works[25, 26] demonstrate that ADMM is also a good tool for solving nonconvex problems, potentially with combinatorial constraints, since it can converge to a solution that may not be globally optimal but is sufficiently good for many applications.
For some problems which are difficult to solve directly, we can use variable splitting first, and then employ ADMM to decompose the problem into two subproblems that can be solved separately and efficiently. For example, the optimization problem
assumes that is differentiable and is non-differentiable but has exploitable structure properties. Common instances of are the norm and the indicator function of a constraint set. To make it suitable for the application of ADMM, we use variable splitting to rewrite the problem as
Next, via the introduction of the augmented Lagrangian, the above optimization problem can be decomposed into two subproblems in and . The first subproblem is , where is a quadratic function of its argument. Since and are differentiable, the first subproblem can be solved by gradient descent. The second subproblem is , where is a quadratic function of its argument. In problems where has some special structure, for instance if it is a regularizer in (1), exploiting the properties of may allow this problem to be solved analytically. More details regarding the application of ADMM to the weight pruning problem will be demonstrated in Section 4.2.
Consider an -layer DNN, where the collection of weights in the -th (convolutional or fully-connected) layer is denoted by and the collection of biases in the -th layer is denoted by
. In a convolutional layer the weights are organized in a four-dimensional tensor and in a fully-connected layer they are organized in a two-dimensional matrix.
Assume that the input to the (fully-connected) DNN is . Every column of corresponds to a training image, and the number of columns determines the number of training images in the input batch. The input will enter the first layer and the output of the first layer is calculated by
where and have columns, and
is a matrix with identical columns. The non-linear activation function
acts entry-wise on its argument, and is typically chosen to be the ReLU function in state-of-the-art DNNs. Since the output of one layer is the input of the next, the output of the -th layer for is given by
The output of the DNN corresponding to a batch of images is
In this case is a matrix, where is the number of classes in the classification, and is the number of training images in the batch. The element in matrix is the score of the -th training image corresponding to the -th class. The total loss of the DNN is calculated as
where denotes the Frobenius norm, the first term is cross-entropy loss, is the correct class of the -th image, and the second term is weight regularization.
Hereafter, for simplicity of notation we write , or simply , instead of . The same notational convention applies to writing instead of . The training of a DNN is a process of minimizing the loss by updating weights and biases. If we use the gradient descent method, the update at every step is
for where is the learning rate.
Our objective is to prune the weights of the DNN, and therefore we minimize the loss function subject to constraints on the cardinality of weights in each layer. More specifically, our training process solves
where returns the number of nonzero elements of its matrix argument and is the desired number of weights in the -th layer of the DNN111Our framework is also compatible with the constraint of total number of weights for the whole DNN.. A prior work  uses ADMM for DNN training with regularization in the objective function, which can result in sparsity as well. On the other hand, our method directly targets at sparsity with incorporating hard constraints on the weights, thereby resulting in a higher degree of sparsity.
We can rewrite the above weight pruning optimization problem as
where . It is clear that are nonconvex sets, and it is in general difficult to solve optimization problems with nonconvex constraints. The problem can be equivalently rewritten in a form without constraint, which is
where is the indicator function of , i.e.,
The first term of problem (2) is the loss function of a DNN, while the second term is non-differentiable. This problem cannot be solved analytically or by stochastic gradient descent. A recent paper , however, demonstrates that such problems lend themselves well to the application of ADMM, via a special decomposition into simpler subproblems. We begin by equivalently rewriting the above problem in ADMM form as
The augmented Lagrangian  of the above optimization problem is given by
where has the same dimension as and is the Lagrange multiplier (also known as the dual variable) corresponding to the constraint , the positive scalars are penalty parameters, denotes the trace, and denotes the Frobenius norm. Defining the scaled dual variable , the augmented Lagrangian can be equivalently expressed as
until both of the following conditions are satisfied
where the first term is the loss function of the DNN, and the second term can be considered as a special regularizer. Since the regularizer is a differentiable quadratic norm, and the loss function of the DNN is differentiable, problem (7) can be solved by stochastic gradient descent. More specifically, the gradients of the augmented Lagrangian with respect to and are given by
Note that we cannot prove optimality of the solution to subproblem (3), just as we can not prove optimality of the solution to the original DNN training problem due to the nonconvexity of the loss function of DNN.
On the other hand, problem (4) can be formulated as
Since is the indicator function of the set , the globally optimal solution of this problem can be explicitly derived as :
where denotes the Euclidean projection onto the set . Note that is a nonconvex set, and computing the projection onto a nonconvex set is a difficult problem in general. However, the special structure of allows us to express this Euclidean projection analytically. Namely, the solution of (4) is to keep the elements of with the largest magnitudes and set the rest to zero . Finally, we update the dual variable according to (5). This concludes one iteration of the ADMM algorithm.
We observe that the proposed systematic framework exhibits multiple major advantages in comparison with the heuristic weight pruning method in . Our proposed method achieves a higher compression ratio with a higher convergence rate compared with the iterative pruning and retraining method in . For example, we achieve 15 compression ratio on AlexNet with only 10 iterations of ADMM. Additionally, subproblem (3) can be solved in a fraction of the number of iterations needed for training the original network when we use warm start initialization, i.e., when we initialize subproblem (3) with in order to find . For example, when training on the AlexNet model using the ImageNet data set, convergence is achieved in approximately of the total iterations required for the original DNN training. Also, problems (4) and (5) are straightforward to carry out, thus their computational time can be ignored. As a synergy of the above effects, the total computational time of 10 iterations of ADMM will be similar to (or at least in the same order of) the training time of the original DNN. Furthermore, we achieve 21 compression ratio on AlexNet without accuracy loss when we use 40 iterations of ADMM.
For very small values of in (6), ADMM needs a large number of iterations to converge. However, in many applications, such as the weight pruning problem considered here, a slight increase in the value of can result in a significant speedup in convergence. On the other hand, when ADMM stops early, the weights to be pruned may not be identically zero, in the sense that there will be small nonzero elements contained in . To deal with this issue, we keep the elements with the largest magnitude in , set the rest to zero and no longer involve these elements in training (i.e., we prune these weights). Then, we retrain the DNN. Note that we only need a single retraining step and the convergence is much faster than training the original DNN, since the starting point of the retraining is already close to the point which can achieve the original test/validation accuracy.
We take the weight distribution of every (convolutional or fully connected) layer on LeNet-5 as an example to illustrate our systematic weight pruning method. The weight distributions at different stages are shown in Figure 1. The subfigures in the left column show the weight distributions of the pretrained model, which serves as our starting point. The subfigures in the middle column show that after the convergence of ADMM for moderate values of , we observe a clear separation between weights whose values are close to zero and the remaining weights. To prune the weights rigorously, we set the values of the close-to-zero weights exactly to zero and retrain the DNN without updating these values. The subfigures in the right column show the weight distributions after our final retraining step. We observe that most of the weights are zero in every layer. This concludes our weight pruning procedure.
|Layer||Weights||Weights after prune||Weights after prune %||Result of  %|
|Layer||Weights||Weights after prune||Weights after prune %||Result of  %|
We have tested the proposed systematic weight pruning framework on the MNIST benchmark using the LeNet-300-100 and LeNet-5 models  and the ImageNet ILSVRC-2012 benchmark on the AlexNet model , in order to perform an apple-to-apple comparison with the prior heuristic pruning work 
. The LeNet models are implemented and trained in TensorFlow
and the AlexNet models are trained in Caffe. We carry out our experiments on NVIDIA Tesla P100 GPUs. The weight pruning results consistently outperform the prior work. On the LeNet-5 model, we achieve 71.2 weight reduction without accuracy loss, which is 5.9 times compared with . On the AlexNet model, we achieve 21 weight reduction without accuracy loss, which is 2.3 times compared with . Moreover, when we focus on the convolutional layer pruning for computation reductions, we can reduce the total computation by five times compared with the prior work .
Table 1 shows our per-layer pruning results on the LeNet-300-100 model. LeNet-300-100 is a fully connected network with 300 and 100 neurons on the two hidden layers, respectively, and achieves 98.4test accuracy on the MNIST benchmark. Table 2 shows our per-layer pruning results on the LeNet-5 model. LeNet-5 contains two convolutional layers, two pooling layers and two fully connected layers, and can achieve 99.2 test accuracy on the MNIST benchmark.
Our pruning framework does not incur accuracy loss and can achieve a much higher compression ratio on these networks compared with the prior iterative pruning heuristic , which reduces the number of parameters by 12 on both LeNet-300-100 and LeNet-5. On the LeNet-300-100 model, our pruning method reduces the number of weights by 22.9, which is 90% higher than . Also, our pruning method reduces the number of weights by 71.2 on the LeNet-5 model, which is 5.9 times compared with .
We implement our systematic weight pruning method using the BAIR/BVLC AlexNet model222https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet on the ImageNet ILSVRC-2012 benchmark. The implementation is on the Caffe tool because it is faster than TensorFlow. The original BAIR/BVLC AlexNet model can achieve a top-5 accuracy 80.2% on the validation set. AlexNet contains 5 convolutional (and pooling) layers and 3 fully connected layers with a total of 60.9M parameters, with the detailed network structure shown in deploy.prototxt text on the website indicated in footnote 2.
Our first set of experiments only target model size reductions for AlexNet, and the results are shown in Table 3. It can be observed that our pruning method can reduce the number of weights by 21 on AlexNet, which is more than twice compared with the prior iterative pruning heuristic. We achieve a top-5 accuracy of 80.2% on the validation set of ImageNet ILSVRC-2012. Layer-wise comparison results are also shown in Table 3, while comparisons with some other model compression methods are shown in Table 5. These results clearly demonstrate the advantage of the proposed systematic weight pruning framework using ADMM.
|Layer||Weights||Weights after prune||Weights after prune%||Result of |
Our second set of experiments target computation reduction besides weight reduction. Because the major computation in state-of-the-art DNNs is in the convolutional layers, we mainly target weight pruning in these layers. Although on AlexNet, the number of weights in convolutional layers is less than that in fully connected layers, the computation on AlexNet is dominated by its 5 convolutional layers. In our experiments, we conduct experiments which keep the same portion of weights as  in fully connected layers but prune more weights in convolutional layers. For AlexNet, Table 4 shows that we can reduce the number of weights by 13.4 in convolutional layers, which is five times compared with 2.7 in . This indicates our pruning method can reduce much more computation compared with the prior work . Layer-wise comparison results are also shown in Table 4. Still, it is difficult to prune weights in the first convolutional layer because they are needed to directly extract features from the raw inputs. Our major gain is because (i) we can achieve significant weight reduction in conv2 through conv5 layers, and (ii) the first convolutional layer is relatively small and less computational intensive.
|Layer||Weights||Weights after prune||Weights after prune%||Result of |
|Total of conv1-5||2332.6K||173.92k||7.46%||37.1%|
Several extensions [12, 17] of the original weight pruning work have improved in various directions such as energy efficiency for hardware implementation and regularity, but they cannot strictly outperform the original work  in terms of compression ratio under the same accuracy. The very recent work 
employs an evolutionary algorithm for weight pruning, which incorporates randomness in both pruning and growing of weights following certain probability rules. It can achieve a comparable model size with our work. However, it suffers from a prohibitively long retraining phase. For example, it needs to start with an already-compressed model with 8.4M parameters for further pruning on ImageNet, instead of the original AlexNet model. By using an already-compressed model, it can reduce the number of neurons per layer as well, while such reduction is not considered in our proposed framework.
|Network||Top-5 Error||Parameters||Weight Reduction Ratio|
|Baseline AlexNet ||19.8%||60.9M||1.0|
|Layer-wise pruning ||%||6.7M||9.1|
|Network pruning ||%||6.7M||9.1|
|Our result (10 iterations of ADMM)||4.06M||15|
|Dynamic surgery ||3.45M||17.7|
|Our result (25 iterations of ADMM)||3.24M||18.8|
|Our result (40 iterations of ADMM)||2.9M||21|
For nonconvex problems in general, there is no guarantee that ADMM will converge to an optimal point. ADMM can converge to different points for different choices of initial values and and penalty parameters . To resolve this limitation, we set the pretrained model , a good solution of , to be the starting point when we use stochastic gradient descent to solve problem (7). We initialize by keeping the elements of with the largest magnitude and set the rest to be zero. We set . For problem (7), if the penalty parameters are too small, the solution will be close to the minimum of but fail to regularize the weights, and the ADMM procedure may converge slowly or not converge at all. If the penalty parameters are too large, the solution may regularize the weights well but fail to minimize , and therefore the accuracy of the DNN will be degradated. In actual experiments, we find that is an appropriate choice for LeNet-5 and LeNet-300-100, and that works well for AlexNet.
We initialize based on existing results in the literature, then we implement our weight pruning method on the DNN and test its accuracy. If there is no accuracy loss on the DNN, we decrease in every layer proportionally. We use binary search to find the smallest that will not result in accuracy loss.
Convergence behavior of ADMM (5 CONV layers in AlexNet) is shown in Figure 2 (left sub-figure). The loss value progression of AlexNet is shown in Figure 2 (right sub-figure). We start from an existing DNN model without pruning. After the convergence of ADMM, we remove the weights which are (close to) zero, which results in an increase in the loss value. We then retrain the DNN and the loss decreases to the same level as it was before pruning.
The cardinality function is nonconvex and nondifferentiable, which complicates the use of standard gradient algorithms. ADMM circumvents the issue of differentiability systematically, and does so without introducing additional numerical complexity. Furthermore, although ADMM achieves global optimality for convex problems, it has been shown in the optimization literature that it performs extremely well for large classes of nonconvex problems. In fact, ADMM-based pruning can be perceived as a smart regularization technique in which the regularization target will be dynamically updated in each iteration. The limitation is that we need to tune the parameters . However, some parameter tuning is generally inevitable; even a soft regularization parameter requires fine-tuning in order to achieve the desired solution structure. On the positive side, the freedom in setting allows the user to obtain the exact desired level of sparsity.
In this paper, we presented a systematic DNN weight pruning framework using ADMM. We formulate the weight pruning problem of DNNs as a nonconvex optimization problem with combinatorial constraints specifying the sparsity requirements. By using ADMM, the nonconvex optimization problem is decomposed into two subproblems that are solved iteratively, one using stochastic gradient descent and the other analytically. We reduced the number of weights by 22.9 on LeNet-300-100 and 71.2 on LeNet-5 without accuracy loss. For AlexNet, we reduced the number of weights by 21 without accuracy loss. When we focued on computation reduction, we reduced the number of weights in convolutional layers by 13.4 on AlexNet, which is five times compared with the prior work.
In future work, we will extend the proposed weight pruning method to incorporate structure and regularity in the weight pruning procedure, and develop a unified framework of weight pruning, activation reduction, and weight clustering.
Financial support from the National Science Foundation under awards CNS-1840813, CNS-1704662 and ECCS-1609916 is gratefully acknowledged.
Imagenet classification with deep convolutional neural networks.In: Advances in neural information processing systems. (2012) 1097–1105
Deep supervised learning for hyperspectral data classification through convolutional neural networks.In: Geoscience and Remote Sensing Symposium (IGARSS), 2015 IEEE International, IEEE (2015) 4959–4962
Predicting parameters in deep learning.In: Advances in neural information processing systems. (2013) 2148–2156
On optimal periodic sensor scheduling for field estimation in wireless sensor networks.In: Global Conference on Signal and Information Processing (GlobalSIP), IEEE (2013) 137–140