Incremental Network Quantization, Kmeans quantization, Iterative Pruning, Dynamic Network Surgery
Deep learning has become a ubiquitous technology to improve machine intelligence. However, most of the existing deep models are structurally very complex, making them difficult to be deployed on the mobile platforms with limited computational power. In this paper, we propose a novel network compression method called dynamic network surgery, which can remarkably reduce the network complexity by making on-the-fly connection pruning. Unlike the previous methods which accomplish this task in a greedy way, we properly incorporate connection splicing into the whole process to avoid incorrect pruning and make it as a continual network maintenance. The effectiveness of our method is proved with experiments. Without any accuracy loss, our method can efficiently compress the number of parameters in LeNet-5 and AlexNet by a factor of 108× and 17.7× respectively, proving that it outperforms the recent pruning method by considerable margins. Code and some models are available at https://github.com/yiwenguo/Dynamic-Network-Surgery.READ FULL TEXT VIEW PDF
Incremental Network Quantization, Kmeans quantization, Iterative Pruning, Dynamic Network Surgery
Despite these tremendous successes, recently designed networks tend to have more stacked layers, and thus more learnable parameters. For instance, AlexNet krizhevsky2012 designed by Krizhevsky et al. has 61 million parameters to win the ILSVRC 2012 classification competition, which is over 100 times more than that of LeCun’s conventional model lecun1998 (e.g., LeNet-5), let alone the much more complex models like VGGNet simonyan2014 . Since more parameters means more storage requirement and more floating-point operations (FLOPs), it increases the difficulty of applying DNNs on mobile platforms with limited memory and processing units. Moreover, the battery capacity can be another bottleneck han2015 .
Although DNN models normally require a vast number of parameters to guarantee their superior performance, significant redundancies have been reported in their parameterizations denil2013 . Therefore, with a proper strategy, it is possible to compress these models without significantly losing their prediction accuracies. Among existing methods, network pruning appears to be an outstanding one due to its surprising ability of accuracy loss prevention. For instance, Han et al. han2015 recently propose to make "lossless" DNN compression by deleting unimportant parameters and retraining the remaining ones (as illustrated in Figure 1), somehow similar to a surgery process.
However, due to the complex interconnections among hidden neurons, parameter importance may change dramatically once the network surgery begins. This leads to two main issues inhan2015 (and some other classical methods lecun1989 ; hassibi1993 as well). The first issue is the possibility of irretrievable network damage. Since the pruned connections have no chance to come back, incorrect pruning may cause severe accuracy loss. In consequence, the compression rate must be over suppressed to avoid such loss. Another issue is learning inefficiency. As in the paper han2015 , several iterations of alternate pruning and retraining are necessary to get a fair compression rate on AlexNet, while each retraining process consists of millions of iterations, which can be very time consuming.
In this paper, we attempt to address these issues and pursue the compression limit of the pruning method. To be more specific, we propose to sever redundant connections by means of continual network maintenance, which we call dynamic network surgery. The proposed method involves two key operations: pruning and splicing, conducted with two different purposes. Apparently, the pruning operation is made to compress network models, but over pruning or incorrect pruning should be responsible for the accuracy loss. In order to compensate the unexpected loss, we properly incorporate the splicing operation into network surgery, and thus enabling connection recovery once the pruned connections are found to be important any time. These two operations are integrated together by updating parameter importance whenever necessary, making our method dynamic.
In fact, the above strategies help to make the whole process flexible. They are beneficial not only to better approach the compression limit, but also to improve the learning efficiency, which will be validated in Section 4. In our method, pruning and splicing naturally constitute a circular procedure and dynamically divide the network connections into two categories, akin to the synthesis of excitatory and inhibitory neurotransmitter in human nervous systems lodish2000 .
The rest of this paper is structured as follows. In Section 2, we introduce the related methods of DNN compression by briefly discussing their merits and demerits. In Section 3, we highlight our intuition of dynamic network surgery and introduce its implementation details. Section 4 experimentally analyses our method and Section 5 draws the conclusions.
In order to make DNN models portable, a variety of methods have been proposed. Vanhoucke et al. vanhoucke2011 analyse the effectiveness of data layout, batching and the usage of Intel fixed-point instructions, making a speedup on x86 CPUs. Mathieu et al. mathieu2013
explore the fast Fourier transforms (FFTs) on GPUs and improve the speed of CNNs by performing convolution calculations in the frequency domain.
An alternative category of methods resorts to matrix (or tensor) decomposition. Denil et al.denil2013 propose to approximate parameter matrices with appropriately constructed low-rank decompositions. Their method achieves speedup on the convolutional layer with 1% drop in prediction accuracy. Following similar ideas, some subsequent methods can provide more significant speedups denton2014 ; zhang2015 ; lebedev2015 . Although matrix (or tensor) decomposition can be beneficial to DNN compression and speedup, these methods normally incur severe accuracy loss under high compression requirement.
Vector quantization is possible way to compress DNNs. Gong et al. gong2014 explore several such methods and point out the effectiveness of product quantization. HashNet proposed by Chen et al. chen2015
handles network compression by grouping its parameters into hash buckets. It is trained with a standard backpropagation procedure and should be able to make substantial storage savings. The recently proposed BinaryConnectcourbariaux2015
and Binarized Neural Networkscourbariaux2016 are able to compress DNNs by a factor of , while a noticeable accuracy loss is sort of inevitable.
This paper follows the idea of network pruning. It starts from the early work of LeCun et al.’s lecun1989
, which makes use of the second derivatives of loss function to balance training loss and model complexity. As an extension, Hassibi and Storkhassibi1993 propose to take non-diagonal elements of the Hessian matrix into consideration, producing compression results with less accuracy loss. In spite of their theoretical optimization, these two methods suffer from the high computational complexity when tackling large networks, regardless of the accuracy drop. Very recently, Han et al. han2015 explore the magnitude-based pruning in conjunction with retraining, and report promising compression results without accuracy loss. It has also been validated that the sparse matrix-vector multiplication can further be accelerated by certain hardware design, making it more efficient than traditional CPU and GPU calculations han2016isca . The drawback of Han et al.’s method han2015 is mostly its potential risk of irretrievable network damage and learning inefficiency.
Our research on network pruning is partly inspired by han2015 , not only because it can be very effective to compress DNNs, but also because it makes no assumption on the network structure. In particular, this branch of methods can be naturally combined with many other methods introduced above, to further reduce the network complexity. In fact, Han et al. han2016iclr have already tested such combinations and obtained excellent results.
In this section, we highlight the intuition of our method and present its implementation details. In order to simplify the explanations, we only talk about the convolutional layers and the fully connected layers. However, as claimed in han2016iclr , our pruning method can also be applied to some other layer types as long as their underlying mathematical operations are inner products on vector spaces.
First of all, we clarify the notations in this paper. Suppose a DNN model can be represented as , in which denotes a matrix of connection weights in the th layer. For the fully connected layers with -dimensional input and -dimensional output, the size of is simply . For the convolutional layers with learnable kernels, we unfold the coefficients of each kernel into a vector and concatenate all of them to as a matrix.
In order to represent a sparse model with part of its connections pruned away, we use . Each is a binary matrix with its entries indicating the states of network connections, i.e., whether they are currently pruned or not. Therefore, these additional matrices can be considered as the mask matrices.
Since our goal is network pruning, the desired sparse model shall be learnt from its dense reference. Apparently, the key is to abandon unimportant parameters and keep the important ones. However, the parameter importance (i.e., the connection importance) in a certain network is extremely difficult to measure because of the mutual influences and mutual activations among interconnected neurons. That is, a network connection may be redundant due to the existence of some others, but it will soon become crucial once the others are removed. Therefore, it should be more appropriate to conduct a learning process and continually maintain the network structure.
Taking the th layer as an example, we propose to solve the following optimization problem:
in which is the network loss function, indicates the Hadamard product operator, set consists of all the entry indices in matrix , and is a discriminative function, which satisfies if parameter seems to be crucial in the current layer, and 0 otherwise. Function is designed on the base of some prior knowledge so that it can constrain the feasible region of and simplify the original NP-hard problem. For the sake of topic conciseness, we leave the discussions of function in Section 3.3. Problem (1) can be solved by alternately updating and
through the stochastic gradient descent (SGD) method, which will be introduced in the following paragraphs.
Since binary matrix can be determined with the constraints in (1), we only need to investigate the update scheme of . Inspired by the method of Lagrange Multipliers and gradient descent, we give the following scheme for updating . That is,
in which indicates a positive learning rate. It is worth mentioning that we update not only the important parameters, but also the ones corresponding to zero entries of , which are considered unimportant and ineffective to decrease the network loss. This strategy is beneficial to improve the flexibility of our method because it enables the splicing of improperly pruned connections.
The partial derivatives in formula (2
) can be calculated by the chain rule with a randomly chosen minibatch of samples. Once matrixand are updated, they shall be applied to re-calculate the whole network activations and loss function gradient. Repeat these steps iteratively, the sparse model will be able to produce excellent accuracy. The above procedure is summarized in Algorithm 1.
Note that, the dynamic property of our method is shown in two aspects. On one hand, pruning operations can be performed whenever the existing connections seem to become unimportant. Yet, on the other hand, the mistakenly pruned connections shall be re-established if they once appear to be important. The latter operation plays a dual role of network pruning, and thus it is called "network splicing" in this paper. Pruning and splicing constitute a circular procedure by constantly updating the connection weights and setting different entries in , which is analogical to the synthesis of excitatory and inhibitory neurotransmitter in human nervous system lodish2000 . See Figure 2 for the overview of our method and the method pipeline can be found in Figure 1.
Since the measure of parameter importance influences the state of network connections, function , can be essential to our dynamic network surgery. We have tested several candidates and finally found the absolute value of the input to be the best choice, as claimed in han2015 . That is, the parameters with relatively small magnitude are temporarily pruned, while the others with large magnitude are kept or spliced in each iteration of Algorithm 1
. Obviously, the threshold values have a significant impact on the final compression rate. For a certain layer, a single threshold can be set based on the average absolute value and variance of its connection weights. However, to improve the robustness of our method, we use two thresholdsand by importing a small margin and set as in Equation (3). For the parameters out of this range, we set their function outputs as the corresponding entries in , which means these parameters will neither be pruned nor spliced in the current iteration.
Considering that Algorithm 1 is a bit more complicated than the standard backpropagation method, we shall take a few more steps to boost its convergence. First of all, we suggest slowing down the pruning and splicing frequencies, because these operations lead to network structure change. This can be done by triggering the update scheme of stochastically, with a probability of , rather than doing it constantly. Function shall be monotonically non-increasing and satisfy . After a prolonged decrease, the probability may even be set to zero, i.e., no pruning or splicing will be conducted any longer.
Another possible reason for slow convergence is the vanishing gradient problem. Since a large percentage of connections are pruned away, the network structure should become much simpler and probably even much "thinner" by utilizing our method. Thus, the loss function derivatives are likely to be very small, especially when the reference model is very deep. We resolve this problem by pruning the convolutional layers and fully connected layers separately, in the dynamic way still, which is somehow similar tohan2015 .
In this section, we will experimentally analyse the proposed method and apply it on some popular network models. For fair comparison and easy reproduction, all the reference models are trained by the GPU implementation of Caffe packagejia2014 with .prototxt files provided by the community.111Except for the simulation experiment and LeNet-300-100 experiments which we create the .prototxt files by ourselves, because they are not available in the Caffe model zoo. Also, we follow the default experimental settings for SGD method, including the training batch size, base learning rate, learning policy and maximal number of training iterations. Once the reference models are obtained, we directly apply our method to reduce their model complexity. A brief summary of the compression results are shown in Table 1.
To begin with, we consider an experiment on the synthetic data to preliminary testify the effectiveness of our method and visualize its compression quality. The exclusive-OR (XOR) problem can be a good option. It is a nonlinear classification problem as illustrated in Figure 3. In this experiment, we turn the original problem to a more complicated one as Figure 3, in which some Gaussian noises are mixed up with the original data and .
In order to classify these samples, we design a network model as illustrated in the left part of Figure4
, which consists of 21 connections and each of them has a weight to be learned. The sigmoid function is chosen as the activation function for all the hidden and output neurons. Twenty thousand samples were randomly generated for the experiment, in which half of them were used as training samples and the rest as test samples.
By 100,000 iterations of learning, this three-layer neural network achieves a prediction error rate of 0.31%. The weight matrix of network connections between input and hidden neurons can be found in Figure 4. Apparently, its first and last row share the similar elements, which means there are two hidden neurons functioning similarly. Hence, it is appropriate to use this model as a compression reference, even though it is not very large. After 150,000 iterations, the reference model will be compressed into the right side of Figure 4, and the new connection weights and their masks are shown in Figure 4. The grey and green patches in stand for those entries equal to one, and the corresponding connections shall be kept. In particular, the green ones indicate the connections were mistakenly pruned in the beginning but spliced during the surgery. The other patches (i.e., the black ones) indicate the corresponding connections are permanently pruned in the end.
The compressed model has a prediction error rate of 0.30%, which is slightly better than that of the reference model, even though 40% of its parameters are set to be zero. Note that, the remaining hidden neurons (excluding the bias unit) act as three different logic gates and altogether make up the XOR classifier. However, if the pruning operations are conducted only on the initial parameter magnitude (as in han2015 ), then probably four hidden neurons will be finally kept, which is obviously not the optimal compression result.
In addition, if we reduce the impact of Gaussian noises and enlarge the margin between positive and negative samples, then the current model can be further compressed, so that one more hidden neuron will be pruned by our method.
So far, we have carefully explained the mechanism behind our method and preliminarily testified its effectiveness. In the following subsections, we will further test our method on three popular NN models and make quantitative comparisons with other network compression methods.
MNIST is a database of handwritten digits and it is widely used to experimentally evaluate machine learning methods. Same withhan2015 , we test our method on two network models: LeNet-5 and LeNet-300-100.
LeNet-5 is a conventional CNN model which consists of 4 learnable layers, including 2 convolutional layers and 2 fully connected layers. It is designed by LeCun et al. lecun1998 for document recognition. With 431K parameters to be learned, we train this model for 10,000 iterations and obtain a prediction error rate of 0.91%. LeNet-300-100, as described in lecun1998 , is a classical feedforward neural network with three fully connected layers and 267K learnable parameters. It is also trained for 10,000 iterations, following the same learning policy as with LeNet-5. The well trained LeNet-300-100 model achieves an error rate of 2.28%.
With the proposed method, we are able to compress these two models. The same batch size, learning rate and learning policy are set as with the reference training processes, except for the maximal number of iterations, which is properly increased. The results are shown in Table 1. After convergence, the network parameters of LeNet-5 and LeNet-300-100 are reduced by a factor of and , respectively, which means less than 1% and 2% of the network connections are kept, while the prediction accuracies are as good or slightly better.
|Model||Layer||Params.||Params.% han2015||Params.% (Ours)|
To better demonstrate the advantage of our method, we make layer-by-layer comparisons between our compression results and Han et al.’s han2015 in Table 2. To the best of our knowledge, their method is so far the most effective pruning method, if the learning inefficiency is not a concern. However, our method still achieves at least 4 times the compression improvement against their method. Besides, due to the significant advantage over Han et al.’s models han2015 , our compressed models will also be undoubtedly much faster than theirs.
In the final experiment, we apply our method to AlexNet krizhevsky2012
, which wins the ILSVRC 2012 classification competition. As with the previous experiments, we train the reference model first. Without any data augmentation, we obtain a reference model with 61M well-learned parameters after 450K iterations of training (i.e., roughly 90 epochs). Then we perform the network surgery on it. AlexNet consists of 8 learnable layers, which is considered to be deep. So we prune the convolutional layers and fully connected layers separately, as previously discussed in Section3.4. The training batch size, base learning rate and learning policy still keep the same with reference training process. We run 320K iterations for the convolutional layers and 380K iterations for the fully connected layers, which means 700K iterations in total (i.e., roughly 140 epochs). In the test phase, we use just the center crop and test our compressed model on the validation set.
|Model||Top-1 error||Top-5 error||Epochs||Compression|
|Fastfood 32 (AD) yang2015||41.93%||-||-|
|Fastfood 16 (AD) yang2015||42.90%||-||-|
|Naive Cut han2015||57.18%||23.23%||0|
|Han et al. han2015||42.77%||19.67%||960|
|Dynamic network surgery (Ours)||43.09%||19.99%||140|
Table 3 compares the result of our method with some others. The four compared models are built by applying Han et al.’s method han2015 and the adaptive fastfood transform method yang2015 . When compared with these "lossless" methods, our method achieves the best result in terms of the compression rate. Besides, after acceptable number of epochs, the prediction error rate of our model is comparable or even better than those models compressed from better references.
In order to make more detailed comparisons, we compare the percentage of remaining parameters in our compressed model with that of Han et al.’s han2015 , since they achieve the second best compression rate. As shown in Table 4, our method compresses more parameters on almost every single layer in AlexNet, which means both the storage requirement and the number of FLOPs are better reduced when compared with han2015 . Besides, our learning process is also much more efficient thus considerable less epochs are needed (as least 6.8 times decrease).
In this paper, we have investigated the way of compressing DNNs and proposed a novel method called dynamic network surgery. Unlike the previous methods which conduct pruning and retraining alternately, our method incorporates connection splicing into the surgery and implements the whole process in a dynamic way. By utilizing our method, most parameters in the DNN models can be deleted, while the prediction accuracy does not decrease. The experimental results show that our method compresses the number of parameters in LeNet-5 and AlexNet by a factor of and , respectively, which is superior to the recent pruning method by considerable margins. Besides, the learning efficiency of our method is also better thus less epochs are needed.
Molecular Cell Biology: Neurotransmitters, Synapses, and Impulse Transmission. W. H. Freeman, 2000.