We consider a neural network that is already trained, pruned if employed and fine-tuned before quantization. If no network pruning is employed, all parameters in a network are subject to quantization. For pruned networks, our focus is on quantization of unpruned parameters. The goal of network quantization is to quantize (unpruned) network parameters in order to reduce the size of the storage for them while minimizing the performance degradation due to quantization. For network quantization, network parameters are grouped into clusters. Parameters in the same cluster share their quantized value, which is the representative value (i.e., cluster center) of the cluster they belong to. After quantization, lossless binary coding follows to encode quantized parameters into binary codewords to store instead of actual parameter values. Either fixed-length binary coding or variable-length binary coding, e.g., Huffman coding, can be employed to this end.
Suppose that we have total parameters in a neural network. Before quantization, each parameter is assumed to be of bits. For quantization, we partition the network parameters into clusters. Let be the set of network parameters in cluster and let be the number of bits of the codeword assigned to the network parameters in cluster for . For a lookup table to decode quantized values from their binary encoded codewords, we store binary codewords ( bits for ) and corresponding quantized values ( bits for each). The compression ratio is then given by
Observe in sec:quantization:compressionratio:eq:01 that the compression ratio depends not only on the number of clusters but also on the sizes of the clusters and the lengths of the binary codewords assigned to them, in particular, when a variable-length code is used for encoding quantized values. For fixed-length codes, however, all codewords are of the same length, i.e., for all , and thus the compression ratio is reduced to only a function of the number of clusters, i.e., , assuming that and are given.
Provided network parameters to quantize, k-means clustering partitions them into disjoint sets (clusters), denoted by , while minimizing the mean square quantization error (MSQE) as follows:
We observe two issues with employing k-means clustering for network quantization.
First, although k-means clustering minimizes the MSQE, it does not imply that k-means clustering minimizes the performance loss due to quantization as well in neural networks. K-means clustering treats quantization errors from all network parameters with equal importance. However, quantization errors from some network parameters may degrade the performance more significantly that the others. Thus, for minimizing the loss due to quantization in neural networks, one needs to take this dissimilarity into account.
Second, k-means clustering does not consider any compression ratio constraint. It simply minimizes its distortion measure for a given number of clusters, i.e., for clusters. This is however suboptimal when variable-length coding follows since the compression ratio depends not only on the number of clusters but also on the sizes of the clusters and assigned codeword lengths to them, which are determined by the binary coding scheme employed after clustering. Therefore, for the optimization of network quantization given a compression ratio constraint, one need to take the impact of binary coding into account, i.e., we need to solve the quantization problem under the actual compression ratio constraint imposed by the specific binary coding scheme employed after clustering.
Hessian-weighted network quantization
In this section, we analyze the impact of quantization errors on the neural network loss function and derive that the Hessian-weighted distortion measure is a relevant objective function for network quantization in order to minimize the quantization loss locally. Moreover, from this analysis, we propose Hessian-weighted k-means clustering for network quantization to minimize the performance loss due to quantization in neural networks.
We consider a general non-linear neural network that yields output from input , where
is the vector consisting of all trainable network parameters in the network;is the total number of trainable parameters in the network. A loss function is defined as the objective function that we aim to minimize in average, where is the expected (ground-truth) output for input . Cross entropy or mean square error are typical examples of a loss function. Given a training data set , we optimize network parameters by solving the following problem, e.g., approximately by using a stochastic gradient descent (SGD) method with mini-batches:
Hessian-weighted quantization error
The average loss function can be expanded by Taylor series with respect to as follows:
the square matrix consisting of second-order partial derivatives is called as Hessian matrix or Hessian. Assume that the loss function has reached to one of its local minima, at , after training. At local minima, gradients are all zero, i.e., we have , and thus the first term in the right-hand side of sec:hessian:error:eq:01 can be neglected at . The third term in the right-hand side of sec:hessian:error:eq:01 is also ignored under the assumption that the average loss function is approximately quadratic at the local minimum . Finally, for simplicity, we approximate the Hessian matrix as a diagonal matrix by setting its off-diagonal terms to be zero. Then, it follows from sec:hessian:error:eq:01 that
where is the second-order partial derivative of the average loss function with respect to evaluated at , which is the -th diagonal element of the Hessian matrix . Now, we connect sec:hessian:error:eq:02 with the problem of network quantization by treating as the quantization error of network parameter at its local optimum , i.e.,
where is a quantized value of . Finally, combining sec:hessian:error:eq:02 and sec:hessian:error:eq:03, we derive that the local impact of quantization on the average loss function at can be quantified approximately as follows:
At a local minimum, the diagonal elements of Hessian, i.e., ’s, are all non-negative and thus the summation in sec:hessian:error:eq:04 is always additive, implying that the average loss function either increases or stays the same. Therefore, the performance degradation due to quantization of a neural network can be measured approximately by the Hessian-weighted distortion as shown in sec:hessian:error:eq:04. Further discussion on the Hessian-weighted distortion measure can be found in Appendix LABEL:appendix:hessian.
Hessian-weighted k-means clustering
For notational simplicity, we use and from now on. The optimal clustering that minimizes the Hessian-weighted distortion measure is given by
We call this as Hessian-weighted k-means clustering. Observe in sec:hessian:kmeans:eq:01 that we give a larger penalty for a network parameter in defining the distortion measure for clustering when its second-order partial derivative is larger, in order to avoid a large deviation from its original value, since the impact on the loss function due to quantization is expected to be larger for that parameter. Hessian-weighted k-means clustering is locally optimal in minimizing the quantization loss when fixed-length binary coding follows, where the compression ratio solely depends on the number of clusters as shown in Section Document. Similar to the conventional k-means clustering, solving this optimization is not easy, but Lloyd’s algorithm is still applicable as an efficient heuristic solution for this problem if Hessian-weighted means are used as cluster centers instead of non-weighted regular means.
For obtaining Hessian, one needs to evaluate the second-order partial derivative of the average loss function with respect to each of network parameters, i.e., we need to calculate
Recall that we are interested in only the diagonal elements of Hessian. An efficient way of computing the diagonal of Hessian is presented in le1987modeles,becker1988improving and it is based on the back propagation method that is similar to the back propagation algorithm used for computing first-order partial derivatives (gradients). That is, computing the diagonal of Hessian is of the same order of complexity as computing gradients. Hessian computation and our network quantization are performed after completing network training. For the data set used to compute Hessian in sec:hessian:computation:eq:01, we can either reuse a training data set or use some other data set, e.g., validation data set. We observed from our experiments that even using a small subset of the training or validation data set is sufficient to yield good approximation of Hessian for network quantization.
Alternative of Hessian
Although there is an efficient way to obtain the diagonal of Hessian as discussed in the previous subsection, Hessian computation is not free. In order to avoid this additional Hessian computation, we propose to use an alternative metric instead of Hessian. In particular, we consider neural networks trained with the Adam SGD optimizer kingma2014adam and propose to use some function (e.g., square root) of the second moment estimates of gradients as an alternative of Hessian. The Adam algorithm computes adaptive learning rates for individual network parameters from the first and second moment estimates of gradients. We compare the Adam method to Newton’s optimization method using Hessian and notice that the second moment estimates of gradients in the Adam method act like the Hessian in Newton’s method. This observation leads us to use some function (e.g., square root) of the second moment estimates of gradients as an alternative of Hessian. The advantage of using the second moment estimates from the Adam method is that they are computed while training and we can obtain them at the end of training at no additional cost. It makes Hessian-weighting more feasible for deep neural networks, which have millions of parameters. We note that similar quantities can be found and used for other SGD optimization methods using adaptive learning rates, e.g., AdaGrad duchi2011adaptive, Adadelta zeiler2012adadelta and RMSProp tieleman2012lecture.
Quantization of all layers
We propose quantizing the network parameters of all layers in a neural network together at once by taking Hessian-weight into account. Layer-by-layer quantization was examined in the previous work gong2014compressing,han2015deep. However, e.g., in han2015deep, a larger number of bits (a larger number of clusters) are assigned to convolutional layers than fully-connected layers, which implies that they heuristically treat convolutional layers more importantly. This follows from the fact that the impact of quantization errors on the performance varies significantly across layers; some layers, e.g., convolutional layers, may be more important than the others. This concern is exactly what we can address by Hessian-weighting. Hessian-weighting properly handles the different impact of quantization errors not only within layers but also across layers and thus it can be employed for quantizing all layers of a network together. The impact of quantization errors may vary more substantially across layers than within layers. Thus, Hessian-weighting may show more benefit in deeper neural networks. We note that Hessian-weighting can still provide gain even for layer-by-layer quantization since it can address the different impact of the quantization errors of network parameters within each layer as well. Recent neural networks are getting deeper, e.g., see szegedy2015going,szegedy2015rethinking,he2015deep. For such deep neural networks, quantizing network parameters of all layers together is even more advantageous since we can avoid layer-by-layer compression rate optimization. Optimizing compression ratios jointly across all individual layers (to maximize the overall compression ratio for a network) requires exponential time complexity with respect to the number of layers. This is because the total number of possible combinations of compression ratios for individual layers increases exponentially as the number of layers increases.
Entropy-constrained network quantization
In this section, we investigate how to solve the network quantization problem under a constraint on the compression ratio. In designing network quantization schemes, we not only want to minimize the performance loss but also want to maximize the compression ratio. In Section Document, we explored how to quantify and minimize the loss due to quantization. In this section, we investigate how to take the compression ratio into account properly in the optimization of network quantization.
After quantizing network parameters by clustering, lossless data compression by variable-length binary coding can be followed for compressing quantized values. There is a set of optimal codes that achieve the minimum average codeword length for a given source. Entropy is the theoretical limit of the average codeword length per symbol that we can achieve by lossless data compression, proved by Shannon (see, e.g., [Section 5.3]cover2012elements). It is known that optimal codes achieve this limit with some overhead less than 1 bit when only integer-length codewords are allowed. So optimal coding is also called as entropy coding. Huffman coding is one of entropy coding schemes commonly used when the source distribution is provided (see, e.g., [Section 5.6]cover2012elements), or can be estimated.
Entropy-constrained scalar quantization (ECSQ)
Considering a compression ratio constraint in network quantization, we need to solve the clustering problem in sec:quantization:kmeans:eq:01 or sec:hessian:kmeans:eq:01 under the compression ratio constraint given by
which follows from sec:quantization:compressionratio:eq:01. This optimization problem is too complex to solve for any arbitrary variable-length binary code since the average codeword length can be arbitrary. However, we identify that it can be simplified if optimal codes, e.g., Huffman codes, are assumed to be used. In particular, optimal coding closely achieves the lower limit of the average source code length, i.e., entropy, and then we approximately have
where is the entropy of the quantized network parameters after clustering (i.e., source), given that is the ratio of the number of network parameters in cluster to the number of all network parameters (i.e., source distribution). Moreover, assuming that , we have
in sec:ecnq:ecsq:eq:01. From sec:ecnq:ecsq:eq:02 and sec:ecnq:ecsq:eq:03, the constraint in sec:ecnq:ecsq:eq:01 can be altered to an entropy constraint given by
where . In summary, assuming that optimal coding is employed after clustering, one can approximately replace a compression ratio constraint with an entropy constraint for the clustering output. The network quantization problem is then translated into a quantization problem with an entropy constraint, which is called as entropy-constrained scalar quantization (ECSQ) in information theory. Two efficient heuristic solutions for ECSQ are proposed for network quantization in the following subsections, i.e., uniform quantization and an iterative solution similar to Lloyd’s algorithm for k-means clustering.
It is shown in gish1968asymptotically that the uniform quantizer is asymptotically optimal in minimizing the mean square quantization error for any random source with a reasonably smooth density function as the resolution becomes infinite, i.e., as the number of clusters . This asymptotic result leads us to come up with a very simple but efficient network quantization scheme as follows:
We first set uniformly spaced thresholds and divide network parameters into clusters.
After determining clusters, their quantized values (cluster centers) are obtained by taking the mean of network parameters in each cluster.
Note that one can use Hessian-weighted mean instead of non-weighted mean in computing cluster centers in the second step above in order to take the benefit of Hessian-weighting. A performance comparison of uniform quantization with non-weighted mean and uniform quantization with Hessian-weighted mean can be found in Appendix LABEL:appendix:uniform. Although uniform quantization is a straightforward method, it has never been shown before in the literature that it is actually one of the most efficient quantization schemes for neural networks when optimal variable-length coding, e.g., Huffman coding, follows. We note that uniform quantization is not always good; it is inefficient for fixed-length coding, which is also first shown in this paper.
Iterative algorithm to solve ECSQ
Another scheme proposed to solve the ECSQ problem for network quantization is an iterative algorithm, which is similar to Lloyd’s algorithm for k-means clustering. Although this iterative solution is more complicated than the uniform quantization in Section Document, it finds a local optimum for a given discrete source. An iterative algorithm to solve the general ECSQ problem is provided in chou1989entropy. We derive a similar iterative algorithm to solve the ECSQ problem for network quantization. The main difference from the method in chou1989entropy is that we minimize the Hessian-weighted distortion measure instead of the non-weighted regular distortion measure for optimal quantization. The detailed algorithm and further discussion can be found in Appendix LABEL:appendix:iterecsq.
This section presents our experiment results for the proposed network quantization schemes in three exemplary convolutional neural networks: (a) LeNet lecun1998gradient for the MNIST data set, (b) ResNet he2015deep for the CIFAR-10 data set, and (c) AlexNet krizhevsky2012imagenet for the ImageNet ILSVRC-2012 data set. Our experiments can be summarized as follows:
We employ the proposed network quantization methods to quantize all of network parameters in a network together at once, as discussed in Section Document.
We evaluate the performance of the proposed network quantization methods with and without network pruning. For a pruned model, we need to store not only the values of unpruned parameters but also their respective indexes (locations) in the original model. For the index information, we compute index differences between unpruned network parameters in the original model and further compress them by Huffman coding as in han2015deep.
For Hessian computation, 50,000 samples of the training set are reused. We also evaluate the performance when Hessian is computed with 1,000 samples only.
Finally, we evaluate the performance of our network quantization schemes using Hessian when its alternative is used instead, as discussed in Section Document. To this end, we retrain the considered neural networks with the Adam SGD optimizer and obtain the second moment estimates of gradients at the end of training. Then, we use the square roots of the second moment estimates instead of Hessian and evaluate the performance.
First, we evaluate our network quantization schemes for the MNIST data set with a simplified version of LeNet5 lecun1998gradient, consisting of two convolutional layers and two fully-connected layers followed by a soft-max layer. It has total 431,080 parameters and achieves 99.25% accuracy. For a pruned model, we prune 91% of the original network parameters and fine-tune the rest. Second, we experiment our network quantization schemes for the CIFAR-10 data set krizhevsky2009learning with a pre-trained 32-layer ResNet he2015deep. The 32-layer ResNet consists of 464,154 parameters in total and achieves 92.58% accuracy. For a pruned model, we prune 80% of the original network parameters and fine-tune the rest. Third, we evaluate our network quantization schemes with AlexNet krizhevsky2012imagenet for the ImageNet ILSVRC-2012 data set russakovsky2015imagenet. We obtain a pre-trained AlexNet Caffe model, which achieves 57.16% top-1 accuracy. For a pruned model, we prune 89% parameters and fine-tune the rest. In fine-tuning, the Adam SGD optimizer is used in order to avoid the computation of Hessian by utilizing its alternative (see SectionDocument). However, the pruned model does not recover the original accuracy after fine-tuning with the Adam method; the top-1 accuracy recovered after pruning and fine-tuning is 56.00%. We are able to find a better pruned model achieving the original accuracy by pruning and retraining iteratively han2015learning, which is however not used here.
We first present the quantization results without pruning for 32-layer ResNet in Figure Document, where the accuracy of 32-layer ResNet is plotted against the average codeword length per network parameter after quantization. When fixed-length coding is employed, the proposed Hessian-weighted k-means clustering method performs the best, as expected. Observe that Hessian-weighted k-means clustering yields better accuracy than others even after fine-tuning. On the other hand, when Huffman coding is employed, uniform quantization and the iterative algorithm for ECSQ outperform Hessian-weighted k-means clustering and k-means clustering. However, these two ECSQ solutions underperform Hessian-weighted k-means clustering and even k-means clustering when fixed-length coding is employed since they are optimized for optimal variable-length coding.
Figure Document shows the performance of Hessian-weighted k-means clustering when Hessian is computed with a small number of samples (1,000 samples). Observe that even using the Hessian computed with a small number of samples yields almost the same performance. We also show the performance of Hessian-weighted k-means clustering when an alternative of Hessian is used instead of Hessian as explained in Section Document. In particular, the square roots of the second moment estimates of gradients are used instead of Hessian, and using this alternative provides similar performance to using Hessian. In Table Document, we summarize the compression ratios that we can achieve with different network quantization methods for pruned models. The original network parameters are 32-bit float numbers. Using the simple uniform quantization followed by Huffman coding, we achieve the compression ratios of 51.25, 22.17 and 40.65 (i.e., the compressed model sizes are 1.95%, 4.51% and 2.46% of the original model sizes) for LeNet, 32-layer ResNet and AlexNet, respectively, at no or marginal performance loss. Observe that the loss in the compressed AlexNet is mainly due to pruning. Here, we also compare our network quantization results to the ones in han2015deep. Note that layer-by-layer quantization with k-means clustering is evaluated in han2015deep while our quantization schemes including k-means clustering are employed to quantize network parameters of all layers together at once (see Section Document).
+ Quantization all layers
+ Huffman coding
|Deep compression han2015deep||99.26||39.00|
+ Quantization all layers
+ Huffman coding
|Deep compression han2015deep||N/A||N/A|
+ Quantization all layers
+ Huffman coding
|Deep compression han2015deep||57.22||35.00|
This paper investigates the quantization problem of network parameters in deep neural networks. We identify the suboptimality of the conventional quantization method using k-means clustering and newly design network quantization schemes so that they can minimize the performance loss due to quantization given a compression ratio constraint. In particular, we analytically show that Hessian can be used as a measure of the importance of network parameters and propose to minimize Hessian-weighted quantization errors in average for clustering network parameters to quantize. Hessian-weighting is beneficial in quantizing all of the network parameters together at once since it can handle the different impact of quantization errors properly not only within layers but also across layers. Furthermore, we make a connection from the network quantization problem to the entropy-constrained data compression problem in information theory and push the compression ratio to the limit that information theory provides. Two efficient heuristic solutions are presented to this end, i.e., uniform quantization and an iterative solution for ECSQ. Our experiment results show that the proposed network quantization schemes provide considerable gain over the conventional method using k-means clustering, in particular for large and deep neural networks.