Efficient and Effective Quantization for Sparse DNNs

by   Yiren Zhao, et al.
University of Cambridge

Deep convolutional neural networks (CNNs) are powerful tools for a wide range of vision tasks, but the enormous amount of memory and compute resources required by CNNs poses a challenge in deploying them on constrained devices. Existing compression techniques show promising performance in reducing the size and computation complexity of CNNs for efficient inference, but there lacks a method to integrate them effectively. In this paper, we attend to the statistical properties of sparse CNNs and present focused quantization, a novel quantization strategy based on powers-of-two values, which exploits the weight distributions after fine-grained pruning. The proposed method dynamically discovers the most effective numerical representation for weights in layers with varying sparsities, to minimize the impact of quantization on the task accuracy. Multiplications in quantized CNNs can be replaced with much cheaper bit-shift operations for efficient inference. Coupled with lossless encoding, we build a compression pipeline that provides CNNs high compression ratios (CR) and minimal loss in accuracies. In ResNet-50, we achieve a 18.08 × CR with only 0.24% loss in top-5 accuracy, outperforming existing compression pipelines.



There are no comments yet.


page 1

page 2

page 3

page 4


Structured Pruning is All You Need for Pruning CNNs at Initialization

Pruning is a popular technique for reducing the model size and computati...

Pyramid Vector Quantization and Bit Level Sparsity in Weights for Efficient Neural Networks Inference

This paper discusses three basic blocks for the inference of convolution...

Smaller Models, Better Generalization

Reducing network complexity has been a major research focus in recent ye...

Pruning Ternary Quantization

We propose pruning ternary quantization (PTQ), a simple, yet effective, ...

Exploiting Weight Redundancy in CNNs: Beyond Pruning and Quantization

Pruning and quantization are proven methods for improving the performanc...

EAST: Encoding-Aware Sparse Training for Deep Memory Compression of ConvNets

The implementation of Deep Convolutional Neural Networks (ConvNets) on t...

Term Revealing: Furthering Quantization at Run Time on Quantized DNNs

We present a novel technique, called Term Revealing (TR), for furthering...

Code Repositories


Mayo: Auto-generation of hardware-friendly deep neural networks.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite deep convolutional neural networks (CNNs) demonstrating state-of-the-art performance in many computer vision tasks, the parameter-rich and compute-intensive nature substantially hinders the efficient use of them in bandwidth- and power-constrained devices. To this end, recent years have seen a surge of interest in minimizing the memory and compute costs for CNN inference.

Pruning algorithms compress CNNs by setting weights to zero, thus removing connections or neurons from the models. In particular, fine-grained pruning 

(Liu et al., 2015; Guo et al., 2016) provides the best compression by removing connections at the finest granularity, i.e. individual weights. Quantization methods reduce the number of bits required to represent each value, and thus further provide memory, bandwidth and compute savings. Shift quantization of weights, which quantizes weight values in a model to powers-of-two values or zero, i.e. , is particularly of interest, as multiplications in convolutions become much simpler bit-shift operations. The computational cost in hardware can thus be significantly reduced without a detrimental impact on the model’s task accuracy (Zhou et al., 2017).

Fine-grained pruning, however, is often in conflict with quantization, as pruning introduces various degrees of sparsities to different layers. Linear quantization methods (integers) have uniform quantization levels and non-linear quantizations (logarithmic, floating-point and shift) have fine levels around zero but levels grow further apart as values get larger in magnitude. Both linear and nonlinear quantizations thus provide precision where it is actually not required in the case of a pruned CNN. It is observed that empirically, very few non-zero weights concentrate near zero in certain layers sparsified with fine-grained pruning (see Figure 0(c) for an example). Shift quantization is highly desirable as it can be implemented efficiently, but it becomes a poor choice for certain layers in sparse models, as most near-zero quantization levels are under-utilized (Figure 0(d)).

(a) Dense layers
(b) After shift quantization
(c) Sparse layers (without zeros)
(d) After shift quantization
Figure 1:

The weight distributions of the first 8 layers of ResNet-18 on ImageNet. (fig:intro:dense) shows the weight distributions of the layers, (fig:intro:sparse) similarly shows the distributions (excluding zeros) for a sparsified variant. Notice greedy pruning leaves some layers dense. (fig:intro:dense_quantized) and (fig:intro:sparse_quantized) respectively quantize the weight distributions on the left with 5-bit shift quantization. Shift quantization on the sparse layers results in poor utilization of the quantization levels.

This dichotomy prompts the question, how can we quantize sparse weights efficiently and effectively? Here, efficiency represents the minimized compute cost from replacing floating-point multiplications to numerical shifts. Effectiveness means that the quantization levels are well-utilized. From an information theory perspective, it is desirable to design a quantization function such that the quantized values in match as closely as possible the prior weight distribution. We address this issue by proposing a new approach to quantize parameters in CNNs which we call focused quantization that mixes shift and recentralized quantization methods. Following the minimum description length (MDL) principle (Hinton and van Camp, 1993; Graves, 2011)

, recentralized quantization uses a mixture of Gaussian distributions to find the most likely locations of the probability masses in the weight distribution of sparse layers (first block in

Figure 2), and independently quantizes the two masses (rightmost of Figure 2) to powers-of-2s. The compute pattern of a -bit recentralized quantization can be seen as a combination of a -bit shift quantization and a ternary quantization. This way, it not only preserves the compute-efficiency of convolutional layers using bit-shift operations, but also allows the shift quantization to effectively make use of its range of representable values. Additionally, not all layers consist of two probability masses, and recentralized quantization may not be necessary (as shown in Figure 0(c)). In such cases, we use the Wasserstein distance between the two Gaussian components to decide when to apply shift quantization. Finally, we present a complete compression pipeline comprising fine-grain pruning, focused quantization and Huffman encoding, and show its performance in comparison with many state-of-the-art compression techniques.

In this paper, we make the following contributions:

  • We propose focused quantization for sparse CNNs based on the MDL principle. The proposed quantization significantly reduces computation and model size with minimal loss of accuracy.

  • Focused quantization is hybrid, it systematically mixes a recentralized quantization with shift quantization to provide the most effective quantization on sparse CNNs.

  • We build a complete compression pipeline based on this mixed quantization and demonstrate state-of-the-art compression ratios on a range of modern CNNs.

The rest of the paper is structured as follows. Section 2 discusses related work in the field of DNN compression. Section 3 introduces focused quantization and the complete compression pipeline. Section 4 presents an evaluation of the proposed compression pipeline.

Figure 2: The step-by-step process of recentralized quantization of unpruned weights on block3f/conv1

in sparse ResNet-50. Each step shows how it changes a filter and the distribution of all weights. The first estimates the high-probability regions with a Gaussian mixture, and assign weights to a Gaussian component. The second normalizes each weight. The third quantizes the normalized values with shift quantization and produces a representation of quantized weights used for inference. The final block visualizes the actual numerical values after quantization.

2 Related Work

Recently, a wide range of techniques have been proposed and proven effective for reducing the memory and computation requirements of deep neural networks (DNNs). These proposed optimizations can provide direct reductions in memory footprints, bandwidth requirements, total number of arithmetic operations, arithmetic complexities or a combination of these properties.

Pruning-based optimization methods directly reduce the number of parameters in a network. Fine-grained pruning method (Guo et al., 2016) significantly reduces the size of a model but introduces element-wise sparsity. Coarse-grained pruning (Luo et al., 2017; Gao et al., 2019) shrinks model sizes or reduce computation at a higher granularity that is easier to accelerate on commodity hardware. Quantization methods allow parameters to be represented in more efficient data formats. Quantizing weights to powers-of-2 recently gained attention because it not only reduces the model size but also simplifies computation (Leng et al., 2018; Zhou et al., 2017; Miyashita et al., 2016). Previous research also focused on quantizing DNNs to extremely low bit-widths such as ternary (Zhu et al., 2017) or binary (Hubara et al., 2016) values. They however introduces large numerical errors and thus causes significant degradations in model accuracies. Lossy and lossless encoding is another popular method to reduce the size of a DNN, typically used in conjunction with pruning and quantization (Dubey et al., 2018; Han et al., 2016).

Since many compression techniques are available and building a compression pipeline can provide a multiplying effect in compression ratios, researchers start to chain multiple compression techniques. Han et al. (2016) proposed Deep Compression that combines pruning, quantization and Huffman encoding. Dubey et al. (2018) built a compression pipeline using their coreset representation of filters. Tung et al. (2018) and Polino et al. (2018) integrated multiple compression techniques. Tung et al. (2018) combined pruning with quantization and Polino et al. (2018) employed knowledge distillation on top of quantization. Although there are many attempts in building an efficient compression pipeline, the statistical relationship between pruning and quantization lacks exploration. In this paper, we look at exactly this problem and propose a new method that exploits the statistical properties of weights in pruned models to quantize them efficiently and effectively.

3 Method

3.1 Preliminaries

A high-level overview of the proposed quantization method is shown in Figure 2, and in this section we continue to explain the method and optimization techniques in detail. Given a model with a sequence of parameters trained on a dataset of input points and targets , compressing the model with quantization can be formulated as a minimum description length (MDL) optimization (Hinton and van Camp, 1993; Graves, 2011), and its objective is to encode the data and the model with the fewest number of bits. Given that we approximate the posterior with a distribution of quantized weights , where

contains the hyperparameters used to quantize

, the MDL problem minimizes the following negative variational free energy (Graves, 2011):


where consists of two separate terms, the error cost and the complexity cost . The former represents the cost to communicate the true label , given the receiver already knows the inputs and the quantized model weights . The more accurate the model output, the fewer bits are therefore required to communicate the true label . The latter is the Kullback-Leibler (KL) divergence from to , which denotes the lower bound on the expected cost to communicate the model . Intuitively, the former reflects the cross-entropy loss of the quantized model trained on , and the latter minimizes the discrepancies between the weight distributions before and after quantization.

3.2 Designing the Quantization Function

Shift quantization is a quantization scheme which constrains weight values to powers-of-two or zero values. A representable value in a -bit shift quantization is given by:


where denotes either zero or the sign of the value, is an integer bounded by , and is the bias, a layer-wise constant which scales the magnitudes of quantized values. We use to denote a -bit shift quantization with a bias of a weight value to the nearest representable value .

As we have discussed earlier and illustrated in Figure 1, shift quantization on sparse layers makes poor use of the range of representable values, i.e. the resulting distribution after quantization is a poor approximation of .

Intuitively, by concentrating quantization effort on the high probability regions in the weight distribution in sparse layers, the KL-divergence can be minimized better. Recentralized quantization is designed specifically for this purpose, and is defined as follows and applied in a layer-wise fashion:


where is a weight value of the layer, is the constant pruning mask containing binary values , and it is used to set pruned weights to 0. The set of components determines the specific locations to focus quantization effort. The Kronecker delta evaluates to either 1 when , or 0 otherwise. In effect, is a constant and it chooses which component in is used to quantize , and Section 3.3 explains how can be determined for each . Following (Zhu et al., 2017; Leng et al., 2018), we additionally introduce a layer-wise learnable scaling factor initialized to 1, which empirically improves the task accuracy. Finally, quantizes the component :


where the scalar constants , and are hyperparameters in to be optimized. We additionally use to indicate the distribution generated by applying the layer-wise quantization on all weights . The process of minimizing therefore finds the optimal .

3.3 Optimizing Recentralized Quantization

The complexity cost is unfortunately intractable, and the relevant hyperparameters in cannot be computed analytically and a direct search is still difficult to accomplish. As recentralized quantization concerns with high-probability weights, we can approximately minimize the KL divergence by applying the following two-step process in a layer-wise manner, which first identifies regions with high probabilities (first block in Figure 2), then locally quantize them with recentralized quantization (second and third blocks in Figure 2).

First, we notice that in general, the weight distribution resembles a mixture of Gaussian distributions, and thus replace

with a surrogate, which is defined as a Gaussian mixture model, and subsequently used to approximate




is the probability density function of the Gaussian distribution

, the non-negative defines the mixing weight of the component and . We can then maximize with:


where comprises the optimal values for , , of all components . This solution is known as the maximum likelihood estimate (MLE). Theoretically, finding the MLE is equivalent to minimizing the KL divergence , and can be efficiently computed by the expectation-maximization (EM) algorithm (Dempster et al., 1977).

In practice, we found it sufficient to use two Gaussian components, , for identifying high-probability regions in the weight distribution. For faster EM convergence, we initialize and

respectively with the means and standard deviations of negative and positive values in weights

respectively, and with .

We then generate from the mixture model, which individually selects the component to use for . An obvious design decision is to sample from the Gaussian mixture, and for each , follows a categorical distribution where we randomly assign a component to with the following probability:


Finally, we set the constant to a powers-of-two value, chosen to ensure that allows at most a proportion of

values to overflow and clips them to the maximum representable magnitude. In practice, this heuristic choice makes better use of the quantization levels provided by shift quantization than disallowing overflows.

After determining all of the relevant hyperparameters with the method described above, can be evaluated to quantize the layer weights .

3.4 Bit-width Saving Tricks

Recentralized quantization is designed to capture the high-probability components in the weight distribution, which in theory provides a less redundant use of bits compared to shift quantization. We further reduce the bit-width by removing certain representable values that occur rarely after quantization. The tricks are generally applicable. Consider the (orange) and (blue) Gaussian components in the first block of Figure 2, it is notable that the means and are surrounded with many fine-grained quantization levels, thus sacrificing these representations by quantizing to nearby values is equivalently efficient. Similarly, very few values quantized by lie about the well-quantized region of and vice versa. Meaning that we can remove the largest representation from and smallest representation from . By removing these values from the representation, we use exactly at most bits to represent a quantized value which internally uses -bit shift quantization. To further simplify computation, we constrain and to the nearest powers-of-two values. For instance, a 3-bit recentralized quantization uses the following representable values if , where the first two sets correspond to values quantized by the and components respectively.

3.5 Choosing the Appropriate Quantization

As we have discussed earlier, the weight distribution of sparse layers may not always have multiple high-probability regions. For example, fitting a mixture model of two Gaussian components on the layer in Figure 2(a) gives highly overlapped components. It is therefore of little consequence which component we use to quantize a particular weight value. This is due to the fact that quantizing with either gives us similar quantization results (Figure 2(b)), rendering the use of the component selector redundant. Under this scenario, we can simply use -bit shift quantization instead of a -bit which internally uses a -bit shift quantization. By moving the 1 bit used to represent the now absent to shift quantization, we further increase its precision.

To decide whether to use shift or recentralized quantization, it is necessary to introduce a metric to compare the similarity between the pair of components. While the KL-divergence provides a measure for similarity, it is however non-symmetric, making it unsuitable for this purpose. To address this, we propose to first normalize the distribution of the mixture, then to use the 2-Wasserstein metric between the two Gaussian components after normalization as a decision criterion, which we call the Wasserstein separation:


where and are respectively the mean and standard deviation of the component , and

denotes the variance of the entire weight distribution. Focused quantization can then adaptively pick to use recentralized quantization for all sparse layers except when

, and shift quantization is used instead. In our experiments, we found usually provides a good decision criterion. In Section 4.3, we additionally study the impact of quantizing a model with different values.

(a) Weight distribution.
(b) Overlapping components.
Figure 3: The weight distribution of the layer block22/conv1 in a sparse ResNet-18 trained on ImageNet. It shows that when the two Gaussian components have a large overlap, quantizing with either one of them results in almost the same quantization levels.

3.6 Model Optimization

Optimizing the overall cost requires us to minimize alongside , which is the expected training loss of the model using weights drawn from the quantized distribution. We propose to interleave the optimizations of and using dual-updates, and the rest of this section explains its rationale and algorithm.

Under the variational inference (VI) framework for neural networks (Graves, 2011; Hinton and van Camp, 1993; Nowlan and Hinton, 1992), the posterior approximation is sampled for each mini-batch during training, and often requires reparameterization tricks (Kingma et al., 2015) to reduce the training variance. These limitations make the large-scale use of VI a challenging endeavour. We address this by presenting an alternative optimization approach.

The procedure to minimize the complexity cost and sample the resulting quantized value described in Section 3.3

is time-consuming when compared to the forward/backward propagation time of the model. Because of this, it cannot be carried out for each stochastic gradient descent (SGD) step to minimize the error cost

. Instead, we interleave the minimization stages of and . By doing so, is optimized with conventional SGD algorithm, where the forward pass uses the quantized weights for inference, and the subsequent backward pass updates the original values. We also found that in our experiments, exponentially increasing the interval between consecutive optimization stages helps to reduce the variance introduced by sampling and improves training quality.

Algorithm 1 optimizes , where

specifies the number of epochs to fine-tune the quantized sparse model, and it returns the final optimized hyperparameters

and quantized weights . Note that we assume the pruned weights given by the pruning mask to remain zero throughout fine-tuning.

1:function Optimize()
4:     while  do
6:         for  do
7:              Sample the component selector in
8:         end for
9:         for  epochs do
10:              Sample a mini-batch from
12:         end for
15:     end while
16:     return
17:end function
Algorithm 1 Model Optimization

4 Evaluation

We applied focused compression, a compression flow which consists of pruning, focused quantization and Huffman encoding, on a wide range of popular vision models including MobileNets (Howard et al., 2017; Sandler et al., 2018) and ResNets (He et al., 2016a, b) on the ImageNet dataset (Deng et al., 2009). For all of these models, focused compression produced models with high compression ratios (CRs) and permitted a multiplication-less hardware implementation while having minimal impact on the task accuracy. In our experiments, models are initially sparsified using Dynamic Network Surgery (Guo et al., 2016). Focused quantization is subsequently applied to restrict weights to low-precision values. During fine-tuning, we additionally employed Incremental Network Quantization (INQ) (Zhou et al., 2017) and gradually increased the proportion of weights being quantized to 25%, 50%, 75%, 87.5% and 100%. At each step, the models were fine-tuned for 3 epochs at a learning rate of 0.001, except for the final step at 100% we ran for 10 epochs, and decay the learning rate every 3 epochs. Finally, Huffman encoding was applied to model weights which further reduced model sizes.

4.1 Model Size Reduction

Table 1 compares the accuracies and compression rates before and after applying the compression pipeline under different quantization bit-widths. It demonstrates the effectiveness of focused compression on the models. We found that sparsified ResNets with 7-bit weights are at least smaller than the original dense model with marginal degradations () in top-5 accuracies. MobileNets, which are much less redundant and more compute-efficient models to begin with, achieved a smaller CR at around and slightly larger accuracy degradations (). Yet when compared to the ResNet-18 models, it is not only more accurate, but also has a significantly smaller memory footprint at 1.71 MB.

Model Top-1 Top-5 Sparsity Size (MB) CR
ResNet-18 68.94 88.67 0.00 46.76
5 bits 68.36 -0.58 88.45 -0.22 74.86 2.86 16.33
7 bits 68.57 -0.37 88.53 -0.14 74.86 2.94 15.92
ResNet-50 75.58 92.83 0.00 93.82
5 bits 74.86 -0.72 92.59 -0.24 82.70 5.19 18.08
7 bits 74.99 -0.59 92.59 -0.24 82.70 5.22 17.98
MobileNet-V1 70.77 89.48 0.00 16.84
7 bits 69.13 -1.64 88.61 -0.87 33.80 2.13 7.90
MobileNet-V2 71.65 90.44 0.00 13.88
7 bits 70.05 -1.60 89.55 -0.89 31.74 1.71 8.14
Table 1: The accuracies (%), sparsities (%) and CRs of focused compression on ImageNet models. The baseline models are dense models before compression and use 32-bit floating-point weights, and 5 bits and 7 bits denote the number of bits used by individual weights of the quantized models before Huffman encoding.

In Table 2 we compare focused compression with many state-of-the-art model compression schemes. It shows that focused compression simultaneously achieves the best accuracies and the highest CR on both ResNets. Trained Ternary Quantization (TTQ) (Zhu et al., 2017) quantizes weights to ternary values, while INQ (Zhou et al., 2017) and extremely low bit neural network (denoted as ADMM) (Leng et al., 2018) quantize weights to ternary or powers-of-two values using shift quantization. Distillation and Quantization (D&Q) (Polino et al., 2018) quantize parameters to integers via distillation. Note that D&Q’s results used a larger model as baseline, hence the compressed model has high accuracies and low CR. We also compared against Coreset-Based Compression (Dubey et al., 2018) comprising pruning, filter approximation, quantization and Huffman encoding. For ResNet-50, we additionally compare against ThiNet (Luo et al., 2017), a filter pruning method, and Clip-Q (Tung et al., 2018), which interleaves training steps with pruning, weight sharing and quantization. Focused compression again achieves the highest CR () and accuracy (74.86%).

ResNet-18 Top-1 Top-5 Size (MB) CR
TTQ (Zhu et al., 2017) 66.00 87.10 2.92 16.00
INQ (2 bits) (Zhou et al., 2017) 66.60 87.20 2.92 16.00
INQ (3 bits) (Zhou et al., 2017) 68.08 88.36 4.38 10.67
ADMM (2 bits) (Leng et al., 2018) 67.00 87.50 2.92 16.00
ADMM (3 bits) (Leng et al., 2018) 68.00 88.30 4.38 10.67
D&Q (large) (Polino et al., 2018) 73.10 91.17 21.98 2.13
Coreset (Dubey et al., 2018) 68.00 3.11 15.00
Focused compression (5 bits, sparse) 68.36 88.45 2.86 16.33
ResNet-50 Top-1 Top-5 Size (MB) CR
INQ (5 bits) (Zhou et al., 2017) 74.81 92.45 14.64 6.4
ADMM (3 bits) (Leng et al., 2018) 74.0 91.6 8.78 10.67
ThiNet (Luo et al., 2017) 72.04 90.67 16.94 5.53
Clip-Q (Tung et al., 2018) 73.70 6.70 14.00
Coreset (Dubey et al., 2018) 74.00 5.93 15.80
Focused compression (5 bits, sparse) 74.86 92.59 5.19 18.08
Table 2: Comparisons of top-1 and top-5 accuracies () and CRs with various compression methods. Numbers with indicate results not originally reported and calculated by us. Note that D&Q use a much larger ResNet-18 as their baseline.

4.2 Computation Reduction

Quantizing weights using focused quantization can significantly reduce computation complexities in models. By further quantizing activations and batch normalization parameters to integers, the expensive floating-point multiplications and additions in convolutions can be replaced with simple bit-shift operations and integer additions. This can be realized with much faster software or hardware implementations, which directly translates to energy saving and much lower latencies in low-power devices. In

Table 3, we evaluate the impact on accuracies by progressively applying focused quantization on weights, and integer quantizations on activations and batch normalization parameters.

Quantized Top-1 Top-5
Baseline 68.94 88.67
68.36 -0.58 88.45 -0.22
67.89 -1.05 88.08 -0.59
67.95 -0.99 88.06 -0.61
Table 3: Comparison of the original ResNet-18 with different quantizations applied on weights, activations and BN parameters. Here, , and respectively represent a 5-bit focused quantization on weights, an 8-bit integer quantization on activations, and a 16-bit integer quantization on BN parameters.

Figure 4 shows an efficient implementation of a layer with recentralized quantization. Table 4 shows the total number of BitOps required by the implementation to compute a batch-normalized convolution layer with

filters with a padding size of 1, which takes as input a

activation and produce a tensor output. The BitOps is an estimate for the cost in a corresponding hardware implementation and we demonstrate the hardware cost of the two different quantizations used in focused quantization. Perhaps most surprisingly, the convolution quantized using recentralized quantization use approximately the same amount of bit operations (BitOps) when compared to a shift-quantized alternative with the same bit-width for weights. The reason for this is that a weight with -bit recentralized quantization internally uses a -bit shift quantization. Comparing to -bit shift quantization, the former has exactly half of the dynamic range of the latter and thus uses an adder tree half of the size of the latter. Yet recentralized quantization doubles the number of additions, as Figure 4 now takes two parallel addition paths. Additionally, Huffman encoding has minimal overhead in the number of BitOps, because it uses a very small number of dictionary entries for quantized values. For instance, the number of possible entries for 3-bit values is only .

Quantization Total BitOps #Adds #Mults
Shift 463 M 5.74 M 6400
Focused Quantization 499 M 11.48 M 6400
Focused Quantization + Huffman 500 M 11.48 M 6400
Table 4: Computation overheads of inference with 5-bit shift vs. focused quantized weights. BitOps shows the total amount of bit operations to compute the convolution.
Figure 4: Implementation of a layer with recentralized quantization.

4.3 Exploring the Wasserstein Separation

In Section 3.5, we mentioned that some of the layers in a sparse model may not have multiple high-probability regions. For this reason, we use the Wasserstein distance between the two components in the Gaussian mixture model as a metric to differentiate whether recentralized or shift quantization should be used. In our experiments, we specified a threshold such that for each layer, if then recentralized quantization is used, otherwise shift quantization is employed instead. Figure 5 shows the impact of choosing different ranging from 1.0 to 3.5 at 0.1 increments on the Top-1 accuracy. This model is a fast CIFAR-10 (Krizhevsky et al., 2014)

 classifier with only 9 convolutional layers, so that it is possible to repeat training 100 times for each

value to produce high-confidence results. Note that the average validation accuracy is minimized when the layer with only one high-probability region uses shift quantization and the remaining 8 use recentralized quantization, which verifies our intuition.

Figure 5: Exploring the effect of choosing different separation threshold values on the Wasserstein distance in focused quantization. The larger the threshold, the fewer the number of layers using recentralized quantization instead of shift quantization.

5 Conclusion

In this paper, we exploit the statistical properties of sparse CNNs and propose focused quantization to efficiently and effectively quantize model weights. The quantization strategy uses Gaussian mixture models to locate high-probability regions in the weight distributions and quantize them in fine levels. Coupled with pruning and encoding, we build a complete compression pipeline and demonstrate high compression ratios on a range of CNNs. In ResNet-18, we achieve CR with minimal loss in accuracies. Furthermore, the proposed quantization uses only powers-of-2 values and thus provides an efficient compute pattern. The significant reductions in model sizes and compute complexities can translate to direct savings in power efficiencies for future CNN accelerators on IoT devices.


  • Dempster et al. [1977] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 1977.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    , pages 248–255, June 2009.
  • Dubey et al. [2018] Abhimanyu Dubey, Moitreya Chatterjee, and Narendra Ahuja. Coreset-based neural network compression. In Proceedings of the European Conference on Computer Vision (ECCV), pages 454–470, 2018.
  • Gao et al. [2019] Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, and Cheng zhong Xu. Dynamic channel pruning: Feature boosting and suppression. In International Conference on Learning Representations, 2019.
  • Graves [2011] Alex Graves. Practical variational inference for neural networks. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24. 2011.
  • Guo et al. [2016] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient DNNs. In Advances in Neural Information Processing Systems, 2016.
  • Han et al. [2016] Song Han, Huizi Mao, and William J Dally. Deep Compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. International Conference on Learning Representations (ICLR), 2016.
  • He et al. [2016a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • He et al. [2016b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
  • Hinton and van Camp [1993] Geoffrey E. Hinton and Drew van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In

    Proceedings of the Sixth Annual Conference on Computational Learning Theory

    , COLT ’93, 1993.
  • Howard et al. [2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • Hubara et al. [2016] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.
  • Kingma et al. [2015] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2575–2583. Curran Associates, Inc., 2015.
  • Krizhevsky et al. [2014] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 and CIFAR-100 datasets. http://www.cs.toronto.edu/ kriz/cifar.html, 2014.
  • Leng et al. [2018] Cong Leng, Zesheng Dou, Hao Li, Shenghuo Zhu, and Rong Jin. Extremely low bit neural network: Squeeze the last bit out with ADMM. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • Liu et al. [2015] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Penksy. Sparse convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • Luo et al. [2017] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE International Conference on Computer Vision, pages 5058–5066, 2017.
  • Miyashita et al. [2016] Daisuke Miyashita, Edward H. Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. CoRR, 2016.
  • Nowlan and Hinton [1992] Steven J. Nowlan and Geoffrey E. Hinton. Simplifying neural networks by soft weight-sharing. Neural Comput., 1992.
  • Polino et al. [2018] Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantization. In International Conference on Learning Representations, 2018.
  • Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
  • Tung et al. [2018] Frederick Tung, Greg Mori, and Simon Fraser. CLIP-Q: Deep network compression learning by in-parallel pruning-quantization. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7873–7882, 2018.
  • Zhou et al. [2017] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless CNNs with low-precision weights. In International Conference on Learning Representations, 2017.
  • Zhu et al. [2017] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. International Conference on Learning Representations (ICLR), 2017.