Loss Aware Post-training Quantization

11/17/2019 ∙ by Yury Nahshan, et al. ∙ 0

Neural network quantization enables the deployment of large models on resource-constrained devices. Current post-training quantization methods fall short in terms of accuracy for INT4 (or lower) but provide reasonable accuracy for INT8 (or above). In this work, we study the effect of quantization on the structure of the loss landscape. We show that the structure is flat and separable for mild quantization, enabling straightforward post-training quantization methods to achieve good results. On the other hand, we show that with more aggressive quantization, the loss landscape becomes highly non-separable with sharp minima points, making the selection of quantization parameters more challenging. Armed with this understanding, we design a method that quantizes the layer parameters jointly, enabling significant accuracy improvement over current post-training quantization methods. Reference implementation accompanies the paper at https://github.com/ynahshan/nn-quantization-pytorch/tree/master/lapq



There are no comments yet.


page 1

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) is a powerful tool that have shown unmatched performance in various tasks in computer vision, natural language processing and optimal control just to mention a few. The high computational resource requirements, however, constitute one of the main drawbacks of DNNs hindering their massive adoption on edge devices. With the growing number of tasks performed on the edge, e.g., smartphones or embedded systems, and the availability of dedicated custom hardware for DNN inference, the topic of DNN compression has gained popularity.

Figure 1.1:

Cross-entropy loss on a calibration set (one batch of a training set of ImageNet) as a function of two clipping parameters, corresponding to the first two layers in ResNet-18. Dots denote solutions found by various local optimization methods. Note the complex non-separable shape of the loss landscape.

One of the ways to improve computational efficiency of a DNN is to use lower-precision representation of the network, also known as quantization. The majority of literature on neural network quantization involves training either from scratch [bethge2018training, bethge2019back] or as a fine-tuning step on a pre-trained full-precision model [yang2019quantization, Hubara:2017:QNN:3122009.3242044]. Training is a powerful method to compensate for accuracy loss due to quantization. Yet, it is not always applicable in real-world scenarios, since it requires the full-size dataset. This dataset might be unavailable for different reasons such as privacy and intellectual property protection. Training is also time-consuming, requiring very long periods of optimization as well as skilled manpower and computational resources.

Consequently, it is desirable to apply quantization without fine-tuning the model or, at least, without fully training it from scratch. These methods are commonly referred to as post-training quantization and usually require only a small calibration dataset. However, those methods are inevitably less efficient, and most existing works only manage to quantize parameters to the 8-bit integer representation (INT8).

In the absence of a training set, these methods typically aim at minimizing some surrogate errors introduced during the quantization process (e.g., round-off errors) as opposed to the end-to-end loss that one actually wants to minimize. Minimization is performed independently for each layer, for example, by optimizing the value used for thresholding the tensor outliers before applying linear quantization (clipping). Different thresholds induce different quantization errors, and many previous techniques have been suggested to choose the optimal clipping threshold

[lee2018quantization, banner2018post, zhao2019improving].

Unfortunately, these schemes suffer from two fundamental drawbacks. First, they optimize a surrogate objective, which serves as an imperfect proxy for the network accuracy. Fig. 3.4 demonstrates that various local error optimization methods result in different overall accuracy. This means that it might be impossible to choose a surrogate objective that can serve as a good proxy in all cases.

Moreover, the noise in earlier layers might be amplified by the next layers, creating a dependency between optimal clipping parameters in different layers. This means that only joint optimization of different layers parameters can lead to an optimal performance of the quantized model. In Section 5 we show that for stronger quanization, the lack of the loss separability in the individual layer parameters becomes significant and cannot be neglected without major accuracy degradation.

Fig. 1.1 serves as an illustration to this phenomenon. We depict the loss surface of ResNet-18 as a function of clipping thresholds for the first and the second layers. The solutions of the various local layer-wise optimization methods are visualized as colored dots on the surface. Fig. 1.2 provides a magnification of a neighbourhood of the optimum to better visualize the local minimum using a contour map. While all dots lie within a nearly convex region, none of them coincide exactly with the optimal solution

In this paper, we extend these local optimization methods by further optimizing the network loss function jointly over all clipping parameters, enabling highly optimized post-training quantization scheme. We further show through simulations that this global loss-aware approach provides major benefits over current methods that optimize each layer locally and independently from the rest of the layers.

Figure 1.2: Zoom in into area around the minimum (denoted a by red cross) of the loss function of ResNet-18 where two consecutive layers of the model quantized to 2 bit.
Our contribution is as following:
  • [leftmargin=*]

  • We perform an extensive analysis of the loss function of quantized neural networks. We study several characteristics among convexity, separability, the sharpness of the minimum, and curvature. We explain their influence on effectiveness quantization procedure

  • We propose Loss Aware Post-training Quantization (LAPQ) method to find optimal clipping values that minimize loss and hence maximizes performance.

  • We evaluate our method on two different tasks and various neural network architectures. We show that our method outperforms other known methods for post-training quantization.

2 Related work

Neural networks quantization mostly divided into sort of Quantization-Aware training(QAT) [Krishnamoorthi2018whitepaper] and Post-training [Krishnamoorthi2018whitepaper] methods. Due to it’s robustness and outstanding results quantization-aware training gained popularity among multiple authors [baskin2018nice, zhang2018lq, gong2019differentiable, liu2019rbcn, yang2019quantization].

2 bit
3 bit
4 bit
Figure 2.1: Heatmap of the cross-entropy loss on a calibration set as a function of clipping values of two layers of ResNet-18. Two consecutive activation layers quantized to 2 bit (a), 3 bit (b) and 4 bit (c). All other layers remain in full-precision. Left plot(2 bit) reveal complex, non-convex landscape with sharp minimum where other mostly flat.

Despite that post-training quantization widely used in existing hardware solutions. However, those methods are inevitably less efficient and majority of works only managed to get to 8 bit quanitization with-out degradation in performance. Recently post-training quantization methods attracted much more attention of the researchers and more advanced techniques where proposed.


proposed an automatic framework that converts full-precision parameters (both weight and activations) to quantized representation, based on picking threshold, i.e. clipping value, which minimizes Kullback-Leibler divergence between the distributions of quantized and non-quantized tensors.

gong2018highly used norm as a threshold for 8-bit quantization, which resulted in small performance degradation but significant improvement in latency.

lee2018quantization chose clipping parameters per channel. This leads to an increase in performance compared to layer-wise clipping parameters, while requiring more parameters and additional efforts for hardware support.

banner2018post suggested to find an optimal clipping value for quantization by assuming some known distribution of the tensors and minimizing the local MSE quantization error. In addition the authors proposed to apply weights bias correction by injection of bias error. We utilize the proposed bias correction method in our scheme.


propose outlier channel splitting, which splits the neuron with large value into two neurons with smaller magnitude. This approach introduces a trade-off by reducing quantization error at the cost of network size overhead.

choukroun2019low calculated the clipping values iteratively on a calibration set by minimizing quantization MSE on each layer separately. The quantization was kernel-wise for weights and channel-wise for activations. The authors performed experiments on as low as 4-bit quantization.

finkelstein2019fighting addressed a harder problem of MobileNet quantization. They claim that the source of degradation is shifting in the mean activation value caused by inherent bias in the quantization process. The authors proposed a scheme that compensating this bias. The method does not require labeled data and easily integrated during deployment of DNNs.


utilized a quantization scheme relying on equalizing the weight ranges in the network by making use of the scale-equivariance property of activation functions. In addition, the authors proposed a method to correct biases in the error that are introduced during quantization.


mapped the data into bins by using clustering while the range is uniformly distributed. This method manages to achieve near baseline accuracy for 4-bit quantization.

To the best of our knowledge previous works did not take into account the fact that loss might be not separable during the optimization, usually performing the optimization per layer or even per channel. Notable exceptions are gong2018highly and zhao2019improving who did not perform any kind of optimization. nagel2019data partially addressed the lack of separability by treating pairs of consecutive layers together.

3 Loss landscape of quantized neural nets

In the following section we explore landscape of the loss function of quantized neural networks. We perform extensive analysis of characteristics of this function among convexity, separability, sharpness of the minimum and curvature. Those characteristics greatly influence the effectiveness of existing method for post-training quantization, specially selection of optimal clipping thresholds for quantization procedure.

3.1 Separability

In order to reduce the error introduced by quantization, in post-training regime one can saturate values in the layer by some threshold. Quantization with-in smaller range reduces distortion caused by rounding error introducing trade-off between quantization and clipping error [banner2018post]. Previous studies suggest optimization of clipping threshold ether based on statistics or direct minimization of mean square error [migacz20178, lee2018quantization, banner2018post, choukroun2019low] for individual layers. While such straightforward approaches has been shown to be useful in many cases, their efficiency highly depends on separability of the error function.

To illustrate the importance of separability, let us consider a DNN with layers. Each layer comprises of linear function with weights , and activation function , and applied to an input it produces an output

. In particular, if the activation function is ReLU, the input of the next layer is given by


We treat quantization error as an additive uniform random noise :


This assumption is legitimate, since the sufficient and necessary condition for quantization error to be white and uniform is vanishing characteristic function of the input



For large amount of quantization bins, is close to integer, and thus the condition is approximately satisfied, which was also confirmed empirically for NN feature maps [baskin2018nice].

Figure 3.1:

norm of quantization error with respect to clipping value at 2 bit (a) and 4 bit(b) quantization of single dimension normal distributed vector. For less aggressive quantization the minimum of quantization error is flattered and approximately same among various


The quantization at each layer introduces a multiplicative error of with respect to the true activation value. This multiplicative error builds up across layers. For an

-layer network, ignoring for a moment activations, the expected network outputs are scaled with respect to the true network outputs as follows:


Assuming sufficiently small quantization error, , we can neglect terms of second and higher orders. Eq. 4 shows that the final quantization error is additive separable for sufficiently small local error , which means we can write this error as a sum of single-variable functions, :


In our case, we can take, for example


For larger value of local error, we need to write more terms in Eq. 4:


Approximation of defined in Eq. 7, as opposed to one defined in Eq. 4, is not separable anymore. Moreover, clipping of earlier layers affects the error in subsequent layers that create even higher dependencies. Eq. 7 suggest that minimization of quantization error of individual layers does not necessary minimize final error.

Figure 3.2: Accuracy of ResNet-50 quantized to 2 bit and 4 bit with respect to layer wise optimization of different norms. At 4 bit accuracy almost does not depend on , where for 2 bit different values of can differ by more than 20%.
Quantization of 4 bit
Quantization of 2 bit
Figure 3.3: Absolute value of the Hessian matrix of the loss function with respect to clipping parameters calculated over 15 layers of ResNet-18. Higher values at diagonal of hessian at 2 bit quantization suggests for Sharpness of the function compared to 4 bit. Non-diagonal elements provide indication of the coupling between clipping parameters of different layers.

3.2 Sharpness of the minimum

We briefly discuss the geometry and the flatness of the loss function around the minimum. We informally call an minimum flat if the minimum loss value neighbourhood is large; otherwise, the minimum is sharp. Flat minimum allow lower precision of parameters [wolpert1994bayesian, hochreiter1995simplyfing] and are more robust under perturbations. We empirically confirm that lower amount of bits results in sharper minimum (Fig. 2.1).

We begin by investigation of the sharpness of the minimum in case of quantization of a single -dimensional vector. The empirical evidence provided by banner2018post shows that the quantization mean-square-error (MSE) is a smooth, convex function, which has sharp minimum at lower bitwidth quantization and is mostly flat at higher bitwidth. In Fig. 3.1 we plot average norm of quantization error for different values of at 2 bit and 4 bit respectively. It is clear that 2 bit is associated with sharper minimum, where 4 bit mostly flat.

We next turn to look at how the minimum becomes flatter as the number of bits increases for two-dimensional data taken from two different layers of ResNet-18. While for 4 bit quantization the quantization error is insignificant even for larger clipping parameter, which results in very flat landscape, the tradeoff between clipping and quantization errors is obvious for two bit. This tradeoff leads to a much more complex non-convex and non-separable loss landscapes which a significantly sharper minimum.

Clearly, different metrics would result in different optimal clipping values. In Fig. 3.2 we show accuracy of ResNet-50 for different cases where clipping optimizes a given value of . While in case of 4 bits per parameter the accuracy almost does not depend on , for 2 bits different values of can differ by more than 20% accuracy. These results provide indirect evidence for sharpness of the minimum. Specifically, while the different metrics results with nearby solutions, they eventually end up being comparable in terms of accuracy only for the 4-bit case due to the flat and stable nature of the minimum.

3.3 Hessian of the loss function

To estimate dependencies between clipping parameters of different layers we analyze the structure the Hessian of the loss function. The Hessian matrix contain the second order partial derivatives of the loss

, where is a vector of clipping parameters:


In case of separable functions, the Hessian is a diagonal matrix. This means that the magnitude of the off-diagonal elements can be used as a measure of separability.

To quantify the sharpness of the minimum, we look at curvature of the graph of the function, and, in particular, Gaussian curvature which is given [goldman2005curvature] by:


In minimum and thus


We calculated Gaussian curvature at point that minimizes norm and acquired the following values:


which means that the flat surface for 4 bit shown in Fig. 2.1 is a generic property of the loss and not of the specific coordinates. Similarly, we conclude that 2-bit quantization generally has sharper minima.

In Fig. 3.3 we show the absolute value of the Hessian matrix of the loss function with respect to clipping parameters calculated over 15 layers of ResNet-18. Hessian calculated at point that minimizes norm with 4 bit(a) and 2 bit(b) quantization respectively. At 2 bit diagonal elements of the Hessian much bigger than at 4 bit, which indicates higher sharpness of the loss at this point. On other hand off-diagonal elements at 4 bit smaller than at 2 bit which confirms function is much more separable at 4 bit. Those results are consistent with measurements of Gaussian curvature and experiments with different norms Fig. 3.1.

More over Hessian matrix provide additional information regarding coupling between different layers. As expected, adjacent off-diagonal terms has higher values than distant elements corresponds to higher dependencies between clipping parameters of adjacent layers.

Figure 3.4: Accuracy of ResNet-18 and ResNet-50 quantized to 2 bits with respect to layer wise optimization of different norms. Accuracy with respect to has approximately quadratic form.

4 Multivariate loss optimization

Instead of minimization of local metric we propose to optimize loss function of the network. In this case, both layer-wise and joint optimization methods can be used. In this section, we provide some background on gradient-free optimization methods and discuss their application to the DNN quantization.

Coordinate descent

Having a function of parameters , coordinate descent optimize each parameter independently from the others. Given minimization problem:


we optimize one parameter while fix the others. Coordinate descent is a very simple and scalable method of optimization, however it does not provides any convergence guarantees. In case of separable functions, however, if single-variable optimization achieves minimum, so does coordinate descent.

Conjugate directions and Powell’s method

More advanced methods, such as Powell’s method [1964Powell], optimize all the parameters jointly, by performing linear search over a set of directions, called conjugate directions. This method is more efficient that coordinate descent, but still does not require gradients. Hence, minimized function need not be differentiable, and no derivatives are taken.

W A CD Powell
32 4 68.1% 68.7% 68.6% 68.8% 68.0% 68.6%
32 3 63.3% 65.7% 65.7% 65.9% 65.4% 66.3%
32 2 32.9% 43.9% 47.8% 48.0% 51.6% 48.0%
4 32 48.6% 56.0% 57.2% 55.5% 53.3% 62.6%
3 32 4.0% 18.3% 19.8% 23.7% 0.1% 42.0%
4 4 43.6% 53.5% 55.4% 53.5% 57.4% 58.5%
Table 4.1: Accuracy of ResNet-18 with four different initialization of norms. refers to the

value that is the interpolation of three different values of

to a quadratic form. The best initialization was used to run Coordinate Descent (CD) and Powell’s method.

4.1 Loss Aware Quantization

In previous sections, we have shown that at low-bit quantization loss as a function of clipping parameters is a non-convex, non-separable function with complex landscape. Those properties make the function hard to optimize. On the other hand, at high bitwidth loss function around minimum is flat. To address this combination of different conditions, we propose to combine multi-variate optimization algorithms with heuristic approach of ‘‘good’’ initialization.

We formulate quantization problem as an optimization of the loss function with respect to clipping parameters . is a continuous function defined on closed region, hence has a global minimum.

First we estimate value of that produce best accuracy. To that end we perform layer-wise norm minimizaton of quantization error for three different values of . In Fig. 3.4 we show accuracy of ResNet-18 and Resnet-50 for different values of . Given a three points we interpolate quadratic polynomial to approximate optimal value of . For example, in ResNet-50 at 2 bit best accuracy obtained at (Fig. 3.4). Than we take point with best accuracy and use it as initialization to either coordinate descent or Powell’s algorithm. In Table 4.1 we show ablation study performed on ResNet-18. We compare accuracy achieved by different initializations with coordinate descent and Powell’s method. In most cases, Powell’s method outperform others. To choose optimal clipping values for each specific case we just select clipping of the method that attain best accuracy.

5 Experimental Results

We apply our method on multiple models covering vision and recommendation system tasks. In all experiments we first calibrate on small held-off calibration set to calculate optimal scale factors. Than we evaluate on validation set with scale factors obtained from calibration step.

W A ResNet-18 ResNet-50 MobileNet V2
32 4 68.1% 73.4% 62%
32 2 32.9% 17.1% 1.2%
4 32 48.6% 48.4% 1.2%
4 4 43.6% 36.4% 1.2%
32 4 68.8% 74.8% 65.1%
32 2 51.6% 54.2% 1.5%
4 32 62.6% 69.9% 29.4%
4 4 58.5% 66.6% 21.3%
LAPQ + bias correction
4 32 63.3% 71.8% 60.2%
4 4 59.8% 70.% 49.7%
FP32 69.7% 76.1% 71.8%
Table 5.1: Results of applying LAPQ on ResNet-18, ResNet-50 and MobileNetV2. LAPQ significantly outperform naive minimization of mean square error. In addition, we show effect of bias correction of the weights.
Figure 5.1: Accuracy of ResNet-18 for different sizes of calibration set at various quantization levels.

5.1 ImageNet

We evaluate our method on several CNN architectures on ImageNet. We select calibration set of 512 random images for an optimization step. Empirically we have found that 512 is a good trade-off between generalization and running time, as shown on Fig. 5.1. Following the convention [baskin2018nice, yang2019quantization], we do not quantize the first and the last layer. Table 5.1 report accuracy at different bitwidths compared to minimal MSE baseline. Our method provides better improvements over baseline at lower bitwidth. Moreover, we note that weights are more sensitive to quantization than activations. Even at 4 bit quantization, minimization of MSE results in significant accuracy degradation compared to LAPQ.

As observed by finkelstein2019fighting, MobileNet is sensitive to quantization bias. To address this issue, we perform bias correction of the weights as proposed by banner2018post, which can be easily combined with LAPQ. As shown in Table 5.1, this significantly reduces quantization error andimproves accuracy of the MobileNet as well as other networks.

Model W/A Method Hit rate(%)
NCF 1B 32/32 FP32 51.5
32/8 LAPQ (Ours) 51.2
MMSE 51.1
8/32 LAPQ (Ours) 51.4
MMSE 33.4
8/8 LAPQ (Ours) 51.0
MMSE 33.5
Table 5.2: Hit rate of NCF-1B applying our method, LAPQ, and MMSE (minimal Mean Square Error).

Many of successful methods of post-training quantization perform finer parameter assignment, such as group-wise [Mellempudi2017ternary], channel-wise [banner2018post], pixel-wise [faraone2018syq] or filter-wise [choukroun2019low] quantization, which require special hardware support and additional computational resources. Finer parameter assignment appears to provide uncoditional improvement of performance, independently on underlying methods. In contrast with those approaches, our method performs layer-wise quantization which is simple to implement on any existing hardware that supports low precision integer operations. For that reasons, we do not include those methods in comparison. In Tables 5.4 and 5.3 we compare our method with several other known methods of layer-wise quantization. In most cases, our method significantly all the competing methods.

5.2 Ncf-1b

In addition to the CNN models we evaluated our method on recommendation system task, specifically on Neural Collaborative Filtering [Xiangnan2017ncf] model. We use mlperf111https://github.com/mlperf/training/tree/master/recommendation/pytorch implementation to train the model on MovieLens-1B dataset. Similarly to the CNN we generate calibration set of 50k random user/item pairs, significantly smaller then both training and validation sets.

Model W/A Method Accuracy(%)
32 / 32 FP32 69.7
LAPQ (Ours) 68.8
DUAL [choukroun2019low] 68.38
8 / 4 ACIQ [banner2018post] 65.528
LAPQ (Ours) 66.3
8 / 3 ACIQ [banner2018post] 52.476
LAPQ (Ours) 51.6
8 / 2 ACIQ [banner2018post] 7.07
LAPQ (Ours) 59.8
KLD [migacz20178] 31.937
ResNet-18 4 / 4 MMSE 43.6
32 / 32 FP32 76.1
LAPQ (Ours) 74.8
DUAL [choukroun2019low] 73.25
OCS [zhao2019improving] 0.1
8 / 4 ACIQ [banner2018post] 68.92
LAPQ (Ours) 70.8
8 / 3 ACIQ [banner2018post] 51.858
LAPQ (Ours) 54.2
8 / 2 ACIQ [banner2018post] 2.92
LAPQ (Ours) 71.8
4 / 32 OCS [zhao2019improving] 69.3
LAPQ (Ours) 70
KLD [migacz20178] 46.19
ResNet-50 4 / 4 MMSE 36.4
Table 5.3: Comparison with other methods. MMSE refers to minimization of Mean Square Error. Part of the ACIQ results ran by us on published code.

In Table 5.2 we report results for NCF-1B model comparing to MMSE method. Even at 8 bit quantization, NCF-1B suffer from significant degradation with naive MMSE method. On the other hand LAPQ achieve near SOTA accuracy with 0.5% degradation from FP32 results.

Model W/A Method Accuracy(%)
32 / 32 FP32 77.3
LAPQ (Ours) 73.6
DUAL [choukroun2019low] 74.26
8 / 4 ACIQ [banner2018post] 66.966
LAPQ (Ours) 65.7
8 / 3 ACIQ [banner2018post] 41.46
LAPQ (Ours) 29.8
8 / 2 ACIQ [banner2018post] 3.826
LAPQ (Ours) 66.5
4 / 32 MMSE 18
LAPQ (Ours) 59.2
KLD [migacz20178] 49.948
ResNet-101 4 / 4 MMSE 9.8
32 / 32 FP32 77.2
LAPQ (Ours) 74.4
DUAL [choukroun2019low] 73.06
OCS [zhao2019improving] 0.2
8 / 4 ACIQ [banner2018post] 66.42
LAPQ (Ours) 64.4
8 / 3 ACIQ [banner2018post] 31.01
LAPQ (Ours) 51.6
4 / 32 MMSE 5.8
LAPQ (Ours) 38.6
KLD [migacz20178] 1.84
Inception-V3 4 / 4 MMSE 2.2
Table 5.4: Comparison with other methods. MMSE refers to minimization of Mean Square Error. Part of the ACIQ results ran by us on published code.

6 Discussion

In this paper we analyzed loss function of quantized neural networks. We show that for low precision quantization loss function is non-convex and non-separable. In such conditions existing methods that minimize some local metric can not perform well.

We introduce Loss Aware Post-training quantization (LAPQ), which optimizes clipping parameters of the quantization function by directly minimizing loss function. Our method outperforms most previously suggested methods for post-training quantization. Our method does not assume special hardware support like channel-wise or filter-wise quantization. To the best of our knowledge LAPQ is a first that achieve near FP32 accuracy at 4 bit layer-wise quantization in post-training regime.


The research was funded by Hyundai Motor Company through HYUNDAI-TECHNION-KAIST Consortium, ERC StG RAPID, and Hiroshi Fujiwara Technion Cyber Security Research Center.