1 Introduction
Deep neural networks (DNNs) is a powerful tool that have shown unmatched performance in various tasks in computer vision, natural language processing and optimal control just to mention a few. The high computational resource requirements, however, constitute one of the main drawbacks of DNNs hindering their massive adoption on edge devices. With the growing number of tasks performed on the edge, e.g., smartphones or embedded systems, and the availability of dedicated custom hardware for DNN inference, the topic of DNN compression has gained popularity.
One of the ways to improve computational efficiency of a DNN is to use lowerprecision representation of the network, also known as quantization. The majority of literature on neural network quantization involves training either from scratch [bethge2018training, bethge2019back] or as a finetuning step on a pretrained fullprecision model [yang2019quantization, Hubara:2017:QNN:3122009.3242044]. Training is a powerful method to compensate for accuracy loss due to quantization. Yet, it is not always applicable in realworld scenarios, since it requires the fullsize dataset. This dataset might be unavailable for different reasons such as privacy and intellectual property protection. Training is also timeconsuming, requiring very long periods of optimization as well as skilled manpower and computational resources.
Consequently, it is desirable to apply quantization without finetuning the model or, at least, without fully training it from scratch. These methods are commonly referred to as posttraining quantization and usually require only a small calibration dataset. However, those methods are inevitably less efficient, and most existing works only manage to quantize parameters to the 8bit integer representation (INT8).
In the absence of a training set, these methods typically aim at minimizing some surrogate errors introduced during the quantization process (e.g., roundoff errors) as opposed to the endtoend loss that one actually wants to minimize. Minimization is performed independently for each layer, for example, by optimizing the value used for thresholding the tensor outliers before applying linear quantization (clipping). Different thresholds induce different quantization errors, and many previous techniques have been suggested to choose the optimal clipping threshold
[lee2018quantization, banner2018post, zhao2019improving].Unfortunately, these schemes suffer from two fundamental drawbacks. First, they optimize a surrogate objective, which serves as an imperfect proxy for the network accuracy. Fig. 3.4 demonstrates that various local error optimization methods result in different overall accuracy. This means that it might be impossible to choose a surrogate objective that can serve as a good proxy in all cases.
Moreover, the noise in earlier layers might be amplified by the next layers, creating a dependency between optimal clipping parameters in different layers. This means that only joint optimization of different layers parameters can lead to an optimal performance of the quantized model. In Section 5 we show that for stronger quanization, the lack of the loss separability in the individual layer parameters becomes significant and cannot be neglected without major accuracy degradation.
Fig. 1.1 serves as an illustration to this phenomenon. We depict the loss surface of ResNet18 as a function of clipping thresholds for the first and the second layers. The solutions of the various local layerwise optimization methods are visualized as colored dots on the surface. Fig. 1.2 provides a magnification of a neighbourhood of the optimum to better visualize the local minimum using a contour map. While all dots lie within a nearly convex region, none of them coincide exactly with the optimal solution
In this paper, we extend these local optimization methods by further optimizing the network loss function jointly over all clipping parameters, enabling highly optimized posttraining quantization scheme. We further show through simulations that this global lossaware approach provides major benefits over current methods that optimize each layer locally and independently from the rest of the layers.
Our contribution is as following:

[leftmargin=*]

We perform an extensive analysis of the loss function of quantized neural networks. We study several characteristics among convexity, separability, the sharpness of the minimum, and curvature. We explain their influence on effectiveness quantization procedure

We propose Loss Aware Posttraining Quantization (LAPQ) method to find optimal clipping values that minimize loss and hence maximizes performance.

We evaluate our method on two different tasks and various neural network architectures. We show that our method outperforms other known methods for posttraining quantization.
2 Related work
Neural networks quantization mostly divided into sort of QuantizationAware training(QAT) [Krishnamoorthi2018whitepaper] and Posttraining [Krishnamoorthi2018whitepaper] methods. Due to it’s robustness and outstanding results quantizationaware training gained popularity among multiple authors [baskin2018nice, zhang2018lq, gong2019differentiable, liu2019rbcn, yang2019quantization].
Despite that posttraining quantization widely used in existing hardware solutions. However, those methods are inevitably less efficient and majority of works only managed to get to 8 bit quanitization without degradation in performance. Recently posttraining quantization methods attracted much more attention of the researchers and more advanced techniques where proposed.
migacz20178
proposed an automatic framework that converts fullprecision parameters (both weight and activations) to quantized representation, based on picking threshold, i.e. clipping value, which minimizes KullbackLeibler divergence between the distributions of quantized and nonquantized tensors.
gong2018highly used norm as a threshold for 8bit quantization, which resulted in small performance degradation but significant improvement in latency.
lee2018quantization chose clipping parameters per channel. This leads to an increase in performance compared to layerwise clipping parameters, while requiring more parameters and additional efforts for hardware support.
banner2018post suggested to find an optimal clipping value for quantization by assuming some known distribution of the tensors and minimizing the local MSE quantization error. In addition the authors proposed to apply weights bias correction by injection of bias error. We utilize the proposed bias correction method in our scheme.
zhao2019improving
propose outlier channel splitting, which splits the neuron with large value into two neurons with smaller magnitude. This approach introduces a tradeoff by reducing quantization error at the cost of network size overhead.
choukroun2019low calculated the clipping values iteratively on a calibration set by minimizing quantization MSE on each layer separately. The quantization was kernelwise for weights and channelwise for activations. The authors performed experiments on as low as 4bit quantization.
finkelstein2019fighting addressed a harder problem of MobileNet quantization. They claim that the source of degradation is shifting in the mean activation value caused by inherent bias in the quantization process. The authors proposed a scheme that compensating this bias. The method does not require labeled data and easily integrated during deployment of DNNs.
nagel2019data
utilized a quantization scheme relying on equalizing the weight ranges in the network by making use of the scaleequivariance property of activation functions. In addition, the authors proposed a method to correct biases in the error that are introduced during quantization.
nayak2019bit
mapped the data into bins by using clustering while the range is uniformly distributed. This method manages to achieve near baseline accuracy for 4bit quantization.
To the best of our knowledge previous works did not take into account the fact that loss might be not separable during the optimization, usually performing the optimization per layer or even per channel. Notable exceptions are gong2018highly and zhao2019improving who did not perform any kind of optimization. nagel2019data partially addressed the lack of separability by treating pairs of consecutive layers together.
3 Loss landscape of quantized neural nets
In the following section we explore landscape of the loss function of quantized neural networks. We perform extensive analysis of characteristics of this function among convexity, separability, sharpness of the minimum and curvature. Those characteristics greatly influence the effectiveness of existing method for posttraining quantization, specially selection of optimal clipping thresholds for quantization procedure.
3.1 Separability
In order to reduce the error introduced by quantization, in posttraining regime one can saturate values in the layer by some threshold. Quantization within smaller range reduces distortion caused by rounding error introducing tradeoff between quantization and clipping error [banner2018post]. Previous studies suggest optimization of clipping threshold ether based on statistics or direct minimization of mean square error [migacz20178, lee2018quantization, banner2018post, choukroun2019low] for individual layers. While such straightforward approaches has been shown to be useful in many cases, their efficiency highly depends on separability of the error function.
To illustrate the importance of separability, let us consider a DNN with layers. Each layer comprises of linear function with weights , and activation function , and applied to an input it produces an output
. In particular, if the activation function is ReLU, the input of the next layer is given by
(1) 
We treat quantization error as an additive uniform random noise :
(2) 
This assumption is legitimate, since the sufficient and necessary condition for quantization error to be white and uniform is vanishing characteristic function of the input
[sripad1977necessary]:(3) 
For large amount of quantization bins, is close to integer, and thus the condition is approximately satisfied, which was also confirmed empirically for NN feature maps [baskin2018nice].
norm of quantization error with respect to clipping value at 2 bit (a) and 4 bit(b) quantization of single dimension normal distributed vector. For less aggressive quantization the minimum of quantization error is flattered and approximately same among various
norms.The quantization at each layer introduces a multiplicative error of with respect to the true activation value. This multiplicative error builds up across layers. For an
layer network, ignoring for a moment activations, the expected network outputs are scaled with respect to the true network outputs as follows:
(4) 
Assuming sufficiently small quantization error, , we can neglect terms of second and higher orders. Eq. 4 shows that the final quantization error is additive separable for sufficiently small local error , which means we can write this error as a sum of singlevariable functions, :
(5) 
In our case, we can take, for example
(6) 
For larger value of local error, we need to write more terms in Eq. 4:
(7) 
Approximation of defined in Eq. 7, as opposed to one defined in Eq. 4, is not separable anymore. Moreover, clipping of earlier layers affects the error in subsequent layers that create even higher dependencies. Eq. 7 suggest that minimization of quantization error of individual layers does not necessary minimize final error.
3.2 Sharpness of the minimum
We briefly discuss the geometry and the flatness of the loss function around the minimum. We informally call an minimum flat if the minimum loss value neighbourhood is large; otherwise, the minimum is sharp. Flat minimum allow lower precision of parameters [wolpert1994bayesian, hochreiter1995simplyfing] and are more robust under perturbations. We empirically confirm that lower amount of bits results in sharper minimum (Fig. 2.1).
We begin by investigation of the sharpness of the minimum in case of quantization of a single dimensional vector. The empirical evidence provided by banner2018post shows that the quantization meansquareerror (MSE) is a smooth, convex function, which has sharp minimum at lower bitwidth quantization and is mostly flat at higher bitwidth. In Fig. 3.1 we plot average norm of quantization error for different values of at 2 bit and 4 bit respectively. It is clear that 2 bit is associated with sharper minimum, where 4 bit mostly flat.
We next turn to look at how the minimum becomes flatter as the number of bits increases for twodimensional data taken from two different layers of ResNet18. While for 4 bit quantization the quantization error is insignificant even for larger clipping parameter, which results in very flat landscape, the tradeoff between clipping and quantization errors is obvious for two bit. This tradeoff leads to a much more complex nonconvex and nonseparable loss landscapes which a significantly sharper minimum.
Clearly, different metrics would result in different optimal clipping values. In Fig. 3.2 we show accuracy of ResNet50 for different cases where clipping optimizes a given value of . While in case of 4 bits per parameter the accuracy almost does not depend on , for 2 bits different values of can differ by more than 20% accuracy. These results provide indirect evidence for sharpness of the minimum. Specifically, while the different metrics results with nearby solutions, they eventually end up being comparable in terms of accuracy only for the 4bit case due to the flat and stable nature of the minimum.
3.3 Hessian of the loss function
To estimate dependencies between clipping parameters of different layers we analyze the structure the Hessian of the loss function. The Hessian matrix contain the second order partial derivatives of the loss
, where is a vector of clipping parameters:(8) 
In case of separable functions, the Hessian is a diagonal matrix. This means that the magnitude of the offdiagonal elements can be used as a measure of separability.
To quantify the sharpness of the minimum, we look at curvature of the graph of the function, and, in particular, Gaussian curvature which is given [goldman2005curvature] by:
(9) 
In minimum and thus
(10) 
We calculated Gaussian curvature at point that minimizes norm and acquired the following values:
(11)  
(12) 
which means that the flat surface for 4 bit shown in Fig. 2.1 is a generic property of the loss and not of the specific coordinates. Similarly, we conclude that 2bit quantization generally has sharper minima.
In Fig. 3.3 we show the absolute value of the Hessian matrix of the loss function with respect to clipping parameters calculated over 15 layers of ResNet18. Hessian calculated at point that minimizes norm with 4 bit(a) and 2 bit(b) quantization respectively. At 2 bit diagonal elements of the Hessian much bigger than at 4 bit, which indicates higher sharpness of the loss at this point. On other hand offdiagonal elements at 4 bit smaller than at 2 bit which confirms function is much more separable at 4 bit. Those results are consistent with measurements of Gaussian curvature and experiments with different norms Fig. 3.1.
More over Hessian matrix provide additional information regarding coupling between different layers. As expected, adjacent offdiagonal terms has higher values than distant elements corresponds to higher dependencies between clipping parameters of adjacent layers.
4 Multivariate loss optimization
Instead of minimization of local metric we propose to optimize loss function of the network. In this case, both layerwise and joint optimization methods can be used. In this section, we provide some background on gradientfree optimization methods and discuss their application to the DNN quantization.
Coordinate descent
Having a function of parameters , coordinate descent optimize each parameter independently from the others. Given minimization problem:
(13) 
we optimize one parameter while fix the others. Coordinate descent is a very simple and scalable method of optimization, however it does not provides any convergence guarantees. In case of separable functions, however, if singlevariable optimization achieves minimum, so does coordinate descent.
Conjugate directions and Powell’s method
More advanced methods, such as Powell’s method [1964Powell], optimize all the parameters jointly, by performing linear search over a set of directions, called conjugate directions. This method is more efficient that coordinate descent, but still does not require gradients. Hence, minimized function need not be differentiable, and no derivatives are taken.
W  A  CD  Powell  

32  4  68.1%  68.7%  68.6%  68.8%  68.0%  68.6% 
32  3  63.3%  65.7%  65.7%  65.9%  65.4%  66.3% 
32  2  32.9%  43.9%  47.8%  48.0%  51.6%  48.0% 
4  32  48.6%  56.0%  57.2%  55.5%  53.3%  62.6% 
3  32  4.0%  18.3%  19.8%  23.7%  0.1%  42.0% 
4  4  43.6%  53.5%  55.4%  53.5%  57.4%  58.5% 
value that is the interpolation of three different values of
to a quadratic form. The best initialization was used to run Coordinate Descent (CD) and Powell’s method.4.1 Loss Aware Quantization
In previous sections, we have shown that at lowbit quantization loss as a function of clipping parameters is a nonconvex, nonseparable function with complex landscape. Those properties make the function hard to optimize. On the other hand, at high bitwidth loss function around minimum is flat. To address this combination of different conditions, we propose to combine multivariate optimization algorithms with heuristic approach of ‘‘good’’ initialization.
We formulate quantization problem as an optimization of the loss function with respect to clipping parameters . is a continuous function defined on closed region, hence has a global minimum.
First we estimate value of that produce best accuracy. To that end we perform layerwise norm minimizaton of quantization error for three different values of . In Fig. 3.4 we show accuracy of ResNet18 and Resnet50 for different values of . Given a three points we interpolate quadratic polynomial to approximate optimal value of . For example, in ResNet50 at 2 bit best accuracy obtained at (Fig. 3.4). Than we take point with best accuracy and use it as initialization to either coordinate descent or Powell’s algorithm. In Table 4.1 we show ablation study performed on ResNet18. We compare accuracy achieved by different initializations with coordinate descent and Powell’s method. In most cases, Powell’s method outperform others. To choose optimal clipping values for each specific case we just select clipping of the method that attain best accuracy.
5 Experimental Results
We apply our method on multiple models covering vision and recommendation system tasks. In all experiments we first calibrate on small heldoff calibration set to calculate optimal scale factors. Than we evaluate on validation set with scale factors obtained from calibration step.
W  A  ResNet18  ResNet50  MobileNet V2 

Min MSE  
32  4  68.1%  73.4%  62% 
32  2  32.9%  17.1%  1.2% 
4  32  48.6%  48.4%  1.2% 
4  4  43.6%  36.4%  1.2% 
LAPQ  
32  4  68.8%  74.8%  65.1% 
32  2  51.6%  54.2%  1.5% 
4  32  62.6%  69.9%  29.4% 
4  4  58.5%  66.6%  21.3% 
LAPQ + bias correction  
4  32  63.3%  71.8%  60.2% 
4  4  59.8%  70.%  49.7% 
FP32  69.7%  76.1%  71.8% 
5.1 ImageNet
We evaluate our method on several CNN architectures on ImageNet. We select calibration set of 512 random images for an optimization step. Empirically we have found that 512 is a good tradeoff between generalization and running time, as shown on Fig. 5.1. Following the convention [baskin2018nice, yang2019quantization], we do not quantize the first and the last layer. Table 5.1 report accuracy at different bitwidths compared to minimal MSE baseline. Our method provides better improvements over baseline at lower bitwidth. Moreover, we note that weights are more sensitive to quantization than activations. Even at 4 bit quantization, minimization of MSE results in significant accuracy degradation compared to LAPQ.
As observed by finkelstein2019fighting, MobileNet is sensitive to quantization bias. To address this issue, we perform bias correction of the weights as proposed by banner2018post, which can be easily combined with LAPQ. As shown in Table 5.1, this significantly reduces quantization error andimproves accuracy of the MobileNet as well as other networks.
Model  W/A  Method  Hit rate(%) 

NCF 1B  32/32  FP32  51.5 
32/8  LAPQ (Ours)  51.2  
MMSE  51.1  
8/32  LAPQ (Ours)  51.4  
MMSE  33.4  
8/8  LAPQ (Ours)  51.0  
MMSE  33.5 
Many of successful methods of posttraining quantization perform finer parameter assignment, such as groupwise [Mellempudi2017ternary], channelwise [banner2018post], pixelwise [faraone2018syq] or filterwise [choukroun2019low] quantization, which require special hardware support and additional computational resources. Finer parameter assignment appears to provide uncoditional improvement of performance, independently on underlying methods. In contrast with those approaches, our method performs layerwise quantization which is simple to implement on any existing hardware that supports low precision integer operations. For that reasons, we do not include those methods in comparison. In Tables 5.4 and 5.3 we compare our method with several other known methods of layerwise quantization. In most cases, our method significantly all the competing methods.
5.2 Ncf1b
In addition to the CNN models we evaluated our method on recommendation system task, specifically on Neural Collaborative Filtering [Xiangnan2017ncf] model. We use mlperf^{1}^{1}1https://github.com/mlperf/training/tree/master/recommendation/pytorch implementation to train the model on MovieLens1B dataset. Similarly to the CNN we generate calibration set of 50k random user/item pairs, significantly smaller then both training and validation sets.
Model  W/A  Method  Accuracy(%) 

32 / 32  FP32  69.7  
LAPQ (Ours)  68.8  
DUAL [choukroun2019low]  68.38  
8 / 4  ACIQ [banner2018post]  65.528  
LAPQ (Ours)  66.3  
8 / 3  ACIQ [banner2018post]  52.476  
LAPQ (Ours)  51.6  
8 / 2  ACIQ [banner2018post]  7.07  
LAPQ (Ours)  59.8  
KLD [migacz20178]  31.937  
ResNet18  4 / 4  MMSE  43.6 
32 / 32  FP32  76.1  
LAPQ (Ours)  74.8  
DUAL [choukroun2019low]  73.25  
OCS [zhao2019improving]  0.1  
8 / 4  ACIQ [banner2018post]  68.92  
LAPQ (Ours)  70.8  
8 / 3  ACIQ [banner2018post]  51.858  
LAPQ (Ours)  54.2  
8 / 2  ACIQ [banner2018post]  2.92  
LAPQ (Ours)  71.8  
4 / 32  OCS [zhao2019improving]  69.3  
LAPQ (Ours)  70  
KLD [migacz20178]  46.19  
ResNet50  4 / 4  MMSE  36.4 
In Table 5.2 we report results for NCF1B model comparing to MMSE method. Even at 8 bit quantization, NCF1B suffer from significant degradation with naive MMSE method. On the other hand LAPQ achieve near SOTA accuracy with 0.5% degradation from FP32 results.
Model  W/A  Method  Accuracy(%) 

32 / 32  FP32  77.3  
LAPQ (Ours)  73.6  
DUAL [choukroun2019low]  74.26  
8 / 4  ACIQ [banner2018post]  66.966  
LAPQ (Ours)  65.7  
8 / 3  ACIQ [banner2018post]  41.46  
LAPQ (Ours)  29.8  
8 / 2  ACIQ [banner2018post]  3.826  
LAPQ (Ours)  66.5  
4 / 32  MMSE  18  
LAPQ (Ours)  59.2  
KLD [migacz20178]  49.948  
ResNet101  4 / 4  MMSE  9.8 
32 / 32  FP32  77.2  
LAPQ (Ours)  74.4  
DUAL [choukroun2019low]  73.06  
OCS [zhao2019improving]  0.2  
8 / 4  ACIQ [banner2018post]  66.42  
LAPQ (Ours)  64.4  
8 / 3  ACIQ [banner2018post]  31.01  
LAPQ (Ours)  51.6  
4 / 32  MMSE  5.8  
LAPQ (Ours)  38.6  
KLD [migacz20178]  1.84  
InceptionV3  4 / 4  MMSE  2.2 
6 Discussion
In this paper we analyzed loss function of quantized neural networks. We show that for low precision quantization loss function is nonconvex and nonseparable. In such conditions existing methods that minimize some local metric can not perform well.
We introduce Loss Aware Posttraining quantization (LAPQ), which optimizes clipping parameters of the quantization function by directly minimizing loss function. Our method outperforms most previously suggested methods for posttraining quantization. Our method does not assume special hardware support like channelwise or filterwise quantization. To the best of our knowledge LAPQ is a first that achieve near FP32 accuracy at 4 bit layerwise quantization in posttraining regime.
Acknowledgments
The research was funded by Hyundai Motor Company through HYUNDAITECHNIONKAIST Consortium, ERC StG RAPID, and Hiroshi Fujiwara Technion Cyber Security Research Center.
Comments
There are no comments yet.