Robust Quantization: One Model to Rule Them All

by   Moran Shkolnik, et al.

Neural network quantization methods often involve simulating the quantization process during training. This makes the trained model highly dependent on the precise way quantization is performed. Since low-precision accelerators differ in their quantization policies and their supported mix of data-types, a model trained for one accelerator may not be suitable for another. To address this issue, we propose KURE, a method that provides intrinsic robustness to the model against a broad range of quantization implementations. We show that KURE yields a generic model that may be deployed on numerous inference accelerators without a significant loss in accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4


TensorQuant - A Simulation Toolbox for Deep Neural Network Quantization

Recent research implies that training and inference of deep neural netwo...

Post-training Quantization with Multiple Points: Mixed Precision without Mixed Precision

We consider the post-training quantization problem, which discretizes th...

ACIQ: Analytical Clipping for Integer Quantization of neural networks

Unlike traditional approaches that focus on the quantization at the netw...

Convolutional Neural Network Quantization using Generalized Gamma Distribution

As edge applications using convolutional neural networks (CNN) models gr...

HERO: Hessian-Enhanced Robust Optimization for Unifying and Improving Generalization and Quantization Performance

With the recent demand of deploying neural network models on mobile and ...

Quantization Aware Training, ERNIE and Kurtosis Regularizer: a short empirical study

Pre-trained language models like Ernie or Bert are currently used in man...

Generative Design of Hardware-aware DNNs

To efficiently run DNNs on the edge/cloud, many new DNN inference accele...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) are a prominent choice for many machine learning applications. However, a significant drawback of these models is their computational costs. Low-precision arithmetic is one of the key techniques being actively studied to overcome this difficult. With appropriate hardware support, low-precision training and inference can perform more operations per second, reduce memory bandwidth and power consumption, and allow larger networks to fit into a device.

Naively quantizing a single-precision floating point (FP32) model to bits (INT4) or lower usually incurs a significant accuracy degradation. Many studies have tried to mitigate this accuracy decrease by offering different quantization methods. These methods differ in whether they require training or not. Methods that require training (known as quantization aware training or QAT(Choi et al., 2018; Baskin et al., 2018; Esser et al., 2019; Zhang et al., 2018; Zhou et al., 2016)) simulate the quantization arithmetic on the fly, while methods that avoid training (known as post-training quantization or PQT (Banner et al., 2019; Choukroun et al., 2019; Migacz, 2017; Gong et al., 2018; Finkelstein et al., 2019; Zhao et al., 2019)) minimize the quantization noise added to the model.

Unfortunately, both approaches are not robust to common variations in the assumed quantization noise model. For example, (Krishnamoorthi, 2018) has observed that in order to avoid accuracy degradation at inference time, it is essential to ensure that all quantization-related artifacts are faithfully modeled at training time. Our experiments in this paper further asses this observation. For example, when quantizing ResNet-18 (He et al., 2015) with DoReFa (Zhou et al., 2016) to 4-bit precision, an error of less than 2% in the quantizer step size results in an accuracy drop of 58%.

In practice, there is almost always some degree of uncertainty related to how quantization is executed. For example, recent estimates suggest that over 100 companies are now producing optimized inference chips

(Reddi et al., 2019)

, each with its own mix of data types and induced quantization artifacts. Different quantizer implementations can differ in many ways, including the number of assigned bits, the quantization step size adjusted to accommodate the tensor range, the rounding and truncation policies, the support of fused operations avoiding noise accumulation, etc.

To allow rapid and easy deployment of DNNs on embedded low-precision accelerators, a single pre-trained generic model that can be deployed on a wide range of deep learning accelerators would be very appealing. Such a robust and generic model would allow DNN practitioners to provide a single off-the-shelf robust model suitable for every accelerator, regardless of the supported mix of data types, precise quantization process, and without the need to re-train the model at the customer side.

In this paper, we suggest a method to produce a quantized model without simulating quantization during training and without minimizing the quantization error post-training. This makes the resulting model robust to quantization since no specific assumption about the quantization process is made. Consequently, the resulting model can be used in diverse settings on different accelerators and various operating modes.

To that end, we introduce KURE — a KUrtosis REgularization term which is added to the model loss function. By imposing specific kurtosis values, KURE is capable of manipulating the model tensor distributions to adopt superior noise tolerance qualities.

This paper makes the following contributions:

  • We introduce KURE as a method to reshape distribution — specifically, we use KURE for the uniformization of the network weights, and show that it can be used for the benefit of both PTQ and QAT regimes. Interestingly, changing the tensor distribution to a uniform distribution during training does not hurt converge to the state-of-the-art accuracy in full precision.

  • We prove that compared to normally-distributed weights, uniformly-distributed weights promoted by KURE are more robust to quantization – they have higher SNR and are less sensitive to the specific quantizer implementation.

  • We apply KURE to several ImageNet models and demonstrate that the generated models are quantization-robust and less susceptible to changes in the quanitzation policy (e.g., change in the quantization step size).

2 Related Work

Quantization techniques may be divided into post-training quantization (PTQ) and quantization-aware training (QAT). The former, as the name implies, does not comprise training of the model parameters. Instead, PTQ techniques mainly involve clipping of the pretrained model distributions of the activations and/or weights followed by their linear quantization. The lack of training is usually reflected in lesser performance compared to QAT; however, QAT requires resources for training that may not be available, such as the training dataset, energy, or time.

Post-training quantization. PTQ methods vary mainly in the way the clipping is performed. Banner et al. (2019) and Choukroun et al. (2019) find optimal clipping values in terms of the mean-squared-error (MSE), Migacz (2017) pick clipping values that minimize the Kullback-Leibler (KL) divergence, and Gong et al. (2018) use the norm. Clipping may be also accompanied by minor distribution tweaking. Zhao et al. (2019)

address the importance of clipping outlier values and propose outlier channel splitting;

Nayak et al. (2019)

perform hierarchical clustering of classes to help the quantized model differentiate between overlapping class distributions; and

Finkelstein et al. (2019) correct the mean activation value shift in the quantized model with bias. Clipping may also be considered in channel granularity (Lee et al., 2018), and its value may be obtained according to the global network loss function (Nahshan et al., 2019).

Quantization-aware training.

QAT methods optimize model parameters for quantization with gradient descent and backpropagation. Quantization step size (

Choi et al. (2018); Baskin et al. (2018); Esser et al. (2019); Zhang et al. (2018); Zhou et al. (2016)), for example, may also be defined as a learnable parameter, thereby optimally adjusted during the training procedure. Since QAT passes model activations and/or weights through “quantization layers”, and since quantization operations are not differentiable, it is common to employ a straight-through estimator (Bengio et al., 2013). Differentiable models of the quantizer have been shown to improve the performance of QAT (Yang et al. (2019); Gong et al. (2019); Elthakeb et al. (2019)).

Robustness and distribution reshaping. To the best of our knowledge, previous works did not consider robustness to the changes in the quantization scheme assumed at training. In a sense, our work is most closely related to Yu et al. (2019) which mention uniformization of the model distributions, targeting, however, a very different goal. While Yu et al. (2019) proposed a QAT scheme, we propose a complimentary method to PTQ and QAT that produces models robust to different quantization techniques. In addition, we prove that uniformly-distributed tensors are indeed more resilient to noise.

3 Model and Problem Formulation

Given a uniform -bit quantizer with quantization step size that maps a continuous value into a discrete representation with . The index is expressed as follows:


Given a random variable

taken from a distribution and a quantizer , we consider the expected mean-squared-error (MSE) as a local distortion measure we would like to minimize, that is,


Assuming an optimal quantization step and optimal quantizer for a given distribution , we quantify the quantization sensitivity as the increase in following a small changes in the optimal quantization step size . Specifically, for a given and a quantization step size around (i.e., ) we measure the following difference:


The following Lemma will be useful to estimate the quantization sensitivity for various density distributions.

Lemma 1

Assuming a second order Taylor approximation, the quantization sensitivity satisfies the following equation:


Proof: Let be a quantization step with similar size to so that . Using a second order Taylor expansion, we approximate around as follows:


Since is the optimal quantization step for , we have that . In addition, by ignoring order terms higher than two, we can re-write Equation (5) as follows:


Equation (6) holds also with absolute values:


In the next subsection, we use Lemma 1 to compare the quantization sensitivity of the Normal distribution with and Uniform distribution.

3.1 Robustness of Tensor Distributions

In this section, we consider different tensor distributions and their robustness to quantization. Specifically, we show that for a tensor with a uniform distribution the variations in the region around are smaller compared with other typical distributions of weights and activations. This is captured through the following theorem.

Lemma 2


be a continuous random variable that is uniformly distributed in the interval

. Assume that is a uniform -bit quantizer with a quantization step . Then, the expected MSE is given as follows:

Proof: Given a finite quantization step size and a finite range of quantization levels , the quanitzer truncates input values larger than and smaller than . Hence, denoting by this threshold (i.e., ), the quantizer can be modeled as follows:


Therefore, by the law of total expectation, we know that


We now turn to evaluate the contribution of each term in Equation (9). We begin with the case of

, for which the probability density is uniform in the range

and zero for . Hence, the conditional expectation is given as follows:


In addition, since is uniformly distributed in the range , a random sampling from the interval happens with a probability


Therefore, the first term in Equation (9) is stated as follows:


Since is symmetrical around zero, the first and last terms in Equation (9) are equal and their sum can be evaluated by multiplying Equation (12) by two.

We are left with the middle part of Equation (9) that considers the case of . Note that the qunatizer rounds input values to the nearest discrete value that is a multiple of the quantization step . Hence, the quantization error, , is uniformly distributed and bounded in the range . Hence, we get that


Finally, we are left to estimate , which is exactly the probability of sampling a uniform random variable from a range of out of a total range of :


By summing all terms of Equation (9) and substituting , we achieve the following expression for the expected MSE:


In Fig. 1(b) we depict the MSE as a function of value for 4-bit uniform quantization. We show a good agreement between Equation 15 and the synthetic simulations measuring the MSE.

As defined in Eq. 3, we quantify the quantization sensitivity as the increase in MSE in the surrounding of the optimal quantization step . In Lemma 3 we will find for a random variable that is uniformly distributed.

(a) Optimal quantization of normally distributed tensors. The first order gradient zeroes at a region with a relatively high-intensity 2nd order gradient, i.e., a region with a high quantization sensitivity . This sensitivity zeros only asymptotically when (and MSE) tends to infinity. This means that optimal quantization is highly sensitive to changes in the quantization process.
(b) Optimal quantization of uniformly distributed tensors. First and second-order gradients zero at a similar point, indicating that the optimum is attained at a region where quantization sensitivity tends to zero. This means that optimal quantization is tolerant and can bear changes in the quantization process without significantly increasing the MSE.
Figure 1: Quantization needs to be modeled to take into account uncertainty about the precise way it is being done. The best quantization that minimizes the MSE is also the most robust one with uniformly distributed tensors (sub-figure (b)), but not with normally distributed tensors (sub-figure (a)). (i) Simulation: 10,000 values are generated from a uniform/normal distribution and quantized using different quantization step sizes . (ii) MSE(): Analytical results, stated by Lemma 2 for the uniform case, and developed by Banner et al. (2019) for the normal case - note these are in a good agreement with simulations. (iii) MSE’(): The first order derivative zeroes at the region where MSE is minimized. (iii) MSE”(): second order derivative zeroes in the region with maximum robustness.
Lemma 3

Let be a continuous random variable that is uniformly distributed in the interval . Given an -bit quantizer , the expected MSE is minimized by selecting the following quantization step size:


Proof: We calculate the roots of the first order derivative of Equation (15) with respect to as follows:


Solving Equation (17) yields the following solution:


We can finally provide the main result of this paper, stating that the uniform distribution is more robust to modification in the quantization process compared with the typical distributions of weights and activations that tend to be normal.

Theorem 4

Let and be continuous random variables with a uniform and normal distributions. Then, for any given , the quantization sensitivity satisfies the following inequality:


i.e., compared to the typical normal distribution, the uniform distribution is more robust to changes in the quantization step size .

Proof: In the following, we use Lemma 1 to calculate the quantization sensitivity of each distribution. We begin with the uniform case. We have proven in Lemma 2 the following:

Hence, since we have shown in Lemma 3 that optimal step size for is we get that


We now turn to find the sensitivity of the normal distribution . According to Banner et al. (2019), the expected MSE for the quantization of a Gaussian random variable is as follows:


where .

To obtain the quantization sensitivity, we first calculate the second derivative:


We have three terms: the first is positive but not larger than (for the case of ); the second is negative in the range ; and the third is the constant . The sum of the three terms falls in the range . Hence, the quantization sensitivity for normal distribution is at least


This clearly establishes the theorem since we have that

3.2 When Robustness and Optimality Meet

We have developed in the previous section the MSE as a function of quantization step size for the uniform case. We have shown that optimal quantization step size is approximately . The second order derivative is linear in and zeroes at approximately the same location:


Therefore, for the uniform case, the optimal quantization step size in terms of is generally the one that optimizes the sensitivity , as illustrated by Fig. 1.

Finally, Fig. 2 presents the minimum MSE distortions for different bit-width when normal and uniform distributions are optimally quantized. These optimal MSE values constitute the optimal solution of equations Eq. 15 and Eq. 21, respectively. Note that the optimal quantization of uniformly distributed tensors is superior in terms of MSE to normally distributed tensors at all bit-width representations.

Figure 2: as a function of bit-width M for Uniform and Normal distributions. is significantly smaller than .

In this section, we proved that uniform distribution is more robust to perturbations in quantization parameters than normal distribution. The robustness of the uniform distribution over Laplace distribution, for example, can be similarly justified. In the next section, we show how DNN tensor distributions can be manipulated to form different distributions, and in particular to form the uniform distribution.

4 Kurtosis Regularization (KURE)

DNN parameters usually follow Gaussian or Laplace distributions (Banner et al., 2019). However, we would like to obtain the robust qualities that the uniform distribution introduces (Section 3). Yu et al. (2019) suggested to reshape the model distributions by imposing clipping during training. In this work, we use kurtosis

— the fourth standardized moment — as a proxy to the probability distribution.

4.1 Kurtosis — The Fourth Standardized Moment

The kurtosis of a random variable is defined as follows:


where and

are the mean and standard deviation of

, respectively. The kurtosis is the fourth central moment normalized by the standard deviation . It provides a scale and shift-invariant measure that captures the shape of the probability distribution

, rather than its mean or variance. If

is a uniform distributed variable, its kurtosis value will be 1.8, whereas if is a normal distributed variable or Laplace distributed variable then its kurtosis values will be 3 and 6, respectively (DeCarlo, 1997). We define ”kurtosis target”, , as the kurtosis value we want the tensor to adopt. In our case, the kurtosis target is 1.8 (uniform distribution). In the following section we describe how we manipulate the model kurtosis.

4.2 Kurtosis Loss

To control the model weights distributions, we introduce kurtosis regularization (KURE). KURE enables us to control the tensor distribution during training while maintaining the original model accuracy in full precision.

KURE is applied to the model loss function, , as follows:


where is the target loss function, is the KURE term, and is the KURE coefficient. is defined as


where is the number of layers and is the target for kurtosis regularization.

Figure 3: The effect of KURE on weight values distribution of one layer in ResNet-18 (top) and ResNet-18 accuracy with PTQ setting (bottom). Uniformization of the model () provides higher accuracy.

To demonstrate the impact of KURE, Fig. 3 (top) presents weight distributions of one layer from ResNet-18 which was trained with different KURE values. We also examine the effect of different values on the accuracy and robustness of an entire ResNet-18 model. Considering model robustness for different bit-widths, maximum robustness is observed for , as expected according to Fig. 2. Furthermore, in Fig. 4 we demonstrate the sensitivity of one ResNet-18 layer to the the step size (). It is clear that a model which was trained with KURE and achieves superior robustness qualities than its non-KURE counterpart (Theorem 4).

While KURE may be used to achieve different tensor distributions, we use it to achieve uniform-like distributions. We further explore the performance of these models next.

Figure 4: Sensitivity of the weights, , in one layer of ResNet-18 as a function of the change in the quantization step size from the optimal quantization step size (). The kurtosis regularization improves the robustness.

5 Experiments

In this section, we evaluate KURE robustness. Specifically, we focus on the robustness to bit-width changes and to perturbations in quantization step size. We conduct our experiments with Distiller (Zmora et al., 2019), using ImageNet data-set (Deng et al., 2009) on CNN architectures for image classification (ResNet-18/50 (He et al., 2015) and MobileNet-V2 (Sandler et al., 2018)).

Our method can be applied to both PTQ and QAT schemes. Not only we observe improved robustness to bit-width in both schemes but also accuracy increase when KURE is combined with PTQ. A detailed explanation of the implementation parameters appears in the supplementary material.

5.1 Post-Training Quantization (PTQ)

We apply KURE on a pre-trained model and fine-tune it. Each model is trained until the kurtosis level converges to its target, and until no significant loss in accuracy is observed.

By doing so, we get a quantization-robust model in full precision which then may be used with different quantization algorithms and bit-widths. After finishing the training process, we quantize our robust model using LAPQ (Nahshan et al., 2019). We quantize all layers except the first and last layers, which is standard practice for quantized NN models. We evaluate the performance of our model, trained with KURE versus LAPQ, DUAL (Choukroun et al., 2019), and ACIQ(Banner et al., 2019), with ResNet-50, ResNet-18, and MobileNet-V2, as presented in Table 1. We observe that a model trained with KURE results in better accuracy in most cases, especially in the lower bit-widths. For example, with ResNet-50 we apply LAPQ method for quantization on two pre-trained model, with the only difference being that the first one was trained without KURE and the other was trained with KURE. For 3-bits weights and activations, the model accuracy improves from 38.4% to 66.5%.

As we mentioned, we apply KURE on a trained model and fine-tune it, but it’s important to note that we can also add KURE to the training process even when we train a model from scratch, as opposed to QAT method that shows good results only after fine-tuning a pre-trained model.

Model W / A Method Accuracy(%)
FP32 / FP32 76.1
KURE (Ours) 75.1
LAPQ 74.8
DUAL 73.25
8 / 4 ACIQ 68.92
KURE (Ours) 75.6
LAPQ 71.8
4 / FP32 DUAL 70.06
KURE (Ours) 76.2
6 / 6 LAPQ 74.8
KURE (Ours) 75.8
5 / 5 LAPQ 72.9
KURE (Ours) 74.3
DUAL 72.6
4 / 4 LAPQ 70
KURE (Ours) 66.5
ResNet-50 3 / 3 LAPQ 38.4
FP32 / FP32 69.7
KURE (Ours) 69.0
LAPQ 68.8
DUAL 68.38
8 / 4 ACIQ 65.52
KURE (Ours) 68.3
LAPQ 62.6
4 / FP32 DUAL 68.8
KURE (Ours) 69.7
5 / 5 LAPQ 65.4
KURE (Ours) 68.8
5 / 4 LAPQ 64.9
KURE (Ours) 66.9
DUAL 67.4
4 / 4 LAPQ 59.8
KURE (Ours) 57.3
ResNet-18 3 / 3 LAPQ 44.3
FP32 / FP32 71.8
KURE (Ours) 71.1
8 / 8 LAPQ 71.4
KURE (Ours) 70.0
6 / 6 LAPQ 69.7
KURE (Ours) 66.9
5 / 5 LAPQ 64.6
KURE (Ours) 59.0
4 / 4 LAPQ 48.1
KURE (Ours) 24.4
MobileNet-V2 3 / 3 LAPQ 3.7
Table 1: Comparison using post-training quantization scheme of different models using our method ( KURE) against different methods of post-training quantization: LAPQ, DUAL and ACIQ. For KURE, we use the proposed regularization until achieving FP32 results and then used LAPQ quantization on the new model.
(a) ResNet-18 - DoReFa
(b) ResNet-18 - LSQ
(c) MobileNet-V2 - DoReFa
(d) ResNet-50 - LSQ
Figure 5: Bitwidth robustness comparison between classic QAT model and QAT with KURE model (a robust QAT model) for different ImageNet architectures and different QAT methods (DoReFa and LSQ). The x is the original operation point to which the QAT model was trained, whereas the other points describe the accuracy of the model when changing the weight bit-width. In fig (a) the quantization is for activations and weights, while in the rest only for the weights - which are more sensitive to quantization.

5.2 Quantization-Aware Training (QAT)

Recall that KURE may be used with any QAT method to improve the model robustness to different quantizers that may be implemented in different accelerators, for example. We show results on two QAT methods: LSQ (Esser et al., 2019) and DoReFa (Zhou et al., 2016). To demonstrate the improvement in robustness with KURE, we show robustness to bit-width and robustness to perturbations in quantization step size parameter in QAT methods.

Bit-width comparison. In Fig. 5 we compare the proposed method on top of 2 known QAT methods, for ResNet-18, ResNet-50 and MobileNet-V2 architectures. To emphasize the robustness of our method, we did the following experiment. We trained one model with QAT and a second model with QAT combined with KURE to the same quantization operation point (W4A4 in Fig. 4(a)), then we changed the weights bit-width and measured the accuracy. The proposed method achieves competitive results compared to the QAT alone at the operation point to which the networks were trained (W4A4 in Fig. 4(a)); however, when applying different bit-widths, the suggested method accuracy stays stable while all QAT methods fail to confront different quantization bit-width.

Quantization step size perturbation. In Fig. 6 we show the effect of a minor perturbation on the quantization step size parameter on a QAT model and a QAT model trained with KURE. Such perturbations are common when running the quantization on different hardware platforms. For example, many hardwares use approximations for the quantization step size and offset parameters in their quantization process. An example for such approximations is described in (Jacob, 2017). The robustness over this kind of perturbation is remarkable as a result of using KURE. For example, in DoReFa model, by changing the quantization step size parameter only by 2%, we see a a dramatic drop in accuracy from 68.3% to less than 10%, while in DoReFa with KURE model, the accuracy stays high even after changing the quantization step size by 30%.

(a) ResNet-18 - DoReFa
(b) MobileNet-V2 - DoReFa
Figure 6: The network has been trained for quantization step size . Still, the quantizer uses a slightly different step size . Small changes in optimal step size cause severe accuracy degradation in the quantized model. KURE significantly enhances the model robustness by promoting solutions that are more robust to uncertainties in the quantizer design (ResNet-18 and MobileNet-V2 on ImageNet).

6 Conclusions

In this work, we emphasize the importance of the model robustness to different quantization policies and perturbations in quantization parameters. We show that today’s quantized models are sensitive to these perturbations; therefore, it is difficult to deploy them on a wide range of inference accelerators. We prove that tensors with uniform distribution are less sensitive to perturbations in quantization parameters. By adding a kurtosis regularization to the training phase, we change the distribution of the weights to be uniform-like, those improving the robustness of NN in PTQ and QAT schemes. For QAT models, our proposed method allows us to quantize a model with one quantization policy during the training phase and run it in inference-time on a different platform with different quantization policy, without a significant loss in accuracy. In addition, KURE allows us to produce full precision models that are robust to quantization. By quantizing these robust models with PTQ methods, we see significant improvements in accuracy.

This work focuses on the weights but can be used for activations as well. In addition, it can be extended to other NN domains such as recommendation systems and NLP models. The concept of manipulating the model distributions with kurtosis regularization may be also used when the target distribution is known.


  • R. Banner, Y. Nahshan, E. Hoffer, and D. Soudry (2019) Post-training 4-bit quantization of convolution networks for rapid-deployment. Conference on Neural Information Processing Systems (NeurIPS). Cited by: §1, §2, Figure 1, §3.1, §4, §5.1.
  • C. Baskin, N. Liss, Y. Chai, E. Zheltonozhskii, E. Schwartz, R. Girayes, A. Mendelson, and A. M. Bronstein (2018) Nice: noise injection and clamping estimation for neural network quantization. arXiv preprint arXiv:1810.00162. Cited by: §1, §2.
  • Y. Bengio, N. Léonard, and A. C. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    CoRR abs/1308.3432. External Links: Link, 1308.3432 Cited by: §2.
  • J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §1, §2.
  • Y. Choukroun, E. Kravchik, and P. Kisilev (2019) Low-bit quantization of neural networks for efficient inference. arXiv preprint arXiv:1902.06822. Cited by: §1, §2, §5.1.
  • L. T. DeCarlo (1997) “On the Meaning and Use of Kurtosis . Psychological Methods, 2(3), pp. 292–307. External Links: Link Cited by: §4.1.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §5.
  • A. T. Elthakeb, P. Pilligundla, and H. Esmaeilzadeh (2019) SinReQ: generalized sinusoidal regularization for automatic low-bitwidth deep quantized training. arXiv preprint arXiv:1905.01416. Cited by: §2.
  • S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha (2019) Learned step size quantization. arXiv preprint arXiv:1902.08153. External Links: Link Cited by: §1, §2, §5.2.
  • A. Finkelstein, U. Almog, and M. Grobman (2019) Fighting quantization bias with bias. arXiv preprint arXiv:1906.03193. Cited by: §1, §2.
  • J. Gong, H. Shen, G. Zhang, X. Liu, S. Li, G. Jin, N. Maheshwari, E. Fomenko, and E. Segal (2018)

    Highly efficient 8-bit low precision inference of convolutional neural networks with IntelCaffe

    In Proceedings of Reproducible Quality-Efficient Systems Tournament on Co-designing Pareto-efficient Deep Learning (ReQuEST), Cited by: §1, §2.
  • R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan (2019) Differentiable soft quantization: bridging full-precision and low-bit neural networks. arXiv preprint arXiv:1908.05033. External Links: Link Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. External Links: 1512.03385 Cited by: §1, §5.
  • B. Jacob (2017) Quantization and training of neural networks for efficient integer-arithmetic-only inference. arXiv preprint arXiv:1712.05877. External Links: Link Cited by: §5.2.
  • R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. External Links: 1806.08342 Cited by: §1.
  • J. H. Lee, S. Ha, S. Choi, W. Lee, and S. Lee (2018) Quantization for rapid deployment of deep neural networks. arXiv preprint arXiv:1810.05488. External Links: Link Cited by: §2.
  • S. Migacz (2017) 8-bit inference with TensorRT. Note: NVIDIA GPU Technology Conference Cited by: §1, §2.
  • Y. Nahshan, B. Chmiel, C. Baskin, E. Zheltonozhskii, R. Banner, A. M. Bronstein, and A. Mendelson (2019) Loss aware post-training quantization. External Links: 1911.07190 Cited by: §2, §5.1.
  • P. Nayak, D. Zhang, and S. Chai (2019) Bit efficient quantization for deep neural networks. External Links: 1910.04877 Cited by: §2.
  • V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. St. John, P. Kanwar, D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne, G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou (2019) MLPerf inference benchmark. External Links: 1911.02549 Cited by: §1.
  • M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. CoRR abs/1801.04381. External Links: Link, 1801.04381 Cited by: §5.
  • J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X. Hua (2019) Quantization networks. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    External Links: Link Cited by: §2.
  • H. Yu, T. Wen, G. Cheng, J. Sun, Q. Han, and J. Shi (2019) GDRQ: group-based distribution reshaping for quantization. ArXiv abs/1908.01477. Cited by: §2, §4.
  • D. Zhang, J. Yang, D. Ye, and G. Hua (2018) LQ-nets: learned quantization for highly accurate and compact deep neural networks. In The European Conference on Computer Vision (ECCV), External Links: Link Cited by: §1, §2.
  • R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang (2019) Improving neural network quantization using outlier channel splitting. arXiv preprint arXiv:1901.09504. External Links: Link Cited by: §1, §2.
  • S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou (2016) DoReFa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. ArXiv abs/1606.06160. Cited by: §1, §1, §2, §5.2.
  • N. Zmora, G. Jacob, L. Zlotnik, B. Elharar, and G. Novik (2019) Neural network distiller: a python package for dnn compression research. arXiv preprint arXiv:1910.12232. External Links: Link Cited by: §5.

7 Supplementary material

7.1 Hyper parameters to reproduce the results in Section 5- Experiments

Following we described the hyper parameters used in the experiments section. A fully reproducible code accompanies the paper.

7.1.1 Hyper parameters for Section 5.1 - Post Training Quantization (PTQ) results

In Table 2

we describe the hyper-parameters used in section 5.1. We apply KURE on a pre-trained model from torch-vision repository and fine-tune it with the following hyper-parameters. All the other hyper-parameters like momentum and w-decay stay the same as in the pre-trained model.

architecture kurtosis target () KURE coefficient ( ) initial lr lr schedule batch size epochs fp32 accuracy
ResNet-18 256 83 70.3
ResNet-50 1.8 1.0 0.001 decays by a factor of 10 every 30 epochs 128 49 76.4
MobileNet-V2 256 83 71.3
Table 2: Hyper parameters for experiments in section 5.1 - Post Training Quantization

7.1.2 Hyper parameters for Section 5.2 - Quantization-Aware-Training (QAT) results

We combine KURE with QAT method during the training phase. In Table 3 we describe the hyper-parameters we used.

architecture QAT method quantization settings (W/A) kurtosis target () KURE coefficient ( ) initial lr lr schedule batch size epochs accuracy
ResNet-18 DoReFa 4 / 4 1.8 1.0 1e-4 decays by a factor of 10 every 30 epochs 256 80 68.3
ResNet-18 LSQ 4 / FP32 1.8 1.0 1e-4 decays by a factor of 10 every 20 epochs 128 70 68.70
ResNet-50 LSQ 4 / FP32 1.8 1.0 1e-4 decays by a factor of 10 every 30 epochs 64 18 75.7
MobileNet-V2 DoReFa 4 / FP32 1.8 1.0 5e-5 lr decay rate of 0.98 per epoch 128 10 66.9
Table 3: Hyper parameters for experiments in section 5.2 - Quantization Aware Training

7.2 Mean and standard deviation over multiple runs of ResNet-18 trained with DoReFa and KURE

architecture QAT method quantization settings (W/A) Runs Accuracy, % ()
ResNet-18 DoReFa 4 / 4 3 ()
Table 4: Mean and standard deviation over multiple runs of ResNet-18 trained with DoReFa and KURE
Figure 7: The network has been trained for quantization step size . Still, the quantizer uses a slightly different step size . Small changes in optimal step size cause severe accuracy degradation in the quantized model. KURE significantly enhances the model robustness by promoting solutions that are more robust to uncertainties in the quantizer design (ResNet-18 on ImageNet).