1 Introduction
Quantization of the weight and activation of the deep model has been a promising approach to reduce the model complexity, along with the other techniques such as pruning lecun1990optimal and distillation hinton2015distilling. Previous studies, both on weightonly quantization and weightactivation quantization, have achieved meaningful progress mainly on the classification task. Especially, the scalar (INT8) quantization provides practically applicable performances with enhanced latency, thanks to hardware supports.
The network quantization aims to approximate the floatingpoint (FLOAT32) computation of fullprecision (FP) models using fixedpoint computation in lowerbits (INT8
), and hence, has been targeting on transfer learning from the pretrained network to the quantized counterpart, so called postquantization
vanhoucke2011improving; stock2019and. However, the approximation errors are accumulated in the computations operated during the forwardpropagation, and bring noticeable performance degradation. Especially for the lightweight models mehta2018espnet; mehta2018espnetv2; mobilenetv2; DBLP:journals/corr/abs190502244 with less representational capacity compared to the baseline architectures NIPS2012_4824; Simonyan15; he2016deep; googlenet, the initial statistics error caused by quantization makes it challenging to directly use the pretrained FP model weights fan2020training.A promising approach to solve this problem is to imitate the network quantization during the training. Quantizationaware training (QAT) jacob2018quantization
simulates the quantized inference during the forward pass and uses the straight through estimator (STE)
DBLP:journals/corr/BengioLC13to compute the gradient for the backpropagation. While QAT ameliorates the quantization error by reducing the differences in range of weights and the number of outlier weight values
jacob2018quantization, it still cannot overcome the gradient approximation error caused by STE.) suffers from the vanishing gradient problem
grad_vanishing due to the gradient approximation error caused by the straight through estimator (STE) DBLP:journals/corr/BengioLC13, the StatAssist + GradBoost QAT model (right) shows a distribution of gradients similar to that of the FPtrained counterpart (left), thus provides a comparable performance.Previous approaches suggest to postpone activation quantization in the early stage of training jacob2018quantization or use pretrained FP model weights with few steps of calibration DBLP:journals/corr/abs180608342 to reduce errors caused by inaccurate initial quantization statistics. While these methods effectively work in cases of small error resulting from STE, they lead to the unexpected vanishing gradient problem grad_vanishing when the error is large, as in figure 0(b)
. Several existing methods suggest special workarounds such as batch normalization
ioffe2015batch statistics freezing DBLP:journals/corr/abs180608342; Li2019quantizedObjectDet, percentilebased activation clamping Li2019quantizedObjectDet, and LayerDrop fan2020training, which add extra restrictions to the model’s architecture and training scheme.In this work, we propose intensive and flexible strategies that enable the QAT of a deep model from scratch for better quantization performance with reduced training cost. Our proposal tackles two common factors that lead to the failure of QAT from scratch: 1) the gradient approximation error and 2) the gradient instability from the inaccurate initial quantization statistics. We show that assisting the optimizer momentum with initial statistics of FP model and boosting the optimizer gradients with noise in early stage of training can stabilize the whole training process without any harm to the performance of the model. For sure, our proposed FPstatistic assisting (StatAssist) and stochasticgradient boosting (GradBoost) QAT can be applied to diverse training schemes of existing lightweight models including object detection, segmentation, and style transfer with significantly reduced training cost, along with classification which has been a main target for the previous quantization methods. We demonstrate the effectiveness of the StatAssist + GradBoost during the backpropagation of QAT in figure 1 with histograms of gradients.
Specifically, our main contributions for the efficient QAT from scratch are as follows:

We introduce the StatAssist and the GradBoost QAT method to make the optimizer robust to gradient approximation error caused by STE during the backpropagation of QAT.

Applying StatAssist and GradBoost QAT to the conventional training scheme of a deep model is straightforward and costefficient. With extent experiments, we show that our method leads to successful QAT in various tasks including classification he2016deep; mobilenetv2; Ma_2018_ECCV, object detection mobilenetv2; tdsod, semantic segmentation mehta2018espnet; mehta2018espnetv2; mobilenetv2; DBLP:journals/corr/abs190502244, and style transfer pix2pix.

By combining layer fusion jacob2018quantization and INT8 quantization to compress networks trained with our method, we obtain various lightweight models with fixedpoint computation while maintaining comparable, or achieving better preciseness to each floatingpoint version.
The main motivation of this paper is to improve the current quantizationaware training scheme to make a quantized model competitive with its FP counterpart, widely considered as an upperbound. Section 2 reviews prior work in quantizing a model for faster inference time and smaller size. Section 3 describes the StatAssist and GradBoost algorithms for QAT from scratch without quantized performance degradation. Section 4 introduces related works with model compression in different aspects. Section 5 describes experiments on a variety of different vision tasks and applications. Section 6 summarizes our paper with a meaningful conclusion.
2 Quantizationaware Training
2.1 Network Quantization
Network quantization requires to approximate the weight parameters and activation of the network to and , where the space and each denotes the space represented by bit (FLOAT32) and the lowerbit (INT8) precision. From Jacob et al. jacob2018quantization, the process of approximating the original value to can be represented as:
(1) 
Here, the approximation function and its inverse function are defined by the same parameters . It means that if we store the quantization statistics including minimum, maximum, and zero point of , we can convert to and revert to
. Now we assume the vector multiplication * of the vector
and , where the both vectors lie on . Then, also from Jacob et al. jacob2018quantization, the resultant value is formulated as,(2) 
where the function only includes lowerbit calculation. The equation shows that we can replace the original floatingpoint operation with the lowerbit operation, because we can get the statistics from and . Ideally, we can expect the faster lowerbit operation and hence faster calculation.
2.2 Static Quantization
Static quantization
Despite the theoretical speedenhancement, achieving the enhancement by the network quantization is not straightforward. One main reason is the quantization of the activation . At the inference time, we can easily get the quantization statistics of weights since the value of is fixed. In contrast, the quantization statistics of the activation constantly change according to the input value of . Since replacing the floatingpoint operation with the lowerbit operation always requires , a special workaround is essential for the activation quantization.
Instead of calculating the quantization statistics dynamically, approximating with the precalculated the statistics from a number of samples can be one solution to detour the problem, and called the static quantization. In the static quantization process, the approximation function uses the precalculated statistics , where the each static is from the set of samples . Since we fix the statistics, there exists a sample such that , and we truncate the sample to the bound. The static quantization including the calibration of the quantization statistics are also called post quantization jacob2018quantization. The post quantization enables actual speedup of the operation, but also brings on performance degradation.
Layer fusion
The latency gap between the conceptual design and the actual implementation is another critical issue. Previous methods fan2020training; Li2019quantizedObjectDet; jacob2018quantization; DBLP:journals/corr/abs190502244 report a meaningful latencyenhancement of the convolutional block, but this often couldn’t lead to the overall speedup of the model execution. The main reason is the conversion overhead from FP to lowerbit. Even with a faster convolution operation, we cannot expect significant latency improvement if we should frequently convert between FP and lowerbit for normalization and activation operations. In inference time, integrating the convolution, normalization, and activation into a single convolution operation is required to boost the latency, called layer fusion
. While layer fusion removes the FPlowerbit conversion overhead, it also restricts the selection of the normalization and activation functions. Such restriction on model component candidates hinders the usage of previously studied model architecture design techniques.
2.3 Quantizationaware Training
Quantizationaware training
To mitigate the performance degradation from post quantization, Jacob et al. propose quantizationaware training (QAT) jacob2018quantization
as a method to finetune the network parameters. In training phase, QAT converts the convolutional block to the fakequantization module, which mimics the fixedpoint computation of the quantized module with the floatingpoint arithmetic. In the inference phase, each fakequantized module is actually converted to the lowerbit (
INT8) quantized counterpart using the statistics and the weight value of the fakequantized module.Optimization methods
To further backup the performance gap between the FP and fakequantized model during QAT, various approaches including distillation hinton2015distilling, statistics freezing DBLP:journals/corr/abs180608342; Li2019quantizedObjectDet, and LayerDrop fan2020training have been applied and proven to be effective. However, the use of these approaches are restrictive on applying to other tasks rather than classification hinton2015distilling, or requires a specific model architecture fan2020training and training conditions DBLP:journals/corr/abs180608342; Li2019quantizedObjectDet. These restrictions make QAT challenging to be applied to various tasks having various model architectures.
The optimization of QAT is reported to be unstable fan2020training due to the approximation errors occurred in the fakequantization with the straight through estimator (STE) DBLP:journals/corr/BengioLC13.This instability restricts the use of QAT as a finetuning process with small learning rate, just narrowing the performance degradation from the staticquantization. In the below sections, we study the causes of the fragility in training by analyzing the gradients. Based on the analysis, we discover the future possibilities of QAT; that it can be actively used for finding the most appropriate localminima for the quantized setting and often exceeds the floatingpoint model performance, considered as the upperbound.
3 Proposed Method
3.1 Approximation Error and Gradient Computation
Let the quantity be the gradient computed for the weight by the floating point precision. Then, in each update step , the weight is updated as follows:
(3) 
where the term denotes the momentum statistics accumulating the traces of the gradient computed in previous timesteps. The term governs the learning rate of the model training. In QAT setting, the fake quantization module approximates the process by the function in section 2.2, as:
(4) 
The term denotes the quantization statistics of . The quantization step of includes the value clipping by the and of the quantization statistics . This let the calculation occur erroneous approximation, and propagated to the downstream layers invoking gradient vanishing, as in figure 0(b). Also, the gradient approximation error and the statistics update form a feedback loop amplifying the error. The inaccurate calculation of the gradient invokes the inaccurate statistic update, and this inaccurate statistic again induces the inaccurate gradient calculation.
We suggest that the error amplification can be prevented by assigning a proper momentum value , as in figure 1(c). If the momentum has a proper weight update direction, the weight of equation 4 will ignore the inaccurate gradient . In this case, we can expect the statistics in the next time step will become more accurate than those in . This is also a positive feedback that reducing the gradient update error as well as accumulating the statistic well reflecting the FP value. In the previous QAT case, the use of the pretrained weight and the statistic calibration (and freeze) help to reduce the initial gradient computation error. Still, the magnitude of the learning rate is restricted to be small.
Then, how to impose the proper value to the momentum ? We suggest that the momentum which have accumulated the gradient from a single epoch of FP training is enough to control the gradient approximation error that occurs in the entire training pipeline. This strategy, called StatAssist, gives another answer to control the instability in the initial stage of the training; while previous QAT focuses on a good pretrained weight, ours focuses on a good initialization of the momentum.
3.2 Training Robustness and Stochastic Gradient Boosting
Even with the proposed momentum initialization, StatAssist, there still exists a possibility of earlyconvergence due to the gradient instability from the inaccurate initial quantization statistics. The gradient calculated with erroneous information may narrow the search space for optimal localminima and drop the performance. Previous works jacob2018quantization; DBLP:journals/corr/abs180608342 suggest to postpone activation quantization for certain extent or use the pretrained weight of FP model to walkaround this issue.
We suggest a simple modification to the weight update mechanism in equation 4 to get over the unexpected localminima in early stages of QAT. In each training step, the gradient is computed using STE during the backpropagation. Our stochastic gradient boosting, GradBoost works as follows:
In each update step
, We first define a probability distribution
. Among various probability distributions, we chose a Laplace distribution with a scale parameter by layerwise analysis of the histograms of gradients (figure 1). In each update step , the term is updated as follows:(5) 
where is the exponential moving maximum of and is the exponential moving minimum of in each update step .
We further choose a random subset of weights from . For each , we apply some distortion to its gradient with in a following way:
(6) 
(7) 
(8) 
where
is a clamping factor to prevent the exploding gradient problem and
is taken to the power of for an exponential decay. By matching the sign of with the original gradient as in equation 6, adding to randomly boosts the gradient toward its current direction. For each , the gradient remains unchanged.Note that our GradBoost can be easily combined with equations 3 and 4
and use it as an addon to any stochastic gradient descent (SGD) optimizers
Robbins2007ASA; hinton2012neural; kingma2014adam; DBLP:journals/corr/abs171105101 . As shown in figure 1(d), the combination of StatAssist and GradBoost stabilizes the training and broadens the search area for optimal local minima during QAT. In section 5, we analyze the effect of StatAssist and GradBoost on the final quantized model performance with various lightweight models on different tasks.4 Related Work
Different model compression methods for the better tradeoff between accuracy and efficiency have been actively proposed in recent years. Both handcrafted mehta2018espnet; mehta2018espnetv2; tdsod and neural architecture search (NAS) driven howard2017mobilenets; mobilenetv2; DBLP:journals/corr/abs190502244 structures make it possible to run a deep model on edgedevice GPUs. Lightweight models can be further compressed via weight pruning DBLP:journals/corr/LiKDSG16; DBLP:journals/corr/abs180110447; DBLP:journals/corr/abs171109224, quantization jacob2018quantization; DBLP:journals/corr/abs180608342; Li2019quantizedObjectDet; fan2020training, or with NAS & distillation integrated training scheme li2020gan; DBLP:journals/corr/abs180502152. In section 5 we further modify the architectures of existing lightweight models mehta2018espnet; mehta2018espnetv2; tdsod; mobilenetv2; DBLP:journals/corr/abs190502244 in search of the practical model architecture for lowerbit (INT8) quantization with the implementationlevel restrictions introduced in section 2.2. As opposed to other works jacob2018quantization; DBLP:journals/corr/abs180608342; Li2019quantizedObjectDet; li2020gan, we do not use any pretrained weight finetuning or distillation techniques but train each model with its original training scheme combined with our novel StatAssist and GradBoost methods.
5 Experiments
To empirically evaluate our proposed method, we perform three sets of experiments on training different lightweight models with StatAssist and GradBoost QAT from scratch. The results on classification, object detection, semantic segmentation, and style transfer prove the effectiveness of our method in both quantitative and qualitative ways.
5.1 Experimental Setting
Training protocol
Our main contribution in section 1 focuses on making the optimizer robust to gradient approximation error caused by STE during the backpropagation of QAT. To be more specific, we initialize the optimizer with StatAssist and distort a random subset of gradients on each update step via GradBoost. As an optimizer updates its momentum by itself each step, we simply apply StatAssist by running the optimizer with FP model for a single epoch. Our StatAssist also replaces the learning rate warmup process in conventional model training schemes. For GradBoost, we modify the gradient update step of each optimizer with equations 5 through 8.
Implementation details
We train our models using PyTorch
pytorch and follow the methodology of PyTorch 1.4 quantization library. See the supplementary material for the typical PyTorch pytorch code illustrating StatAssist implementation and the detailed algorithms for different GradBoost optimizers. For the optimal latency, we tuned the components of each model for the best tradeoff between model performance and compression. We provide model tuning details in the supplementary material.5.2 Classification
We first compare the classification
performance of different lightweight models on the ImageNet1K
russakovsky2015imagenet dataset in Table 1. We found out that the performance gap between a quantized model finetuned with FP weights and each FP counterpart varies according to the architectural difference. In particular, the channelshuffle mechanism in ShuffleNetV2 Ma_2018_ECCV seems to widen the gap. Our method successfully narrows the gap to no more than , proving that the scratch training of fakequantized jacob2018quantization models with StatAssist and GradBoost is essential for better quantized performance.Model  Params  MAdds  FP training 





ResNet18 he2016deep  11.68M  7.25B  69.7  68.8  68.9  69.6  
MobileNetV2 mobilenetv2  3.51M  1.19B  71.8  70.3  70.7  71.5  
ShuffleNetV2 Ma_2018_ECCV  2.28M  0.57B  69.3  63.4  67.7  68.8  
ShuffleNetV20.5 Ma_2018_ECCV  1.36M  0.16B  58.2  44.8  56.8  57.3 
5.3 Object Detection
For the object detection, we used two lightweightdetectors: SSDLiteMobileNetV2 (SSDmv2) mobilenetv2 and TinyDSOD (TDSOD) tdsod
. We trained the models with Nesterovmomentum SGD
pmlrv28sutskever13 on PASCALVOC 2007 pascalvoc2007 following default settings of the papers. For training TDSOD, we set the initial learning rate and scaled the rate into at the iterations 120K and 150K, over entire 180K iteration. In SSDmv2 training case, we used total 120K iteration with scaling at 80K and 100K. The initial learning rate was set to . For each case, we set the batch size of . For testing, we slightly modified the detectors to fuse all the layers in each model, as in Section 2.2.Table 2 shows the evaluation results on two lightweight detectors, TDSOD and SSDmv2. Following our theoretical analysis, the quantized model trained with pretrained FP weight finetuning could not surpass the performance of the FP model, which acts like an upperbound. On the contrary, we can see that it is possible to make the quantized outperform the original FP by training each model from scratch using our method. This is counterintuitive in that there still exists enough room for improvements in the FP’s representational capacity. However, our method still can’t be a panacea for any INT8 conversion since the model architecture should be modified due to limitations explained in section 2.2. This modification would induce a performance degradation if the FP model was not initially designed for the quantization.
Model  Params  MAdds  FP training 





TDSOD tdsod  2.17M  2.24B  71.5  71.4  71.9  72.0  
SSDmv2 liu2016ssd  2.95M  1.60B  71.0  70.8  71.1  71.3 
5.4 Semantic Segmentation
We also evaluated our method on semantic segmentation with three lightweightsegmentation models: ESPNet mehta2018espnet, ESPNetV2 mehta2018espnetv2, and MobileNetV3 + LRASPP (Mv3LRASPP) DBLP:journals/corr/abs190502244. We trained the models on Cityscapes Cordts2016Cityscapes following default settings from DBLP:journals/corr/abs180202611. For training, we used Nesterovmomentum SGD pmlrv28sutskever13 with the initial learning rate and poly learning rate schedule DBLP:journals/corr/abs180202611. We trained our models with randomcropped train images to fit a model in a single NVIDIA P40 GPU. The evaluation was performed with fullscale val images. For Mv3LRASPP, we also made extra variations to the original architecture settings from DBLP:journals/corr/abs190502244 (as in our supplementary material) to examine promising performancecompression tradeoffs.
Model 


FP Training 





ESPNet mehta2018espnet  0.60M  16.8B  65.4  64.6  65.0  65.5  
ESPNetV2 mehta2018espnetv2  3.43M  27.2B  64.4  63.8  64.6  64.5  
Mv3LRASPPLarge DBLP:journals/corr/abs190502244  2.42M  7.21B  65.3  64.5  64.7  65.2  
Mv3LRASPPSmall DBLP:journals/corr/abs190502244  0.75M  2.22B  62.5  61.7  61.6  62.1  
Mv3LRASPPLargeRE (Ours)  2.42M  7.21B  65.5  64.9  65.1  65.8  
Mv3LRASPPSmallRE (Ours)  0.75M  2.22B  61.5  61.2  62.1  62.3 
The segmentation results in table 3 are in consensus with the results in 5.3. While quantized models finetuned with FP weights suffer from an average mIOU drop compared to their FP counterparts, the StatAssist + GradBoost trained models maintain or slightly surpass the performance of the FP with an average mIOU gain. While it is costefficient to use hardswish activation DBLP:journals/corr/abs190502244 in the FP versions of the MobileNetV3,the Add and Multiply operations used in hardswish seems to generate extra quantization errors during the training and degrade the final quantized performance. Our modified version of MobileNetV3 (Mv3LRASPPLargeRE, Mv3LRASPPSmallRE), in which all hardswish activations are replaced with the ReLU, states that the right choice of activation function is important for the quantizationaware model architecture.
5.5 Style Transfer
We further evaluate the robustness of our method against unstable training losses by training the Pix2Pix pix2pix style transfer model with minimax NIPS2014_5423 generation loss. For the layer fusion compatibility, we used ResNetbased Pix2Pix model proposed by Li et al. li2020gan and Adamkingma2014adam optimizer with our StatAssist and GradBoost. We only applied the fakequantization fan2020training to the model’s Generator since the Discriminator is not used during the inference. Example results on several imagetoimage style transfer problems are in figure 3. We demonstrate that our method also fits well to the fuzzy training condition without causing the mode collapse NIPS2014_5423, which is considered as a sign of failure in minimaxbased generative models. As demonstrated in figure 3, our method succeeds in training the Pix2Pix model on different imagetoimage style transfer problems.
6 Conclusion
This paper propose a simple yet powerful strategy for the scratch training of a quantization model (QAT), which has been considered to be difficult in other previous works. We show that the scratch quantizationaware training (QAT) with StatAssist and GradBoost enables the final quantized model to maintain or often surpass the FP baseline performance, which is an upperbound of the post quantization and QAT with FPweight finetuning. Besides the scratch training of lightweight models for classification, object detection, and semantic segmentation, we furthermore demonstrate that our proposed method are even robust to significantly unstable training losses such as the minimax generation loss. As a future work, we expect that the QATtargeted architecture and component studies including quantizationaware neural architecture search (NAS) are another promising future research directions.
Broader Impact
This work does not present any foreseeable societal consequence.
Acknowledgement
We would like to thank Clova AI Research team, especially JungWoo Ha and for their helpful feedback and discussion. NAVER Smart Machine Learning (NSML) platform
NSML has been used in the experiments.References
Appendix A Possible Considerations for Quantizationaware Model Designing and Training
a.1 Initialization of QAT
From the above results, we can raise an issue regarding the importance of the fullprecision (FP) pretrained model as an initialization of the QAT. Previous works have assumed that the losssurface of the model expressed by INT8 is the approximated version of the losssurface by FP, and hence, been focusing on finetuning that narrowing the approximation gap. Our observations, however, show a new possibility that the quantized losssurface itself has a different and better local minima not near those of the FP. In the experiment, we show that only using a single epoch pretrained weight with a proper direction of gradient momentum (StatAssist) can achieve comparable or better results than using the FP pretrained weight. We note that this discovery enables the active use of recent architecture search techniques for quantized models since ours doesn’t require a good initialization from the pretrained model.
a.2 Architecture Consideration
One main concern for converting the model from FP to lowerbit is the activation function. As we mentioned in section 5.5, exponential activation functions force the lowerbit to FP conversion for the exponential calculation, leading to a significant latency drop. The use of a hardapproximation version (i.e., hardsigmoid) can be another option, but this might occur an extra quantization error. Therefore, it is necessary to develop a new quantizationaware architecture design scheme with limited activation function candidates.
Appendix B Model Modification Details
As mentioned in section 2.2, the latency gap between the conceptual design and the actual implementation is critical. The layer fusion, integrating the convolution, normalization, and activation into a single convolution operation, can improve the latency by reducing the conversion overhead between FP and lowerbit. For better tradeoff between the accuracy (mAP, mIOU, image quality) and efficiency (latency, MAdds, compression rate), we modified models in the following ways: .

We first replaced each normalization and activation function that comes after a convolution (Conv) layer with the Batch Normalization (BN) [ioffe2015batch] and ReLU [pmlrv15glorot11a]. For special modules like ConvConcatenateBNReLU or ConvAddBNReLU in ESPNets [mehta2018espnet, mehta2018espnetv2], we inserted an extra Conv before BN.

For MobileNetV3 + LRASPP, we replaced the
AvgPool Stride=(16, 20)
in LRASPP with AvgPool Stride=(8, 8) to train models with randomcropped images instead of fullscale images. 
Quantizing the entire layers of a model except the last single layer yields the best tradeoff between accuracy and efficiency.
Appendix C Example Workflow of Quantizationaware Training
In this section, we describe an example workflow of our StatAssist and GradBoost quantizationaware training (QAT) with PyTorch. Our workflow in algorithm 1 closely follows the methodology of the official PyTorch 1.4 quantization library. Detailed algorithms and PyTorch codes for the StatAssist and Gradboost are also provided in section C.1 and C.2.
c.1 StatAssist in Pytorch 1.4
We provide a typical PyTorch 1.4 code illustrating StatAssist implementation in algorithm 2. The actual implementation may vary according to training workflows or PyTorch versions.
c.2 GradBoost Optimizers
Our Gradboost method in section 3.2 is applicable to any existing optimizer implementations by adding extra lines to the gradient calculation. An example algorithm for GradBoostapplied momentumSGD [pmlrv28sutskever13] and AdamW [DBLP:journals/corr/abs171105101] is provided in algorithm 3 and 4. Please refer to optimizer.py in our source code for detailed GradBoost applications to Pytorch 1.4 optimizers.
Comments
There are no comments yet.