Quantization of the weight and activation of the deep model has been a promising approach to reduce the model complexity, along with the other techniques such as pruning lecun1990optimal and distillation hinton2015distilling. Previous studies, both on weight-only quantization and weight-activation quantization, have achieved meaningful progress mainly on the classification task. Especially, the scalar (INT8) quantization provides practically applicable performances with enhanced latency, thanks to hardware supports.
The network quantization aims to approximate the floating-point (FLOAT32) computation of full-precision (FP) models using fixed-point computation in lower-bits (INT8
), and hence, has been targeting on transfer learning from the pre-trained network to the quantized counterpart, so called post-quantizationvanhoucke2011improving; stock2019and. However, the approximation errors are accumulated in the computations operated during the forward-propagation, and bring noticeable performance degradation. Especially for the lightweight models mehta2018espnet; mehta2018espnetv2; mobilenetv2; DBLP:journals/corr/abs-1905-02244 with less representational capacity compared to the baseline architectures NIPS2012_4824; Simonyan15; he2016deep; googlenet, the initial statistics error caused by quantization makes it challenging to directly use the pre-trained FP model weights fan2020training.
A promising approach to solve this problem is to imitate the network quantization during the training. Quantization-aware training (QAT) jacob2018quantization
simulates the quantized inference during the forward pass and uses the straight through estimator (STE)DBLP:journals/corr/BengioLC13
to compute the gradient for the back-propagation. While QAT ameliorates the quantization error by reducing the differences in range of weights and the number of outlier weight valuesjacob2018quantization, it still cannot overcome the gradient approximation error caused by STE.
) suffers from the vanishing gradient problemgrad_vanishing due to the gradient approximation error caused by the straight through estimator (STE) DBLP:journals/corr/BengioLC13, the StatAssist + GradBoost QAT model (right) shows a distribution of gradients similar to that of the FP-trained counterpart (left), thus provides a comparable performance.
Previous approaches suggest to postpone activation quantization in the early stage of training jacob2018quantization or use pre-trained FP model weights with few steps of calibration DBLP:journals/corr/abs-1806-08342 to reduce errors caused by inaccurate initial quantization statistics. While these methods effectively work in cases of small error resulting from STE, they lead to the unexpected vanishing gradient problem grad_vanishing when the error is large, as in figure 0(b)
. Several existing methods suggest special workarounds such as batch normalizationioffe2015batch statistics freezing DBLP:journals/corr/abs-1806-08342; Li2019quantizedObjectDet, percentile-based activation clamping Li2019quantizedObjectDet, and LayerDrop fan2020training, which add extra restrictions to the model’s architecture and training scheme.
In this work, we propose intensive and flexible strategies that enable the QAT of a deep model from scratch for better quantization performance with reduced training cost. Our proposal tackles two common factors that lead to the failure of QAT from scratch: 1) the gradient approximation error and 2) the gradient instability from the inaccurate initial quantization statistics. We show that assisting the optimizer momentum with initial statistics of FP model and boosting the optimizer gradients with noise in early stage of training can stabilize the whole training process without any harm to the performance of the model. For sure, our proposed FP-statistic assisting (StatAssist) and stochastic-gradient boosting (GradBoost) QAT can be applied to diverse training schemes of existing lightweight models including object detection, segmentation, and style transfer with significantly reduced training cost, along with classification which has been a main target for the previous quantization methods. We demonstrate the effectiveness of the StatAssist + GradBoost during the back-propagation of QAT in figure 1 with histograms of gradients.
Specifically, our main contributions for the efficient QAT from scratch are as follows:
We introduce the StatAssist and the GradBoost QAT method to make the optimizer robust to gradient approximation error caused by STE during the back-propagation of QAT.
Applying StatAssist and GradBoost QAT to the conventional training scheme of a deep model is straight-forward and cost-efficient. With extent experiments, we show that our method leads to successful QAT in various tasks including classification he2016deep; mobilenetv2; Ma_2018_ECCV, object detection mobilenetv2; tdsod, semantic segmentation mehta2018espnet; mehta2018espnetv2; mobilenetv2; DBLP:journals/corr/abs-1905-02244, and style transfer pix2pix.
By combining layer fusion jacob2018quantization and INT8 quantization to compress networks trained with our method, we obtain various lightweight models with fixed-point computation while maintaining comparable, or achieving better preciseness to each floating-point version.
The main motivation of this paper is to improve the current quantization-aware training scheme to make a quantized model competitive with its FP counterpart, widely considered as an upper-bound. Section 2 reviews prior work in quantizing a model for faster inference time and smaller size. Section 3 describes the StatAssist and GradBoost algorithms for QAT from scratch without quantized performance degradation. Section 4 introduces related works with model compression in different aspects. Section 5 describes experiments on a variety of different vision tasks and applications. Section 6 summarizes our paper with a meaningful conclusion.
2 Quantization-aware Training
2.1 Network Quantization
Network quantization requires to approximate the weight parameters and activation of the network to and , where the space and each denotes the space represented by -bit (FLOAT32) and the lower-bit (INT8) precision. From Jacob et al. jacob2018quantization, the process of approximating the original value to can be represented as:
Here, the approximation function and its inverse function are defined by the same parameters . It means that if we store the quantization statistics including minimum, maximum, and zero point of , we can convert to and revert to
. Now we assume the vector multiplication * of the vectorand , where the both vectors lie on . Then, also from Jacob et al. jacob2018quantization, the resultant value is formulated as,
where the function only includes lower-bit calculation. The equation shows that we can replace the original floating-point operation with the lower-bit operation, because we can get the statistics from and . Ideally, we can expect the faster lower-bit operation and hence faster calculation.
2.2 Static Quantization
Despite the theoretical speed-enhancement, achieving the enhancement by the network quantization is not straightforward. One main reason is the quantization of the activation . At the inference time, we can easily get the quantization statistics of weights since the value of is fixed. In contrast, the quantization statistics of the activation constantly change according to the input value of . Since replacing the floating-point operation with the lower-bit operation always requires , a special workaround is essential for the activation quantization.
Instead of calculating the quantization statistics dynamically, approximating with the pre-calculated the statistics from a number of samples can be one solution to detour the problem, and called the static quantization. In the static quantization process, the approximation function uses the pre-calculated statistics , where the each static is from the set of samples . Since we fix the statistics, there exists a sample such that , and we truncate the sample to the bound. The static quantization including the calibration of the quantization statistics are also called post quantization jacob2018quantization. The post quantization enables actual speed-up of the operation, but also brings on performance degradation.
The latency gap between the conceptual design and the actual implementation is another critical issue. Previous methods fan2020training; Li2019quantizedObjectDet; jacob2018quantization; DBLP:journals/corr/abs-1905-02244 report a meaningful latency-enhancement of the convolutional block, but this often couldn’t lead to the overall speed-up of the model execution. The main reason is the conversion overhead from FP to lower-bit. Even with a faster convolution operation, we cannot expect significant latency improvement if we should frequently convert between FP and lower-bit for normalization and activation operations. In inference time, integrating the convolution, normalization, and activation into a single convolution operation is required to boost the latency, called layer fusion
. While layer fusion removes the FP-lower-bit conversion overhead, it also restricts the selection of the normalization and activation functions. Such restriction on model component candidates hinders the usage of previously studied model architecture design techniques.
2.3 Quantization-aware Training
To mitigate the performance degradation from post quantization, Jacob et al. propose quantization-aware training (QAT) jacob2018quantization
as a method to fine-tune the network parameters. In training phase, QAT converts the convolutional block to the fake-quantization module, which mimics the fixed-point computation of the quantized module with the floating-point arithmetic. In the inference phase, each fake-quantized module is actually converted to the lower-bit (INT8) quantized counterpart using the statistics and the weight value of the fake-quantized module.
To further backup the performance gap between the FP and fake-quantized model during QAT, various approaches including distillation hinton2015distilling, statistics freezing DBLP:journals/corr/abs-1806-08342; Li2019quantizedObjectDet, and LayerDrop fan2020training have been applied and proven to be effective. However, the use of these approaches are restrictive on applying to other tasks rather than classification hinton2015distilling, or requires a specific model architecture fan2020training and training conditions DBLP:journals/corr/abs-1806-08342; Li2019quantizedObjectDet. These restrictions make QAT challenging to be applied to various tasks having various model architectures.
The optimization of QAT is reported to be unstable fan2020training due to the approximation errors occurred in the fake-quantization with the straight through estimator (STE) DBLP:journals/corr/BengioLC13.This instability restricts the use of QAT as a fine-tuning process with small learning rate, just narrowing the performance degradation from the static-quantization. In the below sections, we study the causes of the fragility in training by analyzing the gradients. Based on the analysis, we discover the future possibilities of QAT; that it can be actively used for finding the most appropriate local-minima for the quantized setting and often exceeds the floating-point model performance, considered as the upper-bound.
3 Proposed Method
3.1 Approximation Error and Gradient Computation
Let the quantity be the gradient computed for the weight by the floating point precision. Then, in each update step , the weight is updated as follows:
where the term denotes the momentum statistics accumulating the traces of the gradient computed in previous time-steps. The term governs the learning rate of the model training. In QAT setting, the fake quantization module approximates the process by the function in section 2.2, as:
The term denotes the quantization statistics of . The quantization step of includes the value clipping by the and of the quantization statistics . This let the calculation occur erroneous approximation, and propagated to the downstream layers invoking gradient vanishing, as in figure 0(b). Also, the gradient approximation error and the statistics update form a feedback loop amplifying the error. The inaccurate calculation of the gradient invokes the inaccurate statistic update, and this inaccurate statistic again induces the inaccurate gradient calculation.
We suggest that the error amplification can be prevented by assigning a proper momentum value , as in figure 1(c). If the momentum has a proper weight update direction, the weight of equation 4 will ignore the inaccurate gradient . In this case, we can expect the statistics in the next time step will become more accurate than those in . This is also a positive feedback that reducing the gradient update error as well as accumulating the statistic well reflecting the FP value. In the previous QAT case, the use of the pre-trained weight and the statistic calibration (and freeze) help to reduce the initial gradient computation error. Still, the magnitude of the learning rate is restricted to be small.
Then, how to impose the proper value to the momentum ? We suggest that the momentum which have accumulated the gradient from a single epoch of FP training is enough to control the gradient approximation error that occurs in the entire training pipeline. This strategy, called StatAssist, gives another answer to control the instability in the initial stage of the training; while previous QAT focuses on a good pre-trained weight, ours focuses on a good initialization of the momentum.
3.2 Training Robustness and Stochastic Gradient Boosting
Even with the proposed momentum initialization, StatAssist, there still exists a possibility of early-convergence due to the gradient instability from the inaccurate initial quantization statistics. The gradient calculated with erroneous information may narrow the search space for optimal local-minima and drop the performance. Previous works jacob2018quantization; DBLP:journals/corr/abs-1806-08342 suggest to postpone activation quantization for certain extent or use the pre-trained weight of FP model to walkaround this issue.
We suggest a simple modification to the weight update mechanism in equation 4 to get over the unexpected local-minima in early stages of QAT. In each training step, the gradient is computed using STE during the back-propagation. Our stochastic gradient boosting, GradBoost works as follows:
In each update step
, We first define a probability distribution. Among various probability distributions, we chose a Laplace distribution with a scale parameter by layer-wise analysis of the histograms of gradients (figure 1). In each update step , the term is updated as follows:
where is the exponential moving maximum of and is the exponential moving minimum of in each update step .
We further choose a random subset of weights from . For each , we apply some distortion to its gradient with in a following way:
is a clamping factor to prevent the exploding gradient problem andis taken to the power of for an exponential decay. By matching the sign of with the original gradient as in equation 6, adding to randomly boosts the gradient toward its current direction. For each , the gradient remains unchanged.
and use it as an add-on to any stochastic gradient descent (SGD) optimizersRobbins2007ASA; hinton2012neural; kingma2014adam; DBLP:journals/corr/abs-1711-05101 . As shown in figure 1(d), the combination of StatAssist and GradBoost stabilizes the training and broadens the search area for optimal local minima during QAT. In section 5, we analyze the effect of StatAssist and GradBoost on the final quantized model performance with various lightweight models on different tasks.
4 Related Work
Different model compression methods for the better trade-off between accuracy and efficiency have been actively proposed in recent years. Both hand-crafted mehta2018espnet; mehta2018espnetv2; tdsod and neural architecture search (NAS) driven howard2017mobilenets; mobilenetv2; DBLP:journals/corr/abs-1905-02244 structures make it possible to run a deep model on edge-device GPUs. Lightweight models can be further compressed via weight pruning DBLP:journals/corr/LiKDSG16; DBLP:journals/corr/abs-1801-10447; DBLP:journals/corr/abs-1711-09224, quantization jacob2018quantization; DBLP:journals/corr/abs-1806-08342; Li2019quantizedObjectDet; fan2020training, or with NAS & distillation integrated training scheme li2020gan; DBLP:journals/corr/abs-1805-02152. In section 5 we further modify the architectures of existing lightweight models mehta2018espnet; mehta2018espnetv2; tdsod; mobilenetv2; DBLP:journals/corr/abs-1905-02244 in search of the practical model architecture for lower-bit (INT8) quantization with the implementation-level restrictions introduced in section 2.2. As opposed to other works jacob2018quantization; DBLP:journals/corr/abs-1806-08342; Li2019quantizedObjectDet; li2020gan, we do not use any pre-trained weight fine-tuning or distillation techniques but train each model with its original training scheme combined with our novel StatAssist and GradBoost methods.
To empirically evaluate our proposed method, we perform three sets of experiments on training different lightweight models with StatAssist and GradBoost QAT from scratch. The results on classification, object detection, semantic segmentation, and style transfer prove the effectiveness of our method in both quantitative and qualitative ways.
5.1 Experimental Setting
Our main contribution in section 1 focuses on making the optimizer robust to gradient approximation error caused by STE during the back-propagation of QAT. To be more specific, we initialize the optimizer with StatAssist and distort a random subset of gradients on each update step via GradBoost. As an optimizer updates its momentum by itself each step, we simply apply StatAssist by running the optimizer with FP model for a single epoch. Our StatAssist also replaces the learning rate warm-up process in conventional model training schemes. For GradBoost, we modify the gradient update step of each optimizer with equations 5 through 8.
We train our models using PyTorchpytorch and follow the methodology of PyTorch 1.4 quantization library. See the supplementary material for the typical PyTorch pytorch code illustrating StatAssist implementation and the detailed algorithms for different GradBoost optimizers. For the optimal latency, we tuned the components of each model for the best trade-off between model performance and compression. We provide model tuning details in the supplementary material.
We first compare the classification
performance of different lightweight models on the ImageNet-1Krussakovsky2015imagenet dataset in Table 1. We found out that the performance gap between a quantized model fine-tuned with FP weights and each FP counterpart varies according to the architectural difference. In particular, the channel-shuffle mechanism in ShuffleNetV2 Ma_2018_ECCV seems to widen the gap. Our method successfully narrows the gap to no more than , proving that the scratch training of fake-quantized jacob2018quantization models with StatAssist and GradBoost is essential for better quantized performance.
5.3 Object Detection
For the object detection, we used two lightweight-detectors: SSD-Lite-MobileNetV2 (SSD-mv2) mobilenetv2 and Tiny-DSOD (T-DSOD) tdsod
. We trained the models with Nesterov-momentum SGDpmlr-v28-sutskever13 on PASCAL-VOC 2007 pascal-voc-2007 following default settings of the papers. For training T-DSOD, we set the initial learning rate and scaled the rate into at the iterations 120K and 150K, over entire 180K iteration. In SSD-mv2 training case, we used total 120K iteration with scaling at 80K and 100K. The initial learning rate was set to . For each case, we set the batch size of . For testing, we slightly modified the detectors to fuse all the layers in each model, as in Section 2.2.
Table 2 shows the evaluation results on two light-weight detectors, T-DSOD and SSD-mv2. Following our theoretical analysis, the quantized model trained with pre-trained FP weight fine-tuning could not surpass the performance of the FP model, which acts like an upper-bound. On the contrary, we can see that it is possible to make the quantized outperform the original FP by training each model from scratch using our method. This is counter-intuitive in that there still exists enough room for improvements in the FP’s representational capacity. However, our method still can’t be a panacea for any INT8 conversion since the model architecture should be modified due to limitations explained in section 2.2. This modification would induce a performance degradation if the FP model was not initially designed for the quantization.
5.4 Semantic Segmentation
We also evaluated our method on semantic segmentation with three lightweight-segmentation models: ESPNet mehta2018espnet, ESPNetV2 mehta2018espnetv2, and MobileNetV3 + LRASPP (Mv3-LRASPP) DBLP:journals/corr/abs-1905-02244. We trained the models on Cityscapes Cordts2016Cityscapes following default settings from DBLP:journals/corr/abs-1802-02611. For training, we used Nesterov-momentum SGD pmlr-v28-sutskever13 with the initial learning rate and poly learning rate schedule DBLP:journals/corr/abs-1802-02611. We trained our models with random-cropped train images to fit a model in a single NVIDIA P40 GPU. The evaluation was performed with full-scale val images. For Mv3-LRASPP, we also made extra variations to the original architecture settings from DBLP:journals/corr/abs-1905-02244 (as in our supplementary material) to examine promising performance-compression trade-offs.
The segmentation results in table 3 are in consensus with the results in 5.3. While quantized models fine-tuned with FP weights suffer from an average mIOU drop compared to their FP counterparts, the StatAssist + GradBoost trained models maintain or slightly surpass the performance of the FP with an average mIOU gain. While it is cost-efficient to use hard-swish activation DBLP:journals/corr/abs-1905-02244 in the FP versions of the MobileNetV3,the Add and Multiply operations used in hard-swish seems to generate extra quantization errors during the training and degrade the final quantized performance. Our modified version of MobileNetV3 (Mv3-LRASPP-Large-RE, Mv3-LRASPP-Small-RE), in which all hard-swish activations are replaced with the ReLU, states that the right choice of activation function is important for the quantization-aware model architecture.
5.5 Style Transfer
We further evaluate the robustness of our method against unstable training losses by training the Pix2Pix pix2pix style transfer model with minimax NIPS2014_5423 generation loss. For the layer fusion compatibility, we used ResNet-based Pix2Pix model proposed by Li et al. li2020gan and Adamkingma2014adam optimizer with our StatAssist and GradBoost. We only applied the fake-quantization fan2020training to the model’s Generator since the Discriminator is not used during the inference. Example results on several image-to-image style transfer problems are in figure 3. We demonstrate that our method also fits well to the fuzzy training condition without causing the mode collapse NIPS2014_5423, which is considered as a sign of failure in minimax-based generative models. As demonstrated in figure 3, our method succeeds in training the Pix2Pix model on different image-to-image style transfer problems.
This paper propose a simple yet powerful strategy for the scratch training of a quantization model (QAT), which has been considered to be difficult in other previous works. We show that the scratch quantization-aware training (QAT) with StatAssist and GradBoost enables the final quantized model to maintain or often surpass the FP baseline performance, which is an upper-bound of the post quantization and QAT with FP-weight fine-tuning. Besides the scratch training of lightweight models for classification, object detection, and semantic segmentation, we furthermore demonstrate that our proposed method are even robust to significantly unstable training losses such as the minimax generation loss. As a future work, we expect that the QAT-targeted architecture and component studies including quantization-aware neural architecture search (NAS) are another promising future research directions.
This work does not present any foreseeable societal consequence.
We would like to thank Clova AI Research team, especially Jung-Woo Ha and for their helpful feedback and discussion. NAVER Smart Machine Learning (NSML) platformNSML has been used in the experiments.
Appendix A Possible Considerations for Quantization-aware Model Designing and Training
a.1 Initialization of QAT
From the above results, we can raise an issue regarding the importance of the full-precision (FP) pre-trained model as an initialization of the QAT. Previous works have assumed that the loss-surface of the model expressed by INT8 is the approximated version of the loss-surface by FP, and hence, been focusing on fine-tuning that narrowing the approximation gap. Our observations, however, show a new possibility that the quantized loss-surface itself has a different and better local minima not near those of the FP. In the experiment, we show that only using a single epoch pre-trained weight with a proper direction of gradient momentum (StatAssist) can achieve comparable or better results than using the FP pre-trained weight. We note that this discovery enables the active use of recent architecture search techniques for quantized models since ours doesn’t require a good initialization from the pre-trained model.
a.2 Architecture Consideration
One main concern for converting the model from FP to lower-bit is the activation function. As we mentioned in section 5.5, exponential activation functions force the lower-bit to FP conversion for the exponential calculation, leading to a significant latency drop. The use of a hard-approximation version (i.e., hard-sigmoid) can be another option, but this might occur an extra quantization error. Therefore, it is necessary to develop a new quantization-aware architecture design scheme with limited activation function candidates.
Appendix B Model Modification Details
As mentioned in section 2.2, the latency gap between the conceptual design and the actual implementation is critical. The layer fusion, integrating the convolution, normalization, and activation into a single convolution operation, can improve the latency by reducing the conversion overhead between FP and lower-bit. For better trade-off between the accuracy (mAP, mIOU, image quality) and efficiency (latency, MAdds, compression rate), we modified models in the following ways: .
We first replaced each normalization and activation function that comes after a convolution (Conv) layer with the Batch Normalization (BN) [ioffe2015batch] and ReLU [pmlr-v15-glorot11a]. For special modules like Conv-Concatenate-BN-ReLU or Conv-Add-BN-ReLU in ESPNets [mehta2018espnet, mehta2018espnetv2], we inserted an extra Conv before BN.
For MobileNetV3 + LRASPP, we replaced the
Avg-Pool Stride=(16, 20)in LRASPP with Avg-Pool Stride=(8, 8) to train models with random-cropped images instead of full-scale images.
Quantizing the entire layers of a model except the last single layer yields the best trade-off between accuracy and efficiency.
Appendix C Example Workflow of Quantization-aware Training
In this section, we describe an example workflow of our StatAssist and GradBoost quantization-aware training (QAT) with PyTorch. Our workflow in algorithm 1 closely follows the methodology of the official PyTorch 1.4 quantization library. Detailed algorithms and PyTorch codes for the StatAssist and Gradboost are also provided in section C.1 and C.2.
c.1 StatAssist in Pytorch 1.4
We provide a typical PyTorch 1.4 code illustrating StatAssist implementation in algorithm 2. The actual implementation may vary according to training workflows or PyTorch versions.
c.2 GradBoost Optimizers
Our Gradboost method in section 3.2 is applicable to any existing optimizer implementations by adding extra lines to the gradient calculation. An example algorithm for GradBoost-applied momentum-SGD [pmlr-v28-sutskever13] and AdamW [DBLP:journals/corr/abs-1711-05101] is provided in algorithm 3 and 4. Please refer to optimizer.py in our source code for detailed GradBoost applications to Pytorch 1.4 optimizers.