1 Introduction
Since 2012, convolutional neural networks (CNNs) have achieved great success in image classification
[9, 17, 10], object detection [29, 21, 20][6, 31, 23], etc. However, existing networks always have a large number of parameters and high computational cost, which restricts their application on resourceslimited devices such as smartphones, AR glasses, and drones. To solve this issue, network compression is an effective method, which aims to reduce the parameters and computational costs of the network without significant performance degradation.Existing studies on network compression focus on channel pruning [11, 25, 42], efficient architecture design [12, 30, 27] and network quantization [36, 39, 40]. In particular, network quantization directly reduce the model size by converting the network weights into lowprecision (e.g., 4bit or 2bit) ones. As a result, the lowprecision model can achieve substantial memory saving (e.g., or ). More importantly, network quantization also converts the activations into lowprecision ones. As a result, we can replace the computeintensive floatingpoint operations with lightweighted fixedpoint or bitwise operations, which greatly reduces the computational cost of the networks.
Although promising results on tasks such as image classification have been reported [39, 16, 7] using the aforementioned quantization techniques, using quantized networks for more complex tasks such as object detection still remains a challenge. Compared with image classification, object detection is more challenging. In object detection, the detector not only performs object classification, but also conducts bounding box regression, which makes it more difficult to quantize the detection model. Existing methods [18, 15, 37] quantize the detector to 4 or 8 bits and achieve promising performance. However, when it comes to lower bitwidth (e.g., 2bit) quantization, it incurs a significant performance drop. In this paper, we observe that the main challenge of quantization on object detection is the inaccurate batch statistics of batch normalization [14]. In a onestage detector [32, 21]
, each detection head decodes scalespecific features. Different scales of features result in different means and variances. The discrepancy of means and variances across different levels of features leads to inaccurate statistics of shared batch normalization layers. This issue becomes even more severe when we perform lowprecision quantization on the onestage detector.
To address this issue, in this paper, we propose an accurate quantized object detection (AQD) method. Specifically, we construct multilevel batch normalization (multilevel BN) to obtain accurate batch statistics. We use independent batch normalization for each pyramid level of head, which can capture accurate batch statistics. Moreover, we further propose a learned interval quantization method (LIQ) to improve the performance of the quantized network.
Our main contributions are summarized as follows.

We highlight that poor performance of the quantized detectors is due to the inaccurate statistics of batch normalization. We therefore propose multilevel batch normalization (multilevel BN) to capture accurate batch statistics of different scales of feature pyramid. We further propose a learned interval quantization method (LIQ) by improving how the quantizer itself is configured.

We evaluate the proposed methods on COCO detection benchmark with multiple precisions. Experimental results show that our 3bit AQD achieves comparable performance with its fullprecision counterpart. To be emphasized, our 4bit AQD even outperforms the 32bit model by a large margin.
2 Related Work
Network quantization. Network quantization aims to represent the network weighs and/or activations with very low precision, which reduce the model size and computational cost. Existing methods can be divided into two categories, namely, binary quantization [13, 28, 3] and fixedpoint quantization [36, 39, 35]. Binary quantization convert the full precision weights and activations to
. In this way, we can replace the matrix multiplication operations with the bitwise XNORpopcount operations. As a result, the binary convolution layer can achieve up to
memory saving and speedup on CPUs [28, 40]. Nevertheless, binary quantization incurs significant performance degradation compared with the full precision counterparts. To reduce the performance gap, fixedpoint quantization [36, 39, 35, 4] have been proposed to represent weights and activations with higher bitwidth, which achieve impressive performance on image classification task.Quantization on Object Detection. Many researchers have studied quantization on object detection to speed up ondevice inference and save storage. Jacob et al. [15] propose a quantization scheme using integeronly arithmetic and perform object detection on COCO dataset with quantized 8bit models. Wei et al. [33] utilize knowledge distillation and quantization to train very tiny CNNs for object detection. Observing the instability problem during the finetuning stage of the quantization process, Li et al. [18] propose to produce fully quantized 4bit detectors based on RetinaNet and Faster RCNN with three techniques, which include freezing batch normalization statistics, clamping activation based on percentile and quantizing with channelwise scheme. Zhuang et al. [37] point out the difficulty of propagating gradient and propose to train lowprecision network with a fully precision auxiliary module.
3 Proposed method
In this section, we describe the proposed accurate quantized object detection (AQD). We first introduce problem definition in Section 3.1. Then we introduce the inaccurate batch statistics issue and the multilevel batch normalization (multilevel BN) in Section 3.2. Then, we introduce the learned interval quantization (LIQ) in Section 3.3.
3.1 Problem definition
We consider build a onestage quantized detector for object detection. Onestage object detector consists of a backbone, a feature pyramid and prediction heads. For a convolutional layer in a detector, we define the input and weight parameter . Let and be the th element of and , respectively. Here, we omit the subscript for convenience. Quantization seeks to reduce the bitwidth of and via quantizers:
(1) 
where a quantizer contains a transformer and a discretizer . Following [7], the transformer transforms the value to with learnable step size . Let be the bitwidth of the weights and activations of the quantized networks. For weights, and are and . For activations, and are and . The discretizer maps the continuous value in the range to some discrete value .
3.2 Multilevel batch normalization
During the training of the quantized detector, batch normalization [14] normalize the input features and update exponential moving average (EMA) statistics and with current batch statistics and . In onestage object detection frameworks, each prediction head encodes a corresponding feature level. However, due to the quantization process, there may be large divergence of batch statistics between different feature levels, as shown in Figure 1. Therefore, using a shared batch normalization across prediction heads may lead to inaccurate batch statistics, which will cause a significant performance drop.
To solve this issue, we propose a simple yet effective method, called multilevel batch normalization (multilevel BN), that uses independent batch normalization for different feature levels. The multilevel BN can capture individual batch statistics of the corresponding feature level. There are two advantages of our method comparing to the standard shared BN strategy. First, multilevel BN only introduces negligible parameters. In fact, multilevel BN only has less than 1.1% of the model size. Second, multilevel BN does not change the architecture of the network. Therefore, the proposed multilevel BN does not increase any additional computational cost. An empirical study on the effect of multilevel BN can be found in Section 4.5.
3.3 Learned interval quantization
During the training of the quantized network, the gradient of step size is defined as follow:
(2) 
where is over all elements in the corresponding layer. The summation over all elements makes the gradient of step size much larger than the gradient of the input value . Hence, directly training the quantized network with learnable step size is unstable. One possible solution is to rescale the gradient [7]. However, it is hard to determine the value of scaling factor. The improper gradient scaling factor may hamper the performance of the quantized network.
To solve this issue, we propose a learned interval quantization (LIQ) method to quantize the network with trainable interval instead of step size, which can avoid rescaling the gradient. The transformer and discretizer for the proposed LIQ can be defined as follow:
Transformer: The transformer transforms the weights and activations to , which are defined as follows:
(3)  
(4) 
where and are trainable interval parameters that limits the range of weights and activations.
Discretizer: The discretizer maps the continuous value in the range to some discrete value , which is defined as follow:
(5) 
where is the number of discrete values (except 0). Let be the bitwidth of the weights and activations of the quantized networks. Then, can be computed as . After weights and activations quantization, we use affine transformation to bring the range of weights to and range of activations to .
Backpropagation with gradient approximation: In general, the discretizer function is nondifferentiable. Therefore, it is impossible to train the quantized network through backpropagation. To solve this issue, we use the straightthrough estimator [1, 5, 36] (STE) to approximate the gradients. Then, we can use following equations to compute the gradient of and respectively:
(6) 
(7) 
Backbone  Model  AP  AP  AP  AP  AP  AP  


Full precision [32]  33.9  51.2  36.4  19.3  36.2  44.0  
GroupNet [41] (4 bases)  28.9  45.3  31.2  15.4  30.5  38.1  
AQD (4bit)  35.2  52.7  37.8  20.3  37.2  46.1  
AQD (3bit)  34.1  51.4  36.7  19.1  35.8  45.2  
AQD (2bit)  30.6  47.3  32.4  16.6  31.9  41.3  

Full precision [32]  38.0  55.9  41.0  23.0  40.3  49.4  
GroupNet [41] (4 bases)  31.5  47.6  33.8  16.9  32.3  40.1  
AQD (4bit)  38.6  56.9  41.5  22.5  41.2  51.0  
AQD (3bit)  37.4  55.5  40.3  21.2  39.7  48.8  
AQD (2bit)  34.5  52.4  37.0  19.0  36.6  46.0 
Backbone  Model  AP  AP  AP  AP  AP  AP  


Full precision [32]  32.3  50.9  34.2  18.9  35.6  42.5  
FQN [18] (4bit)  28.6  46.9  29.9  14.9  31.2  38.7  
Auxi [38] (4bit)  31.9  50.4  33.7  16.5  34.6  42.3  
AQD (4bit)  34.0  53.1  36.3  18.8  37.2  45.3  
AQD (3bit)  32.8  51.7  34.9  18.1  35.1  44.6  
AQD (2bit)  29.6  48.1  15.9  31.7  15.9  41.1  

Full precision [32]  36.3  56.2  39.1  22.4  39.8  46.9  
FQN [18] (4bit)  31.3  50.4  33.3  16.1  34.4  41.6  
Auxi [38] (4bit)  34.7  53.7  36.9  19.3  38.0  45.9  

AQD (4bit)  37.0  57.0  39.8  21.6  40.1  49.1  
AQD (3bit)  35.9  56.0  38.5  20.9  39.0  47.9  
AQD (2bit)  33.1  52.5  35.4  18.6  36.1  45.2 
Backbone 
Model  AP  AP  AP  AP  AP  AP  


AQD w/ Shared Sync BN  29.5  46.6  31.7  19.0  32.8  35.8  
AQD w/ Multilevel Sync BN  30.6  47.3  32.4  16.6  31.9  41.3 
Backbone  Model  AP  AP  AP  AP  AP  AP  


Full precision [32]  33.9  51.2  36.4  19.3  36.2  44.0  
Backbone  33.2  50.1  35.5  19.1  34.7  44.5  
Backbone + Feature Pyramid  32.3  49.0  34.9  17.4  33.7  43.7  
Backbone + Feature Pyramid + Heads  30.6  47.3  32.4  16.6  31.9  41.3 
4 Experiments
4.1 Compared methods
4.2 Data sets
We evaluate our proposed methods on the COCO detection benchmark [22]. COCO detection benchmark is a largescale benchmark data set for object detection, which is widely used to evaluate the performance of the detector. Following [20, 41], we use the COCO trainval35k split (115K images) for training and minival split (5K images) for validation.
4.3 Implementation details
We implement the proposed method based on detectron2 [34]. We apply our AQD on two onestage detectors, namely, FCOS [32] and RetinaNet [21]. FCOS and RetinaNet contain three parts, including a backbone, a feature pyramid, and detection heads. We use ResNet18 and ResNet34 [9] as backbones. Following [41, 37]
, we quantize all the layers in the network except the first layer in the backbone and the last layer in the detection heads. To stabilize the optimization, each convolution layer is followed by BN and ReLU. We do not fix the BN during training. We replace the BN with synchronized batch normalization (Sync BN) to fully exploit data across all devices. Following APOT
[19], we use weight normalization before weight quantization to provide relatively consistent and stable input distribution.Following [18, 37], all images in the training and validation set are resized so that their shorter edges are 800 pixels. During training, images are augmented by random horizontal flipping. During evaluation, we do not perform any augmentations. We train the network for 90K iterations with a minibatch size of 16. We use SGD with momentum for optimization. The learning is started at 0.01, and divided by 10 at iterations 60K and 80K. We set the weight decay to 0.0001. More details on the other hyperparameters settings can be found at [21, 32].
4.4 Comparisons on COCO
We compare the proposed methods with several stateoftheart quantized models and report the results in Table 1 and Table 2. From the results, we have the following observations. First, our AQD outperforms the considered baselines on different detection frameworks and backbones. For example, our 4bit RetinaNet detector with ResNet18 backbone outperforms FQN [18] and Auxi [38] by a large margin. Second, our 4bit quantized detectors even outperform the corresponding fullprecision models. Specifically, on a 4bit RetinaNet detector, our AQD surpasses the fullprecision model by 1.7% on the ResNet18 backbone. Third, when performing 3bit quantization, our AQD achieves near lossless performance compared with the fullprecision counterpart. To be specific, on a 3bit RetinaNet detector with ResNet34 backbone, our AQD only leads to 0.4% performance degradation on the AP. Forth, when conducting aggressive 2bit quantization, our AQD still achieves comparable performance. For example, our 2bit FCOS detector with ResNet18 backbone only suffer 3.3% AP loss compared with its fullprecision baseline. These results justify the superior performance of our proposed AQD.
4.5 Effect of multilevel batch normalization
To study the effect of multilevel BN, we quantize the FCOS detector with multilevel BN and shared BN. Here, the detector with shared BN indicates that the batch normalization in the detection heads are shared across different pyramid levels. The results are shown in Table 3. From the results, the detector with multilevel sync BN outperforms the one with shared sync BN by 1.1% on AP, which demonstrates the effectiveness of the proposed multilevel BN.
4.6 Effect of learned interval quantization
To investigate the effect of LIQ, we quantize ResNet18 to 2bit with different quantization methods and evaluate the model on ImageNet [17]. Following the settings in LSQ [7]
, we train the quantized network for 90 epochs using a minibatch size of 256. We use SGD with nesterov
[26] for optimization. The momentum is set to 0.9. The learning rate is initialized to 0.01, and decrease to 0 following the cosine function [24]. We report the results in Table 5. From the results, we have following observations. First, LIQ outperforms LSQ+ by 0.6% in the Top1 accuracy, which demonstrates the effectiveness of the learned interval quantization. Second, LIQ outperforms all the consider baselines, which shows the superior performance of our proposed LIQ.4.7 Effect of quantization on different components
We study the effect of quantizing different components in object detection models. The results are shown in Table 4. From the results, we have the following observations. Quantizing the backbone only leads to a small performance drop (0.7% in AP). Nevertheless, quantizing the feature pyramid and detection heads will cause significant performance degradation (3.3% in AP). These results show that the detector is sensitive to the quantization of feature pyramid and detection heads, which provides a direction to improve the performance of the quantized network.
5 Conclusions
In this paper, we have proposed an accurate quantized object detection (AQD) framework. We have first proposed multilevel batch normalization to capture batch statistics for different detection heads. Then, we have proposed a learned interval quantization (LIQ) strategy to further improve the performance of the quantized network. To evaluate the performance of the proposed methods, we have applied our AQD to two classical onestage detectors. Experimental results have justified that our quantized 3bit detector achieves nearlossless performance compared with the fullprecision counterpart. More importantly, our 4bit detector even outperforms the fullprecision model by a large margin.
References

[1]
(2013)
Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §3.3.  [2] (202006) LSQ+: improving lowbit quantization through learnable offsets and better initialization. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Table 5.
 [3] (2018) Hierarchical binary cnns for landmark localization with limited resources. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.
 [4] (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §2, Table 5.
 [5] (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830. Cited by: §3.3.
 [6] (2019) Arcface: additive angular margin loss for deep face recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4690–4699. Cited by: §1.
 [7] (2020) LEARNED step size quantization. In Proc. Int. Conf. Learn. Repren., Cited by: §1, §3.1, §3.3, §4.6.
 [8] (2019) Differentiable soft quantization: bridging fullprecision and lowbit neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4852–4861. Cited by: Table 5.
 [9] (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 770–778. Cited by: §1, §4.3.
 [10] (2016) Identity mappings in deep residual networks. In Proc. Eur. Conf. Comp. Vis., pp. 630–645. Cited by: §1.
 [11] (2017) Channel pruning for accelerating very deep neural networks. In Proc. IEEE Int. Conf. Comp. Vis., pp. 1389–1397. Cited by: §1.
 [12] (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
 [13] (2016) Binarized neural networks. In Proc. Adv. Neural Inf. Process. Syst., pp. 4107–4115. Cited by: §2.
 [14] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. Int. Conf. Mach. Learn., pp. 448–456. Cited by: §1, §3.2.
 [15] (2018) Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2704–2713. Cited by: §1, §2.
 [16] (201906) Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, Table 5.
 [17] (2012) Imagenet classification with deep convolutional neural networks. In Proc. Adv. Neural Inf. Process. Syst., pp. 1097–1105. Cited by: §1, §4.6.
 [18] (201906) Fully quantized network for object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, §2, Table 2, §4.1, §4.3, §4.4.
 [19] (2020) Additive powersoftwo quantization: an efficient nonuniform discretization for neural networks. In Proc. Int. Conf. Learn. Repren., Cited by: §4.3.
 [20] (2017) Feature pyramid networks for object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2117–2125. Cited by: §1, §4.2.
 [21] (2017) Focal loss for dense object detection. In Proc. IEEE Int. Conf. Comp. Vis., pp. 2980–2988. Cited by: §1, §1, §4.3, §4.3.
 [22] (2014) Microsoft coco: common objects in context. In Proc. Eur. Conf. Comp. Vis., pp. 740–755. Cited by: §4.2.
 [23] (2017) Sphereface: deep hypersphere embedding for face recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 212–220. Cited by: §1.

[24]
(2017)
SGDR: stochastic gradient descent with warm restarts
. In Proc. Int. Conf. Learn. Repren., Cited by: §4.6.  [25] (2017) ThiNet: a filter level pruning method for deep neural network compression. In Proc. IEEE Int. Conf. Comp. Vis., pp. 5058–5066. Cited by: §1.
 [26] (1983) A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Proceedings of the USSR Academy of Sciences, Vol. 269, pp. 543–547. Cited by: §4.6.
 [27] (2018) Efficient neural architecture search via parameter sharing. In Proc. Int. Conf. Mach. Learn., pp. 4092–4101. Cited by: §1.
 [28] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In Proc. Eur. Conf. Comp. Vis., pp. 525–542. Cited by: §2.
 [29] (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Proc. Adv. Neural Inf. Process. Syst., pp. 91–99. Cited by: §1.
 [30] (2018) MobileNetV2: inverted residuals and linear bottlenecks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4510–4520. Cited by: §1.
 [31] (2015) FaceNet: a unified embedding for face recognition and clustering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 815–823. Cited by: §1.
 [32] (2019) FCOS: fully convolutional onestage object detection. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: §1, Table 1, Table 2, Table 4, §4.3, §4.3.
 [33] (201809) Quantization mimic: towards very tiny cnn for object detection. In Proc. Eur. Conf. Comp. Vis., Cited by: §2.
 [34] (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §4.3.
 [35] (2018) LQnets: learned quantization for highly accurate and compact deep neural networks. In Proc. Eur. Conf. Comp. Vis., Cited by: §2, Table 5.
 [36] (2016) DoReFanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §2, §3.3, Table 5.
 [37] (2019) Training quantized network with auxiliary gradient module. arXiv preprint arXiv:1903.11236. Cited by: §1, §2, §4.3, §4.3.
 [38] (202006) Training quantized neural networks with a fullprecision auxiliary module. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Table 2, §4.1, §4.4.
 [39] (2018) Towards effective lowbitwidth convolutional neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7920–7928. Cited by: §1, §1, §2.
 [40] (201906) Structured binary neural networks for accurate image classification and semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, §2, §4.1.
 [41] (2019) Structured binary neural networks for image recognition. arXiv preprint arXiv:1909.09934. Cited by: Table 1, §4.2, §4.3.
 [42] (2018) Discriminationaware channel pruning for deep neural networks. In Proc. Adv. Neural Inf. Process. Syst., S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 881–892. Cited by: §1.