Since 2012, convolutional neural networks (CNNs) have achieved great success in image classification[9, 17, 10], object detection [29, 21, 20]6, 31, 23], etc. However, existing networks always have a large number of parameters and high computational cost, which restricts their application on resources-limited devices such as smartphones, AR glasses, and drones. To solve this issue, network compression is an effective method, which aims to reduce the parameters and computational costs of the network without significant performance degradation.
Existing studies on network compression focus on channel pruning [11, 25, 42], efficient architecture design [12, 30, 27] and network quantization [36, 39, 40]. In particular, network quantization directly reduce the model size by converting the network weights into low-precision (e.g., 4bit or 2bit) ones. As a result, the low-precision model can achieve substantial memory saving (e.g., or ). More importantly, network quantization also converts the activations into low-precision ones. As a result, we can replace the compute-intensive floating-point operations with light-weighted fixed-point or bitwise operations, which greatly reduces the computational cost of the networks.
Although promising results on tasks such as image classification have been reported [39, 16, 7] using the aforementioned quantization techniques, using quantized networks for more complex tasks such as object detection still remains a challenge. Compared with image classification, object detection is more challenging. In object detection, the detector not only performs object classification, but also conducts bounding box regression, which makes it more difficult to quantize the detection model. Existing methods [18, 15, 37] quantize the detector to 4 or 8 bits and achieve promising performance. However, when it comes to lower bitwidth (e.g., 2-bit) quantization, it incurs a significant performance drop. In this paper, we observe that the main challenge of quantization on object detection is the inaccurate batch statistics of batch normalization . In a one-stage detector [32, 21]
, each detection head decodes scale-specific features. Different scales of features result in different means and variances. The discrepancy of means and variances across different levels of features leads to inaccurate statistics of shared batch normalization layers. This issue becomes even more severe when we perform low-precision quantization on the one-stage detector.
To address this issue, in this paper, we propose an accurate quantized object detection (AQD) method. Specifically, we construct multi-level batch normalization (multi-level BN) to obtain accurate batch statistics. We use independent batch normalization for each pyramid level of head, which can capture accurate batch statistics. Moreover, we further propose a learned interval quantization method (LIQ) to improve the performance of the quantized network.
Our main contributions are summarized as follows.
We highlight that poor performance of the quantized detectors is due to the inaccurate statistics of batch normalization. We therefore propose multi-level batch normalization (multi-level BN) to capture accurate batch statistics of different scales of feature pyramid. We further propose a learned interval quantization method (LIQ) by improving how the quantizer itself is configured.
We evaluate the proposed methods on COCO detection benchmark with multiple precisions. Experimental results show that our 3-bit AQD achieves comparable performance with its full-precision counterpart. To be emphasized, our 4-bit AQD even outperforms the 32-bit model by a large margin.
2 Related Work
Network quantization. Network quantization aims to represent the network weighs and/or activations with very low precision, which reduce the model size and computational cost. Existing methods can be divided into two categories, namely, binary quantization [13, 28, 3] and fixed-point quantization [36, 39, 35]. Binary quantization convert the full precision weights and activations to
. In this way, we can replace the matrix multiplication operations with the bitwise XNOR-popcount operations. As a result, the binary convolution layer can achieve up tomemory saving and speedup on CPUs [28, 40]. Nevertheless, binary quantization incurs significant performance degradation compared with the full precision counterparts. To reduce the performance gap, fixed-point quantization [36, 39, 35, 4] have been proposed to represent weights and activations with higher bitwidth, which achieve impressive performance on image classification task.
Quantization on Object Detection. Many researchers have studied quantization on object detection to speed up on-device inference and save storage. Jacob et al.  propose a quantization scheme using integer-only arithmetic and perform object detection on COCO dataset with quantized 8-bit models. Wei et al.  utilize knowledge distillation and quantization to train very tiny CNNs for object detection. Observing the instability problem during the fine-tuning stage of the quantization process, Li et al.  propose to produce fully quantized 4-bit detectors based on RetinaNet and Faster R-CNN with three techniques, which include freezing batch normalization statistics, clamping activation based on percentile and quantizing with channel-wise scheme. Zhuang et al.  point out the difficulty of propagating gradient and propose to train low-precision network with a fully precision auxiliary module.
3 Proposed method
In this section, we describe the proposed accurate quantized object detection (AQD). We first introduce problem definition in Section 3.1. Then we introduce the inaccurate batch statistics issue and the multi-level batch normalization (multi-level BN) in Section 3.2. Then, we introduce the learned interval quantization (LIQ) in Section 3.3.
3.1 Problem definition
We consider build a one-stage quantized detector for object detection. One-stage object detector consists of a backbone, a feature pyramid and prediction heads. For a convolutional layer in a detector, we define the input and weight parameter . Let and be the -th element of and , respectively. Here, we omit the subscript for convenience. Quantization seeks to reduce the bitwidth of and via quantizers:
where a quantizer contains a transformer and a discretizer . Following , the transformer transforms the value to with learnable step size . Let be the bitwidth of the weights and activations of the quantized networks. For weights, and are and . For activations, and are and . The discretizer maps the continuous value in the range to some discrete value .
3.2 Multi-level batch normalization
During the training of the quantized detector, batch normalization  normalize the input features and update exponential moving average (EMA) statistics and with current batch statistics and . In one-stage object detection frameworks, each prediction head encodes a corresponding feature level. However, due to the quantization process, there may be large divergence of batch statistics between different feature levels, as shown in Figure 1. Therefore, using a shared batch normalization across prediction heads may lead to inaccurate batch statistics, which will cause a significant performance drop.
To solve this issue, we propose a simple yet effective method, called multi-level batch normalization (multi-level BN), that uses independent batch normalization for different feature levels. The multi-level BN can capture individual batch statistics of the corresponding feature level. There are two advantages of our method comparing to the standard shared BN strategy. First, multi-level BN only introduces negligible parameters. In fact, multi-level BN only has less than 1.1% of the model size. Second, multi-level BN does not change the architecture of the network. Therefore, the proposed multi-level BN does not increase any additional computational cost. An empirical study on the effect of multi-level BN can be found in Section 4.5.
3.3 Learned interval quantization
During the training of the quantized network, the gradient of step size is defined as follow:
where is over all elements in the corresponding layer. The summation over all elements makes the gradient of step size much larger than the gradient of the input value . Hence, directly training the quantized network with learnable step size is unstable. One possible solution is to rescale the gradient . However, it is hard to determine the value of scaling factor. The improper gradient scaling factor may hamper the performance of the quantized network.
To solve this issue, we propose a learned interval quantization (LIQ) method to quantize the network with trainable interval instead of step size, which can avoid rescaling the gradient. The transformer and discretizer for the proposed LIQ can be defined as follow:
Transformer: The transformer transforms the weights and activations to , which are defined as follows:
where and are trainable interval parameters that limits the range of weights and activations.
Discretizer: The discretizer maps the continuous value in the range to some discrete value , which is defined as follow:
where is the number of discrete values (except 0). Let be the bitwidth of the weights and activations of the quantized networks. Then, can be computed as . After weights and activations quantization, we use affine transformation to bring the range of weights to and range of activations to .
Back-propagation with gradient approximation: In general, the discretizer function is non-differentiable. Therefore, it is impossible to train the quantized network through back-propagation. To solve this issue, we use the straight-through estimator [1, 5, 36] (STE) to approximate the gradients. Then, we can use following equations to compute the gradient of and respectively:
|Full precision ||33.9||51.2||36.4||19.3||36.2||44.0|
|Group-Net  (4 bases)||28.9||45.3||31.2||15.4||30.5||38.1|
|Full precision ||38.0||55.9||41.0||23.0||40.3||49.4|
|Group-Net  (4 bases)||31.5||47.6||33.8||16.9||32.3||40.1|
|Full precision ||32.3||50.9||34.2||18.9||35.6||42.5|
|FQN  (4-bit)||28.6||46.9||29.9||14.9||31.2||38.7|
|Auxi  (4-bit)||31.9||50.4||33.7||16.5||34.6||42.3|
|Full precision ||36.3||56.2||39.1||22.4||39.8||46.9|
|FQN  (4-bit)||31.3||50.4||33.3||16.1||34.4||41.6|
|Auxi  (4-bit)||34.7||53.7||36.9||19.3||38.0||45.9|
|AQD w/ Shared Sync BN||29.5||46.6||31.7||19.0||32.8||35.8|
|AQD w/ Multi-level Sync BN||30.6||47.3||32.4||16.6||31.9||41.3|
|Full precision ||33.9||51.2||36.4||19.3||36.2||44.0|
|Backbone + Feature Pyramid||32.3||49.0||34.9||17.4||33.7||43.7|
|Backbone + Feature Pyramid + Heads||30.6||47.3||32.4||16.6||31.9||41.3|
4.1 Compared methods
4.2 Data sets
We evaluate our proposed methods on the COCO detection benchmark . COCO detection benchmark is a large-scale benchmark data set for object detection, which is widely used to evaluate the performance of the detector. Following [20, 41], we use the COCO trainval35k split (115K images) for training and minival split (5K images) for validation.
4.3 Implementation details
We implement the proposed method based on detectron2 . We apply our AQD on two one-stage detectors, namely, FCOS  and RetinaNet . FCOS and RetinaNet contain three parts, including a backbone, a feature pyramid, and detection heads. We use ResNet-18 and ResNet-34  as backbones. Following [41, 37]
, we quantize all the layers in the network except the first layer in the backbone and the last layer in the detection heads. To stabilize the optimization, each convolution layer is followed by BN and ReLU. We do not fix the BN during training. We replace the BN with synchronized batch normalization (Sync BN) to fully exploit data across all devices. Following APOT, we use weight normalization before weight quantization to provide relatively consistent and stable input distribution.
Following [18, 37], all images in the training and validation set are resized so that their shorter edges are 800 pixels. During training, images are augmented by random horizontal flipping. During evaluation, we do not perform any augmentations. We train the network for 90K iterations with a mini-batch size of 16. We use SGD with momentum for optimization. The learning is started at 0.01, and divided by 10 at iterations 60K and 80K. We set the weight decay to 0.0001. More details on the other hyper-parameters settings can be found at [21, 32].
4.4 Comparisons on COCO
We compare the proposed methods with several state-of-the-art quantized models and report the results in Table 1 and Table 2. From the results, we have the following observations. First, our AQD outperforms the considered baselines on different detection frameworks and backbones. For example, our 4-bit RetinaNet detector with ResNet-18 backbone outperforms FQN  and Auxi  by a large margin. Second, our 4-bit quantized detectors even outperform the corresponding full-precision models. Specifically, on a 4-bit RetinaNet detector, our AQD surpasses the full-precision model by 1.7% on the ResNet-18 backbone. Third, when performing 3-bit quantization, our AQD achieves near lossless performance compared with the full-precision counterpart. To be specific, on a 3-bit RetinaNet detector with ResNet-34 backbone, our AQD only leads to 0.4% performance degradation on the AP. Forth, when conducting aggressive 2-bit quantization, our AQD still achieves comparable performance. For example, our 2-bit FCOS detector with ResNet-18 backbone only suffer 3.3% AP loss compared with its full-precision baseline. These results justify the superior performance of our proposed AQD.
4.5 Effect of multi-level batch normalization
To study the effect of multi-level BN, we quantize the FCOS detector with multi-level BN and shared BN. Here, the detector with shared BN indicates that the batch normalization in the detection heads are shared across different pyramid levels. The results are shown in Table 3. From the results, the detector with multi-level sync BN outperforms the one with shared sync BN by 1.1% on AP, which demonstrates the effectiveness of the proposed multi-level BN.
4.6 Effect of learned interval quantization
, we train the quantized network for 90 epochs using a mini-batch size of 256. We use SGD with nesterov for optimization. The momentum is set to 0.9. The learning rate is initialized to 0.01, and decrease to 0 following the cosine function . We report the results in Table 5. From the results, we have following observations. First, LIQ outperforms LSQ+ by 0.6% in the Top-1 accuracy, which demonstrates the effectiveness of the learned interval quantization. Second, LIQ outperforms all the consider baselines, which shows the superior performance of our proposed LIQ.
4.7 Effect of quantization on different components
We study the effect of quantizing different components in object detection models. The results are shown in Table 4. From the results, we have the following observations. Quantizing the backbone only leads to a small performance drop (0.7% in AP). Nevertheless, quantizing the feature pyramid and detection heads will cause significant performance degradation (3.3% in AP). These results show that the detector is sensitive to the quantization of feature pyramid and detection heads, which provides a direction to improve the performance of the quantized network.
In this paper, we have proposed an accurate quantized object detection (AQD) framework. We have first proposed multi-level batch normalization to capture batch statistics for different detection heads. Then, we have proposed a learned interval quantization (LIQ) strategy to further improve the performance of the quantized network. To evaluate the performance of the proposed methods, we have applied our AQD to two classical one-stage detectors. Experimental results have justified that our quantized 3-bit detector achieves near-lossless performance compared with the full-precision counterpart. More importantly, our 4-bit detector even outperforms the full-precision model by a large margin.
Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §3.3.
-  (2020-06) LSQ+: improving low-bit quantization through learnable offsets and better initialization. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Table 5.
-  (2018) Hierarchical binary cnns for landmark localization with limited resources. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.
-  (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §2, Table 5.
-  (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §3.3.
-  (2019) Arcface: additive angular margin loss for deep face recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4690–4699. Cited by: §1.
-  (2020) LEARNED step size quantization. In Proc. Int. Conf. Learn. Repren., Cited by: §1, §3.1, §3.3, §4.6.
-  (2019) Differentiable soft quantization: bridging full-precision and low-bit neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4852–4861. Cited by: Table 5.
-  (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 770–778. Cited by: §1, §4.3.
-  (2016) Identity mappings in deep residual networks. In Proc. Eur. Conf. Comp. Vis., pp. 630–645. Cited by: §1.
-  (2017) Channel pruning for accelerating very deep neural networks. In Proc. IEEE Int. Conf. Comp. Vis., pp. 1389–1397. Cited by: §1.
-  (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
-  (2016) Binarized neural networks. In Proc. Adv. Neural Inf. Process. Syst., pp. 4107–4115. Cited by: §2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. Int. Conf. Mach. Learn., pp. 448–456. Cited by: §1, §3.2.
-  (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2704–2713. Cited by: §1, §2.
-  (2019-06) Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, Table 5.
-  (2012) Imagenet classification with deep convolutional neural networks. In Proc. Adv. Neural Inf. Process. Syst., pp. 1097–1105. Cited by: §1, §4.6.
-  (2019-06) Fully quantized network for object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, §2, Table 2, §4.1, §4.3, §4.4.
-  (2020) Additive powers-of-two quantization: an efficient non-uniform discretization for neural networks. In Proc. Int. Conf. Learn. Repren., Cited by: §4.3.
-  (2017) Feature pyramid networks for object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2117–2125. Cited by: §1, §4.2.
-  (2017) Focal loss for dense object detection. In Proc. IEEE Int. Conf. Comp. Vis., pp. 2980–2988. Cited by: §1, §1, §4.3, §4.3.
-  (2014) Microsoft coco: common objects in context. In Proc. Eur. Conf. Comp. Vis., pp. 740–755. Cited by: §4.2.
-  (2017) Sphereface: deep hypersphere embedding for face recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 212–220. Cited by: §1.
SGDR: stochastic gradient descent with warm restarts. In Proc. Int. Conf. Learn. Repren., Cited by: §4.6.
-  (2017) ThiNet: a filter level pruning method for deep neural network compression. In Proc. IEEE Int. Conf. Comp. Vis., pp. 5058–5066. Cited by: §1.
-  (1983) A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Proceedings of the USSR Academy of Sciences, Vol. 269, pp. 543–547. Cited by: §4.6.
-  (2018) Efficient neural architecture search via parameter sharing. In Proc. Int. Conf. Mach. Learn., pp. 4092–4101. Cited by: §1.
-  (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In Proc. Eur. Conf. Comp. Vis., pp. 525–542. Cited by: §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proc. Adv. Neural Inf. Process. Syst., pp. 91–99. Cited by: §1.
-  (2018) MobileNetV2: inverted residuals and linear bottlenecks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4510–4520. Cited by: §1.
-  (2015) FaceNet: a unified embedding for face recognition and clustering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 815–823. Cited by: §1.
-  (2019) FCOS: fully convolutional one-stage object detection. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: §1, Table 1, Table 2, Table 4, §4.3, §4.3.
-  (2018-09) Quantization mimic: towards very tiny cnn for object detection. In Proc. Eur. Conf. Comp. Vis., Cited by: §2.
-  (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §4.3.
-  (2018) LQ-nets: learned quantization for highly accurate and compact deep neural networks. In Proc. Eur. Conf. Comp. Vis., Cited by: §2, Table 5.
-  (2016) DoReFa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §2, §3.3, Table 5.
-  (2019) Training quantized network with auxiliary gradient module. arXiv preprint arXiv:1903.11236. Cited by: §1, §2, §4.3, §4.3.
-  (2020-06) Training quantized neural networks with a full-precision auxiliary module. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Table 2, §4.1, §4.4.
-  (2018) Towards effective low-bitwidth convolutional neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7920–7928. Cited by: §1, §1, §2.
-  (2019-06) Structured binary neural networks for accurate image classification and semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, §2, §4.1.
-  (2019) Structured binary neural networks for image recognition. arXiv preprint arXiv:1909.09934. Cited by: Table 1, §4.2, §4.3.
-  (2018) Discrimination-aware channel pruning for deep neural networks. In Proc. Adv. Neural Inf. Process. Syst., S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 881–892. Cited by: §1.