Localization Uncertainty-Based Attention for Object Detection

by   Sanghun Park, et al.

Object detection has been applied in a wide variety of real world scenarios, so detection algorithms must provide confidence in the results to ensure that appropriate decisions can be made based on their results. Accordingly, several studies have investigated the probabilistic confidence of bounding box regression. However, such approaches have been restricted to anchor-based detectors, which use box confidence values as additional screening scores during non-maximum suppression (NMS) procedures. In this paper, we propose a more efficient uncertainty-aware dense detector (UADET) that predicts four-directional localization uncertainties via Gaussian modeling. Furthermore, a simple uncertainty attention module (UAM) that exploits box confidence maps is proposed to improve performance through feature refinement. Experiments using the MS COCO benchmark show that our UADET consistently surpasses baseline FCOS, and that our best model, ResNext-64x4d-101-DCN, obtains a single model, single-scale AP of 48.3 among various object detectors.



There are no comments yet.


page 1

page 3


VarifocalNet: An IoU-aware Dense Object Detector

Accurately ranking a huge number of candidate detections is a key to the...

Anchor-free 3D Single Stage Detector with Mask-Guided Attention for Point Cloud

Most of the existing single-stage and two-stage 3D object detectors are ...

Localization Uncertainty Estimation for Anchor-Free Object Detection

Since many safety-critical systems such as surgical robots and autonomou...

Decoupled IoU Regression for Object Detection

Non-maximum suppression (NMS) is widely used in object detection pipelin...

IoU-aware Single-stage Object Detector for Accurate Localization

Due to the simpleness and high efficiency, single-stage object detectors...

DBF: Dynamic Belief Fusion for Combining Multiple Object Detectors

In this paper, we propose a novel and highly practical score-level fusio...

Robust Monocular Localization in Sparse HD Maps Leveraging Multi-Task Uncertainty Estimation

Robust localization in dense urban scenarios using a low-cost sensor set...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection has achieved noteworthy successes with the development of convolutional neural networks (CNNs). In recent years, FCOS

[14], an anchor-free approach, has been suggested as a strong alternative to anchor-based mainstream methods [12, 10]. The per-pixel fashion framework achieves improved performance by modeling each object as dense key points, but it still has insufficient localization information in terms of probabilistic confidence. Inspired by recent works on the bounding box (bbox) regression with uncertainty [7, 3], we construct a localization uncertainty prediction based on FCOS [14]

to identify the locations of the most convincing key points. The dense key points-based model can learn the localization uncertainty from a Gaussian model parameterized with bbox regression values and corresponding variances.

The results are shown in Fig. 1 (a), where it can be seen that the right side of the surfboard has relatively high uncertainty because it is occluded by the water; in other words, there is a large gap between the prediction and the ground truth for the submerged part of the surfboard (marked in red). Fig. 1 (b) shows box confidence maps highlighting areas with key points that have low localization uncertainties for each boundary of the bboxes (highlighted in red). As shown in these maps, it is difficult to discover the common convincing key point for the four-directional boundaries of the surfboard, while this is not the case for the person. Therefore, the key point with the inaccurate prediction must be chosen as the final key point for the surfboard.

This inaccurate prediction needs to be compensated by the convincing features for the boundaries of the bbox, which would be missed in the detection process of a conventional dense key points-based detector. To this end, the localization uncertainty-based attention is designed to encode features from both the convincing regions for the boundaries of the bbox and the central region of the object. It uses the box confidence maps to enhance the original features by exploiting certain boundary features. Such features also allow us to effectively refine the coarse predictions.

To verify the effectiveness of the proposed method, we build an uncertainty-aware dense detector (UADET) that learns the localization uncertainty and leverages box confidence maps as spatial attentions for feature refinement; we then evaluate the proposed UADET on the MS COCO [11] benchmark. The experimental results show that our approach can achieve a single model, single-scale AP of 48.3, thereby representing a large improvement over the baseline FCOS [14]. Further, the results of extensive experiments demonstrate that the UADET significantly outperforms various state-of-the-art object detectors.

2 Proposed Method

2.1 Localization Uncertainty

In Gaussian YOLOv3 [3], due to its anchor-based design, localization uncertainty is modeled using the center point of each box, the box size, and the corresponding variances as Gaussian parameters. In this paper, localization uncertainty is modeled using each single Gaussian model of the bbox regression values (, , ,

) as well as the corresponding variances. A single variate Gaussian distribution is as follows:


where denotes the predicted bbox regression, denotes the bbox regression target, and

(standard deviation) denotes the localization uncertainty, the value of which is (0, 1) with a sigmoid function. For training, we design a negative log likelihood (NLL) loss with Gaussian parameters as follows:


where denotes of four-directional bbox regression targets, and the corresponding Gaussian parameters and in Eq. (1) are also and , respectively. is the indicator function, which is 1 if and 0 otherwise, and denotes the classification label at the pixel location of the feature. The summation is calculated over four-directional bbox regressions and positive samples. The cost average is calculated by dividing by the number of positive samples, . (= 0.2 in this paper) is the balance weight for .

In contrast to Gaussian YOLOv3 [3], we observe degraded performance when bbox regression is learned solely from . Thus, we set the value of such that the impact on bbox regression will be less than the powerful GIoU [13] loss, but enough to learn the standard deviation. According to , the localization uncertainty is predicted to involve larger values when there are larger gaps between the predicted regressions and corresponding targets, and vice versa. Therefore, we can utilize () as the four-directional box confidence at each pixel location.

2.2 Uncertainty Attention Module

The dense key points-based detector typically focuses on the area at the center of the object, as this usually ensures powerful feature representation for predictions. Accordingly, the pixel location with the maximum classification score within the central area is selected as the final key point location for initial predictions. However, we can check that the convincing regions for the boundaries of the bbox maintain strong representation for bbox regression from the obtained box confidence maps (see Fig. 1 (b)). Occasionally, such features can also be better representations for classification than the center feature of the object in cases of occlusion by the background or unusually shaped objects. This means that we can compensate for initial predictions focusing on the center region of the object by exploiting the convincing features for the boundaries of the bbox, indicated by localization uncertainty.

We thus propose a novel feature refinement module, the uncertainty attention module (UAM) that leverages box confidence maps as spatial attentions. As shown in Fig. 2, the UAM takes the last feature of the initial prediction as input, then generates a feature with (4+1)C channels through the convolution layer. The 4C channels of correspond to the box confidence map of each boundary, while the other 1C channels of correspond to the original feature representing the central area of the object. Each box confidence map is multiplied spatially to the corresponding , then all the features are concatenated. The concatenated feature can be formulated using the following equation:


where denotes the feature channel and , , and respectively denote the localization uncertainties of the left, top, right and bottom. Finally, the UAM produces an output feature with the same shape as the input feature through a convolution layer. In this paper, we apply C = 256 for the classification refinement branch and C = 64 for the bbox regression refinement branch.

Figure 2: Network architecture of the uncertainty-aware dense detector (UADET).

2.3 Network Architecture

Fig. 2 illustrates the overall network architecture of the uncertainty-aware dense detector (UADET). The structures of the backbone and the feature pyramid network (FPN) are the same as those of the FCOS [14], but the head structure is different. First, we attach the localization uncertainty prediction to the initial bbox regression branch. Then, we attach additional sub-branches for classification and bbox regression refinement using the UAM, which leverages localization uncertainty. Each sub-branch refines the feature through the UAM, and finally applies convolution layers to produce the prediction to be refined. The UADET predicts final classification and bbox regression by combining the existing and refined results.

2.4 Loss Function

We model the sub-branches as a generated anchor refinement problem. We serve the initial bbox prediction as an anchor generated from the pixel location of the feature. We then obtain the classification label by measuring the intersection over unit (IoU) between the generated anchor and the ground truth boxes. The classification label of the ground truth box, which has a maximum IoU with the generated anchor, is the label of the anchor. If the maximum IoU is under 0.6, we treat that anchor as the background. We adopt focal loss [10] for classification refinement branch.


denotes the number of positive samples from the above classification targeting strategy while , respectively denote the classification score and target for refinement.

For positive samples, the generated anchor is compensated by the offset to the assigned ground truth box. The sub-branch for bbox regression refinement learns the offset through L1 loss.


, respectively denote the four-directional bbox regression offset and the corresponding target. is the indicator function, which is 1 if and 0 otherwise. Finally, we define the total training loss as follows:


Following FCOS [14], we adopt focal loss [10] for initial classification, GIoU loss [13] for initial regression, and binary cross-entropy loss for centerness.

2.5 Inference

The inference of the UADET is straightforward: First, we forward an input image through the network and obtain the initial classification score , initial bbox regression , centerness , localization uncertainty , classification score for refinement , and bbox regression offset for each pixel location of the feature. Next, we adopt the final classification score as the square root of and the final bbox regression as the summation of and . Then, we process centerness-weighted NMS with a threshold of 0.6 to eliminate redundant detections.

3 Experiments

We evaluate our UADET on the large-scale detection benchmark MS COCO [11]. Following common practice [14, 10, 16], we use the train2017 split for training and report the ablation results on the val2017 split. To compare ours with state-of-the-art detectors, we report our main results on test-dev split by uploading our detection results to the evaluation server.


Method Backbone AP
Faster R-CNN w/ FPN [9] ResNet-101 36.2 59.1 39.0 18.2 39.0 48.2
Mask R-CNN [6] ResNet-101 38.2 60.3 41.7 20.1 41.1 50.2
Cascade R-CNN [1] ResNet-101 42.8 62.1 46.3 23.7 45.5 55.2
DSSD513 [5] ResNet-101 33.2 53.3 35.2 13.0 35.4 51.1
RetinaNet [10] ResNet-101 39.1 59.1 42.3 21.8 42.7 50.2
RefineDet512+ [17] ResNet-101 41.8 62.9 45.7 25.6 45.1 54.1
FreeAnchor [18] ResNet-101 43.1 62.2 46.4 24.5 46.1 54.8
CornerNet [8] Hourglass-104 40.5 56.5 43.1 19.4 42.7 53.9
CenterNet [4] Hourglass-104 44.9 62.4 48.1 25.6 47.4 57.4
RepPoints [15] ResNet-101-DCN 45.0 66.1 49.0 26.6 48.6 57.5
FSAF [19] ResNet-101 40.9 61.5 44.0 24.0 44.2 51.3
ATSS [16] ResNet-101 43.6 62.1 47.4 26.1 47.0 53.6
FCOS [14] ResNeXt-64x4d-101 44.7 64.1 48.4 27.6 47.5 55.6
UADET (ours) ResNet-101 44.0 62.6 47.7 26.1 47.1 54.5
UADET (ours) ResNeXt-64x4d-101 46.0 65.0 50.0 28.6 48.9 56.8
UADET (ours) ResNet-101-DCN 46.4 65.1 50.5 27.7 49.4 58.6
UADET (ours) ResNeXt-64x4d-101-DCN 48.3 67.2 52.5 30.1 51.2 61.0


Table 1: Detection results() on MS COCO test-dev split.


Method AP
FCOS[14] 38.6 57.4 41.4 22.3 42.5 49.8
+ Uncertainty 38.7 56.9 41.6 22.6 42.2 50.4
+ Cls. refinement 39.3 57.5 43.0 23.2 43.1 52.1
+ Reg. refinement 39.9 58.1 43.2 23.3 43.7 51.8


Table 2: Individual component contributions.


Method UAM AP
UADET 39.1 57.7 42.3 22.9 42.6 51.1
UADET 39.9 58.1 43.2 23.3 43.7 51.8


Table 3: Contribution of the UAM. The UAM is replaced with a 11 convolution layer in the first row.

3.1 Implementation and Training Details

We implement the UADET based on MMDetection [2]. Unless specified, the default hyper-parameters of FCOS [14]

implementation are maintained. For the ablation results, we adopt ResNet-50 pretrained on ImageNet as the backbone. The network is trained with an initial learning rate of 0.01 and a mini-batch size of 16 for 90K iterations. The input images are resized to a scale of 1333

800 without changing the aspect ratio, then augmented by random flipping.

To allow for a fair comparison with the state-of-the-art detectors, following prior work on FCOS [14], we adopt larger backbone networks and a multi-scale training strategy. The shorter side of the input images are randomly resized in the range of [640:800], and the number of training iterations is doubled to 180K. At testing time, we adopt a single-scale testing strategy for all results.

3.2 Ablation Study

We now assess the effectiveness of each component in our method. As presented in Table 2, the performance at baseline is slightly improved with the effectiveness of when we attach the localization uncertainty prediction. Due to the refined features of the UAM and the IoU-based targeting strategy on the classification refinement branch, the performance is boosted up to 39.3 in AP. By adding the regression refinement branch to compensate for initial bbox regressions, we obtain a further boosted performance of 39.9 in AP.

We also verify the effectiveness of the UAM, as presented in Table 3. For one, the UAM outperforms a conventional convolutional layer. This indicates that the UAM extracts more effective feature representations by considering the convincing features for the boundaries of the bbox.

3.3 Comparison with State-of-the-art Detectors

We compare our UADET with recent state-of-the-art detectors on the MS COCO test-dev. As listed in Table 1, the UADET achieves 44.0 in AP, thus surpassing other methods with the same ResNet-101 backbone. With the ResNeXt-64x4d-101 backbone, the performance is further boosted to 46.0 in AP, which is a substantial improvement over the baseline performance. By applying deformable convolution [20] (DCN) at stages 3 and 4 of the backbone and the last convolution of 4 repeated convolutions shown in Fig. 2, the performance is improved to 46.4 in AP with the ResNet-101 backbone. Finally, our best model, which includes the ResNeXt-64x4d-101 backbone and DCN, achieves 48.3 in AP, thereby offering state-of-the-art performance as a single model with a single-scale testing strategy.

4 Conclusions

In this work, we propose a novel feature refinement method using localization uncertainty-based attention. We then build an uncertainty-aware dense detector (UADET) that utilizes this novel method to improve performance. The experimental results show that UADET achieves remarkable performance compared to various detectors, including recent state-of-the-art detectors.

5 Acknowledgement

This work was supported by Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government (MSIT). (No.2018-0-01290, Development of an Open Dataset and Cognitive Processing Technology for the Recognition of Features Derived From Unstructured Human Motions Used in Self-driving Cars) and (No.B0101-15-0266, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis)


  • [1] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6154–6162. Cited by: Table 1.
  • [2] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, et al. (2019) Mmdetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §3.1.
  • [3] J. Choi, D. Chun, H. Kim, and H. Lee (2019) Gaussian yolov3: an accurate and fast object detector using localization uncertainty for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 502–511. Cited by: §1, §2.1, §2.1.
  • [4] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) Centernet: keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6569–6578. Cited by: Table 1.
  • [5] C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg (2017) Dssd: deconvolutional single shot detector. CoRR, abs/1701.06659. Cited by: Table 1.
  • [6] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: Table 1.
  • [7] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang (2019) Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2888–2897. Cited by: §1.
  • [8] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: Table 1.
  • [9] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: Table 1.
  • [10] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §2.4, §2.4, Table 1, §3.
  • [11] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §3.
  • [12] S. Ren, K. He, R. Girshick, and J. Sun (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39 (6), pp. 1137–1149. Cited by: §1.
  • [13] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 658–666. Cited by: §2.1, §2.4.
  • [14] Z. Tian, C. Shen, H. Chen, and T. He (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE international conference on computer vision, pp. 9627–9636. Cited by: §1, §1, §2.3, §2.4, §3.1, §3.1, Table 1, Table 2, §3.
  • [15] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin (2019) Reppoints: point set representation for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9657–9666. Cited by: Table 1.
  • [16] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li (2020) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9759–9768. Cited by: Table 1, §3.
  • [17] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li (2018)

    Single-shot refinement neural network for object detection

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4203–4212. Cited by: Table 1.
  • [18] X. Zhang, F. Wan, C. Liu, R. Ji, and Q. Ye (2019) Freeanchor: learning to match anchors for visual object detection. In Advances in Neural Information Processing Systems, pp. 147–155. Cited by: Table 1.
  • [19] C. Zhu, Y. He, and M. Savvides (2019) Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 840–849. Cited by: Table 1.
  • [20] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9308–9316. Cited by: §3.3.