Object detection has achieved noteworthy successes with the development of convolutional neural networks (CNNs). In recent years, FCOS, an anchor-free approach, has been suggested as a strong alternative to anchor-based mainstream methods [12, 10]. The per-pixel fashion framework achieves improved performance by modeling each object as dense key points, but it still has insufficient localization information in terms of probabilistic confidence. Inspired by recent works on the bounding box (bbox) regression with uncertainty [7, 3], we construct a localization uncertainty prediction based on FCOS 
to identify the locations of the most convincing key points. The dense key points-based model can learn the localization uncertainty from a Gaussian model parameterized with bbox regression values and corresponding variances.
The results are shown in Fig. 1 (a), where it can be seen that the right side of the surfboard has relatively high uncertainty because it is occluded by the water; in other words, there is a large gap between the prediction and the ground truth for the submerged part of the surfboard (marked in red). Fig. 1 (b) shows box confidence maps highlighting areas with key points that have low localization uncertainties for each boundary of the bboxes (highlighted in red). As shown in these maps, it is difficult to discover the common convincing key point for the four-directional boundaries of the surfboard, while this is not the case for the person. Therefore, the key point with the inaccurate prediction must be chosen as the final key point for the surfboard.
This inaccurate prediction needs to be compensated by the convincing features for the boundaries of the bbox, which would be missed in the detection process of a conventional dense key points-based detector. To this end, the localization uncertainty-based attention is designed to encode features from both the convincing regions for the boundaries of the bbox and the central region of the object. It uses the box confidence maps to enhance the original features by exploiting certain boundary features. Such features also allow us to effectively refine the coarse predictions.
To verify the effectiveness of the proposed method, we build an uncertainty-aware dense detector (UADET) that learns the localization uncertainty and leverages box confidence maps as spatial attentions for feature refinement; we then evaluate the proposed UADET on the MS COCO  benchmark. The experimental results show that our approach can achieve a single model, single-scale AP of 48.3, thereby representing a large improvement over the baseline FCOS . Further, the results of extensive experiments demonstrate that the UADET significantly outperforms various state-of-the-art object detectors.
2 Proposed Method
2.1 Localization Uncertainty
In Gaussian YOLOv3 , due to its anchor-based design, localization uncertainty is modeled using the center point of each box, the box size, and the corresponding variances as Gaussian parameters. In this paper, localization uncertainty is modeled using each single Gaussian model of the bbox regression values (, , ,
) as well as the corresponding variances. A single variate Gaussian distribution is as follows:
where denotes the predicted bbox regression, denotes the bbox regression target, and
(standard deviation) denotes the localization uncertainty, the value of which is (0, 1) with a sigmoid function. For training, we design a negative log likelihood (NLL) loss with Gaussian parameters as follows:
where denotes of four-directional bbox regression targets, and the corresponding Gaussian parameters and in Eq. (1) are also and , respectively. is the indicator function, which is 1 if and 0 otherwise, and denotes the classification label at the pixel location of the feature. The summation is calculated over four-directional bbox regressions and positive samples. The cost average is calculated by dividing by the number of positive samples, . (= 0.2 in this paper) is the balance weight for .
In contrast to Gaussian YOLOv3 , we observe degraded performance when bbox regression is learned solely from . Thus, we set the value of such that the impact on bbox regression will be less than the powerful GIoU  loss, but enough to learn the standard deviation. According to , the localization uncertainty is predicted to involve larger values when there are larger gaps between the predicted regressions and corresponding targets, and vice versa. Therefore, we can utilize () as the four-directional box confidence at each pixel location.
2.2 Uncertainty Attention Module
The dense key points-based detector typically focuses on the area at the center of the object, as this usually ensures powerful feature representation for predictions. Accordingly, the pixel location with the maximum classification score within the central area is selected as the final key point location for initial predictions. However, we can check that the convincing regions for the boundaries of the bbox maintain strong representation for bbox regression from the obtained box confidence maps (see Fig. 1 (b)). Occasionally, such features can also be better representations for classification than the center feature of the object in cases of occlusion by the background or unusually shaped objects. This means that we can compensate for initial predictions focusing on the center region of the object by exploiting the convincing features for the boundaries of the bbox, indicated by localization uncertainty.
We thus propose a novel feature refinement module, the uncertainty attention module (UAM) that leverages box confidence maps as spatial attentions. As shown in Fig. 2, the UAM takes the last feature of the initial prediction as input, then generates a feature with (4+1)C channels through the convolution layer. The 4C channels of correspond to the box confidence map of each boundary, while the other 1C channels of correspond to the original feature representing the central area of the object. Each box confidence map is multiplied spatially to the corresponding , then all the features are concatenated. The concatenated feature can be formulated using the following equation:
where denotes the feature channel and , , and respectively denote the localization uncertainties of the left, top, right and bottom. Finally, the UAM produces an output feature with the same shape as the input feature through a convolution layer. In this paper, we apply C = 256 for the classification refinement branch and C = 64 for the bbox regression refinement branch.
2.3 Network Architecture
Fig. 2 illustrates the overall network architecture of the uncertainty-aware dense detector (UADET). The structures of the backbone and the feature pyramid network (FPN) are the same as those of the FCOS , but the head structure is different. First, we attach the localization uncertainty prediction to the initial bbox regression branch. Then, we attach additional sub-branches for classification and bbox regression refinement using the UAM, which leverages localization uncertainty. Each sub-branch refines the feature through the UAM, and finally applies convolution layers to produce the prediction to be refined. The UADET predicts final classification and bbox regression by combining the existing and refined results.
2.4 Loss Function
We model the sub-branches as a generated anchor refinement problem. We serve the initial bbox prediction as an anchor generated from the pixel location of the feature. We then obtain the classification label by measuring the intersection over unit (IoU) between the generated anchor and the ground truth boxes. The classification label of the ground truth box, which has a maximum IoU with the generated anchor, is the label of the anchor. If the maximum IoU is under 0.6, we treat that anchor as the background. We adopt focal loss  for classification refinement branch.
denotes the number of positive samples from the above classification targeting strategy while , respectively denote the classification score and target for refinement.
For positive samples, the generated anchor is compensated by the offset to the assigned ground truth box. The sub-branch for bbox regression refinement learns the offset through L1 loss.
, respectively denote the four-directional bbox regression offset and the corresponding target. is the indicator function, which is 1 if and 0 otherwise. Finally, we define the total training loss as follows:
The inference of the UADET is straightforward: First, we forward an input image through the network and obtain the initial classification score , initial bbox regression , centerness , localization uncertainty , classification score for refinement , and bbox regression offset for each pixel location of the feature. Next, we adopt the final classification score as the square root of and the final bbox regression as the summation of and . Then, we process centerness-weighted NMS with a threshold of 0.6 to eliminate redundant detections.
We evaluate our UADET on the large-scale detection benchmark MS COCO . Following common practice [14, 10, 16], we use the train2017 split for training and report the ablation results on the val2017 split. To compare ours with state-of-the-art detectors, we report our main results on test-dev split by uploading our detection results to the evaluation server.
|Faster R-CNN w/ FPN ||ResNet-101||36.2||59.1||39.0||18.2||39.0||48.2|
|Mask R-CNN ||ResNet-101||38.2||60.3||41.7||20.1||41.1||50.2|
|Cascade R-CNN ||ResNet-101||42.8||62.1||46.3||23.7||45.5||55.2|
|+ Cls. refinement||39.3||57.5||43.0||23.2||43.1||52.1|
|+ Reg. refinement||39.9||58.1||43.2||23.3||43.7||51.8|
3.1 Implementation and Training Details
implementation are maintained. For the ablation results, we adopt ResNet-50 pretrained on ImageNet as the backbone. The network is trained with an initial learning rate of 0.01 and a mini-batch size of 16 for 90K iterations. The input images are resized to a scale of 1333800 without changing the aspect ratio, then augmented by random flipping.
To allow for a fair comparison with the state-of-the-art detectors, following prior work on FCOS , we adopt larger backbone networks and a multi-scale training strategy. The shorter side of the input images are randomly resized in the range of [640:800], and the number of training iterations is doubled to 180K. At testing time, we adopt a single-scale testing strategy for all results.
3.2 Ablation Study
We now assess the effectiveness of each component in our method. As presented in Table 2, the performance at baseline is slightly improved with the effectiveness of when we attach the localization uncertainty prediction. Due to the refined features of the UAM and the IoU-based targeting strategy on the classification refinement branch, the performance is boosted up to 39.3 in AP. By adding the regression refinement branch to compensate for initial bbox regressions, we obtain a further boosted performance of 39.9 in AP.
We also verify the effectiveness of the UAM, as presented in Table 3. For one, the UAM outperforms a conventional convolutional layer. This indicates that the UAM extracts more effective feature representations by considering the convincing features for the boundaries of the bbox.
3.3 Comparison with State-of-the-art Detectors
We compare our UADET with recent state-of-the-art detectors on the MS COCO test-dev. As listed in Table 1, the UADET achieves 44.0 in AP, thus surpassing other methods with the same ResNet-101 backbone. With the ResNeXt-64x4d-101 backbone, the performance is further boosted to 46.0 in AP, which is a substantial improvement over the baseline performance. By applying deformable convolution  (DCN) at stages 3 and 4 of the backbone and the last convolution of 4 repeated convolutions shown in Fig. 2, the performance is improved to 46.4 in AP with the ResNet-101 backbone. Finally, our best model, which includes the ResNeXt-64x4d-101 backbone and DCN, achieves 48.3 in AP, thereby offering state-of-the-art performance as a single model with a single-scale testing strategy.
In this work, we propose a novel feature refinement method using localization uncertainty-based attention. We then build an uncertainty-aware dense detector (UADET) that utilizes this novel method to improve performance. The experimental results show that UADET achieves remarkable performance compared to various detectors, including recent state-of-the-art detectors.
This work was supported by Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government (MSIT). (No.2018-0-01290, Development of an Open Dataset and Cognitive Processing Technology for the Recognition of Features Derived From Unstructured Human Motions Used in Self-driving Cars) and (No.B0101-15-0266, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis)
-  (2018) Cascade r-cnn: delving into high quality object detection. In , pp. 6154–6162. Cited by: Table 1.
-  (2019) Mmdetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §3.1.
-  (2019) Gaussian yolov3: an accurate and fast object detector using localization uncertainty for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 502–511. Cited by: §1, §2.1, §2.1.
-  (2019) Centernet: keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6569–6578. Cited by: Table 1.
-  (2017) Dssd: deconvolutional single shot detector. CoRR, abs/1701.06659. Cited by: Table 1.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: Table 1.
-  (2019) Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2888–2897. Cited by: §1.
-  (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: Table 1.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: Table 1.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §2.4, §2.4, Table 1, §3.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §3.
-  (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39 (6), pp. 1137–1149. Cited by: §1.
-  (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 658–666. Cited by: §2.1, §2.4.
-  (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE international conference on computer vision, pp. 9627–9636. Cited by: §1, §1, §2.3, §2.4, §3.1, §3.1, Table 1, Table 2, §3.
-  (2019) Reppoints: point set representation for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9657–9666. Cited by: Table 1.
-  (2020) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9759–9768. Cited by: Table 1, §3.
Single-shot refinement neural network for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4203–4212. Cited by: Table 1.
-  (2019) Freeanchor: learning to match anchors for visual object detection. In Advances in Neural Information Processing Systems, pp. 147–155. Cited by: Table 1.
-  (2019) Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 840–849. Cited by: Table 1.
-  (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9308–9316. Cited by: §3.3.