FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection

by   Tai Wang, et al.
The Chinese University of Hong Kong

Monocular 3D object detection is an important task for autonomous driving considering its advantage of low cost. It is much more challenging compared to conventional 2D case due to its inherent ill-posed property, which is mainly reflected on the lack of depth information. Recent progress on 2D detection offers opportunities to better solving this problem. However, it is non-trivial to make a general adapted 2D detector work in this 3D task. In this technical report, we study this problem with a practice built on fully convolutional single-stage detector and propose a general framework FCOS3D. Specifically, we first transform the commonly defined 7-DoF 3D targets to image domain and decouple it as 2D and 3D attributes. Then the objects are distributed to different feature levels with the consideration of their 2D scales and assigned only according to the projected 3D-center for training procedure. Furthermore, the center-ness is redefined with a 2D Guassian distribution based on the 3D-center to fit the 3D target formulation. All of these make this framework simple yet effective, getting rid of any 2D detection or 2D-3D correspondence priors. Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020. Code and models are released at https://github.com/open-mmlab/mmdetection3d.


page 1

page 3

page 5

page 8

page 9


MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection

Monocular 3D object detection has long been a challenging task in autono...

Ground-aware Monocular 3D Object Detection for Autonomous Driving

Estimating the 3D position and orientation of objects in the environment...

MV-FCOS3D++: Multi-View Camera-Only 4D Object Detection with Pretrained Monocular Backbones

In this technical report, we present our solution, dubbed MV-FCOS3D++, f...

Probabilistic and Geometric Depth: Detecting Objects in Perspective

3D object detection is an important capability needed in various practic...

M3DSSD: Monocular 3D Single Stage Object Detector

In this paper, we propose a Monocular 3D Single Stage object Detector (M...

Monocular 3D Object Detection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss

Three-dimensional object detection from a single view is a challenging t...

Extending One-Stage Detection with Open-World Proposals

In many applications, such as autonomous driving, hand manipulation, or ...

1 Introduction

Object detection is a fundamental problem in computer vision. It aims to identify objects of interest in the image and predict their categories and corresponding 2D bounding boxes. With the rapid progress of deep learning, 2D object detection has been well explored in recent years. Various models such as Faster R-CNN 

[25], RetinaNet [18], and FCOS [28] greatly promote the progress of the field and benefit various practical applications like autonomous driving.

However, 2D information is not enough for an intelligent agent to perceive the 3D real world. For example, when an autonomous vehicle needs to run smoothly and safely on the road, it must have the accurate 3D information of objects around it to make secure decisions. Therefore, 3D object detection is becoming increasingly important in these robotic applications. Most state-of-the-art methods [35, 14, 33, 29, 37, 36, 38] rely on the accurate 3D information provided by LiDAR point clouds, but it is a heavy burden to install expensive LiDARs on each vehicle. So monocular 3D object detection, as a simple and cheap setting for deployment, becomes a much meaningful research problem nowadays.

Considering monocular 2D and 3D object detection have the exactly same input but different outputs, a straightforward solution for monocular 3D object detection is following the practices in the 2D domain but adding extra components to predict the additional 3D attributes of the objects. Some previous work [27, 20] keeps the prediction of 2D boxes and further regresses 3D attributes on top of 2D centers and region of interests. Others [1, 9, 2] simultaneously predict 2D and 3D boxes with 3D priors corresponding to each 2D anchor. Another stream of methods based on redundant 3D information [13, 16] could predict extra keypoints for optimized results ultimately. In a word, the core underlying challenge is how to assign 3D targets to 2D domain with the 2D-3D correspondence and predict them afterwards.

In this technical report, we adopt a simple yet efficient method to enable a 2D detector predict 3D localization. We first project the commonly defined 7-DoF 3D locations onto the 2D image and get the projected center point, which we name it as 3D-center compared to the previous 2D-center. With this projection, the 3D-center actually contains 2.5D information, i.e., 2D location and its corresponding depth. The 2D location can be further reduced to the 2D offset from a certain point on the image, which serves as the only 2D attribute that can be normalized among different feature levels like in the 2D detection. In comparison, depth, 3D size and orientation are regarded as 3D attributes after decoupling. In this way, we transform the 3D targets with a center-based paradigm and avoid any necessary 2D detection or 2D-3D correspondence priors.

As a practical implementation, we build our method on FCOS [28], a simple anchor-free fully convolutional single-stage detector. We first distribute the objects to different feature levels with consideration of their 2D scales. Then the regression targets of each training sample are assigned only according to the projected 3D centers. In contrast to FCOS that denote the center-ness with the distances to the boundaries, we denote the 3D center-ness with a 2D Guassian distribution based on the 3D-center.

We evaluate our method on a popular large-scale dataset, nuScenes [3] and achieved the 1st place

on the camera track of this benchmark without any prior information. Moreover, we only need 2x less computing resources to train a baseline model with performance comparable to the previous best open-source method, CenterNet 

[34], in one day, also 3x faster than it. Both show that our framework is simple and efficient. Detailed ablation studies show the importance of each component.

2 Related Work

Figure 2:

An overview of our pipeline. To leverage the well-developed 2D feature extractors, we basically follow the typical design of backbone and neck for 2D detectors. For detection head, we first reformulate the 3D targets with center-based paradigm to decouple it as multi-task learning. The strategies for multi-level target assignment and center sampling are further adjusted accordingly to equip this framework with the better capability of handling overlapped ground truths and scale variance problem.

2D Object Detection Research on 2D object detection has made great progress with the breakthrough of deep learning approaches. Modern methods can be divided into two branches, anchor-based and anchor-free, according to the base of initial guesses. Anchor-based methods [10, 25, 19, 24] benefit from the predefined anchors in terms of much easier regression while have many hyper-parameters to tune. In contrast, anchor-free methods [12, 23, 28, 15, 34] do not need these prior settings and are thus neater with better universality. For simplicity, this paper takes FCOS, a representative anchor-free detector, as the baseline considering its capability of handling overlapped ground truths and scale variance problem.

From another perspective, monocular 3D detection is a more difficult task closely related to 2D detection. But there is few work investigating the connection and difference between them, which makes them isolated and not able to benefit from the advancement of each other. This report aims at taking the adaptation of FCOS as the example and further building a closer connection between these two tasks.

Monocular 3D Object Detection Monocular 3D detection is more difficult compared to conventional 2D detection. The underlying key problem is the inconsistency of input 2D data modal and the output 3D predictions.

Methods involving sub-networks Earlier work uses sub-networks to assist 3D detection. 3DOP [4] and MLFusion [31]

use a depth estimation network while Deep3DBox 

[21] uses a 2D object detector. They rely on the design and performance of these sub-networks, even external data and pretrained models, which makes the training inconvenient and introduces additional system complexity.

Transform to 3D representation Another category is to convert the RGB input to other representations like OFTNet [26] and Pseudo-Lidar [30]. Although these methods have shown promising performance, they actually still rely on dense depth labels and hence are not regarded as pure monocular approaches. There is also a domain gap between different depth sensors and LiDARs, which makes it hard to generalize to a new practical setting smoothly. Furthermore, the efficiency of processing a large amount of point clouds is also a significant issue to deal with when applying these methods to cases in the real world.

End-to-end design like 2D detection Recent work notices these drawbacks and end-to-end frameworks are thus proposed. M3D-RPN [1] implements a single-stage multi-class detector with an end-to-end region proposal network and depth-aware convolution. SS3D [13] proposes to detect 2D key points and further predicts object characteristics with uncertainties. MonoDIS [27] introduces a disentangling loss to reduce the instability of training procedure. Some of them still have multiple training stages or hand-crafted post-optimization phase, and all of these methods follow anchor-based manners, thus the consistency of 2D and 3D anchors needed to be determined. In contrast, anchor-free methods [34, 16, 5] do not need to make statistics on the given data and can be better generalized to more complicated cases with more various classes or different intrinsic settings, so we choose to follow this paradigm.

Nevertheless, all of these works hardly study the key difficulty when applying a general 2D detector to monocular 3D detection. What should be kept or leveraged and what should be adjusted or focused on in this procedure are seldom discussed when proposing their new frameworks. In contrast, this technical report just concentrates on this point, which could provide a reference when applying a typical 2D detector framework to a closely related task. On this basis, a more in-depth understanding of the connection and difference of these two tasks will also be beneficial to further research of both communities.

3 Approach

Object detection is one of the most essential and challenging problems for scene understanding. Conventional 2D object detection expects the given model to predict 2D bounding boxes and category labels for each object of interest. Compared to it, monocular 3D detection needs us to predict 3D bounding boxes instead, which needs to be decoupled and transformed to 2D image plane as much as possible. In this section, we will first present an overview of our framework with our adopted reformulation of 3D targets, and then elaborate two corresponding technical designs, 2D guided multi-level 3D prediction and center sampling strategy, tailored to this task, which together make the 2D detector FCOS work in this 3D task.

3.1 Framework Overview

A fully convolutional one-stage detector typically consists of three components: backbone for feature extraction, necks for multi-level branches construction and detection heads for dense predictions. Then we briefly introduce each of them.

Firstly, we use ResNet101 [11]

pretrained on ImageNet 

[8] with deformable convolutions [7] for feature extraction, which achieves a good trade-off between accuracy and efficiency in our experiments. We fixed parameters of the first convolutional block to avoid more memory overhead.
The second module is the Feature Pyramid Network [17], which is a basic component for detecting objects at different scales. For clear clarification, we denote feature maps from level 3 to 7 as P3 to P7 as shown in the Figure 2. We follow the approach in the original FCOS to obtain P3 to P5 and simply downsample P5 with two convolutional blocks to obtain P6 and P7. All of these five feature maps are responsible for predictions of different scales afterwards.
Finally for shared detection heads, there are two important issues to be dealt with. The first is how to assign targets to different levels of feature maps and different points, which is one of the core problems for different detectors and will be presented in the next subsection. The second is how to design the architecture. We follow the conventional design of RetinaNet [18] and FCOS [28]. Each shared head consists of 4 shared convolutional blocks and small heads for various regression targets. There are several alternative designs for the setting of small heads, which will be discussed in the experiments. From our experience, disentangled heads for targets with different measurements could benefit the training procedure. It appears more important for regression targets, including offset, depth, orientation, size and velocity. So we set one small head for each of them.
So far, we have introduced the overall design of our framework. Next, let’s formulate this problem more formally and present the detailed training and inference procedure.
Regression Targets Let’s first recall the formulation of anchor-free manners for object detection in FCOS. Given a feature map at layer of the backbone, denoted as

, we need to predict objects based on each point on this feature map, which correspond to uniformly distributed points on the original input image. Formally, for each location

on the feature map

, suppose the total stride until layer

is , then the corresponding location on the original image should be . Different from anchor-based detectors, which regress targets by taking predefined anchors as reference, we directly predict objects based on these locations. Moreover, because we do not rely on anchors, the criterion for judging whether a point is foreground or not will no longer be the IOU between anchors and ground truths. Instead, as long as the point is near the box center enough, it could be a foreground point.

Figure 3: Our exploited rotation encoding scheme. Two objects with opposite orientation share the same rotation offset based on the 2-bin boundary, thus have the same value. To distinguish them, we predict an additional direction class from the regression branch.

In the 2D case, the model just needs to regress the distance of the point to top/bottom/left/right side, denoted as in the Fig. 1. However, in the 3D case, it is non-trivial to regress the distance to six faces of the 3D bounding box. Instead, a more straightforward implementation is to convert the commonly defined 7-DoF regression targets to the 2.5D center and 3D size, in which 2.5D center can be easily transformed back to 3D space with camera intrinsic matrix. Regressing the 2.5D center could be further reduced to regressing the offset from the center to a specific foreground point, , and its corresponding depth respectively. In addition, to predict the allocentric orientation of the object, we follow the way in  [32] and divide it into two parts: angle with period and 2-bin direction classification. The first component naturally models the IOU of our predictions with the ground truth boxes while the second component focuses on the adversarial case where two boxes have opposite orientations. Thanks to this angle encoding, our method surpasses another center-based framework, CenterNet, in terms of orientation accuracy, which will be compared in the experiments. The rotation encoding scheme is illustrated in Fig. 3.
In addition to these regression targets related to the location and orientation of objects, we also regress a binary target

, namely center-ness, like FCOS. It serves as a soft binary classifier to determine which points are closer to centers, and helps suppress those low-quality predictions far away from object centers. More details are presented in Sec. 

To sum up, the regression branch needs to predict , direction class and center-ness while the classification branch needs to output the class label of the object and its attribute label.
Loss For classification and different regression targets, we define their loss respectively and take their weighted summation as the total loss. Firstly, for classification branch, we use the commonly used focal loss [18] for object classification loss:



is the class probability of a predicted box, and we follow the settings,

and , of the original paper. For attribute classification, we use a simple softmax classification loss, denoted as .
For regression branch, we use smooth L1 loss for each regression targets except center-ness with corresponding weights considering their scales:


where the weight of error is 1, the weight of is 0.2 and the weight of is 0.05. It should be noted that although we employ for depth prediction, we still compute the loss in the original depth space instead of the log space, which leads to more accurate detection in terms of object locations ultimately. We use the softmax classification loss and binary cross entropy (BCE) loss for direction classification and center-ness regression, denoted as and respectively.
Finally, the total loss is:


where is the number of positive predictions and .

Inference For inference procedure, given a input image, we just need to forward it through the framework and obtain bounding boxes with their class scores, attribute scores and center-ness predictions. We multiply the class score and center-ness as the confidence for each prediction and conduct rotated Non-Maximum Suppression (NMS) in the bird view as most 3D detectors to get the final results. Note that there are some transformations in the process, like rotation decoding and projecting the 2.5D center back to 3D space, which are basically inverse procedures of data preprocessing.

Figure 4: Our proposed distance-based target assignment for dealing with ambiguity case could significantly improve the best possible recall (BPR) for each class, especially for large objects like trailers. Construction vehicle and traffic cone are abbreviated as CV and TC in this figure.

3.2 2D Guided Multi-Level 3D Prediction

In FCOS, the author has discussed two most important issues of target assignment: 1) Best Possible Recall (BPR) for anchor-free and anchor-based detectors, 2) Intractable ambiguity caused by overlaps of ground-truth boxes. The first problem has been well verified by the comparison in the original paper, which shows that the implementation of multi-level prediction through FPN can really improve BPR, and even achieve better results than anchor-based methods. Similarly, the conclusion of this problem is also applicable in our adapted framework. The second question will involve the specific setting of the regression target, which we will discuss next.
In the original FCOS, we detect objects with different sizes in different levels of feature maps. Different from anchor-based methods, instead of assigning anchors with different sizes, FCOS directly assigns ground-truth boxes with different sizes to different levels of feature maps. Formally, it first computes the 2D regression targets, l*, r*, t*, b* for each location at each feature level, then for the location satisfying or , it would be regarded as a negative sample, where denotes the maximum regression range for feature level  111We set the regression range as (0, 48, 96, 192, 384, ) for to in our experiments respectively.. In comparison, we also follow this criterion in our implementation considering the scale of 2D detection is directly consistent with how large region we need to focus on. However, we only use 2D detection for filtering meaningless targets in this assignment step. After completing target assignment, our regression targets only include 3D related ones. Note that here we generate the 2D bounding boxes by computing the minimum and maximum coordinate of 8 projected vertices of 3D bounding boxes, so we do not need any 2D detection annotations or priors.
Next, let us discuss how to deal with the ambiguity problem, i.e., when a point is inside multiple ground truth boxes in the same feature level, which box should be assigned to it. The usual way is to select according to the area of the 2D bounding box. The box with smaller area is selected as the target box for this point. We call this scheme as the area-based criterion. It is obvious that large objects will be paid less attention by such processing, which is also verified by our experiments (Figure 4). Therefore, considering that the target of our regression is center-based, and the points closer to the center of the object can obtain more comprehensive and balanced local region features, so as to produce higher quality prediction, we propose a distance-based criterion, i.e., select the box with closer center as the regression target. Through simple verification (Figure 4), we find that this scheme significantly improves the best possible recall (BPR) and mAP of large objects, and also significantly improves the overall mAP (about 1%), which will be presented in the ablation study.
In addition to center-based approach to deal with ambiguity, we also use the 3D-center to determine foreground points, i.e., only the points near the center enough will be regarded as positive samples. We define a hyper-parameter, radius, to measure this central portion. The points with distance smaller than radius stride to the object center would be considered as positive, where radius is set to 1.5 in our experiments.
Finally, we also follow the approach in original FCOS for distinguishing heads of different feature level, i.e., replacing each output of different regression branches with , where is a trainable scalar used for adjusting the base of exponential function for feature level , which also brings a minor improvement in terms of detection performance.

Methods Dataset Modality mAP mATE mASE mAOE mAVE mAAE NDS
CenterFusion [22] test Camera & Radar 0.326 0.631 0.261 0.516 0.614 0.115 0.449
PointPillars [14] test LiDAR 0.305 0.517 0.290 0.500 0.316 0.368 0.453
MEGVII [36] test LiDAR 0.528 0.300 0.247 0.379 0.245 0.140 0.633
LRM0 test Camera 0.294 0.752 0.265 0.603 1.582 0.14 0.371
MonoDIS [27] test Camera 0.304 0.738 0.263 0.546 1.553 0.134 0.384
CenterNet [34] (HGLS) test Camera 0.338 0.658 0.255 0.629 1.629 0.142 0.4
Noah CV Lab test Camera 0.331 0.660 0.262 0.354 1.663 0.198 0.418
FCOS3D (Ours) test Camera 0.358 0.690 0.249 0.452 1.434 0.124 0.428
CenterNet [34] (DLA) val Camera 0.306 0.716 0.264 0.609 1.426 0.658 0.328
FCOS3D (Ours) val Camera 0.343 0.725 0.263 0.422 1.292 0.153 0.415
Table 1: Results on the nuScenes dataset.

3.3 3D Center-ness with 2D Guassian Distribution

In the original design of FCOS, center-ness is defined by 2D regression targets, l*, r*, t*, b*:


Because our regression targets are changed to the 3D center-based paradigm, so we define center-ness by 2D Gaussian distribution with projected 3D center as the origin. The 2D Gaussian distribution is simplified as:


where is used to adjust the intensity attenuation from the center to the periphery and set to 2.5 in our experiments. We take it as the ground truth of center-ness and predict it from the regression branch for filtering low-quality predictions later. As mentioned earlier, this center-ness target ranges from 0 to 1, so we use the Binary Cross Entropy (BCE) loss for training that branch.

4 Experimental Setup

4.1 Dataset

We evaluate our framework on a large-scale commonly used dataset, nuScenes [3]. NuScenes dataset consists of multi-modal data collected from 1000 scenes, which includes RGB images from 6 surrounding cameras, points from 5 Radars and 1 LiDAR. It is split into 700/150/150 scenes for training/validation/testing. There are overall 1.4M annotated 3D bounding boxes from 10 categories. Considering the variety of scenes and ground truths, it is becoming one of the most convincing benchmarks for 3D object detection. Therefore, we take it as the platform to validate the efficacy of our method.

Methods car truck bus trailer CV ped motor bicycle TC barrier mAP
LRM0 0.467 0.21 0.17 0.149 0.061 0.359 0.287 0.246 0.476 0.512 0.294
MonoDIS [27] 0.478 0.22 0.188 0.176 0.074 0.37 0.29 0.245 0.487 0.511 0.304
CenterNet [34] (HGLS) 0.536 0.27 0.248 0.251 0.086 0.375 0.291 0.207 0.583 0.533 0.338
Noah CV Lab 0.515 0.278 0.249 0.213 0.066 0.404 0.338 0.237 0.522 0.49 0.331
FCOS3D (Ours) 0.524 0.27 0.277 0.255 0.117 0.397 0.345 0.298 0.557 0.538 0.358
Table 2: Average precision for each class on the nuScenes test benchmark. CV and TC are abbreviation of construction vehicle and traffic cone in the table.
Baseline 0.227 0.868 0.272 0.778 1.326 0.393 0.282
+depth loss in original space 0.25 0.838 0.268 0.892 1.33 0.413 0.284
+flip augmentation 0.248 0.85 0.267 1.016 1.358 0.268 0.286
+dist-based target assign & attr pred 0.257 0.832 0.268 0.852 1.2 0.18 0.316
+global NMS 0.26 0.828 0.267 0.85 1.371 0.18 0.317
+ResNet101 0.272 0.821 0.265 0.81 1.379 0.17 0.329
+disentangle heads 0.28 0.822 0.274 0.64 1.305 0.177 0.349
+DCN in backbone 0.295 0.806 0.268 0.511 1.315 0.17 0.372
+finetune w/ depth weight=1.0 0.316 0.755 0.263 0.458 1.307 0.169 0.393
+TTA 0.326 0.743 0.259 0.441 1.341 0.163 0.402

+more epochs & ensemble

0.343 0.725 0.263 0.422 1.292 0.153 0.415
Table 3: Ablation studies on the nuScenes validation 3D detection benchmark.

4.2 Evaluation Metrics

For fair comparison with other methods, we use the official metrics, distance-based mAP and NDS, which are given by the benchmark. Next, we briefly introduce these two kinds of metrics as follows.
Average Precision metric  Average Precision (AP) metric is generally used when evaluating performance of object detectors. Here instead of using Intersection over Union (IoU) for thresholding, nuScenes defines the match by 2D center distance on the ground plane for decoupling detection from object size and orientation. On this basis, we calculate AP by computing the normalized area under the precision recall curve for recall and precision over 10%. Finally, mAP is computed over all matching thresholds, meters, and all categories :


True Positive metrics  Apart from Average Precision, we also calculate 5 kinds of True Positive metrics, Average Translation Error (ATE), Average Scale Error (ASE), Average Orientation Error (AOE), Average Velocity Error (AVE) and Average Attribute Error (AAE). To obtain these measurements, we firstly define that predictions with center distance from the matching ground truth will be considered as true positives (TP). Then matching and scoring are conducted independently for each class of objects, and each metric is the average of cumulative mean at each recall levels above 10%. ATE is the Euclidean center distance in 2D (). ASE is equal to , is calculated between predictions and labels after aligning their translation and orientation. AOE is the smallest yaw angle difference between predictions and labels (). Note that different from other classes measured on the full period, barrier is measured on period. AVE is the L2-Norm of the absolute velocity error in 2D (). AAE is defined as , where refers to the attribute classification accuracy. Finally, given these metrics, we compute the mean TP metric (mTP) overall all categories:


Note that some not well defined metrics will be omitted, like AVE for cones and barriers considering they are stationary.

NuScenes Detection Score  The conventional mAP couples the evaluation of locations, sizes and orientations of detections and also could not capture some aspects in this setting like velocity and attributes, so this benchmark proposes a more comprehensive, decoupled but simple metric, nuScenes detection score (NDS):


where mAP is mean Average Precision (mAP) and is the set composed of five True Positive metrics. Considering mAVE, mAOE and mATE can be larger than 1, a bound is applied to them to limit them between 0 and 1.

4.3 Implementation Details

Network Architectures As shown in the Figure 2

, our framework basically follows the design of FCOS. Given the input image, we utilize ResNet101 as the feature extraction backbone followed by Feature Pyramid Networks (FPN) for generating multi-level predictions. Detection heads are shared among multi-level feature maps except that three scale factors are used to differentiate some of their final regressed results, including offsets, depths and sizes, respectively. Final small heads after the four shared convolutional layers for each regression targets are simply convolutional layers with kernel size and stride 1. All the convolutional modules are made up of basic convolution, batch normalization and activation layers, and normal distribution are leveraged for weights initialization. The overall framework is built on top of MMDetection3D 

Training Parameters

 For all experiments, we trained randomly initialized networks from scratch following end-to-end manners. Models are trained with SGD optimizer, in which gradient clip and warm-up policy are exploited with learning rate 0.001, number of warm-up iterations 500, warm-up ratio 0.33 and batch size 32 on 16 GTX 1080Ti GPUs. Finally, to achieve a stable training procedure at the beginning, our baseline model is trained with weight 0.2 for depth regression. For a more competitive performance and a more accurate detector, we finetune our model with this weight switched to 1. Related results are presented in the ablation study.

Data Augmentation Like previous work, we only implement image flip for data augmentation both when training and testing. Note that when flipping images, only offset is needed to be flipped as 2D attributes while 3D boxes need to be transformed correspondingly in 3D space. For test time augmentation, we average the score maps output by the detection heads except rotation and velocity related scores due to their inaccuracy. It is empirically a more efficient approach for augmentation than merging boxes at last.

Figure 5: Qualitative analysis of detection results. 3D bounding boxes predictions are projected onto images from six different views and bird-view respectively. Boxes from different categories are marked with different colors. From left part, we can see the results are reasonable except some detection with false class predictions. Moreover, a few small objects are detected by our model while not annotated as ground truth, like barriers in the back/back right camera. However, apart from the intrinsic occlusion problem in this setting, there still exists noticeable inaccuracy in terms of depth and rotation predictions, which can be observed in the visualization from bird view.

5 Results

In this section, we present experimental results quantitatively and qualitatively, and make a detailed ablation study in terms of important factors in the procedure of pushing our method towards the state-of-the-art.

5.1 Quantitative Analysis

First, we give the results of quantitative analysis, which are shown in Tab. 1. We compare the results on the test set and validation set respectively. On the test set, we first compared with all the methods using RGB image as the input data. We achieved the best performance among them with mAP 0.358 and NDS 0.428, in which particularly we exceeded the previous best method more than 2% in terms of mAP. Benchmarks using LiDAR data as the input include PointPillars [14], which are faster and lighter, and CBGS [36] (MEGVII in the Tab. 1) with relatively high performance. For the approaches which use the input of RGB image and Radar mixed data, we select CenterFusion [22] as the benchmark. It can be seen that although our method has a certain gap with the high-performance CBGS, it even surpasses PointPillars and CenterFusion on mAP, which shows that we can solve this ill-posed problem decently with enough data. At the same time, it can be seen that the methods using other modal of data have relatively better NDS, mainly because the mAVE is smaller. The reason is that other methods will introduce continuous multi-frame data, such as point cloud data from consecutive frames, so as to predict the speed of objects. The Radar itself also has the function of velocity measurement, so CenterFusion can achieve reasonable speed prediction even with a single frame image. However, these can not be achieved only by using a single frame image, so how to mine the speed information from consecutive frame images will be one of the directions that can be explored in the future. For detailed mAP for each category, please refer to Tab. 2 and the official benchmark.
On the validation set, we compare our method with the best open-source center-based detector, CenterNet. Their method not only takes about 3 days to train (compared with our only 1 day to achieve comparable performance possibly thanks to our pretrained backbone to some extent), but also is inferior to our method except for mATE. In particular, thanks to our rotation encoding scheme, we have achieved a significant improvement in the accuracy of angle prediction. The significant improvement of mAP reflects the superiority of our multi-level prediction. On the basis of common improvement in these aspects, we finally achieved an improvement of about 9% on NDS.

Figure 6: Failure cases. As shown in this figure, our detectors perform poorly especially for occluded and large objects. We use yellow dotted circle to mark the failure case caused by occlusion, while use red dotted circle to mark the inaccurate large objects predictions. The former problem is intrinsic considering the ill-posed property of this task itself. So a direction to improve our method would be how to enhance the detection performance for large objects.

5.2 Qualitative Analysis

Then we show some qualitative results in the Fig. 5 and 6 to give an intuitive understanding of the performance of our model. First of all, in the Fig. 5, we draw the 3D bounding boxes we predicted in the six-view images and the top view point clouds. We can see that from the perspective of image, our prediction results are very appealing. Especially for some small objects that are not labeled, such as the barriers in the camera at the rear right, they are not labeled, but are detected by our model. But at the same time, we should also see that our method still has obvious problems in depth estimation and identification of occluded objects. For example, it is difficult to detect the blocked car in the left rear image. Moreover, from the top view, especially in depth estimation, the results are not as good as those shown in the image. This is also in line with our expectation that depth estimation is still the core challenge in this ill-posed problem.
In the Fig. 6, we show some failure cases, mainly focused on the detection of large objects and occluded objects. In the camera view and top view, the yellow dotted circle is used to mark the blocked object which has not been successfully detected, while the red dotted circle is used to mark the detected large object with obvious deviation. The former is mainly manifest in the failure to find the object behind, while the latter is mainly manifest in the inaccurate estimation of the size and orientation of the object. The reasons behind the two failure cases are also different. The former is due to the inherent property of the current setting, which is difficult to be solved; the latter may be due to the fact that the receptive field of convolution kernel of the current model is not large enough, resulting in low performance of large object detection. Therefore, the future research direction may be more focused on the solution of the latter.

5.3 Ablation Studies

Finally, we show some important factors in the whole process of studying in Tab. 3. It can be seen that in the prophase process, transforming depth back to the original space to compute loss is an important factor to improve mAP, and distance-based target assignment is an important factor to improve the overall NDS. In the later promotion process, the stronger backbone, such as replacing original ResNet50 with ResNet101 and using DCN, is a very important factor. At the same time, due to the difference of scales and measurements, using disentangled heads for different regression targets is also an important way to improve the accuracy of angle prediction and NDS. Finally, we achieve the current state-of-the-art through simple augmentation, more training epochs and basic model ensemble.

6 Conclusion

In this paper, we propose a simple yet efficient one-stage framework, FCOS3D, for monocular 3D object detection without any 2D detection or 2D-3D correspondence priors. In the framework, we first transform the commonly defined 7-DoF 3D targets to image domain and decouple it as 2D and 3D attributes to generally make it fit the 3D setting. On this basis, the objects are distributed to different feature levels with consideration of their 2D scales and further assigned only according to the 3D centers. In addition, the center-ness is redefined with a 2D Guassian distribution based on the 3D-center to be compatible with our target formulation. Experimental results with detailed ablation studies show the efficacy of our approach. For future work, a promising direction is how to better tackle the difficulty of depth and orientation estimation in this ill-posed setting.


  • [1] G. Brazil and X. Liu (2019) M3D-rpn: monocular 3d region proposal network for object detection. In IEEE International Conference on Computer Vision, Cited by: §1, §2.
  • [2] G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele (2020) Kinematic 3d object detection in monocular video. In Proceedings of the European Conference on Computer Vision, Cited by: §1.
  • [3] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) NuScenes: a multimodal dataset for autonomous driving. CoRR abs/1903.11027. External Links: Link Cited by: §1, §4.1.
  • [4] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun (2015) 3D object proposals for accurate object class detection. In Conference on Neural Information Processing Systems, Cited by: §2.
  • [5] Y. Chen, L. Tai, K. Sun, and M. Li (2020) MonoPair: monocular 3d object detection using pairwise spatial relationships. In

    IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §2.
  • [6] M. Contributors (2020) MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. Note: https://github.com/open-mmlab/mmdetection3d Cited by: §4.3.
  • [7] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In IEEE International Conference on Computer Vision, Cited by: §3.1.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li (2009) ImageNet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
  • [9] M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo (2020) Learning depth-guided convolutions for monocular 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [10] R. Girshick (2015) Fast r-cnn. In IEEE International Conference on Computer Vision, Cited by: §2.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
  • [12] L. Huang, Y. Yang, Y. Deng, and Y. Yu (2015) DenseBox: unifying landmark localization with end to end object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [13] E. Jörgensen, C. Zach, and F. Kahl (2019) Monocular 3d object detection and box fitting trained end-to-end using intersection-over-union loss. CoRR abs/1906.08070. External Links: Link Cited by: §1, §2.
  • [14] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) PointPillars: fast encoders for object detection from point clouds. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, Table 1, §5.1.
  • [15] H. Law and J. Deng (2018) CornerNet: detecting objects as paired keypoints. In European Conference on Computer Vision, Cited by: §2.
  • [16] P. Li, H. Zhao, P. Liu, and F. Cao (2020) RTM3D: real-time monocular 3d detection from object keypoints for autonomous driving. In European Conference on Computer Vision, Cited by: §1, §2.
  • [17] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
  • [18] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.1, §3.1.
  • [19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Cited by: §2.
  • [20] F. Manhardt, W. Kehl, and A. Gaidon (2019) ROI-10d: monocular lifting of 2d detection to 6d pose and metric shape. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [21] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017) 3D bounding box estimation using deep learning and geometry. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [22] R. Nabati and H. Qi (2020) CenterFusion: center-based radar and camera fusion for 3d object detection. In IEEE Winter Conference on Applications of Computer Vision, Cited by: Table 1, §5.1.
  • [23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [24] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [25] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
  • [26] T. Roddick, A. Kendall, and R. Cipolla (2018) Orthographic feature transform for monocular 3d object detection. CoRR abs/1811.08188. External Links: Link Cited by: §2.
  • [27] A. Simonelli, S. R. R. Bulò, L. Porzi, M. López-Antequera, and P. Kontschieder (2019) Disentangling monocular 3d object detection. In IEEE International Conference on Computer Vision, Cited by: §1, §2, Table 1, Table 2.
  • [28] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §2, §3.1.
  • [29] T. Wang, X. Zhu, and D. Lin (2020) Reconfigurable voxels: a new representation for lidar-based point clouds. In Conference on Robot Learning, Cited by: §1.
  • [30] Y. Wang, W. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger (2019) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [31] B. Xu and Z. Chen (2018) Multi-level fusion based 3d object detection from monocular images. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [32] Y. Yan, Y. Mao, and B. Li (2018) SECOND: sparsely embedded convolutional detection. Sensors 18 (10). Cited by: §3.1.
  • [33] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2019) STD: sparse-to-dense 3d object detector for point cloud. In IEEE International Conference on Computer Vision, Cited by: §1.
  • [34] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. CoRR abs/1904.07850. External Links: Link Cited by: §1, §2, §2, Table 1, Table 2.
  • [35] Y. Zhou and O. Tuzel (2018) VoxelNet: end-to-end learning for point cloud based 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [36] B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu (2019) Class-balanced grouping and sampling for point cloud 3d object detection. CoRR abs/1908.09492. External Links: Link Cited by: §1, Table 1, §5.1.
  • [37] X. Zhu, Y. Ma, T. Wang, Y. Xu, J. Shi, and D. Lin (2020) SSN: shape signature networks for multi-class object detection from point clouds. In Proceedings of the European Conference on Computer Vision, Cited by: §1.
  • [38] X. Zhu, H. Zhou, T. Wang, F. Hong, Y. Ma, W. Li, H. Li, and D. Lin (2021) Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In Proceedings of the European Conference on Computer Vision, Cited by: §1.