1 Introduction
Object detection is a fundamental problem in computer vision. It aims to identify objects of interest in the image and predict their categories and corresponding 2D bounding boxes. With the rapid progress of deep learning, 2D object detection has been well explored in recent years. Various models such as Faster RCNN
[25], RetinaNet [18], and FCOS [28] greatly promote the progress of the field and benefit various practical applications like autonomous driving.However, 2D information is not enough for an intelligent agent to perceive the 3D real world. For example, when an autonomous vehicle needs to run smoothly and safely on the road, it must have the accurate 3D information of objects around it to make secure decisions. Therefore, 3D object detection is becoming increasingly important in these robotic applications. Most stateoftheart methods [35, 14, 33, 29, 37, 36, 38] rely on the accurate 3D information provided by LiDAR point clouds, but it is a heavy burden to install expensive LiDARs on each vehicle. So monocular 3D object detection, as a simple and cheap setting for deployment, becomes a much meaningful research problem nowadays.
Considering monocular 2D and 3D object detection have the exactly same input but different outputs, a straightforward solution for monocular 3D object detection is following the practices in the 2D domain but adding extra components to predict the additional 3D attributes of the objects. Some previous work [27, 20] keeps the prediction of 2D boxes and further regresses 3D attributes on top of 2D centers and region of interests. Others [1, 9, 2] simultaneously predict 2D and 3D boxes with 3D priors corresponding to each 2D anchor. Another stream of methods based on redundant 3D information [13, 16] could predict extra keypoints for optimized results ultimately. In a word, the core underlying challenge is how to assign 3D targets to 2D domain with the 2D3D correspondence and predict them afterwards.
In this technical report, we adopt a simple yet efficient method to enable a 2D detector predict 3D localization. We first project the commonly defined 7DoF 3D locations onto the 2D image and get the projected center point, which we name it as 3Dcenter compared to the previous 2Dcenter. With this projection, the 3Dcenter actually contains 2.5D information, i.e., 2D location and its corresponding depth. The 2D location can be further reduced to the 2D offset from a certain point on the image, which serves as the only 2D attribute that can be normalized among different feature levels like in the 2D detection. In comparison, depth, 3D size and orientation are regarded as 3D attributes after decoupling. In this way, we transform the 3D targets with a centerbased paradigm and avoid any necessary 2D detection or 2D3D correspondence priors.
As a practical implementation, we build our method on FCOS [28], a simple anchorfree fully convolutional singlestage detector. We first distribute the objects to different feature levels with consideration of their 2D scales. Then the regression targets of each training sample are assigned only according to the projected 3D centers. In contrast to FCOS that denote the centerness with the distances to the boundaries, we denote the 3D centerness with a 2D Guassian distribution based on the 3Dcenter.
We evaluate our method on a popular largescale dataset, nuScenes [3] and achieved the 1st place
on the camera track of this benchmark without any prior information. Moreover, we only need 2x less computing resources to train a baseline model with performance comparable to the previous best opensource method, CenterNet
[34], in one day, also 3x faster than it. Both show that our framework is simple and efficient. Detailed ablation studies show the importance of each component.2 Related Work
2D Object Detection Research on 2D object detection has made great progress with the breakthrough of deep learning approaches. Modern methods can be divided into two branches, anchorbased and anchorfree, according to the base of initial guesses. Anchorbased methods [10, 25, 19, 24] benefit from the predefined anchors in terms of much easier regression while have many hyperparameters to tune. In contrast, anchorfree methods [12, 23, 28, 15, 34] do not need these prior settings and are thus neater with better universality. For simplicity, this paper takes FCOS, a representative anchorfree detector, as the baseline considering its capability of handling overlapped ground truths and scale variance problem.
From another perspective, monocular 3D detection is a more difficult task closely related to 2D detection. But there is few work investigating the connection and difference between them, which makes them isolated and not able to benefit from the advancement of each other. This report aims at taking the adaptation of FCOS as the example and further building a closer connection between these two tasks.
Monocular 3D Object Detection Monocular 3D detection is more difficult compared to conventional 2D detection. The underlying key problem is the inconsistency of input 2D data modal and the output 3D predictions.
Methods involving subnetworks Earlier work uses subnetworks to assist 3D detection. 3DOP [4] and MLFusion [31]
use a depth estimation network while Deep3DBox
[21] uses a 2D object detector. They rely on the design and performance of these subnetworks, even external data and pretrained models, which makes the training inconvenient and introduces additional system complexity.Transform to 3D representation Another category is to convert the RGB input to other representations like OFTNet [26] and PseudoLidar [30]. Although these methods have shown promising performance, they actually still rely on dense depth labels and hence are not regarded as pure monocular approaches. There is also a domain gap between different depth sensors and LiDARs, which makes it hard to generalize to a new practical setting smoothly. Furthermore, the efficiency of processing a large amount of point clouds is also a significant issue to deal with when applying these methods to cases in the real world.
Endtoend design like 2D detection Recent work notices these drawbacks and endtoend frameworks are thus proposed. M3DRPN [1] implements a singlestage multiclass detector with an endtoend region proposal network and depthaware convolution. SS3D [13] proposes to detect 2D key points and further predicts object characteristics with uncertainties. MonoDIS [27] introduces a disentangling loss to reduce the instability of training procedure. Some of them still have multiple training stages or handcrafted postoptimization phase, and all of these methods follow anchorbased manners, thus the consistency of 2D and 3D anchors needed to be determined. In contrast, anchorfree methods [34, 16, 5] do not need to make statistics on the given data and can be better generalized to more complicated cases with more various classes or different intrinsic settings, so we choose to follow this paradigm.
Nevertheless, all of these works hardly study the key difficulty when applying a general 2D detector to monocular 3D detection. What should be kept or leveraged and what should be adjusted or focused on in this procedure are seldom discussed when proposing their new frameworks. In contrast, this technical report just concentrates on this point, which could provide a reference when applying a typical 2D detector framework to a closely related task. On this basis, a more indepth understanding of the connection and difference of these two tasks will also be beneficial to further research of both communities.
3 Approach
Object detection is one of the most essential and challenging problems for scene understanding. Conventional 2D object detection expects the given model to predict 2D bounding boxes and category labels for each object of interest. Compared to it, monocular 3D detection needs us to predict 3D bounding boxes instead, which needs to be decoupled and transformed to 2D image plane as much as possible. In this section, we will first present an overview of our framework with our adopted reformulation of 3D targets, and then elaborate two corresponding technical designs, 2D guided multilevel 3D prediction and center sampling strategy, tailored to this task, which together make the 2D detector FCOS work in this 3D task.
3.1 Framework Overview
A fully convolutional onestage detector typically consists of three components: backbone for feature extraction, necks for multilevel branches construction and detection heads for dense predictions. Then we briefly introduce each of them.
Firstly, we use ResNet101 [11]
pretrained on ImageNet
[8] with deformable convolutions [7] for feature extraction, which achieves a good tradeoff between accuracy and efficiency in our experiments. We fixed parameters of the first convolutional block to avoid more memory overhead.The second module is the Feature Pyramid Network [17], which is a basic component for detecting objects at different scales. For clear clarification, we denote feature maps from level 3 to 7 as P3 to P7 as shown in the Figure 2. We follow the approach in the original FCOS to obtain P3 to P5 and simply downsample P5 with two convolutional blocks to obtain P6 and P7. All of these five feature maps are responsible for predictions of different scales afterwards.
Finally for shared detection heads, there are two important issues to be dealt with. The first is how to assign targets to different levels of feature maps and different points, which is one of the core problems for different detectors and will be presented in the next subsection. The second is how to design the architecture. We follow the conventional design of RetinaNet [18] and FCOS [28]. Each shared head consists of 4 shared convolutional blocks and small heads for various regression targets. There are several alternative designs for the setting of small heads, which will be discussed in the experiments. From our experience, disentangled heads for targets with different measurements could benefit the training procedure. It appears more important for regression targets, including offset, depth, orientation, size and velocity. So we set one small head for each of them.
So far, we have introduced the overall design of our framework. Next, let’s formulate this problem more formally and present the detailed training and inference procedure.
Regression Targets Let’s first recall the formulation of anchorfree manners for object detection in FCOS. Given a feature map at layer of the backbone, denoted as
, we need to predict objects based on each point on this feature map, which correspond to uniformly distributed points on the original input image. Formally, for each location
on the feature map, suppose the total stride until layer
is , then the corresponding location on the original image should be . Different from anchorbased detectors, which regress targets by taking predefined anchors as reference, we directly predict objects based on these locations. Moreover, because we do not rely on anchors, the criterion for judging whether a point is foreground or not will no longer be the IOU between anchors and ground truths. Instead, as long as the point is near the box center enough, it could be a foreground point.In the 2D case, the model just needs to regress the distance of the point to top/bottom/left/right side, denoted as in the Fig. 1. However, in the 3D case, it is nontrivial to regress the distance to six faces of the 3D bounding box. Instead, a more straightforward implementation is to convert the commonly defined 7DoF regression targets to the 2.5D center and 3D size, in which 2.5D center can be easily transformed back to 3D space with camera intrinsic matrix. Regressing the 2.5D center could be further reduced to regressing the offset from the center to a specific foreground point, , and its corresponding depth respectively. In addition, to predict the allocentric orientation of the object, we follow the way in [32] and divide it into two parts: angle with period and 2bin direction classification. The first component naturally models the IOU of our predictions with the ground truth boxes while the second component focuses on the adversarial case where two boxes have opposite orientations. Thanks to this angle encoding, our method surpasses another centerbased framework, CenterNet, in terms of orientation accuracy, which will be compared in the experiments. The rotation encoding scheme is illustrated in Fig. 3.
In addition to these regression targets related to the location and orientation of objects, we also regress a binary target
, namely centerness, like FCOS. It serves as a soft binary classifier to determine which points are closer to centers, and helps suppress those lowquality predictions far away from object centers. More details are presented in Sec.
3.3.To sum up, the regression branch needs to predict , direction class and centerness while the classification branch needs to output the class label of the object and its attribute label.
Loss For classification and different regression targets, we define their loss respectively and take their weighted summation as the total loss. Firstly, for classification branch, we use the commonly used focal loss [18] for object classification loss:
(1) 
where
is the class probability of a predicted box, and we follow the settings,
and , of the original paper. For attribute classification, we use a simple softmax classification loss, denoted as .For regression branch, we use smooth L1 loss for each regression targets except centerness with corresponding weights considering their scales:
(2) 
where the weight of error is 1, the weight of is 0.2 and the weight of is 0.05. It should be noted that although we employ for depth prediction, we still compute the loss in the original depth space instead of the log space, which leads to more accurate detection in terms of object locations ultimately. We use the softmax classification loss and binary cross entropy (BCE) loss for direction classification and centerness regression, denoted as and respectively.
Finally, the total loss is:
(3) 
where is the number of positive predictions and .
Inference For inference procedure, given a input image, we just need to forward it through the framework and obtain bounding boxes with their class scores, attribute scores and centerness predictions. We multiply the class score and centerness as the confidence for each prediction and conduct rotated NonMaximum Suppression (NMS) in the bird view as most 3D detectors to get the final results. Note that there are some transformations in the process, like rotation decoding and projecting the 2.5D center back to 3D space, which are basically inverse procedures of data preprocessing.
3.2 2D Guided MultiLevel 3D Prediction
In FCOS, the author has discussed two most important issues of target assignment: 1) Best Possible Recall (BPR) for anchorfree and anchorbased detectors, 2) Intractable ambiguity caused by overlaps of groundtruth boxes. The first problem has been well verified by the comparison in the original paper, which shows that the implementation of multilevel prediction through FPN can really improve BPR, and even achieve better results than anchorbased methods. Similarly, the conclusion of this problem is also applicable in our adapted framework. The second question will involve the specific setting of the regression target, which we will discuss next.
In the original FCOS, we detect objects with different sizes in different levels of feature maps. Different from anchorbased methods, instead of assigning anchors with different sizes, FCOS directly assigns groundtruth boxes with different sizes to different levels of feature maps. Formally, it first computes the 2D regression targets, l*, r*, t*, b* for each location at each feature level, then for the location satisfying or , it would be regarded as a negative sample, where denotes the maximum regression range for feature level ^{1}^{1}1We set the regression range as (0, 48, 96, 192, 384, ) for to in our experiments respectively.. In comparison, we also follow this criterion in our implementation considering the scale of 2D detection is directly consistent with how large region we need to focus on. However, we only use 2D detection for filtering meaningless targets in this assignment step. After completing target assignment, our regression targets only include 3D related ones. Note that here we generate the 2D bounding boxes by computing the minimum and maximum coordinate of 8 projected vertices of 3D bounding boxes, so we do not need any 2D detection annotations or priors.
Next, let us discuss how to deal with the ambiguity problem, i.e., when a point is inside multiple ground truth boxes in the same feature level, which box should be assigned to it. The usual way is to select according to the area of the 2D bounding box. The box with smaller area is selected as the target box for this point. We call this scheme as the areabased criterion. It is obvious that large objects will be paid less attention by such processing, which is also verified by our experiments (Figure 4). Therefore, considering that the target of our regression is centerbased, and the points closer to the center of the object can obtain more comprehensive and balanced local region features, so as to produce higher quality prediction, we propose a distancebased criterion, i.e., select the box with closer center as the regression target. Through simple verification (Figure 4), we find that this scheme significantly improves the best possible recall (BPR) and mAP of large objects, and also significantly improves the overall mAP (about 1%), which will be presented in the ablation study.
In addition to centerbased approach to deal with ambiguity, we also use the 3Dcenter to determine foreground points, i.e., only the points near the center enough will be regarded as positive samples. We define a hyperparameter, radius, to measure this central portion. The points with distance smaller than radius stride to the object center would be considered as positive, where radius is set to 1.5 in our experiments.
Finally, we also follow the approach in original FCOS for distinguishing heads of different feature level, i.e., replacing each output of different regression branches with , where is a trainable scalar used for adjusting the base of exponential function for feature level , which also brings a minor improvement in terms of detection performance.
Methods  Dataset  Modality  mAP  mATE  mASE  mAOE  mAVE  mAAE  NDS 

CenterFusion [22]  test  Camera & Radar  0.326  0.631  0.261  0.516  0.614  0.115  0.449 
PointPillars [14]  test  LiDAR  0.305  0.517  0.290  0.500  0.316  0.368  0.453 
MEGVII [36]  test  LiDAR  0.528  0.300  0.247  0.379  0.245  0.140  0.633 
LRM0  test  Camera  0.294  0.752  0.265  0.603  1.582  0.14  0.371 
MonoDIS [27]  test  Camera  0.304  0.738  0.263  0.546  1.553  0.134  0.384 
CenterNet [34] (HGLS)  test  Camera  0.338  0.658  0.255  0.629  1.629  0.142  0.4 
Noah CV Lab  test  Camera  0.331  0.660  0.262  0.354  1.663  0.198  0.418 
FCOS3D (Ours)  test  Camera  0.358  0.690  0.249  0.452  1.434  0.124  0.428 
CenterNet [34] (DLA)  val  Camera  0.306  0.716  0.264  0.609  1.426  0.658  0.328 
FCOS3D (Ours)  val  Camera  0.343  0.725  0.263  0.422  1.292  0.153  0.415 
3.3 3D Centerness with 2D Guassian Distribution
In the original design of FCOS, centerness is defined by 2D regression targets, l*, r*, t*, b*:
(4) 
Because our regression targets are changed to the 3D centerbased paradigm, so we define centerness by 2D Gaussian distribution with projected 3D center as the origin. The 2D Gaussian distribution is simplified as:
(5) 
where is used to adjust the intensity attenuation from the center to the periphery and set to 2.5 in our experiments. We take it as the ground truth of centerness and predict it from the regression branch for filtering lowquality predictions later. As mentioned earlier, this centerness target ranges from 0 to 1, so we use the Binary Cross Entropy (BCE) loss for training that branch.
4 Experimental Setup
4.1 Dataset
We evaluate our framework on a largescale commonly used dataset, nuScenes [3]. NuScenes dataset consists of multimodal data collected from 1000 scenes, which includes RGB images from 6 surrounding cameras, points from 5 Radars and 1 LiDAR. It is split into 700/150/150 scenes for training/validation/testing. There are overall 1.4M annotated 3D bounding boxes from 10 categories. Considering the variety of scenes and ground truths, it is becoming one of the most convincing benchmarks for 3D object detection. Therefore, we take it as the platform to validate the efficacy of our method.
Methods  car  truck  bus  trailer  CV  ped  motor  bicycle  TC  barrier  mAP 

LRM0  0.467  0.21  0.17  0.149  0.061  0.359  0.287  0.246  0.476  0.512  0.294 
MonoDIS [27]  0.478  0.22  0.188  0.176  0.074  0.37  0.29  0.245  0.487  0.511  0.304 
CenterNet [34] (HGLS)  0.536  0.27  0.248  0.251  0.086  0.375  0.291  0.207  0.583  0.533  0.338 
Noah CV Lab  0.515  0.278  0.249  0.213  0.066  0.404  0.338  0.237  0.522  0.49  0.331 
FCOS3D (Ours)  0.524  0.27  0.277  0.255  0.117  0.397  0.345  0.298  0.557  0.538  0.358 
Methods  mAP  mATE  mASE  mAOE  mAVE  mAAE  NDS 

Baseline  0.227  0.868  0.272  0.778  1.326  0.393  0.282 
+depth loss in original space  0.25  0.838  0.268  0.892  1.33  0.413  0.284 
+flip augmentation  0.248  0.85  0.267  1.016  1.358  0.268  0.286 
+distbased target assign & attr pred  0.257  0.832  0.268  0.852  1.2  0.18  0.316 
+global NMS  0.26  0.828  0.267  0.85  1.371  0.18  0.317 
+ResNet101  0.272  0.821  0.265  0.81  1.379  0.17  0.329 
+disentangle heads  0.28  0.822  0.274  0.64  1.305  0.177  0.349 
+DCN in backbone  0.295  0.806  0.268  0.511  1.315  0.17  0.372 
+finetune w/ depth weight=1.0  0.316  0.755  0.263  0.458  1.307  0.169  0.393 
+TTA  0.326  0.743  0.259  0.441  1.341  0.163  0.402 
+more epochs & ensemble 
0.343  0.725  0.263  0.422  1.292  0.153  0.415 
4.2 Evaluation Metrics
For fair comparison with other methods, we use the official metrics, distancebased mAP and NDS, which are given by the benchmark. Next, we briefly introduce these two kinds of metrics as follows.
Average Precision metric Average Precision (AP) metric is generally used when evaluating performance of object detectors. Here instead of using Intersection over Union (IoU) for thresholding, nuScenes defines the match by 2D center distance on the ground plane for decoupling detection from object size and orientation. On this basis, we calculate AP by computing the normalized area under the precision recall curve for recall and precision over 10%. Finally, mAP is computed over all matching thresholds, meters, and all categories :
(6) 
True Positive metrics Apart from Average Precision, we also calculate 5 kinds of True Positive metrics, Average Translation Error (ATE), Average Scale Error (ASE), Average Orientation Error (AOE), Average Velocity Error (AVE) and Average Attribute Error (AAE). To obtain these measurements, we firstly define that predictions with center distance from the matching ground truth will be considered as true positives (TP). Then matching and scoring are conducted independently for each class of objects, and each metric is the average of cumulative mean at each recall levels above 10%. ATE is the Euclidean center distance in 2D (). ASE is equal to , is calculated between predictions and labels after aligning their translation and orientation. AOE is the smallest yaw angle difference between predictions and labels (). Note that different from other classes measured on the full period, barrier is measured on period. AVE is the L2Norm of the absolute velocity error in 2D (). AAE is defined as , where refers to the attribute classification accuracy. Finally, given these metrics, we compute the mean TP metric (mTP) overall all categories:
(7) 
Note that some not well defined metrics will be omitted, like AVE for cones and barriers considering they are stationary.
NuScenes Detection Score The conventional mAP couples the evaluation of locations, sizes and orientations of detections and also could not capture some aspects in this setting like velocity and attributes, so this benchmark proposes a more comprehensive, decoupled but simple metric, nuScenes detection score (NDS):
(8) 
where mAP is mean Average Precision (mAP) and is the set composed of five True Positive metrics. Considering mAVE, mAOE and mATE can be larger than 1, a bound is applied to them to limit them between 0 and 1.
4.3 Implementation Details
Network Architectures As shown in the Figure 2
, our framework basically follows the design of FCOS. Given the input image, we utilize ResNet101 as the feature extraction backbone followed by Feature Pyramid Networks (FPN) for generating multilevel predictions. Detection heads are shared among multilevel feature maps except that three scale factors are used to differentiate some of their final regressed results, including offsets, depths and sizes, respectively. Final small heads after the four shared convolutional layers for each regression targets are simply convolutional layers with kernel size and stride 1. All the convolutional modules are made up of basic convolution, batch normalization and activation layers, and normal distribution are leveraged for weights initialization. The overall framework is built on top of MMDetection3D
[6].Training Parameters
For all experiments, we trained randomly initialized networks from scratch following endtoend manners. Models are trained with SGD optimizer, in which gradient clip and warmup policy are exploited with learning rate 0.001, number of warmup iterations 500, warmup ratio 0.33 and batch size 32 on 16 GTX 1080Ti GPUs. Finally, to achieve a stable training procedure at the beginning, our baseline model is trained with weight 0.2 for depth regression. For a more competitive performance and a more accurate detector, we finetune our model with this weight switched to 1. Related results are presented in the ablation study.
Data Augmentation Like previous work, we only implement image flip for data augmentation both when training and testing. Note that when flipping images, only offset is needed to be flipped as 2D attributes while 3D boxes need to be transformed correspondingly in 3D space. For test time augmentation, we average the score maps output by the detection heads except rotation and velocity related scores due to their inaccuracy. It is empirically a more efficient approach for augmentation than merging boxes at last.
5 Results
In this section, we present experimental results quantitatively and qualitatively, and make a detailed ablation study in terms of important factors in the procedure of pushing our method towards the stateoftheart.
5.1 Quantitative Analysis
First, we give the results of quantitative analysis, which are shown in Tab. 1. We compare the results on the test set and validation set respectively. On the test set, we first compared with all the methods using RGB image as the input data. We achieved the best performance among them with mAP 0.358 and NDS 0.428, in which particularly we exceeded the previous best method more than 2% in terms of mAP. Benchmarks using LiDAR data as the input include PointPillars [14], which are faster and lighter, and CBGS [36] (MEGVII in the Tab. 1) with relatively high performance. For the approaches which use the input of RGB image and Radar mixed data, we select CenterFusion [22] as the benchmark. It can be seen that although our method has a certain gap with the highperformance CBGS, it even surpasses PointPillars and CenterFusion on mAP, which shows that we can solve this illposed problem decently with enough data. At the same time, it can be seen that the methods using other modal of data have relatively better NDS, mainly because the mAVE is smaller. The reason is that other methods will introduce continuous multiframe data, such as point cloud data from consecutive frames, so as to predict the speed of objects. The Radar itself also has the function of velocity measurement, so CenterFusion can achieve reasonable speed prediction even with a single frame image. However, these can not be achieved only by using a single frame image, so how to mine the speed information from consecutive frame images will be one of the directions that can be explored in the future. For detailed mAP for each category, please refer to Tab. 2 and the official benchmark.
On the validation set, we compare our method with the best opensource centerbased detector, CenterNet. Their method not only takes about 3 days to train (compared with our only 1 day to achieve comparable performance possibly thanks to our pretrained backbone to some extent), but also is inferior to our method except for mATE. In particular, thanks to our rotation encoding scheme, we have achieved a significant improvement in the accuracy of angle prediction. The significant improvement of mAP reflects the superiority of our multilevel prediction. On the basis of common improvement in these aspects, we finally achieved an improvement of about 9% on NDS.
5.2 Qualitative Analysis
Then we show some qualitative results in the Fig. 5 and 6 to give an intuitive understanding of the performance of our model. First of all, in the Fig. 5, we draw the 3D bounding boxes we predicted in the sixview images and the top view point clouds. We can see that from the perspective of image, our prediction results are very appealing. Especially for some small objects that are not labeled, such as the barriers in the camera at the rear right, they are not labeled, but are detected by our model. But at the same time, we should also see that our method still has obvious problems in depth estimation and identification of occluded objects. For example, it is difficult to detect the blocked car in the left rear image. Moreover, from the top view, especially in depth estimation, the results are not as good as those shown in the image. This is also in line with our expectation that depth estimation is still the core challenge in this illposed problem.
In the Fig. 6, we show some failure cases, mainly focused on the detection of large objects and occluded objects. In the camera view and top view, the yellow dotted circle is used to mark the blocked object which has not been successfully detected, while the red dotted circle is used to mark the detected large object with obvious deviation. The former is mainly manifest in the failure to find the object behind, while the latter is mainly manifest in the inaccurate estimation of the size and orientation of the object. The reasons behind the two failure cases are also different. The former is due to the inherent property of the current setting, which is difficult to be solved; the latter may be due to the fact that the receptive field of convolution kernel of the current model is not large enough, resulting in low performance of large object detection. Therefore, the future research direction may be more focused on the solution of the latter.
5.3 Ablation Studies
Finally, we show some important factors in the whole process of studying in Tab. 3. It can be seen that in the prophase process, transforming depth back to the original space to compute loss is an important factor to improve mAP, and distancebased target assignment is an important factor to improve the overall NDS. In the later promotion process, the stronger backbone, such as replacing original ResNet50 with ResNet101 and using DCN, is a very important factor. At the same time, due to the difference of scales and measurements, using disentangled heads for different regression targets is also an important way to improve the accuracy of angle prediction and NDS. Finally, we achieve the current stateoftheart through simple augmentation, more training epochs and basic model ensemble.
6 Conclusion
In this paper, we propose a simple yet efficient onestage framework, FCOS3D, for monocular 3D object detection without any 2D detection or 2D3D correspondence priors. In the framework, we first transform the commonly defined 7DoF 3D targets to image domain and decouple it as 2D and 3D attributes to generally make it fit the 3D setting. On this basis, the objects are distributed to different feature levels with consideration of their 2D scales and further assigned only according to the 3D centers. In addition, the centerness is redefined with a 2D Guassian distribution based on the 3Dcenter to be compatible with our target formulation. Experimental results with detailed ablation studies show the efficacy of our approach. For future work, a promising direction is how to better tackle the difficulty of depth and orientation estimation in this illposed setting.
References
 [1] (2019) M3Drpn: monocular 3d region proposal network for object detection. In IEEE International Conference on Computer Vision, Cited by: §1, §2.
 [2] (2020) Kinematic 3d object detection in monocular video. In Proceedings of the European Conference on Computer Vision, Cited by: §1.
 [3] (2019) NuScenes: a multimodal dataset for autonomous driving. CoRR abs/1903.11027. External Links: Link Cited by: §1, §4.1.
 [4] (2015) 3D object proposals for accurate object class detection. In Conference on Neural Information Processing Systems, Cited by: §2.

[5]
(2020)
MonoPair: monocular 3d object detection using pairwise spatial relationships.
In
IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: §2.  [6] (2020) MMDetection3D: OpenMMLab nextgeneration platform for general 3D object detection. Note: https://github.com/openmmlab/mmdetection3d Cited by: §4.3.
 [7] (2017) Deformable convolutional networks. In IEEE International Conference on Computer Vision, Cited by: §3.1.
 [8] (2009) ImageNet: a largescale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
 [9] (2020) Learning depthguided convolutions for monocular 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
 [10] (2015) Fast rcnn. In IEEE International Conference on Computer Vision, Cited by: §2.
 [11] (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
 [12] (2015) DenseBox: unifying landmark localization with end to end object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
 [13] (2019) Monocular 3d object detection and box fitting trained endtoend using intersectionoverunion loss. CoRR abs/1906.08070. External Links: Link Cited by: §1, §2.
 [14] (2019) PointPillars: fast encoders for object detection from point clouds. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, Table 1, §5.1.
 [15] (2018) CornerNet: detecting objects as paired keypoints. In European Conference on Computer Vision, Cited by: §2.
 [16] (2020) RTM3D: realtime monocular 3d detection from object keypoints for autonomous driving. In European Conference on Computer Vision, Cited by: §1, §2.
 [17] (2017) Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
 [18] (2017) Focal loss for dense object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.1, §3.1.
 [19] (2016) SSD: single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Cited by: §2.
 [20] (2019) ROI10d: monocular lifting of 2d detection to 6d pose and metric shape. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
 [21] (2017) 3D bounding box estimation using deep learning and geometry. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
 [22] (2020) CenterFusion: centerbased radar and camera fusion for 3d object detection. In IEEE Winter Conference on Applications of Computer Vision, Cited by: Table 1, §5.1.
 [23] (2016) You only look once: unified, realtime object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
 [24] (2017) YOLO9000: better, faster, stronger. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
 [25] (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
 [26] (2018) Orthographic feature transform for monocular 3d object detection. CoRR abs/1811.08188. External Links: Link Cited by: §2.
 [27] (2019) Disentangling monocular 3d object detection. In IEEE International Conference on Computer Vision, Cited by: §1, §2, Table 1, Table 2.
 [28] (2019) FCOS: fully convolutional onestage object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §2, §3.1.
 [29] (2020) Reconfigurable voxels: a new representation for lidarbased point clouds. In Conference on Robot Learning, Cited by: §1.
 [30] (2019) Pseudolidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
 [31] (2018) Multilevel fusion based 3d object detection from monocular images. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
 [32] (2018) SECOND: sparsely embedded convolutional detection. Sensors 18 (10). Cited by: §3.1.
 [33] (2019) STD: sparsetodense 3d object detector for point cloud. In IEEE International Conference on Computer Vision, Cited by: §1.
 [34] (2019) Objects as points. CoRR abs/1904.07850. External Links: Link Cited by: §1, §2, §2, Table 1, Table 2.
 [35] (2018) VoxelNet: endtoend learning for point cloud based 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
 [36] (2019) Classbalanced grouping and sampling for point cloud 3d object detection. CoRR abs/1908.09492. External Links: Link Cited by: §1, Table 1, §5.1.
 [37] (2020) SSN: shape signature networks for multiclass object detection from point clouds. In Proceedings of the European Conference on Computer Vision, Cited by: §1.
 [38] (2021) Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In Proceedings of the European Conference on Computer Vision, Cited by: §1.