1 Introduction
As an important and challenging problem, 3D object detection plays a fundamental role in various computer vision applications, such as autonomous driving, robotics, and augmented/virtual reality. In recent years monocular 3D object detection has received great attention, because it simply uses monocular camera instead of requiring extra sensing devices as in LiDARbased
[36, 16, 8, 37] and stereobased [7, 19, 29, 43] methods. However, the performance gap between LiDARbased and monocular imagebased approaches remains significant mainly because of the lack of reliable depth information. A quantitative investigation is conducted by only replacing the depth predictions with the groundtruth depth values on a baseline model. The detection performance of the model can be remarkably improved from 11.84% to 70.91% in terms of the metric [39] under the moderate setting of car category on the KITTI val set (see Table 1), which suggests that the depth estimation is a critical performance bottleneck in the monocular 3D object detection.The depth information has also been successfully applied as an important 3D geometry element to facilitate the learning in other problems, such as 2D object detection [46, 45]
[33], and camera localization [44, 2, 35]. However, how to jointly model the geometry relationships between the scene depth and different 2D/3D network predictions, such as 2D box sizes, 3D dimensions, and poses, and enable joint learning with the modeled geometry constraints for geometryaware monocular 3D object detection is rarely explored in the literature. An intuitive way to introduce the geometric relationships is to leverage perspective projection between the 3D scene space and the 2D image plane. Prior works [1, 5, 18, 20] either weakly use the geometry considering the projection consistency between 2D and 3D for postprocessing or employ perspective projection regardless of the object poses and 3D dimensions, which however can provide considerably stronger geometric constraints and are extremely important for accurate depth estimation. As can be observed in Fig. 1, the depth values differ by more than 5 meters due to the distinct poses and positions of the cars with the same height of 2D/3D boxes.In this paper, we propose a novel geometric formula by principled modeling of the relationships between the scene depth and different geometry elements predicted from the deep network for the task of monocular 3D object detection, including 2D bounding boxes, 3D object dimensions, and object poses. We further implement the proposed formula to develop a geometrybased network module, which can be flexibly embedded into the deep learning framework, allowing effective geometryaware learning on the representation level for guiding the depth estimation and advancing the monocular 3D object detection. Besides, the geometry module can be utilized during both the training and inference phases without additional complex postprocessing. Moreover, we provide a simple yet strong baseline for ensuring robust learning with the proposed geometry module, which is achieved through addressing the severe misalignment between the annotated 2D bounding box and the projected 2D bounding box from the 3D annotations. This effective baseline achieves an AP of 13.37% under the moderate setting of car category on the KITTI
val set.To summarize, the contribution of this paper is threefold:

We propose a novel geometric formula, which jointly models the perspective geometry relationships of multiple 2D/3D elements predicted from the deep monocular 3D objection network, providing strong geometric constraints for learning the 3D detection network.

We implement the proposed geometric formula in neural network as a module, which can be leveraged to guide the representation learning for boosting the depth estimation to significantly advance the performance of the monocular 3D object detection.

We provide a simple yet strong baseline through dealing with the misalignment between 2D projected boxes and 2D annotation boxes, which achieves 13.37% on the moderate of the KITTI val set. We expect our baseline will be beneficial for the community in future research on monocular 3D object detection.
Extensive experiments conducted on the challenging KITTI [11] dataset clearly demonstrate the effectiveness of the proposed approach and show that our method achieves 13.81% in terms of the metric, which is 2.80% absolute improvement over the stateoftheart of the monocular 3D object detection on the moderate setting of the KITTI test set for the car category.
2 Related Work
There are two groups of works closely related to ours, monocular 3D object detection and geometryguided 3D object detection.
Monocular 3D Object Detection. Compared with the methods with LiDAR and stereo sensors, 3D object detection with monocular images is challenging due to the absence of reliable depth information. Existing works [6, 27, 25, 24, 5, 10] have considered using external pretrained networks, extra training data, and prior knowledge to improve the performance of the monocular 3D object detection. Particularly, DeepMANTA [6] utilizes extra 3D shape and template datasets in learning 2D/3D vehicle models and then performs 2D/3D matching for the detection. Inspired by the importance of accurate depth for 3D object detection, many works [31, 25, 24, 10, 47] develop monocular 3D object detection by introducing pretrained external network for depth estimation. In contrast to these methods, we only use the monocular image as input without any extra burden.
In recent years, some works also only use RGB data as the input for the task [39, 3, 9, 40, 22]. For instance, MonoDIS [39] proposes to leverage a disentangling transformation between different 2D and 3D tasks to optimize the parameters at the loss level. M3DRPN [3] focuses on the design of depthaware convolution layers to improve 3D parameter estimation and postoptimization of the orientation by exploring the consistency between projected and annotated bounding boxes. To address the common occlusion issue in monocular object detection, MonoPair [9] proposes to model spatial relationships of objects in paired adjacent RGB images via introducing an uncertaintybased prediction for improving the detection. MoVi3D [40] builds virtual views where the object appearance is normalized depending on the distance to reduce the visual appearance variability. RARNet [22]
builds a postprocessing method by introducing Reinforcement learning to improve the 3D object detection performance. Although these existing methods achieved very promising results, the beneficial geometry relationships between the different 2D and 3D predictions from the detection network are not explicitly modeled for boosting the
learning of the detection network.GeometryGuided 3D Object Detection. There are several recent methods considering utilizing the geometric information for monocular 3D object detection [14, 30, 28, 5, 18]. One research direction mainly focuses on using geometry information to improve the detection performance in the inference stage via postprocessing [3, 38]. For instance, M3DRPN [3] employs the consistency between the 2D projected and the predicted 2D bounding boxes to optimize orientation parameters in a postprocessing process. UR3D [38] uses estimated key points to postoptimize the predictions of physical sizes and yaw angles by minimizing the objective function. Some other works [30, 28, 18, 5] consider using a simplified perspective projection relationship in the training phase. In particular, MonoGRNet [30] presents a geometric reasoning method based on instance depth estimation and 2D bounding box projection to obtain more accurate 3D localization. GS3D [18] uses average object sizes based on the statistics on the training data to guide the location estimation. Decoupled3D [5] estimates the depth from the projected average height of each vertical edge and the 3D height of the objects. RTM3D [20] predicts keypoints including eight vertexes and the center of 3D object in the image plane, and then minimize the energy function using geometric constraints of perspective projection. Ivan [1] relies on extra CAD models to process labels for keypoint detection and enforces the constrain between 2D keypoints and the CAD models using a consistency loss. However, these methods basically utilize the geometry at the prediction level and ignore several important geometry elements (object poses and locations) in their geometric modeling. In contrast to these methods, we jointly model the geometry relationships between the scene depth and 2D bounding boxes, 3D dimensions, and object poses, and the geometric model is implemented as a network module to be leveraged for geometryaware representation learning to directly boost the depth estimation.
3 The Proposed Approach
3.1 Framework Overview
A framework overview is illustrated in Fig. 2. We model an object as a single point following [49, 9]. Our framework consists of three key steps. First, we use deep layer aggregation [48], a fullyconvolutional encoderdecoder network, to extract features from a monocular image. Second, the features are fed into several network branches to separately predict 2D bounding box, 3D object dimension, and orientation (Sec. 3.2). Third, the geometric module models the geometry relationships from these 2D/3D predictions to obtain a geometric formula, which is implemented as a network module for geometryaware feature learning (Sec. 3.3). Finally, we utilize the geometric features for depth estimation (Sec. 3.3), which combines with other 3D predictions for obtaining the 3D object detection results.
3.2 Base Detection Structure
Our base network structure for 2D detection, 3D dimension and orientation prediction is derived from the anchorfree 2D object detection [49, 41]
with six output branches. Each branch takes the backbone features as input and uses 3x3 convolution, ReLU, and 1x1 convolution for prediction. The heatmap branch is used to locate 2D object center. The 2D/3D offset branch is applied for estimating 2D/3D center in 2D image coordinate system. The 2D box size and the 3D dimension branch predicts the size of 2D bounding box and the 3D dimension of the 3D object, respectively. Similar to
[28, 9, 49], the orientation branch predicts observation angle of the object via encoding it into scalars.3.3 Geometric Module for Learning Geometric Representations
In this section, we introduce the proposed geometric formula via modeling the relationships between the depth and 2D/3D predictions, and present how it can be implemented to learn geometric representations for depth estimation.
Formulation and notation. We adopt the 3D object definition described by the KITTI dataset. The coordinate system is constructed in meters with the camera center as the origin of coordinate. A 3D bounding box is represented as a 7tuple , where and are the dimensions of the 3D bounding box, width, height, and length, respectively, and is the bottom center coordinate of the 3D bounding box. As shown in Fig. 3, denotes the rotation around the Yaxis in the camera coordinate system, in a range of . Moreover, to facilitate the introduction of the proposed geometric formula, we define the 2D bounding box with a 4tuple , where and represent the size and the center of 2D bounding box, respectively.
3.3.1 Projective Modeling of Depth and 2D/3D Network Predictions
We derive a geometric formula for modeling the geometric relationships between the scene depth and multiple 2D/3D network predictions, 2D bounding box, 3D dimension, and object orientation from the perspective projection.
Geometric relationship of 2D and 3D corners. First, we represent an object in the object coordinate system, in which the origin is the bottom center of the object via the translation transformation from the camera coordinate system. As shown in Fig. 3, the coordinate of the th () corner in the 3D object bounding box, denoted as , can be given as follows:
(1) 
where , and represent the coordinate difference between the corner and the center of the object in X, Y, and Z direction, respectively; denotes the index of different values as shown in Fig. 3. With the position of the object in the camera coordinate system, we can represent the corner in the same coordinate system as:
(2) 
where and respectively represent the bottom center coordinate and the corner coordinate of the 3D object bounding box in the camera coordinate system; , , and denote the coordinate value along the X, Y, and Z dimension in the camera plane. also represents the distance from the bottom center of object to the camera plane, the depth of the object in the camera coordinate system; Given the intrinsic matrix of the camera provided by the official KITTI dataset, , we can project the corner in the camera coordinate system to the pixel coordinate system as:
(3) 
where denotes the projected corner coordinate in the pixel coordinate system; indicates the depth of the th corner; and respectively denote the horizontal and vertical coordinate of the corner in the pixel coordinate system.
Relationship between 2D height and 3D corners. Given the eight corners of the 3D object box in the pixel plane, the height of the projected 2D bounding box can be estimated from the difference between the vertical coordinate of the uppermost corner () and that of the lowermost corner () in the pixel coordinate system as:
(4) 
where is derived from Eq. 3; represents the maximum of of the eight corners, analogically for ; denotes the focal length in the vertical direction of the pixel plane.
Relationship between depth and other 2D/3D parameters. Similar to the definition of the bird’seye view angle (see Fig. 3a), we define the angle between the bottom center of the object and the horizontal plane as (see Fig. 3b). Given the projected coordinate of the object bottom center in the pixel plane based on Eq. 3, we can obtain the following geometric relationship:
(5) 
where is the location of the principal point relative to the origin in the pixel plane. Then, combining Eq. 4 and Eq. 5, the depth of the center of the object, , can be written as:
(6) 
where . It can be clearly observed that, the depth is correlated to the camera intrinsic parameters ( and ), the object position (when deriving ), 3D dimension (when deriving and ), and orientation of the object (when deriving ).
Relationship to existing works. Obviously, Eq. 6 obeys the perspective projection principle that further objects tend to be smaller than the nearer objects. It is also clearly different from prior works that, in our formula there is a nonlinear relationship between the scene depth and , due to the modeling via the introduction of the object pose and 3D dimensions. We can simplify the proposed formula in two different ways: (i) To reduce the computation complexity, we can consider only the first term in Eq. 6 to obtain a simplified geometric formula v1:
(7) 
(ii) If the variation of pose and position is not considered, then the formulation in Eq. 7 can be further derived as a simplified geometric formula v2:
(8) 
where represents the scale factor for the depth scale conversation. The formula in Eq. 8 is widely used in 3D object detection [18, 5]. We report detailed comparison and analysis on our formulation in Eq. 6 and the two simplified versions in the experimental results (see Sec. 4.2).
3.3.2 GeometryGuided Scene Depth Learning
Following the proposed geometric formula, we devise and implement a network module for the geometryguided deep representation learning for accurate depth prediction, as shown in the red dashed box of Fig. 2. The module aims to learn geometric representations using the 2D/3D geometryrelated network predictions (2D bounding box, 3D object dimension, and orientation) as input. Specifically, in the training stage, the module first produces a calculated onechannel depth map with the proposed geometric formula as described in Eq. 6. The depth map is then transformed into 3D maps of 3 channels with each spatial position representing a 3D data point
by introducing camera parameters as the initial geometric input. Then, the 3D map goes through three nonlinear transformation blocks, with each block consisting of a convolution and taking the previous transformation block as input, a batchnorm and a ReLU layer, to learn a robust geometric representation map with
channels (typically ). We set as 32 in our experiments. These learned geometric representations are further concatenated with the image representations produced from the backbone network to learn the depth estimation. In the inference stage, we perform the same procedure as in the training, and the final depth output is further used to combine with other predictions, including 2D bounding boxes, 3D dimensions and orientations to produce 3D object bounding boxes.Method  3D Detection  BEV  

Easy  Mod.  Hard  Easy  Mod.  Hard  
Baseline  16.42  11.84  10.06  24.47  17.17  15.40 
w/ gt Dim  19.85  14.06  12.02  25.06  18.29  15.85 
w/ gt Depth  79.82  70.91  62.41  88.60  82.66  75.41 
3.4 Misalignment in 2D and 3D bounding Boxes
There is misalignment between the 2D projected box and 2D annotation box remains. Generally, due to the perspective projection effect, further objects appear smaller than nearer objects, the misalignment is more serious for nearby objects, which makes the learning with the proposed formula inaccurate, especially for nearby objects. To handle this misalignment, we propose to use the 2D projected box instead of the 2D annotation box as the groundtruth to ensure the correctness of the depth estimation. According to Eq. 1 and 2, we compute the 3D corner coordinates of the object through the 3D poses and 3D dimensions of the object. We further obtain their coordinates on the pixel plane through the projection transformation according to Eq. 3. We also calculate the difference between vertices in the image plane as the height and width of the 2D projected box.
3.5 Implementation Details
Backbone. We adopt a DLA34 [48] network architecture without deformable convolutions as our backbone. During training, we set the input resolution of the network as . The spatial size of the feature map from the backbone is , where represents the downsampling factor of the backbone CNN.
Optimization loss. The optimization objective of our deep detection framework follows a multitask learning setting, and consist of classification and regression losses for both the 2D and 3D predictions. Specifically, we train the heatmap prediction with the focal loss [21]. The branches for offsets and dimensions in both the 2D and 3D detection are trained with 1 losses. The branch for the orientation predictionn is trained with a MultiBin loss following [9, 49]. Based on [9, 13], we use an
1 loss with heteroscedastic aleatoric uncertainty for the depth estimation (More details are illustrated in Appendix).
Method  Extra data  3D Detection  BEV  AOS  Runtime  
Easy  Mod.  Hard  Easy  Mod.  Hard  Easy  Mod.  Hard  
MonoDLE[26]    17.23  12.26  10.29  24.79  18.89  16.00  93.46  90.23  80.11   
GrooMeDNMS[15]    18.10  12.32  9.65  26.19  18.27  14.05  90.05  79.93  63.43   
DDMP3D[42]    19.71  12.78  9.80  28.08  17.89  13.44  90.73  80.20  61.82   
Decoupled3D[5]  Yes  11.08  7.02  5.63  23.16  14.82  11.25  87.34  67.23  53.84   
UR3D[38]  Yes  15.58  8.61  6.00  21.8  12.51  9.20        120ms 
AM3D[25]  Yes  16.50  10.74  9.52  25.03  17.32  14.91        400ms 
PatchNet[24]  Yes  15.68  11.12  10.17  22.97  16.86  14.97        400ms 
DA3Ddet[47]  Yes  16.80  11.50  8.9               
D4LCN[10]  Yes  16.65  11.72  9.51  22.51  16.02  12.55  90.01  82.08  63.98   
Kinematic3D[4]  Yes  19.07  12.72  9.17  26.69  17.52  13.10  58.33  45.50  34.81  120ms 
CaDDN[31]  Yes  19.17  13.41  11.46  27.94  18.91  17.19  78.28  67.31  59.52   
GS3D[18]  No  4.47  2.90  2.47  8.41  6.08  4.94  85.79  75.63  61.85  2000ms 
MonoGRNet[30] 
No  9.61  5.74  4.25  18.19  11.17  8.73        60ms 
MonoDIS[39]  No  10.37  7.94  6.40  17.23  13.19  11.12         
M3DRPN[3]  No  14.76  9.71  7.42  21.02  13.67  10.23  88.38  82.81  67.08  161ms 
MonoPair[9]  No  13.04  9.99  8.65  19.28  14.83  12.89  91.65  86.11  76.45  57ms 
RTM3D[20]  No  14.41  10.34  8.77  19.17  14.20  11.99  91.75  86.73  77.18  55ms 
MoVi3D[40]  No  15.19  10.90  9.26  22.76  17.03  14.85        45ms 
RARNet[22]  No  16.37  11.01  9.52  22.45  15.02  12.93  88.48  83.29  67.54   
Our method  No  18.85  13.81  11.52  25.86  18.99  16.19  94.67  89.44  79.27  50ms 
Improvement    +2.48  +2.80  +2.00  +3.10  +1.96  +1.34  +2.92  +2.71  +2.09   

Training: We use a batch size of
and train the overall deep network for 140 epochs on
NVIDIA 1080ti GPUs. To alleviate overfitting, we adopt data augmentation techniques including random scaling, random horizontal flipping, and random cropping for the 2D detection, and random horizontal flipping for the 3D detection, respectively. We use the Adam optimizer with 1e5 weight decay to optimize the full training loss as described in [9]. The initial learning rate is 1.25e4, which is dropped by multiplying after the th and the th epoch. To make train stable, we apply the linear warmup strategy for learning with the geometric network module in the first 5 epochs.Inference: We first predict 2D bounding boxes, 3D dimensions, and orientations via a shared backbone and several separate task branches. Than, we use the proposed formula to predict coarse depth followed by several convolution layers for the depth estimation. Finally, similar to [49], we use a simple postprocessing algorithm through maxpooling and backprojection to recover 3D bounding boxes from 2D boxes, 3D dimensions, orientations, and the depth.
4 Experiments
Setup. The KITTI dataset [11] provides widely used benchmarks for various visual tasks in the autonomous driving, including 2D Object detection, Average Orientation Similarity (AOS), Bird’s Eye View (BEV), and 3D Object Detection. The official data set contains 7481 training and 7518 test images with 2D and 3D bounding box annotations for cars, pedestrians, and cyclists. We report the average accuracy () for each task under three different settings: easy, moderate, and hard, as defined in [11]
. Moreover, we use 40 recall positions instead of 11 recall positions proposed in the original Pascal VOC benchmark, following
[39]. This results in a more fair comparison of the results. Each class uses different IoU standards for further evaluations. We report our results on the official settings of IoU for cars.Method  3D Detection IoU0.7  BEV IoU0.7  3D Detection IoU0.5  BEV IoU0.5  

Easy  Mod.  Hard  Easy  Mod.  Hard  Easy  Mod.  Hard  Easy  Mod.  Hard  
CenterNet [49]  0.60  0.66  0.77  3.46  3.31  3.21  20.00  17.50  15.57  34.36  27.91  24.65 
MonoGRNet [30]  11.90  7.56  5.76  19.72  12.81  10.15  47.59  32.28  25.50  52.13  35.99  28.72 
MonoDIS [39]  11.06  7.60  6.37  18.45  12.58  10.66             
M3DRPN [3]  14.53  11.07  8.65  20.85  15.62  11.88  48.53  35.94  28.59  53.35  39.60  31.76 
MoVi3D [40]  14.28  11.13  9.68  22.36  17.87  15.73             
MonoPair [9]  16.28  12.30  10.42  24.12  18.17  15.76  55.38  42.39  37.99  61.06  47.63  41.92 
Baseline  16.54  13.37  11.15  23.62  19.19  16.70  53.93  40.97  36.67  58.72  45.48  40.02 
Our method 
18.45  14.48  12.87  27.15  21.17  18.35  56.59  43.70  39.37  61.96  47.84  43.10 
for the car category with the evaluation metric of
. The results of the previous works are from [9]. Our approach significantly outperforms the previous stateofthearts on almost all the different evaluation protocols and settings. The bold black/blue color indicates the best/the second best performing method.4.1 Overall Performance Comparison and Analysis
Table 2 and 3 show the overall performance of the proposed approach on the KITTI 3D test and val sets for cars from the official online leaderboard as of Mar. 12th, 2021. Existing stateoftheart monocular 3D object detectors, including methods using extra data and only using monocular image are listed in the tables for comparison. The KITTI val results of MonoGRNet [30], M3DRPN [3] and MonoPair [9] are quoted from [9].
description  3D Detection  BEV  

Easy  Mod.  Hard  Easy  Mod.  Hard  
Original baseline  12.78  9.83  8.27  18.32  14.18  12.11 
+ Uncertainty  15.40  11.10  9.58  22.33  16.53  14.18 
+ Center3d  16.22  12.88  10.94  22.61  17.89  16.17 
+ Projected box  16.54  13.37  11.15  23.62  19.19  16.70 
Enhanced baseline  16.54  13.37  11.15  23.62  19.19  16.70 
Build a simple yet strong baseline for monocular 3D object detection. We report the enhanced baseline results of 3D monocular object detection in Table 4. Overall, the baseline significantly increases the performance upon the original one by 3.76%, 3.54%, 2.88% on easy, moderate and hard difficulty levels, respectively. This is achieved by introducing three methods to the original baseline. First, we adopt the 1 loss with the aleatoric uncertainty in [9, 13], which makes training stage more robust to noise input. Second, we use the projected 3D center as the groundtruth for 2D heatmap prediction similar to SMOKE [23]. Third, we address the misalignment between 2D groundtruth bounding boxes and the 2D projection bounding boxes by using 2D projected box as the groundtruth. This guarantees the consistency between 2D and 3D boxes from the projection relationships in the proposed geometric formula, and ensure the robust learning with the formula. The enhanced baseline achieves 16.54%, 13.37%, 11.15% on easy, moderate and hard difficulty levels, respectively.
Comparison with monocular image based methods. Our approach achieves a notable improvement over the stateoftheart monocular imagebased detectors [39, 30, 3, 9] on both the val and test sets. As shown in Table 2, the performance of our approach on the KITTI test set, for the detection on the car category, an indispensable part of the 3D object detection task for the autonomous driving scenario, our method achieves 18.85% ( improvement) on the easy, 13.81% ( improvement) on the moderate, and 11.52% ( improvement) on the hard compared with the previous stateoftheart imageonly method. Besides, compared with unpublished [26, 15] our method still increases the by 1.49 % on moderate. For the Bird’s Eye View (BEV) on the car class, our method also achieves the best performance, increasing the over the second best method by 3.10%, 1.96%, 1.34% on the easy, moderate, and hard level, respectively. For the KITTI val set, our method also establishes new stateoftheart performance on both the 3D object detection and the BEV. Table 2 and 3 shows considerable improvement over the stateoftheart monocular detection methods with the great robustness, benefiting from the introduction of the proposed geometric formula for learning geometryaware representations to advance the depth estimation.
Comparison with methods using extra data or networks. The prior methods [5, 25, 24, 10, 31] achieve impressive performance on the KITTI test set by introducing extra data or external networks. Although our method utilizes none of these kinds of information, as shown in Table 2, it can still outperform these comparison methods in terms of the metric by 0.40% on the moderate level. These significant improvements demonstrate the superior performance of our method with the proposed geometryguided depth learning for the monocular 3D object detection.
Latency.
We test our model on Nvidia GTX 1080 Ti, Pytorch 1.1, CUDA 9.0, Intel @ 2.60GHz As shown in Table
2, the proposed method achieves 20 fps and runs similar to other realtime stateofthearts [20, 40]. This clearly demonstrates the efficiency of our method when compared with other competitive methods under the similar experimental environment.Method  3D Detection  BEV  

Easy  Mod.  Hard  Easy  Mod.  Hard  
Baseline  16.54  13.37  11.15  23.62  19.19  16.70 
+ 3DCAT  15.87  11.80  10.33  21.85  16.90  14.51 
+ GeoSV1  17.25  13.38  11.29  24.33  18.57  16.06 
+ GeoSV2  17.10  13.22  11.13  25.02  18.62  16.48 
Ours (full model)  18.45  14.48  12.87  27.15  21.17  18.35 
4.2 Ablation Experiments
We conduct extensive ablation studies on the KITTI val set, to demonstrate the effectiveness of the proposed approach for geometryguided depth learning in advancing the monocular 3D object detection. For all the evaluation, the metric is employed. We mainly investigate from two aspects, including the effect of the proposed geometric formula and module, and the effect of the geometryguided representation learning for depth estimation.
Baseline and variant models. To conduct an extensive evaluation, we consider the following baseline and variant models: (i) Baseline, which is a base model achieving a strong 3D detection performance with an of 11.8% on the moderate; (ii) 3DCAT., which directly inputs the concatenation of the 3D network predictions to the nonlinear transformation blocks while bypassing the depth calculation with geometric formula; (iii) GeoSV1, which uses our simplified geometry formula v1 as in Eq. 7; (iv) GeoSV2, which uses our simplified geometry formula v2 as in Eq. 8.
Effects of the geometric formula and module. A detailed ablation study is shown in Table 5. As we can observe, ours (full model) achieves a large gain (2.68% on the moderate level) over Baseline + 3DCAT, meaning that directly using the 3D network predictions are not effective enough for learning the geometric representations, thus verifying the importance of the proposed geometric formula. By comparing Baseline + GeoSV2, Baseline + GeoSV1, and ours (full model), all these three with the geometric relationships, the performance gradually improves when more geometry elements are involved in modeling, confirming our motivation of modeling between depth and multiple 2D/3D geometry elements, instead of partial of them, only height typically considered in most existing works [18, 5] similar to the GeoSV2. Finally, Ours (full model) is 1.11% and 1.98% improvement on the moderate for the 3D detection and BEV, respectively, which adequately demonstrate the effectiveness of our proposed approach.
Effect of the geometryguided representation learning for depth estimation. Fig. 6 shows a performance comparison between baseline and our approach on the depth estimation. Specifically, we evaluate the predicted depth of all car samples in different depth ranges under two primary metrics (SILog and sqRel) widely used in depth estimation field. Fig. 7 shows that 87% of the cars are within 40m, while only 5.0% of those are 45m away. Fig. 6 shows that our approach outperforms the baseline consistently in all the depth ranges, especially in the 40m range with most samples, which further validates our idea of using geometryguided representation learning to boost depth estimation to advance the monocular 3D object detection.
5 Conclusion
We proposed a novel geometric formula principally modeled from multiple 2D/3D network predictions, to guide the depth estimation and advance the monocular 3D object detection. We design and implement this formula as a neural network module to have geometryaware feature learning with the image representations to boost the learning of the depth. Extensive experiments demonstrate the effectiveness of the proposed approach, and our results also achieve stateoftheart performance with a large margin.
References
 [1] (2020) Monocular 3d object detection via geometric reasoning on keypoints. In VISIGRAPP, Cited by: §1, §2.
 [2] (2018) Geometryaware learning of maps for camera localization. In CVPR, Cited by: §1.
 [3] (2019) M3drpn: monocular 3d region proposal network for object detection. In ICCV, Cited by: Table 6, Table 7, §2, §2, Table 2, §4.1, §4.1, Table 3.
 [4] (2020) Kinematic 3d object detection in monocular video. In ECCV, Cited by: Table 2.
 [5] (2020) Monocular 3d object detection with decoupled structured polygon estimation and heightguided depth estimation.. In AAAI, Cited by: §1, §2, §2, §3.3.1, Table 2, §4.1, §4.2.
 [6] (2017) Deep manta: a coarsetofine manytask network for joint 2d and 3d vehicle analysis from monocular image. In CVPR, Cited by: §2.
 [7] (2015) 3d object proposals for accurate object class detection. In NIPS, Cited by: §1.
 [8] (2019) Fast point rcnn. In ICCV, Cited by: §1.
 [9] (2020) MonoPair: monocular 3d object detection using pairwise spatial relationships. In CVPR, Cited by: §A, Table 6, §2, §3.1, §3.2, §3.5, §3.5, Table 2, §4.1, §4.1, §4.1, Table 3.
 [10] (2020) Learning depthguided convolutions for monocular 3d object detection. In CVPR, Cited by: §2, Table 2, §4.1.
 [11] (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, Cited by: §1, §B.1, §4.
 [12] (2019) Monocular 3d object detection and box fitting trained endtoend using intersectionoverunion loss. CoRR abs/1906.08070. Cited by: Table 6, Table 7.
 [13] (2019) Geometry and uncertainty in deep learning for computer vision. Ph.D. Thesis, University of Cambridge. Cited by: §A, §3.5, §4.1.
 [14] (2019) Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In CVPR, Cited by: §2.
 [15] (2021) GrooMeDnms: grouped mathematically differentiable nms for monocular 3d object detection. In CVPR, Cited by: Table 2, §4.1.
 [16] (2019) Pointpillars: fast encoders for object detection from point clouds. In CVPR, Cited by: §1.
 [17] (2018) Cornernet: detecting objects as paired keypoints. In ECCV, Cited by: §A.
 [18] (2019) GS3D: an efficient 3d object detection framework for autonomous driving. In CVPR, Cited by: §1, §2, §3.3.1, Table 2, §4.2.
 [19] (2019) Stereo rcnn based 3d object detection for autonomous driving. In CVPR, Cited by: §1.
 [20] (2020) RTM3D: realtime monocular 3d detection from object keypoints for autonomous driving. In ECCV, Cited by: §1, §2, Table 2, §4.1.
 [21] (2017) Focal loss for dense object detection. In ICCV, Cited by: §3.5.
 [22] (2020) Reinforced axial refinement network for monocular 3d object detection. In ECCV, Cited by: §2, Table 2.
 [23] (2020) SMOKE: singlestage monocular 3d object detection via keypoint estimation. In CVPR, Cited by: §4.1.
 [24] (2020) Rethinking pseudolidar representation. In ECCV, Cited by: §2, Table 2, §4.1.
 [25] (2019) Accurate monocular 3d object detection via colorembedded 3d reconstruction for autonomous driving. In ICCV, Cited by: §2, Table 2, §4.1.
 [26] (2021) Delving into localization errors for monocular 3d object detection. In CVPR, Cited by: Table 2, §4.1.
 [27] (2019) Roi10d: monocular lifting of 2d detection to 6d pose and metric shape. In CVPR, Cited by: §2.
 [28] (2017) 3d bounding box estimation using deep learning and geometry. In CVPR, Cited by: §2, §3.2.
 [29] (2018) Frustum pointnets for 3d object detection from rgbd data. In CVPR, Cited by: §1.
 [30] (2019) Monogrnet: a geometric reasoning network for monocular 3d object localization. In AAAI, Cited by: §2, Table 2, §4.1, §4.1, Table 3.
 [31] (2021) Categorical depth distribution network for monocular 3d object detection. CVPR. Cited by: §2, Table 2, §4.1.
 [32] (2021) Categorical depth distribution network for monocular 3d object detection. In CVPR, Cited by: §B.3.
 [33] (2018) Unsupervised geometryaware representation for 3d human pose estimation. In ECCV, Cited by: §1.
 [34] (2019) Orthographic feature transform for monocular 3d object detection. In BMVC, Cited by: Table 6.
 [35] (2019) Unsupervised collaborative learning of keyframe detection and visual odometry towards monocular deep slam. In ICCV, Cited by: §1.
 [36] (2020) Pvrcnn: pointvoxel feature set abstraction for 3d object detection. In CVPR, Cited by: §1.
 [37] (2020) From points to parts: 3d object detection from point cloud with partaware and partaggregation network. TPAMI. Cited by: §1.
 [38] (2020) Distancenormalized unified representation for monocular 3d object detection. In ECCV, Cited by: §2, Table 2.
 [39] (2019) Disentangling monocular 3d object detection. In ICCV, Cited by: §1, §2, Table 2, §4.1, Table 3, §4.
 [40] (2020) Towards generalization across depth for monocular 3d object detection. In ECCV, Cited by: Table 6, §2, Table 2, §4.1, Table 3.
 [41] (2019) Fcos: fully convolutional onestage object detection. In ICCV, Cited by: §3.2.
 [42] (2021) Depthconditioned dynamic message propagation for monocular 3d object detection. In CVPR, Cited by: Table 2.
 [43] (2018) Multilevel fusion based 3d object detection from monocular images. In CVPR, Cited by: §1.
 [44] (2021) Moving slam: fully unsupervised deep learning in nonrigid scenes. In IROS, Cited by: §1.
 [45] (2019) Geometryaware video object detection for static cameras. In BMVC, Cited by: §1.
 [46] (2019) Reppoints: point set representation for object detection. In ICCV, Cited by: §1.
 [47] (2020) Monocular 3d object detection via feature domain adaptation. In ECCV, Cited by: §2, Table 2.
 [48] (2018) Deep layer aggregation. In CVPR, Cited by: §3.1, §3.5.
 [49] (2019) Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: §A, §3.1, §3.2, §3.5, §3.5, Table 1, Table 3.
Supplementary Material
In this Supplementary Material, we provide more elaboration on the implementation details, experiment results, and qualitative results. Specifically, we present the implementation details of the model training in Section A, additional quantitative results and analysis in Section B, and additional qualitative results in Section D.
A Additional Implementation Details
The overall network optimization loss of the proposed approach consists of three parts, a classification loss , a 2D regression loss , and a 3D regression loss . We present the details of these losses one by one: (i) Regarding to the classification loss, similar to [17, 49], we employ a variant of focal loss which reduces the penalty for negative locations according to the distance from a positive location as:
(9) 
where and
represent the groundtruth class probability given by an unnormalized 2D Gaussian and the model’s predicted probability for the class, respectively. And
andare hyperparameters that control the importance of each sample. We set
to 2 and to 4 as a default setting in our experiments. (ii) For the 2D regression loss , it is defined upon a 6tuple of groundtruth boundingbox targets and a predicted 6tuple. Specifically, the 6tuple consists of two 2D offsets, two 3D offsets, and two 2D box sizes. 2D/3D offsets are used to adjust the 2D/3D center locations before remapping them to the input resolution following [17, 49]. We use an loss to optimize each 6tuple parameters. (iii) For the 3D regression loss , it consists of an loss for regressing the dimension of the 3D bounding box (width, height, and length), and an loss with an uncertainty term for regressing the depth. Specifically, we follow [9, 13] and employ the heteroscedastic aleatoric uncertainty in the depth estimation loss as:(10)  
(11) 
Where and represent the predicted depth and the groundtruth depth, respectively. is the noisy observation parameter of the model. Hence, the overall optimization loss is the sum of the three losses written as:
(12) 
where and are loss weights controlling the balance between the different losses. We consider and equally important and use = = 1 in all experiments.
B Additional Results and Analysis
b.1 Additional Results for the Pedestrian/Cyclist Category
Cat.  Method  3D Detection/BEV  

Easy  Mod.  Hard  
Ped.  OFTNet [34]  0.63/1.28  0.36/0.81  0.35/0.51 
SS3D [12]  2.31/2.48  1.78/2.09  1.48/1.61  
M3DRPN [3]  4.92/5.65  3.48/4.05  2.94/3.29  
MoVi3D [40]  8.99/10.08  5.44/6.29  4.57/5.37  
MonoPair [9]  10.02/10.99  6.68/7.04  5.53/6.29  
Ours  8.00/9.54  5.63/6.77  4.71/5.83  
Cyc.  OFTNet [34]  0.14/0.36  0.06/0.16  0.07/0.15 
SS3D [12]  2.80/3.45  1.45/1.89  1.35/1.44  
M3DRPN [3]  0.94/1.25  0.65/0.81  0.47/0.78  
MoVi3D [40]  1.08/1.45  0.63/0.91  0.70/0.93  
MonoPair [9]  3.79/4.76  2.12/2.87  1.83/2.42  
Ours  4.73/5.93  2.93/3.87  2.58/3.42 
As mentioned in the main paper, the KITTI [11] official data set contains 7,481 training and 7,518 test images with 2D and 3D bounding box annotations for pedestrian and cyclist categories. We report our quantitative results in Table 6, using the official settings with IoU for pedestrians and cyclists on the KITTI test set. Our method establishes new stateoftheart performance on all the three detection levels ( easy, medium, and hard) for the cyclist category with only slight drop for the pedestrian category. We investigate the slight performance drop in the pedestrian category by comparing 2D detection results between car and pedestrian. In fact, the advantage of the proposed geometric formula is independent of different classes as 2D images conform with projective camera models, and every object meets the geometric reasoning. However, a performance gap between car detection and pedestrian/cyclist detection commonly exists
in ours and many previous works on the KITTI dataset. This is mainly due to insufficient training samples of pedestrian and cyclist categories on KITTI, leading to unstable training, sensitivity to hyperparameters, and inaccurate prediction of 2D/3D information(2D boxes, orientation, and the 3D dimensions) with high variance. This imbalance of the category data is however a common issue on the KITTI dataset for the 3D object detection task. Table
7 shows that the 2D detection results on the moderate level are only 50.48% and 44.63% for cyclist and pedestrian respectively, while up to 90.14% for car on the test set. Similarly for orientation estimation, the pedestrian (39.76%) has less than half of the car (89.44%) on the moderate. The two factors above introduce more noise into our geometry formula to affect the geometryguided representation learning. However, our results for pedestrians and cyclists are highly competitive with other SOTA methods on the KITTI test set.Cat.  Method  2D Detection/AOS  

Easy  Mod.  Hard  
Car  SS3D [12]  92.72/92.57  84.92/84.38  70.35/69.82 
M3DRPN [3]  89.04/88.38  85.08/82.81  69.26/67.08  
Ours  95.11/94.67  90.14/89.44  80.19/79.27  
Ped.  SS3D [12]  61.58/53.72  45.79/39.60  41.14/35.40 
M3DRPN [3]  56.64/44.33  41.46/31.88  37.31/28.55  
Ours  58.49/52.87  44.63/39.76  40.41/35.83  
Cyc.  SS3D [12]  52.97/42.95  35.48/27.79  31.07/24.26 
M3DRPN [3]  61.54/48.11  41.54/31.09  35.23/26.10  
Ours  65.42/55.58  50.48/42.05  42.48/35.48 
b.2 Further Analysis on Depth Estimation from Geometry Modeling
We conduct a further depth statistic analysis on the train+val set. Table 8 shows that for two cars with the same height in both the 2D bounding box and the 3D bounding box, the depth values of their centers may differ by more than meters due to their distinct poses and locations. This confirms the critical importance of considering 3D pose and locations simultaneously in the geometric modeling for depth estimation, which is however not investigated by previous works.
depth  The height of 3D bounding boxes  

avg.  
30  39.51  40.23  40.39  42.23  39.47  
37.69  36.53  36.53  37.21  37.25  
diff.  1.82  3.70  3.86  5.02  2.22  
35  34.04  34.68  35.69  34.12  36.40  
32.99  31.72  31.77  32.05  31.75  
diff.  1.05  2.96  3.92  2.07  4.65 
b.3 Additional Results at Different Distances
We provide additional results on depth estimation and monocular 3D object detection at different distances. Table 9 shows more depth estimation results on KITTI val set via comparing the enhanced baseline and our method. Specifically, we evaluate the depth estimation by computing Scale Invariant Logarithmic (SILog) error, squared Relative (sqRel) error, absolute Relative (absRel) error, and Root Mean Squared Error of the inverse depth (iRMSE). Our method outperforms the enhanced baseline by large margins on all these evaluation metrics. The depth estimation results clearly demonstrate the effectiveness of our proposed idea of using geometryguided representation learning to boost depth estimation from monocular images for advancing the monocular 3D object detection.
Depth Range  Num.  SILog  absRel  sqRel  iRMSE 

010m  867  16.49  8.65  67.55  16.53 
14.35  7.75  38.46  14.94  
020m  4236  12.12  6.02  31.42  9.98 
10.48  5.42  20.81  8.77  
030m  7379  11.00  5.70  25.25  8.16 
10.27  5.30  19.73  7.52  
040m  9797  10.49  5.65  24.68  7.23 
10.49  5.36  21.71  7.03 
Moreover, we conduct experiments about the 3D monocular object detection improvement at different distances. Table 10 reports performance on at different object distance ranges following [32]. It is clear that our method consistently outperforms the baseline at different ranges.
Description  3D Detection  BEV  

15m  30m  all  15m  30m  all  
Baselne 
18.85  15.42  11.32  26.95  21.94  16.82 
Ours  22.29  17.38  12.87  31.37  24.82  18.35 
C Additional Ablation Study for Uncertainty and Equation
We investigate the effect of uncertainty with our geometric module as requested on the KITTI val set in Table 12. It can be seen that the uncertainty is helpful for learning the geometry, but the main improvement is from the proposed principled geometric modeling. To further validate the effectiveness of Eq. (6), we compare all predictions followed by pointwise MLP as the reviewer described with our geometric module in Table 11. Ours is significantly better than the pointwise MLP.
Description  3D Detection  BEV  

Easy  Mod.  Hard  Easy  Mod.  Hard  
Baseline  16.54  13.37  11.15  23.62  19.19  16.70 
Pointwise MLP  17.09  13.12  11.05  23.79  18.20  16.26 
Ours  18.79  14.53  12.77  26.48  20.75  18.04 
All Other Enhancements  Uncertainty  Geometric Module  3D Detection  BEV 
11.81  17.51  
14.44  19.77  
13.37  19.19  

14.48  21.17 
D Additional Qualitative Results
Fig. 8 also show the comparison results between the enhanced baseline and the proposed method from the BirdEyeView. Figure 9 also present additional qualitative 3D detection results on the images with a comparison between those two on the KITTI val set. We could observe from the figures that the proposed geometryguided learning approach can achieve significantly better 3D detection and localization performance than the enhanced baseline.
Figure 10 and 11 show additional visualization of the prediction results on KITTI 3D raw data in both the image plane and the LiDAR coordinate system, respectively. We use orange box, purple box, and green box for car, pedestrian, and cyclist, respectively. Our approach is able to accurately localize the differentdepth 3D objects.