Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection

07/29/2021 ∙ by Yinmin Zhang, et al. ∙ SenseTime Corporation The University of Sydney The Hong Kong University of Science and Technology 11

As a crucial task of autonomous driving, 3D object detection has made great progress in recent years. However, monocular 3D object detection remains a challenging problem due to the unsatisfactory performance in depth estimation. Most existing monocular methods typically directly regress the scene depth while ignoring important relationships between the depth and various geometric elements (e.g. bounding box sizes, 3D object dimensions, and object poses). In this paper, we propose to learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection. Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised. We further implement and embed the proposed formula to enable geometry-aware deep representation learning, allowing effective 2D and 3D interactions for boosting the depth estimation. Moreover, we provide a strong baseline through addressing substantial misalignment between 2D annotation and projected boxes to ensure robust learning with the proposed geometric formula. Experiments on the KITTI dataset show that our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80 moderate test setting. The model and code will be released at https://github.com/YinminZhang/MonoGeo.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As an important and challenging problem, 3D object detection plays a fundamental role in various computer vision applications, such as autonomous driving, robotics, and augmented/virtual reality. In recent years monocular 3D object detection has received great attention, because it simply uses monocular camera instead of requiring extra sensing devices as in LiDAR-based 

[36, 16, 8, 37] and stereo-based [7, 19, 29, 43] methods. However, the performance gap between LiDAR-based and monocular image-based approaches remains significant mainly because of the lack of reliable depth information. A quantitative investigation is conducted by only replacing the depth predictions with the ground-truth depth values on a baseline model. The detection performance of the model can be remarkably improved from 11.84% to 70.91% in terms of the metric [39] under the moderate setting of car category on the KITTI val set (see Table 1), which suggests that the depth estimation is a critical performance bottleneck in the monocular 3D object detection.

The depth information has also been successfully applied as an important 3D geometry element to facilitate the learning in other problems, such as 2D object detection [46, 45]

, human pose estimation

[33], and camera localization [44, 2, 35]. However, how to jointly model the geometry relationships between the scene depth and different 2D/3D network predictions, such as 2D box sizes, 3D dimensions, and poses, and enable joint learning with the modeled geometry constraints for geometry-aware monocular 3D object detection is rarely explored in the literature. An intuitive way to introduce the geometric relationships is to leverage perspective projection between the 3D scene space and the 2D image plane. Prior works [1, 5, 18, 20] either weakly use the geometry considering the projection consistency between 2D and 3D for post-processing or employ perspective projection regardless of the object poses and 3D dimensions, which however can provide considerably stronger geometric constraints and are extremely important for accurate depth estimation. As can be observed in Fig. 1, the depth values differ by more than 5 meters due to the distinct poses and positions of the cars with the same height of 2D/3D boxes.

In this paper, we propose a novel geometric formula by principled modeling of the relationships between the scene depth and different geometry elements predicted from the deep network for the task of monocular 3D object detection, including 2D bounding boxes, 3D object dimensions, and object poses. We further implement the proposed formula to develop a geometry-based network module, which can be flexibly embedded into the deep learning framework, allowing effective geometry-aware learning on the representation level for guiding the depth estimation and advancing the monocular 3D object detection. Besides, the geometry module can be utilized during both the training and inference phases without additional complex post-processing. Moreover, we provide a simple yet strong baseline for ensuring robust learning with the proposed geometry module, which is achieved through addressing the severe misalignment between the annotated 2D bounding box and the projected 2D bounding box from the 3D annotations. This effective baseline achieves an AP of 13.37% under the moderate setting of car category on the KITTI

val set.

To summarize, the contribution of this paper is threefold:

  • We propose a novel geometric formula, which jointly models the perspective geometry relationships of multiple 2D/3D elements predicted from the deep monocular 3D objection network, providing strong geometric constraints for learning the 3D detection network.

  • We implement the proposed geometric formula in neural network as a module, which can be leveraged to guide the representation learning for boosting the depth estimation to significantly advance the performance of the monocular 3D object detection.

  • We provide a simple yet strong baseline through dealing with the misalignment between 2D projected boxes and 2D annotation boxes, which achieves 13.37% on the moderate of the KITTI val set. We expect our baseline will be beneficial for the community in future research on monocular 3D object detection.

Extensive experiments conducted on the challenging KITTI [11] dataset clearly demonstrate the effectiveness of the proposed approach and show that our method achieves 13.81% in terms of the metric, which is 2.80% absolute improvement over the state-of-the-art of the monocular 3D object detection on the moderate setting of the KITTI test set for the car category.

2 Related Work

There are two groups of works closely related to ours,  monocular 3D object detection and geometry-guided 3D object detection.

Monocular 3D Object Detection. Compared with the methods with LiDAR and stereo sensors, 3D object detection with monocular images is challenging due to the absence of reliable depth information. Existing works [6, 27, 25, 24, 5, 10] have considered using external pretrained networks, extra training data, and prior knowledge to improve the performance of the monocular 3D object detection. Particularly, DeepMANTA [6] utilizes extra 3D shape and template datasets in learning 2D/3D vehicle models and then performs 2D/3D matching for the detection. Inspired by the importance of accurate depth for 3D object detection, many works [31, 25, 24, 10, 47] develop monocular 3D object detection by introducing pretrained external network for depth estimation. In contrast to these methods, we only use the monocular image as input without any extra burden.

In recent years, some works also only use RGB data as the input for the task [39, 3, 9, 40, 22]. For instance, MonoDIS [39] proposes to leverage a disentangling transformation between different 2D and 3D tasks to optimize the parameters at the loss level. M3D-RPN [3] focuses on the design of depth-aware convolution layers to improve 3D parameter estimation and post-optimization of the orientation by exploring the consistency between projected and annotated bounding boxes. To address the common occlusion issue in monocular object detection, MonoPair [9] proposes to model spatial relationships of objects in paired adjacent RGB images via introducing an uncertainty-based prediction for improving the detection. MoVi-3D [40] builds virtual views where the object appearance is normalized depending on the distance to reduce the visual appearance variability. RAR-Net [22]

builds a post-processing method by introducing Reinforcement learning to improve the 3D object detection performance. Although these existing methods achieved very promising results, the beneficial geometry relationships between the different 2D and 3D predictions from the detection network are not explicitly modeled for boosting the

learning of the detection network.

Figure 2: An overview of our proposed approach. We leverage a 2D backbone network to extract features from the input monocular RGB image. Then several 2D/3D output branches are used for generating 2D/3D predictions through decoding, with depth excluded from prediction. The 2D/3D predictions are utilized by the geometric module to compute and generate geometric features via the proposed geometric formula implemented in a network module. The geometric features are concatenated with the image features of the backbone for depth estimation. Based on the depth and other 3D predictions from the output branches, the decoder outputs the 3D object detection results. Dash lines and solid lines represent normal flows and neural network forward flows, respectively. The symbol

represents a tensor concatenation operation.

Geometry-Guided 3D Object Detection. There are several recent methods considering utilizing the geometric information for monocular 3D object detection [14, 30, 28, 5, 18]. One research direction mainly focuses on using geometry information to improve the detection performance in the inference stage via post-processing [3, 38]. For instance, M3D-RPN [3] employs the consistency between the 2D projected and the predicted 2D bounding boxes to optimize orientation parameters in a post-processing process. UR3D [38] uses estimated key points to post-optimize the predictions of physical sizes and yaw angles by minimizing the objective function. Some other works [30, 28, 18, 5] consider using a simplified perspective projection relationship in the training phase. In particular, MonoGRNet [30] presents a geometric reasoning method based on instance depth estimation and 2D bounding box projection to obtain more accurate 3D localization. GS3D [18] uses average object sizes based on the statistics on the training data to guide the location estimation.   Decoupled-3D [5] estimates the depth from the projected average height of each vertical edge and the 3D height of the objects. RTM3D [20] predicts keypoints including eight vertexes and the center of 3D object in the image plane, and then minimize the energy function using geometric constraints of perspective projection. Ivan  [1] relies on extra CAD models to process labels for keypoint detection and enforces the constrain between 2D keypoints and the CAD models using a consistency loss. However, these methods basically utilize the geometry at the prediction level and ignore several important geometry elements (object poses and locations) in their geometric modeling. In contrast to these methods, we jointly model the geometry relationships between the scene depth and 2D bounding boxes, 3D dimensions, and object poses, and the geometric model is implemented as a network module to be leveraged for geometry-aware representation learning to directly boost the depth estimation.

3 The Proposed Approach

3.1 Framework Overview

A framework overview is illustrated in Fig. 2. We model an object as a single point following [49, 9]. Our framework consists of three key steps. First, we use deep layer aggregation [48],  a fully-convolutional encoder-decoder network, to extract features from a monocular image. Second, the features are fed into several network branches to separately predict 2D bounding box, 3D object dimension, and orientation (Sec. 3.2). Third, the geometric module models the geometry relationships from these 2D/3D predictions to obtain a geometric formula, which is implemented as a network module for geometry-aware feature learning (Sec. 3.3). Finally, we utilize the geometric features for depth estimation (Sec. 3.3), which combines with other 3D predictions for obtaining the 3D object detection results.

3.2 Base Detection Structure

Our base network structure for 2D detection, 3D dimension and orientation prediction is derived from the anchor-free 2D object detection [49, 41]

with six output branches. Each branch takes the backbone features as input and uses 3x3 convolution, ReLU, and 1x1 convolution for prediction. The heatmap branch is used to locate 2D object center. The 2D/3D offset branch is applied for estimating 2D/3D center in 2D image coordinate system. The 2D box size and the 3D dimension branch predicts the size of 2D bounding box and the 3D dimension of the 3D object, respectively. Similar to 

[28, 9, 49], the orientation branch predicts observation angle of the object via encoding it into scalars.

3.3 Geometric Module for Learning Geometric Representations

In this section, we introduce the proposed geometric formula via modeling the relationships between the depth and 2D/3D predictions, and present how it can be implemented to learn geometric representations for depth estimation.

Formulation and notation. We adopt the 3D object definition described by the KITTI dataset. The coordinate system is constructed in meters with the camera center as the origin of coordinate. A 3D bounding box is represented as a 7-tuple , where and are the dimensions of the 3D bounding box, width, height, and length, respectively, and is the bottom center coordinate of the 3D bounding box. As shown in Fig. 3, denotes the rotation around the Y-axis in the camera coordinate system, in a range of . Moreover, to facilitate the introduction of the proposed geometric formula, we define the 2D bounding box with a 4-tuple , where and represent the size and the center of 2D bounding box, respectively.

3.3.1 Projective Modeling of Depth and 2D/3D Network Predictions

We derive a geometric formula for modeling the geometric relationships between the scene depth and multiple 2D/3D network predictions, 2D bounding box, 3D dimension, and object orientation from the perspective projection.

Geometric relationship of 2D and 3D corners. First, we represent an object in the object coordinate system, in which the origin is the bottom center of the object via the translation transformation from the camera coordinate system. As shown in Fig. 3, the coordinate of the -th () corner in the 3D object bounding box, denoted as , can be given as follows:

(1)

where , and represent the coordinate difference between the corner and the center of the object in X, Y, and Z direction, respectively; denotes the index of different values as shown in Fig. 3. With the position of the object in the camera coordinate system, we can represent the corner in the same coordinate system as:

(2)

where and respectively represent the bottom center coordinate and the corner coordinate of the 3D object bounding box in the camera coordinate system; , , and denote the coordinate value along the X, Y, and Z dimension in the camera plane. also represents the distance from the bottom center of object to the camera plane, the depth of the object in the camera coordinate system; Given the intrinsic matrix of the camera provided by the official KITTI dataset, , we can project the corner in the camera coordinate system to the pixel coordinate system as:

(3)

where denotes the projected corner coordinate in the pixel coordinate system; indicates the depth of the -th corner; and respectively denote the horizontal and vertical coordinate of the corner in the pixel coordinate system.

Relationship between 2D height and 3D corners. Given the eight corners of the 3D object box in the pixel plane, the height of the projected 2D bounding box can be estimated from the difference between the vertical coordinate of the uppermost corner () and that of the lowermost corner () in the pixel coordinate system as:

(4)

where is derived from Eq. 3; represents the maximum of of the eight corners, analogically for ; denotes the focal length in the vertical direction of the pixel plane.

(a) Bird’s-eye view
(b) Right side view
Figure 3: Visualization of notations in different object observation angles: (a) in Bird’s Eye View, and (b) in right-side view.

Relationship between depth and other 2D/3D parameters. Similar to the definition of the bird’s-eye view angle (see Fig. 3a), we define the angle between the bottom center of the object and the horizontal plane as (see Fig. 3b). Given the projected coordinate of the object bottom center in the pixel plane based on Eq. 3, we can obtain the following geometric relationship:

(5)

where is the location of the principal point relative to the origin in the pixel plane. Then, combining Eq. 4 and Eq. 5, the depth of the center of the object, , can be written as:

(6)

where . It can be clearly observed that, the depth is correlated to the camera intrinsic parameters ( and ), the object position (when deriving ), 3D dimension (when deriving and ), and orientation of the object (when deriving ).

Relationship to existing works. Obviously, Eq. 6 obeys the perspective projection principle that further objects tend to be smaller than the nearer objects. It is also clearly different from prior works that, in our formula there is a non-linear relationship between the scene depth and , due to the modeling via the introduction of the object pose and 3D dimensions. We can simplify the proposed formula in two different ways: (i) To reduce the computation complexity, we can consider only the first term in Eq. 6 to obtain a simplified geometric formula v1:

(7)

(ii) If the variation of pose and position is not considered, then the formulation in Eq. 7 can be further derived as a simplified geometric formula v2:

(8)

where represents the scale factor for the depth scale conversation. The formula in Eq. 8 is widely used in 3D object detection [18, 5]. We report detailed comparison and analysis on our formulation in Eq. 6 and the two simplified versions in the experimental results (see Sec. 4.2).

3.3.2 Geometry-Guided Scene Depth Learning

Following the proposed geometric formula, we devise and implement a network module for the geometry-guided deep representation learning for accurate depth prediction, as shown in the red dashed box of Fig. 2. The module aims to learn geometric representations using the 2D/3D geometry-related network predictions (2D bounding box, 3D object dimension, and orientation) as input. Specifically, in the training stage, the module first produces a calculated one-channel depth map with the proposed geometric formula as described in Eq. 6. The depth map is then transformed into 3D maps of 3 channels with each spatial position representing a 3D data point

by introducing camera parameters as the initial geometric input. Then, the 3D map goes through three non-linear transformation blocks, with each block consisting of a convolution and taking the previous transformation block as input, a batch-norm and a ReLU layer, to learn a robust geometric representation map with

channels (typically ). We set as 32 in our experiments. These learned geometric representations are further concatenated with the image representations produced from the backbone network to learn the depth estimation. In the inference stage, we perform the same procedure as in the training, and the final depth output is further used to combine with other predictions, including 2D bounding boxes, 3D dimensions and orientations to produce 3D object bounding boxes.

Method 3D Detection BEV
Easy Mod. Hard Easy Mod. Hard
Baseline 16.42 11.84 10.06 24.47 17.17 15.40
w/ gt Dim 19.85 14.06 12.02 25.06 18.29 15.85
w/ gt Depth 79.82 70.91 62.41 88.60 82.66 75.41
Table 1: Error analysis. Similar to the error analysis in [49], we replace the predicted depth and 3D dimensions with their corresponding ground-truth values. Using the ground-truth depth remarkably improves the AP from 11.84% to 70.91% on the moderate, suggesting that the depth is a significantly important factor that affects the accuracy the monocular 3D object detection.

3.4 Misalignment in 2D and 3D bounding Boxes

There is misalignment between the 2D projected box and 2D annotation box remains. Generally, due to the perspective projection effect,  further objects appear smaller than nearer objects, the misalignment is more serious for nearby objects, which makes the learning with the proposed formula inaccurate, especially for nearby objects. To handle this misalignment, we propose to use the 2D projected box instead of the 2D annotation box as the ground-truth to ensure the correctness of the depth estimation. According to Eq. 1 and  2, we compute the 3D corner coordinates of the object through the 3D poses and 3D dimensions of the object. We further obtain their coordinates on the pixel plane through the projection transformation according to Eq. 3. We also calculate the difference between vertices in the image plane as the height and width of the 2D projected box.

3.5 Implementation Details

Backbone. We adopt a DLA-­34 [48] network architecture without deformable convolutions as our backbone. During training, we set the input resolution of the network as . The spatial size of the feature map from the backbone is , where represents the down-sampling factor of the backbone CNN.

Optimization loss. The optimization objective of our deep detection framework follows a multi-task learning setting, and consist of classification and regression losses for both the 2D and 3D predictions. Specifically, we train the heatmap prediction with the focal loss [21]. The branches for offsets and dimensions in both the 2D and 3D detection are trained with 1 losses. The branch for the orientation predictionn is trained with a MultiBin loss following [9, 49]. Based on [9, 13], we use an

1 loss with heteroscedastic aleatoric uncertainty for the depth estimation (More details are illustrated in Appendix).

Method Extra data 3D Detection BEV AOS Runtime
Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
MonoDLE[26] - 17.23 12.26 10.29 24.79 18.89 16.00 93.46 90.23 80.11 -
GrooMeD-NMS[15] - 18.10 12.32 9.65 26.19 18.27 14.05 90.05 79.93 63.43 -
DDMP-3D[42] - 19.71 12.78 9.80 28.08 17.89 13.44 90.73 80.20 61.82 -
Decoupled-3D[5] Yes 11.08 7.02 5.63 23.16 14.82 11.25 87.34 67.23 53.84 -
UR3D[38] Yes 15.58 8.61 6.00 21.8 12.51 9.20 - - - 120ms
AM3D[25] Yes 16.50 10.74 9.52 25.03 17.32 14.91 - - - 400ms
PatchNet[24] Yes 15.68 11.12 10.17 22.97 16.86 14.97 - - - 400ms
DA-3Ddet[47] Yes 16.80 11.50 8.9 - - - - - - -
D4LCN[10] Yes 16.65 11.72 9.51 22.51 16.02 12.55 90.01 82.08 63.98 -
Kinematic3D[4] Yes 19.07 12.72 9.17 26.69 17.52 13.10 58.33 45.50 34.81 120ms
CaDDN[31] Yes 19.17 13.41 11.46 27.94 18.91 17.19 78.28 67.31 59.52 -
GS3D[18] No 4.47 2.90 2.47 8.41 6.08 4.94 85.79 75.63 61.85 2000ms

MonoGRNet[30]
No 9.61 5.74 4.25 18.19 11.17 8.73 - - - 60ms
MonoDIS[39] No 10.37 7.94 6.40 17.23 13.19 11.12 - - - -
M3D-RPN[3] No 14.76 9.71 7.42 21.02 13.67 10.23 88.38 82.81 67.08 161ms
MonoPair[9] No 13.04 9.99 8.65 19.28 14.83 12.89 91.65 86.11 76.45 57ms
RTM3D[20] No 14.41 10.34 8.77 19.17 14.20 11.99 91.75 86.73 77.18 55ms
MoVi-3D[40] No 15.19 10.90 9.26 22.76 17.03 14.85 - - - 45ms
RAR-Net[22] No 16.37 11.01 9.52 22.45 15.02 12.93 88.48 83.29 67.54 -
Our method No 18.85 13.81 11.52 25.86 18.99 16.19 94.67 89.44 79.27 50ms
Improvement - +2.48 +2.80 +2.00 +3.10 +1.96 +1.34 +2.92 +2.71 +2.09 -

Table 2: State-of-the-art comparison on the KITTI test set for the car category in terms of the metric of . Extra data denotes the methods with extra data or external networks used in the training or inference or not. ‘-’ denotes the methods have not been published yet without specific details. The bold black/blue color indicates the best/the second best performing method under the same ‘No’ setting. ‘Improvement’ denotes the increasing in performance compared to methods without extra data. Our approach achieves the best performance compared with the state-of-the-arts under both settings on moderate.

Training: We use a batch size of

and train the overall deep network for 140 epochs on

NVIDIA 1080ti GPUs. To alleviate overfitting, we adopt data augmentation techniques including random scaling, random horizontal flipping, and random cropping for the 2D detection, and random horizontal flipping for the 3D detection, respectively. We use the Adam optimizer with 1e-­5 weight decay to optimize the full training loss as described in [9]. The initial learning rate is 1.25e-­4, which is dropped by multiplying after the -th and the -th epoch. To make train stable, we apply the linear warm-up strategy for learning with the geometric network module in the first 5 epochs.

Inference: We first predict 2D bounding boxes, 3D dimensions, and orientations via a shared backbone and several separate task branches. Than, we use the proposed formula to predict coarse depth followed by several convolution layers for the depth estimation. Finally, similar to [49], we use a simple post-processing algorithm through maxpooling and back-projection to recover 3D bounding boxes from 2D boxes, 3D dimensions, orientations, and the depth.

4 Experiments

Setup. The KITTI dataset [11] provides widely used benchmarks for various visual tasks in the autonomous driving, including 2D Object detection, Average Orientation Similarity (AOS), Bird’s Eye View (BEV), and 3D Object Detection. The official data set contains 7481 training and 7518 test images with 2D and 3D bounding box annotations for cars, pedestrians, and cyclists. We report the average accuracy () for each task under three different settings: easy, moderate, and hard, as defined in [11]

. Moreover, we use 40 recall positions instead of 11 recall positions proposed in the original Pascal VOC benchmark, following 

[39]. This results in a more fair comparison of the results. Each class uses different IoU standards for further evaluations. We report our results on the official settings of IoU for cars.

Method 3D Detection IoU0.7 BEV IoU0.7 3D Detection IoU0.5 BEV IoU0.5
Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
CenterNet [49] 0.60 0.66 0.77 3.46 3.31 3.21 20.00 17.50 15.57 34.36 27.91 24.65
MonoGRNet [30] 11.90 7.56 5.76 19.72 12.81 10.15 47.59 32.28 25.50 52.13 35.99 28.72
MonoDIS [39] 11.06 7.60 6.37 18.45 12.58 10.66 - - - - - -
M3D-RPN [3] 14.53 11.07 8.65 20.85 15.62 11.88 48.53 35.94 28.59 53.35 39.60 31.76
MoVi-3D [40] 14.28 11.13 9.68 22.36 17.87 15.73 - - - - - -
MonoPair [9] 16.28 12.30 10.42 24.12 18.17 15.76 55.38 42.39 37.99 61.06 47.63 41.92
Baseline 16.54 13.37 11.15 23.62 19.19 16.70 53.93 40.97 36.67 58.72 45.48 40.02

Our method
18.45 14.48 12.87 27.15 21.17 18.35 56.59 43.70 39.37 61.96 47.84 43.10
Table 3: Monocular 3D object detection results on the KITTI val set

for the car category with the evaluation metric of

. The results of the previous works are from [9]. Our approach significantly outperforms the previous state-of-the-arts on almost all the different evaluation protocols and settings. The bold black/blue color indicates the best/the second best performing method.

4.1 Overall Performance Comparison and Analysis

Table 2 and 3 show the overall performance of the proposed approach on the KITTI 3D test and val sets for cars from the official online leaderboard as of Mar. 12th, 2021. Existing state-of-the-art monocular 3D object detectors, including methods using extra data and only using monocular image are listed in the tables for comparison. The KITTI val results of MonoGRNet [30], M3D-RPN [3] and MonoPair [9] are quoted from [9].

description 3D Detection BEV
Easy Mod. Hard Easy Mod. Hard
Original baseline 12.78 9.83 8.27 18.32 14.18 12.11
+ Uncertainty 15.40 11.10 9.58 22.33 16.53 14.18
+ Center3d 16.22 12.88 10.94 22.61 17.89 16.17
+ Projected box 16.54 13.37 11.15 23.62 19.19 16.70
Enhanced baseline 16.54 13.37 11.15 23.62 19.19 16.70
Table 4:  Results of the enhanced baseline on KITTI val set for the car category with the evalution metric of . Each row adds an extra component to the above row.

Build a simple yet strong baseline for monocular 3D object detection. We report the enhanced baseline results of 3D monocular object detection in Table 4.  Overall, the baseline significantly increases the performance upon the original one by 3.76%, 3.54%, 2.88% on easy, moderate and hard difficulty levels, respectively. This is achieved by introducing three methods to the original baseline. First, we adopt the 1 loss with the aleatoric uncertainty in [9, 13], which makes training stage more robust to noise input. Second, we use the projected 3D center as the ground-truth for 2D heatmap prediction similar to SMOKE [23]. Third, we address the misalignment between 2D ground-truth bounding boxes and the 2D projection bounding boxes by using 2D projected box as the ground-truth. This guarantees the consistency between 2D and 3D boxes from the projection relationships in the proposed geometric formula, and ensure the robust learning with the formula. The enhanced baseline achieves 16.54%, 13.37%, 11.15% on easy, moderate and hard difficulty levels, respectively.

Comparison with monocular image based methods. Our approach achieves a notable improvement over the state-of-the-art monocular image-based detectors [39, 30, 3, 9] on both the val and test sets. As shown in Table 2, the performance of our approach on the KITTI test set, for the detection on the car category, an indispensable part of the 3D object detection task for the autonomous driving scenario, our method achieves 18.85% ( improvement) on the easy, 13.81% ( improvement) on the moderate, and 11.52% ( improvement) on the hard compared with the previous state-of-the-art image-only method. Besides, compared with unpublished [26, 15] our method still increases the by 1.49 % on moderate. For the Bird’s Eye View (BEV) on the car class, our method also achieves the best performance, increasing the over the second best method by 3.10%, 1.96%, 1.34% on the easy, moderate, and hard level, respectively. For the KITTI val set, our method also establishes new state-of-the-art performance on both the 3D object detection and the BEV. Table 2 and 3 shows considerable improvement over the state-of-the-art monocular detection methods with the great robustness, benefiting from the introduction of the proposed geometric formula for learning geometry-aware representations to advance the depth estimation.

Comparison with methods using extra data or networks. The prior methods [5, 25, 24, 10, 31] achieve impressive performance on the KITTI test set by introducing extra data or external networks. Although our method utilizes none of these kinds of information, as shown in Table 2, it can still outperform these comparison methods in terms of the metric by 0.40% on the moderate level. These significant improvements demonstrate the superior performance of our method with the proposed geometry-guided depth learning for the monocular 3D object detection.

Latency.

 We test our model on Nvidia GTX 1080 Ti, Pytorch 1.1, CUDA 9.0, Intel @ 2.60GHz As shown in Table 

2, the proposed method achieves 20 fps and runs similar to other real-time state-of-the-arts [20, 40]. This clearly demonstrates the efficiency of our method when compared with other competitive methods under the similar experimental environment.

Figure 4: Qualitative results of our method for multi-class 3D object detection. We use orange box for cars, purple box for pedestrians, and green box for cyclists. All illustrated images are from the KITTI test set. Zoom in the image for more details.
Figure 5: Qualitative results of our method for Bird’s-Eye-View. We use black box for ground-truth, red box for baseline results, and blue box for our results. All the illustrated images are from the KITTI val set. Zoom in on the circles for more detailed comparison.
Method 3D Detection BEV
Easy Mod. Hard Easy Mod. Hard
Baseline 16.54 13.37 11.15 23.62 19.19 16.70
+ 3D-CAT 15.87 11.80 10.33 21.85 16.90 14.51
+ Geo-SV1 17.25 13.38 11.29 24.33 18.57 16.06
+ Geo-SV2 17.10 13.22 11.13 25.02 18.62 16.48
Ours (full model) 18.45 14.48 12.87 27.15 21.17 18.35
Table 5: Quantitative comparison on different variants of the proposed approach. The experiments are conducted on the KITTI val set for the car category with the evaluation metric of , to investigate the effect of the proposed geometry formula and geometry-guided representation learning. ‘3D-CAT’, ‘Geo-SV1’ and ‘Geo-SV2’ represents transformation blocks combined with 3D dimension, simplified geometry formula v1, and v2.
Figure 6: Depth prediction performance w.r.t. SILog (Scale invariant logarithmic error) and sqRel (Relative squared error) metrics on KITTI val set for all the car samples. Different depth ranges are considered in the performance evaluation.

4.2 Ablation Experiments

We conduct extensive ablation studies on the KITTI val set, to demonstrate the effectiveness of the proposed approach for geometry-guided depth learning in advancing the monocular 3D object detection. For all the evaluation, the metric is employed. We mainly investigate from two aspects, including the effect of the proposed geometric formula and module, and the effect of the geometry-guided representation learning for depth estimation.

Baseline and variant models. To conduct an extensive evaluation, we consider the following baseline and variant models: (i) Baseline, which is a base model achieving a strong 3D detection performance with an of 11.8% on the moderate; (ii) 3D-CAT., which directly inputs the concatenation of the 3D network predictions to the non-linear transformation blocks while bypassing the depth calculation with geometric formula; (iii) Geo-SV1, which uses our simplified geometry formula v1 as in Eq. 7; (iv) Geo-SV2, which uses our simplified geometry formula v2 as in Eq. 8.

Effects of the geometric formula and module. A detailed ablation study is shown in Table 5. As we can observe, ours (full model) achieves a large gain (2.68% on the moderate level) over Baseline + 3D-CAT, meaning that directly using the 3D network predictions are not effective enough for learning the geometric representations, thus verifying the importance of the proposed geometric formula. By comparing Baseline + Geo-SV2, Baseline + Geo-SV1, and ours (full model), all these three with the geometric relationships, the performance gradually improves when more geometry elements are involved in modeling, confirming our motivation of modeling between depth and multiple 2D/3D geometry elements, instead of partial of them, only height typically considered in most existing works [18, 5] similar to the Geo-SV2. Finally, Ours (full model) is 1.11% and 1.98% improvement on the moderate for the 3D detection and BEV, respectively, which adequately demonstrate the effectiveness of our proposed approach.

Figure 7:  Statistics on the KITTI train+val set for car samples, showing the number of samples (left) and cumulative proportions (right) w.r.t. different depths. Most samples are within 40m, while our method achieves significant depth improvements in this range.

Effect of the geometry-guided representation learning for depth estimation. Fig. 6 shows a performance comparison between baseline and our approach on the depth estimation. Specifically, we evaluate the predicted depth of all car samples in different depth ranges under two primary metrics (SILog and sqRel) widely used in depth estimation field. Fig. 7 shows that 87% of the cars are within 40m, while only 5.0% of those are 45m away. Fig. 6 shows that our approach outperforms the baseline consistently in all the depth ranges, especially in the 40m range with most samples, which further validates our idea of using geometry-guided representation learning to boost depth estimation to advance the monocular 3D object detection.

5 Conclusion

We proposed a novel geometric formula principally modeled from multiple 2D/3D network predictions, to guide the depth estimation and advance the monocular 3D object detection. We design and implement this formula as a neural network module to have geometry-aware feature learning with the image representations to boost the learning of the depth. Extensive experiments demonstrate the effectiveness of the proposed approach, and our results also achieve state-of-the-art performance with a large margin.

References

  • [1] I. Barabanau, A. Artemov, E. Burnaev, and V. Murashkin (2020) Monocular 3d object detection via geometric reasoning on keypoints. In VISIGRAPP, Cited by: §1, §2.
  • [2] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz (2018) Geometry-aware learning of maps for camera localization. In CVPR, Cited by: §1.
  • [3] G. Brazil and X. Liu (2019) M3d-rpn: monocular 3d region proposal network for object detection. In ICCV, Cited by: Table 6, Table 7, §2, §2, Table 2, §4.1, §4.1, Table 3.
  • [4] G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele (2020) Kinematic 3d object detection in monocular video. In ECCV, Cited by: Table 2.
  • [5] Y. Cai, B. Li, Z. Jiao, H. Li, X. Zeng, and X. Wang (2020) Monocular 3d object detection with decoupled structured polygon estimation and height-guided depth estimation.. In AAAI, Cited by: §1, §2, §2, §3.3.1, Table 2, §4.1, §4.2.
  • [6] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau (2017) Deep manta: a coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In CVPR, Cited by: §2.
  • [7] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun (2015) 3d object proposals for accurate object class detection. In NIPS, Cited by: §1.
  • [8] Y. Chen, S. Liu, X. Shen, and J. Jia (2019) Fast point r-cnn. In ICCV, Cited by: §1.
  • [9] Y. Chen, L. Tai, K. Sun, and M. Li (2020) MonoPair: monocular 3d object detection using pairwise spatial relationships. In CVPR, Cited by: §A, Table 6, §2, §3.1, §3.2, §3.5, §3.5, Table 2, §4.1, §4.1, §4.1, Table 3.
  • [10] M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo (2020) Learning depth-guided convolutions for monocular 3d object detection. In CVPR, Cited by: §2, Table 2, §4.1.
  • [11] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, Cited by: §1, §B.1, §4.
  • [12] E. Jörgensen, C. Zach, and F. Kahl (2019) Monocular 3d object detection and box fitting trained end-to-end using intersection-over-union loss. CoRR abs/1906.08070. Cited by: Table 6, Table 7.
  • [13] A. G. Kendall (2019) Geometry and uncertainty in deep learning for computer vision. Ph.D. Thesis, University of Cambridge. Cited by: §A, §3.5, §4.1.
  • [14] J. Ku, A. D. Pon, and S. L. Waslander (2019) Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In CVPR, Cited by: §2.
  • [15] A. Kumar, G. Brazil, and X. Liu (2021) GrooMeD-nms: grouped mathematically differentiable nms for monocular 3d object detection. In CVPR, Cited by: Table 2, §4.1.
  • [16] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) Pointpillars: fast encoders for object detection from point clouds. In CVPR, Cited by: §1.
  • [17] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In ECCV, Cited by: §A.
  • [18] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang (2019) GS3D: an efficient 3d object detection framework for autonomous driving. In CVPR, Cited by: §1, §2, §3.3.1, Table 2, §4.2.
  • [19] P. Li, X. Chen, and S. Shen (2019) Stereo r-cnn based 3d object detection for autonomous driving. In CVPR, Cited by: §1.
  • [20] P. Li, H. Zhao, P. Liu, and F. Cao (2020) RTM3D: real-time monocular 3d detection from object keypoints for autonomous driving. In ECCV, Cited by: §1, §2, Table 2, §4.1.
  • [21] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, Cited by: §3.5.
  • [22] L. Liu, C. Wu, J. Lu, L. Xie, J. Zhou, and Q. Tian (2020) Reinforced axial refinement network for monocular 3d object detection. In ECCV, Cited by: §2, Table 2.
  • [23] Z. Liu, Z. Wu, and R. Tóth (2020) SMOKE: single-stage monocular 3d object detection via keypoint estimation. In CVPR, Cited by: §4.1.
  • [24] X. Ma, S. Liu, Z. Xia, H. Zhang, X. Zeng, and W. Ouyang (2020) Rethinking pseudo-lidar representation. In ECCV, Cited by: §2, Table 2, §4.1.
  • [25] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan (2019) Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In ICCV, Cited by: §2, Table 2, §4.1.
  • [26] X. Ma, Y. Zhang, D. Xu, D. Zhou, S. Yi, H. Li, and W. Ouyang (2021) Delving into localization errors for monocular 3d object detection. In CVPR, Cited by: Table 2, §4.1.
  • [27] F. Manhardt, W. Kehl, and A. Gaidon (2019) Roi-10d: monocular lifting of 2d detection to 6d pose and metric shape. In CVPR, Cited by: §2.
  • [28] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017) 3d bounding box estimation using deep learning and geometry. In CVPR, Cited by: §2, §3.2.
  • [29] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In CVPR, Cited by: §1.
  • [30] Z. Qin, J. Wang, and Y. Lu (2019) Monogrnet: a geometric reasoning network for monocular 3d object localization. In AAAI, Cited by: §2, Table 2, §4.1, §4.1, Table 3.
  • [31] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander (2021) Categorical depth distribution network for monocular 3d object detection. CVPR. Cited by: §2, Table 2, §4.1.
  • [32] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander (2021) Categorical depth distribution network for monocular 3d object detection. In CVPR, Cited by: §B.3.
  • [33] H. Rhodin, M. Salzmann, and P. Fua (2018) Unsupervised geometry-aware representation for 3d human pose estimation. In ECCV, Cited by: §1.
  • [34] T. Roddick, A. Kendall, and R. Cipolla (2019) Orthographic feature transform for monocular 3d object detection. In BMVC, Cited by: Table 6.
  • [35] L. Sheng, D. Xu, W. Ouyang, and X. Wang (2019) Unsupervised collaborative learning of keyframe detection and visual odometry towards monocular deep slam. In ICCV, Cited by: §1.
  • [36] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li (2020) Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In CVPR, Cited by: §1.
  • [37] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li (2020) From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. TPAMI. Cited by: §1.
  • [38] X. Shi, Z. Chen, and T. Kim (2020) Distance-normalized unified representation for monocular 3d object detection. In ECCV, Cited by: §2, Table 2.
  • [39] A. Simonelli, S. R. Bulo, L. Porzi, M. López-Antequera, and P. Kontschieder (2019) Disentangling monocular 3d object detection. In ICCV, Cited by: §1, §2, Table 2, §4.1, Table 3, §4.
  • [40] A. Simonelli, S. R. Bulò, L. Porzi, E. Ricci, and P. Kontschieder (2020) Towards generalization across depth for monocular 3d object detection. In ECCV, Cited by: Table 6, §2, Table 2, §4.1, Table 3.
  • [41] Z. Tian, C. Shen, H. Chen, and T. He (2019) Fcos: fully convolutional one-stage object detection. In ICCV, Cited by: §3.2.
  • [42] L. Wang, L. Du, X. Ye, Y. Fu, G. Guo, X. Xue, J. Feng, and L. Zhang (2021) Depth-conditioned dynamic message propagation for monocular 3d object detection. In CVPR, Cited by: Table 2.
  • [43] B. Xu and Z. Chen (2018) Multi-level fusion based 3d object detection from monocular images. In CVPR, Cited by: §1.
  • [44] D. Xu, A. Vedaldi, and J. F. Henriques (2021) Moving slam: fully unsupervised deep learning in non-rigid scenes. In IROS, Cited by: §1.
  • [45] D. Xu, W. Xie, and A. Zisserman (2019) Geometry-aware video object detection for static cameras. In BMVC, Cited by: §1.
  • [46] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin (2019) Reppoints: point set representation for object detection. In ICCV, Cited by: §1.
  • [47] X. Ye, L. Du, Y. Shi, Y. Li, X. Tan, J. Feng, E. Ding, and S. Wen (2020) Monocular 3d object detection via feature domain adaptation. In ECCV, Cited by: §2, Table 2.
  • [48] F. Yu, D. Wang, E. Shelhamer, and T. Darrell (2018) Deep layer aggregation. In CVPR, Cited by: §3.1, §3.5.
  • [49] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: §A, §3.1, §3.2, §3.5, §3.5, Table 1, Table 3.

Supplementary Material

In this Supplementary Material, we provide more elaboration on the implementation details, experiment results, and qualitative results. Specifically, we present the implementation details of the model training in Section A, additional quantitative results and analysis in Section B, and additional qualitative results in Section D.

A Additional Implementation Details

The overall network optimization loss of the proposed approach consists of three parts,  a classification loss , a 2D regression loss , and a 3D regression loss . We present the details of these losses one by one: (i) Regarding to the classification loss, similar to [17, 49], we employ a variant of focal loss which reduces the penalty for negative locations according to the distance from a positive location as:

(9)

where and

represent the ground-truth class probability given by an unnormalized 2D Gaussian and the model’s predicted probability for the class, respectively. And

and

are hyperparameters that control the importance of each sample. We set

to 2 and to 4 as a default setting in our experiments. (ii) For the 2D regression loss , it is defined upon a 6-tuple of ground-truth bounding-box targets and a predicted 6-tuple. Specifically, the 6-tuple consists of two 2D offsets, two 3D offsets, and two 2D box sizes. 2D/3D offsets are used to adjust the 2D/3D center locations before remapping them to the input resolution following [17, 49]. We use an loss to optimize each 6-tuple parameters. (iii) For the 3D regression loss , it consists of an loss for regressing the dimension of the 3D bounding box (width, height, and length), and an loss with an uncertainty term for regressing the depth. Specifically, we follow [9, 13] and employ the heteroscedastic aleatoric uncertainty in the depth estimation loss as:

(10)
(11)

Where and represent the predicted depth and the ground-truth depth, respectively. is the noisy observation parameter of the model. Hence, the overall optimization loss is the sum of the three losses written as:

(12)

where and are loss weights controlling the balance between the different losses. We consider and equally important and use = = 1 in all experiments.

B Additional Results and Analysis

b.1 Additional Results for the Pedestrian/Cyclist Category

Cat. Method 3D Detection/BEV
Easy Mod. Hard
Ped. OFTNet [34] 0.63/1.28 0.36/0.81 0.35/0.51
SS3D [12] 2.31/2.48 1.78/2.09 1.48/1.61
M3D-RPN [3] 4.92/5.65 3.48/4.05 2.94/3.29
MoVi-3D [40] 8.99/10.08 5.44/6.29 4.57/5.37
MonoPair [9] 10.02/10.99 6.68/7.04 5.53/6.29
Ours 8.00/9.54 5.63/6.77 4.71/5.83
Cyc. OFTNet [34] 0.14/0.36 0.06/0.16 0.07/0.15
SS3D [12] 2.80/3.45 1.45/1.89 1.35/1.44
M3D-RPN [3] 0.94/1.25 0.65/0.81 0.47/0.78
MoVi-3D [40] 1.08/1.45 0.63/0.91 0.70/0.93
MonoPair [9] 3.79/4.76 2.12/2.87 1.83/2.42
Ours 4.73/5.93 2.93/3.87 2.58/3.42
Table 6: Monocular 3D object detection results on the KITTI test set for the Pedestrian and Cyclist categories with the evaluation metric of . The IoU threshold is set to 0.5. The bold black/blue color indicates the best/the second best performing method, respectively.

As mentioned in the main paper, the KITTI [11] official data set contains 7,481 training and 7,518 test images with 2D and 3D bounding box annotations for pedestrian and cyclist categories. We report our quantitative results in Table 6, using the official settings with IoU for pedestrians and cyclists on the KITTI test set. Our method establishes new state-of-the-art performance on all the three detection levels ( easy, medium, and hard) for the cyclist category with only slight drop for the pedestrian category. We investigate the slight performance drop in the pedestrian category by comparing 2D detection results between car and pedestrian. In fact, the advantage of the proposed geometric formula is independent of different classes as 2D images conform with projective camera models, and every object meets the geometric reasoning. However, a performance gap between car detection and pedestrian/cyclist detection commonly exists

in ours and many previous works on the KITTI dataset. This is mainly due to insufficient training samples of pedestrian and cyclist categories on KITTI, leading to unstable training, sensitivity to hyper-parameters, and inaccurate prediction of 2D/3D information(2D boxes, orientation, and the 3D dimensions) with high variance. This imbalance of the category data is however a common issue on the KITTI dataset for the 3D object detection task. Table 

7 shows that the 2D detection results on the moderate level are only 50.48% and 44.63% for cyclist and pedestrian respectively, while up to 90.14% for car on the test set.  Similarly for orientation estimation, the pedestrian (39.76%) has less than half of the car (89.44%) on the moderate. The two factors above introduce more noise into our geometry formula to affect the geometry-guided representation learning. However, our results for pedestrians and cyclists are highly competitive with other SOTA methods on the KITTI test set.

Cat. Method 2D Detection/AOS
Easy Mod. Hard
Car SS3D [12] 92.72/92.57 84.92/84.38 70.35/69.82
M3D-RPN [3] 89.04/88.38 85.08/82.81 69.26/67.08
Ours 95.11/94.67 90.14/89.44 80.19/79.27
Ped. SS3D [12] 61.58/53.72 45.79/39.60 41.14/35.40
M3D-RPN [3] 56.64/44.33 41.46/31.88 37.31/28.55
Ours 58.49/52.87 44.63/39.76 40.41/35.83
Cyc. SS3D [12] 52.97/42.95 35.48/27.79 31.07/24.26
M3D-RPN [3] 61.54/48.11 41.54/31.09 35.23/26.10
Ours 65.42/55.58 50.48/42.05 42.48/35.48
Table 7: Monocular 2D object detection results on the KITTI test set for the All categories with the evaluation metric of . The metric is used for detection evaluation and the IoU threshold is set to 0.5. The bold black/blue color indicates the best/the second best performing method, respectively.

b.2 Further Analysis on Depth Estimation from Geometry Modeling

We conduct a further depth statistic analysis on the train+val set. Table 8 shows that for two cars with the same height in both the 2D bounding box and the 3D bounding box, the depth values of their centers may differ by more than meters due to their distinct poses and locations. This confirms the critical importance of considering 3D pose and locations simultaneously in the geometric modeling for depth estimation, which is however not investigated by previous works.

depth The height of 3D bounding boxes
avg.
30 39.51 40.23 40.39 42.23 39.47
37.69 36.53 36.53 37.21 37.25
diff. 1.82 3.70 3.86 5.02 2.22
35 34.04 34.68 35.69 34.12 36.40
32.99 31.72 31.77 32.05 31.75
diff. 1.05 2.96 3.92 2.07 4.65
Table 8: Depth values on the training set (in meter). We show the maximum (max) and minimum (min) depth values of the cars with the same height of 2D bounding boxes and the same height of 3D bounding boxes, and the difference (diff.) between the maximum and minimum depth values.

b.3 Additional Results at Different Distances

We provide additional results on depth estimation and monocular 3D object detection at different distances. Table 9 shows more depth estimation results on KITTI val set via comparing the enhanced baseline and our method. Specifically, we evaluate the depth estimation by computing Scale Invariant Logarithmic (SILog) error, squared Relative (sqRel) error, absolute Relative (absRel) error, and Root Mean Squared Error of the inverse depth (iRMSE). Our method outperforms the enhanced baseline by large margins on all these evaluation metrics. The depth estimation results clearly demonstrate the effectiveness of our proposed idea of using geometry-guided representation learning to boost depth estimation from monocular images for advancing the monocular 3D object detection.

Depth Range Num. SILog absRel sqRel iRMSE
0-10m 867 16.49 8.65 67.55 16.53
14.35 7.75 38.46 14.94
0-20m 4236 12.12 6.02 31.42 9.98
10.48 5.42 20.81 8.77
0-30m 7379 11.00 5.70 25.25 8.16
10.27 5.30 19.73 7.52
0-40m 9797 10.49 5.65 24.68 7.23
10.49 5.36 21.71 7.03
Table 9: Depth prediction results on the KITTI val set for all car samples.  We show first the baseline and then ours (bold) for each row ( each depth range). ‘Num.’ denotes the number of car samples on val set, which has in total 11,178 car samples.

Moreover, we conduct experiments about the 3D monocular object detection improvement at different distances. Table 10 reports performance on at different object distance ranges following [32]. It is clear that our method consistently outperforms the baseline at different ranges.

Description 3D Detection BEV
15m 30m all 15m 30m all

Baselne
18.85 15.42 11.32 26.95 21.94 16.82
Ours 22.29 17.38 12.87 31.37 24.82 18.35
Table 10: Performance on KITTI val at different ranges.

C Additional Ablation Study for Uncertainty and Equation

We investigate the effect of uncertainty with our geometric module as requested on the KITTI val set in Table 12. It can be seen that the uncertainty is helpful for learning the geometry, but the main improvement is from the proposed principled geometric modeling. To further validate the effectiveness of Eq. (6), we compare all predictions followed by pointwise MLP as the reviewer described with our geometric module in Table 11. Ours is significantly better than the pointwise MLP.

Description 3D Detection BEV
Easy Mod. Hard Easy Mod. Hard
Baseline 16.54 13.37 11.15 23.62 19.19 16.70
Pointwise MLP 17.09 13.12 11.05 23.79 18.20 16.26
Ours 18.79 14.53 12.77 26.48 20.75 18.04
Table 11:  Results of different modules on KITTI val with .
All Other Enhancements Uncertainty Geometric Module 3D Detection BEV
11.81 17.51
14.44 19.77
13.37 19.19

14.48 21.17
Table 12: Ablation study on KITTI val set for uncertainty and geometric modeling on the moderate setting of cars.
Figure 8: Qualitative results of our method for Bird’s-Eye-View. We use black box for ground-truth, red box for baseline results, and blue box for our results. All the illustrated images are from the KITTI val set. Zoom in on the circles for more detailed comparison.

D Additional Qualitative Results

Fig. 8 also show the comparison results between the enhanced baseline and the proposed method from the Bird-Eye-View. Figure 9 also present additional qualitative 3D detection results on the images with a comparison between those two on the KITTI val set. We could observe from the figures that the proposed geometry-guided learning approach can achieve significantly better 3D detection and localization performance than the enhanced baseline.

Figure 10 and 11 show additional visualization of the prediction results on KITTI 3D raw data in both the image plane and the LiDAR coordinate system, respectively. We use orange box, purple box, and green box for car, pedestrian, and cyclist, respectively. Our approach is able to accurately localize the different-depth 3D objects.

Figure 9: Qualitative Results. The predictions on the KITTI val set. Results are from the enhanced baseline (left column) and ours (right column).
Figure 10: Qualitative results of our method for multi-class 3D object detection. We use orange box for cars, purple box for pedestrians, and green box for cyclists. All illustrated images are from the KITTI test set. Zoom in on the images for more details.
Figure 11: Qualitative results of our method for multi-class 3D object detection. We use orange box for cars, purple box for pedestrians, and green box for cyclists. All illustrated images are from the KITTI test set. Zoom in on the images for more details.