As an important and challenging problem, 3D object detection plays a fundamental role in various computer vision applications, such as autonomous driving, robotics, and augmented/virtual reality. In recent years monocular 3D object detection has received great attention, because it simply uses monocular camera instead of requiring extra sensing devices as in LiDAR-based[36, 16, 8, 37] and stereo-based [7, 19, 29, 43] methods. However, the performance gap between LiDAR-based and monocular image-based approaches remains significant mainly because of the lack of reliable depth information. A quantitative investigation is conducted by only replacing the depth predictions with the ground-truth depth values on a baseline model. The detection performance of the model can be remarkably improved from 11.84% to 70.91% in terms of the metric  under the moderate setting of car category on the KITTI val set (see Table 1), which suggests that the depth estimation is a critical performance bottleneck in the monocular 3D object detection.
In this paper, we propose a novel geometric formula by principled modeling of the relationships between the scene depth and different geometry elements predicted from the deep network for the task of monocular 3D object detection, including 2D bounding boxes, 3D object dimensions, and object poses. We further implement the proposed formula to develop a geometry-based network module, which can be flexibly embedded into the deep learning framework, allowing effective geometry-aware learning on the representation level for guiding the depth estimation and advancing the monocular 3D object detection. Besides, the geometry module can be utilized during both the training and inference phases without additional complex post-processing. Moreover, we provide a simple yet strong baseline for ensuring robust learning with the proposed geometry module, which is achieved through addressing the severe misalignment between the annotated 2D bounding box and the projected 2D bounding box from the 3D annotations. This effective baseline achieves an AP of 13.37% under the moderate setting of car category on the KITTIval set.
To summarize, the contribution of this paper is threefold:
We propose a novel geometric formula, which jointly models the perspective geometry relationships of multiple 2D/3D elements predicted from the deep monocular 3D objection network, providing strong geometric constraints for learning the 3D detection network.
We implement the proposed geometric formula in neural network as a module, which can be leveraged to guide the representation learning for boosting the depth estimation to significantly advance the performance of the monocular 3D object detection.
We provide a simple yet strong baseline through dealing with the misalignment between 2D projected boxes and 2D annotation boxes, which achieves 13.37% on the moderate of the KITTI val set. We expect our baseline will be beneficial for the community in future research on monocular 3D object detection.
Extensive experiments conducted on the challenging KITTI  dataset clearly demonstrate the effectiveness of the proposed approach and show that our method achieves 13.81% in terms of the metric, which is 2.80% absolute improvement over the state-of-the-art of the monocular 3D object detection on the moderate setting of the KITTI test set for the car category.
2 Related Work
There are two groups of works closely related to ours, monocular 3D object detection and geometry-guided 3D object detection.
Monocular 3D Object Detection. Compared with the methods with LiDAR and stereo sensors, 3D object detection with monocular images is challenging due to the absence of reliable depth information. Existing works [6, 27, 25, 24, 5, 10] have considered using external pretrained networks, extra training data, and prior knowledge to improve the performance of the monocular 3D object detection. Particularly, DeepMANTA  utilizes extra 3D shape and template datasets in learning 2D/3D vehicle models and then performs 2D/3D matching for the detection. Inspired by the importance of accurate depth for 3D object detection, many works [31, 25, 24, 10, 47] develop monocular 3D object detection by introducing pretrained external network for depth estimation. In contrast to these methods, we only use the monocular image as input without any extra burden.
In recent years, some works also only use RGB data as the input for the task [39, 3, 9, 40, 22]. For instance, MonoDIS  proposes to leverage a disentangling transformation between different 2D and 3D tasks to optimize the parameters at the loss level. M3D-RPN  focuses on the design of depth-aware convolution layers to improve 3D parameter estimation and post-optimization of the orientation by exploring the consistency between projected and annotated bounding boxes. To address the common occlusion issue in monocular object detection, MonoPair  proposes to model spatial relationships of objects in paired adjacent RGB images via introducing an uncertainty-based prediction for improving the detection. MoVi-3D  builds virtual views where the object appearance is normalized depending on the distance to reduce the visual appearance variability. RAR-Net 
builds a post-processing method by introducing Reinforcement learning to improve the 3D object detection performance. Although these existing methods achieved very promising results, the beneficial geometry relationships between the different 2D and 3D predictions from the detection network are not explicitly modeled for boosting thelearning of the detection network.
Geometry-Guided 3D Object Detection. There are several recent methods considering utilizing the geometric information for monocular 3D object detection [14, 30, 28, 5, 18]. One research direction mainly focuses on using geometry information to improve the detection performance in the inference stage via post-processing [3, 38]. For instance, M3D-RPN  employs the consistency between the 2D projected and the predicted 2D bounding boxes to optimize orientation parameters in a post-processing process. UR3D  uses estimated key points to post-optimize the predictions of physical sizes and yaw angles by minimizing the objective function. Some other works [30, 28, 18, 5] consider using a simplified perspective projection relationship in the training phase. In particular, MonoGRNet  presents a geometric reasoning method based on instance depth estimation and 2D bounding box projection to obtain more accurate 3D localization. GS3D  uses average object sizes based on the statistics on the training data to guide the location estimation. Decoupled-3D  estimates the depth from the projected average height of each vertical edge and the 3D height of the objects. RTM3D  predicts keypoints including eight vertexes and the center of 3D object in the image plane, and then minimize the energy function using geometric constraints of perspective projection. Ivan  relies on extra CAD models to process labels for keypoint detection and enforces the constrain between 2D keypoints and the CAD models using a consistency loss. However, these methods basically utilize the geometry at the prediction level and ignore several important geometry elements (object poses and locations) in their geometric modeling. In contrast to these methods, we jointly model the geometry relationships between the scene depth and 2D bounding boxes, 3D dimensions, and object poses, and the geometric model is implemented as a network module to be leveraged for geometry-aware representation learning to directly boost the depth estimation.
3 The Proposed Approach
3.1 Framework Overview
A framework overview is illustrated in Fig. 2. We model an object as a single point following [49, 9]. Our framework consists of three key steps. First, we use deep layer aggregation , a fully-convolutional encoder-decoder network, to extract features from a monocular image. Second, the features are fed into several network branches to separately predict 2D bounding box, 3D object dimension, and orientation (Sec. 3.2). Third, the geometric module models the geometry relationships from these 2D/3D predictions to obtain a geometric formula, which is implemented as a network module for geometry-aware feature learning (Sec. 3.3). Finally, we utilize the geometric features for depth estimation (Sec. 3.3), which combines with other 3D predictions for obtaining the 3D object detection results.
3.2 Base Detection Structure
with six output branches. Each branch takes the backbone features as input and uses 3x3 convolution, ReLU, and 1x1 convolution for prediction. The heatmap branch is used to locate 2D object center. The 2D/3D offset branch is applied for estimating 2D/3D center in 2D image coordinate system. The 2D box size and the 3D dimension branch predicts the size of 2D bounding box and the 3D dimension of the 3D object, respectively. Similar to[28, 9, 49], the orientation branch predicts observation angle of the object via encoding it into scalars.
3.3 Geometric Module for Learning Geometric Representations
In this section, we introduce the proposed geometric formula via modeling the relationships between the depth and 2D/3D predictions, and present how it can be implemented to learn geometric representations for depth estimation.
Formulation and notation. We adopt the 3D object definition described by the KITTI dataset. The coordinate system is constructed in meters with the camera center as the origin of coordinate. A 3D bounding box is represented as a 7-tuple , where and are the dimensions of the 3D bounding box, width, height, and length, respectively, and is the bottom center coordinate of the 3D bounding box. As shown in Fig. 3, denotes the rotation around the Y-axis in the camera coordinate system, in a range of . Moreover, to facilitate the introduction of the proposed geometric formula, we define the 2D bounding box with a 4-tuple , where and represent the size and the center of 2D bounding box, respectively.
3.3.1 Projective Modeling of Depth and 2D/3D Network Predictions
We derive a geometric formula for modeling the geometric relationships between the scene depth and multiple 2D/3D network predictions, 2D bounding box, 3D dimension, and object orientation from the perspective projection.
Geometric relationship of 2D and 3D corners. First, we represent an object in the object coordinate system, in which the origin is the bottom center of the object via the translation transformation from the camera coordinate system. As shown in Fig. 3, the coordinate of the -th () corner in the 3D object bounding box, denoted as , can be given as follows:
where , and represent the coordinate difference between the corner and the center of the object in X, Y, and Z direction, respectively; denotes the index of different values as shown in Fig. 3. With the position of the object in the camera coordinate system, we can represent the corner in the same coordinate system as:
where and respectively represent the bottom center coordinate and the corner coordinate of the 3D object bounding box in the camera coordinate system; , , and denote the coordinate value along the X, Y, and Z dimension in the camera plane. also represents the distance from the bottom center of object to the camera plane, the depth of the object in the camera coordinate system; Given the intrinsic matrix of the camera provided by the official KITTI dataset, , we can project the corner in the camera coordinate system to the pixel coordinate system as:
where denotes the projected corner coordinate in the pixel coordinate system; indicates the depth of the -th corner; and respectively denote the horizontal and vertical coordinate of the corner in the pixel coordinate system.
Relationship between 2D height and 3D corners. Given the eight corners of the 3D object box in the pixel plane, the height of the projected 2D bounding box can be estimated from the difference between the vertical coordinate of the uppermost corner () and that of the lowermost corner () in the pixel coordinate system as:
where is derived from Eq. 3; represents the maximum of of the eight corners, analogically for ; denotes the focal length in the vertical direction of the pixel plane.
Relationship between depth and other 2D/3D parameters. Similar to the definition of the bird’s-eye view angle (see Fig. 3a), we define the angle between the bottom center of the object and the horizontal plane as (see Fig. 3b). Given the projected coordinate of the object bottom center in the pixel plane based on Eq. 3, we can obtain the following geometric relationship:
where . It can be clearly observed that, the depth is correlated to the camera intrinsic parameters ( and ), the object position (when deriving ), 3D dimension (when deriving and ), and orientation of the object (when deriving ).
Relationship to existing works. Obviously, Eq. 6 obeys the perspective projection principle that further objects tend to be smaller than the nearer objects. It is also clearly different from prior works that, in our formula there is a non-linear relationship between the scene depth and , due to the modeling via the introduction of the object pose and 3D dimensions. We can simplify the proposed formula in two different ways: (i) To reduce the computation complexity, we can consider only the first term in Eq. 6 to obtain a simplified geometric formula v1:
(ii) If the variation of pose and position is not considered, then the formulation in Eq. 7 can be further derived as a simplified geometric formula v2:
where represents the scale factor for the depth scale conversation. The formula in Eq. 8 is widely used in 3D object detection [18, 5]. We report detailed comparison and analysis on our formulation in Eq. 6 and the two simplified versions in the experimental results (see Sec. 4.2).
3.3.2 Geometry-Guided Scene Depth Learning
Following the proposed geometric formula, we devise and implement a network module for the geometry-guided deep representation learning for accurate depth prediction, as shown in the red dashed box of Fig. 2. The module aims to learn geometric representations using the 2D/3D geometry-related network predictions (2D bounding box, 3D object dimension, and orientation) as input. Specifically, in the training stage, the module first produces a calculated one-channel depth map with the proposed geometric formula as described in Eq. 6. The depth map is then transformed into 3D maps of 3 channels with each spatial position representing a 3D data point
by introducing camera parameters as the initial geometric input. Then, the 3D map goes through three non-linear transformation blocks, with each block consisting of a convolution and taking the previous transformation block as input, a batch-norm and a ReLU layer, to learn a robust geometric representation map withchannels (typically ). We set as 32 in our experiments. These learned geometric representations are further concatenated with the image representations produced from the backbone network to learn the depth estimation. In the inference stage, we perform the same procedure as in the training, and the final depth output is further used to combine with other predictions, including 2D bounding boxes, 3D dimensions and orientations to produce 3D object bounding boxes.
|w/ gt Dim||19.85||14.06||12.02||25.06||18.29||15.85|
|w/ gt Depth||79.82||70.91||62.41||88.60||82.66||75.41|
3.4 Misalignment in 2D and 3D bounding Boxes
There is misalignment between the 2D projected box and 2D annotation box remains. Generally, due to the perspective projection effect, further objects appear smaller than nearer objects, the misalignment is more serious for nearby objects, which makes the learning with the proposed formula inaccurate, especially for nearby objects. To handle this misalignment, we propose to use the 2D projected box instead of the 2D annotation box as the ground-truth to ensure the correctness of the depth estimation. According to Eq. 1 and 2, we compute the 3D corner coordinates of the object through the 3D poses and 3D dimensions of the object. We further obtain their coordinates on the pixel plane through the projection transformation according to Eq. 3. We also calculate the difference between vertices in the image plane as the height and width of the 2D projected box.
3.5 Implementation Details
Backbone. We adopt a DLA-34  network architecture without deformable convolutions as our backbone. During training, we set the input resolution of the network as . The spatial size of the feature map from the backbone is , where represents the down-sampling factor of the backbone CNN.
Optimization loss. The optimization objective of our deep detection framework follows a multi-task learning setting, and consist of classification and regression losses for both the 2D and 3D predictions. Specifically, we train the heatmap prediction with the focal loss . The branches for offsets and dimensions in both the 2D and 3D detection are trained with 1 losses. The branch for the orientation predictionn is trained with a MultiBin loss following [9, 49]. Based on [9, 13], we use an
1 loss with heteroscedastic aleatoric uncertainty for the depth estimation (More details are illustrated in Appendix).
|Method||Extra data||3D Detection||BEV||AOS||Runtime|
Training: We use a batch size of
and train the overall deep network for 140 epochs onNVIDIA 1080ti GPUs. To alleviate overfitting, we adopt data augmentation techniques including random scaling, random horizontal flipping, and random cropping for the 2D detection, and random horizontal flipping for the 3D detection, respectively. We use the Adam optimizer with 1e-5 weight decay to optimize the full training loss as described in . The initial learning rate is 1.25e-4, which is dropped by multiplying after the -th and the -th epoch. To make train stable, we apply the linear warm-up strategy for learning with the geometric network module in the first 5 epochs.
Inference: We first predict 2D bounding boxes, 3D dimensions, and orientations via a shared backbone and several separate task branches. Than, we use the proposed formula to predict coarse depth followed by several convolution layers for the depth estimation. Finally, similar to , we use a simple post-processing algorithm through maxpooling and back-projection to recover 3D bounding boxes from 2D boxes, 3D dimensions, orientations, and the depth.
Setup. The KITTI dataset  provides widely used benchmarks for various visual tasks in the autonomous driving, including 2D Object detection, Average Orientation Similarity (AOS), Bird’s Eye View (BEV), and 3D Object Detection. The official data set contains 7481 training and 7518 test images with 2D and 3D bounding box annotations for cars, pedestrians, and cyclists. We report the average accuracy () for each task under three different settings: easy, moderate, and hard, as defined in 
. Moreover, we use 40 recall positions instead of 11 recall positions proposed in the original Pascal VOC benchmark, following. This results in a more fair comparison of the results. Each class uses different IoU standards for further evaluations. We report our results on the official settings of IoU for cars.
|Method||3D Detection IoU0.7||BEV IoU0.7||3D Detection IoU0.5||BEV IoU0.5|
for the car category with the evaluation metric of. The results of the previous works are from . Our approach significantly outperforms the previous state-of-the-arts on almost all the different evaluation protocols and settings. The bold black/blue color indicates the best/the second best performing method.
4.1 Overall Performance Comparison and Analysis
Table 2 and 3 show the overall performance of the proposed approach on the KITTI 3D test and val sets for cars from the official online leaderboard as of Mar. 12th, 2021. Existing state-of-the-art monocular 3D object detectors, including methods using extra data and only using monocular image are listed in the tables for comparison. The KITTI val results of MonoGRNet , M3D-RPN  and MonoPair  are quoted from .
|+ Projected box||16.54||13.37||11.15||23.62||19.19||16.70|
Build a simple yet strong baseline for monocular 3D object detection. We report the enhanced baseline results of 3D monocular object detection in Table 4. Overall, the baseline significantly increases the performance upon the original one by 3.76%, 3.54%, 2.88% on easy, moderate and hard difficulty levels, respectively. This is achieved by introducing three methods to the original baseline. First, we adopt the 1 loss with the aleatoric uncertainty in [9, 13], which makes training stage more robust to noise input. Second, we use the projected 3D center as the ground-truth for 2D heatmap prediction similar to SMOKE . Third, we address the misalignment between 2D ground-truth bounding boxes and the 2D projection bounding boxes by using 2D projected box as the ground-truth. This guarantees the consistency between 2D and 3D boxes from the projection relationships in the proposed geometric formula, and ensure the robust learning with the formula. The enhanced baseline achieves 16.54%, 13.37%, 11.15% on easy, moderate and hard difficulty levels, respectively.
Comparison with monocular image based methods. Our approach achieves a notable improvement over the state-of-the-art monocular image-based detectors [39, 30, 3, 9] on both the val and test sets. As shown in Table 2, the performance of our approach on the KITTI test set, for the detection on the car category, an indispensable part of the 3D object detection task for the autonomous driving scenario, our method achieves 18.85% ( improvement) on the easy, 13.81% ( improvement) on the moderate, and 11.52% ( improvement) on the hard compared with the previous state-of-the-art image-only method. Besides, compared with unpublished [26, 15] our method still increases the by 1.49 % on moderate. For the Bird’s Eye View (BEV) on the car class, our method also achieves the best performance, increasing the over the second best method by 3.10%, 1.96%, 1.34% on the easy, moderate, and hard level, respectively. For the KITTI val set, our method also establishes new state-of-the-art performance on both the 3D object detection and the BEV. Table 2 and 3 shows considerable improvement over the state-of-the-art monocular detection methods with the great robustness, benefiting from the introduction of the proposed geometric formula for learning geometry-aware representations to advance the depth estimation.
Comparison with methods using extra data or networks. The prior methods [5, 25, 24, 10, 31] achieve impressive performance on the KITTI test set by introducing extra data or external networks. Although our method utilizes none of these kinds of information, as shown in Table 2, it can still outperform these comparison methods in terms of the metric by 0.40% on the moderate level. These significant improvements demonstrate the superior performance of our method with the proposed geometry-guided depth learning for the monocular 3D object detection.
We test our model on Nvidia GTX 1080 Ti, Pytorch 1.1, CUDA 9.0, Intel @ 2.60GHz As shown in Table2, the proposed method achieves 20 fps and runs similar to other real-time state-of-the-arts [20, 40]. This clearly demonstrates the efficiency of our method when compared with other competitive methods under the similar experimental environment.
|Ours (full model)||18.45||14.48||12.87||27.15||21.17||18.35|
4.2 Ablation Experiments
We conduct extensive ablation studies on the KITTI val set, to demonstrate the effectiveness of the proposed approach for geometry-guided depth learning in advancing the monocular 3D object detection. For all the evaluation, the metric is employed. We mainly investigate from two aspects, including the effect of the proposed geometric formula and module, and the effect of the geometry-guided representation learning for depth estimation.
Baseline and variant models. To conduct an extensive evaluation, we consider the following baseline and variant models: (i) Baseline, which is a base model achieving a strong 3D detection performance with an of 11.8% on the moderate; (ii) 3D-CAT., which directly inputs the concatenation of the 3D network predictions to the non-linear transformation blocks while bypassing the depth calculation with geometric formula; (iii) Geo-SV1, which uses our simplified geometry formula v1 as in Eq. 7; (iv) Geo-SV2, which uses our simplified geometry formula v2 as in Eq. 8.
Effects of the geometric formula and module. A detailed ablation study is shown in Table 5. As we can observe, ours (full model) achieves a large gain (2.68% on the moderate level) over Baseline + 3D-CAT, meaning that directly using the 3D network predictions are not effective enough for learning the geometric representations, thus verifying the importance of the proposed geometric formula. By comparing Baseline + Geo-SV2, Baseline + Geo-SV1, and ours (full model), all these three with the geometric relationships, the performance gradually improves when more geometry elements are involved in modeling, confirming our motivation of modeling between depth and multiple 2D/3D geometry elements, instead of partial of them, only height typically considered in most existing works [18, 5] similar to the Geo-SV2. Finally, Ours (full model) is 1.11% and 1.98% improvement on the moderate for the 3D detection and BEV, respectively, which adequately demonstrate the effectiveness of our proposed approach.
Effect of the geometry-guided representation learning for depth estimation. Fig. 6 shows a performance comparison between baseline and our approach on the depth estimation. Specifically, we evaluate the predicted depth of all car samples in different depth ranges under two primary metrics (SILog and sqRel) widely used in depth estimation field. Fig. 7 shows that 87% of the cars are within 40m, while only 5.0% of those are 45m away. Fig. 6 shows that our approach outperforms the baseline consistently in all the depth ranges, especially in the 40m range with most samples, which further validates our idea of using geometry-guided representation learning to boost depth estimation to advance the monocular 3D object detection.
We proposed a novel geometric formula principally modeled from multiple 2D/3D network predictions, to guide the depth estimation and advance the monocular 3D object detection. We design and implement this formula as a neural network module to have geometry-aware feature learning with the image representations to boost the learning of the depth. Extensive experiments demonstrate the effectiveness of the proposed approach, and our results also achieve state-of-the-art performance with a large margin.
-  (2020) Monocular 3d object detection via geometric reasoning on keypoints. In VISIGRAPP, Cited by: §1, §2.
-  (2018) Geometry-aware learning of maps for camera localization. In CVPR, Cited by: §1.
-  (2019) M3d-rpn: monocular 3d region proposal network for object detection. In ICCV, Cited by: Table 6, Table 7, §2, §2, Table 2, §4.1, §4.1, Table 3.
-  (2020) Kinematic 3d object detection in monocular video. In ECCV, Cited by: Table 2.
-  (2020) Monocular 3d object detection with decoupled structured polygon estimation and height-guided depth estimation.. In AAAI, Cited by: §1, §2, §2, §3.3.1, Table 2, §4.1, §4.2.
-  (2017) Deep manta: a coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In CVPR, Cited by: §2.
-  (2015) 3d object proposals for accurate object class detection. In NIPS, Cited by: §1.
-  (2019) Fast point r-cnn. In ICCV, Cited by: §1.
-  (2020) MonoPair: monocular 3d object detection using pairwise spatial relationships. In CVPR, Cited by: §A, Table 6, §2, §3.1, §3.2, §3.5, §3.5, Table 2, §4.1, §4.1, §4.1, Table 3.
-  (2020) Learning depth-guided convolutions for monocular 3d object detection. In CVPR, Cited by: §2, Table 2, §4.1.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, Cited by: §1, §B.1, §4.
-  (2019) Monocular 3d object detection and box fitting trained end-to-end using intersection-over-union loss. CoRR abs/1906.08070. Cited by: Table 6, Table 7.
-  (2019) Geometry and uncertainty in deep learning for computer vision. Ph.D. Thesis, University of Cambridge. Cited by: §A, §3.5, §4.1.
-  (2019) Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In CVPR, Cited by: §2.
-  (2021) GrooMeD-nms: grouped mathematically differentiable nms for monocular 3d object detection. In CVPR, Cited by: Table 2, §4.1.
-  (2019) Pointpillars: fast encoders for object detection from point clouds. In CVPR, Cited by: §1.
-  (2018) Cornernet: detecting objects as paired keypoints. In ECCV, Cited by: §A.
-  (2019) GS3D: an efficient 3d object detection framework for autonomous driving. In CVPR, Cited by: §1, §2, §3.3.1, Table 2, §4.2.
-  (2019) Stereo r-cnn based 3d object detection for autonomous driving. In CVPR, Cited by: §1.
-  (2020) RTM3D: real-time monocular 3d detection from object keypoints for autonomous driving. In ECCV, Cited by: §1, §2, Table 2, §4.1.
-  (2017) Focal loss for dense object detection. In ICCV, Cited by: §3.5.
-  (2020) Reinforced axial refinement network for monocular 3d object detection. In ECCV, Cited by: §2, Table 2.
-  (2020) SMOKE: single-stage monocular 3d object detection via keypoint estimation. In CVPR, Cited by: §4.1.
-  (2020) Rethinking pseudo-lidar representation. In ECCV, Cited by: §2, Table 2, §4.1.
-  (2019) Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In ICCV, Cited by: §2, Table 2, §4.1.
-  (2021) Delving into localization errors for monocular 3d object detection. In CVPR, Cited by: Table 2, §4.1.
-  (2019) Roi-10d: monocular lifting of 2d detection to 6d pose and metric shape. In CVPR, Cited by: §2.
-  (2017) 3d bounding box estimation using deep learning and geometry. In CVPR, Cited by: §2, §3.2.
-  (2018) Frustum pointnets for 3d object detection from rgb-d data. In CVPR, Cited by: §1.
-  (2019) Monogrnet: a geometric reasoning network for monocular 3d object localization. In AAAI, Cited by: §2, Table 2, §4.1, §4.1, Table 3.
-  (2021) Categorical depth distribution network for monocular 3d object detection. CVPR. Cited by: §2, Table 2, §4.1.
-  (2021) Categorical depth distribution network for monocular 3d object detection. In CVPR, Cited by: §B.3.
-  (2018) Unsupervised geometry-aware representation for 3d human pose estimation. In ECCV, Cited by: §1.
-  (2019) Orthographic feature transform for monocular 3d object detection. In BMVC, Cited by: Table 6.
-  (2019) Unsupervised collaborative learning of keyframe detection and visual odometry towards monocular deep slam. In ICCV, Cited by: §1.
-  (2020) Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In CVPR, Cited by: §1.
-  (2020) From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. TPAMI. Cited by: §1.
-  (2020) Distance-normalized unified representation for monocular 3d object detection. In ECCV, Cited by: §2, Table 2.
-  (2019) Disentangling monocular 3d object detection. In ICCV, Cited by: §1, §2, Table 2, §4.1, Table 3, §4.
-  (2020) Towards generalization across depth for monocular 3d object detection. In ECCV, Cited by: Table 6, §2, Table 2, §4.1, Table 3.
-  (2019) Fcos: fully convolutional one-stage object detection. In ICCV, Cited by: §3.2.
-  (2021) Depth-conditioned dynamic message propagation for monocular 3d object detection. In CVPR, Cited by: Table 2.
-  (2018) Multi-level fusion based 3d object detection from monocular images. In CVPR, Cited by: §1.
-  (2021) Moving slam: fully unsupervised deep learning in non-rigid scenes. In IROS, Cited by: §1.
-  (2019) Geometry-aware video object detection for static cameras. In BMVC, Cited by: §1.
-  (2019) Reppoints: point set representation for object detection. In ICCV, Cited by: §1.
-  (2020) Monocular 3d object detection via feature domain adaptation. In ECCV, Cited by: §2, Table 2.
-  (2018) Deep layer aggregation. In CVPR, Cited by: §3.1, §3.5.
-  (2019) Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: §A, §3.1, §3.2, §3.5, §3.5, Table 1, Table 3.
In this Supplementary Material, we provide more elaboration on the implementation details, experiment results, and qualitative results. Specifically, we present the implementation details of the model training in Section A, additional quantitative results and analysis in Section B, and additional qualitative results in Section D.
A Additional Implementation Details
The overall network optimization loss of the proposed approach consists of three parts, a classification loss , a 2D regression loss , and a 3D regression loss . We present the details of these losses one by one: (i) Regarding to the classification loss, similar to [17, 49], we employ a variant of focal loss which reduces the penalty for negative locations according to the distance from a positive location as:
represent the ground-truth class probability given by an unnormalized 2D Gaussian and the model’s predicted probability for the class, respectively. Andand
are hyperparameters that control the importance of each sample. We setto 2 and to 4 as a default setting in our experiments. (ii) For the 2D regression loss , it is defined upon a 6-tuple of ground-truth bounding-box targets and a predicted 6-tuple. Specifically, the 6-tuple consists of two 2D offsets, two 3D offsets, and two 2D box sizes. 2D/3D offsets are used to adjust the 2D/3D center locations before remapping them to the input resolution following [17, 49]. We use an loss to optimize each 6-tuple parameters. (iii) For the 3D regression loss , it consists of an loss for regressing the dimension of the 3D bounding box (width, height, and length), and an loss with an uncertainty term for regressing the depth. Specifically, we follow [9, 13] and employ the heteroscedastic aleatoric uncertainty in the depth estimation loss as:
Where and represent the predicted depth and the ground-truth depth, respectively. is the noisy observation parameter of the model. Hence, the overall optimization loss is the sum of the three losses written as:
where and are loss weights controlling the balance between the different losses. We consider and equally important and use = = 1 in all experiments.
B Additional Results and Analysis
b.1 Additional Results for the Pedestrian/Cyclist Category
As mentioned in the main paper, the KITTI  official data set contains 7,481 training and 7,518 test images with 2D and 3D bounding box annotations for pedestrian and cyclist categories. We report our quantitative results in Table 6, using the official settings with IoU for pedestrians and cyclists on the KITTI test set. Our method establishes new state-of-the-art performance on all the three detection levels ( easy, medium, and hard) for the cyclist category with only slight drop for the pedestrian category. We investigate the slight performance drop in the pedestrian category by comparing 2D detection results between car and pedestrian. In fact, the advantage of the proposed geometric formula is independent of different classes as 2D images conform with projective camera models, and every object meets the geometric reasoning. However, a performance gap between car detection and pedestrian/cyclist detection commonly exists
in ours and many previous works on the KITTI dataset. This is mainly due to insufficient training samples of pedestrian and cyclist categories on KITTI, leading to unstable training, sensitivity to hyper-parameters, and inaccurate prediction of 2D/3D information(2D boxes, orientation, and the 3D dimensions) with high variance. This imbalance of the category data is however a common issue on the KITTI dataset for the 3D object detection task. Table7 shows that the 2D detection results on the moderate level are only 50.48% and 44.63% for cyclist and pedestrian respectively, while up to 90.14% for car on the test set. Similarly for orientation estimation, the pedestrian (39.76%) has less than half of the car (89.44%) on the moderate. The two factors above introduce more noise into our geometry formula to affect the geometry-guided representation learning. However, our results for pedestrians and cyclists are highly competitive with other SOTA methods on the KITTI test set.
b.2 Further Analysis on Depth Estimation from Geometry Modeling
We conduct a further depth statistic analysis on the train+val set. Table 8 shows that for two cars with the same height in both the 2D bounding box and the 3D bounding box, the depth values of their centers may differ by more than meters due to their distinct poses and locations. This confirms the critical importance of considering 3D pose and locations simultaneously in the geometric modeling for depth estimation, which is however not investigated by previous works.
|depth||The height of 3D bounding boxes|
b.3 Additional Results at Different Distances
We provide additional results on depth estimation and monocular 3D object detection at different distances. Table 9 shows more depth estimation results on KITTI val set via comparing the enhanced baseline and our method. Specifically, we evaluate the depth estimation by computing Scale Invariant Logarithmic (SILog) error, squared Relative (sqRel) error, absolute Relative (absRel) error, and Root Mean Squared Error of the inverse depth (iRMSE). Our method outperforms the enhanced baseline by large margins on all these evaluation metrics. The depth estimation results clearly demonstrate the effectiveness of our proposed idea of using geometry-guided representation learning to boost depth estimation from monocular images for advancing the monocular 3D object detection.
Moreover, we conduct experiments about the 3D monocular object detection improvement at different distances. Table 10 reports performance on at different object distance ranges following . It is clear that our method consistently outperforms the baseline at different ranges.
C Additional Ablation Study for Uncertainty and Equation
We investigate the effect of uncertainty with our geometric module as requested on the KITTI val set in Table 12. It can be seen that the uncertainty is helpful for learning the geometry, but the main improvement is from the proposed principled geometric modeling. To further validate the effectiveness of Eq. (6), we compare all predictions followed by pointwise MLP as the reviewer described with our geometric module in Table 11. Ours is significantly better than the pointwise MLP.
|All Other Enhancements||Uncertainty||Geometric Module||3D Detection||BEV|
D Additional Qualitative Results
Fig. 8 also show the comparison results between the enhanced baseline and the proposed method from the Bird-Eye-View. Figure 9 also present additional qualitative 3D detection results on the images with a comparison between those two on the KITTI val set. We could observe from the figures that the proposed geometry-guided learning approach can achieve significantly better 3D detection and localization performance than the enhanced baseline.
Figure 10 and 11 show additional visualization of the prediction results on KITTI 3D raw data in both the image plane and the LiDAR coordinate system, respectively. We use orange box, purple box, and green box for car, pedestrian, and cyclist, respectively. Our approach is able to accurately localize the different-depth 3D objects.