Towards Model Generalization for Monocular 3D Object Detection

05/23/2022
by   Zhenyu Li, et al.
0

Monocular 3D object detection (Mono3D) has achieved tremendous improvements with emerging large-scale autonomous driving datasets and the rapid development of deep learning techniques. However, caused by severe domain gaps (e.g., the field of view (FOV), pixel size, and object size among datasets), Mono3D detectors have difficulty in generalization, leading to drastic performance degradation on unseen domains. To solve these issues, we combine the position-invariant transform and multi-scale training with the pixel-size depth strategy to construct an effective unified camera-generalized paradigm (CGP). It fully considers discrepancies in the FOV and pixel size of images captured by different cameras. Moreover, we further investigate the obstacle in quantitative metrics when cross-dataset inference through an exhaustive systematic study. We discern that the size bias of prediction leads to a colossal failure. Hence, we propose the 2D-3D geometry-consistent object scaling strategy (GCOS) to bridge the gap via an instance-level augment. Our method called DGMono3D achieves remarkable performance on all evaluated datasets and surpasses the SoTA unsupervised domain adaptation scheme even without utilizing data on the target domain.

READ FULL TEXT VIEW PDF

page 5

page 6

page 8

page 14

page 17

04/25/2022

Unsupervised Domain Adaptation for Monocular 3D Object Detection via Self-Training

Monocular 3D object detection (Mono3D) has achieved unprecedented succes...
03/07/2022

An Unsupervised Domain Adaptive Approach for Multimodal 2D Object Detection in Adverse Weather Conditions

Integrating different representations from complementary sensing modalit...
03/31/2022

Multi-Granularity Alignment Domain Adaptation for Object Detection

Domain adaptive object detection is challenging due to distinctive data ...
09/30/2018

Pixel and Feature Level Based Domain Adaption for Object Detection in Autonomous Driving

Annotating large scale datasets to train modern convolutional neural net...
07/23/2021

Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency

Deep learning-based 3D object detection has achieved unprecedented succe...
08/08/2022

Aerial Monocular 3D Object Detection

Drones equipped with cameras can significantly enhance human ability to ...

1 Introduction

3D object detection is a critical component for many computer vision applications such as autonomous driving, robot navigation, and virtual reality, to name a few, aiming to categorize and localize objects in 3D space. Previous methods have achieved engaging performance based on the accurate spatial information from multiple sensors, such as LiDAR-scanned point clouds 

Zhou and Tuzel [2018]; Lang et al. [2019]; Shi et al. [2020] or stereo images Chen et al. [2020]; Li et al. [2019]; Sun et al. [2020]

. With the rapid development of intelligent systems, monocular 3D object detection (Mono3D) from single images has drawn increasing research attention due to the potential prospects of reduced cost and increased modular redundancy. Driven by deep neural networks and large-scale human-annotated datasets 

Kesten et al. [2019]; Caesar et al. [2020]; Geiger et al. [2012], this field has obtained remarkable advancements Zhang et al. [2022]; Huang et al. [2022]; Wang et al. [2021a]; Park et al. [2021]; Liu et al. [2020a]; Wang et al. [2019]; Chen et al. [2016].

However, trained on one specific domain (i.e., source domain), Mono3D detectors cannot generalize well to a novel test domain (i.e., target domain) due to inevitable domain shifts arising from geographical locations, imaging processes, and object characteristics, which leads to a plight in deploying models for practical applications Li et al. [2022]. Though collecting and training with more data from different domains could alleviate this issue, unfortunately, it might be infeasible given diverse real-world scenarios and expensive annotation costs Yang et al. [2021a]; Li et al. [2022]. Hence, more generalized Mono3D detectors are highly demanded. In this paper, we aim to achieve single-domain generalization for monocular 3D object detection, which is more challenging and practical compared with the unsupervised domain adaptation (UDA) Li et al. [2022] since the data in the target domain is sometimes inaccessible and we can only obtain few statistics during the training stage.

STMono3D Li et al. [2022] makes the first effort for Mono3D UDA. In terms of all discrepancies among datasets (i.e., domain gaps), Li et al. [2022] conducts a detailed investigation and indicates that the geometry misalignment caused by different camera devices (i.e., intrinsic parameters) will lead to the severe depth-shift phenomenon and hampers the cross-domain inference for the most part. Hence, Li et al. [2022] proposes a geometry-aligned multi-scale training strategy and adopts the pixel-size depth to make detectors camera-aware. Nevertheless, the detectors still suffer from an imaging misalignment caused by different camera fields of view (FOV). It cannot be awared by models through the proposed strategy Li et al. [2022] since the FOV is invariant in resizing images, leading to a sub-optimal solution for model generalization. To bridge this gap, we propose to seamlessly combine the position-invariant transform (PIT) Gu et al. [2021], and multi-scale training with the pixel-size depth strategy Li et al. [2022]; Park et al. [2021]; Chen et al. [2022] to construct a unified camera-generalized paradigm (CGP). It ensures the geometry consistency (i.e., fully considers the discrepancies of FOV and pixel size) when 3D detectors infer on images captured by different cameras and avoids unacceptable overhead costs during the training stage.

While the effective paradigm achieves camera generalization and yields competitive model performance on loose quantitative metrics (e.g., , ), we observe the trained detectors cannot obtain satisfactory results (i.e., average precision drops to zero) on relatively strict metrics (e.g., , ). Moreover, even the STMono3D Li et al. [2022] that utilizes images in the target domain for self-training suffers from a similar dilemma. Unlike 2D bounding boxes having a large variety of sizes, depending on the distance of the object from the camera, the size of 3D bounding boxes is more consistent in the same dataset, regardless of the relative location to the camera Luo et al. [2021]. Hence, Mono3D detectors tend to overfit a narrow and dataset-specific distribution of object size from the source domain, which is consistent with observations in LiDAR-based 3D detectors Wang et al. [2020]; Luo et al. [2021] and leads to the size bias of prediction (Fig. 2). To better analyze the influence of such a bias on quantitative metrics, we replace the predicted dimensions of object sizes with ground-truth step by step. As shown in Fig. 3b, it is the size bias that leads to the severe degradation of strict metrics. To alleviate this issue, we propose the 2D-3D geometry-consistent object scaling strategy (GCOS) that simultaneously scales objects in the 2D image and 3D spatial space and maintains the 2D-3D geometry consistency of objects. This strategy emancipates data in the target domain during the training stage, achieving more generalized Mono3D detectors and effective training pipelines.

In summary, the major contributions of this work are as follows: First, we combine the position-invariant transform Gu et al. [2021] and multi-scale training with the pixel-size depth strategy Li et al. [2022]; Park et al. [2021]; Chen et al. [2022] to construct a unified paradigm for camera-generalized Mono3D detectors, which comprehensively considers FOV and pixel-size discrepancies among domains. Second, we investigate underlying reasons behind the degradation of strict quantitative metrics. Accordingly, we propose the 2D-3D GCOS strategy to alleviate this issue in a data augment manner and boost the model generalization. Third, we conduct extensive experiments on various 3D object detection datasets: KITTI Geiger et al. [2012], NuSenses Caesar et al. [2020], and Lyft Kesten et al. [2019]. Our method for Domain Generalized Monocular 3D detection, named DGMono3D, obtains engaging results. Without training in the target domain, models achieve competitive and even better performance compared with the UDA method Li et al. [2022], demonstrating the effectiveness of DGMono3D.

2 Related Work

Monocular 3D object detection has drawn much more attention in recent years Chen et al. [2015]; Xu and Chen [2018]; Mousavian et al. [2017]; Roddick et al. [2018]; Weng and Kitani [2019]; Brazil and Liu [2019]; Wang et al. [2022b, 2021a]; Park et al. [2021]; Wang et al. [2022a]. Because of the lack in spatial information, earlier work adopts auxiliary depth networks Chen et al. [2015]; Xu and Chen [2018] or 2D object detectors Mousavian et al. [2017] to support 3D detection. Another line of study attempts to lift up RGB images into 3D representations, such as OFTNet Roddick et al. [2018] and Pseudo-Lidar Weng and Kitani [2019]. To avoid the dependency on sub-networks, recent methods propose to design the framework in an end-to-end manner like 2D detection Brazil and Liu [2019]; Liu et al. [2020a]; Wang et al. [2022b]; Park et al. [2021]. In this paper, we conduct experiments based on FCOS3D Wang et al. [2021a], a neat and representative Mono3D detector that keeps the well-developed designs for 2D detection and is adapted to Mono3D with only fundamental designs for specific 3D targets.

Domain adaptation aims to generalize the model trained on source domains to target domains Wang and Deng [2018]. Based on whether labels are available on target domains, methods can be divided into supervised and unsupervised, respectively Wang and Deng [2018]. In detection, most domain adaptation approaches are designed for 2D detectors Hsu et al. [2020]; Chen et al. [2018]; Khodabandeh et al. [2019]; Kim et al. [2019], while direct adoption of these techniques to Mono3D may not work well due to the distinct characteristics of targets in the 3D spatial coordinate Li et al. [2022]. For domain adaptation methods on LiDAR-based 3D detection Yang et al. [2021a, b]; Luo et al. [2021]; Zhang et al. [2021], the fundamental differences in data structures and network architectures render these approaches not readily applicable to Mono3D.

STMono3D Li et al. [2022] proposes the first UDA paradigm for Mono3D via self-training. While they adopt the multi-scale training with pixel-size depth strategy to overcome the discrepancy of camera parameters when cross-dataset inference, FOV that is invariant in scaling image is neglected and hampers the model performance. Moreover, their methods still suffer from drastic degradation on strict quantitative metrics, as shown in Fig. 3. In this paper, we fully consider the pixel size and FOV, conducting the camera-genralized paradigm (CGP). Subsequently, we do detailed system analysis and design the 2D-3D geometry-consistent object scale strategy (GCOS) to handle the degradation on strict quantitative metrics. Our methods aim to achieve single-domain generalization Qiao et al. [2020]; Wang et al. [2021b]; Li et al. [2021], and do not need images on target domains, which are hard to obtain in practice.

Cut-and-paste for detection is a representative kind of data augment strategy applied to regions that contain objects of an image Dvornik et al. [2018]; Fang et al. [2019]; Dwibedi et al. [2017]. It is also commonly utilized in LiDAR-based 3D detection to process points of objects and is crucial for detector performance Liu et al. [2020b]; Yan et al. [2018]; Yang et al. [2019]. Besides, there are several explorations for multi-modal 3D detection Zhang et al. [2020], but absent for Mono3D Wang et al. [2021a, 2022a, 2022b]. In this paper, we design the GCOS in a cut-and-paste manner. Instead of saving image crops to build a ground-truth database and pasting them randomly during training Yan et al. [2018]; Zhang et al. [2020], we directly operate objects of current images in an online manner. Moreover, our method also resizes objects and keeps the consistency of 2D-3D correspondence to diminish the size gap on cross-dataset inference.

                    (a) Overview                           (b) Camera-Generalized Paradigm
Figure 1: (a) An overview of our proposed DGMono3D, containing a camera-generalized paradigm (CGP, red) and a geometry-consistent object scaling strategy (GCOS, green). (b) Based on the reversibility of the position-invariant transform (PIT) and the multi-scale strategy (MS), we evolve an advanced CGP that avoids infeasible offline redundancy.

3 DGMono3D

In this section, we first formulate the single-domain generalization task on Mono3D in Sec. 3.1 and then present the overall pipeline of DGMono3D in Sec. 3.2. Subsequently, we introduce our key contributions in detail, including the camera-generalized paradigm in Sec. 3.3 and the 2D-3D geometry-consistent object scale strategy in Sec. 3.4.

3.1 Problem Definition

Under the single-domain generalization setting, we own labeled images from a single-source domain , and are inaccessible to the target domain data , where and are the number of samples from the source and target domains, respectively. Each 2D image is paired with a camera parameter that associates points in 3D space and 2D image plane while denotes the label of the corresponding training sample from the source domain. Label is in the form of object class , location , size in each dimension , and orientation . We aim to train models with and avoid performance degradation when inferring in any other target domain . Images and camera parameters to re-project predictions to 3D locations are available during inference in target domains.

(a) Height (b) Length (c) Width
Figure 2: Size statistics of different datasets and model predictions (Nus KITTI). The distribution of object sizes varies drastically across datasets. Models trained in the source domain without GCOS predict the object size with bias. Our proposed GCOS alleviates this issue, achieving more generalized Mono3D detectors.

3.2 Overview

As shown in Fig. 1a, DGMono3D is a two-stage training method based on standard center-based Mono3D detectors that predict 3D attributes including classes, 2D locations, depths, sizes, and directions. The input images are first passed through the camera-generalized paradigm (CGP) consisting of the position-invariant transform (PIT) Gu et al. [2021] and the multi-scale augmentation (MS) Li et al. [2022]; Park et al. [2021]; Chen et al. [2022]. The former projects the image onto a spherical surface to eliminate the image distortion caused by FOV Gu et al. [2021], and the latter combines the modification of pixel-size depth to learn the invariant geometry correspondences in scaled images and camera parameters Li et al. [2022]. In the first training stage, we initialize the detector encoder with standard classification pre-trained parameters and prohibit the 2D-3D geometry-consistent object scaling strategy (GCOS) to prevent from lack of location information. After obtaining models with enough detection capability, we fine-tune the detectors with the GCOS to enhance the model generalization.

3.3 Camera-Generalized Paradigm

As shown in Fig. 1, the vanilla CGP first randomly scales images and camera parameters as

(1)

where and are resize rates, and are focal length and optical center, and indicate image coordinate axises, respectively. Since the scaling operation cannot change FOV (proofed in the supplementary material), the model cannot be FOV-aware, thus suffering from the influence of this discrepancy when conducting cross-dataset inference. Hence, we apply the position-invariant transform (PIT) Gu et al. [2021] to diminish the FOV distortion and make the model more generalized. Given the coordination of images, we project it to the spherical coordination by

(2)

where bilinear interpolation is adopted to sample points from images. More details can refer to 

Gu et al. [2021]. However, since the PIT is time-consuming and has to save transformed images in an offline manner, it leads to unacceptable overhead costs and makes tuning scaling parameters much harder. To solve this issue, we reverse the order of the MS and PIT based on the reversibility proofed in the supplementary material. Hence, we can apply MS in an online manner, which saves physical memory by times when there are -scale transformations as illustrated in Fig. 1b.

On top of that, we replace the metric depth with the pixel-size depth following Li et al. [2022]; Park et al. [2021]:

(3)

where and are the pixel size and a constant, is the model prediction which is scaled to the final result with Eq. 3. The MS strategy and the pixel-size depth make model camera-aware Li et al. [2022]; Park et al. [2021], ensuring the basic model generalization on other target domains with different camera devices.

In addition, since the pixels generalized in PIT-transformed images occupy different sizes of areas in the plain images, they have different pixel sizes which can be computed through:

(4)
(5)

where is real side length of PIT-transformed pixels. We present the visualized weight map and give more intuitive description in the supplementary material.

                                    (a) Visualization of Size Bias                                  (b) Systematic Analysis
Figure 3: (a) The pretrained model on NuScenes dataset predicts objects with larger size in the target KITTI dataset, indicating a potential size bias. The ground-truth and model prediction are presented in blue and orange, respectively. (b) We further evaluate the results by replacing the predicted dimensions of object sizes with ground-truth value step by step to better analyze the influence of the size bias on quantitative metrics.

3.4 2D-3D Geometry-Consistent Object Scaling Strategy

3.4.1 Systematic Analysis

We take NuScenes as the source dataset and KITTI as the target dataset. From Fig. 2 and Fig. 3a, our key insights include two aspects. First, the distribution of object sizes varies across datasets, leading to a geometric mismatch that can be a factor for the domain gap Luo et al. [2021]. Second, directly applying a model trained on NuScenes to KITTI (referred to the +GCP in the Fig. 3b) is ineffective since the model predicts object sizes close to the source domain. Then, we conduct the systematic analysis by replacing the predicted sizes with the ground-truth value step by step. The results shown in Fig. 3b demonstrate the influence of size prediction on strict quantitative metrics. Specifically, when we replace the predicted height with the ground-truth height, the AP IOU0.7 is improved from 0.6% to 9.8%, indicating the height plays an essential role. Subsequently, we replace all the predicted dimensions (i.e., height, width, and length) and the accuracy reaches 10.3%, which meets the results provided by Oracle models. Therefore, the low accuracy on strict metrics is mainly caused by the size error. To alleviate it, we propose the 2D-3D geometry-consistent object scaling strategy (GCOS).

3.4.2 GCOS Design

Motivated by Yang et al. [2021a]; Wang et al. [2020], we fine-tune the already trained object detector with the GCOS so that its predicted box sizes can better match the target statistics. The principle is to maintain the 2D-3D correspondence and avoid breaking geometry consistency, which is more complex and essential for Mono3D. To facilitate the augment, we scale objects in the 3D space and then adjust corresponding areas in 2D images, keeping the bird’s eyes view (BEV) position of the visible face invariant.

As shown in Fig .4

, we classify objects into

(a) objects with only one face visible and (b) objects with two faces visible in perspective view, and apply different strategies to scale them. As for (a), we extend or shrink objects along the direction perpendicular to the visible BEV edges and retain the BEV position of visible-edge centers. In terms of (b), we first split the two visible faces based on the nearest vertical fringes of 3D bounding boxes and then extend or shrink objects along the direction paralleling these two BEV visible edges. When scaling, the bottom of 3D objects is fixed on the ground. After obtaining scaled 3D bounding boxes, we project them onto 2D images, getting target boundaries. Then, we scale cropped object patches to the target size and paste them into images.

Given the dense regular arrangement of image pixels, we highlight several points in operating 2D crops. (1) On scaling objects, direct shrinking 2D crops will lead to blank fringes on origin images. To avoid leaking object information, supposing shrink objects with ratio , we first expand the range of the 2D crop with and then narrow the expanded crop with . Hence, the black fringes are filled with the background on the fly, and the objects are shrunk correctly. (2) We crop and paste objects in an inverse-depth order, avoiding breaking the layout in perspective view. (3) To reduce artifacts caused by image patches, we follow Dwibedi et al. [2017]; Zhang et al. [2020] to apply random blending to smooth the boundaries of image patches. We present the installation of GCOS in Sec. 4.2.

Figure 4: Illustration of the GCOS strategy. We classify objects into (a) objects with only one face visible and (b) objects with two faces visible (Best view in BEV). We apply two different scaling strategies on them, but the same principle is to keep the position of the BEV visible edges invariant, preventing from breaking perspective geometry consistency.

4 Experiments

  Dataset  Samples    Loc.     Shape     FOV   Height   Width   Length
KITTI Geiger et al. [2012] 3712 EUR. (375,1242) (29,81) 1.52 1.63 3.87
NuScenes Caesar et al. [2020] 27522 SG.,EUR. (900,1600) (39,65) 1.71 1.92 4.62
Lyft Kesten et al. [2019] 21623 SG.,EUR. (1024,1224) (60,70) 1.73 1.94 4.77
Table 1: Dataset overview. We focus on properties related to frontal-view cameras. Sample refers to the number of images used in the training stage. We present the mean of object sizes in meters.

4.1 Setup

Dataset. Following Li et al. [2022], we conduct experiments on three widely used autonomous driving datasets: KITTI Geiger et al. [2012] (CC BY-NC-SA 3.0), NuScenes  Caesar et al. [2020] (CC BY-NC-SA 4.0), and Lyft Kesten et al. [2019] (CC BY-NC-SA 4.0). We explore two difficulties lying in domain generalization: (1) images captured by different camera devices (i.e., pixel size and FOV), and (2) bias in the distribution of object sizes on the source and target domain. To highlight these discrepancies, we summarize the dataset information in detail at Tab. 1. For Lyft and NuScenes, we subsample 1/4 data for simplicity.

Comparison Methods. In our experiments, we compare our DGMono3D with three counterparts: (1) Source Only indicates directly evaluating the source domain trained model on the target domain. (2) Oracle indicates the fully supervised model trained on the target domain. (3) STMono3D Li et al. [2022] is the SoTA unsupervised domain adaptation method that must utilizes images on the target domain during the training stage.

Metrics.

We adopt the KITTI evaluation metrics to evaluate our methods. Following the most domain adaptation methods for 3D detection 

Yang et al. [2021a]; Luo et al. [2021]; Li et al. [2022], we focus on the commonly used car category. When KITTI is considered, we report the average precision (AP) where the IoU thresholds are 0.5/0.7 (i.e., loose/strict) for both the bird’s eye view (BEV) IoUs and 3D IoUs. When inferring on NuScenes, since the attribute labels are different or unavailable on the target domain, we discard the average attribute error (mAAE) and report the average trans error (mATE), scale error (mASE), orient error (mAOE), and average precision (mAP). Following Yang et al. [2021a]; Li et al. [2022], we highlight the closed performance gap between Source Only and Oracle for a more intuitive comparison.

4.2 Implementation

Training. We train DGMono3D based on FCOS3D Wang et al. [2021a] in a two-stage manner. Following Gu et al. [2021], we adopt the loss re-weighting strategy to substitute the reversed PIT during the training process, which avoids the extra computational cost Gu et al. [2021]. In the first-stage training, we follow the standard scheme proposed in Wang et al. [2022a]

(FCOS3D++). We train our model for 48 epoches on the KITTI dataset, and 12 epoches on the NuSenese and Lyft datasets. In the second-stage training, we fine-tune the model with the same training scheme but adopt the GCOS strategy for 2 epoches, fixing model parameters except for the size prediction branch. When the statistical information of the target domain is available, we apply GCOS with the ratio of the mean object size of the target domain to the source domain accordingly. When it is inaccessible, we utilize a random scale factor in GCOS to flatten the distribution of predicted object sizes and enhance the model generalization. Compared with the demand of collecting tremendous images on the target domain for self-teaching 

Li et al. [2022], DGMono3D is more practicable and hyper-parameter free, avoiding tricky and complex training strategies.

NusK Loose Metrics Strict Metrics
Method
Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
Source Only 0 0 0 0 0 0 0 0 0 0 0 0
Oracle 38.77 30.84 29.82 34.78 27.97 26.59 19.98 16.83 16.29 15.80 13.61 13.16
STMono3D Li et al. [2022] 35.63 27.37 23.95 28.65 21.89 19.55 5.37 4.38 3.76 0.60 0.64 0.64
DGMono3D 34.22 28.99 27.82 28.77 24.82 23.67 16.07 14.89 14.32 12.00 11.29 10.97
Closed Gap 88.3% 94.0% 93.2% 82.7% 88.7% 89.1% 80.4% 88.4% 87.9% 75.9% 82.9% 88.3%
LK Loose Metrics Strict Metrics LNus                 Metrics
Method Method    AP   ATE   ASE   AOE
Easy Mod. Hard Easy Mod. Hard
Source Only 0 0 0 0 0 0 Source Only 2.40 1.302 0.190 0.802
Oracle 34.78 27.97 26.59 15.80 13.61 13.16 Oracle 28.2 0.798 0.160 0.209
STMono3D Li et al. [2022] 18.14 13.32 11.83 4.54 4.54 4.54 STMono3D Li et al. [2022] 21.3 0.911 0.170 0.355
DGMono3D 30.03 23.38 22.23 11.56 10.55 10.31 DGMono3D 25.5 0.842 0.169 0.208
Closed Gap 86.3% 83.5% 83.6% 73.1% 77.5% 68.8% Closed Gap 90.4% 91.2% 70.0% 100%
Table 2: Performance of DGMono3D on three source-target pairs. We report AP of the car category at as well as the domain gap closed by DGMono3D. In LyftNus, DGMono3D achieves slightly better results compared with the Oracle model on AOE, demonstrating the effectiveness of our proposed method. Moreover, our approach surpasses STMono3D Li et al. [2022] with large margins and approach Oracle results on strict metrics, indicating the size gap is significantly bridged.

Inference. When inferring on the target domain, unlike STMono3D Li et al. [2022], which first scales images to a certain range, we adopt images with original resolution converted by PIT Gu et al. [2021] for fair comparison with Oracle models. Our model outputs (i.e., 2D centers) are reversed through inv-PIT to corresponding positions on plain images Gu et al. [2021] and then back-project to 3D space combining the predicted object depth with camera intrinsic parameters Wang et al. [2021a, 2022a]. When conducting the system analysis introduced in Sec. 3.4, we first associate the predicted and ground-truth objects via the maximum of 3D IoU, and then replace each individual object size with the ground-truth value.

4.3 Main Results

As shown in Tab. 2, we compare the proposed DGMono3D with Source Only, Oracle, and the SoTA unsupervised domain adaptation method for Mono3D (i.e., STMono3D Li et al. [2022]). Caused by the domain gap, the Source Only model cannot correctly localize 3D objects, and the mAP almost drops to 0% on the target domain. In contrast, benefiting from the CGP, our DGMono3D can work on the target domain as normal, even though the images are captured by different camera devices. Specifically, DGMono3D improves the performance of Source Only models by a large margin that around 88% and 83% performance gaps of loose AP are closed on the NuScenesKITTI and LyftKITTI settings. Notably, the AOE of DGMono3D even slightly better than that of Oracle on the LyftNuScenes setting, which indicates the effectiveness of our method.

Moreover, DGMono3D achieves superior performance than STMono3D even without training models with data on the target domain. We highlight the strict metrics on the KITTI dataset. Although STMono3D successfully bridges a large part of the gaps on the loose metrics, its performance on strict metrics is hampered by the size gap as discussed in 3.4. In terms of the DGMono3D, the size bias is alleviated by the proposed GCOS, improving the model performance on strict metrics by significant margins. To better validate the effectiveness of DGMono3D, we present quantitative comparisons in Fig. 5. With GCOS, the size bias is significantly alleviated, achieving more tight 3D object predictions, as shown in the figures.

Although Yang et al. [2021a]; Li et al. [2022] show that label annotations on the Lyft dataset are not comprehensive compared with the KITTI and NuScenes datasets, our DGMono3D achieves satisfying results and avoids drastic performance degradation as STMono3D Li et al. [2022], demonstrating the better generalization of our method. Specifically, the large discrepancy of FOV among Lyft and other datasets may lead to such a difficulty when cross-dataset inference, but our CGP can alleviate it via the embedded PIT Gu et al. [2021].

4.4 Ablation Studies and Discussions

We present ablation study results to demonstrate the effectiveness of each component in DGMono3D and further discuss the reasons for the ups and downs of performance. More results and discussions about methods can be found in the supplementary material.

              (a) Oracle               (b) STMono3D Li et al. [2022]               (c) CGP             (b) GCOS
Figure 5: We present quantitative comparisons based on the NusKITTI setting. Benefiting from CGP, DGMono3D fully considers the FOV and pixel-size discrepancies among images captured by different camera devices, achieving more generalized models with better performance compared with STMono3D Li et al. [2022] on the target domain. Moreover, with GCOS, the size bias is significantly alleviated, yielding more tight 3D object predictions. Zoom in for clear comparison.
NusK Method Loose Metrics Strict Metrics
Domain   PIT MSPSD
w/ diff. w/o diff. Easy Mod. Hard Easy Mod. Hard
Source Only 0 0 0 0 0 0
0.74 1.26 1.28 0.11 0.33 0.33
0.82 1.42 1.43 0.21 0.37 0.37
20.75 18.99 19.24 0.30 0.26 0.27
21.98 22.63 22.10 0.61 0.57 0.54
27.31 23.08 22.31 0.73 0.67 0.66
Target (Oracle) 34.78 27.97 26.59 15.80 13.61 13.16
30.16 23.13 21.49 10.60 6.12 5.54
29.19 25.87 24.08 8.75 5.70 5.01
27.18 22.94 20.72 12.09 11.03 10.59
35.85 27.68 24.54 10.49 7.86 7.05
Table 3: Effectiveness of our proposed camera-generalized paradigm, including the position-invariant transform (PIT) Gu et al. [2021] and multi-scale training with the pixel-size depth strategy (MSPSD) Li et al. [2022]. We further investigate each component and detailed training settings in a more fine-grained manner.

4.4.1 Effectiveness of the CGP

We investigate the effectiveness of CGP, including the position-invariant transform (PIT) and the multi-scale training with the pixel-size depth strategy (MSPSD), on both target and source domains. Detailed ablation results are presented in Tab. 3.

PIT. As for the oracle models, the PIT cannot improve the model performance but leads to a slight decline. We argue that the elimination of geometry cues caused by PIT hampers the performance improvement. The distortion that is removed via the PIT conversion around images can be useful information for Mono3D detectors to localize objects. Moreover, the results (w/ diff. vs., w/o diff.) also indicate the application of different pixel sizes (Eq. 4, 5) on PIT-transformed images is essential for geometry consistency in Mono3D.

In terms of the single-domain generalization (Source Only), the mere installation of the PIT cannot help Mono3D detectors localize objects given images captured by different camera devices on the source domain since it cannot solve the depth-issue proposed in Li et al. [2022]. However, combining the MSPSD, CGP boosts the model performance on the target domain with a significant margin, indicating the effectiveness of our proposed CGP. One different experiment result on Source Only models is that model performance drops when we apply the different pixel sizes. One of the possible reasons is the discrepancy of image resolution as presented in Tab. 1, which leads to additional domain gaps and hampers the model generalization. Therefore, we utilize the constant pixel size calculated from plain images in converting the pixel-size depth to metric depth.

NusK NusLyft
Stat. Info. Method
Easy Mod. Hard Easy Mod. Hard
baseline 6.94 6.25 6.11 0.72 0.66 0.66 8.83 4.36
scale pred 15.85 14.71 14.50 0.99 0.94 0.92 8.55 4.19
gt rep. 16.36 15.13 14.95 10.71 10.32 10.21 9.52 5.11
stat. gcos 16.07 14.89 14.32 12.00 11.29 10.97 16.90 12.32
rand. gcos 11.30 10.70 10.48 1.07 1.02 1.02 12.07 7.82
Oracle 19.98 16.83 16.29 15.80 13.61 13.16 19.19 8.52
Table 4: Effectiveness of our proposed geometry-consistent object scaling strategy.

MSPSD. MSPSD plays a crucial role in domain generalization. As discussed in Li et al. [2022], it forces the Mono3D detectors to predict object depth via cues of object acreages in perspective view. With the help of MSPSD, model generalization is improved significantly. However, during the training stage, other depth cues are inevitably weakened, leading to a slight decrease in model performance for Oracle models.

4.4.2 Effectiveness of the GCOS

We choose the trained model equipped with DGP as the baseline for the ablation study of GCOS. We compare the following strategies to investigate the effectiveness of GCOS. Supposing the statistical ratio of the target domain to the source domain is . (1) scale pred.: directly multiply size predictions with when inference on the target domain. (2) gt rep.: replace the predicted sizes with the ground-truth value. (3) stat. gcos: apply GCOS with . (4) rand. gcos: apply GCOS with a random scaling factor in the condition that the statistical information is inaccessible. Moreover, we also consider domain generalization in two different situations with large (e.g., NusK) and small (e.g., NusLyft) domain gaps of size distributions, respectively.

As shown in Tab. 4, no matter how large the discrepancy of size distribution exists between the source and target domain, directly applying scale pred. cannot improve the model performance on strict metrics. However, as for the models fine-tuned by stat. gcos, the degradation of performance on strict metrics is remarkably alleviated, which definitely proves the effectiveness of our GCOS.

Furthermore, we also explore the condition that we have no statistical information on the target domain, which is a much more challenging setting. We adopt rand. gcos to flatten the prediction distribution of size and improve the model generalization. In the tough setting of NusK, while the strict metric of AP, IOU0.7 is enhanced with a great margin, indicating GCOS with random scaling factor is beneficial to model generalization, the AP, IOU0.7 is on par with the baseline model, which suggests a difficulty in Mono3D detectors to predict correct object height when cross-dataset inference. However, as for the situation where the size gap is not oversized, which is a more common setting for real-world applications, rand. gcos successfully bridge size gaps with prominent margins, indicating that GCOS can make Mono3D detectors more generalized.

5 Conclusion

This paper presents DGMono3D, a meticulously designed single-domain generalization framework tailored for the monocular 3D object detection task. We first combine the PIT Gu et al. [2021] and multi-scale training with the pixel-size depth strategy Li et al. [2022]; Park et al. [2021]; Chen et al. [2022] to construct a unified camera-generalized paradigm that fully consider the discrepancies of FOV and pixel size when cross-dataset inference on images captured by different camera devices. Then, we investigate model performance degradation on strict quantitative metrics caused by different distributions of object sizes on the source and target domains. To alleviate this issue, we propose the 2D-3D geometry-consistent object scaling strategy to scale objects based on statistical information during the training stage. Extensive experimental results on three datasets demonstrate the effectiveness of DGMono3D, which can serve as a solid baseline for industrial applications and further research on domain adaptation for Mono3D.

References

  • G. Brazil and X. Liu (2019) M3d-rpn: monocular 3d region proposal network for object detection. In ICCV, pp. 9287–9296. Cited by: §2.
  • H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2020) Nuscenes: a multimodal dataset for autonomous driving. In CVPR, pp. 11621–11631. Cited by: Appendix B, §1, §1, §4.1, Table 1.
  • X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun (2016) Monocular 3d object detection for autonomous driving. In CVPR, pp. 2147–2156. Cited by: §1.
  • X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun (2015) 3d object proposals for accurate object class detection. NeurIPS 28. Cited by: §2.
  • Y. Chen, S. Liu, X. Shen, and J. Jia (2020) Dsgn: deep stereo geometry network for 3d object detection. In CVPR, pp. 12536–12545. Cited by: §1.
  • Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool (2018) Domain adaptive faster r-cnn for object detection in the wild. In CVPR, pp. 3339–3348. Cited by: §2.
  • Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, and F. Zhao (2022) Graph-detr3d: rethinking overlapping regions for multi-view 3d object detection. arXiv preprint arXiv:2204.11582. Cited by: §1, §1, §3.2, §5.
  • N. Dvornik, J. Mairal, and C. Schmid (2018) Modeling visual context is key to augmenting object detection datasets. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 364–380. Cited by: §2.
  • D. Dwibedi, I. Misra, and M. Hebert (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. In Proceedings of the IEEE international conference on computer vision, pp. 1301–1310. Cited by: §2, §3.4.2.
  • H. Fang, J. Sun, R. Wang, M. Gou, Y. Li, and C. Lu (2019)

    Instaboost: boosting instance segmentation via probability map guided copy-pasting

    .
    In ICCV, pp. 682–691. Cited by: §2.
  • A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, pp. 3354–3361. Cited by: Appendix B, §1, §1, §4.1, Table 1.
  • Q. Gu, Q. Zhou, M. Xu, Z. Feng, G. Cheng, X. Lu, J. Shi, and L. Ma (2021) Pit: position-invariant transform for cross-fov domain adaptation. In ICCV, pp. 8761–8770. Cited by: Figure 8, §C.1.2, §C.1.3, §1, §1, §3.2, §3.3, §4.2, §4.2, §4.3, Table 3, §5.
  • H. Hsu, C. Yao, Y. Tsai, W. Hung, H. Tseng, M. Singh, and M. Yang (2020) Progressive domain adaptation for object detection. In WACV, pp. 749–757. Cited by: §2.
  • K. Huang, T. Wu, H. Su, and W. H. Hsu (2022) MonoDTR: monocular 3d object detection with depth-aware transformer. arXiv preprint arXiv:2203.10981. Cited by: §1.
  • R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang, and V. Shet (2019) Level 5 perception dataset 2020. Note: https://level-5.global/level5/data/ Cited by: Appendix B, §1, §1, §4.1, Table 1.
  • M. Khodabandeh, A. Vahdat, M. Ranjbar, and W. G. Macready (2019) A robust learning approach to domain adaptive object detection. In ICCV, pp. 480–490. Cited by: §2.
  • T. Kim, M. Jeong, S. Kim, S. Choi, and C. Kim (2019) Diversify and match: a domain adaptive representation learning paradigm for object detection. In CVPR, pp. 12456–12465. Cited by: §2.
  • A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) Pointpillars: fast encoders for object detection from point clouds. In CVPR, pp. 12697–12705. Cited by: §1.
  • L. Li, K. Gao, J. Cao, Z. Huang, Y. Weng, X. Mi, Z. Yu, X. Li, and B. Xia (2021) Progressive domain expansion network for single domain generalization. In CVPR, pp. 224–233. Cited by: §2.
  • P. Li, X. Chen, and S. Shen (2019) Stereo r-cnn based 3d object detection for autonomous driving. In CVPR, pp. 7644–7652. Cited by: §1.
  • Z. Li, Z. Chen, A. Li, L. Fang, Q. Jiang, X. Liu, and J. Jiang (2022) Unsupervised domain adaptation for monocular 3d object detection via self-training. arXiv preprint arXiv:2204.11590. Cited by: §C.1.1, §C.1.1, §C.1.1, §1, §1, §1, §1, §2, §2, §3.2, §3.3, Figure 5, §4.1, §4.1, §4.1, §4.2, §4.2, §4.3, §4.3, §4.4.1, §4.4.1, Table 2, Table 3, §5.
  • Z. Liu, Z. Wu, and R. Tóth (2020a)

    Smoke: single-stage monocular 3d object detection via keypoint estimation

    .
    In CVPRW, pp. 996–997. Cited by: §1, §2.
  • Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou, and X. Bai (2020b) Tanet: robust 3d object detection from point clouds with triple attention. In AAAI, Vol. 34, pp. 11677–11684. Cited by: §2.
  • Z. Luo, Z. Cai, C. Zhou, G. Zhang, H. Zhao, S. Yi, S. Lu, H. Li, S. Zhang, and Z. Liu (2021) Unsupervised domain adaptive 3d detection with multi-level consistency. In ICCV, pp. 8866–8875. Cited by: §1, §2, §3.4.1, §4.1.
  • A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017) 3d bounding box estimation using deep learning and geometry. In CVPR, pp. 7074–7082. Cited by: §2.
  • D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon (2021) Is pseudo-lidar needed for monocular 3d object detection?. In ICCV, pp. 3142–3152. Cited by: §C.1.1, §C.1.1, §1, §1, §1, §2, §3.2, §3.3, §5.
  • F. Qiao, L. Zhao, and X. Peng (2020) Learning to learn single domain generalization. In CVPR, pp. 12556–12565. Cited by: §2.
  • T. Roddick, A. Kendall, and R. Cipolla (2018) Orthographic feature transform for monocular 3d object detection. arXiv preprint arXiv:1811.08188. Cited by: §2.
  • S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li (2020) Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In CVPR, pp. 10529–10538. Cited by: §1.
  • J. Sun, L. Chen, Y. Xie, S. Zhang, Q. Jiang, X. Zhou, and H. Bao (2020) Disp r-cnn: stereo 3d object detection via shape prior guided instance disparity estimation. In CVPR, pp. 10548–10557. Cited by: §1.
  • M. Wang and W. Deng (2018) Deep visual domain adaptation: a survey. Neurocomputing 312, pp. 135–153. Cited by: §2.
  • T. Wang, Z. Xinge, J. Pang, and D. Lin (2022a) Probabilistic and geometric depth: detecting objects in perspective. In CoRL, pp. 1475–1485. Cited by: §2, §2, §4.2, §4.2.
  • T. Wang, X. Zhu, J. Pang, and D. Lin (2021a) Fcos3d: fully convolutional one-stage monocular 3d object detection. In ICCVW, pp. 913–922. Cited by: §1, §2, §2, §4.2, §4.2.
  • Y. Wang, W. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger (2019) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In CVPR, pp. 8445–8453. Cited by: §1.
  • Y. Wang, X. Chen, Y. You, L. Erran, B. Hariharan, M. Campbell, K. Q. Weinberger, and W. Chao (2020) Train in germany, test in the usa: making 3d object detectors generalize. In CVPR, pp. 11713–11723. Cited by: §C.2.1, §1, §3.4.2.
  • Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon (2022b) Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In CoRL, pp. 180–191. Cited by: §2, §2.
  • Z. Wang, Y. Luo, R. Qiu, Z. Huang, and M. Baktashmotlagh (2021b) Learning to diversify for single domain generalization. In ICCV, pp. 834–843. Cited by: §2.
  • X. Weng and K. Kitani (2019) Monocular 3d object detection with pseudo-lidar point cloud. In ICCVW, pp. 0–0. Cited by: §2.
  • B. Xu and Z. Chen (2018) Multi-level fusion based 3d object detection from monocular images. In CVPR, pp. 2345–2353. Cited by: §2.
  • Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §2.
  • J. Yang, S. Shi, Z. Wang, H. Li, and X. Qi (2021a) St3d: self-training for unsupervised domain adaptation on 3d object detection. In CVPR, pp. 10368–10378. Cited by: §C.2.1, §1, §2, §3.4.2, §4.1, §4.3.
  • J. Yang, S. Shi, Z. Wang, H. Li, and X. Qi (2021b) ST3D++: denoised self-training for unsupervised domain adaptation on 3d object detection. arXiv preprint arXiv:2108.06682. Cited by: §2.
  • Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2019) Std: sparse-to-dense 3d object detector for point cloud. In ICCV, pp. 1951–1960. Cited by: §2.
  • R. Zhang, H. Qiu, T. Wang, X. Xu, Z. Guo, Y. Qiao, P. Gao, and H. Li (2022) MonoDETR: depth-aware transformer for monocular 3d object detection. arXiv preprint arXiv:2203.13310. Cited by: §1.
  • W. Zhang, W. Li, and D. Xu (2021) SRDAN: scale-aware and range-aware domain adaptation network for cross-dataset 3d object detection. In CVPR, pp. 6769–6779. Cited by: §2.
  • W. Zhang, Z. Wang, and C. C. Loy (2020) Exploring data augmentation for multi-modality 3d object detection. arXiv preprint arXiv:2012.12741. Cited by: §2, §3.4.2.
  • Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In CVPR, pp. 4490–4499. Cited by: §1.

Appendix A Proofs

a.1 Invariance of FOV in Image Scaling

Consider the formulation of FOV:

where , are image width and height, , are focal lengths along the (horizontal) and (perpendicular) direction, respectively.

When scaling images, we simultaneously operate the image resolution and the camera intrinsic parameter as:

where , are scaling factor and is the camera intrinsic parameter.

After scaling the image with and , we re-calculate the FOV:

Similarly, the invariance of FOV can be proofed. We add more discussions about consequences caused by this invariance during multi-scale training (MS) in Sec. C.1.2.

a.2 Reversibility of MS and PIT

a.2.1 Vanilla CGP

We present the sequence that we first apply MS and then PIT. The anchor pixel on the coordinate of changes to after scaling the image with scale factors and as:

(6)

The camera intrinsic parameter is changed simultaneously as:

(7)

Finally, we apply PIT to convert the pixel onto the spherical coordinate as

(8)

a.2.2 Advanced CGP

In this advanced version, we first apply PIT and then MS. The anchor pixel on the coordinate of is converted onto the spherical coordinate as

(9)

Then, we apply the MS on the scaled image and also change the camera intrinsic parameter as:

(10)

After that, we can calculate the corresponding position on the raw scaled plain image as

(11)

a.2.3 Invariance

We proof , , , and to demonstrate the invariance of MS and PIT. Since the symmetry, we present the proof of and . The other two equations can be proofed similarly.

The proof of :

(12)

Hence, the position of the anchor point on the scaled PIT-converted image is the same in the order and inversed process.

The proof of :

(13)

Hence, the corresponding position of the anchor pixel on the scaled PIT-converted image is the same in the order and inversed process.

Thus, the reversibility of MS and PIT is proofed.

NusK Loose Metrics Strict Metrics
Method
Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
Source Only (SO) 0 0 0 0 0 0 0 0 0 0 0 0
Oracle 38.77 30.84 29.82 34.78 27.97 26.59 19.98 16.83 16.29 15.80 13.61 13.16
SO + PIT 34.65 29.44 28.90 23.08 24.82 22.31 6.94 6.25 6.11 0.72 0.66 0.66
SO + GCOS (w/o stat.) 33.35 27.99 27.28 25.11 21.52 20.53 11.30 10.70 10.48 1.07 1.02 1.02
SO + GCOS (w/ stat.) 34.22 28.99 27.82 28.77 24.82 23.67 16.07 14.89 14.32 12.00 11.29 10.97
LK Loose Metrics Strict Metrics
Method
Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
Source Only (SO) 0 0 0 0 0 0 0 0 0 0 0 0
Oracle 38.77 30.84 29.82 34.78 27.97 26.59 19.98 16.83 16.29 15.80 13.61 13.16
SO + CGP 30.36 26.29 25.48 22.97 18.62 17.88 3.55 3.26 3.19 0.29 0.36 0.34
SO + GCOS (w/o stat.) 36.94 23.40 22.08 28.00 15.44 14.30 6.62 3.60 3.37 0.92 0.51 0.51
SO + GCOS (w/ stat.) 36.18 28.30 27.16 30.03 23.38 22.23 17.29 14.76 14.30 11.56 10.55 10.31
KNus KLyft
Method    AP   ATE   ASE   AOE
Source Only (SO) 2.4 1.302 0.190 0.802 0 0 0 0
Oracle 28.2 0.798 0.160 0.209 33.00 29.74 19.19 8.52
SO + CGP 18.0 0.847 0.297 0.441 6.50 5.37 0.82 0.08
SO + GCOS (w/o stat.) 18.4 0.842 0.288 0.446 10.18 7.00 1.67 0.82
SO + GCOS (w/ stat.) 18.2 0.813 0.184 0.436 13.02 9.81 3.68 1.28
LyftNus NusLyft
Method    AP   ATE   ASE   AOE
Source Only (SO) 7.1 1.182 0.185 0.447 0 0 0 0
Oracle 28.2 0.798 0.160 0.209 33.00 29.74 19.19 8.52
SO + CGP 25.5 0.842 0.169 0.208 26.57 22.34 8.83 4.36
SO + GCOS (w/o stat.) 25.3 0.844 0.169 0.213 31.32 27.58 12.07 7.82
SO + GCOS (w/ stat.) 25.4 0.841 0.166 0.209 31.68 28.11 16.90 12.32
Table 5: Performance of DGMono3D on three source-target pairs. We present more fine-grained ablation results to demonstrate the effectiveness of each component proposed in this paper.

Appendix B Experimental Results

We present all the experimental results on three-pair datasets as shown in Tab. 5 and highlight several interesting observations:

(1) Based on statistical information (Specific values are presented in Tab.1 in the paper), the distributions of object dimensions on NuScenes Caesar et al. [2020] and Lyft Kesten et al. [2019] are similar. In other words, the mean values are approximate. However, objects on the KITTI dataset Geiger et al. [2012] are much smaller, leading to a tricky and huge domain gap. Therefore, while applying GCOS (w/o stat.) on the difficult settings (e.g., NusK) can obtain performance gain, there are still severe degradations on the strict metrics. When the dimension discrepancy is not so huge (e.g., NusLyft), GCOS (w/o stat.) can achieve satisfactory results without any information on the target domain. As for GCOS (w/ stat.), it can work effectively on all these settings, bridging large gaps caused by the dimension discrepancy.

(2) Different from metrics on KITTI and Lyft datasets, the AP on NuScenes is less affected by the accuracy of dimension predictions. In contrast, the average size error (ASE) specifically measures the accuracy of dimension predictions. Hence, the GCOS mainly contributes to the ASE instead of AP when cross-dataset inference on NuScenes.

Figure 6: Illustration of the multi-scale training and pixel-size depth.

Appendix C Discussions about CGP and GCOS

c.1 Camera-Generalized Paradigm

c.1.1 What Do Models Learn from Multi-Scale Training and Pixel-Size Depth?

As investigated in Li et al. [2022] and our paper, the multi-scaling training (MS) and pixel-size depth plays crucial roles in cross-domain inference. In Park et al. [2021], the authors claim that they can make models camera-aware, but there is still a lack of explanation in literature. Here, we aim to dig into what models learn from the MS and pixel-size depth.

As introduced in the paper, we replace the metric depth with pixel-size depth following Li et al. [2022]; Park et al. [2021]:

(14)

In practice, we do not change the on scaled images, maintaining the invariance of the 3D space. When scaling the images and the camera intrinsic parameters, we imitate taking pictures with different cameras (i.e., intrinsic parameters). The models predict , which is post-processed via multiplying to get the metric-depth prediction.

However, it is hard to understand the function of MS since the entanglement of factor and pixel-size depth . To better understand it, we can equivalently analyze the scaling of with factor , whose result is the depth we want the models to predict from the scaled image (i.e., target depth). Reasonably, for an example as illustrated in Fig. 6, when we shrink a image with a scale factor , the areas of objects are reduced. In perspective view, the scaled objects are correspondingly getting further from the camera. Since there is a negative correlation between the factor and the scale factor , when shrinking the images with a smaller scale factor , the metric depth will multiply with a more significant factor to get a more distant target depth. Therefore, we essentially enhance the depth cues that more distant objects are smaller in perspective view via the multi-scale training. Moreover, it also increases the diversity of factor , making the post-processing more generalized.

Since the geometry structure of 3D space is invariant among datasets, the depth cues in perspective view (more distant objects are smaller) learned from the source domain can work as usual on the target domain, thus obtaining reasonable and correct predictions of . Combined with the factor that can also be calculated from the camera intrinsic parameter, Mono3D detectors can localize object depth without the depth shift illustrated in Li et al. [2022] on the target domain. If we directly adopt , the models still predict object depth based on similar depth cues, thus leading to depth shift since the models have no ideas about the influence of different cameras on object size in perspective view. In other words, benefiting from the dataset-invariant pixel-size depth and the consideration of the influence of camera parameters (i.e., ) on object size, Mono3D detectors can predict correct metric-depth on the target domain.

    (a) 2D-3D geometry-consistent object scaling (b) Vanilla object scaling
Figure 7: Comparison of the 2D-3D geometry-consistent object scaling and vanilla object scaling.

c.1.2 Why Do We Need to Apply the Position Invariant Transform?

As discussed in C.1.1, multi-scale training is trying to imitate taking pictures with different cameras (i.e., intrinsic parameters). Therefore, equipped with pixel-size depth, the model will be aware of the discrepancy of camera parameters among datasets when cross-domain inference. However, since the multi-scale augment cannot change the camera FOV as proofed in A.1, the FOV gaps among different datasets can lead to the performance degradation. Moreover, it is tough to generate images with different FOVs for training and model the influence of FOV discrepancy on Mono3D. To alleviate the potential domain gap raised by FOV, we adopt the position invariant transform (PIT) Gu et al. [2021] to remove the distortion caused by FOV. The models achieve 3D detection based on images without FOV distortion, avoiding the potential domain gap raised by FOV.

                 (a) PIT Gu et al. [2021]                                  (b) Different scale factors
Figure 8: Illustration of different scale factors for pixel-size depth.

c.1.3 Visualization of Different Scale Factors

In the paper, we have discussed the different depth scale factors in PIT-converted images:

Here, we present more intuitive introductions. As shown in Fig. 8a from Gu et al. [2021], pixels on the converted sphere coordinate occupy different size of area on the plain raw images. Hence, the pixel size is various in the PIT-converted images. To keep the geometry consistency, we apply various scale factor (shown in Fig. 8b) when converting the predicted pixel depth to the metric depth .

More intuitively, at the edges of the PIT-converted image, pixels are ‘larger’ with more corresponding pixels on the plain raw image. Considering the explanation in C.1.1, objects in these areas look smaller compared to the ones in the center of the PIT-converted image. In perspective view, smaller means further, and the model will predict a larger depth value. Remember that we do not scale the ground-truth value during the training stage. Therefore, it makes sense that we need to multiply a smaller scale factor on the larger predicted depth to make the predicted metric depth closer to the ground truth.

c.2 2D-3D Geometry-Consistent Object Scaling

c.2.1 Why is the 2D-3D Geometry Consistency?

In LIDAR-based methods Wang et al. [2020]; Yang et al. [2021a], the object scaling strategy is center-based as shown in Fig. 7b, which means the BEV position of the object center is unchanged. It is undoubtedly reasonable when considering that points are of discrete distribution, and the points reflected by objects are also centered on 3D centers. However, as for Mono3D, we aim to localize objects given monocular images. Directly applying the vanilla object scaling strategy will lead to the ambiguity that the object size in the 2D perspective view is changed by the combined influence of 3D-object scaling and depth variation (best think in BEV). In contrast, in our 2D-3D geometry-consistent object scaling strategy, we maintain the BEV position of the visible edges. Hence, the object size in the 2D perspective view is only changed by the influence of 3D-object scaling. While the experimental results are on par, it makes sense that we need to decrease potential factors leading to performance degradation.

Appendix D Limitation

We present more images augmented via 2D-3D GCOS in Fig. 9. However, caused by the dense regular arrangement of image pixels, it is inevitable to introduce artifacts and leak localization information when applying the scaling strategy. We have to utilize a two-stage fine-tuning strategy to train the dimension branch of generalized models, avoiding hurting the model performance on the target domain. Moreover, for more challenging conditions where there are significant discrepancies in distributions of object sizes on the target domain (i.e., NusKITTI), the statistical information for GCOS is necessary. How to reduce this dependence on statistical information worths further research.

Figure 9: Images augmented by the 2D-3D GCOS.