Unsupervised Domain Adaptation for Monocular 3D Object Detection via Self-Training

04/25/2022
by   Zhenyu Li, et al.
Harbin Institute of Technology
USTC
0

Monocular 3D object detection (Mono3D) has achieved unprecedented success with the advent of deep learning techniques and emerging large-scale autonomous driving datasets. However, drastic performance degradation remains an unwell-studied challenge for practical cross-domain deployment as the lack of labels on the target domain. In this paper, we first comprehensively investigate the significant underlying factor of the domain gap in Mono3D, where the critical observation is a depth-shift issue caused by the geometric misalignment of domains. Then, we propose STMono3D, a new self-teaching framework for unsupervised domain adaptation on Mono3D. To mitigate the depth-shift, we introduce the geometry-aligned multi-scale training strategy to disentangle the camera parameters and guarantee the geometry consistency of domains. Based on this, we develop a teacher-student paradigm to generate adaptive pseudo labels on the target domain. Benefiting from the end-to-end framework that provides richer information of the pseudo labels, we propose the quality-aware supervision strategy to take instance-level pseudo confidences into account and improve the effectiveness of the target-domain training process. Moreover, the positive focusing training strategy and dynamic threshold are proposed to handle tremendous FN and FP pseudo samples. STMono3D achieves remarkable performance on all evaluated datasets and even surpasses fully supervised results on the KITTI 3D object detection dataset. To the best of our knowledge, this is the first study to explore effective UDA methods for Mono3D.

READ FULL TEXT VIEW PDF

page 2

page 16

page 17

05/23/2022

Towards Model Generalization for Monocular 3D Object Detection

Monocular 3D object detection (Mono3D) has achieved tremendous improveme...
07/23/2021

Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency

Deep learning-based 3D object detection has achieved unprecedented succe...
10/18/2021

FAST3D: Flow-Aware Self-Training for 3D Object Detectors

In the field of autonomous driving, self-training is widely applied to m...
03/26/2021

Exploiting Playbacks in Unsupervised Domain Adaptation for 3D Object Detection

Self-driving cars must detect other vehicles and pedestrians in 3D to pl...
08/15/2021

ST3D++: Denoised Self-training for Unsupervised Domain Adaptation on 3D Object Detection

In this paper, we present a self-training method, named ST3D++, with a h...
11/24/2021

UDA-COPE: Unsupervised Domain Adaptation for Category-level Object Pose Estimation

Learning to estimate object pose often requires ground-truth (GT) labels...
04/07/2019

Unsupervised Domain Adaptation for Multispectral Pedestrian Detection

Multimodal information (e.g., visible and thermal) can generate robust p...

1 Introduction

Monocular 3D object detection (Mono3D) aims to categorize and localize objects from single input RGB images. With the prevalent development of cameras for autonomous vehicles and mobile robots, this field has drawn increasing research attention. Recently, it has obtained remarkable advancements [7, 2, 41, 37, 36, 28, 29]

driven by deep neural networks and large-scale human-annotated autonomous driving datasets 

[15, 3, 19].

However, 3D detectors developed on one specific dataset (i.e. source domain) might suffer from tremendous performance degradation when generalizing to another dataset (i.e. target domains) due to unavoidable domain-gaps arising from different types of sensors, weather conditions, and geographical locations. Especially, as shown in Fig. 1, the severe depth-shift caused by different imaging camera devices leads to totally failed localizations. As a result, a monocular detector trained on data collected in Singapore cities with NuScenes [3] cameras cannot work well (i.e., average precision drops to zero) when evaluated on data from European cities captured by KITTI [15] cameras. While collecting and training with more data from different domains could alleviate this problem, it is unfortunately infeasible, given diverse real-world scenarios and expensive annotation costs. Therefore, methods for effectively adapting a monocular 3D detector trained on a labeled source domain to a novel unlabeled target domain are highly demanded in practical applications. We call this task unsupervised domain adaptation (UDA) for monocular 3D object detection.

                              (a) Camera View                                 (b) BEV View       (c) STMono3D
Figure 1: Depth-shift Illustration. When inferring on the target domain, models can accurately locate the objects on the 2D image but predict totally wrong object depth with tremendous shifts. Such unreliable predictions for pseudo labels cannot improve but hurt the model performance in STMono3D. GAMS guarantees the geometry consistency and enables models predict correct object depth. Best view in color: prediction and ground truth are in blue and orange. Depth-shift is shown in green arrows.

While intensive UDA studies [11, 24, 18, 9, 31, 14]

on the 2D image setting are proposed, they mainly focus on handling lighting, color, and texture variations. However, in terms of the Mono3D, since detectors attend to estimate the spatial information of objects from monocular RGB images, the geometry alignment of domains is much more crucial. Moreover, for UDA on LiDAR-based 3D detection 

[43, 44, 25, 45], the fundamental differences in data structures and network architectures render these approaches not readily applicable to this problem.

In this paper, we propose STMono3D, for UDA on monocular 3D object detection. We first thoroughly investigate the depth-shift issue caused by the tight entanglement of models and camera parameters during the training stage. Models can accurately locate the objects on the 2D image but predict totally wrong object depth with tremendous shifts when inferring on the target domain. To alleviate this issue, we develop the geometry-aligned multi-scale (GAMS) training strategy to guarantee the geometry consistency of domains and predict pixel-size depth to overcome the inevitable misalignment and ambiguity. Hence, models can provide effective predictions on the unlabeled target domain. Based upon this, we adopt the mean teacher [34] paradigm to facilitate the learning. The teacher model is essentially a temporal ensemble of student models, where parameters are updated by an exponential moving average window on student models of preceding iterations. It produces stable supervision for the student model without prior knowledge of the target domain.

Moreover, we observe that the Mono3D teacher model suffers from extremely low confidence scores and numerous failed predictions on the target domain. To handle these issues, we adopt Quality-Aware Supervision (QAS), Positive Focusing Training (PFT), and Dynamic Threshold (DT) strategies. Benefitting from the flexibility of the end-to-end mean teacher framework, we utilize the readability of each teacher-generated prediction to dynamically reweight the supervision loss of the student model, which takes instance-level qualities of pseudo labels into account, avoiding the low-quality samples interfering the training process. Since the backgrounds of domains are similar in the Mono3D UDA of the autonomous driving setting, we ignore the negative samples and only utilize positive pseudo labels to train the model. It avoids excessive FN pseudo labels at the beginning of the training process impairing the capability of the model to recognize objects. In synchronization with training, we utilize a dynamic threshold to adjust the filter score, which stabilizes the increase of pseudo labels.

To the best of our knowledge, this is the first study to explore effective UDA methods for Mono3D. Experimental results on various 3D object detection datasets KITTI [15], NuSenses [3], and Lyft [19] demonstrate the effectiveness of our proposed methods, where the performance gaps between source only results and fully supervised oracle results are closed by a large margin. It is noteworthy that STMono3D even outperforms the oracle results under the NuScenesKITTI setting. Our codes will be available111https://github.com/zhyever/STMono3D.

2 Related Work

2.1 Monocular 3D Object Detection

Mono3D has drawn increasing attention in recent years [8, 41, 27, 30, 40, 2, 39, 37, 28, 36]. Earlier work utilizes sub-networks to assist 3D detection. For instance, 3DOP [8] and MLFusion [41] use a depth estimation network while Deep3DBox [27] uses a 2D object detector. Another line of research makes efforts to convert the RGB input to 3D representations like OFTNet [30] and Pseudo-Lidar [40]. While these methods have shown promising performance, they rely on the design and performance of sub-networks or dense depth labels. Hence, some methods propose to design the framework in an end-to-end manner like 2D detection. M3D-RPN [2] implements a single-stage multi-class detector with a region proposal network and depth-aware convolution. SMOKE [23] proposes a neat framework to predict 3D objects without generating 2D proposals. DETR3D [39] develop a DETR-like [4] bbox head, where 3D objects are predicted by independent queries in a set-to-set manner. DD3D [28] further investigates the influence of pre-trained monocular depth estimation network, in which they find depth estimation plays a crucial part in Mono3D. In this paper, we mainly conduct UDA experiments based on FCOS3D [37]

, a neat and representative Mono3D paradigm that keeps the well-developed designs for 2D feature extraction and is adapted for this 3D task with only basic designs for specific 3D detection targets.

2.2 Unsupervised Domain Adaptation

UDA aims to generalize the model trained on a source domain to unlabeled target domains. So far, tremendous methods have been proposed for various computer vision tasks 

[11, 24, 18, 9, 31, 14, 47] (e.g., recognition, detection, segmentation). Some methods [26, 33, 5] employ the statistic-based metrics to model the differences between two domains. Other approaches [32, 46, 20]

utilize the self-training strategy to generate pseudo labels for unlabeled target domains. Moreover, inspired by Generative Adversarial Networks (GANs) 

[16], adversarial learning was employed to align feature distributions [35, 12, 13], which can be explained by minimizing the H-divergence [1] or the Jensen-Shannon divergence [17] between two domains. [22, 38]

alleviated the domain shift on batch normalization layers by modulating the statistics in the BN layer before evaluation or specializing parameters of BN domain by domain. Most of these domain adaptation approaches are designed for the general 2D image recognition tasks, while direct adoption of these techniques for the large-scale monocular 3D object detection task may not work well due to the distinct characteristics of Mono3D, especially targes in 3D spatial coordination.

In terms of 3D object detection, [45, 43, 25] investigate UDA strategies for LIDAR-based detectors. SRDAN [45] adopt adversarial losses to align the features and instances with similar scales between two domains. ST3D [43] and MLC-Net [25] develop self-training strategies with delicate designs, such as random object scaling, triplet memory bank, and multi-level alignment, for domain adaptation. Following the successful trend of UDA on LIDAR-based 3D object detection, we investigate effective self-training strategies for Mono3D. To the best of our knowledge, this is the first study to explore effective UDA methods for Mono3D.

3 STMono3D

In this section, we first formulate the UDA task on Mono3D (Sec. 3.1), and present an overview of our framework (Sec. 3.2), followed by the Self-Teacher with Temporal Ensemble paradigm (Sec. 3.3). Then, we explain the details of the Geometry-Aligned Multi-Scale Training (GAMS, Sec. 3.4), the Quality-Aware Supervision (QAS, Sec. 3.5), and some other crucial training strategies consisting of Positive Focusing Training (PFT) and Dynamic Threshold (DT) (Sec. 3.6).

3.1 Problem Definition

Under the unsupervised domain adaptation setting, we access to labeled images from the source domain , and unlabeled images from the target domain , where and are the number of samples from the source and target domains, respectively. Each 2D image is paired with a camera parameter that projects points in 3D space to 2D image plane while denotes the label of the corresponding training sample in the specific camera coordinate from the source domain. Label is in the form of object class , location , size in each dimension , and orientation . We aim to train models with and avoid performance degradation when evaluating on the target domain.

Figure 2: Framework overview. STMono3D leverages the mean-teacher [34] paradigm where the teacher model is the exponential moving average of the student model and updated at each iteration. We design the GAMS (Sec. 3.4) to alleviate the severe depth-shift in cross domain inference and ensure the availability of pseudo labels predicted by the teacher model. QAS (Sec. 3.5) is a simple soft-teacher approach which leverages richer information from the teacher model to reweight losses and provide quality-aware supervision on the student model. PFT and DT are another two crucial training strategies presented in Sec. 3.6.

3.2 Framework Overview

We illustrate our STMono3D in Fig. 2. The labeled source domain data is utilized for supervised training of the student model with a loss . In terms of the unlabeled target domain data , we first perturb it by applying a strong random augmentation to obtain . Before passing to the models, both the target and source domain input are further augmented by the GAMS strategy in Sec. 3.4, where images and camera intrinsic parameters are cautiously aligned via simultaneously rescaling. Subsequently, the original and perturbed images are sent to the teacher and student model, respectively, where the teacher model generates intuitively reasonable pseudo labels and supervises the student model via loss on the target domain:

(1)

where and are the regression loss and classification loss, respectively. Here, we adopt the QAS strategy in Sec. 3.5 to further leverage richer information from the teacher model by instance-wise reweighting the loss . In each iteration, the student model is updated through gradient descent with the total loss , which is a linear combination of and :

(2)

where is the weight coefficient. Then, the teacher model parameters are updated by the corresponding parameters of the student model, where we introduce the details in Sec. 3.3. Moreover, we observe that the teacher model suffers from numerous FN and FP pseudo labels on the target domain. To handle this issue, we utilize the PFT and DT strategies illustrated in Sec. 3.6.

3.3 Self-Teacher with Temporal Ensemble

Following the successful trend of the mean teacher paradigm [34]

in the semi-supervised learning, we adapt it to our Mono3D UDA task as illustrated in Fig. 

2. The teacher model and the student model share the same network architecture but have different parameters and , respectively. During the training, the parameters of the teacher model are updated via taking the exponential moving average (EMA) of the student parameters:

(3)

where is the momentum that is commonly set close to 1, e.g., 0.999 in our experiments. Moreover, the input of the student model is perturbed by a strong augmentation, which ensures that the pseudo labels generated by the teacher model are more accurate than the student model predictions, thus providing available optimization directions for the parameter updating. In addition, the strong augmentation can also improve the model generalization to handle the different domain inputs. Hence, by supervising the student model with pseudo targets generated by the teacher model (i.e., forcing the consistency between predictions of the student and the teacher model), the student can learn domain-invariant representations to adapt to the unlabeled target domain. Fig. 4 shows that the teacher model can provide effective supervision to the student model and Tab. 45 demonstrate the effectiveness of the mean teacher paradigm.

3.4 Geometry-Aligned Multi-Scale Training

3.4.1 Observation.

As shown in Fig. 1, depth-shift drastically harms the quality of pseudo labels on the target domain. It is mainly caused by the domain-specific geometry correspondences between 3D objects and images (i.e., camera imaging process). For instance, since the pixel size (defined in Eq. 6) of the KITTI dataset is larger than the NuScenes dataset, objects in images captured by KITTI cameras are smaller than NuScenes ones. While the model can predict accurate 2D locations on image planes, it tends to estimate relatively more significant object depth based on the depth cue that far objects tend to be smaller in perspective view. We call the phenomenon depth-shift: models localize accurate 2D location but predict depth with tremendous shifts on the target domain. To mitigate it, we propose a straight-forward yet effective augmentation strategy, i.e., geometry-aligned multi-scale training, fully leveraging the geometry consistency in the imaging process.

3.4.2 Method.

Given the source input and the target input , a naive geometry-aligned strategy is to rescale camera parameters to the same constant values and resize images correspondingly:

(4)

where and are resize rates, and are focal length and optical center, and indicate image coordinate axises, respectively. However, since the cannot be changed by resizing, it is impracticable to strictly align the geometry correspondences of 3D objects and images between different domains via convenient transformations. The inevitable discrepancy and ambiguity lead to a failure on UDA.

To solve the issue, motivated by DD3D [28], we propose to predict the pixel-size depth instead of the metric depth :

(5)
(6)

where and are the pixel size and a constant, is the model prediction and is scaled to the final result . Hence, while there are inevitable discrepancies between aligned geometry correspondences of two domains, the model can infer the depth from the pixel size and be more robust to the various imaging process. Moreover, we further rescale camera parameters into a multi-scale range, instead of the same constant values, and resize images correspondingly to enhance the dynamic of model. During the training, we keep ground-truth 3D bounding boxes and pseudo labels unchanged, but modify camera parameters and image resolutions simultaneously.

3.5 Quality-Aware Supervision

3.5.1 Observation.

The cross-domain performance of the detector highly depends on the quality of pseudo labels. In practice, we have to utilize a higher threshold on the foreground score to filter out most false positive (FP) box candidates with low confidence. However, unlike the teacher model that can detect objects with high confidence in the semi-supervised 2D detection or UDA of LiDAR-based 3D detector (e.g., the threshold is set to 90% and 70% in [42] and [43], respectively), we find the Mono3D cross-domain teacher suffers from a much lower confidence as shown in Fig. 3, which is another unique phenomenon in Mono3D UDA caused by the much worse oracle monocular 3D detection performance than 2D detection and LiDAR-based 3D detection, which indicates that though the prediction confidence surpasses the threshold, we cannot ensure the sample quality, especially for the ones near the threshold. To alleviate the impact, we propose the quality-aware supervision (QAS) to leverage richer information from the teacher and take instance-level quality into account.

3.5.2 Method.

Thanks to the flexibility of the end-to-end mean teacher framework, we assess the reliability of each teacher-generated bbox to be a real foreground, which is then used to weight the foreground classification loss of the student model. Given the foreground bounding box set , the classification loss of the unlabeled images on the target domain is defined as:

(7)

where denotes the set of pseudo class labels, is the box classification loss, is the confidence score for foreground pseudo boxes, is the number of foreground pseudo box, and

is a constant hyperparameter.

The QAS resembles a simple positive mining strategy, which is intuitively reasonable that there should be more severe punishment for pseudo labels with higher confidence. Moreover, compared with semi-supervised and supervised tasks that focus on simple/hard negative samples [42, 6], it is more critical for UDA Mono3D models to prevent harmful influence caused by low-quality pseudo labels near the threshold. Such an instance-level weighting strategy balances the loss terms based on foreground confidence scores and significantly improves the effectiveness of STMono3D.

(a) (b) (c)
Figure 3: (a) Correlation between confidence value and box IoU with ground-truth. (b) Distribution of confidence scores. The teacher suffers from low scores on the target domain. (c) Distribution of IoU between ground-truth and pseudo labels near the threshold (0.35-0.4). We highlight the existence of numerous low-quality and FP samples in these pseudo labels.

3.6 Crucial Training Strategies

3.6.1 Positive Focusing Training.

Since the whole STMono3D is trained in an end-to-end manner, the teacher model can hardly detect objects with confident scores higher than the threshold at the start of the training. Tons of FN pseudo samples impair the capability of the model to recognize objects. Because backgrounds of different domains are similar with negligible domain gaps in Mono3D UDA (e.g., street, sky, and house), we propose the positive focusing training strategy. As for the , we discard negative background pseudo labels and only utilize the positive samples to supervise the student model, which ensures that the model does not crash to overfit on the FN pseudo labels during the training stage.

3.6.2 Dynamic Threshold.

In practice, we find that the mean confidence score of pseudo labels gradually increases in synchronization within training duration. Increasing false positive (FP) samples appear in the middle and late stages of training, which harshly hurts the model performance. While the QAS strategy proposed in Sec. 3.5 can reduce the negative impact of low-quality pseudo labels, the completely wrong predictions still introduce inevitable noise to the training process. To alleviate the issue, we propose a simple progressively increasing threshold strategy to dynamic change the threshold as:

(8)

where is the base threshold that is set to 0.35 based on the statistics in Fig. 3(a) in our experiments, is the slope of increasing threshold, is the iteration of training stage. The threshold is fixed to a minimum during the first warmup steps as the teacher model can hardly detect objects with confident scores higher than the base threshold. It then linearly increases after the teacher model predicts pseudo labels with FP samples to avoid the model being blemished by increasing failure predictions. Finally, we find that the increasing of average scores tends to a saturation. Therefore, the threshold is fixed at the end of the training stage to guarantee the number of pseudo labels.

  Dataset    Size    Anno.    Loc.     Shape     FOV  Objects   Night
KITTI [15] 3712 17297 EUR. (375,1242) (29,81) 8 No
NuScenes [3] 27522 252427 SG.,EUR. (900,1600) (39,65) 23 Yes
Lyft [19] 21623 139793 SG.,EUR. (1024,1224) (60,70) 9 No
Table 1: Dataset Overview. We focus on their properties related to frontal-view cameras and 3D object detection. The dataset size refers to the number of images used in training stage. For Waymo and NuScenes, we subsample the data. See text for details.

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets.

We conduct experiments on three widely used autonomous driving datasets: KITTI [15], NuSenses [3], and Lyft [19]. Two aspects are lying in our experiments: Cross domains with different cameras (existing in all the source-target pairs) and adaptation from label rich domains to insufficient domains (i.e., NuSensesKITTI). We summarize the dataset information in detail in Tab. 1, and present more visualization comparisons in the supplementary material.

4.1.2 Comparison Methods.

In our experiments, we compare STMono3D with three methods: () Source Only indicates directly evaluating the source domain trained model on the target domain. () Oracle indicates the fully supervised model trained on the target domain. () Naive ST (with GAMS) is the basic self-training method. We first train a model (with GAMS) on the source domain, then generate pseudo labels for the target domain, and finally fine-tuning the trained model on the target domain.

4.1.3 Evaluation Metric.

We adopt the KITTI evaluation metric for evaluating our methods in NuSenses

KITTI and LyftKITTI and the NuScenes metric for LyftNuSenses. We focus on the commonly used car category in our experiments. For LyftNuSenses, we evaluate models on ring view, which is more useful in real-world applications. For KITTI, We report the average precision (AP) where the IoU thresholds are 0.5 for both the bird’s eye view (BEV) IoUs and 3D IoUs. For NuScenes, since the attribute labels are different from the source domain (i.e., Lyft), we discard the average attribute error (mAAE) and report the average trans error (mATE), scale error (mASE), orient error (mAOE), and average precision (mAP). Following [43], we report the closed performance gap between Source Only to Oracle.

4.1.4 Implementation Details.

We validate our proposed STMono3D on detection backbone FCOS3D [37]. Since there is no modification to the model, our method can be adapted to other Mono3D backbones as well. We implement STMono3D based on the popular 3D object detection codebase mmDetection3D [10]. We utilize SGD [21]

optimizer. Gradient clip and warm-up policy are exploited with the learning rate

, the number of warm-up iterations 500, warm-up ratio 0.33, and batch size 32 on 8 Tesla V100s. The loss weight of different domains in Eq. 2 is set to 1. We apply a momentum of 0.999 in Eq. 3 following most of mean teacher paradigms [25, 42]. As for the strong augmentation, we adopt the widely used image data augmentation, including random flipping, random erase, random toning, etc. We subsample dataset during the training stage of NuScenes and Lyft dataset for simplicity. Notably, unlike the mean teacher paradigm or the self-training strategy used in UDA of LiDAR-based 3D detector [25, 43], our STMono3D is trained in a totally end-to-end manner.

4.2 Main Results

As shown in Tab. 2, we compare the performance of our STMono3D with Source Only and Oracle. Our method outperforms the Source Only baseline on all evaluated UDA settings. Caused by the domain gap, the Source Only model cannot detect 3D objects where the mAP almost drops to 0%. Otherwise, STMono3D improves the performance on NuScenesKITTI and LyftKITTI tasks by a large margin that around 110%/67% performance gap of are closed. Notably, the and of of STMono3D surpass the Oracle results, which indicates the effectiveness of our method. Furthermore, when transferring Lyft models to other domains that have full ring view annotations for evaluation (i.e., LyftNuScenes), our STMono3D also attains a considerable performance gain which closes the Oracle and Source Only performance gap by up to 66% on . These encouraging results validate that our method can effectively adapt 3D object detectors to the target domain.

NusK
Method
Easy Mod. Hard  Easy Mod. Hard  Easy   Mod.   Hard Easy Mod. Hard
Source Only 0 0 0 0 0 0 0 0 0 0 0 0
Oracle 33.46 23.62 22.18 29.01 19.88 17.17 33.70 23.22 20.68 28.33 18.97 16.57
STMono3D 35.63 27.37 23.95 28.65 21.89 19.55 31.85 22.82 19.30 24.00 16.85 13.66
Closed Gap 106.5% 115.8% 107.9% 98.7% 110.1% 113.8%  94.5% 98.2% 93.3% 84.7% 88.8% 82.4%
LK LNus                 Metrics
Method Method   AP   ATE   ASE   AOE
Easy Mod. Hard Easy Mod. Hard
Source Only 0 0 0 0 0 0 Source Only 2.40 1.302 0.190 0.802
Oracle 33.46 23.62 22.18 29.01 19.88 17.17 Oracle 28.2 0.798 0.160 0.209
STMono3D 26.46 20.71 17.66 18.14 13.32 11.83 STMono3D 21.3 0.911 0.170 0.355
Closed Gap 79.0% 87.6% 79.6% 62.5% 67.0% 68.8% Closed Gap 73.2% 77.5% 66.7% 82.9%
Table 2: Performance of STMono3D on three source-target pairs. We report of the car category at as well as the domain gap closed by STMono3D. In NusKITTI, STMono3D achieves a even better results on compared with the Oracle model, which demonstrates the effectiveness of our proposed method.
NusK
GAMS
Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
0 0 0 0 0 0 0 0 0 0 0 0
35.63 27.37 23.95 28.65 21.89 19.55 31.85 22.82 19.30 24.00 16.85 13.66
Table 3: Ablation study of the geometry-aligned multi-scale training.

4.3 Ablation Studies and Analysis

In this section, we conduct extensive ablation experiments to investigate the individual components of our STMono3D. All experiments are conducted on the task of NuScenesKITTI.

4.3.1 Effective of Geometry-Aligned Multi-Scale Training.

We study the effects of GAMS in the mean teacher paradigm of STMono3D and the Naive ST pipeline. Tab. 3 first reports the experimental results when GAMS is disabled. Caused by the depth-shift analyzed in Sec. 3.4, the teacher model generates incorrect pseudo labels on the target domain, thus leading to a severe drop in model performance. Furthermore, as shown in Tab. 4, GAMS is crucial for effective Naive ST as well. It is reasonable that GAMS supports the model trained on the source domain to generate valid pseudo labels on the target domain, making the fine-tuning stage helpful for the model performance. We present pseudo labels predicted by the teacher model of STMono3D in Fig. 1, which shows that the depth-shift is well alleviated. All the results highlight the importance of GAMS for effective Mono3D UDA.

NusK KITTI Nus Metrics
Method  AP  ATE  ASE  AOE
Easy Mod. Hard Easy Mod. Hard
Naive ST 0 0 0 0 0 0 0 0 0 0
Naive ST with GAMS 9.05 9.08 8.82 3.72 3.69 3.58 14.0 0.906 0.164 0.264
STMono3D 35.63 27.37 23.95 28.65 21.89 19.55 36.5 0.731 0.160 0.167
Table 4: Comparison of different self-training paradigms.
NusK
EMA
Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
2.55 2.41 2.38 0.82 0.82 0.82 0.45 0.31 0.25 0.06 0.03 0.02
35.63 27.37 23.95 28.65 21.89 19.55 31.85 22.82 19.30 24.00 16.85 13.66
Table 5: Ablation study of the exponential moving average strategy.
NusK
Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
26.33 21.92 19.57 21.17 18.14 16.46 21.66 16.64 14.03 15.55 12.06 9.88
21.50 17.57 15.35 16.57 13.80 11.34 20.47 15.77 13.12 15.32 11.69 9.35
35.63 27.37 23.95 28.65 21.89 19.55 31.85 22.82 19.30 24.00 16.85 13.66
21.74 19.56 17.22 18.09 15.67 14.71 16.01 13.26 11.15 10.89 9.22 7.49
Table 6: Ablation study of QAS on different loss terms.
(a) Oracle v.s. Source Only + GAMS    (b) STMono3D Teacher v.s. Student
Figure 4: Performance comparision. (a) Oracle v.s. Source Only with GAMS: While the Oracle performance progressively improves, the Source Only model suffers from a drasical performance fluctuation. (b) Mean Teacher v.s. Student on the target domain: Not only does the teacher model outperforms the student at the end of the training phase, its performance curve is also smoother and more stable.

4.3.2 Comparison of Self-Training Paradigm.

We compare our STMono3D with other commonly used self-training paradigms (i.e., Naive ST) in Tab. 4. While the GAMS helps the Naive ST teacher generate effective pseudo labels on the target domain to boost UDA performance, our STMono3D still outperforms it by a significant margin. One of the primary concerns lies in low-quality pseudo labels caused by the domain gap. Moreover, as shown in Fig. 4(a), while the performance of Oracle improves progressively, the Source Only model on the target domain suffers from a performance fluctuation. It is also troublesome to choose a specific and suitable model from immediate results to generate pseudo labels for the student model.

In terms of our STMono3D, the whole framework is trained in an end-to-end manner. The teacher is a temporal ensemble of student models at different time stamps. Fig. 4(b) shows that our teacher model is much more stable compared with the ones in Naive ST and has a better performance than the student model at the end of the training phase, where the teacher model starts to generate more predictions over the filtering score threshold. This validates our analysis in Sec. 3.3 that the mean teacher paradigm provides a more effective teacher model for pseudo label generation. Tab. 5 demonstrates the effectiveness of the EMA of STMono3D. The performance significantly degrades when the EMA is disabled, and the model is easily crashed during the training stage. Moreover, since the model is simultaneously trained by data from both domains, our STMono3D can still preserve the knowledge from the source domain, which means a more powerful generalization capability. As shown in Tab. 4, STMono3D achieves even better results compared with Oracle models trained on the source domain.

4.3.3 Effective of Quality-Aware Supervision.

We study the effects of different applied loss terms of the proposed QAS strategy. Generally, the loss terms of Mono3D can be divided into two categories: () containing the object classification loss and attribute classification loss, and () consisting of the location loss, dimension loss, and orientation loss. We separately apply the QAS on these two kinds of losses and report the corresponding results in Tab. 6. Interestingly, utilizing the confidence score from the teacher to reweight the cannot improve the model performance. We speculate it is caused by a loose correlation between the IoU score and localization quality (see yellow or blue line in Fig. 3(a)), which is in line with the findings in LiDAR-based method [43]. However, we find QAS is more applicable for the , where the model performance increases about 20.6% , which indicates the effectiveness of our proposed QAS strategy. It is intuitively reasonable since the score of pseudo labels itself is used to measure the confidence of predicted object classification. Such an instance-level reweighting strategy can help the model better handle low-quality pseudo labels as discussed in Sec. 3.5.

4.3.4 Effective of Crucial Training Strategies.

We then further investigate the effectiveness of our proposed PFT and DT strategies. We first present the ablation results in Tab. 7. When we disable the strategies, model performance suffers from drastic degradations, where drops 64.3%. The results demonstrate they are crucial strategies in STMono3D. As shown in Fig. 5(a), we also present the influence of them in a more intuitive manner. If we disable the PFT, the model will be severely impaired by the numerous FN predcitions (shown in Fig. 5(b) top) in the warm-up stage, leading to a failure to recognize objects in the following training iterations. On the other hand, for the teacher model w/o DT, the number of predictions abruptly increases at the end of training process, introducing more FPs predictions (shown in Fig. 5(b) down) that are harmful to the model perfomance. When jointly utilizing both the strategies, the number of pseudo labels stably increases, which means the detection capability of the model is gradually enhanced on the target domain.

(a) Num. of pseudo labels during training (b) Visualization examples
Figure 5: Effects of the proposed DFT and DT. (a) Correlation between the average of the number of pseudo labels and training iters. (b) Examples of harmful FN and FP pseudo labels caused by disabling DFT and DT, respectively.
NusK
PFT DT
Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
13.57 11.33 10.31 9.10 7.80 7.00 12.36 9.42 8.03 7.82 5.82 5.08
19.59 16.00 14.35 15.96 13.15 12.23 13.44 9.76 7.90 9.23 6.52 5.13
18.90 16.57 15.75 15.15 13.73 12.85 12.74 10.35 9.42 8.41 6.81 5.96
35.63 27.37 23.95 28.65 21.89 19.55 31.85 22.82 19.30 24.00 16.85 13.66
Table 7: Ablation study of PFT and DT.

5 Conclusion

In this paper, we have presented STMono3D, a meticulously designed unsupervised domain adaptation framework tailored for monocular 3D object detection task. We investigate that the depth-shift caused by the geometry discrepancy of domains leads to a drastic performance degradation when cross-domain inference. To alleviate the issue, we leverages a teacher-student paradigm for pseudo label generation and propose quality-aware supervision, positive focusing training and dynamic threshold to handle the difficulty in Mono3D UDA. Extensive experimental results demonstrate the effectiveness of STMono3D. In future work, we would like to explore temporal consistency to boost UDA performance.

Appendix 0.A Appendix

Appendix 0.B Dataset Comparisons

To provide more intuitive comparisons among different datasets (e.g., KITTI [15], NuScense [3] and Lyft [19]), we present images with projected ground-truth labels in Fig. 6. One can easily observe cameras utilized in these datasets have different parameters, which are reflected in the image resolutions, FOV, etc. This work focuses on designing a general Mono3D UDA framework and solving the severe depth-shift caused by misaligned camera intrinsic parameters, which is the most crucial problem in Mono3D UDA. However, there are still numerous unsolved issues such as different image color styles, various distributions of object dimensions, different distributions of object depth, etc. Our proposed STMono3D can be a well-developed baseline for future research.

Appendix 0.C Visualizations of Pseudo Labels

Here, we present more visualizations of pseudo labels generated by the teacher model during the training stage. The images show the depth-shift issue caused by the misalignment of camera parameters can be well-solved. The reasonable pseudo labels provide regular supervision on the target domain and achieve Mono3D UDA in a teacher-student paradigm. In addition, we can find there is still a slight error of prediction locations or dimensions that can be improved by further development of Mono3D methods and enhancement of the UDA algorithms. There is still tremedous room for improvement of the Mono3D UDA.

Appendix 0.D Detailed Training Settings

In this section, we introduce more detailed training settings. As for the model, we follow the basic config provided in MMDetection3D [10].The only modification lies in the scaling of predicted object depth based on the pixel size (GAMS introduced in our paper). We then summary all the runtime settings in Tab. 8, including the number of interations, training schedule, threshold changing, etc.

       KITTI                                             NuScense                                       Lyft
Figure 6: Dataset visualizations with ground-truth labels.
Figure 7: Visualizations of pseudo labels generated by the teacher model.
Schedule number of iters 880 13 learning rate policy step warmup type linear warmup iters 500 warmup ratio 1/3 step [880 8, 880 11] Optimizer optimizer type    SGD learning rate 0.002 gradient clip True batchsize per GPU 4 number of GPUs 8 source:domain samples per bs 1:1
Mean Teacher Paradigm momentum    0.999 interval 1 warmup 0  increasing thr per step 0.005 start iter 8808 stop iter 88010 Inference Settings (KITTI)   student nms pre    100 nms thr 0.05 score thr 0.001 max per img 20 teacher score thr (only diff.) 0.35
Strong Data Augmentation (type/prob/details)    RandomFlip3D      0.5         horizontal Mono3DResize 1 (1600, 840) (1600, 900) (1600, 960) (1600, 1020) (1600, 1080) (1600, 1140) (1600, 1200) (1600, 1260) (1540, 840) (1480, 780) (1420, 720) (1380, 680) (1660, 960) (1720, 1020) (1800, 1080) (1880, 1140) OneOf 1 Identity AutoContrast RandEqualize RandSolarize RandColor RandContrast RandBrightness RandSharpness RandPosterize RandErase 1 size=[0, 0.2] n blocks=(1, 5) squared=True
Weak Data Augmentation (type/prob/details)    RandomFlip3D      0.5         horizontal Mono3DResize 1 (1600, 840) (1600, 900) (1600, 960) (1600, 1020) (1600, 1080) (1600, 1140) (1600, 1200) (1600, 1260) (1540, 840) (1480, 780) (1420, 720) (1380, 680) (1660, 960) (1720, 1020) (1800, 1080) (1880, 1140)
Table 8: Detailed training settings.

References

  • [1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan (2010) A theory of learning from different domains. Machine Learning 79 (1), pp. 151–175. Cited by: §2.2.
  • [2] G. Brazil and X. Liu (2019) M3d-rpn: monocular 3d region proposal network for object detection. In International Conference on Computer Vision (ICCV), pp. 9287–9296. Cited by: §1, §2.1.
  • [3] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2020) Nuscenes: a multimodal dataset for autonomous driving. In

    Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 11621–11631. Cited by: Appendix 0.B, §1, §1, §1, Table 1, §4.1.1.
  • [4] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV), pp. 213–229. Cited by: §2.1.
  • [5] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulo (2017) Autodial: automatic domain alignment layers. In International Conference on Computer Vision (ICCV), pp. 5077–5085. Cited by: §2.2.
  • [6] X. Chen, Y. Yuan, G. Zeng, and J. Wang (2021) Semi-supervised semantic segmentation with cross pseudo supervision. In Computer Vision and Pattern Recognition (CVPR), pp. 2613–2622. Cited by: §3.5.2.
  • [7] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun (2016) Monocular 3d object detection for autonomous driving. In Computer Vision and Pattern Recognition (CVPR), pp. 2147–2156. Cited by: §1.
  • [8] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun (2015) 3d object proposals for accurate object class detection. Advances in Neural Information Processing Systems (NIPS) 28. Cited by: §2.1.
  • [9] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool (2018) Domain adaptive faster r-cnn for object detection in the wild. In Computer Vision and Pattern Recognition (CVPR), pp. 3339–3348. Cited by: §1, §2.2.
  • [10] M. Contributors (2020) MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. Note: https://github.com/open-mmlab/mmdetection3d Cited by: Appendix 0.D, §4.1.4.
  • [11] F. Dubourvieux, R. Audigier, A. Loesch, S. Ainouz, and S. Canu (2021) Unsupervised domain adaptation for person re-identification through source-guided pseudo-labeling. In International Conference on Pattern Recognition (ICPR), pp. 4957–4964. Cited by: §1, §2.2.
  • [12] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In International Conference on Machine Learning (ICML), pp. 1180–1189. Cited by: §2.2.
  • [13] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research (JMLR) 17 (1), pp. 2096–2030. Cited by: §2.2.
  • [14] Y. Ge, F. Zhu, D. Chen, R. Zhao, et al. (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. Advances in Neural Information Processing Systems (NIPS) 33, pp. 11309–11321. Cited by: §1, §2.2.
  • [15] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. Cited by: Appendix 0.B, §1, §1, §1, Table 1, §4.1.1.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in Neural Information Processing Systems (NIPS) 27. Cited by: §2.2.
  • [17] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. Advances in Neural Information Processing Systems (NIPS) 30. Cited by: §2.2.
  • [18] J. Hoffman, D. Wang, F. Yu, and T. Darrell (2016) Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649. Cited by: §1, §2.2.
  • [19] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang, and V. Shet (2019) Level 5 perception dataset 2020. Note: https://level-5.global/level5/data/ Cited by: Appendix 0.B, §1, §1, Table 1, §4.1.1.
  • [20] M. Khodabandeh, A. Vahdat, M. Ranjbar, and W. G. Macready (2019) A robust learning approach to domain adaptive object detection. In International Conference on Computer Vision (ICCV), pp. 480–490. Cited by: §2.2.
  • [21] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.4.
  • [22] Y. Li, N. Wang, J. Shi, X. Hou, and J. Liu (2018) Adaptive batch normalization for practical domain adaptation. Pattern Recognition (PR) 80, pp. 109–117. Cited by: §2.2.
  • [23] Z. Liu, Z. Wu, and R. Tóth (2020) Smoke: single-stage monocular 3d object detection via keypoint estimation. In Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 996–997. Cited by: §2.1.
  • [24] M. Long, Y. Cao, J. Wang, and M. Jordan (2015) Learning transferable features with deep adaptation networks. In International Conference on Machine Learning (ICML), pp. 97–105. Cited by: §1, §2.2.
  • [25] Z. Luo, Z. Cai, C. Zhou, G. Zhang, H. Zhao, S. Yi, S. Lu, H. Li, S. Zhang, and Z. Liu (2021) Unsupervised domain adaptive 3d detection with multi-level consistency. In International Conference on Computer Vision (ICCV), pp. 8866–8875. Cited by: §1, §2.2, §4.1.4.
  • [26] M. Mancini, L. Porzi, S. R. Bulo, B. Caputo, and E. Ricci (2018) Boosting domain adaptation by discovering latent domains. In Computer Vision and Pattern Recognition (CVPR), pp. 3771–3780. Cited by: §2.2.
  • [27] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017) 3d bounding box estimation using deep learning and geometry. In Computer Vision and Pattern Recognition (CVPR), pp. 7074–7082. Cited by: §2.1.
  • [28] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon (2021) Is pseudo-lidar needed for monocular 3d object detection?. In International Conference on Computer Vision (ICCV), pp. 3142–3152. Cited by: §1, §2.1, §3.4.2.
  • [29] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander (2021) Categorical depth distribution network for monocular 3d object detection. In Computer Vision and Pattern Recognition (CVPR), pp. 8555–8564. Cited by: §1.
  • [30] T. Roddick, A. Kendall, and R. Cipolla (2018) Orthographic feature transform for monocular 3d object detection. arXiv preprint arXiv:1811.08188. Cited by: §2.1.
  • [31] K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2019) Strong-weak distribution alignment for adaptive object detection. In Computer Vision and Pattern Recognition (CVPR), pp. 6956–6965. Cited by: §1, §2.2.
  • [32] K. Saito, Y. Ushiku, and T. Harada (2017) Asymmetric tri-training for unsupervised domain adaptation. In International Conference on Machine Learning (ICML), pp. 2988–2997. Cited by: §2.2.
  • [33] B. Sun and K. Saenko (2016) Deep coral: correlation alignment for deep domain adaptation. In European Conference on Computer Vision (ECCV), pp. 443–450. Cited by: §2.2.
  • [34] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems (NIPS) 30. Cited by: §1, Figure 2, §3.3.
  • [35] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko (2015) Simultaneous deep transfer across domains and tasks. In International Conference on Computer Vision (ICCV), pp. 4068–4076. Cited by: §2.2.
  • [36] T. Wang, Z. Xinge, J. Pang, and D. Lin (2022) Probabilistic and geometric depth: detecting objects in perspective. In Conference on Robot Learning (CoRL), pp. 1475–1485. Cited by: §1, §2.1.
  • [37] T. Wang, X. Zhu, J. Pang, and D. Lin (2021) Fcos3d: fully convolutional one-stage monocular 3d object detection. In International Conference on Computer Vision Workshop (ICCVW), pp. 913–922. Cited by: §1, §2.1, §4.1.4.
  • [38] X. Wang, Y. Jin, M. Long, J. Wang, and M. I. Jordan (2019) Transferable normalization: towards improving transferability of deep neural networks. Advances in Neural Information Processing Systems (NIPS) 32. Cited by: §2.2.
  • [39] Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon (2022) Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning (CoRL), pp. 180–191. Cited by: §2.1.
  • [40] X. Weng and K. Kitani (2019) Monocular 3d object detection with pseudo-lidar point cloud. In International Conference on Computer Vision Workshops (ICCVW), pp. 0–0. Cited by: §2.1.
  • [41] B. Xu and Z. Chen (2018) Multi-level fusion based 3d object detection from monocular images. In Computer Vision and Pattern Recognition (CVPR), pp. 2345–2353. Cited by: §1, §2.1.
  • [42] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu (2021) End-to-end semi-supervised object detection with soft teacher. In International Conference on Computer Vision (ICCV), pp. 3060–3069. Cited by: §3.5.1, §3.5.2, §4.1.4.
  • [43] J. Yang, S. Shi, Z. Wang, H. Li, and X. Qi (2021) St3d: self-training for unsupervised domain adaptation on 3d object detection. In Computer Vision and Pattern Recognition (CVPR), pp. 10368–10378. Cited by: §1, §2.2, §3.5.1, §4.1.3, §4.1.4, §4.3.3.
  • [44] J. Yang, S. Shi, Z. Wang, H. Li, and X. Qi (2021) ST3D++: denoised self-training for unsupervised domain adaptation on 3d object detection. arXiv preprint arXiv:2108.06682. Cited by: §1.
  • [45] W. Zhang, W. Li, and D. Xu (2021) SRDAN: scale-aware and range-aware domain adaptation network for cross-dataset 3d object detection. In Computer Vision and Pattern Recognition (CVPR), pp. 6769–6779. Cited by: §1, §2.2.
  • [46] Y. Zou, Z. Yu, B. Kumar, and J. Wang (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In European Conference on Computer Vision (ECCV), pp. 289–305. Cited by: §2.2.
  • [47] Y. Zou, Z. Yu, B. Kumar, and J. Wang (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In European Conference on Computer Vision (ECCV), pp. 289–305. Cited by: §2.2.