Monocular 3D object detection (Mono3D) aims to categorize and localize objects from single input RGB images. With the prevalent development of cameras for autonomous vehicles and mobile robots, this field has drawn increasing research attention. Recently, it has obtained remarkable advancements [7, 2, 41, 37, 36, 28, 29]
driven by deep neural networks and large-scale human-annotated autonomous driving datasets[15, 3, 19].
However, 3D detectors developed on one specific dataset (i.e. source domain) might suffer from tremendous performance degradation when generalizing to another dataset (i.e. target domains) due to unavoidable domain-gaps arising from different types of sensors, weather conditions, and geographical locations. Especially, as shown in Fig. 1, the severe depth-shift caused by different imaging camera devices leads to totally failed localizations. As a result, a monocular detector trained on data collected in Singapore cities with NuScenes  cameras cannot work well (i.e., average precision drops to zero) when evaluated on data from European cities captured by KITTI  cameras. While collecting and training with more data from different domains could alleviate this problem, it is unfortunately infeasible, given diverse real-world scenarios and expensive annotation costs. Therefore, methods for effectively adapting a monocular 3D detector trained on a labeled source domain to a novel unlabeled target domain are highly demanded in practical applications. We call this task unsupervised domain adaptation (UDA) for monocular 3D object detection.
|(a) Camera View (b) BEV View (c) STMono3D|
on the 2D image setting are proposed, they mainly focus on handling lighting, color, and texture variations. However, in terms of the Mono3D, since detectors attend to estimate the spatial information of objects from monocular RGB images, the geometry alignment of domains is much more crucial. Moreover, for UDA on LiDAR-based 3D detection[43, 44, 25, 45], the fundamental differences in data structures and network architectures render these approaches not readily applicable to this problem.
In this paper, we propose STMono3D, for UDA on monocular 3D object detection. We first thoroughly investigate the depth-shift issue caused by the tight entanglement of models and camera parameters during the training stage. Models can accurately locate the objects on the 2D image but predict totally wrong object depth with tremendous shifts when inferring on the target domain. To alleviate this issue, we develop the geometry-aligned multi-scale (GAMS) training strategy to guarantee the geometry consistency of domains and predict pixel-size depth to overcome the inevitable misalignment and ambiguity. Hence, models can provide effective predictions on the unlabeled target domain. Based upon this, we adopt the mean teacher  paradigm to facilitate the learning. The teacher model is essentially a temporal ensemble of student models, where parameters are updated by an exponential moving average window on student models of preceding iterations. It produces stable supervision for the student model without prior knowledge of the target domain.
Moreover, we observe that the Mono3D teacher model suffers from extremely low confidence scores and numerous failed predictions on the target domain. To handle these issues, we adopt Quality-Aware Supervision (QAS), Positive Focusing Training (PFT), and Dynamic Threshold (DT) strategies. Benefitting from the flexibility of the end-to-end mean teacher framework, we utilize the readability of each teacher-generated prediction to dynamically reweight the supervision loss of the student model, which takes instance-level qualities of pseudo labels into account, avoiding the low-quality samples interfering the training process. Since the backgrounds of domains are similar in the Mono3D UDA of the autonomous driving setting, we ignore the negative samples and only utilize positive pseudo labels to train the model. It avoids excessive FN pseudo labels at the beginning of the training process impairing the capability of the model to recognize objects. In synchronization with training, we utilize a dynamic threshold to adjust the filter score, which stabilizes the increase of pseudo labels.
To the best of our knowledge, this is the first study to explore effective UDA methods for Mono3D. Experimental results on various 3D object detection datasets KITTI , NuSenses , and Lyft  demonstrate the effectiveness of our proposed methods, where the performance gaps between source only results and fully supervised oracle results are closed by a large margin. It is noteworthy that STMono3D even outperforms the oracle results under the NuScenesKITTI setting. Our codes will be available111https://github.com/zhyever/STMono3D.
2 Related Work
2.1 Monocular 3D Object Detection
Mono3D has drawn increasing attention in recent years [8, 41, 27, 30, 40, 2, 39, 37, 28, 36]. Earlier work utilizes sub-networks to assist 3D detection. For instance, 3DOP  and MLFusion  use a depth estimation network while Deep3DBox  uses a 2D object detector. Another line of research makes efforts to convert the RGB input to 3D representations like OFTNet  and Pseudo-Lidar . While these methods have shown promising performance, they rely on the design and performance of sub-networks or dense depth labels. Hence, some methods propose to design the framework in an end-to-end manner like 2D detection. M3D-RPN  implements a single-stage multi-class detector with a region proposal network and depth-aware convolution. SMOKE  proposes a neat framework to predict 3D objects without generating 2D proposals. DETR3D  develop a DETR-like  bbox head, where 3D objects are predicted by independent queries in a set-to-set manner. DD3D  further investigates the influence of pre-trained monocular depth estimation network, in which they find depth estimation plays a crucial part in Mono3D. In this paper, we mainly conduct UDA experiments based on FCOS3D 
, a neat and representative Mono3D paradigm that keeps the well-developed designs for 2D feature extraction and is adapted for this 3D task with only basic designs for specific 3D detection targets.
2.2 Unsupervised Domain Adaptation
UDA aims to generalize the model trained on a source domain to unlabeled target domains. So far, tremendous methods have been proposed for various computer vision tasks[11, 24, 18, 9, 31, 14, 47] (e.g., recognition, detection, segmentation). Some methods [26, 33, 5] employ the statistic-based metrics to model the differences between two domains. Other approaches [32, 46, 20]
utilize the self-training strategy to generate pseudo labels for unlabeled target domains. Moreover, inspired by Generative Adversarial Networks (GANs), adversarial learning was employed to align feature distributions [35, 12, 13], which can be explained by minimizing the H-divergence  or the Jensen-Shannon divergence  between two domains. [22, 38]
alleviated the domain shift on batch normalization layers by modulating the statistics in the BN layer before evaluation or specializing parameters of BN domain by domain. Most of these domain adaptation approaches are designed for the general 2D image recognition tasks, while direct adoption of these techniques for the large-scale monocular 3D object detection task may not work well due to the distinct characteristics of Mono3D, especially targes in 3D spatial coordination.
In terms of 3D object detection, [45, 43, 25] investigate UDA strategies for LIDAR-based detectors. SRDAN  adopt adversarial losses to align the features and instances with similar scales between two domains. ST3D  and MLC-Net  develop self-training strategies with delicate designs, such as random object scaling, triplet memory bank, and multi-level alignment, for domain adaptation. Following the successful trend of UDA on LIDAR-based 3D object detection, we investigate effective self-training strategies for Mono3D. To the best of our knowledge, this is the first study to explore effective UDA methods for Mono3D.
In this section, we first formulate the UDA task on Mono3D (Sec. 3.1), and present an overview of our framework (Sec. 3.2), followed by the Self-Teacher with Temporal Ensemble paradigm (Sec. 3.3). Then, we explain the details of the Geometry-Aligned Multi-Scale Training (GAMS, Sec. 3.4), the Quality-Aware Supervision (QAS, Sec. 3.5), and some other crucial training strategies consisting of Positive Focusing Training (PFT) and Dynamic Threshold (DT) (Sec. 3.6).
3.1 Problem Definition
Under the unsupervised domain adaptation setting, we access to labeled images from the source domain , and unlabeled images from the target domain , where and are the number of samples from the source and target domains, respectively. Each 2D image is paired with a camera parameter that projects points in 3D space to 2D image plane while denotes the label of the corresponding training sample in the specific camera coordinate from the source domain. Label is in the form of object class , location , size in each dimension , and orientation . We aim to train models with and avoid performance degradation when evaluating on the target domain.
3.2 Framework Overview
We illustrate our STMono3D in Fig. 2. The labeled source domain data is utilized for supervised training of the student model with a loss . In terms of the unlabeled target domain data , we first perturb it by applying a strong random augmentation to obtain . Before passing to the models, both the target and source domain input are further augmented by the GAMS strategy in Sec. 3.4, where images and camera intrinsic parameters are cautiously aligned via simultaneously rescaling. Subsequently, the original and perturbed images are sent to the teacher and student model, respectively, where the teacher model generates intuitively reasonable pseudo labels and supervises the student model via loss on the target domain:
where and are the regression loss and classification loss, respectively. Here, we adopt the QAS strategy in Sec. 3.5 to further leverage richer information from the teacher model by instance-wise reweighting the loss . In each iteration, the student model is updated through gradient descent with the total loss , which is a linear combination of and :
where is the weight coefficient. Then, the teacher model parameters are updated by the corresponding parameters of the student model, where we introduce the details in Sec. 3.3. Moreover, we observe that the teacher model suffers from numerous FN and FP pseudo labels on the target domain. To handle this issue, we utilize the PFT and DT strategies illustrated in Sec. 3.6.
3.3 Self-Teacher with Temporal Ensemble
Following the successful trend of the mean teacher paradigm 
in the semi-supervised learning, we adapt it to our Mono3D UDA task as illustrated in Fig.2. The teacher model and the student model share the same network architecture but have different parameters and , respectively. During the training, the parameters of the teacher model are updated via taking the exponential moving average (EMA) of the student parameters:
where is the momentum that is commonly set close to 1, e.g., 0.999 in our experiments. Moreover, the input of the student model is perturbed by a strong augmentation, which ensures that the pseudo labels generated by the teacher model are more accurate than the student model predictions, thus providing available optimization directions for the parameter updating. In addition, the strong augmentation can also improve the model generalization to handle the different domain inputs. Hence, by supervising the student model with pseudo targets generated by the teacher model (i.e., forcing the consistency between predictions of the student and the teacher model), the student can learn domain-invariant representations to adapt to the unlabeled target domain. Fig. 4 shows that the teacher model can provide effective supervision to the student model and Tab. 4, 5 demonstrate the effectiveness of the mean teacher paradigm.
3.4 Geometry-Aligned Multi-Scale Training
As shown in Fig. 1, depth-shift drastically harms the quality of pseudo labels on the target domain. It is mainly caused by the domain-specific geometry correspondences between 3D objects and images (i.e., camera imaging process). For instance, since the pixel size (defined in Eq. 6) of the KITTI dataset is larger than the NuScenes dataset, objects in images captured by KITTI cameras are smaller than NuScenes ones. While the model can predict accurate 2D locations on image planes, it tends to estimate relatively more significant object depth based on the depth cue that far objects tend to be smaller in perspective view. We call the phenomenon depth-shift: models localize accurate 2D location but predict depth with tremendous shifts on the target domain. To mitigate it, we propose a straight-forward yet effective augmentation strategy, i.e., geometry-aligned multi-scale training, fully leveraging the geometry consistency in the imaging process.
Given the source input and the target input , a naive geometry-aligned strategy is to rescale camera parameters to the same constant values and resize images correspondingly:
where and are resize rates, and are focal length and optical center, and indicate image coordinate axises, respectively. However, since the cannot be changed by resizing, it is impracticable to strictly align the geometry correspondences of 3D objects and images between different domains via convenient transformations. The inevitable discrepancy and ambiguity lead to a failure on UDA.
To solve the issue, motivated by DD3D , we propose to predict the pixel-size depth instead of the metric depth :
where and are the pixel size and a constant, is the model prediction and is scaled to the final result . Hence, while there are inevitable discrepancies between aligned geometry correspondences of two domains, the model can infer the depth from the pixel size and be more robust to the various imaging process. Moreover, we further rescale camera parameters into a multi-scale range, instead of the same constant values, and resize images correspondingly to enhance the dynamic of model. During the training, we keep ground-truth 3D bounding boxes and pseudo labels unchanged, but modify camera parameters and image resolutions simultaneously.
3.5 Quality-Aware Supervision
The cross-domain performance of the detector highly depends on the quality of pseudo labels. In practice, we have to utilize a higher threshold on the foreground score to filter out most false positive (FP) box candidates with low confidence. However, unlike the teacher model that can detect objects with high confidence in the semi-supervised 2D detection or UDA of LiDAR-based 3D detector (e.g., the threshold is set to 90% and 70% in  and , respectively), we find the Mono3D cross-domain teacher suffers from a much lower confidence as shown in Fig. 3, which is another unique phenomenon in Mono3D UDA caused by the much worse oracle monocular 3D detection performance than 2D detection and LiDAR-based 3D detection, which indicates that though the prediction confidence surpasses the threshold, we cannot ensure the sample quality, especially for the ones near the threshold. To alleviate the impact, we propose the quality-aware supervision (QAS) to leverage richer information from the teacher and take instance-level quality into account.
Thanks to the flexibility of the end-to-end mean teacher framework, we assess the reliability of each teacher-generated bbox to be a real foreground, which is then used to weight the foreground classification loss of the student model. Given the foreground bounding box set , the classification loss of the unlabeled images on the target domain is defined as:
where denotes the set of pseudo class labels, is the box classification loss, is the confidence score for foreground pseudo boxes, is the number of foreground pseudo box, and
is a constant hyperparameter.
The QAS resembles a simple positive mining strategy, which is intuitively reasonable that there should be more severe punishment for pseudo labels with higher confidence. Moreover, compared with semi-supervised and supervised tasks that focus on simple/hard negative samples [42, 6], it is more critical for UDA Mono3D models to prevent harmful influence caused by low-quality pseudo labels near the threshold. Such an instance-level weighting strategy balances the loss terms based on foreground confidence scores and significantly improves the effectiveness of STMono3D.
3.6 Crucial Training Strategies
3.6.1 Positive Focusing Training.
Since the whole STMono3D is trained in an end-to-end manner, the teacher model can hardly detect objects with confident scores higher than the threshold at the start of the training. Tons of FN pseudo samples impair the capability of the model to recognize objects. Because backgrounds of different domains are similar with negligible domain gaps in Mono3D UDA (e.g., street, sky, and house), we propose the positive focusing training strategy. As for the , we discard negative background pseudo labels and only utilize the positive samples to supervise the student model, which ensures that the model does not crash to overfit on the FN pseudo labels during the training stage.
3.6.2 Dynamic Threshold.
In practice, we find that the mean confidence score of pseudo labels gradually increases in synchronization within training duration. Increasing false positive (FP) samples appear in the middle and late stages of training, which harshly hurts the model performance. While the QAS strategy proposed in Sec. 3.5 can reduce the negative impact of low-quality pseudo labels, the completely wrong predictions still introduce inevitable noise to the training process. To alleviate the issue, we propose a simple progressively increasing threshold strategy to dynamic change the threshold as:
where is the base threshold that is set to 0.35 based on the statistics in Fig. 3(a) in our experiments, is the slope of increasing threshold, is the iteration of training stage. The threshold is fixed to a minimum during the first warmup steps as the teacher model can hardly detect objects with confident scores higher than the base threshold. It then linearly increases after the teacher model predicts pseudo labels with FP samples to avoid the model being blemished by increasing failure predictions. Finally, we find that the increasing of average scores tends to a saturation. Therefore, the threshold is fixed at the end of the training stage to guarantee the number of pseudo labels.
4.1 Experimental Setup
We conduct experiments on three widely used autonomous driving datasets: KITTI , NuSenses , and Lyft . Two aspects are lying in our experiments: Cross domains with different cameras (existing in all the source-target pairs) and adaptation from label rich domains to insufficient domains (i.e., NuSensesKITTI). We summarize the dataset information in detail in Tab. 1, and present more visualization comparisons in the supplementary material.
4.1.2 Comparison Methods.
In our experiments, we compare STMono3D with three methods: () Source Only indicates directly evaluating the source domain trained model on the target domain. () Oracle indicates the fully supervised model trained on the target domain. () Naive ST (with GAMS) is the basic self-training method. We first train a model (with GAMS) on the source domain, then generate pseudo labels for the target domain, and finally fine-tuning the trained model on the target domain.
4.1.3 Evaluation Metric.
We adopt the KITTI evaluation metric for evaluating our methods in NuSensesKITTI and LyftKITTI and the NuScenes metric for LyftNuSenses. We focus on the commonly used car category in our experiments. For LyftNuSenses, we evaluate models on ring view, which is more useful in real-world applications. For KITTI, We report the average precision (AP) where the IoU thresholds are 0.5 for both the bird’s eye view (BEV) IoUs and 3D IoUs. For NuScenes, since the attribute labels are different from the source domain (i.e., Lyft), we discard the average attribute error (mAAE) and report the average trans error (mATE), scale error (mASE), orient error (mAOE), and average precision (mAP). Following , we report the closed performance gap between Source Only to Oracle.
4.1.4 Implementation Details.
We validate our proposed STMono3D on detection backbone FCOS3D . Since there is no modification to the model, our method can be adapted to other Mono3D backbones as well. We implement STMono3D based on the popular 3D object detection codebase mmDetection3D . We utilize SGD 
optimizer. Gradient clip and warm-up policy are exploited with the learning rate, the number of warm-up iterations 500, warm-up ratio 0.33, and batch size 32 on 8 Tesla V100s. The loss weight of different domains in Eq. 2 is set to 1. We apply a momentum of 0.999 in Eq. 3 following most of mean teacher paradigms [25, 42]. As for the strong augmentation, we adopt the widely used image data augmentation, including random flipping, random erase, random toning, etc. We subsample dataset during the training stage of NuScenes and Lyft dataset for simplicity. Notably, unlike the mean teacher paradigm or the self-training strategy used in UDA of LiDAR-based 3D detector [25, 43], our STMono3D is trained in a totally end-to-end manner.
4.2 Main Results
As shown in Tab. 2, we compare the performance of our STMono3D with Source Only and Oracle. Our method outperforms the Source Only baseline on all evaluated UDA settings. Caused by the domain gap, the Source Only model cannot detect 3D objects where the mAP almost drops to 0%. Otherwise, STMono3D improves the performance on NuScenesKITTI and LyftKITTI tasks by a large margin that around 110%/67% performance gap of are closed. Notably, the and of of STMono3D surpass the Oracle results, which indicates the effectiveness of our method. Furthermore, when transferring Lyft models to other domains that have full ring view annotations for evaluation (i.e., LyftNuScenes), our STMono3D also attains a considerable performance gain which closes the Oracle and Source Only performance gap by up to 66% on . These encouraging results validate that our method can effectively adapt 3D object detectors to the target domain.
|Source Only||0||0||0||0||0||0||Source Only||2.40||1.302||0.190||0.802|
|Closed Gap||79.0%||87.6%||79.6%||62.5%||67.0%||68.8%||Closed Gap||73.2%||77.5%||66.7%||82.9%|
4.3 Ablation Studies and Analysis
In this section, we conduct extensive ablation experiments to investigate the individual components of our STMono3D. All experiments are conducted on the task of NuScenesKITTI.
4.3.1 Effective of Geometry-Aligned Multi-Scale Training.
We study the effects of GAMS in the mean teacher paradigm of STMono3D and the Naive ST pipeline. Tab. 3 first reports the experimental results when GAMS is disabled. Caused by the depth-shift analyzed in Sec. 3.4, the teacher model generates incorrect pseudo labels on the target domain, thus leading to a severe drop in model performance. Furthermore, as shown in Tab. 4, GAMS is crucial for effective Naive ST as well. It is reasonable that GAMS supports the model trained on the source domain to generate valid pseudo labels on the target domain, making the fine-tuning stage helpful for the model performance. We present pseudo labels predicted by the teacher model of STMono3D in Fig. 1, which shows that the depth-shift is well alleviated. All the results highlight the importance of GAMS for effective Mono3D UDA.
|Naive ST with GAMS||9.05||9.08||8.82||3.72||3.69||3.58||14.0||0.906||0.164||0.264|
|(a) Oracle v.s. Source Only + GAMS||(b) STMono3D Teacher v.s. Student|
4.3.2 Comparison of Self-Training Paradigm.
We compare our STMono3D with other commonly used self-training paradigms (i.e., Naive ST) in Tab. 4. While the GAMS helps the Naive ST teacher generate effective pseudo labels on the target domain to boost UDA performance, our STMono3D still outperforms it by a significant margin. One of the primary concerns lies in low-quality pseudo labels caused by the domain gap. Moreover, as shown in Fig. 4(a), while the performance of Oracle improves progressively, the Source Only model on the target domain suffers from a performance fluctuation. It is also troublesome to choose a specific and suitable model from immediate results to generate pseudo labels for the student model.
In terms of our STMono3D, the whole framework is trained in an end-to-end manner. The teacher is a temporal ensemble of student models at different time stamps. Fig. 4(b) shows that our teacher model is much more stable compared with the ones in Naive ST and has a better performance than the student model at the end of the training phase, where the teacher model starts to generate more predictions over the filtering score threshold. This validates our analysis in Sec. 3.3 that the mean teacher paradigm provides a more effective teacher model for pseudo label generation. Tab. 5 demonstrates the effectiveness of the EMA of STMono3D. The performance significantly degrades when the EMA is disabled, and the model is easily crashed during the training stage. Moreover, since the model is simultaneously trained by data from both domains, our STMono3D can still preserve the knowledge from the source domain, which means a more powerful generalization capability. As shown in Tab. 4, STMono3D achieves even better results compared with Oracle models trained on the source domain.
4.3.3 Effective of Quality-Aware Supervision.
We study the effects of different applied loss terms of the proposed QAS strategy. Generally, the loss terms of Mono3D can be divided into two categories: () containing the object classification loss and attribute classification loss, and () consisting of the location loss, dimension loss, and orientation loss. We separately apply the QAS on these two kinds of losses and report the corresponding results in Tab. 6. Interestingly, utilizing the confidence score from the teacher to reweight the cannot improve the model performance. We speculate it is caused by a loose correlation between the IoU score and localization quality (see yellow or blue line in Fig. 3(a)), which is in line with the findings in LiDAR-based method . However, we find QAS is more applicable for the , where the model performance increases about 20.6% , which indicates the effectiveness of our proposed QAS strategy. It is intuitively reasonable since the score of pseudo labels itself is used to measure the confidence of predicted object classification. Such an instance-level reweighting strategy can help the model better handle low-quality pseudo labels as discussed in Sec. 3.5.
4.3.4 Effective of Crucial Training Strategies.
We then further investigate the effectiveness of our proposed PFT and DT strategies. We first present the ablation results in Tab. 7. When we disable the strategies, model performance suffers from drastic degradations, where drops 64.3%. The results demonstrate they are crucial strategies in STMono3D. As shown in Fig. 5(a), we also present the influence of them in a more intuitive manner. If we disable the PFT, the model will be severely impaired by the numerous FN predcitions (shown in Fig. 5(b) top) in the warm-up stage, leading to a failure to recognize objects in the following training iterations. On the other hand, for the teacher model w/o DT, the number of predictions abruptly increases at the end of training process, introducing more FPs predictions (shown in Fig. 5(b) down) that are harmful to the model perfomance. When jointly utilizing both the strategies, the number of pseudo labels stably increases, which means the detection capability of the model is gradually enhanced on the target domain.
|(a) Num. of pseudo labels during training||(b) Visualization examples|
In this paper, we have presented STMono3D, a meticulously designed unsupervised domain adaptation framework tailored for monocular 3D object detection task. We investigate that the depth-shift caused by the geometry discrepancy of domains leads to a drastic performance degradation when cross-domain inference. To alleviate the issue, we leverages a teacher-student paradigm for pseudo label generation and propose quality-aware supervision, positive focusing training and dynamic threshold to handle the difficulty in Mono3D UDA. Extensive experimental results demonstrate the effectiveness of STMono3D. In future work, we would like to explore temporal consistency to boost UDA performance.
Appendix 0.A Appendix
Appendix 0.B Dataset Comparisons
To provide more intuitive comparisons among different datasets (e.g., KITTI , NuScense  and Lyft ), we present images with projected ground-truth labels in Fig. 6. One can easily observe cameras utilized in these datasets have different parameters, which are reflected in the image resolutions, FOV, etc. This work focuses on designing a general Mono3D UDA framework and solving the severe depth-shift caused by misaligned camera intrinsic parameters, which is the most crucial problem in Mono3D UDA. However, there are still numerous unsolved issues such as different image color styles, various distributions of object dimensions, different distributions of object depth, etc. Our proposed STMono3D can be a well-developed baseline for future research.
Appendix 0.C Visualizations of Pseudo Labels
Here, we present more visualizations of pseudo labels generated by the teacher model during the training stage. The images show the depth-shift issue caused by the misalignment of camera parameters can be well-solved. The reasonable pseudo labels provide regular supervision on the target domain and achieve Mono3D UDA in a teacher-student paradigm. In addition, we can find there is still a slight error of prediction locations or dimensions that can be improved by further development of Mono3D methods and enhancement of the UDA algorithms. There is still tremedous room for improvement of the Mono3D UDA.
Appendix 0.D Detailed Training Settings
In this section, we introduce more detailed training settings. As for the model, we follow the basic config provided in MMDetection3D .The only modification lies in the scaling of predicted object depth based on the pixel size (GAMS introduced in our paper). We then summary all the runtime settings in Tab. 8, including the number of interations, training schedule, threshold changing, etc.
|KITTI NuScense Lyft|
-  (2010) A theory of learning from different domains. Machine Learning 79 (1), pp. 151–175. Cited by: §2.2.
-  (2019) M3d-rpn: monocular 3d region proposal network for object detection. In International Conference on Computer Vision (ICCV), pp. 9287–9296. Cited by: §1, §2.1.
Nuscenes: a multimodal dataset for autonomous driving.
Computer Vision and Pattern Recognition (CVPR), pp. 11621–11631. Cited by: Appendix 0.B, §1, §1, §1, Table 1, §4.1.1.
-  (2020) End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV), pp. 213–229. Cited by: §2.1.
-  (2017) Autodial: automatic domain alignment layers. In International Conference on Computer Vision (ICCV), pp. 5077–5085. Cited by: §2.2.
-  (2021) Semi-supervised semantic segmentation with cross pseudo supervision. In Computer Vision and Pattern Recognition (CVPR), pp. 2613–2622. Cited by: §3.5.2.
-  (2016) Monocular 3d object detection for autonomous driving. In Computer Vision and Pattern Recognition (CVPR), pp. 2147–2156. Cited by: §1.
-  (2015) 3d object proposals for accurate object class detection. Advances in Neural Information Processing Systems (NIPS) 28. Cited by: §2.1.
-  (2018) Domain adaptive faster r-cnn for object detection in the wild. In Computer Vision and Pattern Recognition (CVPR), pp. 3339–3348. Cited by: §1, §2.2.
-  (2020) MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. Note: https://github.com/open-mmlab/mmdetection3d Cited by: Appendix 0.D, §4.1.4.
-  (2021) Unsupervised domain adaptation for person re-identification through source-guided pseudo-labeling. In International Conference on Pattern Recognition (ICPR), pp. 4957–4964. Cited by: §1, §2.2.
Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (ICML), pp. 1180–1189. Cited by: §2.2.
-  (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research (JMLR) 17 (1), pp. 2096–2030. Cited by: §2.2.
-  (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. Advances in Neural Information Processing Systems (NIPS) 33, pp. 11309–11321. Cited by: §1, §2.2.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. Cited by: Appendix 0.B, §1, §1, §1, Table 1, §4.1.1.
-  (2014) Generative adversarial nets. Advances in Neural Information Processing Systems (NIPS) 27. Cited by: §2.2.
-  (2017) Improved training of wasserstein gans. Advances in Neural Information Processing Systems (NIPS) 30. Cited by: §2.2.
-  (2016) Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649. Cited by: §1, §2.2.
-  (2019) Level 5 perception dataset 2020. Note: https://level-5.global/level5/data/ Cited by: Appendix 0.B, §1, §1, Table 1, §4.1.1.
-  (2019) A robust learning approach to domain adaptive object detection. In International Conference on Computer Vision (ICCV), pp. 480–490. Cited by: §2.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.4.
-  (2018) Adaptive batch normalization for practical domain adaptation. Pattern Recognition (PR) 80, pp. 109–117. Cited by: §2.2.
-  (2020) Smoke: single-stage monocular 3d object detection via keypoint estimation. In Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 996–997. Cited by: §2.1.
-  (2015) Learning transferable features with deep adaptation networks. In International Conference on Machine Learning (ICML), pp. 97–105. Cited by: §1, §2.2.
-  (2021) Unsupervised domain adaptive 3d detection with multi-level consistency. In International Conference on Computer Vision (ICCV), pp. 8866–8875. Cited by: §1, §2.2, §4.1.4.
-  (2018) Boosting domain adaptation by discovering latent domains. In Computer Vision and Pattern Recognition (CVPR), pp. 3771–3780. Cited by: §2.2.
-  (2017) 3d bounding box estimation using deep learning and geometry. In Computer Vision and Pattern Recognition (CVPR), pp. 7074–7082. Cited by: §2.1.
-  (2021) Is pseudo-lidar needed for monocular 3d object detection?. In International Conference on Computer Vision (ICCV), pp. 3142–3152. Cited by: §1, §2.1, §3.4.2.
-  (2021) Categorical depth distribution network for monocular 3d object detection. In Computer Vision and Pattern Recognition (CVPR), pp. 8555–8564. Cited by: §1.
-  (2018) Orthographic feature transform for monocular 3d object detection. arXiv preprint arXiv:1811.08188. Cited by: §2.1.
-  (2019) Strong-weak distribution alignment for adaptive object detection. In Computer Vision and Pattern Recognition (CVPR), pp. 6956–6965. Cited by: §1, §2.2.
-  (2017) Asymmetric tri-training for unsupervised domain adaptation. In International Conference on Machine Learning (ICML), pp. 2988–2997. Cited by: §2.2.
-  (2016) Deep coral: correlation alignment for deep domain adaptation. In European Conference on Computer Vision (ECCV), pp. 443–450. Cited by: §2.2.
-  (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems (NIPS) 30. Cited by: §1, Figure 2, §3.3.
-  (2015) Simultaneous deep transfer across domains and tasks. In International Conference on Computer Vision (ICCV), pp. 4068–4076. Cited by: §2.2.
-  (2022) Probabilistic and geometric depth: detecting objects in perspective. In Conference on Robot Learning (CoRL), pp. 1475–1485. Cited by: §1, §2.1.
-  (2021) Fcos3d: fully convolutional one-stage monocular 3d object detection. In International Conference on Computer Vision Workshop (ICCVW), pp. 913–922. Cited by: §1, §2.1, §4.1.4.
-  (2019) Transferable normalization: towards improving transferability of deep neural networks. Advances in Neural Information Processing Systems (NIPS) 32. Cited by: §2.2.
-  (2022) Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning (CoRL), pp. 180–191. Cited by: §2.1.
-  (2019) Monocular 3d object detection with pseudo-lidar point cloud. In International Conference on Computer Vision Workshops (ICCVW), pp. 0–0. Cited by: §2.1.
-  (2018) Multi-level fusion based 3d object detection from monocular images. In Computer Vision and Pattern Recognition (CVPR), pp. 2345–2353. Cited by: §1, §2.1.
-  (2021) End-to-end semi-supervised object detection with soft teacher. In International Conference on Computer Vision (ICCV), pp. 3060–3069. Cited by: §3.5.1, §3.5.2, §4.1.4.
-  (2021) St3d: self-training for unsupervised domain adaptation on 3d object detection. In Computer Vision and Pattern Recognition (CVPR), pp. 10368–10378. Cited by: §1, §2.2, §3.5.1, §4.1.3, §4.1.4, §4.3.3.
-  (2021) ST3D++: denoised self-training for unsupervised domain adaptation on 3d object detection. arXiv preprint arXiv:2108.06682. Cited by: §1.
-  (2021) SRDAN: scale-aware and range-aware domain adaptation network for cross-dataset 3d object detection. In Computer Vision and Pattern Recognition (CVPR), pp. 6769–6779. Cited by: §1, §2.2.
-  (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In European Conference on Computer Vision (ECCV), pp. 289–305. Cited by: §2.2.
-  (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In European Conference on Computer Vision (ECCV), pp. 289–305. Cited by: §2.2.