Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency

Deep learning-based 3D object detection has achieved unprecedented success with the advent of large-scale autonomous driving datasets. However, drastic performance degradation remains a critical challenge for cross-domain deployment. In addition, existing 3D domain adaptive detection methods often assume prior access to the target domain annotations, which is rarely feasible in the real world. To address this challenge, we study a more realistic setting, unsupervised 3D domain adaptive detection, which only utilizes source domain annotations. 1) We first comprehensively investigate the major underlying factors of the domain gap in 3D detection. Our key insight is that geometric mismatch is the key factor of domain shift. 2) Then, we propose a novel and unified framework, Multi-Level Consistency Network (MLC-Net), which employs a teacher-student paradigm to generate adaptive and reliable pseudo-targets. MLC-Net exploits point-, instance- and neural statistics-level consistency to facilitate cross-domain transfer. Extensive experiments demonstrate that MLC-Net outperforms existing state-of-the-art methods (including those using additional target domain information) on standard benchmarks. Notably, our approach is detector-agnostic, which achieves consistent gains on both single- and two-stage 3D detectors.


page 1

page 2

page 3

page 4


Unsupervised Domain Adaptation for Monocular 3D Object Detection via Self-Training

Monocular 3D object detection (Mono3D) has achieved unprecedented succes...

Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection

Domain adaptive object detection (DAOD) is a promising way to alleviate ...

Cross Domain Object Detection by Target-Perceived Dual Branch Distillation

Cross domain object detection is a realistic and challenging task in the...

AcroFOD: An Adaptive Method for Cross-domain Few-shot Object Detection

Under the domain shift, cross-domain few-shot object detection aims to a...

Memorizing Comprehensively to Learn Adaptively: Unsupervised Cross-Domain Person Re-ID with Multi-level Memory

Unsupervised cross-domain person re-identification (Re-ID) aims to adapt...

Towards Model Generalization for Monocular 3D Object Detection

Monocular 3D object detection (Mono3D) has achieved tremendous improveme...

Domain Contrast for Domain Adaptive Object Detection

We present Domain Contrast (DC), a simple yet effective approach inspire...

1 Introduction

  denotes equal contribution.  denotes corresponding author.

With the prevalent use of LiDARs for autonomous vehicles and mobile robots, 3D object detection on point clouds has drawn increasing research attention. Large-scale 3D object detection datasets [geiger2013kittidataset, sun2020waymo, caesar2020nuscenes] in recent years has empowered deep learning-based models [shi2019pointrcnn, yang20203dssd, yan2018second, lang2019pointpillars, shi2020pvrcnn, yang2019std, qi2019votenet, sindagi2019mvx, shi2020parta2, zhu2020ssn, yin2020centerpoint] to achieve remarkable success. However, deep learning models trained on one dataset (source domain) often suffer tremendous performance degradation when evaluated on another dataset (target domain). We investigate the bounding box scale mismatch problem (e.g., vehicle size in the U.S. is noticeably larger than that in Germany), which is found to be a major contributor to the domain gap, in alignment with previous work [wang2020traininger]. This is unique to 3D detection: compared to 2D bounding boxes that can have a large variety of size, depending on the distance of the object from the camera, 3D bounding boxes have a more consistent size in the same dataset, regardless of the objects’ location relative to the LiDAR sensor. Hence, the detector tends to memorize a narrow, dataset-specific distribution of bounding box size from the source domain (Figure 2).

Unfortunately, existing works are inadequate to address the domain gap with a realistic setup. Recent methods on domain adaptive 3D detection either require some labeled data from the target domain for finetuning or utilize some additional statistics (such as the mean size) of the target domain [wang2020traininger]. However, such knowledge of the target domain is not always available. In addition, popular 2D unsupervised domain adaptation methods that leverage feature alignment techniques [chen2018dafastercnn, saito2019strongweak, zheng2020crossdomain, chen2020harmonizing, xu2020exploring, li2020spatialattention] to mitigate domain shift are not readily transferable to 3D detection. While these methods are effective in handling domain gaps due to lighting, color, and texture variations, such information is unavailable in point clouds. Instead, point clouds pose unique challenges such as the geometric mismatch discussed above.

Therefore, we propose MLC-Net for unsupervised domain adaptive 3D detection. MLC-Net is designed to tackle two major challenges. First, to create meaningful scale-adaptive targets to facilitate the learning. Specifically, MLC-Net employs the mean teacher [tarvainen2017meanteacher] learning paradigm. The teacher model is essentially a temporal ensemble of student models: the parameters of the teacher model are updated by an exponential moving average window on student models of preceding iterations. Our analyses show that the mean teacher produces accurate and stable supervision for the student model without any prior knowledge of the target domain. To the best of our knowledge, we are the first to introduce the mean teacher paradigm in unsupervised domain adaptive 3D detection. Second, to design scale-related consistency losses and construct useful correspondences of teacher-student predictions to initiate gradient flow, we design MLC-Net to enforce consistency at three levels. 1) Point-level. As point clouds are unstructured, point-based region proposals or equivalents [shi2019pointrcnn, yang20203dssd] are common. Hence, we sample the same subset of points and share them between the teacher and student. We retain the indices of the points that allow 3D augmentation methods to be applied without losing the correspondences. 2) Instance-level. Matching region proposals can be erroneous, especially at the initial stage when the quality of region proposals is substandard. Hence, we resort to transferring teacher region proposals to students to circumvent the matching process. 3) Neural statistics-level. As the teacher model only accesses the target domain input, the mismatch between the batch statistics hinders effective learning. We thus transfer the student’s statistics, which is gathered from both the source and the target domain, to the teacher to achieve a more stable training behavior.

MLC-Net shows remarkable compatibility with popular mainstream 3D detectors, allowing us to implement it on both two-stage [shi2019pointrcnn] and single-stage [yang20203dssd] detectors. Moreover, we verify our design through rigorous experiments across multiple widely used 3D object detection datasets [geiger2013kittidataset, sun2020waymo, caesar2020nuscenes]. Our method outperforms baselines by convincing margins, even surprisingly surpassing existing methods that utilize additional information. In summary, our main contributions are:

  • We formulate and study unsupervised domain adaptive 3D detection, a pragmatic, yet underexplored task that requires no information of the target domain. We comprehensively investigate the major underlying factors of the domain gap in 3D detection and find geometric mismatch is the key factor.

  • We propose a concise yet effective mean-teacher paradigm that leverages three levels of consistency to facilitate cross-domain transfer, achieving a significant performance boost that is consistent on various mainstream detectors and across multiple popular public datasets.

  • We validate our hypothesis on the unique challenges associated with point clouds and verify our proposed approach with comprehensive evaluations, which we hope would lay a strong foundation for future research.

2 Related Works

LiDAR-based 3D Detection. LiDAR-based 3D detection methods mainly come from two categories, namely grid-based methods and point-based methods. Grid-based approaches convert the whole point cloud scene to grids of fixed size and process the input with 2D or 3D CNN. MV3D [mv3d] first projects point clouds to bird-eye view images to generate proposals. PointPilar [lang2019pointpillars] performs voxelization on point clouds and converts the representation to 2D. VoxelNet [zhou2018voxelnet] obtains voxel representations by applying PointNet [qi2017pointnet] to points and processes the features with 3D convolution. SECOND [yan2018second] applies 3D sparse convolution [graham20183dsparseconv] to improve the efficiency. PV-RCNN [shi2020pvrcnn] proposes to combine voxelization and point-based set abstraction to obtain more discriminative features. On the other hand, point-based methods directly extract features from raw point cloud input. F-PointNet [qi2018frustum] applies PointNet [qi2017pointnet] to perform 3D detection based on 2D bounding boxes. PointRCNN proposes a two-stage framework to generate box bounding proposals from the whole point clouds and refine them with feature pooling. 3DSSD proposes to use F-FPS for better point sampling to achieves single-stage detection. In this work, we conduct focused discussion with PointRCNN [shi2019pointrcnn] as the base model but we show our method is also compatible to single-stage detector (3DSSD) in Supplementary Material.

Point Cloud Domain Adaptation. While extensive researches have been conducted on domain adaptation tasks with 2D image data, the 3D point cloud domain adaptation field has relatively small literature. PointDAN [qin2019pointdan] proposes to jointly align local and global features using discrepancy loss and adversarial training for point cloud classification. Achituve et. al. [achituve2021selfsuupervised] introduces an additional self-supervised reconstruction task to improve the classification performance on the target domain. [yi2020completeandlabel] designs a sparse voxel completion network to perform point cloud completion for domain adaptive semantic segmentation. [jaritz2020xmuda] leverages multi-modal information by projecting point cloud to 2D images and train models jointly. For object detection, [wang2020traininger] identifies the major domain gap among autonomous driving datasets and proposes to mitigate the gap by leveraging target statistical information. SF-UDA [saltori2020sourcefree] computes motion coherence over consecutive frames to select the best scale for the target domain. Our proposed method works under a similar setup to [wang2020traininger] but does not require target domain statistical information.

Mean Teacher. The mean teacher framework [tarvainen2017meanteacher]

is first proposed for semi-supervised learning task. Many variants

[cubuk2018autoaugment, berthelot2019mixmatch, xie2019unsupervised] have been proposed to further improve its performance. Furthermore, the framework has also been applied to other fields such as domain adaptation [french2017selfensembling, cai2019exploringobjectrelation]

and self-supervised learning

[he2020moco, grill2020byol, liu2020selfemd] where labeled data is scarce or unavailable. Specifically, the mean teacher framework incorporates one trainable student model and a non-trainable teacher model whose weights are obtained from the exponential moving average of the student model’s weights. The student model is optimized based on the consistency loss between the student and teacher network predictions. In particular, although [cai2019exploringobjectrelation] also employs the mean teacher paradigm for the detection task by aligning region-level features, point cloud detection models are substantially different from 2D detectors and our proposed method differs by incorporating multi-level consistency.

Figure 3: The network architecture of our proposed MLC-Net. MLC-Net leverages the mean-teacher [tarvainen2017meanteacher]

paradigm where the teacher is the exponential moving average (hence the name mean-teacher) of the student model and is updated at every iteration. This mean-teacher design provides high-quality pseudo labels to facilitate smooth learning of the student model. Towards the goal, we design consistency enforced at three levels. First, at point-level, 3D proposals are associated based on point correspondences, which are established by sampling the same set of points from the target domain for both the student and teacher models; second, at instance-level, the teacher 3D proposals are passed to the student Box Refinement Network, and the correspondences between 3D box predictions from two models are naturally maintained. Third, at neural statistics-level, we discover non-learnable parameters in batch normalization layers demonstrate significant domain shift, we thus align the teacher’s parameters with the student’s. We highlight the efficacy of MLC-Net and further discuss our design motivations in Section

3. Best viewed in color.

3 Our Approach

In this section, we formulate the 3D point cloud domain adaptive detection problem (Section 3.1), and provide an overview of our MLC-Net (Section 3.2), followed by the details of our mean-teacher paradigm (Section 3.3). Finally, we explain the details of the point-level (Section 3.4), instance-level (Section 3.5), and statistics-level (Section 3.6) consistency of our MLC-Net.

3.1 Problem Definition

Under the unsupervised domain adaptation setting, we have access to point cloud data from one labeled source domain and one unlabeled target domain , where and are the number of samples from the source and target domains, respectively. Each point cloud scene consists of points with their 3D coordinates while denotes the label of the corresponding training sample from the source domain. is in the form of object class and 3D bounding box parameterized by the center location of the bounding box , the size in each dimension , and the orientation . The goal of the domain adaptive detection task is to train a model based on and and maximize the performance on .

3.2 Framework Overview

We illustrate MLC-Net in Figure 3. The labeled source input is used for standard supervised training of the student model with loss . For each unlabeled target domain example , we perturb it by applying random augmentation to obtain . The perturbed and original point cloud inputs are passed to the student model and teacher model respectively to get their point-level box proposals and where point-level consistency is applied. Subsequently, teacher proposals are passed to the student model for box refinement, to obtain . Together with teacher’s instance-level predictions , the instance-level consistency is applied. The overall consistency loss is computed as:


where pt, ins, cls and box stand for point-level, instance-level, classification and box regression respectively. These loss components are elaborated in Section 3.4 and  3.5. In each iteration, the student model is updated through gradient descent with the total loss , which is a weighted sum of and :


where is the weight coefficient. The learnable parameters of the student model are then used to update the corresponding teacher model parameters, where the details can be found in Section 3.3. In addition, we enforce non-learnable parameters to be aligned between the teacher and the student via neural statistics-consistency (Section 3.6).

MLC-Net achieves two major design goals towards effective unsupervised 3D domain adaptive detection. First, to generate accurate and robust pseudo targets without any access to the target domain annotation or statistical information. MLC-Net leverages a mean teacher paradigm where the teacher model can be regarded as a temporal ensemble of student models, allowing it to produce high-quality predictions and guide the learning of the student. Second, to design effective consistency losses at point-, instance- and neural statistics-level that enhance adaptability to scale variation, and construct the teacher-student correspondences that allow the back-propagated gradient to flow through the correct routes. Although we conduct most analysis on PointRCNN [shi2019pointrcnn] as the representative of two-stage 3D detectors, we highlight that our method is generic and can be easily extended to single-stage detection models such as 3DSSD [yang20203dssd] with modest modifications (see Supplementary Material).

3.3 Mean Teacher

Motivated by the success of the mean teacher paradigm [tarvainen2017meanteacher] in semi-supervised learning and self-supervised learning, we apply it to our point cloud domain adaptive detection task as illustrated in Figure 3. The framework consists of a student model and a teacher model with the same network architecture but different weights and , respectively. The weights of the teacher model are updated by taking the exponential moving average of the student model weights:


where is known as the momentum which is usually a number close to 1, e.g. 0.99. Figure 5 shows that the teacher model constantly provides effective supervision to the student model via high-quality pseudo targets. Hence, by enforcing the consistency between the student and the teacher, the student learns domain-invariant representations to adapt to the unlabeled target domain, guided by the pseudo labels. We show in Table 5 that the mean teacher significantly improves model performance compared to baseline.

3.4 Point-Level Consistency

The point-level consistency loss is calculated between the first-stage box proposals of the student and teacher models. One of the key challenges for formulating consistency is to find the correspondence between the student and the teacher. Unlike image pixels that are arranged in regular lattices, points reside in continuous 3D space which lacks structure [qi2017pointnet]. Hence, constructing point correspondences can be problematic (Table 3). Instead, we circumvent the difficulty by feeding the teacher and the student two identical sets of points at the very beginning and trace the point indices to maintain correspondences.

Specifically, for each target domain example, we sample points from the point cloud scene to obtain the teacher input and apply random augmentation on a replicated set to obtain with . consists of random global scaling of the point cloud scenes and can be regarded as applying displacements on individual points, without disrupting the point correspondences. As a result, each point corresponds to a point , and this relationship holds for the point-level predictions of the region proposal network . We denote the first stage prediction as . Note that the point correspondences are transferred to box proposals as each point generates one box proposal. consists of class prediction and box regression . For the class predictions, we define the consistency loss as the Kullback-Leibler (KL) divergence between each point pair from and :


where stands for the number of points in .

More importantly, we enforce consistency between bounding box regression predictions to address geometric mismatch. For the bounding box predictions, we only compute the consistency over points belonging to the objects because the background points do not generate meaningful bounding boxes. We obtain a set of points which fall inside the bounding boxes of the final predictions of both the student and teacher networks with , where and are the refined bounding box predictions after second stage (see Section 3.5). We then compute the point-level box consistency loss as:


where is the smooth loss and is the random augmentation applied to the input . We apply the same augmentation to the teacher bounding box predictions to align with the scale of the student point cloud scene before computing the consistency.

3.5 Instance-Level Consistency

In the second stage, NMS is performed on to obtain high-confidence region proposals denoted as for each point cloud scene. We highlight that the association between region proposals from the student and teacher models are lost in the NMS due to the differences between and . To match the instance-level predictions for consistency computation, a common method is to perform greedy matching based on IoU between teacher and student region proposals. However, such matching is not robust due to the large number of noisy predictions, which lead to ineffective learning as shown experimentally in Table 3. Hence, we adopt a simple approach by replicating the teacher region proposals to the student model and applying the input augmentation to match the scale of the student model. Subsequently, we disturb the region proposals by applying random RoI augmentation for the sets of region proposals before they are used for feature pooling. The motivation of this operation is to force the models to output consistent predictions given non-identical region proposals and prevent convergence to trivial solutions. Formally, the above process can be described as and for the student and teacher models, respectively, where denotes the instance-level features obtained from feature pooling as described in [shi2019pointrcnn]. The pooled features are then passed to the box refinement network for box refinement to obtain the second stage predictions . Similar to the first stage prediction , consists of the class prediction as well as the bounding box prediction . We define the instance-level class consistency as the difference between and :


where denotes the number of region proposals. On the other hand, to compute the instance-level box consistency loss, we first obtain a set of positive predictions

by selecting bounding boxes with classification predictions larger than a probability threshold

. We then apply to to match the scale and compute the instance-level box consistency loss based on the discrepancy between and for the selected predictions:


3.6 Neural Statistics-Level Consistency

While the student model takes both source domain data and target domain data as input, the teacher model only has access to the target data . The distribution shift lying between source and target data could lead to mismatched batch statistics between the batch normalization (BN) layers of the student and teacher models. This mismatch could cause misaligned normalization and in turn, leads to an unstable training process with degraded performance or even divergence. We provide an in-depth analysis regarding this matter in Section 4.4.

To mitigate this issue, we propose to use the running statistics of the student model BN layers for the teacher model during the training process. Specifically, for each BN layer in the student model, the batch mean

and variance

are used to update the running statistics at every iteration:


where and are the running mean of and and is the BN momentum that controls the speed of batch statistics updating the running statistics. For the teacher model, we use and instead of the batch statistics for all the BN layers to normalize the layer inputs. We argue that this modification closes the gap caused by domain mismatch and leads to more stable training behavior. We empirically demonstrate the effectiveness by comparing the performance under different BN settings in Section 4.3.


KITTI Waymo     Waymo KITTI


Methods AP/L1 APH/L1 AP/L2 APH/L2     Methods Easy Moderate Hard
Direct Transfer 0.0917 0.0899 0.0794 0.0778     Direct Transfer 20.2213 21.4261 20.4927
Wide-Range Aug 0.1861 0.1818 0.1677 0.1640     Wide-Range Aug 30.2341 31.4959 32.8531
DA-Faster [chen2018dafastercnn] 0.0696 0.0687 0.0642 0.0633     DA-Faster [chen2018dafastercnn] 4.4248 5.5510 5.5296
OT [wang2020traininger] 0.2648 0.2584 0.2385 0.2329     OT [wang2020traininger] 39.7762 37.8212 39.5546
SN [wang2020traininger] 0.3069 0.3006 0.2723 0.2667     SN [wang2020traininger] 61.9289 58.0656 58.4406
Ours 0.3821 0.3774 0.3446 0.3404     Ours 69.3518 59.4454 56.2913


KITTI nuScenes     nuScenes KITTI


Methods ATE ASE AOE AP     Methods Easy Moderate Hard
Direct Transfer 0.207 0.248 0.212 13.0073     Direct Transfer 49.1303 39.5565 35.5127
Wide-Range Aug 0.200 0.228 0.211 16.0081     Wide-Range Aug 58.7072 45.3730 43.0254
DA-Faster [chen2018dafastercnn] 0.247 0.253 0.292 10.7661     DA-Faster [chen2018dafastercnn] 52.2501 40.6209 35.9015
OT [wang2020traininger] 0.207 0.220 0.212 14.6650     OT [wang2020traininger] 23.1286 27.2584 29.0979
SN [wang2020traininger] 0.227 0.168 0.368 23.1491     SN [wang2020traininger] 44.8135 45.1496 47.5991
Ours 0.197 0.179 0.197 23.4720     Ours 71.2648 55.4152 48.9880


Table 1: Performance of MLC-Net on four source-target pairs in comparison with various baselines and state-of-the-art methods. MLC-Net outperforms all baselines and even surpasses SOTA methods that utilize target domain annotation information (indicated by ). Direct transfer: the model trained on the source domain is directly tested on the target domain. Wide-Range Aug: baseline method with random scaling augmentation of a wide range which potentially includes the target domain scales. It is thus validated the drastic performance degradation cannot be fully mitigated by simple data augmentation. DA-Faster: we also compare with adversarial feature alignment [chen2018dafastercnn], a common technique used in 2D domain adaptation. indicates the implementation is adapted from 2D to 3D. However, feature alignment is unable to solve the geometric mismatch, which we argue is unique to 3D detection. The state-of-the-art work [wang2020traininger] proposes to perform output transformation (OT) to scale predictions and statistical normalization (SN) for scale-adjusted training examples. Both OT and SN require known target domain statistics. MLC-Net, albeit being fully unsupervised, even surpasses these methods on key metrics: APH/L2 (Waymo), AP (nuScenes), and AP Moderate (KITTI).

4 Experiments

We first introduce the popular autonomous driving datasets including KITTI [geiger2013kittidataset], Waymo Open Dataset [sun2020waymo], and nuScenes [caesar2020nuscenes] used in the experiments (Section 4.1). We then benchmark MLC-Net across datasets where MLC-Net achieves consistent performance boost in Section 4.2. Moreover, we ablate MLC-Net to give a comprehensive assessment of its submodules and justify our design choices in Section 4.3. Finally, we further investigate the challenges of unsupervised domain adaptive 3D detection and show MLC-Net successfully addresses them. We further analyse the problems in 3D domain adaptive detection and our solutions in Section 4.4. Due to the space constraint, we include the implementation details in the Supplementary Material.

4.1 Datasets

We follow [wang2020traininger] to evaluate MLC-Net on various source-target combinations with the following datasets.

KITTI. KITTI [geiger2013kittidataset]

is a popular autonomous driving dataset that consists of 3,712 training samples and 3,769 validation samples. The 3D bounding box annotations are only provided for objects within the Field of View (FoV) of the front camera. Therefore, points outside of the FoV are ignored during training and evaluation. We use the official KITTI evaluation metrics for evaluation where the objects are categorized into three levels (Easy, Moderate, and Hard) based on the number of pixels, occlusion and truncation levels.

Waymo Open Dataset. The Waymo Open Dataset (referred to as Waymo) [sun2020waymo] is a large-scale benchmark that contains 122,000 training samples and 30,407 validation samples. We subsample 1/10 the training and validation set. To align the input convention, we apply the same front camera FoV as the KITTI dataset. The official Waymo evaluation metrics are used to benchmark the performance.

nuScenes. The nuScenes [caesar2020nuscenes] dataset consists of 28,130 training samples and 6,019 validation samples. We subsample the training dataset by 50% and use the entire validation set. We also apply the same FoV on the input as other datasets. We adopt the official evaluation metrics of translation, scale, and orientation errors, with the addition of the commonly used average precision based on 3D IoU with a threshold of 0.7 to reflect the overall detection accuracy.

4.2 Benchmarking Results

As an emerging research area, the cross-domain point cloud detection topic has relatively small literature. To the best of our knowledge, [wang2020traininger] is the most relevant work that has a similar setting as our study. We compare our method with two normalization methods proposed in [wang2020traininger], namely Output Transformation (OT) and Statistical Normalization (SN), where the former transforms the predictions by an offset and the latter trains the detector with scale-normalized input. Moreover, we also compare with the adversarial feature alignment method, which is commonly used on image-based tasks, by adapting DA-Faster [chen2018dafastercnn] to our PointRCNN [shi2019pointrcnn] base model. We also provide Direct Transfer and Wide-Range Augmentation as baselines. More results can be found in the Supplementary Material.

Table 1 demonstrates the cross-domain detection performance on four source-target domain pairs, MLC-Net outperforms all unsupervised baselines by convincing margins. We highlight that our method adapts scale for each instance instead applying a global shift, allowing us to surpass state-of-the-art methods that utilize target domain statistical information.

4.3 Ablation Study

To evaluate the effectiveness of the components of MLC-Net, we conduct ablation studies on KITTI Waymo transfer with PointRCNN as the base model.

Effectiveness of Point/Instance-Level Consistency. We study the effects of different components of the proposed consistency loss. Table 2 reports the experimental results when different combinations of loss components are applied. It is observed that, for both point-level consistency and instance-level consistency, the box consistency clearly has a larger contribution as compared to the class consistency. This observation indicates that the scale difference is a major source of the domain gap between source and target domains with different object size distributions, which is also in line with the previous work [wang2020traininger]. It also shows that our proposed box consistency regularization method effectively mitigates this gap. In addition, all losses are complementary to one another: the best result is achieved when all four of them are used.


0.1861 0.1818 0.1677 0.1640
0.2034 0.1991 0.1807 0.1770
0.3034 0.2969 0.2708 0.2649
0.3100 0.3039 0.2764 0.2709
0.2112 0.2087 0.1879 0.1857
0.3321 0.3244 0.2995 0.2926
0.3495 0.3453 0.3143 0.3105
0.3821 0.3774 0.3446 0.3404


Table 2: Ablation study of point-level and instance-level consistency loss components. Results show loss components are highly complementary; the joint use of all four losses at two levels achieves the best performance. More importantly, we find that the bounding box regression loss, which is directly associated with bounding box scale, benefits the performance more than the classification loss. This further validates our stance that geometric mismatch is a key domain gap for 3D detection.

Furthermore, we compare MLC-Net with two alternative approaches for point and box matching respectively in Table 3

. Compared to these baseline approaches, MLC-Net replicates the input point clouds and the region proposals before passed to the student and teacher models to eradicate any noise which arise from inaccurate matching. The results highlight the importance of correspondence in constructing meaningful consistency losses for effective unsupervised learning.


Matching Method AP/L1 APH/L1 AP/L2 APH/L2
Nearest Point 0.0293 0.0286 0.0265 0.0258
Max IoU Box 0.2695 0.2666 0.2418 0.2392
Ours 0.3821 0.3774 0.3446 0.3404


Table 3: Ablation study of point-level and instance-level matching methods. Nearest Point: a baseline for point match where a point in the student input is matched to the nearest point in the teacher input using Euclidean distance. Max IoU Box: a baseline for box matching where a student box prediction is matched to the teacher pseudo label with the largest IoU. Ours: input point clouds or region proposals of the student are replicated from the teacher. We highlight that our matching method ensures accurate one-to-one correspondence, which is critical to effective teacher-student learning.

Effectiveness of Neural Statistics-Level Consistency. We also experiment on the effectiveness of neural statistics-level consistency by comparing the performance when such alignment is enabled and disabled. From Table 4 we can see that when neural statistics-level consistency is disabled, the model performance severely drops. As analyzed in Section 3.6, when neural statistics-level consistency is not in place, the teacher model BN layers normalize the input features using batch statistics that are obtained from only target data, while the student model performs BN with statistics from both source and target domains. This misalignment creates a significant gap. As a result, the consistency computation between the student and teacher predictions is invalidated. We also compare with the approach that the student model performs separate BN for source and target data. In this case, although the normalization for target input is performed with target statistics for both models, the mismatched normalization of source and target inputs leads to suboptimal performance as compared to MLC-Net.


Setting AP/L1 APH/L1 AP/L2 APH/L2
Disabled 0.0279 0.0274 0.0254 0.0249
Separate 0.2988 0.2945 0.2685 0.2648
Enabled 0.3821 0.3774 0.3446 0.3404


Table 4: Ablation study of neural statistics-level consistency indicates that MLC-Net effectively closes the domain gap due to neural statistics mismatch. Disabled: no consistency is enforced. Separate: the student model performs BN separately for source and target domain inputs to align with the teacher model. Enabled: our proposed neural statistics-level alignment.

Effectiveness of Mean Teacher. The teacher model is essentially a temporal ensemble of student models at different time stamps. We study the effectiveness of the mean teacher paradigm by comparing the performance when the exponential moving average update is enabled or disabled. Table 5 shows that it is important to employ the moving average update mechanism for the teacher to generate meaningful supervisions to guide the student model, and the removal of such mechanism leads to performance deterioration.


Disabled 0.0895 0.0866 0.0835 0.0808
Enabled 0.3821 0.3774 0.3446 0.3404


Table 5: Ablation study of the exponential moving average (EMA) update scheme in mean teacher paradigm. The performance significantly degrades when the exponential moving average update is disabled, highlighting the importance of mean teacher design in producing meaningful targets.

4.4 Further Analysis

Analysis of Distribution Shift. We highlight that the geometric mismatch is a significant issue for cross-domain deployment of 3D detection models. In Figure 2

, the object dimension (length, width, and height) distributions are drastically different across domains with a relatively small overlap. The baseline, trained on the source domain, is not able to generalize to the target domain as the distribution of its dimension prediction is still close to that of the source domain. In contrast, MLC-Net is able to adapt to the new domain by predicting highly similar geometric distribution as the target domain.

Analysis of Neural Statistics Mismatch.

Figure 4: Neural statistics mismatch across domains. We plot the distributions of batch mean and batch variance. Significant misalignment is observed, which highlights the necessity of neural statistics-level consistency.

Figure 4 shows that inputs from different domains have very different distributions of batch statistics, which explains the tremendous performance drop when our proposed neural statistics-level consistency is not applied to align the statistics (Table 4).

Analysis of Teacher/Student Paradigm. In Figure 5, the teacher model in MLC-Net demonstrates stronger performance during the training process until convergence. Moreover, the teacher model exhibits a smoother learning curve. This validates the effectiveness of our mean-teacher paradigm to create accurate and reliable supervision for robust optimization of the student model.

Figure 5: Teacher and student model performance against iteration. Not only does the teacher model constantly outperform the student, its performance curve is also smoother. Hence, the teacher model, which can be regarded as a temporal ensemble of the student model, is able to produce more stable and accurate pseudo labels to supervise the student model.

5 Conclusion

We study unsupervised 3D domain adaptive detection that requires no target domain annotation or statistics. We validate that geometric mismatch is a major contributor to the domain shift and propose MLC-Net that leverages a teacher-student paradigm for robust and reliable pseudo label generation via point-, instance- and neural statistics-level consistency to enforce effective transfer. MLC-Net outperforms all the baselines by convincing margins, and even surpasses methods that require additional target information.