1 Introduction
^{†}^{†} denotes equal contribution.^{†}^{†} denotes corresponding author.With the prevalent use of LiDARs for autonomous vehicles and mobile robots, 3D object detection on point clouds has drawn increasing research attention. Largescale 3D object detection datasets [geiger2013kittidataset, sun2020waymo, caesar2020nuscenes] in recent years has empowered deep learningbased models [shi2019pointrcnn, yang20203dssd, yan2018second, lang2019pointpillars, shi2020pvrcnn, yang2019std, qi2019votenet, sindagi2019mvx, shi2020parta2, zhu2020ssn, yin2020centerpoint] to achieve remarkable success. However, deep learning models trained on one dataset (source domain) often suffer tremendous performance degradation when evaluated on another dataset (target domain). We investigate the bounding box scale mismatch problem (e.g., vehicle size in the U.S. is noticeably larger than that in Germany), which is found to be a major contributor to the domain gap, in alignment with previous work [wang2020traininger]. This is unique to 3D detection: compared to 2D bounding boxes that can have a large variety of size, depending on the distance of the object from the camera, 3D bounding boxes have a more consistent size in the same dataset, regardless of the objects’ location relative to the LiDAR sensor. Hence, the detector tends to memorize a narrow, datasetspecific distribution of bounding box size from the source domain (Figure 2).
Unfortunately, existing works are inadequate to address the domain gap with a realistic setup. Recent methods on domain adaptive 3D detection either require some labeled data from the target domain for finetuning or utilize some additional statistics (such as the mean size) of the target domain [wang2020traininger]. However, such knowledge of the target domain is not always available. In addition, popular 2D unsupervised domain adaptation methods that leverage feature alignment techniques [chen2018dafastercnn, saito2019strongweak, zheng2020crossdomain, chen2020harmonizing, xu2020exploring, li2020spatialattention] to mitigate domain shift are not readily transferable to 3D detection. While these methods are effective in handling domain gaps due to lighting, color, and texture variations, such information is unavailable in point clouds. Instead, point clouds pose unique challenges such as the geometric mismatch discussed above.
Therefore, we propose MLCNet for unsupervised domain adaptive 3D detection. MLCNet is designed to tackle two major challenges. First, to create meaningful scaleadaptive targets to facilitate the learning. Specifically, MLCNet employs the mean teacher [tarvainen2017meanteacher] learning paradigm. The teacher model is essentially a temporal ensemble of student models: the parameters of the teacher model are updated by an exponential moving average window on student models of preceding iterations. Our analyses show that the mean teacher produces accurate and stable supervision for the student model without any prior knowledge of the target domain. To the best of our knowledge, we are the first to introduce the mean teacher paradigm in unsupervised domain adaptive 3D detection. Second, to design scalerelated consistency losses and construct useful correspondences of teacherstudent predictions to initiate gradient flow, we design MLCNet to enforce consistency at three levels. 1) Pointlevel. As point clouds are unstructured, pointbased region proposals or equivalents [shi2019pointrcnn, yang20203dssd] are common. Hence, we sample the same subset of points and share them between the teacher and student. We retain the indices of the points that allow 3D augmentation methods to be applied without losing the correspondences. 2) Instancelevel. Matching region proposals can be erroneous, especially at the initial stage when the quality of region proposals is substandard. Hence, we resort to transferring teacher region proposals to students to circumvent the matching process. 3) Neural statisticslevel. As the teacher model only accesses the target domain input, the mismatch between the batch statistics hinders effective learning. We thus transfer the student’s statistics, which is gathered from both the source and the target domain, to the teacher to achieve a more stable training behavior.
MLCNet shows remarkable compatibility with popular mainstream 3D detectors, allowing us to implement it on both twostage [shi2019pointrcnn] and singlestage [yang20203dssd] detectors. Moreover, we verify our design through rigorous experiments across multiple widely used 3D object detection datasets [geiger2013kittidataset, sun2020waymo, caesar2020nuscenes]. Our method outperforms baselines by convincing margins, even surprisingly surpassing existing methods that utilize additional information. In summary, our main contributions are:

We formulate and study unsupervised domain adaptive 3D detection, a pragmatic, yet underexplored task that requires no information of the target domain. We comprehensively investigate the major underlying factors of the domain gap in 3D detection and find geometric mismatch is the key factor.

We propose a concise yet effective meanteacher paradigm that leverages three levels of consistency to facilitate crossdomain transfer, achieving a significant performance boost that is consistent on various mainstream detectors and across multiple popular public datasets.

We validate our hypothesis on the unique challenges associated with point clouds and verify our proposed approach with comprehensive evaluations, which we hope would lay a strong foundation for future research.
2 Related Works
LiDARbased 3D Detection. LiDARbased 3D detection methods mainly come from two categories, namely gridbased methods and pointbased methods. Gridbased approaches convert the whole point cloud scene to grids of fixed size and process the input with 2D or 3D CNN. MV3D [mv3d] first projects point clouds to birdeye view images to generate proposals. PointPilar [lang2019pointpillars] performs voxelization on point clouds and converts the representation to 2D. VoxelNet [zhou2018voxelnet] obtains voxel representations by applying PointNet [qi2017pointnet] to points and processes the features with 3D convolution. SECOND [yan2018second] applies 3D sparse convolution [graham20183dsparseconv] to improve the efficiency. PVRCNN [shi2020pvrcnn] proposes to combine voxelization and pointbased set abstraction to obtain more discriminative features. On the other hand, pointbased methods directly extract features from raw point cloud input. FPointNet [qi2018frustum] applies PointNet [qi2017pointnet] to perform 3D detection based on 2D bounding boxes. PointRCNN proposes a twostage framework to generate box bounding proposals from the whole point clouds and refine them with feature pooling. 3DSSD proposes to use FFPS for better point sampling to achieves singlestage detection. In this work, we conduct focused discussion with PointRCNN [shi2019pointrcnn] as the base model but we show our method is also compatible to singlestage detector (3DSSD) in Supplementary Material.
Point Cloud Domain Adaptation. While extensive researches have been conducted on domain adaptation tasks with 2D image data, the 3D point cloud domain adaptation field has relatively small literature. PointDAN [qin2019pointdan] proposes to jointly align local and global features using discrepancy loss and adversarial training for point cloud classification. Achituve et. al. [achituve2021selfsuupervised] introduces an additional selfsupervised reconstruction task to improve the classification performance on the target domain. [yi2020completeandlabel] designs a sparse voxel completion network to perform point cloud completion for domain adaptive semantic segmentation. [jaritz2020xmuda] leverages multimodal information by projecting point cloud to 2D images and train models jointly. For object detection, [wang2020traininger] identifies the major domain gap among autonomous driving datasets and proposes to mitigate the gap by leveraging target statistical information. SFUDA [saltori2020sourcefree] computes motion coherence over consecutive frames to select the best scale for the target domain. Our proposed method works under a similar setup to [wang2020traininger] but does not require target domain statistical information.
Mean Teacher. The mean teacher framework [tarvainen2017meanteacher]
is first proposed for semisupervised learning task. Many variants
[cubuk2018autoaugment, berthelot2019mixmatch, xie2019unsupervised] have been proposed to further improve its performance. Furthermore, the framework has also been applied to other fields such as domain adaptation [french2017selfensembling, cai2019exploringobjectrelation]and selfsupervised learning
[he2020moco, grill2020byol, liu2020selfemd] where labeled data is scarce or unavailable. Specifically, the mean teacher framework incorporates one trainable student model and a nontrainable teacher model whose weights are obtained from the exponential moving average of the student model’s weights. The student model is optimized based on the consistency loss between the student and teacher network predictions. In particular, although [cai2019exploringobjectrelation] also employs the mean teacher paradigm for the detection task by aligning regionlevel features, point cloud detection models are substantially different from 2D detectors and our proposed method differs by incorporating multilevel consistency.3 Our Approach
In this section, we formulate the 3D point cloud domain adaptive detection problem (Section 3.1), and provide an overview of our MLCNet (Section 3.2), followed by the details of our meanteacher paradigm (Section 3.3). Finally, we explain the details of the pointlevel (Section 3.4), instancelevel (Section 3.5), and statisticslevel (Section 3.6) consistency of our MLCNet.
3.1 Problem Definition
Under the unsupervised domain adaptation setting, we have access to point cloud data from one labeled source domain and one unlabeled target domain , where and are the number of samples from the source and target domains, respectively. Each point cloud scene consists of points with their 3D coordinates while denotes the label of the corresponding training sample from the source domain. is in the form of object class and 3D bounding box parameterized by the center location of the bounding box , the size in each dimension , and the orientation . The goal of the domain adaptive detection task is to train a model based on and and maximize the performance on .
3.2 Framework Overview
We illustrate MLCNet in Figure 3. The labeled source input is used for standard supervised training of the student model with loss . For each unlabeled target domain example , we perturb it by applying random augmentation to obtain . The perturbed and original point cloud inputs are passed to the student model and teacher model respectively to get their pointlevel box proposals and where pointlevel consistency is applied. Subsequently, teacher proposals are passed to the student model for box refinement, to obtain . Together with teacher’s instancelevel predictions , the instancelevel consistency is applied. The overall consistency loss is computed as:
(1) 
where pt, ins, cls and box stand for pointlevel, instancelevel, classification and box regression respectively. These loss components are elaborated in Section 3.4 and 3.5. In each iteration, the student model is updated through gradient descent with the total loss , which is a weighted sum of and :
(2) 
where is the weight coefficient. The learnable parameters of the student model are then used to update the corresponding teacher model parameters, where the details can be found in Section 3.3. In addition, we enforce nonlearnable parameters to be aligned between the teacher and the student via neural statisticsconsistency (Section 3.6).
MLCNet achieves two major design goals towards effective unsupervised 3D domain adaptive detection. First, to generate accurate and robust pseudo targets without any access to the target domain annotation or statistical information. MLCNet leverages a mean teacher paradigm where the teacher model can be regarded as a temporal ensemble of student models, allowing it to produce highquality predictions and guide the learning of the student. Second, to design effective consistency losses at point, instance and neural statisticslevel that enhance adaptability to scale variation, and construct the teacherstudent correspondences that allow the backpropagated gradient to flow through the correct routes. Although we conduct most analysis on PointRCNN [shi2019pointrcnn] as the representative of twostage 3D detectors, we highlight that our method is generic and can be easily extended to singlestage detection models such as 3DSSD [yang20203dssd] with modest modifications (see Supplementary Material).
3.3 Mean Teacher
Motivated by the success of the mean teacher paradigm [tarvainen2017meanteacher] in semisupervised learning and selfsupervised learning, we apply it to our point cloud domain adaptive detection task as illustrated in Figure 3. The framework consists of a student model and a teacher model with the same network architecture but different weights and , respectively. The weights of the teacher model are updated by taking the exponential moving average of the student model weights:
(3) 
where is known as the momentum which is usually a number close to 1, e.g. 0.99. Figure 5 shows that the teacher model constantly provides effective supervision to the student model via highquality pseudo targets. Hence, by enforcing the consistency between the student and the teacher, the student learns domaininvariant representations to adapt to the unlabeled target domain, guided by the pseudo labels. We show in Table 5 that the mean teacher significantly improves model performance compared to baseline.
3.4 PointLevel Consistency
The pointlevel consistency loss is calculated between the firststage box proposals of the student and teacher models. One of the key challenges for formulating consistency is to find the correspondence between the student and the teacher. Unlike image pixels that are arranged in regular lattices, points reside in continuous 3D space which lacks structure [qi2017pointnet]. Hence, constructing point correspondences can be problematic (Table 3). Instead, we circumvent the difficulty by feeding the teacher and the student two identical sets of points at the very beginning and trace the point indices to maintain correspondences.
Specifically, for each target domain example, we sample points from the point cloud scene to obtain the teacher input and apply random augmentation on a replicated set to obtain with . consists of random global scaling of the point cloud scenes and can be regarded as applying displacements on individual points, without disrupting the point correspondences. As a result, each point corresponds to a point , and this relationship holds for the pointlevel predictions of the region proposal network . We denote the first stage prediction as . Note that the point correspondences are transferred to box proposals as each point generates one box proposal. consists of class prediction and box regression . For the class predictions, we define the consistency loss as the KullbackLeibler (KL) divergence between each point pair from and :
(4) 
where stands for the number of points in .
More importantly, we enforce consistency between bounding box regression predictions to address geometric mismatch. For the bounding box predictions, we only compute the consistency over points belonging to the objects because the background points do not generate meaningful bounding boxes. We obtain a set of points which fall inside the bounding boxes of the final predictions of both the student and teacher networks with , where and are the refined bounding box predictions after second stage (see Section 3.5). We then compute the pointlevel box consistency loss as:
(5) 
where is the smooth loss and is the random augmentation applied to the input . We apply the same augmentation to the teacher bounding box predictions to align with the scale of the student point cloud scene before computing the consistency.
3.5 InstanceLevel Consistency
In the second stage, NMS is performed on to obtain highconfidence region proposals denoted as for each point cloud scene. We highlight that the association between region proposals from the student and teacher models are lost in the NMS due to the differences between and . To match the instancelevel predictions for consistency computation, a common method is to perform greedy matching based on IoU between teacher and student region proposals. However, such matching is not robust due to the large number of noisy predictions, which lead to ineffective learning as shown experimentally in Table 3. Hence, we adopt a simple approach by replicating the teacher region proposals to the student model and applying the input augmentation to match the scale of the student model. Subsequently, we disturb the region proposals by applying random RoI augmentation for the sets of region proposals before they are used for feature pooling. The motivation of this operation is to force the models to output consistent predictions given nonidentical region proposals and prevent convergence to trivial solutions. Formally, the above process can be described as and for the student and teacher models, respectively, where denotes the instancelevel features obtained from feature pooling as described in [shi2019pointrcnn]. The pooled features are then passed to the box refinement network for box refinement to obtain the second stage predictions . Similar to the first stage prediction , consists of the class prediction as well as the bounding box prediction . We define the instancelevel class consistency as the difference between and :
(6) 
where denotes the number of region proposals. On the other hand, to compute the instancelevel box consistency loss, we first obtain a set of positive predictions
by selecting bounding boxes with classification predictions larger than a probability threshold
. We then apply to to match the scale and compute the instancelevel box consistency loss based on the discrepancy between and for the selected predictions:(7) 
3.6 Neural StatisticsLevel Consistency
While the student model takes both source domain data and target domain data as input, the teacher model only has access to the target data . The distribution shift lying between source and target data could lead to mismatched batch statistics between the batch normalization (BN) layers of the student and teacher models. This mismatch could cause misaligned normalization and in turn, leads to an unstable training process with degraded performance or even divergence. We provide an indepth analysis regarding this matter in Section 4.4.
To mitigate this issue, we propose to use the running statistics of the student model BN layers for the teacher model during the training process. Specifically, for each BN layer in the student model, the batch mean
and variance
are used to update the running statistics at every iteration:(8)  
(9) 
where and are the running mean of and and is the BN momentum that controls the speed of batch statistics updating the running statistics. For the teacher model, we use and instead of the batch statistics for all the BN layers to normalize the layer inputs. We argue that this modification closes the gap caused by domain mismatch and leads to more stable training behavior. We empirically demonstrate the effectiveness by comparing the performance under different BN settings in Section 4.3.


KITTI Waymo  Waymo KITTI  


Methods  AP/L1  APH/L1  AP/L2  APH/L2  Methods  Easy  Moderate  Hard 
Direct Transfer  0.0917  0.0899  0.0794  0.0778  Direct Transfer  20.2213  21.4261  20.4927 
WideRange Aug  0.1861  0.1818  0.1677  0.1640  WideRange Aug  30.2341  31.4959  32.8531 
DAFaster [chen2018dafastercnn]  0.0696  0.0687  0.0642  0.0633  DAFaster [chen2018dafastercnn]  4.4248  5.5510  5.5296 
OT [wang2020traininger]  0.2648  0.2584  0.2385  0.2329  OT [wang2020traininger]  39.7762  37.8212  39.5546 
SN [wang2020traininger]  0.3069  0.3006  0.2723  0.2667  SN [wang2020traininger]  61.9289  58.0656  58.4406 
Ours  0.3821  0.3774  0.3446  0.3404  Ours  69.3518  59.4454  56.2913 


KITTI nuScenes  nuScenes KITTI  


Methods  ATE  ASE  AOE  AP  Methods  Easy  Moderate  Hard 
Direct Transfer  0.207  0.248  0.212  13.0073  Direct Transfer  49.1303  39.5565  35.5127 
WideRange Aug  0.200  0.228  0.211  16.0081  WideRange Aug  58.7072  45.3730  43.0254 
DAFaster [chen2018dafastercnn]  0.247  0.253  0.292  10.7661  DAFaster [chen2018dafastercnn]  52.2501  40.6209  35.9015 
OT [wang2020traininger]  0.207  0.220  0.212  14.6650  OT [wang2020traininger]  23.1286  27.2584  29.0979 
SN [wang2020traininger]  0.227  0.168  0.368  23.1491  SN [wang2020traininger]  44.8135  45.1496  47.5991 
Ours  0.197  0.179  0.197  23.4720  Ours  71.2648  55.4152  48.9880 

4 Experiments
We first introduce the popular autonomous driving datasets including KITTI [geiger2013kittidataset], Waymo Open Dataset [sun2020waymo], and nuScenes [caesar2020nuscenes] used in the experiments (Section 4.1). We then benchmark MLCNet across datasets where MLCNet achieves consistent performance boost in Section 4.2. Moreover, we ablate MLCNet to give a comprehensive assessment of its submodules and justify our design choices in Section 4.3. Finally, we further investigate the challenges of unsupervised domain adaptive 3D detection and show MLCNet successfully addresses them. We further analyse the problems in 3D domain adaptive detection and our solutions in Section 4.4. Due to the space constraint, we include the implementation details in the Supplementary Material.
4.1 Datasets
We follow [wang2020traininger] to evaluate MLCNet on various sourcetarget combinations with the following datasets.
KITTI. KITTI [geiger2013kittidataset]
is a popular autonomous driving dataset that consists of 3,712 training samples and 3,769 validation samples. The 3D bounding box annotations are only provided for objects within the Field of View (FoV) of the front camera. Therefore, points outside of the FoV are ignored during training and evaluation. We use the official KITTI evaluation metrics for evaluation where the objects are categorized into three levels (Easy, Moderate, and Hard) based on the number of pixels, occlusion and truncation levels.
Waymo Open Dataset. The Waymo Open Dataset (referred to as Waymo) [sun2020waymo] is a largescale benchmark that contains 122,000 training samples and 30,407 validation samples. We subsample 1/10 the training and validation set. To align the input convention, we apply the same front camera FoV as the KITTI dataset. The official Waymo evaluation metrics are used to benchmark the performance.
nuScenes. The nuScenes [caesar2020nuscenes] dataset consists of 28,130 training samples and 6,019 validation samples. We subsample the training dataset by 50% and use the entire validation set. We also apply the same FoV on the input as other datasets. We adopt the official evaluation metrics of translation, scale, and orientation errors, with the addition of the commonly used average precision based on 3D IoU with a threshold of 0.7 to reflect the overall detection accuracy.
4.2 Benchmarking Results
As an emerging research area, the crossdomain point cloud detection topic has relatively small literature. To the best of our knowledge, [wang2020traininger] is the most relevant work that has a similar setting as our study. We compare our method with two normalization methods proposed in [wang2020traininger], namely Output Transformation (OT) and Statistical Normalization (SN), where the former transforms the predictions by an offset and the latter trains the detector with scalenormalized input. Moreover, we also compare with the adversarial feature alignment method, which is commonly used on imagebased tasks, by adapting DAFaster [chen2018dafastercnn] to our PointRCNN [shi2019pointrcnn] base model. We also provide Direct Transfer and WideRange Augmentation as baselines. More results can be found in the Supplementary Material.
Table 1 demonstrates the crossdomain detection performance on four sourcetarget domain pairs, MLCNet outperforms all unsupervised baselines by convincing margins. We highlight that our method adapts scale for each instance instead applying a global shift, allowing us to surpass stateoftheart methods that utilize target domain statistical information.
4.3 Ablation Study
To evaluate the effectiveness of the components of MLCNet, we conduct ablation studies on KITTI Waymo transfer with PointRCNN as the base model.
Effectiveness of Point/InstanceLevel Consistency. We study the effects of different components of the proposed consistency loss. Table 2 reports the experimental results when different combinations of loss components are applied. It is observed that, for both pointlevel consistency and instancelevel consistency, the box consistency clearly has a larger contribution as compared to the class consistency. This observation indicates that the scale difference is a major source of the domain gap between source and target domains with different object size distributions, which is also in line with the previous work [wang2020traininger]. It also shows that our proposed box consistency regularization method effectively mitigates this gap. In addition, all losses are complementary to one another: the best result is achieved when all four of them are used.



AP/L1  APH/L1  AP/L2  APH/L2  
0.1861  0.1818  0.1677  0.1640  
0.2034  0.1991  0.1807  0.1770  
0.3034  0.2969  0.2708  0.2649  
0.3100  0.3039  0.2764  0.2709  
0.2112  0.2087  0.1879  0.1857  
0.3321  0.3244  0.2995  0.2926  
0.3495  0.3453  0.3143  0.3105  
0.3821  0.3774  0.3446  0.3404  

Furthermore, we compare MLCNet with two alternative approaches for point and box matching respectively in Table 3
. Compared to these baseline approaches, MLCNet replicates the input point clouds and the region proposals before passed to the student and teacher models to eradicate any noise which arise from inaccurate matching. The results highlight the importance of correspondence in constructing meaningful consistency losses for effective unsupervised learning.



Matching Method  AP/L1  APH/L1  AP/L2  APH/L2 
Nearest Point  0.0293  0.0286  0.0265  0.0258 
Max IoU Box  0.2695  0.2666  0.2418  0.2392 
Ours  0.3821  0.3774  0.3446  0.3404 

Effectiveness of Neural StatisticsLevel Consistency. We also experiment on the effectiveness of neural statisticslevel consistency by comparing the performance when such alignment is enabled and disabled. From Table 4 we can see that when neural statisticslevel consistency is disabled, the model performance severely drops. As analyzed in Section 3.6, when neural statisticslevel consistency is not in place, the teacher model BN layers normalize the input features using batch statistics that are obtained from only target data, while the student model performs BN with statistics from both source and target domains. This misalignment creates a significant gap. As a result, the consistency computation between the student and teacher predictions is invalidated. We also compare with the approach that the student model performs separate BN for source and target data. In this case, although the normalization for target input is performed with target statistics for both models, the mismatched normalization of source and target inputs leads to suboptimal performance as compared to MLCNet.



Setting  AP/L1  APH/L1  AP/L2  APH/L2 
Disabled  0.0279  0.0274  0.0254  0.0249 
Separate  0.2988  0.2945  0.2685  0.2648 
Enabled  0.3821  0.3774  0.3446  0.3404 

Effectiveness of Mean Teacher. The teacher model is essentially a temporal ensemble of student models at different time stamps. We study the effectiveness of the mean teacher paradigm by comparing the performance when the exponential moving average update is enabled or disabled. Table 5 shows that it is important to employ the moving average update mechanism for the teacher to generate meaningful supervisions to guide the student model, and the removal of such mechanism leads to performance deterioration.



EMA  AP/L1  APH/L1  AP/L2  APH/L2 
Disabled  0.0895  0.0866  0.0835  0.0808 
Enabled  0.3821  0.3774  0.3446  0.3404 

4.4 Further Analysis
Analysis of Distribution Shift. We highlight that the geometric mismatch is a significant issue for crossdomain deployment of 3D detection models. In Figure 2
, the object dimension (length, width, and height) distributions are drastically different across domains with a relatively small overlap. The baseline, trained on the source domain, is not able to generalize to the target domain as the distribution of its dimension prediction is still close to that of the source domain. In contrast, MLCNet is able to adapt to the new domain by predicting highly similar geometric distribution as the target domain.
Analysis of Neural Statistics Mismatch.
Figure 4 shows that inputs from different domains have very different distributions of batch statistics, which explains the tremendous performance drop when our proposed neural statisticslevel consistency is not applied to align the statistics (Table 4).
Analysis of Teacher/Student Paradigm. In Figure 5, the teacher model in MLCNet demonstrates stronger performance during the training process until convergence. Moreover, the teacher model exhibits a smoother learning curve. This validates the effectiveness of our meanteacher paradigm to create accurate and reliable supervision for robust optimization of the student model.
5 Conclusion
We study unsupervised 3D domain adaptive detection that requires no target domain annotation or statistics. We validate that geometric mismatch is a major contributor to the domain shift and propose MLCNet that leverages a teacherstudent paradigm for robust and reliable pseudo label generation via point, instance and neural statisticslevel consistency to enforce effective transfer. MLCNet outperforms all the baselines by convincing margins, and even surpasses methods that require additional target information.