1 Introduction†† denotes equal contribution.†† denotes corresponding author.
With the prevalent use of LiDARs for autonomous vehicles and mobile robots, 3D object detection on point clouds has drawn increasing research attention. Large-scale 3D object detection datasets [geiger2013kittidataset, sun2020waymo, caesar2020nuscenes] in recent years has empowered deep learning-based models [shi2019pointrcnn, yang20203dssd, yan2018second, lang2019pointpillars, shi2020pvrcnn, yang2019std, qi2019votenet, sindagi2019mvx, shi2020parta2, zhu2020ssn, yin2020centerpoint] to achieve remarkable success. However, deep learning models trained on one dataset (source domain) often suffer tremendous performance degradation when evaluated on another dataset (target domain). We investigate the bounding box scale mismatch problem (e.g., vehicle size in the U.S. is noticeably larger than that in Germany), which is found to be a major contributor to the domain gap, in alignment with previous work [wang2020traininger]. This is unique to 3D detection: compared to 2D bounding boxes that can have a large variety of size, depending on the distance of the object from the camera, 3D bounding boxes have a more consistent size in the same dataset, regardless of the objects’ location relative to the LiDAR sensor. Hence, the detector tends to memorize a narrow, dataset-specific distribution of bounding box size from the source domain (Figure 2).
Unfortunately, existing works are inadequate to address the domain gap with a realistic setup. Recent methods on domain adaptive 3D detection either require some labeled data from the target domain for finetuning or utilize some additional statistics (such as the mean size) of the target domain [wang2020traininger]. However, such knowledge of the target domain is not always available. In addition, popular 2D unsupervised domain adaptation methods that leverage feature alignment techniques [chen2018dafastercnn, saito2019strongweak, zheng2020crossdomain, chen2020harmonizing, xu2020exploring, li2020spatialattention] to mitigate domain shift are not readily transferable to 3D detection. While these methods are effective in handling domain gaps due to lighting, color, and texture variations, such information is unavailable in point clouds. Instead, point clouds pose unique challenges such as the geometric mismatch discussed above.
Therefore, we propose MLC-Net for unsupervised domain adaptive 3D detection. MLC-Net is designed to tackle two major challenges. First, to create meaningful scale-adaptive targets to facilitate the learning. Specifically, MLC-Net employs the mean teacher [tarvainen2017meanteacher] learning paradigm. The teacher model is essentially a temporal ensemble of student models: the parameters of the teacher model are updated by an exponential moving average window on student models of preceding iterations. Our analyses show that the mean teacher produces accurate and stable supervision for the student model without any prior knowledge of the target domain. To the best of our knowledge, we are the first to introduce the mean teacher paradigm in unsupervised domain adaptive 3D detection. Second, to design scale-related consistency losses and construct useful correspondences of teacher-student predictions to initiate gradient flow, we design MLC-Net to enforce consistency at three levels. 1) Point-level. As point clouds are unstructured, point-based region proposals or equivalents [shi2019pointrcnn, yang20203dssd] are common. Hence, we sample the same subset of points and share them between the teacher and student. We retain the indices of the points that allow 3D augmentation methods to be applied without losing the correspondences. 2) Instance-level. Matching region proposals can be erroneous, especially at the initial stage when the quality of region proposals is substandard. Hence, we resort to transferring teacher region proposals to students to circumvent the matching process. 3) Neural statistics-level. As the teacher model only accesses the target domain input, the mismatch between the batch statistics hinders effective learning. We thus transfer the student’s statistics, which is gathered from both the source and the target domain, to the teacher to achieve a more stable training behavior.
MLC-Net shows remarkable compatibility with popular mainstream 3D detectors, allowing us to implement it on both two-stage [shi2019pointrcnn] and single-stage [yang20203dssd] detectors. Moreover, we verify our design through rigorous experiments across multiple widely used 3D object detection datasets [geiger2013kittidataset, sun2020waymo, caesar2020nuscenes]. Our method outperforms baselines by convincing margins, even surprisingly surpassing existing methods that utilize additional information. In summary, our main contributions are:
We formulate and study unsupervised domain adaptive 3D detection, a pragmatic, yet underexplored task that requires no information of the target domain. We comprehensively investigate the major underlying factors of the domain gap in 3D detection and find geometric mismatch is the key factor.
We propose a concise yet effective mean-teacher paradigm that leverages three levels of consistency to facilitate cross-domain transfer, achieving a significant performance boost that is consistent on various mainstream detectors and across multiple popular public datasets.
We validate our hypothesis on the unique challenges associated with point clouds and verify our proposed approach with comprehensive evaluations, which we hope would lay a strong foundation for future research.
2 Related Works
LiDAR-based 3D Detection. LiDAR-based 3D detection methods mainly come from two categories, namely grid-based methods and point-based methods. Grid-based approaches convert the whole point cloud scene to grids of fixed size and process the input with 2D or 3D CNN. MV3D [mv3d] first projects point clouds to bird-eye view images to generate proposals. PointPilar [lang2019pointpillars] performs voxelization on point clouds and converts the representation to 2D. VoxelNet [zhou2018voxelnet] obtains voxel representations by applying PointNet [qi2017pointnet] to points and processes the features with 3D convolution. SECOND [yan2018second] applies 3D sparse convolution [graham20183dsparseconv] to improve the efficiency. PV-RCNN [shi2020pvrcnn] proposes to combine voxelization and point-based set abstraction to obtain more discriminative features. On the other hand, point-based methods directly extract features from raw point cloud input. F-PointNet [qi2018frustum] applies PointNet [qi2017pointnet] to perform 3D detection based on 2D bounding boxes. PointRCNN proposes a two-stage framework to generate box bounding proposals from the whole point clouds and refine them with feature pooling. 3DSSD proposes to use F-FPS for better point sampling to achieves single-stage detection. In this work, we conduct focused discussion with PointRCNN [shi2019pointrcnn] as the base model but we show our method is also compatible to single-stage detector (3DSSD) in Supplementary Material.
Point Cloud Domain Adaptation. While extensive researches have been conducted on domain adaptation tasks with 2D image data, the 3D point cloud domain adaptation field has relatively small literature. PointDAN [qin2019pointdan] proposes to jointly align local and global features using discrepancy loss and adversarial training for point cloud classification. Achituve et. al. [achituve2021selfsuupervised] introduces an additional self-supervised reconstruction task to improve the classification performance on the target domain. [yi2020completeandlabel] designs a sparse voxel completion network to perform point cloud completion for domain adaptive semantic segmentation. [jaritz2020xmuda] leverages multi-modal information by projecting point cloud to 2D images and train models jointly. For object detection, [wang2020traininger] identifies the major domain gap among autonomous driving datasets and proposes to mitigate the gap by leveraging target statistical information. SF-UDA [saltori2020sourcefree] computes motion coherence over consecutive frames to select the best scale for the target domain. Our proposed method works under a similar setup to [wang2020traininger] but does not require target domain statistical information.
Mean Teacher. The mean teacher framework [tarvainen2017meanteacher]
is first proposed for semi-supervised learning task. Many variants[cubuk2018autoaugment, berthelot2019mixmatch, xie2019unsupervised] have been proposed to further improve its performance. Furthermore, the framework has also been applied to other fields such as domain adaptation [french2017selfensembling, cai2019exploringobjectrelation]
and self-supervised learning[he2020moco, grill2020byol, liu2020selfemd] where labeled data is scarce or unavailable. Specifically, the mean teacher framework incorporates one trainable student model and a non-trainable teacher model whose weights are obtained from the exponential moving average of the student model’s weights. The student model is optimized based on the consistency loss between the student and teacher network predictions. In particular, although [cai2019exploringobjectrelation] also employs the mean teacher paradigm for the detection task by aligning region-level features, point cloud detection models are substantially different from 2D detectors and our proposed method differs by incorporating multi-level consistency.
3 Our Approach
In this section, we formulate the 3D point cloud domain adaptive detection problem (Section 3.1), and provide an overview of our MLC-Net (Section 3.2), followed by the details of our mean-teacher paradigm (Section 3.3). Finally, we explain the details of the point-level (Section 3.4), instance-level (Section 3.5), and statistics-level (Section 3.6) consistency of our MLC-Net.
3.1 Problem Definition
Under the unsupervised domain adaptation setting, we have access to point cloud data from one labeled source domain and one unlabeled target domain , where and are the number of samples from the source and target domains, respectively. Each point cloud scene consists of points with their 3D coordinates while denotes the label of the corresponding training sample from the source domain. is in the form of object class and 3D bounding box parameterized by the center location of the bounding box , the size in each dimension , and the orientation . The goal of the domain adaptive detection task is to train a model based on and and maximize the performance on .
3.2 Framework Overview
We illustrate MLC-Net in Figure 3. The labeled source input is used for standard supervised training of the student model with loss . For each unlabeled target domain example , we perturb it by applying random augmentation to obtain . The perturbed and original point cloud inputs are passed to the student model and teacher model respectively to get their point-level box proposals and where point-level consistency is applied. Subsequently, teacher proposals are passed to the student model for box refinement, to obtain . Together with teacher’s instance-level predictions , the instance-level consistency is applied. The overall consistency loss is computed as:
where pt, ins, cls and box stand for point-level, instance-level, classification and box regression respectively. These loss components are elaborated in Section 3.4 and 3.5. In each iteration, the student model is updated through gradient descent with the total loss , which is a weighted sum of and :
where is the weight coefficient. The learnable parameters of the student model are then used to update the corresponding teacher model parameters, where the details can be found in Section 3.3. In addition, we enforce non-learnable parameters to be aligned between the teacher and the student via neural statistics-consistency (Section 3.6).
MLC-Net achieves two major design goals towards effective unsupervised 3D domain adaptive detection. First, to generate accurate and robust pseudo targets without any access to the target domain annotation or statistical information. MLC-Net leverages a mean teacher paradigm where the teacher model can be regarded as a temporal ensemble of student models, allowing it to produce high-quality predictions and guide the learning of the student. Second, to design effective consistency losses at point-, instance- and neural statistics-level that enhance adaptability to scale variation, and construct the teacher-student correspondences that allow the back-propagated gradient to flow through the correct routes. Although we conduct most analysis on PointRCNN [shi2019pointrcnn] as the representative of two-stage 3D detectors, we highlight that our method is generic and can be easily extended to single-stage detection models such as 3DSSD [yang20203dssd] with modest modifications (see Supplementary Material).
3.3 Mean Teacher
Motivated by the success of the mean teacher paradigm [tarvainen2017meanteacher] in semi-supervised learning and self-supervised learning, we apply it to our point cloud domain adaptive detection task as illustrated in Figure 3. The framework consists of a student model and a teacher model with the same network architecture but different weights and , respectively. The weights of the teacher model are updated by taking the exponential moving average of the student model weights:
where is known as the momentum which is usually a number close to 1, e.g. 0.99. Figure 5 shows that the teacher model constantly provides effective supervision to the student model via high-quality pseudo targets. Hence, by enforcing the consistency between the student and the teacher, the student learns domain-invariant representations to adapt to the unlabeled target domain, guided by the pseudo labels. We show in Table 5 that the mean teacher significantly improves model performance compared to baseline.
3.4 Point-Level Consistency
The point-level consistency loss is calculated between the first-stage box proposals of the student and teacher models. One of the key challenges for formulating consistency is to find the correspondence between the student and the teacher. Unlike image pixels that are arranged in regular lattices, points reside in continuous 3D space which lacks structure [qi2017pointnet]. Hence, constructing point correspondences can be problematic (Table 3). Instead, we circumvent the difficulty by feeding the teacher and the student two identical sets of points at the very beginning and trace the point indices to maintain correspondences.
Specifically, for each target domain example, we sample points from the point cloud scene to obtain the teacher input and apply random augmentation on a replicated set to obtain with . consists of random global scaling of the point cloud scenes and can be regarded as applying displacements on individual points, without disrupting the point correspondences. As a result, each point corresponds to a point , and this relationship holds for the point-level predictions of the region proposal network . We denote the first stage prediction as . Note that the point correspondences are transferred to box proposals as each point generates one box proposal. consists of class prediction and box regression . For the class predictions, we define the consistency loss as the Kullback-Leibler (KL) divergence between each point pair from and :
where stands for the number of points in .
More importantly, we enforce consistency between bounding box regression predictions to address geometric mismatch. For the bounding box predictions, we only compute the consistency over points belonging to the objects because the background points do not generate meaningful bounding boxes. We obtain a set of points which fall inside the bounding boxes of the final predictions of both the student and teacher networks with , where and are the refined bounding box predictions after second stage (see Section 3.5). We then compute the point-level box consistency loss as:
where is the smooth loss and is the random augmentation applied to the input . We apply the same augmentation to the teacher bounding box predictions to align with the scale of the student point cloud scene before computing the consistency.
3.5 Instance-Level Consistency
In the second stage, NMS is performed on to obtain high-confidence region proposals denoted as for each point cloud scene. We highlight that the association between region proposals from the student and teacher models are lost in the NMS due to the differences between and . To match the instance-level predictions for consistency computation, a common method is to perform greedy matching based on IoU between teacher and student region proposals. However, such matching is not robust due to the large number of noisy predictions, which lead to ineffective learning as shown experimentally in Table 3. Hence, we adopt a simple approach by replicating the teacher region proposals to the student model and applying the input augmentation to match the scale of the student model. Subsequently, we disturb the region proposals by applying random RoI augmentation for the sets of region proposals before they are used for feature pooling. The motivation of this operation is to force the models to output consistent predictions given non-identical region proposals and prevent convergence to trivial solutions. Formally, the above process can be described as and for the student and teacher models, respectively, where denotes the instance-level features obtained from feature pooling as described in [shi2019pointrcnn]. The pooled features are then passed to the box refinement network for box refinement to obtain the second stage predictions . Similar to the first stage prediction , consists of the class prediction as well as the bounding box prediction . We define the instance-level class consistency as the difference between and :
where denotes the number of region proposals. On the other hand, to compute the instance-level box consistency loss, we first obtain a set of positive predictions
by selecting bounding boxes with classification predictions larger than a probability threshold. We then apply to to match the scale and compute the instance-level box consistency loss based on the discrepancy between and for the selected predictions:
3.6 Neural Statistics-Level Consistency
While the student model takes both source domain data and target domain data as input, the teacher model only has access to the target data . The distribution shift lying between source and target data could lead to mismatched batch statistics between the batch normalization (BN) layers of the student and teacher models. This mismatch could cause misaligned normalization and in turn, leads to an unstable training process with degraded performance or even divergence. We provide an in-depth analysis regarding this matter in Section 4.4.
To mitigate this issue, we propose to use the running statistics of the student model BN layers for the teacher model during the training process. Specifically, for each BN layer in the student model, the batch mean
and varianceare used to update the running statistics at every iteration:
where and are the running mean of and and is the BN momentum that controls the speed of batch statistics updating the running statistics. For the teacher model, we use and instead of the batch statistics for all the BN layers to normalize the layer inputs. We argue that this modification closes the gap caused by domain mismatch and leads to more stable training behavior. We empirically demonstrate the effectiveness by comparing the performance under different BN settings in Section 4.3.
|KITTI Waymo||Waymo KITTI|
|Direct Transfer||0.0917||0.0899||0.0794||0.0778||Direct Transfer||20.2213||21.4261||20.4927|
|Wide-Range Aug||0.1861||0.1818||0.1677||0.1640||Wide-Range Aug||30.2341||31.4959||32.8531|
|DA-Faster [chen2018dafastercnn]||0.0696||0.0687||0.0642||0.0633||DA-Faster [chen2018dafastercnn]||4.4248||5.5510||5.5296|
|OT [wang2020traininger]||0.2648||0.2584||0.2385||0.2329||OT [wang2020traininger]||39.7762||37.8212||39.5546|
|SN [wang2020traininger]||0.3069||0.3006||0.2723||0.2667||SN [wang2020traininger]||61.9289||58.0656||58.4406|
|KITTI nuScenes||nuScenes KITTI|
|Direct Transfer||0.207||0.248||0.212||13.0073||Direct Transfer||49.1303||39.5565||35.5127|
|Wide-Range Aug||0.200||0.228||0.211||16.0081||Wide-Range Aug||58.7072||45.3730||43.0254|
|DA-Faster [chen2018dafastercnn]||0.247||0.253||0.292||10.7661||DA-Faster [chen2018dafastercnn]||52.2501||40.6209||35.9015|
|OT [wang2020traininger]||0.207||0.220||0.212||14.6650||OT [wang2020traininger]||23.1286||27.2584||29.0979|
|SN [wang2020traininger]||0.227||0.168||0.368||23.1491||SN [wang2020traininger]||44.8135||45.1496||47.5991|
We first introduce the popular autonomous driving datasets including KITTI [geiger2013kittidataset], Waymo Open Dataset [sun2020waymo], and nuScenes [caesar2020nuscenes] used in the experiments (Section 4.1). We then benchmark MLC-Net across datasets where MLC-Net achieves consistent performance boost in Section 4.2. Moreover, we ablate MLC-Net to give a comprehensive assessment of its submodules and justify our design choices in Section 4.3. Finally, we further investigate the challenges of unsupervised domain adaptive 3D detection and show MLC-Net successfully addresses them. We further analyse the problems in 3D domain adaptive detection and our solutions in Section 4.4. Due to the space constraint, we include the implementation details in the Supplementary Material.
We follow [wang2020traininger] to evaluate MLC-Net on various source-target combinations with the following datasets.
KITTI. KITTI [geiger2013kittidataset]
is a popular autonomous driving dataset that consists of 3,712 training samples and 3,769 validation samples. The 3D bounding box annotations are only provided for objects within the Field of View (FoV) of the front camera. Therefore, points outside of the FoV are ignored during training and evaluation. We use the official KITTI evaluation metrics for evaluation where the objects are categorized into three levels (Easy, Moderate, and Hard) based on the number of pixels, occlusion and truncation levels.
Waymo Open Dataset. The Waymo Open Dataset (referred to as Waymo) [sun2020waymo] is a large-scale benchmark that contains 122,000 training samples and 30,407 validation samples. We subsample 1/10 the training and validation set. To align the input convention, we apply the same front camera FoV as the KITTI dataset. The official Waymo evaluation metrics are used to benchmark the performance.
nuScenes. The nuScenes [caesar2020nuscenes] dataset consists of 28,130 training samples and 6,019 validation samples. We subsample the training dataset by 50% and use the entire validation set. We also apply the same FoV on the input as other datasets. We adopt the official evaluation metrics of translation, scale, and orientation errors, with the addition of the commonly used average precision based on 3D IoU with a threshold of 0.7 to reflect the overall detection accuracy.
4.2 Benchmarking Results
As an emerging research area, the cross-domain point cloud detection topic has relatively small literature. To the best of our knowledge, [wang2020traininger] is the most relevant work that has a similar setting as our study. We compare our method with two normalization methods proposed in [wang2020traininger], namely Output Transformation (OT) and Statistical Normalization (SN), where the former transforms the predictions by an offset and the latter trains the detector with scale-normalized input. Moreover, we also compare with the adversarial feature alignment method, which is commonly used on image-based tasks, by adapting DA-Faster [chen2018dafastercnn] to our PointRCNN [shi2019pointrcnn] base model. We also provide Direct Transfer and Wide-Range Augmentation as baselines. More results can be found in the Supplementary Material.
Table 1 demonstrates the cross-domain detection performance on four source-target domain pairs, MLC-Net outperforms all unsupervised baselines by convincing margins. We highlight that our method adapts scale for each instance instead applying a global shift, allowing us to surpass state-of-the-art methods that utilize target domain statistical information.
4.3 Ablation Study
To evaluate the effectiveness of the components of MLC-Net, we conduct ablation studies on KITTI Waymo transfer with PointRCNN as the base model.
Effectiveness of Point/Instance-Level Consistency. We study the effects of different components of the proposed consistency loss. Table 2 reports the experimental results when different combinations of loss components are applied. It is observed that, for both point-level consistency and instance-level consistency, the box consistency clearly has a larger contribution as compared to the class consistency. This observation indicates that the scale difference is a major source of the domain gap between source and target domains with different object size distributions, which is also in line with the previous work [wang2020traininger]. It also shows that our proposed box consistency regularization method effectively mitigates this gap. In addition, all losses are complementary to one another: the best result is achieved when all four of them are used.
Furthermore, we compare MLC-Net with two alternative approaches for point and box matching respectively in Table 3
. Compared to these baseline approaches, MLC-Net replicates the input point clouds and the region proposals before passed to the student and teacher models to eradicate any noise which arise from inaccurate matching. The results highlight the importance of correspondence in constructing meaningful consistency losses for effective unsupervised learning.
|Max IoU Box||0.2695||0.2666||0.2418||0.2392|
Effectiveness of Neural Statistics-Level Consistency. We also experiment on the effectiveness of neural statistics-level consistency by comparing the performance when such alignment is enabled and disabled. From Table 4 we can see that when neural statistics-level consistency is disabled, the model performance severely drops. As analyzed in Section 3.6, when neural statistics-level consistency is not in place, the teacher model BN layers normalize the input features using batch statistics that are obtained from only target data, while the student model performs BN with statistics from both source and target domains. This misalignment creates a significant gap. As a result, the consistency computation between the student and teacher predictions is invalidated. We also compare with the approach that the student model performs separate BN for source and target data. In this case, although the normalization for target input is performed with target statistics for both models, the mismatched normalization of source and target inputs leads to suboptimal performance as compared to MLC-Net.
Effectiveness of Mean Teacher. The teacher model is essentially a temporal ensemble of student models at different time stamps. We study the effectiveness of the mean teacher paradigm by comparing the performance when the exponential moving average update is enabled or disabled. Table 5 shows that it is important to employ the moving average update mechanism for the teacher to generate meaningful supervisions to guide the student model, and the removal of such mechanism leads to performance deterioration.
4.4 Further Analysis
Analysis of Distribution Shift. We highlight that the geometric mismatch is a significant issue for cross-domain deployment of 3D detection models. In Figure 2
, the object dimension (length, width, and height) distributions are drastically different across domains with a relatively small overlap. The baseline, trained on the source domain, is not able to generalize to the target domain as the distribution of its dimension prediction is still close to that of the source domain. In contrast, MLC-Net is able to adapt to the new domain by predicting highly similar geometric distribution as the target domain.
Analysis of Neural Statistics Mismatch.
Figure 4 shows that inputs from different domains have very different distributions of batch statistics, which explains the tremendous performance drop when our proposed neural statistics-level consistency is not applied to align the statistics (Table 4).
Analysis of Teacher/Student Paradigm. In Figure 5, the teacher model in MLC-Net demonstrates stronger performance during the training process until convergence. Moreover, the teacher model exhibits a smoother learning curve. This validates the effectiveness of our mean-teacher paradigm to create accurate and reliable supervision for robust optimization of the student model.
We study unsupervised 3D domain adaptive detection that requires no target domain annotation or statistics. We validate that geometric mismatch is a major contributor to the domain shift and propose MLC-Net that leverages a teacher-student paradigm for robust and reliable pseudo label generation via point-, instance- and neural statistics-level consistency to enforce effective transfer. MLC-Net outperforms all the baselines by convincing margins, and even surpasses methods that require additional target information.