3D object detection aims to categorize and localize objects from 3D sensor data (e.g. LiDAR point clouds) with many applications in autonomous driving, robotics, virtual reality, to name a few. Recently, this field has obtained remarkable advancements [60, 29, 47, 48, 45, 46]
driven by deep neural networks and large-scale human-annotated datasets[16, 49, 4, 25].
However, 3D detectors developed on one specific domain (i.e. training / source domain) might not generalize well to novel testing domains (i.e. target domains) due to unavoidable domain-shifts arising from different types of 3D depth sensors, weather conditions and geographical locations, etc. For instance, a 3D detector trained on data collected in USA cities with Waymo LiDAR (i.e. Waymo dataset ) suffers from a dramatic performance drop (of over )  when evaluated on data from European cities captured by Velodyne LiDAR (i.e. KITTI dataset ). Though collecting more training data from different domains could alleviate this problem, it unfortunately might be infeasible given various real-world scenarios and enormous costs for 3D annotation. Therefore, approaches to effectively adapting 3D detectors trained on a labeled source domain to a new unlabeled target domain is highly demanded in practical applications. This task is also known as unsupervised domain adaptation (UDA) for 3D object detection.
In contrast to intensive studies on UDA in the 2D setting [12, 34, 22, 9, 41, 14, 15], few efforts  have been made to explore UDA for 3D object detection. Meanwhile, the fundamental differences in data structures and network architectures render UDA approaches for image tasks not readily applicable to this problem. For domain adaptation on 3D detection, while promising results have been obtained in , the method requires object size statistics of the target domain, and its efficacy largely depends on data distributions.
. Self-training starts from pre-training a model on source labeled data and further iterating between pseudo label generation and model training on unlabeled target data until convergence is achieved. It formulates the task of UDA as a supervised learning problem on the target domain with pseudo labels, which explicitly closes the domain gaps. Despite of encouraging results in image tasks, our study illustrates that naive self-training does not work well in UDA for 3D object detection as shown in Fig. 1 (“source only” vs. “naive ST”).
The major obstacle for self-training on domain adaptive 3D object detection lies in severe pseudo label noise, such as imprecise object localization (i.e. orientated 3D bounding box) and incorrect object categories, which are yet largely overlooked in the naive self-training pipeline. The noise is the compound effect of the domain gap between source and target domain data and the capability of the 3D object detector (e.g. systematic errors of the 3D detector). These noisy pseudo labels will not only misguide the direction of model optimization but also make errors accumulate during iterative pseudo label generation and model training, leading to inferior performance.
In this paper, we propose a holistic denoised self-training pipeline for UDA on 3D object detection, namely ST3D++, which simultaneously reduces target domain pseudo label noise and mitigates the negative impacts of noisy pseudo labels on model training.
First, in model pre-training on source domain labeled data, we develop random object scaling (ROS), a simple 3D object augmentation technique, which randomly scales the 3D objects to overcome the bias in object scale of the source domain and effectively reduces pseudo label noise from biased source domain data. Second, for pseudo label generation in iterative self-training, we develop a hybrid quality-aware triplet memory which encompasses a hybrid box scoring criterion to assess the quality of object localization and categorization, a triplet box partition scheme to avoid assigning pseudo labels to inconclusive examples, and a memory updating strategy, integrating historical pseudo labels via ensemble and voting, to reduce pseudo label noise and instability. Finally, in the model training process, we design a source-assisted self-denoised
(SASD) training method with separate source and target batch normalization, which fully leverages the advantage of clean and diverse source annotations to rectify the direction of gradients as well as address negative impacts of joint optimization on source and target domain data with domain shifts. Meanwhile,curriculum data augmentation (CDA) is developed for pseudo-labeled target data to guarantee effective learning at the beginning and gradually simulate hard examples through progressively increasing the intensity of augmentation. CDA also prevents the model from overfitting to easy examples – pseudo-labeled data with high confidence – and thus improves model’s capabilities.
Experimental results on four 3D object detection datasets KITTI , Waymo , nuScenes , and Lyft  for three common categories including car, pedestrian and cyclist, demonstrate the effectiveness of our approach. The performance gaps between source only results and fully supervised oracle results are closed by a large percentage. Besides, we outperform existing approaches [54, 44, 66] by a notable margin (around 13% 17%) based on the same setup. It’s also noteworthy that our approach even outperforms the oracle results for all categories on the Waymo KITTI setting when further combined with target statistics  as shown in Fig. 1.
Different from our conference paper: This manuscript signiﬁcantly improves the conference version : (i) We extend domain adaptive 3D detection to multiple categories (i.e. , car, pedestrian and cyclist) which is the first attempt on four popular 3D detection datasets: Waymo , nuScenes , KITTI  and Lyft  as far as we know. (ii) We conduct more analysis on pseudo label noise for self-training and present a holistic pseudo label denoised self-training pipeline, which addresses pseudo label noise in a systematic manner from model pre-training, pseudo label generation to model optimization. (iii) We improve the quality-aware criterion to account for both the localization quality and the classification accuracy. (iv) We propose a source-assisted training strategy where source examples are leveraged in the self-training stage to rectify incorrect gradient directions from noisy pseudo labels and provide more diverse patterns. Besides, the domain-specific normalization is incorporated to avoid the negative impacts of mixing source and target data in source-assisted joint optimization. (v) We carry out a study on the quality of pseudo labels on 3D object detection where five quantitative indicators are proposed, and conduct more analysis on how the proposed strategies improve pseudo label qualities. (vi) We conduct extensive experiments on four datasets for three categories. The proposed ST3D++ outperforms ST3D  over all categories on all adaptation settings as well as other existing UDA methods, SF-UDA , Dreaming  and MLC-Net  by a large margin. (vii) We explore to leverage temporal information to further improve ST3D++ through fused sequential point frames, which also mitigates the point density gaps across domains.
2 Related Work
3D Object Detection from Point Clouds
aims to localize and classify 3D objects from point clouds, which is a challenging task due to the irregularity and sparsity of 3D point clouds. Some previous works[8, 28, 61] proposes to resolve this task by previous 2D detection methods directly via mapping the irregular 3D point clouds to 2D bird eye’s view grids. Another line of research [60, 70, 48, 19, 45] adopts 3D convolutional networks to learn 3D features from voxelized point clouds, and the extracted 3D feature volumes are also further compressed to bird-view feature maps as above. Recently, point-based approaches [47, 64] propose to directly generate 3D proposals from raw point clouds by adopting PointNet++  to extract point-wise features. There are also some other methods [37, 55] that utilize 2D images for generating 2D box proposals which are further employed to crop the object-level point clouds to produce 3D bounding boxes. In our work, we adopt SECOND , PointRCNN  and PV-RCNN  as our 3D object detectors.
Unsupervised Domain Adaptation targets at obtaining a robust model which can generalize well to target domains with only labeled source examples and unlabeled target data. Previous works [34, 35] explore domain-invariant feature learning by minimizing the Maximum Mean Discrepancy . Inspired by GANs , adversarial learning was employed to align feature distributions across different domains on image classification [12, 13], semantic segmentation [22, 52] and object detection [9, 41]
tasks. Besides, Benefited from the development of unpaired image to image translation, some methods [21, 69, 17] proposed to mitigate the domain gap on pixel-level by translating images across domains. Another line of approaches [42, 73, 26, 5] leverage self-training  to generate pseudo labels for unlabeled target domains. Saito et al.  adopted a two branch classifier to reduce the discrepancy. [31, 53, 7] alleviated the domain shift on batch normalization layers by modulating the statistics in BN layer before evaluation or specializing parameters of BN domain by domain. [50, 11, 10] employed curriculum learning  and separated examples by their difficulties to realize local sample-level curriculum. Xu et al.  proposed a progressive feature-norm enlarging method to reduce the domain gap. [33, 63] injected feature perturbations to obtain a robust classifier through adversarial training.
On par with the developments on domain adaptation for image recognition tasks, some recent works also aim to address the domain shift on point clouds for shape classification  and semantic segmentation [56, 65, 24]. However, despite intensive studies on the 3D object detection task [70, 47, 60, 48, 64, 45], only few approaches have been proposed to solve UDA for 3D object detection. Wang et al. proposed SN  to normalize the object size of the source domain leveraging the object statistics of the target domain to close the size-level domain gap. Though the performance has been improved, the method needs the target statistics information, and its effectiveness depends on the source and target data distributions. SF-UDA  and Dreaming  leveraged the extra tracker and time consistency regularization along target point cloud sequence to generate better pseudo labels for self-training. SRDAN  employed scale-aware and range-aware feature adversarial alignment manners to match the distribution between two domains, which might suffer from stability and convergence issues. MLC-Net  employed the mean-teacher paradigm to address the geometric mismatch between source and target domains. In contrast, the proposed ST3D++ tailors a self-denoising framework to simultaneously close content and point distribution gaps across domains and achieve superior performance on all three categories of four adaptation tasks with neither prior target object statistics nor extra computation.
3 Analysis of Pseudo Label Noise in Domain Adaptive 3D Object Detection
Self-training  obtains remarkable progress on unsupervised domain adaptation [42, 73, 26, 5] by alternating between pseudo-labeling target data and updating the model, which explicitly closes the domain gap by re-formulating the UDA problem as a target domain supervised learning problem with pseudo labels. If pseudo labels are ideally perfect, the problem will be close to supervised learning on the target domain. Hence, the quality of pseudo labels is the determining factor for successful domain adaptation. In the following, we will analyze pseudo label noise in 3D object detection.
As 3D object detection requires to jointly categorize and localize objects, the pseudo label noise could be divided into localization noise and classification noise as shown in Fig. 2. For these two types of noise, their causes including domain shifts and model capabilities, and their negative impacts on domain adaptive 3D object detection with self-training are elaborated as below.
– low-quality pseudo labeled bounding boxes – is unavoidable in pseudo labeling the target domain data. On the one hand, the object detector has to estimate the 3D bounding boxes with orientations based on incomplete point cloud, which is an ill-posed and challenging problem itself with ambiguities (see Oracle prediction in the upper row of Fig.2). And different source and target point cloud patterns further intensify the difficulties for the model to produce accurate bounding box predictions on the target domain data. As shown in the upper row of Fig. 2, pseudo label of car predicted by source only achieves much lower IoU with GTs compared to the pseudo label obtained by the oracle model (e.g. Source only: , Oracle: ) due to few noise points. On the other hand, object sizes and annotation rules vary in different datasets (e.g. cars in Waymo dataset around 0.9 meters longer than cars in KITTI on average), which will arise negative transfers in 3D bounding boxes from the source domain to the target domain, generating inaccurate 3D target domain pseudo-labeled 3D boxes.
Classification noise refers to false positives (i.e. , backgrounds or other incorrect categories, that are misclassified and assigned to pseudo labeled bounding boxes) and false negatives (i.e. missed objects) as shown in the bottom row of Fig. 2. The main causes lie in the following folds: different point cloud patterns in source and target domains confuse the object detector, leading to incorrect class predictions; and detector’s capability is insufficient to distinguish backgrounds and different object categories; and sub-optimal criteria to generate pseudo categories.
Pseudo label noise, including localization and classification noise in 3D object detection, is unavoidable and yet detrimental to the self-training process if not properly handled. First, the noisy pseudo labeled data will produce imprecise gradients which will guide the updating of model weights in an incorrect direction. Then, the negative impacts will be amplified as the updated model will be further used to produce more noisy pseudo labeled data, making the training process fall into a vicious circle of error accumulation.
Our goal is to adapt a 3D object detector trained on source labeled data with samples to unlabeled target data with samples. Here, and represent the -th source input point cloud and its corresponding label. contains the category and 3D bounding box information for each object in the -th point cloud, and each box is parameterized by its center , size and heading angle . Similarly, denotes the -th unlabeled target point cloud.
We present ST3D++, a self-training framework with a holistic pseudo-label denoising pipeline, to adapt the 3D detector trained on the source domain to the target domain. Our ST3D++ tackles the noise issues in pseudo labeling for 3D object detection through reducing the above noises to generate high-quality pseudo labels and mitigating negative impacts of noisy labels during the model training. An overview of our approach is shown in Fig. 3 and described in Algorithm 1. First, the object detector is pre-trained on the source labeled domain with random object scaling (ROS) (see Fig. 4 (a)) which mitigates source domain bias and facilitates obtaining a robust object detector for pseudo label generation. Then, the object detector is progressively improved by alternating between generating pseudo labels on the target data, where hybrid quality-aware triplet memory (HQTM) (see Fig. 4 (b)) is designed to denoise the pseudo labeled data and enforce pseudo label consistency through training, and fine-tuning the object detector on pseudo labeled target data, where a source-assisted self-training (SASD) method is proposed to rectify imprecise gradient directions from noisy labels and a curriculum data augmentation (CDA) (see Fig. 7) strategy is incorporated to avoid model overfitting to pseudo-labeled easy samples.
4.2 Model Pre-training with Random Object Scaling
ST3D++ starts from training a 3D object detector on labeled source data . The pre-trained model learns how to perform 3D detection on source labeled data and is further adopted to initialize object predictions for the target domain unlabeled data.
Observations. However, despite the useful knowledge, the pre-trained detector also learns the bias from the source data, such as object size and point densities. Among them, the bias in object size has direct negative impacts on 3D object detection, and results in noisy pseudo-labeled target bounding boxes with incorrect sizes. This is also in line with the findings in . To mitigate the issue, we propose a very simple and yet effective object-wise augmentation strategy, i.e. random object scaling
(ROS), fully leveraging the high degree of freedom of 3D spaces.
Random Object Scaling. Given an annotated 3D bounding box with size , center and heading angle , ROS scales the box in the length, width and height dimensions with random scale factors through transforming the points inside the box. We denote the points inside the box as with a total of points, and the coordinate of is represented as . First, we transform the points to the object-centric coordinate system of the box along its length, width and height dimensions via
where is matrix multiplication. Second, to derive the scaled object, the point coordinates inside the box are scaled to be with object size . Third, to derive the augmented data , the points inside the scaled box are transformed back to the ego-car coordinate system and shifted to the center as
Albeit simple, ROS effectively simulates objects with diverse object sizes to address the size bias and hence facilitates to train de-biased object detectors that produce more accurate initial pseudo boxes on target domain for subsequent self-training.
4.3 De-noised Pseudo Label Generation with Hybrid Quality-aware Triplet Memory
With the trained detector, the next step is to generate pseudo labels for the unlabeled target data. Given the target sample , the output of the object detector is a group of predicted boxes containing category labels, confidence scores, regressed box sizes, box centers and heading angles, where non-maximum-suppression (NMS) has already been conducted to remove duplicated detections. For clarity, we denote as outputs from the object detector for the -th sample.
Observations. Different from classification and segmentation tasks, 3D object detection needs to jointly consider the classification and localization information, which poses great challenges for high-quality pseudo label generation. First, the confidence of object category prediction may not reflect the precision of localization as shown by the blue line in Fig. 5 (a). Second, the fraction of false labels is greatly increased in confidence score intervals with medium values as illustrated in Fig. 5 (b). Third, model fluctuations induce inconsistent pseudo labels as demonstrated in Fig. 5 (c). The above factors will undoubtedly introduce negative impacts on pseudo-labeled objects, leading to noisy supervisory information and instability for model. Hence, reducing noise of pseudo labels in the pseudo label generation stage is essential for self-training.
To this end, we design hybrid quality-aware triplet memory (HQTM) to parse noisy object predictions into high-quality pseudo labels , which are cached into the memory at each stage for self-training. To obtain high-quality , HQTM includes two major denoising components tailored to 3D object detection. Given noisy predictions , the first denoising component incorporates a hybrid criterion (see Sec. 4.3.1) to assess quality of each predictions regarding localization and classification, and a triplet partition scheme to produce proxy-pseudo labels111To differentiate the intermediate pseudo labels from the object detector and pseudo labels in the memory, we call “proxy-pseudo label”. denoted as (see Sec. 4.3.2), aiming to avoid assigning labels to object predictions with ambiguous confidence. Then, to further enhance pseudo label qualities, the second denoising component (see Sec. 4.3.3) combines the proxy pseudo labels and historical pseudo labels in the memory through the elaborately designed memory ensemble and voting strategies. The overall pipeline of this procedure is demonstrated in Fig. 4 (b).
|0.4 - 0.5||0.5 - 0.6||0.6 - 0.7||0.7 - 0.8||0.8 - 0.9|
4.3.1 Hybrid Quality-aware Criterion for Scoring
Although classification confidence is a widely adopted criterion to measure the quality of predictions [42, 73, 26] in self-training, it fails to reflect the localization quality in 3D object detection as shown in Fig. 5 (a). To simultaneously evaluate the quality of localization and classification, we propose an IoU-based criterion which is further integrated with the confidence-based criterion to assess prediction qualities detailed as below.
IoU-based Criterion for Scoring.
To directly assess the localization quality of pseudo labels, we propose to augment the original object detection model with a lightweight IoU regression head. Specifically, given the feature derived from RoI pooling, we append two fully connected layers to directly predict the 3D box IoU between RoIs and their ground truths (GTs) or pseudo labels. A sigmoid function is adopted to map the output into range. During model training, the IoU branch is optimized using a binary cross-entropy loss as
where is the predicted IoU and is the IoU between the ground truth (or pseudo label) box and the predicted 3D box.
Hybrid Criterion for Better Scoring. For object categories that are easily distinguishable from backgrounds (e.g. “cars”), the IoU score not only correlates well with localization quality (see Fig. 5
(a)), but also generates more true positives (TPs) in confident interval (i.e. score larger than in Table I) if adopted as a criterion. However, the classification score enjoys an obvious superiority in categories that are easily confused with backgrounds (e.g. pedestrians similar to background “trees” and “ poles”), i.e. the classification score obtains pseudo labels with higher TP ratios in the high-confident regime compared to the IoU score (i.e. confidence larger than 0.7) in Table I.
To take the best of both worlds, we propose an effective hybrid quality-aware criterion which integrates classification confidence and IoU scores in a weighted manner as
where is the final criterion for each 3D box, is the classification score of object predictions and is the trade-off parameters between and ( is set to 0 0.5 across different tasks).
4.3.2 Triplet Box Partition to Avoid Ambiguous Samples
Now, we are equipped with a hybrid quality assessment criterion to assess (for the -th sample at stage ) from the detector after NMS. Here, to avoid assigning labels to ambiguous examples which introduce noise, we present a triplet box partition scheme to obtain the proxy-pseudo labels . Given an object box from with the final criterion , we create a margin to ignore boxes with score inside this margin, preventing them from contributing to training, as follows:
If is positive, will be cached into as a positive sample with its state, category label, pseudo box and confidence. Similarly, ignored boxes will also be incorporated into the to identify regions that should be ignored during model training due to their high uncertainty. Box with negative will be discarded, corresponding to backgrounds.
Our triplet box partition scheme reduces pseudo label noise caused by ambiguous boxes and thus ensures the quality of pseudo-labeled boxes. It also shares a similar spirit with curriculum learning , where confident easy boxes are learned at earlier stages and ambiguous hard examples are handled later after the model has been trained well enough to distinguish these examples.
4.3.3 Memory Update and Pseudo Label Generation
Here, we combine proxy-pseudo labels at stage and the historical pseudo labels in the memory via memory ensemble and voting to leverage historical pseudo labels to perform pseudo label denoising and obtain consistency-regularized pseudo labels at stage . The outputs are the updated pseudo labels that will serve as supervisions for subsequent model training procedures. During this memory update process, each pseudo box from and has three attributes , which are the hybrid quality-aware confidence score, state (positive or ignored) and an unmatched memory counter (UMC) (for memory voting), respectively. We assume that contains boxes denoted as and has boxes represented as .
Memory Ensemble. Instead of directly replacing with the latest proxy-pseudo labels , we propose the memory ensemble operation to combine and , which denoises the pseudo labels through consistency checking considering historical pseudo labels.
The memory ensemble operation matches two object boxes with similar locations, sizes and angles from and , and merges them to produce a new object box. By default, we adopt the consistency ensemble strategy for box matching. Specifically, it calculates the pair-wise 3D IoU matrix between each box in and each box in . For the -th object box in , its matched box index in is derived by,
Note that if , we denote each of these two paired boxes as unmatched boxes that will be further processed by the memory voting operation.
We assume the successfully matched pair-wise object boxes as and . They are further merged to cache the pseudo labeled box with a higher confidence value into the and update its corresponding attributes as
Here, we adopt to choose box instead of a weighted combination since weighted combination has the potential to produce an unreasonable final box if the matched boxes have very different heading angles (see Fig. 6 “wrong case”). Similarly, 3D Auto Labeling  also observes that selecting box candidates with highest confidence is more optimal.
In addition, we also explore other two memory ensemble variants for box ensembling. For the first variant NMS ensemble, it is an intuitive solution to match and merge two types of boxes by removing the duplicated boxes based on IoU. It directly removes matched boxes with lower confidence scores. Specifically, we concatenate historical pseudo labels and current proxy-pseudo labels to as well as their corresponding confidence scores to for each target sample . Then, we obtain the final pseudo boxes and corresponding confidence score by applying NMS with an IoU threshold at as
For the second variant bipartite ensemble, it employs optimal bipartite matching  to pair historical pseudo labels and current proxy-pseudo labels and then follow consistency ensemble to process matched pairs. Concretely, we assume that there are and boxes for and separately. Then, we search a permutation of elements with the lowest cost as
where the matching cost is the between the matched boxes. Notice that the matched box pairs with IoU lower than 0.1 would still be regarded as unmatched.
Memory Voting. The memory ensemble operation can effectively select better matched pseudo boxes. However, it cannot handle the unmatched pseudo boxes from either or . As the unmatched boxes often contain both false positive boxes and true positive boxes, either caching them into the memory or discarding them all is sub-optimal. To address the above problem, we propose a novel memory voting approach, which leverages history information of unmatched object boxes to robustly determine their status (cache, discard or ignore). For the -th unmatched pseudo boxes from or , its unmatched memory counter (UMC) will be updated as follows:
We update the UMC for unmatched boxes in by adding and initialize the UMC of the newly generated boxes in as 0. The UMC records the successfully unmatched times of a box, which are combined with two thresholds and ( and by default) to select the subsequent operation for unmatched boxes as
Benefited from our memory voting, we could generate more robust and consistent pseudo boxes by caching the occasionally unmatched box in the memory.
4.4 Model training with SASD and CDA
The proposed hybrid quality-aware triplet memory addresses pseudo label noise during the pseudo label generation stage and can produce high-quality and consistent pseudo labels for each point cloud . Now, the detection model can be trained on and as described in Algorithm 1 (Line 7) and Fig. 7. In the following, we will elaborate on how to alleviate the negative impacts of noisy pseudo labeled data on model training using source-assisted self-denoised training (see Sec. 4.4.1) and avoid model over-fitting with curriculum data augmentation (see Sec. 4.4.2).
4.4.1 Source-assisted Self-denoised Training
Although we have attempted to mitigate pseudo label noise at model pre-training and pseudo label generation stages as discussed above, pseudo labels still can not be totally noise-free. The pseudo label noise will unfortunately disturb the direction of model updating, and errors will be accumulated during iterative self-training. Here, we propose source-assisted self-denoised (SASD) training to make the optimization be more noise-tolerant and ease error accumulations by joint optimization on source data and pseudo-labeled target data. However, simultaneously optimizing data from different domains could induce domain shifts and degrade model performance. To tackle this issue, we adopt domain specific normalization to avoid the negative impacts of joint optimization on data with different underlying distributions. Details of domain specific normalization and the joint training optimization objectives are elaborated as below.
Domain Specific Normalization. Batch normalization (BN)  is an extensively employed layer in deep neural networks as it can effectively accelerate convergence and boost model performance. Nevertheless, BN suffers from transferability issues when being applied in cross-domain data scenarios as source and target examples are drawn from different underlying distributions. This is also observed in the adversarial training  community for handling out-of-domain data. In this regard, we replace each BN layer with a very simple Domain Specific Normalization (DSNorm) layer, which disentangles the statistic estimation of different domains at the normalization layer. During training, DSNorm calculates batch mean
and variancefor each domain separately as Eq. (12).
where is the domain indicator for source and target domain respectively, is the total number of domain specific samples in a mini-batch, is the input feature of one feature channel, and is the number of elements (e.g. , pixels or points) in this feature channel. Here, we ignore the channel index for simplicity, and the above process is performed on each channel separately. Meanwhile, since the transformation parameters and are domain agnostic and transferable across domains, the two domains shares the learnable scale and shift parameters as Eq. (13).
where is the normalized value and is the transformed feature.
At the inference stage on the target domain, model predictions are obtained using target and by moving average over batch mean and variance during training .
Optimization Objective. The detection loss on each domain consists of four loss terms as below,
The anchor classification loss is calculated using focal loss  with default parameters to balance the foreground and background classes, employs smooth L1 loss to regress the residuals of box parameters, utilizes binary cross entropy loss to estimate the ambiguity of heading angle as in [60, 45], and IoU estimation loss is formulated as discussed in Eq. (3). In addition, trade-off factors and (2.0 and 0.2 by default) are set to balance regression loss and direction loss as in . Then, considering both domains, the overall optimization objective at the model training stage is
where and are detection losses for source and target domains respectively with the trade-off parameter (default set as 1.0).
Analysis. Source-assisted self-denoised self-training has the following merits that help alleviate the negative impacts of pseudo label noise on model training and improve model’s robustness. First, by leveraging noise-free labeled source data, the imprecise gradients due to noisy pseudo labels can be effectively rectified. Then, through joint optimization on source and target domain data, the model will be enforced to learn domain-invariant features and maintain discriminativeness on diverse patterns across different domains. Further, the source domain data can provide the model with challenging cases that are easily ignored or misclassified (e.g. “tree”, “pole”, vs. “pedestrian”) on pseudo labeled target data. Moreover, the simple domain specific normalization scheme can effectively address domain shift issues in cross domain joint optimization, and further improve model performance.
|Dataset||# Beam Ways||Beam Angles||# Points Per Scene||# Training Frames||# Validation Frames||Location||# Night/Rain|
|Waymo ||64-beam||[-18.0, 2.0]||160k||158,081||39,987||USA||Yes/Yes|
|KITTI ||64-beam||[-23.6, 3.2]||118k||3,712||3,769||Germany||No/No|
|Lyft ||64-beam||[-29.0, 5.0]||69k||18,900||3,780||USA||No/No|
|nuScenes ||32-beam||[-30.0, 10.0]||25k||28,130||6,019||USA and Singapore||Yes/Yes|
4.4.2 Curriculum Data Augmentation.
Our observation shows that most positive pseudo boxes are easy examples since they are generated from previous high-confident object predictions. Consequently, during training, model is prone to overfitting to these easy examples with low loss values (see Fig. 5 (d)), unable to further mine hard examples to improve the detector . To prevent model from being trapped by bad local minimal, strong data augmentations could be an alternative to generate diverse and potentially hard examples to improve the model. However, this might confuse the learner and hence be harmful to model training at the initial stage.
Curriculum Data Augmentation. Motivated by the above observation, we design a curriculum data augmentation (CDA) strategy to progressively increase the intensity of data augmentation and gradually generate increasingly harder examples to facilitate improving the model and ensure effective learning at the early stages.
To progressively increase the intensity of data augmentations with types (i.e. global points transformation and per-object points transformation), we design a multi-step intensity scheduler with initial intensity for the -th data augmentation. Specifically, we split the total training epochs into stages. After each stage, the data augmentation intensity is multiplied by an enlarging ratio (, we use by default). Thus, the data augmentation intensity for -th data augmentation at stage () is derived as . Hence, the random sampling range of the -th data augmentation could be calculated as follows:
CDA enables the model to learn from challenging samples while making the difficulty of examples be within the capability of the learner during the whole training process. Experiments in Table X illustrate its effectiveness via the curriculum regime .
5.1 Experimental Setup
Datasets. We conduct experiments on four widely used LiDAR 3D object detection datasets: KITTI , Waymo , nuScenes , and Lyft . The statistics of the four datasets are summarized in Table II. The domain gaps across different datasets mainly lie in two folds: content gap (e.g. object size, weather condition, etc.) caused by different data-capture locations and time and point distribution gap owing to different LiDAR types (e.g. number of beam ways, beam range, vertical inclination and horizontal azimuth of LiDAR).
|Waymo KITTI||Source Only||67.64 / 27.48||46.29 / 43.13||48.61 / 43.84|
|SN ||78.96 / 59.20||53.72 / 50.44||44.61 / 41.43|
|ST3D||82.19 / 61.83||52.92 / 48.33||53.73 / 46.09|
|ST3D (w/ SN)||85.83 / 73.37||54.74 / 51.92||56.19 / 53.00|
|ST3D++||80.78 / 65.64||57.13 / 53.87||57.23 / 53.43|
|ST3D++ (w/ SN)||86.47 / 74.61||62.10 / 59.21||65.07 / 60.76|
|Oracle||83.29 / 73.45||46.64 / 41.33||62.92 / 60.32|
|Waymo Lyft||Source Only||72.92 / 54.34||37.87 / 33.40||33.47 / 28.90|
|SN ||72.33 / 54.34||39.07 / 33.59||30.21 / 23.44|
|ST3D||76.32 / 59.24||36.50 / 32.51||35.06 / 30.27|
|ST3D (w/ SN)||76.35 / 57.99||37.53 / 33.28||31.77 / 26.34|
|ST3D++||79.61 / 59.93||40.17 / 35.47||37.89 / 34.49|
|ST3D++ (w/ SN)||76.67 / 58.86||37.89 / 34.49||37.73 / 32.05|
|Oracle||84.47 / 68.78||47.92 / 39.17||43.74 / 39.24|
|Waymo nuScenes||Source Only||32.91 / 17.24||7.32 / 5.01||3.50 / 2.68|
|SN ||33.23 / 18.57||7.29 / 5.08||2.48 / 1.8|
|ST3D||35.92 / 20.19||5.75 / 5.11||4.70 / 3.35|
|ST3D (w/ SN)||35.89 / 20.38||5.95 / 5.30||2.5 / 2.5|
|ST3D++||35.73 / 20.90||12.19 / 8.91||5.79 / 4.84|
|ST3D++ (w/ SN)||36.65 / 22.01||15.50 / 12.13||5.78 / 4.70|
|Oracle||51.88 / 34.87||25.24 / 18.92||15.06 / 11.73|
|nuScenes KITTI||Source Only||51.84 / 17.92||39.95 / 34.57||17.70 / 11.08|
|SN ||40.03 / 21.23||38.91 / 34.36||11.11 / 5.67|
|ST3D||75.94 / 54.13||44.00 / 42.60||29.58 / 21.21|
|ST3D (w/ SN)||79.02 / 62.55||43.12 / 40.54||16.60 / 11.33|
|ST3D++||80.52 / 62.37||47.20 / 43.96||30.87 / 23.93|
|ST3D++ (w/ SN)||78.87 / 65.56||47.94 / 45.57||13.57 / 12.64|
|Oracle||83.29 / 73.45||46.64 / 41.33||62.92 / 60.32|
Adaptation Benchmark. We design experiments to cover most practical 3D domain adaptation scenarios: Adaptation from label rich domains to label insufficient domains, Adaptation across domains with different data collection locations and time (e.g. Waymo KITTI, nuScenes KITTI), and Adaptation across domains with a different number of the LiDAR beams (i.e. Waymo nuScenes and nuScenes KITTI). Therefore, we evaluate domain adaptive 3D object detection models on the following four adaptation tasks: Waymo KITTI, Waymo Lyft, Waymo nuScenes and nuScenes KITTI. We rule out some ill-posed settings that are not suitable for evaluation. For example, we do not consider KITTI and Lyft as the source domain since KITTI lacks ring view annotations (less practical) and Lyft uses very different annotation rules (i.e. , a large number of objects outside the road are not annotated). Besides, we only include most typical tasks to make the evaluation computationally manageable for follow-up research.
Comparison Methods. We compare ST3D++ with three methods: Source Only indicates directly evaluating the source domain pre-trained model on the target domain; SN  is the pioneer weakly-supervised domain adaptation method on 3D object detection with target domain statistical object size as extra information; ST3D  is the state-of-the-art method of both unsupervised and weakly-supervised (i.e. with extra target object size statistics) domain adaptation on 3D object detection; and Oracle indicates the fully supervised model trained on the target domain.
Evaluation Metric. We follow 
and adopt the KITTI evaluation metric for evaluating our methods on the common categoriescar (also named vehicle for similar categories in the Waymo Open Dataset) pedestrians and cyclists (also named bicyclist and motorcyclist in nuScenes and Lyft). Except the KITTI dataset which only provides the annotations in the front view, we evaluate the methods on ring view point clouds since they are more widely used in real-world applications, We follow the official KITTI evaluation metric and report the average precision (AP) in both the bird’s eye view (BEV) and 3D over recall positions. The mean average precision is evaluated with IoU threshold for cars and for pedestrians and cyclists.
Implementation Details. We validate our proposed ST3D++ on three detection backbones SECOND , PointRCNN  and PV-RCNN . Specifically, we improve the SECOND detector with an extra IoU head to estimate the IoU between the object proposals and their GTs, and name this detector as SECOND-IoU. Given object proposals from the RPN head in original SECOND network, we extract proposal features from 2D BEV features using the rotated RoI-align operation 
. Then, taking the extracted features as inputs, we adopt two fully connected layers with ReLU nonlinearity and batch normalization 
to regress the IoU between RoIs and their corresponding GTs (or pseudo boxes) with sigmoid nonlinearity. During training, we do not back-propagate the gradient from our IoU head to our backbone network.
We adopt the training settings of the popular point cloud detection codebase OpenPCDet  to pre-train our detectors on the source domain with our proposed random object scaling (ROS) data augmentation strategy with scaling range . For the triplet box partition in hybrid quality-aware triplet memory, two thresholds and are typically set as 0.6 and 0.25, respectively. For the following target domain self-training stage, we use Adam  with learning rate and one cycle scheduler to finetune the detectors for 30 epochs. We update the pseudo label with memory ensemble and voting after every 2 epochs. For all the above datasets, the detection range is set to for and axes, and for axis (the origins of coordinates of different datasets have been shifted to the ground plane). We set the voxel size of the detector to on all datasets. More detailed parameter setups could be found in our released code.
During both the pre-training and self-training processes, we adopt the widely used data augmentation, including random world flipping, random world scaling, random world rotation, random object scaling and random object rotation. CDA is utilized in the self-training process to provide proper hard examples for promoting the training process. All experiments are accomplished on 8 NVIDIA GTX 1080 TI GPUs.
5.2 Main results and Analysis
As shown in Table III, we compare the performance of our ST3D++ with Source Only, SN , ST3D  and Oracle on four adaptation tasks. Since SN, one of the baseline methods, employs extra statistical supervision on the target domain, we construct our experiments of ST3D++ on two settings: one is the unsupervised DA setting including source only, ST3D and ST3D++, and the other is weakly-supervised DA setting including SN, ST3D (w/ SN) as well as ST3D++ (w/ SN), where the weakly-supervised DA setting utilizes the target object size distribution as prior. In addition, our analysis mainly focuses on two types of domain gaps as mentioned in Sec. 5.1.
For the content gap caused by different data-capture locations and time, the representative adaptation tasks are Waymo KITTI and nuScenes KITTI. For both tasks, our ST3D++ outperforms the source only and SN baseline on both UDA and weakly-supervised DA settings. Specifically, without leveraging the target domain size prior, we improve the performance on Waymo KITTI and nuScenes KITTI tasks by a large margin of around 38% 44%, 10% 11% and 9% 10% on car, pedestrian and cyclist separately in terms of , which largely close the performance gap between source only and oracle. Even compared with current SOTA method ST3D , our ST3D++ still demonstrates its superiority especially in challenging categories, e.g. pedestrian and cyclist. On Waymo KITTI, ST3D++ exceeds ST3D by and in pedestrian and cyclist, respectively in terms of . Besides, without bells and whistles, our ST3D++ even surpasses the oracle in pedestrian on both adaptation tasks, demonstrating the effectiveness of ST3D++ for UDA on 3D object detection.
In addition, since the object size gap is large between KITTI (captured in Germany) and other three datasets (all or partially captured in USA) , by incorporating target object size as a prior, SN, ST3D (w/ SN) and ST3D++ (/w SN) performs prominently better in comparison with their unsupervised counterparts without SN. Especially, ST3D++ (w/ SN) even outperforms the oracle results on all evaluated categories, i.e. car, pedestrian and cyclist on the Waymo KITTI task. However, it is noteworthy that for the adaptation tasks with minor domain shifts in object size (i.e. Waymo nuScenes and Waymo Lyft), only minor performance gains or even performance degradation are observed for SN. In contrast, our ST3D++ still obtains consistent improvements on Waymo Lyft222Lyft dataset is constructed with different label rules from the other three datasets (i.e. a large number of object that are outside the road are not annotated) which enlarges the domain gaps., where data for both domains are captured at similar locations with similar object distributions.
For the point distribution gap owing to different LiDAR types, we select nuScenes KITTI and Waymo KITTI as representatives since they use different LiDAR types with various LiDAR beam ways (see Table II). When the model is adapted from a sparse domain towards a dense domain such as nuScenes KITTI, even though the performance of the baseline model is relatively low, our ST3D++ obtains significant performance gains, i.e. , and on car, pedestrian and cyclist separately in terms of . These performance gains demonstrate the advancement of our ST3D++ to improve model’s capability upon a weak pre-trained detector on the source domain. However, when we adapt a model obtained in a dense domain to a sparse domain such as Waymo nuScenes, the performance gains are relatively small. In this regard, self-training strategies have more advantages on sparse to dense adaptation tasks instead of dense to sparse adaptation tasks. The reason is that the 3D object detector trained on dense point clouds tends to make predictions with low confidence in sparse regions. As a result, when the detector is adapted to a sparse domain, it can not generate enough high-quality pseudo labels to provide sufficient knowledge in the self-training stage.
5.3 Comparison to Contemporary Works
|Method||Architecture||Sequence||IoU 0.7||IoU 0.5|
To further demonstrate the advancement of our ST3D++, we compare it with SF-UDA , Dreaming  and MLC-Net  on nuScenes KITTI (i.e. the only common adaptation task attempted by four approaches). As shown in the Table IV, both SF-UDA and Dreaming utilize the temporal information from the point cloud sequence and an extra object tracker. Nevertheless, by only taking the single frame point cloud as input, based on SECOND-IoU, our ST3D++ outperforms SF-UDA and Dreaming by 7.87% in at IoU 0.7 and 16.75% at IoU 0.5, respectively. Besides, ST3D++ with SECOND-IoU surpasses MLC-Net with 6.95% in at IoU 0.7 with a lower baseline.
To further exclude the influence of different detection architectures, we further verify our ST3D++ on PointRCNN . Based on PointRCNN, our implementation for source only is 20.85%, 33.68% and 3.20% stronger than the implementation by SF-UDA, Dreaming and MLC-Net separately in terms of . Besides, our ST3D++ with PointRCNN even achieves 67.61% at IoU 0.7 thanks to the prominent two-stage refinement ability of PointRCNN, while ST3D++ based on SECOND-IoU performs more notably at IoU 0.5.
|1||Source Only||51.84 / 17.92||39.95 / 34.57||17.70 / 11.08|
|ST3D||75.94 / 54.13||44.00 / 42.60||29.58 / 21.21|
|ST3D++||80.52 / 62.37||47.20 / 43.96||30.87 / 23.93|
|5||Source Only||62.82 / 32.08||29.50 / 24.66||20.05 / 12.07|
|ST3D||81.06 / 66.98||34.65 / 31.76||27.32 / 20.52|
|ST3D++||80.91 / 68.23||30.48 / 27.86||29.88 / 25.57|
|-||Oracle||83.29 / 73.45||46.64 / 41.33||62.92 / 60.32|
Equip ST3D++ with temporal information. Considering that contemporary works SF-UDA  and Dreaming  are designed to leverage temporal information of point cloud sequences through consistency regularization, here we simply fuse several sequential point cloud frames to exploit temporal information in a straightforward manner. Furthermore, as shown in Table II
, the point density of KITTI is around 5 times larger than nuScenes, which will cause a serious point cloud distribution gap, so our multiple frames fusion attempts can also mitigate point density gaps. Specifically, here we merge five LiDAR pose calibrated frames in nuScenes to approach the point cloud density in KITTI. We extend the point cloud with an extra timestamp channel to identify different frames in nuScenes. For the target domain KITTI, we just pad zeros to each point as the timestamp channel.
As shown in Table V, for the source only model, directly adapting the model from densified nuScenes to KITTI brings around 14% and 1% gains on car and cyclist separately in terms of . Besides, with multi-frame fused source data, ST3D++ obtains around 5.9% and 1.6% improvements in and achieves new SOTA performance on car and cyclist. These experimental results demonstrate that our ST3D++ can largely benefit from temporal information even in a simple fusion manner. We believe that ST3D++ can consistently benefit from the development of temporal-based 3D detection from point clouds since they are orthogonal. Note that pedestrian suffers from performance degradation with five frames fused source data. The reason might lie in the different characteristics of rigid and nonrigid categories. Gestures of nonrigid pedestrians are very diverse along different frames, while the same car on different frames has same shape as rigid object. As a result, directly fusing multiple frames might produce nondescript gestures for nonrigid categories and finally confuse the deep learner.
5.4 ST3D++ Results with SOTA Detection Architecture
|Source Only||61.18 / 22.01||46.65 / 43.18||54.40 / 50.56|
|SN||79.78 / 63.60||54.78 / 53.04||52.65 / 49.56|
|ST3D||84.10 / 64.78||50.15 / 47.24||51.63 / 48.23|
|ST3D (w/ SN)||86.65 / 76.86||55.23 / 53.20||56.84 / 53.71|
|ST3D++||84.59 / 67.73||56.63 / 53.36||58.64 / 55.07|
|ST3D++ (w/ SN)||86.92 / 77.36||63.58 / 62.87||65.61 / 61.42|
|Oracle||88.98 / 82.50||54.13 / 49.96||73.65 / 70.69|
We further investigate the generalization of our ST3D++ by employing more sophisticated detection architecture PV-RCNN  as the base detector without any specific hyper-parameter adjustment. As shown in Table VI, for the adaptation task Waymo KITTI, our unsupervised ST3D++ outperforms source only by 45.72%, 10.18% and 4.51% on car, pedestrian and cyclist separately in terms of AP. It also surpasses the SOTA cross-domain 3D object detection method ST3D with 2.95%, 6.12% and 6.84% on car, pedestrian and cyclist in AP. Furthermore, incorporated with SN, our ST3D++ (w/ SN) is further improved to approach oracle results since SN provides better localized pseudo labels for subsequent self-training with the prior of target object statistics. These prominent experimental results strongly demonstrate that our ST3D++ is a model-agnostic self-training pipeline. Our ST3D++ can further harvest the progress of 3D object detector through the produced more accurate pseudo labels.
6 Ablation Studies
6.1 Component Analysis of ST3D++
In this section, we conduct extensive ablation experiments to investigate the individual components of our ST3D++. All experiments are conducted on SECOND-IoU for the adaptation task of Waymo KITTI. Notice for category-agnostic components such as memory updating and curriculum data augmentation, the ablation experiments are only conducted on the car category.
|(a) Source Only||67.64 / 27.48||46.29 / 43.13||48.61 / 43.84|
|(b) ROS||78.07 / 54.67||49.90 / 46.43||50.61 / 46.96|
|(c) SN||78.96 / 59.20||53.72 / 50.44||44.61 / 41.43|
|(d) ST3D++ (w/o ROS)||77.35 / 33.73||52.03 / 48.13||57.83 / 50.81|
|(e) ST3D++ (w/ ROS)||80.78 / 65.64||55.98 / 53.30||57.88 / 52.87|
|(f) ST3D++ (w/ SN)||86.47 / 74.61||52.10 / 59.21||65.07 / 60.76|
|78.96 / 59.20||53.72 / 50.44||44.61 / 41.43|
|79.74 / 65.88||51.18 / 49.13||51.65 / 48.61|
|79.81 / 67.39||52.72 / 49.73||54.29 / 50.33|
|82.72 / 70.17||54.06 / 51.13||54.72 / 52.85|
|85.35 / 72.52||54.74 / 51.92||56.19 / 53.00|
|85.35 / 72.52||59.36 / 56.13||57.38 / 53.55|
|85.83 / 73.37||59.97 / 56.27||56.30 / 53.49|
|86.47 / 74.61||62.10 / 59.21||65.07 / 60.76|
Random Object Scaling. Here we investigate the effectiveness of our unsupervised random object scaling (ROS) for mitigating the domain shift of object size statistics across domains as mentioned in Sec. 4.2. By employing random object scaling as one of the data augmentations for pre-training, the detector could be more robust to variations of object sizes in different domains. As shown in Table VII (a), (b), (c), our unsupervised ROS improves the performance by around 27.2%, 3.3% and 3.1% on car, pedestrian and cyclist respectively in terms of . Besides, our ROS is only 4.5% lower than the weakly-supervised SN method on car and even surpasses SN on cyclist. Furthermore, as shown in Table VII (d), (e), the ROS pre-trained model also greatly benefits the subsequent self-training process since it provides pseudo labels with less noise. We also observe that there still exists a gap between the performance of ST3D++ (w/ ROS) and ST3D++ (w/ SN) in , potentially due to that the KITTI dataset has a larger domain gap over object sizes compared with other datasets, and under this situation, the weakly-supervised SN could leverage accurate object size information and generate pseudo labels with less localization noise.
Component Analysis in Self-training. As demonstrated in Table VIII, we investigate the effectiveness of our individual components at the self-training stage. Our ST3D++ (w/ SN) (last line) outperforms the SN baseline and naive ST baseline by 15.14% and 8.73% on car, 16.08% and 10.08% on pedestrian as well as 11.52% and 6.35% on cyclist in terms of , manifesting the effectiveness of self-training on domain adaptive 3D object detection.
In the pseudo label generation stage, hybrid quality-aware triplet memory (HQTM), including hybrid quality-aware criterion, triplet box partition and memory ensemble-and-voting, is designed to address pseudo label noise, and yield an improvement of 6.6%, 7%, and 4.9% in on car, pedestrian and cyclist respectively (See Table VIII). First, by avoiding assigning labels to ambiguous samples, the triplet boxes partition scheme brings 1.51%, 0.6% and 2.56% improvements on car, pedestrian and cyclist in . Then, Memory ensemble and voting integrates historical pseudo labels and stabilizes the pseudo label updating process, leading to an improvement of around 1.5% 2.8% in on all evaluated categories. Finally, the quality-aware criterion in  which only considers the IoU as a criterion, has improved the performance of car by 2.35%, while boosting the performance of other categories, i.e. pedestrian and cyclist only slightly. In contrast, our hybrid quality-aware criterion consistently boosts the evaluated categories car, pedestrian and cyclist by 2.35%, 4.21% and 0.7% . The above shows HQAC is instrumental in improving performance.
In the model training stage, source-assisted self-denoised training (SASD) and curriculum data augmentation (CDA) jointly deliver around 2.1%, 3.1% and 7.2% improvements in on car, pedestrian and cyclist respectively. In particular, small categories such as pedestrian and cyclist benefit more from SASD since they typically suffer from more severe classification and localization noise which will mislead the optimization process while SASD efficiently corrects the direction of gradient descent in the training stage and significantly improves the performance, i.e. 1.24%, 2.94%, and 7.27% for car, pedestrian and cyclist in respectively.
Analysis of Memory Ensemble and Voting. Here, we further investigate the memory ensemble and memory voting schemes for memory updating and consistent pseudo label generation. The analysis contains three aspects: different memory ensemble strategies, the advancement of memory voting and pseudo label merging methods in memory ensemble. The comparison results are shown in Table IX. First, for the comparison of different memory ensemble variants, three variations (see Sec. 4.3.3 for details) achieve similar performance, and bipartite ensemble outperforms 1.4% and 0.6% than ME-N and ME-C respectively in terms of . For the paired box merging methods (see Fig. 6), we compare two merging approaches “max score” and “weighted average”, where max score obtains a 2.5% performance gain than the weighted average. This validates our analysis in Sec. 4.3.3 that simply selecting box with higher confidence can generate better pseudo labels. In addition, without memory voting, the performance of ST3D++ (w/ ME-C) drops by around 1.9% in terms of since the unmatched boxes along different memories could not be properly handled. Our memory voting strategy could robustly mine high-quality boxes and discard low-quality boxes.
|Method||Merge Manner||Memory Voting||Car|
|ST3D++ (w/ ME-N)||Max||86.69 / 74.21|
|ST3D++ (w/ ME-B)||Max||86.84 / 75.22|
|ST3D++ (w/ ME-C)||Max||86.47 / 74.61|
|Avg||82.35 / 72.09|
|Max||86.41 / 72.76|
|Avg||86.60 / 72.32|
|ST3D++ (w/ SN)||-||84.62 / 69.62|
|Normal||84.17 / 69.17|
|Normal||86.78 / 73.63|
|Normal||86.65 / 73.98|
|Strong||86.71 / 73.59|
|Curriculum||86.47 / 74.61|
Data Augmentation Analysis. As shown in Table X, we also explore the influence of different data augmentation strategies and intensities in the model training stage. We divide all data augmentation into two groups: world-level data augmentations that affect the whole scene (i.e. random world flipping, random world scaling and random world rotation) and object-level data augmentations that change each instance (i.e. random object rotation and random object scaling). We observe that without any data augmentation, ST3D++ suffers from around 5% performance degradation. Object-level data augmentations provide significant improvements of around 4% in terms of AP while world-level data augmentations even slightly harm performance. The combination of object- and world-level data augmentations further improves the detector’s capability. When it comes to the intensity of data augmentation (see Sec. 4.4.2), compared to the normal intensity, simply adopting stronger data augmentation magnitude confuses the deep learner and suffers from slightly performance drop while our CDA, progressively enlarges the intensities, can bring around 0.6% gains.
|85.83 / 73.37||59.97 / 56.27||56.30 / 53.49|
|84.10 / 68.08||58.70 / 55.96||59.65 / 56.88|
|86.47 / 74.61||62.10/ 59.21||65.07 / 60.76|
Effect of Domain-Specific Normalization in SASD. As mentioned in Sec. 4.4.1, although the assistance of source data can help rectify the gradient directions and offer hard cases to train the model, a noteworthy issue is the distribution shift caused by statistic differences of batch normalization layers. Here, we investigate the effectiveness of Domain-Specific Normalization (DSNorm) in addressing this issue. As illustrated in Table XI, without DSNorm, the naive SASD even causes performance degradation on car and pedestrian due to domain shifts. After being equipped with DSNorm, SASD obtains obvious improvements particular for pedestrian and cyclist (i.e. 2.94% and 7.27% in terms of separately).
6.2 Effects of Each Component on Pseudo Label Qualities
We investigate how each module benefits self-training by analyzing their contributions on correcting pseudo label noise and improving pseudo label qualities. We adopt and #TPs to assess the correctness of pseudo labels, and employ ATE, ASE and AOE to assess the average translation, scale and orientation errors of pseudo labels. The later is inspired by nuScenes toolkit : Average translation error (ATE) is the euclidean object center distance in 2D from bird’s eye view (measured in meters); Average scale error (ASE) is the 3D intersection over union (IoU) of prediction and its corresponding ground truth after aligning heading angle and object center (calculated by ); and Average orientation error is the smallest yaw angle difference between the pseudo label and the ground truth (measured in radian).
As shown in Fig. 8, when the proposed components are gradually incorporated to the pipeline (i.e. C H), the number of true positives and are progressively increased. This tendency illustrates that the classification noise is significantly reduced and thus its correctness is improved by a large margin. Specifically, ROS mitigates domain differences in object size distributions and hence largely reduces ASE. With Triplet, HQAC and MEV, our method generates accurate and stable pseudo labels, localizing more #TPs with fewer errors. CDA overcomes overfitting and reduces both ASE and AOE. SASD reduces ASE and AOE through addressing the negative impacts of pseudo label noise on model training.
We have presented ST3D++ – a holistic denoised self-training pipeline for unsupervised domain adaptive 3D object detection from point clouds. ST3D++ redesigns different self-training stages from model pre-training on source labeled data, pseudo label generation on target data to model training on pseudo labeled data. At the first two stages, it effectively reduces pseudo label noise through pre-training a de-biased object detector via random object scaling and designing a robust pseudo label assignment method, a better pseudo label selection criterion and a consistency regularized pseudo label update strategy. These components cooperate to make pseudo labels accurate and consistent. In addition, at the model training stage, the negative impacts of noisy pseudo labels are alleviated via the assistance of supervision from source labeled data . Simultaneously, curriculum data augmentation is also developed to overcoming the overfitting issue on easy pseudo labeled target data. Our extensive experimental results demonstrate that ST3D++ substantially advances state-of-the-art methods. Our future work would be to extend our ST3D++ to indoor scenes or sequence data and investigate point cloud translation approaches to address domain gaps in point distributions.
-  (2018) Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375. Cited by: §5.1.
Impossibility theorems for domain adaptation.
International Conference on Artificial Intelligence and Statistics, pp. 129–136. Cited by: §2.
Proceedings of the 26th annual International Conference on Machine Learning, pp. 41–48. Cited by: §2, §4.3.2, §4.4.2.
-  (2020) Nuscenes: a multimodal dataset for autonomous driving. In , pp. 11621–11631. Cited by: §1, §1, §1, TABLE II, §5.1, §6.2.
-  (2019) Exploring object relation in mean teacher for cross-domain detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11457–11466. Cited by: §2, §3.
-  (2020) End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872. Cited by: §4.3.3.
-  (2019) Domain-specific batch normalization for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7354–7362. Cited by: §2.
-  (2017) Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915. Cited by: §2.
-  (2018) Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3339–3348. Cited by: §1, §2.
-  (2018) Adaptive semantic segmentation with a strategic curriculum of proxy labels. arXiv preprint arXiv:1811.03542. Cited by: §2.
-  (2019) Pseudo-labeling curriculum for unsupervised domain adaptation. arXiv preprint arXiv:1908.00262. Cited by: §2.
Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, pp. 1180–1189. Cited by: §1, §2.
-  (2016) Domain-adversarial training of neural networks. JMLR 17 (1), pp. 2096–2030. Cited by: §2.
-  (2019) Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. In International Conference on Learning Representations, Cited by: §1.
-  (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. arXiv preprint arXiv:2006.02713. Cited by: §1.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §1, §1, TABLE II, §5.1.
-  (2019) Dlow: domain flow for adaptation and generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2477–2486. Cited by: §2.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §2.
-  (2020) Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11873–11882. Cited by: §2.
-  (2017) Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969. Cited by: §5.1.
-  (2017) Cycada: cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213. Cited by: §2.
-  (2016) Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649. Cited by: §1, §2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: §4.4.1, §5.1.
-  (2020) XMUDA: cross-modal unsupervised domain adaptation for 3d semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12605–12614. Cited by: §2.
-  (2019) Lyft level 5 perception dataset 2020. Note: https://level5.lyft.com/dataset/ Cited by: §1, §1, §1, TABLE II, §5.1.
-  (2019) A robust learning approach to domain adaptive object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 480–490. Cited by: §1, §2, §3, §4.3.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
-  (2018) Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. Cited by: §2.
-  (2019) Pointpillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §1.
Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. Cited by: §2, §3.
-  (2018) Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80, pp. 109–117. Cited by: §2.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §4.4.1.
-  (2019) Transferable adversarial training: a general approach to adapting deep classifiers. In International Conference on Machine Learning, pp. 4013–4022. Cited by: §2.
-  (2015) Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pp. 97–105. Cited by: §1, §2.
-  (2018) Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1647–1657. Cited by: §2.
-  (2021) Unsupervised domain adaptive 3d detection with multi-level consistency. arXiv preprint arXiv:2107.11355. Cited by: §1, §2, §5.3, TABLE IV.
-  (2018) Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927. Cited by: §2.
-  (2021) Offboard 3d object detection from point cloud sequences. arXiv preprint arXiv:2103.05073. Cited by: §4.3.3.
-  (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §2.
-  (2019) PointDAN: a multi-scale 3d domain adaption network for point cloud representation. In Advances in Neural Information Processing Systems, pp. 7192–7203. Cited by: §2.
-  (2019) Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6956–6965. Cited by: §1, §2.
-  (2017) Asymmetric tri-training for unsupervised domain adaptation. In International Conference on Machine Learning, pp. 2988–2997. Cited by: §2, §3, §4.3.1.
-  (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3723–3732. Cited by: §2.
-  (2020) SF-uda3D: source-free unsupervised domain adaptation for lidar-based 3d object detection. arXiv preprint arXiv:2010.08243. Cited by: §1, §1, §2, §5.3, §5.3, TABLE IV.
-  (2020) Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538. Cited by: §1, §2, §2, §4.4.1, §5.1, §5.4, TABLE VI.
PV-rcnn++: point-voxel feature set abstraction with local vector representation for 3d object detection. arXiv preprint arXiv:2102.00463. Cited by: §1.
-  (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–779. Cited by: §1, §2, §2, §5.1, §5.3.
-  (2019) From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. arXiv preprint arXiv:1907.03670. Cited by: §1, §2, §2.
-  (2020) Scalability in perception for autonomous driving: waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454. Cited by: §1, §1, §1, §1, TABLE II, §5.1.
-  (2013) Self-paced learning for long-term tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2379–2386. Cited by: §2.
OpenPCDet: an open-source toolbox for 3d object detection from point clouds. Note: https://github.com/open-mmlab/OpenPCDet Cited by: §5.1.
-  (2018) Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7472–7481. Cited by: §2.
-  (2019) Transferable normalization: towards improving transferability of deep neural networks. In Advances in Neural Information Processing Systems 32, Cited by: §2.
-  (2020) Train in germany, test in the usa: making 3d object detectors generalize. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11713–11723. Cited by: Fig. 1, §1, §1, §1, §2, §4.2, TABLE II, §5.1, §5.1, §5.2, §5.2, TABLE III.
-  (2019) Frustum convnet: sliding frustums to aggregate local point-wise features for amodal 3d object detection. arXiv preprint arXiv:1903.01864. Cited by: §2.
-  (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation, pp. 1887–1893. Cited by: §2.
-  (2020) Adversarial examples improve image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 819–828. Cited by: §4.4.1.
Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698. Cited by: §1.
-  (2019-10) Larger norm more transferable: an adaptive feature norm approach for unsupervised domain adaptation. In The IEEE International Conference on Computer Vision, Cited by: §2.
-  (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: Fig. 1, §1, §2, §2, §4.4.1, §5.1.
-  (2018) Pixor: real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660. Cited by: §2.
-  (2021) ST3D: self-training for unsupervised domain adaptation on 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §5.1, §5.2, §5.2, TABLE III, §6.1, TABLE VIII.
-  (2020) An adversarial perturbation oriented domain adaptation approach for semantic segmentation.. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12613–12620. Cited by: §2.
-  (2019) Std: sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1951–1960. Cited by: §2, §2.
-  (2020) Complete & label: a domain adaptation approach to semantic segmentation of lidar point clouds. arXiv preprint arXiv:2007.08488. Cited by: §2.
-  (2021) Exploiting playbacks in unsupervised domain adaptation for 3d object detection. arXiv preprint arXiv:2103.14198. Cited by: §1, §1, §2, §5.3, §5.3, TABLE IV.
-  (2021) SRDAN: scale-aware and range-aware domain adaptation network for cross-dataset 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6769–6779. Cited by: §2.
-  (2020) Label propagation with augmented anchors: a simple semi-supervised learning baseline for unsupervised domain adaptation. In European Conference on Computer Vision, pp. 781–797. Cited by: §1.
-  (2018) Fully convolutional adaptation networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6810–6818. Cited by: §2.
-  (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §2, §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pp. 2223–2232. Cited by: §2.
-  (2019) Confidence regularized self-training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5982–5991. Cited by: §1.
-  (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In European Conference on Computer Vision, pp. 289–305. Cited by: §2, §3, §4.3.1.