A robust autonomous driving system requires its LiDAR-based detector to reliably handle different environmental conditions, e.g., geographic locations and weather conditions. While 3D detection has received increasing interest in recent years, most existing works [zhou2018voxelnet, chen2017multi, chen2019fast, chen2020dsgn, du2018general, konigshof2019realtime, lang2019pointpillars, li2019gs3d, li2019stereo, liang2019multi, liang2018deep, meyer2019lasernet, pon2020object, qi2018frustum, shi2020pv, shi2019pointrcnn, shi2019points, shi2020point, xu2020zoomnet, yan2018second, yang2018pixor, yang20203dssd, yang2019std, xu2020grid, zhou2020end] have focused on the performance in a single domain, where training and test data are captured in similar conditions. It is still an open question how to generalize a 3D detector to different domains, where the environment varies significantly. In this paper, we address the domain gap caused by the deteriorating point cloud quality and aim to improve 3D object detection in the setting of unsupervised domain adaptation (UDA). We use the Waymo Domain Adaptation dataset [sun2019scalability] to analyze the domain gap and introduce semantic point generation (SPG), a general approach to enhance the reliability of LiDAR detectors against domain shift. SPG is able to improve detection quality in both the target domain and the source domain and can be naturally combined with modern LiDAR-based detectors.
1.1 Understanding the Domain Gap
Waymo Open Dataset (OD) is mainly collected in California and Arizona, and Waymo Kirkland Dataset (Kirk) [sun2019scalability] is collected in Kirkland. We consider OD as the source domain and Kirk as the target domain. To understand the possible domain gap, we take a PointPillars [lang2019pointpillars] model trained on the OD training set and compare its 3D vehicle detection performance on OD validation set and those on Kirk validation set. We observe a drastic performance drop of points in 3D average precision (AP) (see Table 1).
We first confirm that there is no significant difference in object size between two domains. Then by investigating the meta data in the datasets, we find that only of LiDAR frames in OD are collected under rainy weather, but almost all frames in Kirk share the rainy weather attribute. To rule out other factors, we extract all dry weather frames in Kirk training set and form a “Kirk Dry” dataset. Because the the rain drop changes the surface property of objects, there are twice amount of missing LiDAR points per frame in Kirk validation set than in OD or Kirk Dry (see Table 1). As a result, vehicles in Kirk receive around fewer LiDAR point observations than those in OD (see statistics and more details in the supplemental). In Figure 2, we visualize two range images from OD and Kirk, respectively. We can observe that in the rainy weather, a significant number of points are missing and the distribution of missing points is more irregular compared to the dry weather.
To conclude, the major domain gap between OD and Kirk is the deteriorating point cloud quality, which is caused by the rainy weather condition. In the target domain, we name this phenomenon as the “missing point” problem.
1.2 Previous Methods to Address the Domain Gap
Multiple studies propose to align the features across domains. Most of them focus on 2D tasks [morerio2017minimal, ganin2015unsupervised, tzeng2017adversarial, dong2019semantic] or object-level 3D tasks [zhou2018unsupervised, qin2019pointdan]. Applying feature alignment [chen2018domain, he2019multi, luo2020unsupervised] requires a redesign of the model or loss of a detector. Our goal is to seek a general solution to benefit recently reported LiDAR-based detectors[lang2019pointpillars, shi2020pv, zhou2018voxelnet, shi2019pointrcnn, he2020sassd].
Another direction is to apply transformations to the data from one domain to match the data from another domain. A naive approach is to randomly down-sample the point cloud but this not only fails to satisfactorily simulate the pattern of missing points (Figure 2d) but also hurts the performance on the source domain. Another approach is to up-sample the point cloud [yu2018pu, yifan2019patch, li2019pu] in the target domain, which can increase point density around observed regions. However, those methods have a limited capability in recovering the 3D shape of very partially observed objects. Moreover, up-sampling the entire point cloud will lead to a significantly higher latency. A third approach is to leverage style transfer techniques: [zhu2017unpaired, park2020contrastive, choi2019self, he2019multi, shan2019pixel, hsu2020progressive, saleh2019domain] render point clouds as 2D pseudo images and enforce the renderings from different domains to be resemblant in style. However, these methods introduce an information bottleneck during rasterization [zhou2018voxelnet] and they are not applicable to modern point-based 3D detectors [shi2020pv].
1.3 SPG for Closing the Domain Gap
The “missing point” problem deteriorates the point cloud quality and reduces the number of point observations, thus undermining the detection performance. To address this issue, we propose Semantic Point Generation (SPG). Our approach aims to learn the semantic information of the point cloud and performs foreground region prediction to identify voxels that are inside foreground objects. Based on the predicted foreground voxels, SPG generates points to recover the foreground regions. Since these points are discriminatively generated at foreground objects, we denote them by semantic points. These semantic points are merged with the original points into an augmented point cloud, which is then fed to a 3D detector.
The contributions of this paper are two-fold:
1. We present an in-depth analysis of unsupervised domain adaptation (UDA) for LiDAR 3D detectors across different geographic locations and weather conditions. Our study reveals that the rainy weather can severely deteriorate the quality of LiDAR point clouds and lead to drastic performance drop for modern detectors.
2. We propose semantic point generation (SPG). To our best knowledge, it is the first learning-based model that targets UDA for point cloud 3D detection. Specifically, SPG has the following merits:
[noitemsep, topsep=2pt, leftmargin=8pt]
SPG can generate semantic points that faithfully recover the foreground regions suffering from the “missing point” problem. SPG can significantly improve performance over poor-quality point clouds in the target domain while also benefiting source domain, for representative 3D detectors, including PointPillars [lang2019pointpillars] and PV-RCNN [shi2020pv].
SPG also improves the performance for the general 3D object detection task. We verify its effectiveness on KITTI [geiger2013vision] for the aforementioned 3D detectors.
SPG is a general approach and can be easily combined with modern off-the-shelf LiDAR-based detectors.
Our approach is light-weight and efficient. Introducing less than additional points, SPG only adds a marginal complexity to a 3D detector.
2 Related Work
2.1 Unsupervised Domain Adaptation
Unsupervised domain adaptation (UDA) aims to generalize a model to a novel (target) domain by using label information only from the source domain. The two domains are generally related, but there exists a distribution shift (domain gap). Most methods focus on learning aligned feature representations across domains. To reach this goal, [borgwardt2006integrating] proposes Maximum Mean Discrepancy (MMD) while [pan2010domain] proposes Transfer Component Analysis (TCA). [long2013transfer]
designs a Joint Distribution Adaptation to close the distribution shift while[long2015learning, long2016unsupervised]
utilize a shared Hilbert space. Without using explicit distance measures, deep learning models[ganin2015unsupervised, tzeng2017adversarial, dong2019semantic, qin2019generatively, saito2018maximum] use adversarial training to get indistinguishable features between domains.
Unsupervised Domain Adaptation for 2D Detection
The object detection task is sensitive to local geometric features. [chen2018domain, he2019multi] hierarchically align the features between domains. Most of these works focus on UDA for 2D detection. With the current advances of unpaired style transfer methods [park2020contrastive, zhu2017unpaired], studies such as [shan2019pixel, hsu2020progressive] translate the image from source domain to target domain or vice versa.
Unsupervised Domain Adaptation for 3D Tasks
Most of the UDA methods focus on 2D tasks, only a few studies explore the UDA in 3D. [zhou2018unsupervised, qin2019pointdan] align the global and local features for object-level tasks. To reduce the sparsity, [wu2019squeezesegv2] projects the point cloud to 2D view, while [saleh2019domain] projects the point cloud to birds-eye view (BEV). [du2020associate] creates a car model set and adapts their features to the detection object features. However, this study targets general car 3D detection on a single point cloud domain. [wang2020train] is the first published study targeting UDA for 3D LiDAR detection. They identify the vehicle size as the domain gap between KITTI[geiger2013vision] and other datasets. So they resize the vehicles in the data. In contrast, we identify the point cloud quality as the major domain gap between Waymo’s two datasets[sun2019scalability]. We use a learning-based approach to close the domain gap.
2.2 Point Cloud Transformation
One way to improve point cloud quality is to suitably transform the point cloud. Studies of point cloud up-sampling [yu2018pu, yifan2019patch, li2019pu] can transfer a low density point cloud to a high density one. However, they need high density point cloud ground truth during training. These networks can densify the point cloud in the observed regions. But in our case, we also need to recover regions with no point observation, caused by “missing points”.
Point cloud completion networks [yuan2018pcn, chen2019unpaired, yang2018foldingnet, xie2020grnet] aim to complete the point cloud. Specialized in object-level completion, these models assume a single object has been manually located and the input only consists of the points on this object. Therefore, these models do not fit our purpose of object detection. Point cloud style transfer models [cao2020psnet, cao2019neural] can transfer the color theme and the object-level geometric style for the point cloud. However, these models do not focus on preserving local details with high-fidelity. Therefore, their transformation cannot directly help 3D detection.
3 Semantic Point Generation
In the input point cloud , each point has three channels of xyz and properties (e.g., intensity, elongation). Figure 3 illustrates the SPG-aided 3D detection pipeline. SPG takes raw point cloud as input and generates a set of semantic points in the predicted foreground regions. Then, these semantic points are combined with the original point cloud into an augmented point cloud , which is fed into a point cloud detector to obtain object detection results.
As shown in Figure 4, SPG voxelizes into an evenly spaced 3D voxel grid, and learns the point cloud semantics for these voxels. For each voxel, the network predicts the probability confidence of it being a foreground voxel (contained in a foreground object bounding box). In each foreground voxel, the network generates a semantic point with point features . is the xyz coordinate of and is the point properties.
To faithfully recover the foreground regions of the observed objects, we define a generation area. Only voxels occupied or neighbored by the observed points are considered within the generation area. We also filter out semantic points with less than , then take semantic points with the highest and merge them with the original point cloud to get . In practice, we use .
To enable SPG to be directly used by modern LiDAR-based detectors, we encode the augmented point cloud as . Here we add another property channel to each point, indicating the confidence in foreground prediction: is used for the semantic points, and 1.0 for the original raw points.
3.1 Training Targets
To train SPG, we need to create two supervisions: 1) , the class label if a voxel (either occupied or empty) is a foreground voxel, which supervises ; 2) , the regression target for semantic point features .
As visualized in Figure 4, we mark a point as a foreground point if it is inside an object bounding box. Voxels contained in a foreground bounding box are marked as foreground voxels . For voxel , we assign if and otherwise. If is an occupied foreground voxel, we set as the regression target, where is the centroid (xyz) of all foreground points in while is the mean of their point properties (e.g. intensity, elongation).
3.2 Model Structure
The lower part of Figure 4 illustrates the network architecture. SPG uses a light-weight encoder-decoder network [zhou2018voxelnet, lang2019pointpillars], which is composed of three modules:
1) The Voxel Feature Encoding module [zhou2018voxelnet] aggregates points inside each voxel by using several MLPs. Similar to [lang2019pointpillars, shi2020pv], these voxel features are later stacked into pillars and projected onto a birds-eye view feature space;
2) The Information Propagation module applies 2D convolutions on the pillar features. As shown in Figure 4, the semantic information in the occupied pillars (dark green) is populated into the neighboring empty pillars (light green), which enables SPG to recover the foreground regions in the empty space.
3. The Point Generation module maps the pillar features to the corresponding voxels. For each voxel in the generation area, the module creates a semantic point with encoding , in which is the point location, is the point properties, and is the foreground probability.
3.3 Foreground Region Recovery
The above pipeline supervises SPG to generate semantic points in the occupied voxels. However, it is also crucial to recover the empty voxels caused by the “missing points” problem. To generate semantic points in the empty areas, SPG employs two strategies:
[noitemsep, topsep=2pt, leftmargin=8pt]
“Hide and Predict”, which produces the “missing points” on the source domain during training and guides SPG to recover the foreground object shape in the empty space.
“Semantic Area Expansion”, which leverages the foreground/background voxel labels derived from the bounding boxes and encourages SPG to recover more unobserved foreground regions in each bounding box.
3.3.1 Hide and Predict
SPG voxelizes into a voxel set . Before passing to the network, we randomly select of the occupied voxels and hide all their points. During training, SPG is required to predict the foreground/background label for all voxels in , even though it only observes points in . The predicted point features in should match the corresponding ground-truth calculated by these hidden points.
This strategy brings two benefits: 1. Hiding points region by region mimics the missing point pattern in the target domain; 2. The strategy naturally creates the training targets for semantic points in the empty space. Section 4.4 shows the effectiveness of this strategy. Here we set .
3.3.2 Semantic Area Expansion
In section 1.1, we find the poor point cloud quality leads to insufficient points on each object and substantially degrades the detection performance. To remedy this problem, we allow SPG to expand the generation area to the empty space. Figure 5 a and c show the examples of the generation area with and without the expansion, respectively.
Without the expansion, we can use the ground-truth knowledge of foreground points to supervise SPG only on the occupied voxels (Figure 5 b). However, with the expansion, there is no foreground point inside these empty voxels. Therefore, as shown in Figure 5 d, we design a supervision scheme as follows:
1. For both occupied and empty background voxels and , we impose negative supervision and set label .
2. For the occupied foreground voxels , we set .
3. For the empty voxels inside a bounding box , we set their foreground label and assign a weighting factor , where .
4. We only impose point features supervision at occupied foreground voxels .
To investigate the effectiveness of the expansion, we train a model on the OD training set and evaluate it on the Kirk validation set. The expansion results in 510% more semantic points on foreground objects, which mitigates the “missing points” problem caused by environmental interference and occlusions. Figure 6 shows the generation results with and without the expansion. The supervision scheme encourages SPG to learn the extended shape of vehicle parts and enables SPG to fill in more foreground space with semantic points. We also conduct ablation studies (Section 4.4) to show the effectiveness of the proposed strategy.
We use two loss functions,i.e., foreground area classification loss and feature regression loss .
To supervise with label , we use Focal loss [lin2017focal] to mitigate the background-foreground class imbalance. can be decomposed as focal losses on four categories of voxels: the occupied voxels , the empty background voxels , the empty foreground voxels and the hidden voxels . The labeling strategy for these categories is described in Section 3.3.2. 0pt0pt
We use Smooth-L1 loss [he2019multi] for point feature regression, and supervise on the semantic points in occupied foreground voxels and the hidden foreground voxels . 0pt0pt
Please note that we are only interested in the and on voxels inside the generation area. We find and achieves the best result.
In this section, we first evaluate the effectiveness of SPG as a general UDA approach for 3D detection, based on the Waymo Domain Adaptation Dataset [sun2019scalability]. In addition, we show that SPG can also improve results for top-performing 3D detectors on the source domain[sun2019scalability, geiger2013vision]. To demonstrate the wide applicability of SPG, we choose two representative detectors: 1) PointPillars [lang2019pointpillars], popular among industrial-grade autonomous driving systems; 2) PV-RCNN [shi2020pv], a high performance LiDAR-based 3D detector [geiger2013vision, sun2019scalability]. We perform two groups of model comparisons under the setting of unsupervised domain adaptation (UDA) and general 3D object detection: group 1, PointPillars vs. SPG + PointPillars; group 2, PV-RCNN vs. SPG + PV-RCNN. SPG can also be combined with range image-based detectors [meyer2019lasernet, zhou2020end, REF:Range_AlexBewley2020] by applying ray casting to the generated points. However, we leave this as future work.
The Waymo Domain Adaptation dataset 1.0 [sun2019scalability] consists of two sub datasets, the Waymo Open Dataset (OD) and the Waymo Kirkland Dataset (Kirk). OD provides 798 training segments of 158,361 LiDAR frames and 202 validation segments of 40,077 frames. Captured across California and Arizona, of its frames have dry weather. Kirk is a smaller dataset including 80 training segments of 15,797 frames and 20 validation segments of 3,933 frames. Captured in Kirkland, its LiDAR frames have rainy weather. To examine a detector’s reliability when entering a new environment, we conduct UDA experiments without using the data in Kirk during training.
KITTI [geiger2013vision] contains 7481 training samples and 7518 testing samples. Following [REF:Multiview3D_2017], we divide the training data into a train split and a val split containing 3721 and 3769 LiDAR frames, respectively.
Implementation and Training Details
We use a single lightweight network architecture on all experiments. As shown in Figure 4, our Voxel Feature Encoding[zhou2018voxelnet]
module includes a single layer point-wise MLP and a voxel-wise max-pooling[qi2017pointnet, zhou2018voxelnet]
. The Information Propagation module includes two levels of CNN layers. The first level includes three CNN layers with stride 1. The second level includes one CNN layer with stride 2 and four subsequent CNN layers with stride 1, then up-sampled back to the original resolution. Each layer has an output dimension of 128. From the BEV feature map, the Point Generation module uses one FC layer to produceand another FC layer to generate the features for the voxels in each pillar. SPG and each detector are trained separately.
We implement PointPillars following [lang2019pointpillars] and use the PV-RCNN code provided by [shi2020pv] (the training settings on OD 1.0 are obtained via direct communication with the author). On the Waymo Domain Adaptation Dataset [sun2019scalability], we set the voxel dimensions to (0.32m, 0.32m, 0.4m) for PointPillars and (0.2m, 0.2m, 0.3m) for PV-RCNN. On KITTI, we set the voxel dimensions to (0.16m, 0.16m, 0.2m) and (0.2m, 0.2m, 0.3m) for PointPillars and PV-RCNN, respectively. By default, the generation area includes voxels within 6 steps of any occupied voxel. After probability thresholding, we preserve up to semantic points for the Waymo Domain Adaptation Dataset and for KITTI.
4.1 Evaluation on the Waymo Open Dataset
We perform two groups of model comparisons by training them on the OD training set and evaluating them on both the OD validation set and the Kirk validation set.
The Kirk 1.0 validation set only provides the evaluation labels for the vehicle and the pedestrian classes. We use the official evaluation tool released by [sun2019scalability]. The IoU thresholds for vehicles and pedestrians are 0.7 and 0.5. In Table 2 we report both 3D and BEV AP on two difficulty levels. More results with distance breakdown are shown in the supplemental material.
On Kirk, we observe that SPG brings remarkable improvements over both detectors across all object types. Averaged over two difficulty levels, SPG improves PointPillars on Kirk vehicle 3D AP by and BEV AP by . For PV-RCNN, SPG improves Kirk pedestrian 3D AP by and BEV AP by .
Unlike most UDA methods [chen2018domain, hsu2020progressive, shan2019pixel] that only optimize the performance on the target domain, SPG also consistently improves the results on the source domain. Averaged across both difficulty levels, SPG improves OD vehicle 3D AP for PointPillars by and improves OD pedestrian 3D AP for PV-RCNN by .
Comparison with Alternative Strategies
We compare SPG with alternative strategies that also target the deteriorating point cloud quality. We employ PointPillars as the baseline and choose LEVEL_1 vehicle 3D AP as the main metric on the Kirk validation set, during UDA. Three strategies are implemented: 1. RndDrop, where we randomly drop of the points in the source domain during training. This dropout ratio is chosen for the number of points in the source and target domain to match (see Table 1). 2. K-frames, where we use consecutive historical frames in both the source domain and the target domain. The points in the first are transformed into the last frame according to the ground-truth ego-motion, so that the last frame has times the number of points. 3. Adversarial Domain Adaptation (ADA), where we follow [ganin2015unsupervised] and add a domain classification loss on the pillar features of PointPillars.
As shown in Table 3, although “RndDrop” enforces the quantity of missing points in the source domain to match with that in the target domain, the pattern of missing points still differs from the reality (see Figure 2), which limits the improvement to only in 3D AP. To remedy the “missing points” problem, “3-frames” contains real points from 3 frames and “5-frames” contains points from 5 frames. With around 800K points per scene, “5-frames” significantly improves the single-frame baseline. However, aggregating multiple frames inevitably increases the memory usage and the processing time. ADA improves 3D AP to on the target domain, but we observe an AP drop of in the source domain. Remarkably, SPG can outperform “5-frames”, by adding only 8000 semantic points, which is less than of the points in a single frame.
4.2 Evaluation on the KITTI Dataset
In this section, we show besides the usefulness in UDA (Sec. 4.1) the proposed SPG can also boost performance in another popular 3D detection benchmark (i.e. KITTI [geiger2013vision]). We follow the training and evaluation protocols in [lang2019pointpillars, shi2020pv].
KITTI Test Set
As shown in Table 4, SPG significantly improves PV-RCNN on Car 3D detection. As of Mar. 3rd, 2021, our method ranks the 1st on KITTI car 3D detection among all published methods (4th among all submitted approaches). Moreover, SPG demonstrates strong robustness in detecting hard objects (truncation up to 50%). Specifically, SPG surpasses all submitted methods on the hard category by a big margin and achieves the highest overall 3D AP of (averaged over Easy, Mod. and Hard).
KITTI Validation Set
We summarize the results in Table 5. We train each group of models using the recommended settings of baseline detectors [lang2019pointpillars, shi2020pv].
SPG remarkably improves both PointPillars and PV-RCNN on all object types and difficulty levels. Specifically, for PointPillars, SPG improves the 3D AP of car detection by , , on easy, moderate, and hard levels, respectively. For PV-RCNN, SPG improves the 3D AP of pedestrian detection by , , on easy, moderate and hard levels, respectively.
4.3 Model Efficiency
We evaluate the efficiency of SPG on the KITTI val split (Table 6). SPG contains million parameters while adding less than milliseconds latency to the detectors. This indicates that SPG is highly efficient for industrial-grade deployment on a stringent computation budget.
4.4 Ablation Studies
We conduct ablation studies on “Semantic Area Expansion”, “Hide and Predict” and whether to add foreground confidence () as a point property and show all of them can benefit detection quality (see Table 7). We also change the weighting factor on the empty foreground voxels . A larger encourages more point generation in the empty foreground space. However, in reality, an object typically does not occupy the entire space within a bounding box. Therefore, over-aggressively generating points does not help improve the performance (see ).
In Table 8, we show the effect of choosing different thresholds during probability thresholding. While a higher only keeps semantic points with high foreground probability, a lower admits more points, but may introduce points to the background. We find the threshold of achieves the best results.
In this paper, we investigate unsupervised domain adaptation for LiDAR-based 3D detectors across different geographic locations and weather conditions. We observe that rainy weather can severely deteriorate the point cloud quality and cause drastic performance drop for modern 3D detectors, based on the Waymo Domain Adaptation dataset. The proposed SPG method addresses this issue as a novel unsupervised domain adaptation (UDA) task without using any training data from the new domain. This setting allows us to rigorously test 3D detectors against real-world challenges autonomous vehicles may experience due to diverse conditions (e.g., different levels of fog/rain/snow beyond what one may effectively train for) during the trip.
Utilizing two strategies “Hide and Predict” and “Semantic Area Generation”, SPG generates semantic points to recover the shape of foreground objects with a negligible overhead (only adding extra points) and can be conveniently integrated with modern LiDAR-based detectors. We test SPG with two detectors: PointPillars and PV-RCNN. For unsupervised domain adaptation, SPG achieves significant performance gains on the challenging target domain. On Waymo Open dataset and KITTI, SPG also consistently benefits detection quality on the source domain.
We would like to thank Boqing Gong for the helpful discussions. We also thank Jingwei Ji for the careful proofreading.
Appendix A Statistics of the Waymo Domain Adaptation Dataset
We collect the statistics about the average number of points in a vehicle bounding box across different ranges. The range value is calculated as the euclidean distance between the LiDAR sensor and the center of a bounding box. We investigate four sets of point clouds:
[noitemsep, topsep=2pt, leftmargin=8pt]
The OD Validation set, in which of the frames are collected in the dry weather.
The Kirk Dry set, which consists of all the frames with the dry weather condition from the Kirk training set.
The Kirk Training Rainy set, which consists of all the frames with the rainy weather condition from the Kirk training set.
The Kirk Validation set, in which all the frames are collected in the rainy weather.
As shown in Figure 7, the point clouds with similar weather conditions share similar numbers of points per object, even though they are collected at different locations. Specifically, the vehicle objects of the two “dry datasets”, i.e., the Kirk Dry set and the OD Validation set, have similar numbers of points across all ranges. The vehicle objects of the two “rainy datasets” i.e., the Kirk Training Rainy set and the Kirk Validation set, share similar statistics.
In addition, the point clouds captured in the dry weather (the OD Validation set and the Kirk Dry set) have more points on each object than those collected in the rainy weather (the Kirk Training Rainy set and the Kirk Validation set). Please note that we have applied to the number of points for better visualization. The difference in the number of points is substantial between two weather conditions across all ranges.
Appendix B The Robustness of the Foreground Voxel Classifier
In order to generalize detectors to different domains, it is crucial to correctly classify foreground voxels so that semantic points can be reliably generated. Table 9 lists the evaluation results of the foreground voxel classifier.
|OD Train||OD Val||99.3 %||90.9 %||92.9 %||86.7 %|
|OD Train||Kirk Val||98.9 %||88.4 %||88.2 %||78.3 %|
Foreground voxel classification results of our SPG. The model is trained on the OD training set and then it is evaluated on the OD validation set and Kirk validation set, respectively. The accuracy, precision and recall are evaluated by setting.
The results in Table 9 are averaged among all voxels in the foreground regions. Our SPG is trained on the OD training set. Then it is evaluated on the OD validation set and the Kirk validation set, respectively. The classification of a voxel is correct if its prediction score when or when . The accuracy, precision and recall are all calculated under this setting. The AP is calculated using 40 recall thresholds. The results show that SPG achieves high performance in both domains.
Appendix C Dropout Rate of the RndDrop Method
In the experiment section, we implement a baseline method RndDrop, where we randomly drop out of points for point clouds from the source domain during training. This dropout ratio is chosen to match the ratio of missing points in the target domain. We calculate , where is the average number of points per scene in the source domain and is the average number of points per scene in the target domain.
Appendix D More Results on the Waymo Domain Adaptation Dataset
The evaluation tool [sun2019scalability] provides the average precision for three distance-based breakdowns: 0 to 30 meters, 30 to 50 meters, and beyond 50 meters. The AP is calculated using 100 recall thresholds.
We perform two groups of model comparisons in the setting of UDA: Group 1. PointPillars vs. SPG + PointPillars; Group 2. PV-RCNN vs. SPG + PV-RCNN. We train all models on the OD training set and evaluate them on both the OD validation set and the Kirk validation set. Table 10 and 11 show the comparisons on vehicle 3D AP and vehicle BEV AP, respectively. Table 12 and Table 13 show the comparisons in pedestrian 3D AP and pedestrian BEV AP, respectively. In most cases, SPG improves the detection performance across all ranges for both vehicles and pedestrians.
Appendix E More Results on KITTI
We provide more 3D object detection results on KITTI. There are two commonly used metric standards for evaluating the detection performance: 1) R11, where the AP is evaluated with 11 recall positions; 2) R40, where the AP is evaluated with 40 recall positions. In addition to the improvement on car and pedestrian detection, SPG also significantly boosts the performance in cyclist detection. Based on R11, Table 14 and Table 15 show the results in 3D AP and BEV AP for three object types, respectively. Based on R40, Table 16 and Table 17 show the results in 3D AP and BEV AP for three object types, respectively.
We show more comparisons on the KITTI test set in Table 18.
Appendix F More Visualization of Semantic Point Generation
In Figure 8, we illustrate more augmented point clouds, where the raw points are rendered in the grey color and the generated semantic points are highlighted in red.