SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation

08/15/2021
by   Qiangeng Xu, et al.
University of Southern California
0

In autonomous driving, a LiDAR-based object detector should perform reliably at different geographic locations and under various weather conditions. While recent 3D detection research focuses on improving performance within a single domain, our study reveals that the performance of modern detectors can drop drastically cross-domain. In this paper, we investigate unsupervised domain adaptation (UDA) for LiDAR-based 3D object detection. On the Waymo Domain Adaptation dataset, we identify the deteriorating point cloud quality as the root cause of the performance drop. To address this issue, we present Semantic Point Generation (SPG), a general approach to enhance the reliability of LiDAR detectors against domain shifts. Specifically, SPG generates semantic points at the predicted foreground regions and faithfully recovers missing parts of the foreground objects, which are caused by phenomena such as occlusions, low reflectance or weather interference. By merging the semantic points with the original points, we obtain an augmented point cloud, which can be directly consumed by modern LiDAR-based detectors. To validate the wide applicability of SPG, we experiment with two representative detectors, PointPillars and PV-RCNN. On the UDA task, SPG significantly improves both detectors across all object categories of interest and at all difficulty levels. SPG can also benefit object detection in the original domain. On the Waymo Open Dataset and KITTI, SPG improves 3D detection results of these two methods across all categories. Combined with PV-RCNN, SPG achieves state-of-the-art 3D detection results on KITTI.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 19

10/16/2020

SF-UDA^3D: Source-Free Unsupervised Domain Adaptation for LiDAR-Based 3D Object Detection

3D object detectors based only on LiDAR point clouds hold the state-of-t...
03/07/2022

An Unsupervised Domain Adaptive Approach for Multimodal 2D Object Detection in Adverse Weather Conditions

Integrating different representations from complementary sensing modalit...
11/30/2021

Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection

3D object detection networks tend to be biased towards the data they are...
10/18/2021

FAST3D: Flow-Aware Self-Training for 3D Object Detectors

In the field of autonomous driving, self-training is widely applied to m...
08/27/2020

GhostBuster: Looking Into Shadows to Detect Ghost Objects in Autonomous Vehicle 3D Sensing

LiDAR-driven 3D sensing allows new generations of vehicles to achieve ad...
06/08/2020

Associate-3Ddet: Perceptual-to-Conceptual Association for 3D Point Cloud Object Detection

Object detection from 3D point clouds remains a challenging task, though...
12/06/2018

OMNIA Faster R-CNN: Detection in the wild through dataset merging and soft distillation

Object detectors tend to perform poorly in new or open domains, and requ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A robust autonomous driving system requires its LiDAR-based detector to reliably handle different environmental conditions, e.g., geographic locations and weather conditions. While 3D detection has received increasing interest in recent years, most existing works [zhou2018voxelnet, chen2017multi, chen2019fast, chen2020dsgn, du2018general, konigshof2019realtime, lang2019pointpillars, li2019gs3d, li2019stereo, liang2019multi, liang2018deep, meyer2019lasernet, pon2020object, qi2018frustum, shi2020pv, shi2019pointrcnn, shi2019points, shi2020point, xu2020zoomnet, yan2018second, yang2018pixor, yang20203dssd, yang2019std, xu2020grid, zhou2020end] have focused on the performance in a single domain, where training and test data are captured in similar conditions. It is still an open question how to generalize a 3D detector to different domains, where the environment varies significantly. In this paper, we address the domain gap caused by the deteriorating point cloud quality and aim to improve 3D object detection in the setting of unsupervised domain adaptation (UDA). We use the Waymo Domain Adaptation dataset [sun2019scalability] to analyze the domain gap and introduce semantic point generation (SPG), a general approach to enhance the reliability of LiDAR detectors against domain shift. SPG is able to improve detection quality in both the target domain and the source domain and can be naturally combined with modern LiDAR-based detectors.

1.1 Understanding the Domain Gap

Waymo Open Dataset (OD) is mainly collected in California and Arizona, and Waymo Kirkland Dataset (Kirk) [sun2019scalability] is collected in Kirkland. We consider OD as the source domain and Kirk as the target domain. To understand the possible domain gap, we take a PointPillars [lang2019pointpillars] model trained on the OD training set and compare its 3D vehicle detection performance on OD validation set and those on Kirk validation set. We observe a drastic performance drop of points in 3D average precision (AP) (see Table 1).

0pt0pt Dataset Rainy frames Avg. number of missing points per frame Avg. number of points per vehicle 3D L1 AP OD Val 0.5 % 23.0K 306.2 56.54 Kirk Dry 0.0 % 25.1K 303.6 55.98 Kirk Val 100.0% 42.8K 222.3 34.74

Table 1: The statistics of OD and Kirk. Each frame contains at most 163.8K points. Kirk Dry is formed by frames with dry weather in Kirk training set.
Figure 2: Examples of RGB and range image (intensity channel) in OD validation set and Kirk validation set. The dark regions in the range images indicate missed LiDAR returns. The regions of “missing points” are irregular in shape.

We first confirm that there is no significant difference in object size between two domains. Then by investigating the meta data in the datasets, we find that only of LiDAR frames in OD are collected under rainy weather, but almost all frames in Kirk share the rainy weather attribute. To rule out other factors, we extract all dry weather frames in Kirk training set and form a “Kirk Dry” dataset. Because the the rain drop changes the surface property of objects, there are twice amount of missing LiDAR points per frame in Kirk validation set than in OD or Kirk Dry (see Table 1). As a result, vehicles in Kirk receive around fewer LiDAR point observations than those in OD (see statistics and more details in the supplemental). In Figure 2, we visualize two range images from OD and Kirk, respectively. We can observe that in the rainy weather, a significant number of points are missing and the distribution of missing points is more irregular compared to the dry weather.

To conclude, the major domain gap between OD and Kirk is the deteriorating point cloud quality, which is caused by the rainy weather condition. In the target domain, we name this phenomenon as the “missing point” problem.

1.2 Previous Methods to Address the Domain Gap

Multiple studies propose to align the features across domains. Most of them focus on 2D tasks [morerio2017minimal, ganin2015unsupervised, tzeng2017adversarial, dong2019semantic] or object-level 3D tasks [zhou2018unsupervised, qin2019pointdan]. Applying feature alignment [chen2018domain, he2019multi, luo2020unsupervised] requires a redesign of the model or loss of a detector. Our goal is to seek a general solution to benefit recently reported LiDAR-based detectors[lang2019pointpillars, shi2020pv, zhou2018voxelnet, shi2019pointrcnn, he2020sassd].

Another direction is to apply transformations to the data from one domain to match the data from another domain. A naive approach is to randomly down-sample the point cloud but this not only fails to satisfactorily simulate the pattern of missing points (Figure 2d) but also hurts the performance on the source domain. Another approach is to up-sample the point cloud [yu2018pu, yifan2019patch, li2019pu] in the target domain, which can increase point density around observed regions. However, those methods have a limited capability in recovering the 3D shape of very partially observed objects. Moreover, up-sampling the entire point cloud will lead to a significantly higher latency. A third approach is to leverage style transfer techniques: [zhu2017unpaired, park2020contrastive, choi2019self, he2019multi, shan2019pixel, hsu2020progressive, saleh2019domain] render point clouds as 2D pseudo images and enforce the renderings from different domains to be resemblant in style. However, these methods introduce an information bottleneck during rasterization  [zhou2018voxelnet] and they are not applicable to modern point-based 3D detectors [shi2020pv].

1.3 SPG for Closing the Domain Gap

The “missing point” problem deteriorates the point cloud quality and reduces the number of point observations, thus undermining the detection performance. To address this issue, we propose Semantic Point Generation (SPG). Our approach aims to learn the semantic information of the point cloud and performs foreground region prediction to identify voxels that are inside foreground objects. Based on the predicted foreground voxels, SPG generates points to recover the foreground regions. Since these points are discriminatively generated at foreground objects, we denote them by semantic points. These semantic points are merged with the original points into an augmented point cloud, which is then fed to a 3D detector.

The contributions of this paper are two-fold:
1. We present an in-depth analysis of unsupervised domain adaptation (UDA) for LiDAR 3D detectors across different geographic locations and weather conditions. Our study reveals that the rainy weather can severely deteriorate the quality of LiDAR point clouds and lead to drastic performance drop for modern detectors.
2. We propose semantic point generation (SPG). To our best knowledge, it is the first learning-based model that targets UDA for point cloud 3D detection. Specifically, SPG has the following merits:

  • [noitemsep, topsep=2pt, leftmargin=8pt]

  • SPG can generate semantic points that faithfully recover the foreground regions suffering from the “missing point” problem. SPG can significantly improve performance over poor-quality point clouds in the target domain while also benefiting source domain, for representative 3D detectors, including PointPillars [lang2019pointpillars] and PV-RCNN [shi2020pv].

  • SPG also improves the performance for the general 3D object detection task. We verify its effectiveness on KITTI [geiger2013vision] for the aforementioned 3D detectors.

  • SPG is a general approach and can be easily combined with modern off-the-shelf LiDAR-based detectors.

  • Our approach is light-weight and efficient. Introducing less than additional points, SPG only adds a marginal complexity to a 3D detector.

2 Related Work

2.1 Unsupervised Domain Adaptation

Unsupervised domain adaptation (UDA) aims to generalize a model to a novel (target) domain by using label information only from the source domain. The two domains are generally related, but there exists a distribution shift (domain gap). Most methods focus on learning aligned feature representations across domains. To reach this goal, [borgwardt2006integrating] proposes Maximum Mean Discrepancy (MMD) while [pan2010domain] proposes Transfer Component Analysis (TCA). [long2013transfer]

designs a Joint Distribution Adaptation to close the distribution shift while

[long2015learning, long2016unsupervised]

utilize a shared Hilbert space. Without using explicit distance measures, deep learning models

[ganin2015unsupervised, tzeng2017adversarial, dong2019semantic, qin2019generatively, saito2018maximum] use adversarial training to get indistinguishable features between domains.

Unsupervised Domain Adaptation for 2D Detection

The object detection task is sensitive to local geometric features. [chen2018domain, he2019multi] hierarchically align the features between domains. Most of these works focus on UDA for 2D detection. With the current advances of unpaired style transfer methods [park2020contrastive, zhu2017unpaired], studies such as [shan2019pixel, hsu2020progressive] translate the image from source domain to target domain or vice versa.

Unsupervised Domain Adaptation for 3D Tasks

Most of the UDA methods focus on 2D tasks, only a few studies explore the UDA in 3D. [zhou2018unsupervised, qin2019pointdan] align the global and local features for object-level tasks. To reduce the sparsity, [wu2019squeezesegv2] projects the point cloud to 2D view, while [saleh2019domain] projects the point cloud to birds-eye view (BEV). [du2020associate] creates a car model set and adapts their features to the detection object features. However, this study targets general car 3D detection on a single point cloud domain. [wang2020train] is the first published study targeting UDA for 3D LiDAR detection. They identify the vehicle size as the domain gap between KITTI[geiger2013vision] and other datasets. So they resize the vehicles in the data. In contrast, we identify the point cloud quality as the major domain gap between Waymo’s two datasets[sun2019scalability]. We use a learning-based approach to close the domain gap.

2.2 Point Cloud Transformation

One way to improve point cloud quality is to suitably transform the point cloud. Studies of point cloud up-sampling [yu2018pu, yifan2019patch, li2019pu] can transfer a low density point cloud to a high density one. However, they need high density point cloud ground truth during training. These networks can densify the point cloud in the observed regions. But in our case, we also need to recover regions with no point observation, caused by “missing points”.

Point cloud completion networks [yuan2018pcn, chen2019unpaired, yang2018foldingnet, xie2020grnet] aim to complete the point cloud. Specialized in object-level completion, these models assume a single object has been manually located and the input only consists of the points on this object. Therefore, these models do not fit our purpose of object detection. Point cloud style transfer models [cao2020psnet, cao2019neural] can transfer the color theme and the object-level geometric style for the point cloud. However, these models do not focus on preserving local details with high-fidelity. Therefore, their transformation cannot directly help 3D detection.

3 Semantic Point Generation

0pt0pt

Figure 3:

Illustration of SPG-aided 3D detection pipeline. SPG voxelizes the entire point cloud and generates prediction for each voxel (both occupied and empty) within the generation areas. After applying a probability thresholding, we take the top voxels with highest foreground probability and add a semantic point (red) at the predicted location in each of these voxels. These points are merged with the original point cloud and fed into the selected 3D point cloud detector.

In the input point cloud , each point has three channels of xyz and properties (e.g., intensity, elongation). Figure 3 illustrates the SPG-aided 3D detection pipeline. SPG takes raw point cloud as input and generates a set of semantic points in the predicted foreground regions. Then, these semantic points are combined with the original point cloud into an augmented point cloud , which is fed into a point cloud detector to obtain object detection results.

As shown in Figure 4, SPG voxelizes into an evenly spaced 3D voxel grid, and learns the point cloud semantics for these voxels. For each voxel, the network predicts the probability confidence of it being a foreground voxel (contained in a foreground object bounding box). In each foreground voxel, the network generates a semantic point with point features . is the xyz coordinate of and is the point properties.

To faithfully recover the foreground regions of the observed objects, we define a generation area. Only voxels occupied or neighbored by the observed points are considered within the generation area. We also filter out semantic points with less than , then take semantic points with the highest and merge them with the original point cloud to get . In practice, we use .

To enable SPG to be directly used by modern LiDAR-based detectors, we encode the augmented point cloud as . Here we add another property channel to each point, indicating the confidence in foreground prediction: is used for the semantic points, and 1.0 for the original raw points.

3.1 Training Targets

To train SPG, we need to create two supervisions: 1) , the class label if a voxel (either occupied or empty) is a foreground voxel, which supervises ; 2) , the regression target for semantic point features .

As visualized in Figure 4, we mark a point as a foreground point if it is inside an object bounding box. Voxels contained in a foreground bounding box are marked as foreground voxels . For voxel , we assign if and otherwise. If is an occupied foreground voxel, we set as the regression target, where is the centroid (xyz) of all foreground points in while is the mean of their point properties (e.g. intensity, elongation).

0pt0pt

Figure 4: Training targets construction and SPG model architecture. Three steps to create the semantic point training targets: 1.Voxelization; 2. Foreground points searching 3. Label assignment and ground-truth point feature calculation. SPG includes: the Voxel Feature Encoding module (VFE), the Information Propagation module, and the Point Generation module.

3.2 Model Structure

The lower part of Figure 4 illustrates the network architecture. SPG uses a light-weight encoder-decoder network [zhou2018voxelnet, lang2019pointpillars], which is composed of three modules:
1) The Voxel Feature Encoding module [zhou2018voxelnet] aggregates points inside each voxel by using several MLPs. Similar to [lang2019pointpillars, shi2020pv], these voxel features are later stacked into pillars and projected onto a birds-eye view feature space;
2) The Information Propagation module applies 2D convolutions on the pillar features. As shown in Figure 4, the semantic information in the occupied pillars (dark green) is populated into the neighboring empty pillars (light green), which enables SPG to recover the foreground regions in the empty space.
3. The Point Generation module maps the pillar features to the corresponding voxels. For each voxel in the generation area, the module creates a semantic point with encoding , in which is the point location, is the point properties, and is the foreground probability.

3.3 Foreground Region Recovery

The above pipeline supervises SPG to generate semantic points in the occupied voxels. However, it is also crucial to recover the empty voxels caused by the “missing points” problem. To generate semantic points in the empty areas, SPG employs two strategies:

  • [noitemsep, topsep=2pt, leftmargin=8pt]

  • “Hide and Predict”, which produces the “missing points” on the source domain during training and guides SPG to recover the foreground object shape in the empty space.

  • “Semantic Area Expansion”, which leverages the foreground/background voxel labels derived from the bounding boxes and encourages SPG to recover more unobserved foreground regions in each bounding box.

3.3.1 Hide and Predict

SPG voxelizes into a voxel set . Before passing to the network, we randomly select of the occupied voxels and hide all their points. During training, SPG is required to predict the foreground/background label for all voxels in , even though it only observes points in . The predicted point features in should match the corresponding ground-truth calculated by these hidden points.

This strategy brings two benefits: 1. Hiding points region by region mimics the missing point pattern in the target domain; 2. The strategy naturally creates the training targets for semantic points in the empty space. Section 4.4 shows the effectiveness of this strategy. Here we set .

3.3.2 Semantic Area Expansion

0pt0pt

Figure 5: Visualization of “Semantic Area Expansion”. (a) and (c) show the occupied voxels and the generation area, respectively. (b) and (d) show the supervision strategies.

In section 1.1, we find the poor point cloud quality leads to insufficient points on each object and substantially degrades the detection performance. To remedy this problem, we allow SPG to expand the generation area to the empty space. Figure 5 a and c show the examples of the generation area with and without the expansion, respectively.

Without the expansion, we can use the ground-truth knowledge of foreground points to supervise SPG only on the occupied voxels (Figure 5 b). However, with the expansion, there is no foreground point inside these empty voxels. Therefore, as shown in Figure 5 d, we design a supervision scheme as follows:
1. For both occupied and empty background voxels and , we impose negative supervision and set label .
2. For the occupied foreground voxels , we set .
3. For the empty voxels inside a bounding box , we set their foreground label and assign a weighting factor , where .
4. We only impose point features supervision at occupied foreground voxels .

To investigate the effectiveness of the expansion, we train a model on the OD training set and evaluate it on the Kirk validation set. The expansion results in 510% more semantic points on foreground objects, which mitigates the “missing points” problem caused by environmental interference and occlusions. Figure 6 shows the generation results with and without the expansion. The supervision scheme encourages SPG to learn the extended shape of vehicle parts and enables SPG to fill in more foreground space with semantic points. We also conduct ablation studies (Section 4.4) to show the effectiveness of the proposed strategy.

Figure 6: Comparisons between generated semantic points (red) with and without “Semantic Area Expansion”.

3.4 Objectives

We use two loss functions,

i.e., foreground area classification loss and feature regression loss .

To supervise with label , we use Focal loss [lin2017focal] to mitigate the background-foreground class imbalance. can be decomposed as focal losses on four categories of voxels: the occupied voxels , the empty background voxels , the empty foreground voxels and the hidden voxels . The labeling strategy for these categories is described in Section 3.3.2. 0pt0pt

(1)

We use Smooth-L1 loss [he2019multi] for point feature regression, and supervise on the semantic points in occupied foreground voxels and the hidden foreground voxels . 0pt0pt

(2)

Please note that we are only interested in the and on voxels inside the generation area. We find and achieves the best result.

4 Experiments

In this section, we first evaluate the effectiveness of SPG as a general UDA approach for 3D detection, based on the Waymo Domain Adaptation Dataset [sun2019scalability]. In addition, we show that SPG can also improve results for top-performing 3D detectors on the source domain[sun2019scalability, geiger2013vision]. To demonstrate the wide applicability of SPG, we choose two representative detectors: 1) PointPillars [lang2019pointpillars], popular among industrial-grade autonomous driving systems; 2) PV-RCNN [shi2020pv], a high performance LiDAR-based 3D detector  [geiger2013vision, sun2019scalability]. We perform two groups of model comparisons under the setting of unsupervised domain adaptation (UDA) and general 3D object detection: group 1, PointPillars vs. SPG + PointPillars; group 2, PV-RCNN vs. SPG + PV-RCNN. SPG can also be combined with range image-based detectors [meyer2019lasernet, zhou2020end, REF:Range_AlexBewley2020] by applying ray casting to the generated points. However, we leave this as future work.

Datasets

The Waymo Domain Adaptation dataset 1.0 [sun2019scalability] consists of two sub datasets, the Waymo Open Dataset (OD) and the Waymo Kirkland Dataset (Kirk). OD provides 798 training segments of 158,361 LiDAR frames and 202 validation segments of 40,077 frames. Captured across California and Arizona, of its frames have dry weather. Kirk is a smaller dataset including 80 training segments of 15,797 frames and 20 validation segments of 3,933 frames. Captured in Kirkland, its LiDAR frames have rainy weather. To examine a detector’s reliability when entering a new environment, we conduct UDA experiments without using the data in Kirk during training.

KITTI [geiger2013vision] contains 7481 training samples and 7518 testing samples. Following [REF:Multiview3D_2017], we divide the training data into a train split and a val split containing 3721 and 3769 LiDAR frames, respectively.

Implementation and Training Details

We use a single lightweight network architecture on all experiments. As shown in Figure 4, our Voxel Feature Encoding[zhou2018voxelnet]

module includes a single layer point-wise MLP and a voxel-wise max-pooling

[qi2017pointnet, zhou2018voxelnet]

. The Information Propagation module includes two levels of CNN layers. The first level includes three CNN layers with stride 1. The second level includes one CNN layer with stride 2 and four subsequent CNN layers with stride 1, then up-sampled back to the original resolution. Each layer has an output dimension of 128. From the BEV feature map, the Point Generation module uses one FC layer to produce

and another FC layer to generate the features for the voxels in each pillar. SPG and each detector are trained separately.

We implement PointPillars following [lang2019pointpillars] and use the PV-RCNN code provided by [shi2020pv] (the training settings on OD 1.0 are obtained via direct communication with the author). On the Waymo Domain Adaptation Dataset [sun2019scalability], we set the voxel dimensions to (0.32m, 0.32m, 0.4m) for PointPillars and (0.2m, 0.2m, 0.3m) for PV-RCNN. On KITTI, we set the voxel dimensions to (0.16m, 0.16m, 0.2m) and (0.2m, 0.2m, 0.3m) for PointPillars and PV-RCNN, respectively. By default, the generation area includes voxels within 6 steps of any occupied voxel. After probability thresholding, we preserve up to semantic points for the Waymo Domain Adaptation Dataset and for KITTI.

0pt0pt Target Domain - Kirk Source Domain - OD Vehicle Pedestrian Vehicle Pedestrian Difficulty Method 3D AP BEV AP 3D AP BEV AP 3D AP BEV AP 3D AP BEV AP LEVEL_1 PointPillars 34.65 51.88 20.65 22.33 57.27 72.26 55.20 63.82 SPG + PointPillars 41.56 60.44 23.72 24.83 62.44 77.63 56.06 64.66 Improvement +6.91 +8.56 +3.07 +2.50 +5.17 +5.37 +0.86 +0.84 LEVEL_2 PointPillars 31.67 47.93 17.66 18.40 52.96 69.09 51.33 60.13 SPG + PointPillars 38.15 56.94 19.57 20.67 58.54 74.90 52.33 60.93 Improvement +6.48 +9.01 +1.91 +2.27 +5.58 +5.81 +1.00 +0.80 LEVEL_1 PV-RCNN 55.16 70.38 24.47 25.39 74.01 85.13 65.34 70.35 SPG + PV-RCNN 58.31 72.56 30.82 31.92 75.27 87.38 66.93 70.37 Improvement +3.15 +2.18 +6.35 +6.53 +1.26 +2.25 +1.59 +0.02 LEVEL_2 PV-RCNN 45.81 60.13 17.16 17.88 64.69 76.84 56.03 60.81 SPG + PV-RCNN 48.70 62.03 22.05 22.65 65.98 78.05 57.68 60.88 Improvement +2.89 +1.90 +4.89 +4.77 +1.29 +1.21 +1.65 +0.07

Table 2: Results on the Waymo Open Dataset 1.0 and the Kirkland Dataset. Results for PointPillars are based on our own implementation following [lang2019pointpillars]. We use the PV-RCNN source code and obtain training settings for the Waymo Open Dataset [sun2019scalability] via direct communication with the author.

4.1 Evaluation on the Waymo Open Dataset

We perform two groups of model comparisons by training them on the OD training set and evaluating them on both the OD validation set and the Kirk validation set.

Evaluation Metrics

The Kirk 1.0 validation set only provides the evaluation labels for the vehicle and the pedestrian classes. We use the official evaluation tool released by [sun2019scalability]. The IoU thresholds for vehicles and pedestrians are 0.7 and 0.5. In Table 2 we report both 3D and BEV AP on two difficulty levels. More results with distance breakdown are shown in the supplemental material.

Target Domain

On Kirk, we observe that SPG brings remarkable improvements over both detectors across all object types. Averaged over two difficulty levels, SPG improves PointPillars on Kirk vehicle 3D AP by and BEV AP by . For PV-RCNN, SPG improves Kirk pedestrian 3D AP by and BEV AP by .

Source Domain

Unlike most UDA methods [chen2018domain, hsu2020progressive, shan2019pixel] that only optimize the performance on the target domain, SPG also consistently improves the results on the source domain. Averaged across both difficulty levels, SPG improves OD vehicle 3D AP for PointPillars by and improves OD pedestrian 3D AP for PV-RCNN by .

Comparison with Alternative Strategies

We compare SPG with alternative strategies that also target the deteriorating point cloud quality. We employ PointPillars as the baseline and choose LEVEL_1 vehicle 3D AP as the main metric on the Kirk validation set, during UDA. Three strategies are implemented: 1. RndDrop, where we randomly drop of the points in the source domain during training. This dropout ratio is chosen for the number of points in the source and target domain to match (see Table 1). 2. K-frames, where we use consecutive historical frames in both the source domain and the target domain. The points in the first are transformed into the last frame according to the ground-truth ego-motion, so that the last frame has times the number of points. 3. Adversarial Domain Adaptation (ADA), where we follow [ganin2015unsupervised] and add a domain classification loss on the pillar features of PointPillars.

As shown in Table 3, although “RndDrop” enforces the quantity of missing points in the source domain to match with that in the target domain, the pattern of missing points still differs from the reality (see Figure 2), which limits the improvement to only in 3D AP. To remedy the “missing points” problem, “3-frames” contains real points from 3 frames and “5-frames” contains points from 5 frames. With around 800K points per scene, “5-frames” significantly improves the single-frame baseline. However, aggregating multiple frames inevitably increases the memory usage and the processing time. ADA improves 3D AP to on the target domain, but we observe an AP drop of in the source domain. Remarkably, SPG can outperform “5-frames”, by adding only 8000 semantic points, which is less than of the points in a single frame.

Method Baseline RndDrop 3-frames 5-frames ADA SPG
3D AP 34.65 35.45 38.00 38.51 36.34 41.56
Table 3: Comparisons of different strategies targeting at the deteriorating point cloud quality. The models are trained on OD and evaluated on Kirk. The metric is LEVEL_1 Vehicle 3D AP. We use PointPillars[lang2019pointpillars] as the baseline.

4.2 Evaluation on the KITTI Dataset

In this section, we show besides the usefulness in UDA (Sec. 4.1) the proposed SPG can also boost performance in another popular 3D detection benchmark (i.e. KITTI [geiger2013vision]). We follow the training and evaluation protocols in [lang2019pointpillars, shi2020pv].

0pt0pt Car - 3D AP Method Reference Easy Mod. Hard Avg. SA-SSD[he2020sassd] CVPR 2020 88.75 79.79 74.16 80.90 3D-CVF[yoo20203d] ECCV 2020 89.20 80.05 73.11 80.79 CIA-SSD[zheng2020ciassd] AAAI 2021 89.59 80.28 72.87 80.91 Asso-3Ddet[du2020associate] CVPR 2020 85.99 77.40 70.53 77.97 Voxel R-CNN[deng2020voxel] AAAI 2021 90.90 81.62 77.06 83.19 PV-RCNN[shi2020pv] CVPR 2020 90.25 81.43 76.82 82.83 SPG+PV-RCNN - 90.50 82.13 78.90 83.84

Table 4: Car detection Results on the KITTI test set. See the full list of comparisons in the supplemental.
KITTI Test Set

As shown in Table 4, SPG significantly improves PV-RCNN on Car 3D detection. As of Mar. 3rd, 2021, our method ranks the 1st on KITTI car 3D detection among all published methods (4th among all submitted approaches). Moreover, SPG demonstrates strong robustness in detecting hard objects (truncation up to 50%). Specifically, SPG surpasses all submitted methods on the hard category by a big margin and achieves the highest overall 3D AP of (averaged over Easy, Mod. and Hard).

0pt0pt Car - 3D AP Car - BEV AP Pedestrian - 3D AP Pedestrian - BEV AP Method Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard PointPillars 87.75 78.39 75.18 92.03 88.05 86.66 57.30 51.41 46.87 61.59 56.01 52.04 SPG + PointPillars 89.77 81.36 78.85 94.38 89.92 87.97 59.65 53.55 49.24 65.38 59.48 55.32 Improvement +2.02 +2.97 +3.67 +2.35 +1.87 +1.31 +2.35 +2.14 +2.47 +3.79 +3.47 +3.28 PV-RCNN 92.10 84.36 82.48 93.02 90.33 88.53 64.26 56.67 51.91 67.97 60.52 55.80 SPG + PV-RCNN 92.53 85.31 82.82 94.99 91.11 88.86 69.66 61.80 56.39 71.79 64.50 59.51 Improvement +0.43 +0.95 +0.34 +1.97 +0.78 +0.33 +5.40 +5.13 +4.48 +3.82 +3.98 +3.71

Table 5: Comparisons on the KITTI validation set. Average Precision (AP) is computed over 40 recall positions. The baseline results[shi2020pv, openpcdet2020] are obtained based on publically released models. See more results (including Cyclist) in the supplemental.
KITTI Validation Set

We summarize the results in Table 5. We train each group of models using the recommended settings of baseline detectors [lang2019pointpillars, shi2020pv].

SPG remarkably improves both PointPillars and PV-RCNN on all object types and difficulty levels. Specifically, for PointPillars, SPG improves the 3D AP of car detection by , , on easy, moderate, and hard levels, respectively. For PV-RCNN, SPG improves the 3D AP of pedestrian detection by , , on easy, moderate and hard levels, respectively.

4.3 Model Efficiency

We evaluate the efficiency of SPG on the KITTI val split (Table 6). SPG contains million parameters while adding less than milliseconds latency to the detectors. This indicates that SPG is highly efficient for industrial-grade deployment on a stringent computation budget.

0pt0pt Detectors PointPillars PV-RCNN - With SPG No Yes No Yes Yes Latency (ms) 23.56 36.67 139.96 156.85 16.82 Parameters 4.83M 5.22M 13.12M 13.51M 0.39M

Table 6: Latency and model parameters. “M” stands for million. The last column shows the results of standalone SPG. The evaluation is based on a 1080Ti GPU with a batch size of 1. The latency is averaged over the KITTI val split.

4.4 Ablation Studies

Hide & Foreground 3D
Model Expansion Predict Confidence AP Improve
Baseline 34.65
SPG 35.89 +1.24
SPG 38.09 +3.44
SPG (=0.0) 38.96 +4.31
SPG (=1.0) 38.42 +3.77
SPG (=0.5) 39.22 +4.57
SPG (=0.5) 37.96 +3.31
SPG(ours) (=0.5) 41.56 +6.91
Table 7: Ablation studies of SPG. The models are trained on OD and evaluated on Kirk. The metric is LEVEL_1 Vehicle 3D AP. We use PointPillars[lang2019pointpillars] as our baseline.

We conduct ablation studies on “Semantic Area Expansion”, “Hide and Predict” and whether to add foreground confidence () as a point property and show all of them can benefit detection quality (see Table 7). We also change the weighting factor on the empty foreground voxels . A larger encourages more point generation in the empty foreground space. However, in reality, an object typically does not occupy the entire space within a bounding box. Therefore, over-aggressively generating points does not help improve the performance (see ).

Probability Thresholding

In Table 8, we show the effect of choosing different thresholds during probability thresholding. While a higher only keeps semantic points with high foreground probability, a lower admits more points, but may introduce points to the background. We find the threshold of achieves the best results.

3D AP 39.39 40.09 41.56 41.18 40.89
Table 8: Ablation studies on the probability threshold (only keep the semantic point if ). Our best SPG model uses . The metric is LEVEL_1 Vehicle 3D AP on the Kirk validation set.

5 Conclusions

In this paper, we investigate unsupervised domain adaptation for LiDAR-based 3D detectors across different geographic locations and weather conditions. We observe that rainy weather can severely deteriorate the point cloud quality and cause drastic performance drop for modern 3D detectors, based on the Waymo Domain Adaptation dataset. The proposed SPG method addresses this issue as a novel unsupervised domain adaptation (UDA) task without using any training data from the new domain. This setting allows us to rigorously test 3D detectors against real-world challenges autonomous vehicles may experience due to diverse conditions (e.g., different levels of fog/rain/snow beyond what one may effectively train for) during the trip.

Utilizing two strategies “Hide and Predict” and “Semantic Area Generation”, SPG generates semantic points to recover the shape of foreground objects with a negligible overhead (only adding extra points) and can be conveniently integrated with modern LiDAR-based detectors. We test SPG with two detectors: PointPillars and PV-RCNN. For unsupervised domain adaptation, SPG achieves significant performance gains on the challenging target domain. On Waymo Open dataset and KITTI, SPG also consistently benefits detection quality on the source domain.

6 Acknowledgement

We would like to thank Boqing Gong for the helpful discussions. We also thank Jingwei Ji for the careful proofreading.

References

Appendix A Statistics of the Waymo Domain Adaptation Dataset

0pt0pt

Figure 7: The average number of raw points per vehicle across different ranges. On the x axis, the range value stands for the distance between the center of a bounding box and the LiDAR sensor. The y axis shows the value after applying on the number of points . “Kirk Dry” is extracted from the Kirk Training set and contains frames captured in the dry weather.

We collect the statistics about the average number of points in a vehicle bounding box across different ranges. The range value is calculated as the euclidean distance between the LiDAR sensor and the center of a bounding box. We investigate four sets of point clouds:

  • [noitemsep, topsep=2pt, leftmargin=8pt]

  • The OD Validation set, in which of the frames are collected in the dry weather.

  • The Kirk Dry set, which consists of all the frames with the dry weather condition from the Kirk training set.

  • The Kirk Training Rainy set, which consists of all the frames with the rainy weather condition from the Kirk training set.

  • The Kirk Validation set, in which all the frames are collected in the rainy weather.

As shown in Figure 7, the point clouds with similar weather conditions share similar numbers of points per object, even though they are collected at different locations. Specifically, the vehicle objects of the two “dry datasets”, i.e., the Kirk Dry set and the OD Validation set, have similar numbers of points across all ranges. The vehicle objects of the two “rainy datasets” i.e., the Kirk Training Rainy set and the Kirk Validation set, share similar statistics.

In addition, the point clouds captured in the dry weather (the OD Validation set and the Kirk Dry set) have more points on each object than those collected in the rainy weather (the Kirk Training Rainy set and the Kirk Validation set). Please note that we have applied to the number of points for better visualization. The difference in the number of points is substantial between two weather conditions across all ranges.

Appendix B The Robustness of the Foreground Voxel Classifier

In order to generalize detectors to different domains, it is crucial to correctly classify foreground voxels so that semantic points can be reliably generated. Table 9 lists the evaluation results of the foreground voxel classifier.

Train Eval Accuracy Precision Recall AP
OD Train OD Val 99.3 % 90.9 % 92.9 % 86.7 %
OD Train Kirk Val 98.9 % 88.4 % 88.2 % 78.3 %
Table 9:

Foreground voxel classification results of our SPG. The model is trained on the OD training set and then it is evaluated on the OD validation set and Kirk validation set, respectively. The accuracy, precision and recall are evaluated by setting

.

The results in Table 9 are averaged among all voxels in the foreground regions. Our SPG is trained on the OD training set. Then it is evaluated on the OD validation set and the Kirk validation set, respectively. The classification of a voxel is correct if its prediction score when or when . The accuracy, precision and recall are all calculated under this setting. The AP is calculated using 40 recall thresholds. The results show that SPG achieves high performance in both domains.

Appendix C Dropout Rate of the RndDrop Method

In the experiment section, we implement a baseline method RndDrop, where we randomly drop out of points for point clouds from the source domain during training. This dropout ratio is chosen to match the ratio of missing points in the target domain. We calculate , where is the average number of points per scene in the source domain and is the average number of points per scene in the target domain.

Appendix D More Results on the Waymo Domain Adaptation Dataset

The evaluation tool [sun2019scalability] provides the average precision for three distance-based breakdowns: 0 to 30 meters, 30 to 50 meters, and beyond 50 meters. The AP is calculated using 100 recall thresholds.

We perform two groups of model comparisons in the setting of UDA: Group 1. PointPillars vs. SPG + PointPillars; Group 2. PV-RCNN vs. SPG + PV-RCNN. We train all models on the OD training set and evaluate them on both the OD validation set and the Kirk validation set. Table 10 and 11 show the comparisons on vehicle 3D AP and vehicle BEV AP, respectively. Table 12 and Table 13 show the comparisons in pedestrian 3D AP and pedestrian BEV AP, respectively. In most cases, SPG improves the detection performance across all ranges for both vehicles and pedestrians.

0pt0pt Target Domain - Kirk Source Domain - OD Vehicle 3D AP (IoU = 0.7) Vehicle 3D AP (IoU = 0.7) Difficulty Method Overall 0-30m 30-50m 50-Inf Overall 0-30m 30-50m 50-Inf LEVEL_1 PointPillars 34.65 63.13 24.56 7.65 57.27 84.39 52.97 28.22 SPG + PointPillars 41.56 68.26 31.91 13.08 62.44 86.18 58.13 35.40 Improvement +6.91 +5.13 +7.35 +5.43 +5.17 +1.79 +5.16 +7.18 LEVEL_2 PointPillars 31.67 59.26 22.09 7.08 52.96 82.30 50.74 24.6 SP + PointPillar 38.15 64.57 28.66 11.96 58.54 85.75 56.02 31.02 Improvement +6.48 +5.31 +6.57 +4.88 +5.58 +3.45 +5.28 +6.42 LEVEL_1 PV-RCNN 55.16 76.68 47.96 27.59 74.01 91.39 70.94 49.51 SPG+PV-RCNN 58.31 77.81 51.65 31.29 75.27 92.36 73.47 51.03 Improvement +3.15 +1.13 +3.69 +3.70 +1.26 +0.97 +2.53 +1.52 LEVEL_2 PV-RCNN 45.81 71.31 38.83 20.52 64.69 88.95 64.80 37.37 SPG + PV-RCNN 48.70 72.41 42.16 23.52 65.98 91.62 65.61 39.87 Improvement +2.89 +1.10 +3.33 +3.00 +1.29 +2.67 +0.81 +2.50

Table 10: The unsupervised domain adaptation vehicle detection results on both Waymo Open Dataset (OD) and Kirkland Dataset (Kirk). We show the vehicle 3D AP results in this table. The AP distance breakdowns are provided by the official evaluation tool.

0pt0pt Target Domain - Kirk Source Domain - OD Vehicle BEV AP (IoU = 0.7) Vehicle BEV AP (IoU = 0.7) Difficulty Method Overall 0-30m 30-50m 50-Inf Overall 0-30m 30-50m 50-Inf LEVEL_1 PointPillars 51.88 75.56 46.04 25.55 72.26 92.23 71.35 51.11 SPG + PointPillars 60.44 80.89 53.73 38.24 77.63 93.39 75.96 61.16 Improvement +8.56 +5.33 +7.69 +12.69 +5.37 +1.16 +4.61 +10.05 LEVEL_2 PointPillars 47.93 71.18 42.41 23.47 69.09 91.83 68.87 45.53 SPG + PointPillars 56.94 77.13 49.99 35.04 74.90 93.06 73.96 54.51 Improvement +9.01 +5.95 +7.58 +11.57 +5.81 +1.23 +5.09 +8.98 LEVEL_1 PV-RCNN 70.38 84.27 65.31 52.98 85.13 95.99 84.02 72.19 SPG + PV-RCNN 72.56 84.43 68.79 58.49 87.38 97.54 86.63 74.59 Improvement +2.18 +0.16 +3.48 +5.51 +2.25 +1.55 +2.61 +2.40 LEVEL_2 PV-RCNN 60.13 78.10 54.36 40.67 76.84 93.29 76.64 58.29 SPG + PV-RCNN 62.03 78.86 56.47 44.94 78.05 94.45 80.25 59.56 Improvement +1.90 +0.76 +2.11 +4.27 +1.21 +1.16 +3.61 +1.27

Table 11: The unsupervised domain adaptation vehicle detection results on both Waymo Open Dataset (OD) and Kirkland Dataset (Kirk). We show the vehicle BEV AP results in this table. The AP distance breakdowns are provided by the official evaluation tool.

0pt0pt Target Domain - Kirk Source Domain - OD Pedestrian 3D AP (IoU = 0.5) Pedestrian 3D AP (IoU = 0.5) Difficulty Method Overall 0-30m 30-50m 50-Inf Overall 0-30m 30-50m 50-Inf LEVEL_1 PointPillars 20.65 43.98 9.27 3.24 55.20 69.24 52.04 32.72 SPG + PointPillars 23.72 50.19 9.11 5.57 56.06 69.32 53.12 34.73 Improvement +3.07 +6.21 -0.16 +2.33 +0.86 +0.08 +1.08 +2.01 LEVEL_2 PointPillars 17.66 40.67 7.40 2.32 51.33 65.85 49.32 29.29 SPG + PointPillars 19.57 46.42 7.44 3.99 52.33 65.63 50.10 31.25 Improvement +1.91 +5.75 +0.04 +1.67 +1.00 -0.22 +0.78 +1.96 LEVEL_1 PV-RCNN 24.47 39.69 14.24 8.05 65.34 72.23 64.89 50.04 SPG + PV-RCNN 30.82 48.04 18.80 13.39 66.93 73.55 66.60 50.82 Improvement +6.35 +8.35 +4.56 +5.34 +1.59 +1.32 +1.71 +0.78 LEVEL_2 PV-RCNN 17.16 36.39 9.64 3.51 56.03 66.88 56.58 35.76 SPG + PV-RCNN 22.05 44.07 12.91 5.77 57.68 68.28 58.29 37.64 Improvement +4.89 +7.68 +3.27 +2.26 +1.65 +1.40 +1.71 +1.88

Table 12: The unsupervised domain adaptation pedestrian detection results on both Waymo Open Dataset (OD) and Kirkland Dataset (Kirk). We show the pedestrian 3D AP results in this table. The AP distance breakdowns are provided by the official evaluation tool.

0pt0pt Target Domain - Kirk Source Domain - OD Pedestrian BEV AP (IoU = 0.5) Pedestrian BEV AP (IoU = 0.5) Difficulty Method Overall 0-30m 30-50m 50-Inf Overall 0-30m 30-50m 50-Inf LEVEL_1 PointPillars 22.33 45.00 10.50 3.49 63.82 76.33 61.90 42.81 SPG + PointPillars 24.83 51.44 10.80 5.71 64.66 76.11 62.69 44.98 Improvement +2.50 +6.44 +0.30 +2.22 +0.84 -0.22 +0.79 +2.17 LEVEL_2 PointPillars 18.40 41.63 8.58 2.49 60.13 73.34 58.77 38.83 SPG + PointPillars 20.67 47.56 8.98 4.11 60.93 72.94 59.54 41.11 Improvement +2.27 +5.93 +0.40 +1.62 +0.80 -0.40 +0.77 +2.28 LEVEL_1 PV-RCNN 25.39 40.23 14.72 9.76 70.35 76.22 70.49 56.77 SPG + PV-RCNN 31.92 49.06 19.87 14.87 70.37 75.86 72.29 57.47 Improvement +6.53 +8.83 +5.15 +5.11 +0.02 -0.36 +1.80 +0.70 LEVEL_2 PV-RCNN 17.88 36.89 9.97 4.23 60.81 69.22 61.86 41.32 SPG + PV-RCNN 22.65 44.57 13.48 6.38 60.88 70.62 63.65 43.27 Improvement +4.77 +7.68 +3.51 +2.15 +0.07 +1.40 +1.79 +1.95

Table 13: The unsupervised domain adaptation pedestrian detection results on both Waymo Open Dataset (OD) and Kirkland Dataset (Kirk). We show the pedestrian BEV AP results in this table. The AP distance breakdowns are provided by the official evaluation tool.

Appendix E More Results on KITTI

We provide more 3D object detection results on KITTI. There are two commonly used metric standards for evaluating the detection performance: 1) R11, where the AP is evaluated with 11 recall positions; 2) R40, where the AP is evaluated with 40 recall positions. In addition to the improvement on car and pedestrian detection, SPG also significantly boosts the performance in cyclist detection. Based on R11, Table 14 and Table 15 show the results in 3D AP and BEV AP for three object types, respectively. Based on R40, Table 16 and Table 17 show the results in 3D AP and BEV AP for three object types, respectively.

We show more comparisons on the KITTI test set in Table 18.

0pt0pt Car - 3D AP (R11) Pedestrian - 3D AP (R11) Cyclist - 3D AP (R11) Method Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard PointPillars[lang2019pointpillars] 86.46 77.28 74.65 57.75 52.29 47.90 80.05 62.68 59.70 SPG + PointPillars 87.98 78.54 77.32 59.91 54.58 50.34 81.58 65.70 62.28 Improvement +1.52 +1.26 +2.67 +2.16 +2.29 +2.44 +1.53 +3.02 +2.58 PVRCNN[shi2020pv] 89.35 83.69 78.70 64.60 57.90 53.23 85.22 70.47 65.75 SPG + PVRCNN 89.81 84.45 79.14 69.04 62.18 56.77 86.82 73.35 69.30 Improvement +0.46 +0.76 +0.44 +4.44 +4.28 +3.54 +1.60 +2.88 +3.55

Table 14: Result comparisons on the KITTI validation set. The results are evaluated by the Average Precision with 11 recall positions. The baseline detectors, PointPillars and PV-RCNN, are directly evaluated by using the checkpoints released by [shi2020pv, openpcdet2020].

0pt0pt Car - BEV AP (R11) Pedestrian - BEV AP (R11) Cyclist - BEV AP (R11) Method Easy Mod. Hard Easy Mod. hard Easy Mod. Hard PointPillars[lang2019pointpillars] 89.65 87.17 84.37 61.63 56.27 52.60 82.27 66.25 62.64 SPG + PointPillars 90.07 88.00 86.63 65.16 59.86 56.07 86.02 71.93 65.69 Improvement +0.42 +0.83 +2.26 +3.53 +3.59 +3.47 +3.75 +5.68 +3.05 PVRCNN[shi2020pv] 90.09 87.90 87.41 67.01 61.38 56.10 86.79 73.55 69.69 SPG + PVRCNN 90.41 88.49 87.74 71.19 64.37 59.88 92.54 74.43 70.99 Improvement +0.32 +0.59 +0.33 +4.18 +2.99 +3.78 +5.75 +0.88 +1.30

Table 15: Result comparisons on the KITTI validation set. The results are evaluated by the Average Precision with 11 recall positions. The baseline detectors, PointPillars and PV-RCNN, are directly evaluated by using the checkpoints released by [shi2020pv, openpcdet2020].

0pt0pt Car - 3D AP (R40) Pedestrian - 3D AP (R40) Cyclist - 3d AP (R40) Method Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard PointPillars[lang2019pointpillars] 87.75 78.39 75.18 57.30 51.41 46.87 81.57 62.94 58.98 SPG+PointPillars 89.77 81.36 78.85 59.65 53.55 49.24 83.27 66.11 61.99 Improvement +2.02 +2.97 +3.67 +2.35 +2.14 +2.37 +1.70 +3.17 +3.01 PVRCNN[shi2020pv] 92.10 84.36 82.48 64.26 56.67 51.91 88.88 71.95 66.78 SPG+PVRCNN 92.53 85.31 82.82 69.66 61.80 56.39 91.75 74.35 69.49 Improvement +0.43 +0.95 +0.34 +5.40 +5.13 +4.48 +2.87 +2.40 +2.71

Table 16: Result comparisons on the KITTI validation set. The results are evaluated by the Average Precision with 40 recall positions. The baseline detectors, PointPillars and PV-RCNN, are directly evaluated by using the checkpoints released by [shi2020pv, openpcdet2020].

0pt0pt Car - BEV AP (R40) Pedestrian - BEV AP (R40) Cyclist - BEV AP (R40) Method Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard PointPillars[lang2019pointpillars] 92.03 88.05 86.66 61.59 56.01 52.04 85.27 66.34 62.36 SPG + PointPillars 94.38 89.92 87.97 65.38 59.48 55.32 90.29 71.43 66.96 Improvement +2.35 +1.87 +1.31 +3.79 +3.47 +3.28 +5.02 +5.09 +4.60 PVRCNN[shi2020pv] 93.02 90.33 88.53 67.97 60.52 55.80 91.02 74.54 69.92 SPG + PVRCNN 94.99 91.11 88.86 71.79 64.50 59.51 93.62 76.45 71.64 Improvement +1.97 +0.78 +0.33 +3.82 +3.98 +3.71 +2.60 +1.91 +1.72

Table 17: Result comparisons on the KITTI validation set. The results are evaluated by the Average Precision with 40 recall positions. The baseline detectors, PointPillars and PV-RCNN, are directly evaluated by using the checkpoints released by [shi2020pv, openpcdet2020].

0pt0pt Car - 3D AP Method Reference Modality Easy Mod. Hard Avg. F-PointNet[qi2018frustum] CVPR 2018 LIDAR & RGB 82.19 69.79 60.59 70.86 AVOD-FPN[ku2018joint] IROS 2018 LIDAR & RGB 83.07 71.76 65.73 73.52 F-ConvNet[wang2019frustum] IROS 2019 LIDAR & RGB 87.36 76.39 66.69 76.81 UberATG-MMF[liang2019multi] CVPR 2019 LIDAR & RGB 88.40 77.43 70.22 78.68 EPNet[huang2020epnet] ECCV 2020 LiDAR & RGB 89.81 79.28 74.59 81.23 CLOCs_PVCas[pang2020clocs] IROS 2020 LiDAR & RGB 88.94 80.67 77.15 82.25 3D-CVF[yoo20203d] ECCV 2020 LiDAR & RGB 89.20 80.05 73.11 80.79 SECOND[yan2018second] Sensors 2018 LiDAR 83.34 72.55 65.82 73.90 PointPillars[lang2019pointpillars] CVPR 2019 LiDAR 82.58 74.31 68.99 75.30 PointRCNN[shi2019pointrcnn] CVPR 2019 LiDAR 86.96 76.50 71.39 77.77 3D IoU Loss[zhou2019iou] 3DV 2019 LiDAR 86.16 75.64 70.70 78.28 Fast Point R-CNNs[chen2019fast] ICCV 2019 LiDAR 85.29 77.40 70.24 77.64 STD[yang2019std] ICCV 2019 LiDAR 87.95 79.71 75.09 80.91 SegVoxelNet[yi2020segvoxelnet] ICRA 2020 LiDAR 86.04 76.13 70.76 77.64 SARPNET[ye2020sarpnet] Neuro Computing 2019 LiDAR 85.63 76.64 71.31 77.86 HRI-VoxelFPN[yi2020segvoxelnet] Sensor 2020 LiDAR 85.63 76.70 69.44 77.26 HotSpotNet[chen2020object] ECCV 2020 LiDAR 87.60 78.31 73.34 79.75 PartA[9018080] TPAMI 2020 LiDAR 87.81 78.49 73.51 79.94 SERCNN[Zhou_2020_CVPR] CVPR 2020 LiDAR 87,74 78.96 74.14 51.03 Point-GNN[shi2020point] CVPR 2020 LiDAR 88.33 79.47 72.29 80.03 3DSSD[yang20203dssd] CVPR 2020 LiDAR 88.36 79.57 74.55 80.83 SA-SSD[he2020sassd] CVPR 2020 LiDAR 88.75 79.79 74.16 80.90 CIA-SSD[zheng2020ciassd] AAAI 2021 LiDAR 89.59 80.28 72.87 80.91 Asso-3Ddet[du2020associate] CVPR 2020 LiDAR 85.99 77.40 70.53 77.97 Voxel R-CNN[deng2020voxel] AAAI 2021 LiDAR 90.90 81.62 77.06 83.19 PV-RCNN[shi2020pv] CVPR 2020 LiDAR 90.25 81.43 76.82 82.83 SPG+PV-RCNN (Ours) - LiDAR 90.49 82.13 78.88 83.83

Table 18: Car detection result comparisons on the KITTI test set. The results are evaluated by the Average Precision with 40 recall positions on the KITTI benchmark website. We compare with the leader board front runner detectors that are associated with conferences or journals released before our submission. The Avg. AP is calculated by averaging over the APs of Easy, Mod. and Hard. difficulty levels.

Appendix F More Visualization of Semantic Point Generation

In Figure 8, we illustrate more augmented point clouds, where the raw points are rendered in the grey color and the generated semantic points are highlighted in red.

Figure 8: More visualization of generated semantic points. The grey points are original raw points. The red points are the generated semantic points. The green boxes are the predicted bounding boxes.