1st Place Solution for Waymo Open Dataset Challenge – 3D Detection and Domain Adaptation

06/28/2020 ∙ by Zhuangzhuang Ding, et al. ∙ 32

In this technical report, we introduce our winning solution "HorizonLiDAR3D" for the 3D detection track and the domain adaptation track in Waymo Open Dataset Challenge at CVPR 2020. Many existing 3D object detectors include prior-based anchor box design to account for different scales and aspect ratios and classes of objects, which limits its capability of generalization to a different dataset or domain and requires post-processing (e.g. Non-Maximum Suppression (NMS)). We proposed a one-stage, anchor-free and NMS-free 3D point cloud object detector AFDet, using object key-points to encode the 3D attributes, and to learn an end-to-end point cloud object detection without the need of hand-engineering or learning the anchors. AFDet serves as a strong baseline in our winning solution and significant improvements are made over this baseline during the challenges. Specifically, we design stronger networks and enhance the point cloud data using densification and point painting. To leverage camera information, we append/paint additional attributes to each point by projecting them to camera space and gathering image-based perception information. The final detection performance also benefits from model ensemble and Test-Time Augmentation (TTA) in both the 3D detection track and the domain adaptation track. Our solution achieves the 1st place with 77.11 69.49 track.



There are no comments yet.


page 1

page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction to the Challenge

The Waymo Open Dataset Challenges at CVPR 2020 which starts from March 19 2020 and ends on May 31 2020, is the largest and the most exciting competition of the challenging perception tasks in autonomous driving. The Waymo Open Dataset [15] was recently released with high-quality data collected from both LiDAR and camera sensors in real self-driving scenarios and enables many new exciting researches. In the 3D detection track, the challenge requires the algorithm to detect the objects as a set of 3D bounding boxes. The domain adaptation track is similar to the 3D detection track except that the data was collected from a different location.

Figure 2: The framework of anchor-free one-stage 3D detection (AFDet) system. The whole pipeline consists of the Point Cloud Encoder, 3D Feature Extractor and Region Proposal Network (RPN), and the Anchor-free Detector. The number in the brackets indicates the output channels in the last convolution layer. is the number of categories used in the detection.

2 Network

In this section, we present the details of our 3D detector in the challenge. We use our AFDet [2], which is one-stage, anchor-free and NMS-free, as a strong baseline 3D point cloud detector. Significant improvements were made to this baseline during the challenges. Our proposed AFDet consists of four modules: point cloud encoder, 3D Feature Extractor, Region Proposal Network (RPN), and anchor-free detector. We mainly describe the difference between the original AFDet and the one we used in this section.

2.1 Point Cloud Encoder

We use a simple point cloud encoder [22] to voxelize the point cloud. Specifically, we use grid size , , along , , axis respectively to convert the raw point cloud into voxel presentation. In each voxel we calculate the mean of all points inside it and feed the result to 3D feature extractor. The resulting feature map will be reshaped to form a top-down view pseudo image, as illustrated in Figure 3.

Figure 3: Point cloud encoding process from voxelization to pseudo image. 3D Submanifold Sparse Conv and 3D Sparse Conv are used in the 3D Feature Extractor.
(a) B1
(b) B2
(c) B3
Figure 4: The illustration and parameters of B1, B2 and B3. Rectangle with different colors represent different operations. Circles represent feature maps. T. Conv is short for transposed convolution. The feature map sizes of each output are their sizes at the inference stage. Better viewed in color and zoomed in for details.

2.2 3D Feature Extractor and RPN

We introduce two different 3D Feature Extractor FE-v1, FE-v2, and three different RPN [18, 21] RPN-v1, RPN-v2, RPN-v3. Their combinations B1, B2 and B3 are used in the final submissions during the challenge.

SpMiddleFHD in [22] are employed as the 3D Feature Extractor in FE-v1. It contains four phases, and each phase contains several submanifold convolutional layers [18] and one sparse convolutional layer [18] to perform downsampling along the -axis. The feature map is downsampled 4 after FE-v1.

For the 3D Feature Extractor in FE-v2, we use a similar extractor to [5]

with auxiliary branch removed. It has four sparse 3D convolution blocks, each of which is composed of sub-manifold convolutions with the kernel size of 3. The last three blocks use an additional sparse convolution with a stride of 2 to downsample feature map size. Then the feature map is reshaped to obtain the Bird’s-Eye View (BEV) representation. Eight standard 3

3 convolutions are applied to further abstract the feature. The final downsample factor for FE-v2 is 8.

RPN has several downsample and upsample blocks. In RPN-v1, we use the same structure as in [8, 18]. RPN-v1 has 3 cascaded downsample blocks, each of which downsamples the feature map by 2. So the feature map of each downsample blocks is downsampled by 2, 4, 8, respectively. Then three upsample blocks are used separately to upsample the feature map by 1, 2, 4. We concatenate them to get the output of the RPN-v1. So the final feature map of RPN-v1 is downsampled 2.

The RPN-v2 shares the same structure with RPN-v1 except for two differences. The first downsample block uses a downsample factor of 1 instead of 2, and thus the final downsample factor is 1 for RPN-v2. The other difference is that the channels of each block are doubled compared with RPN-v1.

We further improves RPN-v3 over RPN-v1. Based on RPN-v1, the first downsample block uses a factor of 1 instead of 2. Then we reduce the number of filters for each layer to 3/4 (i.e. from 128 to 96, from 256 to 192). Inspired by [16], we add a simplified version of BiFPN with 4 repeating blocks after the downsample blocks in the RPN-v3. Unlike the original BiFPN [16] with different number of filters for each repeat, we use three separate convolutional layers to produce the same number of filters (96) for the output of each downsample block. Then we repeat the block four times with the same number of filters. Our simple BiFPN does not deploy weighted fusion mechanism [16]. The final downsample factor of RPN-v3 is 1.

We use three different network, which are B1, B2 and B3, during the challenge. The 3D Feature Extractor and RPN combinations of B1, B2 and B3 can be found in Table 1. Their parameters and detailed illustration are shown in Figure 4

. In 3D Feature Extractor and RPN, each convolution is followed by a batch normalization 


and ReLU non-linearity.

3D Feature Extractor RPN Downsample Factor
B1 FE-v1 RPN-v1 8
B2 FE-v2 RPN-v2 8
B3 FE-v1 RPN-v3 4

Table 1: 3D Feature Extractor and RPN combinations of B1, B2 and B3.

2.3 Anchor Free Base Detector

We briefly introduce the anchor free base detector in the report. For more details, please refer to [2]. The AFDet [2] consists of five sub-heads, including the keypoint heatmap head, the local offset head, the -axis location head, the 3D object size head, and the orientation head. Figure 2 shows the details of the anchor free detector.

Five Sub-Heads. The heatmap head and the offset head predict a keypoint heatmap and a local offset regression map respectively, with the number of keypoint types. The keypoint heatmap helps to locate the object center (- coordinates) in BEV. The offset regression map compensates the discretization error due to voxelization, and helps to recover more accurate object locations in BEV. The -axis location head regresses the -axis values. Additionally, we regress the object sizes directly. For orientation prediction, we use the MultiBin method following [12, 20].

Loss. The heatmap head training uses the modified focal loss [9]. For the orientation prediction head, the classification part is trained with softmax while the offset part is trained with loss. loss is employed in the local offset head, the -axis location head, and the 3D object size head training. The overall training objective is


where represents the weight for each sub-task. For all regression sub-tasks including local offset, -axis location, size and orientation, we only regress objects which are in the detection range.

Gather Indices and Decode. From the resultant keypoint heatmap, we can easily decode the center of each object, along with the local offset correcting the discretization error. Other information such as orientation and dimension can be obtained from the regression results [2].

3 Methods

The network architecture has been explained in details in the previous section. In this section, we first introduce our input format and data augmentation strategies. Then we explain a novel painting method using 2D detection, test time augmentation, and model ensemble.

3.1 Input and Data Augmentation

As in  [1], we accumulate LiDAR sweeps to utilize temporal information and to densify the LiDAR point cloud. To distinguish points from different sweeps, time difference is attached to the point cloud as an additional attribute. In Waymo challenge, we use past four frames combined with the current frame as our input point cloud. A Detailed ablation study can be found in Table 2.

Waymo Open Dataset [15] provides five kinds of LiDAR sweeps. To fully utilize the information, we combine all the first and second returns point cloud generated by all five LiDARs. Specifically, considering frame densification and painting, our input format should be , with being the time difference as mentioned above.

We use data augmentation strategy following [18, 8]. First, we generate an annotation database containing labels and associated point cloud data. During training, we randomly select 6, 8 and 10 ground truth samples for vehicle, pedestrian and cyclist respectively, and place them into the current frame. Second, we do randomly flipping along -axis [19], global rotation following , global scaling following and global translation along -axis following  [21, 18, 8].

Figure 5: Point cloud painting using 2D detections and semantic segmentation labels

3.2 Painting

Inspired by PointPainting [17], we leverage camera information by projecting each point to the image space and gathering its image-based perception information (i.e. 2D detections and semantic segmentation labels) which will be appended/painted to the point, as illustrated in Figure 5.

First we use 2D bounding boxes to paint the points. Specifically, points are projected from the vehicle coordinate to the corresponding image frame. If a certain point falls inside a 2D bounding box, the classification information will be appended to the point as extra attributes. Otherwise, zeros will be filled in additional channels. The training set is painted with ground truth 2D bounding boxes, while during inference painting uses 2D detections produced by a single Cascade R-CNN based detector used in our 2D detection track.

Second, we use semantic segmentation information to paint point clouds. We start with a WideResNet38 pretrained on ImageNet 

[7] and finetune the network on KITTI [3]. Only around half of the points can be painted in a frame because Waymo Open Dataset only provides images from front and side cameras. The improvements of introducing point painting can be seen in Table 2.

Models Frames Painting L2
2D Box Seg mAP mAPH
AFDet-B1 60.87 56.50
AFDet-B1 65.71 62.22
AFDet-B1 67.88 64.77
AFDet-B1 68.03 65.02

Table 2: 3D detection results of single model on validation set of the 3D detection track. Models are trained on 1/20 of training set and tested on 1/10 of validation set.
Figure 6: Point cloud enhancement. The top left image shows the raw point cloud. The top right image shows point cloud densified with [-4, +0] frames. The bottom left image shows point cloud painted with 2D detection boxes. The bottom right image shows point cloud painted with semantic segmentation. Note that only half of the points are painted in two bottom images because rear-view images are unavailable.

3.3 Further Improvements

Test Time Augmentation (TTA). We perform several different test-time augmentations, including point cloud rotation around pitch, roll and yaw axis, point cloud global scaling and point cloud translation along -axis, which is similar to the data augmentation in the training process. We adapted the Weighted Boxes Fusion (WBF) [14] to a 3D variant to merge different groups of detections into the final result. We use for both pitch and roll rotation, [, , , ,,, ] for yaw rotation, [0.95, 1.05] for global scaling, and for translation along -axis. Our experiments show that the rotation yaw is the most effective one among all the augmentation methods. Therefore we use it in combination with the other augmentation methods.

Ensemble Method and Naive Grid Search. As mentioned above, we use a 3D version of WBF [14] to ensemble different models with test time augmentation. Equal weights are assigned to AFDet-B1, AFDet-B2 and AFDet-B3. We replace 3D IoU with BEV IoU to speed up the ensembling process. A naive grid search is conducted to select the best thresholds used in the ensembling. We search three variables including (box associating IoU threshold), (pre-wbf box skipping threshold) and (post-wbf box skipping threshold). Search spaces for , and are , , respectively. Search intervals for , and are 0.05, 0.05 and 0.01 respectively. is searched from 0.01 instead of 0.00 to accelerate the search process. All grid searches are performed on 1/10 of the validation set. We use different thresholds for different classes as shown in Table 3.

Dataset Thres. Vehicle Pedestrian Cyclist
3D Detection 0.80 0.70 0.65
0.10 0.15 0.25
0.03 0.03 0.03

Domain Adaptation
0.80 0.70 0.65
0.10 0.15 0.25
0.05 0.05 0.05

Table 3: Thresholds used in the final submission. , and are box associating IoU threshold, pre-wbf box skipping threshold and post-wbf box skipping threshold.
Models Test Time Naive L2
Augmentation Grid Search mAP mAPH
AFDet-B1 68.03 65.02
AFDet-B2 67.05 64.29
AFDet-B3 68.91 65.24
B1 + B2 67.97 65.71
B1 + B3 69.72 67.16
B1 + B2 + B3 71.08 68.96
B1 + B2 + B3 73.65 72.14
B1 + B2 + B3 75.21 73.63

Table 4: 3D detection ensemble results on validation set of the 3D detection track. Models are trained on 1/20 of training set and tested on 1/10 of validation set.
L2 mAP L2 mAPH L2 mAP L2 mAPH L2 mAP L2 mAPH L2 mAP L2 mAPH
HorizonLiDAR3D (Ours) 78.23 77.83 79.32 76.50 77.91 76.98 78.49 77.11
PV-RCNN 73.69 72.23 73.98 70.16 72.38 71.16 73.35 71.52
TS-LidarDet 72.65 72.12 68.10 59.32 66.55 65.16 69.10 65.53
Simple Baseline v2 66.44 65.91 66.00 60.93 65.13 64.10 65.84 63.65
Det3D-Waymo-3D-FS-VS 66.03 65.11 66.38 57.83 67.60 66.18 66.67 63.04
Table 5: Top five submissions of Waymo Open Dataset Challenge on 3D detection track.
L2 mAP L2 mAPH L2 mAP L2 mAPH L2 mAP L2 mAPH L2 mAP L2 mAPH L2 mAP L2 mAPH
HorizonLiDAR3D (Ours) 60.39 60.04 64.00 61.63 87.55 86.79 70.65 69.49 7.84 7.62
PV-RCNN-DA 59.67 59.08 48.27 45.74 28.31 27.50 45.42 44.11 27.93 27.41
Simple Baseline v2 48.47 47.92 46.13 43.21 18.84 18.71 37.81 36.62 28.03 27.02
Det3D-Waymo-DA 50.07 49.33 43.51 37.90 4.75 4.66 32.78 30.63 33.89 32.41
Simple Baseline 45.50 44.77 45.99 43.12 1.53 1.48 31.01 29.79 - -
Table 6: Top five submissions of the domain adaptation track. Domain gap is the difference between 3D detection track and domain adaptation track.

4 Experiment Settings

We implemented our detector based on the Det3D framework [22]. To reduce the GPU memory usage, we train our detector within range [ ( -76.8, 76.8), ( -51.2, 51.2), ( -1, 3)] respect to -axis, while enlarging the detection range to [ ( -80, 80), ( -80, 80), ( -1, 3)] at inference. The max number of objects is set to . For all of our models, we set max point per voxel to , max voxel num to . For the offset regression head, we use as default to regress a square area with side length

. We use max pooling with the kernel size

, stride 1, and apply AND operation between the feature map before and after the max pooling to get the peaks of the keypoint heatmaps at the inference stage. Therefore there is no need to suppress overlapped detections. The weight we use for different sub-losses are , , and . All parameters we list here are their default values unless explicitly stated otherwise.

We use AdamW [10] optimizer with one-cycle policy [4]. We set learning rate max to , division factor to 2, momentum ranges from 0.95 to 0.85, fixed weight decay to 0.01 to achieve convergence. To quickly verify our idea, we sample the training set every 20 frames, and validation set every 10 frames according to their timestamps. Unless we explicitly indicate, all models used for ablation study, including Table 2 and Table 4

, are trained on 1/20 training data, tested on 1/10 validation data. In the final submission, we first train 60 epochs on 1/20 training data and finetune 10 epochs on the whole trainval data for AFDet-B1 and AFDet-B2. We only train 60 epochs on 1/20 training data and finetune 3 epochs on the whole trainval data for AFDet-B3. Besides, we adopt half-precision (FP16) 

[11, 13] for B3 model to reduce GPU memory usage and speed up the training process. We ensemble AFDet-B1, AFDet-B2 and AFDet-B3 for 3D detection and AFDet-B1 and AFDet-B2 for domain adaptation in our final submission.

3D training 3D validation Domain training Domain validation
Num Per Num Per Num Per Num Per
Object 6.3M 100% 1.6M 100% 260K 100% 45.2K 100%
Vehicle 4.2M 67.15% 1.1M 68.42% 242K 93.15% 43K 95.11%
Pedestrian 2.0M 32.07% 0.5M 30.80% 17.6K 6.78% 2.2K 4.89%
Cyclist 49K 0.78% 12.3K 0.78% 184 0.07% 0 0

Table 7: Number of objects and the percentage of each class in the 3D detection and domain adaptation tracks.
Figure 7: Examples of results on testing set of the 3D detection track, only bounding boxes with score larger than 0.15 are visualized.

5 Results

3D Detection. To study the effect of each module used in our solution, we conduct ablation experiments on the 1/10 validation set as shown in Table 2 and Table 4. Specifically, Table 2 demonstrates the contributions made by densification and point painting, while Table 4 shows the improvements introduced by ensembling the networks and using TTA and WBF. The final results on the official 3D detection leaderboard are shown in Table 5. As can be seen from Table 5, our method outperformed the 2 place solution by and the 3 place solution by in terms of mAPH L2.

Domain Adaptation. Waymo Open Dataset [15] also provides additional data collected from different locations with only a small subset annotated to test the robustness against domain diversity. We directly apply our models trained for the 3D track to the domain adaptation data. For the final submission, AFDet-B1 and AFDet-B2 were applied with test time augmentation and the results were merged by WBF. , and for our submission are listed in Table 3. The results show that our solution outperforms other methods in domain adaptation by a big margin as shown in Table 6.

This result demonstrated the benefits of using anchor-free model and the better generalizability of our method and its robustness against domain diversity. Another reason might be due to the fact that most other methods finetuned their models using annotated domain adaptation data, which is highly imbalanced. As shown in Table 7, we compared the statistics computed from 3D detection and domain adaptation datasets respectively. For example, only 0.07% of the objects are cyclists in the domain adaptation training set, which is 10 less than the 0.78% in the 3D detection training set. Fine-tuning on the domain adaptation dataset might compromise the performance of the resulting detector, especially for rare classes such as cyclist. Leveraging image information may also contributed to the robustness of our solution against domain diversity.

6 Conclusion

A state-of-the-art 3D detection framework is proposed and achieved the 1 place on both 3D detection track and domain adaptation track in the Waymo Open Dataset Challenges at CVPR 2020.