One Million Scenes for Autonomous Driving: ONCE Dataset

06/21/2021 ∙ by Jiageng Mao, et al. ∙ HUAWEI Technologies Co., Ltd. 7

Current perception models in autonomous driving have become notorious for greatly relying on a mass of annotated data to cover unseen cases and address the long-tail problem. On the other hand, learning from unlabeled large-scale collected data and incrementally self-training powerful recognition models have received increasing attention and may become the solutions of next-generation industry-level powerful and robust perception models in autonomous driving. However, the research community generally suffered from data inadequacy of those essential real-world scene data, which hampers the future exploration of fully/semi/self-supervised methods for 3D perception. In this paper, we introduce the ONCE (One millioN sCenEs) dataset for 3D object detection in the autonomous driving scenario. The ONCE dataset consists of 1 million LiDAR scenes and 7 million corresponding camera images. The data is selected from 144 driving hours, which is 20x longer than the largest 3D autonomous driving dataset available (e.g. nuScenes and Waymo), and it is collected across a range of different areas, periods and weather conditions. To facilitate future research on exploiting unlabeled data for 3D detection, we additionally provide a benchmark in which we reproduce and evaluate a variety of self-supervised and semi-supervised methods on the ONCE dataset. We conduct extensive analyses on those methods and provide valuable observations on their performance related to the scale of used data. Data, code, and more information are available at https://once-for-auto-driving.github.io/index.html.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 18

page 19

Code Repositories

ONCE_Benchmark

One Million Scenes for Autonomous Driving


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autonomous driving is a promising technology that has the potential to ease the drivers’ burden and save human lives from accidents. In autonomous driving systems, 3D object detection is a crucial technique that can identify and localize the vehicles and humans surrounding the self-driving vehicle, given 3D point clouds from LiDAR sensors and 2D images from cameras as input. Recent advances Caesar et al. (2020); Sun et al. (2020) in 3D object detection demonstrate that large-scale and diverse scene data can significantly improve the perception accuracy of 3D detectors.

Figure 1: Images and point clouds sampled from the ONCE (One millioN sCenEs) dataset. Our ONCE dataset covers a variety of geographical locations, time periods and weather conditions.

Unlike other image-based datasets (e.g

. ImageNet 

Deng et al. (2009)

, MS COCO 

Lin et al. (2014)) in which the training data can be obtained directly from websites and the annotation pipeline is relatively simple, the research community generally faces two major problems on the acquisition and exploitation of scene data for autonomous driving: 1) The data resources are scarce and the scenes generally lack diversity. The scenes for autonomous driving must be collected by driving a car that carries an expensive sensor suite on the roads in compliance with local regulations. Thus existing autonomous driving datasets could only provide a limited amount of scene data. For instance, on the largest Waymo Open dataset Sun et al. (2020), the scene data is recorded with only driving hours, which can hardly cover enough different circumstances. 2) Effectively leveraging unlabeled scene data becomes an important problem in practical applications. Typically a data acquisition vehicle can collect more than k frames of point clouds with working hours, but a skilled worker can only annotate -

frames per day. This will lead to a rapid accumulation of a large amount of unlabeled data. Although algorithms of semi-supervised learning 

Pham et al. (2020a); Tarvainen and Valpola (2017); Xie et al. (2020a)

, self-supervised learning 

He et al. (2020); Grill et al. (2020) and unsupervised domain adaptation Long et al. (2015); Ganin and Lempitsky (2015) show promising results to handle those unlabeled data on the image domain, currently only a few methods Wang et al. (2020a); Xie et al. (2020b) are studied for the autonomous driving scenario, mainly because of the limited data amount provided by existing datasets.

To resolve the data inadequacy problem, in this paper, we introduce the ONCE (One millioN sCenEs) dataset, which is the largest and most diverse autonomous driving dataset to date. The ONCE dataset contains million 3D scenes and million corresponding 2D images, which is x quantitatively more than the largest Waymo Open dataset Sun et al. (2020), and the 3D scenes are recorded with driving hours, which is x longer and covers more diverse weather conditions, traffic conditions, time periods and areas than existing datasets. Figure 1 shows various scenes in the ONCE dataset. Each scene is captured by a high-quality LiDAR sensor and transformed to dense 3D point clouds with k points per scene in average. For each scene, cameras capture high-resolution images that cover field of view. The data of LiDAR sensor and cameras are precisely synchronized and additional calibration and transformation parameters are provided to enable the fusion of multiple sensors and scenes. We exhaustively annotated k scenes with 3D ground truth boxes of categories (car, bus, truck, pedestrian and cyclist), giving rise to k 3D boxes in total. And k 2D bounding boxes are also provided for camera images by projecting 3D boxes into image planes. The other scenes are kept unannotated, mainly to facilitate future research on the exploitation of unlabeled data. Comprehensive comparisons between the ONCE dataset and other autonomous driving datasets are in Table 1.

To resolve the unlabeled data exploitation problem and facilitate future research on this area, in this paper, we introduce a 3D object detection benchmark in which we implement and evaluate a variety of self-supervised and semi-supervised learning methods on the ONCE dataset. Specifically, we first carefully select a bunch of widely-used self-supervised and semi-supervised learning methods, including classic image algorithms and methods for the indoor 3D scenario. Then we adapt those methods to the task of 3D object detection for autonomous driving and reproduce their methods with the same detection framework. We train and evaluate those approaches and finally provide some observations on semi-/self-supervised learning for 3D detection by analyzing the obtained results. We also provide baseline results for multiple 3D detectors and unsupervised domain adaptation methods. Extensive experiments show that models pretrained on the ONCE dataset perform much better than those pretrained on other datasets (nuScenes and Waymo) using the same self-supervised method, which implies the superior data quality and diversity of our dataset.

Our main contributions can be summarized into two folds: 1) We introduce the ONCE dataset, which is the largest and most diverse autonomous driving dataset up to now. 2) We introduce a benchmark of self-supervised and semi-supervised learning for 3D detection in the autonomous driving scenario.

2 Related Work

Autonomous driving datasets. Most autonomous driving datasets collect data on the roads with multiple sensors mounted on a vehicle, and the obtained point clouds and images are further annotated for perception tasks including detection and tracking. The KITTI dataset Geiger et al. (2013) is a pioneering work in which they record road sequences with stereo cameras and a LiDAR sensor. The ApolloScape dataset Huang et al. (2018) offers per-pixel semantic annotations for k camera images and Ma et al. (2019) additionally provides point cloud data based on the ApolloScape. The KAIST Multi-Spectral dataset Choi et al. (2018) uses thermal imaging cameras to record scenes. The H3D dataset Patil et al. (2019) provides point cloud data in urban scenes. The Argoverse dataset Chang et al. (2019) introduces geometric and semantic maps. The Lyft L5 dataset Kesten et al. and the A*3D dataset Pham et al. (2020b) offer k and k annotated LiDAR frames respectively.

Dataset Scenes
Size
(hr.)
Area
(km)
Images
3D
boxes
night/
rain
Cls.
KITTI Geiger et al. (2013) 15k 1.5 - 15k 80k No/No 3
ApolloScape Ma et al. (2019) 20k 2 - 0 475k No/No 6
KAIST Choi et al. (2018) 8.9k - - 8.9k 0 Yes/No 3
A2D2 Geyer et al. (2020) 40k - - - - No/Yes 14
H3D Patil et al. (2019) 27k 0.8 - 83k 1.1M No/No 8
Cityscapes 3D Gählert et al. (2020) 20k - - 20k - No/No 8
Argoverse Chang et al. (2019) 44k 1 1.6 490k 993k Yes/Yes 15
Lyft L5 Kesten et al. 30k 2.5 - - 1.3M No/No 9
A*3D Pham et al. (2020b) 39k 55 - 39k 230k Yes/Yes 7
nuScenes Caesar et al. (2020) 400k 5.5 5 1.4M 1.4M Yes/Yes 23
Waymo Open Sun et al. (2020) 230k 6.4 76 1M 12M Yes/Yes 4
ONCE (ours) 1M 144 210 7M 417k Yes/Yes 5
Table 1: Comparisons with other 3D autonomous driving datasets. "-" means not mentioned. Our ONCE dataset has x scenes, x images, and x driving hours compared with the largest dataset Sun et al. (2020).

The nuScenes dataset Caesar et al. (2020) and the Waymo Open dataset Sun et al. (2020) are currently the most widely-used autonomous driving datasets. The nuScenes dataset records hours driving data by multiple sensors with k 3D scenes in total, and the Waymo Open dataset offers k scenes of driving hours with massive annotations. Compared with those two datasets, our ONCE dataset is not only quantitatively larger in terms of scenes and images, e.g. M scenes versus k in Sun et al. (2020), but also more diverse since our driving hours cover all time periods as well as most weather conditions. Statistical comparisons with other autonomous driving datasets are shown in Table 1.

3D object detection in driving scenarios. Many techniques have been explored for 3D object detection in driving scenarios, and they can be broadly categorized into two classes: 1) Single-modality 3D detectors Yan et al. (2018); Shi et al. (2019); Lang et al. (2019); Shi et al. (2020); Yin et al. (2020); Zhou and Tuzel (2018); Qi et al. (2018); Yang et al. (2018) are designed to detect objects solely from sparse point clouds. PointRCNN Shi et al. (2019) operates directly on point clouds to predict bounding boxes. SECOND Yan et al. (2018) rasterizes point clouds into voxels and applies 3D convolutional networks on voxel features to generate predictions. PointPillars Lang et al. (2019) introduces the pillar representation to project point clouds to Bird Eye View (BEV) and utilizes 2D convolutional networks for object detection. PV-RCNN Shi et al. (2020) combines point clouds and voxels for proposal generation and refinement. CenterPoints Yin et al. (2020) introduces the center-based assignment scheme for accurate object localization. 2) Multi-modality approaches Vora et al. (2020); Chen et al. (2017); Ku et al. (2018); Sindagi et al. (2019); Liang et al. (2019); Yoo et al. (2020) leverage both point clouds and images for 3D object detection. PointPainting Vora et al. (2020) uses images to generate segmentation maps and appends the segmentation scores to corresponding point clouds to enhance point features. Other methods Chen et al. (2017); Ku et al. (2018) try to fuse point features and image features on multiple stages of a detector. Our benchmark evaluates a variety of 3D detection models, including both single-modality and multi-modality detectors on the ONCE dataset.

Deep learning on unlabeled data. Semi-supervised learning and self-supervised learning are two promising areas in which various emerging methods are proposed to learn from the unlabeled data effectively. Methods on semi-supervised learning are mainly of two branches: The first branch of methods try to annotate those unlabeled data with pseudo labels Lee et al. (2013); Berthelot et al. (2019b, a); Pham et al. (2020a) by self-training Xie et al. (2020a) or teacher model Tarvainen and Valpola (2017). Other methods Xie et al. (2019); Long et al. (2015); Laine and Aila (2016); Miyato et al. (2018); Sohn et al. (2020) regularize pairs of augmented images under consistency constraints. Self-supervised learning approaches learn from the unlabeled data by leveraging auxiliary tasks Zhang et al. (2016); Noroozi and Favaro (2016) or by clustering Caron et al. (2018, 2020); Asano et al. (2019). Recent advances He et al. (2020); Grill et al. (2020); Chen et al. (2020b, a) demonstrate that contrastive learning methods show promising results in self-supervised learning. Semi-/self-supervised learning has also been studied in 3D scenarios. SESS Zhao et al. (2020) is a semi-supervised method that utilizes geometric and semantic consistency for indoor 3D object detection. 3DIoUMatch Wang et al. (2020a) utilizes an auxiliary IoU head for boxes filtering. For self-supervised learning, PointContrast Xie et al. (2020b) and DepthContrast Zhang et al. (2021) apply contrastive learning on point clouds. Our benchmark provides a fair comparison of various self-supervised and semi-supervised methods.

Figure 2: Sensor locations and coordinate systems. The data acquisition vehicle is equipped with LiDAR and cameras that can capture 3D point clouds and images from field of view.
Freq. (Hz) HFOV () VFOV () Size Range (m) Accuracy (cm) Points/second
CAM_1,9 10 60.6 [-18, +18] 19201020 n/a n/a n/a
CAM_3-8 10 120 [-37, +37] 19201020 n/a n/a n/a
LiDAR 10 360 [-25, +15] n/a [0.3, 200] 2 7.2M
Table 2: Detailed parameters of LiDAR and cameras.

3 ONCE Dataset

3.1 Data Acquisition System

Sensor specifications. The data acquisition system is built with one -beam LiDAR sensor and seven high-resolution cameras mounted on a car. The specific sensor layout is shown in Figure 2, and the detailed parameters of all sensors are listed in Table 2. We note that the LiDAR sensor and the set of cameras can both capture data covering horizontal field of view near the driving vehicle, and all the sensors are well-synchronized, which enables good alignments of cross-modality data. We carefully calibrate the intrinsics and extrinsics of each camera using calibration target boards with patterns. We check the calibration parameters every day and re-calibrate the sensor that has errors. We make the intrinsics and extrinsics public along with the data to users for camera projection.

Data protection. The driving scenes are collected in permitted areas. We comply with the local regulations and avoid releasing any localization information including GPS information and map data. For privacy protection, we actively detect any object on each image that may contain personal information, e.g. human faces, license plates, with a high recall rate, and then we blur those detected objects to ensure no personal information is disclosed.

3.2 Data Format

(a) Weather conditions
(b) Time periods
(c) Areas
Figure 3: Proportions of different weather, time and areas in the ONCE dataset. Our dataset covers a wide range of domains with scenes captured on rainy days and scenes collected at night.

Coordinate systems. There are types of coordinate systems on the ONCE dataset, i.e., the LiDAR coordinate, the camera coordinates, and the image coordinate. The LiDAR coordinate is placed at the center of the LiDAR sensor, with the x-axis positive forwards, the y-axis positive to the left, and the z-axis positive upwards. We additionally provide a transformation matrix (vehicle pose) between each two adjacent LiDAR coordinates, which enables the fusion of multiple point clouds. The camera coordinates are placed at the center of the lens respectively, with the x-y plane parallel to the image plane and the z-axis positive forwards. The camera coordinates can be transformed to the LiDAR coordinate directly using the respective camera extrinsics. The image coordinate is a 2D coordinate system where the origin is at the top-left of the image, and the x-axis and the y-axis are along the image width and height respectively. The camera intrinsics enable the projection from the camera coordinate to the image coordinate. An illustration of our coordinate systems is in Figure 2.

LiDAR data. The original LiDAR data is recorded at a speed of frames per second (FPS). We further downsample those original data with the sampling rate of FPS, since most adjacent frames are quite similar thus redundant. The downsampled data is then transformed into 3D point clouds, resulting in million point clouds, i.e., scenes in total. Each point cloud is represented as an matrix, where is the number of points in this scene, and each point is a

-dim vector (x, y, z, r). The 3D coordinate (x, y, z) is based on the LiDAR coordinate, and r denotes the reflection intensity. The point clouds are stored into separate binary files for each scene and can be easily read by users.

Camera data. The camera data is also downsampled along with the LiDAR data for synchronization, and then the distortions are removed to enhance the quality of the images. We finally provide JPEG compressed images for all the cameras, resulting in million images in total.

Annotations. We select k most representative scenes from the dataset, and exhaustively annotate all the 3D bounding boxes of categories: car, bus, truck, pedestrian and cyclist. Each bounding box is a 3D cuboid and can be represented as a -dim vector: (cx, cy, cz, l, w, h, ), where (cx, cy, cz) is the center of the cuboid on the LiDAR coordinate, (l, w, h) denotes length, width, height, and is the yaw angle of the cuboid. We provide 2D bounding boxes by projecting 3D boxes on image planes.

Other information. Weather and time information of each scene is useful since it contains explicit domain knowledge, but existing datasets seldom release those important data. In the ONCE dataset, we provide weather conditions, i.e., sunny, cloudy, rainy, and time periods, i.e., morning, noon, afternoon, night, for every labeled and unlabeled scene. We pack all the information, i.e., weather, period, timestamp, pose, calibration, annotations, into a single JSON file for each scene.

Dataset splits. Our ONCE dataset contains k annotated scenes, which can be used for training and testing the performance of detection models, as well as nearly million unlabeled scenes, which are specially reserved for semi-supervised and self-supervised approaches. The annotated scenes are further divided into splits: k scenes for training or fine-tuning, k scenes for validation, and k for testing. To explore the effect of different data amounts used for 3D detection, we also divide the unlabeled scenes into subsets: , and , which contains k, k and all the unlabeled scenes respectively. We note that and , are created by selecting particular roads in time order instead of uniformly sampling from all the scenes, which is more practical since the driving data is usually incrementally updated in real applications.

3.3 Dataset Analysis

Diversity analysis. We analyze the ratios of different weather conditions, time periods and areas in our ONCE dataset. Figure 3 shows that the scenes in our dataset are captured under various weather conditions, e.g. of the total scenes are captured on rainy days, where the point clouds are relatively sparser than those obtained on sunny or cloudy days. The one million scenes are also distributed in all time periods of a day, with scenes are captured at night. driving hours also cover a variety of regions including downtown, suburbs, highway, bridges and tunnels.

Quality analysis. In order to evaluate the data quality and provide a fair comparison across different datasets, we propose an approach that uses pretrained models to imply the respective data quality. Specifically, we first pretrain same backbones of the SECOND detector Yan et al. (2018) by the self-supervised method DeepCluster Tian et al. (2017) using data from nuScenes Caesar et al. (2020), Waymo Sun et al. (2020) and ONCE respectively, and then we finetune those pretrained models on multiple downstream datasets under the same settings and report their performances. The superior model should have the best pretrained backbone, which means its corresponding pretrain dataset has the best data quality. The model pretrained on the ONCE dataset achieves moderate mAP on the downstream KITTI Geiger et al. (2013) dataset, and significantly outperforms models pretrained on the Waymo dataset () and nuScenes dataset (), which implies our superior data quality compared with nuScenes and Waymo. More results are in appendix.E.

3.4 Evaluation Metric

Figure 4: An overview of our 3D object detection benchmark. We reproduce detection models, self-supervised learning, semi-supervised learning, and unsupervised domain adaptation methods for 3D object detection. We give comprehensive analyses on the results and offer valuable observations.

Evaluation metric is critical for fair comparisons of different approaches on 3D object detection. Current 3D IoU-based evaluation metric  Geiger et al. (2013) faces the problem that objects with opposite orientations can both be matched to the ground truth with the IOU criterion. To resolve this problem, we extend Geiger et al. (2013) and take the object orientations into special consideration. In particular, we first re-rank the predictions according to their scores, and set those predicted boxes that have low 3D IoUs with all ground truths of the same category as false positives. The IoU thresholds are for car, bus, truck, pedestrian, cyclist respectively. Then we add additional filtering step in which we also set those predictions as false positives if their orientations cannot fall into the range of the matched ground truth orientations . This step sets a more stringent criterion specially for orientations. The remaining matched predictions are treated as true positives. Finally, we determine score thresholds with the recall rates from to at the step and we calculate the corresponding precision rates to draw the precision-recall curve . The calculation of our orientation-aware can be formulated as:

(1)

We merge the car, bus and truck class into a super-class called vehicle following Sun et al. (2020), so we officially report the of vehicle, pedestrian and cyclist respectively in the following experiments. We still provide the evaluation interface of classes for users. Mean AP (mAP) is thus obtained by averaging the scores of categories. To further inspect the detection performance of different distances, we also provide of distance ranges: within m, -m, and farther than m. This is obtained by only considering ground truths and predictions within that distance range. We extensively discuss and compare the evaluation metrics of different datasets in appendix A.

4 Benchmark for 3D Object Detection

Method Vehicle Pedestrian Cyclist mAP
overall 0-30m 30-50m 50m-inf overall 0-30m 30-50m 50m-inf overall 0-30m 30-50m 50m-inf
Multi-Modality (point clouds + images)
PointPainting Vora et al. (2020) 66.46 83.70 56.89 40.74 47.62 58.95 39.33 23.34 65.27 73.48 61.53 43.90 59.78
Single-Modality (point clouds only)
PointRCNN Shi et al. (2019) 52.00 74.44 40.72 22.14 8.73 12.20 6.96 2.96 34.02 46.48 27.39 11.45 31.58
PointPillars Lang et al. (2019) 69.52 84.51 60.55 45.72 17.28 20.21 15.06 11.48 49.63 60.15 42.43 27.73 45.47
SECOND Yan et al. (2018) 69.71 86.96 60.22 43.02 26.09 30.52 24.63 14.19 59.92 70.54 54.89 34.34 51.90
PV-RCNN Shi et al. (2020) 76.98 89.89 69.35 55.52 22.66 27.23 21.28 12.08 61.93 72.13 56.64 37.23 53.85
CenterPoints Yin et al. (2020) 66.35 83.65 56.74 41.57 51.80 62.80 45.41 24.53 65.57 73.02 62.85 44.77 61.24
Table 3: Results of detection models on the testing split.

In this section, we present a 3D object detection benchmark on the ONCE dataset. We reproduce widely-used detection models as well as various methods of self-supervised learning, semi-supervised learning and unsupervised domain adaptation on 3D object detection. We validate those methods with a unified standard and provide performance analysis as well as suggestions for future research.

4.1 Models for 3D Object Detection

We implement most widely-used single-modality 3D detectors: PointRCNN Shi et al. (2019), PointPillars Lang et al. (2019), SECOND Yan et al. (2018), PV-RCNN Shi et al. (2020) and CenterPoints Yin et al. (2020) using only point clouds as input, as well as multi-modality detector PointPainting Vora et al. (2020) using both point clouds and images as input on the ONCE dataset. We train those detectors on the training split and report the overall and distance-wise on the testing split using the evaluation metric in 3.4. Performance on the validation split is also reported in appendix B. We provide training and implementation details in appendix C.

Points vs. voxels vs. pillars. Multiple representations (points/voxels/pillars) have been explored for 3D detection. Our experiments in Table 3 demonstrate that the point-based detector Shi et al. (2019) performs poorly with only mAP on the ONCE dataset, since small objects like pedestrians are naturally sparse and a small amount of used points cannot guarantee a high recall rate. The voxel-based detector Yan et al. (2018) shows decent performance with mAP compared with mAP of the pillar-based detector Lang et al. (2019). It is mainly because voxels contain finer geometric information than pillars. PV-RCNN Shi et al. (2020) combines both the point representation and the voxel representation and further improves the detection performance to mAP.

Anchor assignments vs. center assignments. The only difference between SECOND Yan et al. (2018) and CenterPoints Yin et al. (2020) is that SECOND uses anchor-based target assignments while CenterPoints introduces the center-based assignments. SECOND shows better performance on the vehicle category ( vs. ) while CenterPoints performs much better on the small objects including pedestrians ( vs. ) and cyclists ( vs.

) in our experiments. It is because the center-based method shows stronger localization ability which is required for detecting small objects, while the anchor-based method can estimate the size of objects more precisely.

Single-modality vs. multi-modality. PointPainting Vora et al. (2020) appends the segmentation scores to the input point clouds of CenterPoints Yin et al. (2020), but the performance drops from to . We find that the performance of PointPainting heavily relies on the accuracy of segmentation scores, and without explicit segmentation labels on the ONCE dataset, we cannot generate accurate semantic segmentation maps from images, which brings negative effects on 3D detection.

4.2 Self-Supervised Learning for 3D Object Detection

We reproduce self-supervised learning methods, including contrastive learning methods (PointContrast Xie et al. (2020b) and BYOL Grill et al. (2020)), as well as clustering-based methods (DeepCluster Tian et al. (2017) and SwAV Caron et al. (2020)) on our dataset. We choose the backbone of the SECOND detector Yan et al. (2018) as the pretrained backbone. We first pretrain the backbone using self-supervised learning methods with different amounts of unlabeled data: k , k and million , and then we finetune the detection model on the training split. We report the detection performances on the testing split using the evaluation metric in 3.4. Performance on the validation split is also reported in appendix B. We provide training and implementation details in appendix C.

Self-supervised learning on unlabeled data. Experiments in Table 4 show that self-supervised methods can improve the detection results with enough unlabeled data. PointContrast Xie et al. (2020b) obtains mAP with k unlabeled data, but the performance consistently improves to and with k and one million unlabeled data respectively, giving rise to final performance gain over baseline. Self-supervised learning benefits from the increasing amount of unlabeled data.

Contrastive learning vs. clustering. The detection results indicate that clustering-based methods Tian et al. (2017); Caron et al. (2020) consistently outperforms contrastive learning methods Xie et al. (2020b); Grill et al. (2020). SwAV Caron et al. (2020) and DeepCluster Tian et al. (2017) achieve and mAP respectively on , compared with and obtained by BYOL Grill et al. (2020) and PointContrast Xie et al. (2020b). This is mainly because constructing representative views of a 3D scene for contrastive learning is non-trivial in driving scenarios. Generating different views simply by performing different augmentations on the same point cloud may result in quite similar views that will make the pretraining process easily converge to a trivial solution.

Method Vehicle Pedestrian Cyclist mAP
overall 0-30m 30-50m 50m-inf overall 0-30m 30-50m 50m-inf overall 0-30m 30-50m 50m-inf
baseline Yan et al. (2018) 69.71 86.96 60.22 43.02 26.09 30.52 24.63 14.19 59.92 70.54 54.89 34.34 51.90
BYOL Grill et al. (2020) 67.57 84.61 58.26 41.59 17.22 19.45 16.71 10.43 53.36 64.95 47.47 27.66 46.05 (-5.85)
PointContrast Xie et al. (2020b) 71.53 87.02 62.37 47.23 22.68 26.33 21.58 12.98 58.04 70.01 51.74 31.69 50.75 (-1.15)
SwAV Caron et al. (2020) 72.25 87.20 63.38 48.93 25.11 29.32 23.50 14.13 60.67 70.90 55.91 35.39 52.68 (+0.78)
DeepCluster Tian et al. (2017) 72.06 87.09 63.09 48.78 27.56 32.21 26.60 13.61 50.30 70.33 55.82 35.89 53.31 (+1.41)
BYOL Grill et al. (2020) 69.69 84.83 60.41 46.05 27.31 32.58 24.60 13.69 57.22 69.57 51.07 29.15 51.41 (-0.49)
PointContrast Xie et al. (2020b) 70.15 86.71 61.12 48.11 29.23 35.52 36.28 13.06 58.91 70.05 53.86 34.27 52.76 (+0.86)
SwAV Caron et al. (2020) 72.10 87.11 63.15 48.58 28.00 33.10 25.88 14.19 60.17 70.46 55.61 34.84 53.42 (+1.52)
DeepCluster Tian et al. (2017) 72.12 87.31 62.97 48.55 30.06 36.07 27.23 13.47 60.45 70.81 54.93 36.03 54.21 (+2.31)
BYOL Grill et al. (2020) 72.23 87.30 63.13 48.31 23.62 27.10 22.14 13.47 60.45 70.82 55.31 35.65 52.10 (+0.20)
PointContrast Xie et al. (2020b) 73.15 83.92 67.29 50.97 27.48 31.45 24.17 16.70 58.33 70.37 52.26 35.61 52.99 (+1.09)
SwAV Caron et al. (2020) 71.96 86.92 62.83 48.85 30.60 36.42 28.03 14.52 60.27 70.43 55.52 36.25 54.28 (+2.38)
DeepCluster Tian et al. (2017) 71.85 86.96 62.91 48.54 30.54 37.08 27.55 13.86 60.42 70.60 55.47 36.29 54.27 (+2.37)
Table 4: Results of self-supervised learning methods on the testing split.
Method Vehicle Pedestrian Cyclist mAP
overall 0-30m 30-50m 50m-inf overall 0-30m 30-50m 50m-inf overall 0-30m 30-50m 50m-inf
baseline Yan et al. (2018) 69.71 86.96 60.22 43.02 26.09 30.52 24.63 14.19 59.92 70.54 54.89 34.34 51.90
Pseudo Label Lee et al. (2013) 71.05 86.51 61.81 47.49 25.58 31.03 22.03 14.12 58.08 68.50 52.63 35.61 51.57 (-0.33)
Noisy Student Xie et al. (2020a) 73.25 88.84 64.61 49.95 28.04 34.62 23.43 14.20 57.58 67.77 53.43 33.76 52.96 (+1.06)
Mean Teacher Tarvainen and Valpola (2017) 74.13 89.34 65.28 50.91 31.66 37.44 29.90 14.61 62.69 71.88 59.22 39.45 56.16 (+4.26)
SESS Zhao et al. (2020) 72.42 87.23 63.55 49.11 27.32 32.26 24.47 15.36 61.76 72.39 57.29 37.33 53.83 (+1.93)
3DIoUMatch Wang et al. (2020a) 72.12 87.05 63.65 50.35 31.41 38.56 27.62 14.25 59.46 69.53 54.82 36.18 54.33 (+2.43)
Pseudo Label Lee et al. (2013) 70.72 86.21 61.72 47.39 21.74 25.73 19.91 13.28 56.01 67.14 50.18 33.23 49.49 (-2.41)
Noisy Student Xie et al. (2020a) 73.97 89.09 65.35 51.04 30.32 36.24 27.08 16.24 61.35 71.28 56.70 37.96 55.22 (+3.32)
Mean Teacher Tarvainen and Valpola (2017) 74.71 89.28 66.10 52.90 36.03 42.97 33.29 18.70 64.88 74.05 60.80 42.63 58.54 (+6.64)
SESS Zhao et al. (2020) 72.60 87.02 64.29 50.68 35.23 42.59 31.40 16.64 64.67 73.93 61.14 40.80 57.50 (+5.60)
3DIoUMatch Wang et al. (2020a) 74.26 89.08 66.11 53.03 33.91 41.02 30.07 16.15 61.30 71.29 56.49 38.13 56.49 (+4.59)
Pseudo Label Lee et al. (2013) 70.29 85.94 61.18 46.66 21.85 25.83 20.22 12.75 55.72 66.96 50.29 32.92 49.29 (-2.61)
Noisy Student Xie et al. (2020a) 74.50 89.23 67.11 53.15 33.28 40.20 28.89 17.50 62.05 71.76 57.53 39.32 56.61 (+4.71)
Mean Teacher Tarvainen and Valpola (2017) 76.60 89.41 68.29 55.66 36.37 43.84 32.49 17.11 66.99 75.87 63.35 44.06 59.99 (+8.09)
SESS Zhao et al. (2020) 74.52 88.97 66.32 52.47 36.29 43.53 33.15 16.68 65.52 74.63 62.67 41.91 58.78 (+6.88)
3DIoUMatch Wang et al. (2020a) 74.48 89.13 66.35 54.59 35.74 43.35 32.08 17.34 62.06 71.86 58.00 39.09 57.43 (+5.53)
Table 5: Results of semi-supervised learning methods on the testing split.

4.3 Semi-Supervised Learning for 3D Object Detection

We implement image-based semi-supervised methods: Pseudo Label Lee et al. (2013), Mean Teacher Tarvainen and Valpola (2017) and Noisy Student Xie et al. (2020a), as well as semi-supervised methods for point clouds in the indoor scenario: SESS Zhao et al. (2020) and 3DIoUMatch Wang et al. (2020a). We first pretrain the model on the training split and then apply the semi-supervised learning methods on both the training split and unlabeled scenes. We train those methods with epochs for the k unlabeled subset , epochs for the k subset and the million subset during the semi-supervised learning process. We report detection performances on the testing split for the use of unlabeled subsets , and separately. Performance on the validation split is also reported in appendix B. We provide training and implementation details in appendix C.

Semi-supervised learning on unlabeled data. Experiments in Table 5 show that most semi-supervised methods can improve the detection results using unlabeled data. For instance, Mean Teacher Tarvainen and Valpola (2017) improves the baseline result by in mAP using the largest unlabeled set . The detection performance can be further boosted when the amount of unlabeled data increases. SESS Zhao et al. (2020) obtains performance gain using k unlabeled scenes, and the performance gain reaches with k unlabeled scenes and with one million scenes.

Pseudo labels vs. consistency. There are two keys to the success of labeling-based methods Lee et al. (2013); Xie et al. (2020a); Wang et al. (2020a): augmentations and label quality. Without strong augmentations, the performance of Pseudo Label Lee et al. (2013) drops from to albeit one million scenes are provided for training. 3DIoUMatch Wang et al. (2020a) adds additional step to filter out labels of low quality, and the performance reaches compared with of Noisy Student on . Consistency-based methods Tarvainen and Valpola (2017); Zhao et al. (2020) generally perform better than labeling-based methods, and Mean Teacher obtains the highest performance on . SESS Zhao et al. (2020) performs worse than Mean Teacher with mAP, which indicates that size and center consistency may not be useful in driving scenarios.

Semi-supervised learning vs. self-supervised learning. Our results in Table 4 and Table 5 demonstrate that the semi-supervised methods generally have a better performance compared with the self-supervised methods. Mean Teacher Tarvainen and Valpola (2017) attains the best performance of mAP while the best self-supervised method SwAV Caron et al. (2020) only obtains on . The major reason is that in semi-supervised learning the model usually receives stronger and more precise supervisory signals, e.g. labels or consistency with a trained model, when learning from the unlabeled data. However, in self-supervised learning, the supervisory signals on the unlabeled data are cluster id or similarity of pairs, which are typically noisy and uncertain.

Task Waymo ONCE nuScenes ONCE ONCE KITTI
Method
Source Only 65.55 32.88 46.85 23.74 42.01 12.11
SN Wang et al. (2020b) 67.97 (+2.42) 38.25 (+5.67) 62.47 (+15.62) 29.53 (+5.79) 48.12 (+6.11) 21.12 (+9.01)
ST3D Yang et al. (2021) 68.05 (+2.50) 48.34 (+15.46) 42.53 (-4.32) 17.52 (-6.22) 86.89 (+44.88) 41.42 (+29.31)
Oracle 89.00 77.50 89.00 77.50 83.29 73.45
Table 6: Results on unsupervised domain adaptation. Source Only means trained on the source and directly evaluated on the target domain. Oracle means trained and tested both on the target domain.

4.4 Unsupervised Domain Adaptation for 3D Object Detection

Unsupervised domain adaptation for 3D object detection aims to adapt a detection model from the source dataset to the target dataset without supervisory signals on the target domain. Different datasets typically have different collected environments, sensor locations and point cloud densities. In this paper, we reproduce commonly-used methods: SN Wang et al. (2020b) and ST3D Yang et al. (2021). We follow the settings of those methods and conduct experiments on transferring the model trained on the nuScenes and Waymo Open dataset to our ONCE dataset, as well as transferring the model trained on the ONCE dataset to the KITTI dataset. The detection results of the car class are reported on the respective target validation or testing set using the KITTI metric following Wang et al. (2020b); Yang et al. (2021).

Statistical normalization vs. self-training. The normalization-based method SN Wang et al. (2020b) surpasses the Source Only model by in and in on the Waymo ONCE adaptation task, and the self-training method ST3D Yang et al. (2021) also attains a considerable performance gain with improvement. However, ST3D performs poorly on the nuScenes ONCE task. It is mainly because the nuScenes dataset has fewer LiDAR beams, and the model trained on nuScenes may produce more low-quality pseudo labels, which will harm the self-training process of ST3D. Although those two methods achieve strong results on the adaptation from/to the ONCE dataset, there is still a gap with the Oracle results, leaving large space for future research.

5 Conclusion

In this paper, we introduce the ONCE (One millioN sCenEs) dataset, which is the largest autonomous driving dataset to date. To facilitate future research on 3D object detection, we additionally provide a benchmark for detection models and methods of self-supervised learning, semi-supervised learning and unsupervised domain adaptation. For future works, we plan to support more tasks on autonomous driving, including 2D object detection, 3D semantic segmentation and planning.

References

Appendix A The ONCE dataset

We publish the ONCE dataset, benchmark, develop kit, data format and annotation instructions at our website http://www.once-for-auto-driving.com. It is our priority to protect the privacy of third parties. We bear all responsibility in case of violation of rights, etc., and confirmation of the data license.

License. The ONCE dataset is published under CC BY-NC-SA 4.0 license, which means everyone can use this dataset for non-commercial research purpose. Find more details in http://www.once-for-auto-driving.com/license.html.

Dataset documentation. http://www.once-for-auto-driving.com/documentation.html shows the dataset documentation and intended uses.

Terms of use and privacy. Terms of use and privacy are in http://www.once-for-auto-driving.com/terms_of_use.html.

Data maintenance. http://www.once-for-auto-driving.com/download.html provides data download links for users. Data is stored in Google Drive for global users, and another copy of data is stored in BaiduYunPan for Chinese users. We will maintain the data for a long time and check the data accessibility in a regular basis.

Benchmark and code. http://www.once-for-auto-driving.com/benchmark.html provides benchmark results. The reproduction code will be released upon acceptance.

Annotation statistics. Figure 5 shows the distribution of the number of objects for annotated scenes. Our annotated set covers diverse object counts for different scenes, e.g. the vehicle count in a scene ranges from less than to more than . Pedestrians and cyclists in a scene range from less than to more than . The distributions of training, validation and testing splits are mostly similar but slightly different in some intervals, which guarantees stable evaluation results and encourages evaluated methods to have stronger generalizability across the three splits.

Discussion on the evaluation metrics. Current evaluation metrics of 3D detection typically extend the Average Precision (AP) metric Everingham et al. (2010) of 2D detection to the 3D scenarios by changing the matching criterion between the ground truth boxes and predictions. The nuScenes dataset Caesar et al. (2020) uses the center distance between boxes on the ground plane as the matching criterion for AP calculation, in ignorance of the size and orientation of the objects. Although the nuScenes detection score (NDS) is proposed to take all factors into consideration, AP still accounts for of the total NDS score, which shows strong preference to the accurate localization of object centers but less attention to the objects’ size and orientation. The Waymo Open dataset Sun et al. (2020) applies the Hungarian algorithm to match the ground truths and predictions, which may lead to a higher estimation of AP since objects with no overlaps can also be matched. The KITTI dataset Geiger et al. (2013) uses 3D Intersection over Union (IoU) above certain threshold as the matching criterion, but the predicted boxes with the opposite orientations of the ground truths can also be matched, which can be dangerous in practical. In this paper, we extend the 3D IoU-based evaluation metric of Geiger et al. (2013) and take the object orientations into special consideration. Our orientation-aware metric is more stringent than Geiger et al. (2013). Compared with the weighted-scoring method in Sun et al. (2020), our method avoids repeated calculations of the orientation factor, since it has already participated in the computation of 3D rotated IoU. Compared with the distance-based matching scheme Caesar et al. (2020), our method puts equal weights on the object size, center and orientation.

Limitations. The major limitation of our ONCE dataset is that currently we only annotate a small amount of scenes of the one million scenes, which may hamper the broader exploration on 3D object detection. To overcome the limitation, we plan to provide more annotations in the near future. We also plan to support more autonomous driving tasks in addition to 3D detection on the ONCE dataset.

(a) Vehicle
(b) Pedestrian
(c) Cyclist
Figure 5: Distribution of annotation counts per scene.

Appendix B Experiments on the validation split

b.1 Models for 3D Object Detection

Method Vehicle Pedestrian Cyclist mAP
overall 0-30m 30-50m 50m-inf overall 0-30m 30-50m 50m-inf overall 0-30m 30-50m 50m-inf
Multi-Modality (point clouds + images)
PointPainting Vora et al. (2020) 66.17 80.31 59.80 42.26 44.84 52.63 36.63 22.47 62.34 73.55 57.20 40.39 57.78
Single-Modality (point clouds only)
PointRCNN Shi et al. (2019) 52.09 74.45 40.89 16.81 4.28 6.17 2.40 0.91 29.84 46.03 20.94 5.46 28.74
PointPillars Lang et al. (2019) 68.57 80.86 62.07 47.04 17.63 19.74 15.15 10.23 46.81 58.33 40.32 25.86 44.34
SECOND Yan et al. (2018) 71.19 84.04 63.02 47.25 26.44 29.33 24.05 18.05 58.04 69.96 52.43 34.61 51.89
PV-RCNN Shi et al. (2020) 77.77 89.39 72.55 58.64 23.50 25.61 22.84 17.27 59.37 71.66 52.58 36.17 53.55
CenterPoints Yin et al. (2020) 66.79 80.10 59.55 43.39 49.90 56.24 42.61 26.27 63.45 74.28 57.94 41.48 60.05
Table 7: Results of detection models on the validation split.

b.2 Self-Supervised Learning for 3D Object Detection

Method Vehicle Pedestrian Cyclist mAP
overall 0-30m 30-50m 50m-inf overall 0-30m 30-50m 50m-inf overall 0-30m 30-50m 50m-inf
baseline Yan et al. (2018) 71.19 84.04 63.02 47.25 26.44 29.33 24.05 18.05 58.04 69.96 52.43 34.61 51.89
BYOL Grill et al. (2020) 68.02 81.01 60.21 44.17 19.50 22.16 16.68 12.06 50.61 62.46 44.29 28.18 46.04 (-5.85)
PointContrast Xie et al. (2020b) 71.07 83.31 64.90 49.34 22.52 23.73 21.81 16.06 56.36 68.11 50.35 34.06 49.98 (-1.91)
SwAV Caron et al. (2020) 72.71 83.68 65.91 50.10 25.13 27.77 22.77 16.36 58.05 69.99 52.23 34.86 51.96 (+0.07)
DeepCluster Tian et al. (2017) 73.19 84.25 66.86 50.47 24.00 26.36 21.73 16.79 58.99 70.80 53.66 36.17 52.06 (+0.17)
BYOL Grill et al. (2020) 70.93 84.15 63.48 45.74 25.86 29.91 21.55 15.83 55.63 58.59 49.01 29.53 50.82 (-1.07)
PointContrast Xie et al. (2020b) 71.39 83.89 65.22 47.73 27.69 32.53 23.00 14.68 56.88 69.01 50.41 34.57 51.99 (+0.10)
SwAV Caron et al. (2020) 72.51 83.39 65.46 51.08 27.08 29.94 25.19 17.13 57.85 69.87 52.38 33.78 52.48 (+0.59)
DeepCluster Tian et al. (2017) 71.62 83.99 65.55 50.77 29.33 33.25 25.08 17.00 57.61 68.57 52.58 34.05 52.86 (+0.97)
BYOL Grill et al. (2020) 71.32 83.59 64.89 50.27 25.02 27.06 22.96 17.04 58.56 70.18 52.74 36.32 51.63 (-0.26)
PointContrast Xie et al. (2020b) 71.87 86.93 62.85 48.65 28.03 33.07 25.91 14.44 60.88 71.12 55.77 36.78 53.59 (+1.70)
SwAV Caron et al. (2020) 72.46 83.09 66.66 51.50 29.84 34.15 26.22 17.61 57.84 68.79 52.21 35.39 53.38 (+1.49)
DeepCluster Tian et al. (2017) 72.89 83.52 67.09 50.38 30.32 34.76 26.43 18.33 57.94 69.18 52.42 34.36 53.72 (+1.83)
Table 8: Results of self-supervised learning methods on the validation split.

b.3 Semi-Supervised Learning for 3D Object Detection

Method Vehicle Pedestrian Cyclist mAP
overall 0-30m 30-50m 50m-inf overall 0-30m 30-50m 50m-inf overall 0-30m 30-50m 50m-inf
baseline Yan et al. (2018) 71.19 84.04 63.02 47.25 26.44 29.33 24.05 18.05 58.04 69.96 52.43 34.61 51.89
Pseudo Label Lee et al. (2013) 72.80 84.46 64.97 51.46 25.50 28.36 22.66 18.51 55.37 65.95 50.34 34.42 51.22 (-0.67)
Noisy Student Xie et al. (2020a) 73.69 84.69 67.72 53.41 28.81 33.23 23.42 16.93 54.67 65.58 50.43 32.65 52.39 (+0.50)
Mean Teacher Tarvainen and Valpola (2017) 74.46 86.65 68.44 53.59 30.54 34.24 26.31 20.12 61.02 72.51 55.24 39.11 55.34 (+3.45)
SESS Zhao et al. (2020) 73.33 84.52 66.22 52.83 27.31 31.11 23.94 19.01 59.52 71.03 53.93 36.68 53.39 (+1.50)
3DIoUMatch Wang et al. (2020a) 73.81 84.61 68.11 54.48 30.86 35.87 25.55 18.30 56.77 68.02 51.80 35.91 53.81 (+1.92)
Pseudo Label Lee et al. (2013) 73.03 86.06 65.96 51.42 24.56 27.28 20.81 17.00 53.61 65.26 48.44 33.58 50.40 (-1.49)
Noisy Student Xie et al. (2020a) 75.53 86.52 69.78 55.05 31.56 35.80 26.24 21.21 58.93 69.61 53.73 36.94 55.34 (+3.45)
Mean Teacher Tarvainen and Valpola (2017) 76.01 86.47 70.34 55.92 35.58 40.86 30.44 19.82 63.21 74.89 56.77 40.29 58.27 (+6.38)
SESS Zhao et al. (2020) 72.11 84.06 66.44 53.61 33.44 38.58 28.10 18.67 61.82 73.20 56.60 38.73 55.79 (+3.90)
3DIoUMatch Wang et al. (2020a) 75.69 86.46 70.22 56.06 34.14 38.84 29.19 19.62 58.93 69.08 54.16 38.87 56.25 (+4.36)
Pseudo Label Lee et al. (2013) 72.41 84.06 64.54 50.05 23.62 26.80 20.13 16.66 53.25 64.69 48.52 33.47 49.76 (-2.13)
Noisy Student Xie et al. (2020a) 75.99 86.67 70.48 55.60 33.31 37.81 28.19 21.39 59.81 70.01 55.13 38.33 56.37 (+4.48)
Mean Teacher Tarvainen and Valpola (2017) 76.38 86.45 70.99 57.48 35.95 41.76 29.05 18.81 65.50 75.72 60.07 43.66 59.28 (+7.39)
SESS Zhao et al. (2020) 75.95 86.83 70.45 55.76 34.43 40.00 27.92 19.20 63.58 74.85 58.88 39.51 57.99 (+6.10)
3DIoUMatch Wang et al. (2020a) 75.81 86.11 71.82 57.84 35.70 40.68 30.34 21.15 59.69 70.69 54.92 39.08 57.07 (+5.18)
Table 9: Results of semi-supervised learning methods on the validation split.

Appendix C Implementation details

In this section, we provide implementation and training details for the 3D object detection benchmark.

c.1 Models for 3D Object Detection

General configurations. All the models are trained with an initial learning rate under the cosine annealing learning scheme. The models are trained with the batch size for epochs. Non Maximum Suppression (NMS) with the IoU threshold is adopted for post-processing. Other configurations are kept the same with the official version of those models if not specially mentioned.

PointRCNN. PointRCNN is a point-based 3D detector that generates proposals directly on point clouds. We sample points per frame and construct the segmentation backbone with --- points. We use the mean size of each category for proposal generation.

PointPillars. PointPillars is a pioneering work that introduces pillar-based representation into 3D object detection. We set the pillar size as and also use the mean size as the anchor size.

SECOND.

SECOND is a voxel-based detector that transforms point clouds into voxels for feature extraction. We set the voxel size as

and use the same anchors as PointPillars.

PV-RCNN. PV-RCNN is a point-voxel based detector that applies SECOND for proposal generation and then utilizes keypoints for RoI feature extraction. We sample keypoints per scene.

CenterPoints. CenterPoints introduces center-based target assignments to replace the anchor-based assignments. In addition to the center head, we use the same backbone as SECOND.

PointPainting. PointPainting uses CenterPoints as the 3D detector and HRNet trained on CityScapes to generate semantic segmentation results.

c.2 Self-Supervised Learning for 3D Object Detection

General configurations. We use the voxel-based SECOND detector as the baseline model for all the methods. During the pretraining stage, we pretrain the backbone of SECOND detector on unlabeled subset. We pretrain those methods for 20 epochs on the k unlabeled subset , 5 epochs on the k subset and 3 epochs on the million subset . For all the experiments, we use data parallelization on NVIDIA V100 GPUs.

Multi-view augmentation setup. We generate multi-view of the original scenes by random flip, scaling with a scale factor sampled from [0.95, 1.05] and rotation around vertical yaw axis between [-10, 10] degrees. We also do downsampling by a factor sampled from [0.9, 1].

PointContrast. PointContrast defines a contrastive loss over the point-level features given a pair of overlapping partial scans. The objective is to minimize the distance between matched points (positive pairs) and maximize the distance between unmatched ones (negative pairs). In our setting, we sample a random geometric transformation to transform an original point cloud scene into augmented views. After passing the scenes through SECOND backbone to obtain voxel-wise features, we randomly select voxels within each scene. The voxel-wise features will be passed through a two-layer MLP (with dimension ,

) to project into latent space, with batchnorm and ReLU. The latent space feature will be concatenated with initial feature and passed through a one-layer MLP with dimension

. The final features will be used for contrastive pretraining. We pretrain the model using Adam optimizer with the initial learning rate and the batch size as .

DeepCluster.

DeepCluster uses k-means clustering to give each instance a cluster id as the pseudo label and use the label to train the network. Since clustering method is designed to learn semantic representation, we randomly crop patches in 3D scenes as pseudo instances and pass the patches through the backbone to obtain patch-wise features. We project the features into latent space for clustering and pretraining. We choose the total cluster number as

. Patch-wise feature will be passed through a two-layer MLP (with dimension , ), with a batchnorm and ReLU layer to project the features to the latent space. This two-layer MLP will not be used in the finetune stage. We pretrain the backbone with Adam optimizer. The initial learning rate is with a cosine decay. The batch size is .

SwAV. SwAV improves DeepCluster by introducing prototypes, online clustering and swapped predictions. We use the same clustering and training settings as DeepCluster. For other configurations we follow the settings in the original paper.

BYOL. BYOL introduces two networks, referred to online network and target network, that can interact and learn from each other. Given a 3D scene, we train the online network to predict the target network’s representation of an augmented view of the same scene. In particular, after passing the 3D scene through the backbone, we project the representation through a two-layer MLP (with dimension , ). After that, the predictor in the online network will project the embedding into a latent space as the final representation of the online network. The predictor is also a two-layer MLP (with dimension , ). We update the target network by a slow-moving averaging of the online network with parameter . To avoid the model collapsing to trivial solutions, we further introduce a contrastive regularization term. Specifically, we follow the design in PointContrast and ramdomly select some voxel-wise features. A contrastive loss is designed on different views of the same voxel between its online representation and target representation. We pretrain the model using Adam optimizer with the initial learning rate . The batch size is .

c.3 Semi-Supervised Learning for 3D Object Detection

General configurations. We use the SECOND detector as the baseline model for all the methods, which guarantees a fair comparison among those methods. The model is first pretrained using the training split following the configurations in C.1, and then semi-supervised learning methods are applied on the model using data from the training split as well as the respective unlabeled subset .

Pseudo Label. We use the pretrained model to generate pseudo ground truth boxes for each unlabeled scene. The model is then trained with pseudo labels in the unlabeled scenes, as well as real labels in the training split. It is worth noting that we didn’t apply any augmentation in the semi-supervised learning process, mainly to explore whether augmentations are necessary with a large amount of data.

Mean Teacher.

Mean Teacher uses the teacher and student model for semi-supervised learning. We first load the pretrained weights for both two models, and then the teacher model produces pseudo ground truths to train the student model for the unlabeled subset. A consistency loss is introduced to regularize two models. Specifically, we first match the predicted boxes of student model with pseudo boxes of teacher model by the nearest-neighbor criterion. Then the Kullback–Leibler divergence of class predictions of the matched pairs of boxes is applied as the consistency loss. The teacher model is updated by exponential moving average (EMA) of the student model.

Noisy Student. Noisy Student is a self-training approach in which the student is trained with noise, i.e., strong augmentations, using pseudo labels provided by the teacher. After the first round of semi-supervised training, we make the student a new teacher for the second around.

SESS. Self-Ensembling Semi-Supervised (SESS) 3D object detection extends Mean Teacher by introducing another two consistency constraints: size consistency and center consistency, along with class consistency to the matched pairs of boxes. The teacher is also updated by EMA.

3DIoUMatch. 3DIoUMatch introduces an extra IoU prediction head on the detection model. The predicted IoUs are then used for filtering low-quality pseudo boxes. We reject IoU-guided lower-half suppression and the EMA update scheme, since those components are detrimental to the detection performance in our experiments.

c.4 Unsupervised Domain Adaptation for 3D Object Detection

SN. Statistical Normalization (SN) is based on the observation that domain gap mainly comes from the differences of object size between different datasets, so this method normalizes the objects’ size of the source dataset according to the object statistics on the target domain.

ST3D. ST3D contains two stages: the model is first trained on the source dataset with an augmentation method named random object scaling. Then the model is trained on the target dataset with the aid of pseudo labels and a memory bank.

Appendix D Visualization

Dataset quality analysis. In order to evaluate the data quality and provide a fair comparison across different datasets, we propose an approach that uses pretrained models to imply the respective data quality. Specifically, we first pretrain same backbones of the SECOND detector Yan et al. (2018) by the self-supervised method DeepCluster Tian et al. (2017) using data from nuScenes Caesar et al. (2020), Waymo Sun et al. (2020) and ONCE respectively, and then we finetune those pretrained models on multiple downstream datasets under the same settings and report their performances. The superior model should have the best pretrained backbone, which means its corresponding pretrain dataset has the best data quality. The model pretrained on the ONCE dataset achieves moderate mAP on the downstream KITTI Geiger et al. (2013) dataset, and significantly outperforms models pretrained on the Waymo dataset () and nuScenes dataset (). On the downstream nuScenes dataset, the model pretrained on the ONCE dataset attains NDS score, outperforming of the Waymo dataset. On the downstream Waymo dataset, the model pretrained on the ONCE dataset achieves L2 mAP, which is better than on the nuScenes dataset. Thus our ONCE dataset has better data quality compared with nuScenes and Waymo. Details can be found in Table 10.

pretrain / downstream KITTI (moderate mAP) nuScenes (NDS) Waymo (L2 mAP)
nuScens 66.1 - 53.9
Waymo 66.5 49.9 -
ONCE 67.2 51.5 54.4
Table 10: Dataset quality analysis.

Annotation examples. We present an example of annotations in Figure 6.

Figure 6: Example of 3D annotations.

Multi-modality alignments. We present an illustration of the alignments between point clouds and images in Figure 7.

Figure 7: Alignment of point cloud and image.