Winner's Code and Technical Report for the nuScenes 3D Object Detection challenge in WAD, CVPR 2019
This report presents our method which wins the nuScenes3D Detection Challenge  held in Workshop on Autonomous Driving(WAD, CVPR 2019). Generally, we utilize sparse 3D convolution to extract rich semantic features, which are then fed into a class-balanced multi-head network to perform 3D object detection. To handle the severe class imbalance problem inherent in the autonomous driving scenarios, we design a class-balanced sampling and augmentation strategy to generate a more balanced data distribution. Furthermore, we propose a balanced group-ing head to boost the performance for the categories withsimilar shapes. Based on the Challenge results, our methodoutperforms the PointPillars  baseline by a large mar-gin across all metrics, achieving state-of-the-art detection performance on the nuScenes dataset. Code will be released at CBGS.READ FULL TEXT VIEW PDF
This paper reviews the CVPR 2019 challenge on Autonomous Driving. Baidu'...
A practical autonomous driving system urges the need to reliably and
We present a simple and flexible object detection framework optimized fo...
Real-time 3D object detection is crucial for autonomous cars. Achieving
Compared with model architectures, the training process, which is also
In this paper, we propose an original object detection methodology appli...
Real-world visual recognition requires handling the extreme sample imbal...
Winner's Code and Technical Report for the nuScenes 3D Object Detection challenge in WAD, CVPR 2019
Point cloud 3D object detection has recently received more and more attention and becomes an active research topic in 3D computer vision community since it has great potential for visual applications like autonomous driving and robots navigation. The KITTI dataset is the most widely used dataset in this task. Recently, NuTonomy releases the nuScenes dataset , which greatly extends KITTI in dataset size, sensor modalities, categories, and annotation numbers. Compared to the KITTI 3D detection benchmark 
, in which we need to locate and classify objects of 3 categories respectively, the nuScenes 3D Detection Challenge requires to detect 10 categories at the same time. Moreover, we need to estimate a set of attributes and object velocities for each object. Furthermore, the nuScenes dataset suffers from severe class imbalance issues. As shown in Figure 2, instance distribution of categories in the nuScenes dataset is long-tailed, exhibiting an extreme imbalance in the number of examples between common and rare object classes. All the above challenges make the nuScenes 3D Detection Challenge more difficult, yet closer to real-world scenarios.
Existing 3D object detection methods have explored several ways to tackle 3D object detection task. Several works [3, 13, 15, 29, 14] convert point cloud into bird-view format and apply 2D CNN to get 3D object detection results. Voxel-based methods [26, 32, 28] convert point cloud into regular 3D voxels then apply 3D CNN or 3D sparse convolution [10, 9, 5] to extract features for 3D object detection. Point-based Methods [19, 27] firstly utilize 2D detectors to obtain 2D boxes from the image, and then apply PointNet++ [20, 21] on the cropped point cloud to further estimate location, size and orientation of 3D objects. Methods taking advantage of both voxel-based and point-based methods like [25, 30, 24] first use pointnet fashions to acquire high-quality proposals, then voxel-based methods is applied to obtain final predictions. However, most of above methods are performed on each single category respectively in order to achieve their highest performance. For example, the previous SOTA method PointPillars  can only achieve very low performance on most of the rare categories(e.g., Bicycle).
Multi-task Learning is another technique that we use in the challenge because the multi-category joint detection can be taken as a multi-task learning problem. Many works investigate how to adaptively set weights for the different task effectively. For example, MGDA  takes multi-task learning as a multi-objective optimization problem. GradNorm  uses gradient normalization strategies to balance loss of different tasks adaptively. Benefiting from multi-task learning, our method performs better when training all categories jointly than training each of them individually.
There are 3 tracks in the nuScenes 3D Detection Challenge: Lidar Track, Vision Track, and Open Track. Only lidar input is allowed in Lidar Track. Only camera input is allowed in Vision Track. External data or map data is not allowed in above two tracks. As for Open Track, any input is allowed. Besides, pre-training is allowed in all of the 3 tracks. We participate in the Lidar Track of the challenge. Final leaderboard can be found at . Finally, our contributions in this challenge can be concluded as follows:
We propose class-balanced sampling strategy to handle extreme imbalance issue in the nuScenes Dataset.
We design a multi-group head network to make categories of similar shapes or sizes could benefit from each other, and categories of different shapes or sizes stop interfere with each other.
|Class||Instance Num||Sample Num||Instance Num After||Sample Num After|
Overall network architecture is presented in Figure 3, which is mainly composed of 4 part: Input Module, 3D Feature Extractor, Region Proposal Network, and Multi-group Head network. Together with improvements on data augmentation, loss function, and training procedure, we not only make it perform 10 categories’ 3D object detection, velocity and attribute prediction simultaneously, but also achieve better performance than perform each category’s detection respectively.
In this section, we first introduce inputs and corresponding data augmentation strategies. Then the 3D Feature Extractor, Region Proposal Network, and Multi-group head network will be explained in detail. Finally, improvements on loss, training procedure as well as other tricks will be introduced.
The nuScenes dataset provides point cloud sweeps in format, each of them associated with a time-stamp. We follow the fashion of official nuScenes baseline  by accumulating 10 Lidia sweeps to form dense point cloud inputs. Specifically, our input is of format. is the time lag between each non-keyframe sweep regarding keyframe sweep, and ranges from 0s to 0.45s. We use grid size 0.1m, 0.1m, 0.2m in x, y, z axis respectively to convert the raw point cloud into voxel presentation. In each voxel, we take mean of all points in the same voxel to get final inputs to the network. No extra data normalization strategy is applied.
As shown in Figure 2, the nuScenes dataset  has a severe class imbalance problem . Blue columns tell the original distribution of training split. To alleviate the severe class imbalance, we propose DS Sampling, which generates a smoother instance distribution as the orange columns indicate. To this end, like the sampling strategy used in the image classification task, we firstly duplicate samples of a category according to its fraction of all samples. The fewer a category’s samples are, more samples of this category are duplicated to form the final training dataset. More specifically, we first count total point cloud sample number that exists a specific category in the training split, then samples of all categories which are summed up to 128106 samples. Note that there exist duplicates because multiple objects of different categories can appear in one point cloud sample. Intuitively, to achieve a class-balanced dataset, all categories should have close proportions in the training split. So we randomly sample 10% of 128106 (12810) point cloud samples for each category from the class-specific samples mentioned above. As a result, we expand the training set from 28130 samples to 128100 samples, which is about 4.5 times larger than the original dataset. To conclude, DS Sampling can be seen as improving the average density of rare classes in the training split. Apparently, DS Sampling could alleviate the imbalance problem effectively, as shown in orange columns in Figure 2.
Besides, we use GT-AUG strategy as proposed in SECOND  to sample ground truths from an annotation database, which is generated offline, and place those sampled boxes into another point cloud. Note that the ground plane location of point cloud sample needs to be computed before we could place object boxes properly. So we utilize the least square method and RANSAC  to estimate each sample’s ground plane, which can be formulated as . Examples of our ground plane detection module can be seen in Figure 1.
With the help of the above two strategies, we enable the model to perform better in all, especially tail classes, showing an obvious promoting effect on alleviating the problem of class imbalance.
As Shown in Figure 3, we use sparse 3D convolution with skip connections to build a resnet-like architecture for the 3D feature extractor network. For a
input tensor, the feature extractor outputs afeature map, is the downscale factor of z, x, y dimensions respectively, is output channel of 3D Feature Extractor’s last layer. To make that 3D feature maps more suitable for the following Region Proposal Network and multi-group head which will be explained in detail in the next subsection, we reshape feature maps to , then use a region proposal network like VoxelNet  to perform regular 2D convolution and deconvolution to further aggregate features and get higher resolution feature maps. Based on these feature maps the multi-group head network is thus able to detect objects of different categories efficiently and effectively.
The intrinsic long-tail property poses a multitude of open challenges for object detection since the models will be largely dominated by those abundant head classes while degraded for many other tail classes. As shown in Figure 2, for example, Car accounts for 43.7% annotations of the whole dataset, which is 40 times the number of bicycle, making it difficult for a model to learn features of tail classes sufficiently. That is, if instance numbers of classes sharing a common head differ a lot, there is usually no data for the tail class at most time. As a result, the corresponding head, as the purple parts pictured in Figure 3
, will be dominated by the major classes, resulting in poor performance on rare classes. On the other hand, if we put classes of discrepant shapes or sizes together, regression target will have bigger inter-class variances, which will make classes of different shapes interfere with each other. That is why the performance trained with different shapes jointly is often lower than trained them individually. Our experiments prove that classes of similar shape or size are easier to learn from the same task.
Intuitively, classes of similar shapes or sizes can contribute to each other’s performance when trained jointly because there are common features among those relative categories so that they can compensate for each other to achieve higher detection results together. To this end, we manually divide all categories into several groups following some principles. For a particular head in the Multi-group Head module, it only needs to recognize classes and locates objects belongs to classes of this group. There are mainly 2 principles which guide us split the 10 classes into several groups effectively:
Classes of similar shapes or sizes should be grouped. Classes of similar shapes often share many common attributes. For example, all vehicles look similar because they all have wheels, and look like a cube. Motorcycle and bicycle, traffic cone and pedestrian also have a similar relation. By grouping classes of similar shape or size, we divide classification into two steps logically. Firstly the model recognizes ’superclasses’, namely groups, then in each group, different classes share the same head. As a result, different groups learn to model different shape and size patterns, and in a specific group, the network is forced to learn the inter-class difference of similar shapes or sizes.
Instance numbers of different groups should be balanced properly. We take into account that instance number of different groups should not vary greatly, which will make the learning process dominated by major classes. So we separate major classes from groups of similar shape or size. For example, Car, Truck and Construction Vehicle have similar shape and size, but Car will dominate the group if we put the 3 classes together, so we take Car as a single group, and put Truck and Construction Vehicle together as a group. In this way, we can control the weights of different groups to further alleviate the imbalance problem.
Guided by the above two principles, in the final settings we split 10 classes into 6 groups: (Car), (Truck, Construction Vehicle), (Bus, Trailer), (Barrier), (Motorcycle, Bicycle), (Pedestrian, Traffic Cone). According to our ablation study as shown in Table 4, the class-balanced grouping contributes the most to the final result.
|Point Pillars ||Lidar||30.5||0.517||0.290||0.500||0.316||0.368||45.3|
|Point Pillars ||70.5||59.9||34.4||33.2||29.6||25.0||16.7||20.0||4.50||1.60||29.5|
|GT-AUG||DB Sampling||Multi-head||Res-Encoder||SE||Heavier Head||WS||Hi-res||mAP||NDS|
Apart from regular classification and bounding box regression branch required by 3D object detection, we add an orientation classification branch as proposed in SECOND . It’s important to point out that most of the object boxes are parallel or perpendicular to LiDAR coordinates axis according to our statistics. So if orientation classification is applied as it is in SECOND, it turns out the mAOE is very high for the fact that many predicted bounding boxes’ orientation are just opposite to ground truth. So we add an offset to orientation classification targets to dismiss orientation ambiguity. As for velocity estimation, regression without normalization can achieve the best performance compared to adding extra normalization operations.
We use anchors to reduce learning difficulty through import prior knowledge. Anchors are configured as VoxelNet . That is, anchors of different classes have different height and width configuration which are determined by class means values. There is 1 size configuration with 2 different directions for a category. For velocities, the anchor is set to 0 in both x and y axis. Objects are moving along the ground so we do not need to estimate velocity in the z axis.
In each group, we use weighted Focal Loss for classification, the smooth-l1 loss for regression, and softmax cross-entropy loss for orientation classification. We do not add attribute estimation because its results are not comparable to just applying each category’s most common attribute. We further improve attribute estimation by taking velocity into account. For example, most bicycles are , but if the model predicts a bicycle’s velocity is above a threshold, there should be riders so we change corresponding bicycle’s attribute to .
The Multi-group head is taken as a multi-task learning procedure in our experiments. We use Uniform Scaling to configure weights of different branches.
Apart from the above improvements, we find that SENet , Weight Standardization  can also help in the detection task when used properly. Besides, if we use a heavier head network, performance can still be improved. In our final submission, we ensemble several models of multiple scales to achieve our best performance: mAP 53.2%, NDS 63.78% on validation split.
In this section, we explain the implementation details of the data augmentation, training procedure and method itself. Our method is implemented in PyTorch
. All experiments are trained using NVIDIA 2080Ti distributedly with synchronized batch normalization support.
For this task, we consider point cloud within the range of [-50.4, 50.4] [-51.2, 51.2] [-5, 3] meters in X, Y, Z axis respectively. We choose a voxel size of = 0.1, = 0.1, = 0.2 meters, which leads to a 1008 1024 40 voxels. Max points number allowed in a voxel is set to 10. For using 10 sweeps(1 keyframe + 9 preceeding non-keyframes), max number of non-empty voxels is 60000.
During training, we conduct data augmentation of random flip in the x-axis, scaling with a scale factor sampled from [0.95, 1.05], rotation around Z axis between [-0.3925, 0.3925] rads and translation in range [0.2, 0.2, 0.2] m in all axis. For GT-AUG, we first filter out ground truth boxes with less than 5 points inside, then randomly select and paste ground truth boxes of different classes using different magnitude on the ground plane as shown in Table 5.
with LR max 0.04, division factor 10, momentum ranges from 0.95 to 0.85, fixed weight decay 0.01 to achieved super convergence. With batch size 5, the model is trained for 20 epochs. During inference, top 1000 proposals are kept in each group, then NMS with score threshold 0.1 and IoU threshold 0.2 is applied. Max number of boxes allowed in each group after NMS is 80.
|Category||Car||Truck||Bus||Trailer||Cons. Veh.||Traffic Cone||Barrier||Bicycle||Motorcycle||Pedestrian|
For the 3D feature extractor, we use 16, 32, 64, 128 layers of sparse 3D convolution respectively for each block. As used in , submanifold sparse convolution is used when we downsample the feature map. In other conditions, regular sparse convolution is applied. For the region proposal module, we use 128 and 256 layers respectively for downscale ratio 16 and 8 layers. In each head, we apply 1 1 Conv to get final predictions. To achieve a heavier head, we first use one layer 3 3 Conv to reduce channels by , then use a 1 1 Conv layer to get final predictions. Batch Normalization  is used for all but the last layer.
Anchors of different categories are set according to their mean height and width, with different threshold when assigning class labels. For categories of sufficient annotations, we set the positive area threshold to 0.6, for those categories with fewer annotations we set the threshold to 0.4.
We use the default setting of focal loss in the original paper. For regression, we use 0.2 for velocity prediction and the others are set to 1.0 to achieve a balanced and stable training process.
In this section we report our results in detail. We also investigate contributions of each module to the final result in Table 4.
As shown in Table 2, our method surpasses official PointPillars  baseline by 73.1%. More specifically, our method shows better performance in all categories, especially in long-tail classes like Bicycle, Motorcycle, Bus, and Trailer. Moreover, our method achieves less error in translation(mATE), scale(mASE), orientation(mAOE), velocity(mAVE) and attribute(mAAE). Examples of detection results can be seen in Figure 4, our method generates reliable detection results on all categories. The edge with a line attached in the bounding box indicates the vehicle’s front.
In this report, we present our method and results on the newly-released large scale nuScenes Dataset, which poses more challenges, such as class imbalance, than KITTI on the 3D Object Detection task. With carefully-designed strategies in solving class imbalance, multi-class joint detection through data, network and learning objective, we achieve the best result in the WAD challenge. However, there are still a few methods that report their results on the nuScenes Dataset, so we will release our code, hopefully, it can facilitate people’s research on this topic.
Conference on Computer Vision and Pattern Recognition (CVPR), 2012.