Log In Sign Up

Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection

by   Benjin Zhu, et al.

This report presents our method which wins the nuScenes3D Detection Challenge [17] held in Workshop on Autonomous Driving(WAD, CVPR 2019). Generally, we utilize sparse 3D convolution to extract rich semantic features, which are then fed into a class-balanced multi-head network to perform 3D object detection. To handle the severe class imbalance problem inherent in the autonomous driving scenarios, we design a class-balanced sampling and augmentation strategy to generate a more balanced data distribution. Furthermore, we propose a balanced group-ing head to boost the performance for the categories withsimilar shapes. Based on the Challenge results, our methodoutperforms the PointPillars [14] baseline by a large mar-gin across all metrics, achieving state-of-the-art detection performance on the nuScenes dataset. Code will be released at CBGS.


page 1

page 2

page 3

page 4


Structure Aware and Class Balanced 3D Object Detection on nuScenes Dataset

3-D object detection is pivotal for autonomous driving. Point cloud base...

Woodscape Fisheye Object Detection for Autonomous Driving – CVPR 2022 OmniCV Workshop Challenge

Object detection is a comprehensively studied problem in autonomous driv...

CVPR 2019 WAD Challenge on Trajectory Prediction and 3D Perception

This paper reviews the CVPR 2019 challenge on Autonomous Driving. Baidu'...

Transformation-Equivariant 3D Object Detection for Autonomous Driving

3D object detection received increasing attention in autonomous driving ...

YOLOX: Exceeding YOLO Series in 2021

In this report, we present some experienced improvements to YOLO series,...

DHARI Report to EPIC-Kitchens 2020 Object Detection Challenge

In this report, we describe the technical details of oursubmission to th...

Code Repositories


A general 3D object detection codebse.

view repo


Winner's Code and Technical Report for the nuScenes 3D Object Detection challenge in WAD, CVPR 2019

view repo

1 Introduction

Point cloud 3D object detection has recently received more and more attention and becomes an active research topic in 3D computer vision community since it has great potential for visual applications like autonomous driving and robots navigation. The KITTI dataset 

[7] is the most widely used dataset in this task. Recently, NuTonomy releases the nuScenes dataset [2], which greatly extends KITTI in dataset size, sensor modalities, categories, and annotation numbers. Compared to the KITTI 3D detection benchmark [8]

, in which we need to locate and classify objects of 3 categories respectively, the nuScenes 3D Detection Challenge requires to detect 10 categories at the same time. Moreover, we need to estimate a set of attributes and object velocities for each object. Furthermore, the nuScenes dataset

[2] suffers from severe class imbalance issues. As shown in Figure 2, instance distribution of categories in the nuScenes dataset is long-tailed, exhibiting an extreme imbalance in the number of examples between common and rare object classes. All the above challenges make the nuScenes 3D Detection Challenge more difficult, yet closer to real-world scenarios.

Existing 3D object detection methods have explored several ways to tackle 3D object detection task. Several works [3, 13, 15, 29, 14] convert point cloud into bird-view format and apply 2D CNN to get 3D object detection results. Voxel-based methods [26, 32, 28] convert point cloud into regular 3D voxels then apply 3D CNN or 3D sparse convolution [10, 9, 5] to extract features for 3D object detection. Point-based Methods [19, 27] firstly utilize 2D detectors to obtain 2D boxes from the image, and then apply PointNet++ [20, 21] on the cropped point cloud to further estimate location, size and orientation of 3D objects. Methods taking advantage of both voxel-based and point-based methods like [25, 30, 24] first use pointnet fashions to acquire high-quality proposals, then voxel-based methods is applied to obtain final predictions. However, most of above methods are performed on each single category respectively in order to achieve their highest performance. For example, the previous SOTA method PointPillars [14] can only achieve very low performance on most of the rare categories(e.g., Bicycle).

Multi-task Learning is another technique that we use in the challenge because the multi-category joint detection can be taken as a multi-task learning problem. Many works investigate how to adaptively set weights for the different task effectively. For example, MGDA [23] takes multi-task learning as a multi-objective optimization problem. GradNorm [4] uses gradient normalization strategies to balance loss of different tasks adaptively. Benefiting from multi-task learning, our method performs better when training all categories jointly than training each of them individually.

There are 3 tracks in the nuScenes 3D Detection Challenge: Lidar Track, Vision Track, and Open Track. Only lidar input is allowed in Lidar Track. Only camera input is allowed in Vision Track. External data or map data is not allowed in above two tracks. As for Open Track, any input is allowed. Besides, pre-training is allowed in all of the 3 tracks. We participate in the Lidar Track of the challenge. Final leaderboard can be found at [17]. Finally, our contributions in this challenge can be concluded as follows:

  • We propose class-balanced sampling strategy to handle extreme imbalance issue in the nuScenes Dataset.

  • We design a multi-group head network to make categories of similar shapes or sizes could benefit from each other, and categories of different shapes or sizes stop interfere with each other.

  • Together with improvements on network architecture, loss function, and training procedure, our method achieves state-of-the-art performance on the challenging nuScenes Dataset


We first introduce our methodology in Section 2. Training details and network settings are presented in Section 3. Results are shown in Section 4. Finally we conduct conclusion in Section 5.


Class Instance Num Sample Num Instance Num After Sample Num After


Car 413318 27558 1962556 126811
Truck 72815 20120 394195 104092
Bus 13163 9156 70795 49745
Trailer 20701 7276 125003 45573
Constr. Veh. 11993 6770 82253 46710
Pedestrian 185847 22923 962123 110425
Motocycle 10109 6435 60925 38875
Bicycle 9478 6263 58276 39301
Traffic Cone 82362 12336 534692 73070
Barrier 125095 9269 881469 60443
Total 944881 28130 5132287 128100


Table 1: Instance and sample distribution of training split before and after dataset sampling(DS Sampling). Column Instance Num indicates instance number of each category. Column Sample Num indicates total sample numbers that a category appears in the training split. Column Instance Num After indicates instance number of each category after dataset sampling which expands the training set from 28130 to 128100 samples. Column Sample Num After is the same as column Instance Num After. Total number of samples indicates training dataset size, rather than the sum of all categories listed above, considering the fact that multiple categories can appear in the same point cloud sample.

2 Methodology

Overall network architecture is presented in Figure 3, which is mainly composed of 4 part: Input Module, 3D Feature Extractor, Region Proposal Network, and Multi-group Head network. Together with improvements on data augmentation, loss function, and training procedure, we not only make it perform 10 categories’ 3D object detection, velocity and attribute prediction simultaneously, but also achieve better performance than perform each category’s detection respectively.

In this section, we first introduce inputs and corresponding data augmentation strategies. Then the 3D Feature Extractor, Region Proposal Network, and Multi-group head network will be explained in detail. Finally, improvements on loss, training procedure as well as other tricks will be introduced.

2.1 Input and Augmentation

The nuScenes dataset provides point cloud sweeps in format, each of them associated with a time-stamp. We follow the fashion of official nuScenes baseline [2] by accumulating 10 Lidia sweeps to form dense point cloud inputs. Specifically, our input is of format. is the time lag between each non-keyframe sweep regarding keyframe sweep, and ranges from 0s to 0.45s. We use grid size 0.1m, 0.1m, 0.2m in x, y, z axis respectively to convert the raw point cloud into voxel presentation. In each voxel, we take mean of all points in the same voxel to get final inputs to the network. No extra data normalization strategy is applied.

As shown in Figure 2, the nuScenes dataset [2] has a severe class imbalance problem . Blue columns tell the original distribution of training split. To alleviate the severe class imbalance, we propose DS Sampling, which generates a smoother instance distribution as the orange columns indicate. To this end, like the sampling strategy used in the image classification task, we firstly duplicate samples of a category according to its fraction of all samples. The fewer a category’s samples are, more samples of this category are duplicated to form the final training dataset. More specifically, we first count total point cloud sample number that exists a specific category in the training split, then samples of all categories which are summed up to 128106 samples. Note that there exist duplicates because multiple objects of different categories can appear in one point cloud sample. Intuitively, to achieve a class-balanced dataset, all categories should have close proportions in the training split. So we randomly sample 10% of 128106 (12810) point cloud samples for each category from the class-specific samples mentioned above. As a result, we expand the training set from 28130 samples to 128100 samples, which is about 4.5 times larger than the original dataset. To conclude, DS Sampling can be seen as improving the average density of rare classes in the training split. Apparently, DS Sampling could alleviate the imbalance problem effectively, as shown in orange columns in Figure 2.

Besides, we use GT-AUG strategy as proposed in SECOND [28] to sample ground truths from an annotation database, which is generated offline, and place those sampled boxes into another point cloud. Note that the ground plane location of point cloud sample needs to be computed before we could place object boxes properly. So we utilize the least square method and RANSAC [6] to estimate each sample’s ground plane, which can be formulated as . Examples of our ground plane detection module can be seen in Figure 1.

Figure 1: Examples of ground plane detection result. Points belonging to ground plane are shown in color, which can be formulated by . In average, the ground plane is about -1.82 meters along z axis. Open3D [31] is used for visualization.

With the help of the above two strategies, we enable the model to perform better in all, especially tail classes, showing an obvious promoting effect on alleviating the problem of class imbalance.

Figure 2: Class imbalance in the nuScenes Dataset. 50% categories account for only a small fraction of total annotations. Distribution of original Training Split is shown in blue. Distribution of sampled Training Split is shown is orange.

2.2 Network

Figure 3: Network Architecture. 3D Feature Extractor is composed of submanifold and regular 3D sparse convolutions. Outputs of 3D Feature Extractor are of 16 downscale ratio, which are flatten along output axis and fed into following Region Proposal Network to generate 8 feature maps, followed by the multi-group head network to generate final predictions. Number of groups in head is set according to grouping specification.

As Shown in Figure 3, we use sparse 3D convolution with skip connections to build a resnet-like architecture for the 3D feature extractor network. For a

input tensor, the feature extractor outputs a

feature map, is the downscale factor of z, x, y dimensions respectively, is output channel of 3D Feature Extractor’s last layer. To make that 3D feature maps more suitable for the following Region Proposal Network and multi-group head which will be explained in detail in the next subsection, we reshape feature maps to , then use a region proposal network like VoxelNet [32] to perform regular 2D convolution and deconvolution to further aggregate features and get higher resolution feature maps. Based on these feature maps the multi-group head network is thus able to detect objects of different categories efficiently and effectively.

2.3 Class-balanced Grouping

The intrinsic long-tail property poses a multitude of open challenges for object detection since the models will be largely dominated by those abundant head classes while degraded for many other tail classes. As shown in Figure 2, for example, Car accounts for 43.7% annotations of the whole dataset, which is 40 times the number of bicycle, making it difficult for a model to learn features of tail classes sufficiently. That is, if instance numbers of classes sharing a common head differ a lot, there is usually no data for the tail class at most time. As a result, the corresponding head, as the purple parts pictured in Figure 3

, will be dominated by the major classes, resulting in poor performance on rare classes. On the other hand, if we put classes of discrepant shapes or sizes together, regression target will have bigger inter-class variances, which will make classes of different shapes interfere with each other. That is why the performance trained with different shapes jointly is often lower than trained them individually. Our experiments prove that classes of similar shape or size are easier to learn from the same task.

Intuitively, classes of similar shapes or sizes can contribute to each other’s performance when trained jointly because there are common features among those relative categories so that they can compensate for each other to achieve higher detection results together. To this end, we manually divide all categories into several groups following some principles. For a particular head in the Multi-group Head module, it only needs to recognize classes and locates objects belongs to classes of this group. There are mainly 2 principles which guide us split the 10 classes into several groups effectively:

  • Classes of similar shapes or sizes should be grouped. Classes of similar shapes often share many common attributes. For example, all vehicles look similar because they all have wheels, and look like a cube. Motorcycle and bicycle, traffic cone and pedestrian also have a similar relation. By grouping classes of similar shape or size, we divide classification into two steps logically. Firstly the model recognizes ’superclasses’, namely groups, then in each group, different classes share the same head. As a result, different groups learn to model different shape and size patterns, and in a specific group, the network is forced to learn the inter-class difference of similar shapes or sizes.

  • Instance numbers of different groups should be balanced properly. We take into account that instance number of different groups should not vary greatly, which will make the learning process dominated by major classes. So we separate major classes from groups of similar shape or size. For example, Car, Truck and Construction Vehicle have similar shape and size, but Car will dominate the group if we put the 3 classes together, so we take Car as a single group, and put Truck and Construction Vehicle together as a group. In this way, we can control the weights of different groups to further alleviate the imbalance problem.

Guided by the above two principles, in the final settings we split 10 classes into 6 groups: (Car), (Truck, Construction Vehicle), (Bus, Trailer), (Barrier), (Motorcycle, Bicycle), (Pedestrian, Traffic Cone). According to our ablation study as shown in Table 4, the class-balanced grouping contributes the most to the final result.


Modality Map External mAP mATE mASE mAOE mAVE mAAE NDS


Point Pillars [14] Lidar 30.5 0.517 0.290 0.500 0.316 0.368 45.3
BRAVE [17] Lidar 32.4 0.400 0.249 0.763 0.272 0.090 48.4
Tolist [17] Lidar 42.0 0.364 0.255 0.438 0.270 0.319 54.5
MEGVII(Ours) Lidar 52.8 0.300 0.247 0.380 0.245 0.140 63.3


Table 2: Overall performance. BRAVE and Tolist are the other top three teams. Our method achieves the best performance on all but mAAE metrics.


Car Ped Bus Barrier TC Truck Trailer Moto Cons. Veh. Bicycle Mean
Point Pillars [14] 70.5 59.9 34.4 33.2 29.6 25.0 16.7 20.0 4.50 1.60 29.5
MEGVII(Ours) 81.1 80.1 54.9 65.7 70.9 48.5 42.9 51.5 10.5 22.3 52.8


Table 3: mAP by Categories compared to PointPillars. Our method shows more competitive and balanced performance on tail classes. For example, Bicycle is improved by 14 times. Motorcycle, Construction Vehicle(Cons. Veh.), Trailer, and Traffic Cone(TC) are improved by more than 2 times.


GT-AUG DB Sampling Multi-head Res-Encoder SE Heavier Head WS Hi-res mAP NDS


35.68 45.17
37.69 53.66
42.64 56.66
44.86 58.13
48.64 60.08
48.14 59.66
49.55 60.20
49.43 60.56
51.44 62.56


Table 4: Ablation studies for different components used in our method on Validation Split. Database Sampling and Res-Encoder contribute the most to mAP.

2.4 Loss Function

Apart from regular classification and bounding box regression branch required by 3D object detection, we add an orientation classification branch as proposed in SECOND [28]. It’s important to point out that most of the object boxes are parallel or perpendicular to LiDAR coordinates axis according to our statistics. So if orientation classification is applied as it is in SECOND, it turns out the mAOE is very high for the fact that many predicted bounding boxes’ orientation are just opposite to ground truth. So we add an offset to orientation classification targets to dismiss orientation ambiguity. As for velocity estimation, regression without normalization can achieve the best performance compared to adding extra normalization operations.

We use anchors to reduce learning difficulty through import prior knowledge. Anchors are configured as VoxelNet [32]. That is, anchors of different classes have different height and width configuration which are determined by class means values. There is 1 size configuration with 2 different directions for a category. For velocities, the anchor is set to 0 in both x and y axis. Objects are moving along the ground so we do not need to estimate velocity in the z axis.

In each group, we use weighted Focal Loss for classification, the smooth-l1 loss for regression, and softmax cross-entropy loss for orientation classification. We do not add attribute estimation because its results are not comparable to just applying each category’s most common attribute. We further improve attribute estimation by taking velocity into account. For example, most bicycles are , but if the model predicts a bicycle’s velocity is above a threshold, there should be riders so we change corresponding bicycle’s attribute to .

The Multi-group head is taken as a multi-task learning procedure in our experiments. We use Uniform Scaling to configure weights of different branches.

2.5 Other Improvements

Apart from the above improvements, we find that SENet [11], Weight Standardization [22] can also help in the detection task when used properly. Besides, if we use a heavier head network, performance can still be improved. In our final submission, we ensemble several models of multiple scales to achieve our best performance: mAP 53.2%, NDS 63.78% on validation split.

3 Training Details

In this section, we explain the implementation details of the data augmentation, training procedure and method itself. Our method is implemented in PyTorch


. All experiments are trained using NVIDIA 2080Ti distributedly with synchronized batch normalization support.

For this task, we consider point cloud within the range of [-50.4, 50.4] [-51.2, 51.2] [-5, 3] meters in X, Y, Z axis respectively. We choose a voxel size of = 0.1, = 0.1, = 0.2 meters, which leads to a 1008 1024 40 voxels. Max points number allowed in a voxel is set to 10. For using 10 sweeps(1 keyframe + 9 preceeding non-keyframes), max number of non-empty voxels is 60000.

During training, we conduct data augmentation of random flip in the x-axis, scaling with a scale factor sampled from [0.95, 1.05], rotation around Z axis between [-0.3925, 0.3925] rads and translation in range [0.2, 0.2, 0.2] m in all axis. For GT-AUG, we first filter out ground truth boxes with less than 5 points inside, then randomly select and paste ground truth boxes of different classes using different magnitude on the ground plane as shown in Table 5.

3.1 Training Procedure

We use adamW [16] optimizer together with one-cycle policy [1]

with LR max 0.04, division factor 10, momentum ranges from 0.95 to 0.85, fixed weight decay 0.01 to achieved super convergence. With batch size 5, the model is trained for 20 epochs. During inference, top 1000 proposals are kept in each group, then NMS with score threshold 0.1 and IoU threshold 0.2 is applied. Max number of boxes allowed in each group after NMS is 80.


Category Car Truck Bus Trailer Cons. Veh. Traffic Cone Barrier Bicycle Motorcycle Pedestrian


2 3 7 4 6 2 6 6 2 2


Table 5: GT-AUG magnitudes of different categories. For each category, the magnitude means number of instances placed into a point cloud sample.
Figure 4: Examples of detection results in validation split. Ground truth annotations are in green and detection results are in blue. Detection results come from a model with 51.9% mAP and 62.5% NDS. The token on top of each point cloud bird view image is its corresponding sample data token.

3.2 Network Details

For the 3D feature extractor, we use 16, 32, 64, 128 layers of sparse 3D convolution respectively for each block. As used in [10], submanifold sparse convolution is used when we downsample the feature map. In other conditions, regular sparse convolution is applied. For the region proposal module, we use 128 and 256 layers respectively for downscale ratio 16 and 8 layers. In each head, we apply 1 1 Conv to get final predictions. To achieve a heavier head, we first use one layer 3 3 Conv to reduce channels by , then use a 1 1 Conv layer to get final predictions. Batch Normalization [12] is used for all but the last layer.

Anchors of different categories are set according to their mean height and width, with different threshold when assigning class labels. For categories of sufficient annotations, we set the positive area threshold to 0.6, for those categories with fewer annotations we set the threshold to 0.4.

We use the default setting of focal loss in the original paper. For regression, we use 0.2 for velocity prediction and the others are set to 1.0 to achieve a balanced and stable training process.

4 Results

In this section we report our results in detail. We also investigate contributions of each module to the final result in Table 4.

As shown in Table 2, our method surpasses official PointPillars [14] baseline by 73.1%. More specifically, our method shows better performance in all categories, especially in long-tail classes like Bicycle, Motorcycle, Bus, and Trailer. Moreover, our method achieves less error in translation(mATE), scale(mASE), orientation(mAOE), velocity(mAVE) and attribute(mAAE). Examples of detection results can be seen in Figure 4, our method generates reliable detection results on all categories. The edge with a line attached in the bounding box indicates the vehicle’s front.

5 Conclusion

In this report, we present our method and results on the newly-released large scale nuScenes Dataset, which poses more challenges, such as class imbalance, than KITTI on the 3D Object Detection task. With carefully-designed strategies in solving class imbalance, multi-class joint detection through data, network and learning objective, we achieve the best result in the WAD challenge. However, there are still a few methods that report their results on the nuScenes Dataset, so we will release our code, hopefully, it can facilitate people’s research on this topic.