Currently, there have been many kinds of voxel-based 3D single stage detectors, while point-based single stage methods are still underexplored. In this paper, we first present a lightweight and effective point-based 3D single stage object detector, named 3DSSD, achieving a good balance between accuracy and efficiency. In this paradigm, all upsampling layers and refinement stage, which are indispensable in all existing point-based methods, are abandoned to reduce the large computation cost. We novelly propose a fusion sampling strategy in downsampling process to make detection on less representative points feasible. A delicate box prediction network including a candidate generation layer, an anchor-free regression head with a 3D center-ness assignment strategy is designed to meet with our demand of accuracy and speed. Our paradigm is an elegant single stage anchor-free framework, showing great superiority to other existing methods. We evaluate 3DSSD on widely used KITTI dataset and more challenging nuScenes dataset. Our method outperforms all state-of-the-art voxel-based single stage methods by a large margin, and has comparable performance to two stage point-based methods as well, with inference speed more than 25 FPS, 2x faster than former state-of-the-art point-based methods.READ FULL TEXT VIEW PDF
Although great breakthroughs have been made in 2D detection, it is inappropriate to translate these 2D methods to 3D directly because of the unique characteristics of point clouds. Compared to 2D images, point clouds are sparse, unordered and locality sensitive, making it impossible to use convolution neural networks (CNNs) to parse them. Therefore, how to convert and utilize raw point cloud data has become the primary problem in the detection task.
Some existing methods convert point clouds from sparse formation to compact representations by projecting them to images [4, 12, 9, 21, 6], or subdividing them to equally distributed voxels [19, 29, 36, 32, 31, 13]. We call these methods voxel-based methods, which conduct voxelization on the whole point cloud. Features in each voxel are generated by either PointNet-like backbones [24, 25] or hand-crafted features. Then many 2D detection paradigms can be applied on the compact voxel space without any extra efforts. Although these methods are straightforward and efficient, they suffer from information loss during voxelization and encounter performance bottleneck.
Another stream is point-based methods, like [34, 35, 26]. They take raw point clouds as input, and predict bounding boxes based on each point. Specifically, they are composed of two stages. In the first stage, they first utilize set abstraction (SA) layers for downsampling and extracting context features. Afterwards, feature propagation (FP) layers are applied for upsampling and broadcasting features to points which are discarded during downsampling. A 3D region proposal network (RPN) is then applied for generating proposals centered at each point. Based on these proposals, a refinement module is developed as the second stage to give final predictions. These methods achieve better performance, but their inference time is usually intolerable in many real-time systems.
Different from all previous methods, we first develop a lightweight and efficient point-based 3D single stage object detection framework. We observe that in point-based methods, FP layers and the refinement stage consume half of the inference time, motivating us to remove these two modules. However, it is non-trivial to abandon FP layers. Since under the current sampling strategy in SA, i.e., furthest point sampling based on 3D Euclidean distance (D-FPS), foreground instances with few interior points may lose all points after sampling. Consequently, it is impossible for them to be detected, which leads to a huge performance drop. In STD , without upsampling, i.e., conducting detection on remaining downsampled points, its performance drops by about 9%. That is the reason why FP layers must be adopted for points upsampling, although a large amount of extra computation is introduced. To deal with the dilemma, we first propose a novel sampling strategy based on feature distance, called F-FPS, which effectively preserves interior points of various instances. Our final sampling strategy is a fusion version of F-FPS and D-FPS.
To fully exploit the representative points retained after SA layers, we design a delicate box prediction network, which utilizes a candidate generation layer (CG), an anchor-free regression head and a 3D center-ness assignment strategy. In the CG layer, we first shift representative points from F-FPS to generate candidate points
. This shifting operation is supervised by the relative locations between these representative points and the centers of their corresponding instances. Then, we treat these candidate points as centers, find their surrounding points from the whole set of representative points from both F-FPS and D-FPS, and extract their features through multi-layer perceptron (MLP) networks. These features are finally fed into an anchor-free regression head to predict 3D bounding boxes. We also design a 3D center-ness assignment strategy which assigns higher classification scores to candidate points closer to instance centers, so as to retrieve more precise localization predictions.
We eveluate our method on widely used KITTI  dataset, and more challenging nuScenes  dataset. Experiments show that our model outperforms all state-of-the-art voxel-based single stage methods by a large margin, achieving comparable performance to all two stage point-based methods at a much faster inference speed. In summary, our primary contribution is manifold.
We first propose a lightweight and effective point-based 3D single stage object detector, named 3DSSD. In our paradigm, we remove FP layers and the refinement module, which are indispensible in all existing point-based methods, contributing to huge deduction on inference time of our framework.
A novel fusion sampling strategy in SA layers is developed to keep adequate interior points of different foreground instances, which preserves rich information for regression and classification.
We design a delicate box prediction network, making our framework both effective and efficient further. Experimental results on KITTI and nuScenes dataset show that our framework outperforms all single stage methods, and has comparable performance to state-of-the-art two stage methods with a much faster speed, which is 38ms per scene.
There are several methods exploiting how to fuse information from multiple sensors for object detection. MV3D  projects LiDAR point cloud to bird-eye view (BEV) in order to generate proposals. These proposals with other information from image, front view and BEV are then sent to the second stage to predict final bounding boxes. AVOD  extends MV3D by introducing image features in the proposal generation stage. MMF  fuses information from depth maps, LiDAR point clouds, images and maps to accomplish multiple tasks including depth completion, 2D object detection and 3D object detection. These tasks benefit each other and enhance final performance on 3D object detection.
There are mainly two streams of methods dealing with 3D object detection only using LiDAR data. One is voxel-based, which applies voxelization on the entire point cloud. The difference among voxel-based methods lies on the initialization of voxel features. In , each non-empty voxel is encoded with 6 statistical quantities by the points within this voxel. Binary encoding is used in  for each voxel grid. VoxelNet  utilizes PointNet  to extract features of each voxel. Compared to , SECOND  applies sparse convolution layers  for parsing the compact representation. PointPillars  treats pseudo-images as the representation after voxelization.
Another one is point-based. They take raw point cloud as input, and generate predictions based on each point. F-PointNet  and IPOD  adopt 2D mechanisms like detection or segmentation to filter most useless points, and generate predictions from these kept useful points. PointRCNN  utilizes a PointNet++  with SA and FP layers to extract features for each point, proposes a region proposal network (RPN) to generate proposals, and applies a refinement module to predict bounding boxes and class labels. These methods outperform voxel-based, but their unbearable inference time makes it impossible to be applied in real-time autonomous driving system. STD  tries to take advantages of both point-based and voxel-based methods. It uses raw point cloud as input, applies PointNet++ to extract features, proposes a PointsPool layer for converting features from sparse to dense representations and utilizes CNNs in the refinement module. Although it is faster than all former point-based methods, it is still much slower than voxel-based methods. As analyzed before, all point-based methods are composed of two stages, which are proposal generation module including SA layers and FP layers, and a refinement module as the second stage for accurate predictions. In this paper, we propose to remove FP layers and the refinement module so as to speed up point-based methods.
In this section, we first analyze the bottleneck of point-based methods, and describe our proposed fusion sampling strategy. Next, we present the box prediction network including a candidate generation layer, anchor-free regression head and our 3D center-ness assignment strategy. Finally, we discuss the loss function. The whole framework of 3DSSD is illustrated in Figure1.
Currently, there are two streams of methods in 3D object detection, which are point-based and voxel-based frameworks. Albeit accurate, point-based methods are more time-consuming compared to voxel-based ones. We observe that all current point-based methods [35, 26, 34] are composed of two stages including proposal generation stage and prediction refinement stage. In first stage, SA layers are applied to downsample points for better efficiency and enlarging receptive fields while FP layers are applied to broadcast features for dropped points during downsampling process so as to recover all points. In the second stage, a refinement module optimizes proposals from RPN to get more accurate predictions. SA layers are necessary for extracting features of points, but FP layers and the refinement module indeed limit the efficiency of point-based methods, as illustrated in Table 1. Therefore, we are motivated to design a lightweight and effective point-based single stage detector.
However, it is non-trivial to remove FP layers. As mentioned before, SA layers in backbone utilize D-FPS to choose a subset of points as the downsampled representative points. Without FP layers, the box prediction network is conducted on those surviving representative points. Nonetheless, this sampling method only takes the relative locations among points into consideration. Consequently a large portion of surviving representative points are actually background points, like ground points, due to its large amount. In other words, there are several foreground instances which are totally erased through this process, making them impossible to be detected.
With a limit of total representative points number , for some remote instances, their inner points are not likely to be selected, because the amount of them is much smaller than that of background points. The situation becomes even worse on more complex datasets like nuScenes  dataset. Statistically, we use points recall – the quotient between the number of instances whose interior points survived in the sampled representative points and the total number of instances, to help illustrate this fact. As illustrated in the first row of Table 2, with 1024 or 512 representative points, their points recalls are only 65.9% or 51.8% respectively, which means nearly half of instances are totally erased, that is, cannot be detected. To avoid this circumstance, most of existing methods apply FP layers to recall those abandoned useful points during downsampling, but they have to pay the overhead of computation with longer inference time.
|Methods||SA layers (ms)||FP layers (ms)||Refinement Module (ms)|
model, which is composed of 4 SA layers and 4 FP layers for feature extraction, and a refinement module with 3 SA layers for prediction.
|D-FPS||99.7 %||65.9 %||51.8 %|
|F-FPS (=0.0)||99.7 %||83.5 %||68.4 %|
|F-FPS (=0.5)||99.7 %||84.9 %||74.9 %|
|F-FPS (=1.0)||99.7 %||89.2 %||76.1 %|
|F-FPS (=2.0)||99.7 %||86.3 %||73.7 %|
In order to preserve positive points (interior points within any instance) and erase those meaningless negative points (points locating on background), we have to consider not only spatial distance but also semantic information of each point during the sampling process. We note that semantic information is well captured by the deep neural network. So, utilizing the feature distance as the criterion in FPS, many similar useless negative points will be mostly removed, like massive of ground points. Even for positive points of remote objects, they can also get survived, because semantic features of points from different objects are distinct from each other.
However, only taking the semantic feature distance as the sole criterion will preserve many points within a same instance, which introduces redundancy as well. For example, given a car, there is much difference between features of points around the windows and the wheels. As a result, points around the two parts will be sampled while any point in either part is informative for regression. To reduce the redundancy and increase the diversity, we apply both spatial distance and semantic feature distance as the criterion in FPS. It is formulated as
where and represent L2 XYZ distance and L2 feature distance between two points and is the balance factor. We call this sampling method as Feature-FPS (F-FPS). The comparison among different is shown in in Table 2, which demonstrates that combining two distances together as the criterion in the downsampling operation is more powerful than only using feature distance, i.e., the special case where equals to 0. Moreover, as illustrated in Table 2, using F-FPS with 1024 representative points and setting to 1 guarantees 89.2% instances can be preserved in nuScenes  dataset, which is 23.3% higher than D-FPS sampling strategy.
Large amount of positive points within different instances are preserved after SA layers thanks to F-FPS. However, with the limit of a fixed number of total representative points , many negative points are discarded during the downsampling process, which benefits regression but hampers classification. That is, during the grouping stage in a SA layer, which aggregates features from neighboring points, a negative point is unable to find enough surrounding points, making it impossible to enlarge its receptive field. As a result, the model finds difficulty in distinguishing positive and negative points, leading to a poor performance in classification. Our experiments also demonstrate this limitation in ablation study. Although the model with F-FPS has higher recall rate and better localization accuracy than the one with D-FPS, it prefers treating many negative points as positive ones, leading to a drop in classification accuracy.
As analyzed above, after a SA layer, not only positive points should be sampled as many as possible, but we also need to gather enough negative points for more reliable classification. We present a novel fusion sampling strategy (FS), in which both F-FPS and D-FPS are applied during a SA layer, to retain more positive points for localization and keep enough negative points for classification as well. Specifically, we sample points respectively with F-FPS and D-FPS and feed the two sets together to the following grouping operation in a SA layer.
After the backbone network implemented by several SA layers intertwined with fusion sampling, we gain a subset of points from both F-FPS and D-FPS, which are used for final predictions. In former point-based methods, another SA layer should be applied to extract features before the prediction head. There are three steps in a normal SA layer, including center point selection, surrounding points extraction and semantic feature generation.
In order to further reduce computation cost and fully utilize the advantages of fusion sampling, we present a candidate generation layer (CG) before our prediction head, which is a variant of SA layer. Since most of representative points from D-FPS are negative points and useless in bounding box regression, we only use those from F-FPS as initial center points. These initial center points are shifted under the supervision of their relative locations to their corresponding instances as illustrated in Figure 2, same as VoteNet . We call these new points after shifting operation as candidate points. Then we treat these candidate points as the center points in our CG layer. We use candidate points rather than original points as the center points for the sake of performance, which will be discussed in detail later. Next, we find the surrounding points of each candidate point from the whole representative point set containing points from both D-FPS and F-FPS with a pre-defined range threshold, concatenate their normalized locations and semantic features as input, and apply MLP layers to extract features. These features will be sent to the prediction head for regression and classification. This entire process is illustrated in Figure 1.
With fusion sampling strategy and the CG layer, our model can safely remove the time-consuming FP layers and the refinement module. In the regression head, we are faced with two options, anchor-based or anchor-free prediction network. If anchor-based head is adopted, we have to construct multi-scale and multi-orientation anchors so as to cover objects with variant sizes and orientations. Especially in complex scenes like those in the nuScenes dataset , where objects are from 10 different categories with a wide range of orientations, we need at least 20 anchors, including 10 different sizes and 2 different orientations in an anchor-based model. To avoid the cumbersome setting of multiple anchors and be consistent with our lightweight design, we utilize anchor-free regression head.
In the regression head, for each candidate point, we predict the distance to its corresponding instance, as well as the size and orientation of its corresponding instance. Since there is no prior orientation of each point, we apply a hybrid of classification and regression formulation following  in orientation angle regression. Specifically, we pre-define
equally split orientation angle bins and classify the proposal orientation angle into different bins. Residual is regressed with respect to the bin value.is set to 12 in our experiments.
In the training process, we need an assignment strategy to assign labels for each candidate point. In 2D single stage detectors, they usually use intersection-over-union (IoU)  threshold or mask [28, 33] to assign labels for pixels. FCOS  proposes a continuous center-ness label, replacing original binary classification label, to further distinguish pixels. It assigns higher center-ness scores to pixels closer to instance centers, leading to a relatively better performance compared to IoU- or mask-based assignment strategy. However, it is unsatisfying to directly apply center-ness label to 3D detection task. Given that all LiDAR points are located on surfaces of objects, their center-ness labels are all very small and similar, which makes it impossible to distinguish good predictions from other points.
Instead of utilizing original representative points in point cloud, we resort to the predicted candidate points, which are supervised to be close to instance centers. Candidate points closer to instance centers tend to get more accurate localization predictions, and 3D center-ness labels are able to distinguish them easily. For each candidate point, we define its center-ness label through two steps. We first determine whether it is within an instance , which is a binary value. Then we draw a center-ness label according to its distance to 6 surfaces of its corresponding instance. The center-ness label is calculated as
where represent the distance to front, back, left, right, top and bottom surfaces respectively. The final classification label is the multiplication of and .
The overall loss is composed of classification loss, regression loss and shifting loss, as
where and are the number of total candidate points and positive candidate points, which are candidate points locating in foreground instance. In the classification loss, we denote and as the predicted classification score and center-ness label for point respectively and use cross entropy loss as .
The regression loss includes distance regression loss , size regression loss , angle regression loss and corner loss . Specifically, we utilize the smooth- loss for and , in which the targets are offsets from candidate points to their corresponding instance centers and sizes of corresponding instances respectively. Angle regression loss contains orientation classification loss and residual prediction loss as
where and are predicted angle class and residual while and are their targets. Corner loss is the distance between the predicted 8 corners and assigned ground-truth, expressed as
where and are the location of ground-truth and prediction for point .
As for the shifting loss , which is the supervision of shifts prediction in CG layer, we utilize a smooth- loss to calculate the distance between the predicted shifts and the residuals from representative points to their corresponding instance centers. is the amount of positive representative points from F-FPS.
There are 7,481 training images/point clouds and 7,518 test images/point clouds with three categories of Car, Pedestrian and Cyclist in the KITTI dataset. We only evaluate our model on class Car, due to its large amount of data and complex scenarios. Moreover, most of state-of-the-art methods only test their models on this class. We use average precision (AP) metric to compare with different methods. During evaluation, we follow the official KITTI evaluation protocol – that is, the IoU threshold is 0.7 for class Car.
|2-stage||MV3D ||R + L||71.09||62.35||55.12|
|Fast Point-RCNN ||85.29||77.40||70.24|
|1-stage||ContFuse ||R + L||83.68||68.78||61.67|
optimizer with an initial learning rate of 0.002 and a batch size of 16 equally distributed on 4 GPU cards. The learning rate is decayed by 10 at 40 epochs. We train our model for 50 epochs.
We adopt 4 different data augmentation strategies on KITTI dataset in order to prevent overfitting. First, we use mix-up strategy same as 
, which randomly adds foreground instances with their inner points from other scenes to current point cloud. Then, for each bounding box, we rotate it following a uniform distributionand add a random translation (). Third, each point cloud is randomly flipped along -axis. Finally, we randomly rotate each point cloud around -axis (up axis) and rescale it.
As illustrated in Table 3, our method outperforms all state-of-the-art voxel-based single stage detectors by a large margin. On the main metric, i.e., AP on “moderate” instances, our method outperforms SECOND  and PointPillars  by and respectively. Still, it retains comparable performance to the state-of-the-art point-based method STD  with a more than 2 faster inference time. Our method still outperforms other two stage methods like part-A^2 net and PointRCNN by and respectively. Moreover, it also shows its superiority when compared to multi-sensors methods, like MMF  and F-ConvNet , which achieves about and improvements respectively. We present several qualitative results in Figure 4.
The nuScenes dataset is a more challenging dataset. It contains 1000 scenes, gathered from Boston and Singapore due to their dense traffic and highly challenging driving situations. It provides us with 1.4M 3D objects on 10 different classes, as well as their attributes and velocities. There are about 40k points per frame, and in order to predict velocity and attribute, all former methods combine points from key frame and frames in last 0.5s, leading to about 400k points. Faced with such a large amount of points, all point-based two stage methods perform worse than voxel-based methods on this dataset, due to the GPU memory limitation. In the benchmark, a new evaluation metric called nuScenes detection score (NDS) is presented, which is a weighted sum between mean average precision (mAP), the mean average errors of location (mATE), size (mASE), orientation (mAOE), attribute (mAAE) and velocity (mAVE). We useto denote the set of the five mean average errors, and NDS is calculated by
For each key frame, we combine its points with points in frames within last 0.5s so as to get a richer point cloud input, just the same as other methods. Then, we apply voxelization for randomly sampling point clouds, so as to align input as well as keep original distribution. We randomly choose 65536 voxels, including 16384 voxels from key frame and 49152 voxels from other frames. The voxel size is , and 1 interior point is randomly selected from each voxel. We feed these 65536 points into our point-based network.
The backbone network is illustrated in Figure 3. The training schedule is just the same as the schedule on KITTI dataset. We only apply flip augmentation during training.
We show NDS and mAP among different methods in Table 5, and compare their APs of each class in Table 4. As illustrated in Table 5, our method draws better performance compared to all voxel-based single stage methods by a large margin. Not only on mAP, it also outperforms those methods on AP of each class, as illustrated in Table 4
. The results show that our model can deal with different objects with a large variance on scale. Even for a huge scene with many negative points, our fusion sampling strategy still has the ability to gather enough positive points out. In addition, better results on velocity and attribute illustrate that our model can also discriminate information from different frames.
All ablation studies are conducted on KITTI dataset . We follow VoxelNet  to split original training set to 3,717 images/scenes train set and 3,769 images/scenes val set. All “AP” results in ablation studies are calculated on “Moderate” difficulty level.
We report and compare the performance of our method on KITTI validation set with other state-of-the-art voxel-based single stage methods in Table 6. As shown, on the most important “moderate” difficulty level, our method outperforms by , and compared to PointPillars, SECOND and VoxelNet respectively. This illustrates the superiority of our method.
Our fusion sampling strategy is composed of F-FPS and D-FPS. We compare points recall and AP among different sub-sampling methods in Table 7. Sampling strategies containing F-FPS have higher points recall than D-FPS only. In Figure 5, we also present some visual examples to illustrate the benefits of F-FPS to fusion sampling. In addition, the fusion sampling strategy receives a much higher AP, i.e., better than the one with F-FPS only. The reason is that fusion sampling method can gather enough negative points, which enlarges receptive fields and achieve accurate classification results.
|without shifting (%)||70.4||76.1||43.0|
|with shifting (%)||78.1||77.3||79.4|
In Table 8, we compare performance between with or without shifting representative points from F-FPS in CG layer. Under different assignment strategies, APs of models with shifting are all higher than those without shifting operations. If the candidate points are closer to the centers of instances, it will be easier for them to retrieve their corresponding instances.
We compare the performance of different assignment strategies including IoU, mask and 3D center-ness label. As shown in Table 8, with shifting operation, the model using center-ness label gains better performance than the other two strategies.
The total inference time of 3DSSD is 38ms, tested on KITTI dataset with a Titan V GPU. We compare inference time between 3DSSD and all existing point-based methods in Table 9. As illustrated, our method is much faster than all existing point-based methods. Moreover, our method maintains similar inference speed compared to state-of-the-art voxel-based single stage methods like SECOND which is 40ms. Among all existing methods, it is only slower than PointPillars, which is enhanced by several implementation optimizations such as TensorRT, which can also be utilized by our method for even faster inference speed.
In this paper, we first propose a lightweight and efficient point-based 3D single stage object detection framework. We introduce a novel fusion sampling strategy to remove the time-consuming FP layers and the refinement module in all existing point-based methods. In the prediction network, a candidate generation layer is designed to reduce computation cost further and fully utilize downsampled representative points, and an anchor-free regression head with 3D center-ness label is proposed in order to boost the performance of our model. All of above delicate designs enable our model to be superiority to all existing single stage 3D detectors in both performance and inference time.
Multiview random forest of local experts combining RGB and LIDAR data for pedestrian detection. In IV, Cited by: §1.
PointNet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: §1, §2.