High-efficiency point cloud 3D object detection operated on embedded systems is important for many robotics applications including autonomous driving. Most previous works try to solve it using anchor-based detection methods which come with two drawbacks: post-processing is relatively complex and computationally expensive; tuning anchor parameters is tricky. We are the first to address these drawbacks with an anchor free and Non-Maximum Suppression free one stage detector called AFDet. The entire AFDet can be processed efficiently on a CNN accelerator or a GPU with the simplified post-processing. Without bells and whistles, our proposed AFDet performs competitively with other one stage anchor-based methods on KITTI validation set and Waymo Open Dataset validation set.READ FULL TEXT VIEW PDF
Export CenterPoint PonintPillars ONNX Model For TensorRT
An anchor free method for pointcloud object detecion.
Detecting 3D objects in the point cloud is one of the most important perception tasks for autonomous driving. To satisfy the power and efficiency constraints, most of the detection systems are operated on vehicle embedded systems. Developing embedded systems friendly point cloud 3D detection system is a critical step to make autonomous driving a reality.
Due to the sparse nature of the point cloud, it is inefficient to directly apply 3D or 2D Convolution Neural Networks (CNN)[9, 28] on the raw point cloud. On one hand, lots of point cloud encoders [38, 4, 33, 11, 37] are introduced to encode the raw point cloud to data formats that could be efficiently processed by 3D or 2D CNN. On the other hand, some work [22, 32, 25, 35, 20] directly extract features from raw point clouds for 3D detection which is inspired by PointNet [23, 24]. But for the detection part, most of them adopt anchor-based detection methods proven effective in image object detection tasks.
|Embedded Systems Friendly||✗||✓|
The comparison between anchor-based methods and our method. We use max pooling and AND operation to achieve a similar functionality with NMS but with a much higher speed. In our experiments, our max pooling and AND operation can achieves on one Nvidia 2080 Ti GPU which is approximately faster than the CPU implemented NMS.
Anchor-based methods have two major disadvantages. First, Non-Maximum Suppression (NMS) is necessary for anchor-based methods to suppress the overlapped high confident detection bounding boxes. But it can introduce non-trivial computational cost especially for embedded systems. According to our experiments, it takes more than 20 ms to process one KITTI  point cloud frame even on a modern high-end desktop CPU with an efficient implementation, let alone CPUs typically deployed for embedded systems. Second, anchor-based methods requires anchor selection which is tricky and time-consuming, because critical parts of the tuning can be a manual trial and error process. For instance, every time a new detection class is added to the detection system, hyper parameters such as appropriate anchor number, anchor size, anchor angle and anchor density need to be selected.
Can we get rid of NMS and design an embedded system friendly anchor free point cloud 3D detection system with high efficiency? Recently, anchor free methods [12, 36, 31] in image detection have achieved remarkable performance. In this work, we propose an anchor free and NMS free one stage end-to-end point cloud 3D object detector (AFDet) with simple post-processing.
We use PointPillars  to encode the entire point cloud into pseudo images or image-like feature maps in Bird’s Eye View (BEV) in our experiments. However, AFDet can be used with any point cloud encoder which generates pseudo images or image-like 2D data. After encoding, a CNN with upsampling necks is applied to output the feature maps, which connect to five different heads to predict object centers in the BEV plane and to regress different attributes of the 3D bounding boxes. Finally, the outputs of the five heads are combined together to generate the detection results. A keypoint heat map prediction head is used to predict the object centers in the BEV plane. It will encode every object into a small area with a heat peak as its center. At the inference stage, every heat peak will be picked out by max pooling operation. After this, we no longer have multiple regressed anchors tiled into one location, therefore there is no need to use traditional NMS. This makes the entire detector runnable on a typical CNN accelerator or GPU, saving CPU resources for other critical tasks in autonomous driving.
Our contributions can be summarized as below:
(1) We are the first to propose an anchor free and NMS free detector for point cloud 3D object detection with simplified post-processing.
(2) AFDet is embedded system friendly and can achieve high processing speed with much less engineering effort.
(3) AFDet can achieve competitive accuracy compared with previous single-stage detectors on the KITTI validation set. A variant of our AFDet surpasses the state-of-the-art single-stage 3D detection methods on Waymo validation set.
Thanks to accurate 3D spatial information provided by LiDAR, LiDAR-based solutions prevail in 3D object detection task.
Due to non-fixed length and order, point clouds are in a sparse and irregular format which needs to be encoded before input into a neural network. Some works utilize mesh grid to voxelize point clouds. Features, such as density, intensity, height etc., are concatenated in different voxels as different channels. Voxelized point clouds are either projected to different views such as BEV, Range View (RV) etc., to be processed by 2D convolution [4, 10, 27, 34] or kept in 3D coordinates to be processed by sparse 3D Convolution . PointNet 
proposes an effective solution to use raw point cloud as input to conduct 3D detection and segmentation. PointNet wields Multilayer Perceptron (MLP) and max pooling operation to solve point cloud’s disorder and non-uniformity and provides satisfactory performance. Successive 3D detection solutions based on the raw point cloud input provide promising performance such as PointNet++, Frustum PointNet , PointRCNN  and STD . VoxelNet  combines voxelization and PointNet to propose Voxel Feature Extractor (VFE) in which a PointNet style encoder is implemented inside each voxel. A similar idea is used in SECOND  despite that sparse 3D convolution is utilized to further extract and downsample information in -axis following VFE. VFE improves the performance of the LiDAR-based detector dramatically, however, with encoders that are learned from data, the detection pipeline becomes slower. PointPillars  proposes to encode point cloud as pillars instead of voxels. As a result, the whole point cloud becomes a BEV pseudo image whose channels are equivalent to VFE’s output channels instead of 3.
Anchor free. In anchor-based methods, pre-defined boxes are provided for bounding box encoding. However, using dense anchors lead to exhaustive numbers of potential target objects, which makes NMS an unavoidable issue. Some previous work [34, 18, 2, 25, 21] mention anchor free concepts. PointRCNN  proposes a 3D proposal generation sub-network without anchor boxes based on whole-scene point cloud segmentation. VoteNet  constructs 3D bounding boxes from voted interest points instead of predefined anchor boxes. But all of them are not NMS free, which makes them less efficient and is not friendly to the embedded systems. Besides, PIXOR  is a BEV detector rather than a 3D detector.
Camera-based solutions thrived in accordance with the willingness of reducing cost. With more sophisticated networks being designed, camera-based solutions are catching up rapidly with LiDAR-based solutions. MonoDIS  leverages a novel disentangling transformation for 2D and 3D detection losses and a novel self-supervised confidence score for 3D bounding boxes. It gets top ranking on nuScenes  3D object detection challenge. CenterNet  predicts the location and class of an object from the center of its bounding box on a feature map. Though originally designed for 2D detection, CenterNet also has the potential to conduct 3D detection with a mono camera. TTFNet  proposes techniques to shorten training time and increase inference speed. RTM3D  predicts nine perspective keypoints of a 3D bounding box in image space and recover the 3D bounding box with geometric regulation.
In this section, we present the details of AFDet from three aspects: point cloud encoder, backbone and necks, and anchor free detector. The framework is shown in Figure 1.
To further tap the efficiency potential of our anchor free detector, we use PointPillars  as the point cloud encoder because of its fast speed. First, the detection range is discretized into pillars in the Bird’s Eye View (BEV) plane which is also the - plane. Different points are assigned to different pillars based on their - values. Every point would also be augmented to dimensional at this step. Second, the pre-defined
amount of pillars with enough number of points would be applied with a linear layer and a max operation to create an output tensor of sizewhere is the number if output channels of the liner layer in PointNet . Since is the number of selected pillars, they are not one-to-one correspondent with the original pillars in the entire detection range. So the third step is to scatter the selected pillars to their original location on the detection range. After that we can get a pseudo image where and indicate the width and height, separately.
Although we use PointPillars  as the point cloud encoder, our anchor free detector is compatible with any point cloud encoders which generate pseudo images or image-like 2D data.
Our anchor free detector consists of five heads. They are keypoint heatmap head, local offset head, -axis location head, 3D object size head and orientation head. Figure 1 shows some details of the anchor free detector.
Object localization in BEV. For heatmap head and offset head, we predict a keypoint heatmap and a local offset regression map where is the number of keypoint types. The keypoint heatmap is used to find where the object center is in BEV. The offset regression map is to help the heatmap to find the more accurate object centers in BEV and also help to recover the discretization error caused by the pillarization process.
For a 3D object with category , we parameterize its 3D ground truth bounding box as where , , represent the center location in LiDAR coordinate system, , , are the width, length and height of the bounding box, and is the yaw rotation around -axis which is perpendicular to the ground. Let denote the detection range in - plane. To be specific, and is along the -axis and and is along the -axis in the LiDAR coordinate system. In this work, the pillar in - plane is always a square. So let denote the pillar side length. Following , for each object center we have the keypoint in BEV pseudo image coordinate. is its equivalent in the keypoint heatmap where is the floor operation. The 2D bounding box in BEV could be expressed as .
For each pixel which are covered in the 2D bounding boxes in the pseudo image, we set its value in the heatmap following
where is the Euclidean distance calculated between the bounding box center and the corresponding pixel in the discretized pseudo image coordinates. A prediction represents the object center and indicates this pillar is background.
, which represents the object centers in BEV, would be treated as positive samples while all other pillars would be treated as negative samples. Following [12, 36], we use the modified focal loss 
to train the heatmap where is the number of object in the detection range and and are the hyper parameters. We use and in all our experiments.
For the offset regression head, there are two main functions. First, it is used to eliminate the error caused by the pillarization process in which we assign the float object centers to integer pillar locations in BEV as we mentioned above. Second, it plays an important role to refine the heatmap object centers’ prediction especially when the heatmap predicts wrong centers. To be specific, once the heatmap predicts a wrong center which is several pixels away from the ground truth center, the offset head has the capability to mitigate and even eliminate several pixels’ error to the ground truth object center.
We select a square area with the radius around object center pixel in the offset regression map. The farther the distance to the object center is, the larger the offset value becomes. We train the offset using loss
where the training is only for the square area with side length around the keypoint locations . We will discuss more about the offset regression in Section 4.
-axis location regression. After the object localizations in BEV, we only have object - location. Thus we have the -axis location head to regress the -axis values. We directly regress -value using loss
Size regression. Additionally, we regress the object sizes directly. For each object, we have . The training loss for size regression is
which is also the loss.
Orientation prediction. Orientation for object is the scalar angle rotated around -axis which is perpendicular to the ground. We follow [19, 36] to encode it to an eight scalars with four scalars for each bin. Two scalars are for the softmax classification and the other two are for the angle regression. The angle ranges for two bins are and which overlap slightly. For each bin, we predict which are used for softmax classification and which are used for calculating and value of the offset to the bin center . The classification part is trained with softmax while the offset part is trained with loss. So the loss for the orientation training is
where in which is the indicator function, and . We can decode the predicted orientation value using
where is the bin index with the larger classification score for object .
Loss. We have described the losses for each head. The overall training objective is
where represents the weight for each heads. For all regression heads including local offset, -axis location, size, orientation regression, we only regress objects which are in the detection range.
Gather indices and decode. At the training stage, we do not do back-propagation for the entire feature maps. Instead, we only back-propagate the indices that are the object centers for all regression heads. At the inference stage, we use max pooling and AND operation to find the peaks in the predicted heatmap following  which is much faster and more efficient than IoU-based NMS.
After the max pooling and AND operation, we can easily gather the indices of each center from the keypoint heatmap. Let denote the set of detected BEV object centers. We have where is the total number of detected objects. Then the final object center in BEV would be where are found in the using the index . For all other prediction values, either they are directly from the regression results or we have mentioned the decoding process above. The predicted bounding box for object is
In this work, we make several key modifications to the backbone used in [38, 33, 11] to support our anchor free detector. The network includes the backbone part and the necks part. The backbone part is similar to the network used in the classification tasks  which is used to extract features while downsampling the spatial size through different blocks. The necks part is used to upsample the features to make sure all outputs from different blocks of the backbone have the same spatial size so that we can concatenate them along one axis. Figure 2 shows details of the backbone and necks.
and a ReLU.is defined as the downsampling stride for this block. By reducing the blocks’ number from 3 to 2, we remove the feature maps that are downsampled 4 times in [38, 33, 11]. We accordingly reduce the upsampling necks from 3 to 2. Each upsampling neck contains one transposed convolution with output channels and upsampling stride followed by BatchNorm and ReLU. Second, the first block we use is which does not downsample the output feature size compared with input size.
So the final backbone and necks consists of two blocks and followed by two upsampling necks , , separately. By doing this, the width and height of the input feature maps and the pseudo images are the same. In one word, in the process of generating feature maps we do not downsample, which is critical to maintaining a similar detection performance with  for KITTI  dataset. Reducing downsampling stride will only increase FLOPs, so we also reduce the number of filters in the backbone and necks. It turns out that we have fewer FLOPs in the backbone and necks than [38, 33, 11]. We will talk more about the backbone and necks in Section 4.
In this section, we first introduce the two datasets. Then we describe the experiment settings and our data augmentation strategy. Finally, we show the performance on KITTI  validation set and some preliminary results on Waymo  validation set.
KITTI object detection dataset  consists of training samples with both calibrations and annotations and test samples which only have calibrations. In our experiments, we split the official training samples into a training set comprising samples and a validation set with the rest samples following . KITTI dataset provides both LiDAR point clouds and images, however, annotations are only labeled in the camera field of view (FOV). To accelerate the training process, we crop out points that are in camera FOV for training and evaluation [4, 38].
Waymo Open Dataset (Waymo OD)  is a newly released large dataset for autonomous driving. It consists of training sequences with around samples and validation sequences with around samples. Unlike KITTI where only the objects in camera FOV are labeled, the objects in Waymo are labeled in the full field.
Unless we explicitly indicate, all parameters showing here are their default values. We use AdamW  optimizer with one-cycle policy . We set learning rate max to , division factor to 2, momentum ranges from 0.95 to 0.85, fixed weight decay to 0.01 to achieve convergence. The weight we use for different sub-losses are , , and . For the following part, we first introduce the parameters used in KITTI . Then we introduce the Waymo OD parameters that are different from KITTI.
For KITTI car detection, we set detection range as along , , axes respectively. So the pseudo images are . This range is the same as PointPillars  settings for a fair comparison. We use the max number of objects which means at most we detect objects for each class. For PointPillars encoder , we use pillar side length m, max number of points per pillar 100 and max number of pillars . We set the number of output channels of the linear layer in the encoder to 64. For the backbone, all the convolution layers are with kernel size 3. Their stride and number of output filters are shown in Figure 2. So the outputs of the backbone and necks are with shape which have the same width and height with the pseudo images. For every head, we use two convolution layers: the first convolution layer is with kernel size 3 and channel number 32; the second convolution layer is with kernel size 1. Channel numbers are different for different heads which are shown in Figure 1. For offset regression head, we use as default which means we will regress a square area with side length
. We use max pooling with kernel size 3, stride 1 and apply AND operation between the feature map before and after the max pooling to get the peaks of the keypoint heatmaps at the inference stage. So we do not need NMS to suppress overlapped detections. The model is trained for 240 epochs. Due to the small size of the KITTI dataset, we run every experiment 3 times and select the best one on the validation set.
First, we generate a database containing the labels of all ground truths and their associated point cloud data. For each sample, we randomly select 15 ground truth samples for car/vehicle and place them into the current point cloud. After this, we increase the number of ground truth in one point cloud. Second, each bounding box and the points inside it are rotated following the uniform distribution and translated following the normal distribution. The rotation followsaround -axis. The translation follows for all axes. Third, we also do randomly flip along -axis , global rotation following and global scaling [38, 33, 11].
We follow the official KITTI evaluation protocol to evaluate our detector, where the IoU threshold is 0.7 for the car class. We compare different methods or variants using average precision () metric.
We first compare different heatmap prediction methods and different offset regression methods. Then we compare the different backbone for our detector. Finally, we compare our method with PointPillars.
|Car Shape (Ours)||75.57||85.68||69.31|
|Methods||# Params||# MACs||Anchor||3D IoU=0.7||BEV IoU=0.7|
Heatmap prediction. We compare our car shape heatmap prediction method with the Gaussian heatmap prediction method . For the car shape heatmap prediction, we have described in Section 3. For the Gaussian heatmap prediction, we splat all ground truth keypoints onto a heatmap using a Gaussian kernel where
is the size adaptive standard deviation from. The biggest difference between the two methods is the number of non-zero predictions in the heatmap. For the Gaussian kernel method, the non-zero predictions are only several pixels (e.g. 9 pixels) around the object center in the heatmap. While for the car shape method, all pixels in the 2D bounding box (car shape in BEV view) are non-zero. The illustration could be found in Figure 3 (a) and (b). From Table 2, we can see that predicting the entire car shape rather than the Gaussian kernel can improve about 2% on moderate difficulty.
Offset regression. To verify the effectiveness of our proposed offset regression method in which the training is for the square area with side length around the object center , we compare it with the offset regression method proposed in  in which the training is only for the object center . Actually the latter regression method  is a special case of our method when equals 0. The illustration of two methods is shown in 3 (c) and (d). We set to 0, 1, 2 and 3. From Table 2, we can see that by setting to 2. We can achieve 1 improvement over the regression method mentioned in .
|Methods||Anchor||# Epochs||LEVEL_1 3D IoU=0.7|
|Free||Overall||0 - 30m||30 - 50m||50m -|
|PointPillars111[37, 20, 5] report slightly different performance on the same method. Here we adopt the results reported in . ||✗||100||56.62||81.01||51.75||27.94|
First, the backbone used in  is downsampled 3 with stride 2 for each block. After upsampling, the feature map size used in the detection head is downsampled by 2 compared with the pseudo images. We remove the first downsampling stride and keep the following downsampling stride which is shown as in Table 3. The feature map sizes to the detector are the same as the pseudo images. We can see that the performance improves around 2% compared with the baseline. But the # MACs improve from 125.37 to 501.46 which is about of calculation of the baseline. This is mainly caused by doubling the feature maps’ width and height.
Second, by modifying downsampling stride the performance improves. But we need to make sure that the performance improvement comes from enlarging the feature map size rather than from increasing computation. So we reduce the number of downsampling blocks from 3 to 2, in which we remove the last downsampling block. We also halve the number of output filters in the convolution layers. This computation reducing modification is shown as in Table 3. We can see that the performance has nearly no change by reducing the computation. From to , we reduce about # MACs and about # parameters. So enlarging the feature map in our anchor free detector helps to improve the performance.
Comparison with PointPillars. We compare our method with PointPillars  on KITTI validation set. We use Det3D  implementation to evaluate PointPillars . All comparisons are under the same settings including but not limited to detection range and PointPillars size. As we can see, our AFDet with the modified backbone can achieve similar performance with PointPillars . But our method does not have a complex post-processing process. We do not need the traditional NMS to filter out results. More importantly, the # parameters in AFDet is about 0.56, which is only about 11.6 of its equivalent in PointPillars .
Furthermore, using max pooling and AND operation rather than NMS would make it more friendly to deploy AFDet on the embedded systems. We can run nearly the entire algorithm on a CNN accelerator without the tedious post-processing on CPU. We could reserve more CPU computation resources for other tasks in autonomous driving cars. We also tried kernel sizes 5 and 7 in the max pooling. It does not show much difference with kernel size 3.
We show three qualitative results in Figure 4. As we can see, AFDet has the capability to detect the object centers in the heatmap. It can also regress other object attributes (e.g. object sizes, -axis locations and others) well. We validate the effectiveness of the anchor free method on 3D point cloud detection.
We also include some preliminary evaluation results on Waymo OD  validation set. We use Waymo online system to evaluate our performance. We try our best to have the same settings and parameters for a fair comparison. But sometimes we do not know other methods’ detailed parameters. On Waymo OD, we train our model with significantly less number of epochs compared with other methods. But we still show competitive or even better results.
We show two AFDet results with PoinPillars  encoders in Table 4. The number after the encoder name represents the voxel size in - plane. As we can see, our “AFDet+PointPillars-0.16” with voxel size 0.16 m beats “PointPillars” by 2% on LEVEL_1 vehicle detection. When we reduce the voxel size to 0.10 m, our “AFDet+PointPillars-0.10” outperforms the state-of-the-art single-stage methods on Waymo validation set. We only train our model for 16 epochs while others train their models for 75 or 100 epochs for better convergence.
In this paper, we tried to address the 3D point cloud detection problem. We presented a novel anchor free one stage 3D object detector (AFDet) to detect the 3D object in the point cloud. We are the first to use anchor free and NMS free method in 3D point cloud detection which has the advantage in the embedded systems. All experimental results proved the effectiveness of our proposed method.