Deep neural networks (DNN) have made tremendous advances on the task of 2D visions such as vehicles and other objects detection[liu2016ssd, faster-rcnn] and semantic segmentation [he2017maskrcnn, yang2018segstereo]
. However, compared with its image-based counterparts, 3D scene understanding is still under-explored while crucial for many real-world applications such as autonomous driving. In autonomous driving, LiDAR is one of the most important sensors, which captures the 3D structure of the scene by sparse point cloud. In this work, we focus on 3D vehicle detection, whose goal is to localize the 3D oriented bounding boxes of physical vehicles from the point cloud.
By leveraging the success in vehicle detection in 2D images, 3D vehicle detectors that combine both image and point cloud have been proposed. [qi2018frustum] first detects the 2D bounding boxes in the image and then detects 3D boxes in the point cloud constrained in the frustum. [chen2017multi, ku2018joint] encode the image as additional features and enrich the point cloud representation using a 2D detection backbone. However, such methods rely on strict time synchronization and extrinsic parameter calibration between LiDAR and the camera, which may not be satisfied in practical systems.
Methods exploiting point cloud only as input have also been explored in many works. Similar to 2D detection, these methods can be mainly divided into two categories: the one-stage method and the two-stage method. Generally, one-stage methods are faster than two-stage ones. Usually, these methods encode the point cloud as a bird’s eye view (BEV) [chen2017multi, yang2018pixor] representation or to the irregular pillars [lang2018pointpillars], and directly predicts the 3d bounding boxes and their scores. [voxelnet, yan2018second] firstly group point cloud into voxels and then extracts features using 3D convolution from voxels which is used in region proposal network (RPN) [girshick2015fast]. The two-stage method usually generates bounding box proposals in the first stage and refine them in the second stage [himmelsbach2008LiDAR]. Recently, [shi2019pointrcnn] achieves impressive 3D detection results on the KITTI [Geiger2012CVPR] benchmark by introducing [qi2017pointnet++] as a two-stage encoder for canonical 3D bounding box refinement. However, these methods often need a separate model for each stage which is time-consuming.
Although these methods have made impressive progress, we argue that two main problems still remain unsolved. First, it is difficult for current models to distinguish ambiguous vehicles without semantic context information. As shown in Fig. 1 a), our baseline method detects a false positive caused by an ambiguous obstacle outside the road. Second, the density distribution of point cloud varies continuously for vehicles at different depths, as shown in Fig. 1 b). It may be difficult for the network to simultaneously model the distribution across different depths.
Based on the above observation, we design a unified model called SegVoxelNet by exploiting free-of-charge bird’s eye view (BEV) semantic masks as additional supervision signal and a depth-aware head for learning distinctive depth-aware features for vehicles at various depths. The whole network is fully-convolutional and end-to-end trainable. Experiments on the well-known KITTI [Geiger2012CVPR] dataset show that our proposed method achieves considerable improvement over state-of-the-art methods.
Our main contributions are summarized below:
A unified framework called SegVoxelNet that incorporates free-of-charge semantic segmentation information into 3D vehicle detection pipeline, where semantic context provides guidance for 3D vehicle detection.
A depth-aware head with convolutional layers of different kernel sizes and dilated rates that improve feature learning of vehicles at different depths.
Our SegVoxelNet achieves state-of-the-art results on the KITTI dataset with real-time efficiency.
Ii Related Work
In this section, we briefly review the existing methods for 3D vehicle detection, which could be classified into the following three categories based on different input settings.
3D Vehicle Detection from Images: Some works directly predict the 3D bounding boxes based on the RGB images. Chen et al.[chen2016monocular]
encoded the 3D locations, context and shape features of 3D vehicles as the energy function to score the exhaustively placed 3D bounding boxes on the estimated ground plane. Mousavianet al.[mousavian20173d] recovered the 3D locations by leveraging the geometry constraints between 2D and 3D bounding boxes and predicted the orientations of vehicles by their proposed MultiBin architecture. [chabot2017deep, manhardt2018roi] predicted the 3D bounding boxes by estimating the 3D vehicle shape from the introduced CAD models. Some works [chen20183d, licvpr2019, wangcvpr2019] further utilize stereo images for better estimating the depth information to detect 3D vehicles. However, no matter using monocular or stereo images, there is still a large performance gap between these methods and point cloud based methods, as the vehicle depth estimated from images is far from being accurate.
3D Vehicle Detection from Multiple Sensors: Some methods explored to fuse information from the point cloud and RGB image to improve the performance of 3D vehicle detection. Chen et al.[chen2017multi] and Ku et al.[ku2018joint] encoded the point cloud as bird’s eye view feature maps and projected the 3D proposals to different views (e.g.bird’s eye view for point cloud and front view for image) to crop object features from different sensors for the final 3D bounding box prediction. Qi et al.[qi2018frustum] and Xu et al.[xu2018pointfusion] exploited the 2D image detectors to generate 2D proposals from images, which are then used to crop the point cloud within each 2D box for the following 3D box estimation by applying PointNet [qi2017pointnet, qi2017pointnet++] to these point cloud. However, currently the bird’s eye view based methods suffer from information loss of quantization and feature misalignment problem of different sensors, and the 2D image proposal based methods heavily depend on the performance of 2D detectors and may fail on the occluded objects. Unlike these methods, our method directly predicts 3D bounding boxes from the 3D space generated by the point cloud, which is both efficient and natural to process the occluded objects since they are separated in the 3D space.
3D Vehicle Detection from Sparse Point Cloud: Detecting 3D vehicles directly from the raw point cloud is practical and important for autonomous driving since it avoids the sensor synchronization problem. Zhou et al.[voxelnet] first proposed the VoxelNet architecture to predict 3D bounding boxes with point cloud, and the following work [yan2018second] combined VoxelNet with sparse convolutions [3DSemanticSegmentationWithSubmanifoldSparseConvNet] to further improve efficiency and effectiveness. Other several works [yang2018pixor, lang2018pointpillars] project the point cloud to bird’s eye view and utilizing 2D CNN to predict the 3D bounding boxes from bird’s view feature maps. Shi et al.[shi2019pointrcnn] proposed the PointRCNN architecture to directly generate 3D proposals from raw point cloud by segmenting the foreground points and refine them in the canonical coordinates. Different from these methods, our proposed one-stage detector SegVoxelNet further explores the semantic context information as a guidance to benefit the confidence prediction and 3D box generation, and the new depth-aware detection architecture also improves the 3D detection performance by learning separate features for point cloud at different depths.
In this section, we introduce our proposed single stage detection framework SegVoxelNet, as illustrated in Fig. 2.
Iii-a Voxel Feature Encoder
The Voxel Feature Encoder (VFE) is applied to the raw point cloud to obtain voxelized feature representation for the following SCE module. It consists of two steps, i.e., point cloud voxelization and voxel feature extraction
voxel feature extraction.
We first partition the point cloud into equally spaced voxels with the size . To save up computation and reduce the imbalance for number of points between voxels, a fixed number of 3D points are randomly sampled in each voxel. Finally, the averaged coordinates of the 3D points in the voxel is taken as the feature for that voxel. The voxelized point cloud is then processed by four repeating s sequentially. Each
consists of several 3D sub-manifold sparse convolution layers and a normal sparse convolution layer for down-sampling in x,y-axis and squeezing z-axis. After each sparse convolution layer, BatchNorm layer and ReLU layer are applied. Finally, the output feature maps of the last
in the VFE are reshaped into 2D BEV format tensors for further analysis.
Iii-B Semantic Context Encoder
The Semantic Context Encoder (SCE) in Fig. 2 takes the BEV feature maps from the Voxel Feature Encoder as input, and output the semantic context encoded feature maps for detection. The proposed SCE consists of two branches sharing the input VFE feature maps. The first branch is modified from SECOND [yan2018second]. It consists of a U-Net structure with a downsampling and an upsampling layer for obtaining larger receptive field. The output of the U-Net is the same size as the input, so the it can be concatenated to the input VFE feature map to generate the main feature map. The second branch learns to predict the BEV semantic masks, which is then used to enhance the feature maps from the first branch by a fusion module.
For the semantic segmentation branch, the semantic masks can be easily calculated by projecting the 3D ground truth boxes to BEV. Here we explore two types of masks:
voxel-type mask. We first match the 3D ground truth boxes to voxels. Only non-empty voxels which overlap with the boxes will be assigned as foreground.
box-type mask. We directly project the 3D ground truth boxes to BEV, and set voxels in the boxes to foreground.
Both types of masks can be used to train the SCE.
Inspired by FPN [FPN], the semantic segmentation branch consists of a feature pyramid to extract multi-scale context information from the BEV image. Specifically, The network consists of residual blocks, maxpooling layers, upsampling layers and convolutional layers followed by BatchNorm and ReLU layers for fusing multi-scale feature maps. The segmentation branch can be trained by the commonly used cross entropy loss or softmax loss.
Then, we fuse the two branches to obtain the semantic context encoded feature maps for the detection head. The fusion can be done by re-weighting the feature maps by the probability map as attention residual learning in[wang2017residual], Formally, given the probability map and the feature map of the first branch , the re-weighted feature can be calculated by
where indexes the channel, index the spatial location and M(x,y) is the probability map from segmentation branch ranging from . This formulation enhances car related features and suppresses noises from trunk features.
Iii-C Depth-aware Head
In 2D detector SSD [liu2016ssd], the detection head receives different scales of RPN feature maps for classifying and regressing different sizes of vehicles. Different from the camera, Lidar is an active sensor, which scans the scene to generate point cloud. The sizes of vehicles are invariant with the depth in the point cloud. The scale invariance of vehicles in the point cloud cancels the necessity for multi-scale feature learning in the detection head. However, the density of point cloud on vehicles at different depths is quite different as demonstrated in Fig. 1. Generally, the number of points on a vehicle decreases rapidly with the increasing depth. Based on these observations, we design a depth-aware head with convolution layers of different kernel sizes and dilated rates as illustrated in Fig. 2. To simultaneously increase the model capacity while keep the network as efficient as possible, we divide the input feature maps from SCE into three parts along the x-axis in LiDAR coordinate with two overlapping regions. The length of the overlapping region is designed as twice of the vehicle length. For the three parts, convolution layers with kernel size and dilated rate
are used for learning distinctive features from three different distances, respectively. Each part is asked to solve the same detection multi-task, which can be trained by minimizing the same loss function in SECOND[yan2018second] :
which is normalized by positive anchors and , .
To perform 3D vehicle detection, ground truth boxes and anchors are defined as . The localization regression residuals between the ground truth and anchors are defined by:
where and are the ground truth and anchor boxes respectively, and . The localization loss function is:
Since this loss can not distinguish the opposite orientation of the bounding box, we use a cross entropy loss on the classification of discretized directions [yan2018second]. For the vehicle classification, we use the focal loss [lin2017focal].
Finally, We fuse the class score feature maps from three parts and choose the highest score of the class at each position. The detected vehicles are obtained by applying oriented NMS with bird’s eye view IoU threshold to remove the overlapping bounding boxes.
Iii-D Final Loss Function
The final loss function for training our proposed SegVoxelNet is defined as:
where denotes the part index of depth-aware head. is used to balance the weights for semantic segmentation and classification constrains. We set .
|Method||Modality||Speed (FPS)||3D Box||bird’s Eye View||Orientation|
|AVOD-FPN [ku2018joint]||LiDAR & RGB||10||81.94||71.88||66.38||88.53||83.79||77.90||89.95||87.13||79.74|
|F-PointNet [qi2018frustum]||LiDAR & RGB||5.9||81.20||70.39||62.19||88.70||84.00||75.33||-||-||-|
|UberATG-MMF [Liang_2019_CVPR]||LiDAR & RGB||12||86.81||76.75||68.41||89.49||87.47||79.10||-||-||-|
|PointRCNN [shi2019pointrcnn]||LiDAR (Two Stage)||10||84.32||75.42||67.86||89.28||86.04||79.02||90.76||89.55||80.76|
split. The detected vehicles are shown with 3D bounding boxes, and the orientation of each vehicle is illustrated by the outlier middle line on the bottom of the 3D box. Groundtruths are labeled with green while the predictions are labeled with red.
Iv Experiments and Results
In this section, we first introduce our experimental setup including the dataset and implementation details, then we compare with the previous state-of-the-art methods of 3D vehicle detection on both val and test split of KITTI dataset [Geiger2012CVPR].
The KITTI dataset [Geiger2012CVPR] is used in all experiments, which contains 7,481 training samples and 7,518 test samples. Each sample consists of a RGB image and a corresponding point cloud. Following common practice, we follow [chen2017multi] to split the training samples into train split (3,712 samples) and val split (3,769 samples). We train our model on the train split. and compare it with other state-of-the-art methods on both val and test split. The KITTI dataset is stratified into easy, moderate and hard difficulties and the leaderboard is ranked by 3D average precision (AP) on moderate difficulty.
Iv-B Implementation Details
First, we voxelize the point cloud in the range , , with voxel size along the in the LiDAR coordinate system. The maximum number of randomly sampled points in each voxel is set to 5. In our setting, the point cloud from an HDL-64E Velodyne LiDAR has non-empty voxels for % sparsity in KITTI [Geiger2012CVPR] dataset. We only store non-empty voxels and their coordinates for reducing memory and further speed-up processing. The voxel feature encoder (VFE) network consists of four blocks: Block1 (4, 16, 2, 2), Block2 (16, 32, 2, 2), Block3 (32, 64, 3, 2), Block4 (64, 64, 3, 1) with channel output feature maps. In the semantic context encoder (SCE) network, the multi-scale feature maps are outputted with channels. In the depth-aware head, we split the feature maps into three parts by , , in x-axis with two overlapping regions considering the cars in the edge of it.
We train our model initialized as in [he2015delving]. For training 3D vehicle detection, a proposal is considered as positive if its 3D IoU with groundtruth boxes is above 0.6 while as negative if its IoU is below 0.45. Each position has two anchors with different orientations. We train SegVoxelNet using the AdamW optimizer [loshchilov2017fixing] with batch size , weight decay and initial learning rate for epochs.
For data augmentation, we follow SECOND [yan2018second] to randomly select several ground truths and merge them into the current scene. However, we find out that the ground planes of different scenes have different heights, so we introduce a ground plane equation calculated by RANSAC [li2017improved] to constrain the augmented samples. All ground truth boxes and the associated LiDAR points are translated in different axes from and rotated uniformly from .
Iv-C 3D Vehicle Detection Results on KITTI
Evaluation of 3D vehicle detection
All detection results are measured using the official KITTI evaluation metrics, where average precision (AP) is used as an evaluation metric in 3D and BEV by calculating the rotated IoU, and average orientation similarity (AOS) is used as the evaluation metric for orientation estimation, as in[Geiger2012CVPR].
We evaluate the 3D detection results on the KITTI [Geiger2012CVPR] dataset in Tabel. I. For the most important metric moderate difficulty of 3D mAP, our proposed method SegVoxelNet outperforms all previous methods except [Liang_2019_CVPR], which uses both LiDAR and RGB as input and benefits from multi-tasks for 3D vehicle detection. Compared with LiDAR only single stage methods [voxelnet, lang2018pointpillars, yan2018second], our method outperforms these with a large margin across all difficulties in 3D vehicle detection. We also achieve comparable results with LiDAR only two-stage method [shi2019pointrcnn] but have speed. For BEV detection, our method achieves better AP on the most important moderate difficulty than all other lidar only methods and obtain comparable AP on easy and hard difficulties. The result of orientation estimation shows that our method predicts much more accurate orientation values in hard difficulty while in easy and moderate difficulties our method achieves better results than other methods except [shi2019pointrcnn].
We also report the performance of 3D detection on the split in Tabel. II. Our method outperforms previous state-of-the-art methods in all difficulties, which demonstrated the effectiveness of our method. Besides, our method achieves BEV AP of (, , ) and AOS AP of (, , ) for the easy, moderate, difficulty respectively.
The improvement of integrating semantic segmentation information into detection feature maps can be observed from Fig. III-B. We also evaluate different connections between the semantic segmentation feature map and the detection feature map by quantitatively analyzing the different influence in Sec IV-E.
We provide qualitative results in Fig. 4 and Fig. 5 on the split of KITTI [Geiger2012CVPR] dataset. For ease of interpretation, we visualize the 3D bounding box predictions from both top-view perspective and image front-view perspective. Fig. 4 shows our detection results with tight oriented 3D bounding boxes. As shown in Fig. 5, although our method predicts some vehicles accurately, there are still some difficult examples (several partial occlusion and faraway vehicles or similar classes (vans) which will lead to false positives. Besides, Fig. 5 shows some false positives which are actual vehicles but not labeled in ground truth annotations.
Iv-D Realtime Inference
As shown in Tabel. I, our method achieves impressive results on 3D vehicle detection with real time efficiency. We divide our SegVoxelNet into different parts and analyze the runtime of each part separately. All runtimes are measured on a desktop with an Intel i7 CPU and 1080ti GPU. The main inference steps are as follows. First, the point clouds need to be prepossessed as voxels and transferred to GPU (), then the voxel input tensor is encoded by VFE (), extracted semantic context encoded feature maps by SCE () and processed by the depth-aware head (). Finally, NMS is applied on the CPU () for a total runtime of . Since a LiDAR typically operates at 10HZ and there exists many other speed-up operations, so far, our method can run in real-time efficiency.
|F-PointNet [qi2018frustum]||LiDAR & RGB||83.76||70.92||63.65|
|AVOD-FPN [ku2018joint]||LiDAR & RGB||84.41||74.44||68.65|
|PointRCNN [shi2019pointrcnn]||LiDAR (Two Stage)||88.88||78.63||77.38|
Iv-E Ablation Study
In this section, we conduct extensive ablation experiments to analyze the effectiveness of different components of SegVoxelNet. All experiments are trained on split and evaluated on the split with the car class.
Effect of different fusion between semantic segmentation feature maps and detection feature maps: In Semantic Context Encoder(SCE), Semantic context information is introduced to highlight the existing vehicle region and suppress the background. We further evaluate the quantitative effects of the proposed fusion methods. We compare the proposed attention residual learning with simple concatenation along the channel axis. Tabel. III shows the 3D vehicle detection performance with re-weight and concatenation. It can be seen that these both methods are better than the baseline method without the semantic branch, which proves the effectiveness of the semantic context information for detection. While the performance gap between these two variants is very small, the proposed re-weight method results in fewer parameters and is more efficient than the concatenation alternative.
Effect of depth-aware head: In Sec. III-C, for high efficiency, we design a depth-aware head with small overlapping regions between different parts on the SEC output feature map, which leads to neglected computation. To show the importance of the depth-aware head, we compare it with our baseline method on 3D vehicle detection performance as shown in Tabel. III. The depth-aware head improves the 3D AP with about 0.3 0.4 gains through three difficulties , , . The gain of each branch is shown in Tabel. IV. We achieve a large margin improvement of 3D vehicle detection in the Middle (M) branch and a small improvement in the Near (N) branch but a small decrease in the Far (F) branch. We think it may be caused by the unlabeled vehicles in the far distance, as shown in Fig. 5.
In this paper, we propose a unified framework called SegVoxelNet that incorporates semantic information into 3D vehicle detection, where semantic context becomes active guidance for 3D vehicle detection. Moreover, an efficient depth-aware head is designed for vehicles with different depths in autonomous driving scenarios. Compared with other LiDAR-only methods, the experiments show that our SegVoxelNet achieves state-of-the-art results on the challenging 3D detection benchmark of KITTI dataset [Geiger2012CVPR] with real-time efficiency.
This project was supported by the National Key R&D Program of China (No.2017YFB1002700, No.2017YFB0203000) and NSFC of China (No.61632003, No.61661146002, No.61631001).