Stereo RGB and Deeper LIDAR Based Network for 3D Object Detection

by   Qingdong He, et al.

3D object detection has become an emerging task in autonomous driving scenarios. Previous works process 3D point clouds using either projection-based or voxel-based models. However, both approaches contain some drawbacks. The voxel-based methods lack semantic information, while the projection-based methods suffer from numerous spatial information loss when projected to different views. In this paper, we propose the Stereo RGB and Deeper LIDAR (SRDL) framework which can utilize semantic and spatial information simultaneously such that the performance of network for 3D object detection can be improved naturally. Specifically, the network generates candidate boxes from stereo pairs and combines different region-wise features using a deep fusion scheme. The stereo strategy offers more information for prediction compared with prior works. Then, several local and global feature extractors are stacked in the segmentation module to capture richer deep semantic geometric features from point clouds. After aligning the interior points with fused features, the proposed network refines the prediction in a more accurate manner and encodes the whole box in a novel compact method. The decent experimental results on the challenging KITTI detection benchmark demonstrate the effectiveness of utilizing both stereo images and point clouds for 3D object detection.



There are no comments yet.


page 3

page 8


FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection

Accurate detection of obstacles in 3D is an essential task for autonomou...

SVGA-Net: Sparse Voxel-Graph Attention Network for 3D Object Detection from Point Clouds

Accurate 3D object detection from point clouds has become a crucial comp...

Frustum Fusion: Pseudo-LiDAR and LiDAR Fusion for 3D Detection

Most autonomous vehicles are equipped with LiDAR sensors and stereo came...

Similarity-Aware Fusion Network for 3D Semantic Segmentation

In this paper, we propose a similarity-aware fusion network (SAFNet) to ...

Pedestrian Detection in 3D Point Clouds using Deep Neural Networks

Detecting pedestrians is a crucial task in autonomous driving systems to...

PI-RCNN: An Efficient Multi-sensor 3D Object Detector with Point-based Attentive Cont-conv Fusion Module

LIDAR point clouds and RGB-images are both extremely essential for 3D ob...

Reconfigurable Voxels: A New Representation for LiDAR-Based Point Clouds

LiDAR is an important method for autonomous driving systems to sense the...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Accurate and robust 3D object detection is crucial and indispensable for many real-world applications, such as autonomous driving Geiger et al. (2012) and augmented reality (AR)  Park et al. (2008). State-of-the-art methods can achieve a high average precision (AP) of 2D object detection  Ren et al. (2015); Redmon et al. (2016) and have achieved honorable results in the testing of public datasets such as KITTI  Geiger et al. (2012) and COCO  Chen et al. (2015). However, directly extends 2D detection methods to 3D is nontrivial due to the sparseness and irregularity of the point clouds. How to process the point clouds data with the semantic information from the RGB data remains a hot and challenging problem.

Currently, researchers have explored several methods to tackle these problems, which aim to obtain geometric information such as target position, size and posture in 3D space. Some works Chen et al. (2016); Mousavian et al. (2017); Xu and Chen (2018); Chen et al. (2017a); Li et al. (2018) make full use of the characteristics of RGB images to propose some networks. However, the key problem of the image-based methods is that the depth information cannot be directly obtained, which results in a large positioning error of the object in 3D space. Even stereo vision Li et al. (2019) is very sensitive to factors such as illumination variations and occlusions, which lead to deviations in depth calculations.

Compared with image data, LIDAR point clouds data have accurate depth information and spatial features. At present, most of the state-of-the-art 3D object detection algorithms have focused on processing LIDAR point clouds through projection Simon et al. (2019); Chen et al. (2017b); Ku et al. (2018); Liang et al. (2018); Yang et al. (2018a, b) or voxelization Li (2017); Engelcke et al. (2017); Zhou and Tuzel (2018). However, these works either suffer from the information loss during projectionand quantization Liu et al. (2019) or heavily depend on the performance of 2D object detectorsQi et al. (2018). Recently, some works Shi et al. (2019b); Yang et al. (2019); Shi and Rajkumar (2020) propose to only operate on the point clouds to fulfill 3D object detection. But they achieve an inferior performance especially on cyclist and pedestrian due to the information loss from the image plane.

Different from aforementioned methods, we observed that stereo camera can provide large-scale perception from two views and LIDAR sensors can capture accurate 3D structures, while the combination of them could take advantage of their respective advantages while making up for their shortcomings. In other words, the left and right images can provide a more accurate receptive field while achieving comparable depth and position accuracy. Furthermore, we find that the most commonly used PointNet Qi et al. (2017a, b) fails to capture local features information at variable scales and leads to the loss of local features because it only processes 3D points independently to maintain permutation invariance. In this way, it ignores the distance metric between the points. Although the latter SAWnet  Kaul et al. (2019)

integrates the global features using shared Multi-Layer Perceptron (MLP) with the dynamic locality information from Dynamic Graph CNNs (DGCNNs)  

Wang et al. (2019)

, it is unable to focus on important features and suppress unnecessary ones in its residual connections  

He et al. (2016a).

Motivated by these observations, we present the Stereo RGB and Deeper LIDAR (SRDL) network for 3D object detection, which takes stereo RGB images and LIDAR point clouds as input and utilizes attention mechanism to achieve robust and accurate 3D detection. Specifically, the left and right views can generate proposals that do not completely overlap from different angles. They can mutually correct each other, and a more precise region can be generated during the fusion phase. Considering that the fused proposals may overlap noisy points and excess space for the objects, a feature-oriented segmentation network in 3D point clouds is designed to strip out the object point clouds from the background. Given the segmented object points and cropped proposals, we propose to encode the bounding boxes by adding more constraints in a novel compact manner. This design benefits for removing more redundancy and locating the size of the objects more accurately while reducing the feature dimensions.

The main contributions of our work can be summarized as follows:

  • To the best of our knowledge, we are the first to propose a novel framework that combines semantic information from stereo images and spatial information from raw point clouds for 3D object detection.

  • We propose residual attention learning mechanism to optimize the segmentation network, which can extract deeper geometric features of different levels from the original irregular 3D point clouds.

  • We propose a novel 3D bounding box encoding scheme that regresses the oriented 3D boxes in a more compact manner, ensuring higher 3D localization accuracy.

  • Our proposed SRDL network achieves comparable results with the state-of-the-art image-based and LIDAR-based methods on the challenging KITTI 3D detection dataset.

2 Related work

Image-based 3D object detection. For processing the RGB images, there are two mainstreams, monocular-based and stereo-based methods. In terms of monocular-based methods, many researches  Ma et al. (2019); Chen et al. (2016); Mousavian et al. (2017); Brazil and Liu (2019) have contributed to share similar framework with 2D detection. Surprisingly, there are only a few works utilizing stereo vision for 3D object detection  Chen et al. (2017a); Li et al. (2018). Typically, Li et al  Li et al. (2019) propose the Stereo RCNN to detect and associate object in stereo images by both semantic properties and dense constraints of objects, extending Faster RCNN  Ren et al. (2015) for stereo inputs. However, none of the above approaches combines stereo images with point clouds properly to exploit both advantages and they fail to achieve superior performance because of the lack of accurate depth information.

LIDAR-based object detection. Generally, there are two major ways to process the point clouds from 3D LIDAR sensors, voxelization and raw point clouds. The voxelization based methods  Liu et al. (2020); Zhou and Tuzel (2018); Yan et al. (2018); Shi et al. (2019a) usually take the voxel as input and apply either 2D convolution or 3D convolution to make prediction. VoxelNet  Zhou and Tuzel (2018) is one of the first methods to apply a PointNet-like network to learn low-level geometric feature with several stacked VFE layers in 3D voxelization space. However, the network structure is computationally inefficient as the shallow 3D CNN layers  Li (2017) are not enough to extract deeper 3D features. Even though SECOND  Yan et al. (2018) applies sparse convolution to accelerate VoxelNet, it is still unable to overcome the 3D convolution bottleneck.

Besides, PointNet  Qi et al. (2017a) and PonintNet++  Qi et al. (2017b) are the two pioneers to directly operate on raw points to extract features without converting them to other formats. Based on PointNet as backbone network, some researchers have approached to infer 3D objects from point clouds  Shi et al. (2019b); Chen et al. (2019); Lang et al. (2019); Shi et al. (2019c); Yang et al. (2019). Very recently, Point-GNNShi and Rajkumar (2020)

even propose graph neural network to to detect objects from point clouds.

LIDAR and RGB image fusion based object detection. The majority of the state-of-the-art 3D object detection methods adopt a LIDAR and mono-image fusion scheme to provide accurate 3D information, where they process raw LIDAR input in different representations. Many methods Chen et al. (2017b); Ku et al. (2018); Liang et al. (2018); Yang et al. (2018b, a) project point clouds to bird’s view or front view and utilize 2D CNN to obtain more dense information for 3D box generation. However, these methods still have the limitation when detecting small objects such as pedestrians and cyclists and they do not deal with cases with multiple objects in depth direction. F-PointNet  Qi et al. (2018) is the first method of utilizing mature 2D detectors and raw point cloud to predict 3D objects. And PointNet then is employed to process point cloud within every cropped image region to detect 3D objects. However, the mono-image can’t extract more comprehensive features better than binocular and PointNet lacks the ability to capture local feature information in the metric space.

3 Proposed method

Figure 1: Architecture of the proposed SRDL network which contains three modules: (a) Stereo proposal fusion, which takes stereo RGB images as input and utilizes CNN to generate 2D proposals following RoIpooling in the two views respectively. (b) The 3D point clouds segmentation stacks several attention-based layers to separate the points of objects from backgrounds according to the projected proposals after the fusion operation. (c) After refining the boxes, the 3D bounding box regression module proposes to encode the bounding box in a more accurate scheme to get the final detection results.
Figure 2: Proposals from left and right views. The proposals from the left and right views are not completely overlapped and the final fused proposal is more accurate than either of them.

In this paper, we propose the Stereo RGB and Deeper LIDAR (SRDL) network for 3D object detection, including three modules: stereo proposal fusion, 3D point clouds segmentation and 3D bounding box regression, as shown in Figure 1. In the flowing subsections, we will introduce these modules in detail.

3.1 Stereo proposal fusion

In our framework, we take the stereo images as input and leverage mature 2D object detector to generate 2D object proposals for the left and right image respectively. At the same time, we apply convolution-deconvolution in each view to acquire features in a higher resolution. Combining the 2D proposals, RoIpooling is employed for each view to obtain features at the same size. Finally, we fuse the two cropped features of the output in the RoIpooling via element-wise mean operation.

As Figure 2 shows, the two outputs of the left and right branches are not completely overlapped. Instead, each of them generates different proposals from different views. With a known camera projection matrix to offer accurate depth information, each bounding box can be projected into 3D space to form a cross object area. Through the final element-wise fusion, the final proposal contains less space and point clouds which is more accurate than either of the initial ones by mutual supervision and correction.

3.2 3D point cloud segmentation

As illustrated in Figure 1, the fused proposal is fed into the second module with the depth range to outline an appropriate location in 3D space. Given the 2D image region and its corresponding 3D locations, we design a 3D segmentation network to sperate 3D point clouds from background for further 3D coordinate regression.

3.2.1 Architecture overview

The input of the segmentation network architecture is fed into a transformer net that uses attention-based layers to regress a

transformation matrix, the elements of which are the learnt affine transformation values for point clouds alignment. The aligned points are then fed into several stacked attention-based layers to generate a permutation invariant embedding of the points. Among them, the residual attention module serves as a linking bridge between two adjacent layers to transfer information. After that, the outputs of all the previous 1024-D attention-based layers are concatenated together and the max pooling is used to get the final global information aggregation for the point clouds. The information is then fed into a MLP layer to predict a

score matrix and make a point-wise prediction. More details of the specific architecture are described in the supplementary.

3.2.2 Attention-based layer

Figure 3: The illustration of the information propagates between two adjacent layers. There are residual-attention connections to transfer local and global information from previous layer to current layer individually.

The architecture of the attention-based layers is shown in Figure 3, in which the current layer is considered as an intermediate layer and the features are not only transmitted by the mainstream information flow but the residual attention linking from previous layer. The point embedding from the current is input into two parallel layers, local EdgeConv layer and global MLP layer respectively. The local EdgeConv layer constructs a dynamic graph and incorporates nearest local neighborhood information. The global MLP layer operates on each point independently and subsequently applies a symmetric function to accumulate features. The outputs of these two layers are connected to the outputs of the same branch of the previous layer in an element-wise manner before concatenating together. The two layers also individually transfer information to the next embedding layer using residual attention connections.

Specifically, consider points in a dimensional embedding point clouds set , where can be set to 3 simply, which means each point contains three coordinates . These points are processed in parallel by the local EdgeConv layer and global MLP layer in each attention-based layer. In the branch of the local EdgeConv layer, let denote the input in terms of the nearest neighbours in the dynamic graph and denote the edgeconv operation to evaluate a point’s dependency on its

nearest neighbours. The extracted edge features are fed into the batch normalization and ReLU computation. The output can be represented as:


After applying another edgeconv-BN layer, the output can be represented as:


Where denotes the ReLU function and , are the weights of the two MLPs. After the maxpooling over the output, the attention map from the residual module is added to the output in a point-wise manner which can be written as:


Similarly, in the global MLP layer, denotes the transformation on the input points by the shared weighted MLP denoted as . The output of the first shared MLP layer is:


Applying another MLP-BN layer, the output is:


This output is connected to attention-aware feature in the residual attention module in the same way:


Where denotes element-wise multiplication and change between 0 and 1 as the reaction to different features. Different from the original ResNet, the output of our residual attention module works as the feature filter to weaken the noisy features and amplify the good features. Note that the outputs from the two branches have the same dimension and are transferred to the next layer as well. Finally, they are concatenated together and the embedding points are fed into the next layer as the input.

3.2.3 Attention module

Figure 4: Architecture of attention module which mainly consists of residual units in a bottom-up and top-down manner.

Attention module not only attempts to emphasize meaningful features but also enhances different representations of objects at certain locations. We design our attention module as a bottom-up and a top-down structure, as shown in Figure 4. The bottom-up operation aims to collect global information and the top-down operation combines the global information with the original feature maps. We use the residual unit in  He et al. (2016b)

as our basic unit in attention module. The attention module contains three blocks. In block1, the max pooling and a residual unit are performed to enlarge the receptive field. After getting the lowest resolution, a symmetrical top-down architecture is designed to infer each pixel to get dense features in block2. Besides, we append skip connections between bottom-up and top-down feature maps to capture features at different scales. In block3, a bilinear interpolation is inserted after a residual unit to up-sample the output. Finally, we use the sigmoid function to normalize the output after two consecutive

convolution layers to balance the dimensions.

3.3 3D bounding box regression

Given the segmented object points, this part regresses the final 3D bounding box by a more accurate bounding box encoding scheme after the proposal refining.

3.3.1 3D proposal refining

After segmentation operation on the point cloud, we can sperate the object points from the background and acquire the points inside the bounding box in the certain location of the first module. However, the combination of the predefined proposals from the first module and the segmentation network for the points only gets a relatively rough box. Therefore, we propose to pool 3D points and their corresponding features to rescale the proposal. For each 3D box proposal, , we define a new 3D box by adding a constant to respectively to resize the box. For each point, a validation test is performed to decide whether it is inside the resized box or not. If it is true, the point and its features will be kept for refining the box proposal. Further ablation study will illustrate the effectiveness of this operation in improving performance.

3.3.2 3D bounding box encoding

Figure 5: Comparison between different methods for encoding bounding box. We propose to encode the bounding box with three points (two corners + one center) and two heights to reduce redundancy and keep physical connections.

To determine the orientation of a 3D bounding box, we keep consistent with the AVOD which computes to solve the problem of angles wrapping. As for the box encoding, there are several different methods to encode the bounding box as shown in Figure 5. The axis aligned is first proposed in  Song and Xiao (2016) which encodes the box with centers and sizes. While in MV3D Chen et al. (2017b), Chen et al. claim that 8 corners box encoding works better than axis aligned. And in AVOD Ku et al. (2018)

, Jason Ku et al. attempt to replace 8 corners with 4 corners and 2 heights to encode box efficiently. However, 8 corners need a 24-D vector to normalize the diagonal length of the proposal box and neglect the physical constraints. The 4 corners + heights encoding method does not take the physical connections between the 4 corners within a plane into account. To reduce more redundancy and keep physical connections, we propose to encode the bounding box with three points (two corners + one center) and two heights representing the offsets from the proposal box to the ground plane. The three points are on the diagonal of the cube, where

is the center point of the cube. Therefore, the regression targets is . Despite that our 11-D representation vector is slightly larger than the 10 dimensional one, we not only use fewer points but encode the bounding box compactly in the flowing constrains between these parameters. When regressing , we should consider:

  1. , .

  2. , .

When regressing , we should ensure that the following equations hold true:

  1. .

  2. .

where denotes that there exists constrains in the regression process.

3.4 Loss function

We use a multi-task loss to train our network. Our total loss is composed of three main components from the three modules, the fused loss , the segmentation loss and the bounding box regression loss as:


where weighting parameters , and are used to balance the relative importance of different parts, and their values are set to 1, 4 and 2 respectively. and are the object classification loss. and are box regression loss. In our practice, we apply binary cross entropy for all classification loss. As for regression, we employ Smooth L1 loss for all bounding box and orientation vector regression. For segmentation, we use the focal loss Lin et al. (2017) to handle the imbalance problem. More details of the specific losses are described in the supplementary.

4 Experiments

We evaluate our method on the 3D detection benchmark and the bird’s eye view detection benchmark of the KITTI test server Geiger et al. (2012)

. For evaluation, we use average precision (AP) metric to compare with different methods and use the official 3D IoU evaluation metrics of 0.7, 0.5, and 0.5 respectively for the categories of car, cyclist, and pedestrian. In this section, we will introduce the experimental results. More description about datasets, implementation and training details are specified in the supplementary.

4.1 Comparing with state-of-the-art methods

Method Modality
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
M3D-RPNBrazil and Liu (2019) Mono 15.52 11.44 9.62 - - - - - -
CE3RMa et al. (2019) Mono 21.48 16.08 15.26 - - - - - -
Stereo RCNNLi et al. (2019) Stereo 49.23 34.05 28.39 - - - - - -
MV3DChen et al. (2017b) Mono+Lidar 71.09 62.35 55.12 - - - - - -
F-PointnetQi et al. (2018) Mono+Lidar 81.20 70.39 62.19 51.21 44.89 40.23 71.96 56.77 50.39
AVOD-FPNKu et al. (2018) Mono+Lidar 81.94 71.88 66.38 50.80 42.81 40.88 64.00 52.18 46.61
F-ConvNetWang and Jia (2019) Mono+Lidar 85.88 76.51 68.08 52.37 45.61 41.49 79.58 64.68 57.03
MMFLiang et al. (2019) Mono+Lidar 86.81 76.75 68.41 - - - - - -
VoxelnetZhou and Tuzel (2018) Lidar 77.47 65.11 57.73 39.48 33.69 31.51 61.22 48.36 44.37
SECONDYan et al. (2018) Lidar 83.13 73.66 66.20 51.07 42.56 37.29 70.51 53.85 46.90
PointPillarsLang et al. (2019) Lidar 79.05 74.99 68.30 52.08 43.43 41.49 75.78 59.07 52.92
PointRCNNShi et al. (2019b) Lidar 85.94 75.76 68.32 49.43 41.78 38.63 73.93 59.60 53.59
STDYang et al. (2019) Lidar 86.61 77.63 76.06 53.08 44.24 41.97 78.89 62.53 55.77
Point-GNNShi and Rajkumar (2020) Lidar 88.33 79.47 72.29 51.92 43.77 40.14 78.60 63.48 57.08
SRDL(ours) Stereo+Lidar 89.27 79.95 73.79 53.44 45.91 42.61 78.68 64.88 57.74
Table 1: Performance comparison on KITTI 3D object detection for car, pedestrian and cyclists.The evaluation metrics is the average precision (AP) on the official test set.

For the 3D object detection and the bird’s view detection test benchmark as shown in Table 1 and Table 2, our proposed method achieves decent results compared with other state-of-the-art methods for all categories on three difficulty levels. For car, our method achieves better or comparable results than most of the methods. For the pedestrian and cyclist, SRDL gets large increases than the mono+Lidar methods due to combining stereo images, especially on the moderate and hard set. For the most important car category, we also report the performance of our method on KITTI val split and the results are shown in the supplementary.

Method Modality
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
M3D-RPNBrazil and Liu (2019) Mono 21.29 15.23 13.16 - - - - - -
Stereo RCNNLi et al. (2019) Stereo 61.27 43.87 36.44 - - - - - -
MV3DChen et al. (2017b) Mono+Lidar 86.02 76.90 68.49 - - - - - -
F-PointnetQi et al. (2018) Mono+Lidar 88.70 84.00 75.33 58.09 50.22 47.20 75.38 61.96 54.68
AVOD-FPNKu et al. (2018) Mono+Lidar 88.53 83.79 77.90 58.75 51.05 47.54 68.09 57.48 50.77
F-ConvNetWang and Jia (2019) Mono+Lidar 89.69 83.08 74.56 58.90 50.48 46.72 82.59 68.62 60.62
MMFLiang et al. (2019) Mono+Lidar 89.49 87.47 79.10 - - - - - -
VoxelnetZhou and Tuzel (2018) Lidar 89.35 79.26 77.39 46.13 40.74 38.11 66.70 54.76 50.55
SECONDYan et al. (2018) Lidar 88.07 79.37 77.95 55.10 46.27 44.76 73.67 56.04 48.78
PointPillarsLang et al. (2019) Lidar 88.35 86.10 79.83 58.66 50.23 47.19 79.14 62.25 56.00
PointRCNNShi et al. (2019b) Lidar 89.47 85.58 79.10 - - - 81.52 66.77 60.78
STDYang et al. (2019) Lidar 89.66 87.76 86.89 60.99 51.39 45.89 81.04 65.32 57.85
Point-GNNShi and Rajkumar (2020) Lidar 93.11 89.17 83.90 55.36 47.07 44.61 81.17 67.28 59.67
SRDL(ours) Stereo+Lidar 90.82 89.74 81.93 59.62 51.46 48.32 82.61 69.11 60.37
Table 2: Performance comparison on KITTI bird’s eye view detection for car, pedestrian and cyclists. The evaluation metrics is the average precision (AP) on the official test set.

4.2 Qualitative results

We present some qualitative results of our proposed SRDL network on the test split on KITTI dataset in Figure 6

. From the figures we could see that our proposed network could estimate accurate 3D bounding boxes in different scenes. Surprisingly, we observe that our method can still achieve satisfactory detection results even with very sparse point clouds and severe occlusion.

4.3 Ablation studies

In this section, we change components and variants of our proposed SRDL by conducting extensive ablation studies on the validation split of KITTI. We follow the convention and use the car class which contains the largest amount of training examples. The evaluation metric is the average precision (AP %) on the val set.

Figure 6: Qualitative 3D detection results of SRDL on the KITTI test set. The detected objects are shown with green 3D bounding boxes and the relative labels. The upper row in each image is the 3D object detection result projected onto the RGB image and the bottom is the result in the corresponding point clouds.

Effect of Different Design Choice in the Whole Network. We illustrate the importance of different components of our network by removing one part and keeping all the others unchanged, as shown in Table 5. Without the stereo images as input (the missing of “stereo” stands for mono image), the performance of SRDL drops dramatically which shows that the stereo images could provide rich feature information to locate the object. Similarly, AP decreases significantly by 11.65%, 12.27%, 16.33% respectively for easy, moderate and hard which confirms the indispensability of the 3D bounding box encoding. And the performance degradation caused by the absence of either local or global convolution for segmentation proves that only the combination of them can produce the best results.

Stereo Local Global Encoding Easy Moderate Hard
83.64 73.59 67.48
86.77 72.32 71.65
88.46 76.71 75.69
78.63 67.55 62.32
90.28 79.82 78.65
Table 4: Performance comparison of different fusion method with attention mechanism.
Fusion Method Easy Moderate Hard
Global 83.85 71.08 65.73
Local 83.89 71.27 66.54
Global+Local 84.51 71.33 68.79
Global+Attention 86.77 72.32 71.65
Local+Attention 88.46 76.71 75.69
Global+Local+Attention 90.28 79.82 78.65
Table 5: Performance comparison on different bounding box encoding methods.
Encoding Method Easy Moderate Hard
Axis 79.13 68.42 65.36
8 Corners 83.09 74.51 70.82
4 Corners+2 Heights 89.61 78.37 77.64
3 Points+2 Heights 90.28 79.82 78.65
Table 3: Performance of removing different part of our network. denotes removing and denotes retaining.

Effect of Attention Module. In order to show the importance of the attention module in Section 3.2.2, we add attention module to the three different designs. As shown in Table 5, with attention module to transfer features between connected layers, the fused models override the original ones by 1.24%, 5.44%, 8.49% respectively in the moderate difficulty. And our final fusion method with attention mechanism outperforms the alternative by 5.77%, 8.49%, 9.86% for the three difficulties.

refining size() Easy Moderate Hard
1.5m 72.62 69.86 68.57
1.0m 79.43 70.53 70.82
0.8m 84.59 72.58 71.94
0.5m 87.26 76.25 74.85
0m 89.47 78.84 77.62
-0.5m 90.28 79.82 78.65
-0.8m 89.12 78.76 77.38
-1.0m 87.69 77.17 75.93
-1.5m 84.81 73.91 73.11
Table 6: Performance of adopting different size of for 3D box refining. The "-" half denotes shrinking and the other half denotes enlarging. 0m is the original size.

Effect of Different Refining Size . In Section 3.3.1, we propose to refine the proposals by adding a constant to the size of the box. Table 6 shows the results with different size. =-0.5m proves to perform best in our network which denotes that we should shrink the original box by 0.5m. Note that when we enlarge the size of the box, especially over 1m, the value of AP drops sharply. This indicates that the original box already contains redundant space, and continuing to enlarge the box will only include more unrelated areas. At the same time, too large to shrink the box also lead to bad performance since small region may also exclude relative areas.

Effect of Bounding Box Encoding Method. As stated in Section 3.3.2, there are different bounding box encoding methods including the one we proposed. We use the four different methods to encode boxes in our proposed network. From Table 5, we note that although the 4 corners+2 heights method consumes a few dimensions but its performance is worse than our method. For one thing, the 4 corners+2 heights method does not take into account the coordinate relationship between the four corners so the number of points is redundant. For another thing, the constraint relationship between the coordinates of the corners and the heights cannot be established. Our method can establish four sets of constraint relationships to constrain the length, width, and height respectively.

5 Conclusions

In this paper, we have proposed a novel stereo RGB and deeper LIDAR (SRDL) network for 3D object detection in autonomous driving scenarios. Our method takes full advantage of the merits of stereo RGB images and point clouds to form an end-to-end framework. The combination of semantic information from stereo images and spatial information from point clouds contribute together to improve the performance. Extensive experiments on the challenging KITTI 3D detection benchmark demonstrate the efficiency of our method decently. In future research, we will optimize the inference speed, investigate more focus on integrating RGB and point-wise features, and different operations on the point clouds will be added to further improve our detection framework.

Broader Impact

This article belongs to the application of a subtask under the study of autonomous driving. The accurate detection of vehicles and people on the road can greatly promote the development of autonomous driving. We know that our goal is to accurately detect 3D objects on the road to avoid various accidents. In this paper, stereo RGB images and point cloud data are used for joint detection. The dual data from the optical camera and radar camera jointly ensure this detection result, which is helpful for the further implementation of this application. For autonomous driving companies, they can find inspiration for further improving the safety of autonomous driving from the methods in this article, and this article also provides researchers with a new way to make full use of road environmental data. However, we have to admit that the method proposed in this article uses a variety of data, so the hardware requirements in the implementation are relatively high, which will bring some costs. In addition, once the data from a certain camera is missing as input, the algorithm in this paper will immediately fail.

This work is supported by a grant from the National Natural Science Foundation of China (No. 61872068, 61720106004), by a grant from Science & Technology Department of Sichuan Province of China (No.2018GZ0071, 2019YFG0426), and by a grant from the Fundamental Research Funds for the Central Universities (No.2672018ZYGX2018J014).


  • [1] G. Brazil and X. Liu (2019) M3d-rpn: monocular 3d region proposal network for object detection. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 9287–9296. Cited by: §2, Table 1, Table 2.
  • [2] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun (2016) Monocular 3d object detection for autonomous driving. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2147–2156. Cited by: §1, §2.
  • [3] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun (2017) 3d object proposals using stereo imagery for accurate object class detection. IEEE transactions on pattern analysis and machine intelligence 40 (5), pp. 1259–1272. Cited by: §1, §2.
  • [4] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915. Cited by: §1, §2, §3.3.2, Table 1, Table 2.
  • [5] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §1.
  • [6] Y. Chen, S. Liu, X. Shen, and J. Jia (2019) Fast point r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9775–9784. Cited by: §2.
  • [7] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner (2017)

    Vote3deep: fast object detection in 3d point clouds using efficient convolutional neural networks

    In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1355–1361. Cited by: §1.
  • [8] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §1, §4.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §3.2.3.
  • [11] C. Kaul, N. Pears, and S. Manandhar (2019) SAWNet: a spatially aware deep neural network for 3d point cloud processing. arXiv preprint arXiv:1905.07650. Cited by: §1.
  • [12] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander (2018) Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. Cited by: §1, §2, §3.3.2, Table 1, Table 2.
  • [13] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) PointPillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §2, Table 1, Table 2.
  • [14] B. Li (2017) 3d fully convolutional network for vehicle detection in point cloud. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1513–1518. Cited by: §1, §2.
  • [15] P. Li, X. Chen, and S. Shen (2019) Stereo r-cnn based 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7644–7652. Cited by: §1, §2, Table 1, Table 2.
  • [16] P. Li, T. Qin, et al. (2018) Stereo vision-based semantic 3d object and ego-motion tracking for autonomous driving. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 646–661. Cited by: §1, §2.
  • [17] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun (2019) Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7345–7353. Cited by: Table 1, Table 2.
  • [18] M. Liang, B. Yang, S. Wang, and R. Urtasun (2018) Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 641–656. Cited by: §1, §2.
  • [19] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §3.4.
  • [20] Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou, and X. Bai (2020) TANet: robust 3d object detection from point clouds with triple attention. AAAI. External Links: Link, 1912.05163 Cited by: §2.
  • [21] Z. Liu, H. Tang, Y. Lin, and S. Han (2019)

    Point-voxel cnn for efficient 3d deep learning

    In Advances in Neural Information Processing Systems, pp. 963–973. Cited by: §1.
  • [22] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan (2019) Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6851–6860. Cited by: §2, Table 1.
  • [23] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017) 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082. Cited by: §1, §2.
  • [24] Y. Park, V. Lepetit, and W. Woo (2008) Multiple 3d object tracking for augmented reality. In Proceedings of the 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, ISMAR ’08, Washington, DC, USA, pp. 117–120. External Links: ISBN 978-1-4244-2840-3, Link, Document Cited by: §1.
  • [25] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927. Cited by: §1, §2, Table 1, Table 2.
  • [26] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1, §2.
  • [27] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §1, §2.
  • [28] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
  • [29] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.
  • [30] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li (2019) PV-rcnn: point-voxel feature set abstraction for 3d object detection. arXiv preprint arXiv:1912.13192. Cited by: §2.
  • [31] S. Shi, X. Wang, and H. Li (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–779. Cited by: §1, §2, Table 1, Table 2.
  • [32] S. Shi, Z. Wang, X. Wang, and H. Li (2019) Part-a^ 2 net: 3d part-aware and aggregation neural network for object detection from point cloud. arXiv preprint arXiv:1907.03670. Cited by: §2.
  • [33] W. Shi and R. (. Rajkumar (2020-06) Point-gnn: graph neural network for 3d object detection in a point cloud. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Table 1, Table 2.
  • [34] M. Simon, K. Amende, A. Kraus, J. Honer, T. Samann, H. Kaulbersch, S. Milz, and H. Michael Gross (2019) Complexer-yolo: real-time 3d object detection and tracking on semantic point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1.
  • [35] S. Song and J. Xiao (2016) Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 808–816. Cited by: §3.3.2.
  • [36] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019-10) Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 38 (5), pp. 146:1–146:12. External Links: ISSN 0730-0301, Link, Document Cited by: §1.
  • [37] Z. Wang and K. Jia (2019) Frustum convnet: sliding frustums to aggregate local point-wise features for amodal. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1742–1749. Cited by: Table 1, Table 2.
  • [38] B. Xu and Z. Chen (2018) Multi-level fusion based 3d object detection from monocular images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2345–2353. Cited by: §1.
  • [39] Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §2, Table 1, Table 2.
  • [40] B. Yang, M. Liang, and R. Urtasun (2018) Hdnet: exploiting hd maps for 3d object detection. In Conference on Robot Learning, pp. 146–155. Cited by: §1, §2.
  • [41] B. Yang, W. Luo, and R. Urtasun (2018) Pixor: real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660. Cited by: §1, §2.
  • [42] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2019) Std: sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1951–1960. Cited by: §1, §2, Table 1, Table 2.
  • [43] Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §1, §2, Table 1, Table 2.