3D object detection is a crucial task in autonomous driving since vehicles must be provided with accurate scene information. It is essential for safe driving to recognize the other cars driving ahead or parked at the side of the road precisely. To achieve this goal, recent works of 3D object detection utilize the data from several types of sensors, such as a monocular camera, stereo cameras, and LiDAR. They provide various and useful data, but each sensor has its limitations.
RGB images from cameras contain appearance information that can be efficiently exploited by the advanced techniques of 2D Convolutional Neural Network (CNN). However, they lack depth information which is material to 3D object detection. This drawback results in considerably low performance for monocular 3D object detection[naiden2019shift, Wang_2019_CVPR, Chen_2016_CVPR, Mousavian_2017_CVPR], that only uses a single image to detect the objects in 3D space. To overcome the limitation, [you2019pseudo, li2019stereo, chen2020dsgn]
used two images from the stereo camera to estimate the depth of the scene more precisely, by sharing the weights of image feature extractors. The estimated depth of the stereo images is more enhanced than monocular images, but due to the discrete nature of the pixels, they still are insufficient for real-world applications. Therefore, depth prediction from 2D images itself is not adequate for autonomous driving, where an accurate location of the object is required.
On the other hand, point clouds from LiDAR provide precise distance information, which makes it easy to capture the spatial structures of the scenes. Thus, most 3D object detection methods with high performance are point-based detection. However, the LiDAR point cloud has fatal weaknesses of being sparse in the long range, and that the points are irregular and unordered. To handle these challenges, existing methods such as PointNet [Qi_2017_CVPR] and PointNet++ [qi2017pointnet++] directly process the raw point cloud regardless of the order of the points. Alternatively, point clouds are converted into specific forms that can be easily processed by projecting a point cloud onto the corresponding image for 2D CNN [ku2018joint, yang2018pixor, liang2018deep, chen2017multi] or dividing into voxels for 3D CNN [zhou2018voxelnet, song2016deep, yan2018second].