Object detection has been considered one of the most challenging computer vision problems. Recently, the emergence of convolutional neural networks (CNN) has enabled unprecedented progress in object detection techniques owing to its ability to extract the abstract high-level features from the 2D image. Thus far, numerous object detection methods have been developed for 2D object detection[ssd, yolo, fasterrcnn]. Recently, these studies have been extended to the 3D object detection task [pixor, mv3d, avod, roarnet, fpointrcnn, fpointnet, voxelnet, second, mmf, pointpillar, contfuse, std, fconvnet], where the locations of the objects should be identified in 3D world coordinates. 3D object detection is particularly useful for autonomous driving applications because diverse types of dynamic objects, such as surrounding vehicles, pedestrians, and cyclists, must be identified in the 3D environment.
In general, achieving good accuracy in 3D object detection using only a camera sensor is not an easy task owing to the lack of depth information. Thus, other ranging sensors such as LiDAR, Radar, and RGB-D camera sensors are widely used as alternative signal sources for 3D object detection. Thus far, various 3D object detectors employing LiDAR sensors have been proposed, including MV3D [mv3d], PIXOR [pixor], ContFuse [contfuse], PointRCNN [pointrcnn], F-ConvNet [fconvnet], STD [std], VoxelNet [voxelnet], SECOND [second], MMF [mmf], PointPillar [pointpillar], and Part A [parta2]. Although the performance of the LiDAR only based 3D object detectors have been significantly improved lately, LiDAR point clouds are still limited for providing dense and rich information on the objects such as their shapes, colors, and textures. Hence, using a camera and LiDAR data together is expected to yield better and more robust detection results in the accuracy. Various camera and LiDAR fusion strategies have been proposed for 3D object detection. Well-known camera and LiDAR fusion methods include AVOD [avod], MV3D [mv3d], MMF [mmf], RoarNet [roarnet], F-PointNet [fpointnet], and ContFuse [contfuse].
In fact, a problem of fusing camera and LiDAR sensors is challenging as the features obtained from the camera image and LiDAR points are represented in different points of view (i.e., camera-view versus 3D world view). When the camera feature is projected into 3D world coordinates, some useful spatial information about the objects might be lost since this transformation is a one-to-many mapping. Furthermore, there might be some inconsistency between the projected coordinate and LiDAR 3D coordinate due to imperfect calibration. Indeed, it has been difficult for the camera-LiDAR fusion-based methods to beat the LiDAR-only methods in terms of performance. This motivates us to find a robust way to fuse two feature maps in different views without losing important information for 3D object detection.
In this paper, we propose a new 3D object detection method cross-view spatial feature fusion for 3D object detection named 3D-CVF, which can fuse the spatial feature maps separately extracted from the camera and LiDAR data effectively. As shown in Fig. 2, we are interested in fusing the LiDAR sensor and the multi-view cameras deployed to cover a wider field of view. Information fusion between the camera and LiDAR is achieved over two object detection stages. In the first stage, we aim to generate the strong joint camera-LiDAR features. The auto-calibrated feature projection
maps the camera-view features to smooth and dense BEV feature maps using the interpolated projection capable of correcting the spatial offsets. Fig.1 (a) and (b) compare the feature maps obtained without auto-calibrated projection versus with the auto-calibrated projection, respectively. Note that the auto-calibrated projection yields a smooth camera feature map in the BEV domain as shown in Fig. 1 (b). We also note from Fig. 1 (b) that since the camera feature mapping is a one-to-many mapping, we cannot localize the objects on the transformed camera feature map. To resolve objects in the BEV domain, we employ the adaptive gated fusion network that determines how much information is brought from two sources using attention mechanisms depending on the region. Fig. 1 (c) shows the appropriately-localized activation for the objects obtained from applying the adaptive gated fusion network. Camera-LiDAR information fusion is also achieved at the second proposal refinement stage. Once the region proposals are found based on the joint camera-LiDAR feature map obtained in the first stage, 3D RoI grid pooling is applied to transfer the camera features from the camera view-domain to 3D RoI box. After being encoded by the Pointnet encoding network, these camera features are fused with the RoI aligned BEV feature map. This feature aggregation in proposal refinement stages allows further improvement in the quality of features, and consequently more robust detection results.
We have evaluated our 3D-CVF method on publicly available KITTI [kitti] and nuScenes [nuScenes] datasets. We confirm that by combining the above two sensor fusion strategies combined, the proposed method offers up to 1.57% and 2.74% performance gains in mAP over the baseline without sensor fusion on the KITTI and nuScenes datasets, respectively. Also, we show that the proposed 3D-CVF method achieves impressive detection accuracy comparable to state-of-the-art performance in KITTI 3D object detection benchmark.
The contributions of our work are summarized as follows:
We propose a new 3D object detection architecture that effectively combines information provided by both camera and LiDAR sensors in two detection stages. After the strong joint camera-LiDAR feature is generated by applying the auto-calibrated projection and the gated attention in the first stage, 3D RoI-based feature aggregation is performed in the second proposal refinement stage.
We investigate the effect of the sensor fusion achieved by the 3D-CVF. Our experiments show that the performance gain achieved by the sensor fusion in nuScenes dataset is higher than that in KITTI dataset. Because the resolution of LiDAR used in nuScenes is lower than that in KITTI, this shows that the camera sensor compensates low resolution of the LiDAR data. Also, we observe that the performance gain achieved by the sensor fusion is much higher for distant objects than for near objects, which also validates our conclusion.
2 Related Work
2.1 LiDAR-Only 3D Object Detection
The LiDAR-based 3D object detectors should encode the point clouds since they have unordered and irregular structures. The point encoding methods are roughly categorized into three types: 1) projection-based method, 2) PointNet-based method, and 3) voxel-based point encoding method. First, the projection-based methods project 3D point clouds onto the discrete grid structure in 2D planes, generating multi-view 2D images. These methods include MV3D [mv3d] and PIXOR [pixor]. One disadvantage of the above methods is the information loss caused by the discretization of LiDAR points during the point projection step. PointNet-based methods directly process raw LiDAR points using PointNet [pointnet, pointnet++] to yield the global feature representing the geometric structure of the entire point set. This approach includes PointRCNN [pointrcnn] and STD [std]. These methods tend to require relatively high computational complexity. Voxel-based point encoding methods use 3D voxels to organize the unordered point clouds and encode the points in each voxel using the point encoding network [voxelnet]. VoxelNet [voxelnet] was the first pioneering work that adopted the voxel-based learnable point encoding process. Since then, various voxel-based 3D object detectors have been proposed, including SECOND [second], PointPillar [pointpillar], and Part- [parta2].
2.2 LiDAR and Camera Fusion-based 3D Object Detection
To exploit the advantages of the camera and LiDAR sensors, various camera and LiDAR fusion methods have been proposed for 3D object detection. The approaches proposed in [fpointnet, roarnet, pointfusion, fconvnet] detect the objects in the two sequential steps, where 1) the region proposals are generated based on the camera image, and then 2) the LiDAR points in the region of interest are processed to detect the objects. However, the performance of these methods is limited by the accuracy of the camera-based detector. MV3D [mv3d] proposed the two-stage detector, where 3D proposals are found from the LiDAR point clouds projected in BEV, and 3D object detection is performed by fusing the multi-view features obtained by RoI pooling. AVOD [avod] fused the LiDAR BEV and camera front-view features at the intermediate convolutional layer to propose 3D bounding boxes. ContFuse [contfuse] proposed the effective fusion architecture that transforms the front camera-view features into those in BEV through some interpolation network. MMF [mmf]
learned to fuse both camera and LiDAR data through multi-task loss associated with 2D and 3D object detection, ground estimation and depth completion.
While various sensor fusion networks have been proposed, they do not easily outperform LiDAR-only based detectors. This might be why it is difficult to combine the camera and LiDAR features represented in different view domains. In the next sections, we will present an effective way to overcome this challenge.
3 Proposed 3D Object Detector
In this section, we present the details of the proposed architecture.
3.1 Overall architecture
The overall architecture of the proposed method is illustrated in Fig. 2. It consists of five modules including the 1) LiDAR pipeline, 2) camera pipeline, 3) cross-view spatial feature mapping, 4) gated camera-LiDAR feature fusion network, and 5) proposal generation and refinement network. Each of them is described in the following:
LiDAR Pipeline: LiDAR points are first organized based on the LiDAR voxel structure. The LiDAR points in each voxel are encoded by the point encoding network [voxelnet]
, which generates the fixed-length embedding vector. These encoded LiDAR voxels are processed by six 3D sparse convolution layers with stride two, which produces the LiDAR feature map of 128 channels in the BEV domain. After sparse convolutional layers are applied, the spatial size of the resulting LiDAR feature map is reduced by a factor of eight compared to that of the LiDAR voxel structure.
RGB Pipeline: In parallel to the LiDAR pipeline, the camera RGB images are processed by the CNN backbone network. We use the pre-trained ResNet-18 [resnet] followed by feature pyramid network (FPN) [fpn] to generate the camera feature map of 256 channels represented in camera-view. The spatial size of the camera feature maps is reduced by a factor of eight compared to that of the input RGB images.
Cross-View Feature Mapping: The cross-view feature (CVF) mapping generates the camera feature maps projected in BEV. The auto-calibrated projection converts the camera feature maps in camera-view to the calibrated and interpolated feature maps in BEV. Then, the projected feature map is enhanced by the additional convolutional layers and delivered to the gated camera-LiDAR feature fusion block.
Gated Camera-LiDAR Feature Fusion: The adaptive gated fusion network is used to combine the camera feature maps and the LiDAR feature map. The spatial attention maps are applied to both feature maps to weight the information from each modality depending on their contributions to the detection task. The adaptive gated fusion network produces the joint camera-LiDAR feature map, which is delivered to the 3D RoI fusion-based refinement block.
3D RoI Fusion-based Refinement: After the region proposals are generated based on the joint camera-LiDAR feature map, the RoI pooling is applied for proposal refinement. Since the joint camera-LiDAR feature map does not contain sufficient spatial information, the camera view-domain features are brought using 3D RoI grid-based pooling. These features are encoded by the PointNet encoding network and fused with the joint camera-LiDAR feature map by a 3D-RoI-based fusion network. The fused feature is finally refined to produce the final detection results.
3.2 Cross-View Feature Mapping
Dense Camera Voxel Structure: The camera voxel structure is used for the feature mapping. To generate the spatially dense features, we construct the camera voxel structure whose size is two times larger than the LiDAR voxel structure in the axis. This leads to the voxel structure with higher spatial resolution. In our design, the camera voxel structure has four times as many voxels as the LiDAR voxel structure.
Auto-Calibrated Projection Method: The structure of the auto-calibrated projection method is depicted in Fig. 3. First, the center of the each voxel is projected to in the camera-view plane using the world-to-camera-view projection matrix and is adjusted by the calibration offset . Then, the neighbor camera feature pixels near to the calibrated position are combined with the weights determined by interpolation methods. That is, the combined pixel vector is given by
where is the feature of th closest pixel to , and is the weight obtained by the interpolation methods. In bilinear interpolation, is obtained using Euclidean distance as follows:
Then, the combined feature is assigned to the corresponding voxel. Note that the different calibration offset is applied to each voxel or different local region. These calibration offset parameters can be jointly learned along with other parameters. The auto-calibrated projection provides spatially smooth camera feature maps that achieve the highest correspondence to the LiDAR feature map in the BEV domain.
3.3 Gated Camera-LiDAR Feature Fusion
Adaptive Gated Fusion Network: To extract only essential features from both camera and LiDAR sensors, we apply an adaptive gated fusion network that selectively combines the feature maps depending on the relevance to the object detection task. The proposed gated fusion structure is depicted in Fig. 4. The camera and LiDAR features are gated using the attention maps as follows:
where and represent the camera feature and LiDAR feature, respectively, and are the corresponding gated features, is the element-wise product operation, and is the channel-wise concatenation operation. Note that the attention maps adaptively controls the relative contributions from camera and LiDAR features. After the attention maps are applied, the final joint feature is obtained by concatenating and channel-wise. (see Fig. 2.)
3.4 3D-RoI Fusion-based Refinement
Region Proposal Generation: The initial detection results are obtained by the region proposal network (RPN). Initial regression results and objectness scores are predicted by applying the detection sub-network to the joint camera-LiDAR feature. Since the initial detection results have a large number of proposal boxes associated with objectness scores, the boxes with high objectness scores remain through NMS post-processing with the IoU threshold 0.7.
3D RoI-based Feature Fusion: The predicted box regression values are translated to the global coordinates using the rotated 3D RoI alignment in [mmf] to pool the object features from the joint camera-LiDAR feature. We fuse the RoI aligned features with the camera-view features to enhance the feature quality. These camera-view features retain the detailed spatial information on objects (particularly in axis) so that it can provide the useful semantic information for refining the region proposals in the 3D RoI boxes. Since the camera-view features are represented in different domain from the 3D RoI boxes, we propose the 3D RoI grid pooling to bring the camera-view features to the RoI box. As shown in Fig. 5, consider the size of equally spaced coordinates in the RoI box. These points are projected to the camera view-domain and the camera feature pixels corresponding to these points are brought for fusion with the RoI aligned feature after applying the PointNet encoding network. The final RoI feature is obtained by concatenating the RoI aligned feature and the encoded camera feature and used for the proposal refinement network.
3.5 Training Loss Function
To predict the 3D bounding box, we parameterize the 3D ground truth box as where is the 3D coordinate of the center, represents (length, width, height), and is the yaw rotation around the -axis. Similarly, also represents the predefined 3D anchor. The regression target for the box parameters is as follows:
The RPN loss functionis given by
where and are set to and , respectively. To alleviate the class imbalance problem, the focal loss [focal] is used as a classification loss , i.e.,
where depicts total number of boxes, is the objectness scores for th box, and we set and . The Smoothed-L1 loss [fastrcnn] is used as regression loss . where is the total number of positive boxes and is the th predicted coordinates or size of boxes. We follow the setup for the data encoding and loss functions used in [voxelnet, second]. The proposal refinment loss function is given by
For the confidence score refinement loss , we follow 3D IoU loss [Gs3d]. The details of training procedure using the loss functions is provided in next subsection.
In this section, we evaluate the performance of the proposed 3D-CVF on the KITTI [kitti] and nuScenes [nuScenes] datasets.
The KITTI dataset is the widely used dataset for evaluating 3D object detectors. It contains the camera and LiDAR data collected using a single Pointgrey camera and Velodyne HDL-64E LiDAR. The training set and test set contain 7,481 images and 7,518 images, respectively. For validation, we split the labeled training set into the train set and valid set by half as done in [mv3d]. The detection task is divided into three different levels of difficulty, namely “easy”, “moderate”, and “hard”. The average precision (AP) obtained from the 41-point precision-recall (PR) curve is used as a performance metric.
Training Configuration: We limit the range of point cloud to   in axis. The LiDAR voxel structure consists of voxel grids with each voxel of size. Two anchors with different angles (0, 90) were used. For training our architecture, we use pre-trained LiDAR backbone network and train the overall architecture using the ADAM optimizer with one-cycle policy [1cycle]
with the learning rate (LR) with the max parameter set to 3e-3, the division factor 10, the momentum range from 0.95 to 0.85, and the fixed weight decay parameter of 1e-2. The mini-batch size is set to 12, and the model is trained during 70 epochs. The refinement network is separately trained with 50 epochs and the mini-batch size is set to 6. The initial learning rate is 1e-4 for the first 30 epochs and decays by a factor of 0.1 every 10 epochs. As a camera backbone network, we use the ResNet-18[resnet] network with FPN [fpn] pre-trained with the KITTI 2D object detection dataset.
Data Augmentation: Since we use both camera data and LiDAR point clouds together, careful coordination between the camera and LiDAR data is necessary for data augmentation. The changes made for the LiDAR points should be adapted for the camera images accordingly. We randomly flip the LiDAR points and rotate the point clouds within a range of  along the axis. We also scale the coordinates of the points with a factor within . Since per-object augmentations cannot be applied in the camera images, such augmentation strategies are used only for the LiDAR backbone pre-training process. Note that the overall architecture is trained without the per-object augmentation strategies.
Results on KITTI Test Set: Table 1 provides the mAP performance of several 3D object detectors evaluated on KITTI 3D object detection tasks. The results for other algorithms are brought from the KITTI leaderboard (http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d). We observe that the proposed 3D-CVF achieves the significant performance gain over other camera-LiDAR fusion-based detectors in the leaderboard. In particular, the 3D-CVF achieves up to 2.58% gains (for hard difficulty) over UberATG-MMF [mmf], the best fusion-based method so far. The 3D-CVF outperforms most of LiDAR-based 3D object detectors except for the STD [std]. While the 3D-CVF outperforms the STD [std] for easy and moderate levels but it is not for the difficult level. Since the STD [std] uses the PointNet-based backbone, it might have a stronger LiDAR pipeline than the voxel-based backbone used in our 3D-CVF. It would be possible to apply our sensor fusion strategies to this kinds of detector to improve their performance.
The nuScenes dataset is a large-scale 3D detection dataset that contains more than 1,000 scenes in Boston and Singapore [nuScenes]
. The dataset is collected using six multi-view cameras and 32-channel LiDAR, and 360-degree object annotations for 10 object classes are provided. The dataset consists of 28,130 training samples and 6,019 validation samples. nuScenes dataset suggests the evaluation metric called nuScenes detection score (NDS) defined in[nuScenes].
Training Configuration: For the nuScenes dataset, the range of point cloud is within   in axis which is voxelized with each voxel size of that leads size of voxel structure. Anchor sizes of each class are created by averaging the width and height values of the labels. We train the network with a one-cycle policy for 20 epochs, the mini-batch size is set to 6. We expand the train data split using DS sampling [megvil] to handle the class imbalance issue of nuScenes dataset.
Data Augmentation: For the data augmentation of nuScenes dataset, we use the same augmentation strategies including the random flip, global rotation and global scaling with the same parameters. Note that per-object augmentation strategies are not applied because of the sufficient scale of nuScenes dataset to prevent the over-fitting.
Results on nuScenes Validation Set: We mainly test our 3D-CVF on nuScenes to verify the performance gain achieved by sensor fusion. For this purpose, we compare the proposed 3D-CVF with the baseline algorithm, which has the same structure as our method except that the camera pipeline is disabled. For a fair comparison, DS sampling strategy is also applied to the baseline. As a reference, we also add the performance of the SECOND [second] and PointPillar [pointpillar]. Table 2 provides the AP for 8 classes, mAP, and NDS achieved by several 3D object detectors. We observe that the sensor fusion offers 2.74% and 3.57% performance gains in the mAP the NDS metrics, respectively. The performance of the proposed method consistently outperforms the baseline in terms of the AP for all classes. In particular, the detection accuracy is significantly improved for classes with relatively low APs. This shows that the camera sensor helps to detect objects that are relatively difficult to detect.
4.3 Ablation study
In Table 3, we conduct an extensive ablation study to validate the effect of the ideas in the proposed 3D-CVF method. Note that our ablation study has been conducted on the KITTI validation set. Overall, Our ablation study shows that the fusion strategies used in our 3D-CVF offers 1.32%, 1.57%, and 1.39% gains in AP AP and AP over the LiDAR-only baseline. This is a significant improvement considering that it gets more difficult to beat the detection accuracy of the top rankers in KITTI 3D object detection benchmark, lately.
Effect of Naive camera-LiDAR fusion: We observe that we fuse the camera and LiDAR features without the proposed strategies including adaptive gated fusion network, cross-view feature mapping, and 3D RoI fusion-based refinement, the improvement in detection accuracy is marginal.
Effect of Adaptive Gated Fusion Network: The adaptive gated fusion network leads to 0.54%, 0.87%, and 0.79% performance boost in AP, AP and AP
levels, respectively. By combining the camera and LiDAR features selectively depending on the relevance to the tasks, our method can generate the enhanced joint camera-LiDAR features that promise the performance gain.
Effect of Cross-View Feature Mapping: The auto-calibrated projection generates the smooth and dense camera features in BEV domain, which would offer the features of better quality for 3D object detection. The detection accuracy improves over the baseline by 0.5%, 0.06%, and 0.15% in AP AP and AP, respectively. Note that the performance gain of our method is largest for AP since this method would keep the spatial details in the camera features in the camera view-domain.
Effect of 3D RoI Fusion-based Refinement: We observe that the 3D RoI fusion-based refinement improves AP AP and AP by 0.28%, 0.63%, and 0.45%, respectively. It indicates that our 3D RoI fusion-based refinement compensates the lack of spatial information in the joint camera-LiDAR features which has been lost going through many CNN pipelines.
4.4 Performance Evaluation based on Object Distance
To investigate the effectiveness of sensor fusion, we evaluate the detection accuracy of the 3D-CVF for different object distances. We detect the objects in the KITTI valid set into three classes by the distance ranges (020m), (2040m), and (4070m). Table 4 provides the mAPs achieved by the 3D-CVF for three classes of objects. Note that the performance gain achieved by the sensor fusion is significantly higher for distant objects. The difference of mAP between nearby and distant objects is up to 5%. This result shows that the LiDAR-only baseline does not perform well for distant objects due to the sparseness of LiDAR points and the camera modality compensates this successfully.
In this paper, we proposed a new camera and LiDAR fusion architecture for 3D object detection. The 3D-CVF achieves the multi-modal fusion over two object detection stages. In the first stage, to generate the effective joint representation of camera and LiDAR data, we introduced the cross-view feature mapping that transforms the camera feature map from the calibrated and interpolated feature map in BEV. Then, the camera and LiDAR feature maps were selectively combined based on the region using the adaptive gated fusion network. These two steps produced the joint camera and LiDAR feature maps used to find the region proposals. In the second stage, the 3D RoI-based fusion network refined the region proposals based on the results of the 3D-RoI-based feature fusion. The 3D-RoI-based feature fusion brought the camera-view features using the 3D RoI grid pooling and combined it with the joint feature map. Our evaluation conducted on KITTI and nuScenes dataset confirmed that significant performance gain was achieved by the camera-LiDAR fusion and the proposed 3D-CVF outperforms most of state-of-the-art 3D object detectors in KITTI leaderboard.