Autonomous driving systems need accurate 3D perception of vehicles and other objects in their environment. Unlike 2D visual detection, 3D-based object detection enables spatial path planning for object avoidance and navigation. Compared to 2D object detection, which has been well-studied [ren2015faster, Calandra2016, lin2017feature, lin2017focal], 3D object detection is more challenging with more output parameters needed to specify 3D oriented bounding boxes around targets. In addition, LiDAR methods [zhou2018voxelnet, yan2018second, qi2017pointnet, shi2019pointrcnn, lang2019pointpillars] are hampered by typically lower input data resolution than video which has a large adverse impact on accuracy at longer ranges. Fig. 1 illustrates the difficulty in detecting vehicles from just a few points and no texture at long range. Human annotators use both the camera images together with the LiDAR point clouds to create the ground truth bounding boxes [geiger2012we]. This motivates multi-modal sensor fusion as a way to improve single-modal methods.
While sensor fusion has potential to address the shortcomings of video-only and LiDAR-only detections, finding an effective approach that improves on the state-of-the-art single modality detectors has been difficult. This is illustrated in the official KITTI 3D object detection benchmark leaderboard, where LiDAR-only based methods outperform most of the fusion based methods. Fusion methods can be divided into three broad classes: early fusion, deep fusion and late fusion, each with their own pros and cons. While early and deep fusion have greatest potential to leverage cross modality information, they suffer from sensitivity to data alignment, often involve complicated architectures [chen2017multi, ku2018joint, xu2018pointfusion, liang2018deep], and typically require pixel-level correspondences of sensor data. On the other hand, late fusion systems are much simpler to build as they incorporate pre-trained, single-modality detectors without change, an only need association at the detection level. Our late fusion approach uses much-reduced thresholds for each sensor and combines detection candidates before Non-Maximum Suppression (NMS). By leveraging cross-modality information, it can keep detection candidates that would be mistakenly suppressed by single-modality methods.
We propose Camera-LiDAR Object Candidates Fusion (CLOCs) as a way to achieve improved accuracy for 3D object detection. The proposed architecture delivers the following contributions:
Versatility & Modularity: CLOCs uses any pair of pre-trained 2D and 3D detectors without requiring re-training, and hence, can be readily employed by any relevant already-optimized detection approaches.
Probabilistic-driven Learning-based Fusion: CLOCs is designed to exploit the geometric and semantic consistencies between 2D and 3D detections and automatically learns probabilistic dependencies from training data to perform fusion.
Speed and Memory
: CLOCs is fast, leveraging sparse tensors with low memory footprint, which only adds less than 3ms latency for processing each frame of data on a desktop-level GPU.
Detection Performance: CLOCs improves single-modality detectors, including state-of-the-art detectors, to achieve new performance levels. At time of submission, CLOCs ranks the highest among all the fusion based methods in the official KITTI leaderboard.
The rest of the paper is organized as follows. We first review related work in section 2. Then, we introduce the motivation of our work and why we choose to fuse the detection candidates in section 3. In section 4, we illustrate our Camera-LiDAR Object Candidates (CLOCs) Fusion architecture and relevant details of our network. We report and analyse our experimental results on the KITTI dataset in section 5. In section 6, we conclude the paper.
Ii Related Work
The three main categories 3D object detection are based on (1) 2D images, (2) 3D point clouds and (3) both images and point clouds. Although 2D image-based methods are attractive for not requiring LiDAR, there is a large gap in 3D performance between these methods and those leveraging point clouds, and so here we focus on the latter two categories.
Ii-a 3D Detection Using 2D Images
Mousavian et al. [mousavian20173d] leverage the geometric constraints between 2D and 3D bounding boxes to recover 3D information. [chabot2017deep, mottaghi2015coarse]estimate 3D object information by calculating the similarity between 3D objects and CAD models. [wang2019pseudo] and [you2019pseudo] explore using stereo images to generate dense point cloud and conduct object detection using that cloud. These image-based methods are promising, but when compared to LiDAR-based techniques, they generate much less accurate 3D bounding boxes.
Ii-B 3D Detection Using Point Cloud
Point-cloud techniques currently lead in popularity for 3D object detection. Compared to multi-modal fusion based methods, single sensor setup avoids multi-sensor calibration and synchronization issues. However, object detection performance at longer distance is still relatively poor. Methods vary by how they encode and learn features from raw point cloud. [zhou2018voxelnet] uses voxels to encode the raw point cloud, and 3D CNNs (Convolutonal Neural Networks) are applied to learn voxel features for classification and bounding box regression. SECOND [yan2018second] is the upgrade version of [zhou2018voxelnet], since raw LiDAR point cloud has very sparse data structure, it uses sparse 3D CNNs which reduces the inference time significantly. PointPillars [lang2019pointpillars] uses PointNets [qi2017pointnet] in an encoder that represents point clouds organized in vertical columns (pillars) followed with a 2D CNN detection head to perform 3D object detection; it enables inference at 62 Hz; Compared with one-stage methods discussed above, PointRCNN [shi2019pointrcnn], Fast PointRCNN [Chen2019fastpointrcnn] and STD [std2019yang] applies a two-stage architecture that first generate 3D proposals in a bottom-up manner and then refines these proposals in a second stage. PV-RCNN [shi2020pv] leverages the advantages of both 3D voxel CNN and PointNet-based set abstraction to learn more discriminative features. Besides, Part- in [shi2020part] explores predicting intra-object part locations (lower left, upper right, etc.) in the first stage, and such part locations can assist accurate 3D bounding box refinement in the second stage.
Ii-C 3D Detection Using Multi-modal Fusion
We focus on camera-LiDAR fusion methods in this section since this is the most common sensor setup for self-driving cars. Frustum PointNet [qi2018frustum], Pointfusion [xu2018pointfusion] and Frustum ConvNet [wang2019frustum] are the representatives of 2D driven 3D detectors, which exploit mature 2D detectors to generate 2D proposals and narrow down the 3D processing domain to the corresponding cropped region in the image. But the 2D image-based proposal generation might fail in some cases that could only be observed from 3D space. MV3D [chen2017multi] and AVOD [ku2018joint]
project the raw point cloud into bird’s eye view (BEV) to form a multi-channel BEV image. A deep fusion based 2D CNN is used to extract features from this BEV image as well as the front camera image for 3D bounding box regression. The overall performance of these fusion based methods is worse than LiDAR-only based methods. Possible reasons include: First, transforming raw point cloud into BEV image loses spatial information. Second, the crop and resize operation used in these algorithms in order to fuse feature vectors from different sensor modalities may destroy the feature structure from each sensor. Camera images are high-resolution dense data, while LiDAR point cloud are low-resolution sparse data, fusing these two different types of data structure is not trivial. Forcing feature vectors from 2D images and 3D LiDAR point cloud to have the same size or equal-length, then concatenating, aggregating or averaging them could result in inaccurate correspondence between these feature vectors and therefore is not the optimal way for fusing features. In order to fuse features from different sensor modalities with better correspondence, MMF[liang2019multi] adopts continuous convolution [liang2018deep] to build dense LiDAR BEV feature maps and do point-wise feature fusion with dense image feature maps. MMF is currently one of the best public multi-modal fusion based 3D detector according to the KITTI 3D/BEV object detection benchmark. However, it is still 24% worse in moderate level than the best LiDAR-only based detectors in KITTI leaderboard.
Iii-a 2D and 3D Object Detection
We first introduce the basic concepts of 2D and 3D object detection used in this paper. 2D detection systems discussed in this paper take RGB images as input, and output classified 2D axis-aligned bounding boxes with confidence scores, as shown in Fig2. 3D detection systems generate classified oriented 3D bounding boxes with confidence scores, as shown in Fig 2. In the KITTI dataset [geiger2012we] only rotation in z axis is considered (yaw angle), while rotations in x and y axis is set to zero for simplicity. Using calibration parameters of the camera and LiDAR, the 3D bounding box in the LiDAR coordinate can be accurately projected into the image plane, as shown in Fig 2.
Iii-B Why Fusion of Detection Candidates
Fusion architectures can be categorized based on at what point during their processing features from different modalities are combined. Three general categories are (1) early fusion which combines data at the input, (2) deep fusion which has different networks for different modalities while simultaneously combining intermediate features, and (3) late fusion which processes each modality on a separate path and fuses the outputs in the decision level.
Early fusion has the greatest opportunity for cross-modal interaction, but at the same time inherent data differences between modalities including alignment, representation, and sparsity are not necessarily well-addressed by passing them all through the same network.
Deep fusion addresses this issue by including separate channels for different modalities while still combining features during processing. This is the most complicated approach, and it is not easy to determine whether or not the complexity actually leads to real improvements; simply showing gain over single-modality methods is insufficient.
Late fusion has a significant advantage in training; single modality algorithms can be trained using their own sensor data. Hence, the multi-modal data does not need to be synchronized or aligned with other modalities. Only the final fusion step requires jointly aligned and labeled data. Additionally, the detection candidate data that late fusion operates on is compact and simple to encode for a network. Since late fusion prunes rather than creates new detections, it is important that the input detectors be tuned to maximize their recall rate rather than their precision. In practice, this implies that individual modalities (a) avoid the NMS stage, which may mistakenly suppress true detections. and (b) keep thresholds as low as possible.
In our late fusion framework, we incorporate all detection candidates before NMS in the fusion step to maximize the probability of extracting all potential correct detections. Our approach is data-driven; we train a discriminative network that receives as input the output scores and classifications of individual detection candidates, as well as spatial descriptions of the detection candidates. It learns from data how best to combine input detection candidates for a final output detection.
Iv Camera-LiDAR Object Candidates Fusion
Iv-a Geometric and Semantic Consistencies
For a given frame of image and LiDAR data there may be many detection candidates of with various confidences in each modality from which we seek a single set of 3D detections and scores. Fusing these detection candidates requires an association between the different modalities (even if the association is not unique). For this we build a geometric association score and apply semantic consistency. These are described in more detail as follows.
Geometric consistency An object that is correctly detected by both a 2D and 3D detector will have an identical bounding box in the image plane, see Fig 2, whereas false positives are less likely to have identical bounding boxes. Small errors in pose will result in a reduction of overlap. This motivates an image-based Intersection over Union (IoU) of the 2D bounding box and the bounding box of the projected corners of the 3D detection, to quantify geometric consistency between a 2D and a 3D detection.
Semantic consistency Detectors may output multiple categories of objects, but we only associate detections of the same category during fusion. We avoid thresholding detections at this stage (or use very low thresholds), and leave thresholding to the final output based on the final fused score.
The two types of consistencies illustrated above is the fundamental concept used in our fusion network.
Iv-B Network Architecture
In this section we explain the preprocessing/encoding of fused data, the fusion network architecture and the loss function used for training.
Iv-B1 Sparse Input Tensor Representation
The goal of our encoding step is to convert all individual 2D and 3D detection candidates into a set of all consistent joint detection candidates which can be fed into our fusion network. The general output of a 2D object detector are a set of 2D bounding boxes in the image plane and corresponding confident scores. For 2D detection candidates in one image can be defined as follows:
is the set of all detection candidates in one image, for detection , and are the pixel coordinates of the top left and bottom right corner points from the 2D bounding box. is the confident score.
The output of 3D object detectors are 3D oriented bounding boxes in LiDAR coordinate and confident scores. There are multiple ways to encode the 3D bounding boxes, in KITTI dataset [geiger2012we], a 7-digit vector containing 3D dimension (height, width and length), 3D location (x,y,z) and rotation (yaw angle) is used. For 3D detection candidates in one LiDAR scan can be defined as follows:
where is the set of all detection candidates in one LiDAR scan, for detection , is the 7-digit vector for 3D bounding box. is the 3D confident score. Note that we take 2D and 3D detections without doing NMS, as discussed in the previous section, some correct detections may be suppressed because of limited information from single sensor modality. Our proposed fusion network would reevaluate all detection candidates from both sensor modalities to make better predictions. For 2D detections and 3D detections, we build a tensor , as shown in Fig 3. For each element , there are 4 channels denoted as follows:
where is the between 2D detection and projected 3D detection, and are the confident scores for 2D detection and 3D detection respectively. represents the normalized distance between the 3D bounding box and the LiDAR in plane. Elements with zero are eliminated as they are geometrically inconsistent.
The input tensor is sparse because for each projected 3D detection, only few 2D detections intersect with it and so most elements are empty. The fusion network only needs to learn from these intersected examples. Because we take the raw predictions before NMS, and are large numbers, for SECOND [yan2018second], there are 70400 () predictions in each frame. It would be impractical to do convolution on a dense tensor with this shape. We propose an implementation architecture to utilize the sparsity of tensor and make the calculations much faster and feasible for large and values. Only non-empty elements are delivered to the fusion network for processing, shown in Fig. 3. As we would discuss later, the indices of the non-empty elements () are important for further calculations, therefore the indices of these non-empty elements are saved in the cache, as shown in the blue box in Fig. 3. Here noted that for projected 3D detection that has no 2D detection intersected, we still fill the last element in column in with the available 3D detection information and set and as -1. Because sometimes 3D detector could detect some objects that 2D detector couldn’t and we do not want to discard these 3D detections. Setting the and to -1 rather than 0 enables our network to distinguish this case from other examples with very small and .
|Detector||Input Data||3D AP (%)||Bird’s Eye View AP (%)|
|SECOND (baseline) [yan2018second]||LiDAR||83.34||72.55||65.82||89.39||83.77||78.59|
|Improvement (CLOCs_SecCas over SECOND)||-||+3.04||+5.90||+6.63||+1.77||+4.46||+4.04|
|PointRCNN (baseline) [shi2019pointrcnn]||LiDAR||86.23||75.81||68.99||92.51||86.52||81.39|
|Improvement (CLOCs_PointCas over PointRCNN)||-||+1.27||+1.04||+2.21||+0.09||+2.47||+0.35|
|PV-RCNN (baseline) [shi2020pv]||LiDAR||87.45||80.28||76.21||91.91||88.13||85.41|
|Improvement (CLOCs_PVCas over PV-RCNN)||-||+1.49||+0.39||+0.94||+1.14||+1.67||+1.17|
Iv-C Network Details
The fusion network is a set of 2D convolution layers. We use Conv2D() to represent an 2 dimensional convolution operator where and are the number of input and output channels, and
are the kernel size vector and stride respectively. We employ four convolution layers sequentially as Conv2D(4, 18, (1,1), 1), Conv2D(18, 36, (1,1), 1), Conv2D(36, 36, (1,1), 1) and Conv2D(36, 1, (1,1), 1), which yields a tensor of sizeshown in Fig. 3, where is the number of non-empty elements in the input tensor
. Note that for the first three convolution layers, after each convolution layer applied, ReLU[nair2010rectified] is used. Since we have saved the indices of these non-empty elements (), as shown in Fig. 3 now we could build a tensor of shape by filling outputs based on the indices and putting negative infinity elsewhere. Finally, this tensor is mapped to the desired learning targets, a probability score map of size , through maxpooling in the first dimension.
We use a cross entropy entropy loss for target classification, modified by the focal loss in [lin2017focal] with parameters and to address the large class imbalance between targets and background.
V Experimental Results
In this section we present our experimental setup and results, including dataset, platform, performance results and analyses. For all experiments, we focus on the car class since it has the most training and testing samples in the KITTI [geiger2012we] dataset.
Our fusion system is evaluated on the challenging 3D object detection benchmark KITTI dataset [geiger2012we] which has both LiDAR point clouds and camera images. There are 7481 training samples and 7518 testing samples. Ground truth labels are only available for training samples. For the evaluation of testing samples, one needs to submit the detection results to KITTI server. For experimental studies, we follow the convention in [chen20153d] to split the original training samples into 3712 training samples and 3769 validation samples. We compare our method with sate-of-the-art multi-modal fusion methods of 3D object detection on official test split of KITTI as well as validation split.
V-B 2D/3D Detector Setup
We apply our fusion network for a combination of different 2D and 3D detectors to demonstrate the flexibility of our proposed pipeline. The 2D detectors we used are: RRC [ren2017accurate], MS-CNN [cai2016unified] and Cascade R-CNN [cai2019cascade]. The 3D detectors we incorporated are: SECOND [yan2018second], PointPillars [lang2019pointpillars], PointRCNN [shi2019pointrcnn] and PV-RCNN [shi2020pv]
. While not the top performers within the KITTI leaderboard, we have selected these methods as they are the best currently-available open-source detectors. Our experiments show that CLOCs improves the performance of these detectors significantly. At the time of submission, CLOCs fusion of PV-RCNN with Cascade R-CNN, is ranked number 4 on KITTI 3D detection leaderboard, number 6 on Bird Eye View detection leaderboard, number 1 on 2D detection leaderboard, and outperforms all other fusion methods.
|Detector||3D AP (%)||Bird’s Eye View AP (%)|
*C-RCNN is Cascade R-CNN.
|Detector||3D AP (%)||Bird’s Eye View AP (%)|
|Detector||3D AP (%)||Bird’s Eye View AP (%)|
V-C Evaluation Results
We evaluate the detection results on the KITTI test server. The IoU threshold for car is 0.7. All the instances are classified into three difficulty levels: easy, moderate and hard, based on their 2D bounding boxes’ height, occlusion level and truncation level. Since KITTI has some restrictions on the number of submissions, we only show the results evaluated on the official KITTI test server from three fusion combinations of 2D and 3D detectors, which are SECOND [yan2018second] and Cascade R-CNN [cai2019cascade], written as CLOCs_SecCas, PointRCNN [shi2019pointrcnn] and Cascade R-CNN, as CLOCs_PointCas, PV-RCNN [shi2020pv] and Cascade R-CNN, as CLOCs_ PVCas. All the other combinations are evaluated on the validation set. Table I shows the performance of our fusion method on the KITTI test set through server submission. Our methods outperform all multi-modal fusion based works in moderate and hard level at the time of submission. Note that the official open-source code of PV-RCNN performs slightly worse than the private one owned by the PV-RCNN authors shown on the KITTI leaderboard, and our CLOCs_PVCas result is based on the open-source PV-RCNN. The baseline PV-RCNN in Table I refers to the open-source PV-RCNN. As shown in Table I, compared to baseline methods SECOND, PointRCNN and PV-RCNN, fusion with Cascade R-CNN through our fusion network increases the performance in 3D and BEV object detection by a large margin.
We evaluate the performance of all the combinations of 2D and 3D detectors on car class of KITTI validation set, the results are shown in Table II. Compared to the corresponding baseline 3D detectors, our fusion methods have better performance in 3D and BEV detection benchmark. These results demonstrate the effectiveness as well as the flexibility of our fusion approach.
Table III and Table IV show the 3D and BEV evaluation results of pedestrian and cyclist on KITTI validation set. The IoU threshold for pedestrian and cyclist is 0.5. Here for 3D detectors, only SECOND [yan2018second] and PointPillars [lang2019pointpillars] publish their training configurations for class pedestrian and cyclist; for 2D detectors, only MSCNN [cai2016unified] does. Therefore, we only show the evaluation results based on SECOND, PointPillars and MSCNN. As shown in Table III and Table IV, our fusion method improves the detection performance by a large margin.
Fig. 4 shows the average precision (AP) on KITTI validation set in different distance ranges. The distance is defined as the Euclidean distance in plane between objects and LiDAR. The blue bars are the APs for SECOND detector, the orange bars represent APs for our CLOCs_SecCas. The yellow and purple bars show the APs of PointRCNN and CLOCs_PointCas respectively. As shown in Fig. 4, APs for CLOCs is higher than the corresponding baselines in all distance ranges on both 3D and BEV detection benchmarks. The largest improvement is in . This is because the point clouds in long distance are too sparse for LiDAR-only detectors such as SECOND and PointRCNN, while CLOCs could utilize 2D detections to improve the performance.
Fig. 5 shows some qualitative results of our proposed fusion method on the KITTI [geiger2012we] test set. Red bounding boxes represent wrong detections (false positives) from SECOND that are deleted by our CLOCs, blue bounding boxes stand for missed detections from SECOND that are corrected by our CLOCs, green bounding boxes are correct detections.
|Type of Scores||3D AP (%)||Bird’s Eye View AP (%)|
|corrected sigmoid score||92.83||83.73||80.12||95.88||90.19||87.08|
|corrected log score||92.88||83.92||80.22||96.07||89.93||87.21|
V-D Score Scales
There are two common output scores for detectors: the first is a real number approximating the log likelihood ratio between target and clutter, and the second is a sigmoid transformation of this onto the range 0 to 1, so approximating a probability of target. We compare use of these in CLOCs in Table V and find improved performance using the log likelihood score. The primary reason for the poor performance for the normalized score is that it poorly approximates a probability of target (or precision). Using this score forces the fusion network to learn a non-linear correction, whereas the equivalent log likelihood score discrepancy is a simple offset that can easily corrected by the fusion layer. If we instead use a fitted sigmoid to obtain better probabilistic outputs from the PointRCNN, then fusion works equally well with either input. In general we believe it is simpler to use a log likelihood output for each single-modality detector and fuse these.
V-E Ablation Study
We evaluate the contribution of each channel and focal loss in our fusion pipeline. The four channel includes: IoU between 2D detections and projected 3D detections (), 2D confident score (), 3D confident score () and normalized distance () between 3D bounding box and the LiDAR in plane. The results are shown in Table VI.
, as the measure of geometric consistency, is crucial to the fusion network. Without , the association between 2D and 3D detections would be ambiguous and further lead to degrade performance. 2D confident score indicates the certainty of 2D detections, which could provide useful clues for the fusion. 3D confident score () plays the most important role among the four channels, because CLOCs generates new confident scores to all 3D detection candidates through fusion in which original 3D scores are highly important evidences. Closer objects usually are easier detected because there are more hits from LiDAR, the normalized distance () could be a useful indicator for this. Because there is a large imbalance between positives and negatives among the detection candidates, focal loss could address this issue and improve the detection accuracy.
|focal loss||3D AP||BEV AP|
In this paper, we propose Camera-LiDAR Object Candidates Fusion (CLOCs), as a fast and simple way to improve performance of just about any 2D and 3D object detectors when both LiDAR and camera data are available. CLOCs exploits the geometric and semantic consistencies between 2D and 3D detections and automatically learns fusion parameters. The experiments show that our fusion method outperforms previous state-of-the-art methods by a large margin on the challenging 3D detection benchmark of KITTI dataset, especially in long distance detection. As such, CLOCs provides a baseline for other types of fusion including early and deep fusion.