Recent breakthroughs in deep learning and sensor technologies have motivated rapid development of autonomous driving technology, which could potentially improve road safety, traffic efficiency and personal mobility [Duarteeaav9843]  [bigman2020life]. However, technical challenges and the cost of exteroceptive sensors have constrained current applications of autonomous driving systems to confined and controlled environments in small quantities. One critical challenge is to obtain an adequately accurate understanding of the vehicle’s 3D surrounding environment in real-time. To this end, sensor fusion, which leverages multiple types of sensors with complementary characteristics to enhance perception and reduce cost, has become an emerging research theme.
In particular, the fusion of low-cost digital cameras and high-precision LiDARs has become increasingly popular, as these two types of sensors are currently the most informative and dependent sensors found on most autonomous driving platforms. For instance, vision-based perception systems achieve satisfactory performance at low-cost, often outperforming human experts [Silver2016MasteringTG] [mnih2015humanlevel]. However, a mono-camera perception system cannot provide reliable 3D geometry, which is essential for autonomous driving  . On the other hand, stereo cameras can provide 3D geometry, but do so at high computational cost and struggle in high-occlusion and textureless environments   . Furthermore, camera base perception systems have difficulties with complex or poor lighting conditions, which limit their all-weather capabilities . Contrarily, LiDAR can provide high-precision 3D geometry invariant to lighting conditions but are limited by low-resolution (ranging from 16 to 128 channels), low-refresh rates (10Hz) and high cost. To mitigate this challenge, many works combined these two complementary sensors and demonstrated significant performance advantages than amodal approaches. Therefore, this paper focuses on reviewing current fusion strategies for camera-LiDAR fusion.
However, camera-LiDAR fusion is not a trivial task. First of all, images record the real-world by projecting it to the image plane, whereas the point cloud preserves the 3D geometry. Furthermore, in terms of data structure and type, the point cloud is irregular, orderless and continuous, while the image is regular, ordered and discrete. This leads to a huge difference in the processing of image and point cloud. In Figure 1 characteristics of image and point cloud are compared.
Previous reviews [wang2019multi] [Feng_2020] on deep learning methods for multi-modal data fusion covered a broad range of sensors, including radars, cameras, LiDARs, Ultrasonics, IMU, Odometers, GNSS and HD Maps. This paper focuses on camera-LiDAR fusion only and therefore give more detailed reviews on individual methods. Furthermore, we cover a broader range of perception related topics (depth completion, object detection, semantic segmentation, and tracking) that are interconnected and are not fully included in the previous reviews [Feng_2020]. The contribution of this paper is summarized as the following:
To the best of our knowledge, this paper is the first survey focusing on deep learning based image and point cloud fusion approaches in autonomous driving, including depth completion, object detection, semantic segmentation and object tracking.
This paper organizes and reviews methods based on their fusion methodologies. Furthermore, this paper presented the most up-to-date (2014-2020) overviews and performance comparisons of state-of-the-art camera-LiDAR fusion methods.
This paper raises overlooked open questions, such as online self-calibration and sensor-agnostic framework, that are critical for the real-world deployment of autonomous driving technology. Moreover, summaries of trends and open challenges are presented.
This paper first provides a brief overview of deep learning methods on image and point cloud data in Section II. In Sections III to VI, reviews on camera-LiDAR based depth completion, 3D object detection, semantic segmentation and object tracking are presented respectively. Trends, open challenges and promising directions are discussed in Section VII. Finally, a summary is given in Section VIII. Figure 2 presents the overall structures of this survey and the corresponding topics
Ii A Brief Review of Deep Learning
Deep learning is a subset of artificial neural networks which leverages multiple network layers to extract features progressively. The early works of learning based artificial neural networks start in the 1940s and gained increasing attention and developments throughout the 1970s and 1980s. Some of the most important concepts, such as automatic differentiationlinnainmaa1976taylor] [werbos1990backpropagation]287150]
are proposed during this time. During the 1990s and 2000s, researches in artificial intelligence have witnessed a slow down due to various reasons. However, the exponential growth of computational power and data, combined with advances in network architecture and training strategies, researches in artificial intelligence has since regaining momentum. Especially, deep learning based methods have demonstrated great potentials, outperforming the other methods in the ImageNet competition[RussakovskyFeiFei] [NIPS2012_4824] [szegedy2015going].
Ii-a Deep Learning on Image
Convolutional Neural Networks (CNNs) are one of the most efficient and powerful deep learning models for image processing and understanding. Compared to Multi-Layer-Perceptron (MLP), CNN is shift-invariant, contains fewer weights and exploits the hierarchical patterns, making it highly efficient for image semantic extraction. Typically hidden layers of a CNN consist of a hierarchy of convolutional layer, batch normalization layer, activation layer and pooling layer, allowing end-to-end training. This hierarchical structure extracts image features with increasing abstract levels and receptive ﬁeld, enabling the learning of high-level semantics.
Ii-B Deep Learning on Point Cloud
The point cloud is a set of data points, which are measurements of the detected object’s surface. In terms of data structure, the point cloud is sparse, irregular, orderless and continuous. Point cloud encodes information in 3D structures and reflective intensities, which are invariant to scale, rigid transformation and permutation. These characteristics are challenging for existing CNN based deep learning models and require modifications of existing models or specially designed ones. Therefore, this section focus on introducing some common methodologies for point cloud processing.
Ii-B1 Volumetric representation based
Volumetric representation partitions point cloud into a 3D grid. Point features of each grid/voxel can be hand-crafted or learned. This partition is often achieved with fixed-resolution voxels [Zhou_2018] , which allows standard 3D convolution to be applied. However, this partitioning scheme leads to a dilemma, where high-resolution voxel-space retains richer fine-grained geometry, but drastically increase the computation and memory cost. Furthermore, the number of empty voxels also grows cubically with the resolution.
Ii-B2 Tree-like representation based
To alleviate constraints between high-resolution and computational costs, adapted-resolution partition methods that leverages tree-like structure [riegler2017octnet]  [lei2019octree] [zeng20183dcontextnet] are proposed. By dividing point cloud into a series of unbalanced trees, regions can be partitioned based on their point densities. This allows regions with lower point densities to have lower resolutions, which reduce unnecessary computation and memory cost. Point features are extracted along the tree structure using convolution-like operations.
Ii-B3 Point representation based
Point representation based methods consume raw point cloud. One of the pioneering work, PointNet [qi2017pointnet]
, employs an independent T-Net module to align point cloud and shared Multi-Layer Perceptrons (MLPs) to process individual point for point-wise feature extraction. The key idea which allows the processing of unordered data, is to employ symmetric functions (max-pooling):
where represents points, represents symmetric functions and the general function that we want to approximate
The point cloud’s global features can be aggregated from these pointwise features via max-pooling. This method is invariant to rigid transformation and permutation. However, local inter-point geometry is not fully explored. Moreover, the alignment network introduces additional computation costs.
Ii-B4 2D view representation based
2D view representations of the point cloud are generated via projecting the point cloud to a certain 2D view plane/grid. This results in an image-like feature map, where each pixel/grid encodes point features within that 2D pixel/grid. One of the most popular views is the bird’s-eye view(BEV), where perspective occlusions are minimum and raw information about objects’ orientation, x/y coordinates are retained. Commonly used point features include average points height, density and intensity. Standard 2D convolution and off-the-shelf CNN architecture can be directly applied to the BEV representations.
Ii-B5 Geometric representation based
Point cloud can be represented as graphs and convolution-like operation can be implemented on graphs in the spatial or spectral domain  [henaff2015deep] [simonovsky2017dynamic]. For graph-convolution in the spatial domain, operations are carried out by MLPs on spatially neighbouring points.
Spectral-domain graph-convolution extend convolution as spectral filtering on graphs through the Laplacian Spectrum [doi:10.1111/cgf.12693] [ae482107de73461787258f805cf8f4ed] [defferrard2016convolutional]. This filtering of signal in the spectral domain is defined as:
And a none-parameter filter is defined as:
where represents the Hadamard product.
and is a Fourier coefficients vector.
Pointwise discrete convolution [hua2018pointwise] [Lan_2019] place the center of the convolution kernel at each point, and weights nearby points with respect to their distances to the center point. A pointwise convolution is defined as:
where represents the value of point , represents the sub-domain of the convolution kernal at point , represents the kernel weight.
|Signal Level||Supervised||Sparse2Dense [Ma_2018]||0.08s||4.07||1.57||1299.85||350.32|
|Feature Level||Supervised||Spade-RGBsD [Jaritz_2018]||0.07s||2.17||0.95||917.64||234.81|
|Feature Level||Supervised||HDE-Net ||0.05s||-||-||-||-|
Iii Depth Completion
One of the obstacles to large-scale commercialization of autonomous vehicles is the cost of sensors. In particular, the cost of high-resolution LiDARs (64 or 128 channels) are extremely high. Furthermore, even with high-end LiDARs, the measurements of long-range targets are still limited and sparse. This sparsity greatly limits and complicates 3D perception algorithms, which are often designed to process dense and regular data. Depth completion intents to solve this problem via up-sampling the sparse irregular data to dense regular data. Camera-LiDAR fusion based approaches often leverage high-resolution image to guide depth up-sampling, which also leads to pixel-wise fusion and results in a dense depth image. Figure 3 gives the timeline of depth completion models and their corresponding fusion levels. The comparative results of depth completion models on KITTI depth completion benchmark [Uhrig2017THREEDV] is listed in Table I and plotted in Figure 4.
The depth completion task can be represented as:
where the network parametrized by , predicts ground truth , given input
. The loss function is represented as.
Iii-a Mono Camera and LiDAR fusion
Most current researches on depth completion focused on using images from a mono camera to guide depth completion. The idea is dense RGB/color information contains relevant 3D geometry, which could be leveraged as a reference for depth up-sampling.
Iii-A1 Signal-level fusion
In 2018, Ma et al. [Ma_2018] presented a ResNet [He_2016]
based autoencoder network that leverage image concatenated with a sparse depth map to predict a dense depth map. This approach employs signal-level fusion and has outperformed other depth completion methods. However, this method requires pixel-level depth ground truth, which are difficult to obtain. To solve this issue, Ma et al.[Ma_2019] presented a model-based self-supervised framework that only requires a sequence of images and sparse depth image to train. This self-supervision is achieved by employing sparse depth constrain, photometric loss and smoothness loss. However, this approach assumes objects to be stationary. Furthermore, the resulting depth output is blurry and input depths may not be preserved.
To generate a sharp dense depth map in real-time, Cheng et al. [Cheng_2018]
fed RGB-D image (image and sparse depth map) to a novel convolutional spatial propagation network (CSPN). This CSPN aims to extract the image-dependent affinity matrix directly from data, producing significantly better results in key measurements with lesser run-time. In CSPN++, Cheng et al.[cheng2019cspn] proposed to dynamically select convolutional kernel sizes and iterations to reduce computation. Furthermore, CSPN++ employs weighted assembling to boost its performance.
Iii-A2 Feature-level fusion
Jaritz et al. [Jaritz_2018] presented an autoencoder network that can either perform depth completion or semantic segmentation from sparse depths and images without applying validity masks. Images and sparse depth maps are first processed by two parallel NASNet-based encoders [Zoph_2018] before fusing them into the shared decoder. This approach can achieve decent performance even with very sparse depths inputs (8-channel LiDAR). Wang et al.  designed an integrable module (PnP) that leverage the sparse depth map to improve the performance of existing image-based depth prediction networks. This PnP module leverages gradient calculated from sparse depths ground truth to iteratively update the intermediate feature map produced by the existing depth prediction network. Consistent improvements are achieved by integrating the PnP module to state-of-the-art methods. Eldesokey et al. [Eldesokey_2019] presented a framework for unguided depth completion that processes image and very sparse depths in parallel and combine them in a shared decoder. Furthermore, a novel normalized convolution is designed to process highly sparse depth data and to propagate confidence. Valada et al. [Valada_2019] extended one-stage feature-level fusion to multiple-stages of varying depth of the network. Similarly, GuideNet [tang2019learning] by Tang et al. fused image features to sparse depth features at different stages of the encoder to guide the up-sampling of sparse depths. It is worth noting, GuideNet is the current best performing models in KITTI depth completion benchmark.
Iii-A3 Multi-level fusion
Van Gansbeke et al. [Van_Gansbeke_2019] further combines signal-level fusion and feature-level fusion techniques in an image-guided depth completion network. The network consists of a global and a local branch to process RGB-D data and depth data in parallel before fusing them based on the confidence maps. This approach is real-time and top-ranking on the KITTI depth completion benchmark.
Iii-B Stereo Cameras and LiDAR fusion
Compared with the RGB image, dense depth disparity from stereo cameras contains richer ground truth 3D geometry. On the other hand, LiDAR depth is sparse but of higher accuracy. The complementary characteristics of stereo depth and LiDAR depth make it possible to calculate a dense and more accurate depth from them. However, it is worth noting that stereo cameras have limited rangeability and struggles in high-occlusion, texture-less environments, making them less ideal for autonomous driving.
Iii-B1 Feature-level fusion
One of the pioneering works is from Park et al. , in which high-precision dense disparity map is computed from dense stereo disparity and point cloud using a two-stage CNN. The first stage of the CNN takes LiDAR and stereo disparity to produce a fused disparity. In the second stage, this fused disparity and left RGB image is fused in the feature space to predict the final high-precision disparity. Finally, the 3D scene can be reconstructed from this high-precision disparity. The bottleneck of this approach is the need for large-scale annotated stereo-LiDAR datasets which are rare. LidarStereoNet [Cheng_2019]
averted this difficulty with an unsupervised learning scheme, which does not require depth ground truth. The unsupervised scheme employs image warping/photometric loss, sparse depth loss, smoothness loss and plane fitting loss for end-to-end training. Furthermore, the introduction of ’feedback loop’ makes LidarStereoNet robust against noisy point cloud and sensor misalignment. This ’feedback loop’ leverages dense stereo depth to filter unreliable measurements in the point cloud before network processing took place. Similarly, Zhang et al.[zhang2019listereo] presented a self-supervised scheme for depth completion. The loss function consists of sparse depth, photometric and smoothness loss.
Iv 3D Object Detection
3D object detection aims to locate, classify and estimate oriented bounding boxes in the 3D space. There are two main approaches for object detection: sequential and one-step. Sequential based models consist of a proposal stage and a 3D bounding box (bbox) regression stage in the chronological order. In the proposal stage, regions that may contain objects of interest are proposed. In the bbox regression stage, the region proposal is classified based on the region-wise features extracted from 3D information. However, the performance of sequential fusion is limited by each stage. On the other hand, one-step models consist of one stage, where 2D and 3D information is processed in a parallel manner. The timeline of 3D object detection networks and their corresponding fusion levels is shown in Figure 5. Table II and Figure 6 present comparative results of 3D object detection models on KITTI 3D Object Detection benchmark[Geiger2012CVPR].
Iv-a 2D Proposal Based Sequential Models
A 2D proposal based sequential model attempts to utilize 2D image semantics in the proposal stage, which takes advantage of off-the-shelf image processing models. Specifically, these methods leverage the image object detector to generate 2D region proposals, which can be projected to the 3D space as detection seeds. There are two projection approaches to translate 2D proposals to 3D. The first one is projecting bounding boxes in the image plane to the 3D point cloud, which results in a frustum shaped 3D search space. The second method projects point cloud to the image plane, which results in the point cloud with point-wise 2D semantic information. However, far away or occluded objects are often represented by a handful of sparse points, making 3D bbox regression difficult.
Iv-A1 Result-level Fusion
In these methods, information aggregation happens at the result-level. The intuition behind these methods is to use off-the-shelf 2D object detectors to narrow down the region of interests for the 3D object detector. The most common approach to achieve 2D bbox to 3D seed region is by reverse camera projection. This replaces the processing of the entire point cloud with multiple smaller regions of interest, which brought a significant reduction in computation and run-time. However, since the whole pipeline is depended on the results of the 2D object detector, the overall performance of this approach is very much limited by the performance of the 2D object detector. The core idea of result level fusion is not to use multimodal data to complement each other, but rather to reduce computation.
One of the early works of frustum-based method is F-PointNets [Qi_2018], where 2D bounding boxes are first generated from image data and then projected to the 3D space. The resulting projected frustum proposal areas are fed into a PointNet[Charles_2017] based detector for 3D object detection. Du et al. [Du_2018] extended the 2D to 3D proposals generation stage with an additional proposal refinement stage. During this refinement stage, a model fitting based method is used to filter out background points inside the seed region. Finally, the filtered points are fed into the bbox regression network. This extra step further reduces unnecessary computation on the background point. RoarNet[Shin_2019] follows a similar idea, but instead employs a neural network for the proposal refinement stage. Multiple 3D cylinder proposals are first generated based on each 2D bbox using Geometric agreement search [Mousavian_2017], which results in smaller but more precise frustum proposals then the F-pointNet [Qi_2018]. These initial cylinder proposals are then processed by a PointNet [qi2017pointnet] based header network for final refinement. This approach outperformed state-of-the-arts on KITTI testset, including scenarios where these sensors are not synchronized. However, these approaches assume each seed region only contains one object of interest, which is however not true for crowded scenes and small objects like pedestrians.
One possible solution towards the fore-mentioned issues is to replace the 2D object detector with 2D semantic segmentation and region-wise seed proposal with point-wise seed proposals. Intensive Point-based Object Detector (IPOD) [yang2018ipod] by Yang et al. is a work in this direction. In the first step, 2D semantic segmentation is used to filter out back-ground points. This is achieved by projecting points to the image plane and associated point with 2D semantic labels. The resulting foreground point cloud retains the context information and fine-grained location, which the author believes is essential for the region-proposal and bbox regression. In the following point-wise proposal generation and bbox regression stage, two PointNet++ [qi2017pointnet] based networks are used for proposal feature extraction and bbox prediction. In addition, a novel criterion called PointsIoU is proposed to speed up training and inference. This approach has yielded significant performance advantages over other state-of-the-art approaches in scenes with high occlusion or many objects.
Iv-A2 Multi-level Fusion
Another possible direction of improvements is to combine result level fusion with feature level fusion, where one such work is PointFusion [Xu_2018]. PointFusion first utilizes existing 2D object detector to generate 2D bboxes. These bboxes is used to select corresponding points, via projecting points to the image plane and locate points that pass through the bboxs. Finally, a ResNet [He_2016] and a PointNet[Charles_2017] based network combines image and point cloud features to estimate 3D objects. In this approach, image features and point cloud features are fused per-proposal for final object detection in 3D, which facilitates 3D bbox regression. However, the proposal stage is still amodal. In SIFRNet [zhao20193d], frustum proposals are first generated from an image. Point cloud features in these frustum proposals are then combined with their corresponding image features for final 3D bbox regression. To achieve scale-invariant, PointSIFT [jiang2018pointsift] is incorporated into the network. Additionally, SENet module is used to suppress less informative features.
|Methods||Fusion Level||Fusion Type||PCR||Models||Run-time||Cars||Pedestrians||Cyclists|
|Result Level||N/A||Points||F-PointNet [Qi_2018]||0.17s||69.79||42.15||56.12|
|Feature Level||Point-wise||Multiple||PointPainting [vora2019pointpainting]||0.4s||71.70||40.97||63.78|
|Multi Level||ROI-wise||Points||PointFusion [Xu_2018]||1.3s||63.00||28.04||29.42|
|Feature Level||ROI-wise||2D views||MV3D [Chen_2017]||0.36s||63.63||-||-|
|Pixel/ROI-wise||2D views||AVOD-FPN [Ku_2018]||0.08s||71.76||42.27||50.55|
|Pixel/ROI-wise||2D views||SCANet ||0.17s||68.12||37.93||53.38|
|Point-wise||2D views||ContFuse [Liang_2018_ECCV]||0.06s||68.78||-||-|
|Pixel-wise||2D views||BEVF ||-||-||-||45.00|
|Multi Level||Point-wise||2D views||MMF [Liang_2019_CVPR]||0.08s||77.43||-||-|
|Feature Level||Point-wise||Voxels||MVX-Net [Sindagi_2019]||0.15s||75.86||43.73||61.03|
|Pixel-wise||2D views||LaserNet++ [meyer2019sensor]||0.04s||-||-||-|
Iv-A3 Feature-level Fusion
Early attempts of multimodal fusion are done in pixel-wise, where 3D geometry is converted to image format or appended as additional channels of an image. The intuition is to project 3D geometry onto the image plane and leverage mature image processing methods to extract information. The resulting output, however, is also on the image plane, which is not ideal to locate objects in the 3D space. In 2014, Gupta et al. proposed DepthRCNN [Gupta_2014],a R-CNN [Girshick_2014] based architecture for 2D object detection, instance and semantic segmentation. It encodes 3D geometry from Microsoft Kinect camera in image’s RGB channels, which are Horizontal disparity, Height above ground, and Angle with gravity (HHA). Gupta et al. extended Depth-RCNN  in 2015 for 3D object detection by aligning 3D CAD models, yielding significant performance improvement. In 2016, Gupta et al. developed a novel technique for supervised knowledge transfer between networks trained on image data and unseen paired image modality (depth image) [Gupta_2016]. In 2016, Schlosser et al.  further exploited learning RGB-HHA representations on 2D CNNs for pedestrian detection. However, the HHA data is generated from the LiDAR’s depth instead of a depth camera. The authors also noticed that better results can be achieved if the fusion of RGB and HHA happens at deeper layers of the network.
To locate objects accurately in 3D, current works often employs point-wise fusion. In this approach, image features are appended to each point in the point cloud. This is achieved by projecting points to the image plane for the point-pixel association. PointPainting [vora2019pointpainting] proposed by Vora et al. follows the idea of projecting points to 2D semantic maps in [yang2018ipod], but instead performs point-wise fusion. However, instead of using 2D semantics to filter point clouds, 2D semantics is simply appended to point clouds as additional channels. The authors argued that this technique makes PointPainting flexible as it enables any point cloud based network to be applied on this fused data. To demonstrate this flexibility, the fused point cloud is fed into multiple existing point cloud detectors, which are based on PointRCNN[Shi_2019], VoxelNet[Zhou_2018] and PointPillars[Lang_2019].
Iv-B 3D Proposal Based Sequential Model
In a 3D proposal based sequential Model, 3D proposals are directly generated from 2D or 3D data. The elimination of 2D to 3D transformation greatly narrows down the 3D search space for object detection. Common methods for 3D proposal generation includes the multi-view approach and the point cloud voxelization approach.
Multi-view based approach exploits the point cloud’s bird’s eye view (BEV) representation for 3D proposal generation. The BEV is the preferred viewpoint because it avoids perspective occlusions and retains the raw information of objects’ orientation and x,y coordinates. These orientation and x,y coordinates information is critical for 3D object detection while making coordinate transformation between BEV and other views straight-forward.
Point cloud voxelization transforms the continuous irregular data structure to a discrete regular data structure. This makes it possible to apply standard 3D discrete convolution and leverage existing network structures to process point cloud. The drawback is the loss of some spatial resolution, which might contain fine-grained 3D structure information.
Iv-B1 Feature-level fusion
One of the pioneering and most important works in generating 3D proposals from BEV representations is MV3D [Chen_2017]. MV3D generate 3D proposals on pixelized top-down LiDAR feature map (height, density and intensity). These 3D candidates are then projected to the LiDAR front view and image plane to extract and fuse region-wise features for bbox regression. The fusion happens at region-of-interest (ROI) level via ROI pooling. The ROI can be defined as:
where represents the transformation function from 3D space to BEV, front view and the image plane. The ROI-pooling can be defined as:
Although MV3D out-performed the state-of-the-art models by a remarkable margin, there still are a few flaws. First, generating 3D proposals on BEV assumes that all objects of interest are captured without interference from this view-point and this LiDAR sensor. This assumption does not hold well for small object instances, such as pedestrians and bicyclists, which can be fully occluded by other large objects in the point cloud. Secondly, spatial information of small object instances is lost during the down-sample of feature maps caused by consecutive convolution operations. Thirdly, object-centric fusion combines feature maps of image and point clouds through ROI-pooling, which spoils fine-grained geometric information in the process. It is also worth noting that redundant proposals lead to repetitive computation in the bbox regression stage. To mitigate these challenges, multiple methods have been put forward to improve MV3D.
To improve small objects detection, Aggregate View Object Detection network (AVOD) [Ku_2018] first improved the proposal stage in MV3D[Chen_2017] with feature maps from both BEV point cloud and image. Furthermore, an auto-encoder architecture is employed to up-sample the final feature maps to its original size. This could alleviate the problem that small objects might get down-sampled to one ’pixel’ with consecutive convolution operations. The proposed feature fusion Region Proposal Network (RPN) first extracts equal-length feature vectors from multiple modalities (BEV point cloud and image) with crop and resize operations. Followed by a convolution operation for feature space dimensionality reduction, which can reduce computational cost and boost up speed. Lu et al. also utilized an encoder-decoder based proposal network with Spatial-Channel Attention (SCA) module and Extension Spatial Upsample (ESU) module. The SCA can capture multi-scale contextual information, whereas ESU recovers the spatial information.
One of the problems in object-centric fusion methods [Ku_2018][Chen_2017] is the loss of fine-grained geometric information during ROI-pooling. ContFuse [Liang_2018_ECCV] by Liang et al. tackles this information lost with point-wise fusion. This point-wise fusion is achieved with continuous convolutions [wang2018deep] fusion layers which bridge image and point cloud features of different scales at multiple stages in the network. This is achieved by first extracting K nearest neighbour points for each pixel in the BEV representation of point cloud. These points are then projected to the image plane to retrieve related image features. Finally, the fused feature vector is weighted according to their geometry offset to the target ’pixel’ before feeding into MLPs. However, point-wise fusion might fail to take full advantage of high-resolution images when the LiDAR points are sparse. In [Liang_2019_CVPR] Liang et al. further extended point-wise fusion by combining multiple fusion methodologies, such as signal-level fusion (RGB-D), feature-level fusion, multi-view and depth completion. In particular, depth completion upsamples sparse depth map using image information to generate a dense pseudo point cloud. This up-sampling process alleviates the sparse point-wise fusion problem, which facilitates the learning of cross-modality representations. Furthermore, the authors argued that multiple complementary tasks (ground estimation, depth completion and 2D/3D object detection) could assists the network achieve better overall performance. However, point-wise/pixel-wise fusion leads to the ’feature blurring’ problem. This ’feature blurring’ happens when one point in the point cloud is associated with multiple pixels in the image or the other way around, which confound the data fusion. Similarly, wang et al.  replace the ROI-pooling in MV3D [Chen_2017] with sparse non-homogeneous pooling, which enables effective fusion between feature maps from multiple modalities.
The simplest means to combine the voxelized point cloud and image is to append RGB information as additional channels of a voxel. In a 2014 paper by Song et al. [10.1007/978-3-319-10599-4_41] 3D object detection is achieved by sliding a 3D detection window on the voxelized point cloud. The classification is performed by an ensemble of Exemplar-SVMs. In this work, color information is appended to voxels by projection. Song et al. further extended this idea with 3D discrete convolutional neural networks [song2016deep]. In the first stage, the voxelized point cloud (generated from RGB-D data) is first processed by Multi-scale 3D RPN for 3D proposal generation. These candidates are then classified by joint Object Recognition Network (ORN), which takes both image and voxelized point cloud as inputs. However, volumetric representation introduces boundary artifacts and spoils fine-grained local geometry. Secondly, the resolution mismatch between image and voxelized point cloud makes fusion inefficient.
Iv-C One-step Models
One-step models perform proposal generation and bbox regression in a single stage. By fusing the proposal and bbox regression stage into one-step, these models are often more computationally efficient. This makes them more well-suited for real-time applications on mobile computational platforms.
MVX-Net [Sindagi_2019] presented by Sindagi et al. introduced two methods that fuse image and point cloud data point-wise or voxel-wise. Both methods employ a pre-trained 2D CNN for image feature extraction and a VoxelNet [Zhou_2018] based network to estimate objects from the fused point cloud. In the point-wise fusion method, the point cloud is first projected to image feature space to extract image features before voxelization and processed by VoxelNet. The voxel-wise fusion method first voxelized the point cloud before projecting non-empty voxels to the image feature space for voxel/region-wise feature extraction. These voxel-wise features are only appended to their corresponding voxels at a later stage of the VoxelNet. MVX-Net achieved state-of-the-art results and out-performed other LiDAR-based methods on the KITTI benchmark while lowering false positives and false negatives rate compared to [Zhou_2018].
Meyer et al. [meyer2019sensor] extended the LaserNet [Meyer_2019] to multi-task and multimodal network, performing 3D object detection and 3D semantic segmentation on fused image and LiDAR data. Two CNN process depth-image (generated from point cloud) and front-view image in a parallel manner and fuse them via projecting points to the image plane to associate corresponding image features. This feature map is fed into the LaserNet to predict per-point distributions of the bounding box and combine them for final 3D proposals. This method is highly efficient while achieving state-of-the-art performance.
|DBT||min-cost ﬂow||DSM [Frossard_2018]||0.1s||76.15||83.42||60.00||8.31||296||868|
|min-cost ﬂow||mmMOT [Zhang_2019]||0.02s||84.77||85.21||73.23||2.77||284||753|
V 2D/3D Semantic Segmentation
This section reviews existing camera-LiDAR fusion methods for 2D semantic segmentation, 3D semantic segmentation and instance segmentation. 2D/3D semantic segmentation aims to predict per-pixel and per-point class labels, while instance segmentation also cares about individual instances. Figure 7 presents a timeline of 3D semantic segmentation networks and their corresponding fusion levels.
V-a 2D Semantic Segmentation
V-A1 Feature-level fusion
Jaritz et al. [Jaritz_2018] presented a NASNet [Zoph_2018] based autoencoder network that can be used for 2D semantic segmentation or depth completion leveraging image and sparse depths. The image and corresponding sparse depth map are processed by two parallel encoders before fused into the shared decoder. Valada et al. [Valada_2019] employed a multi-stage feature-level fusion of varying depth to facilitate semantic segmentation. Caltagirone et al.[Caltagirone_2019] utilized up-sampled depth-image and image for 2D semantic segmentation. This dense depth-image is up-sampled using sparse depth-image (from point cloud) and image data [inproceedings]. The authors also explored three different fusion methodologies, namely early fusion (signal fusion), late fusion and cross fusion. The best performing cross-fusion model processes dense depth-image and image data in two parallel CNN branches with skip-connections in between and fuses the two feature maps in the final convolution layer.
V-B 3D Semantic Segmentation
V-B1 Feature-level fusion
Dai et al.[Dai_2018] presented 3DMV, a multi-view network for 3D semantic segmentation which fuse image semantic and point features in voxelized point cloud. Image features are extracted by 2D CNNs from multiple aligned images and projected back to the 3D space. These multi-view image features are max-pooled voxel-wise and fused with 3D geometry before feeding into the 3D CNNs for per-voxel semantic prediction. 3DMV out-performed other voxel-based approaches on ScanNet [Dai_2017] benchmark. However, the performance of voxel-based approaches is determined by the voxel-resolution and hindered by voxel boundary artifacts.
To alleviate problems caused by point cloud voxelization, Chiang et al. [Chiang_2019] proposed a point-based semantic segmentation framework (UPF) that also enables efficient representation learning of image features, geometrical structures and global context priors. Features of rendered multi-view images are extracted using a semantic segmentation network and projected to 3D space for point-wise feature fusion. This fused point cloud is processed by two PointNet++ [qi2017pointnet] based encoders to extract local and global features before feeding into a decoder for per-point semantic label prediction. Similarly, Multi-View PointNet (MVPNet) [jaritz2019multiview] fused multi-view images semantics and 3D geometry to predict per-point semantic labels.
Permutohedral lattice representation is an alternative approach for multimodal data fusion and processing. Sparse Lattice Networks (SPLATNet) from Su et al. [Su_2018]
. employed sparse bilateral convolution to achieve spatial-aware representation learning and multimodal (image and point cloud) reasoning. In this approach, point cloud features are interpolated onto a-dimensional permutohedral lattice, where bilateral convolution is applied. The results are interpolated back onto the point cloud. Image features are extracted from multi-view images using a CNN and projected to the 3D lattice space to be combined with 3D features. This fused feature map is further processed by CNN to predict the per-point label.
V-C Instance Segmentation
In essence, instance segmentation aims to perform semantic segmentation and object detection jointly. It extends the semantic segmentation task by discriminate against individual instances within a class, which makes it more challenging.
V-C1 Proposal based
Hou et al. presented 3D-SIS [Hou_2019], a two-stage 3D CNN that performs voxel-wise 3D instance segmentation on multi-view images and RGB-D scan data. In the 3D detection stage, multi-view image features are extracted and down-sampled using ENet [paszke2016enet] based network. This down-sample process tackles the mismatch problem between a high-resolution image feature map and a low-resolution voxelized point cloud feature map. These down-sampled image feature maps are projected back to 3D voxel space and append to the corresponding 3D geometry features. This fused feature map is fed into a 3D CNN to predict object classes and 3D bbox poses. In the 3D mask stage, a 3D CNN takes images, point cloud features and 3D object detection results to per-voxel instance segmentation.
Narita et al. [Narita_2019] extended 2D panoptic segmentation to perform scene reconstruction, 3D semantic segmentation and 3D instance segmentation jointly on RGB images and depth images. This approach takes RGB and depth frames as inputs for instance and 2D semantic segmentation networks. To track labels between frames, these frame-wise predicted panoptic annotations and corresponding depth are referenced by associating and integrating to the volumetric map. In the final step, a fully connected conditional random field (CRF) is employed to fine-tune the outputs. However, this approach does not support dynamic scenes and are vulnerable to long-term post drift.
V-C2 Proposal-free based
Elich et al. [Elich_2019] presented 3D-BEVIS, a framework that performs 3D semantic and instance segmentation tasks jointly using the clustering method on points aggregated with 2D semantics. 3D-BEVIS first extract global semantic scores map and instance feature map from 2D BEV representation (RGB and height-above-ground). These two semantic maps are propagated to points using a graph neural network. Finally, the mean shift algorithm [comaniciu2002mean] use these semantic features to cluster points into instances. This approach is mainly constraint by its dependence on semantic features from BEV, which could introduce occlusions from a view perspective.
Vi 3D Object Tracking
Multi-object tracking (MOT) is an indispensable for the decision making of autonomous vehicles. To this end, this section reviews camera-LiDAR fusion based object tracking methods and compare their performance on the KITTI multi-object tracking benchmark (car) [Geiger2012CVPR] in Table III.
Vi-a Detection-Based Tracking (DBT)
DBT or Tracking-by-Detection framework consists of two stages. In the first stage, object of interests is detected. The second stage associate these objects over time and formulate them into trajectories, which are formulated as linear programs. Frossard et al.[Frossard_2018] presented an end-to-end trainable tracking-by-detection framework comprise of multiple independent networks that leverage both image and point cloud. This framework performs object detection, proposal matching and scoring, linear optimization consecutively. To achieve end-to-end learning, detection and matching are formulated via a deep structured model (DSM). Zhang et al. [Zhang_2019] presented a sensor-agnostic framework, which employs loss-coupling scheme for image and point cloud fusion. Similar to [Frossard_2018], the framework consists of three stages, object detection, adjacency estimation and linear optimization. In the object detection stage, image and point cloud features are extracted via a VGG-16 [Simonyan15] and a PointNet [qi2017pointnet] in parallel and fused via a robust fusion module. The robust fusion module is design to work with both a-modal and multi-modal inputs. The adjacency estimation stage extends min-cost flow to multi-modality via adjacent matrix learning. Finally, an optimal path is computed from the min-cost flow graph.
Tracking and 3D reconstruct tasks can be performed jointly. Extending this idea, Luiten et al. [Luiten_2020] leveraged 3D reconstruction to improve tracking, making tracking robust against complete occlusion. The propose MOTSFusion consists of two stages. In the first stage, detected objects are associate with spatial-temporal tracklets. These tracklets are matched and merged into trajectories using the Hungarian algorithm. Furthermore, MOTSFusion can work with LiDAR mono and stereo depth.
Vi-B Detection-Free Tracking (DFT)
In DFT objects are manually initialized. Complexer-YOLO [simon2019complexeryolo] is a real-time framework for decoupled 3D object detection and tracking on image and point cloud data. In the 3D object detection phase, 2D semantics are extracted and fused point-wise to the point cloud. This semantic point cloud is voxelized and fed into a 3D complex-YOLO for 3D object detection. To speed up the training process, IoU is replaced by a novel metric called Scale-Rotation-Translation score (SRTs), which evaluates 3 DoFs of the bounding box position. Multi-object tracking is decoupled from the detection and the inference is achieved via Labeled Multi-Bernoulli Random Finite Sets Filter (LMB RFS).
Vii Trends, Open Challenges and Promising Directions
The perception module in a driverless car is responsible for obtaining and understanding its surrounding scenes. Its down-stream modules, such as planning, decision making and self-localization, depend on its outputs. Therefore, its performance and reliability are the prerequisite for the competence of the entire driverless system. To this end, LiDAR and camera fusion is applied to improve the performance and reliability of the perception system, making driverless vehicles more capable in understanding complex scenes (e.g urban traffic, extreme weather condition and so on). Consequently, in this section, we summarize overall trends and discuss open challenges and potential influencing factors in this regard. As shown in Table IV, we focus on improving the performance of fusion methodology and robustness of the fusion pipeline, other nontrivial topics related to engineering practice are also discussed.
From the methods reviewed above, we observed some general trends among the image and point cloud fusion approaches, which are summarized as the following:
2D to 3D: Under the progressing of 3D feature extraction methods, to locate, track and segment objects in 3D space has become a heated area of research.
Single-task to multi-tasks: Some recent works [Liang_2019_CVPR] [simon2019complexeryolo] combines multiple complementary tasks, such as object detection, semantic segmentation and depth completion to achieve better overall performance and reduce computational costs.
Signal-level to multi-level fusion: Early works often leverage signal-level fusion where 3D geometry is translated to the image plane to leverage off-the-shelf image processing models, while recent models try to fuse image and LiDAR in multi-level( e.g early fusion, late fusion) and temporal context encoding.
|Encode Geometry Constraint|
|Encode Temporal Context|
|Adversarial Attacks and Corner Cases|
Vii-a Performance-related Open Research Questions
Vii-A1 Feature/Signal Representation of Fused Data
Feature/Signal representation of fused data plays a fundamental role in designing any data fusion algorithms. Current feature/signal representation includes:
Append 3D geometry as additional channels of the image. This is commonly found among early signal-level works, as it can be processed by off-the-shelf image processing models. However, the results are also limited in the 2D image plane, which is less ideal for autonomous driving.
Append RGB signal/features as additional channels of the point cloud. This can be achieved through projecting points to the image plane for the pixel-point association. Nevertheless, the mismatch of resolution between high-resolution images and low-resolution point cloud leads to inefficiency.
Translate image and point cloud features/signal into an intermediate data representation. Current intermediate data representations includes: voxelized point cloud [10.1007/978-3-319-10599-4_41], lattice [Su_2018]. Future researches could explore other novel intermediate data structures, such as a graph, tree, etc, could lead to better performance.
Vii-A2 Encoding Geometric Constraint
Compared with other sources of depth data, such as RGBD data from stereo or structured light, LiDAR has longer rangeability and higher accuracy, which provide detailed and accurate 3D geometry. The geometric constraint has become common sense in the fusion pipeline of image and point cloud, which provide extra information to guide the network to achieve better performance.
Projecting point cloud to the image plane in the form of RGBD image seems the most natural workaround for point cloud’s unordered data format, but the sparsity attribute would cause empty holes. Depth completion and point cloud up-sampling could handle this problem to some extent. On the other hand, methods in monocular depth prediction introduce self-supervised learning between consecutive frames that may hopefully ease the situation. However, how to encoding this geometry into the fusion pipeline remains to be explored.
Vii-A3 Encoding Temporal Context
Some problems found in real-world practices also greatly hinder the deployment of driverless cars, such as time-synchronization between LiDAR and camera, point cloud deformation caused by low refresh rate, LiDAR sensor ranging error. These problems will cause a mismatch between image and point cloud, point cloud and actual environmental distance. Based on the experience in depth completion, the temporal context between consecutive frames can be adopted to improve the pose which should improve the feature fusion performance and benefit later task-related header networks. In the context of autonomous driving, it’s vital to estimate motion state of surrounding vehicles accurately, temporal context should help with smoother and more stable results. Furthermore, the temporal context could benefit online self-calibration, which remains an open question. Therefore, more research on encoding temporal context should be encouraged.
Vii-A4 Network Architecture Design
To answer this question, we first need to find the best network architecture for point cloud processing. For image processing, it is widely accepted that CNNs are the best choice. However, this is not the case for point cloud processing, which remains an open research question. We have discussed the current point cloud processing architectures in section II and presented several very different approaches to point cloud processing.
However, there are no widely accepted or proven network design principles for this. Most current fusion networks are based on their image processing counter-parts or based on empirical or experimental results. Therefore, methods that employ Neural Architecture Search (NAS) [liu2018progressive] could proven to be effective.
Vii-A5 Unsupervised or weakly-supervised Learning Framework
Annotating image and point cloud is expensive and time consuming, which limit the size of current multi-modal dataset. Researches in unsupervised and weakly-supervised learning fusion frameworks could allows the networks to be trained on large unlabeled/coarse-labeled dataset and leads to better performance.
Vii-B Reliability-related Open Research Questions
Vii-B1 Sensor-agnostic Framework
From an engineering perspective, redundancy design in an autonomous vehicle is crucial for its safety. Although fusing LiDAR and camera improves perception performance, it also comes with the problem of signal coupling. If one of the signal paths suddenly failed, the whole pipeline could break down and cripple down-stream modules. This is unacceptable for autonomous driving systems, which require robust perception pipelines. To achieve this goal, we can adopt multiple fusion modules with different sensor inputs, or a multi-path fusion module of asynchronous multi-modal data. However, the best solution is still open for study.
Vii-B2 All-weather/Lighting Ability
Autonomous vehicles need to work in all weather and lighting conditions. However, current datasets and methods are mostly focused on scenes with good lighting and weather conditions. This leads to bad performances in the real-world, where illumination and weather conditions are more complex. Therefore, datasets that contains a wide range of lighting and weather conditions would be beneficial. In addition, methods that employs multi-modal data to tackle complex lighting and weather conditions requires further investigation.
Vii-B3 Adversarial Attacks and Corner Cases
Adversarial examples targeted at the camera-based perception system has proven effective. This poses a grave danger for autonomous vehicles, as it operates in safety-critical environments. In this context, research on utilizing LiADR’s accurate 3D geometry and image to jointly identified these attacks can be further explored.
As self-driving cars operate in an unpredictable open environment with infinite possibilities, it is critical to consider corner and edge cases in the design of the perception pipeline. The perception system should anticipate unseen and unusual obstacles, strange behaviors and extreme weather. For instance, the image of cyclists printed on a large vehicle and people wearing costumes. To leverage data from multi-modality to identify these corner cases could prove to be more effective and reliable than from a-modal sensor. Further researches in this direction could greatly benefit the safety and commercialization of autonomous driving technology.
Vii-C Engineering-related Open Research Questions
Vii-C1 Online Self-calibration
One of the preconditions and assumptions of all camera-LiDAR fusion pipeline is a flawless calibration between camera and LiDAR, which includes camera intrinsic parameters and extrinsic parameters between camera and LiDAR. In reality, however, this is rarely the case. Even when the camera and LiDAR are perfectly calibrated, their calibration parameters change through time due to vibration, heat etc. As most fusion methods are extremely sensitive towards calibration errors, this could significantly cripple their performance and reliability. Furthermore, calibration processes are mostly carried out offline, the need for constant updating calibration parameters is troublesome and unpractical. However, this problem receives relatively little attention as it is less obvious in the published dataset. Nevertheless, it is necessary to develop methods for the self-calibrate camera and LiDAR online. Recent works have employed motion guided [castorena2018motion] and targetless  self-calibration. It would be interesting to see more researches in this important direction.
Vii-C2 Time synchronization
Knowing the exact time of data frames captured by multiple sensors is critical for real-time sensor fusion, which would directly affect the result of fusion. However, it is difficult to guarantee time synchronization in practice. To begin with, LiDAR and camera have different refresh rates, and each sensor has its original time source. Furthermore, an uncontrollable time delay may occur in many parts of the system (on networking, exposure time, etc). Fortunately, there are several ways to alleviate this problem. The idea of increasing the sensor refresh rate to reduce time deviation comes naturally. It is also common practice to use a GPS PPS time source to keep different sensors synchronized with the host machine. More specifically, the host machine sends timestamp synchronization requests to each sensor to keep everyone in the same timeline. Some most recent sensors can be triggered by an external signal, which enables specific hardware with an oscillator to introduce precise timestamp. This allows different sensors to be triggered at almost the same time, which facilitates the synchronization between the LiDAR and panoramic cameras. Further researches on innovative software and hardware based solutions should be encouraged.
This paper presented an in-depth review of the most recent progress on deep learning models for point cloud and image fusion in the context of autonomous driving. Specifically, this review organizes methods based on their fusion methodologies and covers topics in depth completion, object detection, semantic segmentation and tracking. Furthermore, performance comparisons on the publicly available dataset are presented in both tables and scatter plots. Finally, we summarized general trends and discussed open challenges and possible future directions. This survey also raised awareness and provided insights on questions that are overlooked by the research community but troubles real-world deployment of the autonomous driving technology.