With the development of sensors and environmental perception technology, how to integrate multi-view and multi-modality data through effective fusion technology to achieve accurate detection, classification and positioning has gain increasing attention in the field of autonomous driving.
Autonomous driving systems are generally equipped with sensors such as camera, LiDAR, millimeter wave radar and ultrasonic radar. The camera usually obtains richer information that has advantages in object classification, size estimation and angular positioning, but its velocity measurement and ranging accuracy is comparatively limited and can be further damaged by environmental factors such as light and weather. In contrast, radar systems have preferable ranging performance and are insensitive to environmental changes, but they provide relatively sparse information that is insufficient to sensing the object size and category. Sensors from single modality have inherent disadvantages that are not enough to handle the complex environment changes during the system running. So, it is necessary to fuse multi-modality sensors to obtain accurate and robust perception.
Multi-modality sensors fusion can be divided into decision-level, feature-level and data-level according to the input data . Decision-level fusion regards the perception result of each sensor as input, and improves the reliability of the final decision by verifying decision results of multiple sensors mutually. However, feature-level and data-level fusion adopt abstracted multi-level features and source data of sensors as input, respectively, getting richer information while bringing more fusion difficulties. In recent years, there are many decision-level or feature-level based 3D detection, segmentation and tracking methods [15, 14, 19, 12, 25, 26], bringing competitive results with a relatively simple form through an end-to-end data-driven approach. However, single-level fusion method is often unable to fully exploit the effective information of each sensor, and although the end-to-end approach simplifies the fusion process, it is not easy to interpret, which makes it difficult to locate and analyze when problems occur.
In this paper, we propose a general multi-modality cascaded fusion framework, combining decision-level and feature-level fusion methods, which can be further extended to arbitrary fusion of camera, LiDAR and radar with great interpretability. More specifically, the fusion process is along with multi-object tracking algorithm, which makes the fusion and association facilitate each other, simultaneously. Firstly, intra-frame fusion is performed and data from different sensors are processed. The intra-frame fusion uses a hierarchical method to associate targets with different confidence to form a main-sub format and enhances the fusion result by adopting dynamic coordinate alignment of multi-modality sensors. Secondly, inter-frame fusion is executed when intra-frame is completed, which extends the tracking lists by matching targets in adjacent frames and can handle single sensor failure. Finally, we propose an affinity loss that boosts the performance of the deep affinity networks, which used in intra-frame and inter-frame instead of artificial strategy, and improves the results of association. Extensive experiments conducted on NuScenes dataset demonstrate the effectiveness of the proposed framework.
In summary, the contributions of this paper are as follows:
We propose a multi-modality cascaded fusion framework, which makes full use of multi-modality sensor information, improving robustness and accuracy with great interpretability.
We propose a dynamic coordinate alignment method of sensors from different modalities that reduces the problems of heterogeneous data fusion.
We propose an affinity loss used in training deep affinity estimation networks, which improves the performance of multi-modality data association.
2 Related Work
2.1 Multi-Modality Fusion
. In autonomous driving, thanks to the arise of deep learning, not only single sensor research has made great progress[10, 13, 11, 18], such as camera, LiDAR, multi-modality fusion has also gain increasing attention [12, 26, 3, 21], these methods usually adopt decision-level or feature-level fusion.  proposes a multi-task network, which integrates vision and LiDAR information, achieving ground estimation, 3D detection, 2D detection and depth estimation, simultaneously.  adopts PointNet  and ResNet  to extract point cloud and image feature, respectively, then fuses them to get object 3D bounding boxes. collects the RGB image, the front view and top view of the LiDAR, and obtains 3D bounding box by feature-level fusion.  proposes a real-time moving target detection network, which captures the motion information of vision and LiDAR and exploits the characteristics of LiDAR that are not affected by light to compensate the camera failure in low-light condition. The framework proposed in this paper combines decision-level and feature-level fusion and achieves robust and effective multi-modality fusion.
2.2 Multi-Object Tracking
Multi-object tracking is a core task in autonomous driving, because the control and decision of the vehicle depend on the surrounding environment, such as the movement of pedestrians and other cars. Many multi-object tracking algorithm only consider visual input, which concern more about the task itself, such as concentrating on different multi-object tracking framework , solving association problem by graph optimization , or replacing the traditional association process with an end-to-end method . However, visual based multi-object tracking can be easily invalid under extreme light conditions, and is infeasible to obtain accurate target distance and velocity. [6, 16] use information more than a single RGB image and get accurate inter-frame transformations in the tracking process.  proposes a multi-modality multi-object tracking framework, which obtains the tracking lists by directly fusing the detection results of visual and LiDAR in a deep network, but the fusion between visual and LiDAR is simple and lack of mutual promotion of different sensors in the association process. However, the framework proposed in this paper forms a main-sub format that reserves multi-modality sensors information, which reinforces the hierarchical association result of multi-modality data and is more robust to single sensor failure.
In this paper, we propose a multi-modality cascaded fusion framework, investigating the advantages of the decision-level and feature-level fusion, achieving the accuracy and stability while keeping the interpretability.
As shown in Figure 1,the proposed cascaded framework is composed of two parts, namely, the intra-frame fusion and the inter-frame fusion. Firstly, the intra-frame fusion module fuses the vision and radar detections within each frame, producing fused detections in the main-sub format. Secondly, the inter-frame fusion performs association between the tracklets at time and the fused detections at time , and generates accurate object trajectories.
In intra-frame fusion, the local and the global association are progressive not only in order, but also in functions. By using dynamic coordinate alignment, multi-modality data can be mapped into a same coordinate, generating more reliable feature similarity evaluation. In addition, because the calculation of similarity is the core of data association, a deep affinity network is designed to realize accurate similarity evaluation throughout the entire process of the intra-frame and inter-frame fusion.
3.1 Intra-Frame Fusion
The intra-frame fusion is the first step of proposed cascaded fusion framework, which aims at fusing the multi-modality detections within each frame. After intra-frame fusion, we can obtain a series of fused detections in the main-sub format, which contain information from one or more modalities. Specifically, it performs by two sequential steps: the local association and the global association.
Let and denote the vision detections and radar detections at time , respectively. All the vision and radar detections are distinguished into high-confidence and low-confidence, namely, , , and according to the detection confidence and the corresponding threshold and . During the intra-frame fusion, the high-confidence vision and radar detections are collected and sent into the local association. The association cost matrix between and is calculated as follows:
where denotes the extracted common feature for vision or radar object, including range, angle, velocity and confidence,
fuses the feature vectors ofand , is the deep network for similarity evaluation, which will be introduced in section 3.3.
By employing Hungarian algorithm  on the cost matrix , we can acquire an assignment matrix with elements 0 and 1.For each vision-radar pair satisfying assignment and similarity greater than threshold , the corresponding vision and radar are matched, forming a main-sub (vision-radar) detection . While the remainder vision and radar detections in and are then added to and , respectively, competing in the successive association.
Note that, the local association is generally reliable due to the high object confidence, hence we utilize the matching results to dynamic align sensor coordinate to reduce the mapping error of heterogeneous and facilitate successive fusion.
In order to realize further fusion, the global association is conducted after the local association, which consider all the low-confidence detections and , as well as the unassigned ones in and . In addition, due to the antenna azimuth resolution limitation, radar sensor is often unable to distinguish different targets in dense scenes. To cope with that, we adjust the assign strategy in global association, which allows one-to-many assignment, i.e., a low-confidence radar detection could match multiple vision detections when the similarities are high enough. The calculation of cost matrix is the same as local association
After intra-frame fusion, we can obtain three types of fusion results, namely, , and as the input of the following inter-frame fusion.
3.2 Dynamic Coordinate Alignment
Generally, the similarity computation between heterogeneous sensor data relies on their common features. For example, in systems, these can be the object range, velocity, angle and confidence under a specifically defined geometry space. However, due to the perception principle difference, although the geometry relationship of camera and radar sensor can be calibrated in advance, it may change inevitably during driving, leading to a non-equivalent common feature, e.g., the object range, and finally impacts on associated results. Therefore, a dynamic sensor coordinate alignment is critical.
or vision system with a monocular camera, the range feature can be extracted by two conventional means: the size ranging and the trigonometric ranging. The size ranging uses the proportionate information between the image and the real physical size according to the pinhole camera geometry. Such method relies on the accuracy of a priori object size, which is usually hard to obtain in practice. The trigonometric ranging assumes interested objects are in the same plane with road surface, its performance mainly depends on the accuracy of given vanishing horizontal line.
In order to realize dynamic coordinate alignment, we propose a vanishing horizontal line compensation approach by using the radar ranging information of local association results or the historical image ranging results. The trigonometric ranging model is shown in Figure 2.
Assume that the road is planar, the camera optical axis is parallel to the road surface with camera pitch angle , thus the object range can be calculated as:
where is camera height, is the angle between optical axis and the line formed by optical center and the ranging point.
So, we can get , where and denote the image height pix of object bottom center and the optical center respectively, then we can obtain the distance using Equation 2. However, the pitch angle might change during driving as road is not always flat, which consequently leads to an inaccurate value of . It is worth noting that radar ranging is generally accurate and stable due to the sensor characteristic, which can be used to compensate the camera pitch angle. Specifically, let denote the radar ranging of a reliable pair from local association, such that we have
Hence, by using multiple local association pairs, the camera pitch angle can update immediately, improving the performance of global association.
3.3 Deep Affinity Network
As the core of the association, the deep similarity computation network proposed in this paper is shown in Figure 3.
For convinience, denoting and as outputs of sensor and , respectively, where is the dim of extracted feature vectors, and
are the corresponding number of detections. The feature vectors forwarded in this paper contain the range, angle, velocity, size, confidence and even abstracted feature extracted from raw image and radar power map. Theand inputs are converted to the feature map by broadcast. Unlike traditional similarity computation by Euclidean or cosine distance, we instead use a deep affinity network. Let denotes the fused feature vectors of sensors and . Reshaping the feature vector as the input, we train the multi-layer fully connected network to predict the final affinity matrix .
The label of the affinity matrix is defined as follows:
We designed two loss functions including mask loss and affinity loss in this paper for training network.
Mask loss. Mask loss is simple and easily understandable because the cost is directly compared with the label of the affinity matrix. It is defined as Equation 6.
Affinity loss. In fact, we use the Hungarian algorithm  to acquire matching pairs, instead of fitting the network output to the label, which is often difficult. Therefore, we just need to train the network to meet the following condition:
In the other word, if and are the same object, the corresponding cost should be the maximum of the -th row and -th colume of the affinity matrix .
Based on the above analysis, we propose an affinity loss function as Equation 8:
where denotes the margin between the positive and negative samples. The affinity loss encourages the network to fit better when is larger. Compared with mask loss, the affinity loss makes the network to converge better and faster.
3.4 Inter-Frame Fusion
The aim of inter-frame fusion is to associate the tracklets at time and the intra-frame fusion results at time , which includes homologous data association and heterogeneous data association. The inputs of inter-frame fusion are objects , and produced by intra-frame fusion.
As shown in the figure 4, the inter-frame fusion contains multiple strategies for different matching mode, which is 9 in this paper.
There are 7 types of the homologous data association and 2 types of the heterogeneous data association. The homologous data association is more reliable because richer common features and naturally identical coordinate. Therefore, the homologous data association is firstly conducted to reduce the interference of other data. For example, when two
objects are computed for similarity, image similarity- and radar similarity - are computed separately and then combined to the ultimate similarity. When computing object and image object , only image similarity - need to be computed. The heterogeneous data association, for example - and -, needs to be considered in the same dimension. Therefore, we first perform the coordinate conversion between image and radar data, and then extract common features such as the range, angle, velocity, confidence for similarity computation.
The stable and reliable tracking lists are extended after the hierarchical association, and multiple matching strategies make the fusion results insensitive to single modality sensor failure.
We evaluate the proposed multi-modality cascaded fusion technology on NuScenes dataset. NuScenes is a well-known public dataset including LiDAR, radar, camera and GPS unit for autonomous driving. About 1000 scenes with 3D bounding boxes are available, covering various weather and road conditions. There are 6 different cameras and 5 radars on the test vehicle. However, in this work, we only use the front camera and the front radar.
We randomly split the dataset into train, validation and test sets with 275, 277 and 275 scenes, respectively. We combine multiple vehicles, such as car, truck, bus and so on, into one single class car and get 37907 objects in the test set.
However, the dataset lacks distance annotation for each object.  proposed a method to obtain the ground truth distance and the keypoint of each object. They extracted the depth value of the -th sorted laser point clouds as the distance of the object. It is unreasonable because they were more concerned about using the keypoint to calculate the projection loss for enhanced model than using it as the true value. To solve the problem, we propose a new method to generate the ground true distances for the test set, as shown in Figure 5.
We project the 3D bounding boxes provided by the dataset into bird’s-eye view coordinates. Two corners closest to the test vehicle are extracted and the midpoint is then calculated. The distance between the midpoint and the bumper of the test vehicle is the ground true distance for the object. The distribution of the distances in the test set range from 0 to 105m, most within 5 to 40m, as shown in Figure 6.
4.2 Implementation Details
The experiments include two parts, public comparison and ranging accuracy comparison.
4.2.1 Public comparison
For fairness, we evaluate our method on NuScenes mini dataset using classic evaluation metrics for depth prediction, including threshold proportion ( ), absolute relative difference (Abs Rel), squared relative difference (Squa Rel), root of mean squared errors (RMSE) and root of mean squared errors (RMSElog). And the ground truth distance is obtained the same way as.
In this part, we show the effectiveness of the proposed cascaded fusion framework without DCA and DAN. The distance of the image object is obtained by our monocular ranging model with triangulation and scale ranging method. Besides, the camera and radar need to be calibrated in advance and the similarity computation is designed by artificial.
The similarity computation is as follows:
where denotes the similarity feature of the range, denotes the similarity feature of the angle and denotes the similarity feature of the velocity. , and denote the weight of corresponding feature, which are obtained by policy searching. , and denote the range, angle and velocity of vision, while , and denote the range, angle and velocity of radar. , and are thresholds used for normalization.
The affinity matrix is obtained by the traditional method above for all radar and image objects at the same time. Then the matching results are acquired by the Hungarian algorithm Public comparison on NuScenes mini dataset and the range and velocity of the vision object are updated by the associated radar object.
4.2.2 Ranging accuracy comparison
We conduct extensive experiments on the split test dataset to evaluate the proposed DCA, DAN and cascaded fusion framework using ranging accuracy. Ranging accuracy is the proportion of the number of correct ranging objects with ranging error within to the ground truth, which is obtained by our method described in section 4.1.
All the similarities are computed by the DAN with two fully connected (FC) layers and a sigmoid activation, which makes the output varies from 0 to 1.
The DAN is trained using SGD optimizer with learning rate of 0.001 and batch size of 1.
4.3.1 Public comparison
We compare our method with support vector regression (SVR) , inverse perspective mapping algorithm(IPM)  and an enhanced model with a keypoint regression(EMWK) . We first test the proposed framework with camera and radar, but without DCA and DAN, namely Baseline. Then we reduce the radar inputs to test the robust of the framework, namely Baseline-R.
The results are shown in Table 1. Obviously, our method Baseline-R surpasses the SVR, IPM and EMWK with a large margin, which demonstrate the superiority and robustness of the framework. Furthermore, after adding radar input, Baseline is better than Baseline-R, which proves that radar contributes to the ranging performance and means the proposed cascaded fusion framework is effective.
4.3.2 Ranging accuracy comparison
In this part, we conduct 5 experiments, namely Baseline - R, Baseline - R + DCA, Baseline + DCA, Baseline + DCA + DAN-ml(mask loss) and Baseline + DCA + DAN-al(affinity loss). It’s worth noting that Baseline - R + DCA and Baseline + DCA both use DCA, but Baseline + DCA use radar ranging to alignment the coordinates while Baseline - R + DCA use image ranging.
The results are shown in Table 2.
|Car ranging accuracy||Average ranging accuracy|
|Baseline - R||0.7238||0.4985||0.4821||0.3411||0.4977||0.6162|
|Baseline - R + DCA||0.8460||0.5776||0.4776||0.2424||0.5210||0.6661|
|Baseline + DCA||0.8809||0.7447||0.6304||0.3646||0.6665||0.7823|
|Baseline + DCA + DAN-ml||0.7640||0.6407||0.4930||0.3949||0.5519||0.6539|
|Baseline + DCA + DAN-al||0.8869||0.7463||0.6366||0.4164||0.6720||0.7934|
DCA. As we can see from the TableII, Baseline - R + DCA perform better than Baseline – R. The ranging accuracy of closest in-path vehicle (CIPV) rises and car rises , which demonstrate the effectiveness of the proposed DCA.
DAN. The DAN trained with affinity loss has the best performance among all the methods. More precisely, the ranging accuracy of CIPV rises and car beyond 80M rises , which shows the DAN is better than and can be used to replace traditional artificial strategy.
Loss function. DAN with affinity loss has an obvious performance advantage. When trained with mask loss, the convergence of the network is slower and harder, which makes the ranging accuracy decreases obviously.
Qualitative results of the Baseline - R and Baseline + DCA + DAN-al are shown in Figure 7, including flat and rough road, day and night, sunny and rainy days. It is obvious that the proposed multi-modality cascaded framework can improve the ranging accuracy, which means preferable fusion results of camera and radar.
We propose a multi-modality cascaded fusion framework, supporting various sensors fusion with great interpretability. In addition, the dynamic coordinate alignment can facilitate the feature extraction and can be adapted to other sensor fusion methods. Moreover, the affinity loss function is more suitable for practical applications, which eases the model convergence and improves association accuracy. Finally, the hierarchical association and fusion framework is insensitive to single modality sensor failure, making the entire perception results more robust and can better serving autonomous driving decisions.
-  Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. Tracking without bells and whistles. arXiv preprint arXiv:1903.05625, 2019.
-  Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
-  Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In , pages 1907–1915, 2017.
-  Fatih Gökçe, Göktürk Üçoluk, Erol Şahin, and Sinan Kalkan. Vision-based detection and distance estimation of micro unmanned aerial vehicles. Sensors, 15(9):23805–23846, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  David Held, Jesse Levinson, and Sebastian Thrun. Precision tracking with sparse 3d and dense color 2d data. In 2013 IEEE International Conference on Robotics and Automation, pages 1138–1145. IEEE, 2013.
-  Samira Ebrahimi Kahou, Xavier Bouthillier, Pascal Lamblin, Caglar Gulcehre, Vincent Michalski, Kishore Konda, Sébastien Jean, Pierre Froumenty, Yann Dauphin, Nicolas Boulanger-Lewandowski, et al. Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, 10(2):99–111, 2016.
-  Chanho Kim, Fuxin Li, Arridhana Ciptadi, and James M Rehg. Multiple hypothesis tracking revisited. In Proceedings of the IEEE International Conference on Computer Vision, pages 4696–4704, 2015.
-  Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(4-5):705–724, 2015.
-  Buyu Li, Wanli Ouyang, Lu Sheng, Xingyu Zeng, and Xiaogang Wang. Gs3d: An efficient 3d object detection framework for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1019–1028, 2019.
-  Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo r-cnn based 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7644–7652, 2019.
-  Ming Liang, Bin Yang, Yun Chen, Rui Hu, and Raquel Urtasun. Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7345–7353, 2019.
-  Weixin Lu, Yao Zhou, Guowei Wan, Shenhua Hou, and Shiyu Song. L3-net: Towards learning based lidar localization for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6389–6398, 2019.
-  Khaled El Madawy, Hazem Rashed, Ahmad El Sallab, Omar Nasr, Hanan Kamel, and Senthil Yogamani. Rgb and lidar fusion based 3d semantic segmentation for autonomous driving. arXiv preprint arXiv:1906.00208, 2019.
-  Gregory P Meyer, Jake Charland, Darshan Hegde, Ankit Laddha, and Carlos Vallespi-Gonzalez. Sensor fusion for joint 3d object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
-  Dennis Mitzel and Bastian Leibe. Taking mobile multi-object tracking to the next level: People, unknown objects, and carried items. In European Conference on Computer Vision, pages 566–579. Springer, 2012.
-  James Munkres. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics, 5(1):32–38, 1957.
-  Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
-  Amir Hossein Raffiee and Humayun Irshad. Class-specific anchoring proposal for 3d object recognition in lidar and rgb images. arXiv preprint arXiv:1907.09081, 2019.
-  Dhanesh Ramachandram and Graham W Taylor. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine, 34(6):96–108, 2017.
-  Hazem Rashed, Mohamed Ramzy, Victor Vaquero, Ahmad El Sallab, Ganesh Sistu, and Senthil Yogamani. Fusemodnet: Real-time camera and lidar based moving object detection for robust low-light autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.
-  ShiJie Sun, Naveed Akhtar, HuanSheng Song, Ajmal S Mian, and Mubarak Shah. Deep affinity network for multiple object tracking. IEEE transactions on pattern analysis and machine intelligence, 2019.
-  Shane Tuohy, Diarmaid O’Cualain, Edward Jones, and Martin Glavin. Distance determination for an automobile environment using inverse perspective mapping in opencv. 2010.
Abhinav Valada, Gabriel L Oliveira, Thomas Brox, and Wolfram Burgard.
Deep multispectral semantic scene understanding of forested environments using multimodal fusion.In International Symposium on Experimental Robotics, pages 465–477. Springer, 2016.
-  Bin Xu and Zhenzhong Chen. Multi-level fusion based 3d object detection from monocular images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2345–2353, 2018.
-  Danfei Xu, Dragomir Anguelov, and Ashesh Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 244–253, 2018.
-  Wenwei Zhang, Hui Zhou, Shuyang Sun, Zhe Wang, Jianping Shi, and Chen Change Loy. Robust multi-modality multi-object tracking. In Proceedings of the IEEE International Conference on Computer Vision, pages 2365–2374, 2019.
-  Jing Zhu and Yi Fang. Learning object-specific distance from a monocular image. In Proceedings of the IEEE International Conference on Computer Vision, pages 3839–3848, 2019.