Log In Sign Up

Leveraging Uncertainties for Deep Multi-modal Object Detection in Autonomous Driving

by   Di Feng, et al.

This work presents a probabilistic deep neural network that combines LiDAR point clouds and RGB camera images for robust, accurate 3D object detection. We explicitly model uncertainties in the classification and regression tasks, and leverage uncertainties to train the fusion network via a sampling mechanism. We validate our method on three datasets with challenging real-world driving scenarios. Experimental results show that the predicted uncertainties reflect complex environmental uncertainty like difficulties of a human expert to label objects. The results also show that our method consistently improves the Average Precision by up to 7 temporally misaligned, the sampling method improves the Average Precision by up to 20


page 1

page 2

page 3

page 4


Towards Safe Autonomous Driving: Capture Uncertainty in the Deep Neural Network For Lidar 3D Vehicle Detection

To assure that an autonomous car is driving safely on public roads, its ...

Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection

We present a robust real-time LiDAR 3D object detector that leverages he...

Inferring Spatial Uncertainty in Object Detection

The availability of real-world datasets is the prerequisite to develop o...

Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving

In this paper, we propose a monocular 3D object detection framework in t...

Inferring Distributions Over Depth from a Single Image

When building a geometric scene understanding system for autonomous vehi...

Uncertainty-aware LiDAR Panoptic Segmentation

Modern autonomous systems often rely on LiDAR scanners, in particular fo...

LiDAR Cluster First and Camera Inference Later: A New Perspective Towards Autonomous Driving

Object detection in state-of-the-art Autonomous Vehicles (AV) framework ...

I Introduction

A driverless car is usually equipped with multiple onboard sensors, such as video, LiDAR, and Radar sensors, in order to build a robust and accurate scene understanding. An object detection system that can exploit the complementary properties of different sensing modalities is crucial for safe autonomous driving.

As a powerful tool for learning hierarchical feature representations and complex transformations, deep learning has been widely applied to computer vision tasks. In this regard, many methods have been proposed recently which employ deep learning to fuse multiple sensors for object detection in autonomous driving 

[12]. Typical methods such as MV3D [6] and AVOD [25] have achieved promising results on the standard open datasets, e.g. KITTI [17], with perfectly-aligned sensors and good weather conditions. However, those methods are not robust against temporal and spatial sensor misalignment that might occur during driving. Even a small spatial sensor displacement of m has been shown to drastically degrade the network performance [40]. It is still a challenge to improve the network’s robustness against noisy sensor data.

Reliable uncertainty estimation in object detection networks provides extra information to support predictions, and has the potential to improve other modules such as motion planning 

[2]. Previous work has studied uncertainties in object detectors that only employ a single sensing modality such as LiDAR point clouds [10, 11, 14, 13, 33] or RGB camera images [36, 27, 22, 23]. To the best of our knowledge, uncertainty estimation has not been introduced to multi-modal object detection yet. Furthermore, though previous studies have illustrated that uncertainties reflect environmental noises [11] or data distribution [36], there has been no study on how they reflect more complex attributes, which can be measured by the difficulties of a human expert to correctly label an object. Ideally, a probabilistic detector should assign high uncertainties to objects which human oracles also find difficult to label.

In this work, we propose a probabilistic two-stage multi-modal object detection network that fuses LiDARs and RGB cameras. We explicitly model classification and regression uncertainties in the network, and study how they reflect labeling difficulties of a human expert. To do this, we build a dataset with labels for environmental effects (e.g. occlusion, number of points) as well as an estimate of the labeling difficulty. Regarding the labeling difficulty, human annotators assign objects to be either “Unsure” or “Sure”. Statistical analyses show that our network produces higher uncertainties when detecting “Unsure” objects compared to “Sure” objects after correction for other environmental effects. Afterwards, we leverage those uncertainties to improve the detection accuracy and the network’s robustness, especially in the sensor misalignment situation. The method works by optimizing the fusion network with the training data sampled by the estimated probability distributions. We evaluate our method on multiple datasets with real-world driving scenarios, including two open datasets (KITTI 

[17] and NuScenes [4]) and our self-recorded Bosch dataset.

The Contributions in this paper are three-fold:

  • We explicitly model classification and regression uncertainties in a multi-modal object detection network, and study how uncertainties are affected by complex environmental uncertainties like labeling difficulty.

  • We leverage uncertainties to improve the detection accuracy and the network’s robustness, especially when sensors are temporally misaligned.

  • We validate our proposed method on three real-world datasets.

Ii Related Work

In this section, we briefly summarize deep learning methods for multi-modal object detection and uncertainty estimation. For a more comprehensive overview, we refer interested readers to the survey paper [12].

Ii-a Multi-modal Object Detection

When combining multiple sensors for object detection, RGB cameras and LiDARs are the most common sensors reported in the literature [6, 38, 25, 29]. Some works propose to combine RGB images with thermal images [41], LiDAR point clouds with HD maps [28], as well as RGB images with Radar points [5]. Modern multi-modal object detection networks follow either the two-stage or the one-stage pipeline. The variety of network architectures provide many options for sensor fusion. For instance, in the two-stage pipeline, different sensors can be combined at the first stage [25, 28], or at the second stage after regional proposal generation [6, 38, 40]. In the one-stage pipeline, sensing modalities can be fused at one specific layer [5] or multiple layers [29, 42]. Typical fusion operations include feature concatenation, element-wise average mean, and ensembling. As discussed in [12], we do not find conclusive evidence that one fusion scheme is better than the others, and the fusion performance is highly dependent on the network architectures and datasets.

In this work, we propose a two-stage object detection network that combines RGB cameras and LiDARs. We first extract sensor-specific features using two different backbone networks, and then perform feature concatenation after the regional proposal generation (Fig. LABEL:fig:framework).

Ii-B Uncertainty Estimation and Probablistic Object Detectors

There are many ways of estimating predictive probabilities in supervised deep networks. Bayesian Neural Networks (BNN) [32] place priors over the network weights, and infer their posterior distributions through variational inference [3, 19] or Monte Carlo sampling [15]. Deep Ensembles [26]

obtains predictive probabilities from an ensemble of networks with the same architecture but different training initializations. Uncertainty propagation methods approximate the variance in each activation layer and propagate uncertainties through networks 

[16, 37]. Direct-modelling approaches assume certain distributions over the network outputs. Networks are then trained to directly predict output distributions [24, 16, 8]

or their higher-order conjugate priors, such as estimating the Dirichlet prior for the multinomial distribution in the classification task 


, or the Gaussian prior on the mean and an Inverse-Gamma prior on the variance for the Gaussian distribution in the regression task 


We can capture two types of uncertainties in an object detection network: the epistemic uncertainty and the aleatoric uncertainty [10]. The former reflects the model’s capability for describing data, and can be explained away given enough training data; the later captures observation noises inherent in environments or sensors [24]. Previous studies have leveraged epistemic uncertainty to improve detections in open-set conditions [36]

or boost training efficiency in the active learning setting 

[14]. Other works have shown that modelling aleatoric uncertainty, especially in the bounding box regression task, can greatly improve the detection accuracy [11, 23, 33, 34] and reduce False Positives [27, 7]. Miller et al. [35] and Harakeh et al. [22] have found that the merging strategy, such as Non Maximum Suppression (NMS), significantly influence the uncertainty estimation. Feng et al. [13] identify uncertainty miscalibration problems in a one-stage object detection network. They follow [20] and calibrate uncertainties to correctly estimate the prediction error within the training data distribution.

In this work, we use the direct-modelling approaches to explicitly model aleatoric uncertainties for both classification and regression tasks. We employ uncertainties to improve the detection accuracy and the network’s robustness against the sensor temporal misalignment.

Iii Methodology

Iii-a Network Architecture

Fig. LABEL:fig:framework illustrates our proposed two-stage multi-modal object detection network which fuses LiDAR point clouds and RGB camera images. Following [6, 25], the LiDAR point clouds are discretized and projected onto the Bird’s Eye View (BEV) plane, because this representation has been shown to be very effective in 3D perception [12]. The input signals are first processed separately by sensor-specific networks to extract high-level feature maps. The LiDAR network head also generates accurate 3D object proposals. Afterwards, these proposals are projected onto the BEV and front-view to extract Region of Interest (RoI) features from LiDAR and image feature maps respectively. Finally, RoI features are combined in a small fusion network for 3D bounding box regression and object classification.

The network architecture is designed in a modular manner that eases adoption. In practice, we can directly leverage off-the-shelf pre-trained sensor-specific networks to process LiDAR and camera data (as we do in this work). We can also easily adopt the network with new sensors (e.g. Radar) by re-training only the light-weight fusion network, without affecting other modules. However, the fusion performance is limited by the LiDAR network. If a 3D proposal is not recognized, e.g. due to sparse LiDAR point clouds, the object within it can never be detected by the fusion network. A potential improvement is to generate 3D proposals from Radar or camera channels, which we leave as an interesting future work.

Iii-A1 Input and Output Encodings

Denote an input sample as , and a 3D proposal generated by the LiDAR network as . It includes the class label with the softmax score , and the proposal’s location , i.e. . For brevity we only consider binary class “Object” and “Non-object”, . We encode as the center positional offsets on the horizontal plane ( and ), proposal bottom positional offset , length, width, height at log scale (, , and ), as well as orientation (). The fusion network predicts , where is the class label with the softmax score , and the 3D bounding box position. We encode as the offsets to the region proposal prediction .

Iii-A2 Sensor-specific Networks and Fusion Networks

We process LiDAR data and extract 3D proposals using PIXOR [43], a state-of-the-art one-stage LiDAR object detector, with several modifications. We estimate the object’s height instead of only predicting on the BEV plane, and explicitly model predictive probabilities, which will be introduced in Sec. III-B. As for the RGB image data, we employ the Feature Pyramid Network [30], a well-performing image feature extractor. The fusion network combines LiDAR and image RoI features through concatenation, similar to [25]. It also takes the proposal positions and softmax scores as inputs, because we find that these proposal features can improve the 3D bounding box regression. The fusion network consists of three fully connected layers, each with 256 hidden units and being followed by a dropout layer.

Iii-B Learning with Probability

Suppose we have pre-trained LiDAR and camera networks that produce 3D proposals and RoI features for fusion. The standard approach to training the fusion network can be viewed from the maximum likelihood perspective, where we learn a set of network weights

that maximize the observation likelihood of the training data. We minimize the negative log likelihood by setting the loss function:


In the context of classification, is usually set to be the multinomial mass function, and is widely known as the cross-entropy loss. It can also be adapted to tackle the positive negative sample imbalance problem via the focal loss [31]. As for the deterministic regression, we can assume as the Gaussian density function with fixed variance. The corresponding loss function is the loss.

In this work, we incorporate the proposal distribution into the loss function:


Since an analytical solution is intractable, we approximate this loss function via sampling (as illustrated in Fig. 1(a)):


This new training strategy brings two benefits. First, propagating proposal uncertainties to the fusion network helps to improve its robustness, as the network learns to handle proposals with small and big uncertainties. Second, sampling proposals can serve as a simple data augmentation method that aids generalization.

In practice, we could pre-define a proposal distribution, such as the Gaussian distribution with fixed variance. In this work, we use predictive uncertainties from a pre-trained probabilistic LiDAR network, which is more flexible because they can encode both the varying environmental noises and the network’s inaccuracy for each proposal. We will illustrate how to model probability in the following section.

Fig. 1: (a). Training the fusion network with sampling. We assume that the 3D proposals are Gaussian distributed, and propagate the sampled proposals based on predicted probability distribution to the fusion network; (b). Training the probabilistic LiDAR network. The network directly regresses the parameters of the probability distributions, which are incorporated in the loss function.

Probabilistic Modelling:

We explicitly model regression and classification uncertainties in our LiDAR network. For simplicity we will use the scalar notation instead of vector to introduce our method, e.g.

is a regression variable in the vector . Fig. 1(b) shows the process. Following [11], we assume that each proposal regression variable is Gaussian distributed, i.e. , with its mean being the network standard output, and its variance (regression noise) an auxiliary regression variable. We employ the attenuated loss proposed by [24] to learn this probability distribution (see Fig. 1

(b)). Similarly, we assume the distribution of softmax logit

to be Gaussian, i.e. , and add another output layer in the LiDAR network head to regress the logit variance (classification noise). Directly learning this variable is difficult [24]. Instead, we sample a logit using the re-parametrization trick and transform it to the softmax score, which is used to calculate the final classification loss (see Fig. 1(b)).

It is noteworthy to mention that we train the probabilistic LiDAR network and the fusion network separately to favour the modular architecture design, as discussed in Sec. III-A. When optimizing the fusion network under the proposal uncertainties, we directly sample the proposal position , and indirectly propagate the softmax uncertainty by sampling the softmax logit (see Fig. 1(a)). All sampling operations are not required during the inference. Therefore, this approach to directly modelling uncertainties brings almost no additional computational cost and parameters, as discussed in [11]. In practice, we could also train the whole detector in an end-to-end fashion by employing the re-parametrization trick to the proposal variance . However, we do not find improvement on the detection accuracy using end-to-end training.

Iv Experimental Results

The experimental results are structured as follows. In Sec. IV-B, we study what the predictive uncertainties used in the fusion network look like. We conduct statistical analyses and show that those explicitly-modelled uncertainties reflect complex environmental noises and the labelling uncertainty from human annotators. Afterwards, we show in Sec. IV-C that the proposed uncertainty estimation and sampling mechanism improve the object detection performance across three datasets. Specifically, we observe that the proposed sampling mechanism is more robust than the fixed-sampling approach, because the predictive uncertainties encode useful information as shown in the first experiment. Finally, in Sec. IV-D we demonstrates the robustness of our method against sensor temporal misalignment problems.

Iv-a Setup

Iv-A1 Datasets

We validate our method on detecting objects of the “Car” category in three real-world urban driving datasets recorded at different locations and with different sensor setups.

KITTI [17]: the dataset is recorded in Karlsruhe, a mid-sized city in Germany, only during daytime and on sunny days. Following [6], we split the training data of frames into a train set and a val set, with approximately ratio. The network is optimized on the train set and evaluated on the val set.


: we also record data in several major cities in southern Germany with the vehicle setup similar to KITTI, but in much more diverse driving scenarios, such as night-drive and rainy or cloudy days. We follow KITTI and label the object truncation and occlusion using ordinal numbers. Besides, we ask annotators to label each object as either “Unsure” or “Sure”. An object is “Unsure”, if the annotators find it difficult to define its ground truth label, such as box parameters. Such label enables us to study how the predictive uncertainties from our model reflect the labelling difficulties of a human expert. In the experiment, we randomly split the data into train drives (

frames) and test drives ( frames).

NuScenes [4]: this large-scale dataset is recorded in Singapore and Boston, with rich complexity of traffic and weather conditions. Different from KITTI and Bosch datasets which use 64-lens LiDARs, NuScenes is equipped with 32-lens LiDARs [4], making the LiDAR perception more challenging. Since the full dataset is quite large, we only use a small subset to do evaluation. We randomly select scenes in the full training data ( frames) to train the network, and the data in the NuScenes-teaser release for testing ( frames).

Iv-A2 Implementation Details

We assemble our multi-modal object detector following the modular design discussed in Sec. III-A. First, we leverage the pre-trained Feature Pyramid Network as the image backbone directly from Detectron [18], and pre-train the PIXOR-like LiDAR network with the SGD optimizer and the learning rate , and set the step decay to be for every training steps. We train the LiDAR network with steps for KITTI, and steps for Bosch and NuScenes. Our LiDAR network achieves similar detection performance compared to the original PIXOR results [43]. Afterwards, we fix the sensor-specific networks, and train the fusion network with the ADAM optimizer and learning rate ( steps for KITTI, and steps for Bosch and NuSecenes). We find this strategy of long training with small learning rate makes the fusion network more stable. We proceed 3D proposals with the highest classification scores to the fusion network during the training process, and reduce the number to during inference. Proposals out of the camera field of view are not considered for fusion. For KITTI and Bosch datasets, we use the LiDAR point cloud within the range length width height = mmm, and do discretization with m resolution. For the NuScenes dataset we set the length up to m due to LiDAR point cloud sparsity. All experiments are conducted using Titan X GPUs. The inference time reaches fps.

Iv-B Understanding Uncertainties

We study how uncertainties behave using the probabilistic proposals predicted by the LiDAR network on the Bosch test data. We measure the Shannon Entropy of softmax scores (), the total variance of the regression noise, as well as the classification noise by the positive logit variance . In the binary case, the shannon entropy is high for softmax scores close to 0.5 and low for scores close to 0 or 1.

Fig. 3 shows an example of how uncertainties are distributed in an input frame. We only visualize uncertainties at regions with positive softmax scores larger than , because below this threshold regression uncertainties are dominated by random noises due to the training strategy (no regression loss is a region is not assigned to a ground truth). We observe that the proposals at the “around-object” regions usually show higher Shannon Entropy scores than the “in-object” regions (Fig. 3 (b)), because those regions are on the boundary between objects and background. The regression and classification noises do not show this behaviour (Fig. 3(c), (d)). Instead, both uncertainties are more affected by the environmental noises, e.g. distant proposals depict high uncertainties. Similar results have also been observed in [11].

In order to study whether the predicted uncertainties and match human annotations of “Unsure/Sure” objects, we first associate LiDAR proposals with ground truths when IoU, and then calculate the uncertainty histograms of associated proposals regarding on the “Unsure/Sure” labels. As shown in Fig. 2, the distributions are different between “Unsure” objects and “Sure” objects, with “Unsure” detections in general showing higher , and scores than “Sure” detections. To check whether the correlation in Fig. 2 is mainly due to effects of other environmental effects, we train linear models for and

with independent variables “Distance”, “Occlusion”, “Number of LiDAR points” (within a bounding box), and “Unsure” label, where the ordinal variable “Occlusion” is represented by orthogonal polynomials. The resulting models have an adjusted

of for and for

, indicating that the regression noise can be better modelled by the available parameters than the classification noise. T-tests for the regression parameter of both models result in the p-values with

for all parameters. Thus, all of the parameters have a significant impact on the noise even if being corrected for the other variables. To conclude, the predictive uncertainties reflect the environmental noises which are measured by the difficulty of human annotations.

Fig. 2: Histograms of the uncertainties for “Sure” and “Unsure” objects. The uncertainties are predicted by the LiDAR network using our Bosch dataset. (a). The total variance of the regression noise ; (b). The classification noise for positive logit ; (c). The Shannon entropy for the softmax scores.
Fig. 3: A detection example. We only visualize uncertainties (normalized to the same scale) at the regions with the softmax objectness score larger than . Areas out of the camera field of view are shaded in grey.

Iv-C Detection Performance

We study the detection performance of our proposed method in three different datasets, shown in Tab. I. Our method (“Ours”) explicitly estimates both regression and classification uncertainties in the LiDAR network, and performs sampling during the training phase. It is compared with a model without the sampling mechanism (“Modelling uncertainty” in Tab. I), and the baseline model neither with sampling nor uncertainty estimation. Following [17], we use Average Precision to evaluate detections in the 3D space (), in the bird’s eye view (), as well as on the camera front-view plane (). We group detections according to their distance to the ego-vehicle. For the Bosch and KITTI datasets, we report results up to m detection distance, and set the Intersection over Union (IoU) threshold in a decreasing order, i.e. m: IoU=0.7; m: IoU=0.6; m: IoU=0.5, because localizing distant objects using LiDAR data becomes very difficult due to point cloud sparsity. As for the NuScenes dataset, we set the detection distance up to m, and IoU=0.5.

Tab. I shows that the networks perform similarly on the Bosch and KITTI datasets, probably due to similar sensor settings. However, all networks perform much worse on the NuScenes dataset even with smaller detection range and less strict IoU threshold, depicting the perception challenge when the number of LiDAR channels is halved. In all datasets, “Modelling uncertainty” consistently outperforms “Baseline” with an increase of scores up to . This is because probabilistic inference aids the LiDAR network to predict more accurate 3D proposals. Similar results have also been found in our previous study [11]. Built upon “Modelling uncertainty”, “Ours” further improves the detection performance in most cases (though marginal), indicating that the sampling mechanism helps networks to generalize.


Bosch dataset
[m] [m] [m] [m] [m] [m] [m] [m] [m]
Modelling uncertainty
Modelling uncertainty + Sampling (Ours)
KITTI dataset
[m] [m] [m] [m] [m] [m] [m] [m] [m]
Modelling uncertainty
Modelling uncertainty + Sampling (Ours)
NuScenes dataset
[m] [m] [m] [m] [m] [m] [m] [m] [m]
Modelling uncertainty
Modelling uncertainty + Sampling (Ours)


TABLE I: Comparison of detection performance on three datasets.


Test regression uncertainty
Model Nr. Model Sampling Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
A LiDAR network No
B Fusion No
C Fusion Fixed variance ()
D Fusion Fixed variance ()
E Fusion Fixed variance ()
F Fusion Ours ()
G Fusion Ours ()
Test classification uncertainty
Nr. Model Sampling Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
H LiDAR network No
I Fusion No
J Fusion Fixed variance
K Fusion Ours


TABLE II: Ablation study on the KITTI dataset (Best performance is marked in bold, second best with underline).

Ablation Study: We additionally conduct ablation study to validate the proposed method. Tab. II shows the detection performance on the KITTI dataset. We divide data into Easy, Moderate, and Hard settings [17]

, and use IoU=0.7 for all settings. Models A, H are LiDAR networks which only model regression uncertainty and classification uncertainty, respectively. The fusion networks B-G are built upon Model A, while Models I-K upon Model H. To verify the advantage of sampling from the learned uncertainties, we additionally train fusion networks by sampling with fixed standard deviation for the regression uncertainty (Models C-E) and the classification uncertainty (Model J). The fixed deviation for regression uncertainty

is increased from m to m, and the deviation for classification uncertainty is selected as the mean value of the learned uncertainty.

From the table we have three observations. First, our proposed sampling mechanism consistently improves the fusion performance either for regression (comparing Models F, G with Model B) or classification (comparing Model K with I), and sampling more regression variables could bring better detection accuracy (comparing G with F). Second, sampling with the fixed small deviation (Model C) improves the detection accuracy due to the data augmentation advantage. However, increasing the fixed deviation may harm the performance (Models D and E). In fact, sampling at m (Model E) even underperforms the LiDAR-only detector (Model A). Similar results have also been found in [21]. Therefore, the fixed-sampling approach is an ad-hoc solution and requires tedious hyper-parameter tuning. In contrast, our method of sampling learned uncertainties (Models F, G and K) avoids such process. It generates diverse training data which corresponds to complex environmental noises (as shown in Sec. IV-B), and provides competitive or better detection performance, making it more robust than the fixed sampling method. Finally, fusing LiDAR data with RGB camera information largely improves 2D detection results (e.g. comparing Model A and B on ), but may degrade the BEV performance from the LiDAR-only network (e.g. Models E and F on ) due to the fusion network design: the network implicitly learns sensor alignment between LiDAR top-down view and camera front-view, which makes training the 3D bounding box estimation challenging. One remedy for this problem is to project camera images onto the LiDAR top-down view before fusion, similar to [29].

Iv-D Robustness Testing

So far, we have shown how the proposed method works in datasets with well-aligned sensor settings. When deploying a multi-modal object detector online, however, sensor mismatch could occur due to different timestamp (i.e. temporal misalignment) or calibration errors (i.e. spatial misalignment), which may drastically degrade the perception performance. An ideal object detector should not only perform well in good conditions, but show its robustness against sensor misalignment as well.

Fig. 4: Robustness testing. We randomly shift the LiDAR point clouds in the horizontal plane following the Gaussian distribution to simulate the sensor temporal misalignment. The horizontal axis represents the standard deviations, and the vertical axis represents the detection performance on the KITTI dataset.

In this work, we evaluate the method’s robustness against the temporal misalignment. we follow [40] and simulate the misalignment by randomly shifting all LiDAR point clouds in a frame following Gaussian distribution with zero means and increasing deviations, while keeping cameras as reference. We compare among three models: the fusion network trained without sampling, the one with fixed sampling (m), and the one with our sampling mechanism. All models are trained with the clean KITTI train set, and tested with the misaligned val set. Fig. 4 reports the 3D detection performance in the KITTI “Moderate” setting. At the same shifting level, our method largely outperforms the models without sampling up to nearly or the one with fixed sampling, showing its high robustness against noisy data. Though we only conduct experiments with the KITTI dataset, similar results are expected in other datasets as well (such as Bosch and NuScenes).

V Conclusion and Discussion

We have presented our probabilistic two-stage multi-modal object detection network that fuses LiDARs and RGB cameras. The method proposes to predict 3D proposals from the LiDAR branch, and to combine the regional LiDAR and camera features with a light-weight fusion network. We explicitly model classification and regression uncertainties in the LiDAR network, and leverage those uncertainties to train the fusion network. We evaluate our method on three datasets with real-world driving scenarios. Experimental results show that the predicted uncertainties reflect complex environmental uncertainties reflected by the difficulty of human annotators to label certain objects. Furthermore, modelling uncertainties helps to improve the detection accuracy and the network’s robustness, especially in the sensor misalignment situation.

In this work, we only model uncertainties in the LiDAR branch. It is a very interesting future work to model uncertainties in the image backbone and the fusion network. In addition, to reduce the computational cost of our fusion network for online deployment, we will introduce the quantization technique [9] into our method.


We thank our colleague William H. Beluch for the suggestions and fruitful discussions.


  • [1] A. Amini, W. Schwarting, A. Soleimany, and D. Rus (2019) Deep evidential regression. arXiv preprint arXiv:1910.02600. Cited by: §II-B.
  • [2] H. Banzhaf, M. Dolgov, J. Stellet, and J. M. Zöllner (2018) From footprints to beliefprints: motion planning under uncertainty for maneuvering automated vehicles in dense scenarios. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 1680–1687. Cited by: §I.
  • [3] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. In

    32nd International Conference on International Conference on Machine Learning

    pp. 1613–1622. Cited by: §II-B.
  • [4] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §I, §IV-A1.
  • [5] S. Chadwick, W. Maddern, and P. Newman (2019) Distant vehicle detection using radar and vision. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §II-A.
  • [6] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3D object detection network for autonomous driving. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6526–6534. Cited by: §I, §II-A, §III-A, §IV-A1.
  • [7] J. Choi, D. Chun, H. Kim, and H. Lee (2019-10) Gaussian YOLOv3: an accurate and fast object detector using localization uncertainty for autonomous driving. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §II-B.
  • [8] S. Choi, K. Lee, S. Lim, and S. Oh (2018) Uncertainty-aware learning from demonstration using mixture density networks with sampling-free variance modeling. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6915–6922. Cited by: §II-B.
  • [9] L. Enderich, F. Timm, L. Rosenbaum, and W. Burgard (2019) Learning multimodal fixed-point weights using gradient descent. In 2019 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Cited by: §V.
  • [10] D. Feng, L. Rosenbaum, and K. Dietmayer (2018) Towards safe autonomous driving: capture uncertainty in the deep neural network for lidar 3d vehicle detection. In IEEE International Conference on Intelligent Transportation Systems (ITSC), pp. 3266–3273. Cited by: §I, §II-B.
  • [11] D. Feng, L. Rosenbaum, F. Timm, and K. Dietmayer (2019)

    Leveraging heteroscedastic aleatoric uncertainties for robust real-time lidar 3d object detection

    In IEEE Intelligent Vehicles Symposium (IV), Cited by: §I, §II-B, §III-B, §III-B, §IV-B, §IV-C.
  • [12] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, F. Timm, C. Gläser, W. Wiesbeck, and K. Dietmayer (2019) Deep multi-modal object detection and semantic segmentation for autonomous driving: datasets, methods, and challenges. arXiv preprint arXiv:1902.07830. Cited by: §I, §II-A, §II, §III-A.
  • [13] D. Feng, L. Rosenbaum, C. Gläser, F. Timm, and K. Dietmayer (2019) Can we trust you? on calibration of a probabilistic object detector for autonomous driving. arXiv:1909.12358 [cs.RO]. Cited by: §I, §II-B.
  • [14] D. Feng, X. Wei, L. Rosenbaum, A. Maki, and K. Dietmayer (2019) Deep active learning for efficient training of a lidar 3d object detector. In IEEE Intelligent Vehicles Symposium (IV), Cited by: §I, §II-B.
  • [15] Y. Gal (2016) Uncertainty in deep learning. Ph.D. Thesis, University of Cambridge. Cited by: §II-B.
  • [16] J. Gast and S. Roth (2018) Lightweight probabilistic deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3369–3378. Cited by: §II-B.
  • [17] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §I, §IV-A1, §IV-C, §IV-C.
  • [18] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He (2018) Detectron. Note: Cited by: §IV-A2.
  • [19] A. Graves (2011) Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356. Cited by: §II-B.
  • [20] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pp. 1321–1330. Cited by: §II-B.
  • [21] C. Haase-Schütz, H. Hertlein, and W. Wiesbeck (2019) Estimating labeling quality with deep object detectors. In 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 33–38. Cited by: §IV-C.
  • [22] A. Harakeh, M. Smart, and S. L. Waslander (2019) BayesOD: a bayesian approach for uncertainty estimation in deep object detectors. arXiv preprint arXiv:1903.03838. Cited by: §I, §II-B.
  • [23] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang (2019) Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2888–2897. Cited by: §I, §II-B.
  • [24] A. Kendall and Y. Gal (2017) What uncertainties do we need in Bayesian deep learning for computer vision?. In Advances in Neural Information Processing Systems, pp. 5574–5584. Cited by: §II-B, §II-B, §III-B.
  • [25] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander (2018-10) Joint 3d proposal generation and object detection from view aggregation. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1–8. External Links: Document, ISSN 2153-0866 Cited by: §I, §II-A, §III-A2, §III-A.
  • [26] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §II-B.
  • [27] M. T. Le, F. Diehl, T. Brunner, and A. Knoll (2018) Uncertainty estimation for deep neural object detectors in safety-critical applications. In IEEE International Conference Intelligent Transportation Systems (ITSC), pp. 3873–3878. Cited by: §I, §II-B.
  • [28] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun (2019) Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7345–7353. Cited by: §II-A.
  • [29] M. Liang, B. Yang, S. Wang, and R. Urtasun (2018) Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 641–656. Cited by: §II-A, §IV-C.
  • [30] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §III-A2.
  • [31] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §III-B.
  • [32] D. J. C. MacKay (1992)

    A practical Bayesian framework for backpropagation networks

    Neural Computation 4 (3), pp. 448–472. Cited by: §II-B.
  • [33] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington (2019) LaserNet: an efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12677–12686. Cited by: §I, §II-B.
  • [34] G. P. Meyer and N. Thakurdesai (2019) Learning an uncertainty-aware object detector for autonomous driving. arXiv preprint arXiv:1910.11375. Cited by: §II-B.
  • [35] D. Miller, F. Dayoub, M. Milford, and N. Sünderhauf (2018) Evaluating merging strategies for sampling-based uncertainty techniques in object detection. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §II-B.
  • [36] D. Miller, L. Nicholson, F. Dayoub, and N. Sünderhauf (2018) Dropout sampling for robust object detection in open-set conditions. In IEEE International Conference on Robotics and Automation, Cited by: §I, §II-B.
  • [37] J. Postels, F. Ferroni, H. Coskun, N. Navab, and F. Tombari (2019) Sampling-free epistemic uncertainty estimation using approximated variance propagation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2931–2940. Cited by: §II-B.
  • [38] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum PointNets for 3d object detection from RGB-D data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II-A.
  • [39] M. Sensoy, L. Kaplan, and M. Kandemir (2018) Evidential deep learning to quantify classification uncertainty. In Advances in Neural Information Processing Systems, pp. 3179–3189. Cited by: §II-B.
  • [40] K. Shin, Y. P. Kwon, and M. Tomizuka (2019) Roarnet: a robust 3d object detection based on region approximation refinement. In IEEE Intelligent Vehicles Symposium (IV), pp. 2510–2515. Cited by: §I, §II-A, §IV-D.
  • [41] J. Wagner, V. Fischer, M. Herman, and S. Behnke (2016)

    Multispectral pedestrian detection using deep fusion convolutional neural networks

    In 24th Eur. Symp. Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 509–514. Cited by: §II-A.
  • [42] Z. Wang, W. Zhan, and M. Tomizuka (2018) Fusing bird view lidar point cloud and front view camera image for deep object detection. In IEEE Intelligent Vehicles Symposium (IV), Cited by: §II-A.
  • [43] B. Yang, W. Luo, and R. Urtasun (2018) Pixor: real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660. Cited by: §III-A2, §IV-A2.