Log In Sign Up

Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection

by   Di Feng, et al.

We present a robust real-time LiDAR 3D object detector that leverages heteroscedastic aleatoric uncertainties to significantly improve its detection performance. A multi-loss function is designed to incorporate uncertainty estimations predicted by auxiliary output layers. Using our proposed method, the network ignores to train from noisy samples, and focuses more on informative ones. We validate our method on the KITTI object detection benchmark. Our method surpasses the baseline method which does not explicitly estimate uncertainties by up to nearly 9 It also produces state-of-the-art results compared to other methods while running with an inference time of only 72 ms. In addition, we conduct extensive experiments to understand how aleatoric uncertainties behave. Extracting aleatoric uncertainties brings almost no additional computation cost during the deployment, making our method highly desirable for autonomous driving applications.


Leveraging Uncertainties for Deep Multi-modal Object Detection in Autonomous Driving

This work presents a probabilistic deep neural network that combines LiD...

RefinedMPL: Refined Monocular PseudoLiDAR for 3D Object Detection in Autonomous Driving

In this paper, we strive for solving the ambiguities arisen by the astou...

Towards Multi-Object Detection and Tracking in Urban Scenario under Uncertainties

Urban-oriented autonomous vehicles require a reliable perception technol...

CertainNet: Sampling-free Uncertainty Estimation for Object Detection

Estimating the uncertainty of a neural network plays a fundamental role ...

3D-FCT: Simultaneous 3D Object Detection and Tracking Using Feature Correlation

3D object detection using LiDAR data remains a key task for applications...

LaserFlow: Efficient and Probabilistic Object Detection and Motion Forecasting

In this work, we present LaserFlow, an efficient method for 3D object de...

I Introduction

Fig. 1: Our proposed LiDAR 3D object detector (PROD). The network regresses the heteroscedastic aleatoric uncertainties in RPN and FRH.

A robust and accurate object detection system using on-board sensors (e.g. camera, LiDAR, Radar) is crucial for the road scene understanding of autonomous driving. Among different sensors, LiDAR can provide us with accurate depth information, and is robust under different illumination conditions such as daytime and nighttime. These properties make LiDAR indispensable for safe autonomous driving. The recent Uber’s autonomous driving fatal tragedy could have been avoided, if the LiDAR perception system had robustly detected the pedestrian, or had timely informed the human driver to trigger the emergency braking because it was uncertain with the driving situation  


Recently, deep learning approaches have brought significant improvement to the object detection problem 

[2]. Many methods have been proposed that use LiDAR point clouds  [3, 4, 5, 6, 7, 8, 9, 10, 11] or fuse them with camera images [12, 13, 14, 15, 16, 17, 18, 19]

. However, they only give us deterministic bounding box regression and use softmax scores to represent recognition probability, which does not necessarily represent uncertainties in the network 

[20]. In other words, they do not provide detection confidence regarding classification and localization. For a robust perception system, we need to explicitly model the network’s uncertainties.

Towards this goal, in this work we build a probabilistic 2-stage-based object detector from LiDAR point clouds by introducing heteroscedastic aleatoric uncertainties - the uncertainties that represent sensor observation noises and vary with the data input. The method works by adding auxiliary outputs to model the aleatoric uncertainties, and training the network with a robust multi-loss function. In this way, the network learns to focus more on informative training samples and ignore the noisy ones. We call our method PROD (Probabilistic Real-time Object Dector). Our contributions can be summarized as follows:

  • We model heteroscedastic aleatoric uncertainties in a 3D object detection network using LiDAR point clouds.

  • We show that by leveraging aleatoric uncertainties, the network produces state-of-the-art results and significantly increases the average precision up to compared to the baseline method without any uncertainty estimations.

  • We systematically study how the aleatoric uncertainties behave. We show that the uncertainties are related with each other and are influenced by multiple factors such as detection distance, occlusion, softmax score, and orientation.

In the sequel, we first summarize related works in Sec. II, and then describe our proposed method in Sec. III in detail. Sec. IV shows the experimental results regarding the improvement of object detection performance by leveraging aleatoric uncertainties and understanding how the uncertainties behave. Sec. V summarizes the work and discusses future research. The video of this work is provided as supplementary material.

Ii Related Works

In the following, we summarize methods for LiDAR-based object detection in autonomous driving and uncertainty quantification in deep neural networks.

Ii-a LiDAR-based Object Detection

Many works process the LiDAR information directly from point clouds [5, 7, 3, 17, 14, 15]. For example, Zhou et al. [5] propose a voxel feature encoding layer to handle 3D point clouds. Li [3]

employs a 3D fully convolutional neural network for discretized point clouds to predict an objectness map and a 3D bounding box map. Other works project 3D point clouds onto a 2D plane and use 2D convolutional network to process these LiDAR feature maps. They can be represented by front-view cylindrical images 

[4, 21, 19, 12], camera-coordinate images [8, 22, 23], or bird’s eye view (BEV) map [24, 25, 26, 27]. Besides LiDAR, an autonomous driving car is ususally equipped with other sensors such as cameras or radar sensors. Therefore, it is natural to fuse them for more robust and accurate object detection. For instance, Chen et al. [12] use LiDAR BEV map to generate region proposals, and fuse the regional features from LiDAR BEV and front-view maps, as well as camera images for 3D car detection. Qi et al. [15] propose to generate 2D object bounding boxes by an image detector, and use the regional point clouds within these bounding boxes for object detection.

Ii-B Uncertainty Quantification in Deep Neural Networks

There are two types of uncertainties we can model in neural networks. Epistemic uncertainty shows the model’s uncertainty when describing the training dataset. It can be quantified through variational inference [28], sampling technique [29, 20] or ensemble [30]

, and has been applied to active learning 

[31, 32], image semantic segmentation [33, 34], camera location estimation [35] or open-dataset object detection problems [36]. Aleatoric

uncertainty, on the other hand, models the observation noises of input data. It has been modeled by Laplacian distribution or Gaussian distribution for camera image semantics and geometry predictions 

[37, 38]. Recently, we explicitly model and compare the epistemic and aleatoric uncertainties in an object detector [24]. We have shown that epistemic uncertainties are related to detection accuracy, whereas aleatoric uncertainties are influenced by observation noises.

Iii Methodology

In this section, we present our method which leverages heteroscedastic aleatoric uncertainties for robust LiDAR 3D object detection. We start with illustrating our network architecture, which is followed by a description of how to model the uncertainties. We end the section with a description of our proposed robust multi-loss function.

Iii-a Network Architecture

Iii-A1 LiDAR Point Clouds Transformation

In this work, LiDAR point clouds are encoded in 2D bird’s eye view (BEV) feature maps as network inputs. These feature maps include height and density information generated by projecting the 3D point clouds on the 2D grid [13].

Iii-A2 Two-stage Object Detector

The network architecture is shown in Fig. 1. We follow the two-stage object detection network proposed in [39]. The LiDAR BEV feature maps are fed into a pre-processing network based on VGG16 to extract high-level LiDAR features. After the pre-processing layers, a region proposal network (RPN) produces 3D region of interests (ROIs) based on pre-defined 3D anchors at each pixel on the feature map. The RPN consists of two task-specific fully connected layers, each with hidden units. A ROI is parametrized by , with and indicating the ROI position in the bird’s eye view plane, its height, , , its dimensions, and the softmax objectness score. The anchor’s dimensions are determined by clustering the training samples with two clusters. Its height is based on the LiDAR’s height above the ground plane. Similar to [39], the RPN regresses the region proposals by the normalized offsets denoted as and predicts the objectness score

by a softmax layer.

The Fast-RCNN head (FRH) with three fully connected hidden layers ( units for each layer) is designed to fine-tune ROIs generated by RPN. It produces multi-task outputs, i.e. the softmax probability , 3D bounding box location and orientation. We encode the location with four corner method introduced in [13]: , with and being the relative position of a bounding box corner in , axes of the LiDAR coordinate frame, and being the heights offsets from the ground plane. We also encode the orientation as , with being the object orientation in BEV. As explained in [13], explicitly modeling the orientation can remedy angle wrapping problem, resulting in a better object detection performance.

Iii-B Modeling Heteroscedastic Aleatoric Uncertainties

As introduced in Sec. I, heteroscedastic aleatoric uncertainties indicate data-dependent observation noises in LiDAR sensor. For example, a distant or occluded object should yield high aleatoric uncertainties, since there are a few LiDAR points representing them. In our proposed robust LiDAR 3D object detector, we extract aleatoric uncertainties in both RPN and FRH.

Let us denote an input LiDAR BEV feature image as , and a region of interest produced by RPN as . We also refer to and as the classification labels for RPN and FRH outputs respectively, with indicating “object” and “background” class. We denote as the network weights and as the noise-free outputs of the object detection network.

We use multivariate Gaussian distributions with diagonal covariance matrices to model the observation likelihood for the anchor position , 3D bounding box location and orientation :


where , and

refer to the observation noise vectors, and each element of which indicates an aleatoric uncertainty scalar corresponding to an element in

, and .

For classification tasks, the observation likelihood and can be represented by softmax functions:


Here, we do not explicitly model the aleatoric classification uncertainties, as they are self-contained from the softmax scores which follow the categorical distribution.

The uncertainty scores , , can be obtained by adding auxiliary output layers in the object detection network. To increase numerical stability and consider the positivity constraints, we use for regression. The regression outputs of RPN that model the aleatoric uncertainties can be formulated as , and for FRH they are and .

Iii-C Robust Multi-Loss Function

We incorporate aleatoric uncertainties , and in a multi-loss function for training our object detector via:


where we use smooth loss for , and , and cross entropy loss for and , similar to [38]. Modeling auxiliary uncertainties in this multi-loss function has two effects. First, an uncertainty score can serve as a relative weight to a sub-loss. Thus, optimizing relative weights enables the object detector to balance the contribution of each sub-loss, allowing the network to be trained more easily. Second, leatoric uncertainties can increase the network robustness against noisy input data. For an input sample with high aleatoric uncertainties, i.e. the sample is noisy, the model decreases the residual regression loss because becomes small. Conversely, the network is encouraged to learn from the informative samples with low aleatoric uncertainty by increasing the residual regression loss with larger term.

Iv Experimental Results

Iv-a Experimental Setup

Iv-A1 Dataset and Input Representation

We evaluate the performance of our proposed method on the “Car” category from the KITTI object detection benchmark [40]. We use the LiDAR point cloud within the range - meters in the LiDAR coordinate frame. The point clouds are discretized into height slices along the axis with meters resolution and the length and width are discretized with meters resolution, similar to [13]. After incorporating a density feature map, the input LiDAR point clouds are represented by the feature maps with size .

Iv-A2 Implementation Details

The RPN network and the Fast-RCNN head are trained jointly in an end-to-end fasion. The background anchors are ignored for the RPN regression loss. An anchor is assigned to be “Car” class when its Intersection over Union (IoU) with ground truth in the BEV is larger than and “background” if it is below . Anchors that are neither “Car” nor “background” do not contribute to the learning. Besides, we apply Non-Maximum Suppression (NMS) with the threshold on the region proposals to reduce redundancy, keeping proposals during the training process and during the test time. To train the Faster R-CNN head, a ROI is labeled to be positive when its 2D IoU overlap with ground truth in BEV is larger than and negative when it is less than  [13]. We train the network with Adam optimizer. The learning rate is initialized as and decayed exponentially for every steps with a decay factor . We also use Dropout and regularization to prevent over-fitting. We first train the network without aleatoric uncertainties for steps and then use the robust multi-loss function (Eq. 3) for another steps. We find that the network converges faster following this training strategy.

Iv-B Comparison with State-of-the-art Methods

We first compare the performance of our proposed network PROD (Ours) with the baseline method which does not explicitly model aleatoric uncertainties, as well as other state-of-the-art methods (see Tab. I and Tab. II). For a fair comparison, we only consider LiDAR-only methods. We use 3D Average Precision for 3D detection (, and bird’s eye view Average Precision for 3D localization (). The AP values are calculated at Intersection Over Union IOU=0.7 threshold introduced in [40] unless mentioned otherwise.

Tab. I shows the performance on KITTI test set. The baseline method performs similarly to MV3D(BV+FV) network. By leveraging aleatoric uncertainties, PROD significantly improves the baseline method up to for , and produces results comparable to PIXOR [27] and VoxelNet [5]. Tab. II shows the detection performance on KITTI val set with the same train-val split introduced in [12]. By modeling aleatoric uncertainties, PROD improves the baseline method up to nearly . It also outperforms all other methods in for moderate and hard settings. The experiments show the effectiveness of our proposed method.

Easy Moderate Hard Easy Moderate Hard
3D FCN [3] - - - 69.94 62.54 55.94
MV3D (BV+FV) [12] 66.77 52.73 51.31 85.82 77.00 68.94
PIXOR [27] - - - 81.70 77.05 72.95
VoxelNet [5] 77.47 65.11 57.73 89.35 79.26 77.39
Baseline 70.46 55.48 54.43 84.94 76.70 69.03
Ours 73.92 62.63 56.77 85.91 76.06 68.75
TABLE I: Comparison with state-of-the-art methods on KITTI test set.
Easy Moderate Hard Easy Moderate Hard
VeloFCN [4] 15.20 13.66 15.98 40.14 32.08 30.47
MV3D (BV+FV) [12] 71.19 56.60 55.30 86.18 77.32 76.33
F-PointNet (LiDAR) [15] 69.50 62.30 59.73 - - -
PIXOR [27] - - - 86.79 80.75 76.60
VoxelNet [5] 81.97 65.46 62.85 89.60 84.81 78.57
Baseline 71.50 63.71 57.31 86.33 76.44 69.72
Ours 78.81 65.89 65.19 87.03 77.15 76.95
TABLE II: Comparison with state-of-the-art methods on KITTI val set.

Iv-C Ablation Study

We then conduct an extensive study regarding on where to model the uncertainties, network speed and memory, as well as a qualitative analysis. We use the train-val split introduced in [12] for evaluations.

Iv-C1 Where to Model Aleatoric Uncertainties

In this experiment we study the effectiveness of modeling aleatoric uncertainties in different networks of our LiDAR 3D object detector. In this regard, we train another two detectors that only capture uncertainties either in RPN (Aleatoric RPN) or FRH (Aleatoric FRH) and compare their 3D detection performance with the baseline method (Baseline) and our proposed method (Ours) that models uncertainties in both RPN and FRH. Tab. III illustrates the AP values and their improvements for Baseline on KITTI easy, moderate, and hard settings. We find that modeling the aleatoric uncertainties in either RPN, FRH or both can improve the detection performance, while modeling the uncertainties in both networks brings the largest performance gain in moderate and hard settings. Furthermore, we evaluate the detection performance on different LiDAR ranges, shown by Tab. IV. Again, our method consistently improves the detection performance compared to Baseline. Modeling the uncertainties in both RPN and FRH shows highest improvement between ranges meters, indicating that aleatoric uncertainties in both networks can compensate each other. As we will demonstrate in the following section (Sec. IV-D), our proposed network handles cars from easy, moderate, hard, near-range or long-range settings differently. It learns to adapt to noisy data, resulting in improved detection performance.

Easy Moderate Hard
Baseline 71.50 63.71 57.31
Aleatoric RPN 72.92 (+1.42) 63.84 (+0.13) 58.61 (+1.30)
Aleatoric FRH 81.07 (+9.57) 65.51 (+1.80) 65.09 (+7.78)
Ours 78.81 (+7.31) 65.89 (+2.18) 65.19 (+7.88)
TABLE III: Ablation study - Comparison of 3D car detection performance for easy, moderate, and hard settings on KITTI val set.
0-20 (m) 20-35 (m) 35-50 (m) 50-70 (m)
Baseline 72.42 78.96 57.87 26.17
Aleatoric RPN 80.86 (+8.44) 79.72 (+0.76) 65.44 (+7.57) 30.54 (+4.37)
Aleatoric FRH 79.10 (+6.68) 83.89 (+4.93) 61.98 (+4.11) 29.67 (+3.50)
Ours 80.78 (+8.36) 84.75 (+5.79) 66.81 (+8.94) 34.07 (+7.90)
TABLE IV: Ablation study - Comparison of 3D car detection performance for data within different distances from KITTI val set. We use IOU=0.5 threshold in this experiment.

Iv-C2 Runtime and Number of Parameters

We use the runtime and the number of parameters to evaluate the computational efficiency and memory requirement. Tab. V shows the results of PROD relative to the baseline network. We only need additional and parameters to predict aleatoric uncertainties during inference, showing the high efficiency of our proposed method.

Method Number of parameters Runtime
Data pre-processing Inference
Ours +26112 (+0.07%) (+2.86%)
TABLE V: Comparison of runtime and the number of parameters. The runtime is measured by averaging the predictions over all KITTI val set on a TITAN X GPU.

Iv-C3 Qualitative Analysis

Fig. 2 shows some exemplary results. In general, our proposed method can detect cars with higher recall, especially at long distance (e.g. Fig. 2) or in highly-occluded regions (e.g. Fig. 2

). However, the network also tends to mis-classify objects with car-like shapes, e.g. in Fig. 

2 the network incorrectly predicts the fences on the bottom-left side as a car. Such failures could be avoided by fusing the image features from vision cameras.

Fig. 2: Some car detection examples on KITTI val set within the range - meters. The LiDAR points which are out of camera front view are not evaluated. The predictions from our proposed network (Ours) are visualized in blue, the predictions from baseline method in red, and the labeled samples in green. Better view by magnificence.

Iv-D Understanding Aleatoric Uncertainties

We finally conduct comprehensive experiments to understand how aleatoric uncertainties behave. We use the scalar

Total Variance

(TV) to quantify aleatoric uncertainties introduced in [24]. A large TV score indicates high uncertainty. We also use Pearson Correlation Coefficient (PCC) to measure the linear correlation. We study RPN uncertainties and FRH uncertainties. The RPN uncertainties indicate observation noises when predicting anchor positions, whereas the FRH uncertainties indicate the noises for final bounding box predictions. It consists of FRH location uncertainties for the bounding box regression and FRH orientation uncertainties for heading predictions. We evaluate all predictions with a score larger than from the KITTI val set unless mentioned otherwise.

Iv-D1 Relationship Between Uncertainties

Fig. 3 and Fig. 3 show the prediction distribution of FRH uncertainties with RPN uncertainties, as well as FRH location uncertainties with orientation uncertainties, respectively. The uncertainties are highly correlated with each other, indicating that our detector has learned to represent LiDAR observation noises at different sub-networks and prediction outputs.

Fig. 3: (a) The distribution of uncertainties in FRH ( axis) and in RPN ( axis). Each scatter represents a prediction. (b) The distribution of uncertainties for FRH location ( axis) and orientation prediction ( axis).
Fig. 4: (a) The prediction distribution of FRH orientation uncertainties w.r.t angle values (angular axis). (b) The average orientation uncertainties with orientation difference between predicted angles and the nearest base angles.
Fig. 5: (a) Average RPN and FRH uncertainties with the increasing softmax scores for anchor and final object classification. (b) Average uncertainties with detection distance.
(a) Easy (b) Moderate (c) Hard
Fig. 6: Histogram of FRH uncertainties for easy, moderate, and hard settings in the KITTI val set.
Fig. 7: Exemplary detections with the total variance of uncertainties being distributed equally from -3 to -1 at log scale (). The aleatoric uncertainty is reducing from left to right. The 3D bounding boxes are predicted from PROD and are projected onto camera coordinate for visualization purpose only. We also show the softmax score, object distance, level of occlusion (fully visible, partly occluded and difficult to see) as well as orientation for each detection.

Iv-D2 Orientation Uncertainties

Fig. 4 illustrates the prediction distribution of FRH orientation uncertainties (radial axis) w.r.t. angle values (angular axis) in the polar coordinate. Most predictions lie at four base angles, i.e. . Fig. 4 shows the average orientation uncertainties with orientation difference between predicted angles and the nearest base angles. They are highly correlated with PCC=0.99, showing that the network produces high observation noises when predicting car headings that are different from the base angles.

Iv-D3 Relationship Between Softmax Score, Detection Distance and Uncertainties

In Fig. 5 we plot the average RPN and FRH uncertainties with the increasing softmax scores for anchor and final object classification. We find a strong negative correlation between them. As introduced by Eq. 2, the softmax scores can be regarded as aleatoric uncertainties for classification. This indicate that our network has learned to adapt the uncertainties in regression tasks (i.e. anchor position, bounding box location and orientation) to that in the classification, i.e. the uncertainties increase as the softmax score reduces. Fig. 5 shows that the average RPN and FRH uncertainties become larger as detection distance increases. This is because the LiDAR sensor measures fewer reflections from a distant car, which yields high observation noises.

Iv-D4 Uncertainty Distribution for Easy, Moderate, and Hard Settings

We finally evaluate the FRH uncertainty distribution for easy, moderate, and hard objects, demonstrated in Fig. 6. The uncertainty distributions vary: for easy setting there are more predictions with lower uncertainties, whereas for hard objects which have larger occlusions, truncations, and detection distances, the network estimates higher uncertainties. The result indicates that the network has learned to treat objects from these three settings differently.

Iv-D5 Qualitative Observations

In Fig. 7 we visualize nine exemplary detections whose uncertainties are equally distributed at log scale, ranging from -3 to -1. We observe: (1). The uncertainties are influenced by occlusion, detection distance, orientation as well as softmax score, as discussed above. (2) The network shows higher aleatoric uncertainties if there are fewer points around the car. The results show that our network has captured reasonable LiDAR observation noises.

V Discussion and Conclusion

We have presented our robust LiDAR 3D object detection network called PROD that leverages heteroscedastic aleatoric uncertainties to significantly improve detection performance. Trained with a multi-loss function which incorporates aleatoric uncertainties, PROD learns to adapt to noisy data and increases the average precision up to , producing state-of-the-art results on KITTI object detection benchmark. Our method only requires to modify the cost function and output layers with only additional inference time. This makes our method suitable for real-time autonomous driving applications.

We have qualitatively analyzed how PROD learns to deal with noisy data in Sec. IV-C. The network tends to predict high uncertainties for detections that are highly occluded (Fig. 7), far from the ego-vehicle (Fig. 5), with different orientations from the base angles (Fig. 4) or with low objectness score (Fig. 5). Therefore, our network learns less from noisy samples during the training process by penalizing the multi-loss objective with the terms (Eq. 3). Conversely, the netowrk is encouraged to learn more from the informative training samples with more LiDAR reflections. In this way, its robustness against noisy data is enhanced, resulting in improved detection performance for data from easy, moderate, hard settings (Tab. III) or at different distances (Tab. IV). Note that this effect is because we explicitly model the observation noises rather than an ad-hoc solution.

Compared to Focal Loss [41]

which incorporates prediction probability in the loss function to tackle the positive-negative sample imbalance problem, our proposed robust multi-loss function works in an opposite way: it depreciates “outliers” training samples with high aleatoric uncertainties, and encourages the network to learn from those with small errors. It is an interesting future work to investigate how a network behaves with Focal Loss and our proposed method using different driving datasets.

In the future, we intend to improve PIXOR [27] and VoxelNet [5] by incorporating aleatoric uncertainties. We also intend to investigate the influence of object reflectance and bounding box encodings on uncertainty estimations with our own test vehicles.


We thank Zhongyu Lou his suggestions and inspiring discussions. We also thank Xiao Wei for reading the script.