With the rise of autonomous driving application scenarios and the emergence of large-scale annotated datasets (e.g., KITTI  and Waymo ), 3D object detection has attracted much attention in both industry and academia.
In recent years, deep learning frameworks have been richly investigated for 3D object detection. However, there inevitably exist inaccuracy and ambiguity in many object-level annotations, which can confuse the learning process of the detection model. For example, in the data collection phase, the captured point clouds are typically incomplete due to environmental occlusion and the intrinsic properties of LiDAR sensors. In the data labeling phase, ambiguity occurs when human annotators subjectively estimate object locations and shapes from 2D images or partial 3D points. Concretely, as illustrated in Fig.1, an incomplete LiDAR observation can correspond to multiple potentially plausible labels, and similar point cloud objects can be annotated with significantly varying bounding boxes. This phenomenon reminds us to carefully consider and exploit label uncertainty throughout the whole dataset.
Currently, the dominating 3D object detectors are designed as deterministic learning frameworks, in which label ambiguity is totally ignored in the bounding box regression branch. To address this issue, another family of probabilistic detectors [9, 19, 4, 5]
introduce uncertainty estimation by modeling the predictions as a Gaussian distribution. However, they simply model all ground-truth bounding boxes as a Dirac delta function with zero uncertainty and ignore the noise in labels. Inheriting the probabilistic detection paradigm, there is also a limited number of studies focusing on quantifying label uncertainty based on simple heuristics and Bayes estimation . However,  can be less reliable due to its insufficient modeling capacity, and the assumption of conditional probabilistic independence between point clouds as involved in  is often untenable in practice. In addition, both  and 
tend to predict the bounding box uncertainty as a whole instead of the dimension-wise prediction, which ignores the fact that the variance in each dimension is typically different.
In this paper, we aim to explicitly model the one-to-many relationship between a typical 3D object and its potentially plausible bounding boxes under a learning-based framework. Technically, we present GLENet, a novel deep generative network adapted from conditional variational auto-encoders (CVAE), which introduces a latent variable to capture the distribution over potentially plausible bounding boxes of point cloud objects. During testing, we sample latent variables multiple times to generate diverse bounding boxes, the variance of which is taken as label uncertainty to guide the learning of localization uncertainty estimation in the downstream detection task. Besides, motivated by the observation that in probabilistic detectors the predicted uncertainty is relevant to localization quality, as illustrated in Fig. 3, we further propose Uncertainty-aware Quality Estimator (UAQE), which facilitates the training of the IoU-branch with the predicted uncertainty information. To demonstrate our effectiveness and universality, we integrate GLENet into several popular 3D object detection frameworks to build powerful probabilistic detectors. Experiments on KITTI  and Waymo  datasets demonstrate that our method can bring stable performance gains and achieve the current state-of-the-art.
We summarize the main contributions of this paper as follows.
We introduce a general and unified deep learning-based paradigm to generate reliable label uncertainty, and further extend it as an auxiliary regression target to improve 3D object detection.
We present a deep generative model adapted from CVAE to capture the one-to-many relationship between incomplete point cloud objects and the potentially plausible ground-truth bounding boxes.
Inspired by the strong correlation between the localization quality and the predicted variance in probabilistic detectors, we propose UGQP to facilitate the training of the IoU-branch.
The remainder of the paper is organized as follows. Section 2 shows the related work including reviews of LiDAR-based detectors and existing label uncertainty estimation methods. Section 3 describes our architecture and the strategy to estimate label uncertainty. In Section 4 we conduct the experiments on the KITTI dataset and the Waymo Open dataset to demonstrate the effectiveness of our method to enhance existing 3D detectors and the ablation study to analyze the effect of different components. Finally, Sect. 5 concludes our paper.
2 Related Work
2.1 LiDAR Point Cloud-based 3D Object Detection
Existing 3D object detectors can be categorized into single-stage and two-stage frameworks. For single-stage detectors, Zhou et al.  proposed to convert raw point clouds to regular volumetric representations and adopted voxel-based feature encoding. Yan et al.  presented a more efficient sparse convolution. Lang et al.  converted point clouds to sparse fake images using pillars. Shi et al.  aggregated point information via graph structure. He et al.  introduced point segmentation and center estimation as auxiliary tasks in the training phase to enhance model capacity. Zheng et al. 
constructed an SSFA module for robust feature extraction and multi-task head for confidence rectification, and proposed DI-NMS for post-processing. For two-stage detectors, Shi et al. exploited a voxel-based network to learn the additional spatial relationship between intra-object parts under the supervision of 3D box annotations. Shi et al.  proposed to directly generate 3D proposals from raw point clouds in a bottom-up manner, using semantic segmentation to valid point to regress detection boxes. The follow-up work  further proposed PointsPool to convert sparse proposal features to compact representations and used spherical anchors to generate accurate proposals. Shi et al.  utilized both point-based and voxel-based methods to fuse multi-scale voxel and point features. Deng et al.  proposed voxel RoI pooling to extract RoI features from coarse voxels.
Compared with 2D object detection, there are more serious boundary ambiguity problems in 3D object detection due to occlusion and signal miss. Studies such as SPG  try to use point cloud completion methods to restore full shape of objects and improve the detection performance [39, 21]. However, it’s non-trivial to generate complete and precise shapes with incomplete point clouds only.
2.2 Probabilistic 3D Object Detector
object detectors produce a deterministic box with a confidence score for each detection, while the probability score represents the existence and semantic confidence, it can’t reflect the localization uncertainty. By contrast, probabilistic object detectors[9, 19, 14, 34] estimate the probabilistic distribution of predicted bounding boxes rather than taking them as deterministic results.  models the predicted boxes as Gaussian distributions, the variance of which can indicate the localization uncertainty. The regression branch is expected to output larger variance and get a smaller loss for inaccurate localization estimation with KL Loss. However, most probabilistic detectors take the ground-truth bounding box as deterministic Dirac delta distribution and ignore the ambiguity in labels. The localization variance is learned in an unsupervised manner, which may result in sub-optimal localization precision and erratic training.
2.3 Label Uncertainty Estimation
The label uncertainty estimation and uncertainty estimation in detectors are two different tasks. The former focuses on estimating the uncertainty of annotated labels with an independent framework, while the latter aims to predict uncertainty of detected results with a branch in detectors.
There only exists a limited amount of previous works that focus on quantifying uncertainty statistics of ground-truth bounding boxes in 3D object detection task. Meyer et al.  proposed to model label uncertainty by the IoU between the label bounding box and the corresponding convex hull of the aggregated LiDAR observations. Feng et al. 
proposed a Bayes method to estimate label noises by quantifying the matching degree of point clouds for the given boundary box with Gaussian Mixture Model. However, its assumption of conditional probabilistic independence between point clouds is often untenable in practice. Besides,[18, 36] only produce uncertainty of the box as a whole instead of each dimension.
Conditional variational autoencoder (CVAE)  is a powerful tool for controllable generative tasks, which has been applied to a wide range of language [47, 45, 35, 13] and vision [38, 24, 22, 46] processing scenarios. Inspired by  that apply generative models to handle ambiguities in conversations, i.e., a question may have multiple suitable responses, we propose GLENet based on CVAE to capture the one-to-many relationship between incomplete point cloud objects and potentially plausible bounding box labels. Compared with previous methods, our method can estimate the label uncertainty in 7 dimensions. Despite the existence of previous studies that incorporate VAE for point cloud applications [20, 43], we are the first to employ CVAE in 3D object detection for label uncertainty modeling.
3 Proposed Method
As aforementioned, label ambiguity widely exists in 3D object detection scenarios and has adverse effects on the deep model learning process, which is not well addressed or even completely ignored by previous works. To this end, we propose GLENet, a generic and unified deep learning framework that generates label uncertainty by modeling the one-to-many relationship between point cloud objects and potentially plausible bounding box labels, and extend it as an auxiliary regression objective to enhance 3D object detection performance.
In what follows, we will explicitly formulate the label uncertainty estimation problem from the probabilistic distribution perspective, followed by the technical implementation of GLENet in Sec. 3.1. After that, we introduce a unified way of integrating the label uncertainty statistics predicted by GLENet into the existing 3D object detection frameworks to build more powerful probabilistic detectors in Sec. 3.2.
3.1 Estimating Label Uncertainty with GLENet
3.1.1 Problem Formulation
We denote as a set of observed LiDAR points belonging to an object, and as a 3D point represented by spatial coordinates. Let be the annotated ground-truth bounding box of parameterized by the center location , the size (length, width and height), and the orientation , i.e., .
To quantify label uncertainty, we propose to model , i.e., the distribution of potentially plausible bounding boxes conditioned on point cloud , and take its variance as uncertainty. Considering the fact that directly modeling can be intractable and result in inaccurate distribution estimation 
, we resort to the Bayes theorem and introduce an intermediate variableto reformulate the conditional distribution as , in which and
are deduced through neural networks parameterized by. Meanwhile, inspired by the optimization process of CVAE, we regularize variable via maximizing the variational lower bound of the conditional log likelihood :
where means the expectation of on the distribution of , denotes KL-divergence. The first task term enforces to learn bounding box knowledge. The second term is targeted at regularizing the distribution of by minimizing the KL-divergence between and , in which the auxiliary distribution is introduced to estimate the true posterior.
The overall workflow of GLENet is illustrated in Fig. 2. We assume the prior distribution and the auxiliary posterior distribution subject to multivariate Gaussian distribution and , respectively. Here, and
denote vectorized parameters of the Gaussian distribution learned by the prior network and the recognition network. When modeling, we employ a context encoder embed input points into geometric feature representations , which are combined with the sample of and jointly fed into a prediction network to learn bounding box distribution.
3.1.2 Prior Network and Recognition Network
For the prior network, we adopt PointNet  for point feature embedding and add additional MLP layers to embed input points into distribution . For the recognition network, which we denote as , we adopt the same learning architecture as the prior network to generate point cloud embeddings, which are concatenated with ground-truth bounding box information and jointly fed into the subsequent MLP layers to learn distribution . To facilitate the learning process, we encode ground-truth bounding box information into offsets relative to predefined anchors, and then perform normalization as:
where (, , ) is the size of the predefined anchor located in the center of the point cloud, and is the diagonal of the anchor box. We also take as the additional input of the recognition network to handle the issue of angle periodicity.
3.1.3 Context Encoder
We design a context encoder to generate the corresponding deterministic geometric feature vector from given points . As empirically observed in various related domains , it can be difficult to make use of latent variables when the decoder is sufficiently expressive to generate a plausible output only using the condition . Therefore, we deploy a simplified PointNet architecture as the backbone of the context encoder to avoid posterior collapse.
3.1.4 Prediction Network
Given a condition and its bounding box , we assume there is a true posterior distribution , and train the prediction network to restore from sampled from and context features . In order to approximate with the recognition network , we feed samples of instead of to the prediction network during training. In the inference phase, we feed the prediction network with sampled from instead of to prevent the generated from overfitting the given .
3.1.5 Objective Function
As formulated in Eq. (1), the whole optimization objective of GLENet consists of a task term and a regularization term.
where denotes the Huber loss imposed on the encoded regression targets, as described in Eq. (2), and denotes the binary cross-entropy loss used for direction classification.
For the regularization term, since and are re-parameterized as and , through the prior network and the recognition network, we can define the regularization loss as:
Thus, the overall objective function can be formulated as
, where the hyperparameteris empirically set as in all experiments.
3.1.6 Label Uncertainty Estimation and Evaluation Metric of GLENet
Inspired by previous works , we assume the ground-truth bounding box subject to a Gaussian distribution , whose expectation is exactly the value of the annotation and variance indicates uncertainty. Its uncertainty can be approximated by the degree of confusion in the distribution of potential bounding boxes . Although the intermediate variable is explicitly modeled by a learnable Gaussian distribution, however, it is still intractable to directly deduce the from the prediction network. So we adopt a Monte Carol method to approximate via sampling multiple times, and calculate the variance of multiple predictions as the label uncertainty in seven dimensions, i.e., .
Considering the unavailability of the true distribution of ground-truth bounding box, we evaluate GLENet in a non-reference manner. To this end, we propose to compute negative log-likelihood between the distribution of estimated distribution of ground-truth and prediction’s distribution :
where denotes the number of inference times, and represent the regression targets and the predicted offsets, respectively. We estimate the integral by randomly sampling multiple prediction results, i.e., the Monte Carlo method. encourages the network to predict diverse plausible boxes with high variance for incomplete point cloud and precise boxes with low variance for high-quality point cloud respectively.
3.2 Building Probabilistic 3D Detectors with Label Uncertainty
Most existing probabilistic detectors model the prediction and ground-truth as Gaussian distribution and Dirac delta function, respectively. The probabilistic regression branch is trained with the KL loss:
where is the regression targets of detectors, is the predicted offsets, and is the predicted localization variance. Intuitively, the regression branch should output larger variance and get a smaller loss for inaccurate localization estimation. However, considering the existence of label noise, the ground-truth distribution cannot be sufficiently described by the Dirac delta function with zero uncertainty. Differently, GLENet is designed to boost the existing 3D object detectors with probabilistic bounding box regression by learning label uncertainty, which serves as the auxiliary supervision information to facilitate learning the variance of the predicted bounding box in probabilistic 3D detectors.
3.2.1 Incorporating Label Uncertainty into KL-Loss
We assume the ground-truth bounding box as a Gaussian distribution with variance , and approximate the label noise through GLENet. Then, we incorporate the generated label uncertainty in the KL Loss between the distribution of prediction and ground-truth in the detection head:
where denotes the regression targets of detectors, is the predicted offsets, and is the uncertainty of the estimation. Intuitively, given samples with high label uncertainty, the model is encouraged to predict larger variance under the supervision of .
3.2.2 Uncertainty-aware Quality Estimator
Motivated by the strong correlation between the uncertainty and localization quality for each bounding box (see Fig. 3), we propose Uncertainty-aware Quality Estimator (UAQE) to facilitate the training of the IoU-branch and improve the IoU estimation accuracy.
Given the predicted uncertainty as input, we build a lightweight sub-module to generate a coefficient, multiplied by the original output of the IoU-branch as the final estimation. The UAQE consists of two fully-connected (FC) layers and with Sigmoid activation in the output end.
3.2.3 3D Variance Voting
In probabilistic object detectors, the learned localization variance by the KL loss is interpretable, which reflects the uncertainty of the predicted bounding boxes. Following , we propose 3D variance voting to combine neighboring boxes to find a more precise box representative. Specifically, during the merging process, the neighboring boxes that are closer and have a low variance are assigned with higher weights. There is a detail, for detected box with maximum score, neighboring boxes with a large angle difference from do not participate in the ensembling of angles.
To reveal the effectiveness and universality of our method, we integrated GLENet into several popular types of 3D object detection frameworks to form probabilistic detectors, which were evaluated on two commonly used benchmark datasets, i.e., the Waymo Open dataset  and the KITTI dataset . Specifically, we start by introducing specific experiment settings and implementation details in Sec. 4.1. After that, we report detection performance of the resulting probabilistic detectors and make comparisons with previous state-of-the-art approaches in Sec. 4.2 and 4.3. In the end, we conduct a series of ablation studies to verify the necessity of different key components and configurations in Sec. 4.4.
4.1 Experiment Settings
4.1.1 Benchmark Datasets
KITTI datasetThe KITTI dataset contains 7481 training samples with annotations in the camera field of vision and 7518 testing samples. According to the occlusion level, visibility and bounding box size, the samples are further divided into three difficulty level: simple, moderate and hard. Following common practice, when performing experiments on the val set, we further split all training samples into a subset with 3712 samples for training and the rest 3769 samples for validation. We report the performance on both the val set and online test leaderboard for comparison. And we use all training data for the test server submission.
Waymo Open DatasetThe Waymo Open Dataset (WOD) is a large-scale autonomous driving dataset with more diverse scenes and object annotations in full , which contains 798 sequences (158361 LiDAR frames) for training and 202 sequences (40077 LiDAR frames) for validation. These frames are further divided into two difficulty levels: LEVEL1 for boxes with more than five points and LEVEL2 for boxes with at least one points. We report performance on both LEVEL 1 and LEVEL 2 difficulty objects using the recommended metrics, mean Average Precision (mAP) and mean Average Precision weighted by heading accuracy (APH).
4.1.2 Implementation Details
We trained GLENet on all annotated objects in the training set. As the initial input of GLENet, each point cloud object was uniformly pre-processed into 512 points via random subsampling or upsampling. Then we decentralized the point cloud by subtracting the coordinates of the center point to eliminate the local impact of translation.
Architecturally, we realized the prior network and recognition network with an identical PointNet structure consisting of three fully-connected layers with output dimensions (64, 128, 512), followed by another fully-connected layer to generate an 8-dim latent variable. To avoid posterior collapse, we particularly chose a lightweight PointNet structure with channel dimensions (8, 8, 8) in the context encoder. The prediction network concatenates the generated latent variable and context features and feeds them into subsequent fully-connected layers with channels (64, 64) before predicting offsets and directions.
4.1.3 Training and Inference Strategies
We adopted Adam  (=0.9,
=0.99) for the optimization of GLENet, which was trained for totally 400 epochs on KITTI and 40 epochs on Waymo while maintaining a batch size of 64 on 2 GPUs. We initialized the learning rate as 0.003 and updated it with the one cycle policy.
In the training process, we applied common data augmentation strategies including random flipping, scaling, and rotation, in which the scaling factor and rotation angle were uniformly drawn from [0.95, 1.05] and , respectively. It is important to include multiple plausible ground-truth boxes in training especially for incomplete point clouds, so we further propose an occlusion-driven augmentation approach, as illustrated in Fig. 5, after which a complete point cloud may look similar to another incomplete point cloud, while the ground-truth boxes of them are completely different. To overcome posterior collapse, we also adopted KL annealing  to gradually increase the weight of the KL loss from 0 to 1. We followed k-fold cross-sampling to divide all training objects into 10 mutually exclusive subsets. To overcome overfitting, each time we trained GLENet on 9 subsets and then made predictions on the remaining subset to generate label uncertainty estimations on the whole training set. During inference, we sampled the latent variable from the predicted prior distribution 30 times to form multiple predictions, the variance of which was used as the label uncertainty.
|STD ||ICCV 2019||LiDAR||87.95||79.71||75.09||80.92|
|Part-A2 ||TPAMI 2020||LiDAR||87.81||78.49||73.51||79.94|
|3DSSD ||CVPR 2020||LiDAR||88.36||79.57||74.55||80.83|
|SA-SSD ||CVPR 2020||LiDAR||88.8||79.52||72.3||80.21|
|PV-RCNN ||CVPR 2020||LiDAR||90.25||81.43||76.82||82.83|
|SE-SSD ||CVPR 2021||LiDAR||91.49||82.54||77.15||83.73|
|VoTR ||ICCV 2021||LiDAR||89.9||82.09||79.14||83.71|
|Pyramid-PV ||ICCV 2021||LiDAR||88.39||82.08||77.49||82.65|
|CT3D ||ICCV 2021||LiDAR||87.83||81.77||77.16||82.25|
Comparison with the state-of-the-art methods on the KITTI test set for vehicle detection, under the evaluation metric of 3D Average Precision (AP) of 40 sampling recall points. The best and second best results are highlighted in blod and underline, respectively.
4.1.4 Base Detectors
, to construct probabilistic detectors, which are dubbed as GLENet-S, GLENet-C, and GLENet-VR, respectively. Specifically, we introduced an extra fully-connected layer on the top of the detection head to estimate standard deviations along with the box locations. Meanwhile, we applied the proposed UGQP to GLENet-VR to facilitate the training of the IoU-branch. Note that we kept all the other network configurations in these base detectors unchanged for fair comparisons.
4.2 Evaluation on the KITTI Dataset
As shown in Table I, we compare GLENet-VR with the state-of-the-art detectors on the KITTI test set. We report the AP and mAP that averages over the APs of easy, moderate and hard objects. As of March 29th, 2022, our method surpasses all published single-modal detection methods in a large margin.
Table II lists the validation results of different detection frameworks on the KITTI dataset, from which we can observe that GLENet-S, GLENet-C, and GLENet-VR consistently outperform their corresponding baseline methods, i.e., SECOND, CIA-SSD, and Voxel R-CNN, by 4.79%, 4.78%, and 1.84% in terms of 3D R11 AP on the category of moderate car. Particularly, GLENet-VR achieves 86.36% AP on the moderate car class, which surpasses all other state-of-the-art methods. Besides, as a single-stage method, GLENet-C achieves 84.59% AP for the moderate vehicle class, which is comparable to the exiting two-stage approaches while achieving relatively lower inference costs. It is worth noting that our method is compatible with mainstream detectors and can be expected to achieve better performance when combined with stronger baselines.
|Part- ||TPAMI 2020||89.47||79.47||78.54||-||-||-|
|3DSSD ||CVPR 2020||89.71||79.45||78.67||-||-||-|
|SA-SSD ||CVPR 2020||90.15||79.91||78.78||92.23||84.30||81.36|
|PV-RCNN ||CVPR 2020||89.35||83.69||78.70||92.57||84.83||83.31|
|SE-SSD ||CVPR 2021||90.21||85.71||79.22||93.19||86.12||83.31|
|VoTR ||ICCV 2021||89.04||84.04||78.68||-||-||-|
|Pyramid-PV ||ICCV 2021||89.37||84.38||78.84||-||-||-|
|CT3D ||ICCV 2021||89.54||86.06||78.99||92.85||85.82||83.46|
|SECOND ||Sensors 2018||88.61||78.62||77.22||91.16||81.99||78.82|
|CIA-SSD ||AAAI 2021||90.04||79.81||78.80||93.59||84.16||81.20|
|Voxel R-CNN ||AAAI 2021||89.41||84.52||78.93||92.38||85.29||82.86|
|Method||LEVEL_1 3D mAP||mAPH||LEVEL_2 3D mAP||mAPH|
|Voxel R-CNN ||76.08||92.44||74.67||54.69||75.67||68.06||91.56||69.62||42.80||67.64|
4.3 Evaluation on the Waymo Open Dataset
The evaluation results of different approaches on both LEVEL_1 and LEVEL_2 of the Waymo Open Dataset are reported in Table III, which shows that our method contributes 2.44% and 1.24% enhancement in terms of LEVEL_1 mAP for SECOND and Voxel R-CNN, respectively. It is observed that the performance boost brought by our method becomes much more obvious in the range of 30-50m and 50m-Inf. Intuitively, this is because distant point cloud objects tend to be sparser and thus have more serious issues of bounding box ambiguity. GLENet-VR achieves better performance than the existing methods with 77.32% mAP and 69.68% mAP for the LEVEL 1 and LEVEL 2 difficulty.
4.4 Ablation Study
We conducted ablative analyses to verify the effectiveness and characteristics of our processing pipeline. In this section, all the involved model variants are built upon the Voxel R-CNN baseline and evaluated on the KITTI dataset, under the evaluation metric of average precision calculated with 40 recall positions.
4.4.1 Different Methods for Label Uncertainty Estimation
We compared with other two ways of label uncertainty estimation: 1) treating the label distribution as the deterministic Dirac delta distribution with zero uncertainty; 2) estimating the label uncertainty with simple heuristics, i.e., the number of points in the ground-truth bounding box or the IoU between the label bounding box and its convex hull of the aggregated LiDAR observations ).
As shown in Table IV, our method consistently outperforms existing label uncertainty estimation paradigms. Compared with heuristic strategies, our deep generative learning paradigm can adaptively estimate label uncertainty statistics in 7 dimensions, instead of the uncertainty of bounding boxes as a whole, considering the variance in each dimension could be very different.
|GLENet-VR w/ (=0)||92.48||85.37||83.05|
|GLENet-VR w/ (points num)||92.46||85.58||83.16|
|GLENet-VR w/ (convex hull )||92.33||85.45||82.81|
|GLENet-VR w/ (Ours)||93.49||86.10||83.56|
4.4.2 Key Components of Probabilistic Detectors
We analyzed the contribution of different key components in our constructed probabilistic detectors and reported results in Table V. According to the second row, we can conclude that only training with the KL loss brings little performance gains. Label uncertainty generated by GLENet module in KL Loss contributes 0.75%, 0.51%, and 0.3% improvement on the APs of easy, moderate, and hard class respectively, which demonstrates its regularization effect on KLD-loss (Eq. 7) and its ability to estimate more reliable uncertainty statistics of bounding box labels. Our UGQP module in the probabilistic detection head boosts the easy, moderate, and hard APs by 0.25%, 0.19% and 0.15% respectively, which demonstrates UGQP’s effectiveness in estimating the localization quality.
|KL loss||LU||var voting||UGQP||Easy||Moderate||Hard|
4.4.3 Influence of Data Augmentation
To generate similar point cloud shapes with diverse ground-truth bounding boxes during training of GLENet, we proposed an occlusion data augmentation strategy and generated more incomplete point clouds while keeping the bounding boxes unchanged (see Fig. 5). As listed in table VI, it can be seen that the occlusion data augmentation effectively enhances the performance of GLENet and the downstream detection task. Besides, the effectiveness of the metric is also validated, which is proposed to evaluate GLENet and select optimal configurations to generate reliable label uncertainty.
4.4.4 Conditional Analysis
To figure out in what cases our method improves the base detector most, we evaluated GLENet-VR on different occlusion levels and distance ranges. As shown in Table VII, compared with the baseline, our method mainly improves on the heavily occluded and distant samples, which suffer from more serious boundary ambiguities of ground-truth bounding boxes.
Note: The results include separate APs for objects belonging to different occlusion levels and APs for moderate vehicle class in different distance ranges. Definition of occlusion levels: levels 0, 1 and 2 correspond to fully visible samples, partly occluded samples, and samples difficult to see respectively.
4.4.5 Inference Latency
We evaluated the inference speed of different baselines with a batch size of 1 on a desktop with Intel CPU E5-2560 @ 2.10 GHz and NVIDIA GeForce RTX 2080Ti GPU. As shown in Table VIII, our approach doesn’t significantly increase the computational overhead. Particularly, GLENet-VR only takes 0.6 more ms than the base Voxel R-CNN, since the number of candidates for the input of var voting is relatively small in two-stage detectors.
4.5 Qualitative Results
Fig. 6 visualizes the detection results using the proposed GLENet-VR and the baseline Voxel R-CNN. We observe that GLENet-VR can obtain better detection results with fewer false-positive bounding boxes and fewer missed heavily occluded and distant objects compared with the base detector on the KITTI dataset.
We also include some visualization of results from GLENet. As shown in Fig. 7, given a point cloud object, we can acquire potentially plausible bounding boxes with GLENet by sampling latent variables multiple times. In general, GLENet tends to predict diverse bounding boxes for objects represented with sparse point clouds and incomplete outlines, and consistently accurate boundary boxes for high-quality point cloud objects. Therefore, the variance of GLENet’s multiple predictions can represent the label uncertainty in ground-truth bounding boxes.
We presented a general and unified deep learning-based paradigm for generative modeling of 3D object-level label uncertainty. Technically, we proposed GLENet, adapted from the learning framework of CVAE, to capture one-to-many relationships between incomplete point cloud objects and potentially plausible bounding boxes. As a plug-and-play component, GLENet can generate reliable label uncertainty statistics that can be conveniently integrated into various types of 3D detection pipelines to build powerful probabilistic detectors. We verified the effectiveness and universality of our method by incorporating the proposed GLENet into several existing deep 3D object detectors, which demonstrated stable improvement and produced state-of-the-art performance on both KITTI and Waymo datasets.
-  (2016) Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 10–21. Cited by: §4.1.3.
End-to-end object detection with transformers.
European conference on computer vision, pp. 213–229. Cited by: §2.2.
Voxel r-cnn: towards high performance voxel-based 3d object detection.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 1201–1209. Cited by: §2.1, §3.1.5, §4.1.4, TABLE II, TABLE III, TABLE VII, TABLE VIII.
-  (2018) Towards safe autonomous driving: capture uncertainty in the deep neural network for lidar 3d vehicle detection. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 3266–3273. Cited by: §1.
Leveraging heteroscedastic aleatoric uncertainties for robust real-time lidar 3d object detection. In 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 1280–1287. Cited by: §1.
Are we ready for autonomous driving? the kitti vision benchmark suite.
2012 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 3354–3361. External Links: Cited by: §1, §1, §4.
-  (2017) Z-forcing: training stochastic recurrent networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6716–6726. Cited by: §3.1.3.
-  (2020-06) Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, TABLE I, TABLE II.
-  (2019-06) Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2, §3.1.6, §3.2.3.
-  (2020) Epnet: enhancing point features with image semantics for 3d object detection. In European Conference on Computer Vision, pp. 35–52. Cited by: TABLE I.
-  (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.1.3.
-  (2019-06) PointPillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, TABLE III.
Generating classical chinese poems via conditional variational autoencoder and adversarial training.
Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 3890–3900. Cited by: §2.3.
-  (2020) Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing Systems 33, pp. 21002–21012. Cited by: §2.2.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.2.
-  (2021-10) Pyramid r-cnn: towards better performance and adaptability for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2723–2732. Cited by: TABLE I, TABLE II, TABLE III.
-  (2021) Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3164–3173. Cited by: TABLE I, TABLE II, TABLE III.
-  (2020) Learning an uncertainty-aware object detector for autonomous driving. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10521–10527. Cited by: §1, §2.3, §4.4.1, TABLE IV.
-  (2019-06) LaserNet: an efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2.
-  (2019) 6-dof graspnet: variational grasp generation for object manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2901–2910. Cited by: §2.3.
-  (2020-06) DOPS: learning to detect 3d objects and predict their 3d shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2020) Cardiac segmentation with strong anatomical guarantees. IEEE transactions on medical imaging 39 (11), pp. 3703–3713. Cited by: §2.3.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §3.1.2.
Monocular 3d human pose estimation by generation and ordinal ranking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2.3.
-  (2021) Improving 3d object detection with channel-wise transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2743–2752. Cited by: TABLE I, TABLE II, TABLE III.
-  (2020) Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538. Cited by: §2.1, TABLE I, TABLE II, TABLE III.
-  (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 770–779. Cited by: §2.1.
-  (2020) From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE transactions on pattern analysis and machine intelligence 43 (8), pp. 2647–2664. Cited by: §2.1, §2.2, TABLE I, TABLE II.
-  (2020-06) Point-gnn: graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2017) Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pp. 464–472. Cited by: §4.1.3.
-  (2015) Learning structured output representation using deep conditional generative models. Advances in neural information processing systems 28, pp. 3483–3491. Cited by: §2.3, §3.1.1.
-  (2020-06) Scalability in perception for autonomous driving: waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §4.
-  (2020) Efficientdet: scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790. Cited by: §2.2.
-  (2020) Mixture dense regression for object detection and human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13086–13095. Cited by: §2.2.
-  (2019) T-cvae: transformer-based conditioned variational autoencoder for story completion.. In IJCAI, pp. 5233–5239. Cited by: §2.3.
-  (2020) Inferring spatial uncertainty in object detection. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5792–5799. Cited by: §1, §2.3.
-  (2021-10) SPG: unsupervised domain adaptation for 3d object detection via semantic point generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15446–15456. Cited by: §2.1.
-  (2016) Attribute2image: conditional image generation from visual attributes. In European conference on computer vision, pp. 776–791. Cited by: §2.3.
-  (2021) Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 3101–3109. Cited by: §2.1.
-  (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §2.1, §3.1.5, §4.1.4, TABLE II, TABLE III, TABLE VIII.
-  (2020) 3dssd: point-based 3d single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11040–11048. Cited by: TABLE I, TABLE II.
-  (2019) Std: sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1951–1960. Cited by: §2.1, TABLE I.
-  (2019-06) GSPN: generative shape proposal network for 3d instance segmentation in point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3.
-  (2020) 3d-cvf: generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In European Conference on Computer Vision, pp. 720–736. Cited by: TABLE I.
Variational neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 521–530. External Links: Cited by: §2.3.
-  (2020) UC-net: uncertainty inspired rgb-d saliency detection via conditional variational autoencoders. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8582–8591. Cited by: §2.3.
-  (2017) Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 654–664. Cited by: §2.3.
-  (2021) CIA-ssd: confident iou-aware single-stage object detector from point cloud. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 3555–3562. Cited by: §2.1, §4.1.4, TABLE II, TABLE VIII.
-  (2021) SE-ssd: self-ensembling single-stage object detector from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14494–14503. Cited by: TABLE I, TABLE II.
-  (2020) End-to-end multi-view fusion for 3d object detection in lidar point clouds. In Conference on Robot Learning, pp. 923–932. Cited by: TABLE III.
-  (2018-06) VoxelNet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.