False Positive Detection and Prediction Quality Estimation for LiDAR Point Cloud Segmentation

We present a novel post-processing tool for semantic segmentation of LiDAR point cloud data, called LidarMetaSeg, which estimates the prediction quality segmentwise. For this purpose we compute dispersion measures based on network probability outputs as well as feature measures based on point cloud input features and aggregate them on segment level. These aggregated measures are used to train a meta classification model to predict whether a predicted segment is a false positive or not and a meta regression model to predict the segmentwise intersection over union. Both models can then be applied to semantic segmentation inferences without knowing the ground truth. In our experiments we use different LiDAR segmentation models and datasets and analyze the power of our method. We show that our results outperform other standard approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

05/21/2019

RIU-Net: Embarrassingly simple semantic segmentation of 3D LiDAR point cloud

This paper proposes RIU-Net (for Range-Image U-Net), the adaptation of a...
05/27/2020

False Positive Removal for 3D Vehicle Detection with Penetrated Point Classifier

Recently, researchers have been leveraging LiDAR point cloud for higher ...
03/31/2020

Scene Context Based Semantic Segmentation for 3D LiDAR Data in Dynamic Scene

We propose a graph neural network(GNN) based method to incorporate scene...
10/05/2020

MetaBox+: A new Region Based Active Learning Method for Semantic Segmentation using Priority Maps

We present a novel region based active learning method for semantic imag...
09/08/2021

FIDNet: LiDAR Point Cloud Semantic Segmentation with Fully Interpolation Decoding

Projecting the point cloud on the 2D spherical range image transforms th...
11/12/2019

Time-Dynamic Estimates of the Reliability of Deep Semantic Segmentation Networks

In the semantic segmentation of street scenes, the reliability of a pred...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Fig. 1: A visualization of LidarMetaSeg containing the ground truth (bottom left), the LiDAR segmentation (bottom right), the LiDAR segmentation quality (top left) as the of prediction and ground truth and its estimation obtained by LidarMetaSeg (top right). The higher the , the better the prediction quality.

In the field of automated driving, scene understanding is essential. One possible solution for the semantic interpretation of scenes captured by multiple sensor modalities is LiDAR point cloud segmentation

[16, 6, 28, 27] (in the following LiDAR segmentation for brevity) where each point of the point cloud is assigned to a class of a given set. A segment is an area of points of the same class. Compared to camera images, a LiDAR point cloud is relatively sparse, but provides accurate depth information. Furthermore, since the LiDAR sensor in general is rotating, 360 degrees of the environment are considered. A summary for sensor modalities is given in [25]. In recent years, the performance of LiDAR segmentation networks has increased enormously [18, 16, 6, 28, 27], but there are only few works on uncertainty quantification [6]. In applications of street scene understanding, safety and reliability of perception systems are just as important as their accuracy. To tackle this problem, we introduce a post-processing tool, called LidarMetaSeg, which estimates the segmentwise (i.e., per connected component of the predicted segmentation) prediction quality in terms of segmentwise intersection over union [13] () of the LiDAR segmentation model, see also fig. 1. This provides not only uncertainty quantification per predicted segment but also an online assessment of prediction quality.

State-of-the-art LiDAR segmentation models are based on deep neural networks and can be grouped into two main approaches: projection-based (2D) and non-projection-based (3D) networks, cf.

[12]. Projection-based networks like [26, 16, 6]

use a spherical (2D) image representation of the point cloud. The predicted semantic categories on the image are thereafter reinserted along the spherical rays into the 3D point cloud. This may contain some post-processing steps, like the k-nearest neighbor (kNN) approach, see

[16]. Due to the representation of point clouds as projected images, the networks employed for LiDAR segmentation have architectures that often resemble image segmentation architectures. The non-projection-based networks, e.g. [19, 23, 28], process the point cloud directly in 3D space with or without different 3D representation approaches. For example, in [23], the network operates on the 3D point cloud without introducing an additional representation while in [28] the authors perform a 3D cylinder partition. A combination of a 2D and 3D representation of the point cloud is used in [27]. All current architectures, using a 2D or 3D representation or a combination of both provide the segmentation of the point cloud. Therefore, it is also possible to output the probabilities, which is the only prerequisite required for LidarMetaSeg.

Concerning uncertainty quantification in deep learning, Bayesian approaches like Monte Carlo (MC) dropout [10] are commonly used, e.g. in image-based object detection [17], image segmentation [14] and also in LiDAR object detection [8]. In object detection and instance segmentation, so called scores containing (un)certainty information are used, while this is not the case for semantic segmentation. The network SalsaNext [6] is for LiDAR segmentation and makes use of MC dropout to output the model (epistemic) and observation (aleatoric) uncertainty.

In our method LidarMetaSeg we first project the point cloud and the corresponding softmax probabilities of the network to a spherical 2D image representation, which are then used to compute different types of dispersion measures resulting in different dispersion heatmaps. To estimate uncertainty on segment level, we aggregate the dispersion measures with respect to each predicted segment. The is commonly used to evaluate the performance of a segmentation model. For each predicted segment, we compute its with the ground truth and call this segmentwise . In our experiments we observe a strong correlation of the segmentwise with the aggregated dispersion measures. Hence, we use the aggregated dispersion measures with additional information from the point cloud input to create a set of handcrafted features. The latter are used in post-processing manner as input for training i) a meta classification model to detect false positive segments, i.e., if the is equal or greater than and ii) a meta regression model to estimate the segmentwise . Thus, we not only have a pointwise uncertainty quantification, given by the dispersion heatmaps, but also a false positive detection as well as a segmentation quality estimation on segment level.

The idea of meta classification and regression to detect false positives and to estimate the segmentwise prediction quality was first introduced in the field of semantic segmentation of images [20], called MetaSeg. The work presented in [22] goes in a similar direction, but for brain tumor segmentation. MetaSeg was further extended in other directions, i.e., for controlled false negative reduction [3], for time dynamic uncertainty estimates for video data [15], for taking resolution-dependent uncertainties into account [21]

and as part of an active learning method

[5]. Inspired by the possibility of representing the point could as a 2D image, our method LidarMetaSeg is an extension and further development of the original work. Therefore MetaSeg [20] is the most related work to our approach LidarMetaSeg, which up to now together with SalsaNext [6] are the only works in the direction of uncertainty quantification in LiDAR segmentation.

With MC dropout, SalsaNext follows a Bayesian approach to quantifying the model and the observation uncertainty. The uncertainty output is point-based and not segment-based, as in our approach. Also for MC dropout, the model has to infer one sample multiple times. LidarMetaSeg requires only a single network inference and estimates uncertainties by means of the network’s class probabilities. In a 2D representation, these pixelwise uncertainty estimates can be viewed as uncertainty heatmaps. From those heatmaps, we compute aggregated uncertainties for each predicted segment, therefore clearly going beyond the stage of pixelwise uncertainty estimation. In contrast to MetaSeg for image segmentation, we not only use the network’s output but also utilize information from the point cloud input, such as the intensity and range features provided for each point of the point cloud.

LidarMetaSeg is therefore a universal post-processing tool that allows for the detection of false positive segments as well as the estimation of segmentwise LiDAR segmentation quality. Besides that, the present work is the first one to provide uncertainty estimates on the level of predicted segments. We evaluate our method on two different datasets, SemanticKITTI [1] and nuScenes [2] and with three different network architectures, two projection-based models RangeNet++ [16], SalsaNext [6] and one non-projection-based model, Cylinder3D [28]

. For meta classification, we achieve area under receiver operating characteristic curve (AUROC) and area under precision recall curve (AUPRC)

[7] values of up to and , respectively. For the meta regression, we achieve coefficient of determination values of up to . We show that our aggregated measures – in terms of meta classification and regression – lead to a significant performance gain in comparison to when only considering a single uncertainty metric like the segmentwise entropy.

Ii Method

LidarMetaSeg is a post-processing method for LiDAR semantic segmentation to estimate the segmentwise prediction quality. It consists of a meta classification and a meta regression model that for each predicted segment classifies whether it has an

equal to or greater than with the ground truth and predicts the segmentwise with the ground truth, respectively. The method works as follows: in a preprocessing step we project each sample, i.e., the point cloud, the corresponding network probabilities and the labels into a spherical 2D image representation. In a next step and based on the projected data, we compute dispersion measures and other features like it is done for image data in [20, 21, 3]. Afterwards we identify the segments of a given semantic class and aggregate the pixelwise values from the previous step on a segment level. In addition, we compute the of each predicted segment with the ground truth of the same class. This results in a structured dataset, which consist of the coefficients of the aggregated dispersion measures as well as additional features and of the target variable – the

for the task of meta regression or the binary variable

( as indicator for a false positive) for the task of meta classification – for each segment. We fit a classification and a regression model to this dataset. In the end, we re-project the meta classification and regression from the image representation to the point cloud.

Ii-a Preprocessing

A sample of input data for LidarMetaSeg is assumed to be given on point cloud level and contains the following:

  • point cloud with Cartesian coordinates and intensity for with the number of points in the LiDAR point cloud,

  • ground truth / labels with the set of given classes,

  • probabilities of a LiDAR segmentation network , given as softmax probabilities,

  • prediction .

Typically, one is also interested in the range of a given point in the point cloud, which is part of most LiDAR segmentation networks’ input. Since the ego car is located in the origin of the coordinate system, this quantity is given by for each .

The projection of a point cloud to a spherical 2D image representation follows two steps: a transformation from Cartesian to spherical coordinates and then a transformation from spherical to image coordinates. The spherical coordinates are given as with range , polar angle and azimuth angle . The transformation for the Cartesian to spherical coordinates is given by

(1)

and for . Based on the spherical coordinates we get the image coordinates with the equation

(2)

with the width and height of the image and the vertical field of view (FOV) of the LiDAR sensor.

In order to get an image representation where each point correspond to one pixel and vice versa, we need the number of channels, the angular resolution and the horizontal FOV of the LiDAR sensor. To this end, we define the height as the number of channels and the width as the quotient of the horizontal FOV (which is in general , since the LiDAR sensor is rotating) and the angular resolution , i.e.,

(3)

Thus, the image representation – using the explicit sensor information – has as many entries or pixels as the point cloud can have as maximum number of points. Unfortunately there are still some technical reasons, due to which it can happen that multiple points are projected to the same pixel, e.g. ego-motion compensation or overlapping channel angles. More details concerning such projection errors can be found in [24]. With the projection proposed above this happens rarely enough, so that this event is negligible.

Following the projection above, we denote the projected 2D representation similar as before, but without , i.e.,

  • image representation (of the point cloud )
    ,

  • ground truth / labels ,

  • probabilities ,

  • prediction .

The proposed image projection yields a sparse image representation. However, our post-processing approach LidarMetaSeg is based on connected components of the segmentation. In order to identify connected components (segments) of pixels in the 2D image resulting from the projection, we fill these gaps by setting any empty entry (entries without a corresponding point in the point cloud) to a value of one of its nearest neighbors that received a value from the projection. An example of such a filled image representation is shown in fig. 2, left panel. In the following we only consider the filled image representations. We store the information which pixel received its value via projection and which one via fill-in in a binary mask of width and height denoted by where represents a projected point and a filled entry, i.e.,

(4)

For simplicity, we refer the filled image representations (that are input quantities for the segmentation network) as feature measures.

Fig. 2: Visual examples of our method LidarMetaSeg. The left panel shows the preprocessing part: the ground truth (top left) and the prediction (top right) of the point cloud as well as the corresponding sparse (middle) and filled (bottom) image representations. The right panel visualizes a dispersion heatmap, the segmentwise prediction quality and its estimation: the probability difference heatmap of the prediction-based probabilities (top right), where higher values correspond to higher uncertainty, in the middle the true (left) and estimated (right) values for the image representation and in the bottom part the corresponding visualizations after the re-projection to the point cloud. The prediction of the point cloud and the corresponding prediction quality estimation is highlighted.

Ii-B Dispersion Measures and Segmentwise Aggregation

First we define the dispersion and feature measures and afterwards the segmentwise aggregation.

Dispersion and Feature Measures

Based on the probabilities , we define the dispersion measures entropy , probability difference and variation ratio at pixel position as follows:

(5)
(6)
(7)

In addition, the feature measures coordinates, intensity and range at position are given by the image representation

(8)

For the sake of brevity, we define the set of dispersion and features measures

(9)

omitting the index for the position as this will follow from the context. Note that, due to the position dependence, each element of can be considered as a heatmap.

Segmentwise Aggregation

For a given prediction and the corresponding ground truth , we denote and the set of connected components (segments) in the prediction and the ground truth, respectively. A connected component is a set of pixels that are adjacent to each other and belong to the same class, see also fig. 2, left panel. For each segment , we define the following quantities. Additionally and in order to count only the pixels with a corresponding point in the point cloud, we introduce the restriction by the corresponding binary mask with .

  • The interior , i.e., a pixel is an element of if all eight neighboring pixels are an element of ,

  • the boundary ,

  • the pixel sizes , , ,

  • the segment size in the point cloud .

Furthermore, we define the target variables and the so called adjusted as follow:

  • : let be the set of all that have non-trivial intersection with and whose class label equals the predicted class for , then

    (10)
  • the adjusted does not count pixels in the ground truth segment that are not contained in the predicted segments, but in other predicted segments of the same class: let , then

    (11)

In cases where a ground truth segment is covered by more than one predicted segment of the same class, each predicted segment would have a low , while the predicted segments represent the ground truth quite well. As a remedy, the adjusted was introduced in [20] to not punish this situation. The adjusted is more suitable for the task of meta regression. For the meta classification it holds .

Based on the previous definitions, we define the dispersion and feature measures:

  • the mean

    and variance

    metrics

    (12)
    (13)

    for and ,

  • the relative sizes , ,

  • the relative mean and variance metrics

    (14)
    (15)

    for and ,

  • the ratio of the neighborhood’s correct predictions of each class

    (16)

    with the set of neighbors, i.e.,  ,

  • the mean class probabilities

    (17)

Typically, the dispersion measures are large for . This motivates the separate treatment of interior and boundary measures. Furthermore we observe a correlation between fractal segment shapes and a bad or wrong prediction, which motivates the relative sizes . In summary, we have metrics: the (relative) mean and variance metrics , the (relative) size metrics as well as An example of the pixelwise dispersion measures as well as the segmentwise values and its prediction is shown in fig. 2, right panel.

With the exception of the segmentwise and values, all quantities defined above can be computed without the knowledge of the ground truth.

Iii Numerical Experiments

For numerical experiments we used two datasets: SemanticKITTI [1] and nuScenes [2]

. For meta classification and regression we deploy XGBoost

[4]

. Other classification and regression methods like linear / logistic regression, neural networks of tree based ensemble methods

[9] are also possible. However, as shown in [15], XGBoost leads to the best results. Due to the reason mentioned in the previous section, the target variable for the meta regression (and classification) is the adjusted .

First, we describe the settings of the experiments for both datasets and evaluate the results for the false positive detection and the segmentwise prediction quality estimation when using all metrics presented in the previous section. Afterwards we conduct an analysis of the metrics and the meta classification model.

Iii-a SemanticKITTI

The SemanticKITTI dataset [1] contains street scenes from and around Karlsruhe, Germany. It provides sequences with about K samples for training and validation, consisting of classes. The data is recorded with a Velodyne HDL-64E LiDAR sensor, which has channels and a (horizontal) angular resolution of . Furthermore the data is recorded and annotated with frames per second (fps) and each point cloud contains about K points. The authors of the dataset recommend to use all sequences to train the LiDAR segmentation model, except sequence , which should be used for validation.

For the experiments we used three pretrained LiDAR segmentation models, two projection-based models, i.e., RangeNet++ [16] and SalsaNext [6], and one non-projection-based model, i.e., Cylinder3D [28], which followed the recommended data split. For RangeNet++ and SalsaNext, the softmax probabilities are given for the 2D image representation prediction. As we assume that softmax probabilities are given for the point cloud, we consider this representation as the starting point and re-project the softmax probabilities to the point cloud.

After the re-projection from the 2D image representation prediction to the point cloud, both models have an additional kNN post-processing step to clean the point cloud from undesired discretization and inference artifacts [16], which may results in changing the semantic class of a few points. To take this post-processing step into account, we set the values of the cleaned points in the corresponding softmax probabilities of the point cloud to and all other values to . Therefore the softmax condition (the sum of all probability values of a point is equal to and all values are between and ) is met and the adjusted prediction is equal to the argmax of the probabilities. We do not expect other approaches to significantly change the results since we aggregate our dispersion measures and the number of modified points is small.

Following our method, the image representation of the point cloud data is of size , cf. creftype 3.

Most deep learning models tend to overfit. Therefore we only use samples for LidarMetaSeg, which are not part of the training data of the segmentation network, as overfitted models affects the dispersion measures. Thus, we only use sequence for our experiments. Computing the connected components and metrics yields approx. M segments for each network. Most of the segments are very small. Therefore we follow a similar segment exclusion rule as in MetaSeg [20], where segments with empty interior, , are excluded. Here, we exclude segments consisting of less than LiDAR points, i.e., , also shown in gray color in fig. 2. Hence, we reduce the number of segments to approx. M but we retain of the data measured in terms of the number of points. We tested the dependence of our results under variation of the exclusion size . The results were very similar to the results we present in the following.

For training and validation of LidarMetaSeg we split sequence and the corresponding connected components and metrics into disjoint sub-sequences. These sub-sequences are used for a -fold cross validation. A cross validation over all samples would yield highly correlated training and validation splits as all sequences are recorded with 10 fps. The results for the meta classification and regression are given in table I. For all three models we achieve a validation accuracy between and , see row ‘ACC LMS’ (short for LidarMetaSeg). The accuracy of random guessing (‘ACC naive baseline’) is between and which directly amounts to percentage of segments with an .

For each method, the accuracy values correspond to a single decision threshold. In contrast to that, the AUROC and AUPRC are obtained by varying the decision threshold of the classification output. The AUROC essentially measures the overlap of distributions corresponding to negative and positive samples; this score does not place more emphasis on one class over the other in case of class imbalance. The ACC of random guessing indicates the class imbalance: about of the segments have an and of the segments have an , i.e., they are false positives. The underlying precision recall curve of the AUPRC ignores true negatives and emphasizes the detection of the positive class (false positives).

Using the metrics of the previous section (LMS) for the meta classification yields AUROC values above and AUPRC up to . For the meta regression we achieve values between . Fig. 3 depicts the quality of predicting the . A visualization of estimating the is shown in fig. 2 and in the supplementary video111https://youtu.be/907jJSRgHUk.

SemanticKITTI nuScenes
RangeNet++ SalsaNext Cylinder3D Cylinder3D
training validation training validation training validation training validation
Classification

ACC

LMS
LMS w/o features
Entropy
LMS MCDO
naive baseline

AUROC

LMS
LMS w/o features
Entropy
LMS MCDO

AUPRC

LMS
LMS w/o features
Entropy
LMS MCDO
Regression

LMS
LMS w/o features
Entropy
LMS MCDO

TABLE I:

Results for meta classification and regression, averaged over 10 runs. The numbers in the brackets denote standard deviations of the computed mean values. The best results in terms of ACC, AUROC, AUPRC and

on the validation data are highlighted.

Fig. 3: True vs predicted for RangeNet++, SalsaNext, Cylinder3D on SemanticKITTI as well as Cylinder3D on nuScenes, from left to right.

Iii-B NuScenes

The nuScenes dataset [2] contains street scenes from two cities, Boston (US) and Singapore. It provides sequences for training and sequences for validation. Each sequence contains about samples which amounts to a total of K key frames. The dataset has classes and is recorded and annotated with fps. The LiDAR sensor has channels and an angular resolution of . Every point cloud contains roughly K points. For our experiments we used the pretrained Cylinder3D with the recommended data split. We did not test RangeNet++ and SalsaNext since the corresponding pretrained models are not available.

The image projection is of size . Computing the connected components for all samples of the validation sequences yields approx. M segments. Excluding all small segments containing less than points, i.e., , reduces that number to M. Still, we retain of the data in terms of points. We performed -fold cross validation where we always took of the sequences, i.e., sequences, for training and the remaining , i.e., sequences for validation of the meta models. The results are presented in table I. of all segments have an . With the meta classification we achieve an accuracy of , AUROC of and AUPRC of , see ‘LMS’ rows. For the meta regression we achieve for the validation data. The quality of predicting the is shown in fig. 3.

Iii-C Metric Selection

So far, we have presented results based on all metrics from section II, indicated by LMS in table I. In order to analyze the impact of the metrics to the performance, we repeated the experiments for multiple sets of metrics.

Feature Measures

First, we tested the performance of the meta classification and regression model without the feature measures, i.e., the metrics based on the point cloud input features, see row ‘LMS w/o features’. The performance in terms of ACC, AUROC, AUPRC and for all experiments are up to percentage points (pp.) lower compared to when incorporating feature measures.

Entropy

Since the entropy is commonly used in uncertainty quantification, we tested all experiments with only using the mean entropy , see ‘Entropy’ rows. The performance for the meta classification is up to pp. lower compared to LMS, for the meta regression decreases by up to pp.

Bayesian Uncertainties

The projection-based SalsaNext model follows a Bayesian approach as already mentioned in section I: the LiDAR model provides a model (epistemic) and observation (aleatoric) uncertainty output for the point cloud’s 2D image representation prediction, estimated by MC dropout (MCDO). To get these uncertainties we followed the procedure in [6]. This ends up in epistemic and aleatoric uncertainty values for each pixel position . We compute the same aggregated measures as for the measures . Adding these new metrics to the previous metrics LMS is refereed to as LMS MCDO. The additional Bayesian uncertainties do not improve the meta classification and regression performance significantly, see table I. We have not tested SalsaNext on nuScenes since the pretrained model is not available. For comparability of results, we only used publicly available pretrained models.


SemanticKITTI

number of metrics
RangeNet++
ACC
Added all
Added all

SalsaNext
ACC
Added all
Added all

Cylinder3D
ACC
Added all
Added all

nuScenes

number of metrics
Cylinder3D
ACC
Added all
Added all

TABLE II: Metric selection using a greedy method that in each step adds one metric that maximizes the meta classification / regression performance in terms of ACC / in . All results, SemanticKITTI (top) and nuScenes (bottom) are calculated on the dataset’s metrics’ validation set.

Greedy Heuristic

Inspired by fordward-stepwise selection for linear regression, we analyze different subsets of metrics by performing a greedy heuristic: we start with an empty set of metrics and iteratively add a single metric that maximally improves the performance – ACC for the false positive detection and

for the prediction quality estimation. We performed this greedy heuristic for both, meta classification and meta regression. The results in terms of ACC and are shown in fig. 4 (only for SemanticKITTI) and in table II. For the meta classification, we observe a comparatively big accuracy gain during adding the first metrics, then the accuracy increases rather moderately. For the meta regression, this performance gain in terms of spreads wider across the first 10 iterations, before the improvement per iteration becomes moderate. Furthermore the results show that a small subset of metrics is sufficient for good models. We achieve nearly the same performance for both tasks with metrics selected by the greedy heuristic compared to when using all metrics (LMS). Considering table II, the mean variation ratio and the mean probability difference in most cases constitute the initial choices. Furthermore, the mean probabilities , are also frequently subject to early incorporation.

Fig. 4: Performance of the meta classification (left) and the meta regression (right) model on SemanticKITTI depending of the number of metrics, which are selected by the greedy approach.

Iii-D Confidence Calibration

The false positive detection is based on a meta classification model, which classifies whether the predicted is equal or greater than . In order to demonstrate the reliability of the classification model, we show that the confidences are well calibrated. Confidence scores are called calibrated, if the confidence is representative for the probability of correct classification, cf. [11].

The meta classification model estimates for each predicted segment the probability of being false positive, i.e., . We group the probabilities for all meta classified segments of the validation data into interval bins . The accuracy of a bin is the relative amount of true predictions; the confidence of a bin is the mean of its probabilities. The closer the accuracy and the confidence are to each other, the more reliable is the corresponding classification model. This is visualized in a so-called reliability diagram. For the evaluation of calibration, we define the maximum calibration errors (MCE) as the maximum absolute difference between the accuracy and the confidence over all bins and the expected calibration errors (ECE) as a weighted average of the bins’ difference between accuracy and confidence, where the weights are proportional to the number of elements per bin. Further details are given in [11].

The reliability diagrams and the MCE as well as the ECE for all previously discussed meta classification models are shown in fig. 5. The smaller the gaps, i.e., the closer the outputs are to the diagonal, the more reliable and well calibrated is the model. The MCE and ECE are between and , respectively. The results indicate well calibrated and reliable meta classification models.

Fig. 5: Reliability diagrams with MCE and ECE for the meta classification model: RangeNet++, SalsaNext, Cylinder3D on SemanticKITTI as well as Cylinder3D on nuScenes, from left to right.

Iv Conclusion

In this work we presented our method LidarMetaSeg for segmentwise false positive detection and prediction quality estimation of LiDAR point cloud segmentation. We have shown that the more of our hand-crafted aggregated metrics we incorporate, the better the results get. This holds for all considered evaluation metrics – ACC, AUROC, AUPRC and

. Furthermore, the results show that adding Bayesian uncertainties (epistemic and aleatoric ones approximated by MC dropout) on top of our dispersion measures based on the softmax probabilities neither improves meta classification nor meta regression performance. We have demonstrated the effectiveness of the method on street scene scenarios and are positive that this method can be adapted to other LiDAR segmentation tasks and applications, e.g. indoor segmentation or panoptic segmentation.

References

  • [1] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall (2019) Semantickitti: a dataset for semantic scene understanding of lidar sequences. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    ,
    pp. 9297–9307. Cited by: §I, §III-A, §III.
  • [2] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2020) Nuscenes: a multimodal dataset for autonomous driving. In

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    ,
    pp. 11621–11631. Cited by: §I, §III-B, §III.
  • [3] R. Chan, M. Rottmann, F. Hüger, P. Schlicht, and H. Gottschalk (2020) Controlled false negative reduction of minority classes in semantic segmentation. Cited by: §I, §II.
  • [4] T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §III.
  • [5] P. Colling., L. Roese-Koerner., H. Gottschalk., and M. Rottmann. (2021) MetaBox+: a new region based active learning method for semantic segmentation using priority maps. In Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,, pp. 51–62. External Links: Document, ISBN 978-989-758-486-2 Cited by: §I.
  • [6] T. Cortinhal, G. Tzelepis, and E. E. Aksoy (2020) SalsaNext: fast, uncertainty-aware semantic segmentation of lidar point clouds for autonomous driving. arXiv preprint arXiv:2003.03653. Cited by: §I, §I, §I, §I, §I, §III-A, §III-C.
  • [7] J. Davis and M. Goadrich (2006) The relationship between precision-recall and roc curves. In

    Proceedings of the 23rd international conference on Machine learning

    ,
    pp. 233–240. Cited by: §I.
  • [8] D. Feng, L. Rosenbaum, and K. Dietmayer (2018) Towards safe autonomous driving: capture uncertainty in the deep neural network for lidar 3d vehicle detection. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 3266–3273. Cited by: §I.
  • [9] J. Friedman, T. Hastie, R. Tibshirani, et al. (2001) The elements of statistical learning. Vol. 1, Springer series in statistics New York. Cited by: §III.
  • [10] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §I.
  • [11] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §III-D, §III-D.
  • [12] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun (2020) Deep learning for 3d point clouds: a survey. IEEE transactions on pattern analysis and machine intelligence. Cited by: §I.
  • [13] P. Jaccard (1912) The distribution of the flora in the alpine zone. 1. New phytologist 11 (2), pp. 37–50. Cited by: §I.
  • [14] A. Kendall, V. Badrinarayanan, and R. Cipolla (2015) Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680. Cited by: §I.
  • [15] K. Maag, M. Rottmann, and H. Gottschalk (2020) Time-dynamic estimates of the reliability of deep semantic segmentation networks. In

    2020 IEEE International Conference on Tools with Artificial Intelligence (ICTAI)

    ,
    Cited by: §I, §III.
  • [16] A. Milioto, I. Vizzo, J. Behley, and C. Stachniss (2019) Rangenet++: fast and accurate lidar semantic segmentation. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4213–4220. Cited by: §I, §I, §I, §III-A, §III-A.
  • [17] O. Ozdemir, B. Woodward, and A. A. Berlin (2017)

    Propagating uncertainty in multi-stage bayesian convolutional neural networks with application to pulmonary nodule detection

    .
    arXiv preprint arXiv:1712.00497. Cited by: §I.
  • [18] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §I.
  • [19] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413. Cited by: §I.
  • [20] M. Rottmann, P. Colling, T. P. Hack, R. Chan, F. Hüger, P. Schlicht, and H. Gottschalk (2020) Prediction error meta classification in semantic segmentation: detection via aggregated dispersion measures of softmax probabilities. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. Cited by: §I, §II-B, §II, §III-A.
  • [21] M. Rottmann and M. Schubert (2019) Uncertainty measures and prediction quality rating for the semantic segmentation of nested multi resolution street scene images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §I, §II.
  • [22] A. G. Roy, S. Conjeti, N. Navab, and C. Wachinger (2018) Inherent brain segmentation quality control from fully convnet monte carlo sampling. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 664–672. Cited by: §I.
  • [23] H. Thomas, C. R. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas (2019) Kpconv: flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6411–6420. Cited by: §I.
  • [24] L. T. Triess, D. Peter, C. B. Rist, and J. M. Zöllner (2020) Scan-based semantic segmentation of lidar point clouds: an experimental study. In 2020 IEEE Intelligent Vehicles Symposium (IV), pp. 1116–1121. Cited by: §II-A.
  • [25] Z. Wang, Y. Wu, and Q. Niu (2019) Multi-sensor fusion in automated driving: a survey. IEEE Access 8, pp. 2847–2868. Cited by: §I.
  • [26] C. Xu, B. Wu, Z. Wang, W. Zhan, P. Vajda, K. Keutzer, and M. Tomizuka (2020) Squeezesegv3: spatially-adaptive convolution for efficient point-cloud segmentation. In European Conference on Computer Vision, pp. 1–19. Cited by: §I.
  • [27] J. Xu, R. Zhang, J. Dou, Y. Zhu, J. Sun, and S. Pu (2021) RPVNet: a deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. arXiv preprint arXiv:2103.12978. Cited by: §I, §I.
  • [28] H. Zhou, X. Zhu, X. Song, Y. Ma, Z. Wang, H. Li, and D. Lin (2020) Cylinder3d: an effective 3d framework for driving-scene lidar semantic segmentation. arXiv preprint arXiv:2008.01550. Cited by: §I, §I, §I, §III-A.