CertainNet: Sampling-free Uncertainty Estimation for Object Detection

10/04/2021 ∙ by Stefano Gasperini, et al. ∙ 0

Estimating the uncertainty of a neural network plays a fundamental role in safety-critical settings. In perception for autonomous driving, measuring the uncertainty means providing additional calibrated information to downstream tasks, such as path planning, that can use it towards safe navigation. In this work, we propose a novel sampling-free uncertainty estimation method for object detection. We call it CertainNet, and it is the first to provide separate uncertainties for each output signal: objectness, class, location and size. To achieve this, we propose an uncertainty-aware heatmap, and exploit the neighboring bounding boxes provided by the detector at inference time. We evaluate the detection performance and the quality of the different uncertainty estimates separately, also with challenging out-of-domain samples: BDD100K and nuImages with models trained on KITTI. Additionally, we propose a new metric to evaluate location and size uncertainties. When transferring to unseen datasets, CertainNet generalizes substantially better than previous methods and an ensemble, while being real-time and providing high quality and comprehensive uncertainty estimates.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

While neural networks have been widely employed in research and mobile applications, their usage is still limited in safety-critical real-world settings, such as autonomous driving [8]. The ability of a model to estimate the uncertainty of its predictions is a key enabler for its safe and reliable use in these contexts and unknown scenarios [5]. This makes uncertainty estimation a fundamental companion to object detection [3], semantic segmentation [17], visual odometry [24], and other relevant visual tasks, especially to handle out-of-domain data.

As one of the main perception tasks, object detection comprises a plurality of outputs (e.g. object location, size, and class), requiring both regression and classification, and rendering it particularly challenging for uncertainty estimation. Existing real-time approaches have focused on providing an uncertainty estimate only on a subset of these detection outputs (e.g. location and size [3, 15]). In traffic, there can be situations where an object’s location and size are relatively certain, while its existence in the scene (i.e. objectness) is uncertain, e.g. a person in a dark area or entering a car with an unusual pose, as shown at the bottom of Figure 1

. In these cases, it is particularly important to assess the objectness uncertainty, to avoid ignoring the detection solely based on a low confidence score. A method confidence is a score which does not correspond to the actual probability of being correct, while uncertainty reduces this gap 

[8].

However, prior works cover this only partially, as they leave behind the uncertainty over important parameters, such as the objectness, posing a potential safety issue. By extracting a dedicated uncertainty measure for each of the output signals, a system would provide deeper and better insights on the scene, which could be exploited by downstream tasks (e.g. path planning). Furthermore, incorporating uncertainty can reduce the accuracy of the models [4], or significantly increase the latency [6, 16], hence leading to a challenging trade-off between accuracy, safety, and runtime.

Fig. 1: Example predictions of the proposed CertainNet from the KITTI validation set [9]. Uncertainties for objectness, location and size are shown as the complement of the value at the top left of each box, the plus around the center, and dashed boxes respectively. The object class is color-coded.

In this work, we address these issues with a novel method that provides separate and dedicated uncertainties for all detection outputs, while preserving speed and accuracy. We name our approach CertainNet, and our contributions can be summarized as follows:

  • We introduce a novel sampling-free method to estimate the model uncertainty of an object detector.

  • We propose the first real-time approach to estimate the uncertainty of all signals of the detection output (i.e. objectness, location, size and class).

  • We present new simple metrics to assess the quality of size and location uncertainties for object detection.

Additionally, we provide extensive evaluations and comparisons on three challenging public datasets, including out-of-domain data, reaching competitive accuracy, while improving significantly on uncertainty-aware metrics.

Ii Related Work

Ii-a Uncertainty Estimation for Neural Networks

A learning-based system can express two kinds of uncertainty: the epistemic uncertainty caused by the model, and the aleatoric uncertainty due to the input data [8]. The former can arise from errors in the training procedure (e.g. ignoring data imbalance), a sub-optimal architecture or the knowledge gap with out-of-domain data, while the latter originates from the data itself, e.g. due to sensor noise, coarse approximations, ambiguous samples, or also inaccurate ground truth labels [8].

There are several types of uncertainty estimation approaches. In this Section we provide a brief overview following the categorization of Gawlikowski et al. [8]. Single deterministic methods predict with a deterministic model and provide an uncertainty estimate directly through that model [21], or via external computation [22, 18]. There are various ways for external methods to estimate the uncertainty, e.g. by computing the distance to the training data. DUQ, which is the work of Van Amersfoort et al. [22]

, belongs to this subcategory. The authors used Radial Basis Functions (RBF) to learn a transformation of features into a hyperspace. At inference time, they quantify the uncertainty as the distance in the hyperspace between the predicted features and a set of learned class centroids.

Another group of works is that of Bayesian methods

, which infer the probability distribution of the model parameters 

[8]. In this category, Monte-Carlo Dropout (MC Dropout) [6] approximates this distribution by considering the set of models originating from keeping dropout active at test time. Thanks to the randomness of dropout, each forward pass produces different outputs, from which the uncertainty can be estimated according to their disagreement.

Then, ensemble methods also combine multiple outputs for both predictions and uncertainty estimates [14]. In this case the outputs are inferred by a set of different deterministic models, trained independently. Subcategories of ensembles can be found depending on how these different models are obtained [8].

Lastly, test-time augmentation methods use a single deterministic model and augment its input during inference, collecting a variety of predictions, from which a final output and uncertainty estimate are computed [23].

A different categorization distinguishes sampling-based and sampling-free approaches, depending on how the uncertainty is estimated. The former group requires inferring multiple times on the same input with different models (for Bayesian and ensemble methods), or different inputs with the same model (with test-time augmentations), and then aggregating the results. This significantly increases the runtime [6, 16], which makes it impractical for most real-time scenarios, such as autonomous driving. Instead, the latter category does not require sampling, thanks to built-in quantification techniques, such as those typically used in single deterministic methods [22].

Our work builds on top of the findings of Van Amersfoort et al. [22]. In particular, we considerably extend and adapt their DUQ from image classification to a dense regression task for object detection, estimating the model objectness uncertainty at a given location, while significantly improving training stability and convergence. Additionally, we estimate the uncertainty on the variety of regressed outputs of object detection (e.g. object size), by aggregating the predictions provided by the detector.

Ii-B Uncertainty Estimation in Object Detection

While many approaches quantify the uncertainty on basic tasks, posing the foundations of this domain [6, 22], fewer estimated it for safety-critical settings, such as autonomous driving [17, 12]. As we focus on estimating the uncertainty for object detection, in this Section we provide a brief summary of existing works in this area.

Choi et al. proposed Gaussian YOLOv3 (GYOLO) [3], extending the popular YOLOv3 detector [19]

to estimate the object location and size uncertainties. They achieved this by predicting Gaussian parameters (i.e. mean and variance) for the position and dimensions of the boxes. Miller et al. 

[16] used MC Dropout [6] and thoroughly evaluated different clustering techniques, extracting variance and entropy of spatial and class predictions respectively. Harakeh et al. [12] also used MC Dropout [6]

and replaced the usual non-maximum suppression (NMS) with clustering and Bayesian inference, obtaining an uncertainty quantification for the object size and class. Instead, Lee et al. 

[15] introduced with their Gaussian-FCOS a dedicated head on top of an anchor-less detector to estimate the localization uncertainty for each of the four box boundaries independently.

Likewise, our proposed CertainNet estimates the uncertainty for object detection, but in a different way. Unlike the works of Miller et al. [16] and Harakeh et al. [12], ours is sampling-free and real-time capable. Compared to GYOLO [3] and Gaussian-FCOS [15], ours does not regress the uncertainty as an explicit additional output, thereby reducing the potential impact of biases in the training data. Moreover, we are the first to quantify the uncertainty for each individual aspect of the detection output (i.e. objectness, location, size and class), increasing the explainability.

Fig. 2: The proposed CertainNet. The top shows how each uncertainty-aware objectness score is computed. At the bottom right, examples of more and less certain predictions for each output signal are shown.

Iii Method

In this work, we estimate the uncertainty for each aspect of object detection, improving the model explainability for safety-critical applications. Towards this end, we build on top of an anchor-less detection framework (Section III-A), which we render uncertainty-aware (Section III-B). We then compute the uncertainty for objectness, location, dimensions and class (Section III-C). Figure 2 shows an overview of the proposed method. This fragmentation of the uncertainty along the individual detection components introduces valuable information to improve the safety of downstream tasks.

Iii-a Base Detection Framework

CenterNet [27] is a fast and accurate anchor-less detector. It predicts objects centers via a class heatmap, and regresses the bounding box size in a separate dimensions heatmap. CenterNet gives as intermediate output a bounding box for every pixel, hence providing a distribution of outputs to estimate our uncertainties. For these reasons, we use it as base detector. While we focus on 2D bounding boxes, 3D boxes can be predicted following [27], and the additional uncertainties (e.g. for object distance and length) can be computed analogously to the ones presented in this work.

Iii-B Uncertainty-awareness

We make the class heatmap of the detector uncertainty-aware, by exploiting DUQ [22], extending it from recognizing out-of-domain samples in image classification tasks, for which it was proposed, to a dense regression task within our object detector. During training, we learn a set of class representatives (i.e. centroids), which are then compared with each prediction at inference time. By doing so, similarly to DUQ [22], we assess the deviation from learned object prototypes, which is linked to the model uncertainty. We compute this for all regressed heatmap values, which represent the objectness scores. In particular, it is the comparison with the learned centroids, which makes every value of this heatmap uncertainty-aware.

Proposed for image classification, DUQ [22] learns to map each image to high dimensional spaces, one for each of classes. At training time, class centroids are updated as moving averages of positive samples. As Berger et al. [1] also pointed out, DUQ suffers from instability issues and deployment difficulties. Therefore, after performing a thorough analysis of DUQ and its feature space, we incorporate several modifications to better suit the object detection task, as well as to improve the overall convergence and stability, as visualized in Figure 3.

Adaptation of DUQ to dense regression: Compared to the original implementation [22], due to the high dimensionality of the centroid space (e.g. 512 in [22]), and also the two additional dimensions required for our class heatmap output (i.e. width and height), speed and memory efficiency issues arise. We circumvent these by exploiting the shared operation across the pixels when transforming to the hyperspace. Towards this end, we use convolutions, instead of individual multiplications as in [22].

Balanced centroid update: The centroids need to be a good representation of the data, as at inference time they determine both predictions and uncertainties, making the centroid update a critical step. In particular, the centroids are updated as trailing averages, of which the current estimates are computed to get closer to the current sets of predictions. In a regular setup [22], scaling by the amount of samples of a given class is reasonable, as each minibatch is representative of the whole dataset. However, with unbalanced classes and a varying amount of objects, that would lead to inconsistent update magnitudes, hindering convergence. We circumvent this and account for the minibatch variability by computing the centroid as:

(1)

This allows to set a weight on the center pixel higher than its surroundings, depending on . The right term is an average weighted and scaled by the ground truth class heatmap values , having a Gaussian around each object center. are the predicted hyperspace coordinates, with being the feature extractor, and the hyperspace transformation for class .

is a hyperparameter, defined as the centroid momentum 

[22].

Hyperspace regularization: To properly represent the training data, the centroids should lie among positive samples distributed along a Gaussian hypersphere. However, by visualizing the hyperspace via PCA and t-SNE, we noticed that this is not the case, with the centroid often falling around the boundary of an irregularly shaped training distribution (top of Figure 3). We circumvent this and aim at the ideal case, by regularizing this hyperspace with:

(2)

which acts on the euclidean distance between the centroid and prediction for each pixel, and is weighted by the ground truth class heatmap values . Furthermore, we do not use the gradient penalty of DUQ [22], as we find to deliver more stable and consistent results.

Fig. 3: Centroid update scheme, with the learned hyperspace. The bottom shows the impact of our modifications, compared to the unstable procedure of DUQ [22] at the top when applied on an unbalanced detection task.

Centroid momentum scheduling: The centroids are moved using steps with a magnitude proportional to their distance to the predictions, and to the centroid momentum . A large slows down training, while improving its stability, but reduces the robustness against the initialization. Instead, a low has the inverse effect. While a relatively low worked well for balanced classes and large batch sizes [22], it did not in our task. We apply a scheduling of similar to that of learning rates, to improve the stability in later training stages, and preserve robustness and speed at earlier ones. This effect is represented in Figure 3. Specifically, we reduce

by a factor of 10 at fixed epochs.

Outliers protection

: Additionally, we prevent outliers from impacting the centroids. In particular, we define an outlier as a prediction being further from its centroid than the triple of the length scale

. This further improves the training stability and convergence, as instead of having the predictions following a highly moving target (i.e. centroid), they go towards a more stable target.

Length scale annealing: At early stages of training, the centroids position and the hyperspace transformation are not yet properly tuned, due to random initialization. This, together with the high dimensionality of the space, causes instability, as even the correct inputs can be mapped far from their centroid. We alleviate this problem and improve stability with an approach similar to that of simulated annealing [13]: we increase the length scale at the initial stages, to then slowly reduce it at each step towards the original value. This increases the gradients at larger distances, nearly zero otherwise, thereby allowing to improve convergence and predictions, as depicted in Figure 3.

Car Pedestrian Cyclist all
Method mAP Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard AUPR-In AUPR-Out fps
GaussianYOLO [3] 56.74 89.91 84.36 75.65 58.43 50.95 43.69 45.97 31.17 30.53 65.39 96.42 18.06
CenterNet [27] 71.49 92.30 89.15 82.17 76.53 67.53 59.37 73.48 52.63 50.25 68.33 99.82 20.21
5-Ensemble [16] 73.70 97.84 89.75 80.87 77.08 67.66 59.17 79.45 57.30 54.17 72.13 97.32 5.10
CertainNet [ours] 73.00 93.81 89.36 82.11 76.33 66.13 58.54 78.02 57.49 55.20 69.85 99.85 16.46
TABLE I: Detection performance comparison with related works on the KITTI [9] validation set across classes and difficulties.
objectness location dimensions
Method AP AUPR-In AUPR-Out AUROC ECE UE CE UBQ BR CE UBQ BR

KITTI

GaussianYOLO [3] 84.36 74.93 91.24 86.23 23.37 20.02 11.58 60.01 60.56 6.18 75.34 77.47
CenterNet [27] 89.15 72.76 99.56 96.65 10.49 6.00 7.17 68.27 69.80 4.23 85.84 89.58
5-Ensemble [16] 89.75 73.79 93.61 87.13 16.93 17.72 4.59 86.71 90.43 5.01 84.28 87.54
CertainNet [ours] 89.36 78.16 99.81 98.00 4.60 4.98 4.54 75.92 77.81 4.47 86.76 91.05

BDD

GaussianYOLO [3] 31.03 88.73 79.72 85.84 22.80 20.68 9.48 57.32 58.49 7.67 57.94 59.19
CenterNet [27] 34.51 81.35 98.06 91.74 7.78 14.03 25.72 35.90 36.76 10.53 73.43 79.55
5-Ensemble [16] 26.75 96.42 81.84 91.99 25.64 15.26 3.31 84.83 93.54 9.40 84.81 93.50
CertainNet [ours] 40.93 78.77 97.82 91.17 4.89 14.85 14.79 44.71 46.03 9.68 72.91 78.89

nuIm.

GaussianYOLO [3] 30.38 86.24 87.95 87.78 17.05 17.74 10.42 56.20 57.62 8.74 58.38 60.14
CenterNet [27] 43.93 75.94 98.93 91.58 5.51 15.70 27.07 35.69 36.21 8.39 74.79 79.21
5-Ensemble [16] 35.31 93.93 90.90 93.04 17.74 13.98 3.37 89.48 98.51 8.36 89.31 98.21
CertainNet [ours] 53.14 79.97 99.28 94.24 7.80 11.90 16.21 43.57 44.46 7.73 75.36 80.25

TABLE II: Uncertainty quality comparison with related works on the validation sets of KITTI [9], BDD100K [25], and nuImages (nuIm.) [2]. All models were trained on KITTI, and transferred () to the other datasets. All results are for Car (moderate for KITTI). On CenterNet we applied our location and dimensions uncertainty estimations as post-processing.

Iii-C Uncertainties Estimation

Objectness: Thanks to the class heatmap uncertainty-awareness (Section III-B), the objectness uncertainty can be directly inferred as the complement of the score:

(3)

with

being a Gaussian distribution centered at the centroid

with variance , which is the length scale.

Location: An object location is computed according to the class heatmap. In our CertainNet, this output is uncertainty-aware (Section III-B). We make use of this by extracting the location uncertainty considering the distribution of scores around an object center. As can be seen in Figure 2, a location is more certain if the center peak is more defined and sharper, while it is less certain with a smoother and flatter peak. The uncertainty for the location in direction depends on the horizontal variance and is normalized by the predicted width . We compute as follows:

(4)

where iterates over the pixels around the center, is the x-offset from pixel to the center, is the angle between the x-axis and the pixel, and is the predicted objectness (Equation 3). weights the influence of the scores that are not axis-aligned. The vertical location uncertainty is computed analogously, replacing with the predicted height and using , obtained through the y-offset and .

Dimensions: As described in Section III-A, the model provides a bounding box for each pixel in the output heatmaps. We exploit this by computing the uncertainty on the object size , by comparing the boxes around an object center. Intuitively, the more these boxes are similar to one another, the more certain the model is on the object dimensions, and vice versa, as shown in Figure 2. Towards this end, we compute as an RMSE over the surrounding predicted dimensions for each object center, and weight this by the pixel objectness score:

(5)

where is the predicted object width and is the width at the heatmap pixel around the center. The height uncertainty is computed analogously using the predicted heights .

Class: The object class is predicted via the class heatmap, which is uncertainty-aware in our model (Section III-B). As in Figure 2, a class prediction is more certain, if there is a peak only at such class with the other class probabilities being low, while it is less certain, the higher those other probabilities are. We formalize this, by comparing the objectness scores at each predicted center across the classes:

(6)

with being the -highest class probability, so for the predicted one. We compute recursively, considering and reaching the final class uncertainty as .

Iv Experiments and Results

Iv-a Experimental setup

Datasets: We conducted our experiments on three public autonomous driving datasets, namely KITTI [9], the recent nuImages extension of nuScenes [2], and the dashcam-based BDD100K [25]. KITTI is a popular benchmark for object detection, and it has been recorded in Germany. We followed the standard 3DOP split [27], which comprises 3712 training and 3769 validation images. Moreover, we used the standard car, pedestrian and cyclist classes. NuImages is a highly diverse large-scale dataset, collected in Boston and Singapore. It includes rain, snow and night conditions. BDD100K is another diverse large-scale dataset. It was crowd-sourced from a variety of dashboard cameras and different vehicles, around cities of the USA. It also includes various challenging weather and lighting conditions, such as snow, fog and night. In particular, we used nuImages and BDD100K to assess the generalization capability to out-of-domain data, which is critical for autonomous driving and uncertainty estimation. Towards this end, we evaluated KITTI models (without any fine-tuning) on the validation set of nuImages, which includes 3249 images, sized 1600x900, and the validation set of BDD100K, which contains 10K samples, sized 1280x720. Due to the sub-optimal class overlap between KITTI and the two transfer datasets, for the evaluations we focused on the car class.

Evaluation metrics: We evaluated the object detection performance on the standard AP, with the standard IoU threshold of 0.7 for KITTI cars and 0.5 for the other classes and datasets. We split the evaluation according to the specific uncertainty estimates, which we thoroughly evaluated. For the objectness uncertainty we computed various metrics, such as the area under the precision-recall (AUPR) curve, as AUPR-In and AUPR-Out [16]

, which should be considered jointly. The former assesses the ability to accept all correct detections and minimise wrong acceptances. Conversely, the latter looks at the rejection of negative samples and the rate of wrong rejections. Additionally, we computed the area under the receiver operating characteristic curve (AUROC), comparing true and false positives rates, as used by Miller et al. 

[16]. Following [11], we calculated the expected calibration error (ECE) as the average difference between confidence and detection accuracy, weighted by the occurrence. Moreover, we evaluated the minimum uncertainty error (UE) as the ability to accept correct detections and reject incorrect ones based on the uncertainty estimate [16]. In the top of Figure 4, we show how location and size uncertainties have a similar effect on the box uncertainty boundaries, despite leading to different box distributions. This allows to evaluate them in the same way. We adapted the expected calibration error (ECE) [11] for our detection case, as the calibration error (CE): average deviation between predicted uncertainty and measured error for matched detections. We independently computed the CE on x and y for the location, and on width and height for the size, reporting the average of each pair. All metrics in this work are percentage-based.

Fig. 4: The top shows the effect of size and location uncertainties on the box uncertainty boundaries, allowing to evaluate them in a similar fashion. The bottom shows the IBQ and OBQ terms of our novel UBQ metric.

Uncertainty boundary quality: We introduce a new metric to evaluate the uncertainties on the object location and size. As shown at the bottom of Figure 4, the goal is assessing how much of an object is in the uncertainty boundaries. We name this metric uncertainty boundary quality (UBQ), and compute it as the average across matched boxes of:

(7)

with BR being the boundary ratio between inner and outer:

(8)

UBQ is composed of an inner (IBQ) and an outer (OBQ) term, which are averaged and multiplied by the ratio (BR), to penalize extremely broad and trivial confidence intervals. Figure 

4 shows how and are computed:

(9)

where is the area of a bounding box , is the ground truth box, and are the inner and outer predictions. As for the CE, our UBQ metric can be used for both location and size. We report UBQ as a summary, and BR as the ratio.

Network architecture: All our models were based on a DLA architecture [26]. We selected it as compared to others it offers a good speed-accuracy trade-off [27]. Thus, we followed the DLA-34 configuration of CenterNet, with deformable convolutions [27]. We then made the class heatmap head uncertainty-aware, with the hyperspace transformation and the RBF kernel, as described in Section III-B. We used a single model for all three classes of KITTI.

Fig. 5: Example predictions of the proposed CertainNet on out-of-domain data. The model was trained on KITTI [9] and applied on nuImages [2] and BDD100K [25] without any fine-tuning. The bottom row shows crops. The boxes follow the format introduced in Figure 1, with KITTI KITTI.

Implementation details: We trained our models with an Adam optimizer for epochs, with additional epochs for the objectness head. We used an initial learning rate of , reduced by a factor of after epochs and . We chose a batch size of 16 and a resolution of for KITTI, resizing BDD and nuImages samples to . We selected as centroid dimensionality, and initialized their momentum to , increasing it to at epochs respectively. The length scale was initialized at , with a decay rate of per step for the sigma annealing, until reaching . We weighted the hyperspace regularization loss by , and set to for balancing the score. To compensate the added loss on the objectness, the weight on the dimensions loss was increased to . We set , and kept the remaining hyperparameters the same as proposed for CenterNet [27]

. We trained our models using PyTorch on a single NVIDIA Tesla V100 32GB GPU and evaluated the runtime (fps) on a single NVIDIA Quadro RTX 4000 8GB GPU.

ID Description AP AUPR-In AUPR-Out
A0 Adapted DUQ 76.41 81.64 99.62
A1 A0 - + 87.33 76.26 99.15
A2 A1 + balanced update 87.98 74.12 99.71
A3 A2 + outlier protection 88.14 77.34 99.56
A4 A3 + momentum schedul. 88.76 75.27 99.72
A5 A4 + length scale anneal. 88.85 81.08 99.80
A6 A5 + freeze last 10 epochs 89.36 78.16 99.81
TABLE III: Ablation study on the KITTI [9] validation set according to Car moderate detection and uncertainty quality.

Details for prior works: For a fair comparison, we re-trained all methods on the same standard dataset split [27]

, using the official implementations and hyperparameters, starting from weights pretrained on ImageNet 

[20], until convergence. For GYOLO [3] we used the authors best performing settings with input size 704 on Darknet. When evaluating location and size uncertainties, we extended CenterNet [27] with our location and dimensions uncertainty estimations, on top of its outputs. For the ensemble, we trained 5 DLA-34-based CenterNet models (randomly initialized, with data shuffle and augmentations), and combined their predictions following the merging strategy of Miller et al. [16] with DBSCAN. For CenterNet, GYOLO and the ensemble, the objectness uncertainty is evaluated using the standard confidence scores provided.

Iv-B Quantitative Results

Detection: Table I shows the detection performance of our approach compared to related works on KITTI [9]. Our uncertainty-aware CertainNet outperformed the CenterNet [27] baseline on the overall mAP, as well as most classes and difficulties, while sharing the same architecture and providing additional safety-related outputs. Moreover, the sampling-based 5-Ensemble [16] models achieved the overall highest mAP, and the best score in most classes and difficulties, with ours securing a close second place. However, the superior performance of the ensemble came at the cost of a significantly higher runtime. GYOLO [3] produced the worst detections. As shown in Table I, its AP scores are significantly lower than those reported by Choi et al. in [3], due to the different dataset splits used for KITTI. In particular, Choi et al. split training and validation randomly across single frames. We used instead a standard split [27], reducing the similarity between training and validation sets.

Uncertainty estimation: In Table II we report a variety of uncertainty metrics for different datasets, for the main class Car. On KITTI, our CertainNet achieved higher scores and lower errors than the other methods across most metrics. Remarkably, our approach was able to achieve the best objectness results across the board, often with a substantial margin (e.g. on the calibration ECE), thanks to its effective uncertainty-aware heatmap. Interestingly, in terms of uncertainties, the ensemble could not achieve satisfactory results, except for the location. This can be attributed to the disagreement of the 5 models, and to the difficulty of merging contrasting predictions in the case of object detection. Table II shows also the suboptimal performance of GYOLO [3], which had a tendency to overestimate the uncertainty boundaries, especially for the location (BR around 60%), despite its training time modifications for location and size, and having been trained on the same dataset. Furthermore, the flexibility of our uncertainty quantification techniques allowed to produce estimates also for the location and size of the uncertainty-unaware CenterNet [27]. However, this configuration could not match the uncertainty quality of our CertainNet, especially for the location, computed on the class heatmap. This shows how our modifications at training time positively contributed to the uncertainty-awareness of the whole model.

Out-of-domain data: As all models were trained on KITTI, the results for BDD100K and nuImages in Table II show the ability of each method to generalize to unseen scenarios in out-of-domain data. As also pointed out in [7, 10], multiple differences define a substantial domain gap between these two datasets and KITTI: weather and lighting conditions, image resolution and aspect ratio, camera mounting position and type (e.g. dashcam often with reflections on BDD), different continents and street layouts, amount of dynamic scenes, as well as different vehicles on sale in those regions. All these render the transfer task particularly difficult. Most notably, our CertainNet reached the highest AP on both datasets, by a significant margin on both BDD and nuImages. This shows the robustness of our approach, as well as the strong generalization provided by our uncertainty estimations, compared to CenterNet [27]. Interestingly, the ensemble, which performed well on KITTI (i.e. the training dataset), underperformed on both out-of-domain datasets. We attribute this to the rule-based clustering step required to merge the predictions [16], which was tuned on KITTI. This reiterates that the ensemble’s aggregation step is critical on object detection, and is prone to deliver suboptimal results on out-of-domain data, which is fundamental in autonomous driving. GYOLO [3] followed a similar trend, but suffered the transfer to out-of-distribution less than the ensemble, thanks to its learning-based techniques that generalized better. Nevertheless, the uncertainty estimations of the ensemble seemed high quality for the most part, especially on the object location and size. However, this is due to the rather low number of detections that all related works provided, as discussed below. Overall, our CertainNet achieved the highest AP, while maintaining high quality uncertainty estimates, which correctly increased on out-of-domain data.

Amount of detections: The uncertainty evaluation metrics used in this work are highly influenced by the number of predictions or that of matched detections. Therefore, high scores and low error rates can be achieved by a model that outputs a low amount of high confidence boxes. We found this to be the case for the methods in Table II: in nuImages [2] our method detected 4905 cars, CenterNet 3714, 5-Ensemble 3229, and GYOLO 2248; similarly in BDD [25], our CertainNet 26802, CenterNet 20210, ensemble 16440, and GYOLO 15395. The results also show the limitations of the ensemble, confirming how contrasting predictions within its models in highly uncertain scenarios, such as out-of-domain data (e.g. BDD and nuImages in Table II), can happen and thereby significantly degrade the overall detection and uncertainty performance. In general, each of the uncertainty metrics evaluates only a specific aspect and should be considered together with others, and combined with the detection performance (i.e. AP and mAP).

Ablation study: The importance of our modifications over DUQ [22] is testified by the results in Table III, where A6 represents our full approach. In particular, the baseline A0 could not converge properly, due to the limitations described in Section III-B, worsened by the high imbalance of KITTI [9]. The biggest improvement was brought by A1 replacing the gradient penalty of DUQ with our hyperspace regularization , which also improved the consistency of training convergence. Moreover, the AP increased monotonously with the introduction of each modification (Section III-B), while maintaining good uncertainty estimates throughout.

Iv-C Qualitative Results

In Figures 1 and 5 we show predictions of our method on challenging scenes of KITTI and out-of-domain samples respectively. The images confirm the large domain gap described in Section IV-B. Nevertheless, the proposed CertainNet correctly detected most objects, despite unseen night and rainy settings in Figure 5. Compared to the KITTI predictions in Figure 1, the objectness certainties (values at the top left of each box) are lower for the unusual and difficult to see out-of-domain vehicles in Figure 5. Interestingly, the uncertainty estimates for location and size improve the overlap between sub-optimal predictions and the object, e.g. bottom left of Figure 5, as tested by our UBQ metric.

V Conclusion

In this work we introduced CertainNet, a novel method to estimate the uncertainty of every aspect of 2D object detection. Extensive evaluations showed the benefit of quantifying the uncertainty, especially when transferring to out-of-distribution data for safety-critical applications. Our method substantially improved the generalization ability over previous works, with more accurate predictions and high quality uncertainty estimates, while running in real-time. Therefore, the proposed CertainNet constitutes a valuable contribution towards robust object detection for autonomous driving.

References

  • [1] C. Berger, M. Paschali, B. Glocker, and K. Kamnitsas (2021) Confidence-based out-of-distribution detection: a comparative study and analysis. arXiv, 2107.02568. Cited by: §III-B.
  • [2] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, et al. (2020) Nuscenes: a multimodal dataset for autonomous driving. In IEEE/CVF CVPR, pp. 11621–11631. Cited by: TABLE II, Fig. 5, §IV-A, §IV-B.
  • [3] J. Choi, D. Chun, H. Kim, and H. Lee (2019) Gaussian yolov3: an accurate and fast object detector using localization uncertainty for autonomous driving. In IEEE/CVF ICCV, pp. 502–511. Cited by: §I, §I, §II-B, §II-B, TABLE I, TABLE II, §IV-A, §IV-B, §IV-B, §IV-B.
  • [4] T. Cortinhal, G. Tzelepis, and E. E. Aksoy (2020) SalsaNext: fast, uncertainty-aware semantic segmentation of lidar point clouds. In ISVC, pp. 207–222. Cited by: §I.
  • [5] D. Feng, L. Rosenbaum, and K. Dietmayer (2018) Towards safe autonomous driving: capture uncertainty in the deep neural network for LiDAR 3D vehicle detection. In IEEE ITSC, pp. 3266–3273. Cited by: §I.
  • [6] Y. Gal and Z. Ghahramani (2016)

    Dropout as a bayesian approximation: representing model uncertainty in deep learning

    .
    In ICML, pp. 1050–1059. Cited by: §I, §II-A, §II-A, §II-B, §II-B.
  • [7] S. Gasperini, P. Koch, V. Dallabetta, N. Navab, B. Busam, and F. Tombari (2021) R4Dyn: exploring radar for self-supervised monocular depth estimation of dynamic scenes. arXiv, 2108.04814. Cited by: §IV-B.
  • [8] J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee, M. Humt, J. Feng, et al. (2021) A survey of uncertainty in deep neural networks. arXiv, 2107.03342. Cited by: §I, §I, §II-A, §II-A, §II-A, §II-A.
  • [9] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE/CVF CVPR, pp. 3354–3361. Cited by: Fig. 1, TABLE I, TABLE II, Fig. 5, §IV-A, §IV-B, §IV-B, TABLE III.
  • [10] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon (2020) 3D packing for self-supervised monocular depth estimation. In IEEE/CVF CVPR, pp. 2485–2494. Cited by: §IV-B.
  • [11] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In ICML, pp. 1321–1330. Cited by: §IV-A.
  • [12] A. Harakeh, M. Smart, and S. L. Waslander (2020) Bayesod: a bayesian approach for uncertainty estimation in deep object detectors. In IEEE ICRA, pp. 87–93. Cited by: §II-B, §II-B, §II-B.
  • [13] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi (1983) Optimization by simulated annealing. Science 220 (4598), pp. 671–680. Cited by: §III-B.
  • [14] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, pp. 6405–6416. Cited by: §II-A.
  • [15] Y. Lee, J. Hwang, H. Kim, K. Yun, and Y. Kwon (2020) Localization uncertainty estimation for anchor-free object detection. arXiv, 2006.15607. Cited by: §I, §II-B, §II-B.
  • [16] D. Miller, F. Dayoub, M. Milford, and N. Sünderhauf (2019) Evaluating merging strategies for sampling-based uncertainty techniques in object detection. In IEEE ICRA, pp. 2348–2354. Cited by: §I, §II-A, §II-B, §II-B, TABLE I, TABLE II, §IV-A, §IV-A, §IV-B, §IV-B.
  • [17] J. Postels, F. Ferroni, H. Coskun, N. Navab, and F. Tombari (2019) Sampling-free epistemic uncertainty estimation using approximated variance propagation. In IEEE/CVF ICCV, pp. 2931–2940. Cited by: §I, §II-B.
  • [18] M. Raghu, K. Blumer, R. Sayres, Z. Obermeyer, B. Kleinberg, S. Mullainathan, and J. Kleinberg (2019) Direct uncertainty prediction for medical second opinions. In ICML, pp. 5281–5290. Cited by: §II-A.
  • [19] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv, 1804.02767. Cited by: §II-B.
  • [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, et al. (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV 115 (3), pp. 211–252. External Links: Document Cited by: §IV-A.
  • [21] M. Sensoy, L. Kaplan, and M. Kandemir (2018) Evidential deep learning to quantify classification uncertainty. In NeurIPS, pp. 3183–3193. Cited by: §II-A.
  • [22] J. Van Amersfoort, L. Smith, Y. W. Teh, and Y. Gal (2020) Uncertainty estimation using a single deep deterministic neural network. In ICML, pp. 9690–9700. Cited by: §II-A, §II-A, §II-A, §II-B, Fig. 3, §III-B, §III-B, §III-B, §III-B, §III-B, §III-B, §IV-B.
  • [23] G. Wang, W. Li, M. Aertsen, J. Deprest, S. Ourselin, and T. Vercauteren (2019)

    Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks

    .
    Neurocomputing 338, pp. 34–45. Cited by: §II-A.
  • [24] N. Yang, L. v. Stumberg, R. Wang, and D. Cremers (2020) D3VO: deep depth, deep pose and deep uncertainty for monocular visual odometry. In IEEE/CVF CVPR, pp. 1281–1292. Cited by: §I.
  • [25] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell (2020) BDD100k: a diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF CVPR, pp. 2636–2645. Cited by: TABLE II, Fig. 5, §IV-A, §IV-B.
  • [26] F. Yu, D. Wang, E. Shelhamer, and T. Darrell (2018) Deep layer aggregation. In IEEE/CVF CVPR, pp. 2403–2412. Cited by: §IV-A.
  • [27] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. arXiv, 1904.07850. Cited by: §III-A, TABLE I, TABLE II, §IV-A, §IV-A, §IV-A, §IV-A, §IV-B, §IV-B, §IV-B.