Deep Deterministic Uncertainty for Semantic Segmentation

by   Jishnu Mukhoti, et al.

We extend Deep Deterministic Uncertainty (DDU), a method for uncertainty estimation using feature space densities, to semantic segmentation. DDU enables quantifying and disentangling epistemic and aleatoric uncertainty in a single forward pass through the model. We study the similarity of feature representations of pixels at different locations for the same class and conclude that it is feasible to apply DDU location independently, which leads to a significant reduction in memory consumption compared to pixel dependent DDU. Using the DeepLab-v3+ architecture on Pascal VOC 2012, we show that DDU improves upon MC Dropout and Deep Ensembles while being significantly faster to compute.



There are no comments yet.


page 2

page 4


Evaluating Uncertainty Estimation Methods on 3D Semantic Segmentation of Point Clouds

Deep learning models are extensively used in various safety critical app...

Efficient Uncertainty Estimation for Semantic Segmentation in Videos

Uncertainty estimation in deep learning becomes more important recently....

On the Practicality of Deterministic Epistemic Uncertainty

A set of novel approaches for estimating epistemic uncertainty in deep n...

Semantic Segmentation with Labeling Uncertainty and Class Imbalance

Recently, methods based on Convolutional Neural Networks (CNN) achieved ...

Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty

We show that a single softmax neural net with minimal changes can beat t...

Learning Debiased and Disentangled Representations for Semantic Segmentation

Deep neural networks are susceptible to learn biased models with entangl...

A Survey on Evidential Deep Learning For Single-Pass Uncertainty Estimation

Popular approaches for quantifying predictive uncertainty in deep neural...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the increasing deployment of deep learning models in safety critical applications like autonomous driving

(Huang & Chen, 2020) and medical diagnosis (Esteva et al., 2017), it is imperative for such models to be able to quantify their uncertainty reliably, in addition to making accurate predictions. A significant amount of research has been conducted in this direction and several methods have been introduced in the context of classification (Gal & Ghahramani, 2016; Lakshminarayanan et al., 2017; Blundell et al., 2015). These methods require several forward passes through the model rendering such methods practically infeasible for adoption in large scale applications like semantic segmentation (Long et al., 2015), where dense pixel-wise predictions are necessary, often in real time.

Recently, several methods have been introduced to obtain uncertainty in a single forward pass (Van Amersfoort et al., 2020; Liu et al., 2020; Mukhoti et al., 2021). In particular, DUQ (Van Amersfoort et al., 2020) and SNGP (Liu et al., 2020)

propose using feature extractors with certain inductive biases to impose a bi-Lipschitz constraint on the feature space. They then use a distance aware layer, either an RBF or a Gaussian Process trained end-to-end with the feature extractor. However, both these methods require extensive changes to the model architecture and training setup, with additional hyperparameters which need to be fine-tuned. DDU

(Mukhoti et al., 2021) shows that using feature space density with proper inductive biases can capture uncertainty and avoids the problem of feature collapse (Van Amersfoort et al., 2020). Due to feature collapse, Out-of-distribution (OoD) samples are often mapped to in-distribution regions in the feature space, making the model overconfident on such inputs. Hence, in order to capture uncertainty through feature space density, one needs to use proper inductive biases on the model architecture.

There are two kinds of uncertainty which are important in deep learning literature: epistemic uncertainty, which captures what the model does not know, is high for unseen or OoD inputs and can be reduced with more training data and, aleatoric uncertainty, which captures ambiguity and observation noise in in-distribution samples (Kendall & Gal, 2017). In DDU, the epistemic uncertainty is quantified using a feature space density, while the entropy of the softmax distribution can be used to estimate aleatoric uncertainty.

Figure 1: Applying DDU in the context of semantic segmentation

In this paper, we apply and extend DDU to the task of semantic segmentation (Long et al., 2015)

, where each pixel of a given input image is classified to produce an output which has the same spatial dimensions as the input. We choose semantic segmentation in particular as it forms an excellent example of an application with class imbalance and therefore, requires reliable epistemic uncertainty estimates. Furthermore, state-of-the-art models for semantic segmentation

(Chen et al., 2017; Zhao et al., 2017; Wang et al., 2020) are large and conventional uncertainty quantification methods like MC Dropout (Gal & Ghahramani, 2016) and Deep Ensembles (Lakshminarayanan et al., 2017) are often prohibitively expensive on such models (Kendall et al., 2015; Mukhoti & Gal, 2018).

The paper is organised as follows: in Section 2, we describe how DDU can be extended for semantic segmentation and in Section 3, we provide results using the well-known DeepLab-v3+ (Chen et al., 2017) architecture on the Pascal VOC 2012 (Everingham et al., 2010) dataset to show that DDU outperforms other conventional methods (MC Dropout and Deep Ensembles) of uncertainty quantification in deep learning.

2 DDU in Semantic Segmentation

In this section, we provide details on how DDU can be extended to obtain epistemic and aleatoric uncertainty estimates in semantic segmentation.

A brief introduction to DDU: As described in Mukhoti et al. (2021) in the context of multiclass classification, after training a model with a bi-Lipschitz constraint, we can compute the feature space means and covariances per class using a single pass over all the training samples. The feature space means and covariance matrices can then be used to fit a Gaussian Discriminant Analysis (GDA) (Murphy, 2012). Let be the feature representation for a given input , i.e., where represents model parameters. Then the feature density is computed by marginalizing the density over all classes as


where is obtained from the GDA and can be computed directly from the training set. The feature density thus computed can be used to estimate model confidence (opposite of epistemic uncertainty). At the same time, for in-distribution samples, the entropy of the softmax distribution can be used to capture aleatoric uncertainty.

Pixel-independent class-wise means and covariances: In semantic segmentation, each pixel has a prediction attached to it and a corresponding softmax distribution. A natural question to ask then is whether to compute means and covariance matrices per pixel in order to fit a GDA to the semantic segmentation model. Fortunately, we find that such is not the case and we can indeed compute means and covariance matrices independent of pixels just like in multi-class classification (thereby enforcing invariance). To see this, in Figure 2

, we plot the L2 distances between the feature space means of all pairs of classes in the Pascal VOC validation set for two “distant” pixels. We find that the means of the same class are much closer together as compared to other classes irrespective of where the pixels are located. This makes intuitive sense as the convolution kernel (a linear operator) which converts the feature space representations into logits is shared across the entire feature space representation.

Computing feature density: Following the rationale above, we fit a GDA assuming pixels to be independent samples. Hence, we obtain one mean and one covariance matrix per class (not per pixel) and can apply Equation 1 to obtain the feature density per pixel, given an input image. Separately, we can also obtain the per pixel softmax entropy from the model. Using these two we can disentangle aleatoric and epistemic uncertainty with a single deterministic model in semantic segmentation. We present a schematic diagram of this process in Figure 1.

Figure 2: L2 distances between the feature space means of different classes for a pair of distant pixels on the Pascal VOC 2012 val set: (left) Pixels and , (middle) Pixels and and (right) Pixels and .

3 Experiments

In this section, we evaluate DDU on semantic segmentation using the well-known Pascal VOC (Everingham et al., 2010) dataset and compare it with three other uncertainty baselines widely applied in practice: softmax entropy, MC Dropout (Gal & Ghahramani, 2016) and Deep Ensembles (Lakshminarayanan et al., 2017).

Architecture & Training setup: For all the baselines, we use DeepLab-v3+ (Chen et al., 2017) using a ResNet-101 (He et al., 2016)

backbone as the architecture of choice. We train each model for 50 epochs on the Pascal VOC training set augmented using the Semantic Boundaries Dataset (SBD)

(Hariharan et al., 2011) using SGD as the optimiser with a momentum of and a weight decay of . We set the initial learning rate to with a polynomial decay during the course of training. Finally, we trained using a batch size of parallelized over 4 GPUs.

Baselines & Uncertainty metrics: As mentioned before, we compare DDU with 3 well-known baselines:

Baseline mIoU Runtime (ms)
MC Dropout
Deep Ensemble
Table 1: Pascal VOC val set mIoU and runtime in milliseconds of a single forward pass for different baselines averaged over 10 single forward passes. Note that for each single forward pass in the MC Dropout baseline, we perform 5 stochastic forward passes.
  1. Softmax Entropy, one of the most commonly used metrics for uncertainty is the entropy of the softmax distribution (Hendrycks & Gimpel, 2016). This metric is often preferred due to its simplicity and lack of computational overhead. Softmax entropy is known to capture aleatoric uncertainty for in-distribution samples (Mukhoti et al., 2021). However, it cannot capture epistemic uncertainty reliably (eg. for OoD inputs).

  2. MC Dropout (MCDO) (Gal & Ghahramani, 2016)

    is a method which uses dropout at test time as an approximation to Bayesian inference. Multiple stochastic forward passes are performed with dropout layers active during test time. The softmax distributions

    obtained from these forward passes can then be used to compute either predictive entropy (PE): or mutual information (MI): (Houlsby et al., 2011) as measures of uncertainty. While MI is known to estimate epistemic uncertainty, PE captures both epistemic and aleatoric uncertainty (Gal, 2016). In our experiments, we implement MC dropout by activating the dropout layers in the DeepLab-v3+ architecture during test time. We don’t insert new dropout layers. Finally, we use 5 stochastic forward passes for MC Dropout.

  3. Deep Ensembles (Lakshminarayanan et al., 2017)

    is a simple method where an ensemble of neural networks is trained. Similar to MC Dropout, both the PE as well as the MI from the ensemble predictions can be used to estimate uncertainty. In our experiments we use an ensemble of 3 DeepLab-v3+ models, all trained with identical architecture and training setup.

Metrics for evaluation: In order to evaluate the quality of uncertainty in semantic segmentation, we use the metrics proposed in (Mukhoti & Gal, 2018): p(accurate—certain), p(uncertain—inaccurate) and PAVPU

. The metric p(accurate—certain) measures the probability of a prediction being accurate given that the model is confident on the prediction. Similarly, p(uncertain—inaccurate) measures the probability of the model being uncertain on inaccurate predictions. PAVPU computes the probability of the model being confident on an accurate prediction or uncertain on an inaccurate one. A good model should ideally have high values on all these 3 metrics. Note that these metrics depend on a threshold for uncertainty, i.e., to define a prediction as certain or uncertain. Hence, they can be computed for different uncertainty thresholds. We plot the performance of all the baselines on these metrics for different uncertainty thresholds in

Figure 3.

In addition, we report the Pascal VOC 2012 validation set accuracy and the runtime of a single forward pass for all the baselines in Table 1. Note that a single forward pass for the MC Dropout baseline consists of five stochastic forward passes and a single forward pass from the ensemble involves getting predictions from three ensemble components. Finally, we visualise the uncertainty estimates from each baseline for four samples from the Pascal VOC 2012 val set in Figure 4.

Figure 3: Evaluation metrics: p(accurate—certain), p(uncertain—inaccurate) and PAVPU evaluated on different baselines on the PASCAL VOC validation set. DDU outperforms all other baselines.
(a) Accuracy
(d) Ensemble PE
(e) Ensemble MI
(f) Entropy
(g) Density
Figure 4: Visualisation of different uncertainty baselines on samples from the PASCAL VOC validation set. The first column captures pixel-wise accuracy with bright signifying accurate and dark, inaccurate. The second and third columns show predictive entropy (PE) and mutual information (MI) obtained from the MC Dropout (MCDO) baseline respectively, the fourth and fifth columns show the PE and MI from deep ensembles. The sixth column maps per-pixel softmax entropy which is the aleatoric uncertainty estimate of DDU, and finally the seventh column shows feature density, which is the epistemic component captured by DDU. All the baselines save DDU density (last column on the right) capture uncertainty, i.e., the brighter, the more uncertain whereas DDU feature density captures confidence and hence brighter pixels signify more confident pixels and vice versa.

Observations: Firstly we note from Table 1 that the runtime of DDU and a normal softmax model with values around and milliseconds respectively, are far lower than MC Dropout and Deep Ensembles. In fact, a single forward pass in MC Dropout requires around seconds on an Nvidia Quadro RTX 6000 GPU. Although these baselines have not been tuned for runtime, real-time latency requirements of around 200ms (i.e., 5 predictions a second) make adoption of time-consuming methods infeasible in real-life applications. Furthermore, note that the val set mIoU for all the models are very similar.

Secondly, from Figure 3, we can see that DDU, having higher values on p(accurate—certain), p(uncertain—inaccurate) and PAVPU for most uncertainty thresholds, outperforms all other baselines on all these three metrics.

Finally, from Figure 4, we note that DDU feature density captures epistemic uncertainty whereas softmax entropy captures aleatoric uncertainty. Note that aleatoric uncertainty is high on edges of objects as those are the regions with maximum ambiguity or observation noise. On the other hand, epistemic uncertainty is high on regions which are previously unseen (or less seen) by the model. In the first two samples (first two rows), the epistemic uncertainty is not high and aleatoric uncertainty is captured on the edges by softmax entropy. In the last sample (last row), the epistemic uncertainty is high for a big patch which is inaccurately predicted. DDU feature density for that entire patch is significantly lower whereas softmax entropy doesn’t capture it and is only high on the edges.

4 Conclusion

In this paper, we show that Deep Deterministic Uncertainty (DDU) can be easily extended to the task of semantic segmentation. We find that fitting DDU to a semantic segmentation model with a fully convolutional architecture can be done in a pixel-independent fashion, thereby making its adoption relatively simple. Finally, with experiments on Pascal VOC 2012 using DeepLab-v3+, we observe that DDU outperforms other well-known methods of uncertainty quantification without compromising on accuracy/mIoU and with the runtime of a single deterministic model.