There are two major types of uncertainty one can model. Aleatoric uncertainty captures noise inherent in the observations. On the other hand, epistemic uncertainty accounts for uncertainty in the model -- uncertainty which can be explained away given enough data. Traditionally it has been difficult to model epistemic uncertainty in computer vision, but with new Bayesian deep learning tools this is now possible. We study the benefits of modeling epistemic vs. aleatoric uncertainty in Bayesian deep learning models for vision tasks. For this we present a Bayesian deep learning framework combining input-dependent aleatoric uncertainty together with epistemic uncertainty. We study models under the framework with per-pixel semantic segmentation and depth regression tasks. Further, our explicit uncertainty formulation leads to new loss functions for these tasks, which can be interpreted as learned attenuation. This makes the loss more robust to noisy data, also giving new state-of-the-art results on segmentation and depth regression benchmarks.READ FULL TEXT VIEW PDF
Implementation and evaluation of different approaches to get uncertainty in neural networks
Understanding what a model does not know is a critical part of many machine learning systems. Today, deep learning algorithms are able to learn powerful representations which can map high dimensional data to an array of outputs. However these mappings are often taken blindly and assumed to be accurate, which is not always the case. In two recent examples this has had disastrous consequences. In May 2016 there was the first fatality from an assisted driving system, caused by the perception system confusing the white side of a trailer for bright skyNHTSA (2017). In a second recent example, an image classification system erroneously identified two African Americans as gorillas Guynn (2015), raising concerns of racial discrimination. If both these algorithms were able to assign a high level of uncertainty to their erroneous predictions, then the system may have been able to make better decisions and likely avoid disaster.
Quantifying uncertainty in computer vision applications can be largely divided into regression settings such as depth regression, and classification settings such as semantic segmentation. Existing approaches to model uncertainty in such settings in computer vision include particle filtering and conditional random fields (Blake et al., 1993; He et al., 2004). However many modern applications mandate the use of deep learning to achieve state-of-the-art performance He et al. (2016)
, with most deep learning models not able to represent uncertainty. Deep learning does not allow for uncertainty representation in regression settings for example, and deep learning classification models often give normalised score vectors, which do not necessarily capture model uncertainty. For both settings uncertainty can be captured withBayesian deep learning approaches – which offer a practical framework for understanding uncertainty with deep learning models Gal (2016).
In Bayesian modeling, there are two main types of uncertainty one can model (Der Kiureghian and Ditlevsen, 2009). Aleatoric uncertainty captures noise inherent in the observations. This could be for example sensor noise or motion noise, resulting in uncertainty which cannot be reduced even if more data were to be collected. On the other hand, epistemic uncertainty accounts for uncertainty in the model parameters – uncertainty which captures our ignorance about which model generated our collected data. This uncertainty can be explained away given enough data, and is often referred to as model uncertainty. Aleatoric uncertainty can further be categorized into homoscedastic uncertainty, uncertainty which stays constant for different inputs, and heteroscedastic
uncertainty. Heteroscedastic uncertainty depends on the inputs to the model, with some inputs potentially having more noisy outputs than others. Heteroscedastic uncertainty is especially important for computer vision applications. For example, for depth regression, highly textured input images with strong vanishing lines are expected to result in confident predictions, whereas an input image of a featureless wall is expected to have very high uncertainty.
In this paper we make the observation that in many big data regimes (such as the ones common to deep learning with image data), it is most effective to model aleatoric uncertainty, uncertainty which cannot be explained away. This is in comparison to epistemic uncertainty which is mostly explained away with the large amounts of data often available in machine vision. We further show that modeling aleatoric uncertainty alone comes at a cost. Out-of-data examples, which can be identified with epistemic uncertainty, cannot be identified with aleatoric uncertainty alone.
For this we present a unified Bayesian deep learning framework which allows us to learn mappings from input data to aleatoric uncertainty and compose these together with epistemic uncertainty approximations. We derive our framework for both regression and classification applications and present results for per-pixel depth regression and semantic segmentation tasks (see Figure 1 and the supplementary video for examples). We show how modeling aleatoric uncertainty in regression can be used to learn loss attenuation, and develop a complimentary approach for the classification case. This demonstrates the efficacy of our approach on difficult and large scale tasks.
The main contributions of this work are;
We capture an accurate understanding of aleatoric and epistemic uncertainties, in particular with a novel approach for classification,
We improve model performance by over non-Bayesian baselines by reducing the effect of noisy data with the implied attenuation obtained from explicitly representing aleatoric uncertainty,
We study the trade-offs between modeling aleatoric or epistemic uncertainty by characterizing the properties of each uncertainty and comparing model performance and inference time.
Existing approaches to Bayesian deep learning capture either epistemic uncertainty alone, or aleatoric uncertainty alone Gal (2016)
. These uncertainties are formalised as probability distributions over either the model parameters, or model outputs, respectively. Epistemic uncertainty is modeled by placing a prior distribution over a model’s weights, and then trying to capture how much these weights vary given some data. Aleatoric uncertainty on the other hand is modeled by placing a distribution over the output of the model. For example, in regression our outputs might be modeled as corrupted with Gaussian random noise. In this case we are interested in learning the noise’s variance as a function of different inputs (such noise can also be modeled with a constant value for all data points, but this is of less practical interest). These uncertainties, in the context of Bayesian deep learning, are explained in more detail in this section.
To capture epistemic uncertainty in a neural network (NN) we put a prior distribution over its weights, for example a Gaussian prior distribution:.
Such a model is referred to as a Bayesian neural network (BNN) (Denker and LeCun, 1991; MacKay, 1992; Neal, 1995). Bayesian neural networks replace the deterministic network’s weight parameters with distributions over these parameters, and instead of optimising the network weights directly we average over all possible weights (referred to as marginalisation). Denoting the random output of the BNN as , we define the model likelihood . Given a dataset
, Bayesian inference is used to compute the posterior over the weights. This posterior captures the set of plausible model parameters, given the data.
For regression tasks we often define our likelihood as a Gaussian with mean given by the model output: , with an observation noise scalar . For classification, on the other hand, we often squash the model output through a softmax function, and sample from the resulting probability vector: .
BNNs are easy to formulate, but difficult to perform inference in. This is because the marginal probability , required to evaluate the posterior , cannot be evaluated analytically. Different approximations exist (Graves, 2011; Blundell et al., 2015; Hernández-Lobato et al., 2016; Gal and Ghahramani, 2016). In these approximate inference techniques, the posterior is fitted with a simple distribution , parameterised by . This replaces the intractable problem of averaging over all weights in the BNN with an optimisation task, where we seek to optimise over the parameters of the simple distribution instead of optimising the original neural network’s parameters.
Dropout variational inference is a practical approach for approximate inference in large and complex models (Gal and Ghahramani, 2016). This inference is done by training a model with dropout before every weight layer, and by also performing dropout at test time to sample from the approximate posterior (stochastic forward passes, referred to as Monte Carlo dropout). More formally, this approach is equivalent to performing approximate variational inference where we find a simple distribution in a tractable family which minimises the Kullback-Leibler (KL) divergence to the true model posterior . Dropout can be interpreted as a variational Bayesian approximation, where the approximating distribution is a mixture of two Gaussians with small variances and the mean of one of the Gaussians is fixed at zero. The minimisation objective is given by (Jordan et al., 1999):
with data points, dropout probability , samples , and the set of the simple distribution’s parameters to be optimised (weight matrices in dropout’s case). In regression, for example, the negative log likelihood can be further simplified as
for a Gaussian likelihood, with the model’s observation noise parameter – capturing how much noise we have in the outputs.
Epistemic uncertainty in the weights can be reduced by observing more data. This uncertainty induces prediction uncertainty by marginalising over the (approximate) weights posterior distribution. For classification this can be approximated using Monte Carlo integration as follows:
with sampled masked model weights , where is the Dropout distribution (Gal, 2016). The uncertainty of this probability vector can then be summarised using the entropy of the probability vector: For regression this epistemic uncertainty is captured by the predictive variance, which can be approximated as:
with predictions in this epistemic model done by approximating the predictive mean: The first term in the predictive variance, , corresponds to the amount of noise inherent in the data (which will be explained in more detail soon). The second part of the predictive variance measures how much the model is uncertain about its predictions – this term will vanish when we have zero parameter uncertainty (i.e. when all draws take the same constant value).
In the above we captured model uncertainty – uncertainty over the model parameters – by approximating the distribution . To capture aleatoric uncertainty in regression, we would have to tune the observation noise parameter .
Homoscedastic regression assumes constant observation noise for every input point . Heteroscedastic regression, on the other hand, assumes that observation noise can vary with input (Nix and Weigend, 1994; Le et al., 2005). Heteroscedastic models are useful in cases where parts of the observation space might have higher noise levels than others. In non-Bayesian neural networks, this observation noise parameter is often fixed as part of the model’s weight decay, and ignored. However, when made data-dependent, it can be learned as a function of the data:
with added weight decay parameterised by (and similarly for loss). Note that here, unlike the above, variational inference is not performed over the weights, but instead we perform MAP inference – finding a single value for the model parameters . This approach does not capture epistemic model uncertainty, as epistemic uncertainty is a property of the model and not of the data.
In the next section we will combine these two types of uncertainties together in a single model. We will see how heteroscedastic noise can be interpreted as model attenuation, and develop a complimentary approach for the classification case.
In the previous section we described existing Bayesian deep learning techniques. In this section we present novel contributions which extend this existing literature. We develop models that will allow us to study the effects of modeling either aleatoric uncertainty alone, epistemic uncertainty alone, or modeling both uncertainties together in a single model. This is followed by an observation that aleatoric uncertainty in regression tasks can be interpreted as learned loss attenuation – making the loss more robust to noisy data. We follow that by extending the ideas of heteroscedastic regression to classification tasks. This allows us to learn loss attenuation for classification tasks as well.
We wish to capture both epistemic and aleatoric uncertainty in a vision model. For this we turn the heteroscedastic NN in §2.2 into a Bayesian NN by placing a distribution over its weights, with our construction in this section developed specifically for the case of vision models111Although this construction can be generalised for any heteroscedastic NN architecture..
We need to infer the posterior distribution for a BNN model mapping an input image, , to a unary output, , and a measure of aleatoric uncertainty given by variance, . We approximate the posterior over the BNN with a dropout variational distribution using the tools of §2.1. As before, we draw model weights from the approximate posterior to obtain a model output, this time composed of both predictive mean as well as predictive variance:
is a Bayesian convolutional neural network parametrised by model weights. We can use a single network to transform the input , with its head split to predict both as well as .
We fix a Gaussian likelihood to model our aleatoric uncertainty. This induces a minimisation objective given labeled output points :
where is the number of output pixels corresponding to input image , indexed by i (additionally, the loss includes weight decay which is omitted for brevity). For example, we may set for image-level regression tasks, or equal to the number of pixels for dense prediction tasks (predicting a unary corresponding to each input image pixel). is the BNN output for the predicted variance for pixel .
This loss consists of two components; the residual regression obtained with a stochastic sample through the model – making use of the uncertainty over the parameters – and an uncertainty regularization term. We do not need ‘uncertainty labels’ to learn uncertainty. Rather, we only need to supervise the learning of the regression task. We learn the variance, , implicitly from the loss function. The second regularization term prevents the network from predicting infinite uncertainty (and therefore zero loss) for all data points.
In practice, we train the network to predict the log variance, :
This is because it is more numerically stable than regressing the variance, , as the loss avoids a potential division by zero. The exponential mapping also allows us to regress unconstrained scalar values, where is resolved to the positive domain giving valid values for variance.
To summarize, the predictive uncertainty for pixel in this combined model can be approximated using:
with a set of sampled outputs: for randomly masked weights .
We observe that allowing the network to predict uncertainty, allows it effectively to temper the residual loss by , which depends on the data. This acts similarly to an intelligent robust regression function. It allows the network to adapt the residual’s weighting, and even allows the network to learn to attenuate the effect from erroneous labels. This makes the model more robust to noisy data: inputs for which the model learned to predict high uncertainty will have a smaller effect on the loss.
The model is discouraged from predicting high uncertainty for all points – in effect ignoring the data – through the term. Large uncertainty increases the contribution of this term, and in turn penalizes the model: The model can learn to ignore the data – but is penalised for that. The model is also discouraged from predicting very low uncertainty for points with high residual error, as low will exaggerate the contribution of the residual and will penalize the model. It is important to stress that this learned attenuation is not an ad-hoc construction, but a consequence of the probabilistic interpretation of the model.
This learned loss attenuation property of heteroscedastic NNs in regression is a desirable effect for classification models as well. However, heteroscedastic NNs in classification are peculiar models because technically any classification task has input-dependent uncertainty. Nevertheless, the ideas above can be extended from regression heteroscedastic NNs to classification heteroscedastic NNs.
For this we adapt the standard classification model to marginalise over intermediate heteroscedastic regression uncertainty placed over the logit space. We therefore explicitly refer to our proposed model adaptation as a heteroscedastic classification NN.
For classification tasks our NN predicts a vector of unaries for each pixel , which when passed through a softmax operation, forms a probability vector
. We change the model by placing a Gaussian distribution over the unaries vector:
Here are the network outputs with parameters . This vector is corrupted with Gaussian noise with variance (a diagonal matrix with one element for each logit value), and the corrupted vector is then squashed with the softmax function to obtain , the probability vector for pixel .
Our expected log likelihood for this model is given by:
with the observed class for input , which gives us our loss function. Ideally, we would want to analytically integrate out this Gaussian distribution, but no analytic solution is known. We therefore approximate the objective through Monte Carlo integration, and sample unaries through the softmax function. We note that this operation is extremely fast because we perform the computation once (passing inputs through the model to get logits). We only need to sample from the logits, which is a fraction of the network’s compute, and therefore does not significantly increase the model’s test time. We can rewrite the above and obtain the following numerically-stable stochastic loss:
with the element in the logit vector .
This objective can be interpreted as learning loss attenuation, similarly to the regression case. We next assess the ideas above empirically.
In this section we evaluate our methods with pixel-wise depth regression and semantic segmentation. An analysis of these results is given in the following section. To show the robustness of our learned loss attenuation – a side-effect of modeling uncertainty – we present results on an array of popular datasets, CamVid, Make3D, and NYUv2 Depth, where we set new state-of-the-art benchmarks.
. We use our own independent implementation of the architecture using TensorFlowAbadi et al. (2016) (which slightly outperforms the original authors’ implementation on CamVid by 0.2%, see Table 0(a)). For all experiments we train with crops of batch size 4, and then fine-tune on full-size images with a batch size of 1. We train with RMS-Prop with a constant learning rate of and weight decay .
We compare the results of the Bayesian neural network models outlined in §3. We model epistemic uncertainty using Monte Carlo dropout (§2.1). The DenseNet architecture places dropout with after each convolutional layer. Following Kendall et al. (2015), we use 50 Monte Carlo dropout samples. We model aleatoric uncertainty with MAP inference using loss functions (8) and (12 in the appendix), for regression and classification respectively (§2.2). However, we derive the loss function using a Laplacian prior, as opposed to the Gaussian prior used for the derivations in §3. This is because it results in a loss function which applies a L1 distance on the residuals. Typically, we find this to outperform L2 loss for regression tasks in vision. We model the benefit of combining both epistemic uncertainty as well as aleatoric uncertainty using our developments presented in §3.
. CamVid is a road scene understanding dataset with 367 training images and 233 test images, of day and dusk scenes, withclasses. We resize images to pixels for training and evaluation. In Table 0(a) we present results for our architecture. Our method sets a new state-of-the-art on this dataset with mean intersection over union (IoU) score of . We observe that modeling both aleatoric and epistemic uncertainty improves over the baseline result. The implicit attenuation obtained from the aleatoric loss provides a larger improvement than the epistemic uncertainty model. However, the combination of both uncertainties improves performance even further. This shows that for this application it is more important to model aleatoric uncertainty, suggesting that epistemic uncertainty can be mostly explained away in this large data setting.
Secondly, NYUv2 Silberman et al. (2012) is a challenging indoor segmentation dataset with 40 different semantic classes. It has 1449 images with resolution from 464 different indoor scenes. Table 0(b) shows our results. This dataset is much harder than CamVid because there is significantly less structure in indoor scenes compared to street scenes, and because of the increased number of semantic classes. We use DeepLabLargeFOV Chen et al. (2014) as our baseline model. We observe a similar result (qualitative results given in Figure 6
); we improve baseline performance by giving the model flexibility to estimate uncertainty and attenuate the loss. The effect is more pronounced, perhaps because the dataset is more difficult.
We demonstrate the efficacy of our method for regression using two popular monocular depth regression datasets, Make3D Saxena et al. (2009) and NYUv2 Depth Silberman et al. (2012). The Make3D dataset consists of 400 training and 134 testing images, gathered using a 3-D laser scanner. We evaluate our method using the same standard as Laina et al. (2016), resizing images to pixels and evaluating on pixels with depth less than . NYUv2 Depth is taken from the same dataset used for classification above. It contains RGB-D imagery from 464 different indoor scenes. We compare to previous approaches for Make3D in Table 1(a) and NYUv2 Depth in Table 1(b), using standard metrics (for a description of these metrics please see Eigen et al. (2014)).
These results show that aleatoric uncertainty is able to capture many aspects of this task which are inherently difficult. For example, in the qualitative results in Figure 6 and 6 we observe that aleatoric uncertainty is greater for large depths, reflective surfaces and occlusion boundaries in the image. These are common failure modes of monocular depth algorithms Laina et al. (2016). On the other hand, these qualitative results show that epistemic uncertainty captures difficulties due to lack of data. For example, we observe larger uncertainty for objects which are rare in the training set such as humans in the third example of Figure 6.
In summary, we have demonstrated that our model can improve performance over non-Bayesian baselines by implicitly learning attenuation of systematic noise and difficult concepts. For example we observe high aleatoric uncertainty for distant objects and on object and occlusion boundaries.
In §4 we showed that modeling aleatoric and epistemic uncertainties improves prediction performance, with the combination performing even better. In this section we wish to study the effectiveness of modeling aleatoric and epistemic uncertainty. In particular, we wish to quantify the performance of these uncertainty measurements and analyze what they capture.
Firstly, in Figure 2 we show precision-recall curves for regression and classification models. They show how our model performance improves by removing pixels with uncertainty larger than various percentile thresholds. This illustrates two behaviors of aleatoric and epistemic uncertainty measures. Firstly, it shows that the uncertainty measurements are able to correlate well with accuracy, because all curves are strictly decreasing functions. We observe that precision is lower when we have more points that the model is not certain about. Secondly, the curves for epistemic and aleatoric uncertainty models are very similar. This shows that each uncertainty ranks pixel confidence similarly to the other uncertainty, in the absence of the other uncertainty. This suggests that when only one uncertainty is explicitly modeled, it attempts to compensate for the lack of the alternative uncertainty when possible.
Secondly, in Figure 3 we analyze the quality of our uncertainty measurement using calibration plots from our model on the test set. To form calibration plots for classification models, we discretize our model’s predicted probabilities into a number of bins, for all classes and all pixels in the test set. We then plot the frequency of correctly predicted labels for each bin of probability values. Better performing uncertainty estimates should correlate more accurately with the line in the calibration plots. For regression models, we can form calibration plots by comparing the frequency of residuals lying within varying thresholds of the predicted distribution. Figure 3 shows the calibration of our classification and regression uncertainties.
In this section we show two results:
Aleatoric uncertainty cannot be explained away with more data,
Aleatoric uncertainty does not increase for out-of-data examples (situations different from training set), whereas epistemic uncertainty does.
In Table 3 we give accuracy and uncertainty for models trained on increasing sized subsets of datasets. This shows that epistemic uncertainty decreases as the training dataset gets larger. It also shows that aleatoric uncertainty remains relatively constant and cannot be explained away with more data. Testing the models with a different test set (bottom two lines) shows that epistemic uncertainty increases considerably on those test points which lie far from the training sets.
These results reinforce the case that epistemic uncertainty can be explained away with enough data, but is required to capture situations not encountered in the training set. This is particularly important for safety-critical systems, where epistemic uncertainty is required to detect situations which have never been seen by the model before.
Our model based on DenseNet Jégou et al. (2016) can process a resolution image in on a NVIDIA Titan X GPU. The aleatoric uncertainty models add negligible compute. However, epistemic models require expensive Monte Carlo dropout sampling. For models such as ResNet He et al. (2004), this is possible to achieve economically because only the last few layers contain dropout. Other models, like DenseNet, require the entire architecture to be sampled. This is difficult to parallelize due to GPU memory constraints, and often results in a slow-down for 50 Monte Carlo samples.
We presented a novel Bayesian deep learning framework to learn a mapping to aleatoric uncertainty from the input data, which is composed on top of epistemic uncertainty models. We derived our framework for both regression and classification applications. We showed that it is important to model aleatoric uncertainty for:
Large data situations, where epistemic uncertainty is explained away,
Real-time applications, because we can form aleatoric models without expensive Monte Carlo samples.
And epistemic uncertainty is important for:
Safety-critical applications, because epistemic uncertainty is required to understand examples which are different from training data,
Small datasets where the training data is sparse.
However aleatoric and epistemic uncertainty models are not mutually exclusive. We showed that the combination is able to achieve new state-of-the-art results on depth regression and semantic segmentation benchmarks.
The first paragraph in this paper posed two recent disasters which could have been averted by real-time Bayesian deep learning tools. Therefore, we leave finding a method for real-time epistemic uncertainty in deep learning as an important direction for future research.
Computer vision and pattern recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE computer society conference on, volume 2, pages II–II. IEEE, 2004.
A practical Bayesian framework for backpropagation networks.Neural Computation, 4(3):448–472, 1992.
Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1119–1127, 2015.