Single Shot MC Dropout Approximation

by   Kai Brach, et al.

Deep neural networks (DNNs) are known for their high prediction performance, especially in perceptual tasks such as object recognition or autonomous driving. Still, DNNs are prone to yield unreliable predictions when encountering completely new situations without indicating their uncertainty. Bayesian variants of DNNs (BDNNs), such as MC dropout BDNNs, do provide uncertainty measures. However, BDNNs are slow during test time because they rely on a sampling approach. Here we present a single shot MC dropout approximation that preserves the advantages of BDNNs without being slower than a DNN. Our approach is to analytically approximate for each layer in a fully connected network the expected value and the variance of the MC dropout signal. We evaluate our approach on different benchmark datasets and a simulated toy example. We demonstrate that our single shot MC dropout approximation resembles the point estimate and the uncertainty estimate of the predictive distribution that is achieved with an MC approach, while being fast enough for real-time deployments of BDNNs.



There are no comments yet.


page 1

page 2

page 3

page 4


Efficient Uncertainty Estimation for Semantic Segmentation in Videos

Uncertainty estimation in deep learning becomes more important recently....

Randomized ReLU Activation for Uncertainty Estimation of Deep Neural Networks

Deep neural networks (DNNs) have successfully learned useful data repres...

Optimistic and Pessimistic Neural Networks for Scene and Object Recognition

In this paper the application of uncertainty modeling to convolutional n...

MC-CIM: Compute-in-Memory with Monte-Carlo Dropouts for Bayesian Edge Intelligence

We propose MC-CIM, a compute-in-memory (CIM) framework for robust, yet l...

Robust Neural Regression via Uncertainty Learning

Deep neural networks tend to underestimate uncertainty and produce overl...

Evaluating Scalable Uncertainty Estimation Methods for DNN-Based Molecular Property Prediction

Advances in deep neural network (DNN) based molecular property predictio...

Distributed NLI: Learning to Predict Human Opinion Distributions for Language Reasoning

We introduce distributed NLI, a new NLU task with a goal to predict the ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last, decade deep neural networks (DNN) have arisen as the dominant technique for the analysis of perceptual data. Also in safety-critical applications like autonomous driving, where the vehicle must be able to understand its environment, DNNs have seen rapid progress in several tasks (Grigorescu et al., 2019).

However, classical DNNs have deficits in capturing the model uncertainty (Kendall and Gal, 2017),(Gal and Ghahramani, 2016). But when using DNN models in safety-critical applications, it is mandatory to provide an uncertainty measure that can be used to identify unreliable predictions (Michelmore et al., 2018) (1) (Harakeh et al., 2019) (2) (McAllister et al., 2017).

For example, in the field of robotics (Sünderhauf et al., 2018), medical applications, or autonomous driving (Bojarski et al., 2016), where machines interact with humans, it is important to identify situations where a model prediction is unreliable and a human intervention is necessary. This can, for example, be situations which are completely different from all that occurred during training.

Employing Bayesian DNNs (BDNNs) (MacKay, 1992) tackles the problem and allows to compute an uncertainty measure. However, state of the art BDNNs require sampling during deployment leading to computation times that are by the factor of MC runs larger than a classical DNNs. This work overcomes this drawback by providing a method that allows to approximate the expected value and variance of a BDNN’s predictive distribution in a single run. It has therefore the same computation time as a classical DNN. We focus here on a special variant of BDNNs which is known as MC dropout (Gal and Ghahramani, 2016)

. While our approximation method is applicable also to convolutional neural networks and classification settings, we focus in this work on regression through fully connected networks.

Ensembling based models take an alternative approach to estimate uncertainties and have been successfully applied to DNNs (Lakshminarayanan et al., 2017; Pearce et al., 2020). But ensemble methods do also not allow to quantify the uncertainty in a single shot manner.

2 Related Work

2.1 MC Dropout Bayesian Neural Networks

BDNNs are probabilistic models that capture the uncertainty by means of probability distributions. Probabilistic DNNs, which are non-Bayesian, only define a distribution for the conditional outcome. In common probabilistic DNNs the output nodes are controlling the parameters of a conditional probability distribution (CPD) of the outcome. For regression type problems a common choice for the CPD is the normal distribution

, where the variance quantifies the data uncertainty, known as aleatoric uncertainty. BDNNs define in addition distributions for the weights which translate in a distribution of the modeled parameters. In this manner the model uncertainty is captured, which is known as epistemic uncertainty (Der Kiureghian and Ditlevsen, 2009)

. In case of MC dropout BDNNs each weight distribution is a Bernoulli distribution: the weight takes with the dropout probability

the value zero and with probability the value

. All weights starting from the same neuron are set to zero simultaneously. The dropout probability

is usually treated as a fixed hyperparameter and the weight-value

is tuned during the training.

In contrast to standard dropout (Srivastava et al., 2014), the weights in MC dropout are not frozen and rescaled after training, but the dropout procedure is also done during test time. It can be shown that MC dropout is an approximation to a BDNN (Gal and Ghahramani, 2016). MC dropout BDNNs were successfully used in many applications and have proven to yield improved prediction performance and allow to define uncertainty measures to identify individual unreliable predictions (Gal and Ghahramani, 2016), (Ryu et al., 2019), (Dürr et al., 2018), (Kwon et al., 2020). To employ a trained Bayesian DNN in practice one performs several runs of predictions. In each run, weights are sampled from the weight distributions leading to a certain constellation of weight values that are used to compute the parameters of a CPD. To determine the outcome distribution of a BDNN, we draw samples from the CPDs that resulted from different MC runs. In this way, the outcome distribution incorporates the epistemic and aleatoric uncertainty. A drawback of a MC dropout BDNN compared to its classical DNN variant is the increased computing time. The sampling procedure leads to a computing time that is prohibitive for many real-time applications like autonomous driving.

2.2 Moment Propagation

Our method relies on statistical moment propagation (MP). More specifically, we propagate the expectation and the variance, of our signal distribution through the different layers of a neural network. The variance of the signal arises due to the dropout process. Quantifying the variance after a transformation is also done in error propagation (EP). EP quantifies how an uncertainty of an input which is transformed by a function (i.e. a measurement error) transfers to an uncertainty of the output of this function. In case of a continuous output it is common to characterize the uncertainty by the variance. This approach is also used in statistics as the delta method (Dorfman, 1938). In MP we approximate the layer-wise transformations of the variance and the expected value. A similar approach has also been used for neural networks before (Frey and Hinton, 1999; Adachi, 2019), and used to detect adversarial examples in (Jin, 2015) and (Gast and Roth, 2018).

But, due to our best knowledge, our approach is the first method that provides a single shot approximation to the expected value and the variance of the predictive distribution resulting from a MC dropout NN.

3 Methods

The goal of our method111 is to approximate the expected value E and the variance V of the predicted output which is obtained by the above described MC dropout method. When propagating an observation through a MC dropout network, we get each layer with nodes an activation signal with an expected value (of dimension ) and a variance given by a variance-covariance matrix (of dimension

). We neglect the effect of correlations between different activations, which are small anyway in deeper layers due to the decorrelation effect of the dropout. Hence, we only consider diagonal terms in the correlation matrix. In the following, we describe for each layer-type in a fully connected network how the expected value E and its variance V is propagated. As layer-type we consider dropout, dense, and ReLU activation layer. Figure

1 provides an overview of the layer-wise abstraction.

Figure 1: Overview of the proposed method. The expectation E and V flow through different layers of the network in a single forward pass. Shown is an example configuration in which Dropout (DO) is followed by Dense (FC) and a ReLU activation. More complex networks can be build by different arrangements of the individual blocks.

3.1 Dropout Layer

We start our discussion, with the effect of MC dropout. Let be the expectation at the i’th node of the input layer and the variance at the i’th node. In a dropout layer the random value of a node is multiplied independently with a Bernoulli variable that is either zero or one. The expectation of the i’th node after dropout is then given by:


For computing the variance of the i’th node after dropout, we use the fact that the variance

of the product of two independent random variables

and , is given by (Goodman, 1960):


With , we get:


Dropout is the only layer in our approach where uncertainty is created. I.e. even if the input has the output of the dropout layer has for .

3.2 Dense Layer

For the dense layer with input and output nodes, we compute the value of the i’th output node as , where are the values of the input nodes. Using the linearity of the expectation, we get the expectation of the i’th output node from the expectations, , of the input nodes:


To calculate the change of the variance, we use the fact that the variance under a linear transformation behaves like

. Further, we assume independence of the j different summands, yielding:


3.3 ReLU Activation Layer

To calculate the expectation and variance of the i’th node after a ReLU, as a function of the and

of this node before the ReLU, we need to make a distributional assumption. We assume that the input is Gaussian distributed, with

the PDF, and the corresponding CDF, we get (see (Frey and Hinton, 1999) for a derivation) for the expectation and variance of the output:


4 Results

4.1 Toy Dataset

We first apply our approach to a one dimensional regression toy dataset, with only one input feature. We use a fully connected NN with three layers each with 256 nodes, ReLU activations and dropout after the dense layers. We have a single node in the output layer which is interpreted as the expected value of the conditional outcome distribution . We train the network using the MSE loss and apply dropout with . From the MC dropout BDNN, we get at each x-position MC samples from which we can estimate the expectation by the average value and by the variance of . For comparison, we use our MP approach to also approximate the expected value and the variance of at each -position (see upper panel of 2). We also included the deterministic output of the DNN in which dropout has only been used only during training. All three approaches yield nearly identical results, within the range of the training data. We attribute this to the fact, that we have plenty of training data and so the epistemic uncertainty is neglectable. In the lower panel of figure 2 a comparison of the uncertainty of is shown by displaying an interval given by the expected value of

plus-minus two times the standard deviation of

. Here the width of the resulting intervals of a BDNN via the MP approach and the MC dropout are comparable (the DNN has no spread). This indicates the usefulness of this approach for epistemic uncertainty estimation.

Figure 2: Comparison of the MP and MC dropout results of a BDNN and the results of a DNN. The NNs were fitted on train data that were available in the range of -3 to 19. In the upper panel the estimated expectations of the MC BDNN, the MP BDNN, and the DNN are compared. In the lower panel the predicted spread of is shown for the MC and MP method.

Dataset N Q Test RMSE Test NLL Test RT [s]
Boston 506 13 3.14 3.10 2.57 2.56 2.51 0.04
Concrete 1,030 8 5.46 5.40 3.12 3.13 3.37 0.04
Energy 768 8 1.65 1.61 1.95 2.01 2.84 0.04
Kin8nm 8,192 8 0.08 0.08 -1.10 -1.11 7.37 0.04
Naval 11,934 16 0.00 0.00 -4.36 -3.60 9.69 0.04
Power 9,568 4 4.05 4.04 2.82 2.84 6.85 0.04
Protein 45,730 9 4.42 4.41 2.90 2.91 31.38 0.05
Wine 1,599 11 0.63 0.63 0.95 0.95 4.78 0.04
Yacht 308 6 2.93 2.91 2.35 2.11 2.01 0.04
Table 1: Comparison of the average prediction performance in test RMSE (Root-Mean-Square Error), test NLL (Negative Log-Likelihood) and test RT (Runtime) including standard error on UCI regression benchmark datasets between MC and MP. N and Q correspond to the dataset size and the input dimension. For all test measures, smaller means better.

4.2 UCI-Datasets

To benchmark our method, we redo the analysis of (Gal and Ghahramani, 2016) for the UCI regression benchmark dataset. We use the same NN model as Gal and Ghahramani, which is a fully connected neural network including one hidden layer with ReLU activation in which the CPD over MC runs is given by sampling from the normal PDF:


Again is the single output of the BDNN for the t’th MC run. To derive a predictive distribution Gal assumes in each run a Gaussian distribution, centered at and a precision , corresponding to the reciprocal of the variance. The parameter is received from the NN and is treated as as a hyperparameter. For the MP model, the MC sampling (Eq. 8) is replaced by integration:


We used the same protocol as (Gal and Ghahramani, 2016) which can be found at Accordingly, we train the network for 10

the epochs provided in the individual dataset configuration. As described in

(Gal and Ghahramani, 2016) an excessive grid search over the dropout rate and different values of the precision is done. The hyperparameters minimizing the validation NLL are chosen and applied on the testset.

We report in table 1 the test performance (RMSE and NLL) achieved via MC BDNN using the optimal hyperparameters for the different UCI datasets. We also report the test RMSE and the NLL achieved with our MP method. Allover, the MC and MP approaches produces similar results. However, as shown in the last column in the table the MP method is much faster, having only to perform one forward pass instead of forward passes.

5 Discussion

With our MP approach we have introduced an approximation to MC dropout which requires no sampling but instead propagates the expectation and the variance of the signal through the network. This results in a time saving by a factor that approximately corresponds to the number of MC runs (in our benchmark experiment 10,000). We have shown that our fast MP approach approximates precisely the expectation and variance of the prediction distribution achieved by MC dropout. Also the achieved prediction performance in terms of RMSE and NLL do not show significant differences when using MC dropout or our MP approach. Hence, our presented MP approach opens the door to include uncertainty information in real-time applications.

We are currently working on extending the approach to different architectures such as convolutional neural networks.We are also investigating how to make use of the uncertainty information to detect novel classes in classification settings.

6 Acknowledgements

We are very grateful to Elektrobit Automotive GmbH for supporting this research work. Further, part of the work has been founded by the Federal Ministry of Education and Research of Germany (BMBF) in the project DeepDoubt (grant no. 01IS19083A).


  • [1] Cited by: §1.
  • [2] Cited by: §1.
  • J. Adachi (2019)

    Estimating and factoring the dropout induced distribution with gaussian mixture model

    In International Conference on Artificial Neural Networks, pp. 775–792. Cited by: §2.2.
  • M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba (2016) End to End Learning for Self-Driving Cars. External Links: 1604.07316, Link Cited by: §1.
  • A. Der Kiureghian and O. Ditlevsen (2009) Aleatory or epistemic? Does it matter?. Structural safety 31 (2), pp. 105–112. Cited by: §2.1.
  • R. A. Dorfman (1938) A note on the! d-method for finding variance formulae.. Biometric Bulletin. Cited by: §2.2.
  • O. Dürr, E. Murina, D. Siegismund, V. Tolkachev, S. Steigele, and B. Sick (2018)

    Know When You Don’t Know: A Robust Deep Learning Approach in the Presence of Unknown Phenotypes

    Assay and drug development technologies 16 (6), pp. 343–349. Cited by: §2.1.
  • B. J. Frey and G. E. Hinton (1999) Variational learning in nonlinear gaussian belief networks. Neural Computation 11 (1), pp. 193–213. External Links: Document, ISSN 08997667, Link Cited by: §2.2, §3.3.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In

    33rd International Conference on Machine Learning, ICML 2016

    Vol. 3, pp. 1651–1660. External Links: 1506.02142, ISBN 9781510829008 Cited by: §1, §1, §2.1, §4.2, §4.2.
  • J. Gast and S. Roth (2018) Lightweight Probabilistic Deep Networks. Technical report External Links: Document, 1805.11327, ISBN 9781538664209, ISSN 10636919 Cited by: §2.2.
  • L. A. Goodman (1960) On the Exact Variance of Products. Journal of the American Statistical Association 55 (292), pp. 708–713. External Links: Document, ISSN 01621459, Link Cited by: §3.1.
  • S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu (2019) A Survey of Deep Learning Techniques for Autonomous Driving. Journal of Field Robotics 37 (3), pp. 362–386. External Links: Document, 1910.07738, Link Cited by: §1.
  • A. Harakeh, M. Smart, and S. L. Waslander (2019) BayesOD: A Bayesian Approach for Uncertainty Estimation in Deep Object Detectors. External Links: 1903.03838, Link Cited by: §1.
  • J. Jin (2015) Robust Convolutional Neural Networks under Adversarial Noise.. CoRR abs/1511.0. External Links: Link Cited by: §2.2.
  • A. Kendall and Y. Gal (2017)

    What uncertainties do we need in Bayesian deep learning for computer vision?

    Technical report Vol. 2017-December. External Links: ISSN 10495258, Link Cited by: §1.
  • Y. Kwon, J. Won, B. J. Kim, and M. C. Paik (2020) Uncertainty quantification using bayesian neural networks in classification: Application to biomedical image segmentation. Computational Statistics & Data Analysis 142, pp. 106816. Cited by: §2.1.
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402–6413. Cited by: §1.
  • D. J. C. MacKay (1992)

    A Practical Bayesian Framework for Backpropagation Networks

    Neural Computation 4 (3), pp. 448–472. External Links: Document, ISSN 0899-7667 Cited by: §1.
  • R. McAllister, Y. Gal, A. Kendall, M. Van Der Wilk, A. Shah, R. Cipolla, and A. Weller (2017) Concrete problems for autonomous vehicle safety: Advantages of Bayesian deep learning. In

    IJCAI International Joint Conference on Artificial Intelligence

    External Links: Document, ISBN 9780999241103, ISSN 10450823 Cited by: §1.
  • R. Michelmore, M. Kwiatkowska, and Y. Gal (2018) Evaluating Uncertainty Quantification in End-to-End Autonomous Driving Control. External Links: 1811.06817, Link Cited by: §1.
  • T. Pearce, F. Leibfried, A. Brintrup, M. Zaki, and A. Neely (2020) Uncertainty in neural networks: approximately bayesian ensembling. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, Cited by: §1.
  • S. Ryu, Y. Kwon, and W. Y. Kim (2019) A Bayesian graph convolutional network for reliable prediction of molecular properties with uncertainty quantification. Chemical Science 10 (36), pp. 8438–8446. Cited by: §2.1.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. External Links: ISSN 15337928 Cited by: §2.1.
  • N. Sünderhauf, O. Brock, W. Scheirer, R. Hadsell, D. Fox, J. Leitner, B. Upcroft, P. Abbeel, W. Burgard, M. Milford, and P. Corke (2018) The limits and potentials of deep learning for robotics. International Journal of Robotics Research. External Links: Document, 1804.06557, ISSN 17413176 Cited by: §1.