Over the last, decade deep neural networks (DNN) have arisen as the dominant technique for the analysis of perceptual data. Also in safety-critical applications like autonomous driving, where the vehicle must be able to understand its environment, DNNs have seen rapid progress in several tasks (Grigorescu et al., 2019).
However, classical DNNs have deficits in capturing the model uncertainty (Kendall and Gal, 2017),(Gal and Ghahramani, 2016). But when using DNN models in safety-critical applications, it is mandatory to provide an uncertainty measure that can be used to identify unreliable predictions (Michelmore et al., 2018) (1) (Harakeh et al., 2019) (2) (McAllister et al., 2017).
For example, in the field of robotics (Sünderhauf et al., 2018), medical applications, or autonomous driving (Bojarski et al., 2016), where machines interact with humans, it is important to identify situations where a model prediction is unreliable and a human intervention is necessary. This can, for example, be situations which are completely different from all that occurred during training.
Employing Bayesian DNNs (BDNNs) (MacKay, 1992) tackles the problem and allows to compute an uncertainty measure. However, state of the art BDNNs require sampling during deployment leading to computation times that are by the factor of MC runs larger than a classical DNNs. This work overcomes this drawback by providing a method that allows to approximate the expected value and variance of a BDNN’s predictive distribution in a single run. It has therefore the same computation time as a classical DNN. We focus here on a special variant of BDNNs which is known as MC dropout (Gal and Ghahramani, 2016)
. While our approximation method is applicable also to convolutional neural networks and classification settings, we focus in this work on regression through fully connected networks.
2 Related Work
2.1 MC Dropout Bayesian Neural Networks
BDNNs are probabilistic models that capture the uncertainty by means of probability distributions. Probabilistic DNNs, which are non-Bayesian, only define a distribution for the conditional outcome. In common probabilistic DNNs the output nodes are controlling the parameters of a conditional probability distribution (CPD) of the outcome. For regression type problems a common choice for the CPD is the normal distribution, where the variance quantifies the data uncertainty, known as aleatoric uncertainty. BDNNs define in addition distributions for the weights which translate in a distribution of the modeled parameters. In this manner the model uncertainty is captured, which is known as epistemic uncertainty (Der Kiureghian and Ditlevsen, 2009)
. In case of MC dropout BDNNs each weight distribution is a Bernoulli distribution: the weight takes with the dropout probabilitythe value zero and with probability the value
. All weights starting from the same neuron are set to zero simultaneously. The dropout probability
is usually treated as a fixed hyperparameter and the weight-valueis tuned during the training.
In contrast to standard dropout (Srivastava et al., 2014), the weights in MC dropout are not frozen and rescaled after training, but the dropout procedure is also done during test time. It can be shown that MC dropout is an approximation to a BDNN (Gal and Ghahramani, 2016). MC dropout BDNNs were successfully used in many applications and have proven to yield improved prediction performance and allow to define uncertainty measures to identify individual unreliable predictions (Gal and Ghahramani, 2016), (Ryu et al., 2019), (Dürr et al., 2018), (Kwon et al., 2020). To employ a trained Bayesian DNN in practice one performs several runs of predictions. In each run, weights are sampled from the weight distributions leading to a certain constellation of weight values that are used to compute the parameters of a CPD. To determine the outcome distribution of a BDNN, we draw samples from the CPDs that resulted from different MC runs. In this way, the outcome distribution incorporates the epistemic and aleatoric uncertainty. A drawback of a MC dropout BDNN compared to its classical DNN variant is the increased computing time. The sampling procedure leads to a computing time that is prohibitive for many real-time applications like autonomous driving.
2.2 Moment Propagation
Our method relies on statistical moment propagation (MP). More specifically, we propagate the expectation and the variance, of our signal distribution through the different layers of a neural network. The variance of the signal arises due to the dropout process. Quantifying the variance after a transformation is also done in error propagation (EP). EP quantifies how an uncertainty of an input which is transformed by a function (i.e. a measurement error) transfers to an uncertainty of the output of this function. In case of a continuous output it is common to characterize the uncertainty by the variance. This approach is also used in statistics as the delta method (Dorfman, 1938). In MP we approximate the layer-wise transformations of the variance and the expected value. A similar approach has also been used for neural networks before (Frey and Hinton, 1999; Adachi, 2019), and used to detect adversarial examples in (Jin, 2015) and (Gast and Roth, 2018).
But, due to our best knowledge, our approach is the first method that provides a single shot approximation to the expected value and the variance of the predictive distribution resulting from a MC dropout NN.
The goal of our method111https://github.com/kaibrach/Moment-Propagation is to approximate the expected value E and the variance V of the predicted output which is obtained by the above described MC dropout method. When propagating an observation through a MC dropout network, we get each layer with nodes an activation signal with an expected value (of dimension ) and a variance given by a variance-covariance matrix (of dimension
). We neglect the effect of correlations between different activations, which are small anyway in deeper layers due to the decorrelation effect of the dropout. Hence, we only consider diagonal terms in the correlation matrix. In the following, we describe for each layer-type in a fully connected network how the expected value E and its variance V is propagated. As layer-type we consider dropout, dense, and ReLU activation layer. Figure1 provides an overview of the layer-wise abstraction.
3.1 Dropout Layer
We start our discussion, with the effect of MC dropout. Let be the expectation at the i’th node of the input layer and the variance at the i’th node. In a dropout layer the random value of a node is multiplied independently with a Bernoulli variable that is either zero or one. The expectation of the i’th node after dropout is then given by:
For computing the variance of the i’th node after dropout, we use the fact that the variance
of the product of two independent random variablesand , is given by (Goodman, 1960):
With , we get:
Dropout is the only layer in our approach where uncertainty is created. I.e. even if the input has the output of the dropout layer has for .
3.2 Dense Layer
For the dense layer with input and output nodes, we compute the value of the i’th output node as , where are the values of the input nodes. Using the linearity of the expectation, we get the expectation of the i’th output node from the expectations, , of the input nodes:
To calculate the change of the variance, we use the fact that the variance under a linear transformation behaves like. Further, we assume independence of the j different summands, yielding:
3.3 ReLU Activation Layer
To calculate the expectation and variance of the i’th node after a ReLU, as a function of the and
of this node before the ReLU, we need to make a distributional assumption. We assume that the input is Gaussian distributed, withthe PDF, and the corresponding CDF, we get (see (Frey and Hinton, 1999) for a derivation) for the expectation and variance of the output:
4.1 Toy Dataset
We first apply our approach to a one dimensional regression toy dataset, with only one input feature. We use a fully connected NN with three layers each with 256 nodes, ReLU activations and dropout after the dense layers. We have a single node in the output layer which is interpreted as the expected value of the conditional outcome distribution . We train the network using the MSE loss and apply dropout with . From the MC dropout BDNN, we get at each x-position MC samples from which we can estimate the expectation by the average value and by the variance of . For comparison, we use our MP approach to also approximate the expected value and the variance of at each -position (see upper panel of 2). We also included the deterministic output of the DNN in which dropout has only been used only during training. All three approaches yield nearly identical results, within the range of the training data. We attribute this to the fact, that we have plenty of training data and so the epistemic uncertainty is neglectable. In the lower panel of figure 2 a comparison of the uncertainty of is shown by displaying an interval given by the expected value of
plus-minus two times the standard deviation of. Here the width of the resulting intervals of a BDNN via the MP approach and the MC dropout are comparable (the DNN has no spread). This indicates the usefulness of this approach for epistemic uncertainty estimation.
|Dataset||N||Q||Test RMSE||Test NLL||Test RT [s]|
To benchmark our method, we redo the analysis of (Gal and Ghahramani, 2016) for the UCI regression benchmark dataset. We use the same NN model as Gal and Ghahramani, which is a fully connected neural network including one hidden layer with ReLU activation in which the CPD over MC runs is given by sampling from the normal PDF:
Again is the single output of the BDNN for the t’th MC run. To derive a predictive distribution Gal assumes in each run a Gaussian distribution, centered at and a precision , corresponding to the reciprocal of the variance. The parameter is received from the NN and is treated as as a hyperparameter. For the MP model, the MC sampling (Eq. 8) is replaced by integration:
We used the same protocol as (Gal and Ghahramani, 2016) which can be found at https://github.com/yaringal/DropoutUncertaintyExps. Accordingly, we train the network for 10
the epochs provided in the individual dataset configuration. As described in(Gal and Ghahramani, 2016) an excessive grid search over the dropout rate and different values of the precision is done. The hyperparameters minimizing the validation NLL are chosen and applied on the testset.
We report in table 1 the test performance (RMSE and NLL) achieved via MC BDNN using the optimal hyperparameters for the different UCI datasets. We also report the test RMSE and the NLL achieved with our MP method. Allover, the MC and MP approaches produces similar results. However, as shown in the last column in the table the MP method is much faster, having only to perform one forward pass instead of forward passes.
With our MP approach we have introduced an approximation to MC dropout which requires no sampling but instead propagates the expectation and the variance of the signal through the network. This results in a time saving by a factor that approximately corresponds to the number of MC runs (in our benchmark experiment 10,000). We have shown that our fast MP approach approximates precisely the expectation and variance of the prediction distribution achieved by MC dropout. Also the achieved prediction performance in terms of RMSE and NLL do not show significant differences when using MC dropout or our MP approach. Hence, our presented MP approach opens the door to include uncertainty information in real-time applications.
We are currently working on extending the approach to different architectures such as convolutional neural networks.We are also investigating how to make use of the uncertainty information to detect novel classes in classification settings.
We are very grateful to Elektrobit Automotive GmbH for supporting this research work. Further, part of the work has been founded by the Federal Ministry of Education and Research of Germany (BMBF) in the project DeepDoubt (grant no. 01IS19083A).
-  Cited by: §1.
-  Cited by: §1.
Estimating and factoring the dropout induced distribution with gaussian mixture model. In International Conference on Artificial Neural Networks, pp. 775–792. Cited by: §2.2.
- End to End Learning for Self-Driving Cars. External Links: Cited by: §1.
- Aleatory or epistemic? Does it matter?. Structural safety 31 (2), pp. 105–112. Cited by: §2.1.
- A note on the! d-method for finding variance formulae.. Biometric Bulletin. Cited by: §2.2.
Know When You Don’t Know: A Robust Deep Learning Approach in the Presence of Unknown Phenotypes. Assay and drug development technologies 16 (6), pp. 343–349. Cited by: §2.1.
- Variational learning in nonlinear gaussian belief networks. Neural Computation 11 (1), pp. 193–213. External Links: Cited by: §2.2, §3.3.
Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.
33rd International Conference on Machine Learning, ICML 2016, Vol. 3, pp. 1651–1660. External Links: Cited by: §1, §1, §2.1, §4.2, §4.2.
- Lightweight Probabilistic Deep Networks. Technical report External Links: Cited by: §2.2.
- On the Exact Variance of Products. Journal of the American Statistical Association 55 (292), pp. 708–713. External Links: Cited by: §3.1.
- A Survey of Deep Learning Techniques for Autonomous Driving. Journal of Field Robotics 37 (3), pp. 362–386. External Links: Cited by: §1.
- BayesOD: A Bayesian Approach for Uncertainty Estimation in Deep Object Detectors. External Links: Cited by: §1.
- Robust Convolutional Neural Networks under Adversarial Noise.. CoRR abs/1511.0. External Links: Cited by: §2.2.
What uncertainties do we need in Bayesian deep learning for computer vision?. Technical report Vol. 2017-December. External Links: Cited by: §1.
- Uncertainty quantification using bayesian neural networks in classification: Application to biomedical image segmentation. Computational Statistics & Data Analysis 142, pp. 106816. Cited by: §2.1.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402–6413. Cited by: §1.
A Practical Bayesian Framework for Backpropagation Networks. Neural Computation 4 (3), pp. 448–472. External Links: Cited by: §1.
Concrete problems for autonomous vehicle safety: Advantages of Bayesian deep learning.
IJCAI International Joint Conference on Artificial Intelligence, External Links: Cited by: §1.
- Evaluating Uncertainty Quantification in End-to-End Autonomous Driving Control. External Links: Cited by: §1.
- Uncertainty in neural networks: approximately bayesian ensembling. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, Cited by: §1.
- A Bayesian graph convolutional network for reliable prediction of molecular properties with uncertainty quantification. Chemical Science 10 (36), pp. 8438–8446. Cited by: §2.1.
- Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. External Links: Cited by: §2.1.
- The limits and potentials of deep learning for robotics. International Journal of Robotics Research. External Links: Cited by: §1.