Dropout As A Bayesian Approximation: Code
We propose a Bayesian convolutional neural network built upon Bayes by Backprop and elaborate how this known method can serve as the fundamental construct of our novel reliable variational inference method for convolutional neural networks. First, we show how Bayes by Backprop can be applied to convolutional layers where weights in filters have probability distributions instead of point-estimates; and second, how our proposed framework leads with various network architectures to performances comparable to convolutional neural networks with point-estimates weights. This work represents the expansion of the group of Bayesian neural networks, which consist now of feedforward, recurrent, and convolutional ones.READ FULL TEXT VIEW PDF
Artificial Neural Networks are connectionist systems that perform a give...
We present flattened convolutional neural networks that are designed for...
Conventional Bayesian Neural Networks (BNNs) are known to be capable of
While normalizations aim to fix the exploding and vanishing gradient pro...
A fundamental question regarding the Galactic Center Excess (GCE) is whe...
Classical convolutional neural networks (cCNNs) are very good at categor...
Recent work has shown that the performance of convolutional neural netwo...
Dropout As A Bayesian Approximation: Code
). However, from a probability theory perspective, it is unjustifiable to use single point-estimates as weights to base any classification on.Cnns with frequentist inference require substantial amounts of data examples to train on and are prone to overfitting on datasets with few examples per class.
). At the same time, by using a prior probability distribution to integrate out the parameters, we compute the average across many models during training, which gives a regularization effect to the network, thus preventing overfitting.
, which comprise the properties of Gaussian distributionsand , denoted , where
is the total number of parameters defining a probability distribution. The shape of these Gaussian variational posterior probability distributions, determined by their variance, expresses an uncertainty estimation of every model parameter. The main contributions of our work are as follows:
We present how Bayes by Backprop can be efficiently applied to Cnns. We therefore introduce the idea of applying two convolutional operations, one for the mean and one for the variance.
We empirically show how this generic and reliable variational inference method for Bayesian Cnns can be applied to various Cnn architectures without any limitations on their performances, but with intrinsic regularization effects. We compare the performances of these Bayesian Cnns to Cnns which use single point-estimates as weights, i.e. which are trained by frequentist inference.
This paper is structured as subsequently outlined: after we have introduced our work here, we secondly review Bayesian neural networks with variational inference, including previous works, an explanation of Bayes by Backprop and its implementation in Cnn. Thirdly, we examine aleatoric and epistemic uncertainty estimations with an outline of previous works and how our proposed method directly connects to those. Fourthly, we present our results and findings through experimental evaluation of the proposed method on various architectures and datasets before we finally conclude our work.
Recently, the uncertainty afforded by Bayes by Backprop
trained neural networks has been used successfully to train feedforward neural networks in both supervised and reinforcement learning environmentsblundell2015weight ; lipton2016efficient ; houthooft2016curiosity
, for training recurrent neural networksfortunato2017bayesian , and for Cnns shridhar2018BayesianComprehensive ; neklyudov2018variance . Here, we review this method for Cnns to construct a common foundation on which be build on in section 3.
Applying Bayesian methods to neural networks has been studied in the past with various approximation methods for the intractable true posterior probability distribution . buntine1991bayesian started to propose various maximum-a-posteriori (MAP) schemes for neural networks. They were also the first who suggested second order derivatives in the prior probability distribution to encourage smoothness of the resulting approximate posterior probability distribution. In subsequent work by hinton1993keeping , the first variational methods were proposed which naturally served as a regularizer in neural networks. hochreiter1995simplifying suggest taking an information theory perspective into account and utilizing a minimum description length (MDL) loss. This penalises non-robust weights by means of an approximate penalty based upon perturbations of the weights on the outputs. denker1991transforming and mackay1995probable investigated the posterior probability distributions of neural networks by using Laplace approximations. As a response to the limitations of Laplace approximations, neal2012bayesian investigated the use of hybrid Monte Carlo for training neural networks, although it has so far been difficult to apply these to the large sizes of neural networks built in modern applications. More recently, graves2011practical derived a variational inference scheme for neural networks and blundell2015weight extended this with an update for the variance that is unbiased and simpler to compute. graves2016stochastic derives a similar algorithm in the case of a mixture posterior probability distribution.
Several authors have derived how Dropout srivastava2014dropout and Gaussian Dropout wang2013fast can be viewed as approximate variational inference schemes gal2015bayesian ; kingma2015variational . We compare our results to gal2015bayesian . Furthermore, structured variational approximations louizos2017multiplicative , auxiliary variables maaloe2016auxiliary , and Stochastic Gradient MCMC li2016learning have been proposed to approximate the intractable posterior probability distribution. Recently, the uncertainty afforded by Bayes by Backprop trained neural networks has been used successfully to train feedforward neural networks in both supervised and reinforcement learning environments blundell2015weight ; lipton2016efficient ; houthooft2016curiosity , for training recurrent neural networks fortunato2017bayesian , and convolutional neural networks shridhar2018BayesianComprehensive ; neklyudov2018variance .
can be sampled in backpropagation. Since the true posterior is typically intractable, an approximate distributionis defined that is aimed to be as similar as possible to the true posterior , measured by the Kullback-Leibler (KL) divergence kullback1951information . Hence, we define the optimal parameters as
This derivation forms an optimization problem with a resulting cost function widely known as variational free energy neal1998view ; yedidia2005constructing ; friston2007variational which is built upon two terms: the former, , is dependent on the definition of the prior , thus called complexity cost, whereas the latter, , is dependent on the data , thus called likelihood cost.
The term can be omitted in the optimization because it is constant.
Since the KL-divergence is also intractable to compute exactly, we follow a stochastic variational method graves2011practical ; blundell2015weight . We sample the weights from the variational distribution since it is much more probable to draw samples which are appropriate for numerical methods from the variational posterior than from the true posterior . Consequently, we arrive at the tractable cost function (3) which is aimed to be optimized, i.e. minimised with respect to , during training:
where is the number of draws. We sample from .
In this section, we explain our algorithm of building Cnns with probability distributions over weights in each filter, as seen in Figure 1, and apply Bayes by Backprop to compute the intractable posterior probability distributions , as described in the previous section 2.2. Notably, a fully Bayesian perspective on a Cnn is for most Cnn architectures not accomplished by merely placing probability distributions over weights in convolutional layers; it also requires probability distributions over weights in fully-connected layers.
We utilize the local reparameterization trick kingma2015variational and apply it to Cnns. Following kingma2015variational ; neklyudov2018variance , we do not sample the weights , but we sample layer activations instead due to its consequent computational acceleration. The variational posterior probability distribution (where and are the input, respectively output layers, and the height, respectively width of any given filter) allows to implement the local reparamerization trick in convolutional layers. This results in the subsequent equation for convolutional layer activations :
where , is the receptive field, signalises the convolutional operation, and the component-wise multiplication.
The crux of equipping a Cnn with probability distributions over weights instead of single point-estimates and being able to update the variational posterior probability distribution by backpropagation lies in applying two convolutional operations whereas filters with single point-estimates apply one. As explained in the previous section 2.3.1, we deploy the local reparametrization trick and sample from the activations . Since activations are functions of mean and variance among others, we are able to compute the two variables determining a Gaussian probability distribution, mean and variance , separately.
We pursue this in two convolutional operations: in the first, we treat the output as an output of a Cnn updated by frequentist inference. We optimize with Adam kingma2014adam towards a single point-estimate which makes the accuracy of classifications in the validation dataset increasing. We interpret this single point-estimate as the mean of the variational posterior probability distributions . In the second convolutional operation, we learn the variance . As this formulation of the variance includes the mean , only needs to be learned in the second convolutional operation molchanov2017variational . In this way, we ensure that only one parameter is updated per convolutional operation, exactly how it would have been with a Cnn updated by frequentist inference.
In other words, while we learn in the first convolutional operation the MAP of the variational posterior probability distribution , we observe in the second convolutional operation how much values for weights deviate from this MAP. This procedure is repeated in the fully-connected layers. In addition, to accelerate computation, to ensure a positive non-zero variance , and to enhance accuracy, we learn and use the Softplus activation function as further described in the Experiments section 4.
In classification tasks, we are interested in the predictive distribution , where is an unseen data example and its predicted class. For a Bayesian neural network, this quantity is given by:
In Bayes by Backprop, Gaussian distributions , where are learned with some dataset as we explained previously in 2.2. Due to the discrete and finite nature of most classification tasks, the predictive distribution is commonly assumed to be a categorical. Incorporating this aspect into the predictive distribution gives us
where is the total number of classes and .
As there is no closed-form solution due to the lack of conjugacy between categorical and Gaussian distributions, we cannot recover this distribution. However, we can construct an unbiased estimator of the expectation by sampling from:
where is the predefined number of samples. This estimator allows us to evaluate the uncertainty of our predictions by the definition of variance, hence called predictive variance and denoted as :
where and .
It is important to split the uncertainty in form of the predictive variance into aleatoric and epistemic quantities since it allows the modeler to evaluate the room for improvements: while aleatoric uncertainty (also known as statistical uncertainty) is merely a measure for the variation of ("noisy") data, epistemic uncertainty is caused by the model. Hence, a modeler can see whether the quality of the data is low (i.e. high aleatoric uncertainty), or the model itself is the cause for poor performances (i.e. high epistemic uncertainty). The former can be improved by gathering more data, whereas the latter requests to refine the model der2009aleatory .
As shown in (8), the predictive variance can be decomposed into the aleatoric and epistemic uncertainty. kendall2017uncertainties and kwon2018uncertainty have proposed disparate methods to do so. In this paragraph, we first review these two methods before we explain our own method and how it overcomes deficiencies of the two aforementioned ones, especially the usage of Softmax in the output layer by kwon2018uncertainty .
derived a method how aleatoric and epistemic uncertainties can directly be estimated by constructing a Bayesian neural network with the last layer before activation consisting of mean and variance of logits, denotedand . In other words, the pre-activated linear output of the neural network has dimensions with being the number of output units, i.e. potential classes. They propose an estimator as
where . kwon2018uncertainty mention the deficiencies of this approach: first, it models the variability of the linear predictors and
and not the predictive probabilities; second, it ignores the fact that the covariance matrix of a multinomial random variable is a function of the mean vector; and third, the aleatoric uncertainty does not reflect correlations because of the diagonal matrix.
where and . By doing so, they do not need the pre-activated linear outputs and . Consequently, they can directly compute the variability of the predictive probability and do not require additional sampling steps. With increasing , (10) converges in probability to (8).
where the label is assigned to the input if . Assume we have classes, this logistic sigmoid function can be generalized to the Softmax function:
Here, we propose a novel method how aleatoric and epistemic uncertainty estimations can be computed without having an additional non-linear transformation by the generalized logistic sigmoid function in the output layer.
This procedure can be summarized as a replacement of with where is the aforementioned normalization of the Softplus output. We call this computation Softplus normalization. Our intuition for this is as follows: we classify according to a categorical distribution as we outlined previously in section 3. This can be seen as an approximation of the outputs with one-hot vectors. If those vectors already contain a lot of zeros, the approximation is going to be more accurate. For a Softmax, predicting zeros requires a logit of which is hard to achieve in practice. For the Softplus normalization, we just need a negative value roughly smaller than to get nearly a zero in the output. Consequently, Softplus normalization can easily produce vectors that are in practice zero, whereas Softmax cannot. In sum, our proposed method is as follows:
where and .
For all conducted experiments, we implement the foregoing description of Bayesian Cnns with variational inference in LeNet-5 lecun1998gradient , AlexNet krizhevsky2012imagenet , and VGG simonyan2014very . The exact architecture specifications can be found in the Appendix 6 and in our GitHub repository111https://github.com/kumar-shridhar/PyTorch-Softplus-Normalization-Uncertainty-Estimation-Bayesian-CNN. We train the networks with the MNIST dataset of handwritten digits lecun1998gradient , and with the CIFAR-10 and CIFAR-100 datasets krizhevsky2009learning since these datasets serve widely as benchmarks for Cnns
’ performances. The originally chosen activation functions in all architectures are ReLU, but we must introduce another, called Softplus, see (16), because of our method to apply two convolutional or fully-connected operations. As aforementioned, one of these is determining the mean , and the other the variance . Specifically, we apply the Softplus function because we want to ensure that the variance never becomes zero. This would be equivalent to merely calculating the MAP, which can be interpreted as equivalent to a maximum likelihood estimation (MLE), which is further equivalent to utilising single point-estimates, hence frequentist inference. The Softplus activation function is a smooth approximation of ReLU. Although it is practically not influential, it has the subtle and analytically important advantage that it never becomes zero for , whereas ReLU becomes zero for .
where is by default set to .
All experiments are performed with the same hyper-parameters settings as stated in the Appendix 6.
|Bayesian VGG (with VI)||99||86||45|
|Bayesian AlexNet (with VI)||99||73||36|
|Bayesian LeNet-5 (with VI)||98||69||31|
|Bayesian LeNet-5 (with Dropout)||99||83|
We evaluate the performance of our Bayesian Cnns with variational inference. Table 1 shows a comparison of validation accuracies (in percentage) for architectures trained by two disparate Bayesian approaches, namely variational inference, i.e. Bayes by Backprop and Dropout as proposed by Gal and Ghahramani gal2015bayesian , plus frequentist inference for all three datasets. Bayesian Cnns trained by variational inference achieve validation accuracies comparable to their counter-architectures trained by frequentist inference. On MNIST, validation accuracies of the two disparate Bayesian approaches are comparable, but a Bayesian LeNet-5 with Dropout achieves a considerable higher validation accuracy on CIFAR-10, although we were not able to reproduce these reported results.
In Figure 2
, we show how Bayesian networks incorporate naturally effects of regularization, exemplified on AlexNet. While an AlexNet trained by frequentist inference without any regularization overfits greatly on CIFAR-100, an AlexNet trained by Bayesian inference on CIFAR-100 does not. It performs equivalently to an AlexNet trained by frequentist inference with three layers of Dropout after the first, fourth, and sixth layers in the architecture. In initial epochs, BayesianCnns trained by variational inference start with a low validation accuracy compared to architectures trained by frequentist inference. Initialization for both inference methods is chosen equivalently: the variational posterior probability distributions is initially approximated as standard Gaussian distributions, while initial point-estimates in architectures trained by frequentist inference are randomly drawn from a standard Gaussian distribution. These initialization methods ensure weights are neither too small nor too large in the beginning of training.
Our results for the Softplus normalization estimation method of aleatoric and epistemic uncertainties are combined in Table 2. It compares the means over epochs of aleatoric and epistemic uncertainties for our Bayesian Cnns
LeNet-5, AlexNet and VGG. We train on MNIST and CIFAR-10, and compute homoscedastic uncertainties by averaging over all classes, i.e. over (heteroscedastic) uncertainties of each class. We see a correlating pattern between validation accuracies and epistemic uncertainty: with increasing validation accuracy, epistemic uncertainty decreases, observable by the different models. In contrast, aleatoric uncertainty measures the irreducible variability of the datasets, hence is only dependent on the datasets and not the models, which can be seen by the constant aleatoric uncertainties of each dataset across models.
|Aleatoric uncertainty||Epistemic uncertainty||Validation accuracy|
|Bayesian VGG (MNIST)||0.00110||0.0004||99|
|Bayesian VGG (CIFAR-10)||0.00099||0.0013||85|
|Bayesian AlexNet (MNIST)||0.00110||0.0019||99|
|Bayesian AlexNet (CIFAR-10)||0.00099||0.0002||73|
|Bayesian LeNet-5 (MNIST)||0.00110||0.0026||98|
|Bayesian LeNet-5 (CIFAR-10)||0.00099||0.0404||69|
We further investigate influences on aleatoric uncertainty estimations by additive standard Gaussian noise of different levels. Here, a random sample of the standard Gaussian distribution is multiplied by a level and added to each pixel value. We test for and see that aleatoric uncertainty is independent of the added noise level (see Figure 3), i.e. aleatoric uncertainty is constant to six decimal places. Our intuition for this phenomenon is as follows: by normalizing the output of the Softplus function among one batch, of which all images have the same added level of noise, the aleatoric uncertainty captures the variability among the images in one batch - and not across batches with different levels of noise.
We propose a novel method how aleatoric and epistemic uncertainties can be estimated in Bayesian Cnns. We call this method Softplus normalization. Firstly, we discuss briefly the differences of a Bayes by Backprop based variational inference to the Dropout based approximation of Gal & Ghahramani gal2015bayesian . Secondly, we derive the Softplus normalization uncertainty estimation method with previously published methods for these uncertainties. We evaluate our approach on three datasets (MNIST, CIFAR-10, CIFAR-100).
As a base, we show that Bayesian Cnns
with variational inference achieve comparable results as those achieved by the same network architectures trained by frequentist inference, but include naturally a regularization effect and an uncertainty measure. Building on that, we examine how our proposed Softplus normalization method to estimate aleatoric and epistemic uncertainties is derived by previous work in this field, how it differs from those, and explain why it is more appropriate to use in computer vision than these previously proposed methods.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Proceedings of the sixth annual conference on Computational learning theory, pages 5–13. ACM, 1993.
The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
What uncertainties do we need in bayesian deep learning for computer vision?In Advances in neural information processing systems, pages 5574–5584, 2017.
|layer type||width||stride||padding||input shape||nonlinearity|
|layer type||width||stride||padding||input shape||nonlinearity|