1 Introduction
Deep neural networks (DNNs), particularly convolutional neural networks (CNNs), have recently been used to solve complex perceptual and decision tasks [15; 21; 23]. While these models take into account aleatoric uncertainty via their softmax output (i.e. the uncertainty present in the training data), they do not take into account epistemic uncertainty (i.e. parameter uncertainty) [12]. Bayesian DNNs attempt to learn a distribution over their parameters thereby allowing for the computation of the uncertainty of their outputs given the parameters. However, ideal Bayesian methods do not scale well due to the difficulty in computing the posterior of a network’s parameters.
As a result, several approximate Bayesian methods have been proposed for DNNs. Using the Laplace approximation was proposed by [18]
. Using Markov chain Monte Carlo (MCMC) has been suggested to estimate the posterior of the networks weights given the training data
[22; 26] . Using expectation propagation has also been proposed [11; 8]. However, these methods can be difficult to implement for the very large CNNs commonly used for object recognition. Variational inference methods have also been used to make Bayesian NNs more tractable[9; 1; 7; 2]. Due in large part to the fact that these methods substantially increase the number of parameters in a network, they have not been extensively applied to large DNNs. Gal and Ghahramani [5] and Kingma et al. [13] bypassed this issue by developing Bayesian CNNs using Bernoulli and Gaussian dropout [24], respectively. While independent weight sampling with additive Gaussian noise has been investigated [9; 1; 7; 2], independently sampling weights using multiplicative Bernoulli noise, i.e. dropconnect [25], or independently sampled multiplicative Gaussian noise has not been thoroughly evaluated.In addition to Bernoulli and Gaussian distributions, spikeandslab distributions, a combination of the two, have been investigated, particularly for linear models
[20; 19; 6; 10]. Interestingly, Bernoulli dropout and dropconnect can be seen as approximations to spikeandslab distributions for units and weights, respectively [17; 3]. Spikeandslab variational distributions have been implemented using Bernoulli dropout with additive weight noise sampled from a Gaussian with a learned standard deviation
[17]. This approach more than doubled the number of learned parameters, since the mean and the standard deviation of each weight as well as the dropout rate for each unit were learned. However, this method did not consistently outperform standard neural networks. Gal [3] also discussed motivations for spikeandslab variational distributions, but did not suggest a practical implementation.We evaluated the performance Bayesian CNNs with different variational distributions on MNIST [16] and CIFAR10 [14]. We also investigate how adding Gaussian image noise with varying standard deviations to the test set affected each network’s learned uncertainty. We did this to test how networks responded to inputs not drawn from the data distribution used to create the training and test sets. We also propose an approximation of the spikeandslab variational inference based on Bernoulli dropout and Gaussian dropconnect, which combines the advantages of Gaussian dropconnect and Bernoulli dropout sampling leading to better uncertainty estimates and good test set generalization without increasing the number of learned parameters.
2 Methods
2.1 Bayesian Deep Neural Networks
DNNs are commonly trained by finding the maximum a posteriori (MAP) weights given the training data () and a prior over the weight matrix , . However, ideal Bayesian learning would involve computing the full posterior. This can be intractable due to both the difficulty in calculating
and in calculating the joint distribution of a large number of parameters. Instead,
can be approximated using a variational distribution . This distribution is constructed to allow for easy generation of samples. The objective of variational inference is to optimize the variational parameters so that the KullblackLeiber (KL) divergence between and is minimized [9; 1; 7; 2]:(1) 
Using Monte Carlo (MC) methods to estimate , using weight samples
, results in the following loss function:
(2) 
MC sampling can also be used to estimate the probability of test data:
(3) 
2.2 Variational Distributions
The number and continuous nature of the parameters in DNNs makes sampling from the entire distribution of possible weight matrices computationally challenging. However, variational distributions can make sampling easier. In deep learning, the most common sampling method is using multiplicative noise masks drawn from some distribution. Several of these methods can be formulated as variational distributions where weights are sampled by elementwise multiplication of the variational parameters
, the connection matrix with an element for each connection between the units in the network, by a mask, which is sampled from some probability distribution:
(4) 
From this perspective, the difference between dropout and dropconnect, as well as Bernoulli and Gaussian methods, is simply the probability distribution used to generate the mask sample (Figure 1).
2.2.1 Bernoulli Dropconnect & Dropout
In Bernoulli dropconnect, each element of the mask is sampled independently, so where is the probability of dropping a connection. In Bernoulli dropout, however, the weights are not sampled independently. Instead, one Bernoulli variable is sampled for each row of the weight matrix, so where is the probability of dropping a unit.
2.2.2 Gaussian Dropconnect & Dropout
In Gaussian dropconnect and dropout, is sampled from a Gaussian distribution centered at variational parameter
. This is accomplished by sampling the multiplicative mask using Gaussian distributions with a mean of 1 and a variance of
, which matches the mean and variance of Bernoulli dropout when training time scaling is used [24]. In Gaussian dropconnect, each element of the mask is sampled independently, which results in. In Gaussian dropout, each element in a row has the same random variable, so
. It can be shown that using Gaussian dropconnect or dropout with L2regularization leads to optimizing a stochastic lowerbound of the variational objective function (See Supplementary Material).2.2.3 SpikeandSlab Dropout
A spikeandslab distribution is the normalized linear combination of a "spike" of probability mass at zero and a "slab" consisting of a Gaussian distribution. This spikeandslab returns a 0 with probability or a random sample from a Gaussian distribution with probability . We propose concurrently using Bernoulli dropout and Gaussian dropconnect to approximate the use of a spikeandslab variational distribution and spikeandslab prior by optimizing a lowerbound of the variational objective function (See Supplementary Material). In this formulation, , where for each mask row and . As for Bernoulli dropout, each row of the mask is multiplied by 0 with probability , otherwise each element in that row is multiplied by a value independently sampled from a Gaussian distribution as in Gaussian dropconnect. During nonsampling inference, spikeandslab dropout uses the mean weight values and, per Bernoulli dropout, multiplies unit outputs by .
3 Experiments
3.1 Logistic Regression
In order to visualize the effects of each variational distribution, we trained linear networks with five hidden units to classify data drawn from two 2D multivariate Gaussian distributions. Multiple linear units were used so that Bernoulli dropout would not dropout the only unit in the network. For the dropout methods, unit sampling was performed on the linear hidden layer. For the dropconnect methods, every weight was sampled. Dropout and dropconnect probabilities of
were used for each of these networks, except for the spikeandslab dropconnect probability which was . In Figure 2, we show the decision boundaries learned by the various networks. Higher variability in the decision boundaries corresponds to higher uncertainty. All of the MC sampling methods predict with higher uncertainty as points become further away from the training data. This is particularly true for the dropconnect and spikeandslab methods.MNIST  CIFAR10  

Method  Mean Error (%)  Error Std. Dev.  Mean Error (%)  Error Std. Dev. 
MAP  0.76    25.86   
Bernoulli DropConnect  0.56    16.46   
MC Bernoulli DropConnect  0.56  0.03  16.59  0.11 
Gaussian DropConnect  0.56    16.78   
MC Gaussian DropConnect  0.58  0.02  16.65  0.11 
Bernoulli Dropout  0.49    11.23   
MC Bernoulli Dropout  0.48  0.03  9.95  0.08 
Gaussian Dropout  0.42    9.07   
MC Gaussian Dropout  0.36  0.04  9.00  0.10 
SpikeandSlab Dropout  0.48  –  10.64  – 
MC SpikeandSlab Dropout  0.46  0.01  10.05  0.06 
3.2 Convolutional Neural Networks
We trained CNNs on MNIST [16] and CIFAR10 [14]
. For each dataset, a 10,000 image subset of the training set was used for validation. For MNIST, each CNN had two convolutional layers followed by a fully connected layer and a softmax layer. For CIFAR10, each CNN had 13 convolutional layers followed by a fully connected layer and a softmax layer. (See Supplementary Material for the detailed architectures.) For the dropout networks, dropout was used after each convolutional and fullyconnected layer, but before the nonlinearity. For the dropconnect networks, all weights were sampled. All
s were treated as networkwide hyperparameters. For L2regularization, L2coefficients of 1e5 (MNIST) and 4e5 (CIFAR10) were used for all weights. No data augmentation was used for MNIST. Random horizontal flipping was used during CIFAR10 training. We evaluated the trained CNNs using the original testing sets and using the testing images with added random Gaussian noise of increasing variance in order to test each network’s uncertainty for the regions of input space not seen in the training set.
While the dropoutbased methods were the most accurate on the testset (Table 1), as image noise was added they became increasingly worse compared to the dropconnectbased networks (Figure 3.a and 4.a). Sampling only consistently improved the accuracy for Bernoulli and spikeandslab dropout. However, sampling did consistently improve the calibration of the networks as the image noise was increased (Figure 3.b and 4.b). For a given accuracy across each set of noisy test images, sampling also generally lead to better calibration (Figure 3.c, 4.c, and 5). (See Supplementary Material for the calibration plots.) Gaussian dropout led to the highest test set accuracy, but it also led to reduced robustness to noise. While slightly less accurate on the test set, Bernoulli dropout and spikeandslab dropout were much more robust.
Seemingly contradictory results have been reported in the literature regarding CIFAR10 and MC Bernoulli dropout. Gal and Ghahramani [4] found that standard Bernoulli dropout methods led to relatively inaccurate networks when dropout was used at every layer in a CNN, whereas MC sampling increased the accuracy of these networks. However, Srivastava et al. [24] found that using dropout at every layer led to increased generalization performance even without sampling at prediction time. In our CIFAR10 experiments, but not our MNIST experiments, we have found that using sampling at prediction time makes networks more robust to high variance dropout. Using lower variance dropout results in standard and MC methods having similar accuracies, while using higher variance distributions results in MC inference outperforming standard methods (Figure 6). (See Supplementary Material for more results when using .) These results indicate that Bernoulli or Gaussian dropout with MC sampling are less dependent on the exact value of and can allow higher levels of dropout regularization to be used.
4 Discussion
L2 regularization and Bernoulli dropout are widely used for regularization and routinely lead to increased testing accuracy. However, the uncertainty learned do not generalize well. However, performing approximate Bayesian inference via sampling during training and testing allowed CNNs to better model their uncertainty. Dropconnectbased CNNs performed worse on the unmodified test set, but were much more robust to deviations from the training distribution. On the other hand, dropoutbased networks, particularly MC Gaussian dropout, performed well on the unmodified test set, but were not as robust. Using sampling and combining Bernoulli dropout and Gaussian dropconnect to approximate the use of spikeandslab variational distributions lead to a CNN that performed better near the test set than the dropconnect methods and more robustly represented its uncertainty compared to the dropout methods.
5 Acknowledgments
The authors would like to thank Sergii Strelchuk, Charles Zheng, Yarin Gal, Richard Turner, and Aapo Hyvärinen for their comments on previous versions of the manuscript. This research was funded by the UK Medical Research Council (Programme MCA060 5PR20), by a European Research Council Starting Grant (ERC2010StG 261352), and by the Human Brain Project (EU grant 604102 ’Contextsensitive multisensory object recognition: a deep network model constrained by multilevel, multispecies data’).
References
 Barber and Bishop [1998] David Barber and Christopher M Bishop. Ensemble learning in bayesian neural networks. NATO ASI SERIES F COMPUTER AND SYSTEMS SCIENCES, 168:215–238, 1998.

Blundell et al. [2015]
Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra.
Weight uncertainty in neural network.
In
Proceedings of The 32nd International Conference on Machine Learning
, pages 1613–1622, 2015.  Gal [2016] Yarin Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016.
 Gal and Ghahramani [2015] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Insights and applications. In Deep Learning Workshop, ICML, 2015.
 Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with Bernoulli approximate variational inference. In 4th International Conference on Learning Representations (ICLR) workshop track, 2016.
 George and McCulloch [1997] Edward I George and Robert E McCulloch. Approaches for bayesian variable selection. Statistica sinica, pages 339–373, 1997.
 Graves [2011] Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.

HernándezLobato and Adams [2015]
José Miguel HernándezLobato and Ryan Adams.
Probabilistic backpropagation for scalable learning of bayesian neural networks.
In International Conference on Machine Learning, pages 1861–1869, 2015. 
Hinton and Van Camp [1993]
Geoffrey E Hinton and Drew Van Camp.
Keeping the neural networks simple by minimizing the description
length of the weights.
In
Proceedings of the sixth annual conference on Computational learning theory
, pages 5–13. ACM, 1993.  Ishwaran and Rao [2005] Hemant Ishwaran and J Sunil Rao. Spike and slab variable selection: frequentist and bayesian strategies. Annals of Statistics, pages 730–773, 2005.
 Jylänki et al. [2014] Pasi Jylänki, Aapo Nummenmaa, and Aki Vehtari. Expectation propagation for neural networks with sparsitypromoting priors. Journal of Machine Learning Research, 15(1):1849–1901, 2014.
 Kendall and Gal [2017] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977, 2017.
 Kingma et al. [2015] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
 Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, Krizhevsky Alex, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Advances In Neural Information Processing Systems, pages 1–9, 2012. ISBN 9781627480031. URL http://papers.nips.cc/paper/4824imagenetclassificationwithdeepconvolutionalneuralnetworks.pdf.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Louizos [2015] Christos Louizos. Smart regularization of deep architectures. Master’s thesis, University of Amsterdam, 2015.
 MacKay [1992] David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
 Madigan and Raftery [1994] David Madigan and Adrian E Raftery. Model selection and accounting for model uncertainty in graphical models using occam’s window. Journal of the American Statistical Association, 89(428):1535–1546, 1994.

Mitchell and Beauchamp [1988]
Toby J Mitchell and John J Beauchamp.
Bayesian variable selection in linear regression.
Journal of the American Statistical Association, 83(404):1023–1032, 1988. 
Mnih et al. [2015]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness,
Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg
Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou,
Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis.
Humanlevel control through deep reinforcement learning.
Nature, 518(7540):529–533, 2015. ISSN 00280836. doi: 10.1038/nature14236. URL http://dx.doi.org/10.1038/nature14236.  Neal [2012] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
 Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Wan et al. [2013] Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 1058–1066, 2013.
 Welling and Teh [2011] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 681–688, 2011.
Comments
There are no comments yet.