1 Introduction
For decades, many different approaches have been suggested to integrate Bayesian inference and neural networks. In such Bayesian neural networks (BNNs), we have, after training, not a single set of parameters, or weights, but an (approximate) posterior distribution over those parameters. The posterior distribution, for example, enables uncertainty estimates over the network output, selection of hyperparameters and models in a principled framework, as well as guided data collection (active learning).
In general, exact Bayesian inference over the weights of a neural network is intractable as the number of parameters is very large and the functional form of a neural network does not lend itself to exact integration. For this reason, much of the research in this area has been focused on approximation techniques. Most modern techniques stem from key works which used either a Laplace approximation MacKay (1992), variational methods (Hinton and van Camp, 1993), or Monte Carlo methods (Neal, 1995). Over the past few years, many methods for approximating the posterior distribution have been suggested, falling into one of these categories. These methods include assumed density filtering (Lobato and Adams, 2015; Soudry et al., 2014), approximate power Expectation Propagation (Lobato et al., 2016), Stochastic Langevin Gradient Descent (Balan et al., 2015; Ahn et al., 2012; Welling and Teh, 2011)
, incremental moment matching
(Lee et al., 2017), and variational Bayes (Blundell et al., 2015; Graves, 2011).The standard variational Bayes approach developed by Blundell et al. (2015), called Bayes By Backprop (BBB), has several shortcomings. The variational free energy minimized in BBB is a sum of a loglikelihood cost function and a complexity cost function. The complexity cost function acts as a regularizer, enforcing a solution that captures the complexity of the data while keeping the posterior close to the prior. Finding a good prior is usually a nontrivial task, and overrestricting priors could potentially cause underfitting. To alleviate these issues, Kingma et al. (2015) introduced variational dropout, which uses an improper prior to ensure that the complexity cost function becomes constant in the weight parameters. Later modifications of this approach (Khan et al., 2018; Molchanov et al., 2017; Achterhold et al., 2018) were shown to be useful for weight pruning (without retraining, similarly to BBB). However, Hron et al. (2017) recently pointed out that such variational dropout approaches are not Bayesian. To avoid these issues altogether, Zeno et al. (2018) proposed an online variational Bayes scheme using a new prior for each minibatch, instead of one prior for all the data.
In this paper, we develop a novel Bayesian approach to learning for neural networks, built upon adaptive subgradient methods such as AdaGrad (Duchi et al., 2011), RMSProp (Tieleman and Hinton, 2012) and Adam (Kingma and Lei, 2015). Unlike the aforementioned approaches, ours does not require the specification of a method for approximating the posterior distribution, as it relies on a new probabilistic interpretation of adaptive subgradient algorithms that effectively shows these can readily be utilized as approximate Bayesian posterior inference schemes. This has similar underpinnings as the work of Mandt et al. (2017), although the latter is based on a stochastic model for gradient variations that imposes a number of restrictions on the gradient noise covariance structure, which our framework is able to sidestep by utilizing Adam as the underlying subgradient method. Our proposed algorithm is also similar in spirit to the work of Khan et al. (2018), but there are some important differences that we discuss in detail in Section 6.
2 Preliminaries
Notation.
Vectors are denoted by lower case Roman letters such as , and all vectors are assumed to be column vectors. Upper case roman letters, such as
, denote matrices, with the exception of the identity matrix which we denote by
and whose dimension is implicit from the context. Finally, for any vector , denotes its th coordinate, where .Problem setup.
Let be a noisy objective function: a stochastic scalar function that is differentiable w.r.t. the parameters , where denotes the parameter space. In general, is a subset of , but for simplicity, we shall assume that throughout the remainder of this paper. We are interested in minimizing the expected value of this function, , w.r.t. its parameters . Let denote the realizations of the stochastic function at the subsequent time steps . The stochastic nature may arise from the evaluation of the function at random subsamples (minibatches) of datapoints, or from inherent function noise.
The simplest algorithm for this setting is the standard online gradient descent algorithm (Zinkevich, 2003), which moves the current estimate of in the opposite direction of the last observed (sub)gradient value , i.e.,
(1) 
where is an adaptive learning rate that is typically set to , for some positive constant . While the decreasing learning rate is required for convergence, such an aggressive decay typically translates into poor empirical performance.
Generic adaptive subgradient descent.
We now present a framework that contains a wide range of popular adaptive subgradient methods as special cases, and highlights their flaws and differences. The presentation here follows closely that of (Reddi et al., 2018). The update rule of this generic class of adaptive methods can be compactly written in the form
(2) 
where and are estimates of the (sub)gradient and inverse Hessian, respectively, of the functions , based on observations up to and including iteration . In other words, they are functions of the (sub)gradient history , which we express as
(3) 
where and
denote estimator functions for the (sub)gradient and Hessian of the loss function at iteration
, respectively. The corresponding procedure is outlined in Algorithm 1.For computational performance many popular algorithms restrict themselves to diagonal variants of the general method encapsulated by Algorithm 1, such that , where is the vector of diagonal elements. We first observe that the standard online gradient descent (OGD) algorithm arises as a special case of this framework if we use:
(OGD) 
The key idea of adaptive methods is to choose estimator functions appropriately so as to entail good convergence. For instance, the first adaptive method AdaGrad (Duchi et al., 2011), which propelled research much further, uses the following estimator functions:
(AdaGrad) 
In contrast to the learning rate of in OGD with learningrate decay, such a setting effectively implies a modest learningrate decay of for . When the gradients are sparse, this can potentially lead to huge gains in terms of convergence (see Duchi et al. (2011)). These gains have also been observed in practice even in some nonsparse settings.
Adaptive methods based on EWMA.
Exponentially weighted moving average (EWMA) variants of AdaGrad
are popular in the deep learning community.
AdaDelta (Zeiler, 2012), RMSProp (Tieleman and Hinton, 2012), Adam (Kingma and Lei, 2015) and Nadam (Dozat, 2016) are some prominent algorithms that fall in this category. The key difference between these and AdaGrad is that they use an EWMA as the function instead of a simple average. Adam, a particularly popular variant, is based on the following estimator functions:(Adam) 
where are exponential decay rates. This update can alternatively be stated in terms of the following simple recursions:
(4) 
for all , with for all . Note that the denominator represents a biascorrection term. A value of and is typically recommended in practice (Kingma and Lei, 2015). RMSProp, which appeared in an earlier unpublished work (Tieleman and Hinton, 2012), is essentially a variant of Adam with . In practice, especially in deeplearning applications, the momentum term arising due to nonzero appears to significantly boost performance.
More recently, Reddi et al. (2018) pointed out that the aforementioned methods fail to converge to an optimal solution (or a critical point in nonconvex settings). They showed that one cause for such failures is the use of EWMAs and, as a result of this, proposed AMSGrad, a variant of Adam which relies on a longterm memory of past gradients. Specifically, the AMSGrad update rule is characterized by the following system of recursive equations:
(AMSGrad) 
for all , with for all , and where is a sequence of exponential smoothing factors.
Bayesian neural networks.
As the name suggests, a Bayesian neural network (BNN) is a neural network equipped with a prior distribution over its weights . Consider an i.i.d. data set of feature vectors , with a corresponding set of outputs . For illustration purposes, we shall suppose that the likelihood for each datapoint is Gaussian, with an dependent mean given by the output
of a neuralnetwork model and with variance
:(5) 
Similarly, we shall choose a prior distribution over the weights that is Gaussian of the form
(6) 
Since the data set is i.i.d., the likelihood function is given by
(7) 
and so by virtue of Bayes’ theorem, the resulting posterior distribution is then
(8) 
which, as a consequence of the nonlinear dependence of on , will be nonGaussian. However, we can find a Gaussian approximation by using the Laplace approximation (MacKay, 1992). Alternative approximation methods have been briefly discussed in Section 1.
3 Probabilistic Interpretation of Adaptive Subgradient Methods
Our probabilistic interpretation of adaptive subgradient methods represented by Algorithm 1 is based on a secondorder Taylor expansion of the loss function around the current iterate :
(9) 
Since the gradient and Hessian are unknown, we replace them with the unbiased estimates
(10) 
which results in the following approximation:
(11) 
The corresponding likelihood model is given by
(12) 
where the number of samples corrects the averaging over samples that is implicitly contained in the loss function .
Under an improper prior over , i.e. , Eq. 12 coincides with the unnormalized posterior over given the gradient history . Completing the square with respect to in the exponential yields
(13) 
where, as a reminder, and “const” denotes terms that are independent of . This leads to a Gaussian posterior of the form
(14) 
where is a vector of hyperparameters other than the learning rate, if any, that govern the underlying adaptive subgradient method. For example, in the case of AdaGrad, whereas for Adam.
A few comments are in order regarding Eq. 14. Even though it was easily derived, the final posterior distribution in Eq. 14 is closely related to those obtained via several other approaches. Firstly, note that the posterior mean of the weight distribution is merely the point estimator of the descent algorithm. This is also the case for most of the alternative approaches (e.g., Blundell et al. (2015); Mandt et al. (2017); Zeno et al. (2018)). The expression for the variance, and the nature of our approach in general, is closest in spirit to the work of Mandt et al. (2017). In fact, Eq. 12 is closely related to Assumption 4 in their paper. A notable difference between their paper and ours, however, is that the former relies on an OrnsteinUhlenbeck process to describe the stochastic dynamics of the gradients. Specifically, Mandt et al. (2017) assume that the variability of the gradients can be reasonably captured by a constant covariance matrix, which is an unrealistic assumption given the fact that, in practice, this covariance matrix evolves as one explores different regions of the energy landscape. Instead, our approach, as we shall discuss in the next section, is to use Adam’s EWMA estimates for and . This enables us to filter out the noise arising from the stochastic nature of the gradients, while at the same time accounting for changes in these quantities over different areas of the energy landscape.
4 Practical Algorithms for Bayesian Learning of Neural Networks
Based on the insights from the previous section, we obtain the following generic algorithm for Bayesian learning of neural networks via adaptive subgradient methods.
Intuitively, we can understand Algorithm 2 as follows. Recall that is an estimate of the Hessian of the loss function , and thereby captures the curvature of the energy landscape. Large curvature in a given direction will thus result in a small variance in that direction, while small curvature leads to large variance, which is intuitively clear.
Refining Algorithm 2 to OGD, we would get a covariance matrix proportional to the identity. To approximate the correct covariance of the weights, one needs a good estimate of the curvature. Furthermore, due to the stochastic nature of the energy surface, this curvature estimate should not be based on a single (the final) observation, but rather on a history of observations, so as to mitigate the noise in the resulting curvature estimate. Algorithms like AdaGrad and Adam do exactly that. Furthermore, since working with a full covariance matrix becomes computationally intractable for large networks, one can use approximations that diagonalize the covariance matrix. This occurs when using Algorithm 2 with the values taken by the estimator functions and in AdaGrad and Adam. Note that the online variational Bayes algorithms in (Blundell et al., 2015) and (Zeno et al., 2018) also employ a diagonal approximation. Additionally, the model discussed in (Mandt et al., 2017) arguably bears certain similarities with Algorithm 2 when refining the latter to AdaGrad. For this reason, we shall focus here on discussing how to instantiate Algorithm 2 with Adam.
Adam has many appealing features when used as an optimizer for neural networks. The reasons for this are twofold:

Given that the posterior mean of our algorithm is identical to the point estimate generated by Adam, we expect it to perform well in practice, for the same reasons that Adam excels and is widely used in practice.

There is a tradeoff in estimating the curvature of the landscape, and thus the covariance matrix: if we just focus on the last observation, our estimate will be too noisy; however, if we base it on the entire history – like AdaGrad does – we are implicitly assuming that it is constant throughout the landscape. Ideally, therefore, we should base our estimate on the most recent observations close to the final weight update. This is achieved by using EWMAs, as in Adam^{1}^{1}1For the same reason, EWMAs are popular in finance where one encounters noisy observations from nonstationary distributions..
The specific approach whereby Algorithm 2 uses the update rules of Adam is illustrated in Algorithm 3.
Note that the denominators and in the updates for and , respectively, correct the initialization bias. These factors quickly converge to 1 and any effect on the final posterior variance quickly deteriorates. Thus, in practice, we can absorb those factors into the learning rate by using in place of , as is usually done in many implementations of Adam
, including that in TensorFlow.
5 Proofofconcept experiments
In this section, we evaluate the empirical performance of our Bayesian adaptive subgradient framework using the Bayes by Backprop (BBB) algorithm of Blundell et al. (2015) as the baseline. For brevity, we confine ourselves to a classification exercise on the MNIST data set. While we could have employed alternative approaches such as variational dropout (Kingma et al., 2015) or Bayesian gradient descent (Zeno et al., 2018), we did not consider the former because variational dropout cannot be used for weight pruning (the variances are simply proportional to the weight values), while the latter achieves similar performance on the MNIST classification task as the more widely used BBB, and for this reason we decided to omit it.
We remark that our goal here is not to establish the superiority of our family of algorithms over BBB. Rather, we want to establish a proof of concept that, compared to BBB, our methods are competitive, at least on the MNIST classification experiment. Our methods have the big advantage that posterior distributions, and thus uncertainties, can be extracted for “free” from the standard Adam algorithm which is widely used in practice. It thus provides an outofthebox tool to measure the uncertainty in the weights of a neural network, and the fact that we are able to achieve empirical results that are comparable to those obtained by considerably more complex Bayesian modelling approaches, such as BBB, is very promising. While the results reported in (Blundell et al., 2015)
exhibit better performance, they rely on extensive hyperparameter tuning as well as on a large number of epochs. However, with a smaller number of epochs, the
BBB algorithm performs worse than our framework, whose major strengths are the lesser reliance on hyperparameters and the fact that it works better out of the box.5.1 Description of the data
MNIST is a database of handwritten digits comprising a training and test set of 60,000 and 10,000 pixelated images, respectively, each of size 28 by 28. Each image is labeled with its corresponding digit (between 0 and 9, inclusive).
5.2 Experimental setup
In order to make the results of our framework comparable to those obtained by the BBB algorithm, we replicate the experimental setup proposed in (Blundell et al., 2015), except for the few modifications below:

we preprocess the pixels by dividing values by 255 instead of by 126 as in (Blundell et al., 2015);

there are two dropout layers after each hidden layer; while Blundell et al. (2015) use implicit regularization at this level, we apply dropout combined with regularization on the weights and biases;

while the authors of (Blundell et al., 2015) base their results on 600 epochs, we only experiment with 20 and 300 epochs, respectively.
Otherwise, our experimental configuration is identical to that used by the authors of (Blundell et al., 2015)
. Specifically, we use the same neural network architecture with 2 hidden layers made up of 1,200 units each, ReLUs as activation functions, and a softmax output layer with 10 units, one for each possible digit. The total number of parameters, i.e. weights and biases, is approximately 2.4 million. The biases were all initialized at 0, while the initial weights were randomly drawn from a zeromean Gaussian distribution with a standard deviation of 0.1. As in
(Blundell et al., 2015), we used a training set of 50,000 examples and a test set of 10,000 examples.5.3 Results
We do not place much emphasis on the convergence properties of our algorithms, as these properties are directly inherited from the underlying subgradient method which, in the case of Badam, is Adam. As the convergence of Adam is widely studied, and the algorithm is commonly used because of its good convergence properties among other things, the main goal of our experiments is to assess the quality of the confidence measure provided by Badam, which represents its competitive edge.
Figure 1 shows the distribution of posterior means over the entire network for various algorithms. The values for OGD and BBB are taken from (Blundell et al., 2015). It is seen that when compared to OGD, BBB widens the range of weight values. The same holds for the weight values obtained from Badam.
Proportion removed  # weights  BBB (600)  BBB (300)  Badam (300)  Badagrad (300)  BBB (20)  Badam (20) 

0%  2.4m  1.24%  1.49%  1.66%  1.60%  1.86%  1.96% 
50%  1.2m  1.24%  1.53%  1.66%  1.70%  1.90%  1.95% 
75%  600k  1.24%  1.77%  1.75%  2.19%  2.13%  1.96% 
95%  120k  1.29%  4.53%  1.93%  27.91%  3.16%  2.20% 
98%  48k  1.39%  11.7%  2.15%  71.12%  5.40%  2.38% 
Because the posterior means of Badam have a similar distribution compared to those of BBB, we need to compare the respective variances, which quantify the uncertainties in weights. To assess the quality of the obtained uncertainties and to show that our posterior distributions are meaningful, we follow the weightpruning experiment carried out in (Blundell et al., 2015). Given a posterior mean and a standard deviation , we compute the signaltonoise ratio as . An illustration of the distribution of the signaltonoise ratio across all weights in a network is depicted in Figure 2. To perform the pruning of weights, we sort the weights by their signaltonoise ratio and discard the fraction of weights with the lowest values, by setting these weights equal to zero. As a baseline, we perform pruning on a model with constant variances, . For this model, pruning via the signaltonoise ratio is equivalent to pruning via the absolute value of the weights.
Figure 3 exhibits the test accuracy as a function of the pruning percentage for various models. The hyperparameter configurations we utilized to obtain these results are outlined in Table 2. It is worth noting that, when increasing the number of epochs in Badam from 20 to 300, we considerably lower the learning rate while simultaneously altering , which governs the horizon of the EWMA estimate of the curvature. Since we are using a much smaller learning rate, and are thus sampling from a smaller region, it is instructive to increase the horizon by changing . Our implementations of BBB for 20 and 300 epochs, respectively, perform less well and are arguably noisier, as seen when comparing them against their Badam counterparts. In our implementation of BBB, we used only one MonteCarlo sample, while in (Blundell et al., 2015), the authors considered either 1, 2, 5 or 10 samples. Increasing the number of samples helps improve the estimate and makes it more robust, but it also increases the computational overhead, which is why we only chose one sample to have a fairer comparison against Badam. This is arguably better for us as we do well outofthebox and BBB requires more sampling. The bottom line of Figure 3 is that the Badam instances of our framework based on 20 and 300 epochs, respectively, beat the BBB algorithm with 20 epochs in terms of robustness to weight pruning. As detailed below, this suggests that, relative to BBB, Badam provides a more accurate measure of the uncertainty about neuralnetwork weights.
Hyperparameter  Badagrad (300)  Badam (300)  Badam (20)  BBB (300)  BBB (20) 
Learning rate  0.01  0.0001  0.0001  0.001  0.01 
Regularization parameter  0.002  0.001  0.001  n/a  n/a 
Dropout  0.6  0.6  0.5  n/a  n/a 
n/a  0.9999  0.999  n/a  n/a  
Minibatch size  128  128  128  128  128 
# samples  n/a  n/a  n/a  1  1 
n/a  n/a  n/a  0.25  0.25  
n/a  n/a  n/a  1  1  
n/a  n/a  n/a  0.75  0.75  
n/a  n/a  n/a  0.1  0.1 
We end this section by emphasizing that the key underlying principle of the Bayesian treatment we propose in this paper is to provide a measure of uncertainty over a neural network’s weight parameters, and not just a better or faster (in convergence terms) point estimate thereof. The quality of this uncertainty metric can be assessed by pruning the weights: the more robust the classification error of a Bayesian algorithm for learning neural networks is to weight pruning, the better the quality of the uncertainty embedded in the corresponding (approximate) weight posterior distribution will be. This remark is especially valid for larger pruning proportions. Obtaining a better error without performing any pruning (i.e., for a pruning percentage equal to 0%) just means having done a better work at tuning the network architecture and hyperparameters. On the other hand, an algorithm achieving a smaller error at higher pruning rates – even if its corresponding error rate at 0% pruning is less attractive – comes with genuinely desirable uncertainty estimates. This is clearly illustrated in Table 1 for the Badam variants of our algorithm with 20 and 300 epochs, respectively. We can clearly discern the robustness of Badam to weight pruning, compared to BBB, which suffers from abrupt jumps in its error rate as we increase the pruning percentage.
6 Conclusion
In this paper, we introduced a novel approach to Bayesian learning for neural networks, derived from a new probabilistic interpretation of adaptive subgradient algorithms. In particular, we discussed how to refine this framework to Adam and AdaGrad, calling the resulting Bayesian neural networks Badam and Badagrad, respectively. Finally, we demonstrated the competitive empirical performance of Badam on MNIST classification, employing the variational Bayes approach of Blundell et al. (2015) as a benchmark.
While completing this work, we became aware of the Vadam algorithm (Khan et al., 2018). The latter applies weight perturbations to the Adam method in order to arrive at an approximate posterior distributions over a neural network’s weights. Although close in spirit, there are some important differences to our Badam algorithm as outlined below:

Badam, unlike Vadam, does not use weight perturbations. This is nontrivial as Vadam without weight perturbations boils down to the deterministic Adam method without uncertainties. In fact, the underlying derivation is entirely different: while Vadam uses uncertainties from variational Bayes which results in weight perturbations, Badam uses a geometrical derivation where uncertainties are related to the curvature of the loss function. This results in conceptually different and arguably more intuitive theoretical underpinnings;

Vadam relies on the inverse of the diagonal Fisher information matrix as an approximation to the inverse Hessian, whereas Badam uses the square root of the inverse of the diagonal Fisher information matrix approximation, i.e., . As already pointed out in the context of Adam (Kingma and Lei, 2015), this preconditioner (which is also used in AdaGrad) adapts to the geometry of the data and is more conservative in its adaptation than the inverse of the diagonal Fisher information matrix. In fact, using in place of does not result in good uncertainties as is manifested when using them for weight pruning.
It is worth emphasizing that the above discussion suggests that one might improve Vadam by using the square root of the inverse of the diagonal Fisher information matrix approximation, as is done here.
We hope that Badam will find usage as a practical offtheshelf Bayesian adaptive subgradient algorithm, providing posterior distributions with highly accurate confidence measures over neuralnetwork parameters.
References
 Achterhold et al. (2018) Jan Achterhold, Jan M. Kohler, Anke Schmeink, and Tim Genewein. Variational Network Quantization. In 6th International Conference on Learning Representations, 2018.
 Ahn et al. (2012) Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring. In Proceedings of the 29th International Conference on Machine Learning, 2012.
 Balan et al. (2015) Anoop Korattikara Balan, Vivek Rathod, Kevin P. Murphy, and Max Welling. Bayesian Dark Knowledge. In Advances in Neural Information Processing Systems 28, 2015.
 Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight Uncertainty in Neural Networks. In Proceedings of the 32nd International Conference on Machine Learning, 2015.

Dozat (2016)
Timothy Dozat.
Incorporating Nesterov Momentum into Adam.
In 4th International Conference on Learning Representations – Workshop Track, 2016.  Duchi et al. (2011) John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
 Graves (2011) Alex Graves. Practical Variational Inference for Neural Networks. In Advances in Neural Information Processing Systems 24, 2011.

Hinton and van Camp (1993)
Geoffrey E. Hinton and Drew van Camp.
Keeping Neural Networks Simple by Minimizing the Description Length
of the Weights.
In
Proceedings of the 6th Annual ACM Conference on Computational Learning Theory
, 1993.  Hron et al. (2017) Jiri Hron, Alexander G. de G. Matthews, and Zoubin Ghahramani. Variational Gaussian Dropout is not Bayesian. arXiv:1711.02989 [stat.ML], 2017.
 Khan et al. (2018) Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and Scalable Bayesian Deep Learning by WeightPerturbation in Adam. In Proceedings of the 35th International Conference on Machine Learning, 2018.
 Kingma and Lei (2015) Diederik P. Kingma and Jimmy B. Lei. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015.
 Kingma et al. (2015) Diederik P. Kingma, Tim Salimans, and Max Welling. Variational Dropout and the Local Reparameterization Trick. CoRR, abs/1506.02557, 2015.
 Lee et al. (2017) SangWoo Lee, JinHwa Kim, Jaehyun Jun, JungWoo Ha, and ByoungTak Zhang. Overcoming Catastrophic Forgetting by Incremental Moment Matching. In Advances in Neural Information Processing Systems 30, 2017.

Lobato and Adams (2015)
José Miguel Hernández Lobato and Ryan P. Adams.
Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks.
In Proceedings of the 32nd International Conference on Machine Learning, pages 1861–1869, 2015.  Lobato et al. (2016) José Miguel Hernández Lobato, Yingzhen Li, Mark Rowland, Thang Bui, Daniel HernándezLobato, and Richard Turner. BlackBox Alpha Divergence Minimization. In Proceedings of the 33rd International Conference on Machine Learning, pages 1511–1520, 2016.
 MacKay (1992) David J. C. MacKay. A Practical Bayesian Framework for Backpropagation Networks. Neural Computation, 4:448–472, 1992.
 Mandt et al. (2017) Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic Gradient Descent as Approximate Bayesian Inference. Journal of Machine Learning Research, 18:1–35, 2017.
 Molchanov et al. (2017) Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational Dropout Sparsifies Deep Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.
 Neal (1995) Radford M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, 1995.
 Reddi et al. (2018) Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the Convergence of Adam and Beyond. In 6th International Conference on Learning Representations, 2018.
 Soudry et al. (2014) Daniel Soudry, Itay Hubara, and Ron Meir. Expectation Backpropagation: ParameterFree Training of Multilayer Neural Networks with Continuous or Discrete Weights. In Advances in Neural Information Processing Systems 27, pages 963–971, 2014.
 Tieleman and Hinton (2012) Tijman Tieleman and Geoffrey Hinton. Rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 Welling and Teh (2011) Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proceedings of the 28th International Conference on Machine Learning, 2011.
 Zeiler (2012) Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. CoRR, abs/1212.5701, 2012.
 Zeno et al. (2018) Chen Zeno, Itay Golan, Elad Hoffer, and Daniel Soudry. Bayesian Gradient Descent: Online Variational Bayes Learning with Increased Robustness to Catastrophic Forgetting and Weight Pruning. arXiv:1803.10123 [stat.ML], 2018.
 Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928–936, 2003.