I Introduction
Modern machinelearning methods, such as deep learning, are capable of producing accurate predictions which has lead to their enormous recent success in fields, e.g., computer vision, speech recognition, and recommendation systems. However, this is not enough for other fields such as robotics and medical diagnostics where we also require an accurate estimate of confidence or uncertainty in the predictions. Bayesian inference provides such uncertainty measures by using the
posterior distribution obtained using Bayes’ rule. Unfortunately, this computation requires integrating over all possible values of the model parameters, which is infeasible for large complex models such as Bayesian neural networks.Sampling methods such as Markov Chain Monte Carlo usually converge slowly when applied to such large problems. In contrast, approximate Bayesian methods such as variational inference (VI) can scale to large problems by obtaining approximations to the posterior distribution by using an optimization method, e.g., stochasticgradient descent (SGD) methods
[6, 16, 4]. These methods could provide reasonable approximations very quickly.An issue in using SGD is that it ignores the information geometry of the posterior approximation (see Figure 1(a)). Recent approaches address this issue by using stochastic naturalgradient descent methods which exploit the Riemannian geometry of exponentialfamily approximations to improve the rate of convergence [8, 9, 7]. Unfortunately, these approaches only apply to a restricted class of models known as conditionallyconjugate models, and do not work for nonconjugate models such as Bayesian neural networks.
This paper discusses some recent methods that generalize the use of natural gradients to such large and complex nonconjugate models. We show that, for exponentialfamily approximations, a duality between their natural and expectation parameterspaces enables a simple naturalgradient update. The resulting updates are equivalent to a recently proposed method called Conjugatecomputation Variational Inference (CVI) [10]. An attractive feature of the method is that it naturally obtains local exponentialfamily approximations for individual model components. We discuss the application of the CVI method to Bayesian neural networks and show some recent results from a recent work [11] demonstrating faster convergence of naturalgradient VI methods compared to gradientbased VI methods (see Figure 1(b)).

Ii Problem Formulation
In this section, we discuss the problem of variational inference and show how SGD can be used to optimize it. SGD ignores the geometry of the posterior approximations, and we discuss how naturalgradient methods address this issue. We end the section by mentioning issues with existing naturalgradient methods for variational inference.
Iia Variational Inference (VI)
We consider models^{1}^{1}1Methods discussed in this paper apply to a more general class of models, e.g., the model class discussed in [10], but for clarity of presentation we focus on a restricted class. that take the following form:
(1) 
where is a likelihood function which relates the model parameters to the ’th dataexample , and is the prior distribution which we assume to be an exponentialfamily distribution [21],
(2) 
where
is a vector of sufficient statistics,
is the naturalparameter vector, and is the logpartition function. The model parameter is a random vector here and sometimes is referred to as the latent vector.Example: Consider Bayesian neural networks (BNN) [3] to model data that contains input and a scalar output . The vector is the vector of network weights. The likelihood could be an exponentialfamily distribution whose parameter is a neural network parameterized by . We assume an isotropic Gaussian prior where is a scalar. Its natural parameters are . ∎
For such models, Bayesian approaches can estimate a measure of uncertainty by using the posterior distribution: . This requires computation of the normalization constant which unfortunately is difficult to compute in models such as Bayesian neural networks. One source of difficulty is that the likelihood does not take the same form as the prior with respect to , or, in other words, the model is nonconjugate [5]. As a result, the product does not take a form with which can be easily computed.
Variational inference (VI) simplifies the problem by approximating with a distribution whose normalizing constant is relatively easier to compute. In models (1), a straightforward choice is to choose to be of the same parametric^{2}^{2}2This restriction may not lead to a suboptimal approximation, e.g., in meanfield approximation in conjugate exponentialfamily models, the optimal form according to the variational objective turns out to be an exponentialfamily approximation [3]. form as the prior but with a different naturalparameter vector , i.e., . The parameter can be obtained by maximizing the variational objective which is also a lower bound to [3],
(3) 
where is the set of valid variational parameters. Intuitively, the first term favors which is close to the prior while the second term favors those that obtain high expected loglikelihood values. The variational objective has a very familiar form similar to many other regularized optimization problems in machine learning [3].
Example: In the BNN example, we can choose where is the mean and is the covariance. The naturalparameter vector is , and our goal in VI is to maximize with respect to these parameters. ∎
IiB VI with Gradient Descent
A straightforward approach to maximize is to use a gradientbased method, e.g., the following stochasticgradient descent (SGD) algorithm:
(4) 
where is the iteration number, is a step size, and is a stochastic estimate of the derivative of at (the ‘hat’ here indicates a stochastic estimate). Such stochastic gradients can be easily computed using methods such as REINFORCE [22] and the reparameterization trick [13, 18]. This results in a simple but powerful approach which applies to many models and scales to large data.
Despite this, a direct application of SGD to optimize is problematic because SGD ignores the information geometry of the distribution . To see this, we can rewrite (4) as,
(5) 
Equivalence can be established by taking the derivative and setting to 0. The equation (5) implies that SGD moves in the direction of the gradient while remaining close, in terms of the Euclidean distance, to the previous . However, the Euclidean distance between natural parameters is not appropriate because is the parameter of a distribution and the Euclidean distance is often a poor measure of dissimilarity between distributions. This is illustrated in Figure 1(a). A more informative measure such as a KullbackLeibler (KL) divergence, which directly measures the distance between distributions, might be more appropriate.
IiC VI with NaturalGradient Descent
The issue discussed above can be addressed by using naturalgradient methods that exploit the information geometry of [1]. An exponentialfamily distribution induces a Riemannian manifold with a metric defined by the Fisher Information Matrix (FIM) [2], e.g. the FIM can be obtained as follows in the natural parameterization,
(6) 
Naturalgradient descent modifies the SGD step (5) by using the Riemannian metric instead of the Euclidean distance,
(7) 
where is a scalar step size. This results in an update similar to the SGD update shown in (4),
(8) 
where the stochastic gradient is scaled by the FIM. The scaled stochasticgradient is referred to as the stochastic natural gradient defined as follows:
(9) 
We use the notation to differentiate the natural gradient as opposed to the standard gradient in Euclidean space denoted by . In practice, the scaling, in a similar spirit to Newton’s method, improves convergence and also simplifies stepsize tuning.
Natural gradients are also naturally suited for VI in certain class of models. A recent work in [8] shows that for conjugate exponentialfamily models, naturalgradients with respect to the naturalparameterization take a very simple form. For example, consider the first term in (3) which consists of the ratio of two terms that are conjugate to each other. The naturalgradient then is equal to the difference in the natural parameter of the two terms (see Eq. 41 in [10] for more details):
(10)  
The above natural gradient does not require computation of the FIM, which is surprising. It is natural to ask whether a similar expression is possible when the model contains nonconjugate terms? We show that it is possible to do so if we perform naturalgradient descent in the natural parameter space, but not if we do it in the space of expectation parameters.
Iii Natural Gradients with Exponential Family
In this section, we show that natural gradient with respect to the natural parameters can be obtained by computing the gradient with respect to the expectation parameter. In the next section, we will show that this enables a simple naturalgradient update which does not require an explicit inversion of FIM.
We start by defining the expectation^{3}^{3}3
Sometimes also called the mean or moment parameter.
parameter of an exponentialfamily distribution as follows: , where we have expressed as a function of . Alternatively, can be obtained from the natural parameters by simply differentiating the logpartition function, i.e., . The mapping is onetoone and onto (a bijection) iff the representation is minimal. Therefore, we can express in terms of . We denote the new objective by . We can now state our claim.Theorem 1.
For an exponentialfamily in the minimal representation, the natural gradient with respect to is equal to the gradient with respect to , and vice versa, i.e.,
(11) 
Proof:
Using chain rule, we can rewrite the derivative with respect to
in terms of :(12) 
It is well known that the second derivative of is equal to the FIM for exponentialfamily distribution, i.e., [15]. This matrix is invertible when the representation is minimal. Therefore multiplying the above equation with inverse of gives us the first equality. Since the FIM with respect to is inverse of the FIM with respect to [15], the second equality is immediate. ∎
This result is a consequence of a relationship between and . The two vectors are related through the Legendre transform which is the following transformation . Since is a convex function, the space of and are both Riemannian manifolds which are also duals^{4}^{4}4In information geometry, this is known as the duallyflat Riemannian structure [2]. of each other. An attractive property of this structure is that the FIM in one space is the inverse of the FIM in the other space. This enables us to compute natural gradient in one space using the gradient in the other, as shown in (11). This result is also discussed in an earlier work by Hensman et al. [7] in the context of conjugate models, although they do not explicitly mention the connection to duality.
The natural gradient makes a better choice for conjugate models because assumes a simple form which does not require computation of the FIM. The unfortunately does not have this property. For example, for (10) requires computation of the FIM because it is equal to . This can be shown by using (11), (9) and (10).
The recent work by [10] propose to use the gradients with respect to to perform natural gradient with respect . They arrive at this conclusion by using the equivalence of mirror descent and naturalgradient descent. Our discussion above complements their work by using the duality of the two spaces.
Iv Natural Gradients for Nonconjugate Models
In this section, we show that in some cases the natural gradient of the nonconjugate term can be easily computed by using . We also show that the resulting update takes a simple form.
We start with the expression for . Using (11) and (10), it is straightforward to write this expression:
(13) 
where we have expressed as a function of . For notational convenience, we will denote ’th term inside the summation by .
A stochastic naturalgradient descent update can be obtained by using the gradient of a randomly sampled data example and multiplying it by , as shown below:
(14) 
where the gradient is multiplied by to obtain an unbiased stochastic gradient. This update is equivalent to the update obtained in [10] where it is referred to as Conjugatecomputation variational inference (CVI). In [10], this is derived using a mirrordescent formulation, while we use the duality of the exponential family (Theorem 1).
Unlike the SGD update, the naturalgradient update (14) only computes gradients of the nonconjugate terms, thereby requires less computation. We now give an example which shows that assumes a simple form and can be computed easily using automaticgradient methods.
Example: For the BNNs example,
can be obtained by using backpropagated gradients
and Hessians . For a Gaussian , there are two expectation parameters: and , and two natural parameters: and . As shown in [11], we can write gradients as follows:(15) 
If we approximate the expectations using a single Monte Carlo sample , we can write the update in (14) as
(16)  
(17) 
These updates take a form similar to Newton’s method. The covariance matrix plays a similar role to the Hessian in Newton’s method and scales the gradient in the update of . The matrix itself contains a moving average of the past Hessians. It is, however, not common to compute Hessians for deep models, but, as we discuss in Section VI, we can use another approximation to simplify this computation. With such an approximation, these updates can be implemented efficiently within existing deep learning codebases as discussed in [11]. ∎
Similarly to the above example, it might be possible to employ automaticgradient methods to compute natural gradients in many models. A recent work [20] explores this possibility. Another stochastic approximation method discussed in [19] is also useful. For simple models, such as generalized linear models, where we can directly derive the distribution of the local variables, we can locally compute the gradients. This is discussed in [10] for generalized linear models, Gaussian processes, and linear dynamical systems with nonlinear likelihoods.
V Local Approximations with Natural Gradients
We now show that natural gradients not only result in simple updates, but they also give rise to local exponentialfamily approximations of the nonconjugate terms. An attractive feature of these approximation is that the natural gradient of a nonconjugate likelihood is also the natural parameter of its local approximation.
We start by analyzing the optimality condition of . First, by setting to zero, we note that a maximum of satisfies the following^{5}^{5}5We note that a similar optimality condition is used in [19] although the connection to natural gradients is not discussed.: . Then, multiplying by , exponentiating the whole equation, and by using the definition (2) of the prior, we can rewrite the optimality condition as follows,
(18) 
Comparing this update to the original model (1), we see that the nonconjugate likelihoods are replaced by local exponentialfamily approximations whose natural parameters are the local naturalgradients . This type of local approximation is employed in Expectation Propagation (EP) [14]. In contrast, here they naturally emerge during a global step, i.e., during the optimization of the whole variational objective.
We denote the ’th local approximation at iteration by and define it as follows,
(19) 
We can then write the update (14) as an approximate Bayesian filter as shown below,
(20) 
This update replaces each likelihood term in the model (1) by the ’th likelihood term, which is why is raised to the power . All distributions in the above update take the same exponential form as , and therefore the resulting computation can therefore be performed using conjugate computations, i.e., by simply adding their natural parameters. This algorithm is referred to as Conjugatecomputation VI (CVI) in [10].
Finally, if the parameters of the prior distribution do not change with iterations, then we can further simplify the updates by pulling out of the iterations and expressing the local naturalparameters, denoted by , as a recursion as shown below,
(21)  
where if ’th data point is selected in the ’th iteration. The naturalparameter plays a similar role to the socalled site parameters in EP [17]. As the algorithm progresses, the local natural parameters converge to the optimal natural parameters shown in (18).
Vi Results on Bayesian Neural Networks
In this section, we compare an approximate naturalgradient VI method with a gradientbased VI method. The naturalgradient method employs two approximations to the update (16)(17). The first approximation is to use a diagonal covariance matrix which enables a fast computation when dimensionality of is large. The second approximation is to use a generalized GaussNewton approximation for the Hessian. This avoids the need to compute secondorder derivatives making the implementation easier. The resulting method is called Variational Online GaussNewton (VOGN) [11]. The updates of this method, as discussed in [11], is very similar to the Adam optimizer [12] and can be implemented with a few lines of code change. This makes it easy to apply VOGN to large deeplearning problems.
Figure 1(b) compares VOGN with a gradientbased approach called Bayes by Backprop [4]. The latter optimizes
using the Adam optimizer. The results are obtained using a neural network with singlehidden layer of 64 hidden units and ReLU activations. A prior precision of
, a minibatch size of 128 and 16 MonteCarlo samples are used for all runs. The two figures show results on the following two datasets: ‘Australian’ ( and ) and ‘Breast Cancer’ ( and ) datasets. We showloss vs epochs, where a lower value indicates a better performance. We clearly see that the naturalgradient method is much faster than the gradientbased method. See
[11] for more experimental results.Vii Conclusions
In this paper, we discuss methods for naturalgradient descent in variational inference. Unlike gradientbased approaches, naturalgradient methods exploit the information geometry of the solution and can converge quickly. We review a few recent works and provide new insights using the duality associated with exponentialfamily approximations. We discuss an attractive property of the naturalgradient to obtain local conjugate approximations for individual model components. Finally, we showed some illustrative examples where these methods have been applied to perform Bayesian deep learning.
Acknowledgment
We would like to thank the following people at RIKEN, AIP for discussions and feedback: Aaron Mishkin, Frederik Kunstner, Voot Tangkaratt, and Wu Lin. We would also like to thank James Hensman and Shunichi Amari for discussions.
References
 [1] Shunichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 [2] Shunichi Amari. Information geometry and its applications. Springer, 2016.
 [3] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 [4] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. International Conference on Machine Learning, 2015.
 [5] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis, volume 2. Chapman & Hall/CRC Boca Raton, FL, USA, 2014.
 [6] Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.
 [7] James Hensman, Magnus Rattray, and Neil D Lawrence. Fast variational inference in the conjugate exponential family. In Advances in neural information processing systems, pages 2888–2896, 2012.
 [8] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 [9] Antti Honkela and Harri Valpola. Unsupervised variational Bayesian learning of nonlinear models. In Advances in Neural Information Processing Systems, pages 593–600, 2004.

[10]
Mohammad Emtiyaz Khan and Wu Lin.
Conjugatecomputation variational inference: converting variational
inference in nonconjugate models to inferences in conjugate models.
In
International conference on Artificial Intelligence and Statistics
, 2017.  [11] Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and scalable Bayesian deep learning by weightpertubation in Adam. In Proceedings of the 35th International Conference on Machine Learning, 2018.
 [12] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [13] Diederik P Kingma and Max Welling. Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
 [14] T. Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2001.
 [15] Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards. arXiv preprint arXiv:0911.4863, 2009.
 [16] Rajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference. In International conference on Artificial Intelligence and Statistics, pages 814–822, 2014.
 [17] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
 [18] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.

[19]
Tim Salimans and David Knowles.
Fixedform variational posterior approximation through stochastic linear regression.
Bayesian Analysis, 8(4):837–882, 2013.  [20] Hugh Salimbeni, Stefanos Eleftheriadis, and James Hensman. Natural gradients in practice: Nonconjugate variational inference in gaussian process models. International conference on Artificial Intelligence and Statistics, 2018.
 [21] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1–2:1–305, 2008.

[22]
Ronald J Williams.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
Machine learning, 8(34):229–256, 1992.
Comments
There are no comments yet.