1 Introduction
Variational inference methods are one of the most widelyused computational tools to deal with the intractability of Bayesian inference, while stochastic gradient (SG) methods are one of the most widelyused tools for solving optimization problems on huge datasets. The last three years have seen an explosion of work exploring SG methods for variational inference
(Hoffman et al., 2013; Salimans et al., 2013; Ranganath et al., 2013; Titsias & LázaroGredilla, 2014; Mnih & Gregor, 2014; Kucukelbir et al., 2014). In many settings, these methods can yield simple updates and scale to huge datasets.A challenge that has been addressed in many of the recent works on this topic is that the “blackbox” SG method ignores the geometry of the variationalparameter space. This has lead to methods like the stochastic variational inference (SVI) method of Hoffman et al. (2013), that uses natural gradients to exploit the geometry. This leads to better performance in practice, but this approach only applies to conditionallyconjugate models. In addition, it is not clear how using natural gradients for variational inference affects the theoretical convergence rate of SG methods.
In this work we consider a general framework that (i) can be stochastic to allow huge datasets, (ii) can exploit the geometry of the variationalparameter space to improve performance, and (iii) can yield a closedform update even for nonconjugate models. The new framework can be viewed as a stochastic generalization of the proximalgradient method of Khan et al. (2015), which splits the objective into conjugate and nonconjugate terms. By linearizing the nonconjugate terms, this previous method as well as our new method yield simple closedform proximalgradient updates even for nonconjugate models.
While proximalgradient methods have been wellstudied in the optimization community (Beck & Teboulle, 2009), like SVI there is nothing known about the convergence rate of the method of Khan et al. (2015) because it uses “divergence” functions which do not satisfy standard assumptions. Our second contribution is to analyze the convergence rate of the proposed method. In particular, we generalize an existing result on the convergence rate of stochastic mirror descent in nonconvex settings (Ghadimi et al., 2014)
to allow a general class of divergence functions that includes the cases above (in both deterministic and stochastic settings). While it has been observed empirically that including an appropriate divergence function enables larger steps than basic SG methods, this work gives the first theoretical result justifying the use of these moregeneral divergence functions. It in particular reveals how different factors affect the convergence rate such as the Lipschitzcontinuity of the lower bound, the information geometry of the divergence functions, and the variance of the stochastic approximation. Our results also suggest conditions under which the proximalgradient steps of
Khan et al. (2015) can make more progress than (nonsplit) gradient steps, and sheds light on the choice of stepsize for these methods. Our experimental results indicate that the new method leads to improvements in performance on a variety of problems, and we note that the algorithm and theory might be useful beyond the variational inference scenarios we have considered in this work.2 Variational Inference
Consider a general latent variable model where we have a data vector
of length and a latent vector of length . In Bayesian inference, we are interested in computing the marginal likelihood, which can be written as the integral of the joint distribution
over all values of . This integral is often intractable, and in variational inference we typically approximate it with the evidence lowerbound optimization (ELBO) approximation . This approximation introduces a distribution and chooses the variational parameters to maximize the following lower bound on the marginal likelihood:(1)  
The inequality follows from concavity of the logarithm function. The set is the set of valid parameters .
To optimize , one of the seeminglysimplest approaches is gradient descent: , which can be viewed as optimizing a quadratic approximation of ,
(2) 
While we can often choose the family so that it has convenient computational properties, it might be impractical to apply gradient descent in this context when we have a very large dataset or when some terms in the lower bound are intractable. Recently, SG methods have been proposed to deal with these issues (Ranganath et al., 2013; Titsias & LázaroGredilla, 2014): they allow large datasets by using random subsets (minibatches) and can approximate intractable integrals using Monte Carlo methods that draw samples from .
A second drawback of applying gradient descent to variational inference is that it uses the Euclidean distance and thus ignores the geometry of the variationalparameter space, which often results in slow convergence. Intuitively, (2) implies that we should move in the direction of the gradient, but not move too far away from in terms of the Euclidean distance. However, the Euclidean distance is not appropriate for variational inference because is the parameter vector of a distribution; the Euclidean distance is often a poor measure of dissimilarity between distributions. The following example from Hoffman et al. (2013)
illustrates this point: the two normal distributions
and are almost indistinguishable, yet the Euclidean distance between their parameter vectors is 10, whereas the distributions and barely overlap, but their Euclidean distance between parameters is only .NaturalGradient Methods: The canonical way to address the problem above is by replacing the Euclidean distance in (2) with another divergence function. For example, the natural gradient method defines the iteration by using the symmetric KullbackLeibler (KL) divergence (Hoffman et al., 2013; Pascanu & Bengio, 2013; Amari, 1998),
(3)  
This leads to the update
(4) 
where is the Fisher informationmatrix,
Hoffman et al. (2013) show that the naturalgradient update can be computationally simpler than gradient descent for conditionallyconjugate exponential family models. In this family, we assume that the distribution of factorizes as where are disjoint subsets of and are the parents of the in a directed acyclic graph. This family also assumes that each conditional distribution is in the exponential family,
where are the natural parameters, are the sufficient statistics, is the partition function, and is the base measure. Hoffman et al. (2013) consider a meanfield approximation where each belongs to the same exponentialfamily distribution as the joint distribution,
The parameters of this distribution are denoted by to differentiate them from the jointdistribution parameters .
As shown by Hoffman et al. (2013), the Fisher matrix for this problem is equal to and the gradient of the lower bound with respect to is equal to where are the meanfield parameters (see Paquet, 2014). Therefore, when computing the naturalgradient, the terms cancel out and the naturalgradient is simply which is much easier to compute than the actual gradient. Unfortunately, for nonconjugate models this cancellation does not happen and the simplicity of the update is lost. The Riemannian conjugategradient method of Honkela et al. (2011) has similar issues, in that computing is typically very costly.
KLDivergence Based Methods: Rather than using the symmetricKL, Theis & Hoffman (2015) consider using the KL divergence within a stochastic proximalpoint method:
(5) 
This method yields better convergence properties, but requires numerical optimization to implement the update even for conditionallyconjugate models. Khan et al. (2015) considers a deterministic proximalgradient variant of this method by splitting the lower bound into , where contains all the “easy” terms and contains all the “difficult” terms. By linearizing the “difficult” terms, this leads to a closedform update even for nonconjugate models. The update is given by:
(6)  
However, this method requires the exact gradients which is usually not feasible for large dataset and/or complex models.
Mirror Descent Methods: In the optimization literature, mirror descent (and stochastic mirror descent) algorithms are a generalization of (2) where the squaredEuclidean distance can be replaced by any Bregman divergence generated from a stronglyconvex function (Beck & Teboulle, 2003),
(7) 
The convergence rate of mirror descent algorithm has been analyzed in convex (Duchi et al., 2010) and more recently in nonconvex (Ghadimi et al., 2014) settings. However, mirror descent does not cover the cases described above in (5) and (6) when a KL divergence between two exponentialfamily distributions is used with as the naturalparameter. For such cases, the Bregman divergence corresponds to a KL divergence with swapped parameters (see Nielsen & Garcia, 2009, Equation 29),
(8) 
where is the partition function of . Because (5) and (6) both use a KL divergence where the second argument is fixed to , instead of the first argument, they are not covered under the mirrordescent framework. In addition, even though mirrordescent has been used for variational inference (Ravikumar et al., 2010), Bregman divergences do not yield an efficient update in many scenarios.
3 ProximalGradient Svi
Our proximalgradient stochastic variational inference (PGSVI) method extends (6) to allow stochastic gradients and general divergence functions by using the iteration
(9) 
This unifies a variety of existing approaches since it allows:

Splitting of into a difficult term and a simple term , similar to the method of Khan et al. (2015).

A stochastic approximation of the gradient of the difficult term, similar to SG methods.
Below, we describe each feature in detail, along with the precise assumptions used in our analysis.
3.1 Splitting
Following Khan et al. (2015), we split the lower bound into a sum of a “difficult” term and an “easy” term , enabling a closedform solution for (9). Specifically, we split using , where contains all factors that make the optimization difficult, and contains the rest (while is a constant). By substituting in (1), we get the following split of the lower bound:
Note that and
need not be probability distributions.
We make the following assumptions about and :
 (A1)

The function is differentiable and its gradient is Lipschitzcontinuous, i.e. and we have
 (A2)

The function can be a general convex function.
These assumptions are very weak. The function can be nonconvex and the Lipschitzcontinuity assumption is typically satisfied in practice (and indeed the analysis can be generalized to only require this assumption on a smaller set containing the iterations). The assumption that is convex seems strong, but note that we can always take in the split if the function has no “nice” convex part. Below, we give several illustrative examples of such splits for variationalGaussian inference with , so that with being the mean and being the covariance matrix.
Gaussian Process (GP) Models: Consider GP models (Kuss & Rasmussen, 2005) for inputoutput pairs indexed by . Let be the latent function drawn from a GP with mean 0 and covariance . We use a nonGaussian likelihood to model the output. We can then use the following split, where the nonGaussian terms are in and the Gaussian terms are in :
(10) 
The detailed derivation is in the appendix. By substituting in (1), we obtain the lower bound shown below along with its split:
(11) 
A1 is satisfied for common likelihoods, while it is easy to establish that is convex. We show in Section 6 that this split leads to a closedform update for iteration (9).
Generalized Linear Models (GLMs): A similar split can be obtained for GLMs (Nelder & Baker, 1972), where the nonconjugate terms are in and the rest are in . Denoting the weights by and assuming a standard Gaussian prior over it, we can use the following split:
We give further details about the bound for this case in the appendix.
Correlated Topic Model (CTM): Given a text document with a vocabulary of words, denote its wordcount vector by . Let be the number of topics and be the vector of topicproportions. We can then use the following split:
where are parameters of the Gaussian prior and are parameters of multinomials. We give further details about the bound in the appendix.
3.2 StochasticApproximation
The approach of Khan et al. (2015) considers (9) in the special case of (6) where we use the exact gradient in the first term. But in practice this gradient is often difficult to compute. In our framework, we allow a stochastic approximation of which we denote by .
As shown in the previous section, might take a form for a set of functions as in the GP model (11). In some situations, is computationally expensive or intractable. For example, in GP models the expectation is equal to , which is intractable for most nonGaussian likelihoods. In such cases, we can form a stochastic approximation by using a few samples from , as shown below:
where represents the noise in the stochastic approximation and we use the identity to derive the expression (Ranganath et al., 2013). We can then form a stochasticgradient by randomly selecting a minibatch of functions
and employing the estimate
(12) 
In our analysis we make the following two assumptions regarding the stochastic approximation of the gradient:
 (A3)

The estimate is unbiased: .
 (A4)

Its variance is upper bounded: .
In both the assumptions, the expectation is taken with respect to the noise . The first assumption is true for the stochastic approximations of (12). The second assumption is stronger, but only needs to hold for all so is almost always satisfied in practice.
3.3 Divergence Functions
To incorporate the geometry of we incorporate a divergence function between and . The set of divergence functions need to satisfy two assumptions:
 (A5)

, for all .
 (A6)

There exist an such that for all generated by (9) we have:
(13)
The first assumption is reasonable and is satisfied by typical divergence functions like the squared Euclidean distance and variants of the KL divergence. In the next section we show that, whenever the iteration (9) is defined and all stay within a compact set, the second assumption is satisfied for all divergence functions considered in Section 2.
4 Special Cases
Most methods discussed in Section 2 are special cases of the proposed iteration (9). We obtain gradient descent if , , , and (in this case A6 is satisfied with ). From here, there are three standard generalizations in the optimization literature: SG methods do not require that , proximalgradient methods do not require that , and mirror descent allows to be a different Bregman divergence generated by a stronglyconvex function. Our analysis applies to all these variations on existing optimization algorithms because A1 to A5 are standard assumptions (Ghadimi et al., 2014) and, as we now show, A6 is satisfied for this class of Bregman divergences. In particular, consider the generic Bregman divergence shown in the left side of (8) for some stronglyconvex function . By taking the gradient with respect to and substituting in (13), we obtain that A6 is equivalent to
which is equivalent to strongconvexity of the function (Nesterov, 2004, Theorem 2.1.9).
The method of Theis & Hoffman (2015) corresponds to choosing , , and where is an exponential family distribution with natural parameters . Since we assume to be convex, only limited cases of their approach are covered under our framework. The method of Khan et al. (2015) also uses the KL divergence and focuses on the deterministic case where , but uses the split to allow for nonconjugate models. In both of these models, A6 is satisfied when the Fisher matrix is positivedefinite. This can be shown by using the definition of the KL divergence for exponential families (Nielsen & Garcia, 2009):
(14)  
Taking the derivative with respect to and substituting in (13) with , we get the condition
which is satisfied when is positivedefinite over a compact set for
equal to its lowest eigenvalue on the set.
Methods based on naturalgradient using iteration (3) (like SVI) correspond to using , , and the symmetric KL divergence. Assumption A1 to A5 are usually assumed for these methods and, as we show next, A6 is also satisfied. In particular, when is an exponential family distribution the symmetric KL divergence can be written as the sum of the Bregman divergence shown in (8) and the KL divergence shown in (14),
where the first equality follows from the definition of the symmetric KL divergence and the second one follows from (8). Since the two divergences in the sum satisfy A6, the symmetric KL divergence also satisfies the assumption.
5 Convergence of PgSvi
We first analyze the convergence rate of deterministic methods where the gradient is exact, . This yields a simplified result that applies to a wide variety of existing variational methods. Subsequently, we consider the more general case where a stochastic approximation of the gradient is used.
5.1 Deterministic Methods
We first establish the convergence under a fixed stepsize when using the exact gradient. We use as the initial (constant) suboptimality, and express our result in terms of the quantity
where is computed using (9).
Proposition 1.
Let A1, A2, A5, and A6 be satisfied. If we run iterations of (9) with a fixed stepsize for all and an exact gradient , then we have
(15) 
We give a proof in the appendix. Stating the result in terms of may appear to be unconventional, but this quantity is natural measure of firstorder optimality. For example, consider the special case of gradient descent where and . In this case, and , therefore we have and Proposition 1 implies that has a convergence rate of . This means that the method converges at a sublinear rate to an approximate stationary point, which would be a global minimum in the special case where is convex.
In more general settings, the quantity provides a generalized notion of firstorder optimality for problems that may be nonsmooth or use a nonEuclidean geometry. Further, if the objective is bounded below ( is finite), this result implies that the algorithm converges to such a stationary point and also gives a rate of convergence of .
If we use a divergence with then we can use a stepsize larger than and the error will decrease faster than gradientdescent. To our knowledge, this is the first result that formally shows that naturalgradient methods can achieve faster convergence rates. The splitting of the objective into and functions is also likely to improve the stepsize. Since only depends on , sometimes it might be possible to reduce the Lipschitz constant by choosing an appropriate split.
We next give a more general result that allows a periteration step size.
Proposition 2.
If we choose the stepsizes to be such that with for at least one , then,
(16) 
We give a proof in the appendix. For gradientdescent, the above result implies that we can use any stepsize less than , which agrees with the classical stepsize choices for gradient and proximalgradient methods.
5.2 Stochastic Methods
We now give a bound for the more general case where we use a stochastic approximation of the gradient.
Proposition 3.
Let A1A6 be satisfied. If we run iterations of (9) for a fixed stepsize (where is a scalar) and fixed batchsize for all with a stochastic gradient , then we have
where is a constant such that and . The expectation is taken with respect to the noise
, and a random variable
which follows the uniform distribution
.Unlike the bound of Proposition 1, this bound depends on the noise variance as well the minibatch size . In particular, as we would expect, the bound gets tighter as the variance gets smaller and as the size of our minibatch grows. Notice that the dependence on the variance is also improved if we have a favourable geometry that increases . Thus, we can achieve a higher accuracy by either increasing the minibatch size or improving the geometry. In the appendix we give a more general result that allows nonconstant sequences of step sizes, although we found that constant stepsizes work better empirically. Note that while stating the result in terms of a randomized iteration might seem strange, in practice we typically just take the last iteration as the approximate minimizer.
6 ClosedForm Updates for NonConjugate Models
We now give an example where iteration (9) attains a closedform solution. We expect such closedform solution to exist for a large class of problems, including models where is an exponentialfamily distribution, but here we focus on the GP model discussed in Section 3.1.
For the GP model, we rewrite the lower bound (11) as
(17) 
where we’ve used , , and with being the entry of and being the diagonal entry of . We can compute a stochastic approximation of using (12) by randomly selecting an example (choosing ) and using a Monte Carlo gradient approximation of . Using this approximation, the linearized term in (9) can be simplified to the following:
(18) 
where and denote the value of and in the ’th iteration for . By using the KL divergence as our divergence function in iteration (9), and by denoting by , we can express the two last two terms in (9) as a single KL divergence function as shown below:
where . Comparing this to (17), we see that this objective is similar to that of a GP model with a Gaussian prior^{1}^{1}1Since and are Gaussian, the product is a Gaussian. and a linear Gaussianlike loglikelihood. Therefore, we can obtain closedform updates for its minimization.
The updates are shown below and a detailed derivation is given in the appendix.
(19) 
where is initialized to a small positive constant to avoid numerical issues, is a vector with all zero entries except ’th entry which is equal to 1, is ’th column of , and . For iteration , we use and to compute the gradients and , and run the above updates again. We continue until a convergence criteria is reached.
There are numerous advantages of these updates. First, We do not need to store the full covariance matrix . The updates avoid forming the matrix and only update . This works because we only need one diagonal element in each iteration to compute the stochastic gradient . For large this is a clear advantage since the memory cost is rather than . Second, computation of the mean vector and a diagonal entry of only require solving two linear equations, as shown in the second and third line of (19). In general, for a minibatch of size , we need a total of linear equations, which is a lot cheaper than an explicit inversion. Finally, the linear equations at iteration are very similar to those at iteration , since differ only at one entry from . Therefore, we can reuse computations from the previous iteration to improve the computational efficiency of the updates.
7 Experimental Results
In this section, we compare our method to many existing approaches such as SGD and four adaptive gradientmethods (ADAGRAD, ADADELTA, RMSprop, ADAM), as well as two variational inference methods for nonconjugate models (the delta method and Laplace method). We show results on Gaussian process classification
(Kuss & Rasmussen, 2005) and correlated topic models (Blei & Lafferty, 2007). The code to reproduce these experiments can be found on GitHub.^{2}^{2}2https://github.com/emtiyaz/proxgradsvi7.1 Gaussian Process Classification
We first consider binary classification by using a GP model with a Bernoullilogit likelihood on three datasets: Sonar, Ionosphere, and USPS3vs5. These datasets can be found at the UCI data repository
^{3}^{3}3https://archive.ics.uci.edu/ml/datasets.html and their details are discussed in Kuss & Rasmussen (2005). For the GP prior, we use the zero meanfunction and a squaredexponential covariance function with hyperparameters
and as defined in Kuss & Rasmussen (2005) (see Eq. 33). We set the values of the hyperparameters using crossvalidation. For the three datasets, the hyperparameters are set to , , and , respectively.7.1.1 Performance Under a Fixed StepSize
In our first experiment, we compare the performance under a fixed stepsize.The results also demonstrate the faster convergence of our method compared to gradientdescent methods. We compare the following four algorithms on the Ionosphere dataset: (1) batch gradientdescent (referred to as ‘GD’), (2) batch proximalgradient algorithm (referred to as ‘PG’), (3) batch proximalgradient algorithm with gradients approximated by using Monte Carlo (referred to as ‘PGMC’ and using samples), and (4) the proposed proximalgradient stochastic variationalinference (referred to as ‘PGSVI’) method where stochastic gradients are obtained using (12) with .
Figure 1 shows the number of examples required for convergence versus the stepsize. A lower number implies faster convergence. Convergence is assessed by monitoring the lower bound, and when the change in consecutive iterations do not exceed a certain threshold, we stop the algorithm.
We clearly see that GD requires many more passes through the data, and proximalgradient methods converge faster than GD. In addition, the upper bound on the stepsize for PG is much larger than GD. This implies that PG can potentially take larger steps than the GD method. PGSVI is surprisingly as fast as PG which shows the advantage of our approach over the approach of Khan et al. (2015).
7.1.2 Comparison with Adaptive Gradient Methods
We also compare PGSVI to SGD and four adaptive methods, namely ADADELTA (Zeiler, 2012), RMSprop (Tieleman & Hinton, 2012), ADAGRAD (Duchi et al., 2011), and ADAM (Kingma & Ba, 2014). The implementation details of these algorithms are given in the appendix. We compare the value of the lower bound versus number of passes through the data. We also compare the average logloss on the test data,, where is the predictive probabilities of the test point given training data and is the total number of testpairs. A lower value is better for the logloss, and a value of 1 is equal to the performance of random coinflipping.
Figure 2 summarizes the results. In these plots, lower is better for both objectives and one “pass” means the number of randomly selected examples is equal to the total number of examples. Our method is much faster to converge than other methods. It always converged within 10 passes through the data while other methods required more than 100 passes.
7.2 Correlated Topic Model
We next show results for correlated topic model on two collections of documents, namely the NIPS and Associated Press (AP) datasets. The NIPS^{4}^{4}4https://archive.ics.uci.edu/ dataset contains 1500 documents from the NIPS conferences held between 1987 and 1999 (a vocabularysize of 12,419 words and a total of around 1.9M words). The AP^{5}^{5}5http://www.cs.columbia.edu/~blei/ldac/index.html collection contains 2,246 documents from the Associated Press (a vocabularysize of 10,473 words and a total of 436K observed words). We use 50% of the documents for training and 50% for testing.
We compare to the delta method and the Laplace method discussed in Wang & Blei (2013), and also to the original meanfield (MF) method of Blei & Lafferty (2007). For these methods, we use an available implementation.^{6}^{6}6https://www.cs.princeton.edu/~chongw/resource.html All of these methods approximate the lower bound by using approximations to the expectation of logsumexp functions (see the appendix for details). We compare these methods to the two versions of our algorithm which do not use such approximations, but instead use a stochastic gradient as explain in Section 3.2. Specifically, we use the following two versions: one with full covariance (referred to as PGSVI), and the other with a diagonal covariance (referred to as PGSVIMF). For both of these algorithms, we use a fixed stepsize of 0.001, and a minibatch size of 2 documents.
Following Wang & Blei (2013) we compare the heldout loglikelihood which is computed as follows: a new test document is split into two halves , then we compute the approximate posterior to the posterior and use this to compute the heldout loglikelihood for each using
(20) 
We use a Monte Carlo to this quantity by using a large number of samples from (unlike Wang & Blei (2013) who approximate it by using the Delta method). We report the average of this quantity over all words in .
Figure 3 shows the negative of the average heldout loglikelihood versus time for 10 topics (lower values are better). We see that methods based on proximalgradient algorithm converge a little bit faster than the existing methods. More importantly, they achieves better performance. This could be due to the fact that we do not approximate the expectation of the logsumexp function, unlike the delta and Laplace method. We obtained similar results when we used a different number of topics.
8 Discussion
This work has made two contributions. First, we proposed a new variational inference method that combines variable splitting, stochastic gradients, and general divergence functions. This method is wellsuited for a huge variety of the variational inference problems that arise in practice, and we anticipate that it may improve over state of the art methods in a variety of settings. Our second contribution is a theoretical analysis of the convergence rate of this general method. Our analysis generalizes existing results for the mirror descent algorithm in optimization, and establishes convergences rates of a variety of existing variational inference methods. Due to its generality we expect that this analysis could be useful to establish convergence rates of other algorithms that we have not thought of, perhaps beyond the variational inference settings we consider in this work. However, an open problem that is also discussed by Ghadimi et al. (2014) it to esbatlish convergence to an arbitrary accuracy with a fixed batch size.
One issue that we have not satisfactorily resolved is giving a theoreticallyjustified way to set the stepsize in practice; our analysis only indicates that it must be sufficiently small. However, this problem is common in many methods in the literature and our analysis at least suggests the factors that should be taken into account. Another open issue is the applicability our method to many other latent variable models; in this paper we have shown applications to variationalGaussian inference, but we expect that our method should result in simple updates for a larger class of latent variable models such as nonconjugate exponential family distribution models. Additional work on these issues will improve usability of our method.
References
 Amari (1998) Amari, ShunIchi. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Beck & Teboulle (2003) Beck, Amir and Teboulle, Marc. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
 Beck & Teboulle (2009) Beck, Amir and Teboulle, Marc. A fast iterative shrinkagethresholding algorithm with application to waveletbased image deblurring. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 693–696, 2009.
 Blei & Lafferty (2007) Blei, David M and Lafferty, John D. A correlated topic model of science. The Annals of Applied Statistics, pp. 17–35, 2007.

Duchi et al. (2011)
Duchi, John, Hazan, Elad, and Singer, Yoram.
Adaptive subgradient methods for online learning and stochastic
optimization.
The Journal of Machine Learning Research
, 12:2121–2159, 2011.  Duchi et al. (2010) Duchi, John C, ShalevShwartz, Shai, Singer, Yoram, and Tewari, Ambuj. Composite objective mirror descent. COLT, 2010.
 Ghadimi et al. (2014) Ghadimi, Saeed, Lan, Guanghui, and Zhang, Hongchao. Minibatch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, pp. 1–39, 2014.
 Hoffman et al. (2013) Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 Honkela et al. (2011) Honkela, A., Raiko, T., Kuusela, M., Tornio, M., and Karhunen, J. Approximate Riemannian conjugate gradient learning for fixedform variational Bayes. Journal of Machine Learning Research, 11:3235–3268, 2011.
 Khan et al. (2015) Khan, Mohammad Emtiyaz, Baque, Pierre, Flueret, Francois, and Fua, Pascal. KullbackLeibler Proximal Variational Inference. In Advances in Neural Information Processing Systems, 2015.
 Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kucukelbir et al. (2014) Kucukelbir, Alp, Ranganath, Rajesh, Gelman, Andrew, and Blei, David. Fully automatic variational inference of differentiable probability models. In NIPS Workshop on Probabilistic Programming, 2014.
 Kuss & Rasmussen (2005) Kuss, M. and Rasmussen, C. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005.
 Mnih & Gregor (2014) Mnih, Andriy and Gregor, Karol. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.
 Nelder & Baker (1972) Nelder, John A and Baker, R Jacob. Generalized linear models. Encyclopedia of Statistical Sciences, 1972.
 Nesterov (2004) Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2004.
 Nielsen & Garcia (2009) Nielsen, Frank and Garcia, Vincent. Statistical exponential families: A digest with flash cards. arXiv preprint arXiv:0911.4863, 2009.

Paquet (2014)
Paquet, Ulrich.
On the convergence of stochastic variational inference in Bayesian networks.
NIPS Workshop on variational inference, 2014.  Pascanu & Bengio (2013) Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
 Ranganath et al. (2013) Ranganath, Rajesh, Gerrish, Sean, and Blei, David M. Black box variational inference. arXiv preprint arXiv:1401.0118, 2013.

Ravikumar et al. (2010)
Ravikumar, Pradeep, Agarwal, Alekh, and Wainwright, Martin J.
Messagepassing for graphstructured linear programs: Proximal methods and rounding schemes.
The Journal of Machine Learning Research, 11:1043–1080, 2010. 
Salimans et al. (2013)
Salimans, Tim, Knowles, David A, et al.
Fixedform variational posterior approximation through stochastic linear regression.
Bayesian Analysis, 8(4):837–882, 2013.  Theis & Hoffman (2015) Theis, Lucas and Hoffman, Matthew D. A trustregion method for stochastic variational inference with applications to streaming data. arXiv preprint arXiv:1505.07649, 2015.

Tieleman & Hinton (2012)
Tieleman, Tijmen and Hinton, Geoffrey.
Lecture 6.5RMSprop: Divide the gradient by a running average of its
recent magnitude.
COURSERA: Neural Networks for Machine Learning 4
, 2012.  Titsias & LázaroGredilla (2014) Titsias, Michalis and LázaroGredilla, Miguel. Doubly stochastic variational Bayes for nonconjugate inference. In International Conference on Machine Learning, 2014.
 Wang & Blei (2013) Wang, Chong and Blei, David M. Variational inference in nonconjugate models. Journal of Machine Learning Research, 14(1):1005–1031, 2013.
 Zeiler (2012) Zeiler, Matthew D. ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Appendix A Examples of Splitting for VariationalGaussian Inference
We give detailed derivations for the splittingexamples shown in Section 3.1 in the main paper. As in the main paper, we denote the Gaussian posterior distribution by , so that with being the mean and being the covariance matrix.
a.1 Gaussian Process (GP) Models
Consider GP models for inputoutput pairs indexed by . Let be the latent function drawn from a GP with a zeromean function and a covariance function . We denote the Kernel matrix obtained on the data for all by .
We use a nonGaussian likelihood to model the output, and assume that each is independently sampled from this likelihood given . The jointdistribution over and is shown below:
(21) 
The ratio required for the lower bound is shown below, along with the split, where nonGaussian terms are in and Gaussian terms are in :
(22) 
By substituting in Eq. 1 of the main paper, we can obtain the lower bound after a few simplifications, as shown below:
(23)  
(24)  
(25) 
The assumption A2 is satisfied since the KL divergence is convex in both and . This is clear from the expression of the KL divergence:
(26) 
where is the dimensionality of . Convexity w.r.t. follows from the fact that the above is quadratic in and is positive semidefinite. Convexity w.r.t. follows due to concavity of (trace is linear, so does not matter).
Assumption A1 depends on the choice of the likelihood