We develop stochastic variational inference, a scalable algorithm for
approximating posterior distributions. We develop this technique for a large
class of probabilistic models and we demonstrate it with two probabilistic
topic models, latent Dirichlet allocation and the hierarchical Dirichlet
process topic model. Using stochastic variational inference, we analyze several
large collections of documents: 300K articles from Nature, 1.8M articles from
The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can
easily handle data sets of this size and outperforms traditional variational
inference, which can only handle a smaller subset. (We also show that the
Bayesian nonparametric topic model outperforms its parametric counterpart.)
Stochastic variational inference lets us apply complex Bayesian models to
massive data sets.

We propose extreme stochastic variational inference (ESVI), an asynchron...

Code Repositories

svinet

This package implements algorithms for identifying overlapping communities in large undirected networks. The sampling based algorithms derive from stochastic variational inference under the (assortative) mixed-membership stochastic blockmodel. For details see the following reference: http://www.pnas.org/content/early/2013/08/14/1221839110.full.pdf

Modern data analysis requires computation with massive data. As
examples, consider the following. (1) We have an archive of the raw
text of two million books, scanned and stored online. We want to
discover the themes in the texts, organize the books by subject, and
build a navigator for users to explore our collection. (2) We have
data from an online shopping website containing millions of users’
purchase histories as well as descriptions of each item in the
catalog. We want to recommend items to users based on this
information. (3) We are continuously collecting data from an online
feed of photographs. We want to build a classifier from these
data. (4) We have measured the gene sequences of millions of people.
We want to make hypotheses about connections between observed genes
and other traits.

These problems illustrate some of the challenges to modern data
analysis. Our data are complex and high-dimensional; we have
assumptions to make—from science, intuition, or other data
analyses—that involve structures we believe exist in the data but
that we cannot directly observe; and finally our data sets are large,
possibly even arriving in a never-ending stream.

Statistical machine learning research has addressed some of these
challenges by developing the field of probabilistic modeling, a field
that provides an elegant approach to developing new methods for
analyzing
data (Pearl, 1988; Jordan, 1999; Bishop, 2006; Koller and Friedman, 2009; Murphy, 2012).
In particular, probabilistic graphical models give us a
visual language for expressing assumptions about data and its hidden
structure. The corresponding posterior inference algorithms
let us analyze data under those assumptions, inferring the hidden
structure that best explains our observations.

In descriptive tasks, like problems #1 and #4 above, graphical
models help us explore the data—the organization of books or the
connections between genes and traits—with the hidden structure
probabilistically “filled in.” In predictive tasks, like problems
#2 and #3, we use models to form predictions about new observations.
For example, we can make recommendations to users or predict the class
labels of new images. With graphical models, we enjoy a powerful
suite of probability models to connect and combine; and we have
general-purpose computational strategies for connecting models to data
and estimating the quantities needed to use them.

The problem we face is scale. Inference algorithms of the 1990s and
2000s used to be considered scalable, but they cannot easily handle
the amount of data that we described in the four examples above. This
is the problem we address here. We present an approach to computing
with graphical models that is appropriate for massive data sets, data
that might not fit in memory or even be stored locally. Our method
does not require clusters of computers or specialized hardware, though
it can be further sped up with these amenities.

As an example of this approach to data analysis, consider topic
models. Topic models are probabilistic models of text used to uncover
the hidden thematic structure in a collection of
documents (Blei, 2012). The main idea in a topic model is that
there are a set of topics that describe the collection and each
document exhibits those topics with different degrees. As a
probabilistic model, the topics and how they relate to the documents
are hidden structure and the main computational problem is to infer
this hidden structure from an observed collection. Figure 1
illustrates the results of our algorithm on a probabilistic topic
model. These are two sets of topics, weighted distributions over the
vocabulary, found in 1.8M articles from the New York Times
and 300,000 articles from Nature. Topic models are motivated
by applications that require analyzing massive collections of
documents like this, but traditional algorithms for topic model
inference do not easily scale collections of this size.

Our algorithm builds on variational inference, a method that transforms
complex inference problems into high-dimensional optimization problems
(Jordan et al., 1999; Wainwright and Jordan, 2008). Traditionally, the optimization
is solved with a coordinate ascent algorithm, iterating between
re-analyzing every data point in the data set and re-estimating its
hidden structure. This is inefficient for large data sets, however,
because it requires a full pass through the data at each iteration.

In this paper we derive a more efficient algorithm by using stochastic
optimization (Robbins and Monro, 1951), a technique that follows noisy
estimates of the gradient of the objective. When used in variational
inference, we show that this gives an algorithm which iterates between
subsampling the data and adjusting the hidden structure based only on
the subsample. This is much more efficient than traditional
variational inference. We call our method stochastic
variational inference.

We will derive stochastic variational inference for a large class of
graphical models. We will study its performance on two kinds of
probabilistic topic models. In particular, we demonstrate stochastic
variational inference on latent Dirichlet allocation
(Blei et al., 2003), a simple topic model, and the hierarchical
Dirichlet process topic model (Teh et al., 2006a), a more flexible model
where the number of discovered topics grows with the data. (This
latter application demonstrates how to use stochastic variational
inference in a variety of Bayesian nonparametric settings.) Stochastic
variational inference can efficiently analyze massive data sets with
complex probabilistic models.

Technical summary.

We now turn to the technical context of
our method. In probabilistic modeling, we use hidden variables to
encode hidden structure in observed data; we articulate the
relationship between the hidden and observed variables with a
factorized probability distribution (i.e., a graphical model); and we
use inference algorithms to estimate the posterior distribution, the
conditional distribution of the hidden structure given the
observations.

Consider a graphical model of hidden and observed random variables for
which we want to compute the posterior. For many models of interest,
this posterior is not tractable to compute and we must appeal to
approximate methods. The two most prominent strategies in statistics
and machine learning are Markov chain Monte Carlo (MCMC) sampling and
variational inference. In MCMC sampling, we construct a Markov chain
over the hidden variables whose stationary distribution is the
posterior of interest

(Metropolis et al., 1953; Hastings, 1970; Geman and Geman, 1984; Gelfand and Smith, 1990; Robert and Casella, 2004). We
run the chain until it has (hopefully) reached equilibrium and collect
samples to approximate the posterior. In variational inference, we
define a flexible family of distributions over the hidden variables,
indexed by free parameters (Jordan et al., 1999; Wainwright and Jordan, 2008). We
then find the setting of the parameters (i.e., the member of the
family) that is closest to the posterior. Thus we solve the inference
problem by solving an optimization problem.

Neither MCMC nor variational inference scales easily to the kinds of
settings described in the first paragraph. Researchers have proposed
speed-ups of both approaches, but these usually are tailored to
specific models or compromise the correctness of the algorithm (or
both). Here, we develop a general variational method that scales.

As we mentioned above, the main idea in this work is to use stochastic
optimization (Robbins and Monro, 1951; Spall, 2003)

. In stochastic
optimization, we find the maximum of an objective function by
following noisy (but unbiased) estimates of its gradient.
Under the right conditions, stochastic optimization
algorithms provably converge to an optimum of the objective.
Stochastic optimization is particularly attractive when the objective
(and therefore its gradient) is a sum of many terms that can be
computed independently. In that setting, we can cheaply compute noisy
gradients by subsampling only a few of these terms.

Variational inference is amenable to stochastic optimization because
the variational objective decomposes into a sum of terms, one for each
data point in the analysis. We can cheaply obtain noisy estimates of
the gradient by subsampling the data and computing a scaled gradient
on the subsample. If we sample independently then the expectation of
this noisy gradient is equal to the true gradient. With one more
detail—the idea of a natural
gradient (Amari, 1998)—stochastic variational inference has an
attractive form:

Subsample one or more data points from the data.

Analyze the subsample using the current variational parameters.

Implement a closed-form update of the variational parameters.

Repeat.
While traditional algorithms require repeatedly analyzing the whole
data set before updating the variational parameters, this algorithm
only requires that we analyze randomly sampled subsets. We will show
how to use this algorithm for a large class of graphical models.

Related work. Variational inference for probabilistic
models was pioneered in the mid-1990s. In Michael Jordan’s lab, the
seminal papers of Saul et al. (1996); Saul and Jordan (1996) and Jaakkola (1997)
grew out of reading the statistical physics
literature (Peterson and Anderson, 1987; Parisi, 1988). In parallel, the
mean-field methods explained in Neal and Hinton (1999) (originally published
in 1993) and Hinton and Van Camp (1993) led to variational algorithms for
mixtures of experts (Waterhouse et al., 1996).

In subsequent years, researchers began to understand the potential for
variational inference in more general settings and developed generic
algorithms for conjugate exponential-family
models (Attias, 1999, 2000; Wiegerinck, 2000; Ghahramani and Beal, 2001; Xing et al., 2003).
These innovations led to automated variational inference, allowing a
practitioner to write down a model and immediately use variational
inference to estimate its posterior (Bishop et al., 2003). For good
reviews of variational inference see Jordan et al. (1999)
and Wainwright and Jordan (2008).

In this paper, we develop scalable methods for generic Bayesian
inference by solving the variational inference problem with stochastic
optimization

(Robbins and Monro, 1951). Our algorithm builds on the
earlier approach of Sato (2001), whose algorithm only applies to
the limited set of models that can be fit with the EM
algorithm (Dempster et al., 1977). Specifically, we generalize his
approach to the much wider set of probabilistic models that are
amenable to closed-form coordinate ascent inference. Further, in the
sense that EM itself is a mean-field method (Neal and Hinton, 1999), our
algorithm builds on the stochastic optimization approach to
EM (Cappé and Moulines, 2009). Finally, we note that stochastic optimization
was also used with variational inference in Platt et al. (2008) for fast
approximate inference in a specific model of web service activity.

For approximate inference, the main alternative to variational methods
is Markov chain Monte Carlo (MCMC) (Robert and Casella, 2004). Despite its
popularity in Bayesian inference, relatively little work has focused
on developing MCMC algorithms that can scale to very large data sets.
One exception is sequential Monte Carlo, although these typically lack
strong convergence guarantees (Doucet et al., 2001). Another is the
stochastic gradient Langevin method of Welling and Teh (2011), which
enjoys asymptotic convergence guarantees and also takes advantage of
stochastic optimization. Finally, in topic modeling, researchers have
developed several approaches to parallel
MCMC (Newman et al., 2009; Smola and Narayanamurthy, 2010; Ahmed et al., 2012).

The organization of this paper. In Section 2, we
review variational inference for graphical models and then derive
stochastic variational inference. In Section 3, we review
probabilistic topic models and Bayesian nonparametric models and then
derive the stochastic variational inference algorithms in these
settings. In Section 4, we study stochastic variational
inference on several large text data sets.

2 Stochastic Variational Inference

We derive stochastic variational inference, a stochastic
optimization algorithm for mean-field variational inference. Our
algorithm approximates the posterior distribution of a probabilistic
model with hidden variables, and can handle massive data sets of
observations.

We divide this section into four parts.

We define the class of models to which our algorithm applies.
We define local and global hidden variables, and
requirements on the conditional distributions within the model.

We review mean-field variational inference, an
approximate inference strategy that seeks a tractable distribution
over the hidden variables which is close to the posterior
distribution. We derive the traditional variational inference
algorithm for our class of models, which is a coordinate ascent
algorithm.

We review the natural gradient and derive the natural
gradient of the variational objective function. The natural
gradient closely relates to coordinate ascent variational inference.

We review stochastic optimization, a technique that uses noisy
estimates of a gradient to optimize an objective function, and apply
it to variational inference. Specifically, we use stochastic
optimization with noisy estimates of the natural gradient of the
variational objective. These estimates arise from repeatedly
subsampling the data set. We show how the resulting algorithm,
stochastic variational inference, easily builds on
traditional variational inference algorithms but can handle much
larger data sets.

2.1 Models with local and global hidden variables

Our class of models involves observations, global hidden variables,
local hidden variables, and fixed parameters. The N observations
are x=x1:N

β;
the N local hidden variables are z=z1:N, each of which is a
collection of J variables zn=zn,1:J; the vector of fixed
parameters is α. (Note we can easily allow α to partly
govern any of the random variables, such as fixed parts of the
conditional distribution of observations. To keep notation simple, we
assume that they only govern the global hidden variables.)

The joint distribution factorizes into a global term and a product of
local terms,

p(x,z,β|α)=p(β|α)N∏n=1p(xn,zn|β).

(1)

Figure 2 illustrates the graphical model. Our goal is to
approximate the posterior distribution of the hidden variables given
the observations, p(β,z|x).

The distinction between local and global hidden variables is
determined by the conditional dependencies. In particular, the nth
observation xn and the nth local variable zn are conditionally
independent, given global variables β, of all other observations
and local hidden variables,

p(xn,zn|x−n,z−n,β,α)=p(xn,zn|β,α).

The notation x−n and z−n refers to the set of variables
except the nth.

This kind of model frequently arises in Bayesian statistics. The
global variables

β are parameters endowed with a prior
p(β) and each local variable zn contains the hidden structure
that governs the n

th observation. For example, consider a Bayesian
mixture of Gaussians. The global variables are the mixture
proportions and the means and variances of the mixture components; the
local variable

zn is the hidden cluster label for the nth
observation xn.

We have described the independence assumptions of the hidden
variables. We make further assumptions about the complete
conditionals in the model. A complete conditional is the
conditional distribution of a hidden variable given the other hidden
variables and the observations. We assume that these distributions
are in the exponential family,

The scalar functions h(⋅) and a(⋅) are respectively the
base measure and log-normalizer; the vector functions
η(⋅) and t(⋅) are respectively the natural
parameter and sufficient statistics.^{1}^{1}1We use
overloaded notation for the functions h(⋅) and t(⋅) so
that they depend on the names of their arguments; for example,
h(znj) can be thought of as a shorthand for the more formal
(but more cluttered) notation hznj(znj). This is
analogous to the standard convention of overloading the probability
function p(⋅). These are conditional distributions, so the
natural parameter is a function of the variables that are being
conditioned on. (The subscripts on the natural parameter η
indicate complete conditionals for local or global variables.) For
the local variables znj, the complete conditional distribution
is determined by the global variables β and the other local
variables in the nth context, i.e., the nth data point xn and
the local variables zn,−j. This follows from the factorization
in Equation 1.

These assumptions on the complete conditionals imply a conjugacy
relationship between the global variables β and the local
contexts (zn,xn), and this relationship implies a specific form
of the complete conditional for β. Specifically, the
distribution of the local context given the global variables must be in an
exponential family,

p(xn,zn|β)=h(xn,zn)exp{β⊤t(xn,zn)−aℓ(β)}.

(4)

The prior distribution p(β) must also be in an exponential
family,

α has two
components α=(α1,α2). The first component
α1 is a vector of the same dimension as β; the second
component α2 is a scalar.

Equations 4 and 5 imply that the
complete conditional for the global variable in Equation 2 is
in the same exponential family as the prior with natural parameter

ηg(x,z,α)=(α1+∑Nn=1t(zn,xn),α2+N).

(6)

This form will be important when we derive stochastic variational
inference in Section 2.4. See
Bernardo and Smith (1994) for a general discussion of conjugacy and the
exponential family.

This family of distributions—those with local and global variables,
and where the complete conditionals are in the exponential
family—contains many useful statistical models from the machine
learning and statistics literature. Examples include Bayesian mixture
models (Ghahramani and Beal, 2000; Attias, 2000), latent Dirichlet
allocation (Blei et al., 2003)

(Gelman and Hill, 2007), hierarchical probit
classification models (McCullagh and Nelder, 1989; Girolami and Rogers, 2006),
probabilistic factor analysis/matrix factorization
models (Spearman, 1904; Tipping and Bishop, 1999; Collins et al., 2002; Wang, 2006; Salakhutdinov and Mnih, 2008; Paisley and Carin, 2009; Hoffman et al., 2010b), certain Bayesian nonparametric mixture
models (Antoniak, 1974; Escobar and West, 1995; Teh et al., 2006a), and
others.^{2}^{2}2We note that our assumptions can be relaxed to the
case where the full conditional p(β|x,z) is not tractable,
but each partial conditional p(βk|x,z,β−k)
associated with the global variable βk is in a tractable
exponential family. The topic models of the next section do not
require this complexity, so we chose to keep the derivation a little
simpler.

Analyzing data with one of these models amounts to computing the
posterior distribution of the hidden variables given the observations,

p(z,β|x)=p(x,z,β)∫p(x,z,β)dzdβ.

(7)

We then use this posterior to explore the hidden structure of our data
or to make predictions about future data. For many models however,
such as the examples listed above, the denominator in Equation 7
is intractable to compute. Thus we resort to approximate posterior
inference, a problem that has been a focus of modern Bayesian
statistics. We now turn to mean-field variational inference, the
approximation inference technique which roots our strategy for
scalable inference.

2.2 Mean-field variational inference

Variational inference casts the inference problem as an
optimization. We introduce a family of distributions over the hidden
variables that is indexed by a set of free parameters, and then
optimize those parameters to find the member of the family that is
closest to the posterior of interest. (Closeness is measured with
Kullback-Leibler divergence.) We use the resulting distribution,
called the

variational distribution, to approximate the
posterior.

In this section we review mean-field variational inference, the form
of variational inference that uses a family where each hidden variable
is independent. We describe the variational objective function,
discuss the mean-field variational family, and derive the traditional
coordinate ascent algorithm for fitting the variational parameters.
This algorithm is a stepping stone to stochastic variational
inference.

The evidence lower bound. Variational inference minimizes
the Kullback-Leibler (KL) divergence from the variational distribution
to the posterior distribution. It maximizes the evidence
lower bound (ELBO), a lower bound on the logarithm of the marginal
probability of the observations logp(x). The ELBO is equal to the
negative KL divergence up to an additive constant.

We derive the ELBO by introducing a distribution over the hidden
variables q(z,β) and using Jensen’s inequality. (Jensen’s
inequality and the concavity of the logarithm function imply that
logE[f(y)]≥E[logf(y)] for any random variable y.) This
gives the following bound on the log marginal,

logp(x)

=log∫p(x,z,β)dzdβ

=log∫p(x,z,β)q(z,β)q(z,β)dzdβ

=log(Eq[p(x,z,β)q(z,β)])

≥Eq[logp(x,z,β)]−Eq[logq(z,β)]

(8)

≜L(q).

The ELBO contains two terms. The first term is the expected log
joint, Eq[logp(x,z,β)]. The second term is the entropy of
the variational distribution, −Eq[logq(z,β)]. Both of
these terms depend on q(z,β), the variational distribution of
the hidden variables.

We restrict q(z,β) to be in a family that is tractable, one for
which the expectations in the ELBO can be efficiently computed. We
then try to find the member of the family that maximizes the ELBO.
Finally, we use the optimized distribution as a proxy for the
posterior.

Solving this maximization problem is equivalent to finding the member
of the family that is closest in KL divergence to the
posterior (Jordan et al., 1999; Wainwright and Jordan, 2008),

KL(q(z,β)||p(z,β|x))

=

Eq[logq(z,β)]−Eq[logp(z,β|x)]

=

Eq[logq(z,β)]−Eq[logp(x,z,β)]+logp(x)

=

−L(q)+const.

logp(x) is replaced by a constant because it does not depend
on q.

The mean-field variational family. The simplest
variational family of distributions is the mean-field family.
In this family, each hidden variable is independent and governed by
its own parameter,

q(z,β)=q(β|λ)N∏n=1J∏j=1q(znj|ϕnj).

(9)

The global parameters λ govern the global variables; the local
parameters ϕn govern the local variables in the nth context.
The ELBO is a function of these parameters.

Equation 9 gives the factorization of the variational family,
but does not specify its form. We set q(β|λ) and
q(znj|ϕnj) to be in the same exponential family as the
complete conditional distributions p(β|x,z) and p(znj|xn,zn,−j,β), from Equations 2 and
3. The variational parameters λ and
ϕnj are the natural parameters to those families,

q(β|λ)

=h(β)exp{λ⊤t(β)−ag(λ)}

(10)

q(znj|ϕnj)

=h(znj)exp{ϕ⊤njt(znj)−aℓ(ϕnj)}.

(11)

These forms of the variational distributions lead to an easy
coordinate ascent algorithm. Further, the optimal mean-field
distribution, without regard to its particular functional form, has
factors in these families (Bishop, 2006).

Note that assuming that these exponential families are the same as
their corresponding conditionals means that t(⋅) and h(⋅)
in Equation 10 are the same functions as t(⋅) and h(⋅)
in Equation 2. Likewise, t(⋅) and h(⋅) in
Equation 11 are the same as in Equation 3. We will sometimes
suppress the explicit dependence on ϕ and λ, substituting
q(znj) for q(znj|ϕnj) and q(β) for
q(β|λ).

The mean-field family has several computational advantages. For one,
the entropy term decomposes,

where Eϕnj[⋅] denotes an expectation with respect to
q(znj|ϕnj) and Eλ[⋅] denotes an
expectation with respect to q(β|λ). Its other
computational advantages will emerge as we derive the gradients of the
variational objective and the coordinate ascent algorithm.

The gradient of the ELBO and coordinate ascent inference.
We have defined the objective function in Equation 8 and the
variational family in Equations 9, 10
and 11. Our goal is to optimize the objective with
respect to the variational parameters.

In traditional mean-field variational inference, we optimize
Equation 8 with coordinate ascent. We iteratively optimize each
variational parameter, holding the other parameters fixed. With the
assumptions that we have made about the model and variational
distribution—that each conditional is in an exponential family and
that the corresponding variational distribution is in the same
exponential family—we can optimize each coordinate in closed form.

We first derive the coordinate update for the parameter λ to the
variational distribution of the global variables q(β|λ). As a function of λ, we can rewrite the objective
as

L(λ)=Eq[logp(β|x,z)]−Eq[logq(β)]+const.

(12)

The first two terms are expectations that involve β; the third
term is constant with respect to λ. The constant absorbs
quantities that depend only on the other hidden variables. Those
quantities do not depend on q(β|λ) because all
variables are independent in the mean-field family.

Equation 12 reproduces the full ELBO in Equation 8. The
second term of Equation 12

is the entropy of the global
variational distribution. The first term derives from the expected
log joint likelihood, where we use the chain rule to separate terms
that depend on the variable

β from terms that do not,

Eq[logp(x,z,β)]=Eq[logp(x,z)]+Eq[logp(β|x,z)].

The constant absorbs Eq[logp(x,z)], leaving
the expected log conditional Eq[logp(β|x,z)].

Finally, we substitute the form of q(β|λ) in
Equation 10 to obtain the final expression for the ELBO as a
function of λ,

L(λ)=Eq[ηg(x,z,α)]⊤∇λag(λ)−λ⊤∇λag(λ)+ag(λ)+const.

(13)

In the first and second terms on the right side, we used the
exponential family identity that the expectation of the sufficient
statistics is the gradient of the log normalizer,
Eq[t(β)]=∇λag(λ). The constant
has further absorbed the expected log normalizer of the conditional
distribution −Eq[ag(ηg(x,z,α))], which does
not depend on q(β).

Equation 13 simplifies the ELBO as a function of the global
variational parameter. To derive the coordinate ascent update, we
take the gradient,

∇λL=∇2λag(λ)(Eq[ηg(x,z,α)]−λ).

(14)

We can set this gradient to zero by setting

λ=Eq[ηg(x,z,α)].

(15)

This sets the global variational parameter equal to the expected
natural parameter of its complete conditional distribution.
Implementing this update, holding all other variational parameters
fixed, optimizes the ELBO over λ. Notice that the mean-field
assumption plays an important role. The update is the expected
conditional parameter Eq[ηg(x,z,α)],
which is an expectation of a function of the other random variables
and observations. Thanks to the mean-field assumption, this
expectation is only a function of the local variational parameters and
does not depend on λ.

We now turn to the local parameters ϕnj. The gradient is
nearly identical to the global case,

∇ϕnjL=∇2ϕnjaℓ(ϕnj)(Eq[ηℓ(xn,zn,−j,β)]−ϕnj).

It equals zero when

ϕnj=Eq[ηℓ(xn,zn,−j,β)].

(16)

Mirroring the global update, this expectation does not depend on
ϕnj. However, while the global update in Equation 15
depends on all the local variational parameters—and note there is a
set of local parameters for each of the N observations—the local
update in Equation 16 only depends on the global parameters and
the other parameters associated with the nth context. The
computational difference between local and global updates will be
important in the scalable algorithm of Section 2.4.

The updates in Equations 15 and
16 form the algorithm for coordinate ascent
variational inference, iterating between updating each local parameter
and the global parameters. The full algorithm is in
Figure 3, which is guaranteed to find a local optimum of
the ELBO. Computing the expectations at each step is easy for
directed graphical models with tractable complete conditionals, and in
Section 3 we show that these updates are tractable for
many topic models. Figure 3 is the “classical”
variational inference algorithm, used in many settings.

As an aside, these updates reveal a connection between mean-field
variational inference and Gibbs sampling (Gelfand and Smith, 1990)

. In
Gibbs sampling, we iteratively sample from each complete conditional.
In variational inference, we take variational expectations of the
natural parameters of the same distributions. The updates also show a
connection to the expectation-maximization (EM) algorithm

(Dempster et al., 1977)—Equation 16 corresponds to the E step,
and Equation 15 corresponds to the M step (Neal and Hinton, 1999).

We mentioned that the local steps (Steps 3 and 4 in
Figure 3) only require computation with the global
parameters and the nth local context. Thus, the data can be
distributed across many machines and the local variational updates can
be implemented in parallel. These results can then be aggregated in
Step 6 to find the new global variational parameters.

However, the local steps also reveal an inefficiency in the algorithm.
The algorithm begins by initializing the global parameters λ
randomly—the initial value of λ does not reflect any
regularity in the data. But before completing even one iteration, the
algorithm must analyze every data point using these initial (random)
values. This is wasteful, especially if we expect that we can learn
something about the global variational parameters from only a subset
of the data.

We solve this problem with stochastic optimization. This leads to
stochastic variational inference, an efficient algorithm that
continually improves its estimate of the global parameters as it
analyzes more observations. Though the derivation requires some
details, we have now described all of the computational components of
the algorithm. (See Figure 4.) At each iteration, we sample
a data point from the data set and compute its optimal local
variational parameters; we form intermediate global
parameters using classical coordinate ascent updates where the
sampled data point is repeated N times; finally, we set the new
global parameters to a weighted average of the old estimate and the
intermediate parameters.

The algorithm is efficient because it need not analyze the whole data
set before improving the global variational parameters, and the
per-iteration steps only require computation about a single local
context. Furthermore, it only uses calculations from classical
coordinate inference. Any existing implementation of variational
inference can be easily configured to this scalable alternative.

We now show how stochastic inference arises by applying stochastic
optimization to the natural gradients of the variational objective.
We first discuss natural gradients and their relationship to the
coordinate updates in mean-field variational inference.

2.3 The natural gradient of the ELBO

The natural gradient of a function accounts for the information
geometry of its parameter space, using a Riemannian metric to adjust
the direction of the traditional gradient. Amari (1998)
discusses natural gradients for maximum-likelihood estimation, which
give faster convergence than standard gradients. In this section we
describe Riemannian metrics for probability distributions and the
natural gradient of the ELBO.

Gradients and probability distributions. The classical
gradient method for maximization tries to find a maximum of a function
f(λ) by taking steps of size ρ in the direction of the
gradient,

λ(t+1)=λ(t)+ρ∇λf(λ(t)).

The gradient (when it exists) points in the direction of steepest
ascent. That is, the gradient ∇λf(λ) points in
the same direction as the solution to

argmaxdλf(λ+dλ)subject to ||dλ||2<ϵ2

(17)

for sufficiently small ϵ. Equation 17 implies that
if we could only move a tiny distance ϵ away from λ
then we should move in the direction of the gradient. Initially this
seems reasonable, but there is a complication. The gradient direction
implicitly depends on the Euclidean distance metric associated with
the space in which λ lives. However, the Euclidean metric
might not capture a meaningful notion of distance between settings of
λ.

The problem with Euclidean distance is especially clear in our
setting, where we are trying to optimize an objective with respect to
a parameterized probability distribution q(β|λ). When
optimizing over a probability distribution, the Euclidean distance
between two parameter vectors λ and λ′ is often a poor
measure of the dissimilarity of the distributions q(β|λ) and q(β|λ′). For example, suppose q(β)
is a univariate normal and λ is the mean μ and scale
σ. The distributions N(0,10000) and N(10,10000) are
almost indistinguishable, and the Euclidean distance between their
parameter vectors is 10. In contrast, the distributions N(0,0.01)
and N(0.1,0.01) barely overlap, but this is not reflected in the
Euclidean distance between their parameter vectors, which is only 0.1.
The natural gradient corrects for this issue by redefining
the basic definition of the gradient (Amari, 1998).

Natural gradients and probability distributions. A
natural measure of dissimilarity between probability distributions is
the symmetrized KL divergence

Symmetrized KL depends on the distributions themselves, rather than on
how they are parameterized; it is invariant to parameter transformations.

With distances defined using symmetrized KL, we find the direction of
steepest ascent in the same way as for gradient methods,

argmaxdλf(λ+dλ)subject toDsymKL(λ,λ+dλ)<ϵ.

(19)

As ϵ→0, the solution to this problem points in the
same direction as the natural gradient. While the Euclidean
gradient points in the direction of steepest ascent in Euclidean
space, the natural gradient points in the direction of steepest ascent
in the Riemannian space, i.e., the space where local distance is
defined by KL divergence rather than the L2 norm.

We manage the more complicated constraint in Equation 19 with
a Riemannian metric G(λ) (Do Carmo, 1992)

λ under which the squared Euclidean
distance between λ and a nearby vector λ+dλ is
the KL between q(β|λ) and q(β|λ+dλ),

dλTG(λ)dλ=DsymKL(λ,λ+dλ),

(20)

and note that the transformation can be a function of λ.
Amari (1998) showed that we can compute the natural gradient by
premultiplying the gradient by the inverse of the Riemannian metric
G(λ)−1,

^∇λf(λ)≜G(λ)−1∇λf(λ),

where G is the Fisher information matrix of q(λ)
(Amari, 1982; Kullback and Leibler, 1951),

G(λ)=Eλ[(∇λlogq(β|λ))(∇λlogq(β|λ))⊤].

(21)

We can show that Equation 21 satisfies Equation 20 by
approximating logq(β|λ+dλ) using the
first-order Taylor approximations about λ

This follows from the exponential family identity that the Hessian of
the log normalizer function a with respect to the natural parameter
λ is the covariance matrix of the sufficient statistic vector
t(β).

Natural gradients and mean field variational inference.
We now return to variational inference and compute the natural
gradient of the ELBO with respect to the variational parameters.
Researchers have used the natural gradient in variational inference
for nonlinear state space models (Honkela et al., 2008) and Bayesian
mixtures (Sato, 2001).^{3}^{3}3Our work here—using the natural
gradient in a stochastic optimization algorithm—is closest to that
of Sato (2001), though we develop the algorithm via a
different path and Sato does not address models for which the joint
conditional p(zn|β,xn) is not tractable.

Consider the global variational parameter λ. The gradient of
the ELBO with respect to λ is in Equation 14. Since
λ is a natural parameter to an exponential family
distribution, the Fisher metric defined by q(β) is
∇2λag(λ). Note that the Fisher metric is the
first term in Equation 14. We premultiply the gradient by the
inverse Fisher information to find the natural gradient. This reveals
that the natural gradient has the following simple form,

^∇λL=Eϕ[ηg(x,z,α)]−λ.

(22)

An analogous computation goes through for the local variational
parameters,

^∇ϕnjL=Eλ,ϕn,−j[ηℓ(xn,zn,−j,β)]−ϕnj.

The natural gradients are closely related to the coordinate ascent
updates of Equation 15 or Equation 16. Consider a full
set of variational parameters λ and ϕ. We can
compute the natural gradient by computing the coordinate updates in
parallel and subtracting the current setting of the parameters. The
classical coordinate ascent algorithm can thus be interpreted as a
projected natural gradient algorithm (Sato, 2001). Updating a
parameter by taking a natural gradient step of length one is
equivalent to performing a coordinate update.

We motivated natural gradients by mathematical reasoning around the
geometry of the parameter space. More importantly, however, natural
gradients are easier to compute than classical gradients. They are
easier to compute because premultiplying by the Fisher information
matrix—which we must do to compute the classical gradient
in Equation 14 but which disappears from the natural gradient
in Equation 22—is prohibitively expensive for variational
parameters with many components. In the next section we will see that
efficiently computing the natural gradient lets us develop scalable
variational inference algorithms.

2.4 Stochastic variational inference

The coordinate ascent algorithm in Figure 3 is inefficient
for large data sets because we must optimize the local variational
parameters for each data point before re-estimating the global
variational parameters. Stochastic variational inference uses
stochastic optimization to fit the global variational parameters. We
repeatedly subsample the data to form noisy estimates of the natural
gradient of the ELBO, and we follow these estimates with a decreasing
step-size.

We have reviewed mean-field variational inference in models with
exponential family conditionals and showed that the natural gradient
of the variational objective function is easy to compute. We now
discuss stochastic optimization, which uses a series of noisy
estimates of the gradient, and use it with noisy natural gradients to
derive stochastic variational inference.

Stochastic optimization. Stochastic optimization
algorithms follow noisy estimates of the gradient with a decreasing
step size. Noisy estimates of a gradient are often cheaper to compute
than the true gradient, and following such estimates can allow
algorithms to escape shallow local optima of complex objective
functions. In statistical estimation problems, including variational
inference of the global parameters, the gradient can be written as a
sum of terms (one for each data point) and we can compute a fast noisy
approximation by subsampling the data. With certain conditions on the
step-size schedule, these algorithms provably converge to an
optimum (Robbins and Monro, 1951). Spall (2003) gives an overview of
stochastic optimization; Bottou (2003) gives an overview of its
role in machine learning.

Consider an objective function f(λ) and a random function
B(λ) that has expectation equal to the gradient so that
Eq[B(λ)]=∇λf(λ). The stochastic
gradient algorithm, which is a type of stochastic optimization,
optimizes f(λ) by iteratively following realizations of
B(λ). At iteration t, the update for λ is

λ(t)=λ(t−1)+ρtbt(λ(t−1)),

where bt is an independent draw from the noisy gradient B. If
the sequence of step sizes ρt satisfies

∑ρt=∞∑ρ2t<∞

(23)

then λ(t) will converge to the optimal λ∗ (if f
is convex) or a local optimum of f (if not convex).^{4}^{4}4To find
a local optimum, f must be three-times differentiable and meet a
few mild technical requirements (Bottou, 1998). The
variational objective satisfies these criteria. The same results
apply if we premultiply the noisy gradients bt by a sequence of
positive-definite matrices G−1t

As our notation suggests, we will use the Fisher metric for Gt,
replacing stochastic Euclidean gradients with stochastic natural
gradients.

Stochastic variational inference. We use stochastic
optimization with noisy natural gradients to optimize the variational
objective function. The resulting algorithm is in
Figure 4. At each iteration we have a current setting of the
global variational parameters. We repeat the following steps:

Sample a data point from the set; optimize its local variational
parameters.

Form intermediate global variational parameters, as though we were
running classical coordinate ascent and the sampled data point were
repeated N times to form the collection.

Update the global variational parameters to be a weighted
average of the intermediate parameters and their current setting.

We show that this algorithm is stochastic natural gradient ascent on
the global variational parameters.

Our goal is to find a setting of the global variational parameters
λ that maximizes the ELBO. Writing L as a function
of the global and local variational parameters, Let the function
ϕ(λ) return a local optimum of the local variational
parameters so that

∇ϕL(λ,ϕ(λ))=0.

Define the locally maximized ELBOL(λ) to be
the ELBO when λ is held fixed and the local variational
parameters ϕ are set to a local optimum ϕ(λ),

L(λ)≜L(λ,ϕ(λ)).

We can compute the (natural) gradient of L(λ) by first
finding the corresponding optimal local parameters ϕ(λ) and
then computing the (natural) gradient of L(λ,ϕ(λ)), holding ϕ(λ) fixed. The reason is that
the gradient of L(λ) is the same as the gradient of the
two-parameter ELBO L(λ,ϕ(λ)),

∇λL(λ)

=

∇λL(λ,ϕ(λ))+(∇λϕ(λ))⊤∇ϕL(λ,ϕ(λ))

(24)

=

∇λL(λ,ϕ(λ)),

(25)

where ∇λϕ(λ) is the Jacobian of
ϕ(λ) and we use the fact that the gradient of L(λ,ϕ) with respect to ϕ is zero at
ϕ(λ).

Stochastic variational inference optimizes the maximized ELBO L(λ) by subsampling the data to form noisy estimates of the
natural gradient. First, we decompose L(λ) into a global term and a sum of local terms,

Now consider a variable that chooses an index of the data uniformly at
random, I∼Unif(1,…,N). Define LI(λ) to be the following random function of the
variational parameters,

The expectation of LI is equal to the objective in
Equation 26. Therefore, the natural gradient of LI with respect to each global variational parameter λ
is a noisy but unbiased estimate of the natural gradient of the
variational objective. This process—sampling a data point and then
computing the natural gradient of LI—will provide cheaply
computed noisy gradients for stochastic optimization.

We now compute the noisy gradient. Suppose we have sampled the ith
data point. Notice that Equation 27 is equivalent to the full
objective of Equation 26 where the ith data point is
observed N times. Thus the natural gradient of
Equation 27—which is a noisy natural gradient of the
ELBO—can be found using Equation 22,

^∇Li=Eq[ηg(x(N)i,z(N)i,α)]−λ,

where {x(N)i,z(N)i} are a data set formed by N
replicates of observation xn and hidden variables zn.

We compute this expression in more detail. Recall the complete
conditional ηg(x,z,α) from Equation 6.
From this equation, we can compute the conditional natural parameter
for the global parameter given N replicates of xn,

ηg(x(N)i,z(N)i,α)=α+N⋅(t(xn,zn),1).

Using this in the natural gradient of Equation 22 gives a noisy
natural gradient,

^∇λLi=α+N⋅(Eϕi(λ)[t(xi,zi)],1)−λ,

where ϕi(λ) gives the elements of ϕ(λ)
associated with the ith local context. While the full natural
gradient would use the local variational parameters for the whole data
set, the noisy natural gradient only considers the local parameters
for one randomly sampled data point. These noisy gradients are
cheaper to compute.

Finally, we use the noisy natural gradients in a Robbins-Monro
algorithm to optimize the ELBO. We sample a data point xi at each
iteration. Define the intermediate global parameter ^λt
to be the estimate of λ that we would obtain if the sampled
data point was replicated N times,

^λt≜α+NEϕi(λ)[(t(xi,zi),1)].

This comprises the first two terms of the noisy natural gradient. At
each iteration we use the noisy gradient (with step size ρt) to
update the global variational parameter. The update is

λ(t)

=

λ(t−1)+ρt(^λt−λ(t−1))

=

(1−ρt)λ(t−1)+ρt^λt.

This is a weighted average of the previous estimate of λ and
the estimate of λ that we would obtain if the sampled data
point was replicated N times.

1: Initialize λ(0) randomly.

2: Set the step-size schedule ρt appropriately.

3:repeat

4: Sample a data point xi uniformly from the data set.

5: Compute its local variational parameter,

ϕ=Eλ(t−1)[ηg(x(N)i,z(N)i)].

6: Compute intermediate global
parameters as though xi is replicated N times,

^λ=Eϕ[ηg(x(N)i,z(N)i)].

7: Update the current estimate
of the global variational parameters,

λ(t)=(1−ρt)λ(t−1)+ρt^λ.

8:until forever

Figure 4: Stochastic variational inference.

Figure 4 presents the full algorithm. At each iteration, the
algorithm has an estimate of the global variational parameter
λ(t−1). It samples a single data point from the data and
cheaply computes the intermediate global parameter ^λt,
i.e., the next value of λ if the data set contained N
replicates of the sampled point. It then sets the new estimate of the
global parameter to be a weighted average of the previous estimate and
the intermediate parameter.

We set the step-size at iteration t as follows,

ρt=(t+τ)−κ.

(28)

This satisfies the conditions in Equation 23. The forgetting
rateκ∈(0.5,1] controls how quickly old information is
forgotten; the delayτ≥0 down-weights early
iterations. In Section 4 we fix the delay to be one and
explore a variety of forgetting rates. Note that this is just one way
to parameterize the learning rate. As long as the step size
conditions in Equation 23 are satisfied, this iterative algorithm
converges to a local optimum of the ELBO.

2.5 Extensions

We now describe two extensions of the basic stochastic inference
algorithm in Figure 4

: the use of multiple samples
(“minibatches”) to improve the algorithm’s stability, and empirical
Bayes methods for hyperparameter estimation.

Minibatches. So far, we have considered stochastic
variational inference algorithms where only one observation xt is
sampled at a time. Many stochastic optimization algorithms benefit
from “minibatches,” i.e., several examples at a
time (Bottou and Bousquet, 2008; Liang et al., 2009; Mairal et al., 2010). In stochastic
variational inference, we can sample a set of S examples at each
iteration xt,1:S (with or without replacement), compute the local
variational parameters ϕs(λ(t−1)) for each data point,
compute the intermediate global parameters ^λs for each
data point xts, and finally average the ^λs variables
in the update

λ(t)=(1−ρt)λ(t−1)+ρtS∑s^λs.

The stochastic natural gradients associated with each point xs have
expected value equal to the gradient. Therefore, the average of these
stochastic natural gradients has the same expectation and the
algorithm remains valid.

There are two reasons to use minibatches. The first reason is to
amortize any computational expenses associated with updating the
global parameters across more data points; for example, if the
expected sufficient statistics of β are expensive to compute,
using minibatches allows us to incur that expense less frequently.
The second reason is that it may help the algorithm to find better
local optima. Stochastic variational inference is guaranteed to
converge to a local optimum but taking large steps on the basis of
very few data points may lead to a poor one. As we will see in
Section 4, using more of the data per update can help the
algorithm.

Empirical Bayes estimation of hyperparameters. In some
cases we may want to both estimate the posterior of the hidden random
variables β and z and obtain a point estimate of the values of
the hyperparameters α. One approach to fitting α is to
try to maximize the marginal likelihood of the data p(x|α),
which is also known as empirical Bayes (Maritz and Lwin, 1989) estimation.
Since we cannot compute p(x|α) exactly, an approximate
approach is to maximize the fitted variational lower bound L
over α. In the non-stochastic setting, α can be
optimized by interleaving the coordinate ascent updates in
Figure 3 with an update for α that increases the
ELBO. This is called variational expectation-maximization.

In the stochastic setting, we update α simultaneously with
λ. We can take a step in the direction of the gradient of the
noisy ELBO Lt (Equation 27) with respect to α,
scaled by the step-size ρt,

α(t)=α(t−1)+ρt∇αLt(λ(t−1),ϕ,α(t−1)).

Here λ(t−1) are the global parameters from the previous
iteration and ϕ are the optimized local parameters for the
currently sampled data point. We can also replace the standard
Euclidean gradient with a natural gradient or Newton step.

3 Stochastic Variational Inference in Topic
Models

We derived stochastic variational inference, a scalable inference
algorithm that can be applied to a large class of hierarchical
Bayesian models. In this section we show how to use the general
algorithm of Section 2 to derive stochastic variational
inference for two probabilistic topic models: latent Dirichlet
allocation (LDA) (Blei et al., 2003) and its Bayesian nonparametric
counterpart, the hierarchical Dirichlet process (HDP) topic
model (Teh et al., 2006a).

Topic models are probabilistic models of document collections that use
latent variables to encode recurring patterns of word
use (Blei, 2012). Topic modeling algorithms are inference
algorithms; they uncover a set of patterns that pervade a collection
and represent each document according to how it exhibits them. These
patterns tend to be thematically coherent, which is why the models are
called “topic models.” Topic models are used for both descriptive
tasks, such as to build thematic navigators of large collections of
documents, and for predictive tasks, such as to aid document
classification. Topic models have been extended and applied in many
domains.

Topic models assume that the words of each document arise from a
mixture of multinomials. Across a collection, the documents share the
same mixture components (called topics). Each document,
however, is associated with its own mixture proportions (called
topic proportions). In this way, topic models represent
documents heterogeneously—the documents share the same set of topics,
but each exhibits them to a different degree. For example, a document
about sports and health will be associated with the sports and health
topics; a document about sports and business will be associated with
the sports and business topics. They both share the sports topic, but
each combines sports with a different topic. More generally, this is
called mixed membership (Erosheva, 2003).

The central computational problem in topic modeling is posterior
inference: Given a collection of documents, what are the topics that
it exhibits and how does each document exhibit them? In practical
applications of topic models, scale is important—these models
promise an unsupervised approach to organizing large collections of
text (and, with simple adaptations, images, sound, and other data).
Thus they are a good testbed for stochastic variational inference.

More broadly, this section illustrates how to use the results from
Section 2 to develop algorithms for specific models. We will
derive the algorithms in several steps: (1) we specify the model
assumptions; (2) we derive the complete conditional distributions of
the latent variables; (3) we form the mean-field variational family;
(4) we derive the corresponding stochastic inference algorithm. In
Section 4, we will report our empirical study of stochastic
variational inference with these models.

Observations are words, organized into documents. The
nth word in the dth document is wdn. Each word is an
element in a fixed vocabulary of V terms.

A topicβk is a distribution over the
vocabulary. Each topic is a point on the V−1-simplex, a positive
vector of length V that sums to one. We denote the wth entry
in the kth topic as βkw. In LDA there are K topics;
in the HDP topic model there are an infinite number of topics.

Each document in the collection is associated with a vector of
topic proportionsθd, which is a distribution over
topics. In LDA θd is a point on the K−1-simplex. In the
HDP topic model, θd is a point on the infinite simplex. (We
give details about this below in Section 3.3.) We denote the kth
entry of the topic proportion vector θd as θdk.

Each word in each document is assumed to have been drawn from a
single topic. The topic assignmentzdn indexes the
topic from which wdn is drawn.

The only observed variables are the words of the documents. The
topics, topic proportions, and topic assignments are latent variables.

3.2 Latent Dirichlet allocation

Var

Type

Conditional

Param

Relevant Expectations

zdn

Multinomial

logθdk+logβk,wdn

ϕdn

E[Zkdn]=ϕkdn

θd

Dirichlet

α+∑Nn=1zdn

γd

E[logθdk]=Ψ(γdk)−∑Kj=1Ψ(γdj)

βk

Dirichlet

η+∑Dd=1∑Nn=1zkdnwdn

λk

E[logβkv]=Ψ(λkv)−∑Vy=1Ψ(λky)

Figure 5: (Top) The graphical model representation of Latent
Dirichlet allocation. Note that in practice each document d may
not have the same number of words N. (Bottom) In LDA: hidden
variables, complete conditionals, variational parameters, and
expected sufficient statistics.

LDA is the simplest topic model. It assumes that each document
exhibits K topics with different proportions. The generative process
is

In LDA, each document exhibits the same shared topics but with
different proportions. LDA assumes Dirichlet priors for βk and
θd. Dirichlet distributions over the D-simplex take D+1
parameters, but for simplicity we assume exchangeable Dirichlet
priors; that is, we require that all of these parameters are set to
the same value. (The prior on βk has parameter η; the
prior on θd has parameter α.). We note that
Blei et al. (2003) and Wallach et al. (2009) found improved empirical
performance with non-exchangeable priors.

LDA models an observed collection of documents \boldmathw=w1:D, where
each wd is a collection of words wd,1:N. Analyzing the
documents amounts to posterior inference of p(β,θ,\boldmathz|\boldmathw). Conditioned on the documents, the posterior distribution
captures the topics that describe them (β=β1:K), the
degree to which each document exhibits those topics (θ=θ1:D), and which topics each word was assigned to (\boldmathz=z1:D,1:N). We can use the posterior to explore large collections
of documents. Figure 1 illustrates posterior topics found
with stochastic variational inference.

The posterior is intractable to compute (Blei et al., 2003).
Approximating the posterior in LDA is a central computational problem
for topic modeling. Researchers have developed many methods, including
Markov chain Monte Carlo methods (Griffiths and Steyvers, 2004), expectation
propagation (Minka and Lafferty, 2002), and variational
inference (Blei et al., 2003; Teh et al., 2006b; Asuncion et al., 2009). Here we use the
results of Section 2 to develop stochastic inference for LDA.
This scales the original variational algorithm for LDA to massive
collections of documents.^{5}^{5}5The algorithm we present was
originally developed in Hoffman et al. (2010a), which is a special
case of the stochastic variational inference algorithm we developed
in Section 2.

Figure 7 illustrates the performance of 100-topic LDA on
three large collections—Nature contains 350K documents,
New York Times contains 1.8M documents, and
Wikipedia contains 3.8M documents. (Section 4
describes the complete study, including the details of the performance
measure and corpora.) We compare two inference algorithms for LDA:
stochastic inference on the full collection and batch inference on a
subset of 100,000 documents. (This is the size of collection that
batch inference can handle.) We see that stochastic variational
inference converges faster and to a better model. It is both more
efficient and lets us handle the full data set.

Indicator vectors and Dirichlet distributions. Before
deriving the algorithm, we discuss two mathematical details. These
will be useful both here and in the next section.

zdn and observed words wdn with indicator
vectors. An indicator vector is a binary vector with a single one.
For example, the topic assignment zdn can take on one of K
values (one for each topic). Thus, it is represented as a K-vector
with a one in the component corresponding to the value of the
variable: if zkdn=1 then the nth word in document d is
assigned to the kth topic. Likewise, wvdn=1 implies that the
nth word in document d is v. In a slight abuse of notation, we
will sometimes use wdn and zdn as indices—for example,
if zkdn=1, then βzdn refers to the kth topic
βk.

Second, we review the Dirichlet distribution. As we described above,
a K-dimensional Dirichlet is a distribution on the K−1-simplex,
i.e., positive vectors over K elements that sum to one. It is
parameterized by a positive K-vector γ,

Dirichlet(θ;γ)=Γ(∑Ki=1γi)∏Ki=1Γ(γi)K∏i=1θγi−1,

where Γ(⋅) is the Gamma function, which is a real-valued
generalization of the factorial function. The expectation of the
Dirichlet is its normalized parameter,

E[θk|γ]=γk∑Ki=1γi.

The expectation of its log uses Ψ(⋅), which is the first
derivative of the log Gamma function,

E[logθk|γ]=Ψ(γk)−Ψ(∑Ki=1γi).

(29)

This can be derived by putting the Dirichlet in exponential family
form, noticing that logθ is the vector of sufficient
statistics, and computing its expectation by taking the gradient of
the log-normalizer with respect to the natural parameter vector
γ.

Complete conditionals and variational distributions. We
specify the global and local variables of LDA to place it in the
stochastic variational inference setting of Section 2. In
topic modeling, the local context is a document d. The local
observations are its observed words wd,1:N. The local hidden
variables are the topic proportions θd and the topic assignments
zd,1:N. The global hidden variables are the topics β1:K.

Recall from Section 2 that the complete conditional is the
conditional distribution of a variable given all of the other
variables, hidden and observed. In mean-field variational inference,
the variational distributions of each variable are in the same family
as the complete conditional.

We begin with the topic assignment zdn. The complete
conditional of the topic assignment is a multinomial,

p(zdn=k|θd,β1:K,wdn)∝exp{logθdk+logβk,wdn}.

(30)

Thus its variational distribution is a multinomial q(zdn)=Multinomial(ϕdn), where the variational parameter ϕdn is a
point on the K−1-simplex. Per the mean-field approximation, each
observed word is endowed with a different variational distribution for
its topic assignment, allowing different words to be associated with
different topics.

The complete conditional of the topic proportions is a Dirichlet,

p(θd|\boldmathzd)=Dirichlet(α+∑Nn=1zdn).

(31)

Since zdn is an indicator vector, the kth element of the
parameter to this Dirichlet is the sum of the hyperparameter α
and the number of words assigned to topic k in document d. Note
that, although we have assumed an exchangeable Dirichlet prior, when
we condition on z the conditional p(θd|zd) is a
non-exchangeable Dirichlet.

With this conditional, the variational distribution of the topic
proportions is also Dirichlet q(θd)=Dirichlet(γd), where
γd is a K-vector Dirichlet parameter. There is a different
variational Dirichlet parameter for each document, allowing different
documents to be associated with different topics in different
proportions.

These are local hidden variables. The complete conditionals only
depend on other variables in the local context (i.e., the document)
and the global variables; they do not depend on variables from other
documents.

Finally, the complete conditional for the topic βk is also a
Dirichlet,

The vth element of the parameter to the Dirichlet conditional for
topic k is the sum of the hyperparameter η and the number of
times that the term v was assigned to topic k. This is a global
variable—its complete conditional depends on the words and topic
assignments of the entire collection.

The variational distribution for each topic is a V-dimensional
Dirichlet,

q(βk)=Dirichlet(λk).

As we will see in the next section, the traditional variational
inference algorithm for LDA is inefficient with large collections of
documents. The root of this inefficiency is the update for the topic
parameter λk, which (from Equation 32) requires summing
over variational parameters for every word in the collection.

Batch variational inference.

With the complete conditionals in hand, we now derive the coordinate
ascent variational inference algorithm, i.e., the batch inference
algorithm of Figure 3. We form each coordinate update by
taking the expectation of the natural parameter of the complete
conditional. This is the stepping stone to stochastic variational
inference.

The variational parameters are the global per-topic Dirichlet parameters
λ1:K, local per-document Dirichlet parameters γ1:D, and
local per-word multinomial parameters ϕ1:D,1:N. Coordinate ascent
variational inference iterates between updating all of the local
variational parameters (Equation 16) and updating the global
variational parameters (Equation 15).

We update each document d’s local variational in a local coordinate
ascent routine, iterating between updating each word’s topic
assignment and the per-document topic proportions,

ϕkdn

∝

exp{Ψ(γdk)+Ψ(λk,wdn)−Ψ(∑vλkv)}for n∈{1,…,N}

(33)

γd

=

α+∑Nn=1ϕdn.

(34)

These updates derive from taking the expectations of the natural
parameters of the complete conditionals in Equation 30 and
Equation 31. (We then map back to the usual parameterization
of the multinomial.) For the update on the topic
assignment, we have used the Dirichlet expectations in
Equation 29. For the update on the topic proportions, we
have used that the expectation of an indicator is its probability,
Eq[zkdn]=ϕkdn.

After finding variational parameters for each document, we update the
variational Dirichlet for each topic,