α-Variational Inference with Statistical Guarantees

10/09/2017 ∙ by Yun Yang, et al. ∙ Florida State University Texas A&M University 0

We propose a variational approximation to Bayesian posterior distributions, called α-VB, with provable statistical guarantees for models with and without latent variables. The standard variational approximation is a special case of α-VB with α=1. When α∈(0,1), a novel class of variational inequalities are developed for linking the Bayes risk under the variational approximation to the objective function in the variational optimization problem, implying that maximizing the evidence lower bound in variational inference has the effect of minimizing the Bayes risk within the variational density family. Operating in a frequentist setup, the variational inequalities imply that point estimates constructed from the α-VB procedure converge at an optimal rate to the true parameter in a wide range of problems. We illustrate our general theory with a number of examples, including the mean-field variational approximation to (low)-high-dimensional Bayesian linear regression with spike and slab priors, mixture of Gaussian models, latent Dirichlet allocation, and (mixture of) Gaussian variational approximation in regular parametric models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 27

page 28

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and preliminaries

Variational inference [25, 38]

is a widely-used tool for approximating complicated probability densities, especially those arising as posterior distributions from complex hierarchical Bayesian models. It provides an alternative strategy to Markov chain Monte Carlo (MCMC,

[20, 15]

) sampling by turning the sampling/inference problem into an optimization problem, where a closest member, relative to the Kullback–Leibler (KL) divergence, in a family of approximate densities is picked out as a proxy to the target density. Variational inference has found its success in a variety of contexts, especially in models involving latent variables, such as Hidden Markov models

[30], graphical models [3, 38], mixture models [23, 14, 35], and topic models [9, 11] among others. See the recent review paper [10] by Blei et al. for a comprehensive introduction to variational inference.

The popularity of variational methods can be largely attributed to their computational advantages over MCMC. It has been empirically observed in many applications that variational inference operates orders of magnitude faster than MCMC for achieving the same approximation accuracy. Moreover, compared to MCMC, variational inference tends to be easier to scale to big data due to its inherent optimization nature, and can take advantage of modern optimization techniques such as stochastic optimization [27, 26] and distributed optimization [1]. However, unlike MCMC that is guaranteed to produce (almost) exact samples from the target density for ergodic chains [33], variational inference does not enjoy such general theoretical guarantee.

Several threads of research have been devoted to characterize statistical properties of the variational proxy to the true posterior distribution; refer to Section 5.2 of [10] for a relatively comprehensive survey of the theoretical literature on variational inference. However, almost all these studies are conducted in a case-by-case manner, by either explicitly analyzing the fixed point equation of the variational optimization problem, or directly analyzing the iterative algorithm for solving the optimization problem. In addition, these analyses require certain structural assumptions on the priors such as conjugacy, and is not applicable to broader classes of priors.

This article introduces a novel class of variational approximations and studies their large sample convergence properties in a unified framework. The new variational approximation, termed -VB, introduces a fixed temperature parameter inside the usual VB objective function which controls the relative trade-off between model-fit and prior regularization. The usual VB approximation is retained as a special case corresponding to . The -VB objective function is partly motivated by fractional posteriors [39, 4]; specific connections are drawn in §2.1. The general -VB procedure also inherits all the computational tractability and scalability from the case, and implementation-wise only requires simple modifications to existing variational algorithms.

For , we develop novel variational inequalities for the Bayes risk under the variational solution. These variational inequalities link the Bayes risk with the -VB objective function, implying that maximizing the evidence lower bound has the effect of minimizing the Bayes risk within the variational density family. A crucial upshot of this analysis is that point estimates constructed from the variational posterior concentrate at the true parameter at the same rate as those constructed from the actual posterior for a variety of problems. There is now a well-developed literature on the frequentist concentration properties of posterior distributions in nonparametric problems; refer to [34] for a detailed review, and the present paper takes a step towards developing similar general-purpose theoretical guarantees for variational solutions. We applied our theory to a number of examples where VB is commonly used, including mean-field variational approximation to high-dimensional Bayesian linear regression with spike and slab priors, mixtures of Gaussian models, latent Dirichlet allocation, and Gaussian-mixture variational approximation to regular parametric models.

The case is of particular interest as the major ingredient of the variational inequality involves the prior mass assigned to appropriate Kullback–Leibler neighborhoods of the truth which can be bounded in a straightforward fashion in the aforesaid models and beyond. We mention here a recent preprint by Alquier and Ridgeway [2] where variational approximations to tempered posteriors (without latent variables) are conducted. The -VB objective function considered here incorporates a much broader class of models involving latent variables, and the corresponding variational inequality recovers the risk bound of [2] when no latent variables are present. The variational inequalities for the case do not immediately extend to the case under a simple limiting operation, and require a separate treatment under stronger assumptions. In particular, we make use of additional testability assumptions on the likelihood function detailed in §3.2. Similar assumptions have been used to study concentration of the usual posterior [17].

It is a well-known fact [41, 43]

that the covariance matrices from the variational approximations are typically “too small” compared with those for the sampling distribution of the maximum likelihood estimator, which combined with the Bernstein von-Mises theorem

[36] implies that the variational approximation may not converge to the true posterior distribution. This fact combined with our result illustrate the landscape of variational approximation—minimizing the KL divergence over the variational family forces the variational distribution to concentrate around the truth at the optimal rate (due to the heavy penalty on the tails in the KL divergence); however, the local shape of the obtained density function around the truth can be far away from that of the true posterior due to mis-match between the distributions in the variational family and the true posterior. Overall, our results reveal that concentration of the posterior measure is not only useful in guaranteeing desirable statistical properties, but also has computational benefits in certifying consistency and concentration of variational approximations.

In the remainder of this section, we introduce key notation used in the paper and provide necessary background on variational inference.

1.1 Notation

We briefly introduce notation that will be used throughout the paper. Let and

denote the Hellinger distance and Kullback–Leibler divergence, respectively, between two probability density functions

and relative to a common dominating measure . We define an additional discrepancy measure , which will be referred to as the -divergence. For a set , we use the notation

to denote its indicator function. For any vector

and positive semidefinite matrix , we use

to denote the normal distribution with mean

and covariance matrix , and use to denote its pdf at .

For any , let

(1.1)

denote the Rényi divergence of order . Jensen’s inequality implies that for any , and the equality holds if and only if . The Hellinger distance can be related with the -divergence with by using the inequality for . More details and properties of the -divergence can be found in [37].

1.2 Review of variational inference

Suppose we have observations with denoting the sample size. Let be the distribution of given parameter that admits a density relative to the Lebesgue measure. We will also interchangeably use and to denote and its density function (likelihood function) . Assume additionally that the likelihood can be represented as

where denotes a collection of latent or unobserved variables; the superscript signifies the possible dependence of the number of latent variables on ; for example, when there are observation specific latent variables. In certain situations, the latent variables may be introduced for purely computational reasons to simplify an otherwise intractable likelihood, such as the latent cluster indicators in a mixture model. Alternatively, a complex probabilistic model may itself be defined in a hierarchical fashion by first specifying the distribution of the data given latent variables and parameters, and then specifying the latent variable distribution given parameters; examples include the latent Dirichlet allocation and many other prominent Bayesian hierarchical models. For ease of presentation, we have assumed discrete latent variables in the above display and continue to do so subsequently, although our development seamlessly extends to continuous latent variables by replacing sums with integrals; further details are provided in a supplemental document.

Let denote a prior distribution on with density function , and denote . In a Bayesian framework, all inference is based on the augmented posterior density given by

(1.2)

In many cases, can be inconvenient for conducting direct analysis due to its intractable normalizing constant and expensive to sample from due to the slow mixing of standard MCMC algorithms. Variational inference aims to bypass these difficulties by turning the inference problem into an optimization problem, which can be solved by using iterative algorithms such as coordinate descent [7] and alternating minimization.

Let denote a pre-specified family of density functions over that can be either parameterized by some “variational parameters”, or required to satisfy some structural constraints (see below for examples of ). The goal of variational inference is to approximate this conditional density by finding the closest member of this family in KL divergence to the conditional density of interest, that is, computing the minimizer

(1.3)

where the last step follows by using Bayes’ rule and the fact that the marginal density does not depend on and . The function inside the argmin-operator above (without the negative sign) is called the evidence lower bound (ELBO, [10]) since it provides a lower bound to the log evidence ,

(1.4)

where the equality holds if and only if . The decomposition (1.4) provides an alternative interpretation of variational inference to the original derivation from Jensen’s inequality[25]—minimizing the KL divergence over the variational family is equivalent to maximizing the ELBO over . When is composed of all densities over , this variational approximation exactly recovers . In general, the variational family is chosen to balance between the computational tractability and the approximation accuracy. Some common examples of are provided below.

Example: (Exponential variational family)

When there is no latent variable and corresponds to the parameter in the model, a popular choice of the variational family is an exponential family of distributions. Among the exponential variational families, the Gaussian variational family, for , is the most widely-used owing to the Bernstein von-Mises theorem (Section 10.2 of [36]), stating that for regular parametric models, the posterior distribution converges to a Gaussian limit relative to the total variation metric as the sample size tends to infinity. There are also some recent developments by replacing the single Gaussian with a Gaussian-mixture as the variational family to improve finite-sample approximation [47]

, which is useful when the posterior distribution is skewed or far away from Gaussian for the given sample size.

Example: (Mean-field variational family)

Suppose that can be decomposed into components (or blocks) as for some , where each component can be multidimensional. The mean-field variational family is composed of all density functions over that factorizes as

where each variational factor is a density function over for . A natural mean-field decomposition is to let , with often further decomposed as .

Note that we have not specified the parametric form of the individual variational factors, which are determined by properties of the model— in some cases, the optimal is in the same parametric family as the conditional distribution of given the parameter. The corresponding mean-field variational approximation , which is necessarily of the form , can be computed via the coordinate ascent variational inference (CAVI) algorithm [7, 10] which iteratively optimizes each variational factor keeping others fixed at their present value and resembles the EM algorithm in the presence of latent variables.

The mean-field variational family can be further constrained by restricting each factor to belong to a parametric family, such as the exponential family in the previous example. In particular, it is a common practice to restrict the variational density of the parameter into a structured family (for example, the mean-field family if is multi-dimensional), which will be denoted by in the sequel.
The rest of the paper is organized as follows. In §2, we introduce the -VB objective function and relate it to usual VB. §3 presents our general theoretical results concerning finite sample risk bounds for the -VB solution. In §4, we apply the theory to concrete examples. We conclude with a discussion in §5. All proofs and some additional discussions are provided in a separate supplemental document. The supplemental document also contains a detailed simulation study.

2 The -VB procedure

Before introducing the proposed family of objective functions, we first represent the KL term in a more convenient form which provides intuition into how VB works in the presence of latent variables and aids our subsequent theoretical development.

2.1 A further decomposition of the ELBO

To aid our subsequent development, we introduce some additional notation and make some simplifying assumptions. First, we decompose , with and and . In other words, is the parameter characterizing the conditional distribution of the observation given latent variable , and characterizes the distribution of the latent variables. We shall also assume the mean-field decomposition

(2.1)

throughout, and let denote the class of such product variational distributions. When necessary subsequently, we shall further assume and , which however is not immediately necessary for this subsection.

The KL divergence in (1.3) involves both parameters and latent variables. Separating out the KL divergence for the parameter part leads to the equivalent representation

(2.2)

Observe that, using concavity of and Jensen’s inequality,

The quantity in (2.2) can therefore be recognized as an approximation (from below) to the log likelihood in terms of the latent variables. Define an average Jensen gap due to the variational approximation to the log-likelihood,

With this, write the KL divergence as

(2.3)

which splits as a sum of three terms: an integrated (w.r.t. the variational distribution) negative log-likelihood, the KL divergence between the variational distribution and the prior for , and the Jensen gap due to the latent variables. In particular, the role of the latent variable variational distribution is conveniently confined to .

Another view of the above is an equivalent formulation of the ELBO decomposition (1.4),

(2.4)

which readily follows since

Thus, in latent variable models, maximizing the ELBO is equivalent to minimizing a sum of the Jensen gap and the KL divergence between the variational density and the posterior density of the parameters. When there is no likelihood approximation with latent variables, .

2.2 The -VB objective function

Here and in the rest of the paper, we adopt the frequentist perspective by assuming that there is a true data generating model that generates the data , and will be referred to as the true parameter, or simply truth. Let be the log-likelihood ratio. Define

(2.5)

and observe that differs from the KL divergence in (2.3) only by which does not involve the variational densities. Hence, minimizing is equivalent to minimizing . We note here that the introduction of the term is to develop theoretical intuition and the actual minimization does not require the knowledge of .

The objective function in (2.5

) elucidates the trade-off between model-fit and fidelity to the prior underlying a variational approximation, which is akin to the classical bias-variance trade-off for shrinkage or penalized estimators. The model-fit term consists of two constituents: the first term is an averaged (with respect to the variational distribution) log-likelihood ratio which tends to get small as the variational distribution

places more mass near the true parameter , while the second term is the Jensen gap due to the variational approximation with the latent variables. On the other hand, the regularization or penalty term prevents over-fitting to the data by constricting the KL divergence between the variational solution and the prior.

In this article, we study a wider class of variational objective functions indexed by a scalar parameter which encompass the usual VB,

(2.6)

and define the -VB solution as

(2.7)

Observe that the -VB criterion differs from only in the regularization term, where the inverse temperature parameter controls the amount of regularization, with smaller implying a stronger penalty. When , reduces to the usual variational objective function in (2.5), and we shall denote the solution of (2.7) by and as before. As we shall see in the sequel, the introduction of the temperature parameter substantially simplifies the theoretical analysis and allows one to certify (near-)minimax optimality of the -VB solution for under only a prior mass condition, whereas analysis of the the usual VB solution () requires more intricate testing arguments.

The -VB solution can also be interpreted as the minimizer of a certain divergence function between the product variational distribution and the joint -fractional posterior distribution [4] of ,

(2.8)

which is obtained by raising the joint likelihood of to the fractional power , and combining with the prior using Bayes’ rule. We shall use to denote the fractional posterior density. The fractional posterior is a specific example of a Gibbs posterior [24] and shares a nice coherence property with the usual posterior when viewed as a mechanism for updating beliefs [8].

Proposition 2.1 (Connection with fractional posteriors).

The -VB solution satisfy,

where is the entropy of , and is the joint -fractional posterior density of .

The entropy term encourages the latent-variable variational density

to be concentrated to the uniform distribution, in addition to minimizing the KL divergence between

and . In particular, if there are no latent variables, the entropy term disappears and the objective function reduces to a KL divergence between and .

We conclude this section by remarking that the additive decomposition of the model-fit term in (2.6) provides a peak into why mean-field approximations work for latent variable models, since the roles of the variational density for the latent variables and for the model parameters are de-coupled. Roughly speaking, a good choice of should aim to make the Jensen gap small, while the choice of should balance the integrated log-likelihood ratio and the penalty term. This point is crucial for the theoretical analysis.

3 Variational risk bounds for -Vb

In this section, we investigate concentration properties of the -VB posterior under a frequentist framework assuming the existence of a true data generating parameter . We first focus on the case, and then separately consider the case. The main take-away message from our theoretical results below is that under fairly general conditions, the -VB procedure concentrates at the true parameter at the same rate as the actual posterior, and as a result, point estimates obtained from the -VB can provide rate-optimal frequentist estimators. These results thus compliment the empirical success of VB in a wide variety of models.

We present our results in the form of Bayes risk bounds for the variational distribution. Specifically, for a suitable loss function

, we aim to obtain a high-probability (under the data generating distribution ) to the variational risk

(3.1)

In particular, if is convex in its first argument, then the above risk bound immediately translates into a risk bound for the -VB point estimate using Jensen’s inequality:

Specifically, our goal will be to establish general conditions under which concentrates around at the minimax rate for the particular problem.

3.1 Risk bounds for the case:

We use the shorthand

to denote the averaged -divergence between and . We adopt the theoretical framework of [4] to use this divergence as our loss function for measuring the closeness between any and the truth . Note that in case of i.i.d. observations, this averaged divergence simplifies to , which is stronger than the squared Hellinger distance between and for any fixed .

Our first main result provides a general finite-sample upper bound to the variational Bayes risk (3.1) for the above choice of .

Theorem 3.1 (Variational risk bound).

Recall the -VB objective function from (2.6). For any , it holds with probability at least that for any probability measure with and any probability measure on ,

Here and elsewhere, the probability statement is uniform over all . Theorem 3.1 links the variational Bayes risk for the -divergence to the objective function in (2.6). As a consequence, minimizing in (2.6) has the same effect as minimizing the variational Bayes risk. To apply Theorem 3.1 to various problems, we now discuss strategies to further analyze and simplify under appropriate structural constraints of and . To that end, we make some simplifying assumptions.

First, we assume a further mean-field decomposition for the latent variables , where each factor is restriction-free. Second, the inconsistency of the mean-field approximation for state-space models proved in [40] indicates that this mean-field approximation for the latent variables may not generally work for non-independent observations with non-independent latent variables. For this reason, we assume that the observation latent variable pair are mutually independent across . In fact, we assume that are i.i.d. copies of whose density function is given by . Following earlier notation, let denote the probability mass function of the i.i.d. discrete latent variables , with the parameter residing in the -dim simplex . Finally, we assume the variational family of the parameter decomposes into , where denotes variational family for parameter .

Let denote the marginal probability density function of the i.i.d. observations . The i.i.d. assumption implies a simplified structure of various quantities encountered before, e.g. , and . Moreover, under these assumptions, .

As discussed in the previous subsection, the decoupling of the roles of and in the model fit term aid bounding . Specifically, we first choose a which controls the Jensen gap , and then make a choice of which controls . The choice of requires a delicate balance between placing enough mass near and controlling the KL divergence from the prior.

For a fixed , if we choose to be the full conditional distribution of given , i.e.,

then the normalizing constant of is , and as a result, the Jensen gap . The mean-field approximation precludes us from choosing dependent on , and hence the Jensen gap cannot be made exactly zero in general. However, this naturally suggests replacing by in the above display and choosing . This leads us to the following corollary.

Corollary 3.2 (i.i.d. observations).

It holds with probability at least that for any probability measure with

(3.2)

where

is the probability distribution over

defined as

(3.3)

The second line of (3.2) follows from the first since

After choosing as (3.3) in Corollary 3.2, we can make the first term in the r.h.s. of (3.2) small by choosing the variational factor of concentrated around . In the rest of this subsection, we will apply Corollary 3.2 to derive more concrete variational Bayes risk bounds under some further simplifying assumptions.

As a first application, assume there is no latent variable in the model, that is, . As discussed before, the -VB solution in this case coincides with the nearest KL point to the -fractional posterior of the parameter. A reviewer pointed out a recent preprint by Alquier and Ridgeway [2] where they exploit risk bounds for fractional posteriors developed in [4] to analyze tempered posteriors and their variational approximations, which coincides with the -VB solution when . The following Theorem 3.3 arrives at a similar conclusion to Corollary 2.3 of [2]. We reiterate here that our main motivation is models with latent variables not considered in [2], and Theorem 3.3 follows as a corollary of our general result in Theorem 3.1.

Theorem 3.3 (No latent variable).

It holds with probability at least that for any probability measure with

(3.4)

We will illustrate some particular choices of for typical variational families in the examples in Section 4.

As a second application, we consider a special case when is restriction-free, which is an ideal example for conveying the general idea of how to choose to control the upper bound in (3.2). To that end, define two KL neighborhoods around with radius as

(3.5)

where we used the shorthand to denote the KL divergence between categorical distributions with parameters and in the -dim simplex . By choosing as the restriction of into , we obtain the following theorem. Here, we make the assumption of independent priors on and , i.e., , to simplify the presentation.

Theorem 3.4 (Parameter restriction-free).

For any fixed and , with probability at least , it holds that

(3.6)

Although the results in this section assume discrete latent variables, similar results can be seamlessly obtained for continuous latent variables; see the supplemental document for more details. We will apply this theorem for analysing mean-field approximations for the Gaussian mixture model and the latent Dirichlet allocation in Section 4.

Observe that the variational risk bound in Theorem 3.4 depends only on prior mass assigned to appropriate KL neighborhoods of the truth. This renders an application of Theorem 3.4 to various practical problems particularly straightforward. As we shall see in the next subsection, the case, i.e. the regular VB, requires more stringent conditions involving the existence of exponentially consistent tests to separate points in the parameter space. The testing condition is even necessary for the actual posterior to contract; see, e.g., [4], and hence one cannot avoid the testing assumption for its usual variational approximation. Nevertheless, we show below that once the existence of such tests can be verified, the regular VB approximation can also be shown to contract optimally.

3.2 Risk bounds for the case

We consider any loss function satisfying the following assumption.

Assumption T (Statistical identifiability):

For some and any , there exists a sieve set and a test function such that

(3.7)
(3.8)
(3.9)

Roughly speaking, the sieve set can be viewed as the effective support of the prior distribution at sample size , and the contraction rate of the usual posterior distribution. The first condition (3.7) allows us to focus attention to this important region in the parameter space that is not too large, but still possesses most of the prior mass. The last two conditions (3.8) and (3.9) ensure the statistical identifiability of the parameter under the loss through the existence of a t