BMLrepo
Bayesian Machine Learning repo
view repo
Variational inference (VI) provides fast approximations of a Bayesian posterior in part because it formulates posterior approximation as an optimization problem: to find the closest distribution to the exact posterior over some family of distributions. For practical reasons, the family of distributions in VI is usually constrained so that it does not include the exact posterior, even as a limit point. Thus, no matter how long VI is run, the resulting approximation will not approach the exact posterior. We propose to instead consider a more flexible approximating family consisting of all possible finite mixtures of a parametric base distribution (e.g., Gaussian). For efficient inference, we borrow ideas from gradient boosting to develop an algorithm we call boosting variational inference (BVI). BVI iteratively improves the current approximation by mixing it with a new component from the base distribution family and thereby yields progressively more accurate posterior approximations as more computing time is spent. Unlike a number of common VI variants including mean-field VI, BVI is able to capture multimodality, general posterior covariance, and nonstandard posterior shapes.
READ FULL TEXT VIEW PDFBayesian Machine Learning repo
Bayesian inference offers a flexible framework for learning with rich, hierarchical models of data and for coherently quantifying uncertainty in unknown parameters through the posterior distribution. However, for any moderately complex model, the posterior is intractable to calculate exactly and must be approximated. Variational inference (VI) has grown in popularity as a method for approximating the posterior since it is often fast even for large data sets. VI is fast partly because it formulates posterior approximation as an optimization problem; the rough idea is to find the closest distribution to the exact posterior over some family of distributions.
Mean-field variational inference (MFVI) is a particularly widely-used variant of VI due in part to its simplicity. MFVI assumes that the approximating distribution factorizes across the parameters of the model, and this assumption leads to an efficient coordinate-ascent algorithm (Blei et al., 2016). But this assumption also means that MFVI effectively cannot capture multimodality and that MFVI tends to underestimate the posterior covariance, sometimes drastically (Bishop, 2006; Wang and Titterington, 2004; Turner and Sahani, 2011; Rue et al., 2009; MacKay, 2003). The linear response technique of Giordano et al. (2015) and the full-rank approach within Kucukelbir et al. (2015) provide a correction to the covariance underestimation of MFVI in the unimodal case but do not address the multimodality issue. “Black-box” inference, as in Ranganath et al. (2014) and the mean-field approach within Kucukelbir et al. (2015), focus on making the MFVI optimization problem easier for practitioners, by avoiding tedious calculations, but they do not change the optimization objective of MFVI and therefore still face the problems outlined here.
An alternative and more flexible class of approximating distributions for variational inference (VI) is the family of mixture models. Even if we consider only Gaussian component distributions, one can find a mixture of Gaussians that is arbitrarily close to any
continuous probability density
(Epanechnikov, 1969; Parzen, 1962). Bishop et al. (1998); Jaakkola and Jordan (1998) and Gershman et al. (2012) have previously considered using approximating families with a fixed number of mixture components. Since the problem is non-convex, for practical inference, these authors also employ a further approximation to the VI optimization objective, or impose constraints on the components (e.g., Gaussians with isotropic covariance and equal weights (Gershman et al., 2012)). The resulting optimization algorithms also suffer from some practical limitations; for instance, they are sensitive to initial values. For good performance, one may need rerun the algorithm for multiple different initializations and multiple choices for the number of components.Finally, in all of the variants of VI discussed above, the family of approximating distributions does not, in general, include the exact posterior—even as a limit point. Thus, no matter how long VI is run, the resulting approximation will not approach the exact posterior.
We propose a new algorithm, boosting variational inference (BVI), that addresses these concerns. Inspired by boosting, BVI starts with a single-component approximation and proceeds to add a new mixture component at each step for a progressively more accurate posterior approximation. Thus, BVI allows us to trade off computing time for more statistical accuracy on the fly. BVI effectively considers a more flexible approximating family consisting of all possible finite mixtures of a parametric base distribution (e.g., Gaussian). We will see that this flexibility allows us to capture multiple modes, general posterior covariance, and nonstandard posterior shapes. We highlight that a similar idea and treatment is developed in a concurrent and independent paper Miller et al. (2016).
We start by defining our optimization problem in Section 2. We review boosting and gradient boosting in Section 3 and develop the BVI algorithm in Section 4. We demonstrate the improved performance of our algorithm on a range of target posteriors, including those generated from both real and simulated data in Section 5.
Suppose that we have observed data , and we identify a set of parameters of interest: . In the remainder, we do not make any assumptions on the form or space of the observed data . A Bayesian model is specified with a prior and a likelihood
. Bayes Theorem yields the posterior distribution,
(1) |
which may be considered the product of Bayesian inference and which expresses the practitioner’s knowledge of after seeing the data . While the unnormalized posterior in Eq. 1 is easy to evaluate, the normalizing constant usually involves an intractable integral. So an approximation is needed to use the normalized posterior distribution for any moderately complex model. The posterior
may further be used to report a point estimate of
via a mean, to report an uncertainty estimate via a covariance, or to report some other posterior functional: for some function . For example, yields the posterior mean.One approach to approximating is to find, roughly, the closest distribution among some family of distributions , where the distributions in are typically easier to work with; e.g., the calculation of posterior functionals may be easier for distributions in . More precisely, we choose a discrepancy between distribution and the exact posterior . Then we approximate with the distribution in that minimizes . We assume is non-negative in general and that is zero if and only if . In general, though, the optimum
(2) |
over some constrained family of distributions may be strictly greater than zero. Roughly, we expect a larger to yield a lower but at a higher computational cost. When , the Kullback-Leibler (KL) divergence between and , this optimization problem is called variational inference (Jordan et al., 1999; Wainwright and Jordan, 2008; Blei et al., 2016).
A straightforward choice for is some simple parametric family of known distributions, which we denote by
(3) |
where is the space of feasible parameters. For example, we have for a multivariate Gaussian approximation, and for a mean-field Poisson-Gamma approximation to .
A natural extension to is the family of mixtures. Let denote the (-1)-dimensional simplex: . Then,
is the set of all -component mixtures over these base distributions, where is the collection of all component-wise parameters. , namely a single-component base distribution, is usually adopted for VI, including MFVI. While some authors (Bishop et al., 1998; Jaakkola and Jordan, 1998; Gershman et al., 2012) have considered for , these past works have focused on the case where the number of components is fixed and finite.
In this article, we consider an even larger family , which is the set of all finite mixtures, namely
(4) |
Note that forms the convex hull of the base distributions in (Li and Barron, 1999; Rakhlin, 2006). The family is highly flexible; for example, if multivariate Gaussians are used for , one can find a member of arbitrarily close to any continuous density (Epanechnikov, 1969). Table 1 further compares with other common choices of family. Our main contribution is to propose a novel and efficient algorithm for approximately solving the discrepancy minimization problem in .
Family | Covariance | Multimodality | Arbitrary |
---|---|---|---|
approximation | |||
Mean-field | ✗ | ✗ | ✗ |
Full-rank | ✓ | ✗ | ✗ |
✓ | ✓ | ✗ | |
✓ | ✓ | ✓ |
For a general target distribution , we do not expect to achieve zero, or infimum, discrepancy for any finite mixture. Rather, we expect to get progressively closer to the infimum by increasing the mixture size. In the case where the base distribution is Gaussian, we expect the discrepancy to approach zero as we increase the mixture size. More precisely, we consider a greedy, incremental procedure, as in Zhang (2003), to construct a sequence of finite mixtures for each . The quality of approximation can be measured with the excess discrepancy
and we want as . Then, for any , we can always find a large enough such that .
In particular, we start with a single base distribution for some . In practice, a crude initialization (e.g., ) satisfying should suffice. Iteratively, at each step , let be the approximation from the previous step. We form by mixing a new base distribution with weight together with with weight , i.e.,
(5) |
This approach is called greedy (Rakhlin, 2006, Ch. 4) if we choose (approximately) the optimal base distribution and weight at each step , so the resulting satisfies:
(6) |
That is, rather than choosing to be exactly the optimal at any round, it is enough to choose so that the discrepancy is within some non-negative sequence of optimality.
At each step the approximation remains normalized by construction and takes the form of a mixture of base distributions. The iterative updates are in the style of boosting or greedy error minimization (Freund and Schapire, 1995; Freund et al., 1999; Friedman et al., 2000; Li and Barron, 1999). Under convexity and strong smoothness conditions on , Theorem II.1 of Zhang (2003) guarantees that converges to zero at rate if Eq. 6 is performed exactly. We verify that KL divergence satisfies these conditions, subject to regularity conditions on the , in Theorem 1.
Let as a shorthand notation. Rather than jointly optimizing over , which is non-convex and difficult in general, we consider nearly optimal choices in the spirit of Eq. 6. In particular, we choose first, in the style of gradient descent. Then we fix and optimize the corresponding weight ; we show in Section 4.1 that choosing the optimal is a convex problem.
To find , we use gradient boosting (Friedman, 2001) and consider the functional gradient at the current solution . In what follows, we adopt the notation for functions and . In its general form, gradient boosting considers minimizing a loss by perturbing around its current solution , where is a function. Suppose we additively perturb to with a function ; then a Taylor expansion would yield
(7) |
as . The functional gradient can be defined from the resulting linear term: .
In our case, we consider the perturbation from to so that the solution remains normalized, where now is a distribution that describes the shape of the perturbation and describes the size of the perturbation. Now, as , a Taylor expansion yields
(8) |
where is the functional gradient . Roughly, for fixed in Eq. 8, we will choose to minimize the inner product term . I.e., as in gradient descent, we choose to “match the direction” of , where is the discrepancy we want to minimize. We provide more detail in Section 4.2.
Boosting variational inference (BVI) applies the framework of the previous section with Kullback-Leibler (KL) divergence as the discrepancy measure. We first justify the choice of KL, and then present a two-stage multivariate Gaussian mixture boosting algorithm (Algorithm 2) to stochastically decrease the excess discrepancy. In each iteration, our algorithm first chooses via gradient boosting, and then solves for given .
For as in Eq. 1, the KL discrepancy is defined as
(9) | ||||
By dropping the term , which is a constant in , an effective discrepancy can be defined as
(10) |
which is the negative value of the evidence lower bound, or “ELBO” (Blei et al., 2016).
The following theorem shows that KL satisfies the greedy boosting conditions we discuss after Eq. 6 (Zhang, 2003; Rakhlin, 2006) under assumptions on that we examine after the proof. These conditions then also hold for the effective discrepancy .
Given densities and true density , KL divergence is a convex functional, i.e., for any satisfying
If we further assume that densities are bounded , and denote the functional gradient of KL at density as , then the KL divergence is also strongly smooth, i.e., satisfying
For any densities and , we have the convexity from
where we have used and .
Let ; then we have . Again, using the inequality , we have the strong smoothness from
where in the last step we used that . ∎
Here we assume the densities are lower-bounded by . While this requirement is unrealistic for densities with unbounded support, it is not so onerous if the parameter space is bounded. Even when the parameter space is not intrinsically bounded, we are often most interested in a bounded subset of the parameter space. E.g., we might focus a compact set such that for a small .
First we show how to find given a fixed . Then in Section 4.2 below, we will show how to find . For now, we can take derivatives of effective discrepancy with respect to to find
(11) | ||||
(12) |
where
(13) |
By Eq. 12, is a convex function of . Using Eq. 11, we can estimate the gradient with Monte Carlo by drawing samples from and
. Then we use stochastic gradient descent (SGD) to solve for
, as presented below in Algorithm 1.Since the gradient is stochastically estimated instead of exact, to ensure convergence we use a decaying sequence of step sizes in Algorithm 1 that satisfy the Robbins-Monro conditions (Robbins and Monro, 1951). Note that we use only -order methods here because unlike the gradient, estimating the derivative in Eq. 12 involves computation no longer in the logarithmic scale and is hence numerically unstable.
It is difficult to solve for the best in the optimization problem within Eq. 6, so instead we use gradient boosting to identify a new component that will be good enough, in the sense of Eq. 6. As we suggest after Eq. 8, we propose (roughly) to choose in the “direction” of the negative functional gradient. We first take the Taylor expansion of the effective discrepancy around
(14) |
When the new component contributes small weight , this result suggests that we might choose to minimize , where is the functional gradient .
However, a direct minimization of the inner product is ill-posed since will degenerate to a point mass at the minimum of functional gradient. Instead we consider the regularized minimization
(15) |
with some . The problem can be rewritten as
(16) |
For a Gaussian family with , we have
(17) |
where the log determinant term prevents degeneracy.
Eq. 17, though, is still a non-convex problem and requires approximation. Although methods based on SGD and the reparameterization trick (Kingma and Welling, 2014) can be applied, they are out of the scope of this article. Instead, we seek to develop a simple and efficient procedure based on off-the-shelf optimization routines. For this purpose, we interpret Eq. 17 as follows.
We say that is the residual log-density corresponding to posterior approximation . Ideally, we would have , and the residual would be a flat plane. In general, though, the residual has peaks where posterior density is underestimated and basins where the posterior density is overestimated. Intuitively, the new component should be introduced to “patch” the region where density is currently underestimated. This idea is illustrated in Figure 1.
We therefore propose the following efficient heuristic (
Algorithm 2) to approximately optimize Eq. 17. We start by approximately decomposing the residual into a constant plus a quadratic peakFor instance, we can choose and according to Laplace approximation of ; that is, we match the location of a peak of with and match the Hessian at that location with . Then, Eq. 17 reduces to
which is identical to
(18) |
This last equation describes a convex problem, and we can write the solution in closed form:
(19) |
For the Laplace approximation, the (local) optimum and Hessian can be calculated numerically. This particular choice for and , together with the resulting , is described in Algorithm 2 and illustrated in Figure 1.
We now provide a justification for this choice based on a similar treatment of the KL divergence in the alternate direction, namely , defined as
where is the normalizing constant. Fixing , we use the shorthand , where we employ the lower-case kl to distinguish this discrepancy from the KL divergence in the other direction (cf. Eqs. 10 and 9) By dropping constants that do not involve as in Eq. 10, we can define an effective discrepancy,
(20) |
We now consider a gradient boosting procedure for . Again, from a Taylor expansion around , we have
Here, we should match to the direction of . To avoid degeneracy, we match in the norm
(21) |
as suggested by Friedman (2001). By plugging in the optimal value for , the objective is equivalent to
(22) |
which, by Jensen’s inequality, lower bounds the objective in Eq. 16 when . Heuristically, we might expect that this bound iteratively tightens; as better approximates , approaches a constant function. Thus, referencing Jensen’s inequality, we might expect to more closely approximate .
In Algorithm 2, we maximize the log-residual
to find a local peak. If the exact posterior has a strictly heavier tail than any Gaussian random variable, we have that
as , and the argmax diverges. In order to ensure the Laplace approximation mean is finite, and therefore of practical use, in Algorithm 2, we manually stabilize the tails of log-residual. In particular we add a small constant (e.g., ) to the both and , so that(23) |
The log-sum-exp trick is useful for implementing this modification.
When the dimension of the parameter space is high, the calculation of the covariance in Algorithm 2 can be costly. In particular, the cost of estimating the Hessian with finite difference is , and another for inversion or Cholesky decomposition. When is large, one option is to consider a diagonal approximation to the Hessian (and hence covariance). While this choice reduces the computation cost of a single iteration to , the convergence may be slower in terms of number of iterations.
We refer to Algorithm 2 as boosting variational inference (BVI) and evaluate its performance on a variety of different exact posterior distributions—including distributions arising from both simulated and real data sets. For comparison, we run automatic differentiation variational inference (ADVI) (Kucukelbir et al., 2015), which effectively performs mean-field variational inference (MFVI) but does not require any conditional conjugacy properties in the generative model. We use the ADVI implementation provided in Stan (Kucukelbir et al., 2015; Carpenter et al., 2016). In all experiments, we choose particles to evaluate stochastic gradients both in BVI and ADVI. We choose a tolerance of in Algorithm 1. We use L-BFGS (Byrd et al., 1995) for optimization and the finite difference method when estimating the Hessian in Algorithm 2.
First, in Figure 2, we consider a number of simple choices for the exact target distribution (plotted in blue) that illustrate the different behavior of ADVI (orange) and BVI (red). In particular, the target distribution in (a) of Figure 2 is a Cauchy density with heavy tails. In Figure 2(b), we see that BVI is able to capture the multimodality of the target distribution, which is a mixture of univariate Gaussians with different locations and scales. And in Figure 2
(c), the target distribution is a mixture of five two-dimensional Gaussians with different means and covariances. In each case, we initialize BVI with a Gaussian with very large (co)variance (plotted in green), and then run BVI for 30 iterations. The sequences
and are shown in subplots.We next highlight the ability of BVI to trade-off longer running times for more approximation accuracy on the fly as the number of iterations increase. We demonstrate this property on the “banana” distribution, which is widely-used for studying the performance of MCMC algorithms (Haario et al., 1999, 2001; Roberts and Rosenthal, 2009). The distribution has the following density on :
where the constant controls the curvature, and we choose . This target density is shown in the leftmost panel of Figure 5.
We initialize Algorithm 2
with a standard Gaussian distribution and run for
iterations. Figure 5 shows the intermediate estimates of posterior density as the number of iterations grows. We can see how early, crude approximations are quickly made available, but then the approximation is refined over time. The sequence of weights and Monte Carlo estimates of ELBO are shown in Figure 3.In this experiment, we look at a sensor network localization problem that has been considered by Ihler et al. (2005); Ahn et al. (2013) and Lan et al. (2014). We have sensors located on a 2D plane, with coordinates for . For each pair of sensors
, their distance is observed according to a Bernoulli distribution with the given probability:
(24) |
Let denote the distance measured between sensors and . When (observed), the measured distance is the actual distance polluted by a Gaussian noise
(25) |
Otherwise, we assume when (unobserved).
Now, following the setup in Ahn et al. (2013) and Lan et al. (2014), we set , and . To avoid ambiguity of localization arising from translation and rotation, we assume is known for . Hence, we are interested in localizing the remaining 8 sensors. In particular, we want to infer the posterior distribution of (located at crosses in Fig. 6) given and (located at circles in Fig. 6). The resulting posterior distribution is a highly non-Gaussian and multimodal one on with .
In Figure 6, we compare the posterior samples as inferred by BVI and ADVI (full-rank). Both algorithms are given enough running time (200 iterations for BVI, 100,000 iterations for ADVI implemented in Stan (Kucukelbir et al., 2015; Carpenter et al., 2016)) to reach convergence. For reference, we treat the result from a converged NUTS sampler (Hoffman and Gelman, 2014) as ground truth. In Figure 6, we see that, unlike ADVI, our algorithm is able to reliably capture the complex shape in the posterior.
We investigate the trade-off between time and accuracy for both BVI and ADVI. To measure the accuracy of posterior estimation, we use the relative error of the estimated mean (REM) as a summary index (Ahn et al., 2013; Lan et al., 2014). REM measures the relative error of approximating the posterior mean in norm,
(26) |
and can be computed from sample averages. A comparison of REM versus running time for both BVI and ADVI is shown in Figure 4. Recall that we use particles for both BVI and ADVI to evaluate stochastic gradients.
We apply our algorithm to Bayesian logistic regression for the Nodal dataset (Brown, 1980), consisting of observations of 6 predictors (intercept included) and a binary response . The likelihood is , where and we use the prior . We treat results from a Polya-Gamma sampler (a specialized MCMC algorithm) using R package BayesLogit (Polson et al., 2013) as the ground truth. As shown in Figure 7, while both BVI and ADVI capture the correct mean, BVI provides better estimates of the variance and, unlike MFVI, does not set the covariances to zero. Here, though, we also see the limitations of BVI in the noisy covariance estimates. As expected, the case for BVI is more compelling in the previous examples, where MFVI yields biased estimates of the posterior means (Turner and Sahani, 2011) or the posterior is multimodal.
We have developed a new variational inference (VI) algorithm, boosting variational inference (BVI). And we have demonstrated empirically that BVI can capture posterior multimodality and nonstandard shapes. There are notoriously few theoretical guarantees for quality of inference in VI. Nonetheless, the family considered here (all possible finite mixtures of a parametric base distribution) may be more ripe for future theoretical analysis due to its desirable limiting properties, though such an analysis is outside the scope of the present paper.
European conference on computational learning theory
, pages 23–37. Springer, 1995.Journal-Japanese Society For Artificial Intelligence
, 14(771-780):1612, 1999.On estimation of a probability density function and mode.
The Annals of Mathematical Statistics, 33(3):1065–1076, 1962.