 # Boosting Variational Inference

Variational inference (VI) provides fast approximations of a Bayesian posterior in part because it formulates posterior approximation as an optimization problem: to find the closest distribution to the exact posterior over some family of distributions. For practical reasons, the family of distributions in VI is usually constrained so that it does not include the exact posterior, even as a limit point. Thus, no matter how long VI is run, the resulting approximation will not approach the exact posterior. We propose to instead consider a more flexible approximating family consisting of all possible finite mixtures of a parametric base distribution (e.g., Gaussian). For efficient inference, we borrow ideas from gradient boosting to develop an algorithm we call boosting variational inference (BVI). BVI iteratively improves the current approximation by mixing it with a new component from the base distribution family and thereby yields progressively more accurate posterior approximations as more computing time is spent. Unlike a number of common VI variants including mean-field VI, BVI is able to capture multimodality, general posterior covariance, and nonstandard posterior shapes.

## Code Repositories

### BMLrepo

Bayesian Machine Learning repo

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Bayesian inference offers a flexible framework for learning with rich, hierarchical models of data and for coherently quantifying uncertainty in unknown parameters through the posterior distribution. However, for any moderately complex model, the posterior is intractable to calculate exactly and must be approximated. Variational inference (VI) has grown in popularity as a method for approximating the posterior since it is often fast even for large data sets. VI is fast partly because it formulates posterior approximation as an optimization problem; the rough idea is to find the closest distribution to the exact posterior over some family of distributions.

Mean-field variational inference (MFVI) is a particularly widely-used variant of VI due in part to its simplicity. MFVI assumes that the approximating distribution factorizes across the parameters of the model, and this assumption leads to an efficient coordinate-ascent algorithm (Blei et al., 2016). But this assumption also means that MFVI effectively cannot capture multimodality and that MFVI tends to underestimate the posterior covariance, sometimes drastically (Bishop, 2006; Wang and Titterington, 2004; Turner and Sahani, 2011; Rue et al., 2009; MacKay, 2003). The linear response technique of Giordano et al. (2015) and the full-rank approach within Kucukelbir et al. (2015) provide a correction to the covariance underestimation of MFVI in the unimodal case but do not address the multimodality issue. “Black-box” inference, as in Ranganath et al. (2014) and the mean-field approach within Kucukelbir et al. (2015), focus on making the MFVI optimization problem easier for practitioners, by avoiding tedious calculations, but they do not change the optimization objective of MFVI and therefore still face the problems outlined here.

An alternative and more flexible class of approximating distributions for variational inference (VI) is the family of mixture models. Even if we consider only Gaussian component distributions, one can find a mixture of Gaussians that is arbitrarily close to any

continuous probability density

(Epanechnikov, 1969; Parzen, 1962). Bishop et al. (1998); Jaakkola and Jordan (1998) and Gershman et al. (2012) have previously considered using approximating families with a fixed number of mixture components. Since the problem is non-convex, for practical inference, these authors also employ a further approximation to the VI optimization objective, or impose constraints on the components (e.g., Gaussians with isotropic covariance and equal weights (Gershman et al., 2012)). The resulting optimization algorithms also suffer from some practical limitations; for instance, they are sensitive to initial values. For good performance, one may need rerun the algorithm for multiple different initializations and multiple choices for the number of components.

Finally, in all of the variants of VI discussed above, the family of approximating distributions does not, in general, include the exact posterior—even as a limit point. Thus, no matter how long VI is run, the resulting approximation will not approach the exact posterior.

We propose a new algorithm, boosting variational inference (BVI), that addresses these concerns. Inspired by boosting, BVI starts with a single-component approximation and proceeds to add a new mixture component at each step for a progressively more accurate posterior approximation. Thus, BVI allows us to trade off computing time for more statistical accuracy on the fly. BVI effectively considers a more flexible approximating family consisting of all possible finite mixtures of a parametric base distribution (e.g., Gaussian). We will see that this flexibility allows us to capture multiple modes, general posterior covariance, and nonstandard posterior shapes. We highlight that a similar idea and treatment is developed in a concurrent and independent paper Miller et al. (2016).

We start by defining our optimization problem in Section 2. We review boosting and gradient boosting in Section 3 and develop the BVI algorithm in Section 4. We demonstrate the improved performance of our algorithm on a range of target posteriors, including those generated from both real and simulated data in Section 5.

## 2. Variational inference and Gaussian mixtures

Suppose that we have observed data , and we identify a set of parameters of interest: . In the remainder, we do not make any assumptions on the form or space of the observed data . A Bayesian model is specified with a prior and a likelihood

. Bayes Theorem yields the posterior distribution,

 p(θ|X)=π(θ)p(X|θ)p(X)∝π(θ)p(X|θ)=:f(θ), (1)

which may be considered the product of Bayesian inference and which expresses the practitioner’s knowledge of after seeing the data . While the unnormalized posterior in Eq. 1 is easy to evaluate, the normalizing constant usually involves an intractable integral. So an approximation is needed to use the normalized posterior distribution for any moderately complex model. The posterior

may further be used to report a point estimate of

via a mean, to report an uncertainty estimate via a covariance, or to report some other posterior functional: for some function . For example, yields the posterior mean.

One approach to approximating is to find, roughly, the closest distribution among some family of distributions , where the distributions in are typically easier to work with; e.g., the calculation of posterior functionals may be easier for distributions in . More precisely, we choose a discrepancy between distribution and the exact posterior . Then we approximate with the distribution in that minimizes . We assume is non-negative in general and that is zero if and only if . In general, though, the optimum

 D∗:=infq∈HD(q,pX) (2)

over some constrained family of distributions may be strictly greater than zero. Roughly, we expect a larger to yield a lower but at a higher computational cost. When , the Kullback-Leibler (KL) divergence between and , this optimization problem is called variational inference (Jordan et al., 1999; Wainwright and Jordan, 2008; Blei et al., 2016).

A straightforward choice for is some simple parametric family of known distributions, which we denote by

 H1={hϕ(θ):ϕ∈Φ}, (3)

where is the space of feasible parameters. For example, we have for a multivariate Gaussian approximation, and for a mean-field Poisson-Gamma approximation to .

A natural extension to is the family of mixtures. Let denote the (-1)-dimensional simplex: . Then,

 Hk={h:h(θ)=k∑j=1wjhϕj(θ),w∈Δk,ϕ∈Φk}

is the set of all -component mixtures over these base distributions, where is the collection of all component-wise parameters. , namely a single-component base distribution, is usually adopted for VI, including MFVI. While some authors (Bishop et al., 1998; Jaakkola and Jordan, 1998; Gershman et al., 2012) have considered for , these past works have focused on the case where the number of components is fixed and finite.

In this article, we consider an even larger family , which is the set of all finite mixtures, namely

 H∞=∞⋃k=1Hk. (4)

Note that forms the convex hull of the base distributions in  (Li and Barron, 1999; Rakhlin, 2006). The family is highly flexible; for example, if multivariate Gaussians are used for , one can find a member of arbitrarily close to any continuous density (Epanechnikov, 1969). Table 1 further compares with other common choices of family. Our main contribution is to propose a novel and efficient algorithm for approximately solving the discrepancy minimization problem in .

## 3. Boosting and Gradient Boosting

For a general target distribution , we do not expect to achieve zero, or infimum, discrepancy for any finite mixture. Rather, we expect to get progressively closer to the infimum by increasing the mixture size. In the case where the base distribution is Gaussian, we expect the discrepancy to approach zero as we increase the mixture size. More precisely, we consider a greedy, incremental procedure, as in Zhang (2003), to construct a sequence of finite mixtures for each . The quality of approximation can be measured with the excess discrepancy

 ΔD(qt):=D(qt,pX)−D∗≥0,

and we want as . Then, for any , we can always find a large enough such that .

In particular, we start with a single base distribution for some . In practice, a crude initialization (e.g., ) satisfying should suffice. Iteratively, at each step , let be the approximation from the previous step. We form by mixing a new base distribution with weight together with with weight , i.e.,

 qt=(1−αt)qt−1+αtht. (5)

This approach is called greedy (Rakhlin, 2006, Ch. 4) if we choose (approximately) the optimal base distribution and weight at each step , so the resulting satisfies:

 D(qt,pX)≤infϕ∈Φ0≤α≤1D((1−α)qt−1+αhϕ,pX)+ϵt. (6)

That is, rather than choosing to be exactly the optimal at any round, it is enough to choose so that the discrepancy is within some non-negative sequence of optimality.

At each step the approximation remains normalized by construction and takes the form of a mixture of base distributions. The iterative updates are in the style of boosting or greedy error minimization (Freund and Schapire, 1995; Freund et al., 1999; Friedman et al., 2000; Li and Barron, 1999). Under convexity and strong smoothness conditions on , Theorem II.1 of Zhang (2003) guarantees that converges to zero at rate if Eq. 6 is performed exactly. We verify that KL divergence satisfies these conditions, subject to regularity conditions on the , in Theorem 1.

Let as a shorthand notation. Rather than jointly optimizing over , which is non-convex and difficult in general, we consider nearly optimal choices in the spirit of Eq. 6. In particular, we choose first, in the style of gradient descent. Then we fix and optimize the corresponding weight ; we show in Section 4.1 that choosing the optimal is a convex problem.

To find , we use gradient boosting (Friedman, 2001) and consider the functional gradient at the current solution . In what follows, we adopt the notation for functions and . In its general form, gradient boosting considers minimizing a loss by perturbing around its current solution , where is a function. Suppose we additively perturb to with a function ; then a Taylor expansion would yield

 L(f+ϵh)=L(f)+ϵ ⟨h,g⟩+o(ϵ2) (7)

as . The functional gradient can be defined from the resulting linear term: .

In our case, we consider the perturbation from to so that the solution remains normalized, where now is a distribution that describes the shape of the perturbation and describes the size of the perturbation. Now, as , a Taylor expansion yields

 D((1−ϵ)q+ϵh)=D(q+ϵ (h−q))=D(q)+ϵ ⟨h−q,g⟩+o(ϵ2), (8)

where is the functional gradient . Roughly, for fixed in Eq. 8, we will choose to minimize the inner product term . I.e., as in gradient descent, we choose to “match the direction” of , where is the discrepancy we want to minimize. We provide more detail in Section 4.2.

## 4. Boosting Variational Inference

Boosting variational inference (BVI) applies the framework of the previous section with Kullback-Leibler (KL) divergence as the discrepancy measure. We first justify the choice of KL, and then present a two-stage multivariate Gaussian mixture boosting algorithm (Algorithm 2) to stochastically decrease the excess discrepancy. In each iteration, our algorithm first chooses via gradient boosting, and then solves for given .

For as in Eq. 1, the KL discrepancy is defined as

 D(q)=DKL(q∥pX) =∫q(θ)logq(θ)pX(θ)dθ (9) =logp(X)+∫q(θ)logq(θ)f(θ)dθ.

By dropping the term , which is a constant in , an effective discrepancy can be defined as

 ~DKL(q)=∫q(θ)logq(θ)f(θ)dθ, (10)

which is the negative value of the evidence lower bound, or “ELBO” (Blei et al., 2016).

The following theorem shows that KL satisfies the greedy boosting conditions we discuss after Eq. 6 (Zhang, 2003; Rakhlin, 2006) under assumptions on that we examine after the proof. These conditions then also hold for the effective discrepancy .

###### Theorem 1.

Given densities and true density , KL divergence is a convex functional, i.e., for any satisfying

 DKL((1−α)q1+αq2∥p)≤(1−α)DKL(q1∥p)+αDKL(q2∥p).

If we further assume that densities are bounded , and denote the functional gradient of KL at density as , then the KL divergence is also strongly smooth, i.e., satisfying

 DKL(q2∥p)−DKL(q1∥p) ≤⟨∇DKL(q1∥p),q2−q1⟩+1a∥q2−q1∥22.
###### Proof.

For any densities and , we have the convexity from

 DKL((1−α)q1+αq2∥p) =(1−α)∫q1log(1−α)q1+αq2pdθ+α∫q2log(1−α)q1+αq2pdθ =(1−α)∫q1logq1[1+α(q2/q1−1)]pdθ+α∫q2logq2[1+(1−α)(q1/q2−1)]pdθ ≤(1−α)DKL(q1∥p)+αDKL(q2∥p),

where we have used and .

Let ; then we have . Again, using the inequality , we have the strong smoothness from

 DKL(q1+h∥p)−DKL(q1∥p) =⟨q1+h,log(q1+h)⟩−⟨q1,logq1⟩−⟨h,logp⟩ ≤⟨q1+h,hq1+logq1⟩−⟨q1,logq1⟩−⟨h,logp⟩ =⟨h,logq1⟩+⟨h,hq1⟩−⟨h,logp⟩ ≤⟨logq1−logp,h⟩+1a∥h∥22,

where in the last step we used that . ∎

Here we assume the densities are lower-bounded by . While this requirement is unrealistic for densities with unbounded support, it is not so onerous if the parameter space is bounded. Even when the parameter space is not intrinsically bounded, we are often most interested in a bounded subset of the parameter space. E.g., we might focus a compact set such that for a small .

### 4.1. Setting αt with SGD

First we show how to find given a fixed . Then in Section 4.2 below, we will show how to find . For now, we can take derivatives of effective discrepancy with respect to to find

 ∂~DKL(qt)∂αt =∫(ht−qt−1)log(1−αt)qt−1+αthtfdθ =Eθ∼htγαt(θ)−Eθ∼qt−1γαt(θ), (11) ∂2~DKL(qt)∂α2t =∫(ht−qt−1)2(1−αt)qt−1+αthtdθ≥0, (12)

where

 γαt(θ):=log(1−αt)qt−1(θ)+αtht(θ)f(θ). (13)

By Eq. 12, is a convex function of . Using Eq. 11, we can estimate the gradient with Monte Carlo by drawing samples from and

. Then we use stochastic gradient descent (SGD) to solve for

, as presented below in Algorithm 1.

Since the gradient is stochastically estimated instead of exact, to ensure convergence we use a decaying sequence of step sizes in Algorithm 1 that satisfy the Robbins-Monro conditions (Robbins and Monro, 1951). Note that we use only -order methods here because unlike the gradient, estimating the derivative in Eq. 12 involves computation no longer in the logarithmic scale and is hence numerically unstable.

### 4.2. Setting ht with Laplacian Gradient Boosting

It is difficult to solve for the best in the optimization problem within Eq. 6, so instead we use gradient boosting to identify a new component that will be good enough, in the sense of Eq. 6. As we suggest after Eq. 8, we propose (roughly) to choose in the “direction” of the negative functional gradient. We first take the Taylor expansion of the effective discrepancy around

 ~DKL(qt)=~DKL(qt−1)+αt⟨ht,log(qt−1/f)⟩−αt⟨qt−1,log(qt−1/f)⟩+o(α2t). (14)

When the new component contributes small weight , this result suggests that we might choose to minimize , where is the functional gradient .

However, a direct minimization of the inner product is ill-posed since will degenerate to a point mass at the minimum of functional gradient. Instead we consider the regularized minimization

 ^ht=argminh=hϕ⟨h,−log(f/qt−1)⟩+λ2log∥h∥22 (15)

with some . The problem can be rewritten as

 ^ht=argmaxh=hϕEθ∼hlogf(θ)qt−1(θ)−λ/2⋅log∥h∥22. (16)

For a Gaussian family with , we have

 ^μt,^Σt=argmaxμ,ΣEθ∼Nμ,Σlogf(θ)qt−1(θ)+λ/4⋅log|Σ|, (17)

where the log determinant term prevents degeneracy.

Eq. 17, though, is still a non-convex problem and requires approximation. Although methods based on SGD and the reparameterization trick (Kingma and Welling, 2014) can be applied, they are out of the scope of this article. Instead, we seek to develop a simple and efficient procedure based on off-the-shelf optimization routines. For this purpose, we interpret Eq. 17 as follows.

We say that is the residual log-density corresponding to posterior approximation . Ideally, we would have , and the residual would be a flat plane. In general, though, the residual has peaks where posterior density is underestimated and basins where the posterior density is overestimated. Intuitively, the new component should be introduced to “patch” the region where density is currently underestimated. This idea is illustrated in Figure 1. Figure 1. Algorithm 2 identifies new component h2 by finding a (local) peak of the log residual and its corresponding Hessian.

We therefore propose the following efficient heuristic (

Algorithm 2) to approximately optimize Eq. 17. We start by approximately decomposing the residual into a constant plus a quadratic peak

 log(f(θ)/qt−1(θ))≈−12(θ−η)TH(θ−η)+const.

For instance, we can choose and according to Laplace approximation of ; that is, we match the location of a peak of with and match the Hessian at that location with . Then, Eq. 17 reduces to

 minμ,Σ−12Eθ∼N(μ,Σ)(θ−η)TH(θ−η)+λ4log|Σ|,

which is identical to

 minμ,Σ−12Tr(H(μμT+Σ−2μηT))+λ4log|Σ|. (18)

This last equation describes a convex problem, and we can write the solution in closed form:

 μ∗=η,Σ∗=λ2H−1. (19)

For the Laplace approximation, the (local) optimum and Hessian can be calculated numerically. This particular choice for and , together with the resulting , is described in Algorithm 2 and illustrated in Figure 1.

#### 4.2.1. Choosing λ=1

We now provide a justification for this choice based on a similar treatment of the KL divergence in the alternate direction, namely , defined as

 DKL(pX∥qt) =∫pX(θ)logpX(θ)qt(θ)dθ =∫pX(θ)logpX(θ)dθ−c∫f(θ)logqt(θ)dθ,

where is the normalizing constant. Fixing , we use the shorthand , where we employ the lower-case kl to distinguish this discrepancy from the KL divergence in the other direction (cf. Eqs. 10 and 9) By dropping constants that do not involve as in Eq. 10, we can define an effective discrepancy,

 ~Dkl(q)=−∫f(θ)logq(θ)dθ. (20)

We now consider a gradient boosting procedure for . Again, from a Taylor expansion around , we have

 ~Dkl(qt)=~Dkl(qt−1)−αt⟨f(θ)/qt−1(θ),ht(θ)⟩+αt∫f(θ)dθ+o(α2t).

Here, we should match to the direction of . To avoid degeneracy, we match in the norm

 minh=hϕ,λ′>0∥λ′h−f/qt−1∥22, (21)

as suggested by Friedman (2001). By plugging in the optimal value for , the objective is equivalent to

 maxh=hϕlogEθ∼h(f(θ)/qt−1(θ))−1/2⋅log∥h∥22, (22)

which, by Jensen’s inequality, lower bounds the objective in Eq. 16 when . Heuristically, we might expect that this bound iteratively tightens; as better approximates , approaches a constant function. Thus, referencing Jensen’s inequality, we might expect to more closely approximate .

#### 4.2.2. Stabilizing the Log-Residual

In Algorithm 2, we maximize the log-residual

to find a local peak. If the exact posterior has a strictly heavier tail than any Gaussian random variable, we have that

as , and the argmax diverges. In order to ensure the Laplace approximation mean is finite, and therefore of practical use, in Algorithm 2, we manually stabilize the tails of log-residual. In particular we add a small constant (e.g., ) to the both and , so that

 log((f(θ)+a)/(qt−1(θ)+a))→0as % ∥θ∥→∞. (23)

The log-sum-exp trick is useful for implementing this modification.

#### 4.2.3. Scaling to High Dimensions

When the dimension of the parameter space is high, the calculation of the covariance in Algorithm 2 can be costly. In particular, the cost of estimating the Hessian with finite difference is , and another for inversion or Cholesky decomposition. When is large, one option is to consider a diagonal approximation to the Hessian (and hence covariance). While this choice reduces the computation cost of a single iteration to , the convergence may be slower in terms of number of iterations.

## 5. Experiments

We refer to Algorithm 2 as boosting variational inference (BVI) and evaluate its performance on a variety of different exact posterior distributions—including distributions arising from both simulated and real data sets. For comparison, we run automatic differentiation variational inference (ADVI) (Kucukelbir et al., 2015), which effectively performs mean-field variational inference (MFVI) but does not require any conditional conjugacy properties in the generative model. We use the ADVI implementation provided in Stan (Kucukelbir et al., 2015; Carpenter et al., 2016). In all experiments, we choose particles to evaluate stochastic gradients both in BVI and ADVI. We choose a tolerance of in Algorithm 1. We use L-BFGS (Byrd et al., 1995) for optimization and the finite difference method when estimating the Hessian in Algorithm 2.

### 5.1. Toy Examples Figure 2. Toy examples: (a) (heavy-tailed) Cauchy distribution (b) a mixture of four univariate Gaussians and (c) a mixture of five bivariate Gaussians with random means and covariances. Curves/contours are colored by blue (true), red (BVI), green (initial q1 in BVI) and orange (ADVI). Sequence of (αt)t and Monte Carlo estimates of (DKL(qt))t are plotted against iteration in subplots.

First, in Figure 2, we consider a number of simple choices for the exact target distribution (plotted in blue) that illustrate the different behavior of ADVI (orange) and BVI (red). In particular, the target distribution in (a) of Figure 2 is a Cauchy density with heavy tails. In Figure 2(b), we see that BVI is able to capture the multimodality of the target distribution, which is a mixture of univariate Gaussians with different locations and scales. And in Figure 2

(c), the target distribution is a mixture of five two-dimensional Gaussians with different means and covariances. In each case, we initialize BVI with a Gaussian with very large (co)variance (plotted in green), and then run BVI for 30 iterations. The sequences

and are shown in subplots. Figure 3. Sequence of (αt)t and Monte Carlo ELBO estimates (−~DKL(qt))t against the number of iterations, for estimating the “banana”’ distribution in Section 5.2. The y-axis is in logarithmic scale and the points of αt at the bottom are values near zero. Figure 4. Comparison of accuracy versus running time for different algorithms in sensor localization (Section 5.3). Results are obtained by averaging 5 runs of each algorithm.

### 5.2. Banana Distribution Figure 5. Contour plots for the estimated posterior density of the “banana” distribution, as the number of iterations of BVI grows (t=1,10,20,50,100,200,400). The true target density is shown in the leftmost panel.

We next highlight the ability of BVI to trade-off longer running times for more approximation accuracy on the fly as the number of iterations increase. We demonstrate this property on the “banana” distribution, which is widely-used for studying the performance of MCMC algorithms (Haario et al., 1999, 2001; Roberts and Rosenthal, 2009). The distribution has the following density on :

 pX(θ1,θ2)∝exp(−θ21/200−(θ2+Bθ21−100B)2/2),

where the constant controls the curvature, and we choose . This target density is shown in the leftmost panel of Figure 5.

We initialize Algorithm 2

with a standard Gaussian distribution and run for

iterations. Figure 5 shows the intermediate estimates of posterior density as the number of iterations grows. We can see how early, crude approximations are quickly made available, but then the approximation is refined over time. The sequence of weights and Monte Carlo estimates of ELBO are shown in Figure 3.

### 5.3. Sensor Network Localization Figure 6. Posterior samples of the sensor localization problem in Section 5.3 as inferred from NUTS (considered truth), BVI and ADVI. The circles (location known) and crosses (location unknown) mark the true coordinates of the sensors.

In this experiment, we look at a sensor network localization problem that has been considered by Ihler et al. (2005); Ahn et al. (2013) and Lan et al. (2014). We have sensors located on a 2D plane, with coordinates for . For each pair of sensors

, their distance is observed according to a Bernoulli distribution with the given probability:

 Zij|θi,θj∼Ber(exp(−∥θi−θj∥22/(2R2))). (24)

Let denote the distance measured between sensors and . When (observed), the measured distance is the actual distance polluted by a Gaussian noise

 Yij|Zij=1,θi,θj∼N(∥θi−θj∥2,σ2). (25)

Otherwise, we assume when (unobserved).

Now, following the setup in Ahn et al. (2013) and Lan et al. (2014), we set , and . To avoid ambiguity of localization arising from translation and rotation, we assume is known for . Hence, we are interested in localizing the remaining 8 sensors. In particular, we want to infer the posterior distribution of (located at crosses in Fig. 6) given and (located at circles in Fig. 6). The resulting posterior distribution is a highly non-Gaussian and multimodal one on with .

In Figure 6, we compare the posterior samples as inferred by BVI and ADVI (full-rank). Both algorithms are given enough running time (200 iterations for BVI, 100,000 iterations for ADVI implemented in Stan (Kucukelbir et al., 2015; Carpenter et al., 2016)) to reach convergence. For reference, we treat the result from a converged NUTS sampler (Hoffman and Gelman, 2014) as ground truth. In Figure 6, we see that, unlike ADVI, our algorithm is able to reliably capture the complex shape in the posterior.

We investigate the trade-off between time and accuracy for both BVI and ADVI. To measure the accuracy of posterior estimation, we use the relative error of the estimated mean (REM) as a summary index (Ahn et al., 2013; Lan et al., 2014). REM measures the relative error of approximating the posterior mean in norm,

 REM(q,pX)=∥Eqθ−EpXθ∥1∥EpXθ∥1, (26)

and can be computed from sample averages. A comparison of REM versus running time for both BVI and ADVI is shown in Figure 4. Recall that we use particles for both BVI and ADVI to evaluate stochastic gradients.

### 5.4. Bayesian Logistic Regression Figure 7. Estimated posterior mean and covariance for logistic regression (dashed: estimate = true).

We apply our algorithm to Bayesian logistic regression for the Nodal dataset (Brown, 1980), consisting of observations of 6 predictors (intercept included) and a binary response . The likelihood is , where and we use the prior . We treat results from a Polya-Gamma sampler (a specialized MCMC algorithm) using R package BayesLogit (Polson et al., 2013) as the ground truth. As shown in Figure 7, while both BVI and ADVI capture the correct mean, BVI provides better estimates of the variance and, unlike MFVI, does not set the covariances to zero. Here, though, we also see the limitations of BVI in the noisy covariance estimates. As expected, the case for BVI is more compelling in the previous examples, where MFVI yields biased estimates of the posterior means (Turner and Sahani, 2011) or the posterior is multimodal.

## 6. Conclusions

We have developed a new variational inference (VI) algorithm, boosting variational inference (BVI). And we have demonstrated empirically that BVI can capture posterior multimodality and nonstandard shapes. There are notoriously few theoretical guarantees for quality of inference in VI. Nonetheless, the family considered here (all possible finite mixtures of a parametric base distribution) may be more ripe for future theoretical analysis due to its desirable limiting properties, though such an analysis is outside the scope of the present paper.

## References

• Ahn et al.  S. Ahn, Y. Chen, and M. Welling. Distributed and adaptive darting Monte Carlo through regenerations. In AISTATS, pages 108–116. Citeseer, 2013.
• Bishop  C. M. Bishop. Pattern Recognition and Machine Learning, chapter 10. Springer, New York, 2006.
• Bishop et al.  C. M. Bishop, N. Lawrence, T. Jaakkola, and M. I. Jordan. Approximating posterior distributions in belief networks using mixtures. In Advances in Neural Information Processing Systems 10: Proceedings of the 1997 Conference, volume 10, page 416, 1998.
• Blei et al.  D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. arXiv preprint arXiv:1601.00670, 2016.
• Brown  B. Brown. Prediction analyses for binary data. Biostatistics Casebook, pages 3–18, 1980.
• Byrd et al.  R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
• Carpenter et al.  B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: A probabilistic programming language. J Stat Softw, 2016.
• Epanechnikov  V. A. Epanechnikov. Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 14(1):153–158, 1969.
• Freund and Schapire  Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In

European conference on computational learning theory

, pages 23–37. Springer, 1995.
• Freund et al.  Y. Freund, R. Schapire, and N. Abe. A short introduction to boosting.

Journal-Japanese Society For Artificial Intelligence

, 14(771-780):1612, 1999.
• Friedman et al.  J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics, 28(2):337–407, 2000.
• Friedman  J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
• Gershman et al.  S. Gershman, M. Hoffman, and D. Blei. Nonparametric variational inference. In Proceedings of the 29th International Conference on Machine Learning, pages 663–670, 2012.
• Giordano et al.  R. J. Giordano, T. Broderick, and M. I. Jordan. Linear response methods for accurate covariance estimates from mean field variational Bayes. In Advances in Neural Information Processing Systems, pages 1441–1449, 2015.
• Haario et al.  H. Haario, E. Saksman, and J. Tamminen. Adaptive proposal distribution for random walk Metropolis algorithm. Computational Statistics, 14(3):375–396, 1999.
• Haario et al.  H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm. Bernoulli, pages 223–242, 2001.
• Hoffman and Gelman  M. D. Hoffman and A. Gelman. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014.
• Ihler et al.  A. T. Ihler, J. W. Fisher, R. L. Moses, and A. S. Willsky. Nonparametric belief propagation for self-localization of sensor networks. IEEE Journal on Selected Areas in Communications, 23(4):809–819, 2005.
• Jaakkola and Jordan  T. Jaakkola and M. I. Jordan. Improving the mean field approximation via the use of mixture distributions. In Learning in graphical models, pages 163–173. Springer, 1998.
• Jordan et al.  M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, 1999.
• Kingma and Welling  D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In The International Conference on Learning Representations, 2014.
• Kucukelbir et al.  A. Kucukelbir, R. Ranganath, A. Gelman, and D. Blei. Automatic variational inference in Stan. In Advances in neural information processing systems, pages 568–576, 2015.
• Lan et al.  S. Lan, J. Streets, and B. Shahbaba. Wormhole Hamiltonian Monte Carlo. In Proceedings of the… AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, volume 2014, page 1953. NIH Public Access, 2014.
• Li and Barron  J. Q. Li and A. R. Barron. Mixture density estimation. In Advances in Neural Information Processing Systems, pages 279–285, 1999.
• MacKay  D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms, chapter 33. Cambridge University Press, 2003.
• Miller et al.  A. C. Miller, N. Foti, and R. P. Adams. Variational boosting: Iteratively refining posterior approximations. arXiv preprint arXiv:1611.06585, 2016.
• Parzen  E. Parzen.

On estimation of a probability density function and mode.

The Annals of Mathematical Statistics, 33(3):1065–1076, 1962.
• Polson et al.  N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using Pólya–gamma latent variables. Journal of the American Statistical Association, 108(504):1339–1349, 2013.
• Rakhlin  A. Rakhlin. Applications of empirical processes in learning theory: algorithmic stability and generalization bounds. PhD thesis, Massachusetts Institute of Technology, 2006.
• Ranganath et al.  R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, pages 814–822, 2014.
• Robbins and Monro  H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
• Roberts and Rosenthal  G. O. Roberts and J. S. Rosenthal. Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18(2):349–367, 2009.
• Rue et al.  H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (statistical methodology), 71(2):319–392, 2009.
• Turner and Sahani  R. E. Turner and M. Sahani. Two problems with variational expectation maximisation for time-series models. In D. Barber, A. T. Cemgil, and S. Chiappa, editors, Bayesian Time Series Models. Cambridge University Press, 2011.
• Wainwright and Jordan  M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.
• Wang and Titterington  B. Wang and M. Titterington. Inadequacy of interval estimates corresponding to variational Bayesian approximations. In Workshop on Artificial Intelligence and Statistics, pages 373–380, 2004.
• Zhang  T. Zhang. Sequential greedy approximation for certain convex optimization problems. IEEE Transactions on Information Theory, 49(3):682–691, 2003.