Variational Boosting: Iteratively Refining Posterior Approximations

11/20/2016 ∙ by Andrew C. Miller, et al. ∙ Harvard University University of Washington 0

We propose a black-box variational inference method to approximate intractable distributions with an increasingly rich approximating class. Our method, termed variational boosting, iteratively refines an existing variational approximation by solving a sequence of optimization problems, allowing the practitioner to trade computation time for accuracy. We show how to expand the variational approximating class by incorporating additional covariance structure and by introducing new components to form a mixture. We apply variational boosting to synthetic and real statistical models, and show that resulting posterior inferences compare favorably to existing posterior approximation algorithms in both accuracy and efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

Code Repositories

vboost

code supplement for variational boosting (https://arxiv.org/abs/1611.06585)


view repo

BMLrepo

Bayesian Machine Learning repo


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Variational inference (VI) [2, 18, 35] is a family of methods designed to approximate an intractable target distribution (typically known only up to a constant) with a tractable surrogate distribution. VI procedures typically minimize the Kullback-Leibler (

) divergence of the approximation to the target by maximizing an appropriately defined tractable objective. Often, the class of approximating distributions is fixed, and typically excludes the neighborhood surrounding the target distribution, which prevents the variational approximation from becoming arbitrarily close to the true posterior. Often this mismatch between the variational family and the true posterior manifests as underestimating the posterior variances of the model parameters 

[35]

. Markov chain Monte Carlo (MCMC), an alternative class of inference methods, instead approximates target distributions with samples drawn from a Markov chain constructed to leave the target distribution invariant. MCMC methods allow a user to trade computation time for increased accuracy — drawing more samples will make the approximation closer to the true target distribution. However, MCMC algorithms typically must be run iteratively and it can be difficult to assess convergence to the true posterior. Furthermore, correctly specifying MCMC moves can be more algorithmically restrictive than optimizing an objective (e.g., data subsampling in stochastic gradient methods). In order to alleviate the mismatch between tractable variational approximations and complicated posterior distributions, we propose a variational inference method that iteratively allows the approximating distribution to become more complex and to eventually represent the true distribution arbitrarily well. This choice allows the practitioner to trade time performing inference against accuracy of posterior estimates, where in the limit the exact posterior can be recovered as in MCMC. Our algorithm grows the complexity of the approximating class in two ways: 1) incorporating richer covariance structure in the component distributions, and 2) by sequentially adding new components to the approximating distribution. Our method builds on black-box variational inference methods using the

re-parameterization trick [20, 29, 32], applicable to a broad class of target distributions. The following section discusses variational inference methods, drawing comparisons to alternative approximate inference algorithms. Section 3 and subsections therein describe variational boosting. We show how to adapt the re-parameterization trick for mixture approximations in Section 3.1. Section 4 describes various numerical experiments on real and synthetic data.

2 Variational Inference

Given a target distribution with density111We assume is known up to a constant, for some constant , omitting to simplify notation.

for a random variable 

, variational inference approximates with a tractable approximate distribution,222We treat the density function as a synecdoche for the entire law, and use and interchangeably at the risk of slight notational abuse. , from which we can draw samples and form sample-based estimates of functions of . Variational methods minimize the KL-divergence, , between and the true  as a function of variational parameters  [1]. Direct optimization of is often intractable; however, we can derive a tractable objective based on properties of the -divergence. This objective is often referred to as the evidence lower bound (ELBO), written

(1)
(2)
(3)
(4)

which, due to the positivity of , is a lower bound on the normalization constant333Often referred to as the marginal likelihood,

, in Bayesian inference.

of , Variational methods typically define (or derive) a family of distributions parameterized by , and maximize the ELBO with respect to . Most commonly the class is fixed, and there exists some (possibly non-unique) for which is minimized. When the family does not include , there will be a non-zero gap between and , and that discrepancy will realize itself in the form of biased estimates of functions of . Variational inference is often seen as an alternative to other approximate inference algorithms, most notably Markov chain Monte Carlo (MCMC). MCMC methods construct a Markov chain such that the target distribution remains invariant (i.e., the target is admitted along the margins). Expectations with respect to the target can be calculated as an average with respect to these correlated samples. MCMC typically enjoys nice asymptotic properties; as the number of samples grows, MCMC samplers represent the true target distribution with higher and higher fidelity. However, rules for constructing correct Markov steps are quite restrictive. With a few exceptions [24, 36]

, most MCMC algorithms require evaluating a log-likelihood that touches all data at each step in the chain (sometimes many times per step). This is problematic in analyses with a large amount of data — MCMC methods are often considered unusable because of this computational bottleneck. Data sub-sampling, on the other hand, can often be used in conjunction with variational inference methods. Unbiased estimates of the

log-likelihood based on data sub-sampling can often be used for optimization methods. Because variational methods recast inference as optimization, data sub-sampling can often be a way to make an already efficient approximation even more efficient. In the next section, we propose an algorithm that iteratively grows the approximating class and reframes the VI procedure as a series of optimization problems, resulting in an inference method that can both represent complex distributions and scale to large data sets.

3 Method: Variational Boosting

We define our approximate distribution to be a mixture of simpler component distributions

(5)

where we have defined component distributions 444We denote full mixtures with parenthetical superscripts, , and components with naked subscripts, ., mixture component parameters , and mixing proportion parameters . The component distributions can be any distribution over from which we can draw samples using a continuous mapping that depends on (e.g., multivariate normals [16], or a composition of invertible maps [31]). When posterior expectations and variances are of interest, mixture distributions provide tractable summaries (so long as the component distributions are tractable)555Mixtures are simple to easy sample from so that more complicated functionals can easily be estimated.. Expectations are easily expressed in terms of component expectations

(6)
(7)

In the case of multivariate normal components, the mean and covariance of a mixture are easy to compute in closed form

(8)
(9)

as are marginal distributions along any set of dimensions

(10)

where and isolate the mean and covariance from variational component parameter . Our method begins with a single mixture component, . We use existing black-box variational inference methods to fit the first component parameter, , and is fixed to by definition. At the next iteration we fix and introduce a new component into the mixture, . We define a new ELBO objective as a function of new component parameters, , and a new mixture weight, . We then optimize this objective with respect to and until convergence. At each subsequent iteration, , we introduce new component parameters and a mixing weight, , which are then optimized according to the new ELBO objective. We refer to this procedure as variational boosting

, inspired by methods for learning strong classifiers by weighting an ensemble of weak classifiers. In order for our method to be applicable to a general class of target distributions, we use black-box variational inference methods and the

re-parameterization trick [20, 29, 32] to fit each component and mixture weights. The re-parameterization trick is a method for obtaining unbiased estimates of the gradient of the ELBO. These gradient estimates can then be used to optimize the ELBO objective using a stochastic gradient optimization method. However, using mixtures as the variational approximation complicates the use of the re-parameterization trick.

3.1 The re-parameterization trick and mixture distributions

The re-parameterization trick is a method for computing low-variance estimates of the gradient of an objective for which we only have an unbiased estimator

where samples are drawn from . To obtain a Monte Carlo gradient of  using the re-parameterization trick, we first separate the randomness needed to generate  from the parameters , by defining a deterministic map  such that implies666Here, is some base distribution that is, importantly, not a function of . . Then, we can differentiate through with respect to to obtain a gradient estimator. The re-parameterization trick when is a mixture, however, is less straightforward. The sampling procedure for a mixture model typically contains a discrete component (i.e., sampling component identities), which is a process that cannot be differentiated through. We circumvent this complication by re-writing the variational objective as a weighted combination of expectations with respect to individual mixture components. Because of the form of the mixture, we can write the ELBO as

which is a function of expectations with respect to mixture components. If these distributions are continuous, and there exists some function such that and when , then we can apply the re-parameterization trick to each component to obtain gradients of the ELBO

Variational Boosting uses the above fact with the re-parameterization trick in a component-by-component manner, allowing us to improve the variational approximation as we incorporate and fit new components.

Figure 1: Illustrative one-dimensional example of adding a new component in variational boosting. Top: initial approximation with a single component (solid green). Middle: a new component (dotted red), is initialized using Algorithm 1. Bottom: the new component parameters and mixing weights are optimized using Monte Carlo gradients of the ELBO. Note that this allows the mass of the existing components to rise and fall, but not shift in space.
Figure 2: Sequence of increasing complex approximate posteriors, with . The background (grey/black) contours depict the target distribution, and the foreground (red) contours depict the iterative approximations.

3.2 Adding Components

In this section we present details of the proposed algorithm. We first describe the process of fitting a single component and then the process for adding an additional component to an existing mixture distribution.

Fitting the first component

The procedure starts by fitting an approximation to  with a distribution that consists of a single component. We do this by maximizing the first ELBO objective

(11)
(12)

Depending on the forms of and , optimizing the ELBO can be accomplished by various methods. One general method for fitting a continuous valued component is to compute stochastic, unbiased gradients of , and use stochastic gradient optimization. After convergence (or close to it) we fix to be .

Fitting component

After iteration , our current approximation to is a mixture distribution with components

(13)

where is a list of component parameters and mixing weights, and is the component distribution parameterized by . Adding a new component introduces a new component parameter, , and a new mixing weight, . In this section, the mixing parameter mixes between the new component, and the existing approximation, . The new approximate distribution is

(14)

The new optimization objective, as a function of and is

(15)
(16)
(17)

Above we have separated out two expectations — one with respect to the existing approximation (which is fixed), and the other with respect to the new component distribution. Because we have fixed the existing approximation, we only need to optimize the new component parameters, , allowing us to use the re-parameterization trick to obtain gradients of . As we have fixed the existing component distribution and we only need to optimize the new component , we can use the re-parameterization trick and Monte Carlo gradients to optimize with respect to and . Figure 1 illustrates the algorithm on a simple one-dimensional example — showing the initialization of a new component and the resulting mixture after optimizing the second objective, . Figure 2 depicts the result of the variational boosting procedure on a two-dimensional, multi-modal target distribution. In both cases, the component distributions are Gaussians with diagonal covariance.

3.3 Structured Multivariate Normal Components

Though our method can use any component distribution that can be sampled using a continuous mapping, a sensible choice of component distribution is a multivariate normal

(18)
(19)

where the variational parameter

is transformed into a mean vector

and covariance matrix . Specifying the structure of the covariance matrix is a choice that largely depends on the dimensionality of () and correlation structure of the target distribution. A common first-choice of covariance parameterization is a diagonal matrix

(20)

which implies that is independent across dimensions. When the approximation only consists of one component, this structure is commonly referred to as the mean field family. While computationally efficient, mean field approximations cannot model posterior correlations, which often leads to underestimation of marginal variances. Additionally, when diagonal covariances are used as the component distributions in Eq. (5) the resulting mixture may require a large number of components to represent the strong correlations. Further, the independence restriction can introduce local optima in the variational objective [35]. On the other end of the spectrum, we can parameterize the entire covariance matrix by parameterizing the lower triangle of a Cholesky decomposition, , such that . This allows to be any positive semi-definite matrix, enabling to have the full flexibility of a

-dimensional multivariate normal distribution. However, this introduces 

parameters, which can become computationally cumbersome when is large. Furthermore, it may not be the case that all pairs of variables exhibit posterior correlation, particularly in multi-level models where different parameter types may be more or less independent in the posterior. Alternatively, we can incorporate some capacity to capture correlations between dimensions of without introducing many more parameters. The next subsection discusses a covariance specification that provides this tradeoff, while remaining computationally tractable within the BBVI framework.

Low-rank plus diagonal covariance

Black-box variational inference methods with the re-parameterization trick rely on sampling from the variational distribution, and efficiently computing (or approximating) the entropy of the variational distribution. For multivariate normal distributions, the entropy is a function of the determinant of the covariance matrix, , while computing the log likelihood requires inverting the covariance matrix, . When the dimensionality of the target, , is large, computing determinants and inverses will be and therefore may be prohibitively expensive to compute at every iteration. However, it may be unnecessary to represent all possible correlations in the target distribution, particularly if certain dimensions are close to independent. One way to increase the capacity of is to model the covariance as a low-rank plus diagonal (LR+D) matrix

(21)

where is a matrix of off diagonal factors, and are is the log-diagonal component. Note that both and are represented by the component parameter . The choice of presents a tradeoff — with a larger rank, the variational approximation can be more flexible; with a lower rank, the computations necessary for fitting the variational approximation can be more efficient. As a concrete example, in the Experiments section we present a dimensional posterior resulting from a non-conjugate hierarchical model, and we show that a “rank plus diagonal" covariance does an excellent job capturing all pairwise correlations and marginal variances. Incorporating more components using the variational boosting framework further improves the approximation of the distribution. To use the re-parameterization trick with this low rank covariance, we can simulate from  in two steps

(22)
(23)
(24)

where generates the randomness due to the low-rank structure, and generates the randomness due to the diagonal structure. We use the operator for notational brevity. This generative process can be differentiated through, yielding Monte Carlo estimates of the gradient with respect to and suitable for stochastic optimization. In order to use LR+D covariance structure within variational boosting, we will need to efficiently compute the determinant and inverse of . The matrix determinant lemma [12] allows us to represent the determinant of as the product of two determinants

(25)
(26)

where the left term is simply the product of the diagonal component, and the right term is the determinant of a dimensional matrix, computable in time. Similarly, the Woodbury matrix identity [10] allows us to represent the inverse of as

(27)

which involves the inversion of a smaller, matrix, which can be done in time. Importantly, the above operations are efficiently differentiable and amenable for use in the BBVI framework.

Fitting the rank

To specify the ELBO objective, we need to choose a rank for the component covariance. Because fitting a single component is relatively cheap, we start by a single component with rank , continue to fit

, and rely on a heuristic stopping criterion. For a single Gaussian, one such criterion is the average change in marginal variances — if the marginal variation along each dimension remains the same from rank

to , then the new covariance component is not incorporating explanatory power, particularly if marginal variances are of interest. As the objective tends to underestimate variances when restricted to a particular model class, we observe that the marginal variances grow as new covariance rank components are added. When fitting rank

, we can monitor the average absolute change in marginal variance (or standard deviation) as more covariance structure is incorporated. Figure 

9 in Section 4 depicts this measurement for a -dimensional posterior. To justify sequentially adding ranks to mixture components we consider the KL-divergence between a rank- Gaussian approximation to a full covariance Gaussian, , where and . For simplicity, we assume both distributions have zero mean. If the true posterior is non-Gaussian we will attempt to approximate the best full-rank Gaussian with a low-rank Gaussian thus suffering an unrepresentable KL-divergence between the family of Gaussians and the true posterior. We also assume that the diagonal component, , and the first columns of are held fixed. Then we have

(28)

which we differentiate with respect to , remove terms that do not depend on , and set to zero, yielding

(29)
(30)
(31)

We can thus determine the optimal from the following equation

(32)

where we have defined . Eq. (32

) is reminiscent of an eigenvalue problem indicating that the optimal solution for

should maximally explain , i.e. the parameter space not already explained by . This provides justification for the previously proposed stopping criterion that monitors the increase in marginal variances since incorporating a new vector into the low-rank approximation should grow the marginal variances if extra correlations are captured. This is due to minimizing which underestimates the variances when dependencies between parameters are broken.

3.4 Initializing Components

Introducing a new component requires initialization of component parameters. When our component distributions are mixtures of Gaussians, we found that the optimization procedure is sensitive to initialization. This section describes an importance-weighting scheme for initialization that produces (empirically) good initial values of component and mixing parameters. Conceptually, a good initial component is located in a region of the target that is underrepresented by the existing approximation . A good initial weight is close to the proportion of mass in the unexplained region. Following this principle, we construct this component by first drawing importance-weighted samples from our existing approximation

(33)

The samples with the largest weights tell us where regions of the target are poorly represented by our approximation. In fact, as grows, and if is “close” enough to , we can interpret as a weighted sample from . Based on this interpretation, we can fit a mixture distribution (or some components

of a mixture distribution) to this weighted sample using maximum likelihood, and recover a type of target approximation. For mixture distributions, an efficient inference procedure is Expectation-Maximization (EM) 

[3]. This approach, however, presents a few complications. First, we must adapt EM to fit a weighted sample. Second, importance weights can suffer from extremely high variance — one or two values may be extremely large compared to all other weights. This destabilizes our new component parameters and mixing weight, particularly the variance of the component. Intuitively, if a single weight is extremely large, this would correspond to many samples being located in a single location, and maximum likelihood with EM would want to shrink the variance of the new component to zero right on that location. To combat this behavior, we use a simple method to break up the big weights using a resampling and re-weighting step before applying weighted EM. Empirically, this improves our new component initializations and subsequent ELBO convergence.

Weighted EM

Expectation-maximization is typically used to perform maximum likelihood in latent variable models. Mixture distributions are easily represented with latent variables — a sample’s latent variable corresponds to the mixture component that produced it. EM starts with some initialization of model parameters (e.g.,component means, variances and mixing weights). The algorithm then iterates between two steps: 1) the E-step, which computes the distribution over the latent variables given the current setting of parameters, and 2) the M-step, which maximizes the expected complete data log-likelihood with respect to the distributions computed in the E-step. We suppress details of the general treatment of EM, and focus on EM for mixture models as presented in [1]

. For mixture distributions, the E-step computes “responsibilities”, or the probability that a datapoint came from one of the components. The M-step then computes a weighted maximum likelihood, where the log-likelihood of a datapoint for a particular component is weighted by the associated “responsibility”. This weighted maximum likelihood is an easy entry-point for an additional set of weights — weights associated with each datapoint from the importance-weighting. More concretely, for a sample of data,

, mixture components, and current mixture component parameters and weights , the E-step computes the following quantities

(34)
(35)

where is the “responsibility” of cluster for datapoint . The M-step then computes component parameters by a weighted maximum likelihood

(36)

To incorporate importance weights , we only need to slightly change the M-step:

(37)

Because we are adding a new component, we would like our weighted EM routine to leave the remaining components unchanged. For instance, we want to be fixed, while is free to explain the weighted sample. This can be accomplished in a straightforward manner by simply clamping the first parameters during the M-step.

1:procedure InitComp()
2:      for sample from existing approx
3:      set importance weights
4:     
5:      break up big weights
6:      for sample from new mixture
7:      re-sampled importance weights
8:      fit new component
9:     return
Algorithm 1 Importance-weighted initialization of new components. This algorithm takes in the target distribution, , the current approximate distribution , and a number of samples . This returns an initial value of new component parameters, and a new mixing weight .
Resampling importance weights

If our current approximation is sufficiently different in certain regions of the posterior, then some weights will end up being large compared to other weights. For instance, the objective tends to under-cover regions of the posterior, allowing to be much larger than , meaning the weight associated with will be large. This will create instability in the weighted EM approximation — likelihood maximization will want to put a zero-variance component on the single highest-weighted sample, which does not accurately reflect the local curvature of . To combat this, we construct a slightly more complicated proposal distribution. Conceptually, we first create this naïve importance-weighted sample, and then find samples with outlier weights, and break those samples up. We do this by constructing a new proposal distribution that mixes the existing proposal, , and component means located at the outlier samples. We define this proposal to be

(38)

where denote the set of outlier samples from our original sample, and is the mass not placed on outlier samples. The variance of each outlier component, is set to some heuristic value — we typically use the diagonal of the covariance of as a good-enough guess. We then create a new importance-weighted sample, using and just as we did before. By placing new components (with some non-zero variance) on the outlier samples, which are known to be in a region of high target probability and low approximate probability, we assume that there is more local probability around that region that needs to be explored. This allows us to inflate the local variance of the samples in this region — the region that weighted EM will place a component. Algorithm 1 unites the components from above sections into our final initialization procedure.

3.5 Related Work

Using a mixture model as an approximating distribution in variational inference is a well-studied idea. Mixtures of mean field approximations [16] introduced mean field-like updates for a mixture approximation using a bound on the entropy term and model-specific parameter updates. Nonparametric variational inference [9] is a black-box variational inference algorithm that approximates a target distribution with a mixture of equally-weighted isotropic normals. The authors use a lower-bound on the entropy term in the ELBO to make the optimization procedure tractable. Similarly, [32] present a method for fitting mixture distributions as an approximation. However, their method is restricted to mixture component distributions within the exponential family, and a joint optimization procedure. Sequential estimation of mixture models has been studied previously where the error between the sequentially learned model and the optimal model where all components and weights are jointly learned is bounded by where is the number of mixture components used [22, 21, 28]. The arguments in these works rely on extending convergence results for iterative approximation of functions to use KL divergence. A similar bound was also shown to hold using arguments from convex analysis in Zhang [37]. More recently, sequentially constructing a mixture of deep generative models has been shown to achieve the same error bound when trained using an adversarial approach [34]. Though these ideas show promise for deriving error bounds for variational boosting, there are difficulties in applying them. In concurrent work, boosting has been used to construct flexible approximations to posterior distributions [11]

. In particular, they use gradient-boosting 

[6] to determine candidate component distributions and then optimize the mixture weight for the new component. In addition, they use the greedy approximation error bounds derived by Zhang to suggest that their algorithm obtains an error bound of . However, Guo, et. al. assume that the gradient-boosting procedure is able to find the optimal new component in order for Zhang’s error bound to apply, which is not true in general. Guo, et. al. have provided important first steps in the theoretical development of boosting methods applied to variational inference, however, we note that developing a comprehensive theory that deals with the difficulties of multimodality and the non-convexity of KL divergence is still needed. Using a low-rank Gaussian as a variational approximation was explored in [33], using a PCA-like algorithm. Additionally, concurrent work has similarly proposed the use a LR+D covariance as the covariance of Gaussian posterior approximations [27]

. That work derives the explicit forms of the gradients and demonstrates the efficacy of the approach on high-dimensional logistic regression. Though the spiked-covariance approximation is useful for capturing posterior correlations, we find that combining the idea with boosting new components yields superior posterior approximations. We fit the low-rank components of a Gaussian using black-box methods and joint optimization. We also note that mixture distributions are a type of hierarchical variational model

[30], where the component identity can be thought of as latent variables in our variational distribution. While in [30] the authors optimize a lower bound on the ELBO to fit general hierarchical variational models, our approach integrates out the discrete latent variables because it is tractable to do so.

4 Experiments and Analysis

To supplement the illustrative synthetic examples, in this section we apply variational boosting to approximate various intractable posterior distributions resulting from real statistical analyses.

Figure 3: Comparison of univariate and bivariate marginals for the binomial hierarchical model. Each histogram/scatterplot results from 20,000 NUTS samples. Top left: Bivariate marginal () HMC samples and a mean field approximation (MFVI). Top Right, the same bivariate marginal, and the Variational Boosting approximation. Bottom: comparison of NUTS, MFVI, and VBoost on univariate marginals (global parameters).
Figure 4: Comparison of posterior covariances for the -dimensional baseball model (hierarchical binomial regression). Each plot corresponds to the variational boosting algorithm as it incorporates new diagonal-covariance mixture components. The vertical axis indicates covariance estimates from 20k samples of NUTS (shared across all plots). We see that as more components are added, the variational boosting covariance estimates more closely match the MCMC covariance estimates.

4.1 Hierarchical Binomial Regression

We test out our posterior approximation on a hierarchical binomial regression model.777Model and data from the mc-stan case studies, http://mc-stan.org/documentation/case-studies/pool-binary-trials.html We borrow an example from [4], and estimate the binomial rates of success (batting averages) of baseball players using a hierarchical model. The model describes a latent “skill” parameter for baseball players — the probability of obtaining a hit in a given at bat. The model of the data is

hyper prior
hyper prior
player prior
player hits

where is the number of successes (hits) player has attempted in attempts (at bats). Each player has a latent success rate , which is governed by two global variables  and . There are 18 players in this example, creating a posterior distribution with  parameters. This model is not conjugate, and requires approximate Bayesian inference. We use adam [19] for each stochastic optimization problem with default parameters. For stochastic gradients, we use 400 samples for the new component, and 400 samples for the previous component. In all experiments, we use autograd [26, 25] to obtain automatic gradients with respect to new component parameters. To highlight the fidelity of our method, we compare Variational Boosting to mean field VI and the No-U-Turn Sampler (NUTS) [15]. The empirical distribution resulting from 20k NUTS samples is considered the “ground truth” posterior in this example. Figure 3 compares a selection of univariate and bivariate posterior marginals. We see that Variational Boosting is able to closely match the NUTS posteriors, improving upon the MFVI approximation. Figure 4 compares the variational boosting covariance estimates to the “ground truth” estimates of MCMC at various stages of the algorithm.

(a) Rank 0 (MFVI)
(b) Rank 1
(c) Rank 2
(d) Rank 3
Figure 5: Comparison of single Gaussian component marginals by rank for a 37-dimensional Poisson GLM posterior. The top left plot is a diagonal Gaussian approximation. The next plots show the how the marginal variances inflate as the covariance is allotted more capacity.
(a) Rank 0 (MFVI)
(b) Rank 1
(c) Rank 2
(d) Rank 3
Figure 6: A sampling of bivariate marginals for a single Gaussian component marginals by rank for a -dimensional Poisson GLM posterior. Although there are a total of 666 covariances to be approximated, only a few directions in the -dimensional parameter space exhibit non-trivial correlations.

4.2 Multi-level Poisson GLM

We apply variational boosting to approximate the posterior for a common hierarchical model, a hierarchical Poisson GLM. This model was formulated to measure the relative rates of stop-and-frisk events for different ethnicities and in different precincts [7], and has been used as illustrative example of multi-level modeling [8, Chapter 15, Section 1]. The model incorporates a precinct and ethnicity effect to describe the relative rate of stop-and-frisk events.

mean offset
group variances
ethnicity effect
precinct effect
log rate
stop-and-frisk events

where are the number of stop-and-frisk events within ethnicity group  and precinct  over some fixed period of time;  is the total number of arrests of ethnicity group  in precinct  over the same period of time;  and  are the ethnicity and precinct effects.

(a) Rank 3, 2-component
(b) Rank 3, 8-component
Figure 7: A sampling of bivariate marginals for mixtures of rank-3 Gaussians at various stages of the variational boosting algorithm for the -dimensional frisk model. Introducing new mixture components allows the posterior to take a non-Gaussian shape, most exhibited in the third column.
(a) Marginal standard deviations
(b) Pairwise covariances
Figure 8: A comparison of standard deviations and covariances for the frisk model. The MCMC-inferred values are along the horizontal axis, with the variational boosting values along the vertical axis. While the rank 3 plus diagonal covariance structure is able to account for most of the marginal variances, the largest one is still underestimated. Incorporating more rank 3 components allows the approximation to account for this variance. Similarly, the non-zero covariance measurements improve as more components are added.

As before, we simulate 20k NUTS samples, and compare various variational approximations. Because of the high posterior correlations present in this example, variational boosting with diagonal covariance components is inefficient in its representation of this structure. As such, this example more heavily relies on the low-rank approximation to shape the posterior. Figure 5 show how increasing the rank of a single multivariate normal component can result in better variance approximations. Figure 6 shows a handful of bivariate marginal posterior approximations as a function of covariance rank. Figure 7 shows the same bivariate marginals as more rank-3 components are added to the approximation. Lastly, Figure 8

compares the marginal standard deviations and covariances to MCMC-based measurements. These results indicate that while the incorporation of covariance structure increases the accuracy of marginal variance approximations, the non-Gaussianity afforded by the incorporation of mixture components allows for a better posterior approximation that translates into even more accurate moment estimates.

Figure 9: Mean percent change in marginal variances for the Poisson GLM. After rank 5, the average percent change is less than 5% — this estimate is slightly noisy due to the stochastic optimization procedure.

4.3 Bayesian Neural Network

Lastly, we apply our method to a Bayesian neural network regression model, which admits a high-dimensional, non-conjugate posterior. We compare predictive performance of Variational Boosting to Probabilistic Backpropagation (PBP)

[13]. Mimicking the experimental setup of [13]

, we use a single 50-unit hidden layer, with ReLU activation functions. We place a normal prior over each weight in the neural network, governed by the same variance (with an inverse Gamma prior). We also place an inverse Gamma prior over the observation variance The model can be written as

weight prior hyper (39)
noise prior hyper (40)
weights (41)
output distribution (42)

where is the set of weights, and

is a multi-layer perceptron that maps input

to approximate output as a function of parameters . We denote the set of parameters as . We approximate the posterior , where is the training set of

input-output pairs. We then use the posterior predictive distribution to compute the distribution for a new input

(43)
(44)

and report average predictive log probabilities for held out data, . In our experiment, we report held-out predictive performance for different approximate posteriors for six datasets. For each dataset, we perform the following procedure 20 times. First, we create a random partition into a 90% training set and 10% testing set. For each single component variational model (e.g. mean field, rank 5, rank 10, etc.), we take 500 adam steps using 20 samples to approximate the ELBO gradient. We then fix the rank 5 component, and add new components using the variational boosting procedure. We allow each additional component only 200 iterations. To save time on initialization, we draw 100 samples from the existing approximation, and initialize the new component with the sample with maximum weight. Probabilistic back-propagation is given 1000 passes over the training data (which, empirically, was sufficient for the algorithm to converge). Table 1 presents out-of-sample log probability for single-component multivariate Gaussian approximations with varying rank structure. Table 2 presents out-of-sample log probability for additional rank 5 components added using the variational boosting procedure. We note that though we do not see much predictive improvement as rank structure is added, we do see predictive improvement as components are added. Our results suggest that incorporating and adapting new mixture components is a recipe for a more expressive posterior approximation, translating into better predictive results. In fact, for all datasets we see that incorporating a new component improves test log-probability, and we see further improvement with additional components for most of the datasets. Furthermore, in five of the datasets, we see predictive performance surpass probabilistic back-propagation as new components are added. This highlights variational boosting’s ability to trade computation for improved accuracy. We note that the original observation of [13] may be true — the level of gradient noise does make this optimization problem more difficult, but is mitigated by using more samples (20 in our case). Despite the stochastic training, these empirical results suggest that augmenting a Gaussian approximation to include additional capacity can improve predictive performance in a Bayesian neural network while retaining computational tractability.

pbp mfvi rank 5 rank 10 rank 15
wine -0.990 ( 0.08) -0.973 ( 0.05) -0.972 ( 0.05) -0.972 ( 0.05) -0.973 ( 0.05)
boston -2.902 ( 0.64) -2.658 ( 0.18) -2.670 ( 0.16) -2.696 ( 0.14) -2.743 ( 0.12)
concrete -3.162 ( 0.15) -3.248 ( 0.07) -3.247 ( 0.06) -3.261 ( 0.06) -3.286 ( 0.05)
power-plant -2.798 ( 0.04) -2.812 ( 0.03) -2.814 ( 0.03) -2.838 ( 0.03) -2.867 ( 0.02)
yacht -0.990 ( 0.08) -0.973 ( 0.05) -0.972 ( 0.05) -0.972 ( 0.05) -0.973 ( 0.05)
energy-efficiency -1.971 ( 0.11) -2.451 ( 0.12) -2.452 ( 0.12) -2.469 ( 0.11) -2.502 ( 0.09)
Table 1: Comparison of test log probability for PBP [13] to Variational Inference with various ranks. Each entry shows the average predictive performance of the model on a specific dataset and the standard deviation across the 20 trials — bold indicates the best average (though not necessarily statistical significance).
pbp rank 5 vboost 2 vboost 6 vboost 10
wine -0.990 ( 0.08) -0.972 ( 0.05) -0.971 ( 0.05) -0.978 ( 0.06) -0.994 ( 0.06)
boston -2.902 ( 0.64) -2.670 ( 0.16) -2.651 ( 0.16) -2.599 ( 0.16) -2.628 ( 0.16)
concrete -3.162 ( 0.15) -3.247 ( 0.06) -3.228 ( 0.06) -3.169 ( 0.07) -3.134 ( 0.08)
power-plant -2.798 ( 0.04) -2.814 ( 0.03) -2.811 ( 0.03) -2.800 ( 0.03) -2.793 ( 0.03)
yacht -0.990 ( 0.08) -0.972 ( 0.05) -0.971 ( 0.05) -0.978 ( 0.06) -0.994 ( 0.06)
energy-efficiency -1.971 ( 0.11) -2.452 ( 0.12) -2.422 ( 0.11) -2.345 ( 0.11) -2.299 ( 0.12)
Table 2: Comparison of test log probability for PBP [13] to Variational Boosting with fixed rank (5), varying the number of components. Each entry shows the average predictive performance of the model on a specific dataset and the standard deviation across the 20 trials — bold indicates the best average (though not necessarily statistical significance).

5 Discussion and Conclusion

We have proposed a practical variational inference method that incorporates new components into the approximation and is applicable to a large number of Bayesian models of interest. We demonstrated the ability of the method to learn rich representations of complex posteriors over a moderate number of parameters. We see a few avenues of future work. First, while it is known that mixtures of Gaussians can approximate smooth distributions to arbitrary precision (with enough components) [5], it remains an open question if our approach of fixing and iteratively adding components using this sequence of ELBO objectives will converge. Existing work has shown that this is the case for the alternative direction of -divergence,  [22, 28], but it remains to be shown for . Furthermore, the rate of convergence would, ideally, be characterized. The variational boosting framework allows for more flexible component distributions. For instance, compositions of invertible maps have been used to enrich variational families [31], as well as auxiliary variable variational models [23]. Although our mixture component fitting algorithm is greedy, our determination of an appropriate covariance rank is not. We imagine that we can use the result of the diagonal covariance to inform the procedure for , and so on. We leave this sort of nested boosting for low-rank determination to future work. When optimizing parameters of a variational family, it has been shown that the natural gradient can be more robust and lead to better optima [14, 17]. While the Fisher information, required for computing the natural gradient, for a single Gaussian component can be computed in closed form, it is less straightforward for a mixture component. We hope to incorporate natural gradient updates in future work.

Acknowledgments

The authors would like to acknowledge Arjumand Masood, Mike Hughes, and Finale Doshi-Velez for helpful conversations. ACM is supported by the Applied Mathematics Program within the Office of Science Advanced Scientific Computing Research of the U.S. Department of Energy under contract No. DE-AC02-05CH11231. NJF is supported by a Washington Research Foundation Innovation Postdoctoral Fellowship in Neuroengineering and Data Science. RPA is supported by NSF IIS-1421780 and the Alfred P. Sloan Foundation.

References