Tightening Bounds for Variational Inference by Revisiting Perturbation Theory

09/30/2019 ∙ by Robert Bamler, et al. ∙ University of California, Irvine Microsoft Berlin Institute of Technology (Technische Universität Berlin) 19

Variational inference has become one of the most widely used methods in latent variable modeling. In its basic form, variational inference employs a fully factorized variational distribution and minimizes its KL divergence to the posterior. As the minimization can only be carried out approximately, this approximation induces a bias. In this paper, we revisit perturbation theory as a powerful way of improving the variational approximation. Perturbation theory relies on a form of Taylor expansion of the log marginal likelihood, vaguely in terms of the log ratio of the true posterior and its variational approximation. While first order terms give the classical variational bound, higher-order terms yield corrections that tighten it. However, traditional perturbation theory does not provide a lower bound, making it inapt for stochastic optimization. In this paper, we present a similar yet alternative way of deriving corrections to the ELBO that resemble perturbation theory, but that result in a valid bound. We show in experiments on Gaussian Processes and Variational Autoencoders that the new bounds are more mass covering, and that the resulting posterior covariances are closer to the true posterior and lead to higher likelihoods on held-out data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian inference is the task of reasoning about random variables that can only be measured indirectly. Given a probabilistic model and a data set of observations, Bayesian inference seeks the posterior probability distribution over the remaining, i.e., unobserved or ‘latent’ variables. The computational bottleneck is calculating the marginal likelihood, the probability of the data with latent variables being integrated out. The marginal likelihood is also an important quantity for model selection. It is the objective function of the expectation-maximization (EM) algorithm [8], which has seen a recent resurgence with the popularity of variational autoencoders (VAEs) [16].

Variational inference (VI) [14, 13, 46] scales up approximate Bayesian inference to large data sets by framing inference as an optimization problem. VI derives and maximizes a lower bound to the log marginal likelihood, termed ’evidence’, and uses this so-called ’evidence lower bound’ (ELBO) as a proxy for the true evidence. In its original formulation, VI was limited to a restricted class of so-called conditionally conjugate models [13], for which a closed-form expression for the ELBO can be derived by solving known integrals. More recently, black box variational inference (BBVI) approaches have become more popular [34, 37]

, which lift this restriction by approximating the gradient of the ELBO by Monte Carlo samples. BBVI also enables the optimization of alternative bounds that do not possess closed-form integrals. With Monte Carlo gradient estimation, the focus shifts from tractability of these integrals to the variance of the Monte Carlo gradient estimators, asking for bounds with low-variance gradients.

In this article, we propose a family of new lower bounds that are tighter than the standard ELBO while having lower variance than previous alternatives and admitting unbiased Monte Carlo estimation of the gradients of the bound. We derive these new bounds by introducing ideas from so-called variational perturbation theory into BBVI.

Variational perturbation theory provides an alternative to VI for approximating the evidence [31, 30, 29, 41]. It is based on a Taylor expansion of the evidence around the variational distribution, i.e., a parameterized approximate posterior distribution. The lowest order of the Taylor expansion recovers the standard ELBO, and higher order terms correct for the mismatch the between variational distribution and true posterior.

Variational perturbation theory based on so-called cumulant expansions [29]

requires a fair amount of manual derivations and puts strong constraints on the tractability of certain integrals. A cumulant expansion also generally does not result in a lower bound, making it impossible to minimize the approximation error with stochastic gradient descent. In this article, we propose a new variant of perturbation theory that addresses these issues. Our proposed perturbative expansion leads to a family of lower bounds on the marginal likelihood that can be optimized with the same black box techniques as the standard ELBO. The proposed bounds

are enumerated by an odd integer 

, the order of the perturbative expansion, and are given by

(1)

Here,

is the joint distribution of the probabilistic model with observed variables 

and latent variables , is the variational distribution with variational parameters , and is the marginal likelihood. Further, the reference energy is an additional free parameter over which we optimize jointly with .

The proposed bounds in Eq. 1 generalize the standard ELBO, which one obtains for . Higher order terms make the bound successively tighter while guaranteeing that it still remains a lower bound for all odd orders . In the limit , Eq. 1 becomes an asymptotic series to the exact marginal likelihood .

Our main contributions are as follows:

  • We revisit variational perturbation theory [31, 30, 29] based on cumulant expansions, which does not result in a bound. By introducing a generalized variational inference framework, we propose a similar yet alternative construction to cumulant expansions which results in a valid lower bound of the evidence.

  • We furthermore show that the new bound admits unbiased Monte Carlo gradients with low variance, making it well suited to stochastic optimization. Unbiasedness is an important property and is necessary to guarantee convergence [38]. Biased gradients may destroy the bound and may lead to divergence problems [7, 10, 11].

  • We evaluate the proposed method experimentally both for inference and for model selection. Already the lowest nonstandard order is less prone to underestimating posterior variances than standard VI, and it fits better models in a variational autoencoder (VAE) on small data sets. While this reduction in bias comes at the cost of larger gradient variance, we show experimentally and explain theoretically that the gradient variance is still smaller than in previously proposed alternatives [27, 12, 10] to the standard ELBO, resulting in faster convergence.

This paper is an extended version of a proceeding conference paper by the same authors [4], which also gave the first reference to variational inference as a form of biased importance sampling, where the tightness of the bound is traded off against a low-variance stochastic gradient.

Our paper is structured as follows. In Section 2, we revisit the cumulant expansion for variational inference. In Section 3, we derive a new family of bounds with similar properties, but which is amenable to stochastic optimization. We then discuss theoretical properties of the bounds. Experiments are presented in Section 4. Finally, we discuss related work in Section 5 and open questions in Section 6.

2 Background: Perturbation Theory for Variational Inference

We begin by reviewing variational perturbation theory as introduced in [29]. We consider a probabilistic model with data , latent variables , and joint distribution . The goal of Bayesian inference is to find the posterior distribution,

(2)

Exact posterior inference is impossible in all but the most simple models because of the intractable integral defining the marginal likelihood  in Eq. 2. Variational perturbation theory approximates the log marginal likelihood, or evidence, , via a Taylor expansion. We introduce a so-called variational distribution , parameterized by variational parameters , and we write the evidence as follows:

(3)

Above, the ‘interaction energy’  is defined as

(4)

and the notation  on the right-hand side of Eq. 3 denotes that we introduced an auxiliary real parameter , which we set to one at the end of the calculation.

The reason for this notation is as follows. We approximate the right-hand side of Eq. 3 by a finite order Taylor expansion in  before setting . While is not a small parameter, it is a placeholder that keeps track of appearances of which we consider to be small for all  in the support of . More precisely, it is enough to demand that is almost constant in with small deviations, as is the case when is close to the posterior . (This will become clear in Eq. 6 below.) Hence, we can think of this approximation informally as an expansion in terms of the difference between the log variational distribution and the log posterior.

The first order term of the Taylor expansion is the evidence lower bound, or ELBO,

(5)

As is well known in variational inference (VI) [14, 5, 46], the ELBO is a lower bound on the evidence, i.e., for all . VI maximizes the ELBO over , and thus tries to find the best approximation of the evidence that is possible within a first order Taylor expansion in .

Higher order terms of the Taylor expansion lead to corrections to the standard ELBO. Suppressing the dependence of  on  and  to simplify the notation, we obtain

(6)

Eq. 6 is called the cumulant expansion of the evidence [29]. Due to the higher order correction terms, it can potentially approximate the true evidence better than the ELBO.

Variational perturbation theory has not found its way into mainstream machine learning, which arguably has to do with two major problems of the cumulant expansion and related approaches. First, in contrast to the ELBO, a higher order cumulant expansion does not result in a lower bound on the evidence. This prevents the usage of gradient-based optimization methods (as opposed to, e.g., coordinate updates), which is currently the mainstream approach to VI. A second drawback that will be discussed below is that the cumulant expansion cannot be estimated efficiently using Monte Carlo sampling, as we discuss in Section 

3.3 below. In the following sections, we will present a similar construction of an improved variational objective that does not suffer from these shortcomings, and that is compatible with black-box optimization approaches. To this end, we have to start from a generalized formulation of VI.

3 Perturbative Black Box Variational Inference

This section presents our main results. We derive a family of new objective functions for Black Box Variational Inference (BBVI) which are amenable to stochastic optimization. We begin by deriving a generalized ELBO for VI based on concave functions. We then show that, in a special case, we obtain a strict bound with resemblance to variational perturbation theory.

3.1 Variational Inference with Generalized Lower Bounds

Variational inference (VI) approximates the evidence  of the model in Eq. 2 by maximizing the ELBO  (Eq. 5) over variational parameters . A more common approach to derive the ELBO is through Jensen’s inequality [14, 5, 46]. Here, we present a similar derivation for a broader family of lower bounds.

To this end, we consider an arbitrary concave function  over the positive reals, i.e., with second derivative . As an example, could be the logarithm. We now consider the marginal likelihood , and we derive a lower bound on using Jensen’s inequality:

(7)

As stated above, for (which is a concave function), the bound in Eq. 7 is the standard ELBO from Eq. 5. In this case, we refer to the corresponding VI scheme as KLVI, as maximizing the standard ELBO is equivalent to minimizing the Kullback-Leibler (KL) divergence from  to the true posterior .

The above class of generalized bounds  is compatible with current mainstream optimization schemes for VI, oftentimes summarized as Black Box Variational Inference (BBVI) [37, 34]. For the reader’s convenience, we briefly review these techniques. BBVI optimizes the ELBO via stochastic gradient descent (SGD) based on noisy estimates of its gradient. Crudely speaking, BBVI obtains its gradient estimates in three steps: (1) by drawing Monte Carlo (MC) samples , (2) by averaging the bound over these samples, and (3) by computing the gradient on the resulting average. A slight complication arises as both the expression inside the expectation in Eq. 7 as well as the distribution of MC samples itself depend on . The two main solutions to this problem are as follows:

  • Score function gradients [34]

    , or the REINFORCE method, use the chain rule to relate the change in MC samples to a change in the log variational distribution,

    (8)
  • Reparameterization gradients [16] can be used if the samples  can be expressed as some deterministic differentiable transformation of auxiliary random variables  drawn from some -independent distribution . Under these conditions, the gradient of the bound can be written as

    (9)

One obtains an unbiased estimate of 

by approximating the expectation on the right-hand side of Eq. 8 or 9 via MC samples  or , respectively. Empirically, reparameterization gradients often have smaller variance than score function gradients.

Based on these new bounds , we make the following observations.

Variational Inference and Importance Sampling.

The generalized family of bounds, Eq. 7, shows that VI and importance sampling are closely connected [4, 11]. For being the identity, the variational bound becomes independent of and instead becomes an unbiased importance sampling estimator of , with the proposal distribution and the importance ratio . For other choices of , the bound is no longer an unbiased estimator of . This way, we can identify BBVI as a form of biased importance sampling.

Bias Variance Trade-off.

Importance sampling suffers from high variance in high dimensions because the log importance ratio  (Eq. 4) scales approximately linearly with the size of the model. (To see this, consider a setup where both and factorize over all  data points, in which case is proportional to .)

The choice of function  in Eq. 7 trades off some of this variance for a bias. KLVI uses , which leads to low variance gradient estimates in Eqs. 8-9 that depend only linearly on , rather than exponentially. This makes the KLVI bound easy to optimize, at the cost of some bias. An alternative to KLVI that has been explored in the literature is -VI [27, 12, 21, 10], which corresponds to setting with some , and thus . This choice has an alternative bias-variance trade-off, as enters in the exponent, leading to more variance than KLVI. Our empirical results in Figure 4, discussed in Section 4.3, confirm this picture by showing that -VI suffers from slow convergence due to large gradient variance.

According to our new findings, a good alternative bound should perform favorably on the bias-variance trade-off. It should ideally depend on  only polynomially as opposed to exponentially. Such a family of bounds will be introduced next. We also show how they connect to variational perturbation theory (Section 2).

Figure 1: Different choices for the concave fucntion  in Eq. 7. For (black), the bound is tight but independent of . KLVI corresponds to (pink). Our proposed PBBVI bound uses (green, Eq. 10), which is tighter than KLVI for (we set and for PBBVI here).

3.2 Perturbative Lower Bounds

We now propose a family of concave functions  for the bound in Eq. 7 that is inspired by ideas from perturbation theory (Section 2). The choice of the function  determines the tightness of the bound . A tight bound, i.e., a good approximation of the marginal likelihood, is important for the variational expectation-maximization (EM) algorithm, which uses the bound as a proxy for the marginal likelihood. The bound  becomes perfectly tight if we set  to the identity function, (black line in Figure 1). However, as we discussed, this is a singular limit in which the bound  does not depend on , and therefore does not provide any gradient signal to learn an optimal variational distribution .

Ideally, the bound should be as tight as possible if  is close to the posterior . In order to obtain a gradient signal, the bound should be lower if is very different from the posterior. To achieve this, we recall that the argument  of the function  in Eq. 7 is the fraction , which equals  if is the true posterior. Thus, the concave function  should be close to the identity function for arguments close to . Since the only things we know about  are that it is positive and independent of , we parameterize it as  with a real parameter , over which we optimize.

For any , and any odd integer , we propose the following concave function:

(10)

The pink and green lines in Figure 1 plot for and for and , respectively. The function  converges pointwise to the identity function (black line in Figure 1) for because it is the th order Taylor expansion of  in around the reference point . For any finite , the Taylor expansion approximates the identity function (black line in Figure 1) most closely if the expansion parameter is small. We provide further intuition for Eq. 10 in Section 3.3 below.

An important result of this section is that the functions  define a lower bound on the marginal likelihood for any choice of and ,

(11)

Here, the last equality merely introduces a more convenient notation for the bound defined by Eqs. 7 and 10. An explicit form for  was given in Eq. 1 in the introduction. Eq. 11 follows from Eq. 7 and from the fact that  is concave and lies below the identity function, which we prove formally in Section 3.4 below. Since  is a lower bound on  for all , we can find the optimal reference point  that leads to the tightest bound by maximizing  jointly over both  and . We also prove in Section 3.4 that the bound on is nontrivial, i.e., that is positive at its maximum. (A negative bound would be vacuous since  is trivially larger than any negative number.)

Input: joint probability ; order of perturbation (odd integer); learning rate schedule ; number of Monte Carlo samples ; number of training iterations ; variational family that admits reparameterization gradients as in Eq. 9.
Output: fitted variational parameters .
1 Initialize randomly and ;
2 for  to  do
      3 Draw samples from the noise distribution (see Eq. 9);
       // Obtain gradient estimates of surrogate objective , see Eq. 12:
      4 ;
      5 ];
       // Perform update steps with rescaled gradients, see Eq. 3.2:
      6 ;
      7 ;
      
end for
Algorithm 1 Perturbative Black Box Variational Inference (PBBVI)

To optimize the bound, we use SGD with reparameterization gradients, see Section 3.1. Score function gradients can be used in a similar way. Algorithm 1 explains the optimization in detail. We name the method ‘Perturbative Black Box Variational Inference’ (PBBVI). The algorithm implicitly scales all gradients by a factor of  to avoid a potential exponential explosion or decay of gradients due to the factor  in the bound, Eq. 1. We calculate these rescaled gradients in a numerically stable way by considering the surrogate objective function

(12)

The rescaled gradients are thus

(13)

Eqs. 12-3.2 allow us to estimate the rescaled gradients (see lines 1-1 in Algorithm 1) without having to evaluate any potentially overflowing or underflowing expressions .

3.3 Connection to Variational Perturbation Theory

We now provide more intuition for the choice of the functions  in Eq. 10 by comparing the resulting bounds, Eq. 1, to the cumulant expansion, Eq. 6. We will show that the bounds enjoy similar benefits as the cumulant expansion while being valid lower bounds and providing unbiased Monte Carlo gradients.

Both the cumulant expansion and the proposed bounds  are Taylor expansions in . The cumulant expansion in Eq. 6 is a Taylor expansion of the evidence . By contrast, the bound  in Eq. 1 is a Taylor expansion of the marginal likelihood , and the expansion starts from a reference point  over which we optimize.

Despite this difference, the two expansions are remarkably similar up to order . Up to this order, the cumulant expansion, Eq. 6, coincides with a Taylor expansion of  in . It is thus a special case of the proposed lower bound  if we set . For , the cumulant expansion contains additional terms that do not appear in the bound , such as the last spelled out term on the right-hand side of Eq. 6. These terms contain products of expectations under , which are difficult to estimate with Monte Carlo techniques without introducing an additional bias. The bounds , by contrast, depend only linearly on expectations under , which means that they can be estimated without bias with a single sample from .

In addition, the main advantage of over the cumulant expansion is that is a lower bound on the marginal likelihood for all  and all . This allows us to make the bound as tight as possible by maximizing it over  and  using stochastic gradient estimates. By contrast, variational perturbation theory with the cumulant expansion requires either optimizing towards saddle points, or optimizing a separate objective function, such as the ELBO, to find a good variational distribution [29].

The lack of a lower bound in the cumulant expansion is a particular limitation when performing model selection with the variational expectation-maximization (EM) algorithm. Variational EM fits a model by maximizing an approximation of the marginal likelihood or the model evidence over model parameters. Such an optimization can diverge if the approximation is not a lower bound. This makes perturbative BBVI an interesting alternative for improved evidence approximations [7, 10, 21, 11, 33].

3.4 Proofs of the Theorems

We conclude the presentation of the proposed bounds  by providing proofs of the claims made in Section 3.2.

Theorem 1.

For all and all odd integers , the function defined in Eq. 10 satisfies the following properties:

  1. is concave; and

  2. .

As a corollary, Eq. 11 holds.

Proof.
  1. The first derivative of  is

    (14)

    For the second derivative, the two contributions from the denominator and the enumerator cancel for all but the highest order term, and we obtain

    (15)

    which is nonnegative everywhere for odd . Therefore,  is concave.

  2. Consider the function , which is also concave since it has the same second derivative as . Further,  has a stationary point at since , which can be verified by inserting into Eq. 14. For a concave function, a stationary point is a global maximum. Therefore, we have

    (16)

    which is equivalent to the proposition .

Thus, the first inequality in Eq. 11 holds by property (ii), and the second inequality in Eq. 11 follows from property (i) and Jensen’s inequality. ∎

Theorem 2.

The bound on the positive quantity  is nontrivial, i.e., the maximum value of the bound, , is positive for all odd integers .

Proof.

At the maximum position of , its gradient is zero. Taking the gradient with respect to  in Eq. 1 and using the product rule for the prefactor  and the remaining expression, we find that all terms except the contribution from cancel. We thus obtain, at the maximum position ,

(17)

Thus, when we evaluate  at , the term with on the right-hand side of Eq. 1 evaluates to zero, and we are left with

(18)

with

(19)

where the sum runs only to .

We now show that is positive for all  and all odd . If , then is trivially positive. For , is a polynomial in of even order , whose highest order term has a positive coefficient . Therefore, goes to positive infinity for both and . It thus has a global minimum at some value . At the global minimum, its derivative is zero, i.e.,

(20)

Subtracting Eq. 20 from Eq. 19, we find that the value of at its global minimum  is

(21)

which is nonnegative because is even. Further, is not zero since this would imply  by Eq. 21, but by Eq. 19. Therefore, is strictly positive, and since is a global minimum of , we have for all . Inserting into Eq. 18 concludes the proof that the lower bound at the optimum is positive. ∎

4 Experiments

We evaluate PBBVI with different models. We begin with a qualitative comparison on a simple one-dimensional toy problem (Section 4.1). We then investigate the behavior of BBVI in a controlled setup of Gaussian processes on synthetic data (Section 4.2

). Next, we evaluate PBBVI based on a classification task using Gaussian processes classifiers, where we use data from the UCI machine learning repository (Section 

4.3). This is a Bayesian non-conjugate setup where black box inference is required. Finally, we use an experiment with the variational autoencoder (VAE) to explore our approach on a deep generative model (Section 4.4). This experiment is carried out on MNIST data. We use the perturbative order for all experiments with PBBVI. This corresponds to the lowest order beyond standard VI with the KL divergence (KLVI), since has to be an odd integer, and PBBVI with is equivalent to KLVI. Across all the experiments, PBBVI demonstrates advantages based on different metrics.

Figure 2: Behavior of different VI methods on fitting a univariate Gaussian to a bimodal target distribution (black). PBBVI (proposed, green) covers more of the mass of the entire distribution than the traditional KLVI (pink). -VI is mode seeking for large  (blue) and mass covering for smaller  (orange).

4.1 Mass Covering Effect

In Figure 2

, we fit a Gaussian distribution to a one-dimensional bimodal target distribution (black line), using different divergences. Compared to BBVI with the standard KL divergence (KLVI, pink line),

-divergences [27] are more mode-seeking (purple line) for large values of , and more mass-covering (orange line) for small  [21] (the limit recovers KLVI [27]). Our PBBVI bound (, green line) achieves a similar mass-covering effect as in -divergences, but with associated low-variance reparameterization gradients. This is seen in Figure 4 (b), discussed in Section 4.3 below, which compares the gradient variances of -VI and PBBVI as a function of latent dimensions.

4.2 GP Regression on Synthetic Data

(a) KLVI
(b) PBBVI with
Figure 3:

Gaussian process regression on synthetic data (green dots). Three standard deviations are shown in varying shades of orange. The blue dashed lines show three standard deviations of the true posterior. The red dashed lines show the inferred three standard deviations using KLVI (a) and PBBVI (b). We see that the results from our proposed PBBVI are close to the analytic solution while traditional KLVI underestimates the variances.

Method Avg variances
Analytic 0.0415
KLVI 0.0176
PBBVI 0.0355
(a) GP regression
Data set Crab Pima Heart Sonar
KLVI 0.22 0.245 0.148 0.212
PBBVI 0.11 0.240 0.133 0.173
(b) GP classification
Table 1: Results for Gaussian process (GP) experiments. (a) Average marginal posterior variances at the positions of data points for GP regression with synthetic data (see Figure 3). The proposed PBBVI is closer to the analytic solution than standard KLVI. (b) Error rate of GP classification on the test set. The lower the better. Our proposed PBBVI consistently obtains better classification results.

In this section, we inspect the inference behavior using a synthetic data set with Gaussian processes (GP). We generate the data according to a Gaussian noise distribution centered around a mixture of sinusoids, and we sample 50 data points (green dots in Figure 3). We then use a GP to model the data, thus assuming the generative process and .

In this experiment, the model admits an analytic solution (three standard deviations shown in blue dashed lines in Figure 3). We compare the analytic solution to approximate posteriors using a fully factorized Gaussian variational distribution obtained by KLVI (Figure 3 (a)) and by the proposed PBBVI (Figure 3 (b)). The results from PBBVI are almost identical to the analytic solution. In contrast, KLVI underestimates the posterior variance. This is consistent with Table 1 (a), which shows the average marginal variances. PBBVI results are much closer to the analytic results.

4.3 Gaussian Process Classification

We evaluate the performance of PBBVI and KLVI on a GP classification task. Since the model is non-conjugate, no analytical baseline is available in this case. We model the data with the following generative process:

(22)

Above, is the GP kernel,

indicates the sigmoid function, and

indicates the Bernoulli distribution. We furthermore use the Matern-32 kernel,

(23)

Data.

We use four data sets from the UCI machine learning repository, suitable for binary classification: Crab (200 datapoints), Pima (768 datapoints), Heart (270 datapoints), and Sonar (208 datapoints). We randomly split each of the data sets into two halves. One half is used for training and the other half is used for testing. We set the hyper parameters and throughout all experiments, where is the dimensionality of the input .

Table 1 (b) shows the classification performance (error rate) for these data sets. Our proposed PBBVI consistently performs better than the traditional KLVI.

(c) Speed of convergence
(d) Gradient variance
Figure 4: Comparisons between PBBVI (proposed) and -VI. (a) Training curves (test log-likelihood per data point) for GP classification on the Sonar data set. PBBVI converges faster than -VI even though we tuned the number of MC samples per training step () and the constant learning rate () so as to maximize the performance of -VI on a validation set. (b) Sampling variance of the stochastic gradient (averaged over its components) in the optimum of a GP regression model with synthetic data, for -divergences (orange, purple, gray), and the proposed PBBVI (green). The variance grows exponentially with the latent dimension for -VI, and only algebraically for PBBVI.

Convergence speed comparison.

We also carry out a comparison in terms of speed of convergence, focusing on PBBVI and -VI. Our results indicate that the smaller variance of the reparameterization gradient in PBBVI leads to faster convergence of the optimization algorithm.

We train the GP classifier from Eqs. 22-23 on the Sonar UCI data set using a constant learning rate. Figure 4 (a) shows the test log-likelihood under the posterior mean as a function of training iterations. We split the data set into equally sized training, validation, and test sets. We then tune the learning rate and the number of Monte Carlo samples per gradient step to obtain optimal performance on the validation set after optimizing the -VI bound with a fixed budget of random samples. We use here; smaller values of (i.e., stronger mass-covering effect) leads to even slower convergence. We then optimize the PBBVI lower bound with the same learning rate and number of Monte Carlo samples. The final test error rate is

(the data set has binary labels and is approximately balanced). Although the hyperparameters are tuned for

-VI, the proposed PBBVI converges an order of magnitude faster (Figure 4 (a)).

Figure 4 (b) provides more insight in the scaling of the gradient variance. Here, we fit GP regression models on synthetically generated data by maximizing the PBBVI lower bound and the -VI lower bound with . We generate a separate synthetic data set for each by drawing random data points around a sinusoidal curve. For each , we fit a one-dimensional GP regression with PBBVI and -VI, respectively, using the same data set for both methods. The variational distribution is a fully factorized Gaussian with latent variables. After convergence, we estimate the sampling variance of the gradient of each lower bound with respect to the posterior mean. We calculate the empirical variance of the gradient based on samples from , and we average over the coordinates. Figure 4 shows the average sampling variance as a function of on a logarithmic scale. The variance of the gradient of the -VI bound grows exponentially in the number of latent variables. By contrast, we find only algebraic growth for PBBVI.

4.4 Variational Autoencoder

Figure 5: Predictive likelihood of a VAE trained on data sets of different sizes. The higher value the better. The training data are randomly sampled subsets of the MNIST training set. Our proposed PBBVI method outperforms KLVI mainly when the size of the training data set is small. The fewer the training data, the more advantage PBBVI obtains.

We experiment on Variational Autoencoders (VAEs), and we compare the PBBVI and the KLVI bound in terms of predictive likelihoods on held-out data [16]. Autoencoders compress unlabeled training data into low-dimensional representations by fitting it to an encoder-decoder model that maps the data to itself. These models are prone to learning the identity function when the hyperparameters are not carefully tuned, or when the network is too expressive, especially for a moderately sized training set. VAEs are designed to partially avoid this problem by estimating the uncertainty that is associated with each data point in the latent space. It is therefore important that the inference method does not underestimate posterior variances. We show that, for small data sets, training a VAE by maximizing the PBBVI lower bound leads to higher predictive likelihoods than maximizing the KLVI lower bound.

We train the VAE on the MNIST data set of handwritten digits [19]. We build on the publicly available implementation by [7] and we also use the same architecture and hyperparamters, with stochastic layers and samples from the variational distribution per gradient step. The model has latent units in the first stochastic layer and latent units in the second stochastic layer.

The VAE model factorizes over all data points. We train it by stochastically maximizing the sum of the PBBVI lower bounds for all data points using a minibatch size of . The VAE amortizes the gradient signal across data points by training inference networks. The inference networks express the mean and variance of the variational distribution as a function of the data point. We add an additional inference network that learns the mapping from a data point to the reference energy . Here, we use a network with four fully connected hidden layers of , , , and units, respectively.

MNIST contains

training images. To test our approach on smaller-scale data where Bayesian uncertainty matters more, we evaluate the test likelihood after training the model on randomly sampled fractions of the training set. We use the same training schedules as in the publicly available implementation, keeping the total number of training iterations independent of the size of the training set. Different to the original implementation, we shuffle the training set before each training epoch as this turns out to increase the performance for both our method and the baseline.

Figure 5 shows the predictive log-likelihood of the whole test set, where the VAE is trained on random subsets of different sizes of the training set. We use the same subset to train with PBBVI and KLVI for each training set size. PBBVI leads to a higher predictive likelihood than traditional KLVI on subsets of the data. We explain this finding with our observation that the variational distributions obtained from PBBVI capture more of the posterior variance. As the size of the training set grows—and the posterior uncertainty decreases—the performance of KLVI catches up with PBBVI.

As a potential explanation why PBBVI converges to the KLVI result for large training sets, we note that at the optimal variational distribution and reference energy (see Section 3.4). If becomes a symmetric random variable (such as a Gaussian) in the limit of a large training set, then this implies that , and PBBVI with  reduces to KLVI for large training sets.

5 Related work

Our approach is related to BBVI, VI with generalized divergences, and variational perturbation theory. We thus briefly discuss related work in these three directions.

Black box variational inference (BBVI).

BBVI has already been addressed in the introduction [40, 16, 37, 34, 39]; it enables variational inference for many models [35, 2, 3, 9, 20, 26, 23]. Recent developments include variance reduction and improvements in reparameterization and amortization [17, 45, 36, 24, 39, 6, 25] which are all compatible with our approach. Our work builds upon BBVI in that BBVI makes a large class of new divergence measures between the posterior and the approximating distribution tractable. Depending on the divergence measure, BBVI may suffer from high-variance stochastic gradients. This is a practical problem that we address in this paper.

Generalized divergences measures.

Our work connects to generalized information-theoretic divergences [1]. [27] introduced a broad class of divergences for variational inference, including -divergences. Most of these divergences have been intractable in large-scale applications until the advent of BBVI. In this context, -divergences were first suggested by  [12] for local divergence minimization, and later for global minimization by [21] and [10]. As we show in this paper, -divergences have the disadvantage of inducing high-variance gradients, since the ratio between posterior and variational distribution enters the bound polynomially instead of logarithmically. In contrast, our approach leads to a more stable inference scheme in high dimensions.

Variational perturbation theory.

Perturbation theory refers to methods that aim to truncate a typically divergent power series to a finite series. In machine learning, these approaches have been addressed from an information-theoretic perspective by [42, 43]. Thouless-Anderson-Palmer (TAP) equations [44] are a form of second-order perturbation theory. They were originally developed in statistical physics to include perturbative corrections to the mean-field solution of Ising models. They have been adopted into Bayesian inference in [32] and were advanced by many authors [15, 31, 30, 28]. In variational inference, perturbation theory yields extra terms to the mean-field variational objective which are difficult to calculate analytically. This may be a reason why the methods discussed are not widely adopted by practitioners. In this paper, we emphasize the ease of including perturbative corrections in a black box variational inference framework. Furthermore, in contrast to earlier formulations, our approach yields a strict lower bound to the marginal likelihood which can be conveniently optimized. Our approach is different from the traditional variational perturbation formulation [18, 41], which generally does not result in a bound.

6 Conclusion

We first presented a view on black box variational inference as a form of biased importance sampling, where we can trade off bias versus variance by the choice of divergence. Bias refers to the deviation of the bound from the true marginal likelihood, and variance refers to its reparameterization gradient estimator. We then proposed a family of new variational bounds that connect to variational perturbation theory, and which include corrections to the standard Kullback-Leibler bound. Our proposed PBBVI bound is an asymptotic series to the true marginal likelihood for large order of the perturbative expansion, and we showed both theoretically and experimentally that it has lower-variance reparameterization gradients compared to -VI. In order to scale up our method to massive data sets, future work will explore stochastic versions of PBBVI. Since the PBBVI bound contains interaction terms between all data points, breaking it up into mini-batches is not straightforward. Besides, while our experiments used a fixed perturbative order of , it could be beneficial to increase the perturbative order at some point during the training cycle once an empirical estimate of the gradient variance drops below a certain threshold. It would also be interesting to investigate a -independent formulation of PBBVI using Russian roulette estimates [22]. Furthermore, the PBBVI and -bounds can also be combined, such that PBBVI further approximates -VI. This could lead to promising results on large data sets where traditional -VI is hard to optimize due to its variance, and traditional PBBVI converges to KLVI. As a final remark, a tighter variational bound is not guaranteed to always result in a better posterior approximation since the variational family limits the quality of the solution. However, in the context of variational EM, where one performs gradient-based hyperparameter optimization on the log marginal likelihood, our bound gives more reliable results since higher orders of can be assumed to approximate the marginal likelihood better.

References

References

  • [1] Shunichi Amari. Differential-geometrical methods in statistics, volume 28. Springer Science & Business Media, 2012.
  • [2] Robert Bamler and Stephan Mandt. Dynamic word embeddings. In International Conference on Machine Learning, pages 380–389, 2017.
  • [3] Robert Bamler and Stephan Mandt. Improving optimization in models with continuous symmetry breaking. In International Conference on Machine Learning, pages 432–441, 2018.
  • [4] Robert Bamler, Cheng Zhang, Manfred Opper, and Stephan Mandt. Perturbative black box variational inference. In Advances in Neural Information Processing Systems, pages 5079–5088, 2017.
  • [5] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
  • [6] Alexander Buchholz, Florian Wenzel, and Stephan Mandt. Quasi-monte carlo variational inference. In International Conference on Machine Learning, pages 667–676, 2018.
  • [7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In International Conference on Learning Representations, pages 1–9, 2016.
  • [8] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (methodological), pages 1–38, 1977.
  • [9] Zhiwei Deng, Rajitha Navarathna, Peter Carr, Stephan Mandt, Yisong Yue, Iain Matthews, and Greg Mori. Factorized variational autoencoders for modeling audience reactions to movies. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2577–2586, 2017.
  • [10] Adji B Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David M Blei. Variational inference via upper bound minimization. In Advances in Neural Information Processing Systems, pages 2732–2741, 2017.
  • [11] Justin Domke and Daniel R Sheldon. Importance weighting and variational inference. In Advances in Neural Information Processing Systems, pages 4470–4479, 2018.
  • [12] Jose Hernandez-Lobato, Yingzhen Li, Mark Rowland, Thang Bui, Daniel Hernández-Lobato, and Richard Turner. Black-box alpha divergence minimization. In International Conference on Machine Learning, pages 2256–2273, 2016.
  • [13] Matthew D Hoffman, David M Blei, Chong Wang, and John William Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • [14] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
  • [15] Hilbert J Kappen and Wim Wiegerinck. Second order approximations for probability models. pages 238–244, 2001.
  • [16] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, pages 1–9, 2014.
  • [17] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
  • [18] Hagen Kleinert. Path integrals in quantum mechanics, statistics, polymer physics, and financial markets. World scientific, 2009.
  • [19] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. volume 86, pages 2278–2324, 1998.
  • [20] Yingzhen Li and Stephan Mandt. Disentangled sequential autoencoder. In International Conference on Machine Learning, pages 5656–5665, 2018.
  • [21] Yingzhen Li and Richard E Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pages 1073–1081, 2016.
  • [22] Anne-Marie Lyne, Mark Girolami, Yves Atchadé, Heiko Strathmann, Daniel Simpson, et al. On russian roulette estimates for bayesian inference with doubly-intractable likelihoods. Statistical science, 30(4):443–467, 2015.
  • [23] Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez Lobato, Sebastian Nowozin, and Cheng Zhang. EDDI: efficient dynamic discovery of high-value information with partial VAE. In International Conference on Machine Learning, pages 1–8, 2019.
  • [24] Stephan Mandt and David Blei. Smoothed gradients for stochastic variational inference. In Advances in Neural Information Processing Systems, pages 2438–2446, 2014.
  • [25] Joseph Marino, Yisong Yue, and Stephan Mandt. Iterative amortized inference. In International Conference on Machine Learning, pages 3400–3409, 2018.
  • [26] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In International Conference on Machine Learning, pages 2391–2400, 2017.
  • [27] Thomas Minka. Divergence measures and message passing. Technical report, Technical report, Microsoft Research, 2005.
  • [28] Manfred Opper. Expectation propagation. In Statistical Physics, Optimization, Inference, and Message-Passing Algorithms, chapter 9, pages 263–292. 2015.
  • [29] Manfred Opper, Marco Fraccaro, Ulrich Paquet, Alex Susemihl, and Ole Winther. Perturbation theory for variational inference. In Advances in Neural Information Processing Systems Workshop on Advances in Approximate Bayesian Inference, 2015.
  • [30] Manfred Opper, Ulrich Paquet, and Ole Winther. Perturbative corrections for approximate inference in gaussian latent variable models. Journal of Machine Learning Research, 14(1):2857–2898, 2013.
  • [31] Ulrich Paquet, Ole Winther, and Manfred Opper. Perturbation corrections in approximate inference: Mixture modelling applications. Journal of Machine Learning Research, 10(Jun):1263–1304, 2009.
  • [32] Timm Plefka. Convergence condition of the TAP equation for the infinite-ranged ising spin glass model. Journal of Physics A: Mathematical and general, 15(6):1971, 1982.
  • [33] Tom Rainforth, Adam Kosiorek, Tuan Anh Le, Chris Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter variational bounds are not necessarily better. In International Conference on Machine Learning, pages 4277–4285, 2018.
  • [34] Rajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference. In

    International Conference on Artificial Intelligence and Statistics

    , pages 814–822, 2014.
  • [35] Rajesh Ranganath, Linpeng Tang, Laurent Charlin, and David Blei. Deep exponential families. In Artificial Intelligence and Statistics, pages 762–771, 2015.
  • [36] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on International Conference on Machine Learning, pages 1530–1538, 2015.
  • [37] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In International Conference on Machine Learning, pages 1278–1286, 2014.
  • [38] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, 1951.
  • [39] Francisco Ruiz, Michaelis Titsias, and David Blei. The generalized reparameterization gradient. In Advances in neural information processing systems, pages 460–468, 2016.
  • [40] Tim Salimans and David A Knowles.

    Fixed-form variational posterior approximation through stochastic linear regression.

    Bayesian Analysis, 8(4):837–882, 2013.
  • [41] Moshe Schwartz and Eytan Katzav. The ideas behind self-consistent expansion. Journal of Statistical Mechanics: Theory and Experiment, 2008(04):23, 2008.
  • [42] Toshiyuki Tanaka. A theory of mean field approximation. In Advances in Neural Information Processing Systems, pages 1–8, 1999.
  • [43] Toshiyuki Tanaka. Information geometry of mean-field approximation. Neural Computation, 12(8):1951–1968, 2000.
  • [44] David J Thouless, Philip W Anderson, and Robert G Palmer. Solution of ‘solvable model of a spin glass’. Philosophical Magazine, 35(3):593–601, 1977.
  • [45] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pages 2627–2636, 2017.
  • [46] Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, pages 1–20, 2018.