Importance Weighted Hierarchical Variational Inference

05/08/2019 ∙ by Artem Sobolev, et al. ∙ HSE University 32

Variational Inference is a powerful tool in the Bayesian modeling toolkit, however, its effectiveness is determined by the expressivity of the utilized variational distributions in terms of their ability to match the true posterior distribution. In turn, the expressivity of the variational family is largely limited by the requirement of having a tractable density function. To overcome this roadblock, we introduce a new family of variational upper bounds on a marginal log density in the case of hierarchical models (also known as latent variable models). We then give an upper bound on the Kullback-Leibler divergence and derive a family of increasingly tighter variational lower bounds on the otherwise intractable standard evidence lower bound for hierarchical variational distributions, enabling the use of more expressive approximate posteriors. We show that previously known methods, such as Hierarchical Variational Models, Semi-Implicit Variational Inference and Doubly Semi-Implicit Variational Inference can be seen as special cases of the proposed approach, and empirically demonstrate superior performance of the proposed method in a set of experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian Inference is an important statistical tool. However, exact inference is possible only in a small class of conjugate problems, and for many practically interesting cases, one has to resort to Approximate Inference techniques. Variational Inference (Hinton and van Camp, 1993; Waterhouse et al., 1996; Wainwright et al., 2008)

being one of them is an efficient and scalable approach that gained a lot of interest in recent years due to advances in Neural Networks.

However, the efficiency and accuracy of Variational Inference heavily depend on how close an approximate posterior is to the true posterior. As a result, Neural Networks’ universal approximation abilities and great empirical success propelled a lot of interest in employing them as powerful sample generators (Nowozin et al., 2016; Goodfellow et al., 2014; MacKay, 1995) that are trained to output samples from approximate posterior when fed some standard noise as input. Unfortunately, a significant obstacle on this direction is a need for a tractable density , which in general requires intractable integration. A theoretically sound approach then is to give tight lower bounds on the intractable term – the differential entropy of , which is easy to recover from upper bounds on the marginal log-density. One such bound was introduced by Agakov and Barber (2004); however it’s tightness depends on the auxiliary variational distribution. Yin and Zhou (2018) suggested a multisample loss, whose tightness is controlled by the number of samples.

In this paper we consider hierarchical variational models (Ranganath et al., 2016; Salimans et al., 2015; Agakov and Barber, 2004) where the approximate posterior is represented as a mixture of tractable distributions over arbitrarily complicated mixing distribution : . We show that such variational models contain semi-implicit models, first studied by Yin and Zhou (2018). To overcome the need for the closed-form marginal density we then propose a novel family of tighter bounds on the marginal log-likelihood , which can be shown to generalize many previously known bounds: Hierarchical Variational Models (Ranganath et al., 2016) also known as auxiliary VAE bound (Maaløe et al., 2016), Semi-Implicit Variational Inference (Yin and Zhou, 2018) and Doubly Semi-Implicit Variational Inference (Molchanov et al., 2018). At the core of our work lies a novel variational upper bound on the marginal log-likelihood, which we combine with previously known lower bound (Burda et al., 2015) to give novel upper bound on the Kullback-Leibler (KL) divergence between hierarchical models, and apply it to the evidence lower bound (ELBO) to enable hierarchical approximate posteriors and/or priors. Finally, our method can be combined with the multisample bound of Burda et al. (2015)

to tighten the marginal log-likelihood estimate further.

2 Background

Having a hierarchical model for observable objects , we are interested in two tasks: inference and learning. The problem of Bayesian inference is that of finding the true posterior distribution , which is often intractable and thus is approximated by some . The problem of learning is that of finding parameters s.t. the marginal model distribution approximates the true data-generating process of as good as possible, typically in terms of KL-divergence, which leads to the Maximum Likelihood Estimation problem.

Variational Inference provides a way to solve both tasks simultaneously by lower-bounding the intractable marginal log-likelihood with the Evidence Lower Bound (ELBO) using the posterior approximation :

The bound requires analytically tractable densities for both and . The gap between the marginal log-likelihood and the bound is equal to , which acts as a regularization preventing the true posterior from deviating too far away from the approximate one , thus limiting the expressivity of the marginal distribution . Burda et al. (2015) proposed a family of tighter multisample bounds, generalizing the ELBO. We call it the IWAE bound:

Where from now on we write for brevity. This bound has been shown (Domke and Sheldon, 2018)

to be a tractable lower bound on ELBO for a variational distribution that has been obtained by a certain scoring procedure. However, the price of this increased tightness is higher computation complexity, especially in terms of the number of evaluations of the joint distribution

, and thus we might want to come up with a more expressive posterior approximation to be used in the ELBO – a special case of .

In the direction of improving the single-sample ELBO it was proposed (Agakov and Barber, 2004; Salimans et al., 2015; Maaløe et al., 2016; Ranganath et al., 2016) to use a hierarchical variational model (HVM) for with explicit joint density , where

is auxiliary random variables. To overcome the intractability of the density

, a variational lower bound on the ELBO is proposed. The tightness of the bound is controlled by the auxiliary variational distribution :

(1)

Recently Yin and Zhou (2018) introduced semi-implicit models: hierarchical models with implicit but reparametrizable and explicit , and suggested the following surrogate objective, which was later shown to be a lower bound (the SIVI bound) for all finite by Molchanov et al. (2018):

(2)

SIVI also can be generalized into a multisample bound similarly to the IWAE bound (Burda et al., 2015) in an efficient way by reusing samples for different :

(3)

Where the expectation is taken over and is the same set of i.i.d. random variables for all 111One could also include all into the set of reused samples , expanding its size to .. Importantly, this estimator has sampling complexity for , unlike the naive approach, leading to sampling complexity. We will get back to this discussion in section 4.1.

2.1 SIVI Insights

Here we outline SIVI’s points of weaknesses and identify certain traits that make it possible to generalize the method and bridge it with the prior work.

First, note that both SIVI bounds (2) and (3) use samples from to describe , and in high dimensions one might expect that such "uninformed" samples would miss most of the time, resulting in near-zero likelihood

and thus reducing the effective sample size. Therefore it is expected that in higher dimensions it would take many samples to accurately cover the regions high probability of

for the given . Instead, ideally, we would like to target such regions directly while keeping the lower bound guarantees.

Another important observation that we’ll make use of is that many such semi-implicit models can be equivalently reformulated as a mixture of two explicit distributions: due to reparametrizability of we have for some with tractable density. We can then consider an equivalent hierarchical model that first samples from some simple distribution, transforms this sample into and then generates samples from . Thus from now on we’ll assume both and have tractable density, yet can depend on in an arbitrarily complex way, making analytic marginalization intractable.

3 Importance Weighted Hierarchical Variational Inference

Having intractable as a source of our problems, we seek a tractable and efficient upper bound, which is provided by the following theorem:

Theorem (Marginal log density upper bound).

For any , and (under some regularity conditions) consider the following

Then the following holds:

Proof.

See Appendix for Theorem A.1. ∎

The proposed upper bound provides a variational alternative to MCMC-based upper bounds (Grosse et al., 2015) and complements the standard Importance Weighted stochastic lower bound of Burda et al. (2015) on the marginal log density:

3.1 Upper Bound on KL divergence between hierarchical models

We now apply these bounds to marginal log-densities, appearing in KL divergence in case of both and being different (potentially structurally) hierarchical models. This results in a novel upper bound on KL divergence with auxiliary variational distributions and :

(4)

Crucially, in (4) we merged expectations over and into one expectation over the joint distribution , which admits a more favorable factorization into , and samples from the later are easy to simulate for the Monte Carlo-based estimation.

One can also give lower bounds in the same setting at (4). However, the main focus of this paper is on lower bounding the marginal log-likelihood, so we leave this discussion to appendix C.

3.2 Tractable lower bounds on marginal log-likelihood with hierarchical proposal

The proposed upper bound (4) allows us to lower bound the otherwise intractable ELBO in case of hierarchical and , leading to Importance Weighted Hierarchical Variational Inference (IWHVI) lower bound:

(5)

This bound introduces two additional auxiliary variational distributions and that are learned by maximizing the bound w.r.t. their parameters, tightening the bound. While the optimal distributions are222This choice makes bounds and equal to the marginal log-density. and

, one can see that some particular choices of these distributions and hyperparameters

render previously known methods like DSIVI, SIVI and HVM as special cases (see appendix B).

The bound (5) can be seen as variational generalization of SIVI (2) or multisample generalization of HVM (1). Therefore it has capacity to better estimate the true ELBO, reducing the gap and its regularizational effect, which should lead to more expressive variational approximations .

4 Multisample Extensions

Multisample bounds similar to the proposed one have already been studied extensively. In this section, we relate our results to such prior work.

4.1 Multisample Bound and Complexity

In this section we generalize the bound (5) further in a way similar to the IWAE multisample bound (Burda et al., 2015) (Theorem A.4), leading to the Doubly Importance Weighted Hierarchical Variational Inference (DIWHVI):

(6)

Where the expectation is taken over the same generative process as in eq. 5, independently repeated times:

  1. Sample for

  2. Sample for

  3. Sample for and

  4. Sample for and

The price of the tighter bound (6) is quadratic sample complexity: it requires samples of and samples of . Unfortunately, the DIWHVI cannot benefit from the sample reuse trick of the SIVI that leads to the bound (3). The reason for this is that the bound (6) requires all terms in the outer denominator (the estimate) to use the same distribution , whereas by its very nature it should be very different for different . A viable option, though, is to consider a multisample-conditioned that is invariant to permutations of . We leave a more detailed investigation to a future work.

Runtime-wise when compared to the multisample SIVI (3) the DIWHVI requires additional passes to generate distributions. However, since the SIVI requires a much larger number of samples to reach the same level of accuracy (see section 6.1) that are all then passed through a network to generate distributions, the extra computation is likely to either bear a minor overhead, or be completely justified by reduced . This is particularly true in the IWHVI case () where IWHVI’s single extra pass that generates is dominated by passes that generate .

4.2 Signal to Noise Ratio

Rainforth et al. (2018) have shown that multisample bounds (Burda et al., 2015; Nowozin, 2018) behave poorly during the training phase, having more noisy inference network’s gradient estimates, which manifests itself in decreasing Signal-to-Noise Ratio (SNR) as the number of samples increases. This raises a natural concern whether the same happens in the proposed model as increases. Tucker et al. (2019) have shown that upon a careful examination a REINFORCE-like (Williams, 1992)

term can be seen in the gradient estimate, and REINFORCE is known for its typically high variance

(Rezende et al., 2014). Authors further suggest to apply the reparametrization trick (Kingma and Welling, 2013) the second time to obtain a reparametrization-based gradient estimate, which is then shown to solve the decreasing SNR problem. The same reasoning can be applied to our bound, and we provide further details and experiments in appendix D, developing an IWHVI-DReG gradient estimator. We conclude that the problem of decreasing SNR exists in our bound as well, and is mitigated by the proposed gradient estimator.

4.3 Debiasing the bound

Nowozin (2018) has shown that the standard IWAE can be seen as a biased estimate of the marginal log-likelihood with the bias of order . They then suggested to use Generalized Jackknife of -th order to reuse these samples and come up with an estimator with a smaller bias of order at the cost of higher variance and losing lower bound guarantees. Again, the same idea can be applied to our estimate; we leave further details to appendix E. We conclude that this way one can obtain better estimates of the marginal log-density, however since there is no guarantee that the obtained estimator gives an upper or a lower bound, we chose not to use it in experiments.

5 Related Work

More expressive variational distributions have been under an active investigation for a while. While we have focused our attention to approaches employing hierarchical models via bounds, there are many other approaches, roughly falling into two broad classes.

One possible approach is to augment some standard with help of copulas (Tran et al., 2015), mixtures (Guo et al., 2016), or invertible transformations with tractable Jacobians also known as normalizing flows (Rezende and Mohamed, 2015; Kingma et al., 2016; Dinh et al., 2016; Papamakarios et al., 2017), all while preserving the tractability of the density. Kingma and Dhariwal (2018) have demonstrated that flow-based models are able to approximate complex high-dimensional distributions of real images, but the requirement for invertibility might lead to inefficiency in parameters usage and does not allow for abstraction as one needs to preserve dimensions.

An alternative direction is to embrace implicit distributions that one can only sample from, and overcome the need for tractable density using bounds or estimates (Huszár, 2017). Methods based on estimates (Mescheder et al., 2017; Shi et al., 2017), for example, via the Density Ratio Estimation trick (Goodfellow et al., 2014; Uehara et al., 2016; Mohamed and Lakshminarayanan, 2016), typically estimate the densities indirectly utilizing an auxiliary critic and hide dependency on variational parameters , hence biasing the optimization procedure. Titsias and Ruiz (2018) have shown that in the gradient-based ELBO optimization in case of a hierarchical model with tractable and one does not need the marginal log density per se, only its gradient, which can be estimated using MCMC. Major disadvantage of these methods is that they either lose bound guarantees or make its evaluation intractable and thus cannot be combined with multisample bounds during the evaluation phase.

The core contribution of the paper is a novel upper bound on marginal log-likelihood. Previously, Dieng et al. (2017); Kuleshov and Ermon (2017) suggested using -divergence to give a variational upper bound to the marginal log-likelihood. However, their bound was not an expectation of a random variable, but instead a logarithm of the expectation, preventing unbiased stochastic optimization. Jebara and Pentland (2001) reverse Jensen’s inequality to give a variational upper bound in case of mixtures of exponential family distributions by extensive use of the problem’s structure. Related to our core idea of joint sampling and in (4) is an observation of Grosse et al. (2015) that Annealed Importance Sampling (AIS, Neal (2001)) ran backward from the auxiliary variable sample

gives an unbiased estimate of

, and thus can also be used to upper bound the marginal log-density. However, AIS-based estimation is too computationally expensive to be used during training.

6 Experiments

We only focus on cases with explicit prior to simplify comparison to prior work. Hierarchical priors correspond to nested variational inference, to which most of the variational inference results readily apply (Atanov et al., 2019).

(a)

Negative entropy bounds for 50-dimensional Laplace distribution. Shaded area denotes 90% confidence interval computed over 50 independent runs for each

.
(b) Final marginal log-likelihood estimates and expected for IWHVI-based VAEs trained with different . Each model was trained and plotted 6 times.

6.1 Toy Experiment

As a toy experiment we consider a 50-dimensional factorized standard Laplace distribution as a hierarchical scale-mixture model:

We do not make use of factorized joint distribution to explore bound’s behavior in high dimensions. We use the proposed bound from Theorem A.1 and compare it to SIVI (Yin and Zhou, 2018) on the task of upper-bounding the negative differential entropy . For IWHVI we take

to be a Gamma distribution whose concentration and rate are generated by a neural network with three 500-dimensional hidden layers from

. We use the freedom to design architecture and initialize the network at prior. Namely, we also add a sigmoid "gate" output with large initial negative bias and use the gate to combine prior concentration and rate with those generated by the network. This way we are guaranteed to perform no worse than SIVI even at a randomly initialized . Figure 0(a) shows the value of the bound for a different number of optimization steps over parameters, minimizing the bound. The whole process (including random initialization of neural networks) was repeated 50 times to compute empirical 90% confidence intervals. As results clearly indicate, the proposed bound can be made much tighter, more than halving the gap to the true negative entropy.

6.2 Variational Autoencoder

Method MNIST OMNIGLOT From (Mescheder et al., 2017) AVB + AC IWHVI SIVI HVM VAE + RealNVP VAE + IAF VAE
Figure 1: Left

: Test log-likelihood on dynamically binarized MNIST and OMNIGLOT.

Right: Comparison of multisample DIWHVI and SIVI-IW on a trained MNIST VAE from section 6.2 for and . Shaded area denotes std. interval, computed over 10 independent runs for each value of .

We further test our method on the task of generative modeling, applying it to VAE (Kingma and Welling, 2013), which is a standard benchmark for inference methods. Ideally, better inference should allow one to learn more expressive generative models. We report results on two datasets: MNIST (LeCun et al., 1998) and OMNIGLOT (Lake et al., 2015). For MNIST we follow the setup by Mescheder et al. (2017), and for OMNIGLOT we follow the standard setup (Burda et al., 2015). For experiment details see appendix F.

During training we used the proposed bound eq. 5 with analytically tractable prior with increasing number : we used

for the first 250 epochs,

for the next 250 epochs, and for the next 500 epochs, and from then on. We used during training.

To estimate the marginal log-likelihood for hierarchical models (IWHVI, SIVI, HVM) we use the DIWHVI lower bound (6) for ,

(for justification of DIWHVI as an evaluation metric see

section 6.3). Results are shown in fig. 1. To evaluate the SIVI using the DIWHVI bound we fit to a trained model by making 7000 epochs on the trainset with , keeping parameters of and fixed. We observed improved performance compared to special cases of HVM and SIVI, and the method showed comparable results to the prior works.

For HVM on MNIST we observed its essentially collapsed to , having expected KL divergence between the two extremely close to zero. This indicates the "posterior collapse" (Kim et al., 2018; Chen et al., 2016) problem where the inference network chose to ignore the extra input and effectively degenerated to a vanilla VAE. At the same time IWHVI does not suffer from this problem due to non-zero , achieving average of approximately 6.2 nats, see section 6.3. For OMNIGLOT HVM did learn useful , and achieved average nats, however IWHVI did much better and achieved nats.

To investigate ’s influence on the training process, we trained VAEs on MNIST for 3000 epochs for different values of and evaluated DIWHVI bound for and plotted results in fig. 0(b). One can see that higher values of lead to better final models in terms of marginal log-likelihood, as well as more informative auxiliary inference networks compared to the prior . Using IWHVI-DReG gradient estimator (see appendix D) increased the KL divergence, but resulted in only a modest increase in the marginal log-likelihood.

6.3 DIWHVI as Evaluation Metric

One of the established approaches to evaluate the intractable marginal log-likelihood in Latent Variable Models is to compute the multisample IWAE-bound with large since it is shown to converge to the marginal log-likelihood as goes to infinity. Since both IWHVI and SIVI allow tightening the bound by taking more samples , we compare methods along this direction.

Both DIWHVI and SIVI (being a special case of the former) can be shown to converge to marginal log-likelihood as both and go to infinity, however, rates might differ. We empirically compare the two by evaluating an MNIST-trained IWHVAE model from section 6.2 for several different values and . We use the proposed DIWHVI bound (6), and compare it with several SIVI modifications. We call SIVI-like the (6) with , but without reuse, thus using independent samples. SIVI Equicomp stands for sample reusing bound (3), which uses only samples, and uses same for every . SIVI Equisample is a fair comparison in terms of the number of samples: we take samples of , and reuse of them for every . This way we use the same number of samples as DIWHVI does, but perform log-density evaluations to estimate , which is why we only examine the case.

Results shown in fig. 1 indicate superior performance of the DIWHVI bound. Surprisingly SIVI-like and SIVI Equicomp estimates nearly coincide, with no significant difference in variance; thus we conclude sample reuse does not hurt SIVI. Still, there is a considerable gap to the IWHVI bound, which uses similar to SIVI-like amount of computing and samples. In a more fair comparison to the Equisample SIVI bound, the gap is significantly reduced, yet IWHVI is still a superior bound, especially in terms of computational efficiency, as there are no operations.

Comparing IWHVI and SIVI-like for we see that the former converges after a few dozen samples, while SIVI is rapidly improving, yet lagging almost 1 nat behind for 100 samples, and even 0.5 nats behind the HVM bound (IWHVI for ). One explanation for the observed behaviour is large , which was estimated 333Difference between -sample IWAE and ELBO gives a lower bound on , we used . (on a test set) to be at least nats, causing many samples from to generate poor likelihood for a given due to large difference with the true inverse model . This is consistent with motivation layed out in section 2.1: a better approximate inverse model leads to more efficient sample usage. At the same time was estimated to be approximately and , proving that one can indeed do much better by learning instead of using the prior .

7 Conclusion

We presented a variational upper bound on the marginal log density, which allowed us to formulate sandwich bounds for for the case of hierarchical model in addition to prior works that only provided multisample variational upper bounds for the case of being a hierarchical model. We applied it to lower bound the intractable ELBO with a tractable one for the case of latent variable model approximate posterior . We experimentally validated the bound and showed it alleviates regularizational effect to a further extent than prior works do, allowing for more expressive approximate posteriors, which does translate into a better inference. We then combined our bound with multisample IWAE bound, which led to a tighter lower bound of the marginal log-likelihood. We therefore believe the proposed variational inference method will be useful for many variational models.

Acknowledgements

Authors would like to thank Aibek Alanov, Dmitry Molchanov and Oleg Ivanov for valuable discussions and feedback.

References

Appendix A Proofs

Theorem A.1 (Marginal log-density upper bound).

For any , and such that for any s.t. we have , consider

where we write for brevity. Then the following holds:

  1. .

Proof.
  1. Consider a gap between the proposed bound at the marginal log density:

    Gap

    Where the last line holds due to being a normalized density function (see Lemma A.2):

  2. Now we will prove the second claim.

    Where we used the fact that is normalized density due to Lemma A.3

  3. For the last claim we follow (Burda et al., 2015). Consider

    Due to Law of Large Numbers we have

    Thus

Lemma A.2 ( distribution, following Domke and Sheldon (2018)).

Given , consider a following generative process:

  • Sample i.i.d. samples from

  • For each sample compute its weight

  • Sample

  • Put -th sample first, and then the rest: ,

Then the marginal density of

Proof.

The joint density for the generative process described above is

One can see that this is indeed a normalized density

The marginal density then is

Where on the second line we used the fact that integrand is symmetric under the choice of .

Lemma A.3.

Let

Then is a normalized density.

Proof.

is non-negative due to all the terms being non-negative. Now we’ll show it integrates to 1 (colors denote corresponding terms):

Theorem A.4 (DIWHVI Evidence Lower Bound).
(7)

Where the expectation is taken over the following generative process:

  1. Sample for

  2. Sample for

  3. Sample for and

  4. Sample for and

Proof.

Consider a random variable

We’ll show it’s an unbiased estimate of (colors denote corresponding terms) and then just invoke Jensen’s inequality: