Bayesian Inference is an important statistical tool. However, exact inference is possible only in a small class of conjugate problems, and for many practically interesting cases, one has to resort to Approximate Inference techniques. Variational Inference (Hinton and van Camp, 1993; Waterhouse et al., 1996; Wainwright et al., 2008)
being one of them is an efficient and scalable approach that gained a lot of interest in recent years due to advances in Neural Networks.
However, the efficiency and accuracy of Variational Inference heavily depend on how close an approximate posterior is to the true posterior. As a result, Neural Networks’ universal approximation abilities and great empirical success propelled a lot of interest in employing them as powerful sample generators (Nowozin et al., 2016; Goodfellow et al., 2014; MacKay, 1995) that are trained to output samples from approximate posterior when fed some standard noise as input. Unfortunately, a significant obstacle on this direction is a need for a tractable density , which in general requires intractable integration. A theoretically sound approach then is to give tight lower bounds on the intractable term – the differential entropy of , which is easy to recover from upper bounds on the marginal log-density. One such bound was introduced by Agakov and Barber (2004); however it’s tightness depends on the auxiliary variational distribution. Yin and Zhou (2018) suggested a multisample loss, whose tightness is controlled by the number of samples.
In this paper we consider hierarchical variational models (Ranganath et al., 2016; Salimans et al., 2015; Agakov and Barber, 2004) where the approximate posterior is represented as a mixture of tractable distributions over arbitrarily complicated mixing distribution : . We show that such variational models contain semi-implicit models, first studied by Yin and Zhou (2018). To overcome the need for the closed-form marginal density we then propose a novel family of tighter bounds on the marginal log-likelihood , which can be shown to generalize many previously known bounds: Hierarchical Variational Models (Ranganath et al., 2016) also known as auxiliary VAE bound (Maaløe et al., 2016), Semi-Implicit Variational Inference (Yin and Zhou, 2018) and Doubly Semi-Implicit Variational Inference (Molchanov et al., 2018). At the core of our work lies a novel variational upper bound on the marginal log-likelihood, which we combine with previously known lower bound (Burda et al., 2015) to give novel upper bound on the Kullback-Leibler (KL) divergence between hierarchical models, and apply it to the evidence lower bound (ELBO) to enable hierarchical approximate posteriors and/or priors. Finally, our method can be combined with the multisample bound of Burda et al. (2015)
to tighten the marginal log-likelihood estimate further.
Having a hierarchical model for observable objects , we are interested in two tasks: inference and learning. The problem of Bayesian inference is that of finding the true posterior distribution , which is often intractable and thus is approximated by some . The problem of learning is that of finding parameters s.t. the marginal model distribution approximates the true data-generating process of as good as possible, typically in terms of KL-divergence, which leads to the Maximum Likelihood Estimation problem.
Variational Inference provides a way to solve both tasks simultaneously by lower-bounding the intractable marginal log-likelihood with the Evidence Lower Bound (ELBO) using the posterior approximation :
The bound requires analytically tractable densities for both and . The gap between the marginal log-likelihood and the bound is equal to , which acts as a regularization preventing the true posterior from deviating too far away from the approximate one , thus limiting the expressivity of the marginal distribution . Burda et al. (2015) proposed a family of tighter multisample bounds, generalizing the ELBO. We call it the IWAE bound:
Where from now on we write for brevity. This bound has been shown (Domke and Sheldon, 2018)
to be a tractable lower bound on ELBO for a variational distribution that has been obtained by a certain scoring procedure. However, the price of this increased tightness is higher computation complexity, especially in terms of the number of evaluations of the joint distribution, and thus we might want to come up with a more expressive posterior approximation to be used in the ELBO – a special case of .
In the direction of improving the single-sample ELBO it was proposed (Agakov and Barber, 2004; Salimans et al., 2015; Maaløe et al., 2016; Ranganath et al., 2016) to use a hierarchical variational model (HVM) for with explicit joint density , where
is auxiliary random variables. To overcome the intractability of the density, a variational lower bound on the ELBO is proposed. The tightness of the bound is controlled by the auxiliary variational distribution :
Recently Yin and Zhou (2018) introduced semi-implicit models: hierarchical models with implicit but reparametrizable and explicit , and suggested the following surrogate objective, which was later shown to be a lower bound (the SIVI bound) for all finite by Molchanov et al. (2018):
SIVI also can be generalized into a multisample bound similarly to the IWAE bound (Burda et al., 2015) in an efficient way by reusing samples for different :
Where the expectation is taken over and is the same set of i.i.d. random variables for all 111One could also include all into the set of reused samples , expanding its size to .. Importantly, this estimator has sampling complexity for , unlike the naive approach, leading to sampling complexity. We will get back to this discussion in section 4.1.
2.1 SIVI Insights
Here we outline SIVI’s points of weaknesses and identify certain traits that make it possible to generalize the method and bridge it with the prior work.
First, note that both SIVI bounds (2) and (3) use samples from to describe , and in high dimensions one might expect that such "uninformed" samples would miss most of the time, resulting in near-zero likelihood
and thus reducing the effective sample size. Therefore it is expected that in higher dimensions it would take many samples to accurately cover the regions high probability offor the given . Instead, ideally, we would like to target such regions directly while keeping the lower bound guarantees.
Another important observation that we’ll make use of is that many such semi-implicit models can be equivalently reformulated as a mixture of two explicit distributions: due to reparametrizability of we have for some with tractable density. We can then consider an equivalent hierarchical model that first samples from some simple distribution, transforms this sample into and then generates samples from . Thus from now on we’ll assume both and have tractable density, yet can depend on in an arbitrarily complex way, making analytic marginalization intractable.
3 Importance Weighted Hierarchical Variational Inference
Having intractable as a source of our problems, we seek a tractable and efficient upper bound, which is provided by the following theorem:
Theorem (Marginal log density upper bound).
For any , and (under some regularity conditions) consider the following
Then the following holds:
See Appendix for Theorem A.1. ∎
The proposed upper bound provides a variational alternative to MCMC-based upper bounds (Grosse et al., 2015) and complements the standard Importance Weighted stochastic lower bound of Burda et al. (2015) on the marginal log density:
3.1 Upper Bound on KL divergence between hierarchical models
We now apply these bounds to marginal log-densities, appearing in KL divergence in case of both and being different (potentially structurally) hierarchical models. This results in a novel upper bound on KL divergence with auxiliary variational distributions and :
Crucially, in (4) we merged expectations over and into one expectation over the joint distribution , which admits a more favorable factorization into , and samples from the later are easy to simulate for the Monte Carlo-based estimation.
3.2 Tractable lower bounds on marginal log-likelihood with hierarchical proposal
The proposed upper bound (4) allows us to lower bound the otherwise intractable ELBO in case of hierarchical and , leading to Importance Weighted Hierarchical Variational Inference (IWHVI) lower bound:
This bound introduces two additional auxiliary variational distributions and that are learned by maximizing the bound w.r.t. their parameters, tightening the bound. While the optimal distributions are222This choice makes bounds and equal to the marginal log-density. and
, one can see that some particular choices of these distributions and hyperparametersrender previously known methods like DSIVI, SIVI and HVM as special cases (see appendix B).
4 Multisample Extensions
Multisample bounds similar to the proposed one have already been studied extensively. In this section, we relate our results to such prior work.
4.1 Multisample Bound and Complexity
In this section we generalize the bound (5) further in a way similar to the IWAE multisample bound (Burda et al., 2015) (Theorem A.4), leading to the Doubly Importance Weighted Hierarchical Variational Inference (DIWHVI):
Where the expectation is taken over the same generative process as in eq. 5, independently repeated times:
Sample for and
Sample for and
The price of the tighter bound (6) is quadratic sample complexity: it requires samples of and samples of . Unfortunately, the DIWHVI cannot benefit from the sample reuse trick of the SIVI that leads to the bound (3). The reason for this is that the bound (6) requires all terms in the outer denominator (the estimate) to use the same distribution , whereas by its very nature it should be very different for different . A viable option, though, is to consider a multisample-conditioned that is invariant to permutations of . We leave a more detailed investigation to a future work.
Runtime-wise when compared to the multisample SIVI (3) the DIWHVI requires additional passes to generate distributions. However, since the SIVI requires a much larger number of samples to reach the same level of accuracy (see section 6.1) that are all then passed through a network to generate distributions, the extra computation is likely to either bear a minor overhead, or be completely justified by reduced . This is particularly true in the IWHVI case () where IWHVI’s single extra pass that generates is dominated by passes that generate .
4.2 Signal to Noise Ratio
Rainforth et al. (2018) have shown that multisample bounds (Burda et al., 2015; Nowozin, 2018) behave poorly during the training phase, having more noisy inference network’s gradient estimates, which manifests itself in decreasing Signal-to-Noise Ratio (SNR) as the number of samples increases. This raises a natural concern whether the same happens in the proposed model as increases. Tucker et al. (2019) have shown that upon a careful examination a REINFORCE-like (Williams, 1992)
term can be seen in the gradient estimate, and REINFORCE is known for its typically high variance(Rezende et al., 2014). Authors further suggest to apply the reparametrization trick (Kingma and Welling, 2013) the second time to obtain a reparametrization-based gradient estimate, which is then shown to solve the decreasing SNR problem. The same reasoning can be applied to our bound, and we provide further details and experiments in appendix D, developing an IWHVI-DReG gradient estimator. We conclude that the problem of decreasing SNR exists in our bound as well, and is mitigated by the proposed gradient estimator.
4.3 Debiasing the bound
Nowozin (2018) has shown that the standard IWAE can be seen as a biased estimate of the marginal log-likelihood with the bias of order . They then suggested to use Generalized Jackknife of -th order to reuse these samples and come up with an estimator with a smaller bias of order at the cost of higher variance and losing lower bound guarantees. Again, the same idea can be applied to our estimate; we leave further details to appendix E. We conclude that this way one can obtain better estimates of the marginal log-density, however since there is no guarantee that the obtained estimator gives an upper or a lower bound, we chose not to use it in experiments.
5 Related Work
More expressive variational distributions have been under an active investigation for a while. While we have focused our attention to approaches employing hierarchical models via bounds, there are many other approaches, roughly falling into two broad classes.
One possible approach is to augment some standard with help of copulas (Tran et al., 2015), mixtures (Guo et al., 2016), or invertible transformations with tractable Jacobians also known as normalizing flows (Rezende and Mohamed, 2015; Kingma et al., 2016; Dinh et al., 2016; Papamakarios et al., 2017), all while preserving the tractability of the density. Kingma and Dhariwal (2018) have demonstrated that flow-based models are able to approximate complex high-dimensional distributions of real images, but the requirement for invertibility might lead to inefficiency in parameters usage and does not allow for abstraction as one needs to preserve dimensions.
An alternative direction is to embrace implicit distributions that one can only sample from, and overcome the need for tractable density using bounds or estimates (Huszár, 2017). Methods based on estimates (Mescheder et al., 2017; Shi et al., 2017), for example, via the Density Ratio Estimation trick (Goodfellow et al., 2014; Uehara et al., 2016; Mohamed and Lakshminarayanan, 2016), typically estimate the densities indirectly utilizing an auxiliary critic and hide dependency on variational parameters , hence biasing the optimization procedure. Titsias and Ruiz (2018) have shown that in the gradient-based ELBO optimization in case of a hierarchical model with tractable and one does not need the marginal log density per se, only its gradient, which can be estimated using MCMC. Major disadvantage of these methods is that they either lose bound guarantees or make its evaluation intractable and thus cannot be combined with multisample bounds during the evaluation phase.
The core contribution of the paper is a novel upper bound on marginal log-likelihood. Previously, Dieng et al. (2017); Kuleshov and Ermon (2017) suggested using -divergence to give a variational upper bound to the marginal log-likelihood. However, their bound was not an expectation of a random variable, but instead a logarithm of the expectation, preventing unbiased stochastic optimization. Jebara and Pentland (2001) reverse Jensen’s inequality to give a variational upper bound in case of mixtures of exponential family distributions by extensive use of the problem’s structure. Related to our core idea of joint sampling and in (4) is an observation of Grosse et al. (2015) that Annealed Importance Sampling (AIS, Neal (2001)) ran backward from the auxiliary variable sample
gives an unbiased estimate of, and thus can also be used to upper bound the marginal log-density. However, AIS-based estimation is too computationally expensive to be used during training.
We only focus on cases with explicit prior to simplify comparison to prior work. Hierarchical priors correspond to nested variational inference, to which most of the variational inference results readily apply (Atanov et al., 2019).
6.1 Toy Experiment
As a toy experiment we consider a 50-dimensional factorized standard Laplace distribution as a hierarchical scale-mixture model:
We do not make use of factorized joint distribution to explore bound’s behavior in high dimensions. We use the proposed bound from Theorem A.1 and compare it to SIVI (Yin and Zhou, 2018) on the task of upper-bounding the negative differential entropy . For IWHVI we take
to be a Gamma distribution whose concentration and rate are generated by a neural network with three 500-dimensional hidden layers from. We use the freedom to design architecture and initialize the network at prior. Namely, we also add a sigmoid "gate" output with large initial negative bias and use the gate to combine prior concentration and rate with those generated by the network. This way we are guaranteed to perform no worse than SIVI even at a randomly initialized . Figure 0(a) shows the value of the bound for a different number of optimization steps over parameters, minimizing the bound. The whole process (including random initialization of neural networks) was repeated 50 times to compute empirical 90% confidence intervals. As results clearly indicate, the proposed bound can be made much tighter, more than halving the gap to the true negative entropy.
6.2 Variational Autoencoder
We further test our method on the task of generative modeling, applying it to VAE (Kingma and Welling, 2013), which is a standard benchmark for inference methods. Ideally, better inference should allow one to learn more expressive generative models. We report results on two datasets: MNIST (LeCun et al., 1998) and OMNIGLOT (Lake et al., 2015). For MNIST we follow the setup by Mescheder et al. (2017), and for OMNIGLOT we follow the standard setup (Burda et al., 2015). For experiment details see appendix F.
During training we used the proposed bound eq. 5 with analytically tractable prior with increasing number : we used
for the first 250 epochs,for the next 250 epochs, and for the next 500 epochs, and from then on. We used during training.
To estimate the marginal log-likelihood for hierarchical models (IWHVI, SIVI, HVM) we use the DIWHVI lower bound (6) for ,
(for justification of DIWHVI as an evaluation metric seesection 6.3). Results are shown in fig. 1. To evaluate the SIVI using the DIWHVI bound we fit to a trained model by making 7000 epochs on the trainset with , keeping parameters of and fixed. We observed improved performance compared to special cases of HVM and SIVI, and the method showed comparable results to the prior works.
For HVM on MNIST we observed its essentially collapsed to , having expected KL divergence between the two extremely close to zero. This indicates the "posterior collapse" (Kim et al., 2018; Chen et al., 2016) problem where the inference network chose to ignore the extra input and effectively degenerated to a vanilla VAE. At the same time IWHVI does not suffer from this problem due to non-zero , achieving average of approximately 6.2 nats, see section 6.3. For OMNIGLOT HVM did learn useful , and achieved average nats, however IWHVI did much better and achieved nats.
To investigate ’s influence on the training process, we trained VAEs on MNIST for 3000 epochs for different values of and evaluated DIWHVI bound for and plotted results in fig. 0(b). One can see that higher values of lead to better final models in terms of marginal log-likelihood, as well as more informative auxiliary inference networks compared to the prior . Using IWHVI-DReG gradient estimator (see appendix D) increased the KL divergence, but resulted in only a modest increase in the marginal log-likelihood.
6.3 DIWHVI as Evaluation Metric
One of the established approaches to evaluate the intractable marginal log-likelihood in Latent Variable Models is to compute the multisample IWAE-bound with large since it is shown to converge to the marginal log-likelihood as goes to infinity. Since both IWHVI and SIVI allow tightening the bound by taking more samples , we compare methods along this direction.
Both DIWHVI and SIVI (being a special case of the former) can be shown to converge to marginal log-likelihood as both and go to infinity, however, rates might differ. We empirically compare the two by evaluating an MNIST-trained IWHVAE model from section 6.2 for several different values and . We use the proposed DIWHVI bound (6), and compare it with several SIVI modifications. We call SIVI-like the (6) with , but without reuse, thus using independent samples. SIVI Equicomp stands for sample reusing bound (3), which uses only samples, and uses same for every . SIVI Equisample is a fair comparison in terms of the number of samples: we take samples of , and reuse of them for every . This way we use the same number of samples as DIWHVI does, but perform log-density evaluations to estimate , which is why we only examine the case.
Results shown in fig. 1 indicate superior performance of the DIWHVI bound. Surprisingly SIVI-like and SIVI Equicomp estimates nearly coincide, with no significant difference in variance; thus we conclude sample reuse does not hurt SIVI. Still, there is a considerable gap to the IWHVI bound, which uses similar to SIVI-like amount of computing and samples. In a more fair comparison to the Equisample SIVI bound, the gap is significantly reduced, yet IWHVI is still a superior bound, especially in terms of computational efficiency, as there are no operations.
Comparing IWHVI and SIVI-like for we see that the former converges after a few dozen samples, while SIVI is rapidly improving, yet lagging almost 1 nat behind for 100 samples, and even 0.5 nats behind the HVM bound (IWHVI for ). One explanation for the observed behaviour is large , which was estimated 333Difference between -sample IWAE and ELBO gives a lower bound on , we used . (on a test set) to be at least nats, causing many samples from to generate poor likelihood for a given due to large difference with the true inverse model . This is consistent with motivation layed out in section 2.1: a better approximate inverse model leads to more efficient sample usage. At the same time was estimated to be approximately and , proving that one can indeed do much better by learning instead of using the prior .
We presented a variational upper bound on the marginal log density, which allowed us to formulate sandwich bounds for for the case of hierarchical model in addition to prior works that only provided multisample variational upper bounds for the case of being a hierarchical model. We applied it to lower bound the intractable ELBO with a tractable one for the case of latent variable model approximate posterior . We experimentally validated the bound and showed it alleviates regularizational effect to a further extent than prior works do, allowing for more expressive approximate posteriors, which does translate into a better inference. We then combined our bound with multisample IWAE bound, which led to a tighter lower bound of the marginal log-likelihood. We therefore believe the proposed variational inference method will be useful for many variational models.
Authors would like to thank Aibek Alanov, Dmitry Molchanov and Oleg Ivanov for valuable discussions and feedback.
A. Angelova (2012)
A. Angelova, J.
On moments of sample mean and variance.International Journal of Pure and Applied Mathematics, 79.
Agakov and Barber (2004)
Agakov, F. V. and D. Barber
2004. An auxiliary variational method. In International Conference on Neural Information Processing, Pp. 561–566. Springer.
Atanov et al. (2019)
Atanov, A., A. Ashukha, K. Struminsky, D. Vetrov, and
2019. The deep weight prior. In International Conference on Learning Representations.
Burda et al. (2015)
Burda, Y., R. Grosse, and R. Salakhutdinov
2015. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.
Chen et al. (2016)
Chen, X., D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman,
I. Sutskever, and P. Abbeel
2016. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731.
Dieng et al. (2017)
Dieng, A. B., D. Tran, R. Ranganath, J. Paisley, and
2017. Variational inference via upper bound minimization. In Advances in Neural Information Processing Systems, Pp. 2732–2741.
Dillon et al. (2017)
Dillon, J. V., I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore,
B. Patton, A. Alemi, M. Hoffman, and R. A.
2017. Tensorflow distributions. arXiv preprint arXiv:1711.10604.
Dinh et al. (2016)
Dinh, L., J. Sohl-Dickstein, and S. Bengio
2016. Density estimation using real NVP. CoRR, abs/1605.08803.
Domke and Sheldon (2018)
Domke, J. and D. R. Sheldon
2018. Importance weighting and variational inference. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds., Pp. 4471–4480. Curran Associates, Inc.
et al. (2014)
Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio
2014. Generative adversarial nets. In Advances in neural information processing systems, Pp. 2672–2680.
Grosse et al. (2015)
Grosse, R. B., Z. Ghahramani, and R. P. Adams
2015. Sandwiching the marginal likelihood using bidirectional monte carlo. CoRR, abs/1511.02543.
Guo et al. (2016)
Guo, F., X. Wang, K. Fan, T. Broderick, and D. B.
2016. Boosting variational inference. arXiv preprint arXiv:1611.05559.
Hinton and van
Hinton, G. E. and D. van Camp
1993. Keeping the neural networks simple by minimizing the description length of the weights. In
Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT ’93, Pp. 5–13, New York, NY, USA. ACM.
2017. Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235.
Jebara and Pentland (2001)
Jebara, T. and A. Pentland
2001. On reversing jensen’s inequality. In Advances in Neural Information Processing Systems, Pp. 231–237.
Kim et al. (2018)
Kim, Y., S. Wiseman, A. Miller, D. Sontag, and
Semi-amortized variational autoencoders.In
Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause, eds., volume 80 of Proceedings of Machine Learning Research, Pp. 2678–2687, Stockholmsmässan, Stockholm Sweden. PMLR.
Kingma, D. P. and J. Ba
2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
Kingma and Dhariwal (2018)
Kingma, D. P. and P. Dhariwal
2018. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds., Pp. 10235–10244. Curran Associates, Inc.
Kingma et al. (2016)
Kingma, D. P., T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and
2016. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds., Pp. 4743–4751. Curran Associates, Inc.
Kingma and Welling (2013)
Kingma, D. P. and M. Welling
2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Kuleshov and Ermon (2017)
Kuleshov, V. and S. Ermon
2017. Neural variational inference and learning in undirected graphical models. In Advances in Neural Information Processing Systems, Pp. 6734–6743.
Lake et al. (2015)
Lake, B. M., R. Salakhutdinov, and J. B.
2015. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
LeCun et al. (1998)
LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner
1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
Louizos et al. (2017)
Louizos, C., K. Ullrich, and M. Welling
Bayesian compression for deep learning.In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., Pp. 3288–3298. Curran Associates, Inc.
Maaløe et al. (2016)
Maaløe, L., C. K. Sønderby, S. K. Sønderby, and
2016. Auxiliary deep generative models. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger, eds., volume 48 of Proceedings of Machine Learning Research, Pp. 1445–1453, New York, New York, USA. PMLR.
MacKay, D. J.
1995. Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 354(1):73 – 80. Proceedings of the Third Workshop on Neutron Scattering Data Analysis.
Mescheder et al. (2017)
Mescheder, L., S. Nowozin, and A. Geiger
2017. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In International Conference on Machine Learning (ICML).
Mohamed, S. and B. Lakshminarayanan
2016. Learning in implicit generative models. arXiv preprint arXiv:1610.03483.
Molchanov et al. (2018)
Molchanov, D., V. Kharitonov, A. Sobolev, and
2018. Doubly semi-implicit variational inference. arXiv preprint arXiv:1810.02789.
Neal, R. M.
2001. Annealed importance sampling. Statistics and computing, 11(2):125–139.
2018. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference. In International Conference on Learning Representations.
Nowozin et al. (2016)
Nowozin, S., B. Cseke, and R. Tomioka
2016. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds., Pp. 271–279. Curran Associates, Inc.
et al. (2017)
Papamakarios, G., I. Murray, and T. Pavlakou
2017. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, Pp. 2335–2344.
Rainforth et al. (2018)
Rainforth, T., A. R. Kosiorek, T. A. Le, C. J. Maddison, M. Igl, F. Wood, and
Y. W. Teh
2018. Tighter variational bounds are not necessarily better. In ICML.
et al. (2016)
Ranganath, R., D. Tran, and D. Blei
2016. Hierarchical variational models. In International Conference on Machine Learning, Pp. 324–333.
Reddi et al. (2018)
Reddi, S. J., S. Kale, and S. Kumar
2018. On the convergence of adam and beyond. In International Conference on Learning Representations.
Rezende, D. J. and S. Mohamed
2015. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770.
Rezende et al. (2014)
Rezende, D. J., S. Mohamed, and D. Wierstra
Stochastic backpropagation and approximate inference in deep generative models.In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara, eds., volume 32 of Proceedings of Machine Learning Research, Pp. 1278–1286, Bejing, China. PMLR.
Salimans et al. (2015)
Salimans, T., D. Kingma, and M. Welling
2015. Markov chain monte carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, Pp. 1218–1226.
1976. The generalized jackknife: finite samples and subsample sizes. Journal of the American Statistical Association, 71(354):451–454.
Shi et al. (2017)
Shi, J., S. Sun, and J. Zhu
2017. Kernel implicit variational inference. arXiv preprint arXiv:1705.10119.
Titsias and Ruiz (2018)
Titsias, M. K. and F. J. Ruiz
2018. Unbiased implicit variational inference. arXiv preprint arXiv:1808.02078.
Tran et al. (2015)
Tran, D., D. Blei, and E. M. Airoldi
2015. Copula variational inference. In Advances in Neural Information Processing Systems, Pp. 3564–3572.
Tucker et al. (2019)
Tucker, G., D. Lawson, S. Gu, and C. J. Maddison
2019. Doubly reparameterized gradient estimators for monte carlo objectives. In International Conference on Learning Representations.
Uehara et al. (2016)
Uehara, M., I. Sato, M. Suzuki, K. Nakayama, and
2016. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920.
Wainwright et al. (2008)
Wainwright, M. J., M. I. Jordan, et al.
2008. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305.
Waterhouse et al. (1996)
Waterhouse, S. R., D. MacKay, and A. J. Robinson
1996. Bayesian methods for mixtures of experts. In Advances in neural information processing systems, Pp. 351–357.
Williams, R. J.
Simple statistical gradient-following algorithms for connectionist reinforcement learning.In Machine Learning, Pp. 229–256.
Yin and Zhou (2018)
Yin, M. and M. Zhou
2018. Semi-implicit variational inference. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, Pp. 5660–5669. PMLR.
Appendix A Proofs
Theorem A.1 (Marginal log-density upper bound).
For any , and such that for any s.t. we have , consider
where we write for brevity. Then the following holds:
Lemma A.2 ( distribution, following Domke and Sheldon (2018)).
Given , consider a following generative process:
Sample i.i.d. samples from
For each sample compute its weight
Put -th sample first, and then the rest: ,
Then the marginal density of
The joint density for the generative process described above is
One can see that this is indeed a normalized density
The marginal density then is
Where on the second line we used the fact that integrand is symmetric under the choice of .
Then is a normalized density.
is non-negative due to all the terms being non-negative. Now we’ll show it integrates to 1 (colors denote corresponding terms):
Theorem A.4 (DIWHVI Evidence Lower Bound).
Where the expectation is taken over the following generative process:
Sample for and
Sample for and
Consider a random variable
We’ll show it’s an unbiased estimate of (colors denote corresponding terms) and then just invoke Jensen’s inequality: