code for hierarchical importance weighted autoencoders
In this note, we study the relationship between the variational gap and the variance of the (log) likelihood ratio. We show that the gap can be upper bounded by some form of dispersion measure of the likelihood ratio, which suggests the bias of variational inference can be reduced by making the distribution of the likelihood ratio more concentrated, such as via averaging and variance reduction.READ FULL TEXT VIEW PDF
Variational inference has become one of the most widely used methods in
Recent work in variational inference (VI) uses ideas from Monte Carlo
Variational inference transforms posterior inference into parametric
The empirical likelihood inference is extended to a class of semiparamet...
Highly expressive directed latent variable models, such as sigmoid belie...
In this paper we propose two novel bounds for the log-likelihood based o...
We show that unconverged stochastic gradient descent can be interpreted ...
code for hierarchical importance weighted autoencoders
denote the observed and unobserved random variables, following a joint density function. Generally, the log marginal likelihood is not tractable, so the Maximum likelihood principle
cannot be readily applied to estimate the model parameter. Instead, one can maximize the evidence lower bound (ELBO):
where the inequality becomes an equality if and only if , since is a strictly concave function. This way, learning and inference can be jointly achieved, by maximizing wrt and , respectively.
Alternatively, one can maximize another family of lower bounds due to Burda et al. (2015):
which we call the importance weighted lower bound (IWLB). Clearly . An appealing property of this family of lower bounds is that is monotonic, i.e. if , and can be made arbitrarily close to provided is sufficiently large.
One interpretation for this is that by weighting the samples according to the importance ratio , we are effectively correcting or biasing the proposal towards the true posterior ; see Cremer et al. (2017) for more details. Another interpretation due to Nowozin (2018) is to view as a biased estimator for , where the bias is of the order .
We take a different view by looking at the variance, or some notion of dispersion, of . We write as the average before is applied. The variational gap, , is caused by (1) the strict concavity of , and (2) the dispersion of . To see this, one can view the expectation as the centroid of uncountably many
weighted by its probability density, which lies below the graph of. By using a larger number of samples, the distribution of becomes more concentrated around its expectation , pushing the “centroid” up to be closer to the graph of . See Figure 1 for an illustration.
This intuition has been exploited and ideas of correlating the likelihood ratios of a joint proposal have been proposed in Klys et al. (2018); Wu et al. (2019); Huang et al. (2019). Even though attempts have been made to establish the connection between and the gap (or bias) , the obtained results are asymptotic and require further assumption on boundedness (such as uniform integrability) of the sequence , which makes the results harder to interpret 111For example, see Klys et al. (2018); Huang et al. (2019) where they seek to minimize the variance to improve the variational approximation, and Maddison et al. (2017); Domke and Sheldon (2018) where they analyze the asymptotic bias by looking at the variance of . . Rather than bounding the asymptotic bias by the variance of , we analyze the non-asymptotic relationship between and the variance of and . Our finding justifies exploiting the structure of the likelihood ratios of a joint proposal, as anti-correlation among the likelihood ratio serves to further reduce the variance of an average, which we will show in the next section upper bounds the variational gap.
Let and be the mean and median 222We assume there’s a unique median to simplify the analysis. of a random variable , i.e.
Here we assume is a positive random variable. One can think of it as
, or some other unbiased estimate of. By Jensen’s inequality, we know , where . We want to bound the gap via some notion of dispersion of and . Now assume and . Constants and correspond to the dispersion just mentioned. For example, the following lemma shows
can be taken to be the standard deviation:
For and , then .
Using the fact that the median minimizes the mean absolute error and Jensen’s inequality, we have
Without further assumptions, we can derive a weaker result. Since is strictly monotonic, , so we have . Since , by monotonicity of ,
which after arrangement gives . This means if is small enough so that the difference between and can be neglected, then the gap of interest is bounded by the dispersion of , .
Now, we quantify the error between and by the following result:
Let be a positive random variable with , and with . Assume
for some constants . If , then
A visual illustration of the proof is presented in Figure 2. The main idea is to use Taylor approximation as a linear upper bound on the , so that the error in using to approximate can be translated to the log scale. Hence the additional term is inversely propostional to , i.e. the derivative of , which is the slope of the linear upper bound.
Since is a strictly concave function, first-order Taylor approximation (at ) gives a linear upper bound:
By monotonicity of logarithm and , . The logarithm can be bounded from above by the linear upperbound , which yields
Notice that (since is strictly monotonic), so that we can plug in . Now the premise combined with again yields , concluding the proof. ∎
The main takeaway of the proposition is that if the dispersion of is sufficiently small, then minimizing the standard deviation of and amounts to minimizing the gap . We summarize it by the following Corollary:
Let be an unbiased estimator for the marginal likelihood , and let . Denote by and the standard deviation of and , respectively. Then
We would like to thank Kris Sankaran for proofreading the note.
Importance weighted autoencoders.In International Conference on Learning Representations, 2015.
International Conference on Machine Learning, 2019.