 # Note on the bias and variance of variational inference

In this note, we study the relationship between the variational gap and the variance of the (log) likelihood ratio. We show that the gap can be upper bounded by some form of dispersion measure of the likelihood ratio, which suggests the bias of variational inference can be reduced by making the distribution of the likelihood ratio more concentrated, such as via averaging and variance reduction.

## Code Repositories

### HIWAE

code for hierarchical importance weighted autoencoders

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let and

denote the observed and unobserved random variables, following a joint density function

. Generally, the log marginal likelihood is not tractable, so the Maximum likelihood principle

cannot be readily applied to estimate the model parameter

. Instead, one can maximize the evidence lower bound (ELBO):

 logpθ(v)=logEqϕ(h)[pθ(v,h)qϕ(h)]≥Eqϕ(h)[logpθ(v,h)qϕ(h)]:=L(θ,ϕ)

where the inequality becomes an equality if and only if , since is a strictly concave function. This way, learning and inference can be jointly achieved, by maximizing wrt and , respectively.

Alternatively, one can maximize another family of lower bounds due to Burda et al. (2015):

 logpθ(v)=logEhj∼qϕ(h)[1KK∑j=1pθ(v,hj)qϕ(hj)]≥Ehj∼qϕ(h)[log1KK∑j=1pθ(v,h)qϕ(h)]:=LK(θ,ϕ)

which we call the importance weighted lower bound (IWLB). Clearly . An appealing property of this family of lower bounds is that is monotonic, i.e. if , and can be made arbitrarily close to provided is sufficiently large.

One interpretation for this is that by weighting the samples according to the importance ratio , we are effectively correcting or biasing the proposal towards the true posterior ; see Cremer et al. (2017) for more details. Another interpretation due to Nowozin (2018) is to view as a biased estimator for , where the bias is of the order .

We take a different view by looking at the variance, or some notion of dispersion, of . We write as the average before is applied. The variational gap, , is caused by (1) the strict concavity of , and (2) the dispersion of . To see this, one can view the expectation as the centroid of uncountably many

weighted by its probability density, which lies below the graph of

. By using a larger number of samples, the distribution of becomes more concentrated around its expectation , pushing the “centroid” up to be closer to the graph of . See Figure 1 for an illustration.

This intuition has been exploited and ideas of correlating the likelihood ratios of a joint proposal have been proposed in Klys et al. (2018); Wu et al. (2019); Huang et al. (2019). Even though attempts have been made to establish the connection between and the gap (or bias) , the obtained results are asymptotic and require further assumption on boundedness (such as uniform integrability) of the sequence , which makes the results harder to interpret 111For example, see Klys et al. (2018); Huang et al. (2019) where they seek to minimize the variance to improve the variational approximation, and Maddison et al. (2017); Domke and Sheldon (2018) where they analyze the asymptotic bias by looking at the variance of . . Rather than bounding the asymptotic bias by the variance of , we analyze the non-asymptotic relationship between and the variance of and . Our finding justifies exploiting the structure of the likelihood ratios of a joint proposal, as anti-correlation among the likelihood ratio serves to further reduce the variance of an average, which we will show in the next section upper bounds the variational gap. Figure 1: Visualizing the reduction in variaional gap and the concentration of the distribution of the likelihood ratio due to averaging. x-axis: XK. y-axis: YK. The dotted lines indicate the expected values, and the solid line is the log function.

## 2 Bounding the gap via central tendency

Let and be the mean and median 222We assume there’s a unique median to simplify the analysis. of a random variable , i.e.

 μX:=E[X]P(X≥νX)∧P(X≤νX)≥12

Here we assume is a positive random variable. One can think of it as

, or some other unbiased estimate of

. By Jensen’s inequality, we know , where . We want to bound the gap via some notion of dispersion of and . Now assume and . Constants and correspond to the dispersion just mentioned. For example, the following lemma shows

can be taken to be the standard deviation

:

For and , then .

###### Proof.

Using the fact that the median minimizes the mean absolute error and Jensen’s inequality, we have

 |μX−νX|=|E[X−νX]|≤E[|X−νX|]≤E[|X−μX|]≤||X−μX||p

Without further assumptions, we can derive a weaker result. Since is strictly monotonic, , so we have . Since , by monotonicity of ,

 logμX−μY≤logμX−log(μX−CX)+νY−μY

which after arrangement gives . This means if is small enough so that the difference between and can be neglected, then the gap of interest is bounded by the dispersion of , .

Now, we quantify the error between and by the following result: Figure 2: Bounding the variational gap using a linear majorizer (green curve) of log (blue curve) at x=νX.
###### Proposition 2.

Let be a positive random variable with , and with . Assume

 |μX−νX|≤CX(∗) and |μY−νY|≤CY(†)

for some constants . If , then

 logμX−μY≤CXμX−CX+CY

A visual illustration of the proof is presented in Figure 2. The main idea is to use Taylor approximation as a linear upper bound on the , so that the error in using to approximate can be translated to the log scale. Hence the additional term is inversely propostional to , i.e. the derivative of , which is the slope of the linear upper bound.

###### Proof.

Since is a strictly concave function, first-order Taylor approximation (at ) gives a linear upper bound:

 f(x):=1νX(x−νX)+logνX≥log(x)

By monotonicity of logarithm and , . The logarithm can be bounded from above by the linear upperbound , which yields

 logμX−μY≤f(νX+CX)−μY=CXνX+logνX−μY

Notice that (since is strictly monotonic), so that we can plug in . Now the premise combined with again yields , concluding the proof. ∎

The main takeaway of the proposition is that if the dispersion of is sufficiently small, then minimizing the standard deviation of and amounts to minimizing the gap . We summarize it by the following Corollary:

###### Corollary 3.

Let be an unbiased estimator for the marginal likelihood , and let . Denote by and the standard deviation of and , respectively. Then

 σX

We would like to thank Kris Sankaran for proofreading the note.

## References

• Burda et al. (2015) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov.

Importance weighted autoencoders.

In International Conference on Learning Representations, 2015.
• Cremer et al. (2017) Chris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting importance-weighted autoencoders. arXiv preprint arXiv:1704.02916, 2017.
• Domke and Sheldon (2018) Justin Domke and Daniel R Sheldon. Importance weighting and variational inference. In Advances in Neural Information Processing Systems, pages 4470–4479, 2018.
• Huang et al. (2019) Chin-Wei Huang, Kris Sankaran, Eeshan Dhekane, Alexandre Lacoste, and Aaron Courville. Hierarchical importance weighted autoencoders. In

International Conference on Machine Learning

, 2019.
• Klys et al. (2018) Jack Klys, Jesse Bettencourt, and David Duvenaud. Joint importance sampling for variational inference. 2018.
• Maddison et al. (2017) Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In Advances in Neural Information Processing Systems, pages 6573–6583, 2017.
• Nowozin (2018) Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference. In International Conference on Learning Representations, 2018.
• Wu et al. (2019) Mike Wu, Noah Goodman, and Stefano Ermon. Differentiable antithetic sampling for variance reduction in stochastic variational inference. In Artificial Intelligence and Statitics (AISTATS), 2019.