SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models

Standard variational lower bounds used to train latent variable models produce biased estimates of most quantities of interest. We introduce an unbiased estimator of the log marginal likelihood and its gradients for latent variable models based on randomized truncation of infinite series. If parameterized by an encoder-decoder architecture, the parameters of the encoder can be optimized to minimize its variance of this estimator. We show that models trained using our estimator give better test-set likelihoods than a standard importance-sampling based approach for the same average computational cost. This estimator also allows use of latent variable models for tasks where unbiased estimators, rather than marginal likelihood lower bounds, are preferred, such as minimizing reverse KL divergences and estimating score functions.



page 1

page 2

page 3

page 4


Double Control Variates for Gradient Estimation in Discrete Latent Variable Models

Stochastic gradient-based optimisation for discrete latent variable mode...

Unbiased approximations of products of expectations

We consider the problem of approximating the product of n expectations w...

A Fourier View of REINFORCE

We show a connection between the Fourier spectrum of Boolean functions a...

Filtering Variational Objectives

When used as a surrogate objective for maximum likelihood estimation in ...

Sandwiching the marginal likelihood using bidirectional Monte Carlo

Computing the marginal likelihood (ML) of a model requires marginalizing...

MuProp: Unbiased Backpropagation for Stochastic Neural Networks

Deep neural networks are powerful parametric models that can be trained ...

Bi-level Doubly Variational Learning for Energy-based Latent Variable Models

Energy-based latent variable models (EBLVMs) are more expressive than co...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Latent variable models are powerful tools for constructing highly expressive data distributions and for understanding how high-dimensional observations might possess a simpler representation. Latent variable models are often framed as probabilistic graphical models, allowing these relationships to be expressed in terms of conditional independence. Mixture models, probabilistic principal component analysis 

(Tipping & Bishop, 1999)

, hidden Markov models, and latent Dirichlet allocation 

(Blei et al., 2003)

are all examples of powerful latent variable models. More recently there has been a surge of interest in probabilistic latent variable models that incorporate flexible nonlinear likelihoods via deep neural networks 

(Kingma & Welling, 2014)

. These models can blend the advantages of highly structured probabilistic priors with the empirical successes of deep learning 

(Johnson et al., 2016; Luo et al., 2018)

. Moreover, these explicit latent variable models can often yield relatively interpretable representations, in which simple interpolation in the latent space can lead to semantically-meaningful changes in high-dimensional observations (e.g.,

Higgins et al. (2017)).

It can be challenging, however, to fit the parameters of a flexible latent variable model, since computing the marginal likelihood of the data requires integrating out the latent variables. Typical approaches to this problem include the celebrated expectation maximization algorithm 

(Dempster et al., 1977)

, Markov chain Monte Carlo, and the Laplace approximation. Variational inference generalizes expectation maximization by forming a lower bound on the aforementioned (log) marginal likelihood, using a tractable approximation to the unmanageable posterior over latent variables. The maximization of this lower bound—rather than the true log marginal likelihood—is often relatively straightforward when using automatic differentiation and Monte Carlo sampling. However, a lower bound may be ill-suited for tasks such as posterior inference and other situations where there exists an entropy maximization objective; for example in entropy-regularized reinforcement learning 

(Williams & Peng, 1991; Mnih et al., 2016; Norouzi et al., 2016) which requires minimizing the log probability of the samples under the model.

While there is a long history in Bayesian statistics of estimating the marginal likelihood (e.g.,

Newton & Raftery (1994); Neal (2001)), we often want high-quality estimates of the logarithm of the marginal likelihood, which is better behaved when the data is high dimensional; it is not as susceptible to underflow and it has gradients that are numerically sensible. However, the log transformation introduces some challenges: Monte Carlo estimation techniques such as importance sampling do not straightforwardly give unbiased estimates of this quantity. Nevertheless, there has been significant work to construct estimators of the log marginal likelihood in which it is possible to explicitly trade off between bias against computational cost (Burda et al., 2016; Bamler et al., 2017; Nowozin, 2018). Unfortunately, while there are asymptotic regimes where the bias of these estimators approaches zero, it is always possible to optimize the parameters to increase this bias to infinity.

In this work, we construct an unbiased estimator of the log marginal likelihood. Although there is no theoretical guarantee that this estimator has finite variance, we find that it can work well in practice. We show that this unbiased estimator can train latent variable models to achieve higher test log-likelihood than lower bound estimators at the same expected compute cost. More importantly, this unbiased estimator allows us to apply latent variable models in situations where these models were previously problematic to optimize with lower bound estimators. Such applications include latent variable modeling for posterior inference and for reinforcement learning in high-dimensional action spaces, where an ideal model is one that is highly expressive yet efficient to sample from.

2 Preliminaries

2.1 Latent variable models

Latent variable models (LVMs) describe a distribution over data in terms of a mixture over unobserved quantities. Let  be a family of probability density (mass) functions on a data space , indexed by parameters . We will generally refer to this as a “density” for consistency, even when the data should be understood to be discrete; similarly we will use integrals even when the marginalization is over a discrete set. In a latent variable model,  is defined via a space of latent variables , a family of mixing measures on this latent space with density denoted , and a conditional distribution . This conditional distribution is sometimes called an “observation model” or a conditional likelihood. We will take  to parameterize both  and  in the service of determining the marginal  via the mixture integral:


This simple formalism allows for a large range of modeling approaches, in which complexity can be baked into the latent variables (as in traditional graphical models), into the conditional likelihood (as in variational autoencoders), or into both (as in structured VAEs). The downside of this mixing approach is that the integral may be intractable to compute, making it difficult to evaluate 

—a quantity often referred to in Bayesian statistics and machine learning as the

marginal likelihood or evidence. Various Monte Carlo techniques have been developed to provide consistent and often unbiased estimators of , but it is usually preferable to work with  and unbiased estimation of this quantity has, to our knowledge, not been previously studied.

2.2 Training latent variable models

Fitting a parametric distribution to observed data is often framed as the minimization of a difference between the model distribution and the empirical distribution. The most common difference measure is the forward Kullback-Leibler (KL) divergence; if  is the empirical distribution and  is a parametric family, then minimizing the KL divergence () with respect to  is equivalent to maximizing the likelihood:


Equivalently, the optimization problem of finding the MLE parameters  comes down to maximizing the expected log probability of the data:


Since expectations can be estimated in an unbiased manner using Monte Carlo procedures, simple subsampling of the data enables powerful stochastic optimization techniques, with stochastic gradient descent in particular forming the basis for learning the parameters of many nonlinear models. However, this requires unbiased estimates of

, which are not available for latent variable models. Instead, a stochastic lower bound of is often used and then differentiated for optimization.

Though many lower bound estimators (Burda et al., 2016; Bamler et al., 2017; Nowozin, 2018) are applicable, we focus on an importance-weighted evidence lower bound (Burda et al., 2016). This lower bound is constructed by introducing a proposal distribution  and using it to form an importance sampling estimate of the marginal likelihood:


If  samples are drawn from  then this provides an unbiased estimate of  and the biased “importance-weighted autoencoder” estimator  of is given by


The special case of generates an unbiased estimate of the evidence lower bound (ELBO), which is often used for performing variational inference by stochastic gradient descent. While the IWAE lower bound acts as a useful replacement of in maximum likelihood training, it may not be suitable for other objectives such as those that involve entropy maximization. We discuss tasks for which a lower bound estimator would be ill-suited in Section 3.4.

There are two properties of IWAE that will allow us to modify it to produce an unbiased estimator: First, it is consistent in the sense that as the number of samples  increases, the expectation of converges to . Second, it is also monotonically non-decreasing in expectation:


These properties are sufficient to create an unbiased estimator using the Russian roulette estimator.

2.3 Russian roulette estimator

In order to create an unbiased estimator of the log probability function, we employ the Russian roulette estimator (Kahn, 1955). This estimator is used to estimate the sum of infinite series, where each sample of the estimator almost surely requires only a finite amount of computation. Intuitively, the Russian roulette estimator relies on a randomized truncation and upweighting of each term to account for the possibility of not computing the remaining terms.

To illustrate the idea, let denote the -th term of an infinite series. Assume the partial sum of the series converges to some quantity we wish to obtain. We can construct a simple estimator by always computing the first term then flipping a coin to determine whether we stop or continue evaluating the remaining terms. With probability , we compute the rest of the series. By reweighting the remaining future terms by , we obtain an unbiased estimator:

To obtain the “Russian roulette” (RR) estimator (Forsythe & Leibler, 1950)

, we repeatedly apply this trick to the remaining terms. In effect, we make the number of terms a random variable 

, taking values in  to use in the summation (i.e., the number of coin flips) from some distribution with probability mass function  with support over the positive integers. With  drawn from , the estimator takes the form:


The equality on the right hand of equation 7 holds so long as (i) , and (ii) the series converges absolutely, i.e.,  (Chen et al. (2019)

; Lemma 3). This condition ensures that the average of multiple samples will converge to the value of the infinite series by the law of large numbers. However, the variance of this estimator depends on the choice of

and can potentially be very large or even infinite (McLeish, 2011; Rhee & Glynn, 2015; Beatson & Adams, 2019).

3 SUMO: Unbiased estimation of log probability for LVMs

3.1 Russian roulette to debias lower bounds

We can turn any absolutely convergent series into a telescoping series and apply the Russian roulette randomization to form an unbiased stochastic estimator. We focus here on the IWAE bound described in Section 2.2. Let , then since converges absolutely, we apply equation 7 to construct our estimator, which we call SUMO (Stochastically Unbiased Marginalization Objective). The detailed derivation of SUMO is in Appendix A.1.


The randomized truncation of the series using the Russian roulette estimator means that this is an unbiased estimator of the log marginal likelihood, regardless of the distribution :


where the expectation is taken over and (see Algorithm 1 for our exact sampling procedure). Furthermore, under some conditions, we have (see Appendix A.4).

1:, , encoder , decoder , , reverse_cdf
6:cum_iwae log_cumsum_exp
7:inv_weights = reverse_cdf
8:cum_iwae- + suminv_weights * cum_iwae: - cum_iwae-:-
Algorithm 1 Computing SUMO, an unbiased estimator of .

3.2 Optimizing variance-compute product by choice of

To efficiently optimize a limit, one should choose an estimator to minimize the product of the second moment of the

gradient estimates and the expected compute cost per evaluation. The choice of  effects both the variance and computation cost of our estimator. Denoting  and , the Russian roulette estimator is optimal across a broad family of unbiased randomized truncation estimators if the are statistically independent, in which case it has second moment  (Beatson & Adams, 2019). While the are not in fact strictly independent with our sampling procedure (Algorithm 1), and other estimators within the family may perform better, we justify our choice by showing that for  converges to zero much faster than (Appendices A.2 & A.3). In the following, we assume independence of and choose to minimize the product of compute and variance.

We first show that is (Appendix A.5). This implies the optimal compute-variance product (Rhee & Glynn, 2015; Beatson & Adams, 2019) is given by . In our case, this gives , which results in an estimator with infinite expected computation and no finite bound on variance. In fact, any which gives rise to provably finite variance requires a heavier tail than and so will have infinite expected computation.

Though we could not theoretically show that our estimator and gradients have finite variance, we empirically find that gradient descent converges —even in the setting of minimizing log probability. We plot

for the toy variational inference task used to assess signal to noise ratio in 

Tucker et al. (2018) and Rainforth et al. (2018b), and find that they converge faster than in practice (Appendix A.6). While this indicates the variance is better than the theoretical bound, an estimator having infinite expected computation cost will always be an issue as it indicates significant probability of sampling arbitrarily large . We therefore modify the tail of the sampling distribution such that the estimator has finite expected computation:


We typically choose , which gives an expected computation cost of approximately 5 terms.

3.2.1 Trading variance and compute

One way to improve the RR estimator is to construct it so that some minimum number of terms (denoted here as ) are always computed. This puts a lower bound on the computational cost, but can potentially lower variance, providing a design space for trading off estimator quality against computational cost. This corresponds to a choice of RR estimator in which  for . This computes the sum out to  terms (effectively computing ) and then estimates the remaining difference with Russian roulette:


In practice, instead of tuning parameters of , we set to achieve a given expected computation cost per estimator evaluation for fair comparison with IWAE and related estimators.

3.3 Training to reduce variance

The SUMO estimator does not require amortized variational inference, but the use of an “encoder” to produce an approximate posterior  has been shown to be a highly effective way to perform rapid feedforward inference in neural latent variable models. We use to denote the parameters of the encoder . However, the gradients of SUMO with respect to  are in expectation zero precisely because SUMO is an unbiased estimator of , regardless of our choice of . Nevertheless, we would expect the choice of significantly impacts the variance of our estimator. As such, we optimize to reduce the variance of the SUMO estimator. We can obtain unbiased gradients in the following way (Ruiz et al., 2016; Tucker et al., 2017):


Notably, the expectation of this estimator depends on the variance of SUMO, which we have not been able to bound. In practice, we observe gradients which are sometimes very large. We apply gradient clipping to the encoder to clip gradients which are excessively large in magnitude. This helps stabilize the training progress but introduces bias into the encoder gradients. Fortunately, the encoder itself is merely a tool for variance reduction, and biased gradients with respect to the encoder can still significantly help optimization.

3.4 Applications of unbiased log probability

Here we list some applications where an unbiased log probability is useful. Using SUMO to replace existing lower bound estimates allows latent variable models to be used for new applications where a lower bound is inappropriate. As latent variable models can be both expressive and efficient to sample from, they are frequently useful in applications where the data is high-dimensional and samples from the model are needed.

Minimizing .

Some machine learning objectives include terms that seek to increase the entropy of the learned model. The “reverse KL” objective—often used for training models to perform approximate posterior inference—minimizes  where is a target density that may only be known up a normalization constant. Local updates of this form are the basis of the expectation propagation procedure (Minka, 2001)

. This objective has also been used for distilling autoregressive models that are inefficient at sampling 

(Oord et al., 2018). Moreover, reverse KL is connected to the use of entropy-regularized objectives (Williams & Peng, 1991; Ziebart, 2010; Mnih et al., 2016; Norouzi et al., 2016) in decision-making problems, where the goal is to encourage the decision maker toward exploration and prevent it from settling into a local minimum.

Unbiased score function .

The score function is the gradient of the log-likelihood with respect to the parameters and has uses in estimating the Fisher information matrix and performing stochastic gradient Langevin dynamics (Welling & Teh, 2011), among other applications. Of particular note, the REINFORCE gradient estimator (Williams, 1992)—generally applicable for optimizing objectives of the form —is estimated using the score function. This can be replaced with the gradient of SUMO which itself is an estimator of the score function .


where the inner expectation is over the stochasticity of the SUMO estimator. Such estimators are often used for reward maximization in reinforcement learning where  is a stochastic policy.

4 Related Work

There is a long history in Bayesian statistics of marginal likelihood estimation in the service of model selection. The harmonic mean estimator 

(Newton & Raftery, 1994), for example, has a long (and notorious) history as a consistent estimator of the marginal likelihood that may have infinite variance (Murray & Salakhutdinov, 2009) and exhibits simulation psuedo-bias (Lenk, 2009). The Chib estimator (Chib, 1995), the Laplace approximation, and nested sampling (Skilling, 2006) are alternative proposals that can often have better properties (Murray & Salakhutdinov, 2009). Annealed importance sampling (Neal, 2001) probably represents the gold standard for marginal likelihood estimation. These, however, turn into consistent estimators at best when estimating the log marginal probability (Rainforth et al., 2018a). Bias removal schemes such as jackknife variational inference (Nowozin, 2018) have been proposed to debias log-evidence estimation, IWAE in particular. Hierarchical IWAE (Huang et al., 2019) uses a joint proposal to induce negative correlation among samples and connects the convergence of variance of the estimator and the convergence of the lower bound.

Russian roulette also has a long history. It dates back to unpublished work from von Neumann and Ulam, who used it to debias Monte Carlo methods for matrix inversion (Forsythe & Leibler, 1950) and particle transport problems (Kahn, 1955). It has gained popularity in statistical physics (Spanier & Gelbard, 1969; Kuti, 1982; Wagner, 1987), for unbiased ray tracing in graphics and rendering (Arvo & Kirk, 1990), and for a number of estimation problems in the statistics community (Wei & Murray, 2017; Lyne et al., 2015; Rychlik, 1990, 1995; Jacob & Thiery, 2015; Jacob et al., 2017). It has also been independently rediscovered many times (Fearnhead et al., 2008; McLeish, 2011; Rhee & Glynn, 2012; Tallec & Ollivier, 2017).

The use of Russian roulette estimation in deep learning and generative modeling applications has been gaining traction in recent years. It has been used to solve short-term bias in optimization problems (Tallec & Ollivier, 2017; Beatson & Adams, 2019). Wei & Murray (2017) estimates the reciprocal normalization constant of an unnormalized density. Han et al. (2018)

uses a similar random truncation approach to estimate the distribution of eigenvalues in a symmetric matrix. Along similar motivations with our work,

Chen et al. (2019) uses this estimator to construct an unbiased estimator of the change of variables equation in the context of normalizing flows (Rezende & Mohamed, 2015), and Xu et al. (2019) uses it to construct unbiased log probability for a nonparameteric distribution in the context of variational autoencoders (Kingma & Welling, 2014).

Though we extend latent variable models to applications that require unbiased estimates of log probability and benefit from efficient sampling, an interesting family of models already fulfill these requirements. Normalizing flows (Rezende & Mohamed, 2015; Dinh et al., 2017) offer exact log probability and certain models have been proven to be universal density estimators (e.g. Huang et al. (2018)). However, these models often require restrictive architectural choices with no dimensionality-reduction capabilities, and make use of many more parameters to scale up (Kingma & Dhariwal, 2018) than alternative generative models. Discrete variable versions of these models are still in their infancy and make use of biased gradients (Tran et al., 2019; Hoogeboom et al., 2019), whereas latent variable models naturally extend to discrete observations.

5 Density Modeling Experiments

Training Objective =5 =15 =50 =5 =15 =50
ELBO (Burda et al., 2016) 86.47 86.35 107.62 107.80
IWAE (Burda et al., 2016) 85.54 84.78 106.12 104.67
ELBO (Our impl.) 85.970.01 85.990.05 85.880.07 106.790.08 106.980.19 106.840.13
IWAE (Our impl.) 85.280.01 84.890.03 84.500.02 104.960.04 104.530.05 103.990.12
JVI (Our impl.) 84.750.03 104.080.11
SUMO 85.090.01 84.710.02 84.400.03 104.850.04 104.290.12 103.790.14
Table 1: Test negative log-likelihood of the trained model, estimated using IWAE(=5000). For SUMO, refers to the expected number of computed terms.

We first compare the performance of SUMO when used as a replacement to IWAE with the same expected cost on density modeling tasks. We make use of two benchmark datasets: dynamically binarized MNIST 

(LeCun et al., 1998) and binarized OMNIGLOT (Lake et al., 2015).

We use the same neural network architecture as IWAE (Burda et al., 2016). The prior

is a 50-dimensional standard Gaussian distribution. The conditional distributions

are independent Bernoulli, with the decoder parameterized by two hidden layers, each with 200 units. The approximate posterior is also a 50-dimensional Gaussian distribution with diagonal covariance, whose mean and variance are both parameterized by two hidden layers with 200

units. We reimplemented and tuned IWAE, obtaining strong baseline results which are better than those previously reported. We then used the same hyperparameters to train with the SUMO estimator. We find clipping very large gradients can help performance, as large gradients may be infrequently sampled. This introduces a small amount of bias into the gradients while reducing variance, but can nevertheless help achieve faster convergence and should still result in a less-biased estimator. A posthoc study of the effect on final test performance as a function of this bias-variance tradeoff mechanism is discussed in Appendix 

A.7. We note that gradient clipping is only done for the density modeling experiments.

The averaged test log-likelihoods and standard deviations over 3 runs are summarized in Table 

1. To be consistent with existing literature, we evaluate our model using IWAE with samples. In all the cases, SUMO achieves slightly better performance than IWAE with the same expected cost. We also bold the results that are statistically insignificant from the best performing model according to an unpaired -test with significance level . However, we do see diminishing returns as we increase , suggesting that as we increase compute, the variance of our estimator may impact performance more than the bias of IWAE.

6 Latent variables models for entropy maximization

We move on to our first task for which a lower bound estimate of log probability would not suffice. The reverse KL objective is useful when we have access to a (possibly unnormalized) target distribution but no efficient sampling algorithm.


A major problem with fitting latent variables models to this objective is the presence of an entropy maximization term, effectively a minimization of . Estimating this log marginal probability with a lower bound estimator could result in optimizing to maximize the bias of the estimator instead of the true objective. Our experiments demonstrate that this causes IWAE to often fail to optimize the objective unless we use a large amount of computation.

Target log probability
Training w/ IWAE (k=15)
Training w/ SUMO (k=15)
Model samples
Figure 1: We trained latent variable models for posterior inference, which requires minimizing log probability under the model. Training with IWAE leads to optimizing for the bias while leaving the true model in an unstable state, whereas training with SUMO—though noisy—leads to convergence.
Modifying IWAE.

The bias of the IWAE estimator can be interpreted as the KL between an importance-weighted approximate posterior implicitly defined by the encoder and the true posterior  (Domke & Sheldon, 2018). Both the encoder and decoder parameters can therefore affect this bias. In practice, we find that the encoder optimization proceeds at a faster timescale than the decoder optimization: i.e., the encoder can match to the decoder’s posterior more quickly than the latter can match an objective. For this reason, we train the encoder to reduce bias and use a minimax training objective


Though this is still a lower bound with unbounded bias, it makes for a stronger baseline than optimizing in the same direction as . We find that this approach can indeed work in practice, but requires setting extremely high.

Figure 2: Training with reverse KL requires minimizing . SUMO estimates are unbiased and trains well, but minimizing the lower bound IWAE with small leads to estimates of .

We choose a “funnel” target distribution (Figure 1) similar to the distribution used as a benchmark for inference in Neal et al. (2003), where has support in and is defined We use neural networks with one hidden layer of 200 hidden units and activations for both the encoder and decoder networks. We use 20 latent variables, with , , and all being Gaussian distributed.

Figure 2 shows the learning curves when using IWAE and SUMO. Unless is set very large, IWAE will at some point start optimizing the bias instead of the actual objective. The reverse KL is a non-negative quantity, so any estimate significantly below zero can be attributed to the unbounded bias. On the other hand, SUMO correctly optimizes for the objective even with a small expected cost. Increasing the expected cost for SUMO simply reduces variance. We also found that if is set sufficiently large, then IWAE can work when we train using the minimax objective in equation 15, suggesting that a sufficiently debiased estimator can also work in practice. However, this requires much more compute and likely does not scale compared to SUMO. We also visualize the contours of the resulting models in Figure 1. For IWAE, we visualize the model a few iterations before it reaches numerical instability.

7 Latent variable policies for combinatorial optimization

Let us now consider the problem of finding the maximum of a non-differentiable function, a special case of reinforcement learning without an interacting environment. Variational optimization (Staines & Barber, 2012) can be used to reformulate this as the optimization of a parametric distribution,


which is now a differentiable function with respect to the parameters , whose gradients can be estimated using a combination of the REINFORCE gradient estimator and the SUMO estimator (equation 13). Furthermore, entropy regularized reinforcement learning—where we maximize with being the entropy of —encourages exploration and is inherently related to minimizing a reverse KL objective with the target being an exponentiated reward (Norouzi et al., 2016).

For concreteness, we focus on the problem of quadratic pseudo-Boolean optimization (QPBO) where the objective is to maximize



are binary variables. Without further assumptions, QPBO is NP-hard 

(Boros & Hammer, 2002). As there exist complex dependencies between the binary variables and optimization of equation 16 requires sampling from the policy distribution , a model that is both expressive and allows efficient sampling would be ideal. For this reason, we motivate the use of latent variable models with independent conditional distributions, trained using the SUMO estimator. Our baselines are an autoregressive policy, which captures dependencies but for which sampling must be performed sequentially, and an independent policy, which is easy to sample from but captures no dependencies.

We note that Haarnoja et al. (2018) also argued for latent variable policies in favor of learning diverse strategies but ultimately made use of normalizing flows which do not require marginalization.

QPBO w/ variables
QPBO w/ variables
QPBO w/ variables
& Entropy regularization
Figure 3: Latent variable policies allow faster exploration than autoregressive policy models, while being more expressive than an independent policy. SUMO works well with entropy regularization, whereas IWAE is unstable and converges to similar performance as the non-latent variable model.

We constructed one problem instance for each , which we note are already intractable for exact optimization. For each instance, we randomly sampled the weights and uniformly from the interval . Figure 3 shows the performance of each policy model. In general, the independent policy is quick to converge to a local minima and is unable to explore different regions, whereas more complex models have a better grasp of the “frontier” of reward distributions during optimization. The autoregressive model works well overall but is much slower to train due to its sequential sampling procedure; with , it is slower per iteration than SUMO. Surprisingly, we find that estimating the REINFORCE gradient with IWAE results in decent performance when no entropy regularization is present. With entropy regularization, all policies improve significantly; however, training with IWAE in this setting results in performance similar to the independent model, suggesting that it may not be using the latent state. On the other hand, SUMO works with both REINFORCE gradient estimation and entropy regularization, albeit at the cost of slower convergence due to variance.

8 Conclusion

We introduced SUMO, a new unbiased estimator of the log probability for latent variable models, and demonstrated tasks for which this estimator performs better than standard lower bounds. Specifically, we investigated applications involving entropy maximization where a lower bound performs poorly, but our unbiased estimator can train properly with relatively smaller amount of compute.

In the future, we plan to investigate new families of gradient-based optimizers which can handle heavy-tailed stochastic gradients. It may also be fruitful to investigate the use of convex combination of consistent estimators within the SUMO approach, as any convex combination is unbiased, or to apply variance reduction methods to increase stability of training with SUMO.


This work was partially funded by NSF IIS-1421780. Y.L and J.Z were supported by the NSF China Project (No. 61620106010), Beijing NSF Project (No. L172037), the JP Morgan Faculty Research Program and the NVIDIA NVAIL Program with GPU/DGX Acceleration.


  • Arvo & Kirk (1990) James Arvo and David Kirk. Particle transport and image synthesis. ACM SIGGRAPH Computer Graphics, 24(4):63–66, 1990.
  • Bamler et al. (2017) Robert Bamler, Cheng Zhang, Manfred Opper, and Stephan Mandt. Perturbative black box variational inference. In Advances in Neural Information Processing Systems. 2017.
  • Beatson & Adams (2019) Alex Beatson and Ryan P. Adams. Efficient optimization of loops and limits with randomized telescoping sums. In International Conference on Machine Learning, 2019.
  • Bińkowski et al. (2018) Mikołaj Bińkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Conference on Learning Representations, 2018.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
  • Boros & Hammer (2002) Endre Boros and Peter L Hammer. Pseudo-Boolean optimization. Discrete Applied Mathematics, 123(1-3):155–225, 2002.
  • Burda et al. (2016) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In International Conference on Learning Representations, 2016.
  • Chen et al. (2019) Ricky TQ Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 2019.
  • Chib (1995) Siddhartha Chib. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432):1313–1321, 1995.
  • Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
  • Dinh et al. (2017) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In International Conference on Learning Representations, 2017.
  • Domke & Sheldon (2018) Justin Domke and Daniel R Sheldon. Importance weighting and variational inference. In Advances in Neural Information Processing Systems, pp. 4470–4479, 2018.
  • Fearnhead et al. (2008) Paul Fearnhead, Omiros Papaspiliopoulos, and Gareth O Roberts. Particle filters for partially observed diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4):755–777, 2008.
  • Forsythe & Leibler (1950) George E Forsythe and Richard A Leibler. Matrix inversion by a Monte Carlo method. Mathematics of Computation, 4(31):127–129, 1950.
  • Haarnoja et al. (2018) Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. In International Conference on Machine Learning, pp. 1846–1855, 2018.
  • Han et al. (2018) Insu Han, Haim Avron, and Jinwoo Shin. Stochastic Chebyshev gradient descent for spectral optimization. In Advances in Neural Information Processing Systems, pp. 7386–7396, 2018.
  • Higgins et al. (2017) Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Machine Learning, 2017.
  • Hoogeboom et al. (2019) Emiel Hoogeboom, Jorn WT Peters, Rianne van den Berg, and Max Welling. Integer discrete flows and lossless compression. arXiv preprint arXiv:1905.07376, 2019.
  • Huang et al. (2018) Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In International Conference on Machine Learning, 2018.
  • Huang et al. (2019) Chin-Wei Huang, Kris Sankaran, Eeshan Dhekane, Alexandre Lacoste, and Aaron Courville. Hierarchical importance weighted autoencoders. In International Conference on Machine Learning, pp. 2869–2878, 2019.
  • Jacob & Thiery (2015) Pierre E Jacob and Alexandre H Thiery. On nonnegative unbiased estimators. The Annals of Statistics, 43(2):769–784, 2015.
  • Jacob et al. (2017) Pierre E Jacob, John O’Leary, and Yves F Atchadé. Unbiased Markov chain Monte Carlo with couplings. arXiv preprint arXiv:1708.03625, 2017.
  • Johnson et al. (2016) Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems, pp. 2946–2954, 2016.
  • Kahn (1955) Herman Kahn. Use of different Monte Carlo sampling techniques. Santa Monica, CA: RAND Corporation, 1955. URL
  • Kingma & Welling (2014) Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
  • Kingma & Dhariwal (2018) Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224, 2018.
  • Kuti (1982) Julius Kuti. Stochastic method for the numerical study of lattice fermions. Physical Review Letters, 49(3):183, 1982.
  • Lake et al. (2015) Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lenk (2009) Peter Lenk. Simulation pseudo-bias correction to the harmonic mean estimator of integrated likelihoods. Journal of Computational and Graphical Statistics, 18(4):941–960, 2009.
  • Luo et al. (2018) Yucen Luo, Tian Tian, Jiaxin Shi, Jun Zhu, and Bo Zhang. Semi-crowdsourced clustering with deep generative models. In Advances in Neural Information Processing Systems, pp. 3212–3222, 2018.
  • Lyne et al. (2015) Anne-Marie Lyne, Mark Girolami, Yves Atchadé, Heiko Strathmann, Daniel Simpson, et al.

    On Russian roulette estimates for Bayesian inference with doubly-intractable likelihoods.

    Statistical science, 30(4):443–467, 2015.
  • McLeish (2011) Don McLeish. A general method for debiasing a Monte Carlo estimator. Monte Carlo Methods and Applications, 17(4):301–315, 2011.
  • Minka (2001) Thomas P Minka. Expectation propagation for approximate Bayesian inference. In

    Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence

    , pp. 362–369. Morgan Kaufmann Publishers Inc., 2001.
  • Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
  • Murray & Salakhutdinov (2009) Iain Murray and Ruslan Salakhutdinov. Evaluating probabilities under high-dimensional latent variable models. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (eds.), Advances in Neural Information Processing Systems 21, pp. 1137–1144. 2009.
  • Neal (2001) Radford M Neal. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001.
  • Neal et al. (2003) Radford M Neal et al. Slice sampling. The Annals of Statistics, 31(3):705–767, 2003.
  • Newton & Raftery (1994) Michael A Newton and Adrian E Raftery. Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society: Series B (Methodological), 56(1):3–26, 1994.
  • Norouzi et al. (2016) Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pp. 1723–1731, 2016.
  • Nowozin (2018) Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference. In International Conference on Learning Representations, 2018.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. Parallel Wavenet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, 2018.
  • Rainforth et al. (2018a) Tom Rainforth, Robert Cornish, Hongseok Yang, Andrew Warrington, and Frank Wood. On nesting Monte Carlo estimators. In International Conference on Machine Learning, 2018a.
  • Rainforth et al. (2018b) Tom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter variational bounds are not necessarily better. In International Conference on Machine Learning, 2018b.
  • Reddi et al. (2018) Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018.
  • Rezende & Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015.
  • Rhee & Glynn (2012) Chang-han Rhee and Peter W Glynn. A new approach to unbiased estimation for SDEs. In Proceedings of the Winter Simulation Conference, pp.  17. Winter Simulation Conference, 2012.
  • Rhee & Glynn (2015) Chang-han Rhee and Peter W Glynn. Unbiased estimation with square root convergence for SDE models. Operations Research, 63(5):1026–1043, 2015.
  • Ruiz et al. (2016) Francisco JR Ruiz, Michalis K Titsias, and David M Blei. Overdispersed black-box variational inference. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, 2016.
  • Rychlik (1990) Tomasz Rychlik. Unbiased nonparametric estimation of the derivative of the mean. Statistics & probability letters, 10(4):329–333, 1990.
  • Rychlik (1995) Tomasz Rychlik.

    A class of unbiased kernel estimates of a probability density function.

    Applicationes Mathematicae, 22(4):485–497, 1995.
  • Skilling (2006) John Skilling. Nested sampling for general Bayesian computation. Bayesian Analysis, 1(4):833–859, 2006.
  • Spanier & Gelbard (1969) Jerome Spanier and Ely M Gelbard. Monte Carlo Principles and Neutron Transport Problems. Addison-Wesley Publishing Company, 1969.
  • Staines & Barber (2012) Joe Staines and David Barber. Variational optimization. arXiv preprint arXiv:1212.4507, 2012.
  • Tallec & Ollivier (2017) Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv preprint arXiv:1705.08209, 2017.
  • Tipping & Bishop (1999) Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.
  • Tran et al. (2019) Dustin Tran, Keyon Vafa, Kumar Krishna Agrawal, Laurent Dinh, and Ben Poole. Discrete flows: Invertible generative models of discrete data. arXiv preprint arXiv:1905.10347, 2019.
  • Tucker et al. (2017) George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pp. 2627–2636, 2017.
  • Tucker et al. (2018) George Tucker, Dieterich Lawson, Shixiang Gu, and Chris J Maddison. Doubly reparameterized gradient estimators for Monte Carlo objectives. In International Conference on Learning Representations, 2018.
  • Wagner (1987) Wolfgang Wagner. Unbiased Monte Carlo evaluation of certain functional integrals. Journal of Computational Physics, 71(1):21–33, 1987.
  • Wei & Murray (2017) Colin Wei and Iain Murray. Markov Chain Truncation for Doubly-Intractable Inference. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017.
  • Welling & Teh (2011) Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, pp. 681–688, 2011.
  • Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • Williams & Peng (1991) Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
  • Xu et al. (2019) Kai Xu, Akash Srivastava, and Charles Sutton. Variational Russian roulette for deep Bayesian nonparametrics. In International Conference on Machine Learning, pp. 6963–6972, 2019.
  • Ziebart (2010) Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, figshare, 2010.

Appendix A Appendix

a.1 Derivation of SUMO


where are sampled independently from . And we define the -th term of the infinite series . Using the properties of IWAE  in equation 6, we have , and


which means the series converges absolutely. This is a sufficient condition for finite expectation of the Russian roulette estimator (Chen et al. (2019); Lemma 3). Applying equation 7 to the series:


Let , Hence our estimator is constructed:


And it can be easily seen from equation 22 and equation 23 that SUMO is an unbiased estimator of the log marginal likelihood:


a.2 Convergence of

We follow the analysis of JVI (Nowozin, 2018), which applied the delta method for moments to show the asymptotic results on the bias and variance of both at a rate of . We build on this analysis to analyze the convergence of .

Let and we define as the sample mean and we have .


We note that we rely on for this power series to converge. This condition was implicitly assumed, but not explicitly noted, in (Nowozin, 2018). This condition will hold for sufficiently large so long as the moments of exist: one could bound the probability

by Chebyshev’s inequality or by the Central Limit Theorem. We use the central moments

and for .


Expanding Eq. 28 to order two gives


Since we use cumulative sum to compute and , we obtain .


We note that and . Therefore is , and .

a.3 Convergence of

Without loss of generality, suppose ,


For clarity, let be the zero-mean random variable. Nowozin (2018) gives the relations


Expanding both the sums inside the brackets to order two:

We will proceed by bounding each of the terms , , , . First, we decompose . Let .


We know that is independent of and , implying . Note .

Now we show that is zero:

We now investigate :

We now show that is zero: