Variational Rejection Sampling

04/05/2018 ∙ by Aditya Grover, et al. ∙ 0

Learning latent variable models with stochastic variational inference is challenging when the approximate posterior is far from the true posterior, due to high variance in the gradient estimates. We propose a novel rejection sampling step that discards samples from the variational posterior which are assigned low likelihoods by the model. Our approach provides an arbitrarily accurate approximation of the true posterior at the expense of extra computation. Using a new gradient estimator for the resulting unnormalized proposal distribution, we achieve average improvements of 3.71 nats and 0.21 nats over state-of-the-art single-sample and multi-sample alternatives respectively for estimating marginal log-likelihoods using sigmoid belief networks on the MNIST dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Latent variable models trained using stochastic variational inference can learn complex, high dimensional distributions [Hoffman et al., 2013, Ranganath et al., 2014]. Learning typically involves maximization of a lower bound to the intractable log-likelihood of the observed data, marginalizing over the latent, unobserved variables. To scale to large datasets, inference is amortized by introducing a recognition model approximating the true posterior over the latent variables, conditioned on the observed data [Dayan et al., 1995, Gershman and Goodman, 2014]

. The generative and recognition models are jointly trained and commonly parameterized using deep neural networks. While this provides flexibility, it also leads to expectations without any closed form expressions in the learning objective and corresponding gradients.

The general approach to stochastic optimization of such objectives involves Monte Carlo estimates of the gradients using the variational posterior (a.k.a. the recognition model) as a proposal distribution [Mnih and Rezende, 2016]. A simple feed forward network, however, may not capture the full complexity of the posterior, a difficulty which shows up in practice as high variance in the gradients estimated with respect to the parameters of the proposal distribution.

There is a vast body of prior work in variance reduction for stochastic optimization, including recent work focusing on variational methods for generative modeling. The standard approach is to use score function estimators with appropriate baselines [Glynn, 1990, Williams, 1992, Fu, 2006]. Many continuous distributions are also amenable to reparameterization, which transforms the original problem of taking gradients with respect to the parameters of the proposal to the simpler problem of taking gradients with respect to a deterministic function [Kingma and Welling, 2014, Rezende et al., 2014, Titsias and Lázaro-Gredilla, 2014]. Finally, a complementary technique for variance reduction is the use of multi-sample objectives which compute importance weighted gradient estimates based on multiple samples from the proposal [Burda et al., 2016, Mnih and Rezende, 2016]. We discuss these approaches in Section 2.

In this work, we propose a new class of estimators for variational learning based on rejection sampling. The variational rejection sampling approach modifies the sampling procedure into a two-step process: first, a proposal distribution (in our case, the variational posterior of a generative model) proposes a sample and then we explicitly accept or reject this sample based on a novel differentiable accept-reject test. The test is designed to reject samples from the variational posterior that are assigned low likelihoods by the generative model, wherein the threshold for rejection can be controlled based on the available computation.

We show how this procedure leads to a modification of the original variational posterior to a richer family of approximating resampled proposal distributions. The modification is defined implicitly [Mohamed and Lakshminarayanan, 2016]

since the only requirement from the original variational posterior is that it should permit efficient sampling. Hence, our soft accept-reject test provides a knob to smoothly interpolate between plain importance sampling with a fixed variational posterior (no rejections) to obtaining samples from the exact posterior in the limit (with potentially high rejection rate), thereby trading off statistical accuracy for computational cost. Further, even though the resampled proposal is unnormalized due to the introduction of an accept-reject test, we can surprisingly derive unbiased gradient estimates with respect to the model parameters that only require the unnormalized density estimates of the resampled proposal, leading to an efficient learning algorithm.

Empirically, we demonstrate that variational rejection sampling outperforms competing single-sample and multi-sample approaches by nats and nats respectively on average for estimating marginal log-likelihoods using sigmoid belief networks on the MNIST dataset.

2 Background

In this section, we present the setup for stochastic optimization of expectations of arbitrary functions with respect to parameterized distributions. We also discuss prior work applicable in the context of variational learning. We use upper-case symbols to denote probability distributions and assume they admit densities on a suitable reference measure, denoted by the corresponding lower-case notation.

Consider the following objective:


where and denote sets of parameters and is a parameterized sampling distribution over which can be discrete or continuous. We will assume that sampling from is efficient, and suppress subscript notation in expectations from to simply wherever the context is clear. We are interested in optimizing the expectation of a function with respect to the sampling distribution using gradient methods. In general, and the density need not be differentiable with respect to and .

Such objectives are intractable to even evaluate in general, but unbiased estimates can be obtained efficiently using Monte Carlo techniques. The gradients of the objective with respect to

are given by:

As long as is differentiable with respect to , we can compute unbiased estimates of the gradients using Monte Carlo. There are two primary class of estimators for computing gradients with respect to which we discuss next.

Score function estimators.

Using the fact that , the gradients with respect to can be expressed as:

The first term can be efficiently estimated using Monte Carlo if is differentiable with respect to . The second term, referred to as the score function estimator or the likelihood-ratio estimator or REINFORCE by different authors [Fu, 2006, Glynn, 1990, Williams, 1992], requires gradients with respect to the log density of the sampling distribution and can suffer from large variance [Glasserman, 2013, Schulman et al., 2015]. Hence, these estimators are used in conjunction with control variates (also referred to as baselines). A control variate,

, is any constant or random variable (could even be a function of

if we can correct for its bias) positively correlated with that reduces the variance of the estimator without introducing any bias:

Reparameterization estimators.

Many continuous distributions can be reparameterized such that it is possible to obtain samples from the original distribution by applying a deterministic transformation to a sample from a fixed distribution [Kingma and Welling, 2014, Rezende et al., 2014, Titsias and Lázaro-Gredilla, 2014]. For instance, if the sampling distribution is an isotropic Gaussian, , then a sample can be equivalently obtained by sampling and passing through a deterministic function, . This allows exchanging the gradient and expectation, giving a gradient with respect to after reparameterization as:

where is a fixed sampling distribution and is a deterministic transformation. Reparameterized gradient estimators typically have lower variance but are not widely applicable since they require and to be differentiable with respect to and respectively unlike score function estimators [Glasserman, 2013, Schulman et al., 2015]. Recent work has tried to bridge this gap by reparameterizing continuous relaxations to discrete distributions (called Concrete distributions) that gives low variance, but biased gradient estimates [Maddison et al., 2017, Jang et al., 2017] and deriving gradient estimators that interpolate between score function estimators and reparameterization estimators for distributions that can be simulated using acceptance-rejection algorithms, such as the Gamma and Dirichlet distributions [Ruiz et al., 2016, Naesseth et al., 2017]. Further reductions in the variance of reparameterization estimators is possible as explored in recent work, potentially introducing bias [Roeder et al., 2017, Miller et al., 2017, Levy and Ermon, 2018].

2.1 Variational learning

We can cast variational learning as an objective of the form given in Eq. (1

). Consider a generative model that specifies a joint distribution

over the observed variables and latent variables respectively, parameterized by . We assume the true posterior over the latent variables is intractable, and we introduce a variational approximation to the posterior represented by a recognition network and parameterized by . The parameters of the generative model and the recognition network are learned jointly [Kingma and Welling, 2014, Rezende et al., 2014] by optimizing an evidence lower bound (ELBO) on the marginal log-likelihood of a datapoint :


Besides reparameterization estimators that were introduced in the context of variational learning, there has been considerable research in the design of control variates (CV) for variational learning using the more broadly applicable score function estimators [Paisley et al., 2012]. In particular, Wingate and Weber [2013] and Ranganath et al. [2014] use simple scalar CV, NVIL proposed input-dependent CV [Mnih and Gregor, 2014], and MuProp combines input-dependent CV with deterministic first-order Taylor approximations to the mean-field approximation of the model [Gu et al., 2016]. Recently, REBAR used CV based on the Concrete distribution to give low variance, unbiased gradient updates [Tucker et al., 2017], which has been subsequently generalized to a more flexible parametric version in RELAX [Grathwohl et al., 2018].

In a parallel line of work, there is an increasing effort to learn models with more expressive posteriors. Major research in this direction focuses on continuous latent variable models, for e.g., see Gregor et al. [2014, 2015], Salimans et al. [2015], Rezende and Mohamed [2015], Chen et al. [2017], Song et al. [2017], Grover and Ermon [2018] and the references therein. Closely related to the current work is Gummadi [2014], which originally proposed a resampling scheme to improve the richness of the posterior approximation and derived unbiased estimates of gradients for the KL divergence from arbitrary unnormalized posterior approximations. Related work for discrete latent variable models is scarce. Hierarchical models impose a prior over the discrete latent variables to induce dependencies between the variables [Ranganath et al., 2016], which can also be specified as an undirected model [Kuleshov and Ermon, 2017]. On the theoretical side, random projections of discrete posteriors have been shown to provide tight bounds on the quality of the variational approximation [Zhu and Ermon, 2015, Grover and Ermon, 2016, Hsu et al., 2016].

Multi-sample estimators.

Multi-sample objectives improve the family of distributions represented by variational posteriors by trading off computational efficiency with statistical accuracy. Learning algorithms based on these objectives do not introduce additional parameters but instead draw multiple samples from the variational posterior to reduce the variance in gradient estimates as well as tighten the ELBO. A multi-sample ELBO is given as:


Biased gradient estimators using similar objectives were first used by Raiko et al. [2015] for structured prediction. Burda et al. [2016]

showed that a multi-sample ELBO is a tighter lower bound on the log-likelihood than the ELBO. Further, they derived unbiased gradient estimates for optimizing variational autoencoders trained using the objective in Eq. (

3). VIMCO generalized this to discrete latent variable models with arbitrary Monte Carlo objectives using a score function estimator with per-sample control variates [Mnih and Rezende, 2016], which serves as a point of comparison in our experiments. Recently, Naesseth et al. [2018] proposed an importance weighted multi-sample objective for probabilistic models of dynamical systems based on sequential Monte Carlo.

3 The Vrs Framework

(a) Target dist.
(e) 1e-3
Figure 1: The resampled posterior approximation (b-e) gets closer (in terms of divergence) to a target 2D discrete distribution (a) as we decrease the parameter , which controls the acceptance probability . The triples shown are divergence to target.

To motivate variational rejection sampling (VRS), consider the ELBO objective in Eq. (2) for any fixed and . This is maximized when the variational posterior matches the true posterior . However, in practice, the approximate posterior could be arbitrarily far from the true posterior which we seek to mitigate by rejection sampling.

3.1 The Resampled ELBO (R-ELBO)

Consider an alternate sampling distribution for the variational posterior with the density defined below:


where is an acceptance probability function that could depend on , and additional parameter(s) . Unlike , and , note that does not represent a density over the latent variables , but simply a function that maps each possible to a number between 0-1 and hence, it denotes the probability of acceptance for each . In order to sample from , we follow a two step sampling procedure defined in Algorithm 1. Hence, computing Monte Carlo expectations with respect to the modified proposal involves resampling from the original proposal due to an additional accept-reject step. We refer to such sampling distributions as resampled proposal distributions.

1:input ,
3:while True do
4:      sample from proposal .
5:     Compute acceptance probability
6:     Sample uniform: .
7:     if  then
8:         Output sample .
9:     end if
10:end while
Algorithm 1 Sampler for

The resampled proposal defines a new evidence lower bound on the marginal log-likelihood of , which we refer to as the “resampled ELBO”, or R-ELBO:


where is the (generally intractable) normalization constant for the resampled proposal distribution. To make the resampling framework described above work, we need to define a suitable acceptance function and derive low variance Monte Carlo gradient estimators with respect to and for the R-ELBO which we discuss next.

3.2 Acceptance probability functions

The general intuition behind designing an acceptance probability function is that it should allow for the resampled posterior to come “close” to the target posterior (possibly at the cost of extra computation). While there could be many possible ways of designing such acceptance probability functions, we draw inspiration from rejection sampling [Halton, 1970].

In order to draw samples from a target distribution , a rejection sampler first draws samples from an easy-to-sample distribution with a larger-or-equal support, i.e., wherever . Then, provided we have a fixed, finite upper bound on the likelihood ratio , we can obtain samples from the target by accepting samples from with a probability . The choice of guarantees that the acceptance probability is less than or equal to 1, and overall probability of any accepted sample is proportional to which gives us as desired. The constant has to be large enough such that the acceptance probability does not exceed , but a very high value of leads to an increase in computation due to a higher rejection rate.

If the target is only known up to a normalization constant, then rejection sampling can be used provided is large enough to ensure that the acceptance probability never exceeds 1. However, we do not know in general how large should be and even if we did, it would be computationally infeasible to actually use it in a practical algorithm. A natural approximation that departs from the typical rejection sampler would be to accept proposed samples with probability for some that is no longer guaranteed to dominate the likelihood ratios across the entire state space. In the setting of variational learning, the target corresponds to the true, but intractable posterior that can be specified up to a normalization constant as for any fixed and . If denotes the proposal distribution and is any function of the threshold parameter , the acceptance probability for the approximate rejection sampler is given by:

To get a fully differentiable approximation to the min operator, we consider:

where the approximation in the last step holds for large or when any of the two terms in the expression dominates the other. For , we get the exponentiated negative softplus function which we will use to be the acceptance probability function in the remainder of this paper. We leave other approximations to future work. Letting , the log probability of acceptance is parameterized as:


where and denotes the softplus function, i.e., .

Informally, the resampling scheme of Algorithm 1 with the choice of acceptance probability function as in Eq. (3.2) enforces the following behavior: samples from the approximate posterior that disagree (as measured by the log-likelihoods) with the target posterior beyond a level implied by the corresponding threshold have an exponentially decaying probability of getting accepted, while leaving the remaining samples with negligible interference from resampling.

When the proposed sample from is assigned a small likelihood by , the random variable is correspondingly large with high probability (and linear in the negative log-likelihood assigned by ), resulting in a low acceptance probability. Conversely, when assigns a high likelihood to , we get a higher acceptance probability. Furthermore, a large value of the scalar bias results in an acceptance probability of , recovering the regular variational inference setting as a special case. On the other extreme, for a small value of , we get the behavior of a rejection sampler with high computational cost that is also close to the target distribution in divergence. More formally, we have Theorem 1 which shows that the divergence can be improved monotonically by decreasing . However, a smaller value of would require more aggressive rejections and thereby, more computation.

Theorem 1.

For fixed , the divergence between the approximate and true posteriors, is monotonically increasing in where is the resampled proposal distribution with the choice of acceptance probability function in Eq. (3.2). Furthermore, the behavior of the sampler in Algorithm 1 interpolates between the following two extremes:

  • As , is equivalent to , with perfect sampling efficiency for the accept-reject step i.e., .

  • As , is equivalent to , with the sampling efficiency of a plain rejection sampler i.e., .

This phenomenon is illustrated in Figure 1 where we approximate an example 2D discrete target distribution on a grid, with a uniform proposal distribution plus resampling. With no resampling (), the approximation is far from the target. As is reduced, Figure 1 demonstrates progressive improvement in the posterior quality both visually as well as via an estimate of the divergence from approximation to the target along with an increasing computation cost reflected in the lower acceptance probabilities. In summary, we can express the R-ELBO as:


Theorem 1 and Eq. (7) give the following corollary.

Corollary 1.

The R-ELBO gets tighter by decreasing but more expensive to compute.

With an appropriate acceptance probability function, we can therefore traverse the computational-statistical trade off for maximum likelihood estimation by adaptively tuning the threshold based on available computation.

3.3 Gradient estimation

The resampled proposal distribution in Eq. (4) is unnormalized with an intractable normalization constant, . The presence of an intractable normalization constant seems challenging for both evaluation and stochastic optimization of the R-ELBO. Even though the constant cannot be computed in closed form,111Note that Monte Carlo estimates of the partition function can be obtained efficiently for evaluation. we can nevertheless compute Monte Carlo estimates of its gradients, as we show in Lemma 1 in the appendix. The resulting R-ELBO gradients are summarized below in Theorem 2.

Theorem 2.

Let denote the covariance of the two random variables and , where . Then:

  • The R-ELBO gradients with respect to :

    where the covariance is between the following r.v.:

  • The R-ELBO gradients with respect to :


denotes the sigmoid function applied to

, i.e., .

In the above expressions, the gradients are expressed as the covariance of two random variables that are a function of the latent variables sampled from the approximate posterior . Hence, we only need samples from for learning, which can be done using Algorithm 1 followed by Monte Carlo estimation analogous to estimation of the usual ELBO gradients.

4 Learning Algorithm

1:input Network architectures for

; quantile hyperparameter

; initial parameters, ; threshold update frequency, ; quantile estimation sample count ; Covariance estimate sample count , SGD based optimizer, OPT; Dataset

, Number of epochs:

2:output Final estimates, .
3:Initialize ; ; .
4:for  do
5:     if   then
6:         for  each in dataset do
7:              Sample .
8:              , the Monte Carlo estimate of Eq. (8) based on samples .
9:         end for
10:     end if
11:     for  each in dataset do
12:          Draw independent samples using Algorithm 1.
13:          Use Theorem 2 and Eq. (9) with to estimate gradients, .
14:         Update .
15:     end for
16:end for
Algorithm 2 Variational Rejection Sampling

A practical implementation of variational rejection sampling as shown in Algorithm 2 requires several algorithmic design choices that we discuss in this section.

4.1 Threshold selection heuristic

The monotonicity of divergence in Theorem 1 suggests that it is desirable to choose a value of as low as computationally feasible for obtaining the highest accuracy. However, the quality of the approximate posterior, for a fixed parameter could vary significantly across different examples . This would require making dependent on . Although learning in a parametric way is one possibility, in this work, we restrict attention to a simple estimation based approach that reduces the design choice to a single hyperparameter that can be adjusted to trade extra computation for accuracy. For each fixed , let denote the probability distribution of the scalar random variable, , where . Let denote the quantile function222Recall that for a given CDF , the quantile function is its ‘inverse’, namely . for any given 1-D distribution . For each quantile parameter

, we consider a heuristic family of threshold parameters given

, defined as:


For example, for , this is the median of . Eq. (8) implies that the acceptance probability stays roughly in the range of for most samples. This is due to the fact that the negative log of the acceptance probability, defined in Eq. (3.2) as is positive approximately with probability , an event which is likely to result in a rejection. In Algorithm 2, we compute a Monte Carlo estimate for the threshold and denote the resulting value using samples as (Lines 5-10). This estimation is done independently from the SGD updates, once every epochs, to save computational cost, and also implies that is not continuously updated as a function of . Technically speaking, this introduces a slight bias in the gradients through their dependence on , but we ignore this correction since it only happens once every few epochs.

4.2 Computing covariance estimates

To compute an unbiased Monte Carlo estimate of the covariance terms in the gradients, we need to subtract the mean of at least one random variable while forming the product term. In order to do this in Algorithm 2 (Lines 12-13), we process a fixed batch of (accepted) samples per gradient update, and for each sample, use all-but-one to compute the mean estimate to be subtracted, similar to the local learning signals proposed in Mnih and Rezende [2016]. This requires generating samples from simultaneously at each step to be able to compute each gradient. More precisely, the leave-one-out unbiased Monte Carlo estimator for the covariance of two random variables is defined as follows. Let be independent samples from the joint pair , and let denote the sample mean for : . Then the covariance estimate is given by:


4.3 Hyperparameters and overall algorithm

In summary, Algorithm 2 involves the following hyperparameters: , the number of samples for estimating covariance; , the quantile used for setting thresholds; , the number of epochs between updating .

5 Experimental Evaluation

We evaluated variational rejection sampling (VRS) against competing methods on a diagnostic synthetic experiment and a benchmark density estimation task.

5.1 Diagnostic experiments

In this experiment, we consider a synthetic setup that involves fitting an approximate posterior candidate from a constrained family to a fixed target distribution that clearly falls outside the approximating family. We restrict attention to training a 1-D parameter, exclusively (i.e., we do not consider optimization over ), and for the non-amortized case (i.e., conditioning on is not applicable). The target distribution is 1-D, with support on non-negative integers, and denoted as . This distribution, visualized in Figure 5, is obtained by removing the mass on the first

integers of a Poisson distribution with rate

. More details are given in the Appendix. The approximate proposal is parameterized as , where is an unconstrained scalar, and denotes a (unmodified) Poisson distribution with the (non-negative) rate parameter, . Note that for to explicitly represent a small mass on would require , but this would be a bad fit for points just above . As a result, does not contain candidates close to the target distribution in the sense of divergence, even a simple resampling modification could transform the raw proposal into a better candidate approximation, .

Figure 3: Acceptance probability vs. SGD iteration
Figure 4: VRS learning dynamics. The x-axis shows the number of total samples (both accepted and rejected) at each SGD iteration.
(a) Error:
(b) Gradients for .
(a) Error: .
(b) Gradients for .
Figure 4: VRS learning dynamics. The x-axis shows the number of total samples (both accepted and rejected) at each SGD iteration.
Figure 5: VIMCO learning dynamics. The x-axis shows the number of total samples, which is equal to times the number of iterations at each SGD iteration.
Figure 2: Target distribution, .

In Figures 5 and 5 we illustrate the dynamics of SGD using VRS gradients for approximating . To keep the analysis simple, the threshold was kept fixed at a constant value during learning. Figure 5 shows the efficiency of the sampler improving as the learning progresses due to a better fit to the target distribution. Figure 3(a) shows the difference between the current parameter and from the target distribution, quickly converging to as learning proceeds.

As a benchmark, we evaluated the dynamics based on VIMCO gradients [Mnih and Rezende, 2016]. Figure 5 suggests that the signal in gradients is too low (i.e., high variance in gradient estimates). This behavior was persistent even with much smaller learning rates and large sample sizes compared to VRS gradients. One explanation is that the VIMCO gradient update for has a term that assigns the same average weight to the entire batch of samples, both good and bad ones (see Eq. (8) in Mnih and Rezende [2016]). In contrast, Algorithm 1 discards rejected samples from contributing to the gradients explicitly. Yet another qualitative aspect that distinguishes VRS gradients from importance weighted multi-sample objective gradients is that Algorithm 1 can dynamically adapt the amount of additional computation spent in resampling based on sample quality, as opposed to being fixed in advance.

Model / Architecture 200 200-200
NVIL () 112.5 99.6
MuProp 111.7 99.07
REBAR () 111.7 99
REBAR 111.6 99.8
Concrete () 111.3 102.8
VRS (IS, ) 106.97 96.38
VRS (RS, ) 106.89 96.30
VRS (IS, ) 106.71 96.26
VRS (RS, ) 106.63 96.36
(a) Baseline results from Tucker et al. [2017]
Model / Architecture 200-200-200
NVIL () 95.2
NVIL () 93.6
NVIL () 93.7
NVIL () 93.4
NVIL () 96.2
RWS () 94.6
RWS () 93.4
RWS () 93.0
RWS () 92.5
VIMCO () 93.5
VIMCO () 92.8
VIMCO () 92.6
VIMCO () 91.9
VRS (IS, ) 92.01
VRS (RS, ) 91.93
VRS (IS, ) 92.09
VRS (RS, ) 91.69
(b) Baseline results from Mnih and Rezende [2016]
Table 1: Test NLL (in nats) for MNIST comparing VRS with published results. Lower is better.

5.2 Generative modeling

We trained sigmoid belief networks (SBN) on the binarized MNIST dataset 

[LeCun et al., 2010]. Following prior work, the SBN benchmark architectures for this task consist of several linear layers of 200 hidden units, and the recognition model has the same architecture in the reverse direction. Training such models is sensitive to choices of hyperparamters, and hence we directly compare VRS with published baselines in Table 1. The hyperparameter details for SBNs trained with VRS are given in the Appendix.

With regards to key baseline hyperparameters in Table 1, Concrete and REBAR specify a temperature controlling the degree of relaxation (denoted by ), whereas multi-sample estimators based on importance weighting specify the number of samples, to trade-off computation for statistical accuracy. The relevant parameter, in our case is not directly comparable but we report results for values of where empirically, the average number of rejections per training example was somewhere between drawing and samples for an equivalent importance weighted objective for both and (with the latter requiring more computation). Additionally, we provide two estimators for evaluating the test lower bound for VRS. For the importance sampled (IS) version, we simply evaluate the ELBO using importance sampling with the original posterior, . The resampled (RS) version, on the other hand uses the resampled proposal, with the partition function, estimated as a Monte Carlo expectation.

From the results, we observe that VRS outperforms other methods, including multi-sample estimators with as high as that require much greater computation than the VRS models considered. Generally speaking, the RS estimates are better than the corresponding IS estimates, and decreasing improves performance (at the cost of increased computation).

6 Conclusion

We presented a rejection sampling framework for variational learning in generative models that is theoretically principled and allows for a flexible trade-off between computation and statistical accuracy by improving the quality of the variational approximation made by any parameterized model. We demonstrated the practical benefits of our framework over competing alternatives based on multi-sample objectives for variational learning of discrete distributions.

In the future, we plan to generalize VRS while exploiting factorization structure in the generative model based on intermediate resampling checkpoints [Gummadi, 2014]

. Many baseline methods in our experiments are complementary and could benefit from VRS. Yet another direction involves the applications to stochastic optimization problems arising in reinforcement learning and structured prediction.


We are thankful to Zhengyuan Zhou and Daniel Levy for helpful comments on early drafts. This research has been supported by a Microsoft Research PhD fellowship in machine learning for the first author, NSF grants #1651565, #1522054, #1733686, Toyota Research Institute, Future of Life Institute, and Intel.


  • Burda et al. [2016] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In International Conference on Learning Representations, 2016.
  • Chen et al. [2017] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. In International Conference on Learning Representations, 2017.
  • Dayan et al. [1995] Peter Dayan, Geoffrey E. Hinton, Radford M. Neal, and Richard S. Zemel. The Helmholtz machine. Neural Computation, 7(5):889–904, September 1995.
  • Fu [2006] Michael C Fu. Gradient estimation. Handbooks in operations research and management science, 13:575–616, 2006.
  • Gershman and Goodman [2014] Sam Gershman and Noah D. Goodman. Amortized inference in probabilistic reasoning. In Annual Conference of the Cognitive Science Society, 2014.
  • Glasserman [2013] Paul Glasserman. Monte Carlo methods in financial engineering, volume 53. Springer Science & Business Media, 2013.
  • Glynn [1990] Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
  • Grathwohl et al. [2018] Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations, 2018.
  • Gregor et al. [2014] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive networks. In International Conference on Machine Learning, 2014.
  • Gregor et al. [2015] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra.

    DRAW: A recurrent neural network for image generation.

    In International Conference on Machine Learning, 2015.
  • Grover and Ermon [2016] Aditya Grover and Stefano Ermon. Variational Bayes on Monte Carlo steroids. In Advances in Neural Information Processing Systems, 2016.
  • Grover and Ermon [2018] Aditya Grover and Stefano Ermon. Boosted generative models. In

    AAAI Conference on Artificial Intelligence

    , 2018.
  • Gu et al. [2016] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. MuProp: Unbiased backpropagation for stochastic neural networks. In International Conference on Learning Representations, 2016.
  • Gummadi [2014] R. Gummadi. Resampled belief networks for variational inference. In NIPS Workshop on Advances in Variational Inference, 2014.
  • Halton [1970] John H Halton. A retrospective and prospective survey of the Monte Carlo method. SIAM review, 12(1):1–63, 1970.
  • Hoffman et al. [2013] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • Hsu et al. [2016] Lun-Kai Hsu, Tudor Achim, and Stefano Ermon. Tight variational bounds via random projections and I-projections. In Artificial Intelligence and Statistics, 2016.
  • Jang et al. [2017] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. In International Conference on Learning Representations, 2017.
  • Kingma and Welling [2014] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
  • Kuleshov and Ermon [2017] Volodymyr Kuleshov and Stefano Ermon. Neural variational inference and learning in undirected graphical models. In Advances in Neural Information Processing Systems, 2017.
  • LeCun et al. [2010] Yann LeCun, Corinna Cortes, and Christopher JC Burges. MNIST handwritten digit database. http://yann. lecun. com/exdb/mnist, 2010.
  • Levy and Ermon [2018] Daniel Levy and Stefano Ermon. Deterministic policy optimization by combining pathwise and score function estimators for discrete action spaces. In AAAI Conference on Artificial Intelligence, 2018.
  • Maddison et al. [2017] Chris J Maddison, Andriy Mnih, and Yee Whye Teh.

    The Concrete distribution: A continuous relaxation of discrete random variables.

    In International Conference on Learning Representations, 2017.
  • Miller et al. [2017] Andrew C Miller, Nicholas J Foti, Alexander D’Amour, and Ryan P Adams. Reducing reparameterization gradient variance. In Advances in Neural Information Processing Systems, 2017.
  • Mnih and Gregor [2014] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, 2014.
  • Mnih and Rezende [2016] Andriy Mnih and Danilo J Rezende. Variational inference for Monte Carlo objectives. In International Conference on Machine Learning, 2016.
  • Mohamed and Lakshminarayanan [2016] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
  • Naesseth et al. [2017] Christian Naesseth, Francisco Ruiz, Scott Linderman, and David Blei. Reparameterization gradients through acceptance-rejection sampling algorithms. In International Conference on Artificial Intelligence and Statistics, 2017.
  • Naesseth et al. [2018] Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei. Variational sequential Monte Carlo. In International Conference on Artificial Intelligence and Statistics, 2018.
  • Paisley et al. [2012] John Paisley, David Blei, and Michael Jordan.

    Variational Bayesian inference with stochastic search.

    In International Conference on Machine Learning, 2012.
  • Raiko et al. [2015] Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning binary stochastic feedforward neural networks. In International Conference on Learning Representations, 2015.
  • Ranganath et al. [2014] Rajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference. In International Conference on Artificial Intelligence and Statistics, 2014.
  • Ranganath et al. [2016] Rajesh Ranganath, Dustin Tran, and David Blei. Hierarchical variational models. In International Conference on Machine Learning, 2016.
  • Rezende and Mohamed [2015] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015.
  • Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
  • Roeder et al. [2017] Geoffrey Roeder, Yuhuai Wu, and David Duvenaud. Sticking the landing: An asymptotically zero-variance gradient estimator for variational inference. In Advances in Neural Information Processing Systems, 2017.
  • Ruiz et al. [2016] Francisco Ruiz, Michalis Titsias, and David Blei. The generalized reparameterization gradient. In Advances in Neural Information Processing Systems, 2016.
  • Salimans et al. [2015] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, 2015.
  • Schulman et al. [2015] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, 2015.
  • Song et al. [2017] Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for MCMC. In Advances in Neural Information Processing Systems, 2017.
  • Titsias and Lázaro-Gredilla [2014] Michalis Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, 2014.
  • Tucker et al. [2017] George Tucker, Andriy Mnih, Chris J Maddison, and Jascha Sohl-Dickstein. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, 2017.
  • Williams [1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • Wingate and Weber [2013] David Wingate and Theophane Weber. Automated variational inference in probabilistic programming. arXiv preprint arXiv:1301.1299, 2013.
  • Zhu and Ermon [2015] Michael Zhu and Stefano Ermon. A hybrid approach for probabilistic inference using random projections. In International Conference on Machine Learning, 2015.


Appendix A Proofs of theoretical results

a.1 Theorem 1


We can explicitly write down the acceptance probability function as:

From the above equation, it is easy to see that as , we get an acceptance probability close to 1, resulting in an approximate posterior close to the original proposal, , whereas with , the acceptance probability degenerates to a standard rejection sampler with acceptance probability close to , but with potentially untenable efficiency. Intermediate values of can interpolate between these two extremes.

To prove monotonicity, we first derive the partial derivative of the divergence with respect to as a covariance of two random variables that are monotone transformations of each other. To get the derivative, we use the fact that the gradient of the divergence is the negative of the ELBO gradient derived in Theorem 1. Recall that the ELBO and the KL divergence add up to a constant independent of , and that the expressions for the gradients with respect to and are functionally the same. We have:


For the second term in the covariance, we can use the expressions from Eq. (4) and Eq. (3.2) to write:

where is the sigmoid function. Putting the two terms together, we have:

To prove that the two random variables, and are a monotone transformation of each other, we can use the identity to rewrite the final expression for the gradient of the divergence as:

The inequality follows from the fact that the covariance of a random variable and a monotone transformation (the logarithm in this case) is non-negative. ∎

a.2 Theorem 2

Before proving Theorem 2, we first state and prove an important lemma333We assume a discrete and finite state space in all proofs below for simplicity/clarity, but when combined with the necessary technical conditions required for the existence of the corresponding integrals, they admit a straightforward replacement of sums with integrals..

Lemma 1.

Suppose and are two unnormalized densities, where only depends on (the recognition network parameters), but both and can depend on .444The dependence for on can happen via some resampling mechanism that is allowed to, for example, evaluate on the sample proposals before making its accept/reject decisions, as in our case. Let . Then the variational lower bound objective (on ) and its gradients with respect to the parameters are given by:

Note that the covariance is the expectation of the product of (at least one) mean-subtracted version of the two random variables. Further, we can also write: , where is the mean subtracted version of the learning signal, .


The equation for the ELBO follows from the definition. For the gradients, we can write: , where:

Simplifying , , and , we get:

which implies:

Next, observe that , Therefore, using the fact that the expectation of the product of two random variables is the same as their covariance when at least one of the two random variables has a zero mean, we get . Next note that we can add an arbitrary constant to either random variable without changing the covariance, therefore this is equal to .

The derivation for the gradient with respect to is analogous, except for , which has an additional term which did not appear in the gradient with respect to because of our assumption on the lack of dependence of on the recognition parameters . For the identity on the KL divergence, we have:

Using the above lemma, we provide a proof for Theorem 2 below.


We apply the result of Theorem 1, which computes the ELBO corresponding to the two unnormalized distributions on the latent variable space (for fixed ), with and . This gives: . We can then evaluate , where is the sigmoid function. Note that this is a consequence of the fact that the derivative of the softplus, , is the sigmoid function, . Similarly for the gradient, we get:


Appendix B Experimental details

b.1 Synthetic

To construct the target distribution, we transform a Poisson distribution of rate , denoted by removing probability mass near . More precisely, this transformation forces a negligible uniform mass, , on . This leaves the distribution unnormalized, although this fact is not particularly relevant for subsequent discussion. The approximate proposal is parameterized as , where is an unconstrained scalar, and denotes a (unmodified) Poisson distribution with the (non-negative) rate parameter, . Note that for to explicitly represent a small mass on would require , but this would be a bad fit for points just above . As a result, does not contain candidates close to the target distribution in the sense of divergence, even while it may be possible to approximate well with a simple resampling modification that transforms the raw proposal into a better candidate, .

The target distribution was set with an optimal parameter (i.e., the rate parameter is 10.0), and . The optimizer used was SGD with momentum using a mass of 0.5. We observed that the gradients were consistently positive while initialized at a parameter setting less than the true value (as shown in the plots) and similarly consistently negative when initialized to a parameter more than the true value (which we did not present in the paper) due to the consistency of the correlation between the two terms in the covariance specific to this toy example. For resampling, plots show results with learning rate set to 0.01 and . For VIMCO, plots show results with learning rate set to 0.005 and .

b.2 Mnist

We consider the standard train/validation/test split of the binarized MNIST dataset. For a direct comparison with prior work, both the generative and recognition networks have the same architecture of stochastic layers. No additional deterministic layers were used for SBNs trained using VRS. The batch size was , the optimizer used is Adam with a learning rate of . We ran the algorithm for steps updating the resampling thresholds after every iterations based on the threshold selection heuristic corresponding to the top quantile. We set for the unbiased covariance estimates for gradients. The lower bounds on the test set are calculated based on importance sampling with samples for IS or resampling with accepted samples for RS.