1 Introduction
Energybased models (EBMs) have a long history in statistics and machine learning
(Dayan et al., 1995; Zhu et al., 1998; LeCun et al., 2006). EBMs score configurations of variables with an energy function, which induces a distribution on the variables in the form of a Gibbs distribution. Different choices of energy function recover wellknown probabilistic models including Markov random fields (Kinderman and Snell, 1980), (restricted) Boltzmann machines
(Smolensky, 1986; Freund and Haussler, 1992; Hinton, 2002), and conditional random fields (Lafferty et al., 2001). However, this flexibility comes at the cost of challenging inference and learning: both sampling and density evaluation of EBMs are generally intractable, which hinders the applications of EBMs in practice.Because of the intractability of general EBMs, practical implementations rely on approximate sampling procedures (e.g.
, Markov chain Monte Carlo (MCMC)) for inference. This creates a mismatch between the model and the approximate inference procedure, and can lead to suboptimal performance and unstable training when approximate samples are used in the training procedure.
Currently, most attempts to fix the mismatch lie in designing better sampling algorithms (e.g., Hamiltonian Monte Carlo (Neal, 2005), annealed importance sampling (Neal, 2001)) or exploiting variational techniques (Kim and Bengio, 2016; Dai et al., 2017, 2018) to reduce the inference approximation error.
Instead, we bridge the gap between the model and inference by directly treating the sampling procedure as the model of interest and optimizing the loglikelihood of the the sampling procedure. We call these models energyinspired models (EIMs) because they incorporate a learned energy function while providing tractable, exact samples. This shift in perspective aligns the training and sampling procedure, leading to principled and consistent training and inference.
To accomplish this, we cast the sampling procedure as a latent variable model. This allows us to maximize variational lower bounds (Jordan et al., 1999; Blei et al., 2017) on the loglikelihood (c.f., Kingma and Welling (2013); Rezende et al. (2014)). To illustrate this, we develop and evaluate energyinspired models based on truncated rejection sampling (Algorithm 1), selfnormalized importance sampling (Algorithm 2), and Hamiltonian importance sampling (Algorithm 3). Interestingly, the model based on selfnormalized importance sampling is closely related to ranking NCE (Jozefowicz et al., 2016; Ma and Collins, 2018), suggesting a principled objective for training the “noise” distribution.
Our second contribution is to show that EIMs provide a unifying conceptual framework to explain many advances in constructing tighter variational lower bounds for latent variable models (e.g., (Burda et al., 2015; Maddison et al., 2017; Naesseth et al., 2018; Le et al., 2017; Yin and Zhou, 2018; Molchanov et al., 2018; Sobolev and Vetrov, 2018)). Previously, each bound required a separate derivation and evaluation, and their relationship was unclear. We show that these bounds can be viewed as specific instances of auxiliary variable variational inference (Agakov and Barber, 2004; Salimans et al., 2015; Ranganath et al., 2016; Maaløe et al., 2016) with different EIMs as the variational family. Based on general results for auxiliary latent variables, this immediately gives rise to a variational lower bound with a characterization of the tightness of the bound. Furthermore, this unified view highlights the implicit (potentially suboptimal) choices made and exposes the reusable components that can be combined to form novel variational lower bounds. Concurrently, Domke and Sheldon (2019) note a similar connection, however, their focus is on the use of the variational distribution for posterior inference.
In summary, our contributions are:

The construction of a tractable class of energyinspired models (EIMs), which lead to consistent learning and inference. To illustrate this, we build models with truncated rejection sampling, selfnormalized importance sampling, and Hamiltonian importance sampling and evaluate them on synthetic and realworld tasks. These models can be fit by maximizing a tractable lower bound on their loglikelihood.

We show that EIMs with auxiliary variable variational inference provide a unifying framework for understanding recent tighter variational lower bounds, simplifying their analysis and exposing potentially suboptimal design choices.
2 Background
In this work, we consider learned probabilistic models of data . Energybased models (LeCun et al., 2006) define in terms of an energy function
where is a tractable “prior” distribution and is a generally intractable partition function. To fit the model, many approximate methods have been developed (e.g., pseudo loglikelihood (Besag, 1975)
(Hinton, 2002; Tieleman, 2008), score matching estimator (Hyvärinen, 2005), minimum probability flow
(SohlDickstein et al., 2011), noise contrastive estimation (Gutmann and Hyvärinen, 2010)) to bypass the calculation of the partition function. Empirically, previous work has found that convolutional architectures that score images (i.e., map to a real number) tend to have strong inductive biases that match natural data (Du and Mordatch, 2019). These networks are a natural fit for energybased models. Because drawing exact samples from these models is intractable, samples are typically approximated by Monte Carlo schemes, for example, Hamiltonian Monte Carlo (Neal and others, 2011).Alternatively, latent variables allow us to construct complex distributions by defining the likelihood in terms of tractable components and . While marginalizing is generally intractable, we can instead optimize a tractable lower bound on using the identity
(1) 
where is a variational distribution and the positive term can be omitted to form a lower bound commonly referred to as the evidence lower bound (ELBO) (Jordan et al., 1999; Blei et al., 2017). The tightness of the bound is controlled by how accurately models , so limited expressivity in the variational family can negatively impact the learned model.
3 EnergyInspired Models
Instead of viewing the sampling procedure as drawing approximate samples from the energybased models, we treat the sampling procedure as the model of interest. We represent the randomness in the sampler as latent variables, and we obtain a tractable lower bound on the marginal likelihood using the ELBO. Explicitly, if represents the randomness in the sampler and is the generative process, then
(2) 
where is a variational distribution that can be optimized to tighten the bound. In this section, we explore concrete instantiations of models in this paradigm: one based on truncated rejection sampling (TRS), one based on selfnormalized importance sampling (SNIS), and another based on Hamiltonian importance sampling (HIS) Neal (2005).
3.1 Truncated Rejection Sampling (TRS)
Consider the truncated rejection sampling process (Algorithm 1) used in (Bauer and Mnih, 2018), where we sequentially draw a sample from and accept it with probability . To ensure that the process ends, if we have not accepted a sample after steps, then we return .
In this case, , so we need to construct a variational distribution . The optimal is , which motivates choosing a similarly structured variational distribution. It is straightforward to see that where is generally intractable. So, we choose , where is a learnable variational parameter. Then, we sample and as in the generative process. This results in a simple variational bound
The TRS generative process is the same process as the Learned Accept/Reject Sampling (LARS) model (Bauer and Mnih, 2018). The key difference is the training procedure. LARS tries to directly estimate the gradient of the log likelihood. Without truncation, such a process is attractive because unbiased gradients of its log likelihood can easily be computed without knowing the normalizing constant. Unfortunately, after truncating the process, we require estimating a normalizing constant. In practice, Bauer and Mnih (2018) estimate the normalizing constant using samples during training and samples during evaluation. Even so, LARS requires additional implementation tricks (e.g., evaluating the target density, using an exponential moving average to estimate the normalizing constant) to ensure successful training, which complicate the implementation and analysis of the algorithm. On the other hand, we optimize a tractable log likelihood lower bound. As a result, no implementation tricks are necessary.
3.2 SelfNormalized Importance Sampling (SNIS)
Consider the sampling process defined by selfnormalized importance sampling. That is, first sampling a set of candidate s from a proposal distribution , and then sampling from the empirical distribution composed of atoms located at each and weighted proportionally to (Algorithm 2). In this case, the latent variables are the locations of the proposal samples , …, (abbreviated ) and the index of the selected sample, .
Explicitly, the model is defined by
with . We denote the density of the process by . Choosing in Eq. 2, yields
(3) 
To summarize, can be sampled from exactly and has a tractable lower bound on its loglikelihood. For the same , we expect to outperform because it considers all candidate samples simultaneously instead of sequentially.
As , becomes proportional to . For finite , interpolates between the tractable proposal and the energy model . Furthermore, Equation 3 is closely connected with the ranking NCE loss (Jozefowicz et al., 2016; Ma and Collins, 2018), a popular objective for training energybased models. In fact, if we consider as our noise distribution and set , then up to a constant (in ), we recover the ranking NCE loss using the notation from (Ma and Collins, 2018). The ranking NCE loss is motivated by the fact that it is a consistent objective for any when the true data distribution is in our model family. As a result, it is straightforward to adapt the consistency proof from (Ma and Collins, 2018) to our setting. Furthermore, our perspective gives a coherent objective for jointly learning the noise distribution and the energy function and shows that the ranking NCE loss can be viewed as a lower bound on the log likelihood of a wellspecified model regardless of whether the true data distribution is in our model family. In addition, we can recover the recently proposed InfoNCE (Oord et al., 2018) bound on mutual information by using SNIS as the variational distribution in the classic variational bound by Barber and Agakov (2003) (see Appendix C for details).
To train the SNIS model, we perform stochastic gradient ascent on Eq. 3 with respect to the parameters of the proposal distribution and the energy function . When the data are continuous, reparameterization gradients can be used to estimate the gradients to the proposal distribution (Rezende et al., 2014; Kingma and Welling, 2013). When the data are discrete, score function gradient estimators such as REINFORCE (Williams, 1992) or relaxed gradient estimators such as the GumbelSoftmax (Maddison et al., 2016; Jang et al., 2016) can be used.
3.3 Hamiltonian importance sampling (HIS)
Simple importance sampling scales poorly with dimensionality, so it is natural to consider more complex samplers with better scaling properties. We evaluated models based on Hamiltonian importance sampling (HIS) Neal (2005), which evolve an initial sample under deterministic, discretized Hamiltonian dynamics with a learned energy function. In particular, we sample initial location and momentum variables, and then transition the candidate sample and momentum with leap frog integration steps, changing the temperature at each step (Algorithm 3). While the quality of samples from SNIS are limited by the samples initially produced by the proposal, a model based on HIS updates the positions of the samples directly, potentially allowing for more expressive power. Intuitively, the proposal provides a coarse starting sample which is further refined by gradient optimization on the energy function. When the proposal is already quite strong, drawing additional samples as in SNIS may be advantageous.
In practice, we parameterize the temperature schedule such that . This ensures that the deterministic invertible transform from to has a Jacobian determinant of (i.e., ). Applying Eq. 2 yields a tractable variational objective
We jointly optimize and the variational parameters with stochastic gradient ascent. Goyal et al. (2017) propose a similar approach that generates a multistep trajectory via a learned transition operator.
4 Experiments
We evaluated the proposed models on a set of synthetic datasets, binarized MNIST
(LeCun, 1998) and Fashion MNIST (Xiao et al., 2017), and continuous MINST, Fashion MNIST, and CelebA (Liu et al., 2015). See Appendix D for details on the datasets, network architectures, and other implementation details. To provide a competitive baseline, we use the recently developed Learned Accept/Reject Sampling (LARS) model (Bauer and Mnih, 2018).4.1 Synthetic data
As a preliminary experiment, we evaluated the methods on modeling synthetic densities: a mixture of 9 equallyweighted Gaussian densities, a checkerboard density with uniform mass distributed in 8 squares, and two concentric rings (Fig. 1 and Appendix Fig. 2 for visualizations). For all methods, we used a unimodal standard Gaussian as the proposal distribution (see Appendix D for further details).
TRS, SNIS, and LARS perform comparably on the Nine Gaussians and Checkerboard datasets. On the Two Rings datasets, despite tuning hyperparameters, we were unable to make LARS learn the density.
On these simple problems, the target density lies in the high probability region of the proposal density, so TRS, SNIS, and LARS only have to reweight the proposal samples appropriately. In highdimensional problems when the proposal density is mismatched from the target density, however, we expect HIS to outperform TRS, SNIS, and LARS. To test this we ran each algorithm on the Nine Gaussians problem with a Gaussian proposal of mean 0 and variance 0.1 so that there was a significant mismatch in support between the target and proposal densities. The results in the rightmost panel of
Fig. 1 show that HIS was almost unaffected by the change in proposal while the other algorithms suffered considerably.4.2 Binarized MNIST and Fashion MNIST
Method  Static MNIST  Dynamic MNIST  Fashion MNIST 

VAE w/ Gaussian prior  
VAE w/ TRS prior  
VAE w/ SNIS prior  
VAE w/ HIS prior  
VAE w/ LARS prior  
ConvHVAE w/ Gaussian prior 

ConvHVAE w/ TRS prior  
ConvHVAE w/ SNIS prior  
ConvHVAE w/ HIS prior  
ConvHVAE w/LARS prior  
SNIS w/ VAE proposal  
SNIS w/ ConvHVAE proposal  
LARS w/ VAE proposal  —  — 
Method  MNIST  Fashion MNIST  CelebA 

Small VAE  
LARS w/ small VAE proposal  
SNIS w/ small VAE proposal  
HIS w/ small VAE proposal  
VAE  
LARS w/ VAE proposal  
SNIS w/ VAE proposal  
HIS w/ VAE proposal  
MAF  —  — 
Next, we evaluated the models on binarized MNIST and Fashion MNIST. MNIST digits can be either statically or dynamically binarized — for the statically binarized dataset we used the binarization from (Salakhutdinov and Murray, 2008)
, and for the dynamically binarized dataset we sampled images from Bernoulli distributions with probabilities equal to the continuous values of the images in the original MNIST dataset. We dynamically binarize the Fashion MNIST dataset in a similar manner.
First, we used the models as the prior distribution in a Bernoulli observation likelihood VAE. We summarize loglikelihood lower bounds on the test set in Table 1 (referred to as VAE w/ method prior). SNIS outperformed LARS on static MNIST and dynamic MNIST even though it used only 1024 samples for training and evaluation, whereas LARS used 1024 samples during training and samples for evaluation. As expected due to the similarity between methods, TRS performed comparably to LARS. On all datasets, HIS either outperformed or performed comparably to SNIS. We increased and for SNIS and HIS, respectively, and find that performance improves at the cost of additional computation (Appendix Fig. 3). We also used the models as the prior distribution of a convolutional heiarachical VAE (ConvHVAE, following the architecture in (Bauer and Mnih, 2018)). In this case, SNIS outperformed all methods.
Then, we used a VAE as the proposal distribution to SNIS. A limitation of the HIS model is that it requires continuous data, so it cannot be used in this way on the binarized datasets. Initially, we thought that an unbiased, lowvariance estimator could be constructed similarly to VIMCO (Mnih and Rezende, 2016), however, this estimator still had high variance. Next, we used the Gumbel StraightThrough estimator (Jang et al., 2016) to estimate gradients through the discrete samples proposed by the VAE, but found that method performed worse than ignoring those gradients altogether. We suspect that this may be due to bias in the gradients. Thus, for the SNIS model with VAE proposal, we report results on training runs which ignore those gradients. Future work will investigate lowvariance, unbiased gradient estimators. In this case, SNIS again outperforms LARS, however, the performance is worse than using SNIS as a prior distribution. Finally, we used a ConvHVAE as the proposal for SNIS and saw performance improvements over both the vanilla ConvHVAE and SNIS with a VAE proposal, demonstrating that our modeling improvements are complementary to improving the proposal distribution.
4.3 Continuous MNIST, Fashion MNIST, and CelebA
Finally, we evaluated SNIS and HIS on continuous versions of MNIST, Fashion MNIST, and CelebA (64x64). We use the same preprocessing as in (Dinh et al., 2016). Briefly, we dequantize pixel values by adding uniform noise, rescale them to
, and then transform the rescaled pixel values into logit space by
, where . When we calculate loglikelihoods, we take into account this change of variables.We speculated that when the proposal is already strong, drawing additional samples as in SNIS may be better than HIS. To test this, we experimented with a smaller VAE as the proposal distribution. As we expected, HIS outperformed SNIS when the proposal was weaker, especially on the more complex datasets, as shown in Table 2.
5 Variational Inference with EIMs
To provide a tractable lower bound on the loglikelihood of EIMs, we used the ELBO (Eq. 1). More generally, this variational lower bound has been used to optimize deep generative models with latent variables following the influential work by Kingma and Welling (2013); Rezende et al. (2014), and models optimized with this bound have been successfully used to model data such as natural images (Rezende and Mohamed, 2015; Kingma et al., 2016; Chen et al., 2016; Gulrajani et al., 2016), speech and music timeseries (Chung et al., 2015; Fraccaro et al., 2016; Krishnan et al., 2015), and video (Babaeizadeh et al., 2017; Ha and Schmidhuber, 2018; Denton and Fergus, 2018). Due to the usefulness of such a bound, there has been an intense effort to provide improved bounds (Burda et al., 2015; Maddison et al., 2017; Naesseth et al., 2018; Le et al., 2017; Yin and Zhou, 2018; Molchanov et al., 2018; Sobolev and Vetrov, 2018). The tightness of the ELBO is determined by the expressiveness of the variational family (Zellner, 1988), so it is natural to consider using flexible EIMs as the variational family. As we explain, EIMs provide a conceptual framework to understand many of the recent improvements in variational lower bounds.
In particular, suppose we use a conditional EIM as the variational family (i.e., is the marginalized sampling process). Then, we can use the ELBO lower bound on (Eq. 1), however, the density of the EIM is intractable. Agakov and Barber (2004); Salimans et al. (2015); Ranganath et al. (2016); Maaløe et al. (2016) develop an auxiliary variable variational bound
(4) 
where is a variational distribution meant to model , and the identity follows from the fact that Similar to Eq. 1, Eq. 4 shows the gap introduced by using to deal with the intractability of . We can form a lower bound on the original ELBO and thus a lower bound on the log marginal by omitting the positive term. This provides a tractable lower bound on the loglikelihood using flexible EIMs as the variational family and precisely characterizes the bound gap as the sum of terms in Eq. 1 and Eq. 4. For different choices of EIM, this bound recovers many of the recently proposed variational lower bounds.
Furthermore, the bound in Eq. 4 is closely related to partition function estimation because
is an unbiased estimator of
when . To first order, the bound gap is related to the variance of this partition function estimator (e.g., (Maddison et al., 2017)), which motivates sampling algorithms used in lower variance partition function estimators such as SMC (Doucet et al., 2001) and AIS (Neal, 2001).5.1 Importance Weighted Autoencoders (IWAE)
To tighten the ELBO without explicitly expanding the variational family, Burda et al. (2015)
introduced the importance weighted autoencoder (IWAE) bound,
(5) 
The IWAE bound reduces to the ELBO when , is nondecreasing as increases, and converges to as under mild conditions (Burda et al., 2015). Bachman and Precup (2015) introduced the idea of viewing IWAE as auxiliary variable variational inference and Naesseth et al. (2018); Cremer et al. (2017); Domke and Sheldon (2018) formalized the notion.
Consider the variational family defined by the EIM based on SNIS (Algorithm 2). We use a learned, tractable distribution as the proposal and set motivated by the fact that is the optimal variational distribution. Similar to the variational distribution used in Section 3.2, setting
(6) 
yields the IWAE bound Eq. 5 when plugged into to Eq. 4 (see Appendix A for details).
From Eq. 4, it is clear that IWAE is a lower bound on the standard ELBO for the EIM and the gap is due to . The choice of in Eq. 6 was for convenience and is suboptimal. The optimal choice of is
Compared to the optimal choice, Eq. 6 makes the approximation which ignores the influence of on and the fact that are not independent given . A simple extension could be to learn a factored variational distribution conditional on : . Learning such an could improve the tightness of the bound, and we leave exploring this to future work.
5.2 Semiimplicit variational inference
As a way of increasing the flexibility of the variational family, Yin and Zhou (2018) introduce the idea of semiimplicit variational families. That is they define an implicit distribution
by transforming a random variable
with a differentiable deterministic transformation (i.e., ). However, Sobolev and Vetrov (2018) keenly note that can be equivalently written as with two explicit distributions. As a result, semiimplicit variational inference is simply auxiliary variable variational inference by another name.Additionally, Yin and Zhou (2018) provide a multisample lower bound on the log likelihood which is generally applicable to auxiliary variable variational inference.
(7) 
We can interpret this bound as using an EIM for in Eq. 4. Generally, if we introduce additional auxiliary random variables into , we can tractably bound the objective
(8) 
where is a variational distribution. Analogously to the previous section, we set as an EIM based on the selfnormalized importance sampling process with proposal and . If we choose
with , then Eq. 8 recovers the bound in (Yin and Zhou, 2018) (see Appendix B for details). In a similar manner, we can continue to recursively augment the variational distribution (i.e., add auxiliary latent variables to ).
This view reveals that the multisample bound from (Yin and Zhou, 2018) is simply one approach to choosing a flexible variational . Alternatively, Ranganath et al. (2016) use a learned variational . It is unclear when drawing additional samples is preferable to learning a more complex variational distribution. Furthermore, the two approaches can be combined by using a learned proposal instead of , which results in a bound described in (Sobolev and Vetrov, 2018).
5.3 Additional Bounds
Finally, we can also use the selfnormalized importance sampling procedure to extend a proposal family to a larger family (instead of solely extending ) (Sobolev and Vetrov, 2018). Selfnormalized importance sampling is a particular choice of taking a proposal distribution and moving it closer to a target. Hamiltonian Monte Carlo (Neal and others, 2011) is another choice which can also be embedded in this framework as done by (Salimans et al., 2015; Caterini et al., 2018). Similarly, SMC can be used as a sampling procedure in an EIM and when used as the variational family, it succinctly derives variational SMC (Maddison et al., 2017; Naesseth et al., 2018; Le et al., 2017) without any instance specific tricks. In this way, more elaborate variational bounds can be constructed by specific choices of EIMs without additional derivation.
6 Discussion
We proposed a flexible, yet tractable family of distributions by treating the approximate sampling procedure of energybased models as the model of interest, referring to them as energyinspired models. The proposed EIMs bridge the gap between learning and inference in EBMs. We explore three instantiations of EIMs induced by truncated rejection sampling, selfnormalized importance sampling, and Hamiltonian importance sampling and we demonstrate comparably or stronger performance than recently proposed generative models. The results presented in this paper use simple architectures on relatively small datasets. Future work will scale up both the architectures and size of the datasets.
Interestingly, as a byproduct, exploiting the EIMs to define the variational family provides a unifying framework for recent improvements in variational bounds, which simplifies existing derivations, reveals potentially suboptimal choices, and suggests ways to form novel bounds.
Acknowledgments
We thank Ben Poole, Abhishek Kumar, and Diederick Kingma for helpful comments. We thank Matthias Bauer for answering implementation questions about LARS.
References
 An auxiliary variational method. In International Conference on Neural Information Processing, pp. 561–566. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §5.
 Stochastic variational video prediction. International Conference on Learning Representations. Cited by: §5.
 Training deep generative models: variations on a theme. In NIPS Approximate Inference Workshop, Cited by: §5.1.
 The im algorithm: a variational approach to information maximization. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pp. 201–208. Cited by: Appendix C, §3.2.
 Resampled priors for variational autoencoders. arXiv preprint arXiv:1810.11428. Cited by: §D.2, EnergyInspired Models: Learning with SamplerInduced Distributions, §3.1, §3.1, Figure 1, §4.2, Table 1, §4.
 Statistical analysis of nonlattice data. Journal of the Royal Statistical Society: Series D (The Statistician) 24 (3), pp. 179–195. Cited by: §2.
 Variational inference: a review for statisticians. Journal of the American Statistical Association. Cited by: §1, §2.
 Fundamentals of statistical exponential families: with applications in statistical decision theory. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions.
 Importance weighted autoencoders. nternational Conference on Learning Representations. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §5.1, §5.
 Hamiltonian variational autoencoder. In Advances in Neural Information Processing Systems, pp. 8167–8177. Cited by: §5.3.
 Variational lossy autoencoder. International Conference on Learning Representations. Cited by: §5.
 A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988. Cited by: §5.
 Reinterpreting importanceweighted autoencoders. arXiv preprint arXiv:1704.02916. Cited by: §5.1.
 Kernel exponential family estimation via doubly dual embedding. arXiv preprint arXiv:1811.02228. Cited by: §1.
 Calibrating energybased generative adversarial networks. arXiv preprint arXiv:1702.01691. Cited by: §1.
 The helmholtz machine. Neural computation 7 (5), pp. 889–904. Cited by: §1.
 Stochastic video generation with a learned prior. International Conference on Machine Learning. Cited by: §5.
 Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §4.3.
 Importance weighting and variational inference. In Advances in Neural Information Processing Systems, pp. 4471–4480. Cited by: §5.1.
 Divide and couple: using monte carlo variational objectives for posterior approximation. arXiv preprint arXiv:1906.10115. Cited by: §1.
 An introduction to sequential monte carlo methods. In Sequential Monte Carlo methods in practice, pp. 3–14. Cited by: §5.
 Implicit generation and generalization in energybased models. arXiv preprint arXiv:1903.08689. Cited by: §2.
 Sequential neural models with stochastic layers. In Advances in neural information processing systems, pp. 2199–2207. Cited by: §5.
 A fast and exact learning rule for a restricted class of boltzmann machines. Advances in Neural Information Processing Systems 4, pp. 912–919. Cited by: §1.
 Variational walkback: learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pp. 4392–4402. Cited by: §3.3.
 Pixelvae: a latent variable model for natural images. International Conference on Learning Representations. Cited by: §5.

Noisecontrastive estimation: a new estimation principle for unnormalized statistical models.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, pp. 297–304. Cited by: §2.  World models. Advances in neural information processing systems. Cited by: §5.
 Training products of experts by minimizing contrastive divergence. Neural computation 14 (8), pp. 1771–1800. Cited by: §1, §2.
 Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research 6 (Apr), pp. 695–709. Cited by: §2.
 Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144. Cited by: §3.2, §4.2.
 An introduction to variational methods for graphical models. Machine learning 37 (2), pp. 183–233. Cited by: §1, §2.
 Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §3.2.
 Deep directed generative models with energybased probability estimation. arXiv preprint arXiv:1606.03439. Cited by: §1.
 Markov random fields and their applications. American mathematical society. Cited by: §1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §D.1, §D.2, §D.3.
 Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751. Cited by: §5.
 Autoencoding variational bayes. nternational Conference on Learning Representations. Cited by: §1, §3.2, §5.

Deep kalman filters
. arXiv preprint arXiv:1511.05121. Cited by: §5.  Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Cited by: §1.
 Autoencoding sequential monte carlo. International Conference on Learning Representations. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §5.3, §5.
 A tutorial on energybased learning. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §2.

The mnist database of handwritten digits
. http://yann. lecun. com/exdb/mnist/. Cited by: §4. 
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, Cited by: §4.  Noise contrastive estimation and negative sampling for conditional models: consistency and statistical efficiency. arXiv preprint arXiv:1809.01812. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §3.2.
 Auxiliary deep generative models. arXiv preprint arXiv:1602.05473. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §5.
 Filtering variational objectives. In Advances in Neural Information Processing Systems, pp. 6573–6583. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §5.3, §5, §5.

The concrete distribution: a continuous relaxation of discrete random variables
. arXiv preprint arXiv:1611.00712. Cited by: §3.2.  Variational inference for monte carlo objectives. International Conference on Machine Learning. Cited by: §4.2.
 Doubly semiimplicit variational inference. arXiv preprint arXiv:1810.02789. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §5.
 Variational sequential monte carlo. In International Conference on Artificial Intelligence and Statistics, pp. 968–977. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §5.1, §5.3, §5.
 MCMC using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2 (11), pp. 2. Cited by: §2, §5.3.
 Annealed importance sampling. Statistics and computing 11 (2), pp. 125–139. Cited by: §1, §5.
 Hamiltonian importance sampling. In In talk presented at the Banff International Research Station (BIRS) workshop on Mathematical Issues in Molecular Dynamics, Cited by: §1, §3.3, §3.
 Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §3.2.
 Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347. Cited by: Table 2.
 Hierarchical variational models. In International Conference on Machine Learning, pp. 324–333. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §5.2, §5.

Stochastic backpropagation and approximate inference in deep generative models
. In International Conference on Machine Learning, pp. 1278–1286. Cited by: §1, §3.2, §5.  Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538. Cited by: §5.

On the quantitative analysis of deep belief networks
. In Proceedings of the 25th international conference on Machine learning, pp. 872–879. Cited by: §4.2.  Markov chain monte carlo and variational inference: bridging the gap. In International Conference on Machine Learning, pp. 1218–1226. Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §5.3, §5.
 Information processing in dynamical systems: foundations of harmony theory. Technical report Colorado Univ at Boulder Dept of Computer Science. Cited by: §1.
 Importance weighted hierarchical variational inference. In Bayesian Deep Learning Workshop, Cited by: EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §5.2, §5.2, §5.3, §5.
 Minimum probability flow learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 905–912. Cited by: §2.
 Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pp. 1064–1071. Cited by: §2.

Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
. Machine learning 8 (34), pp. 229–256. Cited by: §3.2.  Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. External Links: cs.LG/1708.07747 Cited by: §4.
 Semiimplicit variational inference. arXiv preprint arXiv:1805.11183. Cited by: Appendix B, EnergyInspired Models: Learning with SamplerInduced Distributions, §1, §5.2, §5.2, §5.2, §5.
 Optimal information processing and bayes’s theorem. The American Statistician 42 (4), pp. 278–280. Cited by: §5.
 Filters, random fields and maximum entropy (frame): towards a unified theory for texture modeling. International Journal of Computer Vision 27 (2), pp. 107–126. Cited by: §1.
Appendices
Appendix A IWAE bound as AVVI with an EIM
We provide a proof sketch that the IWAE bound can be interpreted as auxiliary variable variational inference with an EIM. Recall the auxiliary variable variational inference bound (Eq. 1 and Eq. 4),
(9) 
Let be an EIM based on SNIS with proposal and energy function and be
(10) 
Then, plugging Eq. 10 into Eq. 9 with gives
which is the IWAE bound.
Appendix B Semiimplicit Variational Inference Bound
which is equivalent to the multisample bound from [Yin and Zhou, 2018].
Appendix C Connection with CPC
Starting from the wellknown variational bound on mutual information due to Barber and Agakov [2003]
for a variational distribution , we can use the selfnormalized importance sampling distribution and choose the proposal to be (i.e., ). Applying the bound in Eq. 3, we have
This recovers the CPC bound and proves that it is indeed a lower bound on mutual information whereas the heuristic justification in the original paper relied on unnecessary approximations.
Appendix D Implementation Details
d.1 Synthetic data
All methods used a fixed 2D proposal distribution and a learned acceptance/energy function
parameterized by a neural network with 2 hidden layers of size 20 and tanh activations. For SNIS and LARS, the number of proposal samples drawn,
, was set to 1024 and for HIS . We used batch sizes of 128 and ADAM [Kingma and Ba, 2014] with a learning rate of to fit the models. For evaluation, we report the IWAE bound with 1000 samples for HIS and SNIS. For LARS there is no equivalent to the IWAE bound, so we instead estimate the normalizing constant with 1000 samples.The nine Gaussians density is a mixture of nine equallyweighted 2D Gaussians with variance and means . The checkerboard density places equal mass on the squares and, . The two rings density is defined as
d.2 Binarized MNIST and Fashion MNIST
We chose hyperparameters to match the MNIST experiments in Bauer and Mnih [2018]. Specifically, we parameterized the energy function by a neural network with two hidden layers of size 100 and tanh activations, and parameterized the VAE observation model by neural networks with two layers of 300 units and tanh activations. The latent spaces of the VAEs were 50dimensional, SNIS’s was set to 1024, and HIS’s was set to . We also linearly annealed the weight of the KL term in the ELBO from 0 to 1 over the first steps and dropped the learning rate from to on step . All models were trained with ADAM [Kingma and Ba, 2014].
d.3 Continuous MNIST, Fashion MNIST, and CelebA
For the small VAE, we parameterized the VAE observation model neural networks with a single layer of 20 units and tanh activations. The latent spaces of the small VAEs were 10dimensional. In these experiments, SNIS’s was set to 128, and HIS’s was set to . We also dropped the learning rate from to on step . All models were trained with ADAM [Kingma and Ba, 2014].