1 Introduction
Implicit distributions are probability models whose probability density function may be intractable, but there is a way to

sample from them exactly and/or calculate and approximate expectations under them, and

calculate or estimate gradients of such expectations with respect to model parameters.
A popular example of implicit models are stochastic generative networks: samples from a simple distribution  such as uniform or Gaussian  are transformed nonlinearly and noninvertably by a deep neural network. Such networks can flexibly parametrise a wide range of probability distributions, including even degenerate ones which may not even have a continuous density.
Implicit models have been successfully applied to generative modelling in generative adversarial networks (GANs Goodfellow et al., 2014) and subsequent work (Salimans et al., 2016; Radford et al., 2016; Donahue et al., 2017; Dumoulin et al., 2017)
. They work particularly well for visual data, partly because they can exploit the inductive biases of convolutional neural networks, and partly because they can flexibly model potentially degenerate, manifoldlike distributions which natural images are assumed to follow.
This note is about using implicit distributions in another important probabilistic machine learning problem: approximate inference in latent variable models. Unlike in the first applications of GANs to generative modelling where an implicit model directly models the distribution of observed data, in approximate inference we are interested in modelling the posterior distribution of latent variables given observed data. Direct generative modelling and approximate inference are very different problems indeed: in the former we are provided with samples
from the distribution to be modelled, in the latter we are given a joint distribution of latents
and observations , and a set of observed samples , but no samples from the posterior itself.In this note we focus on variational inference (VI) which works by minimising a divergence between the approximate and real posteriors. More precisely we follow the usual KLdivergence formulation, but other, more general variational methods exist (e. g. Li & Turner, 2016; Ranganath et al., 2016). VI also provides a lower bound to the marginal likelihood or model evidence  the evidence lower bound or ELBO  which can be maximised with respect to parameters of the latent variable model to approximate maximum likelihood learning. It’s important to keep in mind that despite several algorithms in this note look and feel like adversarial training procedures, the way the model is fit to observed data is more akin to variational autoencoders (VAE, Kingma & Welling, 2014; Rezende & Mohamed, 2015) than to GANs.
There are several reasons to explore implicit distributions in the context of variational inference. Firstly, explicit VI is often limited to exponential family distributions or other distributions with tractable densities (Rezende & Mohamed, 2015; Kingma et al., 2016) which may not be expressive enough to capture nontrivial dependencies that the real posterior exhibits. A flexible, implicit distribution may provide a better approximation to the posterior and a sharper lower bound. Secondly, It may be desirable to use an implicit likelihood as the resulting latent variable model may fit the data better. For example, the likelihood or forward model might be described as a probabilistic program (Vajda, 2014) whose density is intractable or unknown. Finally, sometimes we may want to use implicit priors over latent variables. For example in a deep hierarchical latent variable model the prior for a layer may be a complicated probabilistic model with an intractable density. Or, when solving inference problems in computational photography, the prior may be the empirical distribution of natural images as in (Sønderby et al., 2017). In summary, any or all of the prior, the likelihood and the approximate posterior may have to be modelled implicitly, and we need VI procedures that are ready to tackle these situations.
In this note we present two sets of tools to handle implicit distributions in variational inference: GANlike adversarial algorithms which rely on density ratio estimation, and denoisingbased algorithms
which build a representation of the gradients of each implicit distribution’s logdensity and use these gradient estimates directly in a stochastic gradient descent (SGD) algorithm. We further classify algorithms as
priorcontrastive and jointcontrastive depending on which form of the variational bound they use. Priorcontrastive methods only deal with implicit distributions over latent variables (i. e. the prior or approximate posterior), while jointcontrastive methods can handle fully implicit models where none of the distributions involved has a tractable density. This classification gives rise to a range of algorithms listed in Table 1, alongside related algorithms from prior work. All of the algorithms presented here can perform variational approximate inference, which is the main focus of this note, but not all of them can perform learning unless the likelihood is explicitly defined.implicit  
Algorithm  VI  
VAE (Kingma & Welling, 2014)  
NF (Rezende & Mohamed, 2015)  
PCAdv, Algorithm 1  
AffGAN (Sønderby et al., 2017)  
AVB (Mescheder et al., 2017)  I  
OPVI (Ranganath et al., 2016)  I  
PCDen, Algorithm 3  I  
JCAdv, Algorithm 2  I  I  
JCDen  I  I  
JCAdvRMD  
AAE (Makhzani et al., 2016)  I  
DeePSiM (Dosovitskiy & Brox, 2016)  I  
ALI (Dumoulin et al., 2017)  
BiGan (Donahue et al., 2017)  

is specialised to the task of image superresolution where the likelihood is degenerate and linear.
The reversemode differentiationbased JCAdvRMD algorithm has not been validated experimentally.1.1 Overview of prior work
Several of the algorithms proposed here have been discovered in some form before. However, their connections to variational inference is rarely made explicit. In this section we review algorithms for inference and feature learning which use implicit distributions or adversarial training. As we will see, several of these admit a variational interpretation or can be rather straightforwardly modified to fit the variational framework.
GANs have been used rather successfully to solve inverse problems in computer vision. These inverse problems can be cast as a special case of approximate inference.
Dosovitskiy & Brox (2016) used GANs to reconstruct and generate images from nonlinear feature representations. As pointed out later by Sønderby et al. (2017), this method, DeePSiM, can be interpreted as a special case of amortised maximum a posteriori (MAP) or variational inference with a Gaussian observation model. GANs have also been used for inference in image superresolution Ledig et al. (2016); Sønderby et al. (2017). Connections between GANs and VI in this context were first pointed out in (Sønderby et al., 2017, Appendix F). Sønderby et al. (2017)also introduced a modified objective function for the GAN generator which ensures that the algorithm minimises the KullbackLeibler divergence as opposed to the JensenShannon divergence, an essential step to using GANs for VI. The AffGAN algorithm presented there is highly application specific, thus, it does not solve VI in general.
In more recent, parallel work, (Mescheder et al., 2017) proposes adversarial variational Bayes(AVB), perhaps the best description of the use of GANs for variational inference. AVB is a general algorithm that allows for implicit variational distributions and is in fact equivalent to the priorcontrastive adversarial algorithm (PCAdv, Algorithm 1) described in Section 3. Operator variational inference (OPVI, Ranganath et al., 2016) formulates a general class of variational lower bounds based on operator divergences, resulting in a practical algorithm for training implicit inference networks without a tractable density. As is shown in the paper, the KLdivergencebased variational bound used here and in (Mescheder et al., 2017) is a special case of OPVI. Adversarial autoencoders (AAE, Makhzani et al., 2016) are similar to variational autoencoders where the KLdivergence term is replaced by an adversarial objective. However, AAEs do not use the KLdivergence formulation of the adversarial loss and their discriminator is independent of the encoder’s input, thus they are not a true variational method. Finally, Karaletsos (2016) proposed variational message passing, in which adversaries are employed to minimise local JensenShannon divergences in an algorithm more akin to expectation propagation (Minka, 2001) than to variational inference.
Another line of research extends GANs to latent variable models by training the discriminator on the joint distribution of latent and observed variables. This technique has been independently discovered as bidirectional GAN (BiGAN, Dumoulin et al., 2017) and adversarially learned inference (ALI, Donahue et al., 2017). These algorithms are closely related to the jointcontrastive adversarial algorithm (JCAdv, Algorithm 2). ALI and BiGAN use the JensenShannon formulation of GANs rather than the KLdivergence ones used here. On the one hand, this means that the JensenShannon variants aren’t technically VI algorithms. On the other hand, the symmetry of the JensenShannon divergence makes ALI and BiGAN completely symmetric, enabling not only approximate inference but also learning in the same algorithm. Unfortunately, this is no longer true when KL divergences are used: JCAdv is an algorithm for inference only.
The algorithms mentioned so far are examples of adversarial techniques which rely on density ratio estimation as the primary tool for dealing with implicit distributions (Mohamed & Lakshminarayanan, 2016). Sønderby et al. (2017) and WardeFarley & Bengio (2017)
demonstrated an alternative or complementary technique based on denoising autoencoders. As shown by
(Alain & Bengio, 2014) the optimal denoising function learns to represent gradients of the log data density  which in turn can be used in an inference method. Sønderby et al. (2017) used this insight to build a denoiserbased inference algorithm for image superresolution and connected it to amortised maximum a posteriori (MAP) inference. The extension from MAP to variational inference is straightforward and this method is closely related to the priorcontrastive denoising VI (PCDen, Algorithm 3) algorithm presented here.2 Variational Inference: Two forms
In this section, we give a lightweight overview of amortised variational inference (VI) in a latent variable model, in a model similar to e. g. variational autoencoders (Kingma & Welling, 2014). We observe an i. i. d. sequence of observed data . For each data point there exists an associated latent variable . We specify a prior over latent variables and a forward model which describes how the observations are related to latents. In such model we are interested in maximum likelihood learning, which maximises the marginal likelihood or model evidence with respect to parameters , and inference which involves calculating the posterior . We assume that neither the marginal likelihood or the posterior are tractable.
In amortized VI we introduce an auxiliary probability distribution , known as the recognition model, inference network or approximate posterior. Using we define the evidence lower bound (ELBO) as follows:
(1) 
As the name suggests, ELBO is a lower bound to the model evidence and it is exact when matches the true posterior exactly. Maximising ELBO with respect to is known as variational inference. This minimises the KL divergence thus moving the closer to the posterior. Conversely, maximising ELBO with respect to is known as variational learning which approximates maximum likelihood learning.
The ELBO can be calculated exactly for many combinations of and
, whose densities are tractable. VAEs use a reparametrisation trick to construct a low variance estimator to ELBO, but still require tractable densities for both the model
and recognition model . If and/or are implicit the ELBO needs to be approximated differently. As we will see in the next sections, it is useful to formulate ELBO in terms of density ratios.There are two main forms considered here. Firstly, the priorcontrastive form used also by VAEs (Kingma & Welling, 2014):
(2)  
where we introduced notation for the logarithmic density ratio .
We call Eqn. (2.1) the priorcontrastive expression as the term contrasts the approximate posterior with the prior .Alternatively, we can write ELBO in a jointcontrastive form as follows:
(3) 
where we introduced notation to denote the real data distribution and denotes its entropy^{1}^{1}1In practice, is an empirical distribution of samples, so technically it does not have a continuous density or differential entropy . We still use this notation liberally to avoid unnecessarily complicating the derivations.. can be ignored as it is constant with respect to both and . We also introduced notation to denote the logarithmic density ratio . Note that while was a logratio between densities over , is the ratio of joint densities over the tuple . As this form contrasts joint distributions, we call Eqn. (2.1) the jointcontrastive expression.
When using implicit models the density ratios and cannot be computed analytically. Indeed, even if all distributions involved are explicitly defined, is only available as an empirical distribution, thus cannot be calculated even if the densities of other distributions are tractable. In this note we rely on techniques for estimating or , or their gradients, directly from samples. For this to work we need to deal with a final difficulty: that or themselves implicitly depend on the parameter which we would like to optimise.
2.1 Dependence of and on
The KLdivergences in equations and depend on in two ways: first, an expectation is taken with respect to  this is fine as we assumed expectations under implicit distributions and their gradients can be approximated easily. Secondly, the ratios and themselves depend on , which may cause difficulties. If one optimised ELBO naively via gradient descent, one should backpropagate through both of these dependencies. Fortunately, the second dependence can be ignored:
(4) 
The only difference between the LHS and RHS of the equation is in the subscripts v. s. . As is a constant with respect to , Eqn. (4) reduces to the gradient of an expectation with respect to , which we assumed we can approximate if is an implicit distribution. The detailed proof of Eqn. (4) is in Appendix A, the key idea of which is the observation that for any
A similar equation analogously holds for in Eqn. (2.1), or indeed, any other KL divergence as well.
2.2 Approximate SGD Algorithms for VI
In the following sections we outline algorithms for VI which allow for implicit distributions. These algorithms can generally described as two nested loops of the following nature:

the outer loop performs stochastic gradient descent (SGD) on an approximation to ELBO with respect to , using gradient estimates obtained by the inner loop

in each iteration of outer loop, with fixed, the inner loop constructs an estimate to , , or more generally to the gradient in Eqn. (4)
As long as the gradient estimates provided by the inner loop has no memory between subsequent iterations of the outer loop, and the gradient estimates provided by the inner loop on average constitute a conservative vector field, the algorithms can be seen as instances of SGD, and as such, should have the convergence properties of SGD.
3 Direct Density Ratio Estimation
Direct density ratio estimation, also known as direct importance estimation, is the task of estimating the ratio between the densities of two probability distribution given only i. i. d. samples from each of the distributions (Kanamori et al., 2009; Sugiyama et al., 2008; Mohamed & Lakshminarayanan, 2016, see e. g. ). This task is relevant in many machine learning applications, such as dealing with covariate shift or domain adaptation. A range of methods have been introduced to learn density ratios from samples, here we focus on adversarial
techniques which employ a discriminator trained via logistic regression. We note that other methods such as KLIEP
(Sugiyama et al., 2008; Mohamed & Lakshminarayanan, 2016) or LSIF (Kanamori et al., 2009; Uehara et al., 2016) could be used just as well.3.1 Adversarial approach using discriminators
(Bickel et al., 2007) proposed estimating density ratios by training a logistic regression classifier between samples from the two distributions. Assuming the classifier is close to a unique Bayesoptimum, it can then be used directly to provide an estimate of the logarithmic density ratio. This approach has found great application in generative adversarial networks (Sønderby et al., 2017; Mohamed & Lakshminarayanan, 2016; Uehara et al., 2016), which work particularly well for generative modelling of images (see e. g. Salimans et al., 2016).
Let us use this to construct an approximation to the logarithmic density ratio from Eqn. (2.1). We can do this by minimising the following objective function, typically via SGD:
(5) 
where and are the softplus and softminus functions, respectively. Once the approximate log ratio is found, we can use it to take a gradient descent step along the approximate negative ELBO:
(6) 
where we reparametrised sampling from in terms of a generator function and noise . When is explicitly defined, this reparametrisation is the same as the reparametrisation in VAEs. When is an implicit distribution, it often already defined in terms of a nonlinear function and a noise variable which it transforms.
Equations and are analogous to the discriminator and generator losses in generative adversarial networks, with taking the role of the discriminator. Optimising the two losses in tandem gives rise to the priorcontrastive adversarial algorithm (PCAdv, Algorithm 1) for variational inference. This algorithm is equivalent to the independently developed adversarial variational Bayes (Mescheder et al., 2017, AVB).
As the likelihood appears in Eqn. (6), in Algorithm 1 the forward model has to be explicitly defined, but the prior and approximate posterior can be implicit. Algorithm 1 only describes variational inference  finding given  but the approximate ELBO in Eqn. can be used for variational learning of as well, with the exception for parameters of the prior .
Learning prior parameters involves minimising the KLdivergence which is akin to fitting to samples from the aggregate posterior via maximum likelihood. If the prior has a tractable density, this may be an easy task to do. A more interesting case is though when the prior itself is a latent variable model, in which case we can lower bound the said KL divergence with another ELBO, thereby stacking multiple models on top of each other in a hierarchical fashion (Kingma & Welling, 2014; Rezende & Mohamed, 2015; Sønderby et al., 2016).
A similar adversarial algorithm (JCAdv, Algorithm 2) can be constructed to target in the jointcontrastive formulation of ELBO (Eqn. 2.1). JCAdv is very similar to ALI (Dumoulin et al., 2017) and BiGAN (Donahue et al., 2017) in that it learns to discriminate between the joint distributions and . Unlike these methods, however, JCAdv
uses the correct loss functions so it maximises an approximation to the ELBO. Unlike in
PCAdv, which required a tractable likelihood , JCAdv also works with completely implicitly defined models. As a downside, JCAdv provides no direct way for variational learning of . ALI and BiGAN exploit the symmetry of the JensenShannon divergence to optimise for , but as JCAdv uses the asymmetric KLdivergence, this is not an option. Section 7 explores an idea for fixing this drawback of JCAdv.4 Denoiserguided learning
Although most versions of GAN use an adversarial discriminator based on logistic regression, there are other ways one can tackle learning and inference with implicit distributions. One interesting tool that has emerged in recent papers (Sønderby et al., 2017; WardeFarley & Bengio, 2017) is the use of denoising autoencoders (DAEs, Vincent et al., 2008)
or reconstructioncontractive autoencoders
(RCAEs, Alain & Bengio, 2014).The key observation for using denoising is that the Bayesoptimal denoiser function captures gradients of the log density of the data generating distribution:
(7) 
This allows us to construct an estimator to the score of a distribution by fitting a DAE to samples. We note that it is possible to obtain a more precise analytical expression for for the optimal denoising function (Valpola et al., 2016).
Let’s see how one can use this in the priorcontrastive scenario to deal with an implicit . First, we fit a denoising function by minimising the following loss:
(8) 
which can then be used to approximate the gradient of ELBO (Eqn. (4)) with respect to as follows:
(9) 
Several SGD methods only require gradients of the objective function as input, this gradient estimate can be readily used to optimise an approximate ELBO. The resulting iterative algorithm, priorcontrastive denoising VI (PCDen, Algorithm 3) updates the denoiser and the variational distribution in tandem. Following similar derivation one can construct several other variants of the algorithm. The denoiser approach is more flexible than the adversarial approach as one can pick and choose which individual distributions are modelled explitly, and which ones are implicit. For example, when the prior is implicit, we can train a denoiser to represent its score function. Or, one can start from the jointcontrastive formulation of the ELBO and train a DAE over joint distribution of and , giving rise to the jointcontrastive denoising VI (JCDen). In the interest of space the detailed description of these variants is omitted here.
As the denoising criterion estimates the gradients of ELBO but not the value itself, the denoising approach does not provide a direct way to learn model parameters . The denoising method may be best utilised in conjunction with an adversarial algorithm such as a combination of PCDen and PCAdv. The denoising method works better early on in the training when and are very different, and therefore the discrimination task is too easy. Conversely, as approaches , the discriminator can focus its efforts on modelling the residual differences between them rather than trying to model everything about in isolation as the denoiser in Algorithm 3 does. WardeFarley & Bengio (2017) already used such combination of adversarial training with a denoising criterion for generative modelling. However, the additional nonlinear transformation before denoising introduced in that work breaks the mathematical connections to KL divergence minimisation.
5 Summary
To summarise, we have presented two main ways to formulate ELBO in terms of logarithmic density ratios and . We called these priorcontrastive (PC) and jointcontrastive (JC). We have then described two techniques by which these density ratios, or their gradients, can be estimated if the distributions involved are implicit: adversarial methods (PCAdv and JCAdv, Algorithms 1&2) directly estimate density ratios via logistic regression, denoising methods (DCDen and JCDen, Algorithm 3) estimate gradients of the log densities via denoising autoencoders. We have mentioned that these methods can be combined, and that such combination may improve convergence.
While all of these algorithms can perform variational inference  fitting the variational parameters  not all of them can perform full variational learning of model parameters if the model itself is implicitly defined. In Section 7 we outline an idea based on reverse mode differentiation (RMD) idea by Maclaurin et al. (2015) which, giving rise to an algorithm we refer to as JCAdvRMD, which can in theory perform fully variational inference and learning in a model where all distributions involved are implicit.
The capabilities of algorithms presented here and in related work are summarised in Table 1. The adversarial variational Bayes (Mescheder et al., 2017) is equivalent to PCAdv, while ALI (Dumoulin et al., 2017) and BiGAN (Donahue et al., 2017) are closely related to JCAdv. (Sønderby et al., 2017) and (Dosovitskiy & Brox, 2016) are closely related to PCAdv, although the former solves a limited special case and the latter uses the JensenShannon formulation and hence is not fully variational.
6 Experiments
Several related papers have already demonstrated the success of methods surveyed here on real world datasets, see for example (Mescheder et al., 2017; Dosovitskiy & Brox, 2016) for PCAdv, (Dumoulin et al., 2017; Donahue et al., 2017) for JCAdv and (Sønderby et al., 2017; WardeFarley & Bengio, 2017) for denoiserbased techniques. Experiments in these papers typically focus on the models’ ability to learn , and the quality of samples from the learnt generative model .
As the focus here is on inference rather than learning, the goal of this section is to validate the algorithms’ ability to perform inference. To this end, we have devised a simple toy problem loosely based on the “sprinkle” example which exhibits explaining away (Pearl, 1988). In our “continuous sprinkler” model, two independent scalar hidden variables and are combined nonlinearly to produce a univariate observation :
Although and are a priori independent, the likelihood introduces dependence between the two variables once conditioned on data: either latent variable taking a large value can explain a large observed . This is an example of explaining away which is an important phenomenon in latent variable models that is known to be hard to model with simple, unimodal distributions.
Column A in Figure 1 illustrates the joint posterior density of and for various values of . The subsequent columns show the posterior approximations by PCAdv, JCAdv and PCDen, respectively. is implemented as a stochastic generative network where the observation and Gaussian noise variables
are fed as input to a multilayer perceptron
. The discriminators and denoisers were implemented as multilayer perceptrons as well. Columns C and D illustrate the limiting behaviour of the discriminator in the PCAdv algorithm: as converges to the true posterior, is expected resemble the likelihood up to an additive constant. In JCAdv the discriminator eventually converges to the flat solution.7 Discussion and future work
Are adversaries really needed? When using adversarial techniques for VI, we model the distribution of latent variables rather than observations. The distributions we encounter in VI are usually thought of as simpler than the distribution of observed data, so the question arises whether the flexibility of the adversarial framework is really needed.
Is the priorcontrastive too much like noisecontrastive? In the PCAdv algorithm, the discriminator compares samples from the approximate posterior to the prior, and the prior is often highdimensional Gaussian noise. Even at convergence, the two distributions the discriminator sees never overlap, and this may slow down training. This can be remedied by observing that as converges to the true posterior, the discriminator will converge to the loglikelihood plus constant . Hence, the task of the discriminator can be made easier by forming an ensemble between a neural network and the actual loglikelihood.
Aren’t denoising methods imprecise? The main criticism of denoiserbased methods is that the gradient estimates are imprecise. As (Valpola et al., 2016) pointed out, the optimal denoising function represents the gradients of the noisecorrupted distribution rather than the original, and in practical cases the noise level may not be small enough for this effect to be negligible. Sønderby et al. (2017) observed that denoiserbased methods can not produce results as sharp as adversarial counterparts. Finally, for the outer loop SGD to work consistently, the gradient estimates provided by the inner loop have to form a conservative vector field. While the Bayesoptimal denoiser function satisfies this, it is unclear to what degree this property is preserved when using suboptimal denoisers (Im et al., 2016). We believe that an alternative approach based on score matching (Hyvärinen, 2005)  a task intimately related to denoising (Vincent, 2011)  might overcome both of these issues.
How to learn ? The focus of this note is on variational inference, which is finding . However, it is equally important to think about learning . Unfortunately, none of the algorithms presented here allow for fully variational learning of model parameters when is implicit. ALI and BiGAN do provide an algorithm, but as we mentioned, they are not fully variational. We close by highlighting one possible avenue for future work to enable this: differentiating the inner loop of the JCAdv algorithm via reverse mode differentiation (RMD, Maclaurin et al., 2015). To learn via SGD, one only needs an estimate of the gradient . We can’t compute only an approximation which is reached via SGD. Each step of SGD depends implicitly on . Following Maclaurin et al. (2015) we can algorithmically differentiate the SGD algorithm in a memoryefficient way to obtain an estimate of the gradient we need for learning . We have not validated this approach experimentally, but included it as JCAdvRMD in Table 1.
References
 Alain & Bengio (2014) Alain, Guillaume and Bengio, Yoshua. What regularized autoencoders learn from the datagenerating distribution? Journal of Machine Learning Research, 15(1):3563–3593, 2014.
 Bickel et al. (2007) Bickel, Steffen, Brückner, Michael, and Scheffer, Tobias. Discriminative learning for differing training and test distributions. In Proceedings of the 24th International Conference on Machine learning, pp. 81–88. ACM, 2007.
 Donahue et al. (2017) Donahue, Jeff, Krähenbuhl, Philipp, and Darrell, Trevor. Adversarially feature learning. In International Conference on Learning Representations, 2017. URL https://arxiv.org/abs/1605.09782.
 Dosovitskiy & Brox (2016) Dosovitskiy, Alexey and Brox, Thomas. Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pp. 658–666, 2016.
 Dumoulin et al. (2017) Dumoulin, Vincent, Belghazi, Ishmael, Poole, Ben, Lamb, Alex, Arjovsky, Martin, Mastropietro, Olivier, and Courville, Aaron. Adversarially learned inference. In International Conference on Learning Representations, 2017. URL https://arxiv.org/abs/1606.00704.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
 Hyvärinen (2005) Hyvärinen, Aapo. Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005.

Im et al. (2016)
Im, Daniel Jiwoong, Belghazi, Mohamed Ishmael, and Memisevic, Roland.
Conservativeness of untied autoencoders.
In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
, pp. 1694–1700. AAAI Press, 2016.  Kanamori et al. (2009) Kanamori, Takafumi, Hido, Shohei, and Sugiyama, Masashi. A leastsquares approach to direct importance estimation. Journal of Machine Learning Research, 10(Jul):1391–1445, 2009.
 Karaletsos (2016) Karaletsos, Theofanis. Adversarial message passing for graphical models. arXiv preprint arXiv:1612.05048, 2016. URL https://arxiv.org/abs/1612.05048.
 Kingma & Welling (2014) Kingma, Diderik and Welling, Max. Autoencoding variational Bayes. In International Conference on Learning Representations, 2014. URL https://arxiv.org/abs/1312.6114.
 Kingma et al. (2016) Kingma, Diederik P, Salimans, Tim, and Welling, Max. Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934, 2016.
 Ledig et al. (2016) Ledig, Christian, Theis, Lucas, Huszár, Ferenc, Caballero, Jose, Cunningham, Andrew, Acosta, Alejandro, Aitken, Andrew, Tejani, Alykhan, Totz, Johannes, Wang, Zehan, et al. Photorealistic single image superresolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
 Li & Turner (2016) Li, Yingzhen and Turner, Richard E. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pp. 1073–1081, 2016.

Maclaurin et al. (2015)
Maclaurin, Dougal, Duvenaud, David K, and Adams, Ryan P.
Gradientbased hyperparameter optimization through reversible learning.
In International Conference on Machine Learning, pp. 2113–2122, 2015.  Makhzani et al. (2016) Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, and Goodfellow, Ian. Adversarial autoencoders. In International Conference on Learning Representations, 2016. URL http://arxiv.org/abs/1511.05644.
 Mescheder et al. (2017) Mescheder, Lars M., Nowozin, Sebastian, and Geiger, Andreas. Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. CoRR, abs/1701.04722, 2017. URL http://arxiv.org/abs/1701.04722.

Minka (2001)
Minka, Thomas P.
Expectation propagation for approximate bayesian inference.
In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp. 362–369. Morgan Kaufmann Publishers Inc., 2001.  Mohamed & Lakshminarayanan (2016) Mohamed, Shakir and Lakshminarayanan, Balaji. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
 Pearl (1988) Pearl, Judea. Embracing causality in default reasoning. Artificial Intelligence, 35(2):259–271, 1988.
 Radford et al. (2016) Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations, 2016. URL https://arxiv.org/abs/1511.06434.
 Ranganath et al. (2016) Ranganath, Rajesh, Tran, Dustin, Altosaar, Jaan, and Blei, David. Operator variational inference. In Advances in Neural Information Processing Systems, pp. 496–504, 2016.
 Rezende & Mohamed (2015) Rezende, Danilo and Mohamed, Shakir. Variational inference with normalizing flows. In Blei, David and Bach, Francis (eds.), Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 1530–1538. JMLR Workshop and Conference Proceedings, 2015. URL http://jmlr.org/proceedings/papers/v37/rezende15.pdf.
 Salimans et al. (2016) Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen, Xi. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2226–2234, 2016.
 Sønderby et al. (2016) Sønderby, Casper Kaae, Raiko, Tapani, Maaløe, Lars, Sønderby, Søren Kaae, and Winther, Ole. Ladder variational autoencoders. In Advances in Neural Information Processing Systems, pp. 3738–3746, 2016.
 Sønderby et al. (2017) Sønderby, Casper Kaae, Caballero, Jose, Theis, Lucas, Shi, Wenzhe, and Huszár, Ferenc. Amortised MAP inference for image superresolution. In International Conference on Learning Representations, 2017.
 Sugiyama et al. (2008) Sugiyama, Masashi, Nakajima, Shinichi, Kashima, Hisashi, Buenau, Paul V, and Kawanabe, Motoaki. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pp. 1433–1440, 2008.
 Uehara et al. (2016) Uehara, Masatoshi, Sato, Issei, Suzuki, Masahiro, Nakayama, Kotaro, and Matsuo, Yutaka. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920, 2016.
 Vajda (2014) Vajda, Steven. Probabilistic programming. Academic Press, 2014.
 Valpola et al. (2016) Valpola et al. Learning by denoising part 2. connection between data distribution and denoising function. https://thecuriousaicompany.com/connectiontog/, 2016. retrieved on 24 February 2017.
 Vincent (2011) Vincent, Pascal. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
 Vincent et al. (2008) Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, and Manzagol, PierreAntoine. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine learning, pp. 1096–1103. ACM, 2008.
 WardeFarley & Bengio (2017) WardeFarley, David and Bengio, Yoshua. Improving generative adversarial networks with denoising feature matching. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=S1X7nhsxl¬eId=S1X7nhsxl.
Appendix A Ignoring implicit dependence on
Proof of Eqn. (4):
(10) 
where the third line is obtained by noting that is a local minimum of , hence the second term in the second line is .