1 Introduction
Significant progress has been made recently on generative models capable of synthesizing highly realistic data samples [Goodfellow et al., 2014, Oord et al., 2016, Kingma and Welling, 2014]. If
represents the true underlying probability distribution of data
, most of these models seek to represent draws as and , with a specified distribution that may be sampled easily [Goodfellow et al., 2014, Radford et al., 2016]. The objective is to learn, modeled typically via a deep neural network. Note that the model doesn’t impose a form on (or attempt to explicitly model) the density function
used to implictly model .When learning it is typically assumed that one has access to a set of samples , with each drawn i.i.d. from . While such samples are often available, there are other important settings for which one may wish to learn a generative model for , without access to associated samples. An important example occurs when one has access to an unnormalized distribution , with and normalizing constant unknown. The goal of sampling from based on
is a classic problem in physics, statistics and machine learning
[Hastings, 1970, Gelman et al., 1995]. This objective has motivated theoretically exact (but expensive) methods like Markov chain Monte Carlo (MCMC)
[Brooks et al., 2011, Welling and Teh, 2011], and approximate methods like variational Bayes [Hoffman et al., 2013, Kingma and Welling, 2014, Rezende et al., 2014] and expectation propagation [Minka, 2001, Li et al., 2015]. A challenge with methods of these types (in addition to computational cost/approximations) is that they are means of drawing samples or approximating density forms based on , but they do not directly yield a model like and , with the latter important for many fast machine learning implementations.A recently developed, and elegant, means of modeling samples based on is Stein variational gradient descent (SVGD) [Liu and Wang, 2016]. SVGD also learns to draw a set of samples, and an amortization step is used to learn and based on the SVGDlearned samples [Wang and Liu, 2016, Feng et al., 2017, Y. Pu and Carin, 2017]. Such amortization may also be used to build based on MCMCgenerated samples [Li et al., 2017b]. While effective, SVGDbased learning of this form may be limited computationally by the number of samples that may be practically modeled, limiting accuracy. Further, the twostep nature by which is manifested may be viewed as less appealing.
In this paper we develop a new extension of generative adversarial networks (GANs) [Goodfellow et al., 2014] for settings in which we have access to , rather than samples drawn from
. The formulation, while new, is simple, based on a recognition that many existing GAN methods constitute different means of estimating a function of a likelihood ratio
[Kanamori et al., 2010, Mohamed and L., 2016, Uehara et al., 2016]. The likelihood ratio is associated with the true density function and the model . Since we do not have access to or , we show, by a detailed investigation of GAN [Nowozin et al., 2016], that many GAN models reduce to learning , where is a general monotonically increasing function. GAN is an attractive model for uncovering underlying principles associated with GANs, due to its generality, and that many existing GAN approaches may be viewed as special cases of GAN. With the understanding provided by an analysis of GAN, we demonstrate how may be estimated via , and an introduced reference distribution . As discussed below, the assumptions on are that it is easily sampled, it has a known functional form, and it represents a good approximation to .For the special case of variational inference for latent models, the proposed formulation recovers the adversarial variational Bayes (AVB) [Mescheder et al., 2017] setup. However, we demonstrate that the proposed approach has more applicability than inference. Specifically, we demonstrate its application to soft Qlearning [Haarnoja et al., 2017]
, and it leads to the first general purpose adversarial policy algorithm in reinforcement learning. We make a favorable comparison in this context to the aforementioned SVGD formulation.
An additional contribution of this paper concerns regularization of adversarial learning, of interest when learning based on samples or on an unnormalized distribution . Specifically, we develop an entropybased regularizer. When learning based on , we make connections to simulated annealing regularization methods used in prior samplingbased models. We also introduce a bound on the entropy, applicable to learning based on samples or , and make connections to prior work on cycle consistency used in GAN regularization.
(a) Learning from an unnormalized distribution  (b) Learning from a sample set 
2 Traditional GAN Learning
We begin by discussing GAN from the perspective of the divergence [Nguyen et al., 2010a], which has resulted in GAN [Nowozin et al., 2016].
GAN is considered because many popular GAN methods result as special cases, thereby affording the opportunity to identify generalizable components that may extended to new settings. Considering continuous probability density functions
and for , the divergence is defined as , where is a convex, lowersemicontinuous function satisfying . Different choices of , with , yield many common divergences; see [Nowozin et al., 2016] and Table 1.An important connection has been made between the divergence and generative adversarial learning, based on the inequality [Nguyen et al., 2010a]
(1) 
where is the convex conjugate function, defined as , which has an analytic form for many choices of [Nowozin et al., 2016]. Further, under mild conditions, the bound is tight when where is the derivative of . Even if we know we cannot evaluate explicitly, because and/or are unknown.
Note that to compute the bound in (1), we require expectations wrt and , which we effect via sampling (this implies we only need samples from and , and do not require the explicit form of the underlying distributions). Specifically, assume corresponds to the true distribution we wish to model, and is a model distribution with parameters . We seek to learn by minimizing the bound of in (1), with draws from implemented as with , where is a probability distribution that may be sampled easily (e.g., uniform, or isotropic Gaussian [Goodfellow et al., 2014]). The learning problem consists of solving
(2)  
where is typically a (deep) neural network with parameters , with defined similarly. Attempting to solve (2) produces GAN [Nowozin et al., 2016].
One typically solves this minimax problem by alternating between update of and [Nowozin et al., 2016, Goodfellow et al., 2014]. Note that the update of only involves the second term in (2), corresponding to . Recall that the bound in (1) is tight when [Nguyen et al., 2010b], where represent parameters from the previous iteration. Hence, assuming , we update as
(3) 
where .
Different choices of yield a different optimal function (see Table 1). However, in each case is updated such that samples from yield an increase in the likelihood ratio , implying samples from better match than they do . Recall that the likelihood ratio is the optimal means of distinguishing between samples from and [Van Trees, 2001, Neyman and Pearson, 1933]. Hence, is a critic, approximated through , that the actor seeks to maximize when estimating .
Divergence  in update  

KullbackLeibler (KL)  
Reverse KL  
Squared Hellinger  
Total variation  
Pearson  
Neyman  
GAN 
We may alternatively consider
(4)  
(5) 
where now is an arbitrary monotonically increasing function of ,
is the sigmoid function. From
[Kanamori et al., 2010, Mescheder et al., 2017, Gutmann and Hyvärinen, 2010], the solution to (4) is(6) 
where model is assumed to have sufficient capacity to represent the likelihood ratio for all . Hence, here replaces from GAN, and the solution to is a particular function of the likelihood ratio. If this corresponds to learning based on minimizing the reverse KL divergence . When , one recovers the original GAN [Goodfellow et al., 2014], for which learning corresponds to .
3 UnnormalizedDistribution GAN
In the above discussion, and in virtually all prior GAN research, access is assumed to samples from target distribution . In many applications samples from are unavailable, but the unnormalized is known, with but with constant intractable. A contribution of this paper is a recasting of GAN to cases for which we have but no samples from , recognizing that most GAN models require an accurate estimate of the underlying likelihood ratio.
We consider the formulation in (4)(5) and for simplicity set , although any choice of may be considered as long as its monotonically increasing. The update of remains as in (5), and we seek to estimate based on knowledge of . Since , for the critic it is sufficient to estimate . Toward that end, we introduce a reference distribution , that may be sampled easily, and has an explicit functional form that may be evaluated. The reference distribution can be connected to both importance sampling and the reference ratio method developed in bioinformatics [Hamelryck et al., 2010]. We have
(7) 
where may be evaluated explicitly. We learn via (4), with changed to . Therefore, learning becomes alternating between the following two updates:
(8)  
(9) 
We call this procedure referencebased adversarial sampling (RAS) for unnormalized distributions. One should carefully note its distinction from the traditional GANs^{1}^{1}1We refer to generative models learned via samples as GAN, and generative models learned via an unnormalized distribution as RAS., which usually learn to draw samples to mimic the given samples of a target distribution. To illustrate the difference, we visualize the learning schemes for the two settings in Figure 1.
The parameters of reference distribution are estimated using samples from . We consider different forms of depending on the application.

Unconstrained domains For the case when the support of the target distribution is unconstrained, we model
as a Gaussian distribution with diagonal covariance matrix, with mean and variance components estimated via samples from
, drawn as with . 
Constrained domains In some realworld applications the support is bounded. For example, in reinforcement learning, the action often resides within a finite interval . In this case, we propose to represent each dimension of
as a generalized Beta distribution
. The shape parameters are estimated using method of moments:
and , where and , and and are sample mean and variance, respectively.
4 Entropy Regularization
Whether we perform adversarial learning based on samples from , as in Sec. 2, or based upon an unnormalized distribution , as in Sec. 3, the update of parameters is of the form , where is approximated as in (4) or its modified form (for learning from an unnormalized distribution).
A wellknown failure mode of GAN is the tendency of the generative model, with , to underrepresent the full diversity of data that may be drawn . Considering , will seek to favor synthesis of data for which is small and large. When learning in this manner, at iteration the model tends to favor synthesis of a subset of data that are probable from and less probable from . This subset of data that models well can change with , with the iterative learning continuously moving to model a subset of the data that are probable via . This subset can be very small, in the worst case yielding a model that always generates the same single data sample that looks like a real draw from ; in this case yields the same or nearsame output for all , albeit a realisticlooking sample .
To mitigate this failure mode, it is desirable to add a regularization term to the update of , encouraging that the entropy of be large at each iteration , discouraging the model from representing (while iteratively training) a varying small subset of the data supported by . Specifically, consider the regularized update of (5) as:
(10) 
where represents the entropy of the distribution , for . The significant challenge is that , but by construction we lack an explicit form for , and hence the entropy may not be computed directly. Below we consider two means by which we may approximate , one of which is explicitly appropriate for the case in which we learn based upon the unnormalized , and the other of which is applicable to whether we learn via samples from or based on .
In the case for which and is known, we may consider approximating or replacing with , and the term may be ignored, because it doesn’t impact the regularization in (10); we therefore replace the entropy with the cross entropy . The first term in (10) tends to encourage the model to learn to draw samples where , or , is large, while the second term discourages overconcentration on such highprobablity regions, as becomes large when encourages samples near lower probability regions of . This will ideally yield a spreadingout of the samples encouraged by , with highprobability regions of modeled well, but also regions spreading out from these highprobability regions.
To gain further insight into (10), we again consider the useful case of and assume the ideal solution . In this case crossentropybased regularization may be seen as seeking to maximize wrt the function
For the special case of , with an “energy” function, we have with . Hence, the crossentropy regularization is analogous to annealing, with ; corresponds to high “temperature” , which as is lowered and with . When the peaks in are “flattened out,” allowing the model to yield samples that “spread out” and explore the diversity of . This interpretation suggests learning via (10), with the crossentropy replacement for , with near 1 one at the start, and progressively reducing toward (corresponding to lowering temperature ).
The above setup assumes we have access to , which is not the case when we seek to learn based on samples of . Further, rather than replacing by the crossentropy, we may wish to approximate based on samples of , which we have via with (with this estimated via samples from or based on ). Toward that end, consider the following lemma.
Lemma 1
Let be a probabilistic inverse mapping associated with the generator , with parameters . The mutual information between and satisfies
(11) 
The proof is provided in the Supplement Material (SM). Since is a constant wrt , one may seek to maximize to increase the entropy . Hence, in (10) we replace the entropy term with .
In practice we consider , where here
is the identity matrix, and
is a vector mean. Hence,
in (10) is replaced by . Note that a failure mode of GAN, as discussed above, corresponds to many or all being mapped via to the same output. This is discouraged via this regularization, as such behavior makes it difficult to simultaneously minimize . This regularization is related to cycleconsistency [Li et al., 2017a]. However, the justification of the negative cycleconsistency as a lower bound of is deemed a contribution of this paper (not addressed in [Li et al., 2017a]).5 Related Work
Use of a reference distribution We have utilized a readilysampled reference distribution, with known density function , when learning to sample from an unnormalized distribution . The authors of [Gutmann and Hyvärinen, 2010] also use such a reference distribution to estimate the probability distribution associated with observed data samples. However, [Gutmann and Hyvärinen, 2010] considered a distinct problem, for which one wished to fit observed samples to a specified unnormalized distribution. Here we employ the reference distribution in the context of learning to sample from a known , with no empirical samples from provided.
Adversarial variational Bayes In the context of variational Bayes analysis, the adversarial variational Bayes (AVB) [Mescheder et al., 2017]
was proposed for posterior inference of variational autoencoders (VAEs)
[Kingma and Welling, 2014]. Assume we are given a parametric generative model with prior on latent code , designed to model observed data samples . There is interest in designing an inference arm, capable of efficiently inferring a distribution on the latent code given observed data . Given observed , the posterior distribution on the code is , where , and represents an unnormalized distribution of the latent variable , which also depends on the data .One may show that if the procedure in Sec. 3 is employed to draw samples from , based on the unnormalized , one exactly recovers AVB [Mescheder et al., 2017]. The AVB considered within our framework. We do not consider the application to inference with VAEs, as the experiments in [Mescheder et al., 2017] are applicable to the framework we have developed. The generality of the RAS is made more clear in our paper. We show its applicability to reinforcement learning in Sec. 6, and broaden the discussion on the type of adaptive reference distributions in Sec. 3, with extensions to constrained domain sampling.
Regularization The term employed here was considered in [Li et al., 2017a, Chen et al., 2018b, Zhu et al., 2017], but the use of it as a bound on the entropy of is new. From Lemma 1 we see that is also a bound on the mutual information between and , maximization of which is the same goal as InfoGAN [Chen et al., 2016]. However, unlike in [Chen et al., 2016], here the mapping is deterministic, where in InfoGAN it is stochastic. Additionally, the goal here is to encourage diversity in generated , which helps mitigate mode collapse, where in InfoGAN the goal was to discover latent semantic concepts.
Stein variational gradient descent (SVGD) In the formulation of (4)(5), if one sets , then the learning objective corresponds to minimizing the reverse KL divergence . SVGD [Liu and Wang, 2016] also addresses this goal given unnormalized distribution , with . Like for the proposed approach, the goal is not to explicitly learn a functional form for , rather the goal of SVGD is to learn to draw samples from it. We directly learn a sampler model via and , where in [Wang and Liu, 2016] a specified set of samples is adjusted sequentially to correspond to draws from the unnormalized distribution . In this setting, one assumes access to a set of samples drawn from some distribution, and these samples are updated deterministically as
where is a small step size, and is a nonlinear function, assumed described by a reproducing kernel Hilbert space (RKHS) with given kernel . In this setting, the samples are updated , with a deterministic function
that is evaluated in terms of and . While this process is capable of transforming a specific set of samples such that they ultimately approximate samples drawn from , we do not have access to a model that allows one to draw new samples quickly, on demand. Consequently, within the SVGD framework, a model is learned separately as a second “amortization” step. The twostep character of SVGD should be contrasted with the direct approach of the proposed model to learn . SVGD has been demonstrated to work well, and therefore it is a natural model against which to compare, as considered below.
6 Experimental Results
The Tensorflow code to reproduce the experimental results is at github
^{2}^{2}2\(\mathtt{https://github.com/ChunyuanLI/RAS}\).6.1 Effectiveness of Entropy Regularization
6.1.1 Learning based on samples
We first demonstrate that the proposed entropy regularization improves mode coverage when learning based on samples. Following the design in [Metz et al., 2017], we consider a synthetic dataset of samples drawn from a 2D mixture of 8 Gaussians. The results on real datasets are reported in SM.
We consider the original GAN and three stateoftheart GAN variants: UnrolledGAN [Metz et al., 2017], D2GAN [Nguyen et al., 2017] and Spectral Normalization (SN)GAN [Miyato et al., 2018]. For simplicity, we consider the case when is an identity function, and this form of GAN is denoted as adversarially learned likelihoodratio (ALL) in Fig. 3.
For all variants, we study their entropyregularized versions, by adding the entropy bound in (11), when training the generator. If not specifically mentioned, we use a fixanddecay scheme for for all experiments: In total training iterations, we first fix in the first iteration, then linearly decay it to 0 in the rest iterations. On this 8Gaussian dataset, k and k.
Twenty runs were conducted for each algorithm. Since we know the true distribution in this case, we employ the symmetric KL divergence as a metric to quantitatively compare the quality of generated data. In Fig. 3 we report the distribution of divergence values for all runs. We add the entropy bound to each variant, and visualize their results as violin plots with gray edges (the color for each variant remains for comparison). The largely decreased mean and reduced variance of the divergence show that the entropy annealing yields significantly more consistent and reliable solutions, across all methods. We plot the generated samples in Fig. 3. We visualize the generated samples of the original GAN in Fig. 3(a). The samples “struggle” between covering all modes and separating modes. This issue is significantly reduced by ALL with entropy regularization, as shown in Fig. 3(b). SNGAN (Fig. 3(c)) generates samples that concentrate only around the centroid of the mode. However, after adding our entropy regularizer (Fig. 3(d)), the issue is alleviated and the samples spread out.
6.1.2 Learning based on an unnormalized distribution
When the unnormalized form of a target distribution is given, we consider two types of entropy regularization to improve our RAS algorithm: E: the cycleconsistencybased regularization; E: the crossentropybased regularization. To clearly see the advantage of the regularizers, we fix in this experiment. Figure 4 shows the results, with each case shown in one row. The target distributions are shown in column (a), the sampling results of RAS are shown in column (b). RAS can reflect the general shape of the underlying distribution, but tends to concentrate on the high density regions. The two entropy regularizers are shown in (c) and (d). The entropy encourages the samples to spread out, leading to better approximation, and E appears to yield best performance.
(a) Target  (b) RAS  (c) RAS+E  (d) RAS+E 

6.1.3 Comparison of two learning settings
In traditional GAN learning, we have a finite set of samples with the empirical distribution to learn from, each sample drawn from the true distribution . It is known that the optimum of GANs yields the marginal distribution matching [Goodfellow et al., 2014]; it also implies that the performance of in is limited by . In contrast, when we learn from an unnormalized form as in RAS, the likelihood ratio is estimated using samples drawn from and from . Hence, we can draw as many samples as desired to get an accurate likelihoodratio estimation, which further enables to approach . This means RAS can potentially provide better approximation, when is available.
We demonstrate this advantage on the above 8Gaussian distribution. We train GAN on with , samples, and train RAS on . Note that the samples from and are drawn in an online fashion to train RAS. With an appropriate number of iterations (k) to assure convergence, in total samples were used to estimate the likelihood ratio in (8), where is the minibatch size.
In the evaluation stage, we draw 20k samples from for each model, and compute the symmetric KL divergence against the true distribution. The results are shown in Figure 5. As an illustration for the ideal performance, we draw 20k samples from the target distribution and show its divergence as the black line. The GAN gradually performs better, as more target samples are available in training. However, they are still worse than RAS by a fairly large margin.
(a) Beta reference  (b) Gaussian reference  (c) SVGD  (d) Amortized SVGD 
6.2 Sampling from Constrained Domains
To show that RAS can draw samples when is bounded, we apply it to sample from the distributions with the support . The details for the functions and decay of are in SM. We adopt the Beta distribution as our reference, whose parameters are estimated using the method of moments (see Sec. 3
). The activation function in the last layer of the generator is chosen as
. As a baseline, we naively use an empirical Gaussian as the reference. We also compare with the standard SVGD [Liu and Wang, 2016] and the amortized SVGD methods [Wang and Liu, 2016], in which 512 particles are used.Figure 6 shows the comparison. Note that since the support of the Beta distribution is defined in an interval, our RAS can easily match this reference distribution, leading the adversary to accurately estimate the likelihood ratio. Therefore, it closely approximates the target, as shown in Figure 6(a). Alternatively, when a Gaussian reference is considered, the adversarial ratio estimation can be inaccurate in the low density regions, resulting in degraded sampling performance shown in Figure 6(b). Since SVGD is designed for sampling in unconstrained domains, a principled mechanism to extend it for a constrained domain is less clear. Figure 6(c) shows SVGD results, and a substantial percentage of particles fall out of the desired domain. The amortized SVGD method adopts an metric to match the generator’s samples to the SVGD targets, it collapses to the distribution mode, as in Figure 6(d). We observed that the amortized MCMC results [Li et al., 2017b, Chen et al., 2018a] are similar to the amortized SVGD [Li et al., 2018].
6.3 Soft Qlearning
Soft Qlearning (SQL) has been proposed recently [Haarnoja et al., 2017], with reinforcement learning (RL) policy based on a general class of distributions, with the goal of representing complex, multimodal behavior. An agent can take an action based on a policy , defined as the probability of taking action when in state . It is shown in [Haarnoja et al., 2017] that the target policy has a known unnormalized density .
(a) Swimmer ()  (b) Hopper 
(c) Humanoid ()  (d) Halfcheetah 
(e) Ant  (f) Walker 
SqlSvgd
To take actions from the optimal policy (i.e., sampling), learning in [Haarnoja et al., 2017] is performed via amortized SVGD in two separated steps: the samples of are first drawn using SVGD by minimizing ; these samples are then used as the target to update under an amortization metric. We call this procedure as SQLSVGD.
SqlRas
Alternatively, we apply our RAS algorithm to replace the amortized SVGD. When the action space is in unconstrained, we may use the Gaussian reference . However, the action space is often constrained in continuous control, with each dimension in an interval . Hence, we adopt the Betadistribution reference for RAS.
Following [Haarnoja et al., 2018], we compare RAS with amortized SVGD on six continuous control tasks: Hopper, Halfcheetah, Ant and Walker from the OpenAI benchmark suite [Brockman et al., 2016], as well as the Swimmer and Humanoid tasks in the implementation [Duan et al., 2016]. Note that the action space is constrained in
for all the tasks. The dimension of the action space ranges from 2 to 21 on the different tasks. The higherdimension environments are usually harder to solve. All hyperparameters used in this experiment are listed in SM.
Figure 7 shows the total average return of evaluation rollouts during training. We train 3 different instances of each algorithm, with each performing one evaluation rollout every 1k environment steps. The solid curves corresponds to the mean and the shaded regions to the standard derivation Overall, it show that RAS significantly outperforms amortized SVGD on four tasks both in terms of learning speed and the final performance. This includes the most complex benchmark, the 21dimensional Humanoid (). On other two tasks, the two methods perform comparably. In the SQL setting, learning a good stochastic policy with entropy maximization can help training. It means that RAS can better estimate the target policy.
7 Conclusions
We introduce a referencebased adversarial sampling method as a general approach to draw from unnormalized distributions. It allows us to extend GANs from traditional samplebased learning setting to this new setting, and provide novel methods for important downstream applications, e.g., Soft Qlearning. RAS can also be easily used for constrained domain sampling. Further, an entropy regularization is proposed to improve the sample quality, applicable to learning from samples or an unnormalized distribution. Extensive experimental results show the effectiveness of the entropy regularization. In Soft Qlearning, RAS provides performance comparable to, if not better than, its alternative method amortized SVGD.
Acknowledgements
We thank Rohith Kuditipudi, Ruiyi Zhang, Yulai Cong and Ricardo Henao for helpful feedback/editing. We acknowledge anonymous reviewers for proofreading and improving the manuscript. The research was supported by DARPA, DOE, NIH, NSF and ONR.
References
 [Brockman et al., 2016] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
 [Brooks et al., 2011] Brooks, S., Gelman, A., Jones, G., and Meng, X.L. (2011). Handbook of Markov Chain Monte Carlo.
 [Chen et al., 2018a] Chen, C., Li, C., Chen, L., Wang, W., Pu, Y., and Duke, L. C. (2018a). Continuoustime flows for efficient inference and density estimation. In International Conference on Machine Learning, pages 823–832.
 [Chen et al., 2018b] Chen, L., Dai, S., Pu, Y., Li, C., Su, Q., and Carin, L. (2018b). Symmetric variational autoencoder and connections to adversarial learning. AISTATS.
 [Chen et al., 2016] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS.
 [Duan et al., 2016] Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In ICML.
 [Feng et al., 2017] Feng, Y., Wang, D., and Liu, Q. (2017). Learning to draw samples with amortized stein variational gradient descent. UAI.
 [Gelman et al., 1995] Gelman, A., Carlin, J. B., S., S. H., and Rubin, D. B. (1995). Bayesian data analysis. London: Chapman and Hall.
 [Goodfellow et al., 2014] Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In NIPS.
 [Gutmann and Hyvärinen, 2010] Gutmann, M. and Hyvärinen, A. (2010). Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS.
 [Haarnoja et al., 2017] Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energybased policies. ICML.
 [Haarnoja et al., 2018] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. ICML.
 [Hamelryck et al., 2010] Hamelryck, T., Borg, M., Paluszewski, M., Paulsen, J., Frellsen, J., Andreetta, C., Boomsma, W., Bottaro, S., and FerkinghoffBorg, J. (2010). Potentials of mean force for protein structure prediction vindicated, formalized and generalized. PloS one.
 [Hastings, 1970] Hastings, W. (1970). Monte Carlo sampling methods using Markov Chains and their applications. Biometrika.
 [Heusel et al., 2017] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., and Hochreiter, S. (2017). GANs trained by a two timescale update rule converge to a Nash equilibrium. NIPS.
 [Hoffman et al., 2013] Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. The Journal of Machine Learning Research.
 [Kanamori et al., 2010] Kanamori, T., Suzuki, T., and Sugiyama, M. (2010). Theoretical analysis of density ratio estimation. IEICE Trans. Fund. Electronics, Comm., CS.
 [Kingma and Welling, 2014] Kingma, D. P. and Welling, M. (2014). Autoencoding variational Bayes. ICLR.
 [Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
 [Li et al., 2018] Li, C., Li, J., Wang, G., and Carin, L. (2018). Learning to sample with adversarially learned likelihoodratio.

[Li et al., 2017a]
Li, C., Liu, H., Chen, C., Pu, Y., Chen, L., Henao, R., and Carin, L. (2017a).
ALICE: Towards understanding adversarial learning for joint distribution matching.
NIPS.  [Li et al., 2015] Li, Y., HernándezLobato, J. M., and Turner, R. E. (2015). Stochastic expectation propagation. In NIPS.
 [Li et al., 2017b] Li, Y., Turner, R. E., and Liu, Q. (2017b). Approximate inference with amortised MCMC. arXiv preprint arXiv:1702.08343.

[Liu and Wang, 2016]
Liu, Q. and Wang, D. (2016).
Stein variational gradient descent: A general purpose Bayesian inference algorithm.
In NIPS.  [Liu et al., 2015] Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In ICCV.
 [Mescheder et al., 2017] Mescheder, L., Nowozin, S., and Geiger, A. (2017). Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In ICML.
 [Metz et al., 2017] Metz, L., Poole, B., Pfau, D., and SohlDickstein, J. (2017). Unrolled generative adversarial networks. ICLR.
 [Minka, 2001] Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In UAI.
 [Miyato et al., 2018] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. In ICLR.
 [Mohamed and L., 2016] Mohamed, S. and L., B. (2016). Learning in implicit generative models. NIPS workshop on adversarial training.
 [Neyman and Pearson, 1933] Neyman, J. and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Lond. A, 231(694706):289–337.
 [Nguyen et al., 2017] Nguyen, T., Le, T., Vu, H., and Phung, D. (2017). Dual discriminator generative adversarial nets. NIPS.
 [Nguyen et al., 2010a] Nguyen, X., Wainwright, M., and Jordan, M. (2010a). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Info. Theory.
 [Nguyen et al., 2010b] Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010b). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory.
 [Nowozin et al., 2016] Nowozin, S., Cseke, B., and Tomioka, R. (2016). fGAN: Training generative neural samplers using variational divergence minimization. NIPS.

[Oord et al., 2016]
Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. (2016).
Pixel recurrent neural network.
In ICML.  [Radford et al., 2016] Radford, A., Metz, L., and Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR.

[Rezende et al., 2014]
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).
Stochastic backpropagation and approximate inference in deep generative models.
In ICML.  [Uehara et al., 2016] Uehara, M., Sato, I., Suzuki, M., Nakayama, K., and Matsuo, Y. (2016). Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920.
 [Van Trees, 2001] Van Trees, H. L. (2001). Detection, estimation, and modulation theory. John Wiley & Sons.
 [Wang and Liu, 2016] Wang, D. and Liu, Q. (2016). Learning to draw samples: With application to amortized MLE for generative adversarial learning. In arXiv:1611.01722v2.
 [Welling and Teh, 2011] Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In ICML.
 [Y. Pu and Carin, 2017] Y. Pu, Z. Gan, R. H. C. L. S. H. and Carin, L. (2017). VAE learning via Stein variational gradient descent. NIPS.

[Zhu et al., 2017]
Zhu, J.Y., Park, T., Isola, P., and Efros, A. (2017).
Unpaired imagetoimage translation using cycleconsistent adversarial networks.
ICCV.
Appendix A Proof of the Entropy Bound in Lemma 1
Consider random variables
under the joint distribution , where . The mutual information between and satisfies . Since is a deterministic function of , . We therefore have , where is a constant wrt . For general distribution ,(12)  
(13) 
We consequently have
(14) 
Therefore, entropy is lower bounded by the log likelihood or negative cycleconsistency loss; minimizing the cycleconsistency loss maximizes the entropy or mutual information.
Appendix B Experiments
b.1 Sampling from 8GMM
Two methods are presented for estimating the likelihood ratio: (i) ALL for the discriminator in the standard GAN i.e., Eq (4); (ii) ALL for a variational characterization of measures in [Nguyen et al., 2010a].
In Figure 8, we plot the distribution of inception score (ICP) values [Li et al., 2017a]. Similar conclusions as in the case of the symmetric KL divergence metric the can be drawn: (1) The likelihood ratio impelmentation improve the original GAN, and (2) the entropy regularizer improve the all GAN variants. Note that because ICP favors the samples closer to the mean of each mode and SNGAN generate samples that concentrate only around the mode’s centroid, SNGAN show slightly better ICP than its entroyregularized version. We argue that the entropy regualizer help gernerate diverse samples, the lower value of ICP is just due to the limitation of the metric.
The learning curves of the inception score and symmetric KL divergence values are plot over iterations in Figure 9 (a) and (b), respectively. The family of GAN variants with entropy term dominate the performance, compared with those without the entropy term. We conclude that the entropy regularizer can significantly improve the convergence speed and the final performance.
(a) Inception score over iterations.  (b) Symmetric KL over iterations. 
Architectures and Hyperparameters
For the 8GMM and MNIST datasets, the network architectures are specified in Table 2, and hyperparameters are detailed in Table 3. The inference network is used to construct the cycleconsistency loss to bound the entropy.
We further study three realworld datasets of increasing diversity and size: MNIST, CIFAR10 [Krizhevsky et al., 2012] and CelebA [Liu et al., 2015]. For each dataset, we start with a standard GAN model: twolayer fully connected (FC) networks on MNIST, as well as DCGAN [Radford et al., 2016] on CIFAR and CelebA. We then add the entropy regularizer. On MNIST, we repeat the experiments 5 times, and the mean ICP is shown. On CIFAR and CelebA, the performance is also quantified via the recently proposed Fréchet Inception Distance (FID) [Heusel et al., 2017], which approximates the Wasserstein2 distance of generated samples and true samples. The best ICP and FID for each algorithm are reported in Table B.1. The entropy variants consistently show better performance than the original counterparts.
scale=1.0,tabular=l—cc—cc,center & ICP & FID
Dataset & Standard & E & Standard & E
MNIST & & &  & 
CIFAR & & & &
CelebA &  &  & &
b.2 Constrained Domains
The two functions are: (1) , and (2) . The network architectures used for constrained domains are reported in Table 5. The batch size is 512, learning rate is . The total training iterations k, and we start to decay after k iterations.
b.3 Soft Qlearning
We show the detailed setting of environments in Soft QLearning in Table 6. The network architectures are specified in Table 7, and hyperparameters are detailed in Table 8. We only add the entropy regularization at the beginning to stabilize training, and then quickly decay to 0. The total training epoch is 200, and we start to decay after 10 epochs, and set it after 50 epochs. This is because we observed that the entropy regularization did not help in the end, and removing it could accelerate training.
Environment  Action  Reward  Replay 

Spcae  Scale  Pool Size  
Swimmer (rllab)  2  100  
Hopperv1  3  1  
HalfCheetahv1  6  1  
Walker2dv1  6  3  
Antv1  8  10  
Humanoid (rllab)  21  100 