Adversarial Learning of a Sampler Based on an Unnormalized Distribution

by   Chunyuan Li, et al.

We investigate adversarial learning in the case when only an unnormalized form of the density can be accessed, rather than samples. With insights so garnered, adversarial learning is extended to the case for which one has access to an unnormalized form u(x) of the target density function, but no samples. Further, new concepts in GAN regularization are developed, based on learning from samples or from u(x). The proposed method is compared to alternative approaches, with encouraging results demonstrated across a range of applications, including deep soft Q-learning.


Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning

We propose a simple algorithm to train stochastic neural networks to dra...

Generative Adversarial Learning via Kernel Density Discrimination

We introduce Kernel Density Discrimination GAN (KDD GAN), a novel method...

Symmetric Variational Autoencoder and Connections to Adversarial Learning

A new form of the variational autoencoder (VAE) is proposed, based on th...

Bridging Maximum Likelihood and Adversarial Learning via α-Divergence

Maximum likelihood (ML) and adversarial learning are two popular approac...

Networking the Boids is More Robust Against Adversarial Learning

Swarm behavior using Boids-like models has been studied primarily using ...

Adversarial-based neural network for affect estimations in the wild

There is a growing interest in affective computing research nowadays giv...

The IMP game: Learnability, approximability and adversarial learning beyond Σ^0_1

We introduce a problem set-up we call the Iterated Matching Pennies (IMP...

1 Introduction

Significant progress has been made recently on generative models capable of synthesizing highly realistic data samples [Goodfellow et al., 2014, Oord et al., 2016, Kingma and Welling, 2014]. If

represents the true underlying probability distribution of data

, most of these models seek to represent draws as and , with a specified distribution that may be sampled easily [Goodfellow et al., 2014, Radford et al., 2016]. The objective is to learn

, modeled typically via a deep neural network. Note that the model doesn’t impose a form on (or attempt to explicitly model) the density function

used to implictly model .

When learning it is typically assumed that one has access to a set of samples , with each drawn i.i.d. from . While such samples are often available, there are other important settings for which one may wish to learn a generative model for , without access to associated samples. An important example occurs when one has access to an unnormalized distribution , with and normalizing constant unknown. The goal of sampling from based on

is a classic problem in physics, statistics and machine learning

[Hastings, 1970, Gelman et al., 1995]

. This objective has motivated theoretically exact (but expensive) methods like Markov chain Monte Carlo (MCMC) 

[Brooks et al., 2011, Welling and Teh, 2011], and approximate methods like variational Bayes [Hoffman et al., 2013, Kingma and Welling, 2014, Rezende et al., 2014] and expectation propagation [Minka, 2001, Li et al., 2015]. A challenge with methods of these types (in addition to computational cost/approximations) is that they are means of drawing samples or approximating density forms based on , but they do not directly yield a model like and , with the latter important for many fast machine learning implementations.

A recently developed, and elegant, means of modeling samples based on is Stein variational gradient descent (SVGD) [Liu and Wang, 2016]. SVGD also learns to draw a set of samples, and an amortization step is used to learn and based on the SVGD-learned samples [Wang and Liu, 2016, Feng et al., 2017, Y. Pu and Carin, 2017]. Such amortization may also be used to build based on MCMC-generated samples [Li et al., 2017b]. While effective, SVGD-based learning of this form may be limited computationally by the number of samples that may be practically modeled, limiting accuracy. Further, the two-step nature by which is manifested may be viewed as less appealing.

In this paper we develop a new extension of generative adversarial networks (GANs) [Goodfellow et al., 2014] for settings in which we have access to , rather than samples drawn from

. The formulation, while new, is simple, based on a recognition that many existing GAN methods constitute different means of estimating a function of a likelihood ratio 

[Kanamori et al., 2010, Mohamed and L., 2016, Uehara et al., 2016]. The likelihood ratio is associated with the true density function and the model . Since we do not have access to or , we show, by a detailed investigation of -GAN [Nowozin et al., 2016], that many GAN models reduce to learning , where is a general monotonically increasing function. -GAN is an attractive model for uncovering underlying principles associated with GANs, due to its generality, and that many existing GAN approaches may be viewed as special cases of -GAN. With the understanding provided by an analysis of -GAN, we demonstrate how may be estimated via , and an introduced reference distribution . As discussed below, the assumptions on are that it is easily sampled, it has a known functional form, and it represents a good approximation to .

For the special case of variational inference for latent models, the proposed formulation recovers the adversarial variational Bayes (AVB) [Mescheder et al., 2017] setup. However, we demonstrate that the proposed approach has more applicability than inference. Specifically, we demonstrate its application to soft Q-learning [Haarnoja et al., 2017]

, and it leads to the first general purpose adversarial policy algorithm in reinforcement learning. We make a favorable comparison in this context to the aforementioned SVGD formulation.

An additional contribution of this paper concerns regularization of adversarial learning, of interest when learning based on samples or on an unnormalized distribution . Specifically, we develop an entropy-based regularizer. When learning based on , we make connections to simulated annealing regularization methods used in prior sampling-based models. We also introduce a bound on the entropy, applicable to learning based on samples or , and make connections to prior work on cycle consistency used in GAN regularization.

(a) Learning from an unnormalized distribution (b) Learning from a sample set
Figure 1: Illustration of learning in the two different settings of the target . (a) Learning from an unnormalized distribution, as in RAS; (b) Learning from samples, as in the traditional GANs.

2 Traditional GAN Learning

We begin by discussing GAN from the perspective of the -divergence [Nguyen et al., 2010a], which has resulted in -GAN [Nowozin et al., 2016].

-GAN is considered because many popular GAN methods result as special cases, thereby affording the opportunity to identify generalizable components that may extended to new settings. Considering continuous probability density functions

and for , the -divergence is defined as , where is a convex, lower-semicontinuous function satisfying . Different choices of , with , yield many common divergences; see [Nowozin et al., 2016] and Table 1.

An important connection has been made between the -divergence and generative adversarial learning, based on the inequality [Nguyen et al., 2010a]


where is the convex conjugate function, defined as , which has an analytic form for many choices of [Nowozin et al., 2016]. Further, under mild conditions, the bound is tight when where is the derivative of . Even if we know we cannot evaluate explicitly, because and/or are unknown.

Note that to compute the bound in (1), we require expectations wrt and , which we effect via sampling (this implies we only need samples from and , and do not require the explicit form of the underlying distributions). Specifically, assume corresponds to the true distribution we wish to model, and is a model distribution with parameters . We seek to learn by minimizing the bound of in (1), with draws from implemented as with , where is a probability distribution that may be sampled easily (e.g., uniform, or isotropic Gaussian [Goodfellow et al., 2014]). The learning problem consists of solving


where is typically a (deep) neural network with parameters , with defined similarly. Attempting to solve (2) produces -GAN [Nowozin et al., 2016].

One typically solves this minimax problem by alternating between update of and [Nowozin et al., 2016, Goodfellow et al., 2014]. Note that the update of only involves the second term in (2), corresponding to . Recall that the bound in (1) is tight when  [Nguyen et al., 2010b], where represent parameters from the previous iteration. Hence, assuming , we update as


where .

Different choices of yield a different optimal function (see Table 1). However, in each case is updated such that samples from yield an increase in the likelihood ratio , implying samples from better match than they do . Recall that the likelihood ratio is the optimal means of distinguishing between samples from and [Van Trees, 2001, Neyman and Pearson, 1933]. Hence, is a critic, approximated through , that the actor seeks to maximize when estimating .

-Divergence in update
Kullback-Leibler (KL)
Reverse KL
Squared Hellinger
Total variation
Table 1: Functions and corresponding to particular -GAN setups.

We may alternatively consider


where now is an arbitrary monotonically increasing function of ,

is the sigmoid function. From

[Kanamori et al., 2010, Mescheder et al., 2017, Gutmann and Hyvärinen, 2010], the solution to (4) is


where model is assumed to have sufficient capacity to represent the likelihood ratio for all . Hence, here replaces from -GAN, and the solution to is a particular function of the likelihood ratio. If this corresponds to learning based on minimizing the reverse KL divergence . When , one recovers the original GAN [Goodfellow et al., 2014], for which learning corresponds to .

In (4)-(5) and in -GAN, respective estimation of and yields approximation of a function of a likelihood ratio; such an estimation appears to be at the heart of many GAN models. This understanding is our launching point for extending the range of applications of adversarial learning.

3 Unnormalized-Distribution GAN

In the above discussion, and in virtually all prior GAN research, access is assumed to samples from target distribution . In many applications samples from are unavailable, but the unnormalized is known, with but with constant intractable. A contribution of this paper is a recasting of GAN to cases for which we have but no samples from , recognizing that most GAN models require an accurate estimate of the underlying likelihood ratio.

We consider the formulation in (4)-(5) and for simplicity set , although any choice of may be considered as long as its monotonically increasing. The update of remains as in (5), and we seek to estimate based on knowledge of . Since , for the critic it is sufficient to estimate . Toward that end, we introduce a reference distribution , that may be sampled easily, and has an explicit functional form that may be evaluated. The reference distribution can be connected to both importance sampling and the reference ratio method developed in bioinformatics [Hamelryck et al., 2010]. We have


where may be evaluated explicitly. We learn via (4), with changed to . Therefore, learning becomes alternating between the following two updates:


We call this procedure reference-based adversarial sampling (RAS) for unnormalized distributions. One should carefully note its distinction from the traditional GANs111We refer to generative models learned via samples as GAN, and generative models learned via an unnormalized distribution as RAS., which usually learn to draw samples to mimic the given samples of a target distribution. To illustrate the difference, we visualize the learning schemes for the two settings in Figure 1.

The parameters of reference distribution are estimated using samples from . We consider different forms of depending on the application.

  • Unconstrained domains  For the case when the support of the target distribution is unconstrained, we model

    as a Gaussian distribution with diagonal covariance matrix, with mean and variance components estimated via samples from

    , drawn as with .

  • Constrained domains  In some real-world applications the support is bounded. For example, in reinforcement learning, the action often resides within a finite interval . In this case, we propose to represent each dimension of

    as a generalized Beta distribution

    . The shape parameters are estimated using method of moments:

    and , where and , and and are sample mean and variance, respectively.

4 Entropy Regularization

Whether we perform adversarial learning based on samples from , as in Sec. 2, or based upon an unnormalized distribution , as in Sec. 3, the update of parameters is of the form , where is approximated as in (4) or its modified form (for learning from an unnormalized distribution).

A well-known failure mode of GAN is the tendency of the generative model, with , to under-represent the full diversity of data that may be drawn . Considering , will seek to favor synthesis of data for which is small and large. When learning in this manner, at iteration the model tends to favor synthesis of a subset of data that are probable from and less probable from . This subset of data that models well can change with , with the iterative learning continuously moving to model a subset of the data that are probable via . This subset can be very small, in the worst case yielding a model that always generates the same single data sample that looks like a real draw from ; in this case yields the same or near-same output for all , albeit a realistic-looking sample .

To mitigate this failure mode, it is desirable to add a regularization term to the update of , encouraging that the entropy of be large at each iteration , discouraging the model from representing (while iteratively training) a varying small subset of the data supported by . Specifically, consider the regularized update of (5) as:


where represents the entropy of the distribution , for . The significant challenge is that , but by construction we lack an explicit form for , and hence the entropy may not be computed directly. Below we consider two means by which we may approximate , one of which is explicitly appropriate for the case in which we learn based upon the unnormalized , and the other of which is applicable to whether we learn via samples from or based on .

In the case for which and is known, we may consider approximating or replacing with , and the term may be ignored, because it doesn’t impact the regularization in (10); we therefore replace the entropy with the cross entropy . The first term in (10) tends to encourage the model to learn to draw samples where , or , is large, while the second term discourages over-concentration on such high-probablity regions, as becomes large when encourages samples near lower probability regions of . This will ideally yield a spreading-out of the samples encouraged by , with high-probability regions of modeled well, but also regions spreading out from these high-probability regions.

To gain further insight into (10), we again consider the useful case of and assume the ideal solution . In this case cross-entropy-based regularization may be seen as seeking to maximize wrt the function

For the special case of , with an “energy” function, we have with . Hence, the cross-entropy regularization is analogous to annealing, with ; corresponds to high “temperature” , which as is lowered and with . When the peaks in are “flattened out,” allowing the model to yield samples that “spread out” and explore the diversity of . This interpretation suggests learning via (10), with the cross-entropy replacement for , with near 1 one at the start, and progressively reducing toward (corresponding to lowering temperature ).

The above setup assumes we have access to , which is not the case when we seek to learn based on samples of . Further, rather than replacing by the cross-entropy, we may wish to approximate based on samples of , which we have via with (with this estimated via samples from or based on ). Toward that end, consider the following lemma.

Lemma 1

Let be a probabilistic inverse mapping associated with the generator , with parameters . The mutual information between and satisfies


The proof is provided in the Supplement Material (SM). Since is a constant wrt , one may seek to maximize to increase the entropy . Hence, in (10) we replace the entropy term with .

In practice we consider , where here

is the identity matrix, and

is a vector mean. Hence,

in (10) is replaced by . Note that a failure mode of GAN, as discussed above, corresponds to many or all being mapped via to the same output. This is discouraged via this regularization, as such behavior makes it difficult to simultaneously minimize . This regularization is related to cycle-consistency [Li et al., 2017a]. However, the justification of the negative cycle-consistency as a lower bound of is deemed a contribution of this paper (not addressed in [Li et al., 2017a]).

5 Related Work

Use of a reference distribution We have utilized a readily-sampled reference distribution, with known density function , when learning to sample from an unnormalized distribution . The authors of [Gutmann and Hyvärinen, 2010] also use such a reference distribution to estimate the probability distribution associated with observed data samples. However, [Gutmann and Hyvärinen, 2010] considered a distinct problem, for which one wished to fit observed samples to a specified unnormalized distribution. Here we employ the reference distribution in the context of learning to sample from a known , with no empirical samples from provided.

Adversarial variational Bayes In the context of variational Bayes analysis, the adversarial variational Bayes (AVB) [Mescheder et al., 2017]

was proposed for posterior inference of variational autoencoders (VAEs) 

[Kingma and Welling, 2014]. Assume we are given a parametric generative model with prior on latent code , designed to model observed data samples . There is interest in designing an inference arm, capable of efficiently inferring a distribution on the latent code given observed data . Given observed , the posterior distribution on the code is , where , and represents an unnormalized distribution of the latent variable , which also depends on the data .

One may show that if the procedure in Sec. 3 is employed to draw samples from , based on the unnormalized , one exactly recovers AVB [Mescheder et al., 2017]. The AVB considered within our framework. We do not consider the application to inference with VAEs, as the experiments in [Mescheder et al., 2017] are applicable to the framework we have developed. The generality of the RAS is made more clear in our paper. We show its applicability to reinforcement learning in Sec. 6, and broaden the discussion on the type of adaptive reference distributions in Sec. 3, with extensions to constrained domain sampling.

Regularization The term employed here was considered in [Li et al., 2017a, Chen et al., 2018b, Zhu et al., 2017], but the use of it as a bound on the entropy of is new. From Lemma 1 we see that is also a bound on the mutual information between and , maximization of which is the same goal as InfoGAN [Chen et al., 2016]. However, unlike in [Chen et al., 2016], here the mapping is deterministic, where in InfoGAN it is stochastic. Additionally, the goal here is to encourage diversity in generated , which helps mitigate mode collapse, where in InfoGAN the goal was to discover latent semantic concepts.

Figure 2: Comparison of different GAN variants. The GAN models and corresponding entropy-regularized variants are visualized in the same color; in each case, the left result is unregularized, and the right employs entropy regularization. The black dots indicate the means of the distributions.
(a) GAN (b) GAN-E (c) SN-GAN (d) SN-GAN-E
Figure 3: Generated samples.

Stein variational gradient descent (SVGD) In the formulation of (4)-(5), if one sets , then the learning objective corresponds to minimizing the reverse KL divergence . SVGD [Liu and Wang, 2016] also addresses this goal given unnormalized distribution , with . Like for the proposed approach, the goal is not to explicitly learn a functional form for , rather the goal of SVGD is to learn to draw samples from it. We directly learn a sampler model via and , where in [Wang and Liu, 2016] a specified set of samples is adjusted sequentially to correspond to draws from the unnormalized distribution . In this setting, one assumes access to a set of samples drawn from some distribution, and these samples are updated deterministically as where is a small step size, and is a nonlinear function, assumed described by a reproducing kernel Hilbert space (RKHS) with given kernel . In this setting, the samples are updated , with a deterministic function that is evaluated in terms of and . While this process is capable of transforming a specific set of samples such that they ultimately approximate samples drawn from , we do not have access to a model that allows one to draw new samples quickly, on demand. Consequently, within the SVGD framework, a model is learned separately as a second “amortization” step. The two-step character of SVGD should be contrasted with the direct approach of the proposed model to learn . SVGD has been demonstrated to work well, and therefore it is a natural model against which to compare, as considered below.

6 Experimental Results

The Tensorflow code to reproduce the experimental results is at github


6.1 Effectiveness of Entropy Regularization

6.1.1 Learning based on samples

We first demonstrate that the proposed entropy regularization improves mode coverage when learning based on samples. Following the design in [Metz et al., 2017], we consider a synthetic dataset of samples drawn from a 2D mixture of 8 Gaussians. The results on real datasets are reported in SM.

We consider the original GAN and three state-of-the-art GAN variants: Unrolled-GAN [Metz et al., 2017], D2GAN [Nguyen et al., 2017] and Spectral Normalization (SN)-GAN [Miyato et al., 2018]. For simplicity, we consider the case when is an identity function, and this form of GAN is denoted as adversarially learned likelihood-ratio (ALL) in Fig. 3.

For all variants, we study their entropy-regularized versions, by adding the entropy bound in (11), when training the generator. If not specifically mentioned, we use a fix-and-decay scheme for for all experiments: In total training iterations, we first fix in the first iteration, then linearly decay it to 0 in the rest iterations. On this 8-Gaussian dataset, k and k.

Twenty runs were conducted for each algorithm. Since we know the true distribution in this case, we employ the symmetric KL divergence as a metric to quantitatively compare the quality of generated data. In Fig. 3 we report the distribution of divergence values for all runs. We add the entropy bound to each variant, and visualize their results as violin plots with gray edges (the color for each variant remains for comparison). The largely decreased mean and reduced variance of the divergence show that the entropy annealing yields significantly more consistent and reliable solutions, across all methods. We plot the generated samples in Fig. 3. We visualize the generated samples of the original GAN in Fig. 3(a). The samples “struggle” between covering all modes and separating modes. This issue is significantly reduced by ALL with entropy regularization, as shown in Fig. 3(b). SN-GAN (Fig. 3(c)) generates samples that concentrate only around the centroid of the mode. However, after adding our entropy regularizer (Fig. 3(d)), the issue is alleviated and the samples spread out.

6.1.2 Learning based on an unnormalized distribution

When the unnormalized form of a target distribution is given, we consider two types of entropy regularization to improve our RAS algorithm: E: the cycle-consistency-based regularization; E: the cross-entropy-based regularization. To clearly see the advantage of the regularizers, we fix in this experiment. Figure 4 shows the results, with each case shown in one row. The target distributions are shown in column (a), the sampling results of RAS are shown in column (b). RAS can reflect the general shape of the underlying distribution, but tends to concentrate on the high density regions. The two entropy regularizers are shown in (c) and (d). The entropy encourages the samples to spread out, leading to better approximation, and E appears to yield best performance.

(a) Target (b) RAS (c) RAS+E (d) RAS+E
Figure 4: Entropy regularization for unnormalized distributions.

6.1.3 Comparison of two learning settings

In traditional GAN learning, we have a finite set of samples with the empirical distribution to learn from, each sample drawn from the true distribution . It is known that the optimum of GANs yields the marginal distribution matching  [Goodfellow et al., 2014]; it also implies that the performance of in is limited by . In contrast, when we learn from an unnormalized form as in RAS, the likelihood ratio is estimated using samples drawn from and from . Hence, we can draw as many samples as desired to get an accurate likelihood-ratio estimation, which further enables to approach . This means RAS can potentially provide better approximation, when is available.

Figure 5: Comparison of learning via GAN and RAS.

We demonstrate this advantage on the above 8-Gaussian distribution. We train GAN on with , samples, and train RAS on . Note that the samples from and are drawn in an online fashion to train RAS. With an appropriate number of iterations (k) to assure convergence, in total samples were used to estimate the likelihood ratio in (8), where is the minibatch size.

In the evaluation stage, we draw 20k samples from for each model, and compute the symmetric KL divergence against the true distribution. The results are shown in Figure 5. As an illustration for the ideal performance, we draw 20k samples from the target distribution and show its divergence as the black line. The GAN gradually performs better, as more target samples are available in training. However, they are still worse than RAS by a fairly large margin.

(a) Beta reference (b) Gaussian reference (c) SVGD (d) Amortized SVGD
Figure 6: Sampling from constrained domains

6.2 Sampling from Constrained Domains

To show that RAS can draw samples when is bounded, we apply it to sample from the distributions with the support . The details for the functions and decay of are in SM. We adopt the Beta distribution as our reference, whose parameters are estimated using the method of moments (see Sec. 3

). The activation function in the last layer of the generator is chosen as

. As a baseline, we naively use an empirical Gaussian as the reference. We also compare with the standard SVGD [Liu and Wang, 2016] and the amortized SVGD methods [Wang and Liu, 2016], in which 512 particles are used.

Figure 6 shows the comparison. Note that since the support of the Beta distribution is defined in an interval, our RAS can easily match this reference distribution, leading the adversary to accurately estimate the likelihood ratio. Therefore, it closely approximates the target, as shown in Figure 6(a). Alternatively, when a Gaussian reference is considered, the adversarial ratio estimation can be inaccurate in the low density regions, resulting in degraded sampling performance shown in Figure 6(b). Since SVGD is designed for sampling in unconstrained domains, a principled mechanism to extend it for a constrained domain is less clear. Figure 6(c) shows SVGD results, and a substantial percentage of particles fall out of the desired domain. The amortized SVGD method adopts an metric to match the generator’s samples to the SVGD targets, it collapses to the distribution mode, as in Figure 6(d). We observed that the amortized MCMC results [Li et al., 2017b, Chen et al., 2018a] are similar to the amortized SVGD [Li et al., 2018].

6.3 Soft Q-learning

Soft Q-learning (SQL) has been proposed recently [Haarnoja et al., 2017], with reinforcement learning (RL) policy based on a general class of distributions, with the goal of representing complex, multimodal behavior. An agent can take an action based on a policy , defined as the probability of taking action when in state . It is shown in [Haarnoja et al., 2017] that the target policy has a known unnormalized density .

(a) Swimmer () (b) Hopper-
(c) Humanoid () (d) Half-cheetah-
(e) Ant- (f) Walker-
Figure 7: Soft Q-learning on MuJoCo environments.

To take actions from the optimal policy (i.e., sampling), learning in [Haarnoja et al., 2017] is performed via amortized SVGD in two separated steps: the samples of are first drawn using SVGD by minimizing ; these samples are then used as the target to update under an amortization metric. We call this procedure as SQL-SVGD.


Alternatively, we apply our RAS algorithm to replace the amortized SVGD. When the action space is in unconstrained, we may use the Gaussian reference . However, the action space is often constrained in continuous control, with each dimension in an interval . Hence, we adopt the Beta-distribution reference for RAS.

Following [Haarnoja et al., 2018], we compare RAS with amortized SVGD on six continuous control tasks: Hopper, Half-cheetah, Ant and Walker from the OpenAI benchmark suite [Brockman et al., 2016], as well as the Swimmer and Humanoid tasks in the implementation [Duan et al., 2016]. Note that the action space is constrained in

for all the tasks. The dimension of the action space ranges from 2 to 21 on the different tasks. The higher-dimension environments are usually harder to solve. All hyperparameters used in this experiment are listed in SM.

Figure 7 shows the total average return of evaluation rollouts during training. We train 3 different instances of each algorithm, with each performing one evaluation rollout every 1k environment steps. The solid curves corresponds to the mean and the shaded regions to the standard derivation Overall, it show that RAS significantly outperforms amortized SVGD on four tasks both in terms of learning speed and the final performance. This includes the most complex benchmark, the 21-dimensional Humanoid (). On other two tasks, the two methods perform comparably. In the SQL setting, learning a good stochastic policy with entropy maximization can help training. It means that RAS can better estimate the target policy.

7 Conclusions

We introduce a reference-based adversarial sampling method as a general approach to draw from unnormalized distributions. It allows us to extend GANs from traditional sample-based learning setting to this new setting, and provide novel methods for important downstream applications, e.g., Soft Q-learning. RAS can also be easily used for constrained domain sampling. Further, an entropy regularization is proposed to improve the sample quality, applicable to learning from samples or an unnormalized distribution. Extensive experimental results show the effectiveness of the entropy regularization. In Soft Q-learning, RAS provides performance comparable to, if not better than, its alternative method amortized SVGD.


We thank Rohith Kuditipudi, Ruiyi Zhang, Yulai Cong and Ricardo Henao for helpful feedback/editing. We acknowledge anonymous reviewers for proofreading and improving the manuscript. The research was supported by DARPA, DOE, NIH, NSF and ONR.


  • [Brockman et al., 2016] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
  • [Brooks et al., 2011] Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo.
  • [Chen et al., 2018a] Chen, C., Li, C., Chen, L., Wang, W., Pu, Y., and Duke, L. C. (2018a). Continuous-time flows for efficient inference and density estimation. In International Conference on Machine Learning, pages 823–832.
  • [Chen et al., 2018b] Chen, L., Dai, S., Pu, Y., Li, C., Su, Q., and Carin, L. (2018b). Symmetric variational autoencoder and connections to adversarial learning. AISTATS.
  • [Chen et al., 2016] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS.
  • [Duan et al., 2016] Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In ICML.
  • [Feng et al., 2017] Feng, Y., Wang, D., and Liu, Q. (2017). Learning to draw samples with amortized stein variational gradient descent. UAI.
  • [Gelman et al., 1995] Gelman, A., Carlin, J. B., S., S. H., and Rubin, D. B. (1995). Bayesian data analysis. London: Chapman and Hall.
  • [Goodfellow et al., 2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In NIPS.
  • [Gutmann and Hyvärinen, 2010] Gutmann, M. and Hyvärinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS.
  • [Haarnoja et al., 2017] Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energy-based policies. ICML.
  • [Haarnoja et al., 2018] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML.
  • [Hamelryck et al., 2010] Hamelryck, T., Borg, M., Paluszewski, M., Paulsen, J., Frellsen, J., Andreetta, C., Boomsma, W., Bottaro, S., and Ferkinghoff-Borg, J. (2010). Potentials of mean force for protein structure prediction vindicated, formalized and generalized. PloS one.
  • [Hastings, 1970] Hastings, W. (1970). Monte Carlo sampling methods using Markov Chains and their applications. Biometrika.
  • [Heusel et al., 2017] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., and Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a Nash equilibrium. NIPS.
  • [Hoffman et al., 2013] Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. The Journal of Machine Learning Research.
  • [Kanamori et al., 2010] Kanamori, T., Suzuki, T., and Sugiyama, M. (2010). Theoretical analysis of density ratio estimation. IEICE Trans. Fund. Electronics, Comm., CS.
  • [Kingma and Welling, 2014] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. ICLR.
  • [Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
  • [Li et al., 2018] Li, C., Li, J., Wang, G., and Carin, L. (2018). Learning to sample with adversarially learned likelihood-ratio.
  • [Li et al., 2017a] Li, C., Liu, H., Chen, C., Pu, Y., Chen, L., Henao, R., and Carin, L. (2017a).

    ALICE: Towards understanding adversarial learning for joint distribution matching.

  • [Li et al., 2015] Li, Y., Hernández-Lobato, J. M., and Turner, R. E. (2015). Stochastic expectation propagation. In NIPS.
  • [Li et al., 2017b] Li, Y., Turner, R. E., and Liu, Q. (2017b). Approximate inference with amortised MCMC. arXiv preprint arXiv:1702.08343.
  • [Liu and Wang, 2016] Liu, Q. and Wang, D. (2016).

    Stein variational gradient descent: A general purpose Bayesian inference algorithm.

    In NIPS.
  • [Liu et al., 2015] Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In ICCV.
  • [Mescheder et al., 2017] Mescheder, L., Nowozin, S., and Geiger, A. (2017). Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In ICML.
  • [Metz et al., 2017] Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. (2017). Unrolled generative adversarial networks. ICLR.
  • [Minka, 2001] Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In UAI.
  • [Miyato et al., 2018] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. In ICLR.
  • [Mohamed and L., 2016] Mohamed, S. and L., B. (2016). Learning in implicit generative models. NIPS workshop on adversarial training.
  • [Neyman and Pearson, 1933] Neyman, J. and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Lond. A, 231(694-706):289–337.
  • [Nguyen et al., 2017] Nguyen, T., Le, T., Vu, H., and Phung, D. (2017). Dual discriminator generative adversarial nets. NIPS.
  • [Nguyen et al., 2010a] Nguyen, X., Wainwright, M., and Jordan, M. (2010a). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Info. Theory.
  • [Nguyen et al., 2010b] Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010b). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory.
  • [Nowozin et al., 2016] Nowozin, S., Cseke, B., and Tomioka, R. (2016). f-GAN: Training generative neural samplers using variational divergence minimization. NIPS.
  • [Oord et al., 2016] Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. (2016).

    Pixel recurrent neural network.

    In ICML.
  • [Radford et al., 2016] Radford, A., Metz, L., and Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR.
  • [Rezende et al., 2014] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).

    Stochastic backpropagation and approximate inference in deep generative models.

    In ICML.
  • [Uehara et al., 2016] Uehara, M., Sato, I., Suzuki, M., Nakayama, K., and Matsuo, Y. (2016). Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920.
  • [Van Trees, 2001] Van Trees, H. L. (2001). Detection, estimation, and modulation theory. John Wiley & Sons.
  • [Wang and Liu, 2016] Wang, D. and Liu, Q. (2016). Learning to draw samples: With application to amortized MLE for generative adversarial learning. In arXiv:1611.01722v2.
  • [Welling and Teh, 2011] Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In ICML.
  • [Y. Pu and Carin, 2017] Y. Pu, Z. Gan, R. H. C. L. S. H. and Carin, L. (2017). VAE learning via Stein variational gradient descent. NIPS.
  • [Zhu et al., 2017] Zhu, J.-Y., Park, T., Isola, P., and Efros, A. (2017).

    Unpaired image-to-image translation using cycle-consistent adversarial networks.


Appendix A Proof of the Entropy Bound in Lemma 1

Consider random variables

under the joint distribution , where . The mutual information between and satisfies . Since is a deterministic function of , . We therefore have , where is a constant wrt . For general distribution ,


We consequently have


Therefore, entropy is lower bounded by the log likelihood or negative cycle-consistency loss; minimizing the cycle-consistency loss maximizes the entropy or mutual information.

Appendix B Experiments

b.1 Sampling from 8-GMM

Two methods are presented for estimating the likelihood ratio: (i) -ALL for the discriminator in the standard GAN i.e., Eq (4); (ii) -ALL for a variational characterization of -measures in [Nguyen et al., 2010a].

In Figure 8, we plot the distribution of inception score (ICP) values [Li et al., 2017a]. Similar conclusions as in the case of the symmetric KL divergence metric the can be drawn: (1) The likelihood ratio impelmentation improve the original GAN, and (2) the entropy regularizer improve the all GAN variants. Note that because ICP favors the samples closer to the mean of each mode and SN-GAN generate samples that concentrate only around the mode’s centroid, SN-GAN show slightly better ICP than its entroy-regularized version. We argue that the entropy regualizer help gernerate diverse samples, the lower value of ICP is just due to the limitation of the metric.

The learning curves of the inception score and symmetric KL divergence values are plot over iterations in Figure 9 (a) and (b), respectively. The family of GAN variants with entropy term dominate the performance, compared with those without the entropy term. We conclude that the entropy regularizer can significantly improve the convergence speed and the final performance.

Figure 8: Comparison of inception score on different GAN variants. The GAN variants and their corresponding entropy-regularized variants are visualized in the same color, with the latter shaded slightly. The balck dots indicate the means of the distributions.
(a) Inception score over iterations. (b) Symmetric KL over iterations.
Figure 9: Learning curves of different GAN variants. The standard GAN variants are visualized as dashed lines, while their corresponding entropy-regularized variants are visualized as the solid lines in the same color.
Architectures and Hyper-parameters

For the 8-GMM and MNIST datasets, the network architectures are specified in Table 2, and hyper-parameters are detailed in Table 3. The inference network is used to construct the cycle-consistency loss to bound the entropy.

scale=1.0,tabular=l—c—c,center & 8-GMM & MNIST

Networks & Size & Size

Generator & &

Discriminator & &

Auxiliary & &

Table 2:

The convention for the architecture “X–H–H–Y”: X is the input size, Y is the output size, and H is the hidden size. “ReLU” is used for all hidden layer, and the activation of the output layer is linear, except the generator on MNIST is the sigmoid

scale=0.95,tabular=l—cc,center Hyper-parameters & 8GMM & MNIST

Learning rate & &

Batch Size & &

Updates & k Iterations & Epoches

Table 3: The hyper-parameters of experiments. Adam optimizer is used.

We further study three real-world datasets of increasing diversity and size: MNIST, CIFAR10 [Krizhevsky et al., 2012] and CelebA [Liu et al., 2015]. For each dataset, we start with a standard GAN model: two-layer fully connected (FC) networks on MNIST, as well as DCGAN [Radford et al., 2016] on CIFAR and CelebA. We then add the entropy regularizer. On MNIST, we repeat the experiments 5 times, and the mean ICP is shown. On CIFAR and CelebA, the performance is also quantified via the recently proposed Fréchet Inception Distance (FID) [Heusel et al., 2017], which approximates the Wasserstein-2 distance of generated samples and true samples. The best ICP and FID for each algorithm are reported in Table B.1. The entropy variants consistently show better performance than the original counterparts.

scale=1.0,tabular=l—cc—cc,center & ICP   & FID

Dataset & Standard & E & Standard & E

MNIST & & & - & -

CIFAR & & & &

CelebA & - & - & &

b.2 Constrained Domains

The two functions are: (1) , and (2) . The network architectures used for constrained domains are reported in Table 5. The batch size is 512, learning rate is . The total training iterations k, and we start to decay after k iterations.

scale=1.0,tabular=l—c,center Networks & Size

Generator &

Discriminator &

Auxiliary &

Table 5: The convention for the architecture “X–H–H–Y”: X is the input size, Y is the output size, and H is the hidden size. “ReLU” is used for all hidden layer, and the activation of the output layer is “Tanh”.
0:  Create replay memory ; Initialize network parameters ; Assign target parameters: , .
1:  for each epoch do
2:     for each  do
4:         Sample an action for using : , where .
5:         Sample next state and reward from the environment: and
6:         Save the new experience in the replay memory:
8:         .
10:         Sample for each .
11:         Compute the soft Q-values as the target unnormalized density form.
12:         Compute gradient of Q-network and update
14:         Sample actions for each from the stochastic policy via                               , where
15:         Sample actions for each from a Beta (or Gaussian) reference policy
16:         Compute gradient of discriminator in (8) and update
17:         Compute gradient of policy network in (9), and update
18:     end for
19:     if epoch mod update_interval = 0 then
20:         Update target parameters: ,
21:     end if
22:  end for
Algorithm 1 Adversarial Soft Q-learning

b.3 Soft Q-learning

We show the detailed setting of environments in Soft Q-Learning in Table 6. The network architectures are specified in Table 7, and hyper-parameters are detailed in Table 8. We only add the entropy regularization at the beginning to stabilize training, and then quickly decay to 0. The total training epoch is 200, and we start to decay after 10 epochs, and set it after 50 epochs. This is because we observed that the entropy regularization did not help in the end, and removing it could accelerate training.

Environment Action Reward Replay
Spcae Scale Pool Size
Swimmer (rllab) 2 100
Hopper-v1 3 1
HalfCheetah-v1 6 1
Walker2d-v1 6 3
Ant-v1 8 10
Humanoid (rllab) 21 100
Table 6: Hyper-parameters in SQL.

scale=1.0,tabular=l—l,center Networks & Size

Policy-Network &

Q-Network &