1 Introduction
Generative adversarial networks (GANs) (Goodfellow et al., 2014) have drawn great research interests and achieved remarkable success in image synthesis Radford et al. (2015); Brock et al. (2018), video generation Mathieu et al. (2015), and others. However, it is usually hard to train a GAN well, because the training process is commonly unstable, subject to disturbances and even collapses. To alleviate this issue, substantial efforts have been paid to improve the training stability from different perspectives, e.g., divergence minimization Nowozin et al. (2016); Nock et al. (2017), Wasserstein distance with Lipschitz continuity of the discriminator Arjovsky et al. (2017); Gulrajani et al. (2017); Wei et al. (2018)
Zhao et al. (2016); Berthelot et al. (2017), etc.In spite of the above progresses, the instability in training has not been well resolved Chu et al. (2020), since it is difficult to well balance the strength of the generator and the discriminator. What is worse, such an instability issue is exacerbated in text generation due to the sequential and discrete nature of text Fedus et al. (2018); Caccia et al. (2020); Nie et al. (2018). Specifically, the high sensitivity of text generation to noise and the underlying errors caused by sparse discriminator signals in the generated text can often result in destructive updates to both generator and discriminator, enlarging the instability in GANs.
In this work, we develop a novel variational GAN training framework to improve the training stability, which is broadly applicable to GANs of varied architectures for image and text generation. This training framework is derived from a variational perspective of GANs and the resulting connections to reinforcement learning (in particular, RLasinference) Abdolmaleki et al. (2018); Levine (2018); Schulman et al. (2017) and other rich literature Hu et al. (2018b); Grover et al. (2019); Burda et al. (2015). Specifically, our approach consists of two stabilization techniques, namely, probability ratio clipping and sample reweighting, for stabilizing the generator and discriminator respectively. (1) Under the new variational perspective, the generator update is subject to a KL penalty on the change of the generator distribution. This KL penalty closely resembles that in the popular TrustRegion Policy Optimization (TRPO) Schulman et al. (2015) and its related Proximal Policy Optimization (PPO) Schulman et al. (2017). This connection motivates a simple surrogate objective with a clipped probability ratio between the new generator and the old one. The probability ratio clipping discourages excessively large generator updates, and has shown to be effective in the context of stabilizing policy optimization Schulman et al. (2017). Figure 1 (left) shows the intuition about the surrogate objective, where we can observe the objective decreases with an overly large generator change and thus imposes regularization on the updates.
(2) When updating the discriminator, the new perspective induces an importance sampling mechanism, which effectively reweights fake samples by their discriminator scores. Since lowquality samples tend to receive smaller weights, the discriminator trained on the reweighted samples is more likely to maintain stable performance, and in turn provide informative gradients for subsequent generator updates. Figure 1 (middle/right) demonstrates the effect of the reweighting in reducing the variance of both discriminator and generator losses. Similar importance weighting methods have recently been used in other contexts, such as debiasing generative models Grover et al. (2019) and sampling from energybased models Deng et al. (2020). Our derivations can be seen as a variant for the new application of improving GANs.
We give theoretical analysis showing the generator under our training framework can converge to the real data distribution. Empirically, we conduct extensive experiments on a range of tasks, including text generation, text style transfer, and image generation. Our approach shows significant improvement over stateoftheart methods, demonstrating its broad applicability and efficacy.
2 Related Work
Wasserstein distance, WGAN, and Lipschitz continuity. The GAN framework Goodfellow et al. (2014) features two components: a generator that synthesizes samples given some noise source , namely with , and a discriminator that distinguishes generator’s output and real data, which provides gradient feedback to improve the generator’s performance. WGAN Arjovsky et al. (2017) improves the training stability of GANs by minimizing the Wasserstein distance between the generation distribution (induced from ) and the real data distribution . Its training loss is formulated as:
(1) 
where is the set of 1Lipschitz functions;
acts as the discriminator and is usually implemented by a neural network
. The original resort to enforce the Lipschitz constraint is through weight clipping Arjovsky et al. (2017). WGANGP Gulrajani et al. (2017) later improves it by replacing it with a gradient penalty on the discriminator. CTGAN Wei et al. (2018) further imposes the Lipschitz continuity constraint on the manifold of the real data . Our approach is orthogonal to these prior works and can serve as a dropin replacement for the stabilize generator and discriminator in various kinds of GANs, such as WGANGP and CTGAN.Research on the Lipschitz continuity of GAN discriminators have resulted in the theory of “informative gradients” Zhou et al. (2019, 2018). Under certain mild conditions, a Lipschitz discriminator can provide informative gradient to the generator in a GAN framework: when and are disjoint, the gradient of optimal discriminator w.r.t each sample points to a sample , which guarantees that the generation distribution is moving towards . We extend the informative gradient theory to our new case and show convergence of our approach.
Reinforcement Learning as Inference. Casting RL as probabilistic inference has a long history of research Dayan and Hinton (1997); Deisenroth et al. (2013); Rawlik et al. (2013); Levine (2018); Abdolmaleki et al. (2018). For example, Abdolmaleki et al. (2018) introduced maximum aposteriori policy optimization from a variational perspective. TRPO Schulman et al. (2015) is closely related to this line by using a KL divergence regularizer to stabilize standard RL objectives. PPO Schulman et al. (2017) further proposed a practical clipped surrogate objective that emulates the regularization. Our approach draws on the connections to the research, particularly the variational perspective and PPO, to improve GAN training.
Other related work. Importance reweighting has been adopted in different problems, such as improving variational autoencoders Burda et al. (2015), debiasing generative models Grover et al. (2019), learning knowledge constraints Hu et al. (2018b), etc. We derive from the variational perspective which leads to reweighting in the new context of discriminator stabilization.
3 Improving GAN Training
3.1 Motivations
Our approach is motivated by connecting GAN training with the wellestablished RLasinference methods Abdolmaleki et al. (2018); Levine (2018) under a variational perspective. The connections enable us to augment GAN training with existing powerful probabilistic inference tools as well as draw inspirations from the rich RL literature for stable training. In particular, the connection to the popular TRPO Schulman et al. (2015) and PPO Schulman et al. (2017) yields the probability ratio clipping in generator training that avoids destructive updates (Sec. 3.2
), and the application of importance sampling estimation gives rise to sample reweighting for adaptive discriminator updates (Sec.
3.3). The full training procedure is then summarized in Alg.1.Specifically, as described in Sec. 2, the conventional formulation e.g., WGAN Arjovsky et al. (2017) for updating the generator maximizes the expected discriminator score: , where is the Lipschitzcontinuous discriminator parameterized with . The objective straightforwardly relates to policy optimization in RL by seeing as a policy and as a reward function. Thus, inspired by the probabilistic inference formulations of policy optimization Abdolmaleki et al. (2018); Ding and Soricut (2017); Hu et al. (2018b), here we transform the conventional objective by introducing a nonparametric auxiliary distribution and defining a new variational objective:
(2) 
where KL is the KL divergence. Intuitively, we are maximizing the expected discriminator score of the auxiliary (instead of generator ), and meanwhile encouraging the generator to stay close to .
As we shall see in more details shortly, the new formulation allows us to take advantage of offtheshelf inference methods, which naturally leads to new components to improve the GAN training. In particular, maximizing the above objective is solved by the expectation maximization (EM) algorithm
Neal and Hinton (1998) which alternatingly optimizes at Estep and optimizes at Mstep. More specifically, at each iteration , given the current status of , the Estep that maximizes w.r.t has a closedform solution:(3) 
where is the normalization term depending on the discriminator parameters . We elaborate on the Mstep in the following, where we continue to develop the practical procedures for updating the generator and the discriminator, respectively.
3.2 Generator Training with Probability Ratio Clipping
The Mstep optimizes w.r.t , which is equivalent to minimizing the KL divergence term in Eq.(2). However, since the generator in GANs is often an implicit distribution that does not permit evaluating likelihood, the above KL term (which involves evaluating the likelihood of samples from ) is not applicable. We adopt an approximation, which has also been used in the classical wakesleep algorithm Hinton et al. (1995) and recent work Hu et al. (2018b), by minimizing the reverse KL divergence as below. With Eq.(3) plugged in, we have:
(4) 
As proven in the appendix, approximating with the reverse KL does not change the optimization problem. The first term on the righthand side of the equation recovers the conventional objective of updating the generator. Of particular interest is the second term, which is a new KL regularizer between the generator and its “old” state from the previous iteration. The regularizer discourages the generator from changing too much between updates, which is useful to stabilize the stochastic optimization procedure. The regularization closely resembles to that of TRPO/PPO, where a similar KL regularizer is imposed to prevent uncontrolled policy updates and make policy gradient robust to noises. Sec. 3.4 gives convergence analysis on the KLregularized generator updates.
In practice, directly optimizing with the KL regularizer can be infeasible due to the same difficulty with the implicit distribution as above. Fortunately, PPO Schulman et al. (2017) has presented a simplified solution that emulates the regularized updates using a clipped surrogate objective, which is widelyused in RL. We adapt the solution to our context, leading to the following practical procedure of generator updates.
Probability Ratio Clipping. Let denote the probability ratio which measures the difference between the new and old distributions. We have . The clipped surrogate objective for updating the generator, as adapted from PPO, is:
(5) 
where clips the probability ratio, so that moving outside of the interval is discouraged. Taking the minimum puts a ceiling on the increase of the objective. Thus the generator does not benefit by going far away from the old generator.
Finally, to estimate the probability ratio when is implicit, we use an efficient approximation similar to Che et al. (2017); Grover et al. (2019)
by introducing a binary classifier
trained to distinguish real and generated samples. Assuming an optimal Goodfellow et al. (2014); Che et al. (2017) which has , we can approximate through:(6) 
Note that after plugging the rightmost expression into Eq.(5), gradient can propagate through to since . In practice, during the phase of generator training, we maintain by finetuning it for one iteration every time after is updated (Alg.1). Thus the maintenance of is cheap. We give more details of in the appendix.
3.3 Discriminator Training with Sample Reweighting
We next discuss the training of the discriminator , where we augment the conventional training with an importance weighting mechanism for adaptive updates. Concretely, given the form of the auxiliary distribution solution in Eq.(3), we first draw from the recent energybased modeling work Kim and Bengio (2016); Hu et al. (2018b); Deng et al. (2020); Che et al. (2020) and propose to maximize the data loglikelihood of w.r.t . By taking the gradient, we have:
(7) 
We can observe that the resulting form resembles the conventional one (Sec. 2) as we are essentially maximizing on real data while minimizing on fake samples. An important difference is that here fake samples are drawn from the auxiliary distribution instead of the generator . This difference leads to the new sample reweighting component as below. Note that, as in WGAN (Sec.2), we maintain to be from the class of Lipschitz functions, which is necessary for the convergence analysis in Sec.3.4. In practice, we can use gradient penalty following Gulrajani et al. (2017); Wei et al. (2018).
Sample Reweighting. We use the tool of importance sampling to estimate the expectation under in Eq.(7). Given the multiplicative form of in Eq.(3), similar to Abdolmaleki et al. (2018); Hu et al. (2018b); Deng et al. (2020), we use the generator as the proposal distribution. This leads to
(8) 
That is, fake samples from the generator are weighted by the exponentiated discriminator score when used to update the discriminator. Intuitively, the mechanism assigns higher weights to samples that can fool the discriminator better, while lowquality samples are downplayed to avoid destructing the discriminator performance. It is worth mentioning that similar importance weighting scheme has been used in Che et al. (2017); Hu et al. (2018a) for generator training in GANs, and Burda et al. (2015) for improving variational autoencoders. Our work instead results in a reweighting scheme in the new context of discriminator training.
Alg.1 summarizes the proposed training procedure for the generator and discriminator.
3.4 Theoretical Analysis
In this section, we show that the generator distribution with our training approach can converge to the real data distribution with optimal discriminator. The analysis is based on the reverse KL updates for the generator (Eq.4), while the probability ratio clipping serves as a practical emulation for the updates. We begin by adapting Proposition 1 in Gulrajani et al. (2017) to our problem:
Proposition 3.1.
Let and be two distributions in , a compact metric space. Then, there is a Lipschitz function which is the optimal solution of
Let be the optimal coupling between and , defined as the minimizer of: where
is the set of joint distributions
whose marginals are and , respectively. Then, if is differentiable, , and with , it holds that .The proposition shows that the optimal provides informative gradient Zhou et al. (2018) from towards . We then generalize the conclusion to by considering correlation between and .
4 Experiments
We conduct extensive experiments on three unsupervised generation tasks, including image generation, text generation, and text style transfer. The three tasks apply GANs to model different data modalities, namely image, text, and neural hidden representations. Our approach consistently offers improvement over the stateofthearts on all three tasks. We present more experimental details in the appendix. Our code is included in the supplementary materials and will be released upon acceptance.
4.1 Image generation
We first use the popular CIFAR10 benchmark for evaluation and indepth analysis of our approach.
Setup. CIFAR10 Krizhevsky and Hinton (2010) contains 50K images of sizes . Following the setup in CTGAN Wei et al. (2018) and its public implementation^{2}^{2}2https://github.com/igul222/improved_wgan_training, we use a residual architecture to implement both generator and discriminator, and also impose a Lipschitz constraint on the discriminator. For each iteration, we update both generator and discriminator for 5 times. We use Inception Score (IS) Salimans et al. (2016) and Frechet Inception Distance (FID) Heusel et al. (2017)
as our evaluation metrics. Among them, IS evaluates the quality and diversity of generated images, and FID captures model issues, e.g., mode collapse
Xu et al. (2018).Results. Table 1 reports the results on CIFAR10. For the four latest methods, SNGANs Miyato et al. (2018) introduced spectral normalization to stabilize the discriminator training; WGANALP Terjék (2020) developed an explicit Lipschitz penalty; WGANALP Sanyal et al. (2020) introduced a weightnormalization scheme for generalization; and AutoGAN Gong et al. (2019) incorporated neural architecture search for the generator architecture. From Table 1, one can observe that our full approach (CTGAN + discriminator sample reweighting + generator probability ratio clipping) achieves the best performance, with both IS and FID significantly surpassing the baselines. These results accord with the visual results in Figure 2 where our generated samples show higher visual quality than those of the baselines. Moreover, comparison between CTGAN and our approach with only discriminator reweighting shows significant improvement. By further adding the probability ratio clipping to arrive our full approach, the performance in terms of both IS and FID is further improved with a large margin. The results demonstrate the effectiveness of the two components in our approach.
Figure 1 in Sec. 1 has shown the effects of the proposed approach in stabilizing the generator and discriminator training. Here we further analyze these two components. Figure 3 (left) shows the convergence curves of different GAN methods. For a fair comparison, all models use the same DCGAN architecture Radford et al. (2015), and both our approach and WGANGP Gulrajani et al. (2017) enforce the same discriminator Lipschitz constraint. From Figure 3 (left), one can observe that our full approach surpasses our approach with only sample reweighting, and they both converge faster and achieve a higher IS score than WGANGP and DCGAN. Figure 3 (right) further looks into how the reweighting on fake samples can affect the discriminator training. It shows that by injecting sample reweighting into WGANGP, its gradients on fake samples become more stable and show much lower variance, which also partially explains the higher training stability of discriminator in Figure 1.
4.2 Text Generation
In this section, we evaluate our approach on text generation, a task that is known to be notoriously difficult for GANs due to the discrete and sequential nature of text.
Setup. We implement our approach based on the RelGAN Nie et al. (2018) architecture, a stateoftheart GAN model for text generation. Specifically, we replace the generator and discriminator objectives in RelGAN with ours. We follow WGANGP Gulrajani et al. (2017) and impose discriminator Lipschitz constraint with gradient penalty. Same as Nie et al. (2018), we use Gumbelsoftmax approximation Jang et al. (2017); Maddison et al. (2017)
on the discrete text to enable gradient backpropagation, and the generator is initialized with maximum likelihood (MLE) pretraining. Our implementation is based on the public PyTorch code of RelGAN
^{3}^{3}3https://github.com/williamSYSU/TextGANPyTorch. Same as previous studies Nie et al. (2018); Guo et al. (2018); Yu et al. (2017), we evaluate our approach on both synthetic and real text datasets.Results on Synthetic Data. The synthetic data consists of 10K discrete sequences generated by an oracleLSTM with fixed parameters Yu et al. (2017). This setup facilitates evaluation, as the quality of generated samples can be directly measured by the negative loglikelihood (NLL) of the oracle on the samples. We use synthetic data with sequence lengths 20 and 40, respectively. Table 2 reports the results. MLE is the baseline with maximum likelihood training, whose output model is used to initialize the generators of GANs. Besides the previous text generation GANs Yu et al. (2017); Guo et al. (2018); Nie et al. (2018), we also compare with WGANGP which uses the same neural architecture as RelGAN and ours. From Table 2, one can observe that our approach significantly outperforms all other approaches on both synthetic sets. Our improvement over RelGAN and WGANGP demonstrates that our proposed generator and discriminator objectives are more effective than the previous ones.
Length  MLE  SeqGAN Yu et al. (2017)  LeakGAN Guo et al. (2018)  RelGAN Nie et al. (2018)  WGANGP (Gulrajani et al., 2017)  Ours  Real 

20  9.038  8.736  7.038  6.680  6.89  5.67  5.750 
40  10.411  10.310  7.191  6.765  6.78  6.14  4.071 
Method  BLEU2 ()  BLEU3 ()  BLEU4 ()  BLEU5 ()  NLL () 

MLE  0.768  0.473  0.240  0.126  2.382 
SeqGAN Yu et al. (2017)  0.777  0.491  0.261  0.138  2.773 
LeakGAN Guo et al. (2018)  0.826  0.645  0.437  0.272  2.356 
RelGAN (100) Nie et al. (2018)  0.881  0.705  0.501  0.319  2.482 
RelGAN (1000) Nie et al. (2018)  0.837  0.654  0.435  0.265  2.285 
WGANGP Gulrajani et al. (2017)  0.872  0.636  0.379  0.220  2.209 
Ours  0.905  0.692  0.470  0.322  2.265 
, where RelGAN (100) and RelGAN (1000) use different hyperparameters as reported in the paper.
Results on Real Data. We then evaluate our method on the EMNLP2017 WMT News, the largest real text data used for text GAN studies Guo et al. (2018); Nie et al. (2018). The dataset consists of 270K/10K training/test sentences with a maximum length of 51 and a vocabulary size of 5,255. To measure the generation quality, we use the popular BLEU metric which measures gram overlap between generated and real text (). To evaluate the diversity of generation, we use the negative loglikelihood of the generator on the real test set (NLL) Guo et al. (2018); Nie et al. (2018). From the results in Table 3, one can see that our approach shows comparable performance with the previous best model RelGAN (100) in terms of text quality (BLEU), but has better sample diversity. Our model also achieves much higher BLEU scores than WGANGP (e.g., 0.322 v.s. 0.220 on BLEU5), demonstrating its ability of generating higherquality samples.
4.3 Text Style Transfer
We further apply our approach on the text style transfer task which is gaining increasing attention in NLP Hu et al. (2017); Shen et al. (2017). The task aims at rewriting a sentence to modify its style (e.g., sentiment) while preserving the content. Previous work applies GANs on neural hidden states to learn disentangled representations Shen et al. (2017); Tikhonov et al. (2019). The task thus can serve as a good benchmark task for GANs, as hidden state modeling provides a new modality that differs from image and text modeling as studied above.
Setup. We follow the same experimental setting and use the same model architecture in the latest work Tikhonov et al. (2019). In particular, Tikhonov et al. (2019)
extended the variational autoencoder based model
Hu et al. (2017); Kingma and Welling (2013) by adding a latent code discriminator which eliminates stylistic information in the latent code. We replace their adversarial objectives with our proposed ones, and impose discriminator Lipschitz constraint with gradient penalty Gulrajani et al. (2017). Our implementation is based on the public code^{4}^{4}4https://github.com/VAShibaev/text_style_transfer released in Tikhonov et al. (2019). We test our approach on sentiment transfer, in which the sentiment (positive or negative) is treated as the style of the text. We use the standard Yelp review dataset^{5}^{5}5www.yelp.com/dataset, and the human written output text provided by Li et al. (2018) as the ground truth for evaluation.Results. Following the previous work Tikhonov et al. (2019), we first report the BLEU score that measures the similarity of the generated samples against the human written text. Table 4 shows that our approach achieves best performance, improving the stateoftheart result Tikhonov et al. (2019) from BLEU to .
The second widely used evaluation method is to measure (1) the style accuracy by applying a pretrained style classifier on generated text, and (2) the content preservation by computing the BLEU score between the generated text and the original input text (BLEUX). There is often a tradeoff between the two metrics. Figure 4 displays the tradeoff by different models. Our results locate on the topright corner, indicating that our approach achieves the best overall stylecontent tradeoff.
5 Conclusion
We have presented a new training framework of GANs derived from a new variational perspective and draws on rich connections with RLasinference. This results in probably ratio clipping for generator updates to discourage overly large changes, and fake sample reweighting for discriminator updates to stabilize training. Experiments show our approach achieves better results than previous best methods on image generation, text generation, and text style transfer. Our approach also shows more stable training. We are interested in exploring more connections between GANs and other learning paradigms to inspire more techniques for improved GAN training.
References
 [1] (2018) Maximum a posteriori policy optimisation. In ICLR, Cited by: §1, §2, §3.1, §3.1, §3.3.

[2]
(2017)
Wasserstein generative adversarial networks.
In
International Conference on Machine Learning
, pp. 214–223. Cited by: §1, §2, §3.1, §6.3.2, §6.3.3, §6.3.4.  [3] (2017) Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §1.
 [4] (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1.
 [5] (2015) Importance weighted autoencoders. arXiv preprint arXiv:1509.00519. Cited by: §1, §2, §3.3.
 [6] (2020) Language gans falling short. In ICLR, Cited by: §1.
 [7] (2017) Maximumlikelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983. Cited by: §3.2, §3.3.
 [8] (2020) Your GAN is secretly an energybased model and you should use discriminator driven latent sampling. arXiv preprint arXiv:2003.06060. Cited by: §3.3.
 [9] (2020) Smoothness and stability in gans. In ICLR, Cited by: §1.
 [10] (1997) Using expectationmaximization for reinforcement learning. Neural Computation 9 (2), pp. 271–278. Cited by: §2.
 [11] (2013) A survey on policy search for robotics. Foundations and Trends® in Robotics 2 (1–2), pp. 1–142. Cited by: §2.
 [12] (2020) Residual energybased models for text generation. In ICLR, Cited by: §1, §3.3, §3.3.
 [13] (2017) Coldstart reinforcement learning with softmax policy gradient. In NeurIPS, Cited by: §3.1.
 [14] (2018) MaskGAN: better text generation via filling in the_. In ICLR, Cited by: §1.

[15]
(2019)
Autogan: neural architecture search for generative adversarial networks.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 3224–3234. Cited by: §4.1, §4.  [16] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2, §3.2.
 [17] (2019) Bias correction of learned generative models using likelihoodfree importance weighting. In NeurIPS, Cited by: §1, §1, §2, §3.2.
 [18] (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §1, §2, §3.3, §3.4, §4.1, §4.2, §4.3, Table 2, Table 3, §4, §6.3.2, 4.

[19]
(2018)
Long text generation via adversarial training with leaked information.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §4.2, §4.2, §4.2, Table 2, Table 3.  [20] (2017) Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §4.1.
 [21] (1995) The" wakesleep" algorithm for unsupervised neural networks. Science 268 (5214), pp. 1158–1161. Cited by: §3.2.
 [22] (2019) Texar: a modularized, versatile, and extensible toolkit for text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 159–164. Cited by: §6.3.4.
 [23] (2017) Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1587–1596. Cited by: §4.3, §4.3.
 [24] (2018) On unifying deep generative models. In ICLR, Cited by: §3.3.
 [25] (2018) Deep generative models with learnable knowledge constraints. In Advances in Neural Information Processing Systems, pp. 10501–10512. Cited by: §1, §2, §3.1, §3.2, §3.3, §3.3, §6.2.
 [26] (2017) Categorical reparameterization with gumbelsoftmax. In ICLR, Cited by: §4.2.
 [27] (2016) Deep directed generative models with energybased probability estimation. arXiv preprint arXiv:1606.03439. Cited by: §3.3.
 [28] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §4.3.

[29]
(2010)
Convolutional deep belief networks on cifar10
. Unpublished manuscript 40 (7), pp. 1–9. Cited by: §4.1.  [30] (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909. Cited by: §1, §2, §3.1.
 [31] (2018) Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of NAACLHLT, pp. 1865–1874. Cited by: §4.3.

[32]
(2017)
The concrete distribution: a continuous relaxation of discrete random variables
. Cited by: §4.2.  [33] (2015) Deep multiscale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §1.
 [34] (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §4.1, §4, §6.3.2.
 [35] (1998) A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pp. 355–368. Cited by: §3.1.
 [36] (2018) Relgan: relational generative adversarial networks for text generation. Cited by: §1, §4.2, §4.2, §4.2, Table 2, Table 3.
 [37] (2017) FGANs in an information geometric nutshell. In Advances in Neural Information Processing Systems, pp. 456–464. Cited by: §1.
 [38] (2016) Fgan: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279. Cited by: §1.
 [39] (2017) Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2642–2651. Cited by: §6.3.4.
 [40] (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1, Figure 3, §4.1, §6.3.2.
 [41] (2013) On stochastic optimal control and reinforcement learning by approximate inference. In IJCAI, Cited by: §2.
 [42] (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §4.1.
 [43] (2020) Stable rank normalization for improved generalization in neural networks and gans. In ICLR, Cited by: §4.1, §4.
 [44] (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §1, §2, §3.1.
 [45] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §2, §3.1, §3.2.
 [46] (2017) Style transfer from nonparallel text by crossalignment. In Advances in neural information processing systems, pp. 6830–6841. Cited by: §4.3.
 [47] (2018) Multipleattribute text style transfer. arXiv preprint arXiv:1811.00552. Cited by: §4.3.
 [48] (2020) Adversarial lipschitz regularization. In ICLR, Cited by: §4.1, §4.
 [49] (2018) Structured content preservation for unsupervised text style transfer. arXiv preprint arXiv:1810.06526. Cited by: §4.3.
 [50] (2019) Style transfer for texts: to err is human, but error margins matter. In EMNLP, Cited by: Figure 4, §4.3, §4.3, §4.3, §4.3, Figure 7, §6.3.4.
 [51] (2018) Improving the improved training of wasserstein GANs: a consistency term and its dual effect. arXiv preprint arXiv:1803.01541. Cited by: §1, §2, §3.3, §4.1, §4, §6.3.2, §6.3.2.
 [52] (2018) An empirical study on evaluation metrics of generative adversarial networks. arXiv preprint arXiv:1806.07755. Cited by: §4.1.
 [53] (2017) Seqgan: sequence generative adversarial nets with policy gradient. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: §4.2, §4.2, Table 2, Table 3.
 [54] (2018) Style transfer as unsupervised machine translation. arXiv preprint arXiv:1808.07894. Cited by: §4.3.
 [55] (2016) Energybased generative adversarial network. arXiv preprint arXiv:1609.03126. Cited by: §1.
 [56] (2019) Lipschitz generative adversarial nets. arXiv preprint arXiv:1902.05687. Cited by: §2.
 [57] (2018) Understanding the effectiveness of lipschitzcontinuity in generative adversarial nets. arXiv preprint arXiv:1807.00751. Cited by: §2, §3.4, §6.2.
6 Appendix
6.1 Proof on the equivalence between Reverse KL Divergence and KL Divergence
We prove that optimizing are equivalent to optimizing . This provides guarantee for the approximation that leads to (4).
Claim: Under the assumption that Lipschitz, is bounded because the input is bounded. Let be the Lipschitz constant of , and let
(9) 
We then show that differ by a constant. Since the function is lower and upperbounded. There exists , such that for any bounded.
(10) 
where ① plugs ; ② uses the fact ; ③ uses . The above claim completes the theoretical guarantee on the reverseKL approximation in (4).
6.2 Proof on the necessity of Lipschitz constraint on the discriminator
Although [25] shows preliminary connections between PR and GAN, the proposed PR framework does not provide informative gradient to the generator when treated as a GAN loss. Following [57], we consider the training problem when the discriminator (i.e. here) is optimal: when discriminator is optimal, then the gradient of generator is which could be very small due to vanished . In this way, it is hard to push the generated data distribution towards the targeted real distribution . This problem also exists in LABEL:d_loss because
(11) 
So if and are disjoint, we have
(12) 
Note that for any , is not related to and thus its gradient also does not relate to . Similarly, for any , does not provide any information of . Therefore, the proposed loss in [25] cannot guarantee informative gradient [57] that pushes or towards to .
6.3 Experiments: More Details and Results
6.3.1 Binary classifier for probability ratio clipping
For the image generation and text generation, the binary classifier 6 has the same architecture as the discriminator except an additional softmax activation at the output layer. The binary classifier is trained with real and fake minibatches alongside the generator, and requires no additional loops.
In addition in the task of image generation, we observe similar overall performance between training on raw inputs from the generator/dataset and training on input features from the first residual block of the discriminator (), thus further reducing the computational overhead of the binary classifier.
6.3.2 Image Generation on CIFAR10
We translate the code^{6}^{6}6github.com/biuyq/CTGAN provided by Wei et al. [51] into Pytorch to conduct our experiments. We use the same architecture: a residual architecture for both generator and discriminator, and enforcing Lipschitz constraint on the discriminator in the same way as CTGAN [51]. During training, we interleave 5 generator iterations with 5 discriminator iterations. We optimize the generator and discriminators with Adam (Generator lr: , Discriminator lr: , betas: ). We set the clipping threshold
for the surrogate loss and we linearly anneal the learning rate with respect to the number of training epochs.
Discriminator sample reweighting stabilizes DCGAN
We quantitatively evaluate the effect of discriminator reweighted sampling by comparing DCGAN [40]
against DCGAN with discriminator reweighting. Starting from the DCGAN architecture and hyperparameters, we run 200 random configurations of learning rate, batch size, nonlinearity (ReLU/LeakyReLU), and base filter count (32, 64). Results are summarized in Table
5. DCGANs trained with reweighted sampling has significantly less collapse rate, and achieves better overall performance in terms of Inception Score. These results well demonstrate the effectiveness of the proposed discriminator reweighted sampling mechanism.Method  Collapse rate  Avg IS  Best IS 

DCGAN  52.4%  4.2  6.1 
DCGAN + Reweighting  30.2%  5.1  6.7 
Discriminator reweighted samples
To provide an illustration of how discriminator weights can help the discriminator concentrate on the fake samples of better quality during the training phase, in Figure 5 we plot the fake samples of a trained ResNet model alongside their corresponding discriminator weights.
Clipped surrogate objective
One unique benefit of the clipped surrogate objective is that it allows our model to obtain an estimate of the effectiveness of the discriminator, which then enables us to follow a curriculum that takes more than one generator steps per critic steps. In practice, setting achieves good quality, which also allows us to take times more generator steps than prior works [2, 18, 51, 34] with the same number of discriminator iterations. Table 1 shows the improvement enabled by applying the surrogate objective.
Generated samples
Figure 6 shows more image samples by our model.
6.3.3 Text Generation
We build upon the Pytorch implementation^{7}^{7}7github.com/williamSYSU/TextGANPyTorch of RelGAN. We use the exact same model architecture as provided in the code, and enforce Lipschitz constraint on the discriminator in the same way as in WGANGP [2].
During training, we interleave 5 generator iterations with 5 discriminator iterations. We use Adam optimizer (generator lr: 1e4, discriminator lr: 3e4). We set the clipping threshold for the surrogate loss and we linearly anneal the learning rate with respect to the number of training epochs.
6.3.4 Text Style Transfer
We build upon the TexarTensorFlow
[22] styletransfer model by Tikhonov et al. [50]^{8}^{8}8https://github.com/VAShibaev/text_style_transfer. We use the exact same model architecture and hyperparameters as provided in the code, and enforce Lipschitz constraint on the discriminator in the same way as WGANGP [2]. In addition, we replace the discriminator in Figure 7, by our loss with an auxiliary linear style classifier as in Odena et al. [39]. We did not apply the surrogate loss to approximate the KL divergence, but relied on gradient clipping on the generator.