1 Introduction
Deep generative models dedicate to learning a target distribution and have shown great promise in numerous scenarios, such as image generation (Arjovsky et al., 2017; Goodfellow et al., 2014), density estimation (Ho et al., 2019; Salimans et al., 2017; Kingma and Welling, 2013; Townsend et al., 2019), stylization (Ulyanov et al., 2016), and text generation (Yu et al., 2017; Li et al., 2016). Learning generative models for text data is an important task which has significant impact on several real world applications, e.g.
, machine translation, literary creation and article summarization. However, text generation remains a challenging task due to the discrete nature of the data and the huge sample space which increases exponentially with the sentence length.
Text generation is nontrivial for its huge sample space. For generating sentences of various lengths, current text generation models are mainly based on density factorization instead of directly modeling the joint distribution, which results in the prosperity of neural autoregressive models on language modeling. As neural autoregressive models have explicit likelihood function, it is straightforward to employ Maximum Likelihood Estimation (MLE) for training. Although MLE is is asymptotically consistent, for practical
finite sample scenarios, it is prone to overfit on the training set. Additionally, during the inference (generation) stage, the error at each time step will accumulate along the sentence generation process, which is also known as the exposure bias (Ranzato et al., 2015) problem. ^{†}^{†}The work was done during the first author’s internship at Bytedance AI lab.Many efforts have been devoted to address the above limitations of MLE. Researchers have proposed several nonMLE methods based on minimizing different discrepancy measures, e.g., Sequential GANs (Yu et al., 2017; Che et al., 2017; Kusner and HernándezLobato, 2016) and CoT (Lu et al., 2018). However, nonMLE methods typically relies on sampling from the generative distribution to estimate gradients, which results in high variance and instability during training, as the generative distribution is nonstationary during training process. Some recent study (Caccia et al., 2018) empirically shows that nonMLE methods potentially suffer from mode collapse problem and cannot actually outperform MLE in terms of quality and diversity tradeoff.
In this paper, we seek to leverage the ability of generative models itself for providing unlimited amount of samples to augment the training dataset, which has the potential of alleviating the overfitting problem due to limited samples, as well as addressing the exposure bias problem by providing the model with prefixes (input partial sequences) sampled from its own distribution. To correct the bias incurred by sampling from the model distribution, we propose to learn a progressive density ratio estimator based on Bregman divergence minimization. The above procedures together form a novel training scheme for sequence generative models, termed MLE.
Another essential difference between MLE and MLE lies in the fact that the likelihood of samples not in training set are equally penalized through normalization in MLE, whether near or far from the true distribution. While MLE takes the difference in the quality of unseen samples into account through the importance weight assigned by density ratio estimator, which can be expected to get further improvement.
Empirically, MLE with mixture training data gives the same performance as vanilla MLE training with only training data. But our proposed MLE consistently outperforms vanilla MLE training. Additionally, we empirically demonstrate the superiority of our algorithm over many strong baselines like GAN in terms of generative performance (in the qualitydiversity space) with both synthetic and realworld datasets.
2 Preliminary
2.1 Notations
We denote the target data distribution as , and the empirical data distribution as . The parameters of the generative model are presented by and the parameters of a density ratio estimator are presented by . denotes the distribution implied by the tractable density generative model . The objective is to fit the underlying data distribution with a parameterized model distribution with empirical samples from . We use to stand for a sample sequence from datasets or from generator’s output. And stands for the lth token of , where .
2.2 MLE vs Sequential GANs
It should be noticed that both MLE and GANs for sequence generation suffer from their corresponding issues. In this section, we delve deeply into the specific properties of MLE and GANs, and explore how these properties affect their performances in modeling sequential data.
Mle
The objective of Maximum Likelihood Estimation (MLE) is:
(1) 
where
is the learned probability of sequence
in the generative model. Maximizing the objective is equivalent to minimizing the KullbackLeibler (KL) divergence:(2) 
Though MLE has lots of attractive properties, it has two critical issues:
1) MLE is prone to overfitting on small training sets. Training an autoregressive sequence generative model with MLE on a training set consists of sentences of length , the standard objective can be derived as following:
(3) 
The forced exposure to the groundtruth data shown in Eq. 3 is known as “teacher forcing”, which causes the problem of overfitting. What makes thing worse is the exposure bias. During training, the model only learns to predict given , which are fluent prefixes in the training set. During sampling, when there are some small mistakes and the first can no longer make up a very fluent sentence, the model may easily fail to predict .
2) KLdivergence punishes the situation where the generation model gives real data points low probabilities much more severely than that where unreasonable data points are given high probabilities. As a result, models trained with MLE will focus more on not missing real data points than avoiding generating data points of low quality.
Sequential GANs
Sequential GANs (Yu et al., 2017; Guo et al., 2018), are proposed to overcome the above shortcomings of MLE. The typical objective of them is:
(4)  
is action value, which is usually approximated by a discriminator‘s evaluation on the complete sequences sampled from the prefix . The main advantage of GANs is that when we update the generative model, error will be explicitly reduced by the effect of normalizing constant.
However, there is also a major drawback of GANs. As the gradient is estimated by REINFORCE algorithm (Yu et al., 2017), the generated distribution is nonstationary. As a result, the estimated gradient may suffer from high variance. Though many methods have been proposed to stabilize the training of sequential GANs, e.g control variate (Che et al., 2017) or MLE pretraining (Yu et al., 2017), there, they only have limited effect on sequential data. Moreover, as indicated by recent works (Caccia et al., 2018), sequential GANs sharpen density functions in the distribution’s support, which sacrifices diversity for better quality.
3 Methodology
In order to combine the advantages of MLE, which directly trains the model on highquality training samples, and GANs, which actively explore unseen spaces, we propose MLE. We further remove noise points of MLE by performing importance sampling whose weight is given by a density ratio estimator.
3.1 MLE for Sequence Generation
The different properties of MLE and GANs mainly result from , the effect zone of supervision. To be concrete, is a subset of all possible data points, whose likelihoods are directly updated during training. MLE only maximizes the probabilities of points in the training set, which is discrete and finite. However, the actual data space contains far more points than the training set, on which there is no supervision. In contrast, as the generators of GANs are able to generate all possible data points, is essentially the whole data space. Large enough though is, the supervision signal, i.e., the gradients for updating GANs’ generators usually have high variances compared with the gradients of MLE.
To combine the merits of both methods, we propose MLE which blends samples generated by the current generation model into training data:
(5) 
where is the proportion of training data. By MLE, we extend to the whole space. And since there are real training data in the mixture samples, the gradients are more informative with lower variances.
For training, we directly minimize the forward KL divergence between and , which is equivalent to performing MLE on samples from . Since the training goal at each step is to maximize:
(6) 
when the KLdivergence decrease, the gap between and get smaller. Eventually, when , also approximates .
However, may be very different from , especially at the beginning of training. This discrepancy may result in generating really poor samples which have high likelihoods in but not in . As a result, the training set gets noisier, which may harm performance.
3.2 Noise Reduction by Importance Sampling
To make the distribution of training samples closer to , we introduce the following importance sampling method. The main idea is to first get a batch of samples from , and then give each sample an importance weight according to its similarity with real samples. Then the training objective turns into:
(7) 
where is the parameter of the importance weight estimator.
In ideal conditions, where , the training essentially minimizes the KLdivergence between and real data distribution :
(8)  
where s in the last equation are samples from . We assert that dividing by won’t cause any numerical problem, since the support of is a subset of ’s support.
However, it is infeasible to directly calculate . So we need to approximate it by
. The first thought is to use a new parametric model
to approximate , and set . But this method will lead to severe numerical instability. In this paper, we choose to directly approximate by training a discriminator between and . To be more concrete, we first assign positive labels to samples from and negative labels to samples from. Then we train a probabilistic classifier
: to output the probability of belonging to each class. After the training of converges, we set and get the following proposition:Proposition 1
With a Bayes optimal classifier ,
(9) 
is the amount ratio of negative samples and positive samples. We keep by using the same number of negative and positive samples in a minibatch.
Note that the density ratio is obtained indirectly from the classifier which is typically poorly calibrated. Therefore we need to frequently calibrate to get a better density ratio estimation and avoid numerical problems caused by miscalibration (Turner et al., 2018). However, it can be quite computationally expensive to calibrate after each update. To sidestep the above obstacle, directly estimating is a more general approach, which may lead to a more accurate density ratio estimation than the “classifierbased” method mentioned above.
Given two distributions and , the target of direct density estimation is to obtain a density ratio model , which can directly approximate the true density ratio . (Sugiyama et al., 2012; Uehara et al., 2016) proposed to utilize the Bregman divergence as a measure of the discrepancy between two density ratio functions, which guides the training of density ratio model. The Bregman divergence is an extension of Euclidean distance which measures the distance between two data points and , and the definition with respect to function is as following:
(10) 
where is a strictly convex and continuously differentiable function defined on a closed set .
The integration of Bregman divergence between an estimated density ratio function and the real density ratio function under measure is as following:
(11)  
Then the estimation procedure can be turned into an optimization procedure with respect to the parameter . We leave the discussion with different selections of in Sec. 4.2. In practical training, we alternatively update and . The whole training procedure is in Algorithm 1.
4 Connection with other methods
In this section, we provide further investigation on direct density ratio estimation and theoretical justification for our proposed methods.
4.1 Relation with GANs
As introduced in Sec. 2.2
, sequential GANs usually adopt policy gradient methods for training. Their objectives can be interpreted in a Reinforcement Learning(RL) fashion:
(12)  
In this formula, is the reward function, which is usually implemented by a discriminator. In order to mitigate mode collapse, the regulation term is added. We further introduce an exponentiated payoff distribution (Norouzi et al., 2016)
(13) 
Then, we can see that training discrete GANs are essentially minimizing the following KL divergence, , which is shown at following:
(14) 
The last holds by the fact that is a constant during the optimization of . Our method can be seen as optimizing the opposite direction of the KL divergence, i.e, . As it is intractable to directly sample from , we first sample from and conduct importance sampling with weight
to obtain unbiased estimation of
.4.2 Relation with Divergence
Density ratio estimation is closely to related fdivergence (Nowozin et al., 2016)
, which measures the difference between two probability distributions. Given two distributions with absolutely continuous density functions
and , the fdivergence is defined as:(15) 
where is a convex and lowersemicontinous function with .
If is a strictly convex and continuously differentiable function, the following conclusions can be derived.
Proposition 2
Minimizing Bregman divergence between two distributions and with respect to is essentially estimating the divergence between and with as the dual coordinates.
When the true density ratio are available, the fdivergence can also be obtained, it is not surprising that estimating density ratio by minimizing Bregman divergence with respect to function is essentially the dual of estimating the divergence by maximizing a variational bound. We rewrite the Eq. 11 as following:
(16)  
After some simple operations, we get:
The inequality holds for the fact that , and the equality holds if and only if .
Meanwhile, the dual representation of divergence (Nowozin et al., 2016) is illustrated as follows:
(17)  
where is an arbitrary class of functions and corresponds to the Fenchel conjugate of . Last equation in Eq. 17 is valid for the fact that . Above discussions also indicate the knowledge distillation perspective of our method, i.e., we minimize the divergence between and and then distill the knowledge by minimizing the KL divergence between and .
5 Experiments
Model  NLL  NLL  best NLL + NLL 

MLE  5.53  7.58  16.28 
SeqGAN (Yu et al., 2017)  8.12  7.92  18.44 
COT (Lu et al., 2018)  6.20  7.56  16.32 
LeakGAN (Guo et al., 2018)  10.01  8.52  19.45 
MLE  5.09  7.56  15.98 
BLEU()  SelfBLEU()  

2  3  4  5  2  3  4  5  
Training Data  0.86  0.61  0.38  0.23  0.86  0.62  0.38  0.24 
SeqGAN (Yu et al., 2017)  0.72  0.42  0.18  0.09  0.91  0.70  0.46  0.27 
MaliGAN (Che et al., 2017)  0.76  0.44  0.17  0.08  0.91  0.72  0.47  0.25 
LeakGAN (Guo et al., 2018)  0.84  0.65  0.44  0.27  0.94  0.82  0.67  0.51 
MLE()  0.93  0.74  0.51  0.32  0.93  0.78  0.59  0.41 
MLE()  0.93  0.76  0.54  0.33  0.91  0.75  0.56  0.38 
BLEU()  SelfBLEU()  

2  3  4  5  2  3  4  5  
Training Data  0.68  0.47  0.30  0.19  0.86  0.62  0.38  0.42 
SeqGAN (Yu et al., 2017)  0.75  0.50  0.29  0.18  0.95  0.84  0.67  0.49 
MaliGAN (Che et al., 2017)  0.67  0.43  0.26  0.16  0.92  0.78  0.61  0.44 
LeakGAN (Guo et al., 2018)  0.74  0.52  0.33  0.21  0.93  0.82  0.66  0.51 
MLE  0.74  0.52  0.33  0.21  0.89  0.72  0.54  0.38 
MLE()  0.75  0.53  0.36  0.23  0.89  0.70  0.53  0.36 
objectives  

stands for sigmoid function)
To demonstrate the effectiveness of our method, we conduct experiments on a synthetic setting as well as two realworld Benchmark datasets. We compare our method with the several baseline methods, including MLE, SeqGAN (Yu et al., 2017), LeakGAN (Guo et al., 2018), COT (Lu et al., 2018) and MaliGAN (Che et al., 2017)
. Note an important hyperparameter of our method is the mixture weight
, which is set as for default in all experiments except the ablation studies on in Sec. 5.4.5.1 Implementation Details
5.1.1 Bregman Divergence Minimization
The density ratio in Sec. 3.2 is estimated through an optimization procedure towards Bregman divergence A variety of functions meet the requirements of , but in all the experiments, we choose as the default objective for its numerical stability during training. The effect of using different objectives is also analyzed empirically in Section. 5.4.
5.1.2 Variance Reduction
The density ratio estimator, i.e., , can be seen as the importance weight for correcting the bias in the hybrid distribution . In order to increase sample quality, we apply two variancereduction methods on importance sampling to our method (Owen, 2013; Grover et al., 2019):

selfnormalization: the selfnormalized estimator normalizes the density ratio across a samples batch:
(18) where is the batch size.

ratio flattening: the density ratio can be flattened to achieve an intermediate state between the original and the uniform importance weights by a parameter :
(19)
We find that selfnormalization works best, so all experiments are implemented with selfnormalization.
5.2 Synthetic Experiments
The synthetic experiments are conducted following the typical settings of previous works (Yu et al., 2017; Guo et al., 2018; Lu et al., 2018)
. We use a randomly initialized LSTM as the oracle model. Then we test each generation model’s ability in learning from samples generated by the oracle most. We use a single layer LSTM with 32 hidden units. The parameters are initialized by a standard normal distribution. With a fixed LSTM as the target, the groundtruth density is available. Hence it is possible to analyze the generation quality quantitatively by the negative loglikelihood NLL
given by the Oracle model. Besides, the loglikelihood the generative model assigns to the heldout test data, i.e., NLL is another metric used to evaluate sample diversity.As is pointed out by (Caccia et al., 2018), evaluating quality alone is actually misleading for sequence generation task. Note the conditional probability is formalized as: . Here
is the prelogit activation of generator,
is the word embedding matrix and is a Boltzmann temperature parameter. (Caccia et al., 2018) introduced a temperature sweep procedure, which enumerates the possible values of in a predefined range, and report the corresponding NLL and NLL. In the same way, we get a curve of NLL and NLL with different temperatures (Fig. 1). We find that the curve of our method is under the curves of all baseline methods, showing the superiority of our method.Quantitative results are reported in Table 1, including the best NLL, NLL
and the comprehensive evaluation metric NLL
NLL. The quantitative results are obtained by tuning the temperature in the valid range defined in (Caccia et al., 2018), which indicates the quality, diversity and their tradeoff of a training paradigm under the constraint that the tuned model is still a valid and information language model. Our method outperforms previous methods as it combines the strengths of MLE and GANs.5.3 Real Data Experiments
We conduct realdata experiments on two text benchmark datasets, image COCO captions and EMNLP 2017 News. We use BLEU score between generated samples and the whole test set to evaluate the generation quality. At the same time, we use selfBLEU (Zhu et al., 2018) as a metric of diversity, which is the average bleu score between each generated sample and all other generated samples. Following (Caccia et al., 2018), the temperature is selected when the BLEU scores is similar to the reported numbers in (Guo et al., 2018) for fair comparison.
COCO mainly contains short image captions, while EMNLP News 2017 consists of longer formal texts. The results on COCO and EMNLP News 2017 are shown in Table 3 and Table 2 respectively. We find that our model achieves higher BLEU scores and lower selfBLEU scores, revealing better quality and higher diversity of our model.
5.4 Ablation Study and Sensitive Analysis
5.4.1 Mixture Weight
One important hyperparameter in our model is the mixture weight for constructing the proposal distribution . To figure out how the method behaves with different , we gradually increase from to , and show the experiment results in Table 5. With a small , it can be observed that our MLE has similar performance with sequential GANs. This is due to the fact that are more similar to and specifically, our method actually degenerates to a variant of sequential GAN when . Correspondingly, the model is closer to MLE when approach 1. The best performance of MLE is achieved when is set as an intermediate value between and , where both the exploration properties of GANs and the stability of MLE are incorporated. These experiments further justify the connection among MLE, MLE and GANs.
0  1/4  1/2  3/4  1  
7.60  6.23  5.09  5.63  5.53  
8.01  7.91  7.56  7.54  7.60  
17.43  16.27  15.98  15.94  16.30  
5.4.2 Objective Density Functions
As illustrated in Table 4, we show a family of objectives which meet the definition of Bregman Divergence and are available for direct density ratio estimation. We conduct ablation studies within the synthetic experiment setting to find out the training dynamic of different objectives in practice. The results can be found in Fig. 2. Note the training procedure with is not stable due to the numerical issues. When is set as and , MLE get remarkable improvements upon the MLE with a more stable training procedure.
6 Related Works
In the context of sequence generation models, there have been fruitful lines of studies focusing on leveraging adversarial training for sequence generation task. These works are inspired by the generative adversarial nets (Goodfellow et al., 2014), an implicit generative model which seeks to minimize the JensenShannon divergence between the generative distribution and the real data distribution through a twoplay minmax game. In the sequence generation task, the gradients can not be directly backpropagated to the generative module as in continuous setting. Hence reparameterization (Kusner and HernándezLobato, 2016) or policy gradient (Yu et al., 2017; Guo et al., 2018; Che et al., 2017) is utilized to obtain unbiased estimate of gradients.
Our method actually can be seen as a more generalized objective family, with MLE and policy gradient based GANs as two special cases. Our method is close related to the methods leveraging tractable density distribution as noise to estimate another density, especially selfcontrastive estimation (Goodfellow, 2014). While selfcontrastive estimation is the degenerate version of MLE, i.e. directly using samples from as ground truth to conduct MLE without the bias correction step with . COT (Lu et al., 2018) also leverages tractable density as noise. Our approach differs from COT in the calculation of the density ratio. They introduced another generative module for estimating the denominator to obtain the density ratio, while we apply direct density ratio estimation methods which are more flexible and efficient.
Density ratio estimation has come into attention of the community of generative models. (Nowozin et al., 2016) indicates a general objective family for training GANs of which the density ratio is a key element. (Uehara et al., 2016) further investigates the connection between the GANs and density ratio estimation. Density ratio estimator also has been utilized within the settings when the aim is to improve a learned generative model. (Azadi et al., 2018; Turner et al., 2018) leveraged density ratio to conduct rejection sampling over the support of generative distribution for obtaining highquality samples. Similarly, (Grover et al., 2019) utilized an importance sampling framework to correct the biased statistics of generated distribution which result in improvements in several application scenarios of generative model.
7 Discussion and Future Work
We propose MLE, a new sequence generation training paradigm which is effective and stable when operating at large sample space encountered in sequence generation. Our method is derived based on the concept termed effect zone of supervision which accounts for the properties of different sequence generation models. We propose a generalized family of effect zone of supervision through self augmentation and a following densityratio based bias correction procedure to achieve unbiased optimization during each training step. Experimental results demonstrate that MLE is able to achieve better quality and diversity tradeoff compared with previous sequence generation methods. An exciting avenue for future work is to extend the our training paradigm into the conditional text generation tasks, such as machine translation, dialog system and abstractive summarization. Also we look forward to providing further investigation on the consistency and generalization properties of our proposed approach.
Acknowledgements
We thank the anonymous reviewers for their insightful comments. Hao Zhou and Lei Li are the corresponding authors of this paper.
References

Wasserstein generative adversarial networks.
In
International conference on machine learning
, pp. 214–223. Cited by: §1.  Discriminator rejection sampling. arXiv preprint arXiv:1810.06758. Cited by: §6.
 Language gans falling short. arXiv preprint arXiv:1811.02549. Cited by: §1, §2.2, §5.2, §5.2, §5.3, Table 1.
 Maximumlikelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983. Cited by: §1, §2.2, Table 2, Table 3, §5, §6.
 On distinguishability criteria for estimating generative models. arXiv preprint arXiv:1412.6515. Cited by: §6.
 Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §6.
 Bias correction of learned generative models using likelihoodfree importance weighting. arXiv preprint arXiv:1906.09531. Cited by: §5.1.2, §6.

Long text generation via adversarial training with leaked information.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §2.2, §5.2, §5.3, Table 1, Table 2, Table 3, §5, §6.  Flow++: improving flowbased generative models with variational dequantization and architecture design. arXiv preprint arXiv:1902.00275. Cited by: §1.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
 Gans for sequences of discrete elements with the gumbelsoftmax distribution. arXiv preprint arXiv:1611.04051. Cited by: §1, §6.
 Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541. Cited by: §1.
 CoT: cooperative training for generative modeling of discrete data. arXiv preprint arXiv:1804.03782. Cited by: Document, §1, §5.2, Table 1, §5, §6.
 Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pp. 1723–1731. Cited by: §4.1.
 Fgan: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279. Cited by: §4.2, §4.2, §6.
 Monte carlo theory, methods and examples. Cited by: §5.1.2.

Sequence level training with recurrent neural networks
. arXiv preprint arXiv:1511.06732. Cited by: §1.  Pixelcnn++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517. Cited by: §1.
 Densityratio matching under the bregman divergence: a unified framework of densityratio estimation. Annals of the Institute of Statistical Mathematics 64 (5), pp. 1009–1044. Cited by: §3.2.
 Practical lossless compression with latent variables using bits back coding. arXiv preprint arXiv:1901.04866. Cited by: §1.
 Metropolishastings generative adversarial networks. arXiv preprint arXiv:1811.11357. Cited by: §3.2, §6.
 Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920. Cited by: §3.2, §6.
 Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §1.
 Seqgan: sequence generative adversarial nets with policy gradient. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: Document, §1, §1, §2.2, §2.2, §5.2, Table 1, Table 2, Table 3, §5, §6.
 Texygen: a benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1097–1100. Cited by: §5.3.
References

Wasserstein generative adversarial networks.
In
International conference on machine learning
, pp. 214–223. Cited by: §1.  Discriminator rejection sampling. arXiv preprint arXiv:1810.06758. Cited by: §6.
 Language gans falling short. arXiv preprint arXiv:1811.02549. Cited by: §1, §2.2, §5.2, §5.2, §5.3, Table 1.
 Maximumlikelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983. Cited by: §1, §2.2, Table 2, Table 3, §5, §6.
 On distinguishability criteria for estimating generative models. arXiv preprint arXiv:1412.6515. Cited by: §6.
 Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §6.
 Bias correction of learned generative models using likelihoodfree importance weighting. arXiv preprint arXiv:1906.09531. Cited by: §5.1.2, §6.

Long text generation via adversarial training with leaked information.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §2.2, §5.2, §5.3, Table 1, Table 2, Table 3, §5, §6.  Flow++: improving flowbased generative models with variational dequantization and architecture design. arXiv preprint arXiv:1902.00275. Cited by: §1.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
 Gans for sequences of discrete elements with the gumbelsoftmax distribution. arXiv preprint arXiv:1611.04051. Cited by: §1, §6.
 Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541. Cited by: §1.
 CoT: cooperative training for generative modeling of discrete data. arXiv preprint arXiv:1804.03782. Cited by: Document, §1, §5.2, Table 1, §5, §6.
 Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pp. 1723–1731. Cited by: §4.1.
 Fgan: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279. Cited by: §4.2, §4.2, §6.
 Monte carlo theory, methods and examples. Cited by: §5.1.2.

Sequence level training with recurrent neural networks
. arXiv preprint arXiv:1511.06732. Cited by: §1.  Pixelcnn++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517. Cited by: §1.
 Densityratio matching under the bregman divergence: a unified framework of densityratio estimation. Annals of the Institute of Statistical Mathematics 64 (5), pp. 1009–1044. Cited by: §3.2.
 Practical lossless compression with latent variables using bits back coding. arXiv preprint arXiv:1901.04866. Cited by: §1.
 Metropolishastings generative adversarial networks. arXiv preprint arXiv:1811.11357. Cited by: §3.2, §6.
 Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920. Cited by: §3.2, §6.
 Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §1.
 Seqgan: sequence generative adversarial nets with policy gradient. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: Document, §1, §1, §2.2, §2.2, §5.2, Table 1, Table 2, Table 3, §5, §6.
 Texygen: a benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1097–1100. Cited by: §5.3.