1 Introduction
Generative models are appealing because they provide ways to obtain insights on the underlying data distribution and statistics. In particular, these models play a pivot role in many natural language processing tasks such as language modeling, machine translation, and dialogue generation. However, the generated sentences are often unsatisfactory
(Sordoni et al., 2015; Bowman et al., 2015; Serban et al., 2017; Wiseman & Rush, 2016). For example, they often lack of consistency in longterm semantics and have less coherence in highlevel topics and syntactics (Bowman et al., 2015; Zhang et al., 2016a).This is largely attributed to the defect in the dominant training approach for existing discrete generative models. To generate discrete sequences, it is popular to adopt autoregressive models through teacher forcing (Williams & Zipser, 1989) which, nevertheless, causes the exposure bias problem (Ranzato et al., 2016)
. The existing approach trains autoregressive models to maximize the conditional probabilities of next tokens based on the groundtruth histories. In other words, during training, autoregressive generative models are only exposed to the ground truths from the data distribution rather than those from the model distribution, i.e., its own predictions. It prohibits the trained model to take advantage of learning in the the context of its previous generated words to make the next prediction, resulting in a bias and difficulty in approaching the true underlying distribution
(Ranzato et al., 2016; Bengio et al., 2015). Another limitation of teacher forcing is that it is inapplicable to those autoregressive models with latent random variables, which have performed better than autoregressive (deterministic state) recurrent neural networks (i.e. usual RNNs, LSTMs or GRUs) on multiple tasks
(Serban et al., 2017; Miao et al., 2016; Zhang et al., 2016a).An alternative and attractive solution to training autoregressive models is using generative adversarial networks (GAN)
(Goodfellow et al., 2014). The above discussed problem can be prevented if the generative models were able to visit its own predictions during training and had an overall view on the generated sequences. We suggest to facilitate the training of autoregressive models with an additional discriminator under the GAN setting. With a discriminator trained to separate real versus generated sequences, the generative model is able to make use of the knowledge of the discriminator to improve itself. Since the discriminator is trained on the entire sequence, it can in principle provide the training signal to avoid the problem of exposure bias.However, it is nontrivial to apply GANs to discrete data as it is difficult to optimize of the generator using the signal provided by the discriminator. In fact, it is usually very hard to push the generated distribution to the real data distribution, if not impossible, by moving the generated sequence (e.g., a faulty sentence) towards a “true” one (e.g., a correct sentence) in a high dimensional discrete state space. As standard backpropagation fails in discrete settings, the generator can be optimized using the discriminator’s output as a reward via reinforcement learning. Unfortunately, even with careful pretraining, we found that the policy has difficulties to get positive and stable reward signals from the discriminator.
To tackle these limitations, we propose MaximumLikelihood Augmented Discrete Generative Adversarial Networks (MaliGAN). At the core of this model is the novel GAN training objective which sidesteps the stability issue happening when using the discriminator output as a direct reinforcement learning reward. Alternatively, we develop a normalized maximum likelihood optimization target inspired by (Norouzi et al., 2016b). We use importance sampling and several variance reduction techniques in order to successfully optimize this objective. The procedure was discovered independently from us by Hjelm et al. (2017) in the context of image generation.
The new target brings several attractive properties in the proposed MaliGAN. First, it is theoretically consistent and easier to optimize (Section 3.2). Second, it allows the model not only to maximize the likelihood of good behaviors, but also to minimize the likelihood of bad behaviors, with the help of a GAN discriminator. Equipped with these strengths, the model focuses more on improving itself by gaining beneficial knowledge that is not yet well acquired, and excluding the most probable and harmful behaviors. Combined with several proposed variance reduction techniques, the proposed MaliGAN successfully and stably models discrete data sequences (Section 4).
2 Preliminaries and Overview
The basic framework for discrete sequence generation is to fit a set of data coming from an underlying generating distribution by training a parameterized autoregressive probabilistic model .
In this work, we aim to generate discrete data, especially discrete sequential data, under the GAN setting (Goodfellow et al., 2014). GAN defines a framework for training generative models by posing it as a minimax game against a discriminative model. The goal of the generator is to match its distribution to the real data distribution . To achieve this, the generator transforms noise sampled from to a data sample . Following this, the discriminator is trained to distinguish between the samples coming from and , and can be used to provide a training signal to the generator.
When applying the GAN framework to discrete data, the discontinuity prohibits the update of the generator parameters via standard backpropagation. To tackle this, one way is to employ a typical reinforcement learning (RL) strategy that directly uses the GAN discriminator’s output, or as a reward. In practice, the problem is usually solved by REINFORCElike algorithms (Williams, 1992), perhaps with some variance reduction techniques.
Formally, we train a generator together with a discriminator . In its original form, the discriminator is trained to distinguish between the generating distribution and the real data distribution . The generator is then trained to maximize . Namely, the objective for the generator to optimize is as follows:
Our work is related to the viewpoint of casting the GAN training as a reinforcement learning problem with a moving reward signal monotone in
. Define the normalized probability distribution
in some bounded region to guarantee integrability (note that is an approximation to if is well trained) and also put a maximumentropy regularizer to encourage diversity, yielding the regularized loss:(1)  
where c(D) is a constant depending only on . Hence, optimizing the traditional GAN is basically equivalent to optimizing the KLdivergence . One major problem with this approach is that always moves with , which is undesirable for both stability and convergence. When we have some samples , we want to change a bit in order to adjust the likelihood of samples to improve the quality of the generator. However, since initially generates very bad sequences, it have little chance of generating good sequences in order to get positive rewards. Though the dedicated pretraining and variance reduction mechanisms help (Yu et al., 2017), the RL algorithm based on the moving reward signal still seems very unstable and does not work on large scale datasets.
We therefore propose to utilize the information of the discriminator as an additional source of training signals, on top of the maximumlikelihood objective. We employ importance sampling to make the objective trainable. The novel training objective has much less variance than that in vanilla reinforcement learning approaches that directly adopt or as reward signals. The analysis and discussions will be presented in more detail in Section 3.2.
3 MaximumLikelihood Augmented Discrete Generative Adversarial Networks
In this section, we present the details of the proposed model. At the heart of this model is a novel training objective that significantly reduces the variance during training, including the theoretical and practical analysis on the objective’s equivalence and attractive properties. We also show how this core algorithm can be combined with several variance reduction techniques to form the full MaliGAN algorithm for discrete sequence generation.
3.1 Basic Model of MaliGAN
We propose MaximumLikelihood Augmented Discrete Generative Adversarial Networks (MaliGAN) to generate the discrete data. With MaliGAN, we train a discriminator with the standard objective that GAN employs. What is different from GANs is a novel objective for the generator to optimize, using importance sampling, which makes the training procedure closer to maximum likelihood (MLE) training of autoregressive models, and thus being more stable and with less variance in the gradients.
To do so, we keep a delayed copy of the generator whose parameters are updated less often in order to stabilize training. From the basic property of GANs, we know that an optimal has the property . So in this case, we have . Therefore, we set the target distribution for maximum likelihood training to be . Let , we define the augmented target distribution as:
Regarding as a fixed probability distribution, then the target to optimize is:
This objective has an attractive property that is a “fixed” distribution during training, i.e., if is sufficiently trained, then is always approximately the data generating distribution . By defining the gradient as , we have the following importance sampling formula:
where we assume that and the delayed generator is only one step behind the current update in the experiments. This importance sampling procedure was discovered independently from us by (Hjelm et al., 2017)
. We propose to optimize the generator using the following novel gradient estimator:
(2) 
where is a baseline from reinforcement learning in order to reduce variance. In practice, we let increase very slowly from 0 to 1. Combined with the objective of the discriminator in an ordinary GAN, we get the proposed MaliGAN algorithm as shown in Algorithm 1.
3.2 Analysis
The proposed objective in Eq. 2 is also theoretically guaranteed to be sound. In the following theorem, we show that our training objective approximately optimizes the KL divergence when is close to optimal. What’s more, the objective still makes sense when is well trained but far from optimal.
Theorem 3.1.
We have the following two theoretical guarantees for our new training objective.
If discriminator is optimal between delayed generator and real data distribution , we have the following equation.
where .
If is trained well but not sufficiently, namely, , lies between 0.5 and , we have the property that for , almost surely
(3) 
The above gives us a condition for our objective to still push the generator in a descent direction even when the discriminator is not trained to optimality.
In addition to its attractiveness in theory, we now demonstrate why the gradient estimator in Eq. 2 of practically can produce better training signal for the generator than the original GAN objective. Similar discussions can be found in (Bornschein & Bengio, 2015; Norouzi et al., 2016a).
In the original GAN setting from a reinforcement learning perspective, e.g. the inclusive KL in Eq. 1, the free running autoregressive model can be viewed as an RL agent exploring the state space and getting a reward, or , at the end of the exploration. The model then tries to adjust the probability of each of its exploration paths according to this reward. However, this gradient estimator would be drastically inefficient when almost all generated paths had a very small discriminator output. Unfortunately, this is very common in GAN training and cannot even be solved with a carefully selected baseline.
In the MaliGAN objective, however, the partition function is estimated using the samples from the minibatch, which helps dealing with the above dilemma. When we choose, for example, baseline , we can see that the sum of the weights on the generated paths are zero, and the probability of each path is adjusted not according to the absolute value of the discriminator output, but its relative quality in that minibatch. This ensures that the model can always learn something as long as there exist some generations better than others in that minibatch. Furthermore, the previous theorem ensures the consistency of the minibatch level normalization procedure.
From a theoretical point of view, this normalization procedure also helps. Although at the first glance, when is optimal, one can prove that , so estimating seems to only introduce additional variance to the model. However, using this estimator in fact reduces the variance due to the following reason: is actually a function with singularity when is in a region in the data space on which . Even with very careful pretraining, such a region and , making the ratio blow up. In our target , since it is almost impossible to get samples from with in a reasonable size minibatch, the actual distribution we are sampling from is a “regularized” distribution where and . So when doing importance sampling to estimate our training objective with small minibatches, we are actually doing normalizedweights importance sampling based on : . Since the Monte Carlo estimator has much more variance to estimate than , in practical minibatch training settings, we can view that we are doing importance sampling with the distribution , and this objective has much less variance compared to importance sampling with on which has an infinite singularity. This is why estimating is important in order to reduce the variance in the minibatch training setting.
When training autoregressive models with teacher forcing, a serious problem is exposure bias (Ranzato et al., 2016; Norouzi et al., 2016b; Lamb et al., 2016). Namely, the model is only trained on demonstrated behaviors (real data samples), but we also want it to be trained on freerunning behaviors. When we set a positive baseline , the model first generates samples, and then tries to adjust the probabilities of each generated samples by trying to reinforce the best behaviors and exclude the worse behaviors relatively to those in the minibatch.
3.3 Variance Reduction in MaliGAN
The proposed renormalized objective in MaliGAN supports much more stable training behavior than the RL objective in a standard GAN. Nevertheless, when the long sequence generation procedure consists of multiple steps of random sampling, we find it is better to further integrate the following advanced variance reduction techniques.
3.3.1 Monte Carlo Tree Search
Instead of using the same weight for all time steps in one sample, we use the following formula which is well known in the RL literature:
where stands for the “expected total reward” given by of generating token given previous generation , which can be estimated with, e.g., Monte Carlo tree search (MCTS, Silver et al. (2016)).
Thus, following the gradient estimator presented in Theorem 3.1, we derive another gradient estimator:
where is the size of the minibatch. Using Monte Carlo tree search brings in several benefits. First, it allows different steps of the generated sample to be adjusted with different weights. Second, it gives us a more stable estimator of the partition function . Both of these two properties can dramatically reduce the variance of our proposed estimator.
3.3.2 Mixed MLEMali Training
When dealing with long sequences, the above model may result in accumulated variance. To alleviate the issue, we significantly reduce the variance by clamping the input using the training data for time steps, and switch to free running mode for the remaining time steps. Then during our training procedure, inspired from Ranzato et al. (2016), we slowly move from towards 0.
The training objective is equivalent to setting in the last section to:
We also assume is trained on the real samples and fake samples generated by
Let , we have:
For each sample from the real data batch, if it has length larger than , we fix the first words of , and then sample times from our model till the end of the sequence, and get samples .
We then have the following series of minibatch estimators for each :
(4) 
One difference is that in this model, we normalize the coefficients based only on samples generated from a single real data sample . The reason of using this trick will be explained in next subsection.
We have the following theorem which guarantees the theoretical property of this estimator.
Theorem 3.2.
When is correctly trained but not optimal in the sense of Theorem 3.1, when , we almost surely have ,
(5) 
3.3.3 Single real data based renormalization
Many generative models have multiple layers of randomnesses. For example, in autoregressive models, the samples are generated via multiple sampling steps. Other examples include hierarchical generative models like deep Boltzmann machines and deep belief networks
(Salakhutdinov & Hinton, 2009; Hinton, 2009).In these models, highlevel random variables are usually responsible for modeling highlevel decisions or “modes” of the probability distribution. Changing them can result in much larger effects than that from changing lowlevel variables. Motivated by this observation, in each minibatch we first draw a minibatch of samples (e.g. 32) of highlevel latent variables, and then for each high level value we draw a number of low level data samples (e.g. 32). Then we reestimate the partition function from the lowlevel samples that are generated by each highlevel samples. Because lowerlevel sampling has a much smaller variance, the model can receive better gradient signals from the weights provided by the discriminator.
This sampling principle is corresponding to applying the mixed MLEMali training discussed above in the autoregressive settings. In this case we first sample a few data samples, then fix the first words and let the network generate a lot of samples after as our next minibatch. We refer this full algorithm to sequential MaliGAN with Mixed MLE Training, which is summarized in Algorithm 2.
Optional: Pretrain model using pure MLE with some epochs.
The reason why doing this single real sample based renormalization is beneficial can be summarized around two elements. First, consider is a sample from the training set. The first N words should be completed by our model. The conditional distribution should be much simpler than the full distribution . Namely, consists of only one or a few “modes”. So this renormalization technique can be viewed as trying to train the model on these simpler conditional distributions, which gives more stable gradients.
Second, this normalization scheme makes our model robust to mode missing, which is a common failure pattern when training GANs (Che et al., 2016). Single sample based renormalization ensures that for every real sample , the model can receive a moderately strong training signal for how to perform better on generating conditioned on . However, in batchwise renormalization as in the basic MaliGAN, this is not possible because there might be some completions with very large, so other training samples in that minibatch receives very little gradient signals.
4 Experiments
To examine the effectiveness of the proposed algorithms, we conduct experiments on three discrete sequence generation tasks. We achieve promising results on all three tasks, including a standard and challenging language modeling task. From the empirical results and the following analysis, we demonstrate the soundness of MaliGAN and show its robustness to overfitting.
4.1 Discrete MNIST
We first evaluate MaliGAN on the binarized image generation task for the MNIST handwritten digits dataset, similar with
Hjelm et al. (2017). The original datasets have 60,000 and 10,000 samples in the training and testing sets, respectively. We split the training set and randomly selected 10,000 samples for validation. We adopted as the generator a deep convolutional neural network based on the DCGAN architecture
(Radford et al., 2015). To generate the discrete samples, we sample from the generator’s output binomial distribution. We adopt Algorithm
1 of MaliGAN for training and use the single latent variable renormalization technique for variance reduction.To compare our proposed MaliGAN with the models trained using the discriminator’s output as a direct reward, we also train a generator with the same network architecture, but use the output of the discriminator as the weight of generated samples. We denote it as the REINFORCElike model. The comparison results are shown in Figure 1 and Figure 2.
The two figures in the first line are training losses of the generator and discriminator from the proposed MaliGAN. We can see the training process of MaliGAN with variance reduction techniques is stable and the loss curve is meaningful. The bottom two figures in Figure 2 are samples generated by the REINFORCElike model and by MaliGAN. Clearly, the samples generated by MaliGAN have much better visual quality and resemble closely the training data.
4.2 Poem Generation
We examine the effectiveness of our model on a Chinese poem generation task. Typically, there are two genres of Chinese poems. We refer with Poem5 and Poem7 to those consisting of 5 or 7 Chinese characters each in a short sentence, respectively. We use the dataset provided in (Zhang & Lapata, 2014), and split them in the standard way ^{1}^{1}1http://homepages.inf.ed.ac.uk/mlap/Data/EMNLP14/.
The generator is a onelayer LSTM (Hochreiter & Jürgen Schmidhuber, 1997)
with 32 hidden units for Poem5 and 100 for Poem7. Our discriminators are twolayer BiLSTMs with 32 hidden neurons. We denote our models trained with Algorithm
1 and Algorithm 2 as MaliGANbasic and MaliGANfull. We choose two compared models, the autoregressive model with same architecture but trained with maximum likelihood (MLE), and SeqGAN (Yu et al., 2017). Following Yu et al. (2017), we report the BLEU2 scores in Table 1 (Papineni et al., 2002).MaliGANfull obtained the best BLEU2 scores on par on both tasks, and MaliGANbasic was the next best. Clearly, MLE lagged far behind despite the same architecture, which should be attributed to the inherent defect in the MLE teacherforcing training framework. As pointed by previous researchers Wiseman & Rush (2016)
, BLEU might not be a proper evaluation metric, we also calculate the Perplexity of these four models, obtaining qualitatively similar results. The best scores are reported in Table
1 and the Perplexity curves are illustrated in Figure 3.Model  Poem5  Poem7  

BLEU2  PPL  BLEU2  PPL  
MLE  0.6934  564.1  0.3186  192.7 
SeqGAN  0.7389       
MaliGANbasic  0.7406  548.6  0.4892  182.2 
MaliGANfull  0.7628  542.7  0.5526  180.2 
From the above figures, we can see how our models perform during the training procedure. Although with some oscillations, both MaliGANbasic and MaliGANfull achieved lower perplexity. Especially on Poem7 from Figure 3, our proposed models both prevent overfitting when MLE ended up with that. A comparison between the training curve of MaliGANbasic and that of MaliGANfull, we can find that the latter has less variance. This demonstrates the effectiveness of the advanced variance reduction techniques in our full model. The peak in the MLE curve on Poem5 in Figure 3 is, however, unlikely to be a result of overfitting because that MLE “recovered” from it fast and continued to convergence till the end. In fact, we find it harder to train a stable MLE model on Poem5 than on Poem7. We conjecture this resulted from the intricate mutual influence between the improper evaluation and the small training data size.
4.3 SentenceLevel Language Modeling
We also examine the proposed algorithm on a more challenging task, sentencelevel language modeling, which can be considered as a fundamental task with applications to various discrete sequence generation tasks. To explore the possibilities and limitations of our algorithm, we conduct extensive experiments on the standard Penn Treebank (PTB) dataset (Marcus et al., 1993) through parameter searching and model ablations. For evaluation we report sentencelevel perplexity, which is the averaged perplexity on all sentences in the test set. For simplicity and efficiency, we adopt a 1layer GRU (Cho et al., 2014) as our generator, and set the same setting for the baseline model trained with standard teacher forcing(Williams & Zipser, 1989). We use a Bidirectional GRU network as our discriminator. To stabilize training and provide good initialization for the generator, we first pretrain our generator on the training set using teacher forcing, then we train two models, MaliGANbasic and MaliGANfull. MaliGANbasic is trained with Algorithm 1 without MCTS. MaliGANfull is trained by Algorithm 2 with all the variance reduction techniques included.
Note that the computational cost of MCTS is very large, so we remove all sentences longer than 35 words in the training set. We set and at the beginning of the training and pretrain our discriminator to make it reliable enough to provide informative and correct signals for the generator. The perplexity shown in Table 2 is achieved by our best performing model, which has 200 hidden neurons and 200 dimensions for word embeddings.
MLE  MaliGANbasic  MaliGANfull  

ValidPerplexity  141.9  131.6  128.0 
TestPerplexity  138.2  125.3  123.8 
From Table 2 we can see, the simplest model trained by MaliGAN reduced the perplexity of the baseline effectively. Both the basic and the full model, i.e., MaliGANbasic and MaliGANfull obtained a notably lower perplexity compared with the MLE model. Although the PTB dataset is much more difficult, we obtain results consistent with Table 1. It is encouraging to see that our model is more robust to overfitting in consideration of the relative small size of the PTB data. These results strengthen our belief to realize our algorithm on even larger datasets, which we leave as a future work.
The positive result again demonstrates the effectiveness of MaliGAN, whose primary component is the novel optimization objective we propose in Eq. 2. Besides, we also gain insights from the model ablation tests about the advanced variance reduction techniques provided in Section 3.3. Combined with the Perplexity curve in Figure 3, we can see that with advanced techniques, MaliGANfull performed in a more stable way during training and can to some extent achieve lower perplexity scores than MaliGANbasic. We believe these fruitful techniques will be beneficial in other similar problem settings.
5 Related Work
To improve the performance of discrete autoregressive models, some researchers aim to tackle the exposure bias problem, which is discussed detailed in (Ranzato et al., 2016; Serban et al., 2016; Wiseman & Rush, 2016). The problem occurs when the training algorithm prohibits models to be exposed to their own predictions during training. The second issue is the discrepancy between the objective during training and the evaluation metric during testing, which is analyzed in Ranzato et al. (2016) and then summarized as LossEvaluation Mismatch by Wiseman & Rush (2016). Typically, the objectives in training autoregressive models are to maximize the wordlevel probabilities, while in testtime, we often evaluate the models using sequencelevel metrics, such as BLEU (Papineni et al., 2002). To alleviate these two issues, the most straightforward way is to add the evaluation metrics into the objective in the training phase. Because these metrics are often discrete which cannot be utilized through standard backpropagation, researchers generally seek help from reinforcement learning. Ranzato et al. (2016) exploits REINFORCE algorithm (Williams, 1992) and proposes several model variants to well situate the algorithm in text generation applications. Liu et al. (2016) shares similar idea and directly optimizes image caption metrics through policy gradient methods (Igel, 2005). There exists a third issue, namely Label Bias, especially in sequencetosequence learning framework, which obstacles the MLE trained models to be optimized globally (Andor et al., 2016; Wiseman & Rush, 2016)
To addresses the abovementioned issues in training autoregressive models, we propose to formulate the problem under the setting of generative adversarial networks. Initially proposed by Goodfellow et al. (2014), generative adversarial network (GAN) has attracted a lot of attention because it provides a powerful framework to generate promising samples through a minmax game. Researchers have successfully applied GAN to generate promising images conditionally (Mirza & Osindero, 2014; Reed et al., 2016; Zhang et al., 2016b) and unconditionally (Radford et al., 2015; Nguyen et al., 2016)
, to realize image manipulation and superresolution
(Zhu et al., 2016; Sønderby et al., 2017; Ledig et al., 2016), and to produce video sequences (Mathieu et al., 2016; Zhou & Berg, 2016; Saito & Matsumoto, 2016). Despite these successes, the feasibility and advantage on applying GAN to text generation are restrictedly explored yet noteworthy.It is appealing to generate discrete sequences using GAN as discussed above. The generative models are able to utilize the discriminator’s output to make up the information of its own distribution, which is inaccessible if trained by teacher forcing (Williams & Zipser, 1989; Ranzato et al., 2016). However, it is nontrivial to train GAN on discrete data due to its discontinuity nature. The instability inherent in GAN training makes things even worse (Salimans et al., 2016; Che et al., 2016; Arjovsky & Bottou, 2017; Arjovsky et al., 2017). Lamb et al. (2016)
exploits adversarial domain adaption to regularize the training of recurrent neural networks.
Yu et al. (2017) applies GAN to discrete sequence generation by directly optimizing the discrete discriminator’s rewards. They adopt Monte Carlo tree search technique (Silver et al., 2016). Similar technique has been employed in Li et al. (2017) which improves response generation by using adversarial learning.In Bornschein & Bengio (2015), which inspired us, the authors propose a way of doing minibatch reweighting when training latent variable models with discrete variables. However, they make use of inference network which are infeasible in the GAN setting.
Our work is also closely related to Norouzi et al. (2016b). In Norouzi et al. (2016b), they propose to work with the objective in a conditional generation setting. In this case, the situation is similar with ours because rewards such as BLEU scores are available. However, conditional generation metrics such as BLEU scores are decomposable to each time steps, so this property can make them able to directly sample from the augmented distributions, which is not possible for sequencelevel GANs, e.g., language modeling. So we have to use importance sampling to train the model.
6 Discussions and Future Work
In spite of their great popularity on continuous datasets such as images, GANs haven’t yet achieved an equivalent success in discrete domains such as natural language processing. We observed that the main cause of this gap is that while the discriminator can almost perfectly discriminate the good samples from the bad ones, it is notoriously difficult to pass this information to the generator due to the difficulty of credit assignment through discrete computation and inherent instability of RL algorithms applied to dynamic environments with sparse reward.
In this work, we take a different approach. We start first from the maximum likelihood training objective , and then use importance sampling combined with the discriminator output to derive a novel training objective. We argue that although this objective looks similar to the objective used in reinforcement learning, the normalization in fact does reduce the variance of the estimator by ignoring the region in the data space around the singularity of in which the generator has almost zero probability to get samples from. Namely, by estimating the partition function using samples, we are approximately doing normalized importance sampling with another distribution which has much lower variance c.f. Section 3.2. Practically, this single real sample normalization process combined with mixed training (Ranzato et al., 2016) successfully avoided the missing mode problem by providing equivalent training signal for each mode.
Besides successfully reducing the variances of normal reinforcement learning algorithms, our algorithm is surprisingly robust to overfitting. Teacher forcing is prone to overfit, because by maximizing the likelihood of the training data, the model can easily fit not only the regularities but also the noise in the data. However in our model, if the generator tries to fit too much noise in the data, the generated sample will not look good and hopefully the discriminator will be able to capture the differences between the generated and the real samples very easily.
As for future work, we are going to train the model on large datasets such as Google’s one billion words (Chelba et al., 2014) and on conditional generation cases such as dialogue generation.
References
 Andor et al. (2016) Andor, Daniel, Alberti, Chris, Weiss, David, Severyn, Aliaksei, Presta, Alessandro, Ganchev, Kuzman, Petrov, Slav, and Collins, Michael. Globally normalized transitionbased neural networks. CoRR, abs/1603.06042, 2016.
 Arjovsky & Bottou (2017) Arjovsky, Martín and Bottou, Léon. Towards principled methods for training generative adversarial networks. CoRR, abs/1701.04862, 2017.
 Arjovsky et al. (2017) Arjovsky, Martín, Chintala, Soumith, and Bottou, Léon. Wasserstein gan. CoRR, abs/1701.07875, 2017.
 Bengio et al. (2015) Bengio, Samy, Vinyals, Oriol, Jaitly, Navdeep, and Shazeer, Noam. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179, 2015.
 Bornschein & Bengio (2015) Bornschein, Jörg and Bengio, Yoshua. Reweighted wakesleep. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
 Bowman et al. (2015) Bowman, Samuel R, Vilnis, Luke, Vinyals, Oriol, Dai, Andrew M, Jozefowicz, Rafal, and Bengio, Samy. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 Che et al. (2016) Che, Tong, Li, Yanran, Jacob, Athul Paul, Bengio, Yoshua, and Li, Wenjie. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136, 2016.
 Chelba et al. (2014) Chelba, Ciprian, Mikolov, Tomas, Schuster, Mike, Ge, Qi, Brants, Thorsten, and Koehn, Phillipp. One billion word benchmark for measuring progress in statistical language modeling. CoRR, abs/1312.3005, 2014.
 Cho et al. (2014) Cho, Kyunghyun, Van Merriënboer, Bart, Bahdanau, Dzmitry, and Bengio, Yoshua. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
 Hinton (2009) Hinton, Geoffrey E. Deep belief networks. Scholarpedia, 4(5):5947, 2009.
 Hjelm et al. (2017) Hjelm, R Devon, Jacob, Athul, Che, Tong, Cho, Kyunghyun, and Bengio, Yoshua. Boundaryseeking generative adversarial networks. 2017.
 Hochreiter & Jürgen Schmidhuber (1997) Hochreiter, Sepp and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 9:1735–1780, 1997.
 Igel (2005) Igel, C. Policy gradient methods for reinforcement learning with function approximation. 2005.
 Lamb et al. (2016) Lamb, Alex, Goyal, Anirudh, Zhang, Ying, Zhang, Saizheng, Courville, Aaron C., and Bengio, Yoshua. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pp. 4601–4609, 2016.
 Ledig et al. (2016) Ledig, Christian, Theis, Lucas, Huszár, Ferenc, Caballero, Jose, Aitken, Andrew, Tejani, Alykhan, Totz, Johannes, Wang, Zehan, and Shi, Wenzhe. Photorealistic single image superresolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
 Li et al. (2017) Li, Jiwei, Monroe, Will, Shi, Tianlin, Ritter, Alan, and Jurafsky, Dan. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547, 2017.
 Liu et al. (2016) Liu, Siqi, Zhu, Zhenhai, Ye, Ning, Guadarrama, Sergio, and Murphy, Kevin. Optimization of image description metrics using policy gradient methods. CoRR, abs/1612.00370, 2016.
 Marcus et al. (1993) Marcus, Mitchell P, Marcinkiewicz, Mary Ann, and Santorini, Beatrice. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
 Mathieu et al. (2016) Mathieu, Michael, Couprie, Camille, and LeCun, Yann. Deep multiscale video prediction beyond mean square error. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
 Miao et al. (2016) Miao, Yishu, Yu, Lei, and Blunsom, Phil. Neural variational inference for text processing. CoRR, abs/1511.06038, 2016.
 Mirza & Osindero (2014) Mirza, Mehdi and Osindero, Simon. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 Nguyen et al. (2016) Nguyen, Anh, Yosinski, Jason, Bengio, Yoshua, Dosovitskiy, Alexey, and Clune, Jeff. Plug & play generative networks: Conditional iterative generation of images in latent space. arXiv preprint arXiv:1612.00005, 2016.
 Norouzi et al. (2016a) Norouzi, Mohammad, Bengio, Samy, Chen, Zhifeng, Jaitly, Navdeep, Schuster, Mike, Wu, Yonghui, and Schuurmans, Dale. Reward augmented maximum likelihood for neural structured prediction. In NIPS, 2016a.
 Norouzi et al. (2016b) Norouzi, Mohammad, Bengio, Samy, Jaitly, Navdeep, Schuster, Mike, Wu, Yonghui, Schuurmans, Dale, et al. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pp. 1723–1731, 2016b.
 Papineni et al. (2002) Papineni, Kishore, Roukos, Salim, Ward, Todd, and Zhu, WeiJing. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
 Radford et al. (2015) Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Ranzato et al. (2016) Ranzato, Marc’Aurelio, Chopra, Sumit, Auli, Michael, and Zaremba, Wojciech. Sequence level training with recurrent neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.

Reed et al. (2016)
Reed, Scott, Akata, Zeynep, Yan, Xinchen, Logeswaran, Lajanugen, Schiele,
Bernt, and Lee, Honglak.
Generative adversarial texttoimage synthesis.
In
Proceedings of The 33rd International Conference on Machine Learning
, 2016.  Saito & Matsumoto (2016) Saito, Masaki and Matsumoto, Eiichi. Temporal generative adversarial nets. arXiv preprint arXiv:1611.06624, 2016.
 Salakhutdinov & Hinton (2009) Salakhutdinov, Ruslan and Hinton, Geoffrey E. Deep boltzmann machines. In AISTATS, volume 1, pp. 3, 2009.
 Salimans et al. (2016) Salimans, Tim, Goodfellow, Ian J., Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen, Xi. Improved techniques for training gans. CoRR, abs/1606.03498, 2016.
 Serban et al. (2016) Serban, Iulian V, Sordoni, Alessandro, Bengio, Yoshua, Courville, Aaron, and Pineau, Joelle. Building endtoend dialogue systems using generative hierarchical neural network models. In AAAI16, 2016.

Serban et al. (2017)
Serban, Iulian V, Sordoni, Alessandro, Lowe, Ryan, Charlin, Laurent, Pineau,
Joelle, Courville, Aaron, and Bengio, Yoshua.
A hierarchical latent variable encoderdecoder model for generating
dialogues.
In
ThirtyFirst AAAI Conference on Artificial Intelligence (AAAI17)
, 2017.  Silver et al. (2016) Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, van den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, Dieleman, Sander, Grewe, Dominik, Nham, John, Kalchbrenner, Nal, Sutskever, Ilya, Lillicrap, Timothy, Leach, Madeleine, Kavukcuoglu, Koray, Graepel, Thore, and Hassabis, Demis. Mastering the game of go with deep neural networks and tree search. Nature, 529 7587:484–9, 2016.
 Sønderby et al. (2017) Sønderby, Casper Kaae, Caballero, Jose, Theis, Lucas, Shi, Wenzhe, and Huszár, Ferenc. Amortised map inference for image superresolution. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
 Sordoni et al. (2015) Sordoni, Alessandro, Galley, Michel, Auli, Michael, Brockett, Chris, Ji, Yangfeng, Mitchell, Margaret, Nie, JianYun, Gao, Jianfeng, and Dolan, William B. A neural network approach to contextsensitive generation of conversational responses. In HLTNAACL, 2015.
 Williams (1992) Williams, Ronald J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Williams & Zipser (1989) Williams, Ronald J and Zipser, David. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
 Wiseman & Rush (2016) Wiseman, Sam and Rush, Alexander M. Sequencetosequence learning as beamsearch optimization. In Proceeddings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
 Yu et al. (2017) Yu, Lantao, Zhang, Weinan, Wang, Jun, and Yu, Yong. Seqgan: sequence generative adversarial nets with policy gradient. In ThirtyFirst AAAI Conference on Artificial Intelligence (AAAI17), 2017.

Zhang et al. (2016a)
Zhang, Biao, Xiong, Deyi, Su, Jinsong, Duan, Hong, and Zhang, Min.
Variational neural machine translation.
In EMNLP, 2016a.  Zhang et al. (2016b) Zhang, Han, Xu, Tao, Li, Hongsheng, Zhang, Shaoting, Huang, Xiaolei, Wang, Xiaogang, and Metaxas, Dimitris. Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks. arXiv:1612.03242, 2016b.
 Zhang & Lapata (2014) Zhang, Xingxing and Lapata, Mirella. Chinese poetry generation with recurrent neural networks. In EMNLP, pp. 670–680, 2014.

Zhou & Berg (2016)
Zhou, Yipin and Berg, Tamara L.
Learning temporal transformations from timelapse videos.
In
European Conference on Computer Vision
, pp. 262–277. Springer, 2016.  Zhu et al. (2016) Zhu, JunYan, Krähenbühl, Philipp, Shechtman, Eli, and Efros, Alexei A. Generative visual manipulation on the natural image manifold. In Proceedings of European Conference on Computer Vision (ECCV), 2016.