Maximum-Likelihood Augmented Discrete Generative Adversarial Networks

by   Tong Che, et al.

Despite the successes in capturing continuous distributions, the application of generative adversarial networks (GANs) to discrete settings, like natural language tasks, is rather restricted. The fundamental reason is the difficulty of back-propagation through discrete random variables combined with the inherent instability of the GAN training objective. To address these problems, we propose Maximum-Likelihood Augmented Discrete Generative Adversarial Networks. Instead of directly optimizing the GAN objective, we derive a novel and low-variance objective using the discriminator's output that follows corresponds to the log-likelihood. Compared with the original, the new objective is proved to be consistent in theory and beneficial in practice. The experimental results on various discrete datasets demonstrate the effectiveness of the proposed approach.


Boundary-Seeking Generative Adversarial Networks

We introduce a novel approach to training generative adversarial network...

Learning from Label Proportions with Generative Adversarial Networks

In this paper, we leverage generative adversarial networks (GANs) to der...

Generative Cooperative Networks for Natural Language Generation

Generative Adversarial Networks (GANs) have known a tremendous success f...

Generative Adversarial Networks for text using word2vec intermediaries

Generative adversarial networks (GANs) have shown considerable success, ...

Deep Switch Networks for Generating Discrete Data and Language

Multilayer switch networks are proposed as artificial generators of high...

InfoVAEGAN : learning joint interpretable representations by information maximization and maximum likelihood

Learning disentangled and interpretable representations is an important ...

A Representation Modeling Based Language GAN with Completely Random Initialization

Text generative models trained via Maximum Likelihood Estimation (MLE) s...

1 Introduction

Generative models are appealing because they provide ways to obtain insights on the underlying data distribution and statistics. In particular, these models play a pivot role in many natural language processing tasks such as language modeling, machine translation, and dialogue generation. However, the generated sentences are often unsatisfactory 

(Sordoni et al., 2015; Bowman et al., 2015; Serban et al., 2017; Wiseman & Rush, 2016). For example, they often lack of consistency in long-term semantics and have less coherence in high-level topics and syntactics (Bowman et al., 2015; Zhang et al., 2016a).

This is largely attributed to the defect in the dominant training approach for existing discrete generative models. To generate discrete sequences, it is popular to adopt auto-regressive models through teacher forcing (Williams & Zipser, 1989) which, nevertheless, causes the exposure bias problem (Ranzato et al., 2016)

. The existing approach trains auto-regressive models to maximize the conditional probabilities of next tokens based on the ground-truth histories. In other words, during training, auto-regressive generative models are only exposed to the ground truths from the data distribution rather than those from the model distribution, i.e., its own predictions. It prohibits the trained model to take advantage of learning in the the context of its previous generated words to make the next prediction, resulting in a bias and difficulty in approaching the true underlying distribution 

(Ranzato et al., 2016; Bengio et al., 2015)

. Another limitation of teacher forcing is that it is inapplicable to those auto-regressive models with latent random variables, which have performed better than autoregressive (deterministic state) recurrent neural networks (i.e. usual RNNs, LSTMs or GRUs) on multiple tasks 

(Serban et al., 2017; Miao et al., 2016; Zhang et al., 2016a).

An alternative and attractive solution to training autoregressive models is using generative adversarial networks (GAN) 

(Goodfellow et al., 2014). The above discussed problem can be prevented if the generative models were able to visit its own predictions during training and had an overall view on the generated sequences. We suggest to facilitate the training of autoregressive models with an additional discriminator under the GAN setting. With a discriminator trained to separate real versus generated sequences, the generative model is able to make use of the knowledge of the discriminator to improve itself. Since the discriminator is trained on the entire sequence, it can in principle provide the training signal to avoid the problem of exposure bias.

However, it is nontrivial to apply GANs to discrete data as it is difficult to optimize of the generator using the signal provided by the discriminator. In fact, it is usually very hard to push the generated distribution to the real data distribution, if not impossible, by moving the generated sequence (e.g., a faulty sentence) towards a “true” one (e.g., a correct sentence) in a high dimensional discrete state space. As standard back-propagation fails in discrete settings, the generator can be optimized using the discriminator’s output as a reward via reinforcement learning. Unfortunately, even with careful pre-training, we found that the policy has difficulties to get positive and stable reward signals from the discriminator.

To tackle these limitations, we propose Maximum-Likelihood Augmented Discrete Generative Adversarial Networks (MaliGAN). At the core of this model is the novel GAN training objective which sidesteps the stability issue happening when using the discriminator output as a direct reinforcement learning reward. Alternatively, we develop a normalized maximum likelihood optimization target inspired by (Norouzi et al., 2016b). We use importance sampling and several variance reduction techniques in order to successfully optimize this objective. The procedure was discovered independently from us by Hjelm et al. (2017) in the context of image generation.

The new target brings several attractive properties in the proposed MaliGAN. First, it is theoretically consistent and easier to optimize (Section 3.2). Second, it allows the model not only to maximize the likelihood of good behaviors, but also to minimize the likelihood of bad behaviors, with the help of a GAN discriminator. Equipped with these strengths, the model focuses more on improving itself by gaining beneficial knowledge that is not yet well acquired, and excluding the most probable and harmful behaviors. Combined with several proposed variance reduction techniques, the proposed MaliGAN successfully and stably models discrete data sequences (Section 4).

2 Preliminaries and Overview

The basic framework for discrete sequence generation is to fit a set of data coming from an underlying generating distribution by training a parameterized auto-regressive probabilistic model .

In this work, we aim to generate discrete data, especially discrete sequential data, under the GAN setting (Goodfellow et al., 2014). GAN defines a framework for training generative models by posing it as a minimax game against a discriminative model. The goal of the generator is to match its distribution to the real data distribution . To achieve this, the generator transforms noise sampled from to a data sample . Following this, the discriminator is trained to distinguish between the samples coming from and , and can be used to provide a training signal to the generator.

When applying the GAN framework to discrete data, the discontinuity prohibits the update of the generator parameters via standard back-propagation. To tackle this, one way is to employ a typical reinforcement learning (RL) strategy that directly uses the GAN discriminator’s output, or as a reward. In practice, the problem is usually solved by REINFORCE-like algorithms (Williams, 1992), perhaps with some variance reduction techniques.

Formally, we train a generator together with a discriminator . In its original form, the discriminator is trained to distinguish between the generating distribution and the real data distribution . The generator is then trained to maximize . Namely, the objective for the generator to optimize is as follows:

Our work is related to the viewpoint of casting the GAN training as a reinforcement learning problem with a moving reward signal monotone in

. Define the normalized probability distribution

in some bounded region to guarantee integrability (note that is an approximation to if is well trained) and also put a maximum-entropy regularizer to encourage diversity, yielding the regularized loss:


where c(D) is a constant depending only on . Hence, optimizing the traditional GAN is basically equivalent to optimizing the KL-divergence . One major problem with this approach is that always moves with , which is undesirable for both stability and convergence. When we have some samples , we want to change a bit in order to adjust the likelihood of samples to improve the quality of the generator. However, since initially generates very bad sequences, it have little chance of generating good sequences in order to get positive rewards. Though the dedicated pre-training and variance reduction mechanisms help (Yu et al., 2017), the RL algorithm based on the moving reward signal still seems very unstable and does not work on large scale datasets.

We therefore propose to utilize the information of the discriminator as an additional source of training signals, on top of the maximum-likelihood objective. We employ importance sampling to make the objective trainable. The novel training objective has much less variance than that in vanilla reinforcement learning approaches that directly adopt or as reward signals. The analysis and discussions will be presented in more detail in Section 3.2.

3 Maximum-Likelihood Augmented Discrete Generative Adversarial Networks

In this section, we present the details of the proposed model. At the heart of this model is a novel training objective that significantly reduces the variance during training, including the theoretical and practical analysis on the objective’s equivalence and attractive properties. We also show how this core algorithm can be combined with several variance reduction techniques to form the full MaliGAN algorithm for discrete sequence generation.

3.1 Basic Model of MaliGAN

We propose Maximum-Likelihood Augmented Discrete Generative Adversarial Networks (MaliGAN) to generate the discrete data. With MaliGAN, we train a discriminator with the standard objective that GAN employs. What is different from GANs is a novel objective for the generator to optimize, using importance sampling, which makes the training procedure closer to maximum likelihood (MLE) training of auto-regressive models, and thus being more stable and with less variance in the gradients.

To do so, we keep a delayed copy of the generator whose parameters are updated less often in order to stabilize training. From the basic property of GANs, we know that an optimal has the property . So in this case, we have . Therefore, we set the target distribution for maximum likelihood training to be . Let , we define the augmented target distribution as:

Regarding as a fixed probability distribution, then the target to optimize is:

This objective has an attractive property that is a “fixed” distribution during training, i.e., if is sufficiently trained, then is always approximately the data generating distribution . By defining the gradient as , we have the following importance sampling formula:

where we assume that and the delayed generator is only one step behind the current update in the experiments. This importance sampling procedure was discovered independently from us by (Hjelm et al., 2017)

. We propose to optimize the generator using the following novel gradient estimator:


where is a baseline from reinforcement learning in order to reduce variance. In practice, we let increase very slowly from 0 to 1. Combined with the objective of the discriminator in an ordinary GAN, we get the proposed MaliGAN algorithm as shown in Algorithm 1.

0:   A generator with parameters . A discriminator with parameters . A baseline .
1:  for number of training iterations do
2:     for k steps do
3:        Sample a minibatch of samples from .
4:        Sample a minibatch of samples from .
5:        Update the parameter of discriminator by taking gradient ascend of discriminator loss
6:     end for
7:     Sample a minibatch of samples from .
8:     Update the generator by applying gradient update
9:  end for
Algorithm 1 MaliGAN

3.2 Analysis

The proposed objective in Eq. 2 is also theoretically guaranteed to be sound. In the following theorem, we show that our training objective approximately optimizes the KL divergence when is close to optimal. What’s more, the objective still makes sense when is well trained but far from optimal.

Theorem 3.1.

We have the following two theoretical guarantees for our new training objective.

If discriminator is optimal between delayed generator and real data distribution , we have the following equation.

where .

If is trained well but not sufficiently, namely, , lies between 0.5 and , we have the property that for , almost surely


The above gives us a condition for our objective to still push the generator in a descent direction even when the discriminator is not trained to optimality.

In addition to its attractiveness in theory, we now demonstrate why the gradient estimator in Eq. 2 of practically can produce better training signal for the generator than the original GAN objective. Similar discussions can be found in (Bornschein & Bengio, 2015; Norouzi et al., 2016a).

In the original GAN setting from a reinforcement learning perspective, e.g. the inclusive KL in Eq. 1, the free running auto-regressive model can be viewed as an RL agent exploring the state space and getting a reward, or , at the end of the exploration. The model then tries to adjust the probability of each of its exploration paths according to this reward. However, this gradient estimator would be drastically inefficient when almost all generated paths had a very small discriminator output. Unfortunately, this is very common in GAN training and cannot even be solved with a carefully selected baseline.

In the MaliGAN objective, however, the partition function is estimated using the samples from the minibatch, which helps dealing with the above dilemma. When we choose, for example, baseline , we can see that the sum of the weights on the generated paths are zero, and the probability of each path is adjusted not according to the absolute value of the discriminator output, but its relative quality in that minibatch. This ensures that the model can always learn something as long as there exist some generations better than others in that mini-batch. Furthermore, the previous theorem ensures the consistency of the mini-batch level normalization procedure.

From a theoretical point of view, this normalization procedure also helps. Although at the first glance, when is optimal, one can prove that , so estimating seems to only introduce additional variance to the model. However, using this estimator in fact reduces the variance due to the following reason: is actually a function with singularity when is in a region in the data space on which . Even with very careful pre-training, such a region and , making the ratio blow up. In our target , since it is almost impossible to get samples from with in a reasonable size mini-batch, the actual distribution we are sampling from is a “regularized” distribution where and . So when doing importance sampling to estimate our training objective with small mini-batches, we are actually doing normalized-weights importance sampling based on : . Since the Monte Carlo estimator has much more variance to estimate than , in practical mini-batch training settings, we can view that we are doing importance sampling with the distribution , and this objective has much less variance compared to importance sampling with on which has an infinite singularity. This is why estimating is important in order to reduce the variance in the mini-batch training setting.

When training auto-regressive models with teacher forcing, a serious problem is exposure bias (Ranzato et al., 2016; Norouzi et al., 2016b; Lamb et al., 2016). Namely, the model is only trained on demonstrated behaviors (real data samples), but we also want it to be trained on free-running behaviors. When we set a positive baseline , the model first generates samples, and then tries to adjust the probabilities of each generated samples by trying to reinforce the best behaviors and exclude the worse behaviors relatively to those in the mini-batch.

3.3 Variance Reduction in MaliGAN

The proposed renormalized objective in MaliGAN supports much more stable training behavior than the RL objective in a standard GAN. Nevertheless, when the long sequence generation procedure consists of multiple steps of random sampling, we find it is better to further integrate the following advanced variance reduction techniques.

3.3.1 Monte Carlo Tree Search

Instead of using the same weight for all time steps in one sample, we use the following formula which is well known in the RL literature:

where stands for the “expected total reward” given by of generating token given previous generation , which can be estimated with, e.g., Monte Carlo tree search (MCTS,  Silver et al. (2016)).

Thus, following the gradient estimator presented in Theorem 3.1, we derive another gradient estimator:

where is the size of the mini-batch. Using Monte Carlo tree search brings in several benefits. First, it allows different steps of the generated sample to be adjusted with different weights. Second, it gives us a more stable estimator of the partition function . Both of these two properties can dramatically reduce the variance of our proposed estimator.

3.3.2 Mixed MLE-Mali Training

When dealing with long sequences, the above model may result in accumulated variance. To alleviate the issue, we significantly reduce the variance by clamping the input using the training data for time steps, and switch to free running mode for the remaining time steps. Then during our training procedure, inspired from Ranzato et al. (2016), we slowly move from towards 0.

The training objective is equivalent to setting in the last section to:

We also assume is trained on the real samples and fake samples generated by

Let , we have:

For each sample from the real data batch, if it has length larger than , we fix the first words of , and then sample times from our model till the end of the sequence, and get samples .

We then have the following series of mini-batch estimators for each :


One difference is that in this model, we normalize the coefficients based only on samples generated from a single real data sample . The reason of using this trick will be explained in next sub-section.

We have the following theorem which guarantees the theoretical property of this estimator.

Theorem 3.2.

When is correctly trained but not optimal in the sense of Theorem  3.1, when , we almost surely have ,


3.3.3 Single real data based renormalization

Many generative models have multiple layers of randomnesses. For example, in auto-regressive models, the samples are generated via multiple sampling steps. Other examples include hierarchical generative models like deep Boltzmann machines and deep belief networks 

(Salakhutdinov & Hinton, 2009; Hinton, 2009).

In these models, high-level random variables are usually responsible for modeling high-level decisions or “modes” of the probability distribution. Changing them can result in much larger effects than that from changing low-level variables. Motivated by this observation, in each mini-batch we first draw a mini-batch of samples (e.g. 32) of high-level latent variables, and then for each high level value we draw a number of low level data samples (e.g. 32). Then we re-estimate the partition function from the low-level samples that are generated by each high-level samples. Because lower-level sampling has a much smaller variance, the model can receive better gradient signals from the weights provided by the discriminator.

This sampling principle is corresponding to applying the mixed MLE-Mali training discussed above in the auto-regressive settings. In this case we first sample a few data samples, then fix the first words and let the network generate a lot of samples after as our next mini-batch. We refer this full algorithm to sequential MaliGAN with Mixed MLE Training, which is summarized in Algorithm 2.

0:   A generator with parameters . A discriminator with parameters . Maximum sequence length , step size . A baseline , sampling multiplicity .

  Optional: Pretrain model using pure MLE with some epochs.

3:  for number of training iterations do
4:      = -
5:     for k steps do
6:        Sample a minibatch of sequences from .
7:        While keeping the first steps the same as , sample a minibatch of sequences from from time step .
8:        Update the discriminator by taking gradient ascend of discriminator loss.
9:     end for
10:     Sample a minibatch of sequences from .
11:     For each sample with length larger than in the mini-batch, clamp the generator to the first words of , and freely run the model to generate samples till the end of the sequence.
12:     Update the generator by applying the mixed MLE-Mali gradient update
13:  end for
Algorithm 2 Sequential MaliGAN with Mixed MLE Training

The reason why doing this single real sample based renormalization is beneficial can be summarized around two elements. First, consider is a sample from the training set. The first N words should be completed by our model. The conditional distribution should be much simpler than the full distribution . Namely, consists of only one or a few “modes”. So this renormalization technique can be viewed as trying to train the model on these simpler conditional distributions, which gives more stable gradients.

Second, this normalization scheme makes our model robust to mode missing, which is a common failure pattern when training GANs (Che et al., 2016). Single sample based renormalization ensures that for every real sample , the model can receive a moderately strong training signal for how to perform better on generating conditioned on . However, in batch-wise renormalization as in the basic MaliGAN, this is not possible because there might be some completions with very large, so other training samples in that mini-batch receives very little gradient signals.

4 Experiments

To examine the effectiveness of the proposed algorithms, we conduct experiments on three discrete sequence generation tasks. We achieve promising results on all three tasks, including a standard and challenging language modeling task. From the empirical results and the following analysis, we demonstrate the soundness of MaliGAN and show its robustness to overfitting.

4.1 Discrete MNIST

We first evaluate MaliGAN on the binarized image generation task for the MNIST hand-written digits dataset, similar with 

Hjelm et al. (2017)

. The original datasets have 60,000 and 10,000 samples in the training and testing sets, respectively. We split the training set and randomly selected 10,000 samples for validation. We adopted as the generator a deep convolutional neural network based on the DCGAN architecture 

(Radford et al., 2015)

. To generate the discrete samples, we sample from the generator’s output binomial distribution. We adopt Algorithm 

1 of MaliGAN for training and use the single latent variable renormalization technique for variance reduction.

To compare our proposed MaliGAN with the models trained using the discriminator’s output as a direct reward, we also train a generator with the same network architecture, but use the output of the discriminator as the weight of generated samples. We denote it as the REINFORCE-like model. The comparison results are shown in Figure 1 and Figure 2.

The two figures in the first line are training losses of the generator and discriminator from the proposed MaliGAN. We can see the training process of MaliGAN with variance reduction techniques is stable and the loss curve is meaningful. The bottom two figures in Figure 2 are samples generated by the REINFORCE-like model and by MaliGAN. Clearly, the samples generated by MaliGAN have much better visual quality and resemble closely the training data.

Figure 1: The training loss of the generator (left) and the discriminator (right) of MaliGAN on Discrete MNIST task.
Figure 2: Samples generated by REINFORCE-like model (left) and by MaliGAN (right).

4.2 Poem Generation

We examine the effectiveness of our model on a Chinese poem generation task. Typically, there are two genres of Chinese poems. We refer with Poem-5 and Poem-7 to those consisting of 5 or 7 Chinese characters each in a short sentence, respectively. We use the dataset provided in (Zhang & Lapata, 2014), and split them in the standard way 111

The generator is a one-layer LSTM (Hochreiter & Jürgen Schmidhuber, 1997)

with 32 hidden units for Poem-5 and 100 for Poem-7. Our discriminators are two-layer Bi-LSTMs with 32 hidden neurons. We denote our models trained with Algorithm 

1 and Algorithm 2 as MaliGAN-basic and MaliGAN-full. We choose two compared models, the auto-regressive model with same architecture but trained with maximum likelihood (MLE), and SeqGAN (Yu et al., 2017). Following Yu et al. (2017), we report the BLEU-2 scores in Table 1 (Papineni et al., 2002).

MaliGAN-full obtained the best BLEU-2 scores on par on both tasks, and MaliGAN-basic was the next best. Clearly, MLE lagged far behind despite the same architecture, which should be attributed to the inherent defect in the MLE teacher-forcing training framework. As pointed by previous researchers Wiseman & Rush (2016)

, BLEU might not be a proper evaluation metric, we also calculate the Perplexity of these four models, obtaining qualitatively similar results. The best scores are reported in Table 

1 and the Perplexity curves are illustrated in Figure 3.

Model Poem-5 Poem-7
MLE 0.6934 564.1 0.3186 192.7
SeqGAN 0.7389 - - -
MaliGAN-basic 0.7406 548.6 0.4892 182.2
MaliGAN-full 0.7628 542.7 0.5526 180.2
Table 1: Experimental results on Poetry Generation task. The result of SeqGAN is directly taken from (Yu et al., 2017).
Figure 3: Perplexity curves on Poem-5 (left) and Poem-7 (right).

From the above figures, we can see how our models perform during the training procedure. Although with some oscillations, both MaliGAN-basic and MaliGAN-full achieved lower perplexity. Especially on Poem-7 from Figure 3, our proposed models both prevent overfitting when MLE ended up with that. A comparison between the training curve of MaliGAN-basic and that of MaliGAN-full, we can find that the latter has less variance. This demonstrates the effectiveness of the advanced variance reduction techniques in our full model. The peak in the MLE curve on Poem-5 in Figure 3 is, however, unlikely to be a result of overfitting because that MLE “recovered” from it fast and continued to convergence till the end. In fact, we find it harder to train a stable MLE model on Poem-5 than on Poem-7. We conjecture this resulted from the intricate mutual influence between the improper evaluation and the small training data size.

4.3 Sentence-Level Language Modeling

We also examine the proposed algorithm on a more challenging task, sentence-level language modeling, which can be considered as a fundamental task with applications to various discrete sequence generation tasks. To explore the possibilities and limitations of our algorithm, we conduct extensive experiments on the standard Penn Treebank (PTB) dataset (Marcus et al., 1993) through parameter searching and model ablations. For evaluation we report sentence-level perplexity, which is the averaged perplexity on all sentences in the test set. For simplicity and efficiency, we adopt a 1-layer GRU (Cho et al., 2014) as our generator, and set the same setting for the baseline model trained with standard teacher forcing(Williams & Zipser, 1989). We use a Bi-directional GRU network as our discriminator. To stabilize training and provide good initialization for the generator, we first pre-train our generator on the training set using teacher forcing, then we train two models, MaliGAN-basic and MaliGAN-full. MaliGAN-basic is trained with Algorithm 1 without MCTS. MaliGAN-full is trained by Algorithm 2 with all the variance reduction techniques included.

Note that the computational cost of MCTS is very large, so we remove all sentences longer than 35 words in the training set. We set and at the beginning of the training and pre-train our discriminator to make it reliable enough to provide informative and correct signals for the generator. The perplexity shown in Table 2 is achieved by our best performing model, which has 200 hidden neurons and 200 dimensions for word embeddings.

MLE MaliGAN-basic MaliGAN-full
Valid-Perplexity 141.9 131.6 128.0
Test-Perplexity 138.2 125.3 123.8
Table 2: Experimental results on PTB. Note that we evaluate the models in sentence-level.

From Table 2 we can see, the simplest model trained by MaliGAN reduced the perplexity of the baseline effectively. Both the basic and the full model, i.e., MaliGAN-basic and MaliGAN-full obtained a notably lower perplexity compared with the MLE model. Although the PTB dataset is much more difficult, we obtain results consistent with Table 1. It is encouraging to see that our model is more robust to overfitting in consideration of the relative small size of the PTB data. These results strengthen our belief to realize our algorithm on even larger datasets, which we leave as a future work.

The positive result again demonstrates the effectiveness of MaliGAN, whose primary component is the novel optimization objective we propose in Eq. 2. Besides, we also gain insights from the model ablation tests about the advanced variance reduction techniques provided in Section 3.3. Combined with the Perplexity curve in Figure 3, we can see that with advanced techniques, MaliGAN-full performed in a more stable way during training and can to some extent achieve lower perplexity scores than MaliGAN-basic. We believe these fruitful techniques will be beneficial in other similar problem settings.

5 Related Work

To improve the performance of discrete auto-regressive models, some researchers aim to tackle the exposure bias problem, which is discussed detailed in (Ranzato et al., 2016; Serban et al., 2016; Wiseman & Rush, 2016). The problem occurs when the training algorithm prohibits models to be exposed to their own predictions during training. The second issue is the discrepancy between the objective during training and the evaluation metric during testing, which is analyzed in Ranzato et al. (2016) and then summarized as Loss-Evaluation Mismatch by Wiseman & Rush (2016). Typically, the objectives in training auto-regressive models are to maximize the word-level probabilities, while in test-time, we often evaluate the models using sequence-level metrics, such as BLEU (Papineni et al., 2002). To alleviate these two issues, the most straightforward way is to add the evaluation metrics into the objective in the training phase. Because these metrics are often discrete which cannot be utilized through standard back-propagation, researchers generally seek help from reinforcement learning. Ranzato et al. (2016) exploits REINFORCE algorithm (Williams, 1992) and proposes several model variants to well situate the algorithm in text generation applications. Liu et al. (2016) shares similar idea and directly optimizes image caption metrics through policy gradient methods (Igel, 2005). There exists a third issue, namely Label Bias, especially in sequence-to-sequence learning framework, which obstacles the MLE trained models to be optimized globally (Andor et al., 2016; Wiseman & Rush, 2016)

To addresses the abovementioned issues in training auto-regressive models, we propose to formulate the problem under the setting of generative adversarial networks. Initially proposed by Goodfellow et al. (2014), generative adversarial network (GAN) has attracted a lot of attention because it provides a powerful framework to generate promising samples through a min-max game. Researchers have successfully applied GAN to generate promising images conditionally (Mirza & Osindero, 2014; Reed et al., 2016; Zhang et al., 2016b) and unconditionally (Radford et al., 2015; Nguyen et al., 2016)

, to realize image manipulation and super-resolution 

(Zhu et al., 2016; Sønderby et al., 2017; Ledig et al., 2016), and to produce video sequences (Mathieu et al., 2016; Zhou & Berg, 2016; Saito & Matsumoto, 2016). Despite these successes, the feasibility and advantage on applying GAN to text generation are restrictedly explored yet noteworthy.

It is appealing to generate discrete sequences using GAN as discussed above. The generative models are able to utilize the discriminator’s output to make up the information of its own distribution, which is inaccessible if trained by teacher forcing (Williams & Zipser, 1989; Ranzato et al., 2016). However, it is nontrivial to train GAN on discrete data due to its discontinuity nature. The instability inherent in GAN training makes things even worse (Salimans et al., 2016; Che et al., 2016; Arjovsky & Bottou, 2017; Arjovsky et al., 2017). Lamb et al. (2016)

exploits adversarial domain adaption to regularize the training of recurrent neural networks.

Yu et al. (2017) applies GAN to discrete sequence generation by directly optimizing the discrete discriminator’s rewards. They adopt Monte Carlo tree search technique (Silver et al., 2016). Similar technique has been employed in Li et al. (2017) which improves response generation by using adversarial learning.

In Bornschein & Bengio (2015), which inspired us, the authors propose a way of doing mini-batch reweighting when training latent variable models with discrete variables. However, they make use of inference network which are infeasible in the GAN setting.

Our work is also closely related to Norouzi et al. (2016b). In Norouzi et al. (2016b), they propose to work with the objective in a conditional generation setting. In this case, the situation is similar with ours because rewards such as BLEU scores are available. However, conditional generation metrics such as BLEU scores are decomposable to each time steps, so this property can make them able to directly sample from the augmented distributions, which is not possible for sequence-level GANs, e.g., language modeling. So we have to use importance sampling to train the model.

6 Discussions and Future Work

In spite of their great popularity on continuous datasets such as images, GANs haven’t yet achieved an equivalent success in discrete domains such as natural language processing. We observed that the main cause of this gap is that while the discriminator can almost perfectly discriminate the good samples from the bad ones, it is notoriously difficult to pass this information to the generator due to the difficulty of credit assignment through discrete computation and inherent instability of RL algorithms applied to dynamic environments with sparse reward.

In this work, we take a different approach. We start first from the maximum likelihood training objective , and then use importance sampling combined with the discriminator output to derive a novel training objective. We argue that although this objective looks similar to the objective used in reinforcement learning, the normalization in fact does reduce the variance of the estimator by ignoring the region in the data space around the singularity of in which the generator has almost zero probability to get samples from. Namely, by estimating the partition function using samples, we are approximately doing normalized importance sampling with another distribution which has much lower variance c.f. Section 3.2. Practically, this single real sample normalization process combined with mixed training (Ranzato et al., 2016) successfully avoided the missing mode problem by providing equivalent training signal for each mode.

Besides successfully reducing the variances of normal reinforcement learning algorithms, our algorithm is surprisingly robust to overfitting. Teacher forcing is prone to overfit, because by maximizing the likelihood of the training data, the model can easily fit not only the regularities but also the noise in the data. However in our model, if the generator tries to fit too much noise in the data, the generated sample will not look good and hopefully the discriminator will be able to capture the differences between the generated and the real samples very easily.

As for future work, we are going to train the model on large datasets such as Google’s one billion words (Chelba et al., 2014) and on conditional generation cases such as dialogue generation.