Training language GANs from Scratch

05/23/2019 ∙ by Cyprien de Masson d'Autume, et al. ∙ Google 14

Generative Adversarial Networks (GANs) enjoy great success at image generation, but have proven difficult to train in the domain of natural language. Challenges with gradient estimation, optimization instability, and mode collapse have lead practitioners to resort to maximum likelihood pre-training, followed by small amounts of adversarial fine-tuning. The benefits of GAN fine-tuning for language generation are unclear, as the resulting models produce comparable or worse samples than traditional language models. We show it is in fact possible to train a language GAN from scratch -- without maximum likelihood pre-training. We combine existing techniques such as large batch sizes, dense rewards and discriminator regularization to stabilize and improve language GANs. The resulting model, ScratchGAN, performs comparably to maximum likelihood training on EMNLP2017 News and WikiText-103 corpora according to quality and diversity metrics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised word level text generation is a stepping stone for a plethora of applications, from dialogue generation to machine translation and summarization 

(Wu et al., 2016; Lample et al., 2017; Li et al., 2016; Allahyari et al., 2017). While recent innovations such as architectural changes and leveraging big datasets are promising (Vaswani et al., 2017; Jozefowicz et al., 2016; Radford et al., 2019), the problem of unsupervised text generation is far from being solved.

Today, language models trained using maximum likelihood are the most successful and widespread approach to text modeling, but they are not without limitations. Since they explicitly model sequence probabilities, language models trained by maximum likelihood are confined to an autoregressive structure, prohibiting applications such as one-shot language generation. When combined with maximum likelihood training, autoregressive modelling can result in poor samples due exposure bias 

(Bengio et al., 2015)– a distributional shift between training sequences used for learning and model data required for generation. Scheduled sampling (Bengio et al., 2015) has been proposed as a solution, but is thought to encourage sample quality by reducing sample diversity, inducing mode collapse (Huszár, 2015).

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are an alternative to models trained via maximum likelihood. GANs do not suffer from exposure bias since the model learns to sample during training: the learning objective is to generate samples which are indistinguishable from real data according to a discriminator. Since GANs don’t require an explicit probability model, they remove the restriction to autoregressive architectures, allowing one shot feed-forward generation (Gulrajani et al., 2017).

The sequential and discrete nature of text has made the application of GANs to language challenging, with fundamental issues such as difficult gradient estimation and mode collapse yet to be addressed. Existing language GANs avoid these issues by pre-training models with maximum likelihood  (Yu et al., 2017; Guo et al., 2017; Zhang et al., 2017; Che et al., 2017; Lin et al., 2017)

and limiting the amount of adversarial fine tuning by restricting the number of fine-tuning epochs and often using a small learning rate

(Semeniuta et al., 2018; Caccia et al., 2018). This suggests “that the best-performing GANs tend to stay close to the solution given by maximum-likelihood training” (Caccia et al., 2018). Even with adversarial fine-tuning playing a limited role, extensive evaluation has shown that existing language GANs do not improve over maximum likelihood-trained models (Semeniuta et al., 2018; Caccia et al., 2018).

We show that pure adversarial training is a viable approach for unsupervised word-level text generation by training a language GAN from scratch. We achieve this by tackling the fundamental limitations of training discrete GANs through a combination of existing techniques as well as carefully choosing the model and training regime. To the best of our knowledge we are the first to do so successfully; we thus call our model ScratchGAN. Compared to prior work on discrete language GANs which “barely achieve non-random results without supervised pre-training” (Semeniuta et al., 2018),  ScratchGAN achieves results comparable with maximum likelihood models.

Our aim is to learn models that captures both both semantic coherence and grammatical correctness of language, and to demonstrate that these properties have been captured with the use of different evaluation metrics. BLEU and Self-BLEU 

(Zhu et al., 2018) capture basic local consistency. The Fréchet Distance metric (Semeniuta et al., 2018) captures global consistency and semantic information, while being less sensitive to local syntax. We use Language and Reverse Language model scores (Caccia et al., 2018) across various softmax temperatures to capture the diversity-quality trade-off. Nearest neighbor analysis in embedding and data space provide evidence that our model is not trivially overfitting, e.g. by copying sections of training text.

We make the following contributions:

  • [leftmargin=*,noitemsep]

  • We show that GANs without any pre-training are competitive with maximum likelihood methods at unconditional text generation.

  • We show that large batch sizes, dense rewards and discriminator regularization are key ingredients of training language GANs from scratch.

  • We perform an extensive evaluation of the quality and diversity of our model. In doing so, we show that no current evaluation metric is able to capture all the desired properties of language.

2 Generative Models of Text

The generative model practitioner has two choices to make: how to model the unknown data distribution and how to learn the parameters of the model. The choice of model is where often prior information about the data is encoded, either through the factorization of the distribution, or through its parametrization. The language sequence naturally lends itself to autoregressive modeling:

(1)

Sampling from an autoregressive model is an iterative process: each token is sampled from the conditional distribution imposed by previous samples: . Distributions

are Categorical distributions over the vocabulary size, and are often parametrized as recurrent neural networks 

(Hochreiter and Schmidhuber, 1997; Chung et al., 2014).

The specific tokenization for a given data sequence is left to the practitioner, with character level or world level splits being the most common. Throughout this work, we use world level language modeling.

2.1 Maximum Likelihood

Once a choice of model is made, the question of how to train the model arises. The most common approach to learn model of language is using maximum likelihood estimation (MLE):

(2)

The combination of autoregressive models and maximum likelihood learning has been very fruitful in language modeling (Shannon, 1951; Mikolov et al., 2010; Vaswani et al., 2017), but it is unclear whether maximum likelihood is the optimal perceptual objective for text data (Huszár, 2015). In this work we will retain the use of autoregressive models and focus on the impact of the training criterion on the quality and sample diversity of generated data, by using adversarial training instead.

2.2 Generative Adversarial Networks

Generative adversarial networks (Goodfellow et al., 2014) learn the data distribution through a two player adversarial game between a discriminator and a generator. A discriminator is trained to distinguish between real data and samples from the generator distribution , while the generator is trained to fool the discriminator in identifying its samples as real. The original formulation proposes a min-max optimization procedure using the objective:

(3)

Goodfellow et al. (2014) suggested using the alternative generator loss as it provides better gradients for the generator. Since then, multiple other losses have been proposed (Arjovsky et al., 2017; Mao et al., 2016; Jolicoeur-Martineau, 2018; Mohamed and Lakshminarayanan, 2016).

Challenges of learning language GANs arise from the combination of the adversarial learning principle with the choice of an autoregressive model. Learning using equation 3

requires backpropagating through a sampling operation, forcing the language GAN practitioner to choose between high variance, unbiased estimators such as REINFORCE 

(Williams, 1992), or lower variance, but biased estimators, such as the Gumbel-Softmax trick (Maddison et al., 2016; Jang et al., 2016) and other continuous relaxations (Gulrajani et al., 2017). Gradient estimation issues compounded with other GAN problems such as mode collapse or training instability (Arjovsky et al., 2017; Fedus et al., 2017) led prior work on language GANs to use maximum likelihood pre-training (Yu et al., 2017; Guo et al., 2017; Fedus et al., 2018; Nie et al., 2019; Zhang et al., 2017; Lin et al., 2017). This is the current preferred approach to train text GANs.

2.3 Learning Signals

To train the generator we use the REINFORCE gradient estimator (Williams, 1992):

(4)

where

is provided by the discriminator. By analogy with reinforcement learning, we call

a reward. Setting , recovers the MLE estimator in Eq (2) as shown by Che et al. (2017):

(5)

The gradient updates provided by the MLE estimator can be seen as a special case of the REINFORCE updates used in language GAN training. The important difference lies in the fact that for language GANs rewards are learned. Learned discriminators have been shown to be a useful measure of model quality and correlate with human evaluation (Kannan and Vinyals, 2017). We postulate that learned rewards provide a smoother signal to the generator than the classical MLE loss: the discriminator can learn to generalize and provide a meaningful signal over parts of the distribution not covered by the training data. As the training progresses and the signal from the discriminator improves, the generator also explores other parts of data space, providing a natural curriculum, whereas MLE models are only exposed to the dataset.

Adversarial training also enables the use of domain knowledge. Discriminator ensembles where each discriminator is biased to focus on specific aspects of the samples such as syntax, grammar, semantics, or local versus global structure are a promising approach (Holtzman et al., 2018). The research avenues opened by learned rewards and the issues with MLE pre-training motivate our search for a language GAN which does not make use of maximum likelihood pre-training.

3 Training Language GANs from Scratch

To achieve the goal of training a language GAN from scratch, we tried different loss functions and architectures, various reward structures and regularization methods, ensembles, and other modifications. Most of these approaches did not succeed or did not provide any significant gains. Via this extensive experimentation we found that the key ingredients to train language GANs from scratch are: a recurrent discriminator used to provide dense rewards at each time step, large batches for variance reduction, and discriminator regularization. We describe the generator architecture and reward structure we found effective in Figure 

1 and provide details of the other techniques we tried in Appendix C.

Figure 1:  ScratchGAN architecture and reward structure. Table 1: BLEU-5 and Self-BLEU-5 metrics for a 5-gram model. Model BLEU-5 SBLEU-5 Kneser-Ney 20.67 19.73 Training data 20.73 20.73

3.1 Dense Rewards

Our ultimate goal is to generate entire sequences, so we could train a discriminator to distinguish between complete data sequences and complete sampled sequences, with the generator receiving a reward only after generating a full sequence. However, in this setting the generator would get no learning signal early in training, when generated sentences can easily be determined to be fake by the discriminator. We avoid this issue by instead training a recurrent discriminator which provides rewards for each generated token (Fedus et al., 2018). The discriminator learns to distinguish between sentence prefixes coming from real data and sampled sentence prefixes:

While a sequential discriminator is potentially harder to learn than sentence based feed-forward discriminators, it is computationally cheaper than approaches that use Monte Carlo Tree Search to score partial sentences (Guo et al., 2017; Yu et al., 2017; Lin et al., 2017) and has been shown to perform better empirically (Semeniuta et al., 2018).

For a generated token , the reward provided to the ScratchGAN generator at time step is:

(6)

Rewards scale linearly with the probability the discriminator assigns to the current prefix pertaining to a real sentence. Bounded rewards help stabilize training.

The goal of the generator at timestep is to maximize the sum of discounted future rewards using a discount factor :

(7)

Like ScratchGAN, SeqGAN-step (Semeniuta et al., 2018) uses a recurrent discriminator to provide rewards per time step to a generator trained using policy gradient for unsupervised world level text generation. Unlike SeqGAN-step, our model is trained from scratch using only the adversarial objective, without any maximum likelihood pretraining.

3.2 Large Batch Sizes for Variance Reduction

The ScratchGAN generator parameters are updated using Monte Carlo estimates of policy gradients (Equation 4), where is the batch size:

A key component of ScratchGAN is the use of large batch sizes to reduce the variance of the gradient estimation, exploiting the ability to cheaply generate experience by sampling from the generator. To further reduce the gradient variance ScratchGAN uses a global moving-average of rewards as a baseline  (Sutton and Barto, 2018), as we empirically found it improves performance for certain datasets.

Providing rewards only for the sampled token as in Equation (3.2) results in a substantial training speed boost compared to methods that use to provide rewards for each token in the vocabulary, in order to reduce variance and provide a richer learning signal. These methods score all prefixes at time and thus scale linearly with vocabulary size (Fedus et al., 2018).

3.3 Architectures and Discriminator Regularization

The ScratchGAN discriminator and generator use an embedding layer followed by one or more LSTM layers (Hochreiter and Schmidhuber, 1997). For the embedding layer, we have experimented with training the embeddings from scratch, as well as using pre-trained GloVe embeddings (Pennington et al., 2014) concatenated with learned embeddings. When GloVe embeddings are used, they are shared by the discriminator and the generator, and kept fixed during training.

Discriminator regularization in the form of layer normalization (Ba et al., 2016), dropout (Srivastava et al., 2014) and weight decay provide a substantial performance boost to ScratchGAN. Our findings align with prior work which showed the importance of discriminator regularization on image GANs (Brock et al., 2018; Miyato et al., 2018; Fedus et al., 2017).

Despite using a recurrent discriminator, we also provide the discriminator with positional information by concatenating a fix sinusoidal signal to the word embeddings used in the discriminator (Vaswani et al., 2017). We found this necessary to ensure the sentence length distribution obtained from generator samples matches that of the training data. Ablation experiments are provided in Appendix G.

4 Evaluation Metrics

(a) Negative BLEU-5 versus Self-BLEU-5.
(b) Language- and reverse language-model scores.
Figure 2: BLEU scores on EMNLP2017 News (left) and language model scores on Wikitext-103 (right). For BLEU scores, left is better and down is better. LeakGAN, MaliGAN, RankGAN and SeqGAN results from Caccia et al. (2018).

Evaluating text generation remains challenging, since no single metric is able to capture all desired properties: local and global consistency, diversity and quality, as well as generalization beyond the training set. We follow Semeniuta et al. (2018) and Caccia et al. (2018) in the choice of metrics. We use -gram based metrics to capture local consistency, Fréchet Distance to measure distances to real data in embedding space, and language model scores to measure the quality-diversity trade-off. To show our model is not trivially overfitting we look at nearest neighbors in data and embedding space.

4.1 -gram based Metrics

BLEU (Papineni et al., 2002) and Self-BLEU have been proposed (Zhu et al., 2018) as measures of quality and diversity, respectively. BLEU based metrics capture local consistency and detect relatively simple problems with syntax but do not capture semantic variation (Semeniuta et al., 2018; Reiter, 2018).

We highlight the limitations of BLEU metrics by training a -gram model with Kneser-Ney smoothing (Kneser and Ney, 1995) on EMNLP2017-News and measuring its BLEU score. The results are reported in Table 1. The -gram model scores close to perfect according to BLEU-5 metric although its samples are qualitatively very poor (see Table 10 in the Appendix). In the rest of the paper we report BLEU-5 and Self-BLEU-5 metrics to compare with prior work, and complement it with metrics that capture global consistency, like Fréchet Distance.

4.2 Fréchet Embedding Distance

Semeniuta et al. (2018) proposed the Fréchet InferSent Distance (FID), inspired by the Fréchet Inception Distance used for images (Heusel et al., 2017)

. The metric computes the Fréchet distance between two Gaussian distributions fitted to data embeddings, and model sample embeddings, respectively.

Semeniuta et al. (2018) showed that the metric is not sensitive to the choice of embedding model and use InferSent for model evaluation, while we use a Universal Sentence Encoder (Cer et al., 2018)111The model can be found at https://tfhub.dev/google/universal-sentence-encoder/2. We call the metric Fréchet Embedding Distance to clarify that we use a different embedding model from  Semeniuta et al. (2018).

The Fréchet Embedding Distance (FED) offers several advantages over BLEU-based metrics, as highlighted in  Semeniuta et al. (2018): it captures both quality and diversity; it captures global consistency; it is faster and simpler to compute than BLEU metrics; it correlates with human evaluation; it is less sensitive to word order than BLEU metrics; it is empirically proven useful for images.

We find that the Fréchet Embedding Distance provides a useful metric to optimize for during model development, and we use it to choose the best models. However, we notice that FED also has drawbacks: it can be sensitive to sentence length, and we avoid this bias by ensuring that all compared models match the sentence length distribution of the data (see details in Appendix E).

4.3 Language Model Scores

Caccia et al. (2018) proposed evaluating the quality of generated model samples using a language model (Language Model score, LM), as well as training a language model on the generated samples and scoring the original data with it (Reverse Language Model score, RLM). LM measures sample quality: bad samples score poorly under a language model trained on real data. RLM measures sample diversity: real data scores poorly under a language model trained on samples which lack diversity. While insightful, this evaluation criteria relies on training new models, and hence the results can depend on the evaluator architecture. The metric could also have an inherent bias favoring language models, since they were trained using the same criteria.

(a) Wikitext-103.
(b) FED vs softmax temperature.
(c)  ScratchGAN ablation study.
Figure 3: FED scores. Lower is better. EMNLP2017 News results unless otherwise specified.

5 Experimental Results

We use two datasets, EMNLP2017 News222http://www.statmt.org/wmt17/ and Wikitext-103 Merity et al. (2016). We use EMNLP2017 News to compare with prior work(Guo et al., 2017; Caccia et al., 2018) but note that this dataset has limitations: a small vocabulary (5.7k words), no out-of-vocabulary tokens, a sentence length limited to 50 tokens, and a size of only 300k sentences. Wikitext-103 is a large scale dataset of almost 4 million sentences that captures more of the statistical properties of natural language and is a standard benchmark in language modeling (Dauphin et al., 2016; Bai et al., 2018). For Wikitext-103 we use a vocabulary of 20k words. In Wikitext-103 we remove sentences with less than 7 tokens or more than 100 tokens. All our models are trained on individual sentences, using an NVIDIA P100 GPU.

In all our experiments, the baseline maximum likelihood trained language model is an LSTM. Exact architectures and experimental procedures for the results below are detailed in Appendix D. Samples from ScratchGAN can be seen in Appendix H, alongside data and MLE samples.

5.1 Quality and Diversity

As suggested in  Caccia et al. (2018), we measure the diversity-quality trade-off of different models by changing the softmax temperature at sampling time. Reducing the softmax temperature below 1 results in higher quality but less diverse samples, while increasing it results in samples closer and closer to random. Reducing the temperature for a language GANs is similar to the “truncation trick” used in image GANs (Brock et al., 2018). We compute all metrics at different temperatures.

ScratchGAN shows improved local consistency compared to existing language GANs and significantly reduces the gap between language GANs and the maximum likelihood language models. Figure 1(a) reports negative BLEU5 versus Self-BLEU5 metrics on EMNLP2017 News for ScratchGAN and other language GANs, as reported in Caccia et al. (2018).

ScratchGAN improves over an MLE trained model on WikiText-103 according to FED, as shown in Figure 2(a). This suggests that ScratchGAN is more globally consistent and better captures semantic information. Figure 2(b) shows the quality diversity trade-off as measured by FED as the softmax temperature changes. ScratchGAN performs slightly better than the MLE model on this metric. This contrasts with the Language Model Score-Reverse Language Model scores shown in Figure 1(b), which suggests that MLE samples are more diverse. Similar results on EMNLP2017 News are shown in Appendix A.

Unlike image GANs, ScratchGAN learns an explicit model of data, namely an autoregressive explicit model of language. This allows us to compute model perplexities on validation data by feeding the model ground through at each step. We report ScratchGAN and MLE perplexities on EMNLP2017 News in Table 2. This metric favors the MLE model, which is trained to minimize perplexity and thus has an incentive to spread mass around the data distribution to avoid being penalized for not explaining training instances (Theis et al., 2015), unlike ScratchGAN which is penalized by the discriminator when deviating from the data manifold and thus favors quality over diversity.

Our diversity and quality evaluation across multiple metrics shows that compared to the MLE model, ScratchGAN trades off local consistency to achieve slightly better global consistency.

Model World level perplexity Random 5725 ScratchGAN 154 MLE 42 Table 2: EMNLP2017 News perplexity.
Figure 4: Matching -grams in EMNLP2017.

5.2 Nearest Neighbors

A common criticism of GAN models is that they produce realistic samples by overfitting to the training set, e.g. by copying text snippets. For a selection of ScratchGAN samples we find and present the nearest neighbors present in the training set. We consider two similarity measures, a 3-gram cosine similarity — to capture copied word sequences, and a cosine similarity from embeddings produced by the Universal Sentence Encoder —to capture semantically similar sentences. In Table 

5 in Appendix B we display a selection of four random samples and the corresponding top three closest training set sentences with respect to each similarity measure, and see the training text snippets have a mild thematic correspondence but have distinct phrasing and meaning. Additionally we perform a quantitive analysis over the full set of samples; we also compare the longest matching -grams between text from the training set and (a) ScratchGAN samples, (b) MLE samples, and (c) text from the validation set. In Figure 4

we see fewer ScratchGAN samples with long matching n-grams (

) in comparison with MLE samples and text from the validation set. We conclude the generator is producing genuinely novel sentences, although they are not always grammatically or thematically consistent.

5.3 Ablation Study and SeqGAN-step comparison

We show the relative importance of individual features of ScratchGAN with an ablation study in Figure 2(c). We successively add all elements that appear important to ScratchGAN performance, namely large batch size, discriminator regularization ( weight decay, dropout, and layer normalization), pre-trained embeddings, and a value baseline for REINFORCE.

The baseline model in Figure 2(c) is a SeqGAN-step like model (Yu et al., 2017) without pretraining. To highlight the improvement of ScratchGAN compared to prior work, we show in Table 4 the FED difference between the two models.

5.4 Training Stability

Despite the high variance of REINFORCE gradients and the often unstable GAN training dynamics, our training procedure is very stable, due to the use of large batch sizes and chosen reward structure. We report the FED scores for ScratchGAN across hyperparameters and random seeds in Table 

4. When we fixed hyperparameters and repeated an experiment across 50 seeds, we obtained very similar FED scores. No divergence or mode collapse occurred in any of the 50 runs. For WikiText-103, the results are similar (0.055 0.003). Further details can be found in Appendix F.

Model FED
SeqGAN-step (no pretraining) 0.084
ScratchGAN 0.015
Table 4: FED sensitivity on EMNLP2017 News.
Variation FED
Hyperparameters 0.021 0.0056
Seeds (best hypers) 0.018 0.0008
Table 3: FED on EMNLP2017 News.

6 Related Work

Our work expands on the prior work of discrete language GANs, which opened up the avenues to this line of research. Methods which use discrete data have proven to be more successful than methods using continuous relaxations (Semeniuta et al., 2018), but face their own challenges, such as finding the right reward structure and reducing gradient variance. Previously proposed solutions include: receiving dense rewards via Monte Carlo Search (Guo et al., 2017; Yu et al., 2017; Lin et al., 2017) or a recurrent discriminator (Fedus et al., 2018; Semeniuta et al., 2018), leaking information from the discriminator to the generator (Guo et al., 2017), using actor critic methods to reduce variance (Fedus et al., 2018)

, using ranking or moment matching to provide a richer learning signal

(Lin et al., 2017; Zhang et al., 2017) and curriculum learning (Fedus et al., 2018). Despite alleviating problems somewhat, all of the above methods require pre-training, sometimes together with teacher forcing (Che et al., 2017) or interleaved supervised and adversarial training (Guo et al., 2017).

Nie et al. (2019) recently showed that language GANs can benefit from complex architectures such as Relation Networks (Santoro et al., 2017). Their RelGAN model can achieve better than random results without supervised pre-training, but still requires pre-training to achieve results comparable to MLE models.

Press et al. (2017) is perhaps the closest to our work: they train a character level GAN without pre-training. Unlike Press et al. (2017), ScratchGAN is a world level model and does not require teacher helping, curriculum learning or continuous relaxations during training. Importantly, we have performed an extensive evaluation to quantify the performance of ScratchGAN, as well as measured overfitting using multiple metrics, beyond -gram matching.

By learning reward signals through the use of discriminators, our work is in line with recent imitation learning work 

(Ho and Ermon, 2016), as well as training non-differentiable generators (Ganin et al., 2018).

7 Discussion

Existing language GANs use maximum likelihood pretraining to minimize adversarial training challenges, such as unstable training dynamics and high variance gradient estimation. However, they have shown little to no performance improvements over traditional language models, likely due to constraining the set of possible solutions to be close to those found by maximum likelihood. We have shown that large batch sizes, dense rewards and discriminator regularization remove the need for maximum likelihood pre-training in language GANs. To the best of our knowledge, we are the first to use Generative Adversarial Networks to train word-level language models successfully from scratch. Removing the need for maximum likelihood pretraining in language GANs opens up a new avenue of language modeling research, with future work exploring GANs with one-shot feed-forward generators and specialized discriminators which distinguish different features of language, such as semantics and syntax, local and global structure.

We have measured the quality and diversity of ScratchGAN samples using BLEU metrics, Frèchet distance, and language model scores. None of these metrics is sufficient to evaluate language generation: we have shown that BLEU metrics only capture local consistency; language model scores do not capture semantic similarity; and that while embedding based Frèchet distance is a promising global consistency metric it is sensitive to sentence length. Until new ways to assess language generation are developed, current metrics need to be used together to compare models.

8 Acknowledgments

We would like to thank Chris Dyer, Oriol Vinyals, Karen Simonyan, Ali Eslami, David Warde-Farley, Siddhant Jayakumar and William Fedus for thoughtful discussions.

References

Appendix A Fréchet Embedding Distance and Language model scores on EMNLP2017 News

On EMNLP2017 News, FED and LM/RLM results are similar to those on WikiText103, see Figure 5(a) and Figure 5(b). Here we report the FED against both the training and validation set, to assess model overfitting. On this metric, we again notice that ScratchGAN performs better than the MLE model.

(a) FED against training and validation data.
(b) Language model scores.
Figure 5: EMNLP2017 News results.

Appendix B Nearest Neighbors

In Table 5 we see for a selection of four random samples and the corresponding top three closest training set sentences with respect to each similarity measure, there is not a clear pattern of overfitting or training set repetition.

USE Nearest Neighbours 3-gram Nearest Neighbours
Sample: A nice large part of Trump has to plan exactly what Pence would worth , for Trump to choose him
     strongly in Florida, where he can be 100 percent away.
0.77 His name , of course , is Donald Trump , the billionaire businessman who leads most national polls for the Republican nomination . 0.13 It ’ s like the situation in Florida , where he didn ’ t pay taxes on his golf course .
0.75 But to get there , Rubio believes he needs to cut significantly into Cruz ’ s support in Iowa , a state dominated by social conservatives . 0.12 Donald Trump is spending his third straight day in Florida , where he ’ s already made six campaign stops since Sunday .
0.72 On the Republican side , the Iowa poll shows Ted Cruz leading Donald Trump by four points , but Trump has a 16 - point lead in New Hampshire . 0.10 He has long been mentioned as a possible candidate for governor in Florida , where he has a home in Miami with his wife and four school - age children .
Sample: I didn ’ t know how to put him up to the floor among reporters Thursday or when he did what he said.
0.69 Speaking at a news conference on Monday , he said : ” Let me make clear that this is a great professional and a great person . 0.25 Her explanation for saying ” I didn ’ t glass her , I don ’ t know why I ’ m getting arrested ” was said out of panic , I didn ’ t know how to handle the situation .
0.67 In a text message late Monday , he said he had not seen the court filing and could not comment on it . 0.23 I didn ’ t know how to do it or who to talk to , so I had to create opportunities for myself .
0.59 ” We ’ re not going to stand by any agent that has deliberately done the wrong thing , ” he said . 0.23 I didn ’ t know how to face it , but as soon as I ’ d got through that it was OK .
Sample: Paul have got a fine since the last 24 game , and it ’ s just a nine - day mark .
0.50 As he said after Monday night ’ s game : ” We know we have enough quality , it ’ s not always the quality . 0.21 We ’ ve been in this situation too many times , and it ’ s a 60 - minute game , and it doesn ’ t matter .
0.50 The 26 - year - old from Brisbane was forced to come from behind to score an impressive 6 - 7 ( 5 - 7 ), 6 - 4 , 7 - 6 ( 8 - 6 ) win . 0.21 There are already plenty people fighting fire with fire , and it ’ s just not helping anyone or anything .
0.48 But he ’ s had a very good start to this year and beat Roger to win Brisbane a couple of weeks ago . 0.20 We ’ ve just got to move on , it ’ s part of the game , and it ’ s always going to happen , that kind of stuff
Sample: Such changes from the discussion and social support of more people living in the EU with less generous
income and faith.
0.72 The EU has promised Ankara three billion euros in aid if it does more to stop the flow of migrants headed for Europe . 0.14 There are nearly three - quarters of a million British people living in Spain and over two million living in the EU as a whole .
0.68 Now , as Norway is not a member of the EU , it has no say over these or any other EU rules . 0.1 About 60 people living in the facility were moved to another part of the building for safety , according to authorities .
0.67 We can ’ t debate the UK ’ s place in Europe ahead of an historic EU referendum without accurate statistics on this and other issues . 0.1 We ’ d like to hear from people living in the country about what life as a Canadian is really like .
Table 5: EMNLP2017 News nearest neighbours to ScratchGAN samples. Similarity with respect to embedding cosine distance using the Universal Sentence Encoder, and with respect to 3-gram cosine distance. We see the GAN samples are not composed of cut-and-paste text snippets from the training set.

Appendix C Negative results

Here we list some approaches that we tried but which proved unsuccessful or unnecessary:

  • Using a Wasserstein Loss on generator logits, with a straight-through gradient. This was unsuccessful.

  • Using ensembles of discriminators and generators. The results are on par with those obtained by a single discriminator-generator pair.

  • Training against past versions of generators/discriminators. Same as above.

  • Using bi-directional discriminators. They can work but tend to over-fit and provide less useful feedback to the generator.

  • Using several discriminators with different architectures, hoping to have the simple discriminators capture simple failure modes of the generators such as repeated words. It did not improve over single discriminator-generator pair.

  • Training on small datasets such as Penn Tree Bank. The discriminator quickly over-fit to the training data. This issue could probably be solved with stronger regularization but we favoured larger datasets.

  • Using a Hinge loss [Miyato et al., 2018] on the discriminator. This did not improve over the cross-entropy loss.

  • Using a hand-designed curriculum, where the generator is first trained against a simple discriminator, and later in training a more complex discriminator is substituted. This was unsuccessful. We suspect that adversarial training requires a difficult balance between discriminator quality and generator quality, which is difficult to reach when either component has been trained independently from the other.

  • Varying significantly the number of discriminator steps per generator step, say 5 discriminator steps per generator step. This was unsuccessful.

  • Looking at discriminator probabilities (check that and ) to evaluate training. Discriminator seems to be able to provide good gradient signal even when its predictions are not close to the targets, as long as its predictions on real data are distinct from its prediction on fake data.

  • Using a population of discriminators to evaluate the quality of a generator, or conversely. This metric failed when the population as a whole is not making progress.

  • Mapping all data to GloVe embeddings, and training a one-shot feed-forward generator to generate word embeddings directly, while discriminator receives word embeddings directly. This was unsucessful.

Appendix D Experimental details

We now provide the experimental details of our work.

d.1 ScratchGAN architectural details

Generator

The core of the generator is an LSTM with tanh activation function and skip connections. We use an embedding matrix which is the concatenation of a fixed pretrained GloVe embedding matrix of dimension

where is the vocabulary size, and a learned embedding matrix of dimension where depends on the dataset. An embedding for the token at the previous time-step is looked up in the embedding matrix, and then linearly projected using a learned matrix to the feature size of the LSTM. This is the input to the LSTM. The output of the LSTM is the concatenation of the hidden outputs of all layers. This output is linearly projected using a learned matrix to the dimension of the embedding matrix. We add a learned bias of dimension to obtain the logits over the vocabulary. We apply a softmax operation to the logits to obtain a Categorical distribution and sample from it to generate the token for the current time-step.

Discriminator
The input to the discriminator is a sequence of tokens, coming either from the real data or the generator. The core of the discriminator is an LSTM. The discriminator uses its own embedding matrix, independent from the generator. It has the same structure as the generator embedding matrix. Dropout is applied to this embedding matrix. An embedding for the token at the current time-step is looked up in the embedding matrix. A fixed position embedding of dimension , depending on (see G), is concatenated to the embedding. As for the generator, the embedding is linearly projected using a learned matrix to the feature size of the LSTM. This is the input to the LSTM. The output of the LSTM is itself linearly projected to dimension . This scalar is passed through a sigmoid to obtain the discriminator probability . The discriminator LSTM is regularized with layer normalization. regularization is applied to all learned variables in the discriminator.

Losses
The discriminator is trained with the usual cross-entropy loss. The generator is trained with a REINFORCE loss. The value baseline at training step , denoted , is computed as:

(8)

where is the mean cumulative reward over all sequence timesteps and over the current batch at training step . The generator loss at timestep and training step is then:

(9)

and the total generator loss to minimize at training step is .

Optimization
Both generators and discriminators are trained with Adam [Kingma and Ba, 2014], with for both. We perform one discriminator step per generator step.

Data considerations

The maximum sequence length for EMNLP2017 News is 50 timesteps. The generator vocabulary also contains a special end of sequence token. If the generator outputs the end of sequence token at any timestep the rest of the sequence is padded with spaces. At timestep

the input to the generator LSTM is a space character. Generator and discriminator are both recurrent so time and space complexity of inference and training are linear in the sequence length.

d.2 Sweeps and best hyperparameters

To choose our best model, we sweep over the following hyperparameters:

  • [itemsep=0.0em]

  • Discriminator learning rate.

  • Generator learning rate.

  • Discount factor .

  • The number of discriminator updates per generator update.

  • The LSTM feature size of the discriminator and generator.

  • The number of layers for the generator.

  • Batch size.

  • Dropout rate for the discriminator.

  • Trainable embedding size.

  • Update frequency of baseline, .

The best hyperparameters for EMNLP2017 News are:

  • [itemsep=0.0em]

  • Discriminator learning rate: .

  • Generator learning rate:

  • Discount factor : 0.23.

  • The LSTM feature size of the discriminator and generator: 512 and 512.

  • The number of layers for the generator: 2.

  • Batch size: 512.

  • Dropout rate for the discriminator embeddings: 0.1

  • Trainable embedding size: 64.

  • Update frequency of baseline, : 0.08.

The best hyperparameters for WikiText-103 News:

  • [itemsep=0.0em]

  • Discriminator learning rate:

  • Generator learning rate:

  • Discount factor : 0.79.

  • The LSTM feature size of the discriminator and generator: 256 and 256.

  • The number of layers for the discriminator: 1.

  • Batch size: 768.

  • Dropout rate for the discriminator embeddings: 0.4.

  • Trainable embedding size: 16.

  • Update frequency of baseline, : 0.23.

d.3 Training procedure

For both datasets, we train for at least generator training steps, saving the model every steps, and we select the model with the best FED against the validation data. Each training run used approximately 4 Intel Skylake x86-64 CPUs at 2 GHz, 1 Nvidia Tesla V100 GPU, and 20 GB of RAM, for 1 to 5 days depending on the dataset.

d.4 Language models

The language models we compare to are LSTMs. Interestingly, we found that smaller architectures are necessary for the LM compared to the GAN model, in order to avoid overfitting. For the maximum likelihood language models, we sweep over the size of the embedding layer, the feature size of the LSTM, and the dropout rate used for the embedding layer. We choose the model with the smallest validation perplexity.

For EMNLP2017 News, the MLE model used a LSTM feature size of 512, embedding size of 512, and embedding dropout rate of 0.2.

For WikiText-103, the MLE model used a LMST feature size of 3000, embedding size of 512, and embedding dropout rate of 0.3.

d.5 Metrics

FED and BLEU/Self-BLEU metrics on EMNLP2017 News are always computed with samples. On WikiText-103 FED is computed with

samples because this is the number of sentences in WikiText-103 validation data, after filtering outliers.

To compute the reverse language model scores at different softmax temperatures we used the same architecture as the best EMNLP2017 News. We trained a language model on 268590 model samples, and used it to score the validation data.

d.6 Datasets

Wikitext-103 is available at https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip. EMNLP2017News is available at http://www.statmt.org/wmt17/ and a preprocessed version at https://github.com/pclucas14/GansFallingShort/blob/master/real_data_experiments/data/news/.

Appendix E Fréchet Embedding Distance sensitivity to sentence length

We show that FED is slightly dependent on sentence length, highlighting a possible limitation of this metric. For each sentence length, we randomly select a subset of 10k sentences from EMNLP2017 News training set conditioned on this sentence length, and we measure the FED between this subset and the 10k validation set. We show the results in figure 6(a). We see that there is a small dependence of FED on sentence length: FED seems to be worse for sentences that are significantly shorter or longer than the mean.

(a) FED vs sentence length.

(b) Providing positional information to the discriminator helps the generator capture sentence length distribution correctly.
Figure 6: EMNLP2017 News results.

Appendix F Hyperparameter variance

Here we clarify the definition of the subset of hyper-parameter space that we used to show the stability of our training procedure. All runs with hyper-parameters in the ranges defined below gave good results in our experiments.

  • baseline decay ( in equation 8 in appendix D) is in .

  • batch size is in

  • discriminator dropout is in

  • discriminator LSTM feature size is in

  • discriminator learning rate is in

  • discriminator weight is in

  • discriminator LSTM number of layers is in

  • number of discriminator updates per training step is in

  • discount factor in REINFORCE is in

  • generator LSTM feature size is in

  • generator learning rate is in

  • generator LSTM number of layers is in

  • number of generator updates per training step is in

  • dimension of trainable embeddings is in

Appendix G Positional information in the discriminator

Here we discuss the importance of providing positional information to the discriminator. In early experiments we noticed that the distribution of sentence length in the generator samples did not match the distribution of sentence length found in the real data. In theory, we would expect a discriminator based on a LSTM to be able to easily spot samples that are significantly too short or long, and to provide that signal to the generator. But in practice, the generator was biased towards avoiding short and long sentences.

We therefore provide the discriminator with explicit positional information, by concatenating a fix sinusoidal signal to the word embeddings used in the discriminator. We choose 8 periods log-linearly spaced such that and is 4 times the maximum sentence length. For the token at position in the sentence, the positional information is . We concatenate this positional information to the word embedding for token in the discriminator before using it as input for the discriminator LSTM.

Figure 6(b) shows distributions of sentence length in samples of two GAN models, one with and one without this positional information. We compare these distributions against the reference distribution of sentence length in the training data. Even with positional information in the discriminator, the generator still seems slightly biased towards shorter sentences, compared to the training data. But the sentence length distribution is still a much better fit with positional information than without.

Appendix H Samples

Training examples from both datasets can be found in Table 6. Samples from our model, the maximum likelihood trained language model and the -gram model can be found in Tables  79 and  10.

EMNLP2017 News
   My sources have suggested that so far the company sees no reason to change its tax structures , which are perfectly legal .
   I ’ d say this is really the first time I ’ ve had it in my career : how good I feel about my game and knowing where it ’ s at .
   We would open our main presents after lunch ( before the Queen ’ s speech ) then take the dog for a walk .
WikiText-103
   the actual separation of technetium @-@ N from spent nuclear fuel is a long process .
   she was launched on N december N , after which fitting @-@ out work commenced .
   covington was extremely intrigued by their proposal , considering eva perón to be a non @-@ commercial idea for a musical .
Table 6: Training data examples on EMNLP2017 News and WikiText-103.
EMNLP2017 News
   We are pleased for the trust and it was incredible , our job quickly learn the shape and get on that way .
   But I obviously have him with the guys , maybe in Melbourne , the players that weren ’ t quite clear there .
   There is task now that the UK will make for the society to seek secure enough government budget fund reduce the economy .
   Keith is also held in 2005 and Ted ’ s a successful campaign spokeswoman for students and a young brothers has took an advantage of operator .
   And as long as it is lower about , our families are coming from a friend of a family .
WikiText-103
   the general manager of the fa cup final was intended for the final day as a defensive drive , rather than twenty field goals .
   the faces of competitive groups and visual effects were in much of the confidence of the band at UNK ’s over close circles , and as well as changing the identical elements to the computing .
   a much UNK ground was believed to convey UNK other words , which had been UNK writing and that he possessed receiving given powers by his UNK transport , rather than rendered well prior to his “ collapse of the local government .
   the highest viewership from the first N @.@ N % of the debate over the current event .
   the british the united states launched double special education to its N % ;
Table 7: Randomly selected ScratchGAN samples on EMNLP2017 News and WikiText-103.
Beginning of training, FED=0.54
   because kicking firm transparency accommodation Tim earnings While contribution once forever diseases O spotlight furniture intervention guidelines false Republicans Asked defeated raid - who rapid Bryant felt ago oil refused deals today dance stocks Center reviews Storm residents emerging Duke blood draw chain Law expanding code few MPs stomach ¡unk¿ countries civilians
   March labour leave theft afterwards coach 1990 importance issues American revealing players reports confirmed depression crackdown Green publication violence keeps 18th address defined photos experiencing implemented Center shots practical visa felt tweeted hurt Raiders lies artist 1993 reveal cake Amazon express party although equal touch Protection performance own rule Under golden routine
During training, FED=0.034
   Cuba owners might go him because a break in a very small - defeat City drive an Commons
   Germany made it by the chairman of his supporters , who are closed in Denver and 4 average -
   Nine news she scored Donald Trump , appeared to present a New -
   If he did , he wants a letter of the electorate that he accepted the nomination campaign for his first campaign to join passing the election .
   The former complaint she said : ” whatever this means certain players we cannot have the result of the current market .
End of training, FED=0.018
   She ’ s that result she believes that for Ms . Marco Rubio ’ s candidate and that is still become smaller than ever .
   I hadn ’ t been able to move on the surface – if grow through ,’ she said , given it at a time later that time .
   If Iran wins business you have to win ( Iowa ) or Hillary Clinton ’ s survived nothing else since then , but also of all seeks to bring unemployment .
   All the storm shows is incredible , most of the kids who are telling the girls the people we ’ re not turning a new study with a challenging group .
   Six months before Britain were the UK leaving the EU we will benefit from the EU - it is meeting by auto , from London , so it ’ s of also fierce faith Freedom .
Table 8: Randomly selected ScratchGAN samples on EMNLP2017 News as training progresses.
EMNLP2017 News
   It came out the five days of the developing player waiting to begin the final major European championship of state - owned teams in 2015 and 2015 .
   ” I look from my size , you know in the most part , I ’ ve been fighting every day , ” she says .
   When you are around mid - 2006 , you play one and train with you earlier this year and the manager would make the opposition .
   She said : ’ I ’ d like food to be now , where my baby and children deserve to be someone ’ s kids .
   He ’ d been very good at that , but it ’ s fun , the camera have been incredibly tight - with that we can be on the ball at the beginning of his debut .
WikiText-103
   in an interview with journalist UNK UNK during his death , a new specimen was brought in the UNK museum of modern art .
   after the sets of UNK wear UNK and UNKUNKUNK to tell him , UNK UNK they play UNK UNK with UNK around a UNK .
   after he urged players to fight for what he saw as a fantastic match , the bank sustained a fractured arm and limited injury in the regular season .
   the album peaked at number eight on rolling stones ’ s N .
   in the UNK sitting on the starboard N @-@ inch , a UNK woman looks ( UNK UNK ) with an eagle during the day of all singing due to her the doors being edged far through where she UNK , which included UNK , UNK UNK , UNK UNK ,    and UNK ’s motifs on the bridge .
Table 9: Randomly selected MLE model samples on EMNLP2017 News and WikiText-103.
EMNLP2017 News
   It ’ s like a ’ test site will boost powerful published on the question , 60 years on the fact that at moment .
   The bridge opens fire Dallas - and they ’ ll be best remembered as scheduled by accident and emergency units .
   The study focused on everything Donald Trump was ” somebody to cope with a social events that was not wearing the result of a 1 , 2017 , will be in .
   It ’ s going to finish me off , when a recent poll , more than the actual match to thank the British way of the seven years .
   We can be sure that has been struck off by the company , is to be completed by Smith had taken a week later , you just like , what ’ s going on in everyday reflects a material drone hundreds of comments .
Table 10: Randomly selected samples from an -gram model with Kneser-Ney smoothing.