Evaluating Text GANs as Language Models

by   Guy Tevet, et al.

Generative Adversarial Networks (GANs) are a promising approach for text generation that, unlike traditional language models (LM), does not suffer from the problem of "exposure bias". However, A major hurdle for understanding the potential of GANs for text generation is the lack of a clear evaluation metric. In this work, we propose to approximate the distribution of text generated by a GAN, which permits evaluating them with traditional probability-based LM metrics. We apply our approximation procedure on several GAN-based models and show that they currently perform substantially worse than state-of-the-art LMs. Our evaluation procedure promotes better understanding of the relation between GANs and LMs, and can accelerate progress in GAN-based text generation.


page 1

page 2

page 3

page 4


Latent Code and Text-based Generative Adversarial Networks for Soft-text Generation

Text generation with generative adversarial networks (GANs) can be divid...

ColdGANs: Taming Language GANs with Cautious Sampling Strategies

Training regimes based on Maximum Likelihood Estimation (MLE) suffer fro...

On Accurate Evaluation of GANs for Language Generation

Generative Adversarial Networks (GANs) are a promising approach to langu...

Quantitatively Evaluating GANs With Divergences Proposed for Training

Generative adversarial networks (GANs) have been extremely effective in ...

Relating Neural Text Degeneration to Exposure Bias

This work focuses on relating two mysteries in neural-based text generat...

Improving GAN Training with Probability Ratio Clipping and Sample Reweighting

Despite success on a wide range of problems related to vision, generativ...

Uniform Complexity for Text Generation

Powerful language models such as GPT-2 have shown promising results in t...

1 Introduction

Neural networks have revolutionized the field of text generation, in machine translation Sutskever et al. (2014); Neubig (2017); Luong et al. (2016); Chen et al. (2018), summarization See et al. (2017), image captioning You et al. (2016) and many other applications Goldberg (2017).

Traditionally, text generation models are trained by going over a gold sequence of characters (tokens) from left-to-right, and maximizing the probability of the next character (token) given the history, namely, a language modeling (LM) objective. A commonly discussed drawback of such LM-based text generation is exposure bias Ranzato et al. (2016): during training, the model predicts the next token conditioned on the ground truth history, while at test time prediction is based on predicted tokens, causing a train-test mismatch. Models trained in this manner often struggle to overcome previous prediction errors.

Generative Adversarial Networks Goodfellow et al. (2014)

offer a solution for exposure bias. Originally introduced for images, GANs leverage a discriminator, which is trained to discriminate between real images and generated images via an adversarial loss. In such a framework, the generator is not directly exposed to the ground truth data, but instead learns to imitate it using global feedback from the discriminator. This has led to several attempts to use GANs for text generation, with a generator using either a recurrent neural network (RNN)

Yu et al. (2017); Guo et al. (2018); Press et al. (2017); Subramanian et al. (2017)

, or a Convolutional Neural Network (CNN)

Gulrajani et al. (2017); Subramanian et al. (2017).

However, evaluating GANs is more difficult than evaluating LMs. While in language modeling, evaluation is based on the log-probability of a model on held-out text, this cannot be straightforwardly extended to GAN-based text generation, because the generator outputs discrete tokens, rather than a probability distribution. Currently, there is no single evaluation metric for GAN-based text generation, and existing metrics that are based on n-gram overlap are known to lack robustness and have low correlation with semantic coherence

Semeniuta et al. (2018).

In this paper, we propose a method for evaluating GANs with standard probability-based evaluation metrics. We show that the expected prediction of a GAN generator can be viewed as a LM, and suggest a simple Monte-Carlo method for approximating it. The approximated probability distribution can then be evaluated with standard LM metrics such as perplexity or Bits Per Character (BPC).

To empirically establish our claim, we implement our evaluation on several RNN-based GANs: Press et al. (2017); Yu et al. (2017); Guo et al. (2018). We find that all models have substantially lower BPC compared to state-of-the-art LMs. By directly comparing to LMs, we put in perspective the current performance of RNN-based GANs for text generation.111Our code is available at: http://github.com/GuyTevet/SeqGAN-eval & http://github.com/GuyTevet/rnn-gan-eval.

2 Background

Following the success of GANs in image generation, several works applied the same idea to texts using convolutional neural networks Gulrajani et al. (2017); Subramanian et al. (2017), and later using RNNs Press et al. (2017); Yu et al. (2017). RNNs enable generating variable-length sequences, conditioning each token on the tokens generated in previous time steps. We leverage this characteristic in our approximation model (§4.1).

A main challenge in applying GANs for text is that generating discrete symbols is a non-differentiable operation. One solution is to perform a continuous relaxation of the GAN output, which leads to generators that emit a nearly discrete continuous distribution Press et al. (2017). This keeps the model differentiable and enables end-to-end training through the discriminator. Alternatively, SeqGAN Yu et al. (2017) and LeakGAN Guo et al. (2018) used policy gradient methods to overcome the differentiablity requirement. We apply our approximation to both model types.

3 Evaluating GANs and LMs

LM Evaluation.

Text generation from LMs is commonly evaluated using probabilistic metrics. Specifically, given a test sequence of symbols , and a LM , the average cross-entropy over the entire test set is computed: . For word-based models, the standard metric is perplexity: , while for character-based models it is directly.

GAN-based Text Generation Evaluation.

Because GANs output discrete tokens rather than a probability distribution, LM metrics cannot be applied to evaluate the generated text. Therefore, other metrics have been used:

  • [noitemsep,parsep=0pt,partopsep=0pt,leftmargin=*]

  • N-gram overlap: Yu et al. (2017); Press et al. (2017): Inspired by BLEU Papineni et al. (2002), this measures whether n-grams generated by the model appear in a held-out corpus. A major drawback is that this metric favors conservative models that always generate very common text (e.g., “it is”). To mitigate this, self-BLEU has been proposed Lu et al. (2018) as an additional metric, where overlap is measured between two independently sampled texts from the model.

  • LM score: The probability of generated text according to a pre-trained LM. This has the same problem of favoring conservative models.

  • zhao2017adversarially suggested an indirect score by training a LM on GAN-generated text, and evaluating it using perplexity. The drawback in this setting is the coupling of the performance of the GAN with that of the proxy LM.

  • heusel2017gans used Frechet InferSent Distance (FID) to compute the distance between distributions of features extracted from real and generated samples. However, this approach relies on a problematic assumption that features are normally distributed.

  • rajeswar2017adversarial used a context-free grammar (CFG) to generate a reference corpus, and evaluated the model by the likelihood the CFG assigns to generated samples. However, simple CFGs do not fully capture the complexity of natural language.

  • To overcome the drawbacks of each individual method, semeniuta2018accurate proposed a unified measure based on multiple evaluation metrics (N-grams, BLEU variations, FID, LM score variations and human evaluation).

Overall, current evaluation methods cannot fully capture the performance of GAN-based text generation models. While reporting various scores as proposed by semeniuta2018accurate is possible, it is preferable to have a single measure of progress when comparing different text generation models.

4 Proposed Method

We now propose a method for approximating a distribution over tokens from a GAN, and then evaluate the model with standard LM metrics. We will describe our approach given an RNN-based LM, which is the most commonly-used architecture, but the approximation can be applied to other auto-regressive Vaswani et al. (2017) models.

4.1 Language Model Approximation

The inputs to an RNN at time step

, are the state vector

and the current input token . The output token (one-hot) is denoted by . In RNN-based GANs, the previous output token is used at inference time as the input Yu et al. (2017); Guo et al. (2018); Press et al. (2017); Subramanian et al. (2017). In contrast, when evaluating with BPC or perplexity, the gold token is given as input. Hence, LM-based evaluation neutralizes the problem of exposure bias addressed by GANs. Nevertheless, this allows us to compare the quality of text produced by GANs and LMs on an equal footing. Figure 1 illustrates the difference between inference time and during LM approximation.

Figure 1: Generator recurrent connections. is the internal state sequence and is the generator prediction sequence (one-hot). During inference, the outputs are fed back as the input for the next time step (dashed lines). During LM approximation, the input is a sequence of one-hot vectors from the test set.

We can therefore define the generator function at time step as a function of the initial state and the past generated tokens , which we denote as ( is a start token). Given a past sequence , is a stochastic function: the stochasticity of can be gained either by using a noise vector as the initial state Press et al. (2017), or by sampling from the GAN’s internal distribution over possible output tokens Yu et al. (2017); Guo et al. (2018). Since is constant or a noise vector that makes stochastic, we can omit it to get . In such a setup, the expected value is a distribution over the next vocabulary token :

To empirically approximate , we can sample from it i.i.d samples, and compute an approximation , where is one sample from

. Then, according to the strong law of large numbers:


Given this approximate LM distribution, we can evaluate a GAN using perplexity or BPC. We summarize the evaluation procedure in Algorithm 1.

0:  : the generator function at time step : previous gold tokens : a LM evaluation metric : number of samples
1:  for  to  do
2:       sample from
4:  return  
Algorithm 1 LM Evaluation of RNN-based GANs

4.2 Approximation Bound

What should we choose to get a good approximation of to ? Given a vocabulary , we define the bad event in which there exists such that . Explicitly, . In §A we prove that if we want to bound the probability of a bad event by some , we should choose .

As a numerical example, choosing and , for a character-based LM over the text8 dataset, with , we get the bound: . In practice, probability vectors of LMs tend to be sparse Kim et al. (2016). Thus, we argue that we can use a much smaller for a good approximation . Since the sparsity of LMs is difficult to bound, as it differs between models, we suggest an empirical method for choosing .

The approximation is a converging sequence, particularly over (see Equation 1).

Hence, we can empirically choose an which satisfies . In §5 we empirically measure as a function of to choose . We choose a global for a model, rather than for every , by averaging over a subset of the evaluation set.

5 Evaluation

Approach Model BPC Approx. BPC
Language Models mLSTM + dynamic eval Krause et al. (2018) 1.19
Large mLSTM +emb +WN +VD Krause et al. (2017) 1.27
Large RHN Zilly et al. (2016) 1.27
LayerNorm HM-LSTM Chung et al. (2017) 1.29
BN LSTM Cooijmans et al. (2017) 1.36
Unregularised mLSTM Krause et al. (2017) 1.40
SeqGAN - pre-trained LM Yu et al. (2017) 1.93 2.05
GANs (LM Approximation) SeqGAN - full adversarial training Yu et al. (2017) 2.06 2.25
Recurrent GAN without pre-training Press et al. (2017) 3.31
Uniform Distribution 4.75
Table 1: Test set evaluation of different character-based models on the text8 dataset. State-of-the-art results are taken from https://github.com/sebastianruder/NLP-progress/blob/master/language_modeling.md.

5.1 Models

We focus on character-based GANs, which are more common than word-based GANs. We evaluate two RNN-based GANs with different characteristics. As opposed to the original GAN model Goodfellow et al. (2014), in which the generator is initialized with random noise, the GANs we evaluated both leverage gold standard text to initialize the generator, as detailed below.

Recurrent GAN Press et al. (2017)

is a continuous RNN-based generator which minimizes the improved WGAN loss Gulrajani et al. (2017). To guide the generator, during training it is initialized with the first characters from the ground truth, starting the prediction in the th character. Stochasticity is obtained by feeding the generator with a noise vector as a hidden state. At each time step, the input to the RNN generator is the output distribution of the previous step.

SeqGAN Yu et al. (2017)

is a discrete RNN-based generator. To guide the generator, it is pre-trained as a LM on ground truth text. Stochasticity is obtained by sampling tokens from an internal distribution function over the vocabulary. To overcome differentiation problem, it is trained using a policy gradient objective Sutton et al. (2000).

We also evaluated LeakGAN Guo et al. (2018), another discrete RNN-based generator, but since it is similar to SeqGAN and had lower performance, we omit it for brevity.

5.2 Evaluation Settings

Figure 2: Approximate error as a function of samples . , .

To compare to prior work in LM, we follow the common setup and train on the text8 dataset.222http://mattmahoney.net/dc/textdata The dataset is derived from Wikipedia, and includes 26 English characters plus spaces. We use the standard 90/5/5 split to train/validation/test. Finally, we measure performance with BPC.

We tuned hyper-parameters on the validation set, including sequence length to generate at test time (7 for press2017language, 1000 for yu2017seqgan). We chose the number of samples empirically for each model, as described in Section 4.2. We set to 10, and the boundary to as a good trade-off between accuracy and run-time. Figure 2 plots the approximate error as a function of . For both models, satisfies this condition (red line in Figure 2). To be safe, we used .

5.3 Results

Table 1 describes model performance on the test set. Because SeqGAN models output a distribution over tokens at every time step, we can measure the true BPC and assess the quality of our approximation. Indeed, we observe that approximate BPC is only slightly higher than the true BPC.

GAN-based models perform worse than state-of-the-art LMs by a large margin. Moreover, in SeqGAN, the pre-trained LM (1.93 BPC) performs better than the fully trained model (2.06 BPC), and BPC deteriorates as adversarial training continues. Finally, we note that generating sequences larger than 7 characters hurts the BPC of press2017language. It is difficult to assess the quality of generation with such short sequences.

6 Conclusions

We propose an evaluation procedure for text GANs that is based on approximating the GAN output distribution and using standard LM metrics. We provide a bound for the number of samples required for the approximation and empirically show in practice as few as samples per time-step suffice. We evaluate character-based GAN models using our procedure, and show their performance is substantially lower than state-of-the-art LM. We hope our simple evaluation method leads to progress in GAN-based text generation by shedding light on the quality of such models.


  • Chen et al. (2018) Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76–86, Melbourne, Australia. Association for Computational Linguistics.
  • Chernoff et al. (1952) Herman Chernoff et al. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507.
  • Chung et al. (2017) Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. 2017. Hierarchical multiscale recurrent neural networks. In International Conference on Learning Representations (Conference Track).
  • Cooijmans et al. (2017) Tim Cooijmans, Nicolas Ballas, César Laurent, Çağlar Gülçehre, and Aaron Courville. 2017.

    Recurrent batch normalization.

    In International Conference on Learning Representations (Conference Track).
  • Goldberg (2017) Yoav Goldberg. 2017.

    Neural network methods for natural language processing.

    Synthesis Lectures on Human Language Technologies, 10(1):1–309.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
  • Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777.
  • Guo et al. (2018) Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5141–5148.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6626–6637. Curran Associates, Inc.
  • Hoeffding (1963) Wassily Hoeffding. 1963.

    Probability inequalities for sums of bounded random variables.

    Journal of the American statistical association, 58(301):13–30.
  • Kim et al. (2016) Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-aware neural language models. In AAAI, pages 2741–2749.
  • Krause et al. (2018) Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. 2018. Dynamic evaluation of neural sequence models. In

    Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018

    , pages 2771–2780.
  • Krause et al. (2017) Ben Krause, Liang Lu, Iain Murray, and Steve Renals. 2017. Multiplicative lstm for sequence modelling. In ICLR.
  • Lu et al. (2018) Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Neural text generation: Past, present and beyond. arXiv preprint arXiv:1803.07133.
  • Luong et al. (2016) Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. In ICLR.
  • Neubig (2017) Graham Neubig. 2017. Neural machine translation and sequence-to-sequence models: A tutorial. arXiv preprint arXiv:1703.01619.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  • Press et al. (2017) Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, and Lior Wolf. 2017. Language generation with recurrent generative adversarial networks without pre-training. In 1st Workshop on Learning to Generate Natural Language at ICML 2017.
  • Ranzato et al. (2016) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. In ICLR.
  • See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083. Association for Computational Linguistics.
  • Semeniuta et al. (2018) Stanislau Semeniuta, Aliaksei Severyn, and Sylvain Gelly. 2018. On accurate evaluation of gans for language generation. In Advances in Neural Information Processing Systems.
  • Subramanian et al. (2017) Sandeep Subramanian, Sai Rajeswar, Francis Dutil, Chris Pal, and Aaron Courville. 2017. Adversarial generation of natural language. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 241–251, Vancouver, Canada. Association for Computational Linguistics.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000.

    Policy gradient methods for reinforcement learning with function approximation.

    In Advances in neural information processing systems, pages 1057–1063.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
  • You et al. (2016) Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 4651–4659.
  • Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858.
  • Zhao et al. (2017) Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexander M Rush, and Yann LeCun. 2017. Adversarially regularized autoencoders for generating discrete structures. CoRR, abs/1706.04223.
  • Zilly et al. (2016) Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. 2016. Recurrent highway networks. arXiv preprint arXiv:1607.03474.

Appendix A Approximation Bound Proof

We provide a theoretical bound for choosing a number of samples that results in a good approximation of to .

Perplexity and BPC rely on the log-probability of the ground truth token. Since the ground truth token is unknown, we conservatively define the bad event in which there exists such that , where is the vocabulary. We can then bound the probability of by some . We define the following notations:

  1. [noitemsep,parsep=0pt,partopsep=0pt,leftmargin=*]

  2. The probability of a token to be is .

  3. is a random variable representing the binary value of the ’th index of the one-hot vector . Note that the average of over samples is .

Using the above notation, we can re-define the probability of the bad event with respect to the individual coordinates in the vectors:


We note that , and given that are i.i.d., we can apply the Chernoff-Hoeffding theorem Chernoff et al. (1952); Hoeffding (1963). According to the theorem, for every , . Taking the union bound over implies:


Hence, we get a lower bound on :