1 Introduction
Neural networks have revolutionized the field of text generation, in machine translation Sutskever et al. (2014); Neubig (2017); Luong et al. (2016); Chen et al. (2018), summarization See et al. (2017), image captioning You et al. (2016) and many other applications Goldberg (2017).
Traditionally, text generation models are trained by going over a gold sequence of characters (tokens) from lefttoright, and maximizing the probability of the next character (token) given the history, namely, a language modeling (LM) objective. A commonly discussed drawback of such LMbased text generation is exposure bias Ranzato et al. (2016): during training, the model predicts the next token conditioned on the ground truth history, while at test time prediction is based on predicted tokens, causing a traintest mismatch. Models trained in this manner often struggle to overcome previous prediction errors.
Generative Adversarial Networks Goodfellow et al. (2014)
offer a solution for exposure bias. Originally introduced for images, GANs leverage a discriminator, which is trained to discriminate between real images and generated images via an adversarial loss. In such a framework, the generator is not directly exposed to the ground truth data, but instead learns to imitate it using global feedback from the discriminator. This has led to several attempts to use GANs for text generation, with a generator using either a recurrent neural network (RNN)
Yu et al. (2017); Guo et al. (2018); Press et al. (2017); Subramanian et al. (2017), or a Convolutional Neural Network (CNN)
Gulrajani et al. (2017); Subramanian et al. (2017).However, evaluating GANs is more difficult than evaluating LMs. While in language modeling, evaluation is based on the logprobability of a model on heldout text, this cannot be straightforwardly extended to GANbased text generation, because the generator outputs discrete tokens, rather than a probability distribution. Currently, there is no single evaluation metric for GANbased text generation, and existing metrics that are based on ngram overlap are known to lack robustness and have low correlation with semantic coherence
Semeniuta et al. (2018).In this paper, we propose a method for evaluating GANs with standard probabilitybased evaluation metrics. We show that the expected prediction of a GAN generator can be viewed as a LM, and suggest a simple MonteCarlo method for approximating it. The approximated probability distribution can then be evaluated with standard LM metrics such as perplexity or Bits Per Character (BPC).
To empirically establish our claim, we implement our evaluation on several RNNbased GANs: Press et al. (2017); Yu et al. (2017); Guo et al. (2018). We find that all models have substantially lower BPC compared to stateoftheart LMs. By directly comparing to LMs, we put in perspective the current performance of RNNbased GANs for text generation.^{1}^{1}1Our code is available at: http://github.com/GuyTevet/SeqGANeval & http://github.com/GuyTevet/rnnganeval.
2 Background
Following the success of GANs in image generation, several works applied the same idea to texts using convolutional neural networks Gulrajani et al. (2017); Subramanian et al. (2017), and later using RNNs Press et al. (2017); Yu et al. (2017). RNNs enable generating variablelength sequences, conditioning each token on the tokens generated in previous time steps. We leverage this characteristic in our approximation model (§4.1).
A main challenge in applying GANs for text is that generating discrete symbols is a nondifferentiable operation. One solution is to perform a continuous relaxation of the GAN output, which leads to generators that emit a nearly discrete continuous distribution Press et al. (2017). This keeps the model differentiable and enables endtoend training through the discriminator. Alternatively, SeqGAN Yu et al. (2017) and LeakGAN Guo et al. (2018) used policy gradient methods to overcome the differentiablity requirement. We apply our approximation to both model types.
3 Evaluating GANs and LMs
LM Evaluation.
Text generation from LMs is commonly evaluated using probabilistic metrics. Specifically, given a test sequence of symbols , and a LM , the average crossentropy over the entire test set is computed: . For wordbased models, the standard metric is perplexity: , while for characterbased models it is directly.
GANbased Text Generation Evaluation.
Because GANs output discrete tokens rather than a probability distribution, LM metrics cannot be applied to evaluate the generated text. Therefore, other metrics have been used:

[noitemsep,parsep=0pt,partopsep=0pt,leftmargin=*]

Ngram overlap: Yu et al. (2017); Press et al. (2017): Inspired by BLEU Papineni et al. (2002), this measures whether ngrams generated by the model appear in a heldout corpus. A major drawback is that this metric favors conservative models that always generate very common text (e.g., “it is”). To mitigate this, selfBLEU has been proposed Lu et al. (2018) as an additional metric, where overlap is measured between two independently sampled texts from the model.

LM score: The probability of generated text according to a pretrained LM. This has the same problem of favoring conservative models.

zhao2017adversarially suggested an indirect score by training a LM on GANgenerated text, and evaluating it using perplexity. The drawback in this setting is the coupling of the performance of the GAN with that of the proxy LM.

heusel2017gans used Frechet InferSent Distance (FID) to compute the distance between distributions of features extracted from real and generated samples. However, this approach relies on a problematic assumption that features are normally distributed.

rajeswar2017adversarial used a contextfree grammar (CFG) to generate a reference corpus, and evaluated the model by the likelihood the CFG assigns to generated samples. However, simple CFGs do not fully capture the complexity of natural language.

To overcome the drawbacks of each individual method, semeniuta2018accurate proposed a unified measure based on multiple evaluation metrics (Ngrams, BLEU variations, FID, LM score variations and human evaluation).
Overall, current evaluation methods cannot fully capture the performance of GANbased text generation models. While reporting various scores as proposed by semeniuta2018accurate is possible, it is preferable to have a single measure of progress when comparing different text generation models.
4 Proposed Method
We now propose a method for approximating a distribution over tokens from a GAN, and then evaluate the model with standard LM metrics. We will describe our approach given an RNNbased LM, which is the most commonlyused architecture, but the approximation can be applied to other autoregressive Vaswani et al. (2017) models.
4.1 Language Model Approximation
The inputs to an RNN at time step
, are the state vector
and the current input token . The output token (onehot) is denoted by . In RNNbased GANs, the previous output token is used at inference time as the input Yu et al. (2017); Guo et al. (2018); Press et al. (2017); Subramanian et al. (2017). In contrast, when evaluating with BPC or perplexity, the gold token is given as input. Hence, LMbased evaluation neutralizes the problem of exposure bias addressed by GANs. Nevertheless, this allows us to compare the quality of text produced by GANs and LMs on an equal footing. Figure 1 illustrates the difference between inference time and during LM approximation.We can therefore define the generator function at time step as a function of the initial state and the past generated tokens , which we denote as ( is a start token). Given a past sequence , is a stochastic function: the stochasticity of can be gained either by using a noise vector as the initial state Press et al. (2017), or by sampling from the GAN’s internal distribution over possible output tokens Yu et al. (2017); Guo et al. (2018). Since is constant or a noise vector that makes stochastic, we can omit it to get . In such a setup, the expected value is a distribution over the next vocabulary token :
To empirically approximate , we can sample from it i.i.d samples, and compute an approximation , where is one sample from
. Then, according to the strong law of large numbers:
(1) 
Given this approximate LM distribution, we can evaluate a GAN using perplexity or BPC. We summarize the evaluation procedure in Algorithm 1.
4.2 Approximation Bound
What should we choose to get a good approximation of to ? Given a vocabulary , we define the bad event in which there exists such that . Explicitly, . In §A we prove that if we want to bound the probability of a bad event by some , we should choose .
As a numerical example, choosing and , for a characterbased LM over the text8 dataset, with , we get the bound: . In practice, probability vectors of LMs tend to be sparse Kim et al. (2016). Thus, we argue that we can use a much smaller for a good approximation . Since the sparsity of LMs is difficult to bound, as it differs between models, we suggest an empirical method for choosing .
The approximation is a converging sequence, particularly over (see Equation 1).
Hence, we can empirically choose an which satisfies . In §5 we empirically measure as a function of to choose . We choose a global for a model, rather than for every , by averaging over a subset of the evaluation set.
5 Evaluation
Approach  Model  BPC  Approx. BPC 
Language Models  mLSTM + dynamic eval Krause et al. (2018)  1.19  
Large mLSTM +emb +WN +VD Krause et al. (2017)  1.27  
Large RHN Zilly et al. (2016)  1.27  
LayerNorm HMLSTM Chung et al. (2017)  1.29  
BN LSTM Cooijmans et al. (2017)  1.36  
Unregularised mLSTM Krause et al. (2017)  1.40  
SeqGAN  pretrained LM Yu et al. (2017)  1.93  2.05  
GANs (LM Approximation)  SeqGAN  full adversarial training Yu et al. (2017)  2.06  2.25 
Recurrent GAN without pretraining Press et al. (2017)  3.31  
Uniform Distribution  4.75 
5.1 Models
We focus on characterbased GANs, which are more common than wordbased GANs. We evaluate two RNNbased GANs with different characteristics. As opposed to the original GAN model Goodfellow et al. (2014), in which the generator is initialized with random noise, the GANs we evaluated both leverage gold standard text to initialize the generator, as detailed below.
Recurrent GAN Press et al. (2017)
is a continuous RNNbased generator which minimizes the improved WGAN loss Gulrajani et al. (2017). To guide the generator, during training it is initialized with the first characters from the ground truth, starting the prediction in the th character. Stochasticity is obtained by feeding the generator with a noise vector as a hidden state. At each time step, the input to the RNN generator is the output distribution of the previous step.
SeqGAN Yu et al. (2017)
is a discrete RNNbased generator. To guide the generator, it is pretrained as a LM on ground truth text. Stochasticity is obtained by sampling tokens from an internal distribution function over the vocabulary. To overcome differentiation problem, it is trained using a policy gradient objective Sutton et al. (2000).
We also evaluated LeakGAN Guo et al. (2018), another discrete RNNbased generator, but since it is similar to SeqGAN and had lower performance, we omit it for brevity.
5.2 Evaluation Settings
To compare to prior work in LM, we follow the common setup and train on the text8 dataset.^{2}^{2}2http://mattmahoney.net/dc/textdata The dataset is derived from Wikipedia, and includes 26 English characters plus spaces. We use the standard 90/5/5 split to train/validation/test. Finally, we measure performance with BPC.
We tuned hyperparameters on the validation set, including sequence length to generate at test time (7 for press2017language, 1000 for yu2017seqgan). We chose the number of samples empirically for each model, as described in Section 4.2. We set to 10, and the boundary to as a good tradeoff between accuracy and runtime. Figure 2 plots the approximate error as a function of . For both models, satisfies this condition (red line in Figure 2). To be safe, we used .
5.3 Results
Table 1 describes model performance on the test set. Because SeqGAN models output a distribution over tokens at every time step, we can measure the true BPC and assess the quality of our approximation. Indeed, we observe that approximate BPC is only slightly higher than the true BPC.
GANbased models perform worse than stateoftheart LMs by a large margin. Moreover, in SeqGAN, the pretrained LM (1.93 BPC) performs better than the fully trained model (2.06 BPC), and BPC deteriorates as adversarial training continues. Finally, we note that generating sequences larger than 7 characters hurts the BPC of press2017language. It is difficult to assess the quality of generation with such short sequences.
6 Conclusions
We propose an evaluation procedure for text GANs that is based on approximating the GAN output distribution and using standard LM metrics. We provide a bound for the number of samples required for the approximation and empirically show in practice as few as samples per timestep suffice. We evaluate characterbased GAN models using our procedure, and show their performance is substantially lower than stateoftheart LM. We hope our simple evaluation method leads to progress in GANbased text generation by shedding light on the quality of such models.
References
 Chen et al. (2018) Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76–86, Melbourne, Australia. Association for Computational Linguistics.
 Chernoff et al. (1952) Herman Chernoff et al. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507.
 Chung et al. (2017) Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. 2017. Hierarchical multiscale recurrent neural networks. In International Conference on Learning Representations (Conference Track).

Cooijmans et al. (2017)
Tim Cooijmans, Nicolas Ballas, César Laurent, Çağlar
Gülçehre, and Aaron Courville. 2017.
Recurrent batch normalization.
In International Conference on Learning Representations (Conference Track). 
Goldberg (2017)
Yoav Goldberg. 2017.
Neural network methods for natural language processing.
Synthesis Lectures on Human Language Technologies, 10(1):1–309.  Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
 Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777.
 Guo et al. (2018) Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI18), the 30th innovative Applications of Artificial Intelligence (IAAI18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI18), New Orleans, Louisiana, USA, February 27, 2018, pages 5141–5148.
 Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two timescale update rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6626–6637. Curran Associates, Inc.

Hoeffding (1963)
Wassily Hoeffding. 1963.
Probability inequalities for sums of bounded random variables.
Journal of the American statistical association, 58(301):13–30.  Kim et al. (2016) Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Characteraware neural language models. In AAAI, pages 2741–2749.

Krause et al. (2018)
Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. 2018.
Dynamic
evaluation of neural sequence models.
In
Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018
, pages 2771–2780.  Krause et al. (2017) Ben Krause, Liang Lu, Iain Murray, and Steve Renals. 2017. Multiplicative lstm for sequence modelling. In ICLR.
 Lu et al. (2018) Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Neural text generation: Past, present and beyond. arXiv preprint arXiv:1803.07133.
 Luong et al. (2016) MinhThang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multitask sequence to sequence learning. In ICLR.
 Neubig (2017) Graham Neubig. 2017. Neural machine translation and sequencetosequence models: A tutorial. arXiv preprint arXiv:1703.01619.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
 Press et al. (2017) Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, and Lior Wolf. 2017. Language generation with recurrent generative adversarial networks without pretraining. In 1st Workshop on Learning to Generate Natural Language at ICML 2017.
 Ranzato et al. (2016) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. In ICLR.
 See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointergenerator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083. Association for Computational Linguistics.
 Semeniuta et al. (2018) Stanislau Semeniuta, Aliaksei Severyn, and Sylvain Gelly. 2018. On accurate evaluation of gans for language generation. In Advances in Neural Information Processing Systems.
 Subramanian et al. (2017) Sandeep Subramanian, Sai Rajeswar, Francis Dutil, Chris Pal, and Aaron Courville. 2017. Adversarial generation of natural language. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 241–251, Vancouver, Canada. Association for Computational Linguistics.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.

Sutton et al. (2000)
Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour.
2000.
Policy gradient methods for reinforcement learning with function approximation.
In Advances in neural information processing systems, pages 1057–1063.  Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.

You et al. (2016)
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016.
Image captioning with semantic attention.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 4651–4659.  Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858.
 Zhao et al. (2017) Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexander M Rush, and Yann LeCun. 2017. Adversarially regularized autoencoders for generating discrete structures. CoRR, abs/1706.04223.
 Zilly et al. (2016) Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. 2016. Recurrent highway networks. arXiv preprint arXiv:1607.03474.
Appendix A Approximation Bound Proof
We provide a theoretical bound for choosing a number of samples that results in a good approximation of to .
Perplexity and BPC rely on the logprobability of the ground truth token. Since the ground truth token is unknown, we conservatively define the bad event in which there exists such that , where is the vocabulary. We can then bound the probability of by some . We define the following notations:

[noitemsep,parsep=0pt,partopsep=0pt,leftmargin=*]

The probability of a token to be is .

is a random variable representing the binary value of the ’th index of the onehot vector . Note that the average of over samples is .
Using the above notation, we can redefine the probability of the bad event with respect to the individual coordinates in the vectors:
(2)  
We note that , and given that are i.i.d., we can apply the ChernoffHoeffding theorem Chernoff et al. (1952); Hoeffding (1963). According to the theorem, for every , . Taking the union bound over implies:
(3) 
Hence, we get a lower bound on :
(4) 