Recent developments in neural language models Mikolov and Zweig (2012), Reiter and Belz (2009), Mikolov et al. (2011b), Mikolov et al. (2011a) have inspired the use of neural network based architectures for the task of natural language generation (NLG). From image captioning Karpathy and Fei-Fei (2015), Dai et al. (2017), machine translation Sutskever et al. (2014), Bahdanau et al. (2014)
, to text summarizationRush et al. (2015), dialogue systems Wen et al. (2015), and poetry generation Zhang and Lapata (2014)
, deep neural networks have shown promising results for many natural language processing applications, and they have quickly replaced traditionally handcrafted rule-based or template-based approaches to NLG.
Despite the fast development of models, however, there is a huge gap in the evaluation of NLG systems. On one hand, a rigorous, efficient, and reproducible evaluation is extremely critical for the development of any machine learning technology and for a fair interpretation of the state of the art. On the other hand, evaluating the quality of natural language generation is an inherently difficult task due to the special properties of text data, such assubjectivity and non-compositionality (the overall meaning of a piece of text is not composed of the meaning of individual words and is more than just the sum of its parts; it emerges through a holistic appraisal of specific features in a given context). Indeed, “there is no agreed objective criterion for comparing the goodness of texts” Dale and Mellish (1998), and there lacks a clear model of text quality in the NLG literature Hardcastle and Scott (2008).
Conventionally, most NLG systems have been evaluated in a rather informal manner. Reiter and Belz (2009) divide existing evaluation methods commonly employed in text generation into three categories: i) evaluations based on task performance, assessing the impact of generated texts on end users, ii) human judgments and ratings, where human subjects are recruited to rate generated texts on a -point scale, and on different textual dimensions, and iii) evaluations based on comparison to a reference corpus using automatic metrics. Task based evaluation measures the impact of generated texts in real applications, considering that the value of a functional text lies in how well it serves the user to fulfill a specific function. Young (1999) generate instructional texts and determine how informative these are when users carry the directions outlined. Mani et al. (1999) perform extrinsic summary evaluation by comparing the functional value of a summary vs. the entire document. Carenini and Moore (2006) evaluate persuasive texts by assessing how users rank items in a list, while Di Eugenio et al. (2002) measure the learning gain in intelligent tutoring systems with an NLG component. Nevertheless, task based evaluation can be expensive, time-consuming, and often depends on the good will of participants in the study. Besides that, it is hard to toss out the general quality of text generation from the special context (and confounds) of the application task, or to generalize the evaluation conclusions across tasks. Human based evaluation is able to assess the quality of text more directly than task based evaluation, and it requires less support from domain experts. However, evaluating NLG systems on real users in a rigorous manner can be expensive and time consuming, which does not scale well Reiter et al. (2001). Alternative strategies which are cost effective and can provide accurate immediate feedback in less time are used more frequently. Automated evaluation compare texts generated by the candidate algorithms to human-written texts. Text overlap metrics and more recent automated adversarial evaluators are widely employed in NLG as they are cheap, quick, repeatable, and do not require human subjects when a reference corpus is already available. In addition, they allow developers to make rapid changes to their systems and automatically tune parameters without human intervention. Despite the benefits, however, the use of automated metrics in the field of NLG is controversial Reiter and Belz (2009), and their results are often criticized as not meaningful for a number of reasons. First, these automatic evaluations rely on a high-quality reference corpus as references, which is not often available; variations in the writing style of authors and the presence of (grammatical) errors often result in inconsistent and error prone judgments Reiter and Sripada (2002)
. Second, comparisons with a reference corpus do not assess the impact and usefulness of the generated text on the readers as in human-based evaluations, but instead determine how closely the text matches the references. Third, creating human written reference texts specifically for the purpose of evaluation could still be expensive, especially if these reference texts need to be created by skilled domain experts. Finally and most importantly, using automatic evaluation metrics is sensible only if they correlate with results of human-based evaluations and are accurate predictors of text quality, which is never formally verified. Their validity for evaluation should be a more important concern than their cost-effectiveness for evaluation.
In this paper we conduct a systematic experiment that evaluates the different evaluators for natural language generation. We compare three types of evaluators for a carefully selected scenario of online review generation, including human evaluators, automated adversarial evaluators that are trained to distinguish human-written from machine-generated product reviews, and word overlap metrics (such as BLEU and ROUGE). The preferences of different evaluators on a dozen state-of-the-art deep-learning based NLG algorithms are correlated with human assessments of the quality of generated text. Our findings not only reveal the differences among the evaluators and suggest the more effective ones, but also provide important implications on how to guide the development of new natural language generation models.
2 Related Work
The presented study is related to two lines of research: deep learning based models and automated evaluation metrics for the task of text generation.
2.1 Deep Learning Based NLG
Recently, a decent number of deep learning based models have been proposed for text generation in various scenarios. Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM)Hochreiter and Schmidhuber (1997) models in particular, are widely used for generating sequential data including text. Google LM Jozefowicz et al. (2016) releases a language model which was pre-trained on the One Billion Word Benchmark data Chelba et al. (2013) using a 2-layer LSTM. The most popular strategy for training RNNs is teacher forcing Williams and Zipser (1989). At every step of a sequence, it trains the model using the observed token of the previous step; while at inference time it uses the generated token by the model itself. This discrepancy between training and inference often leads to errors that accumulate quickly over the generated sequence and compromise the predictive power of the model Lamb et al. (2016). To solve this problem, Scheduled Sampling (SS) is proposed Bengio et al. (2015) to train RNNs, which mixes the ground-truth outputs and the model-generated outputs as the training inputs. However, it is shown that SS is an inconsistent training strategy, and instead adversarial training strategies are more suitable for generative models Huszár (2015).
Generative Adversarial Networks Goodfellow et al. (2014), or GANs, train generative models through an adversarial process. A GAN works through the interplay of two feedforward neural networks that are trained simultaneously by competing against each other: a generative model that captures the data distribution and generates high quality synthetic data, and a discriminative model.
Comparing to generating images, generating text with GANs is challenging due to the discrete nature of sequence data. SeqGAN Yu et al. (2017) is one of the earliest GAN-based model for sequence generation, which treats the generation procedure as a sequential decision making process Bachman and Precup (2015). The model addresses the problem of non-differentiability of discrete outputs by treating the generator as a stochastic parameterized policy trained via policy gradient Sutton et al. (2000)
and optimized by directly performing gradient policy updates. Reward in such a reinforcement learning process is based on how likely the discriminator would be fooled by a complete sequence of tokens, which is passed back to the intermediate state-action steps using Monte Carlo search.
Following this direction, RankGAN Lin et al. (2017) proposes a newer framework which evaluates the quality of a set of generated sequences collectively. The discriminator in RankGAN, , is trained to rank the model-generated sentences lower than human-written sentences w.r.t. a human-written reference set; the generator is trained to confuse the ranker in such a way that its generated sentences can be ranked higher than the human written ones.
Many GAN-based text generators Yu et al. (2017), Lin et al. (2017), Rajeswar et al. (2017), Che et al. (2017), Li et al. (2017), Zhang et al. (2017) are only capable of generating short texts, say 20 tokens long. LeakGAN Guo et al. (2017) is proposed for generating longer texts. In LeakGAN, the discriminator is allowed to leak its own high-level extracted features to better guide the training of the generator .
Deep learning architectures other than LSTM or GAN have also been proposed for text generation. Tang et al. Tang et al. (2016) studies the the problem of NLG given particular contexts or situations and proposes two approaches (Contexts to Sequences and Gated Contexts to Sequences), both built on top of the encoder-decoder framework. Dong et al. focuses on the same task and employ an attention mechanism to learn soft alignments between the input attributes and the generated words Dong et al. (2017).
2.2 Automated Evaluation Metrics
The variety of natural language generation models are also evaluated with various approaches. Arguably, the most natural way to evaluate the quality of a generator is to involve humans as judges, either through some type of Turing test Machinery (1950) to distinguish generated text from human input texts, or to directly compare the texts generated by different generators Mellish and Dale (1998). Such approaches are hard to scale, and they have to be redone whenever a new generator is included. Practically, it is important to find automated metrics to evaluate the quality of a generator independent of human judges or an exhaustive set of competing generators. Metrics commonly used in literature are i) perplexity, ii) discriminative evaluators, and iii) text overlap metrics. We summarize these metrics below.
. It measures how well a probability distribution predicts a sample (either seen or unseen) and captures the degree of uncertainty in the model. Perplexity is used to evaluate generatorsYarats and Lewis (2017), Ficler and Goldberg (2017), Gerz et al. (2018) even though it is commonly criticized for not being a direct measure of the quality of generated text Fedus et al. (2018). Indeed, perplexity is a model dependent metric, and “how likely a sentence is generated by a given model” is not comparable across different models. Therefore we do not include perplexity as an evaluation metric in this study.
is an alternative way to evaluate a generator, which measures how likely its generated text can fool a classifier that aims to distinguish the generated text from human-written texts. In a way, this is an automated approximation of the Turing test, where machine judges are used to replace human judges. Discriminative machine judges can be trained either using a data set with explicit labelsOtt et al. (2011), or using a mixture of text written by real humans and those generated by the model being evaluated. The latter is usually referred to as adversarial evaluation. Bowman et al. proposes one of the earliest work that uses adversarial evaluation to assess the quality of the generated sentences Bowman et al. (2015)
. The authors train two logistic regression classifiers, one based on bag-of-unigram features and the other based on LSTM, to separate the generated sentences from human-written sentences. In this context,adversarial error is defined as the difference between the ideal accuracy of the discriminator when samples from the two categories are indistinguishable from each other (i.e. 50% accuracy) and the actual accuracy attained. Notably, maximizing the adversarial error is consistent to the objective of the generator in generative adversarial networks.
In a later study Kannan and Vinyals (2017), the authors propose an adversarial loss to discriminate a dialogue model’s output from human output. The model consists of a generator and a discriminator (both RNNs) trained separately. The authors find that the discriminator’s preference is correlated with the length of the output, and long responses are favoured even when they are not entirely coherent. In addition, the discriminator prefers rarer language instead of the most common responses produced by the generator, which implies that the discriminator may detect weaknesses of the generator such as lack of diversity in generated text. Even though the discriminator can distinguish generated output from human output at an accuracy of 62.5%, there lacks evidence that a model that obtains a lower accuracy by the discriminative evaluator is better according to human evaluations.
Automatic dialogue evaluation is formulated as a learning problem in Lowe et al. (2017), who train a hierarchical RNN model to predict the scores a human would assign to dialogue responses. Authors show that the predictions correlate with human judgments at the utterance and system level, however each response is evaluated in a very specific context and the system requires substantial human judgments for training. Li et al. Li et al. (2017) employ the idea of having a discriminator (analogous to the human evaluator in the Turing test) both in training and testing and define adversarial success AdverSuc as the difference between 1 and the accuracy achieved by the evaluator (thus the higher the better). The authors find that an adversarial training strategy is more useful when there is a big discrepancy between the distributions of the generated sequences and the reference target sequences, i.e. high entropy of the targets. Other work finds the performance of a discriminative agent (e.g., attention-based bidirectional LSTM binary classifier) is comparable with human judges at distinguishing between real and fake dialogue excerpts Bruni and Fernández (2017). Results show there is limited consensus among humans on what is considered as coherent dialogue passages. The discriminator is sensitive to patterns which are not apparent to humans, however the utility of such an approach in developing generative models that interact with humans is an open question.
Text Overlap Metrics, such as BLEU Papineni et al. (2002), ROUGE Lin (2004), and METEOR Banerjee and Lavie (2005), are commonly used to evaluate NLP tasks – machine translation and text summarization in particular. They are borrowed to evaluate language generation by comparing the similarity between the generated text and human written references. Some work Liu et al. (2016) finds that word overlap metrics present weak or no correlation with human judgments in non-task oriented dialogue systems. As these metrics are just a rough estimate of human judgments, the authors recommend to always use them with caution or in combination with user studies. In contrary, it is reported in Sharma et al. (2017) that word overlap metrics are indicative of human judgments in task-oriented dialogue settings, when used on datasets which contain multiple ground truth references. Dai et al. find word overlap metrics too restrictive as they focus on fidelity of wording instead of fidelity of semantics, implying that sentences containing matched ngrams get substantially higher scores than sentences containing variant expressions Dai et al. (2017). Callison et al. consider an improvement in BLEU insufficient for achieving an actual improvement in the quality of a system. For example, variations (e.g., permutations and substitutions) in a generated sentence, not equally grammatical or semantically plausible, are not distinguishable by the BLEU score. The authors posit in favour of human evaluations Callison-Burch et al. (2006).
GAN-based text generation models are compared against a maximum likelihood estimation (MLE) baseline in Caccia et al. (2018). The authors find that models trained with MLE yield superior quality-diversity trade-off according to negative log-likelihood and Self-BLEU metrics, where Self-BLEU Zhu et al. (2018) measures the diversity of the generated text by calculating the BLEU Papineni et al. (2002) score for each generated sentence using all other generated sentences in the corpus as references. Shi et al. compare frameworks for text generation including MLE, SeqGAN, LeakGAN and Inverse Reinforcement Learning using a simulated Turing test Shi et al. (2018); each generated sentence gets 1 point when it is judged by humans as real, otherwise 0 points. A benchmarking experiment with GAN neural text generation models is conducted in Lu et al. (2018); results show LeakGAN presents the highest BLEU scores on the test data.
While a large number of natural language generators are proposed and evaluated with various different metrics, no existing work has systematically evaluated the different evaluators. Do automated judges truly mimic human judges? How much do they agree with each other? Is adversarial evaluation indicative of who would pass a Turing test? Do human judges focus on different aspects of text quality than machine judges? Recently, the community starts to realize that these evaluation metrics for NLG are not consistent among themselves, and using multiple methods of evaluation can be helpful at capturing different aspects of text quality, from fluency and clarity to adequacy of semantic content and effectiveness of communication Gatt and Krahmer (2018). Yet these insights are not associated with empirical evidence. In this study, we bridge the gap and conduct a systematic empirical evaluation of the automated evaluators for natural language generation.
3 Experiment Design
We design a large-scale experiment to systematically analyze the procedures and metrics used for evaluating natural language generation models. While the main subjects of the experiment are different evaluators, including those based on human judgments, those based on automated discriminators, and those based on word overlap metrics, the experiment carefully chooses a particular application context and a variety of natural language generators in this context. Ideally, a sound automated evaluator should be able to distinguish good generators from suboptimal ones. Their preferences (on ordering the generators) should be consistent to humans who are trained to make judgments in the particular task context.
3.1 Experiment Context and Procedure
We design the experiment in the context of generating online product reviews. There are several reasons why review generation is a desirable task: 1) online product reviews are widely available, and it is easy to collect a large number of examples for training/testing the generators; 2) Internet users are used to reading online reviews, and it is easy to recruit capable human judges to assess the quality of reviews; and 3) comparing to tasks like image caption generation or dialogue systems, review generation has minimal dependency on the conversation context or on non-textual data, which reduces possible confounds of the experiment.
The general experiment procedure is presented in Figure 1. We start from the publicly available Amazon Product Reviews dataset111http://jmcauley.ucsd.edu/data/amazon/, which spans the period between May 1996 to July 2014. From the entire dataset, we select three most popular domains: books, electronics, and movies. To control for the potential confounds introduced by special products or language usage, we do not include products with less than 2 reviews or users who have written only a single review. The vocabulary is trimmed to the most frequent 5,000 words, with all out-of-vocabulary words replaced by a unified token. We discard reviews longer than 70 words, as they tend to describe plots of particular books or movies instead of expressing opinions.
The filtered dataset is randomly split into three parts, to train, validate, and test the candidate review generators (denoted as G-train, G-valid, and G-test, respectively). Every generative model is trained and validated using the same datasets, and once trained, each of them will be charged to generate a number of product reviews (details are included in the next section). These generated reviews, mixed with the real reviews in G-test, are randomly split into three new subsets for training, validating, and testing candidate (discriminative) evaluators, denoted as D-train, D-valid, and D-test. Note that every subset contains the same portion of human-written reviews and machine-generated reviews. Finally, a random sample of reviews from D-test are sent for human evaluation.
3.2 Review Generators
Although our goal is to evaluate the evaluators, it is critical to include a wide range of generators and generated reviews. Ideally, these generators should present various degrees of quality; a good evaluator should be able to distinguish the high-quality generators (or generated reviews) from the low-quality ones, and vice versa. To achieve this variety, we select a diverse set of generative models from recent text generation literature. Note that the goal of this study is NOT to name the best generative model, and it is unfeasible to include all existing models. Our criteria are to include models that represent different strategies and quality levels and to consider those with publicly available implementations. In Table 1 we list the candidate generative models, carefully noting that it is not an exhaustive list of what is currently available.
|Word LSTM temp 1.0 Hochreiter and Schmidhuber (1997)||No|
|Word LSTM temp 0.7 Hochreiter and Schmidhuber (1997)||No|
|Word LSTM temp 0.5 Hochreiter and Schmidhuber (1997)||No|
|Scheduled Sampling Bengio et al. (2015)||No|
|Google LM Jozefowicz et al. (2016)||No|
|Attention Attribute to Sequence Dong et al. (2017)||No|
|Contexts to Sequences Tang et al. (2016)||No|
|Gated Contexts to Sequences Tang et al. (2016)||No|
|MLE SeqGAN Yu et al. (2017)||Yes|
|SeqGAN Yu et al. (2017)||Yes|
|RankGAN Lin et al. (2017)||Yes|
|LeakGAN Guo et al. (2017)||Yes|
Every generator (other than the Google LM) is trained and validated on G-train and G-valid datasets, and then used to generate the same number of (fake) reviews (see Table 2). We follow the best practice in literature to train these models, although it is possible that the performance of individual models might not be optimized due to the constraints of data and computational resources. Again, our goal is to evaluate the evaluators instead of the individual generators. Google LM was not trained on reviews, but it provides a nice sanity check of the experimental results - this generator is not supposed to be ranked high by any reasonable evaluator.
|model in Table 1 except Google LM||32,500||22,750||3,250||6,500|
The goal of the experiment is to analyze and compare different evaluators for review generation. We include a comprehensive set of evaluators for the quality of the aforementioned generators: i) human evaluators, ii) discriminative evaluators, and iii) word overlap evaluators.
3.3.1 Human evaluators
The most reasonable evaluator is a Turing test. When it is not feasible in practice, controlled user annotations are often used to mimic a "shrunk" version of the Turing test. We conduct a careful power analysis to determine how many examples to include for annotation so as to obtain a certain level of statistically significance in the comparison of the generative models Flight and Julious (2016), Christensen (2007). The result of our power analysis suggests that at least 111 examples per generative model should be evaluated to infer that the machine generated reviews are comparable in quality to human-written reviews, at a minimal statistically significance level of 0.05. Per this calculation, we sample 150 examples for each of the 12 generators for human evaluation. This totals 1,800 machine-generated reviews, to which we add 1,800 human-written reviews, or a total of 3,600 product reviews sent for human evaluation.
We recruit human annotators through the Amazon Mechanical Turk (AMT) Buhrmester et al. (2011) platform to label the reviews as real (i.e. human written) or fake (i.e. machine generated). Reviews are split into pages of 20 (a mixture of 10 real and 10 fake). We restrict participants in our study to highly-qualified US-based workers222Historical approval rate greater than 95%.. To ensure high quality annotations, we insert “gotcha” questions in each page and only accept the labels from workers who answer all the questions per page and also answer the “gotcha” question correctly. Each page is annotated by 5 distinct human evaluators; in total 900 distinct workers participated in our study and evaluated all 3,600 reviews. Their judgments on every review are used to assemble two distinct human evaluators: H1 - individual votes, treating all human annotations independently, and H2 - majority votes of the 5 human judgments per review.
For every annotated review, a human evaluator ( or ) makes a call which can be either right or wrong with regard to the ground truth (whether the review is sampled from the Amazon review dataset or generated by one of the machines). A generator is considered as high quality if the human evaluator achieves a low accuracy on the fake reviews it generated.
3.3.2 Discriminative evaluators
With the D-train and D-valid datasets, we can train a discriminative classifier for every generator using the “fake” reviews it generated and the same number of “real” reviews. These classifiers serve as adversarial evaluators for each individual generator, the quality of which is measured by how much its corresponding adversarial evaluator is fooled by additional “fake” reviews it generates (in D-test). This is consistent with the objectives in generative adversarial networks.
Note that each adversarial evaluator is generator-dependent, as it is trained using the generated reviews of each individual generator. This is not ideal, as comparisons between different generators may be unfair if they are not scored by the same evaluators, and for every new generator, a new adversarial evaluator has to be trained. The inclusion of multiple generators provides the opportunity of creating meta-adversarial evaluators, which are trained using the pool of generated reviews by many generators, mixed with a larger number of “real” reviews. Such a “pooling” strategy is similar to the standard practice used by the TReC conferences to evaluate different information retrieval systems. Comparing to individual adversarial evaluators, a meta-evaluator is supposed to be more robust and fairer, and it can be applied to evaluate new generators without being retrained.
The actual discriminative classifiers can be either deep or shallow. We employ a total of 7 meta-adversarial evaluators: 3 deep versions, among which one using LSTM Hochreiter and Schmidhuber (1997)
, one using Convolutional Neural Network (CNN)Kim (2014), LeCun et al. (1998)
, and one using a combination of LSTM and CNN architectures; 4 shallow versions, based on Naive Bayes (NB)Rish (2001)
, Random Forest (RF)Liaw et al. (2002)
, Support Vector Machines (SVM)Cortes and Vapnik (1995)
, and XGBoostChen and Guestrin (2016), and they use unigrams, bigrams, and trigrams as features. 12 individual adversarial evaluators are trained, all based on SVM. All 19 discriminative evaluators are trained with balanced training sets; for the shallow models we use 10-fold cross validation. We find the best hyper-parameters Bergstra and Bengio (2012) using random search, and prevent the models from overfitting by using early stopping Prechelt (1998).
For every review in D-test (either annotated or not), a discriminative evaluator makes a judgment call. A generator is considered high quality if the discriminative evaluator makes more mistakes on reviews it generated, w.r.t. the ground truth.
3.3.3 Word overlap evaluators
We include a set of 4 word overlap metrics that are used for NLG evaluation: BLEU Papineni et al. (2002) and METEOR Banerjee and Lavie (2005), which are borrowed from machine translation tasks, ROUGE Lin (2004) that is borrowed from text summarization tasks, and CIDEr Vedantam et al. (2015) that is borrowed from image description evaluation.
An important aspect of these metrics is that they rely on matching -grams in the target text (i.e., generated reviews) to the ground truth text, or the “references” (i.e., human-written reviews). The higher the overlap (similarity), the higher the quality of generated text. For every generated review in D-test Fake, we assemble the set of references by retrieving the top- most similar human-written reviews in D-test Real, calculated using a simple vector space model333Computed using a publicly available toolkit from Sharma et al. (2017). A generator is considered high quality if its generated reviews obtain a high average score by a word overlap evaluator.
In total, we analyze and compare 25 candidate evaluators for review generation (2 human evaluators, 19 discriminative evaluators, and 4 word-overlap metrics), based on the D-test dataset.
We consider a few dependent variables to analyze the experimental results. First, we are interested in the accuracy of individual evaluators - how well they can distinguish “fake” reviews (machine-generated) from “real” reviews (human-written). Second, we are interested in how an evaluator assesses the quality of the 12 generators in general, instead of individual reviews. Since none of the evaluators (including the human evaluators) is likely to make perfect calls, the absolute scores an evaluator gives to the generators are not as informative as how it ranks them: a good evaluator should be able to rank good generators higher than bad generators, and implicitly, it should be confident in the ranking. Last but not least, we are interested in how the rankings by different evaluators correlate with each other. Intuitively, an automated evaluator that ranks the generators similarly to the human evaluators is more reasonable and can potentially be used as the surrogate of human evaluation. Below we summarize the main results of the experiment.
4.1 Results of individual evaluators
We first present results for individual evaluators. As a reminder, the evaluators use three different ways to distinguish machine-generated reviews from human-written reviews: through human annotations, through discriminative classifiers, and through word-overlap metrics.
4.1.1 Human evaluators
In the AMT experiment, every review is annotated by 5 human judges as either “fake” or “real.” The inter-annotator agreement among the human judges is only considered fair by the Fleiss-Kappa Fleiss et al. (2013) score (). This suggests that distinguishing machine-generated reviews from human-written reviews in general is a hard task even for humans; there is limited consensus on what counts as a realistic review.
In Figure 2 we present the accuracy of two human evaluators on individual, Turker annotated reviews, using either all 5 annotations or their majority votes for each review. Comparing to the ground-truth (of whether a review is machine-generated or collected from Amazon), the accuracy of individual human decisions is 66.61%, while their majority votes can do as well as 72.63%. Neither of them is close to perfect. We observe that human evaluators generally do better at correctly labelling the human-written examples as real (true positive rate of 78.96% for and 88.31% for ), and they are confused by the machine-written examples in more than 50% of the cases (true negative rate of 54.26% for H1 and 56.95% for H2). This trend is consistent to the observation in literature Tang et al. (2016).
We then look at how the human evaluators rank the generators, according to the accuracy of human evaluators on all reviews generated by each of the generators. The lower the accuracy, the more likely the human evaluator is confused by the generated reviews, and thus the better the generator. From Table 3
, we observe a substantial variance in the accuracy of the human evaluators (bothand ) on different generators, which suggests that human evaluators are able to distinguish between generators. The generator ranked as the best by both human evaluators is Gated Contexts to Sequences. Google LM is ranked on the lower side, which makes sense as the model is not trained to generate reviews. Interestingly, humans tend not to be fooled by reviews generated by the GAN-based models (MLE SeqGAN, SeqGAN, RankGAN and LeakGAN), even though their objective is to mix real and fake. GAN-generated reviews tend to be easily distinguishable from the human-written reviews by human judges.
|Generative Text Model||Human Evaluators||Human Evaluators|
|Accuracy (H1)||Accuracy (H2)|
|Word LSTM temp 1.0||54.87 %||59.73 %|
|Word LSTM temp 0.7||33.91 %||28.19 %|
|Word LSTM temp 0.5||26.71 %||17.80 %|
|Scheduled Sampling||75.27 %||87.25 %|
|Google LM||68.19 %||79.17 %|
|Attention Attribute to Sequence||32.31 %||27.21 %|
|Contexts to Sequences||38.72 %||34.23 %|
|Gated Contexts to Sequences||24.63 %||14.86 %|
|MLE SeqGAN||76.23 %||89.93 %|
|SeqGAN||74.50 %||85.03 %|
|RankGAN||77.82 %||84.25 %|
|LeakGAN||68.14 %||76.19 %|
4.1.2 Discriminative evaluators
We then analyze the 7 meta-adversarial evaluators and the 12 individual adversarial evaluators that are trained to distinguish human-written reviews from machine-generated reviews. Different from human evaluators that are applied to the 3,600 annotated reviews, the discriminative evaluators are applied to all reviews in the D-test dataset.
Meta-adversarial Evaluators In Table 4 we present the accuracy of both shallow and deep meta-adversarial evaluators on individual reviews and on each generator. In general, the 3 deep learning based and the one SVM based meta-discriminators achieve accuracy higher than the two human evaluators, indicating that the adversarial evaluators can better distinguish machine-generated reviews from human-written reviews than human judges. There is no significant difference between different meta-adversarial evaluators in terms of accuracy on individual reviews. We notice that meta-discriminators commonly rank GAN-based generators (especially MLE SeqGAN) as the best and the contexts-to-sequence ones as the worst. This makes sense as the objective of GAN is consistent to the (reversed) accuracy of the evaluator. Interestingly, by simply setting the temperature of Word LSTM to 1.0, it achieves comparable performance to the GAN-based generators.
|Generators||LSTM||CNN||CNN & LSTM||SVM||RF||NB||XGBoost|
|Word LSTM temp 1.0||48.29 %||55.22 %||45.68 %||50.31 %||53.63 %||32.77 %||48.97 %|
|Word LSTM temp 0.7||92.58 %||93.14 %||91.02 %||78.69 %||81.05 %||79.92 %||80.49 %|
|Word LSTM temp 0.5||99.31 %||99.35 %||99.08 %||94.74 %||94.29 %||96.86 %||94.71 %|
|Scheduled Sampling||50.09 %||48.77 %||43.37 %||51.31 %||52.88 %||20.97 %||44.12 %|
|Google LM||84.58 %||74.03 %||74.85 %||78.59 %||82.71 %||48.28 %||82.41 %|
|Attention Attribute to Sequence||90.08 %||91.78 %||89.94 %||74.37 %||77.29 %||80.02 %||71.68 %|
|Contexts to Sequences||100.00 %||100.00 %||99.97 %||100.00 %||99.98 %||100.00 %||99.98 %|
|Gated Contexts to Sequences||98.37 %||99.06 %||98.38 %||96.26 %||95.35 %||98.63 %||93.62 %|
|MLE SeqGAN||41.45 %||47.54 %||41.91 %||52.35 %||51.14 %||21.83 %||43.71 %|
|SeqGAN||50.05 %||52.91 %||47.35 %||56.20 %||54.91 %||25.60 %||48.11 %|
|RankGAN||66.28 %||67.23 %||59.37 %||70.17 %||61.94 %||35.98 %||61.23 %|
|LeakGAN||87.03 %||80.28 %||79.57 %||77.55 %||67.74 %||46.80 %||63.80 %|
|D-test (all)||77.58 %||74.72 %||75.18 %||74.50 %||70.31 %||70.74 %||73.79 %|
|D-test (human-written)||80.12 %||73.54 %||77.99 %||75.98 %||68.59 %||83.53 %||79.10 %|
|D-test (machine-generated)||75.04 %||75.90 %||72.38 %||73.01 %||72.04 %||57.95 %||68.48 %|
Individual Adversarial Evaluators In Table 5 we present the results of the 12 individual adversarial evaluators (all based on SVM), trained and tested on reviews generated by each corresponding generator. We observe that collectively, these generator-specific evaluators tend to rank the generators similarly to the ranking by the meta-discriminators (e.g., MLE SeqGAN and Word LSTM temp 1.0 are considered as the best generators), although the actual accuracy numbers may differ. This suggests that when discriminative evaluation is preferred, one can train one meta-adversarial evaluator instead of many per-generator evaluators, which makes fairer comparisons and is likely to be more robust to noises.
|Generative Text Model||Individual (SVM)|
|Word LSTM temp 1.0||65.02 %|
|Word LSTM temp 0.7||86.29 %|
|Word LSTM temp 0.5||95.72 %|
|Scheduled Sampling||64.65 %|
|Google LM||96.78 %|
|Attention Attribute to Sequence||91.03 %|
|Contexts to Sequences||100.00 %|
|Gated Contexts to Sequences||97.66 %|
|MLE SeqGAN||64.51 %|
4.1.3 Word-Overlap Evaluators
For every machine-generated review, we compute the scores of the 4 word-overlap metrics against a reference set which is assembled by the top-10 most similar human-written reviews from D-test444We also evaluate BLEU and ROUGE against the entire D-test real as references, and rankings are very similar. Using top-10 nearest neighbors as references is more reasonable, as one generated review only needs to mimic some real reviews, and it is apparently more efficient to compute.. As these scores are continuous, we do not have the “accuracy” of the word-overlap evaluators on individual reviews or generators. However, the generators can still be ranked based on the average scores of their generated reviews. In Figure 3 we present the average scores of the 12 generators by each of the four evaluators. Different word-overlap evaluators also tend to rank the generators in similar orders, with sufficient variance among the BLUE, ROUGE, and CIDEr scores. Interestingly, the top-ranked generator according to BLEU, ROUGE, and METEOR is Contexts to Sequences, while CIDEr scores highest the Gated Contexts to Sequences model. GAN-based generators are generally ranked low.
4.2 Comparing evaluators
Based on the results of individual evaluators, we move on to compare them - to what degree do they agree on the ranking of generators? Intuitively, we are more interested in how the automated evaluators compare to the human evaluators, and whether there is any suitable automatic surrogate for human judges at all. To do this, we compute the correlations between , and each discriminative evaluator and correlations between , and the word-overlap evaluators, based on either their decisions on individual reviews, their scores of the generators (by Pearson’s coefficient Fieller et al. (1957)), and their rankings of the generators (by Spearman’s Spearman (1904) and Kendall’s Daniel et al. (1978)). Results are presented in Table 6.
|Evaluation Method||Kendall tau-b||Spearman||Pearson||Kendall tau-b||Spearman||Pearson|
|CNN & LSTM meta-discriminator||-0.6060*||-0.7902*||-0.8392*||-0.6970*||-0.8462*||-0.8507*|
Not surprisingly, the two human evaluators make highly correlated decisions. Surprisingly though, none of the discriminative evaluators has a positive correlation with the human evaluators, and in fact, their rankings are negatively correlated. That says, generators that could fool machine judges easily are less likely to confuse human judges, and vice versa.
Interestingly, the word-overlap evaluators tend to have a positive correlation with the human evaluators in ranking the generators. Among them, BLEU appears to be closer to human rankings than others. This pattern is consistent in all three types of correlations. These two observations are intriguing, which indicate that when identifying fake reviews, humans might focus more on word usage rather than trying to construct a “decision boundary” mentally.
In summary, we find that 1) human evaluators cannot distinguish machine-generated reviews from human-written reviews perfectly, with significant bias between the two classes; 2) meta-adversarial evaluators make similar decisions to per-generator adversarial evaluators, and they tend to be negatively correlated with human evaluators; and 3) word-overlap evaluators are highly correlated with human evaluators in this task. In the next section, we provide more detailed analysis and possible explanations of the findings.
In this work we have designed and conducted an systematic study that evaluates the evaluators of natural language generation. Our results indicate that decisions of discriminative evaluators for review generation do not correlate with decisions of human evaluators, while word-overlap evaluators seem to better correlate with a simulated Turing test. These findings are consistent with some findings in literature Sharma et al. (2017), Li et al. (2017), while contradicting some others Liu et al. (2016), Bruni and Fernández (2017). We conduct some in-depth analysis and aim to discover possible explanations.
|Generative Text Model||LSTM||CNN||CNN & LSTM||SVM||RF||NB||XGBoost|
|Word LSTM temp 1.0||59.33 %||57.33 %||48.67 %||54.00 %||60.67 %||40.67 %||52.00 %|
|Word LSTM temp 0.7||96.67 %||96.00 %||94.67 %||82.00 %||81.33 %||81.33 %||86.00 %|
|Word LSTM temp 0.5||99.33 %||99.33 %||98.67 %||94.00 %||94.00 %||96.00 %||93.33 %|
|Scheduled Sampling||50.67 %||53.33 %||52.67 %||51.33 %||52.57 %||26.00 %||46.00 %|
|Google LM||81.33 %||71.33 %||72.67 %||82.67 %||86.67 %||50.00 %||85.33 %|
|Attention Attribute to Sequence||94.00 %||93.33 %||91.33 %||76.67 %||75.33 %||83.33 %||69.33 %|
|Contexts to Sequences||100.00 %||100.00 %||100.00 %||100.00 %||100.00 %||100.00 %||100.00 %|
|Gated Contexts to Sequences||97.33 %||99.33 %||98.67 %||96.00 %||94.67 %||99.33 %||94.00 %|
|MLE SeqGAN||40.00 %||46.00 %||40.00 %||52.00 %||52.67 %||18.00 %||46.67 %|
|SeqGAN||67.33 %||49.33 %||54.67 %||60.00 %||49.33 %||26.67 %||45.33 %|
|RankGAN||67.33 %||67.33 %||62.00 %||70.00 %||57.33 %||36.67 %||64.00 %|
|LeakGAN||86.67 %||80.00 %||82.67 %||76.00 %||68.00 %||44.00 %||62.00 %|
|Test set reviews (all)||78.57 %||74.17%||75.92 %||75.53 %||70.78 %||71.33 %||75.22 %|
|Test set reviews (human-written)||79.94 %||72.28 %||77.11 %||76.50 %||68.83 %||84.17 %||80.11 %|
|Test set reviews (machine-generated)||77.22 %||76.06 %||74.72 %||74.56 %||72.72 %||58.50 %||70.33 %|
|Generative Text Model||LSTM||CNN||CNN & LSTM||SVM||RF||NB||XGBoost|
|Word LSTM temp 1.0||57.33 %||58.00 %||53.33 %||60.00 %||53.33 %||48.00 %||54.00 %|
|Word LSTM temp 0.7||30.67 %||31.33 %||31.33 %||34.67 %||36.67 %||38.00 %||33.33 %|
|Word LSTM temp 0.5||18.00 %||18.00 %||18.67 %||22.00 %||23.33 %||21.33 %||21.33 %|
|Scheduled Sampling||48.00 %||50.67 %||46.00 %||51.33 %||48.67 %||28.67 %||48.67 %|
|Google LM||68.67 %||60.00 %||62.67 %||67.33 %||68.67 %||49.33 %||71.33 %|
|Attention Attribute to Sequence||31.33 %||32.00 %||31.33 %||36.67 %||36.67 %||31.33 %||33.33 %|
|Contexts to Sequences||34.00 %||34.00 %||34.00 %||34.00 %||34.00 %||34.00 %||34.00 %|
|Gated Contexts to Sequences||14.57 %||15.33 %||14.67 %||17.33 %||16.00 %||15.33 %||19.33 %|
|MLE SeqGAN||46.00 %||50.56 %||47.33 %||52.67 %||52.00 %||25.33 %||50.00 %|
|SeqGAN||59.33 %||53.33 %||60.00 %||60.00 %||53.33 %||34.67 %||52.00 %|
|RankGAN||59.33 %||60.67 %||55.33 %||70.00 %||53.33 %||38.00 %||53.33 %|
|LeakGAN||70.00 %||63.33 %||71.33 %||76.00 %||60.67 %||39.33 %||54.67 %|
|Test set reviews (all)||60.08 %||56.78 %||58.19 %||58.75 %||54.78 %||55.17 %||59.28 %|
|Test set reviews (human-written)||75.11 %||69.33 %||72.17 %||70.89 %||64.67 %||76.44 %||74.83 %|
|Test set reviews (machine-generated)||45.06 %||44.22 %||44.22 %||46.61 %||44.89 %||33.89 %||43.72 %|
5.1 Imperfect Ground-truth
One important thing to note about the discriminative evaluators is that all the classifiers are trained using natural labels (i.e., treating all examples from the Amazon review dataset as positive and examples generated by the candidate models as negative) instead of human labels. It is possible that if they were trained with human labels, the discriminative evaluators would have been more consistent to the human evaluators. Indeed, some reviews posted on Amazon may have been generated by bots, and if that is the case, treating them as human-written examples may bias the discriminators. In fact, according to Figure 2, we see that only around 80% of the reviews in D-train real (Amazon product reviews) were classified as human-written by human evaluators. If we trust the annotations by the Turkers, the other 20% reviews may have been generated by bots.
One way to verify this is to consider an alternative “ground-truth” for the discriminate evaluators. That is, we apply the already trained meta-discriminators to the Turker annotated set (which contains 3,600 reviews) instead of the full D-test set, and we use the majority vote of human annotations (whether a review is fake or real) to surrogate the “ground-truth” labels (whether a review is generated or sampled from Amazon). The results are presented in the following tables.
First, to toss out the potential selection bias in the annotated sample of D-test, in Table 7 we present the results of the meta-discriminators on each generator, tested on the annotated subset but using the natural labels as ground-truth. Note that the only difference between Table 7 and Table 4 is whether they are calculated on the annotated subset or the full D-test dataset. Both the accuracy numbers and the rankings of the generators are consistent, which means using the smaller test set for evaluation does not include noticeable bias.
We then calculate the results of the meta-discriminators, tested on the annotated subset and using the majority votes of Turkers as surrogates of ground-truth (Table 8). Surprisingly, both the accuracy numbers and the rankings of the generators are significantly different from Table 7 and Table 4 (which used natural ground-truth labels). We note that the numbers and rankings are more inline with the human evaluators (Table 3).
To confirm the intuition, we calculate the correlations between the meta-discriminators and the human evaluators using the annotated subset only. Surprisingly, replacing the natural ground-truth with human annotated labels, the meta-discriminators become positively correlated with human evaluators (Table 9). Without using human annotated labels, even if evaluated on the same subset, the discriminative evaluators still disagree with human evaluators (Table 10).
Finally, for sanity check, if we calculate the word overlap metrics based on the annotated subset of D-test, the word overlap evaluators still present positive correlations with human evaluators (Table 11).
|Evaluation Method||Kendall tau-b||Spearman||Pearson||Kendall tau-b||Spearman||Pearson|
|CNN & LSTM meta-discriminator||0.5649*||0.7180*||0.8579*||0.5344*||0.6900*||0.8623*|
|Evaluation Method||Kendall tau-b||Spearman||Pearson||Kendall tau-b||Spearman||Pearson|
|CNN & LSTM meta-discriminator||-0.5954*||-0.7671*||-0.8375*||-0.6870*||-0.8161*||-0.8517*|
|Evaluation Method||Kendall tau-b||Spearman||Pearson||Kendall tau-b||Spearman||Pearson|
These results are intriguing. They indicate that when the “ground-truth” used by an automated Turing test is questionable, the decisions of the evaluators may be biased. Discriminative evaluators suffer the most from the bias, as they were directly trained using the imperfect ground-truth. Word overlap evaluators are more robust, as they only take the most relevant parts of the test set as references (which are more likely to be high quality). Human evaluators are more trustful, as long as the inter-rater disagreements are resolved.
5.2 Role of Diversity
We also assess the role diversity plays in the rankings of the generators. To this end, we measure lexical diversity Bache et al. (2013) of the samples produced by each generator as the ratio of unique tokens to the total number of tokens. We compute in turn lexical diversity for unigrams, bigrams and trigrams, and observe that the generators that produce the least diverse samples are easily distinguished by the meta-discriminators, while they confuse human evaluators the most. Alternatively, samples produced by the most diverse generators are hardest to distinguish by the meta-discriminators, while human evaluators present higher accuracy at classifying them. As reported in Kannan and Vinyals (2017), the lack of lexical richness can be a weakness of the generators, making them easily detected by a machine learning classifier. Meanwhile, a discriminator’s preference for rarer language does not necessarily mean it is favouring higher quality reviews.
In addition to lexical diversity, Self-BLEU Zhu et al. (2018) is an interesting measurement of the diversity of a set of text (average BLEU score of each document using the same collection as reference, therefore the lower the more diverse). In Table 12 we present Self-BLEU scores for each generator, applied to their generated text in D-test fake. We also compute the correlation coefficients between the rankings of generators by Self-BLEU and the rankings by the evaluators (Table 13). Results obtained indicate that Self-BLEU presents negative correlation with human evaluators and word-overlap evaluators and positive correlation with discriminative evaluators. This result confirms the findings in literature Kannan and Vinyals (2017) that discriminators in adversarial evaluation are capturing known limitations of the generative models such as lack of diversity.
|Generative Text Model||Self-BLEU||Lexical diversity|
|Word LSTM temp 1.0||0.1886||0.6467|
|Word LSTM temp 0.7||0.4804||0.2932|
|Word LSTM temp 0.5||0.6960||0.1347|
|Attention Attribute to Sequence||0.5021||0.2939|
|Contexts to Sequences||0.8950||0.0032|
|Gated Contexts to Sequences||0.7330||0.1129|
|CNN & LSTM meta-discriminator||0.7273*||0.8601*||0.8622*|
Following this insight, an important question to answer is to what extent are the generators simply memorizing the training set G-train
. To this end, we assess the degree of n-gram overlap between the generated reviews and the training reviews using the BLEU evaluator. In Table14 we present the average BLEU scores of generated reviews using their nearest neighbors in G-train as references. We observe that generally the generators did not just memorize the training set, and GAN models generate reviews that have fewer overlap with G-train. In Table 15 we compute the correlation between the divergence from training and the ratings by evaluators in the study. BLEU w.r.t. G-train presents highly positive correlation with BLEU w.r.t. D-test real, and it is also positively correlated with the human evaluators.
|Generative Text Model||BLEU G-Train|
|Word LSTM temp 1.0||0.2701|
|Word LSTM temp 0.7||0.4998|
|Word LSTM temp 0.5||0.6294|
|Attention Attribute to Sequence||0.5122|
|Contexts to Sequences||0.7542|
|Gated Contexts to Sequences||0.6240|
|BLEU G-train||Kendall tau-b||Spearman||Pearson|
|CNN & LSTM meta-discriminator||-0.6260*||-0.7811*||-0.7951*|
The effects of diversity is perhaps not hard to explain. At the particular task of distinguishing fake reviews from real, all decisions are made on individual reviews. And because a human judge was not exposed to many fake reviews generated by the same generator, whether or not a fake review is sufficiently different from the other generated reviews is not a major factor for their decision. Instead, the major factor is whether the generated review looks similar to the reviews they have seen in reality. Instead, a discriminative evaluator makes the decision after seeing many positive and negative examples, and a fake review that can fool an adversarial classifier has to be sufficiently different from all other fake reviews it has encountered (therefore diversity of a generator is a major indicator of its ability to pass an adversarial judge).
5.3 User Study
Finally, we are interested in the reasons why human annotators label certain reviews as fake (machine-written). After annotating a batch of reviews, we asked the workers to explain their decisions by filling in an optional free-text comment. This enables us to have a better understanding of what differentiates machine-generated from human-written reviews from human’s perspective. Analyzing their comments, we identify the main reasons why human evaluators annotate a review as machine-written. These are mainly related to the presence of grammatical errors in the review text, wrong wording or inappropriate choice of expressions, redundant use of specific phrases or contradictory arguments in the review. Interestingly, human evaluators’ innate biases are also reflected in their decisions: they are likely to categorize a review as fake if it is too formal, lacks emotion and personal pronouns, or is too vague and generic. A more detailed list of major clusters of reasons is as follows:
Grammar/ typo/ mis-spelling: the language does not flow well.
Too general/ too generic/ vagueness: generated reviews are vague, in lack of details.
Word choice (wording): in lack of slang, use the wrong words.
Flow (not fluent)/ structured/ logical: the sentences level language errors.
Contradictory arguments: some arguments support opposite opinions.
Emotion: lack of emotion, personality in the comments.
Repeated text: using words/ phrases repetitively.
Overly same as human: too advertisement, too formal, too likely to be real.
The results in our experiment have many intriguing implications to both the evaluation and the construction of natural language generators. First, we find that in the context of judging individual documents, discriminative evaluators are not as realistic as word overlap evaluators, w.r.t. how they correlate with a simulated Turing test (human evaluators). That implies that adversarial accuracy might not be the optimal objective for natural language generation, if the goal is to generate documents that humans consider as real. Instead, a fake review that fools humans does not necessarily need to fool a machine that has seen everything. Pushing too hard towards the boundary might swing the pendulum too far towards rare corner or unrealistic cases. As a result, simple LSTM models or attention models may generate surprisingly competitive results. In contrast, GAN based models may more easily pass a Turing test on a bot level (when judgments are made on a system as a whole instead of on individual items), or in a conversational context. That is, when the judges have seen enough examples from the same generator, the next example had better be somewhat different.
Our results also suggest that when adversarial training is used, the selection of training examples must be done with caution. In other words, if the “ground-truth” is hijacked by low quality or “fake” examples, models trained by GAN may be significantly biased. This finding is related to the recent literature of the robustness and security of machine learning models.
We also find that when humans are distinguishing fake reviews from real ones, they tend to focus more on the word usage. The use of particular words, expressions, emotions, and other details may be more convincing than something that is just “generally” correct. These findings may affect the design of objectives for the next generation of natural language generation models.
We believe that our findings represent a preliminary foundation for proposing more solid and robust evaluation metrics for the evaluation of NLG output. In future work we plan to carry additional experiments where we include in the study a wider range of generative models and meta-discriminator architectures, and inspired by the current results we aim to suggest more robust evaluation metrics for assessing the quality of NLG.
We thank Wei Ai for his help on the power analysis. We also thank Yue Wang and Teng Ye for helpful discussions. This work is in part supported by the National Science Foundation under grant numbers 1633370 and 1620319 and in part supported by the National Library of Medicine under grant number 2R01LM010681-05.
- Bache et al. (2013) Kevin Bache, David Newman, and Padhraic Smyth. 2013. Text-based measures of document diversity. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 23–31. ACM.
- Bachman and Precup (2015) Philip Bachman and Doina Precup. 2015. Data generation as sequential decision making. In Advances in Neural Information Processing Systems, pages 3249–3257.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179.
- Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305.
- Bowman et al. (2015) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
- Bruni and Fernández (2017) Elia Bruni and Raquel Fernández. 2017. Adversarial evaluation for open-domain dialogue generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 284–288.
- Buhrmester et al. (2011) Michael Buhrmester, Tracy Kwang, and Samuel D Gosling. 2011. Amazon’s mechanical turk: A new source of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1):3–5.
- Caccia et al. (2018) Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. 2018. Language gans falling short. arXiv preprint arXiv:1811.02549.
- Callison-Burch et al. (2006) Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluation the role of bleu in machine translation research. In 11th Conference of the European Chapter of the Association for Computational Linguistics.
- Carenini and Moore (2006) Giuseppe Carenini and Johanna D Moore. 2006. Generating and evaluating evaluative arguments. Artificial Intelligence, 170(11):925–952.
- Che et al. (2017) Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. 2017. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983.
- Chelba et al. (2013) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
- Chen and Goodman (1999) Stanley F Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359–394.
- Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM.
- Christensen (2007) Erik Christensen. 2007. Methodology of superiority vs. equivalence trials and non-inferiority trials. Journal of hepatology, 46(5):947–954.
- Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning, 20(3):273–297.
- Dai et al. (2017) Bo Dai, Dahua Lin, Raquel Urtasun, and Sanja Fidler. 2017. Towards diverse and natural image descriptions via a conditional gan. arXiv preprint arXiv:1703.06029.
- Dale and Mellish (1998) Robert Dale and Chris Mellish. 1998. Towards evaluation in natural language generation. In In Proceedings of First International Conference on Language Resources and Evaluation.
- Daniel et al. (1978) Wayne W Daniel et al. 1978. Applied nonparametric statistics. Houghton Mifflin.
- Di Eugenio et al. (2002) Barbara Di Eugenio, Michael Glass, and Michael Trolio. 2002. The diag experiments: Natural language generation for intelligent tutoring systems. In Proceedings of the International Natural Language Generation Conference, pages 120–127.
- Dong et al. (2017) Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata, Ming Zhou, and Ke Xu. 2017. Learning to generate product reviews from attributes. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, pages 623–632.
- Fedus et al. (2018) William Fedus, Ian Goodfellow, and Andrew M Dai. 2018. Maskgan: Better text generation via filling in the _. arXiv preprint arXiv:1801.07736.
- Ficler and Goldberg (2017) Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. arXiv preprint arXiv:1707.02633.
- Fieller et al. (1957) Edgar C Fieller, Herman O Hartley, and Egon S Pearson. 1957. Tests for rank correlation coefficients. i. Biometrika, 44(3/4):470–481.
- Fleiss et al. (2013) Joseph L Fleiss, Bruce Levin, and Myunghee Cho Paik. 2013. Statistical methods for rates and proportions. John Wiley & Sons.
- Flight and Julious (2016) Laura Flight and Steven A Julious. 2016. Practical guide to sample size calculations: non-inferiority and equivalence trials. Pharmaceutical statistics, 15(1):80–89.
- Gatt and Krahmer (2018) Albert Gatt and Emiel Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61:65–170.
- Gerz et al. (2018) Daniela Gerz, Ivan Vulić, Edoardo Ponti, Jason Naradowsky, Roi Reichart, and Anna Korhonen. 2018. Language modeling for morphologically rich languages: Character-aware modeling for word-level prediction. Transactions of the Association of Computational Linguistics, 6:451–465.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
- Guo et al. (2017) Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2017. Long text generation via adversarial training with leaked information. arXiv preprint arXiv:1709.08624.
- Hardcastle and Scott (2008) David Hardcastle and Donia Scott. 2008. Can we evaluate the quality of generated text? In LREC. Citeseer.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Huszár (2015) Ferenc Huszár. 2015. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101.
- Jelinek et al. (1977) Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. 1977. Perplexity?a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63.
- Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.
- Kannan and Vinyals (2017) Anjuli Kannan and Oriol Vinyals. 2017. Adversarial evaluation of dialogue models. arXiv preprint arXiv:1701.08198.
- Karpathy and Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
- Lamb et al. (2016) Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. 2016. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
- Li et al. (2017) Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547.
- Liaw et al. (2002) Andy Liaw, Matthew Wiener, et al. 2002. Classification and regression by randomforest. R news, 2(3):18–22.
- Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.
- Lin et al. (2017) Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. 2017. Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pages 3155–3165.
- Liu et al. (2016) Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
- Lowe et al. (2017) Ryan Lowe, Michael Noseworthy, Iulian V Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. arXiv preprint arXiv:1708.07149.
- Lu et al. (2018) Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Neural text generation: Past, present and beyond. arXiv preprint arXiv:1803.07133.
- Machinery (1950) Computing Machinery. 1950. Computing machinery and intelligence-am turing. Mind, 59(236):433.
- Mani et al. (1999) Inderjeet Mani, David House, Gary Klein, Lynette Hirschman, Therese Firmin, and Beth Sundheim. 1999. The tipster summac text summarization evaluation. In Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics, pages 77–85. Association for Computational Linguistics.
- Mellish and Dale (1998) Chris Mellish and Robert Dale. 1998. Evaluation in the context of natural language generation. Computer Speech & Language, 12(4):349–373.
- Mikolov et al. (2011a) Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2011a. Extensions of recurrent neural network language model. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5528–5531. IEEE.
- Mikolov et al. (2011b) Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky. 2011b. Rnnlm-recurrent neural network language modeling toolkit. In Proc. of the 2011 ASRU Workshop, pages 196–201.
- Mikolov and Zweig (2012) Tomas Mikolov and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. SLT, 12:234–239.
- Ott et al. (2011) Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 309–319. Association for Computational Linguistics.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
- Prechelt (1998) Lutz Prechelt. 1998. Early stopping-but when? In Neural Networks: Tricks of the trade, pages 55–69. Springer.
- Rajeswar et al. (2017) Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, and Aaron Courville. 2017. Adversarial generation of natural language. arXiv preprint arXiv:1705.10929.
- Reiter and Belz (2009) Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
- Reiter et al. (2001) Ehud Reiter, Roma Robertson, A Scott Lennox, and Liesl Osman. 2001. Using a randomised controlled clinical trial to evaluate an nlg system. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pages 442–449. Association for Computational Linguistics.
- Reiter and Sripada (2002) Ehud Reiter and Somayajulu Sripada. 2002. Should corpora texts be gold standards for nlg? In Proceedings of the International Natural Language Generation Conference, pages 97–104.
Irina Rish. 2001.
An empirical study of the naive bayes classifier.In IJCAI 2001 workshop on empirical methods in artificial intelligence, volume 3, pages 41–46. IBM.
- Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
- Sharma et al. (2017) Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer. 2017. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. arXiv preprint arXiv:1706.09799.
- Shi et al. (2018) Zhan Shi, Xinchi Chen, Xipeng Qiu, and Xuanjing Huang. 2018. Towards diverse text generation with inverse reinforcement learning. arXiv preprint arXiv:1804.11258.
- Spearman (1904) Charles Spearman. 1904. The proof and measurement of association between two things. The American journal of psychology, 15(1):72–101.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
- Tang et al. (2016) Jian Tang, Yifan Yang, Sam Carton, Ming Zhang, and Qiaozhu Mei. 2016. Context-aware natural language generation with recurrent neural networks. arXiv preprint arXiv:1611.09900.
- Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
- Wen et al. (2015) Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745.
- Williams and Zipser (1989) Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280.
- Yarats and Lewis (2017) Denis Yarats and Mike Lewis. 2017. Hierarchical text generation and planning for strategic dialogue. arXiv preprint arXiv:1712.05846.
- Young (1999) R Michael Young. 1999. Using grice’s maxim of quantity to select the content of plan descriptions. Artificial Intelligence, 115(2):215–256.
- Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858.
- Zhang and Lapata (2014) Xingxing Zhang and Mirella Lapata. 2014. Chinese poetry generation with recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 670–680.
- Zhang et al. (2017) Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. 2017. Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850.
- Zhu et al. (2018) Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. arXiv preprint arXiv:1802.01886.