Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

by   Cristina Garbacea, et al.
University of Michigan

Recent advances in deep learning have resulted in a resurgence in the popularity of natural language generation (NLG). Many deep learning based models, including recurrent neural networks and generative adversarial networks, have been proposed and applied to generating various types of text. Despite the fast development of methods, how to better evaluate the quality of these natural language generators remains a significant challenge. We conduct an in-depth empirical study to evaluate the existing evaluation methods for natural language generation. We compare human-based evaluators with a variety of automated evaluation procedures, including discriminative evaluators that measure how well the generated text can be distinguished from human-written text, as well as text overlap metrics that measure how similar the generated text is to human-written references. We measure to what extent these different evaluators agree on the ranking of a dozen of state-of-the-art generators for online product reviews. We find that human evaluators do not correlate well with discriminative evaluators, leaving a bigger question of whether adversarial accuracy is the correct objective for natural language generation. In general, distinguishing machine-generated text is a challenging task even for human evaluators, and their decisions tend to correlate better with text overlap metrics. We also find that diversity is an intriguing metric that is indicative of the assessments of different evaluators.


page 1

page 2

page 3

page 4


Why is constrained neural language generation particularly challenging?

Recent advances in deep neural language models combined with the capacit...

Russian Natural Language Generation: Creation of a Language Modelling Dataset and Evaluation with Modern Neural Architectures

Generating coherent, grammatically correct, and meaningful text is very ...

Automating Text Naturalness Evaluation of NLG Systems

Automatic methods and metrics that assess various quality criteria of au...

Text to Image Generation: Leaving no Language Behind

One of the latest applications of Artificial Intelligence (AI) is to gen...

Learning to Write with Cooperative Discriminators

Recurrent Neural Networks (RNNs) are powerful autoregressive sequence mo...

Chain of Explanation: New Prompting Method to Generate Higher Quality Natural Language Explanation for Implicit Hate Speech

Recent studies have exploited advanced generative language models to gen...

Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

Automatic evaluation of various text quality criteria produced by data-d...

1 Introduction

Recent developments in neural language models Mikolov and Zweig (2012), Reiter and Belz (2009), Mikolov et al. (2011b), Mikolov et al. (2011a) have inspired the use of neural network based architectures for the task of natural language generation (NLG). From image captioning Karpathy and Fei-Fei (2015), Dai et al. (2017), machine translation Sutskever et al. (2014), Bahdanau et al. (2014)

, to text summarization

Rush et al. (2015), dialogue systems Wen et al. (2015), and poetry generation Zhang and Lapata (2014)

, deep neural networks have shown promising results for many natural language processing applications, and they have quickly replaced traditionally handcrafted rule-based or template-based approaches to NLG.

Despite the fast development of models, however, there is a huge gap in the evaluation of NLG systems. On one hand, a rigorous, efficient, and reproducible evaluation is extremely critical for the development of any machine learning technology and for a fair interpretation of the state of the art. On the other hand, evaluating the quality of natural language generation is an inherently difficult task due to the special properties of text data, such as

subjectivity and non-compositionality (the overall meaning of a piece of text is not composed of the meaning of individual words and is more than just the sum of its parts; it emerges through a holistic appraisal of specific features in a given context). Indeed, “there is no agreed objective criterion for comparing the goodness of texts” Dale and Mellish (1998), and there lacks a clear model of text quality in the NLG literature Hardcastle and Scott (2008).

Conventionally, most NLG systems have been evaluated in a rather informal manner. Reiter and Belz (2009) divide existing evaluation methods commonly employed in text generation into three categories: i) evaluations based on task performance, assessing the impact of generated texts on end users, ii) human judgments and ratings, where human subjects are recruited to rate generated texts on a -point scale, and on different textual dimensions, and iii) evaluations based on comparison to a reference corpus using automatic metrics. Task based evaluation measures the impact of generated texts in real applications, considering that the value of a functional text lies in how well it serves the user to fulfill a specific function. Young (1999) generate instructional texts and determine how informative these are when users carry the directions outlined. Mani et al. (1999) perform extrinsic summary evaluation by comparing the functional value of a summary vs. the entire document. Carenini and Moore (2006) evaluate persuasive texts by assessing how users rank items in a list, while Di Eugenio et al. (2002) measure the learning gain in intelligent tutoring systems with an NLG component. Nevertheless, task based evaluation can be expensive, time-consuming, and often depends on the good will of participants in the study. Besides that, it is hard to toss out the general quality of text generation from the special context (and confounds) of the application task, or to generalize the evaluation conclusions across tasks. Human based evaluation is able to assess the quality of text more directly than task based evaluation, and it requires less support from domain experts. However, evaluating NLG systems on real users in a rigorous manner can be expensive and time consuming, which does not scale well Reiter et al. (2001). Alternative strategies which are cost effective and can provide accurate immediate feedback in less time are used more frequently. Automated evaluation compare texts generated by the candidate algorithms to human-written texts. Text overlap metrics and more recent automated adversarial evaluators are widely employed in NLG as they are cheap, quick, repeatable, and do not require human subjects when a reference corpus is already available. In addition, they allow developers to make rapid changes to their systems and automatically tune parameters without human intervention. Despite the benefits, however, the use of automated metrics in the field of NLG is controversial Reiter and Belz (2009), and their results are often criticized as not meaningful for a number of reasons. First, these automatic evaluations rely on a high-quality reference corpus as references, which is not often available; variations in the writing style of authors and the presence of (grammatical) errors often result in inconsistent and error prone judgments Reiter and Sripada (2002)

. Second, comparisons with a reference corpus do not assess the impact and usefulness of the generated text on the readers as in human-based evaluations, but instead determine how closely the text matches the references. Third, creating human written reference texts specifically for the purpose of evaluation could still be expensive, especially if these reference texts need to be created by skilled domain experts. Finally and most importantly, using automatic evaluation metrics is sensible only if they correlate with results of human-based evaluations and are accurate predictors of text quality, which is never formally verified. Their validity for evaluation should be a more important concern than their cost-effectiveness for evaluation.

In this paper we conduct a systematic experiment that evaluates the different evaluators for natural language generation. We compare three types of evaluators for a carefully selected scenario of online review generation, including human evaluators, automated adversarial evaluators that are trained to distinguish human-written from machine-generated product reviews, and word overlap metrics (such as BLEU and ROUGE). The preferences of different evaluators on a dozen state-of-the-art deep-learning based NLG algorithms are correlated with human assessments of the quality of generated text. Our findings not only reveal the differences among the evaluators and suggest the more effective ones, but also provide important implications on how to guide the development of new natural language generation models.

2 Related Work

The presented study is related to two lines of research: deep learning based models and automated evaluation metrics for the task of text generation.

2.1 Deep Learning Based NLG

Recently, a decent number of deep learning based models have been proposed for text generation in various scenarios. Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM)

Hochreiter and Schmidhuber (1997) models in particular, are widely used for generating sequential data including text. Google LM Jozefowicz et al. (2016) releases a language model which was pre-trained on the One Billion Word Benchmark data Chelba et al. (2013) using a 2-layer LSTM. The most popular strategy for training RNNs is teacher forcing Williams and Zipser (1989). At every step of a sequence, it trains the model using the observed token of the previous step; while at inference time it uses the generated token by the model itself. This discrepancy between training and inference often leads to errors that accumulate quickly over the generated sequence and compromise the predictive power of the model Lamb et al. (2016). To solve this problem, Scheduled Sampling (SS) is proposed Bengio et al. (2015) to train RNNs, which mixes the ground-truth outputs and the model-generated outputs as the training inputs. However, it is shown that SS is an inconsistent training strategy, and instead adversarial training strategies are more suitable for generative models Huszár (2015).

Generative Adversarial Networks Goodfellow et al. (2014), or GANs, train generative models through an adversarial process. A GAN works through the interplay of two feedforward neural networks that are trained simultaneously by competing against each other: a generative model that captures the data distribution and generates high quality synthetic data, and a discriminative model

which estimates the probability that a sample comes from the real training data and not from the synthetic data generated by


Comparing to generating images, generating text with GANs is challenging due to the discrete nature of sequence data. SeqGAN Yu et al. (2017) is one of the earliest GAN-based model for sequence generation, which treats the generation procedure as a sequential decision making process Bachman and Precup (2015). The model addresses the problem of non-differentiability of discrete outputs by treating the generator as a stochastic parameterized policy trained via policy gradient Sutton et al. (2000)

and optimized by directly performing gradient policy updates. Reward in such a reinforcement learning process is based on how likely the discriminator would be fooled by a complete sequence of tokens, which is passed back to the intermediate state-action steps using Monte Carlo search.

Following this direction, RankGAN Lin et al. (2017) proposes a newer framework which evaluates the quality of a set of generated sequences collectively. The discriminator in RankGAN, , is trained to rank the model-generated sentences lower than human-written sentences w.r.t. a human-written reference set; the generator is trained to confuse the ranker in such a way that its generated sentences can be ranked higher than the human written ones.

Many GAN-based text generators Yu et al. (2017), Lin et al. (2017), Rajeswar et al. (2017), Che et al. (2017), Li et al. (2017), Zhang et al. (2017) are only capable of generating short texts, say 20 tokens long. LeakGAN Guo et al. (2017) is proposed for generating longer texts. In LeakGAN, the discriminator is allowed to leak its own high-level extracted features to better guide the training of the generator .

Deep learning architectures other than LSTM or GAN have also been proposed for text generation. Tang et al. Tang et al. (2016) studies the the problem of NLG given particular contexts or situations and proposes two approaches (Contexts to Sequences and Gated Contexts to Sequences), both built on top of the encoder-decoder framework. Dong et al. focuses on the same task and employ an attention mechanism to learn soft alignments between the input attributes and the generated words Dong et al. (2017).

2.2 Automated Evaluation Metrics

The variety of natural language generation models are also evaluated with various approaches. Arguably, the most natural way to evaluate the quality of a generator is to involve humans as judges, either through some type of Turing test Machinery (1950) to distinguish generated text from human input texts, or to directly compare the texts generated by different generators Mellish and Dale (1998). Such approaches are hard to scale, and they have to be redone whenever a new generator is included. Practically, it is important to find automated metrics to evaluate the quality of a generator independent of human judges or an exhaustive set of competing generators. Metrics commonly used in literature are i) perplexity, ii) discriminative evaluators, and iii) text overlap metrics. We summarize these metrics below.

Perplexity Jelinek et al. (1977) is commonly used to evaluate the quality of a language model Chen and Goodman (1999)

. It measures how well a probability distribution predicts a sample (either seen or unseen) and captures the degree of uncertainty in the model. Perplexity is used to evaluate generators

Yarats and Lewis (2017), Ficler and Goldberg (2017), Gerz et al. (2018) even though it is commonly criticized for not being a direct measure of the quality of generated text Fedus et al. (2018). Indeed, perplexity is a model dependent metric, and “how likely a sentence is generated by a given model” is not comparable across different models. Therefore we do not include perplexity as an evaluation metric in this study.

Discriminative Evaluation

is an alternative way to evaluate a generator, which measures how likely its generated text can fool a classifier that aims to distinguish the generated text from human-written texts. In a way, this is an automated approximation of the Turing test, where machine judges are used to replace human judges. Discriminative machine judges can be trained either using a data set with explicit labels

Ott et al. (2011), or using a mixture of text written by real humans and those generated by the model being evaluated. The latter is usually referred to as adversarial evaluation. Bowman et al. proposes one of the earliest work that uses adversarial evaluation to assess the quality of the generated sentences Bowman et al. (2015)

. The authors train two logistic regression classifiers, one based on bag-of-unigram features and the other based on LSTM, to separate the generated sentences from human-written sentences. In this context,

adversarial error is defined as the difference between the ideal accuracy of the discriminator when samples from the two categories are indistinguishable from each other (i.e. 50% accuracy) and the actual accuracy attained. Notably, maximizing the adversarial error is consistent to the objective of the generator in generative adversarial networks.

In a later study Kannan and Vinyals (2017), the authors propose an adversarial loss to discriminate a dialogue model’s output from human output. The model consists of a generator and a discriminator (both RNNs) trained separately. The authors find that the discriminator’s preference is correlated with the length of the output, and long responses are favoured even when they are not entirely coherent. In addition, the discriminator prefers rarer language instead of the most common responses produced by the generator, which implies that the discriminator may detect weaknesses of the generator such as lack of diversity in generated text. Even though the discriminator can distinguish generated output from human output at an accuracy of 62.5%, there lacks evidence that a model that obtains a lower accuracy by the discriminative evaluator is better according to human evaluations.

Automatic dialogue evaluation is formulated as a learning problem in Lowe et al. (2017), who train a hierarchical RNN model to predict the scores a human would assign to dialogue responses. Authors show that the predictions correlate with human judgments at the utterance and system level, however each response is evaluated in a very specific context and the system requires substantial human judgments for training. Li et al. Li et al. (2017) employ the idea of having a discriminator (analogous to the human evaluator in the Turing test) both in training and testing and define adversarial success AdverSuc as the difference between 1 and the accuracy achieved by the evaluator (thus the higher the better). The authors find that an adversarial training strategy is more useful when there is a big discrepancy between the distributions of the generated sequences and the reference target sequences, i.e. high entropy of the targets. Other work finds the performance of a discriminative agent (e.g., attention-based bidirectional LSTM binary classifier) is comparable with human judges at distinguishing between real and fake dialogue excerpts Bruni and Fernández (2017). Results show there is limited consensus among humans on what is considered as coherent dialogue passages. The discriminator is sensitive to patterns which are not apparent to humans, however the utility of such an approach in developing generative models that interact with humans is an open question.

Text Overlap Metrics, such as BLEU Papineni et al. (2002), ROUGE Lin (2004), and METEOR Banerjee and Lavie (2005), are commonly used to evaluate NLP tasks – machine translation and text summarization in particular. They are borrowed to evaluate language generation by comparing the similarity between the generated text and human written references. Some work Liu et al. (2016) finds that word overlap metrics present weak or no correlation with human judgments in non-task oriented dialogue systems. As these metrics are just a rough estimate of human judgments, the authors recommend to always use them with caution or in combination with user studies. In contrary, it is reported in Sharma et al. (2017) that word overlap metrics are indicative of human judgments in task-oriented dialogue settings, when used on datasets which contain multiple ground truth references. Dai et al. find word overlap metrics too restrictive as they focus on fidelity of wording instead of fidelity of semantics, implying that sentences containing matched ngrams get substantially higher scores than sentences containing variant expressions Dai et al. (2017). Callison et al. consider an improvement in BLEU insufficient for achieving an actual improvement in the quality of a system. For example, variations (e.g., permutations and substitutions) in a generated sentence, not equally grammatical or semantically plausible, are not distinguishable by the BLEU score. The authors posit in favour of human evaluations Callison-Burch et al. (2006).

GAN-based text generation models are compared against a maximum likelihood estimation (MLE) baseline in Caccia et al. (2018). The authors find that models trained with MLE yield superior quality-diversity trade-off according to negative log-likelihood and Self-BLEU metrics, where Self-BLEU Zhu et al. (2018) measures the diversity of the generated text by calculating the BLEU Papineni et al. (2002) score for each generated sentence using all other generated sentences in the corpus as references. Shi et al. compare frameworks for text generation including MLE, SeqGAN, LeakGAN and Inverse Reinforcement Learning using a simulated Turing test Shi et al. (2018); each generated sentence gets 1 point when it is judged by humans as real, otherwise 0 points. A benchmarking experiment with GAN neural text generation models is conducted in Lu et al. (2018); results show LeakGAN presents the highest BLEU scores on the test data.

While a large number of natural language generators are proposed and evaluated with various different metrics, no existing work has systematically evaluated the different evaluators. Do automated judges truly mimic human judges? How much do they agree with each other? Is adversarial evaluation indicative of who would pass a Turing test? Do human judges focus on different aspects of text quality than machine judges? Recently, the community starts to realize that these evaluation metrics for NLG are not consistent among themselves, and using multiple methods of evaluation can be helpful at capturing different aspects of text quality, from fluency and clarity to adequacy of semantic content and effectiveness of communication Gatt and Krahmer (2018). Yet these insights are not associated with empirical evidence. In this study, we bridge the gap and conduct a systematic empirical evaluation of the automated evaluators for natural language generation.

3 Experiment Design

We design a large-scale experiment to systematically analyze the procedures and metrics used for evaluating natural language generation models. While the main subjects of the experiment are different evaluators, including those based on human judgments, those based on automated discriminators, and those based on word overlap metrics, the experiment carefully chooses a particular application context and a variety of natural language generators in this context. Ideally, a sound automated evaluator should be able to distinguish good generators from suboptimal ones. Their preferences (on ordering the generators) should be consistent to humans who are trained to make judgments in the particular task context.

3.1 Experiment Context and Procedure

We design the experiment in the context of generating online product reviews. There are several reasons why review generation is a desirable task: 1) online product reviews are widely available, and it is easy to collect a large number of examples for training/testing the generators; 2) Internet users are used to reading online reviews, and it is easy to recruit capable human judges to assess the quality of reviews; and 3) comparing to tasks like image caption generation or dialogue systems, review generation has minimal dependency on the conversation context or on non-textual data, which reduces possible confounds of the experiment.

The general experiment procedure is presented in Figure 1. We start from the publicly available Amazon Product Reviews dataset111http://jmcauley.ucsd.edu/data/amazon/, which spans the period between May 1996 to July 2014. From the entire dataset, we select three most popular domains: books, electronics, and movies. To control for the potential confounds introduced by special products or language usage, we do not include products with less than 2 reviews or users who have written only a single review. The vocabulary is trimmed to the most frequent 5,000 words, with all out-of-vocabulary words replaced by a unified token. We discard reviews longer than 70 words, as they tend to describe plots of particular books or movies instead of expressing opinions.

The filtered dataset is randomly split into three parts, to train, validate, and test the candidate review generators (denoted as G-train, G-valid, and G-test, respectively). Every generative model is trained and validated using the same datasets, and once trained, each of them will be charged to generate a number of product reviews (details are included in the next section). These generated reviews, mixed with the real reviews in G-test, are randomly split into three new subsets for training, validating, and testing candidate (discriminative) evaluators, denoted as D-train, D-valid, and D-test. Note that every subset contains the same portion of human-written reviews and machine-generated reviews. Finally, a random sample of reviews from D-test are sent for human evaluation.

Figure 1: Overview of the Experiment Procedure.

3.2 Review Generators

Although our goal is to evaluate the evaluators, it is critical to include a wide range of generators and generated reviews. Ideally, these generators should present various degrees of quality; a good evaluator should be able to distinguish the high-quality generators (or generated reviews) from the low-quality ones, and vice versa. To achieve this variety, we select a diverse set of generative models from recent text generation literature. Note that the goal of this study is NOT to name the best generative model, and it is unfeasible to include all existing models. Our criteria are to include models that represent different strategies and quality levels and to consider those with publicly available implementations. In Table 1 we list the candidate generative models, carefully noting that it is not an exhaustive list of what is currently available.

Generative Model Adversarial
Word LSTM temp 1.0 Hochreiter and Schmidhuber (1997) No
Word LSTM temp 0.7 Hochreiter and Schmidhuber (1997) No
Word LSTM temp 0.5 Hochreiter and Schmidhuber (1997) No
Scheduled Sampling Bengio et al. (2015) No
Google LM Jozefowicz et al. (2016) No
Attention Attribute to Sequence Dong et al. (2017) No
Contexts to Sequences Tang et al. (2016) No
Gated Contexts to Sequences Tang et al. (2016) No
MLE SeqGAN Yu et al. (2017) Yes
SeqGAN Yu et al. (2017) Yes
RankGAN Lin et al. (2017) Yes
LeakGAN Guo et al. (2017) Yes
Table 1: Overview of generative models employed for generating online product reviews.

Every generator (other than the Google LM) is trained and validated on G-train and G-valid datasets, and then used to generate the same number of (fake) reviews (see Table 2). We follow the best practice in literature to train these models, although it is possible that the performance of individual models might not be optimized due to the constraints of data and computational resources. Again, our goal is to evaluate the evaluators instead of the individual generators. Google LM was not trained on reviews, but it provides a nice sanity check of the experimental results - this generator is not supposed to be ranked high by any reasonable evaluator.

Generative Model Total D-Train D-Valid D-Test
model in Table 1 except Google LM 32,500 22,750 3,250 6,500
Google LM 6,680 4,676 668 1,336
Table 2: Number of generated reviews by each model. Google LM is a pre-trained model and the inference takes a long time, so we limit the number of its samples.

3.3 Evaluators

The goal of the experiment is to analyze and compare different evaluators for review generation. We include a comprehensive set of evaluators for the quality of the aforementioned generators: i) human evaluators, ii) discriminative evaluators, and iii) word overlap evaluators.

3.3.1 Human evaluators

The most reasonable evaluator is a Turing test. When it is not feasible in practice, controlled user annotations are often used to mimic a "shrunk" version of the Turing test. We conduct a careful power analysis to determine how many examples to include for annotation so as to obtain a certain level of statistically significance in the comparison of the generative models Flight and Julious (2016), Christensen (2007). The result of our power analysis suggests that at least 111 examples per generative model should be evaluated to infer that the machine generated reviews are comparable in quality to human-written reviews, at a minimal statistically significance level of 0.05. Per this calculation, we sample 150 examples for each of the 12 generators for human evaluation. This totals 1,800 machine-generated reviews, to which we add 1,800 human-written reviews, or a total of 3,600 product reviews sent for human evaluation.

We recruit human annotators through the Amazon Mechanical Turk (AMT) Buhrmester et al. (2011) platform to label the reviews as real (i.e. human written) or fake (i.e. machine generated). Reviews are split into pages of 20 (a mixture of 10 real and 10 fake). We restrict participants in our study to highly-qualified US-based workers222Historical approval rate greater than 95%.. To ensure high quality annotations, we insert “gotcha” questions in each page and only accept the labels from workers who answer all the questions per page and also answer the “gotcha” question correctly. Each page is annotated by 5 distinct human evaluators; in total 900 distinct workers participated in our study and evaluated all 3,600 reviews. Their judgments on every review are used to assemble two distinct human evaluators: H1 - individual votes, treating all human annotations independently, and H2 - majority votes of the 5 human judgments per review.

For every annotated review, a human evaluator ( or ) makes a call which can be either right or wrong with regard to the ground truth (whether the review is sampled from the Amazon review dataset or generated by one of the machines). A generator is considered as high quality if the human evaluator achieves a low accuracy on the fake reviews it generated.

3.3.2 Discriminative evaluators

With the D-train and D-valid datasets, we can train a discriminative classifier for every generator using the “fake” reviews it generated and the same number of “real” reviews. These classifiers serve as adversarial evaluators for each individual generator, the quality of which is measured by how much its corresponding adversarial evaluator is fooled by additional “fake” reviews it generates (in D-test). This is consistent with the objectives in generative adversarial networks.

Note that each adversarial evaluator is generator-dependent, as it is trained using the generated reviews of each individual generator. This is not ideal, as comparisons between different generators may be unfair if they are not scored by the same evaluators, and for every new generator, a new adversarial evaluator has to be trained. The inclusion of multiple generators provides the opportunity of creating meta-adversarial evaluators, which are trained using the pool of generated reviews by many generators, mixed with a larger number of “real” reviews. Such a “pooling” strategy is similar to the standard practice used by the TReC conferences to evaluate different information retrieval systems. Comparing to individual adversarial evaluators, a meta-evaluator is supposed to be more robust and fairer, and it can be applied to evaluate new generators without being retrained.

The actual discriminative classifiers can be either deep or shallow. We employ a total of 7 meta-adversarial evaluators: 3 deep versions, among which one using LSTM Hochreiter and Schmidhuber (1997)

, one using Convolutional Neural Network (CNN)

Kim (2014), LeCun et al. (1998)

, and one using a combination of LSTM and CNN architectures; 4 shallow versions, based on Naive Bayes (NB)

Rish (2001)

, Random Forest (RF)

Liaw et al. (2002)

, Support Vector Machines (SVM)

Cortes and Vapnik (1995)

, and XGBoost

Chen and Guestrin (2016), and they use unigrams, bigrams, and trigrams as features. 12 individual adversarial evaluators are trained, all based on SVM. All 19 discriminative evaluators are trained with balanced training sets; for the shallow models we use 10-fold cross validation. We find the best hyper-parameters Bergstra and Bengio (2012) using random search, and prevent the models from overfitting by using early stopping Prechelt (1998).

For every review in D-test (either annotated or not), a discriminative evaluator makes a judgment call. A generator is considered high quality if the discriminative evaluator makes more mistakes on reviews it generated, w.r.t. the ground truth.

3.3.3 Word overlap evaluators

We include a set of 4 word overlap metrics that are used for NLG evaluation: BLEU Papineni et al. (2002) and METEOR Banerjee and Lavie (2005), which are borrowed from machine translation tasks, ROUGE Lin (2004) that is borrowed from text summarization tasks, and CIDEr Vedantam et al. (2015) that is borrowed from image description evaluation.

An important aspect of these metrics is that they rely on matching -grams in the target text (i.e., generated reviews) to the ground truth text, or the “references” (i.e., human-written reviews). The higher the overlap (similarity), the higher the quality of generated text. For every generated review in D-test Fake, we assemble the set of references by retrieving the top- most similar human-written reviews in D-test Real, calculated using a simple vector space model333Computed using a publicly available toolkit from Sharma et al. (2017). A generator is considered high quality if its generated reviews obtain a high average score by a word overlap evaluator.

In total, we analyze and compare 25 candidate evaluators for review generation (2 human evaluators, 19 discriminative evaluators, and 4 word-overlap metrics), based on the D-test dataset.

4 Results

We consider a few dependent variables to analyze the experimental results. First, we are interested in the accuracy of individual evaluators - how well they can distinguish “fake” reviews (machine-generated) from “real” reviews (human-written). Second, we are interested in how an evaluator assesses the quality of the 12 generators in general, instead of individual reviews. Since none of the evaluators (including the human evaluators) is likely to make perfect calls, the absolute scores an evaluator gives to the generators are not as informative as how it ranks them: a good evaluator should be able to rank good generators higher than bad generators, and implicitly, it should be confident in the ranking. Last but not least, we are interested in how the rankings by different evaluators correlate with each other. Intuitively, an automated evaluator that ranks the generators similarly to the human evaluators is more reasonable and can potentially be used as the surrogate of human evaluation. Below we summarize the main results of the experiment.

4.1 Results of individual evaluators

We first present results for individual evaluators. As a reminder, the evaluators use three different ways to distinguish machine-generated reviews from human-written reviews: through human annotations, through discriminative classifiers, and through word-overlap metrics.

4.1.1 Human evaluators

In the AMT experiment, every review is annotated by 5 human judges as either “fake” or “real.” The inter-annotator agreement among the human judges is only considered fair by the Fleiss-Kappa Fleiss et al. (2013) score (). This suggests that distinguishing machine-generated reviews from human-written reviews in general is a hard task even for humans; there is limited consensus on what counts as a realistic review.

Figure 2: Accuracy of human evaluators on individual reviews: H1 - individual votes; H2 - majority votes. Humans tend to correctly label human-written reviews as real, but are prone to be confused by (approximately half of) machine-generated reviews. Overall, accuracy values for individual human decisions or majority votes are quite similar.

In Figure 2 we present the accuracy of two human evaluators on individual, Turker annotated reviews, using either all 5 annotations or their majority votes for each review. Comparing to the ground-truth (of whether a review is machine-generated or collected from Amazon), the accuracy of individual human decisions is 66.61%, while their majority votes can do as well as 72.63%. Neither of them is close to perfect. We observe that human evaluators generally do better at correctly labelling the human-written examples as real (true positive rate of 78.96% for and 88.31% for ), and they are confused by the machine-written examples in more than 50% of the cases (true negative rate of 54.26% for H1 and 56.95% for H2). This trend is consistent to the observation in literature Tang et al. (2016).

We then look at how the human evaluators rank the generators, according to the accuracy of human evaluators on all reviews generated by each of the generators. The lower the accuracy, the more likely the human evaluator is confused by the generated reviews, and thus the better the generator. From Table 3

, we observe a substantial variance in the accuracy of the human evaluators (both

and ) on different generators, which suggests that human evaluators are able to distinguish between generators. The generator ranked as the best by both human evaluators is Gated Contexts to Sequences. Google LM is ranked on the lower side, which makes sense as the model is not trained to generate reviews. Interestingly, humans tend not to be fooled by reviews generated by the GAN-based models (MLE SeqGAN, SeqGAN, RankGAN and LeakGAN), even though their objective is to mix real and fake. GAN-generated reviews tend to be easily distinguishable from the human-written reviews by human judges.

Generative Text Model Human Evaluators Human Evaluators
Accuracy (H1) Accuracy (H2)
Word LSTM temp 1.0 54.87 % 59.73 %
Word LSTM temp 0.7 33.91 % 28.19 %
Word LSTM temp 0.5 26.71 % 17.80 %
Scheduled Sampling 75.27 % 87.25 %
Google LM 68.19 % 79.17 %
Attention Attribute to Sequence 32.31 % 27.21 %
Contexts to Sequences 38.72 % 34.23 %
Gated Contexts to Sequences 24.63 % 14.86 %
MLE SeqGAN 76.23 % 89.93 %
SeqGAN 74.50 % 85.03 %
RankGAN 77.82 % 84.25 %
LeakGAN 68.14 % 76.19 %
Table 3: Accuracy of human evaluators at distinguishing reviews generated by individual generators. The lower the better. GAN-based models cannot confuse humans.

4.1.2 Discriminative evaluators

We then analyze the 7 meta-adversarial evaluators and the 12 individual adversarial evaluators that are trained to distinguish human-written reviews from machine-generated reviews. Different from human evaluators that are applied to the 3,600 annotated reviews, the discriminative evaluators are applied to all reviews in the D-test dataset.

Meta-adversarial Evaluators In Table 4 we present the accuracy of both shallow and deep meta-adversarial evaluators on individual reviews and on each generator. In general, the 3 deep learning based and the one SVM based meta-discriminators achieve accuracy higher than the two human evaluators, indicating that the adversarial evaluators can better distinguish machine-generated reviews from human-written reviews than human judges. There is no significant difference between different meta-adversarial evaluators in terms of accuracy on individual reviews. We notice that meta-discriminators commonly rank GAN-based generators (especially MLE SeqGAN) as the best and the contexts-to-sequence ones as the worst. This makes sense as the objective of GAN is consistent to the (reversed) accuracy of the evaluator. Interestingly, by simply setting the temperature of Word LSTM to 1.0, it achieves comparable performance to the GAN-based generators.

Word LSTM temp 1.0 48.29 % 55.22 % 45.68 % 50.31 % 53.63 % 32.77 % 48.97 %
Word LSTM temp 0.7 92.58 % 93.14 % 91.02 % 78.69 % 81.05 % 79.92 % 80.49 %
Word LSTM temp 0.5 99.31 % 99.35 % 99.08 % 94.74 % 94.29 % 96.86 % 94.71 %
Scheduled Sampling 50.09 % 48.77 % 43.37 % 51.31 % 52.88 % 20.97 % 44.12 %
Google LM 84.58 % 74.03 % 74.85 % 78.59 % 82.71 % 48.28 % 82.41 %
Attention Attribute to Sequence 90.08 % 91.78 % 89.94 % 74.37 % 77.29 % 80.02 % 71.68 %
Contexts to Sequences 100.00 % 100.00 % 99.97 % 100.00 % 99.98 % 100.00 % 99.98 %
Gated Contexts to Sequences 98.37 % 99.06 % 98.38 % 96.26 % 95.35 % 98.63 % 93.62 %
MLE SeqGAN 41.45 % 47.54 % 41.91 % 52.35 % 51.14 % 21.83 % 43.71 %
SeqGAN 50.05 % 52.91 % 47.35 % 56.20 % 54.91 % 25.60 % 48.11 %
RankGAN 66.28 % 67.23 % 59.37 % 70.17 % 61.94 % 35.98 % 61.23 %
LeakGAN 87.03 % 80.28 % 79.57 % 77.55 % 67.74 % 46.80 % 63.80 %
D-test (all) 77.58 % 74.72 % 75.18 % 74.50 % 70.31 % 70.74 % 73.79 %
D-test (human-written) 80.12 % 73.54 % 77.99 % 75.98 % 68.59 % 83.53 % 79.10 %
D-test (machine-generated) 75.04 % 75.90 % 72.38 % 73.01 % 72.04 % 57.95 % 68.48 %
Table 4: Accuracy of deep (LSTM, CNN, CNN & LSTM) and shallow (SVM, RF, NB, XGBoost) meta-adversarial evaluators. The lower the better. Meta-adversarial evaluators do better than humans on individual reviews, with less bias between the two classes. GAN-based generators are considered to be the best by meta-adversarial evaluators.

Individual Adversarial Evaluators In Table 5 we present the results of the 12 individual adversarial evaluators (all based on SVM), trained and tested on reviews generated by each corresponding generator. We observe that collectively, these generator-specific evaluators tend to rank the generators similarly to the ranking by the meta-discriminators (e.g., MLE SeqGAN and Word LSTM temp 1.0 are considered as the best generators), although the actual accuracy numbers may differ. This suggests that when discriminative evaluation is preferred, one can train one meta-adversarial evaluator instead of many per-generator evaluators, which makes fairer comparisons and is likely to be more robust to noises.

Generative Text Model Individual (SVM)
Adversarial Evaluators
Word LSTM temp 1.0 65.02 %
Word LSTM temp 0.7 86.29 %
Word LSTM temp 0.5 95.72 %
Scheduled Sampling 64.65 %
Google LM 96.78 %
Attention Attribute to Sequence 91.03 %
Contexts to Sequences 100.00 %
Gated Contexts to Sequences 97.66 %
MLE SeqGAN 64.51 %
SeqGAN 69.43 %
RankGAN 79.46 %
LeakGAN 86.62 %
Table 5: Accuracy of individual adversarial evaluators (SVM based) on each generator. The lower the better. GAN-based models are generally preferred.

4.1.3 Word-Overlap Evaluators

For every machine-generated review, we compute the scores of the 4 word-overlap metrics against a reference set which is assembled by the top-10 most similar human-written reviews from D-test444We also evaluate BLEU and ROUGE against the entire D-test real as references, and rankings are very similar. Using top-10 nearest neighbors as references is more reasonable, as one generated review only needs to mimic some real reviews, and it is apparently more efficient to compute.. As these scores are continuous, we do not have the “accuracy” of the word-overlap evaluators on individual reviews or generators. However, the generators can still be ranked based on the average scores of their generated reviews. In Figure 3 we present the average scores of the 12 generators by each of the four evaluators. Different word-overlap evaluators also tend to rank the generators in similar orders, with sufficient variance among the BLUE, ROUGE, and CIDEr scores. Interestingly, the top-ranked generator according to BLEU, ROUGE, and METEOR is Contexts to Sequences, while CIDEr scores highest the Gated Contexts to Sequences model. GAN-based generators are generally ranked low.

Figure 3: Word-Overlap Evaluators (BLEU, ROUGE, METEOR and CIDEr) scores individual generators. The higher the better. The rankings are overall similar, as GAN-based generators are ranked low.

4.2 Comparing evaluators

Based on the results of individual evaluators, we move on to compare them - to what degree do they agree on the ranking of generators? Intuitively, we are more interested in how the automated evaluators compare to the human evaluators, and whether there is any suitable automatic surrogate for human judges at all. To do this, we compute the correlations between , and each discriminative evaluator and correlations between , and the word-overlap evaluators, based on either their decisions on individual reviews, their scores of the generators (by Pearson’s coefficient Fieller et al. (1957)), and their rankings of the generators (by Spearman’s Spearman (1904) and Kendall’s Daniel et al. (1978)). Results are presented in Table 6.

Evaluation Method Kendall tau-b Spearman Pearson Kendall tau-b Spearman Pearson
(H1) (H1) (H1) (H2) (H2) (H2)
SVM Individual-discriminators -0.4545* -0.6294* -0.6716* -0.5455* -0.6783* -0.6823*
LSTM meta-discriminator -0.5455* -0.7552* -0.7699* -0.6364* -0.8042* -0.7829*
CNN meta-discriminator -0.6363* -0.8112* -0.8616* -0.7273* -0.8741* -0.8766*
CNN & LSTM meta-discriminator -0.6060* -0.7902* -0.8392* -0.6970* -0.8462* -0.8507*
SVM meta-discriminator -0.4545* -0.6573* -0.7207* -0.5455* -0.6993* -0.7405
RF meta-discriminator -0.5455* -0.7273* -0.7994* -0.6364* -0.7832* -0.8075*
NB meta-discriminator -0.6364* -0.8112* -0.9290* -0.7273* -0.8741* -0.9388*
XGBoost meta-discriminator -0.5455* -0.7413* -0.7764* -0.6364* -0.8042* -0.7878*
BLEU evaluator 0.7576* 0.8601* 0.8974* 0.6666* 0.8182* 0.9060*
ROUGE evaluator 0.6060* 0.7692* 0.8054* 0.5758* 0.7483* 0.8073*
METEOR evaluator 0.5758* 0.7762* 0.8225* 0.5455* 0.7622* 0.8231*
CIDEr evaluator 0.5455* 0.7413* 0.8117* 0.4545* 0.6643* 0.8203*
Table 6: Kendall tau-b, Spearman and Pearson correlation coefficients between human evaluators , , and discriminative evaluators and word-overlap evaluators (* denotes statistical significant result with ).

Not surprisingly, the two human evaluators make highly correlated decisions. Surprisingly though, none of the discriminative evaluators has a positive correlation with the human evaluators, and in fact, their rankings are negatively correlated. That says, generators that could fool machine judges easily are less likely to confuse human judges, and vice versa.

Interestingly, the word-overlap evaluators tend to have a positive correlation with the human evaluators in ranking the generators. Among them, BLEU appears to be closer to human rankings than others. This pattern is consistent in all three types of correlations. These two observations are intriguing, which indicate that when identifying fake reviews, humans might focus more on word usage rather than trying to construct a “decision boundary” mentally.

In summary, we find that 1) human evaluators cannot distinguish machine-generated reviews from human-written reviews perfectly, with significant bias between the two classes; 2) meta-adversarial evaluators make similar decisions to per-generator adversarial evaluators, and they tend to be negatively correlated with human evaluators; and 3) word-overlap evaluators are highly correlated with human evaluators in this task. In the next section, we provide more detailed analysis and possible explanations of the findings.

5 Discussion

In this work we have designed and conducted an systematic study that evaluates the evaluators of natural language generation. Our results indicate that decisions of discriminative evaluators for review generation do not correlate with decisions of human evaluators, while word-overlap evaluators seem to better correlate with a simulated Turing test. These findings are consistent with some findings in literature Sharma et al. (2017), Li et al. (2017), while contradicting some others Liu et al. (2016), Bruni and Fernández (2017). We conduct some in-depth analysis and aim to discover possible explanations.

Generative Text Model LSTM CNN CNN & LSTM SVM RF NB XGBoost
Word LSTM temp 1.0 59.33 % 57.33 % 48.67 % 54.00 % 60.67 % 40.67 % 52.00 %
Word LSTM temp 0.7 96.67 % 96.00 % 94.67 % 82.00 % 81.33 % 81.33 % 86.00 %
Word LSTM temp 0.5 99.33 % 99.33 % 98.67 % 94.00 % 94.00 % 96.00 % 93.33 %
Scheduled Sampling 50.67 % 53.33 % 52.67 % 51.33 % 52.57 % 26.00 % 46.00 %
Google LM 81.33 % 71.33 % 72.67 % 82.67 % 86.67 % 50.00 % 85.33 %
Attention Attribute to Sequence 94.00 % 93.33 % 91.33 % 76.67 % 75.33 % 83.33 % 69.33 %
Contexts to Sequences 100.00 % 100.00 % 100.00 % 100.00 % 100.00 % 100.00 % 100.00 %
Gated Contexts to Sequences 97.33 % 99.33 % 98.67 % 96.00 % 94.67 % 99.33 % 94.00 %
MLE SeqGAN 40.00 % 46.00 % 40.00 % 52.00 % 52.67 % 18.00 % 46.67 %
SeqGAN 67.33 % 49.33 % 54.67 % 60.00 % 49.33 % 26.67 % 45.33 %
RankGAN 67.33 % 67.33 % 62.00 % 70.00 % 57.33 % 36.67 % 64.00 %
LeakGAN 86.67 % 80.00 % 82.67 % 76.00 % 68.00 % 44.00 % 62.00 %
Test set reviews (all) 78.57 % 74.17% 75.92 % 75.53 % 70.78 % 71.33 % 75.22 %
Test set reviews (human-written) 79.94 % 72.28 % 77.11 % 76.50 % 68.83 % 84.17 % 80.11 %
Test set reviews (machine-generated) 77.22 % 76.06 % 74.72 % 74.56 % 72.72 % 58.50 % 70.33 %
Table 7: Accuracy of deep (LSTM, CNN, CNN & LSTM) and shallow (SVM, RF, NB, XGBoost) meta-discriminators when each meta-discriminator is trained on data from all generative models and tested on the annotated D-test set with ground-truth test labels. Lower accuracy values denote a higher degree of confusion for the meta-discriminator at correctly identifying the true class of a review item, therefore lower accuracy is better (in boldface we highlight the best generative text model).
Generative Text Model LSTM CNN CNN & LSTM SVM RF NB XGBoost
Word LSTM temp 1.0 57.33 % 58.00 % 53.33 % 60.00 % 53.33 % 48.00 % 54.00 %
Word LSTM temp 0.7 30.67 % 31.33 % 31.33 % 34.67 % 36.67 % 38.00 % 33.33 %
Word LSTM temp 0.5 18.00 % 18.00 % 18.67 % 22.00 % 23.33 % 21.33 % 21.33 %
Scheduled Sampling 48.00 % 50.67 % 46.00 % 51.33 % 48.67 % 28.67 % 48.67 %
Google LM 68.67 % 60.00 % 62.67 % 67.33 % 68.67 % 49.33 % 71.33 %
Attention Attribute to Sequence 31.33 % 32.00 % 31.33 % 36.67 % 36.67 % 31.33 % 33.33 %
Contexts to Sequences 34.00 % 34.00 % 34.00 % 34.00 % 34.00 % 34.00 % 34.00 %
Gated Contexts to Sequences 14.57 % 15.33 % 14.67 % 17.33 % 16.00 % 15.33 % 19.33 %
MLE SeqGAN 46.00 % 50.56 % 47.33 % 52.67 % 52.00 % 25.33 % 50.00 %
SeqGAN 59.33 % 53.33 % 60.00 % 60.00 % 53.33 % 34.67 % 52.00 %
RankGAN 59.33 % 60.67 % 55.33 % 70.00 % 53.33 % 38.00 % 53.33 %
LeakGAN 70.00 % 63.33 % 71.33 % 76.00 % 60.67 % 39.33 % 54.67 %
Test set reviews (all) 60.08 % 56.78 % 58.19 % 58.75 % 54.78 % 55.17 % 59.28 %
Test set reviews (human-written) 75.11 % 69.33 % 72.17 % 70.89 % 64.67 % 76.44 % 74.83 %
Test set reviews (machine-generated) 45.06 % 44.22 % 44.22 % 46.61 % 44.89 % 33.89 % 43.72 %
Table 8: Accuracy of deep (LSTM, CNN, CNN & LSTM) and shallow (SVM, RF, NB, XGBoost) meta-discriminators when each meta-discriminator is trained on data from all generative models and tested on the annotated D-test set with majority vote test labels. Lower accuracy values denote a higher degree of confusion for the meta-discriminator at correctly identifying the true class of a review item, therefore lower accuracy is better (in boldface we highlight the best generative text model).

5.1 Imperfect Ground-truth

One important thing to note about the discriminative evaluators is that all the classifiers are trained using natural labels (i.e., treating all examples from the Amazon review dataset as positive and examples generated by the candidate models as negative) instead of human labels. It is possible that if they were trained with human labels, the discriminative evaluators would have been more consistent to the human evaluators. Indeed, some reviews posted on Amazon may have been generated by bots, and if that is the case, treating them as human-written examples may bias the discriminators. In fact, according to Figure 2, we see that only around 80% of the reviews in D-train real (Amazon product reviews) were classified as human-written by human evaluators. If we trust the annotations by the Turkers, the other 20% reviews may have been generated by bots.

One way to verify this is to consider an alternative “ground-truth” for the discriminate evaluators. That is, we apply the already trained meta-discriminators to the Turker annotated set (which contains 3,600 reviews) instead of the full D-test set, and we use the majority vote of human annotations (whether a review is fake or real) to surrogate the “ground-truth” labels (whether a review is generated or sampled from Amazon). The results are presented in the following tables.

First, to toss out the potential selection bias in the annotated sample of D-test, in Table 7 we present the results of the meta-discriminators on each generator, tested on the annotated subset but using the natural labels as ground-truth. Note that the only difference between Table 7 and Table 4 is whether they are calculated on the annotated subset or the full D-test dataset. Both the accuracy numbers and the rankings of the generators are consistent, which means using the smaller test set for evaluation does not include noticeable bias.

We then calculate the results of the meta-discriminators, tested on the annotated subset and using the majority votes of Turkers as surrogates of ground-truth (Table 8). Surprisingly, both the accuracy numbers and the rankings of the generators are significantly different from Table 7 and Table 4 (which used natural ground-truth labels). We note that the numbers and rankings are more inline with the human evaluators (Table 3).

To confirm the intuition, we calculate the correlations between the meta-discriminators and the human evaluators using the annotated subset only. Surprisingly, replacing the natural ground-truth with human annotated labels, the meta-discriminators become positively correlated with human evaluators (Table 9). Without using human annotated labels, even if evaluated on the same subset, the discriminative evaluators still disagree with human evaluators (Table 10).

Finally, for sanity check, if we calculate the word overlap metrics based on the annotated subset of D-test, the word overlap evaluators still present positive correlations with human evaluators (Table 11).

Evaluation Method Kendall tau-b Spearman Pearson Kendall tau-b Spearman Pearson
(H1) (H1) (H1) (H2) (H2) (H2)
LSTM meta-discriminator 0.5344* 0.7180* 0.8546* 0.4733* 0.6760* 0.8587*
CNN meta-discriminator 0.5455* 0.7343* 0.9002* 0.4545* 0.6503* 0.9024*
CNN & LSTM meta-discriminator 0.5649* 0.7180* 0.8579* 0.5344* 0.6900* 0.8623*
SVM meta-discriminator 0.5152* 0.6993* 0.8860* 0.4848* 0.6713* 0.8926
RF meta-discriminator 0.5315* 0.6879* 0.8556* 0.4690* 0.6526* 0.8696*
NB meta-discriminator 0.2595 0.3292 0.4143* 0.1679 0.2487 0.4241*
XGBoost meta-discriminator 0.5649* 0.6830* 0.8548* 0.4733* 0.6410* 0.8693*
Table 9: Kendall tau-b, Spearman and Pearson correlation coefficients between human evaluators , , and discriminative evaluators (* denotes statistical significant result with ). Meta-discriminators have been trained on D-train, D-valid sets and tested on the annotated D-test set with majority-vote test labels.
Evaluation Method Kendall tau-b Spearman Pearson Kendall tau-b Spearman Pearson
(H1) (H1) (H1) (H2) (H2) (H2)
LSTM meta-discriminator -0.6260* -0.8021* -0.8058* -0.6870* -0.8441 -0.8197*
CNN meta-discriminator -0.6260* -0.8091* -0.8788* -0.7176* -0.8722* -0.8939*
CNN & LSTM meta-discriminator -0.5954* -0.7671* -0.8375* -0.6870* -0.8161* -0.8517*
SVM meta-discriminator -0.5152* -0.7203* -0.7402* -0.6061* -0.7762* -0.7573*
RF meta-discriminator -0.5152* -0.7622* -0.8016* -0.6061* -0.8042* -0.8006*
NB meta-discriminator -0.6970 -0.8601 -0.9458* -0.7879 -0.9021* -0.9543*
XGBoost meta-discriminator -0.4849* -0.7133* -0.7618* -0.5758* -0.7972* -0.7728*
Table 10: Kendall tau-b, Spearman and Pearson correlation coefficients between human evaluators , , and discriminative evaluators (* denotes statistical significant result with ). Meta-discriminators have been trained on D-train, D-valid sets and tested on the annotated D-test set with ground-truth test labels.
Evaluation Method Kendall tau-b Spearman Pearson Kendall tau-b Spearman Pearson
(H1) (H1) (H1) (H2) (H2) (H2)
BLEU evaluator 0.7273* 0.8531* 0.9211* 0.6364* 0.8112* 0.9280*
ROUGE evaluator 0.5455* 0.7692* 0.8220* 0.5758* 0.7832* 0.8231*
METEOR evaluator 0.6061* 0.7692* 0.8109* 0.5758* 0.7483 0.8110*
CIDEr evaluator 0.5455* 0.7413 0.8060* 0.4545* 0.6573* 0.8082*
Table 11: Kendall tau-b, Spearman and Pearson correlation coefficients between human evaluators , , and word overlap-based evaluators (* denotes statistical significant result with ). Word overlap-based metrics retrieve the top-10 most similar sentences from the annotated D-test real set with ground-truth test labels.

These results are intriguing. They indicate that when the “ground-truth” used by an automated Turing test is questionable, the decisions of the evaluators may be biased. Discriminative evaluators suffer the most from the bias, as they were directly trained using the imperfect ground-truth. Word overlap evaluators are more robust, as they only take the most relevant parts of the test set as references (which are more likely to be high quality). Human evaluators are more trustful, as long as the inter-rater disagreements are resolved.

5.2 Role of Diversity

We also assess the role diversity plays in the rankings of the generators. To this end, we measure lexical diversity Bache et al. (2013) of the samples produced by each generator as the ratio of unique tokens to the total number of tokens. We compute in turn lexical diversity for unigrams, bigrams and trigrams, and observe that the generators that produce the least diverse samples are easily distinguished by the meta-discriminators, while they confuse human evaluators the most. Alternatively, samples produced by the most diverse generators are hardest to distinguish by the meta-discriminators, while human evaluators present higher accuracy at classifying them. As reported in Kannan and Vinyals (2017), the lack of lexical richness can be a weakness of the generators, making them easily detected by a machine learning classifier. Meanwhile, a discriminator’s preference for rarer language does not necessarily mean it is favouring higher quality reviews.

In addition to lexical diversity, Self-BLEU Zhu et al. (2018) is an interesting measurement of the diversity of a set of text (average BLEU score of each document using the same collection as reference, therefore the lower the more diverse). In Table 12 we present Self-BLEU scores for each generator, applied to their generated text in D-test fake. We also compute the correlation coefficients between the rankings of generators by Self-BLEU and the rankings by the evaluators (Table 13). Results obtained indicate that Self-BLEU presents negative correlation with human evaluators and word-overlap evaluators and positive correlation with discriminative evaluators. This result confirms the findings in literature Kannan and Vinyals (2017) that discriminators in adversarial evaluation are capturing known limitations of the generative models such as lack of diversity.

Generative Text Model Self-BLEU Lexical diversity
Word LSTM temp 1.0 0.1886 0.6467
Word LSTM temp 0.7 0.4804 0.2932
Word LSTM temp 0.5 0.6960 0.1347
Scheduled Sampling 0.1233 0.7652
Google LM 0.1706 0.7745
Attention Attribute to Sequence 0.5021 0.2939
Contexts to Sequences 0.8950 0.0032
Gated Contexts to Sequences 0.7330 0.1129
MLE SeqGAN 0.1206 0.7622
SeqGAN 0.1370 0.7330
RankGAN 0.1195 0.7519
LeakGAN 0.1775 0.7541
Table 12: Self-BLEU diversity scores per generator (the lower the more diverse), and lexical diversity scores (the higher the more diverse). There is high correlation between the two metrics with respect to the rankings of the generative text models.
Self-BLEU Kendall tau-b Spearman Pearson
H1 evaluator -0.8788* -0.9301* -0.8920*
H2 evaluator -0.7879* -0.8881* -0.9001*
LSTM meta-discriminator 0.6667* 0.8252* 0.7953*
CNN meta-discriminator 0.7576* 0.8811* 0.8740*
CNN & LSTM meta-discriminator 0.7273* 0.8601* 0.8622*
SVM meta-discriminator 0.5758* 0.7413* 0.8518*
RF meta-discriminator 0.6667* 0.8112* 0.8944*
NB meta-discriminator 0.7576* 0.8811* 0.9569*
XGBoost meta-discriminator 0.6667* 0.8252* 0.8693*
BLEU evaluator -0.8788 -0.9301* -0.9880*
ROUGE evaluator -0.7273* -0.8392* -0.9299*
METEOR evaluator -0.6967* -0.8462* -0.8955*
CIDEr evaluator -0.5455* -0.7413* -0.7987*
Table 13: Kendall tau-b, Spearman and Pearson correlation coefficients between Self-BLEU diversity rankings and the three evaluation methods - human evaluators , , discriminative evaluators and word-overlap based evaluators (* denotes statistical significant result with ). Meta-discriminators have been trained on D-train, D-valid sets and tested on the annotated D-test set with ground-truth test labels.

Following this insight, an important question to answer is to what extent are the generators simply memorizing the training set G-train

. To this end, we assess the degree of n-gram overlap between the generated reviews and the training reviews using the BLEU evaluator. In Table

14 we present the average BLEU scores of generated reviews using their nearest neighbors in G-train as references. We observe that generally the generators did not just memorize the training set, and GAN models generate reviews that have fewer overlap with G-train. In Table 15 we compute the correlation between the divergence from training and the ratings by evaluators in the study. BLEU w.r.t. G-train presents highly positive correlation with BLEU w.r.t. D-test real, and it is also positively correlated with the human evaluators.

Generative Text Model BLEU G-Train
Word LSTM temp 1.0 0.2701
Word LSTM temp 0.7 0.4998
Word LSTM temp 0.5 0.6294
Scheduled Sampling 0.1707
Google LM 0.0475
Attention Attribute to Sequence 0.5122
Contexts to Sequences 0.7542
Gated Contexts to Sequences 0.6240
MLE SeqGAN 0.1707
SeqGAN 0.1751
RankGAN 0.1525
LeakGAN 0.1871
Table 14: BLEU results when evaluating the generated reviews using G-train as the reference corpus (a lower score indicates less n-grams in common between the training set G-train and the generated text). GAN models present low similarity with the training set.
BLEU G-train Kendall tau-b Spearman Pearson
H1 evaluator 0.7176* 0.8511* 0.9111*
H2 evaluator 0.6260* 0.8091* 0.9209*
LSTM meta-discriminator -0.5649* -0.7461* -0.7091*
CNN meta-discriminator -0.6565 -0.7951* -0.8213*
CNN & LSTM meta-discriminator -0.6260* -0.7811* -0.7951*
SVM meta-discriminator -0.4428* -0.6130* -0.7442*
RF meta-discriminator -0.5038* -0.6340* -0.7864*
NB meta-discriminator -0.6260* -0.7601* -0.9164*
XGBoost meta-discriminator -0.5649* -0.6550* -0.7586*
BLEU evaluator 0.9619* 0.9912* 0.9936*
ROUGE evaluator 0.5954* 0.7496* 0.8717*
METEOR evaluator 0.6260* 0.7636* 0.8477*
CIDEr evaluator 0.6565* 0.8371* 0.8318*
Table 15: Kendall tau-b, Spearman and Pearson correlation coefficients between BLEU G-train rankings and the three evaluation methods - human evaluators , , discriminative evaluators and word-overlap based evaluators (* denotes statistical significant result with ). Meta-discriminators have been trained on D-train, D-valid sets and tested on the annotated D-test set with ground-truth test labels.

The effects of diversity is perhaps not hard to explain. At the particular task of distinguishing fake reviews from real, all decisions are made on individual reviews. And because a human judge was not exposed to many fake reviews generated by the same generator, whether or not a fake review is sufficiently different from the other generated reviews is not a major factor for their decision. Instead, the major factor is whether the generated review looks similar to the reviews they have seen in reality. Instead, a discriminative evaluator makes the decision after seeing many positive and negative examples, and a fake review that can fool an adversarial classifier has to be sufficiently different from all other fake reviews it has encountered (therefore diversity of a generator is a major indicator of its ability to pass an adversarial judge).

5.3 User Study

Finally, we are interested in the reasons why human annotators label certain reviews as fake (machine-written). After annotating a batch of reviews, we asked the workers to explain their decisions by filling in an optional free-text comment. This enables us to have a better understanding of what differentiates machine-generated from human-written reviews from human’s perspective. Analyzing their comments, we identify the main reasons why human evaluators annotate a review as machine-written. These are mainly related to the presence of grammatical errors in the review text, wrong wording or inappropriate choice of expressions, redundant use of specific phrases or contradictory arguments in the review. Interestingly, human evaluators’ innate biases are also reflected in their decisions: they are likely to categorize a review as fake if it is too formal, lacks emotion and personal pronouns, or is too vague and generic. A more detailed list of major clusters of reasons is as follows:

  1. Grammar/ typo/ mis-spelling: the language does not flow well.

  2. Too general/ too generic/ vagueness: generated reviews are vague, in lack of details.

  3. Word choice (wording): in lack of slang, use the wrong words.

  4. Flow (not fluent)/ structured/ logical: the sentences level language errors.

  5. Contradictory arguments: some arguments support opposite opinions.

  6. Emotion: lack of emotion, personality in the comments.

  7. Repeated text: using words/ phrases repetitively.

  8. Overly same as human: too advertisement, too formal, too likely to be real.

6 Implications

The results in our experiment have many intriguing implications to both the evaluation and the construction of natural language generators. First, we find that in the context of judging individual documents, discriminative evaluators are not as realistic as word overlap evaluators, w.r.t. how they correlate with a simulated Turing test (human evaluators). That implies that adversarial accuracy might not be the optimal objective for natural language generation, if the goal is to generate documents that humans consider as real. Instead, a fake review that fools humans does not necessarily need to fool a machine that has seen everything. Pushing too hard towards the boundary might swing the pendulum too far towards rare corner or unrealistic cases. As a result, simple LSTM models or attention models may generate surprisingly competitive results. In contrast, GAN based models may more easily pass a Turing test on a bot level (when judgments are made on a system as a whole instead of on individual items), or in a conversational context. That is, when the judges have seen enough examples from the same generator, the next example had better be somewhat different.

Our results also suggest that when adversarial training is used, the selection of training examples must be done with caution. In other words, if the “ground-truth” is hijacked by low quality or “fake” examples, models trained by GAN may be significantly biased. This finding is related to the recent literature of the robustness and security of machine learning models.

We also find that when humans are distinguishing fake reviews from real ones, they tend to focus more on the word usage. The use of particular words, expressions, emotions, and other details may be more convincing than something that is just “generally” correct. These findings may affect the design of objectives for the next generation of natural language generation models.

We believe that our findings represent a preliminary foundation for proposing more solid and robust evaluation metrics for the evaluation of NLG output. In future work we plan to carry additional experiments where we include in the study a wider range of generative models and meta-discriminator architectures, and inspired by the current results we aim to suggest more robust evaluation metrics for assessing the quality of NLG.


We thank Wei Ai for his help on the power analysis. We also thank Yue Wang and Teng Ye for helpful discussions. This work is in part supported by the National Science Foundation under grant numbers 1633370 and 1620319 and in part supported by the National Library of Medicine under grant number 2R01LM010681-05.