Unifying Human and Statistical Evaluation for Natural Language Generation

by   Tatsunori B. Hashimoto, et al.
Stanford University

How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized. In this paper, we propose a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated. We demonstrate that this error rate can be efficiently estimated by combining human and statistical evaluation, using an evaluation metric which we call HUSE. On summarization and chit-chat dialogue, we show that (i) HUSE detects diversity defects which fool pure human evaluation and that (ii) techniques such as annealing for improving quality actually decrease HUSE due to decreased diversity.


Semantic Diversity in Dialogue with Natural Language Inference

Generating diverse, interesting responses to chitchat conversations is a...

Evaluating the Evaluation of Diversity in Natural Language Generation

Despite growing interest in natural language generation (NLG) models tha...

BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

Evaluation is a bottleneck in the development of natural language genera...

Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme

In this paper, we propose a joint architecture that captures language, r...

Distribution Aware Metrics for Conditional Natural Language Generation

Traditional automated metrics for evaluating conditional natural languag...

Mark-Evaluate: Assessing Language Generation using Population Estimation Methods

We propose a family of metrics to assess language generation derived fro...

Trading Off Diversity and Quality in Natural Language Generation

For open-ended language generation tasks such as storytelling and dialog...

1 Introduction

Generating text is a core part of many NLP tasks such as image captioning (Lin et al., 2014), open-domain dialogue Sordoni et al. (2015), story generation Roemmele (2016), and summarization Nallapati et al. (2016). However, proper evaluation of natural language generation has proven difficult (Liu et al., 2016; Novikova et al., 2017; Chaganty et al., 2018). A good evaluation metric should not only capture the quality of generation, but also the diversity of generation, which is especially crucial for creative, open-ended tasks like dialogue or story generation.

Human evaluation, which is often viewed as the gold standard evaluation, captures quality but fails to capture diversity. As an example, for language modeling, a model that directly plagiarizes sentences from the training set would pass the human quality bar but would have zero generalization ability and thus have inadequate diversity. On the other hand, statistical evaluation

—i.e., perplexity on a reference test set—captures diversity, as it ensures a model must assign reasonable probability to novel sentences, but perplexity provides an inadequate measure of quality

(Theis et al., 2015). For example, modifying a perfect model by removing its ability to generate even a single test sentence results in infinite perplexity even though the model is still near-perfect. Automatic metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin and Rey, 2004) capture quality better than perplexity but still correlate poorly with human evaluation and fail to capture diversity Novikova et al. (2017); Chaganty et al. (2018).

Figure 1: HUSE is twice the classification error of distinguishing reference and generated text based on human judgment scores and model probabilities. HUSE identifies samples with defects in quality (Sharon has stroke ) and diversity (Cleared coach facing ).

Existing approaches to combining statistical and human evaluation have been ad-hoc, leading to misleading performance measures. A common approach is to measure diversity through the perplexity of a probabilistic model and quality through human evaluation on beam-searched outputs. This gives the illusion that a single model is high-quality and diverse, while the reality is that it shows we can have either a diverse model (when sampling from the distribution used to compute perplexity) or a high-quality model (when beam-searching).

In this paper, we define the idealized evaluation metric as twice the error of the optimal discriminator

for classifying sentences as coming from the reference distribution or the model (Section 

2). If a model generates gibberish (low quality), the optimal discriminator can classify these accurately as coming from the model. If the reference distribution contains sentences the model cannot generate (low diversity), the optimal discriminator can classify these accurately as coming from the reference.

Unfortunately, the optimal discriminator is unavailable. Human discriminators cannot capture diversity effectively, and learned discriminators—e.g., from a Generative Adversarial Network (Goodfellow et al., 2014) or one trained on human judgments (Lowe et al., 2017)—are too unreliable to use for rigorous evaluation.

Our key result (Section 3) is based on the observation that the optimal classifier depends only on two numbers: the probability of a sentence under the model and the probability under the reference distribution. The former can be computed directly from the model, and we show that the latter can be well-approximated by human judgment scores. The resulting two-dimensional space is illustrated in Figure 1. We apply a simple -nearest neighbor classifier in this space and define Human Unified with Statistical Evaluation (HUSE) as twice the leave-one-out error of this classifier.

We apply HUSE to four natural language generation tasks (Section 5): language modeling, chitchat dialogue, story generation, and summarization. First, we show that human evaluation alone is insufficient to discriminate model generations from the references, leading to inflated estimates of model performance. In contrast, HUSE is able to reveal deficiencies of current models. We also show that common techniques for improving sample quality such as annealing actually increase distinguishability between the model and reference due to losses in diversity.

2 Optimal Discriminator

Consider a natural language generation task where the model is given a context (e.g., a dialogue history) drawn from some prior and must output a distribution over possible sentences . We define an idealized evaluation metric based on whether is close to a reference distribution , which is generally human-generated.111 While some tasks only care about quality and thus only require to place mass on some high quality , we demand that places mass on all high quality as given by . This diversity is important for open-ended tasks such as dialogue or story generation. Also note that need not be the human distribution, or match the training distribution. It can be defined as the distribution given by experts.

Specifically, consider a random variable

drawn from either the reference or the model based on an indicator :


Define to be twice the lowest possible error over any discriminator that attempts to determine based on and :


measures similarity between and ; it is 0 if and are disjoint and 1 if they are identical.222 Note that is a linear function of the total variational divergence: . See Appendix A.1 for details.


Unfortunately, is unattainable because it requires computing the optimal discriminator. In the spirit of the Turing Test, we could consider using the error rate of a human discriminator instead, often considered the gold standard for evaluation. However, while humans might have knowledge of , they do not have full knowledge of and thus would have difficulties determining which sentences a model cannot generate.

As a concrete example, suppose

placed a uniform distribution over some set

. Without knowledge of the most sensible discriminator is to predict (reference) when . This discriminator achieves the same classification error of for both the perfect model and one which can only return a single . We could try to reveal to humans by showing multiple samples simultaneously, but this is expensive and, as we will later see, unnecessary.

Another option is to learn

over an expressive class of functions such as neural networks on data sampled from

and . This is analogous to learning the discriminator in a Generative Adversarial Network (GAN) (Goodfellow et al., 2014) or learning an evaluation metric from human judgments (Lowe et al., 2017). However, as are high-dimensional objects, training a good classifier is extremely difficult (and perhaps not significantly easier than solving the original generation problem). Indeed, learned evaluation metrics do not generalize very well (Lowe et al., 2017; Chaganty et al., 2018). Unlike these approaches which seek to replace human evaluation, our focus will instead be on combining human and automatic statistical evaluation to estimate the optimal classifier error.

3 Human Unified with Statistical Evaluation (HUSE)

Our key result is that the optimal discriminator depends on only through a two-dimensional sufficient statistic (Section 3.1), motivating an approximation which we call HUSE (Section 3.2).

For any feature map that maps to , define the evaluation score to be twice the error rate of the optimal discriminator that depends on only through :


Note that the evaluation score given by a feature map optimizes over all functions that depend on (3). Thus, the more information contains, the lower is. This has two implications: First, any feature map yields an (optimistic) upper bound on (2), meaning that might be able detect when a model is poor but cannot certify that it is good. Second, adding features to can only improve this bound.

3.1 Two features suffice

Let us consider the following two-dimensional feature map:


From the arguments above, it is clear that , but perhaps more surprisingly, we actually have equality: The two-dimensional feature map achieves the optimal discriminator score: .


We compute the true posterior over given . Since , and , by Bayes’ rule:

The optimal discriminator simply predicts if and otherwise. In other words, the decision boundary is given by . ∎

More generally, we can obtain this equality with a wider class of . It will hold exactly for any invertible transformation of (Appendix Corollary A.2), and approximately for any which has high mutual information with (Appendix Theorem 1). This means that we can substitute with noisy, possibly un-normalized estimates and still obtain accurate estimates of .

3.2 HUSE features

While we can directly compute for many probabilistic models, is unattainable, so is not computable. However, the wisdom of the crowds Surowiecki (2004); Ungar et al. (2012) suggests that pooling together the judgments of many humans can often produce surprisingly reliable estimates of real-world probabilities such as , even if no individual human is particularly reliable. With this motivation, we ask Amazon Mechanical Turk workers to rate a sentence from 1–5 based on how “typical” it is as a way to estimate . (see Appendix A.3 for more details). We define to be the average response over 20 crowdworkers. Figure 2 shows that for a language modeling task on the Reddit corpus,333We used the Reddit corpus due to crowdworker familiarity, corpus size, and short average sentence length, which results in a wide range of sentence frequencies. strongly correlates with the actual log-frequency of in the corpus. The high correlation suggests that human judgments are a good surrogate for .

Figure 2: On the Reddit corpus, human judgment (HJ) of the “typicality” of a sentence correlates strongly () with its frequency in the corpus, suggesting that HJ is a good surrogate for

. Error bars at the 90% confidence interval.

In addition, we found that rather than using the model probability directly as a feature, normalizing by sentence length yielded lower (tighter) scores. We therefore define the HUSE features as follows:


and define the (population) HUSE score as .

3.3 Guarantees derived from HUSE

We now show that the HUSE score satisfies two nice properties: (i) HUSE does at least as well as human evaluation and (ii) a low HUSE score is sufficient to show that a model is far from the reference distribution.

To show (i), consider a feature map that only includes human evaluation: . Because also incorporates human evaluation, is always tighter (lower) than the human discriminator error :

Proposition 1 (Relationship between HUSE, human evaluation, and optimal scores).

Furthermore, the main difference between and is that the former uses and the latter uses . But as we argued using Figure 2, is strongly correlated with , and good approximations to provide approximation guarantees for (Appendix Theorem 1).

4 Evaluating models with HUSE

In this section, we show how we can estimate the error rate from finite data (Section 4.1). We then show how the HUSE estimate can be decomposed into a score that measures quality (HUSE-Q) and a score that measures diversity (HUSE-D), which allows us to study quality-diversity tradeoffs (Section 4.2).

4.1 Learning a discriminator

For any feature map , we show how to produce an estimate of . Fix contexts . First, we draw examples from the reference distribution , which are usually human-generated sentences from a test set. We also draw examples from the model we wish to evaluate. Next, for each of the examples , we compute the feature map , which might involve evaluating the model probability as well as collecting human judgments from crowdworkers.

Finally, we compute the leave-one-out error of a classifier that tries to predict whether a given example comes from the reference distribution () or the model ().

The classification problems for HUSE are two-dimensional, which allows us to accurately estimate error rates using a -nearest neighbors classifier. We opt to use nearest neighbors classifiers as they are simple, require no training, and can asymptotically capture arbitrary continuous decision boundaries. Specifically, we set and define neighbors using

distances over the feature vectors

scaled componentwise to have unit variance. The overall procedure for computing the estimate

is formally defined in Algorithm 1.

0:  Feature map , number of neighbors Contexts Reference outputs Model outputs
1:  Construct dataset:
Algorithm 1 Estimating error rates under

4.2 Quality-diversity decomposition

We now define the (empirical) HUSE score using the feature map :


We define the quality component of HUSE (HUSE-Q) similarly using human judgments alone:


Since humans can detect quality defects in a model, any increase in error from removing must come from a model’s lack of diversity. Therefore, we define the diversity component (HUSE-D) as follows:


which implies the decomposition . As long as the discriminators are non-degenerate (obtaining better performance than chance and HUSE HUSE-Q), all scores are contained in . Here, implies that the model suffers no diversity defects, while indicates that the examples could be discriminated perfectly due to a lack of diversity.

Score Summarization Story generation Chit-chat dialogue LM
HUSE 0.53 0.26 0.06 0.00 0.56 0.49 0.86
HUSE-Q 0.58 0.92 0.15 0.47 0.56 0.92 0.88
HUSE-D 0.95 0.34 0.91 0.53 1.00 0.57 1.02
Table 1: Performance achieved by the best models on the four tasks, as measured by overall goodness-of-fit (HUSE), sample quality (HUSE-Q) and diversity (HUSE-D). The scale for HUSE and HUSE-Q ranges from 0.0 (completely distinguishable from reference) to 1.0 (indistinguishable from reference) where the implied classification error is . HUSE-D may exceed 1.0 with small sample sizes when HUSE-Q HUSE.

5 Experiments

5.1 Experimental setup

We use HUSE to evaluate three different types of single-sentence natural language generation tasks: (i) unconditional and high entropy (language modeling); (ii) conditional and high entropy (story generation, chit-chat dialogue); and (iii) conditional and low entropy (summarization). We show that HUSE provides a direct and interpretable measure of diversity on high-entropy tasks, while also serving as a useful model diagnostic on low-entropy ones.

The four tasks along with the datasets and models are as follows:

  • Summarization: Giganews story to headline dataset and the pre-trained model from Gehrmann et al. (2018). The dataset consists of 3.8 million news story-headline pairs. Examples from this dataset are shown in Table 2.

  • Story generation: Last sentence generation for ROC stories Mostafazadeh et al. (2016) consisting of 96,198 examples of partially written four-sentence stories as input, and a single sentence which completes the story as the target. We use a standard OpenNMT model with global attention Klein et al. (2017).

  • Language modeling: One billion word benchmark pre-trained language model from Jozefowicz et al. (2016). The task consists of generating a single sentence from the one billion word newswire text distribution.

  • Chit-chat dialogue: Two-turn chit-chat dialogue dataset consisting of 37.3 million comment-response pairs from Reddit (Appendix A.4). Comments are generally short (5–15 tokens) and cover a single topic (e.g. given “wow how did i not notice that”, the response is “you were focusing on other things its understandable”). We train a convolutional model using fairseq Gehring et al. (2017).

For all the tasks, we train neural models and evaluate their diversity-quality tradeoffs as we change the decoding scheme for generation. Our primary evaluation concerns diversity trade-offs involving temperature annealing which is a generation technique applicable to any probabilistic model that generates words sequentially. In temperature annealed models, we sample a word proportional to where is the model probability of given previous words and is the temperature parameter. We excluded beam search since it qualitatively behaves similarly to temperature annealing with low temperatures and due to beam search being extremely under diverse.

As a non-neural baseline, we also consider retrieval based models based on Apache solr on a few tasks. For this approach, we retrieve the single most relevant response from the training set using the BM25 similarity metric on inputs. Such models are known to perform well in tasks with complex outputs such as program generation Hayati et al. (2018); Hashimoto et al. (2018) and style transfer Li et al. (2018).

For cost reasons, we did not measure certain combinations of task and generation mechanisms. We did not measure retrieval for chit-chat dialogue, as we observed its outputs were lower quality than a low-temperature neural model. We also did not anneal language models, as the generation quality from the language model was already high, and our goal was to show that they achieved high HUSE. Our set of measurements, while not comprehensive, generally covers the available quality-diversity tradeoffs for conditional tasks.

Figure 3: Tradeoffs between HUSE-D and HUSE-Q. Points are models and color indicates task. Neural models (circle) generate using temperature annealing (point labels indicate temperature). Models closer to the top right are superior, and gray diagonal lines indicate equivalent HUSE. A shaded region for a task indicates models which are strictly dominated (worse HUSE with the same HUSE-D-HUSE-Q proportion). Annealing can trade-off between diversity and quality but cannot easily increase the underlying model performance (HUSE).

Finally, we collect human judgments as per Section 4.1 where we query 20 Amazon Mechanical Turk crowdworkers for typicality ratings on 100 reference and 100 model sentences. Since our models generate UNK (unknown and out-of-vocabulary) tokens, we instructed crowdworkers to treat UNK tokens as rare, but appropriate words for the context.

5.2 Overall results

The HUSE scores across the four tasks vary widely. Table 1 shows that single-sentence language models are nearly indistinguishable, with and implied discriminator error of .

In contrast, both summarization and dialogue are highly distinguishable () with relatively low quality when sampled from . Human evaluation alone (HUSE-Q) would suggest that using temperature annealing to emphasize high-probability outputs substantially improves the model (HUSE-Q goes from to for summarization and to for dialogue). However, we find that this increase in sample quality comes at the cost of diversity (HUSE-D goes from to for summarization and to for dialogue). Examining the achievable HUSE and diversity tradeoffs in Figure 3 shows that mechanisms such as annealing which improve sample quality actually degrade HUSE due to severe losses in diversity.

We find that all generation schemes and models are inadequate for story generation on ROC stories. The original model () is very easily distinguishable by a human (), corresponding to a discriminator error of . The retrieval models can improve this to , but this comes at the expense of diversity.

Finally, we observe that directly sampling from the model is always diverse. This suggests that human evaluation is an appropriate evaluation for generation systems that are directly sampled (rather than beam-searched).

Figure 4: The two-dimensional classification problem in Algorithm 1 on the summarization task with different softmax temperatures (three panels). Each point represents a reference sentence or model-generated sentence . The color denotes the source of the sentence (), shading is the classification confidence of the nearest neighbor classifier.

5.3 Model error analysis with Huse

Since HUSE is estimated from a two-dimensional classification problem, we can directly visualize the classification problem to understand defects in both model quality and diversity.

Figure 4 shows both reference points (blue squares) and model points (red circles) for the summarization task. The shaded areas indicate the decision boundary of the -nearest neighbor classifier.

At temperature , we find that the classification boundary is mostly horizontal, implying that human judgment alone can distinguish model outputs from references. There is a cluster of sentences with high HJ and high which are essentially indistinguishable. Examining the samples in this top-right region reveals that these are news stories with short headlines such as “Nadal pulls out of Sydney International” which can be reliably generated even at . However, the model frequently generates low quality samples that can easily be distinguished such as “two new vaccines in the poor countries were effective against go-it-alone study says” (Table 2).

At lower temperatures of and , the boundary shifts towards becoming diagonal. Although the distribution is no longer directly separable on human judgment, the two distributions are clearly separable with the inclusion of .

Using Figure 4, we can identify individual examples which were correctly and incorrectly classified based on and HJ. Table 2 shows examples of both quality failures and diversity failures identified by HUSE. For example, the “diversity failure” table shows that the summarization model () has an extremely low probability of generating some reference sentences (“NFL’s bills shake up front office”) and is thus under-diverse. Closer examination of the model shows that the probability of generating “front office” is low, since it is an unusual way to refer to the president and general manager. Improving these models on the diversity failures will require that the model understand more subtle paraphrases. We can also identify model successes, where the model outputs are indistinguishable from the reference in terms of quality (“Agassi bows out of Australian Open after injury”), and the model assigns high probability to the reference (“Agassi withdraws from Australian Open”).

Quality failure HJ
Context: Two new vaccines have been shown effective against rotavirus, which is responsible for a half-million infant deaths in poor countries each year, research studies published Wednesday said.
Model Two new vaccines in the poor countries were effective against go-it-alone study says -2.3 2.6
Reference New vaccines for key UNKvirus shown effective -4.0 4.3
Diversity failure
Context: The Buffalo Bills sacked Tom Donahoe as president and general manager on Wednesday, fulfilling expectations of a shake-up after another failure to make the National Football League playoffs.
Model Bills sack UNKas president GM and general manager -0.9 4.3
Reference NFL’s Bills shake up front office. -5.1 4.3
Model is indistinguishable
Context: US veteran and eight-time Grand Slam winner Andre Agassi has withdrawn from this month’s Australian Open due to a nagging ankle injury, his management team announced Thursday.
Model Agassi bows out of Australian Open after injury. -1.4 5.3
Reference Agassi withdraws from Australian Open. -0.3 4.9
Table 2: Example reference and model outputs (capitalization added for readability) corresponding to Figure 4 (summarization task) that were shown to crowdworkers (left column). Crowdworkers were shown samples from the model (including the UNKtoken) and returned human judgments (right column). Using human judgments and the model probability, we can identify several types of failures. Quality failures are examples that are classified by human judgment. Diversity failures are examples that are classified by model probabilities. Finally some examples are not easily classified, as they have similar human judgment and model probability scores.

5.4 HUSE stability

Figure 5: Estimates of HUSE are robust to small test set size, but generally require crowdworker measurements for each example.

Since HUSE depends on human crowdworker annotations, one might ask if it is possible to reduce either the number of annotated examples, or number of distinct crowdworkers for each example. We show that for low-quality models, substantially fewer annotations are needed.

Figure 5 shows the result of subsampling our original data of 200 sentences and 20 crowdworkers and estimating HUSE. First, we find that using 50 test set examples (Figure 5, left) is often sufficient to give accurate estimates of HUSE. Next, we find that the necessary number of crowdworkers per example depends heavily on the task. Easily distinguishable tasks (story generation), require only 10 crowdworkers, while less distinguishable tasks (summarization) require more than 20 crowdworkers to obtain accurate estimates.

6 Related work

The current state of NLG evaluation.

Existing approaches to NLG evaluation use a hodgepodge mix of quality and diversity measures. Out of the 26 NLG papers at ACL 2018, six perform only human evaluation, fourteen measure human evaluation and a diversity metric such as perplexity or n-gram diversity, and six do not evaluate using human judgments.

While perplexity and -gram counts can in principle evaluate diversity, their practical implementations suffer from serious drawbacks. When human evaluation and perplexity are both evaluated, they are almost always done on separate models—human evaluations are done on beam-searched output, while perplexity is computed on the softmax outputs. This makes it appear as if the models can simultaneously generate high quality outputs while also being diverse, when in fact they can only be one at a time based on whether they sample or run beam search.

On the other hand, -gram diversity was proposed by Li et al. (2016) to identify models with the generic utterance problem where models repeat phrases such as ‘I don’t know’. Unfortunately, -gram diversity is computed across contexts by counting the number of unique -grams generated, and so does not measure a model’s ability to generate multiple valid utterances at any single context. In particular, a model which only outputs a single memorized utterance per context (e.g., via memorization or retrieval) can still have high -gram diversity as long as the memorized sentences differ across contexts.

Finally, all existing diversity measures are computed separately from human evaluation. This results in two incomparable evaluation metrics, which prevent us from reasoning about tradeoffs between diversity and quality. In contrast, HUSE allows us to make precise statements about the tradeoffs between model quality and diversity because it is a single metric which decomposes into diversity and quality terms.

Related evaluations of diversity.

The importance of diverse responses has previously been acknowledged for summarization Nenkova et al. (2007) and information retrieval Clarke et al. (2008). Our work differs in considering a single evaluation measure that captures quality and diversity applicable to any generation task.

Automated metrics based on -gram overlap such as BLEU, METEOR, ROUGE Papineni et al. (2002); Lavie and Denkowski (2009); Lin and Rey (2004) work well for machine translation but do not generalize well to domains with a diverse spectrum of correct responses. While variants Sun and Zhou (2012); Galley et al. (2015); Shima and Mitamura (2011) have adapted such metrics to high entropy generative environments, they are still significantly inferior to the human judgments they attempt to mimic.

Caccia et al. (2018) recently examined the diversity and quality tradeoffs for different language model architectures on synthetic datasets. However, as their approach relies on measuring log-likelihoods under both the model and reference distributions, it cannot be applied to real data where is unavailable. Our main conceptual contribution overcomes this by showing that HJ is an acceptable proxy for .

Sajjadi et al. (2018)

also examines diversity and quality (which they call precision and recall) in the context of generative image models. However, they rely on assuming that

and can be estimated accurately using the Fréchet Inception Distance (FID) Heusel et al. (2017). HUSE avoids such assumptions and instead directly leverages human judgments, resulting in a simple and reliable metric more suitable for use as a gold-standard.

Estimating optimal classification error.

Evaluating a model by estimating its optimal classification error has been considered by several earlier works Olsson et al. (2018); Kannan and Vinyals (2016); Li et al. (2017); Bruni and Fernandez (2017); Bowman et al. (2016). However, these methods have focused on classifying sentences directly, which is quite challenging to do reliably. Existing adversarial evaluation methods do not yet reliably outperform human classification Kannan and Vinyals (2016); Bruni and Fernandez (2017). We propose the use of both human evaluation and model probabilities as part of the adversarial evaluation framework, and demonstrate that the resulting classifier reliably outperforms humans and captures both the sample quality and diversity of a model.

Distributional divergence estimation.

Our proposed evaluation metric is closely related to the total variation distance which has been studied extensively in the distribution testing literature. It is known that total variation distance estimates have pessimistic minimax estimation rates in high dimensions Balakrishnan and Wasserman (2017). Our work overcomes this by utilizing and an estimate of . Other approaches to distributional testing include the maximum mean discrepancy (MMD) and Wasserstein distances, but these approaches require knowledge of a ground truth metric or kernel space Tolstikhin et al. (2016); Singh et al. (2018). Although such divergences are easier to estimate than the total variation distance from samples, the implied convergence rates are still too slow to be practically useful.

7 Discussion

In this paper, we demonstrate that the current gold standard of human evaluation does not penalize under-diverse models. To remedy this, we propose HUSE, a general purpose evaluation strategy which can be applied to any model for which we can calculate a model’s sampling probabilities. HUSE is an upper bound on the optimal classification error of distinguishing reference and model-generated text, and never does worse than human classification. HUSE leverages both model probabilities and human judgments, ensuring that models which do well on the metric are both high-quality and diverse.

Our work can be viewed as a “superhuman version” of the classic Turing Test Turing (1950). Instead of relying on just a human classifier, we approximate the optimal classifier, which can utilize information about the model in addition to the reference. We also modify the classification problem and seek to identify whether a sample comes from a (potentially superhuman) reference distribution, rather than the human distribution. These two changes lead to tractable, rigorous estimators which can quantify tradeoffs between model quality and diversity on a wide range of generation tasks.

Acknowledgements. We would like to thank Arun Chaganty, Robin Jia, and Peng Qi for extensive comments and feedback on the paper. This work was funded by DARPA CwC program under ARO prime contract no. W911NF-15-1-0462.

Reproducibility. All code, data, and experiments are available on the CodaLab platform at https://worksheets.codalab.org/worksheets/0x88644b5ee189402eb19d39d721d1005c.


Appendix A Appendix

a.1 Relationship between total variation distance and optimal discriminator error

This is a standard result, replicated here for completeness: The total variation distance is related to the optimal discriminator error as follows: .


Fix any . Define and . Let be the where the assigns higher probability than , and define and be the aggregated probabilities. On , the optimal discriminator should return (model). This is an error when , which occurs with probability . Analogously, on the complement of , the error probability (when ) is . The total contribution to is thus . The rest follows from algebra:


a.2 Approximation error from features

Theorem 1.

Let and be the optimal classification error and optimal error under feature map respectively. Then,

where is the conditional mutual information in bits and is the prediction of the optimal classifier.


The lower bound falls out of the definition of . To prove the upper bound, a variant of the entropy lower bound by Feder and Merhav Feder and Merhav (1994) shows that the error rate for predicting , via the optimal follows


Now expand the mutual information using the chain rule

The last line follows from the fact that is a deterministic function of (Proposition 3.1). Substituting this into the inequality gives the bound,

with .

Finally, note that incurs error, and we disagree with at most a fraction of time. Assuming that we get every one of these disagreements wrong gives an upper bound of on . ∎

A straightforward corollary is that whenever is an invertible function of , the conditional mutual information is zero, and therefore the above inequalities become an equality. Whenever is an invertible function of , .

a.3 Amazon Mechanical Turk for human judgments

Figure 6: Amazon Mechanical Turk survey design for eliciting human judgment scores HJ in the summarization task.

In order to show that HUSE can be reliably estimated even with simple crowdsourcing techniques, we used a single uniform task design where we asked Amazon Mechanical Turk workers to rate the typicality of a sentence from 0–5. We defined 0 as invalid (grammatically or factually incorrect) and 5 as ‘very typical’. is defined as the average score that crowdworkers assign to a response given the context . We did not perform substantial filtering or qualification checks beyond HIT acceptance rate (HIT Approval rate greater than 95 percent and number of HITs approved greater than 50 and location is USA). We constructed each HIT to be 25 examples, and paid one dollar per HIT.

We observe that measuring many replicates is sufficient to get low-variance estimates of HJ. For classification tasks where the model is straightforward to identify from references (such as story generation) we require five to ten replicates, while for hard tasks such as summarization at least twenty replicates are needed (Section 5.4). Manual inspection suggests that up to 20% of the collected data are low-quality but that this noise is uncorrelated with the sentence being rated and outweighed by a larger majority of honest and reasonably accurate data. Even if the data quality is low, HUSE is still a valid upper bound (i.e. models with low HUSE are guaranteed to be distinguishable from humans). Thus the models which we identify as having low-HUSE are reliably distinguishable regardless of the crowdworker quality.

a.4 Reddit Dataset

We use a subset of Reddit comments from 2006-2018 scraped from https://pushshift.io/. We construct a dictionary containing the 10,000 most popular words and preprocess the dataset by removing deleted posts, out-of-vocabulary tokens, profanity, comments with less than 10 upvotes, and comments with over 400 tokens.