Evaluating the Evaluation of Diversity in Natural Language Generation

04/06/2020 ∙ by Guy Tevet, et al. ∙ 0

Despite growing interest in natural language generation (NLG) models that produce diverse outputs, there is currently no principled method for evaluating the diversity of an NLG system. In this work, we propose a framework for evaluating diversity metrics. The framework measures the correlation between a proposed diversity metric and a diversity parameter, a single parameter that controls some aspect of diversity in generated text. For example, a diversity parameter might be a binary variable used to instruct crowdsourcing workers to generate text with either low or high content diversity. We demonstrate the utility of our framework by: (a) establishing best practices for eliciting diversity judgments from humans, (b) showing that humans substantially outperform automatic metrics in estimating content diversity, and (c) demonstrating that existing methods for controlling diversity by tuning a "decoding parameter" mostly affect form but not meaning. Our framework can advance the understanding of different diversity metrics, an essential step on the road towards better NLG systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Question: So what did I miss in the first 20 minutes? Set A Pretty much everything. Nothing, really. You won’t believe what happened! Why do you even care? What were you doing that was more important than this? Set B Not much. It was pretty dull. Blah, you didn’t miss anything. Not anything that important. Very little, it was uneventful.

Figure 1: Our diversity metric evaluation framework checks the capability of metrics to capture different aspects of diversity. Presented are two sets of responses to the same question, generated by crowdsourcing workers. While both sets are diverse in terms of the form of the sentences, only set A is diverse in terms of content. Each graph presents the distribution over a diversity metric for sets with high content diversity (blue) and low content diversity (orange). Distributions are approximated over

sets such as the two presented. We observe that the human score metric (absHDS) separates the two distributions, while an n-gram based metric (distinct-n) fails, illustrating that n-gram metrics do not capture content diversity. The dotted lines correspond to the specific sets A and B presented above.

An important desideratum of natural language generation (NLG) systems is to produce outputs that are not only correct, but also diverse. For example, a dialog system Adiwardana et al. (2020) should permit many responses for the prompt “How are you today?”. Similarly, we expect diverse responses in NLG tasks such as story generation Li et al. (2018), question generation Pan et al. (2019) and abstractive question answering Fan et al. (2019).

Despite growing effort to produce more diverse models Li et al. (2016c, a); Holtzman et al. (2019); Du and Black (2019)

, there is currently no standard evaluation metric for measuring model diversity. Thus, different papers evaluate diversity differently (if at all), making it difficult to fairly compare competing approaches

Hashimoto et al. (2019). Having a principled and consensual diversity evaluation metric is hence fundamental for advancing the field of NLG.

A key challenge in developing diversity evaluation metrics, is that it is difficult to determine their efficacy. Unlike metrics for evaluating the quality of generated text, where one can measure the correlation between an automatic metric (such as BLEU Papineni et al. (2002) or METEOR Banerjee and Lavie (2005)) and human judgement Zhang et al. (2019a); Sagarkar et al. (2018), it is unknown whether humans can reliably estimate diversity.

In this paper, we propose a framework for evaluating diversity metrics (see Figure 2). We assume that a tester (human or model) is generating sets of sentences, conditioned on some diversity parameter that controls the diversity of the output sentences. We evaluate the diversity of the sentences using a proposed diversity metric, and measure the correlation between the proposed metric and the diversity parameter. High correlation indicates that the metric indeed captures how the diversity parameter affects the model output.

We instantiate this framework with two tests. In the first test, the tester is a neural generation model and the diversity parameter is a decoding parameter, such as softmax temperature Ackley et al. (1985)

. This parameter controls the skewness of the distribution in every generated token, and is known to affect model diversity

Holtzman et al. (2019); Caccia et al. (2018). In the second test (see Figure 1), the tester is a human, and the diversity parameter is a binary variable, where the human is instructed to generate sets of sentences with either high or low diversity in content.

We evaluate several families of popular diversity metrics with these two tests: (a) n-gram-based metrics that estimate diversity based on surface patterns in a set of generated sentences, (b) neural metrics: we propose a reduction from evaluating sentence similarity to evaluating diversity, then evaluate diversity using state-of-the-art sentence similarity models, and (c) human evaluation: we explore multiple ways in which humans can be asked to estimate diversity, resulting in multiple Human Diversity Score (HDS) variations.

We find that n-gram-based metrics succeed in detecting diversity that is driven by decoding parameters (the first test above), suggesting that such parameters mostly control the form of generated text rather than its content. Conversely, n-gram-based metrics perform poorly in the second test, which focuses on diversity of content. While neural metrics outperform n-gram-based metrics, we establish that humans are substantially better than any automatic metric at detecting content diversity. This is illustrated in Figure 1, where a human score clearly distinguishes between sets that have high (blue) and low (orange) content diversity, while n-gram-based metrics fail to do so.

To conclude, our main contributions are:

  • [leftmargin=*,topsep=0pt,itemsep=0pt,parsep=0pt]

  • A framework for evaluating diversity metrics.

  • Tests instantiating this framework, measuring the sensitivity of metrics to content and form.

  • Best practices for obtaining diversity evaluations from crowdsourcing workers.

  • Establishing that humans outperform current automatic metrics in detecting content diversity.

  • The collected data, test scores and code are publicly available,111https://github.com/GuyTevet/diversity-eval and can be used to easily compare new diversity metrics to existing results in our framework.

2 Background: Diversity Evaluation

Recently, interest in diversity in NLG has increased Du and Black (2019); Holtzman et al. (2019); Hashimoto et al. (2019); Dušek et al. (2020), resulting in multiple proposals for its evaluation. We describe recent approaches, highlighting the need for a standard way to evaluate metrics.

Perplexity

is the standard metric in language modeling (LM), measuring the proximity of a LM, , to the true distribution, , by empirically approximating the cross-entropy with held-out data sampled from

. Thus, perplexity captures to some extent diversity. For example, a dialog model that puts all probability mass on the output

“I don’t know” for any given context will obtain infinite perplexity once it encounters any other response. This property makes perplexity popular in LM-based NLG models, and often it is the only reported measure for diversity Lewis et al. (2017); Fan et al. (2018); Wang et al. (2019); Li et al. (2019).

However, perplexity does not purely measure diversity, and high perplexity does not entail low diversity. For example, a LM with a uniform distribution over the vocabulary for each decoded token has high diversity, but its perplexity will be extremely high, due to its low

quality. Moreover, perplexity evaluates a LM, while the diversity of an NLG system is also strongly affected by the decoding procedure. For example, Top-k and nucleus sampling are popular decoding schemes that trade-off quality and diversity by ignoring some of the LM probability mass Holtzman et al. (2019).

Last, some NLG models, such as Generative Adversarial Networks (GANs) Yu et al. (2017) are not based on a LM at all. While it is possible to approximate perplexity for such models Tevet et al. (2019), a metric should ideally not be tied to model specifics.

N-gram-based metrics

A popular metric is distinct n-grams Li et al. (2016b), which computes the proportion of unique n-grams out of the total number of n-grams in a set of generated sentences. For example, distinct unigrams is the ratio of word types to word tokens, alluding to the richness of the vocabulary. duvsek2020evaluating calculated Shannon entropy Manning et al. (1999) based on different n-grams as a measure of lexical diversity. Self-BLEU Zhu et al. (2018); Shu et al. (2019) measures the BLEU score of a generated sentence with respect to another generated sentence (rather than a gold reference). High average Self-BLEU indicates high similarity between generated sentences and low diversity. In §5 we expand this idea and suggest a reduction from any similarity metric to a diversity metric. By design, n-gram based metrics are sensitive to diversity in the form of language, rather than its meaning.

Embedding-based metrics

A new line of metrics suggests to embed generated sentences in latent space, then evaluate them in this space. du2019boosting suggest to cluster the embedded sentences with k-means, then use its inertia as a measure for diversity. Recently, lai2020diversity suggested to consider the volume induced by the embedded sentences as a diversity metric.

Human evaluation

yang2019enhancing asked humans to evaluate the internal diversity of a generated essay. ghandeharioun2019approximating let crowdsourcing workers interact with a dialog chat-bot, then asked them to evaluate the diversity of a single conversation. In contrast, this paper focuses on the diversity of different responses given a context, as in zhang2019syntax.

To conclude, increasing interest in diversity resulted in multiple proposed diversity metrics. However, there is no consensus on how to evaluate diversity and what each metric actually measures.

3 Evaluating Diversity Metrics

Diversity Parameter

“How are you today?”

Tester /

Diversity Metric

Test Score

“Very Good!”“Fine Thank you.”“Couldn’t be better.”

Figure 2: An overview of our diversity metrics evaluation framework. The tester (machine or human) generates a response set () given a diversity parameter () and a context (). The test score of a metric is the correlation between the metric score for and .

We now describe our framework for evaluating diversity metrics. We note that diversity has many facets (see discussion in §7): for instance, a set of sentences can be diverse in terms of their content, while another may have similar content, but diverse form (see Figure 1). Our framework provides a way to evaluate metrics for different aspects of diversity under moderate assumptions.

We define a diversity metric as a function that takes a set of generated responses as an input, and outputs a diversity score. Each response is generated for the same input context , hence is a sample from a generative distribution . The overall diversity score of a generative model can be obtained by averaging over sets sampled from the model given multiple contexts .

To evaluate , our framework assumes access to some deterministic diversity parameter that controls an aspect of diversity in . Our framework tests the relation between and the parameter . By varying and measuring , we can compute the correlation between and an aspect of diversity, represented by . Because our goal is to measure the ability of metrics to rank the diversity level of generated text, we use Spearman’s rank correlation as our test score. Figure 2 illustrates the flow of a test in our framework.

In practice, to control the diversity level of using , we use a tester: a generative model that takes a context and a diversity parameter as input, and outputs a response set . We stress that the tester can be either a neural model or a human. A good tester should reliably represent the diversity level quantified by .

As a hypothetical example, can be a movie name and can represent sentiment diversity, that is, the number of different sentiments in a collection of generated reviews about that movie. A human tester can observe and , and produce reviews accordingly (such data can be easily mined from IMDB). A collection of such makes a test, in which Spearman’s correlation between and is a measure for the sensitivity of to sentiment diversity.

We note that perplexity cannot be evaluated as a diversity metric in our framework, because it requires a sample from , while we assume a response set sampled from .

We now describe two tests that instantiate this framework, roughly corresponding to the two main aspects of diversity: form diversity and content diversity.

3.1 Test #1: Decoding Parameters

The diversity of a NLG system constructed from a LM and a decoder is dependent on the decoding scheme. For example, beam search approximates the most probable output, and thus dramatically reduces diversity. Conversely, pure sampling from the LM distribution leads to high diversity, but low quality output Holtzman et al. (2019).

Consequently, a popular method to control diversity in NLG systems is to vary some decoding parameter. Variations include (a) softmax temperature Ackley et al. (1985), where a temperature parameter controls the skewness of the softmax distribution at each step, (b) Nucleus (Top-) sampling Holtzman et al. (2019), where one samples at each step from the minimal set of most probable tokens whose cumulative probability is at least , and (c) Top- sampling, which samples from the top- most probable tokens at each step. All methods skew the LM distribution in a way that avoids low-probable tokens and leads to higher quality Holtzman et al. (2019), providing a decoding parameter that trades off quality and diversity Caccia et al. (2018).

In Test #1, we define the tester to be a strong LM, such as GPT-2 Radford et al. (2019), and the diversity parameter to be a decoding parameter such as temperature. We check how different diversity metrics correlate with decoding parameters. This can shed light both on the quality of the metrics, but also on how decoding parameters actually affect the output of a NLG system.

3.2 Test #2: Content Diversity

In this test, our goal is to evaluate how different diversity metrics capture the notion of content diversity, that is, whether a set of responses are diverse in terms of their content. Measuring content diversity requires deep understanding of the semantics of responses in .

To isolate content diversity from form diversity, we aim to generate sets of responses with a similar level of form diversity, but where the level of content diversity is controlled by the diversity parameter . To do this, we use crowdsourcing workers as testers, and a binary diversity parameter , corresponding to low or high content diversity. A worker observes a context and produces a set of responses based on the value of . We encourage workers to use different words and phrases in different responses regardless of the value of , such that form diversity is generally high in all examples. Examples from this data are presented in Figure 1 and Appendix B.

In §6, we will focus on whether automatic diversity metrics can perform as well a humans on the task of estimating content diversity.

4 Human Diversity Score

One of the core questions we tackle is:

Can humans evaluate diversity reliably?

Although a few papers Ghandeharioun et al. (2019); Yang et al. (2019); Zhang et al. (2019b) asked humans to evaluate the diversity of their models, to the best of our knowledge no work thoroughly investigated this question. The importance of this question is clear when comparing quality evaluation in NLG systems. There, human judgment is considered the gold standard, and automatic quality metrics are established by showing high correlation with human score. Thus, understanding whether humans can reliably judge diversity is important for improving diversity metrics. In this work, we use crowdsourcing workers222Native English speaking crowdsourcing workers, specificly qualified for this task, for more details see Appendix A. to compute a human diversity score: we show workers a context followed by a set of generated responses, and ask them to rate the diversity of the set.

To establish best practices, we experiment with multiple variations of HDS (detailed in §6.2), asking humans to rate the diversity of a response set, and then evaluating each practice with our framework. We focus on the following questions and present results in §6:

  • [leftmargin=*,topsep=0pt,itemsep=0pt,parsep=0pt]

  • Should humans rate the absolute diversity score of a set of sentences or only rank whether one set is more diverse than another? (tl;dr: absolute scoring is more informative but rank scoring is moderately easier for humans.)

  • Should humans rate diversity of a set or similarity between pairs in the set, from which diversity can be inferred? (tl;dr: diversity)

  • Can humans evaluate different aspects of diversity well? (tl;dr: not effectively)

As a preliminary step, we conducted pilot experiments among a group of NLP graduate students. The main insights were: (a) humans are biased toward quality. For example, if a generated set has high diversity but low quality, humans will rate diversity lower than if the quality of the samples was higher. To neutralize this effect, we explicitly ask workers to evaluate the quality of one of the responses in the set , and then instruct them to ignore quality in the diversity questions; (b) To make sure a worker reads the context , we ask them to generate a sentence before having them rate the diversity of a response set; (c) It is difficult for workers to evaluate the diversity of a set with more than 10 responses. Our crowdsourcing tasks are provided in Appendix A.

5 Similarity to Diversity Reduction

We expand the idea introduced by zhu2018texygen and suggest a method to construct a diversity metric from any 2-sentence similarity metric.

Given , a symmetric similarity metric that gets a pair of input sentences and returns a similarity score, we can define a diversity metric as the negation of the mean similarity score across all (unordered) pairs of :

This reduction allows us to easily define new diversity metrics based on past work on sentence similarity Gomaa et al. (2013); Devlin et al. (2019); Zhang et al. (2019a); Reimers and Gurevych (2019). In §6 we show that both n-gram-based similarity metrics and neural semantic similarity metrics provide useful diversity metrics.

6 Experiments

We now turn to our empirical investigation.

6.1 NLG Tasks

We apply our evaluation procedure on three different NLG tasks (in English), in which diversity is essential.

  • [leftmargin=*,topsep=0pt,itemsep=0pt,parsep=0pt]

  • Story completion (storyGen); We use the ROC Stories dataset Mostafazadeh et al. (2016), in which the context is the first four sentences of a story, and the response is a single sentence that ends the story. We use the contexts from this data and generate response sets for each context using our testers. The long contexts characterizing this data narrow down the space of possible responses, making this a “low-entropy” generation task, where the output is constrained, but diversity is still essential.

  • Dialog response generation (respGen); A comment-response pairs dataset extracted from the website reddit.com and pre-processed by hashimoto2019unifying. We use the comments from their data as contexts and generate response sets for each context using our testers. Since comments are single sentences the response is less constrained, making this a “medium-entropy” generation task.

  • 3-words prompt completion (promptGen); Contexts are 3-words prompts, extracted from the Cornell Movie-Dialogs Corpus Danescu-Niculescu-Mizil and Lee (2011) by taking the first three words from each original context. The response sets are completions of the prompts, generated by our testers. This context provides minimal constraints, making this a “high-entropy” generation task.

Samples of the contexts extracted for each task, along with generated response sets, are presented in Appendix B. We intentionally avoid NLG tasks where diversity is not necessarily desired, such as summarization and machine translation.

6.2 Evaluated Metrics

N-gram-based metrics

We evaluate distinct n-grams (distinct-n), as described in §2

. We also evaluate n-grams cosine similarity (

cos-sim

): a similarity measure computing the cosine between the vectors representing two sentences, where each vector is a count vector over the n-grams that appear in the response. We use the reduction from §

5 to convert this to a diversity measure. In both metrics, rather than choosing the order of the n-grams, we average over , which we found to outperform any single choice of .

Neural metrics

We exploit existing BERT-based models Devlin et al. (2019) fine-tuned for estimating similarity between two sentences (applying the reduction from §5).

BERT-STS; A BERT model fine-tuned on Semantic Textual Similarity Cer et al. (2017): a collection of sentence pairs annotated with scores from 1-5 denoting their semantic similarity.333 https://github.com/swen128/bert-sts

BERT-Score Zhang et al. (2019a); Originally a quality metric, BERT-Score uses BERT’s embeddings to measure similarity between two sentences. We used RoBERTa-large Liu et al. (2019), as suggested by the authors.444https://github.com/Tiiiger/bert_score

Sentence-BERT (sent-BERT) Reimers and Gurevych (2019) is a sentence-level embedding model based on BERT. We use the cosine similarity between the embeddings of two responses as a similarity metric. In our experiments we used bert-large-nli-stsb-mean-tokens.555https://github.com/UKPLab/sentence-transformers

Human Metrics

We examine four methods for evaluating diversity with humans (see §4), to investigate best practices for obtaining diversity judgment from humans. In all metrics (except ranking), ratings are from 5 (highest diversity/similarity) to 1 (lowest). The original tasks presented to workers are in Appendix A.

Absolute HDS (absHDS); Given a context and a set of generated responses , rate the level of diversity of .

Ranking HDS (rnkHDS); Given a context and two sets generated with different values of the diversity parameter , rate which set is more diverse.

Similarity HDS (simHDS); Given a context and a set of generated responses , rate the similarity of each two sentences in , and then apply the reduction from §5.

Aspects HDS (aspHDS); Identical to absHDS, except we explicitly ask about a specific aspect of diversity, namely form and content.

6.3 Test #1

Context
Fire next door. John woke up smelling like something was burning. He went outside. He saw the fire next door. He called the authorities.
Response set ()
It was a minor fire and they put it out. It was a fire. It was a fire. It was a fire. It was a fire.
Response set ()
They arrived and put out the fire. It was a fire. It was a fire. It turned out to be a fire. It was a minor fire night.
Response set ()
It turned out to be a mechanic. Before the fire was put out it was a fire. It was a fire. They co-worker matter how bad the fire was. Several shells, the fire department came just in time.
Table 1: An example of the effect of temperature on the response set for a context from ROC Stories.

In this test we measure the correlation between diversity metrics () and the softmax temperature decoding parameter (). The tester generating the response sets () is a neural NLG model.

Data and settings

For each of the three tasks, we generated sets of responses per context, using a linear temperature sweep with values in the range Caccia et al. (2018). We generated 1K sets in total for 1K contexts ( sets per temperature) and evaluated on ( random sets per temperature). For automatic metrics, we repeat this experiment 100 times (randomly sampling out of

sets each time, with replacement), to present the mean and standard deviation of the experiment. HDS metrics are computed over one experiment of

sets, due to their high cost (Appendix A). We provide an empirical justification for these particular values in §6.5.

The data for storyGen and respGen was generated by the neural model MASS Song et al. (2019), fine-tuned on each dataset separately. The data for promptGen was generated by GPT-2-large Radford et al. (2019) without fine-tuning. We provide examples for how story endings change as a function of temperature in Table 1. Examples for all tasks are in Appendix B. For each HDS metric, we collected 10 ratings per query from Amazon Mechanical Turk (AMT) workers. Whereas absHDS demands one query per response set, in order to perform simHDS at a reasonable cost, we chose (the first half of the original set), resulting in crowdsourcing queries instead of per set.

Absolute scoring results

Table 2 presents the results of absHDS, simHDS, as well as all automatic metrics. In general, n-gram based metrics succeed in capturing the diversity induced by a temperature sweep, beating HDS and neural metrics. Figure 3 provides a more detailed analysis, where each point represents a single set of responses generated at some temperature. We observe that while rank correlation for cosine similarity is high, it is far from linear and reaches high values even at low temperatures, scoring Pearson correlation. Conversely, the correlation for BERT-STS and absHDS is more linear, scoring and Pearson correlation respectively. Thus, Pearson and Spearman correlations disagree in this case on the quality of the different metrics.

This result shows that humans perform worse than automatic metrics in this experimental setup, hinting that temperature mostly controls superficial changes to the generated text. Additionally, simHDS performs worse than absHDS although it is 3x more expensive, showing that rating the entire set rather than averaging over pairs is useful.

Figure 3: Test #1: Scatter plot of n-gram-based (cosine similarity), neural (BERT-STS) and human (absHDS) metrics as a function of temperature for respGen. Each point corresponds to a single generated set. Error bars of HDS represent the standard deviation over 10 annotator ratings.
storyGen respGen promptGen
distinct-n 0.76 (0.03) 0.89 (0.01) 0.91 (0.01)
cos-sim 0.71 (0.04) 0.89 (0.01) 0.87 (0.02)
BERT-STS 0.64 (0.04) 0.81 (0.02) 0.84 (0.02)
sent-BERT 0.65 (0.03) 0.80 (0.02) 0.74 (0.03)
BERT-score 0.69 (0.04) 0.87 (0.01) 0.88 (0.02)
absHDS 0.69 0.81 0.79
simHDS - 0.74 -
Table 2: Test #1 results: Spearman’s correlation between temperature and each metric score (mean and standard deviation). simHDS was tested only on respGen.

Ranking results

To examine whether we can improve correlation by asking humans to rank whether one set is more diverse than another, rather than providing an absolute score, we conduct a ranking experiment. Each context is given along with two sets (5 samples each), produced with different temperature values. We sweep over temperature differences instead of the absolute temperature values. The human metric in this setting is rnkHDS (see §6.2), and the automatic metrics are the difference between the scores each of the two sets got.

We report two measures; The first is Spearman’s between the metric and the temperature difference. The second is accuracy, i.e., whether the metric can predict which set has higher temperature (e.g., in automatic metrics this is whether the sign of the temperature difference and the sign of metric score difference agree).666We consider ties in the metric difference score as a miss.

Table 3 summarizes the ranking test results. We observe that humans are better at ranking compared to giving absolute scores, and are doing as well as automatic metrics. However, the scores of all automatic metrics also improve, making it difficult to separate between the different metrics.

storyGen respGen promptGen
acc acc acc
n-gram distinct-n 0.88 (0.02) 0.88 (0.02) 0.86 (0.01) 0.9 (0.02) 0.91 (0.01) 0.91 (0.02)
cos-sim 0.86 (0.02) 0.88 (0.02) 0.87 (0.01) 0.91 (0.02) 0.9 (0.01) 0.91 (0.02)
neural BERT-STS 0.84 (0.02) 0.84 (0.02) 0.85 (0.02) 0.88 (0.02) 0.9 (0.01) 0.89 (0.02)
sent-BERT 0.85 (0.02) 0.86 (0.02) 0.83 (0.02) 0.85 (0.02) 0.85 (0.02) 0.85 (0.02)
BERT-score 0.88 (0.02) 0.89 (0.02) 0.88 (0.01) 0.89 (0.02) 0.91 (0.01) 0.9 (0.02)
human rnkHDS 0.87 0.89 0.89 0.9 0.89 0.88
Table 3: Test #1 ranking results (mean and standard deviation): Spearman’s (

) correlation between temperature differences and each metric score. Accuracy (acc) of classifying which set has the higher temperature.

Other decoding parameters

To Examine the robustness of our conclusions to other decoding parameters, we repeat it with two additional decoding methods: (a) in Nucleus (Top-) sampling we swept linearly over 100 values of in the range ; (b) In Top- sampling we swept in logarithmic scale over 100 values in the range and present the correlation between the metrics and . While softmax temperature enables skewing to a more diverse using , both Top- and Top- enable only skewing to a more sharp (hence less diverse) .

Table 4 presents results for all automatic metrics using the three decoding methods over promptGen. Although the correlation in Top-

is significantly lower, and the variance is higher, all three decoding methods reflect a similar ordering between the metrics. Results for other tasks are in Appendix 

C.

Temperature Top-p Top-k
distinct-n 0.91 (0.01) 0.84 (0.02) 0.61 (0.05)
cos-sim 0.87 (0.02) 0.78 (0.03) 0.48 (0.05)
BERT-STS 0.84 (0.02) 0.74 (0.03) 0.55 (0.05)
sent-BERT 0.74 (0.03) 0.63 (0.05) 0.51 (0.05)
BERT-score 0.88 (0.02) 0.77 (0.03) 0.57 (0.05)
Table 4: Test #1 results for different decoding parameters: Spearman’s (mean and standard deviation) of automatic metrics for promptGen.

6.4 Test #2

In this test, we measure the correlation between diversity metrics () and content diversity, represented by a binary parameter . The testers are AMT workers, guided to create sets with high level of form diversity and high or low content diversity according to .

Data and settings

For each task, we collected 200 sets of 5 responses each (100 sets per class). For high content diversity class, we asked the workers to give 5 responses for a context, with as different content and structure as possible. Then we asked the same workers to choose a single response they wrote, and rephrase it 5 times such that the original content will be preserved, while changing the form – this set is used for the low content diversity class. A sample from this data is in Figure 1 and more samples in Appendix B. For each HDS metric, we collected 10 ratings from crowdsourcing workers, different than the ones who composed the sets.

Results

In addition to Spearman’s between and , we report the optimal single-threshold classifier accuracy (OCA), that is, the best accuracy that can be achieved in predicting the class of a response set (high or low content diversity) given any threshold on , such that if the classifier predicts high diversity, and otherwise predicts low diversity.

Table 5 shows the test results. This time, n-gram-based metrics perform poorly, indicating they do not measure well content diversity. Neural models perform better than n-gram-based metrics (especially sent-BERT), but there is still a clear gap between automatic metrics and humans. Figure 4 illustrates the typical distributions of n-gram, neural and human metrics. Clearly, HDS separates high and low content diversity much better than neural metrics. In addition, n-gram-based metrics saturate both classes to near maximal values, similarly to test #1.

Figure 4: Test #2: histograms of metric values of n-gram (distinct n-grams), neural (BERT-Score) and human (absHDS) metrics for promptGen. The orange histogram represents the distribution of the low content diversity class, the blue histogram represents the distribution of the high content diversity class and brown is the intersection between the two. Pointing down triangles represent the threshold of the optimal classifiers. The histograms show how each metric separates the two classes.

Since test #2 isolates content diversity, we used aspHDS to ask workers to directly rate content diversity and form diversity. Content aspHDS gets similar scores to absHDS, implying that there is no additional gain in asking directly on the tested aspect. Form aspHDS gets substantially lower scores compared to absHDS, validating that the form diversity of the two classes is similar.

6.5 HDS Stability: Picking Parameter Values

HDS experiments demand expensive human labor. Thus, we need to carefully choose the number of sets and the number of different ratings we ask per set, to get reliable results within a reasonable budget. In Figure 5 we measure HDS results for different number of sets and different number of ratings. Empirically, the test results are stable starting from 7 ratings and 150 sets. Hence, we used 10 ratings and 200 sets for HDS experiments.

storyGen respGen promptGen
OCA OCA OCA
distinct-n 0.57 0.77 0.34 0.67 0.33 0.68
cos-sim 0.56 0.77 0.33 0.66 0.36 0.67
BERT-STS 0.6 0.78 0.46 0.72 0.65 0.82
sent-BERT 0.77 0.90 0.59 0.79 0.68 0.81
BERT-score 0.59 0.77 0.49 0.74 0.4 0.69
absHDS 0.85 0.95 0.63 0.81 0.78 0.89
aspHDS 0.35 0.65 0.56 0.79 0.4 0.68
aspHDS 0.84 0.94 0.67 0.83 0.75 0.88
Table 5: Test #2 results: Spearman’s () correlation between set’s class (1 – high content diversity, 0 – low content diversity) and each metric score. The optimal classifier accuracy (OCA) between the two classes over the metrics’ score.
Figure 5: Test #2 absHDS results depends on the number of ratings per set and the number of sets.

7 Aspects of Diversity

In this work, we focused on the two primary aspects of diversity: content diversity (What to say?) and form diversity (How to say it?). In Figure 1, Both sets are diverse, but Set B is only form diverse, as all answers deliver the same massage, whereas Set A is diverse in both form and content.

Furthermore, we can observe aspects of diversity as having a tree-like structure, where both content and form diversity can be divided to sub-aspects: Content diversity (e.g. answering the question “How are you today?”) can be expressed by using different sentiment (“I’m doing good.” vs. “I’m so glad you asked! I’m really doing good.”), different relevance (“I’m fine” vs. “Did you see the game last night?”), and more. Form diversity can be divided into sub-aspects as well: syntactic diversity (“Someone took it from me.” vs. “It was taken from me.”) or lexical diversity (“I feel fine.” vs. “I feel very well.”). Even those sub-aspects can be further divided. For example, a sub-aspect of lexical diversity is register diversity (“How are you?” vs. “Sup bro?”).

Another observation is that different aspects are not orthogonal, that is, changing one aspect may lead to changes in other aspects. Specifically, we observe that while it is relatively easy to produce high form diversity with low content diversity (Set B in Figure 1), it is almost impossible to diversify content without changing form. This observation was important during the design of test #2.

8 Conclusions

This work presents a novel framework for evaluating diversity metrics as a step toward standardized evaluation. We limit the scope of this work to the differences between form and content diversity, which we consider key towards understanding the different aspects of diversity. Future work can explore other sub-aspects of diversity as detailed in §7, e.g., testing sentiment diversity, as proposed in §3. We urge researchers to use this framework as a platform for developing new diversity metrics and establishing their efficiency.

Acknowledgements

We thank Aya Meltzer-Asscher for linguistic advice, and Ben Bogin, Mor Geva, Omer Goldman and Ohad Rubin for their useful suggestions and references. This research was partially supported by The Israel Science Foundation grants 942/16, The Yandex Initiative for Machine Learning and the European Research Council (ERC) under the European Union Horizons 2020 research and innovation programme (grant ERC DELPHI 802800).

References

  • Ackley et al. (1985) David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. 1985.

    A learning algorithm for boltzmann machines.

    Cognitive science, 9(1):147–169.
  • Adiwardana et al. (2020) Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  • Caccia et al. (2018) Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. 2018. Language gans falling short. arXiv preprint arXiv:1811.02549.
  • Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14.
  • Danescu-Niculescu-Mizil and Lee (2011) Cristian Danescu-Niculescu-Mizil and Lillian Lee. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  • Du and Black (2019) Wenchao Du and Alan W Black. 2019. Boosting dialog response generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 38–43.
  • Dušek et al. (2020) Ondřej Dušek, Jekaterina Novikova, and Verena Rieser. 2020. Evaluating the state-of-the-art of end-to-end natural language generation: The e2e nlg challenge. Computer Speech & Language, 59:123–156.
  • Fan et al. (2019) Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567.
  • Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898.
  • Ghandeharioun et al. (2019) Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Lapedriza, and Rosalind Picard. 2019. Approximating interactive human evaluation with self-play for open-domain dialog systems. In Advances in Neural Information Processing Systems, pages 13658–13669.
  • Gomaa et al. (2013) Wael H Gomaa, Aly A Fahmy, et al. 2013. A survey of text similarity approaches. International Journal of Computer Applications, 68(13):13–18.
  • Hashimoto et al. (2019) Tatsunori Hashimoto, Hugh Zhang, and Percy Liang. 2019. Unifying human and statistical evaluation for natural language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1689–1701.
  • Holtzman et al. (2019) Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
  • Lai et al. (2020) Yi-An Lai, Xuan Zhu, Yi Zhang, and Mona Diab. 2020. Diversity, density, and homogeneity: Quantitative characteristic metrics for text collections. arXiv preprint arXiv:2003.08529.
  • Lewis et al. (2017) Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. 2017. Deal or no deal? end-to-end learning of negotiation dialogues. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , pages 2443–2453.
  • Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119.
  • Li et al. (2016b) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016b. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
  • Li et al. (2016c) Jiwei Li, Will Monroe, and Dan Jurafsky. 2016c. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562.
  • Li et al. (2019) Junyi Li, Wayne Xin Zhao, Ji-Rong Wen, and Yang Song. 2019. Generating long and informative reviews with aspect-aware coarse-to-fine decoding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1969–1979.
  • Li et al. (2018) Zhongyang Li, Xiao Ding, and Ting Liu. 2018. Generating reasonable and diversified story ending using sequence to sequence model with adversarial training. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1033–1043.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Manning et al. (1999) Christopher D Manning, Christopher D Manning, and Hinrich Schütze. 1999. Foundations of statistical natural language processing. MIT press.
  • Mostafazadeh et al. (2016) Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California. Association for Computational Linguistics.
  • Pan et al. (2019) Liangming Pan, Wenqiang Lei, Tat-Seng Chua, and Min-Yen Kan. 2019. Recent advances in neural question generation. arXiv preprint arXiv:1905.08949.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3973–3983.
  • Sagarkar et al. (2018) Manasvi Sagarkar, John Wieting, Lifu Tu, and Kevin Gimpel. 2018. Quality signals in generated stories. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 192–202.
  • Shu et al. (2019) Raphael Shu, Hideki Nakayama, and Kyunghyun Cho. 2019. Generating diverse translations with sentence codes. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1823–1827.
  • Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. In International Conference on Machine Learning, pages 5926–5936.
  • Tevet et al. (2019) Guy Tevet, Gavriel Habib, Vered Shwartz, and Jonathan Berant. 2019. Evaluating text gans as language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2241–2247.
  • Wang et al. (2019) Qingyun Wang, Lifu Huang, Zhiying Jiang, Kevin Knight, Heng Ji, Mohit Bansal, and Yi Luan. 2019. Paperrobot: Incremental draft generation of scientific ideas. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1980–1991.
  • Yang et al. (2019) Pengcheng Yang, Lei Li, Fuli Luo, Tianyu Liu, and Xu Sun. 2019. Enhancing topic-to-essay generation with external commonsense knowledge. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2002–2012.
  • Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In

    Thirty-First AAAI Conference on Artificial Intelligence

    .
  • Zhang et al. (2019a) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019a. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  • Zhang et al. (2019b) Xinyuan Zhang, Yi Yang, Siyang Yuan, Dinghan Shen, and Lawrence Carin. 2019b.

    Syntax-infused variational autoencoder for text generation.

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2069–2078.
  • Zhu et al. (2018) Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1097–1100.

Appendix A HDS Questionnaires

All Human scores for HDS metrics were collected using AMT crowdsourcing platform by English native-speaking workers that were specifically qualified for this task. Figure 6 presents the warm-up part, common for all HDS questionnaires. Before asking workers to rate the diversity of each set, we first asked them to generate a response for the context themselves, to make sure they read the it. To neutralize the effect of the responses’ quality on the workers, we also asked the workers to rate the quality of the first response in the set, then explicitly instructed them to ignore quality when rating diversity.

Figures creftypeplural 9, 8, 10 and 7 present the diversity questions of absHDS, aspHDS, rnkHDS and simHDS as appeared in the AMT questionnaires.

Costs

For HDS metrics that require one query per response set (i.e. absHDS, rnkHDS, aspDHS), the cost for a single rating was . We collected ratings per response set, and conduct each experiment with sets, hence the total cost for an experiment was . In the case of simHDS, the response set size was , and the number of queries needed per set is . The cost of a single rating for this task was 0.056$, and with the same multipliers, the total cost for an experiment was , three times more expensive.

Figure 6: Warm-up part, starting each AMT HDS task. It includes the context, and a single response generated by the tester. The worker is asked to generate response of hers/his own and rate the quality of the tester’s response.

Figure 7: absHDS question along with the evaluated response set (Test #2 in this case).

Figure 8: aspHDS question (content in this case). The response set is the same as presented for absHDS question.

Figure 9: rnkHDS question along with the two evaluated response sets.

Figure 10: simHDS question along with the two evaluated responses.

Appendix B Data Samples

b.1 Test #1

Tables creftypeplural 14, 13, 12, 11, 10, 9, 8, 7 and 6 present data samples from storyGen, respGen and promptGen with the neural testers of test #1, as detailed in §6. Each table presents two contexts and three response sets per context. Each response set was generated with a different value of decoding parameter for the three decoding methods: softmax temperature, Nucleus sampling, and Top-k.

Context Response set () Response set () Response set ()
Fire next door. John woke up smelling like something was burning. He went outside. He saw the fire next door. He called the authorities. It was a minor fire and they put it out. It was a fire. It was a fire. It was a fire. It was a fire. It was a fire. It was a minor fire. It was a fire. It was a fire. It was a minor fire and they put it out. They arrived and put out the fire. It was a fire. It was a fire. It turned out to be a fire. It was a minor fire night. They arrived and put it out. It was a scary but beautiful fire next time he went in. It was a fire. It was a put out and John was able to put it out. It was a scary place to live. It turned out to be a mechanic. Before the fire was put out it was a fire. It was a fire. They co-worker matter how bad the fire was. Several shells, the fire department came just in time. They arrived and put out the fire. It turned out to be a clear, fact. It was a fire. It was a scary meal for wet clothes. It was a scary stealing from him.
Charles Shaves His Chest. Charles decides he no longer wants chest hair. He shaves his chest hair with a razor. Although his skin is irritated, he hair is now gone. Charles is satisfied. He is glad that he finally shaved. He is glad that he finally shaved. He is very happy to have a chest hair. He is glad that he finally shaved. He is glad that he finally shaved. He is glad that he finally shaved. He is glad that he finally shaved. He is glad that he finally shaved. He is glad that he finally shaved. He is glad that he finally shaved. He is very happy to have a chest hair. He is glad that he finally shaved. He is very happy that he no longer has chest hair. He is glad that he finally shaved. He is glad that he finally shaved. He is glad that his chest hair is no longer shaved. He is glad that he finally shaved He is glad that he finally shaved. He is glad that he finally of shaving. He is glad that he finally shashock his chest hair. He is glad that he finally peaceful his voice again. He is glad that he finally shaavoided his chest hair. He is glad that he finally downcome. He is ready to start better cooperation his fireplace. He is glad that he finally shaved. He is glad that he no longer has chest hair. He is happy to finally be shaved. He is very happy to have ster good ride of his life. He is glad that he finally shaved. He is glad that he finally has chest hair.
Table 6: Test #1 data samples for storyGen task and different temperatures.
Context Response set () Response set () Response set ()
Really Bad Decisions. Jake was nervous about a meeting at work the next day. He decided to have a drink to relax. Unfortunately Jake kept drinking. He was really hung over the next day. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He did not get to the meeting anymore. He missed his meeting. He passed out and failing the meeting He missed his meeting. He missed his meeting. He missed his meeting. He missed his meeting. He passed out and was kicked out of the meeting. He missed his meeting. He missed his meeting.
Family Night Food. Tonight, my mom ordered Mexican food for family night. She got it from my favorite Mexican place in town. When it arrived, it was hot and smelled wonderful. We devoured it with gusto. After a few hours of take it home we all enjoyed its night. After a few hours of take it home we all enjoyed its night. After a few hours of take it home we all enjoyed its night. After a few hours of eating everyone was satisfied. After a few hours of take it home we all enjoyed its night. After a few hours of eating everyone was satisfied. After a few hours of take it home we all enjoyed its night. After a few hours of take it home we all enjoyed its night. After a few hours of take it home we all enjoyed its night. After a few hours of eating everyone was satisfied. After dinner, we all went home to cook Mexican food. After a few hours of cooking she was tired and ready to eat. After dinner, I always put got ready for Christmas. After dinner, I helped her do the dishes. After a few hours of dinner, the food was amazing. After a few hours of take it home we all enjoyed margaria. After dinner, I was ready to take on work the next day of After dinner, I was sad to say goodbye to her After a few hours of take it home we all enjoyed one bite. After a few hours of eating everyone was satisfied. Even though my stomach was gone, I was sad it was finally pockets After dinner, I alone. All in all the family while my mom finished the food. After a few hours of dancing, she pianed. Afterwards I’d never finish single night a week. Afterwards we all went to sleep in my woods. Afterwards I helped her do much better than my wife. Afterwards my mom helped me finish my dinner. After a few hours we all enjoyed storm blood from the Italy. After dinner, we all enjoyed some good food together.
Table 7: Test #1 data samples for storyGen task and different values (nucleus sampling).
Context Response set () Response set () Response set ()
Loud Noise. Jane was trying to take a nap. She heard a loud bang in the kitchen. It woke her up. A dish had fallen off the counter. Jane pulled over and started to clean herself. Jane was horrified and dropped her favorite food. Jane was able to finish her car and take a nap. Jane was able to finish her nap . Jane was able to finish her nap . Jane stopped at the store to buy a new one. Jane was able to grab a nap afterwards. Jane was unable to finish her nap since it was lost. Jane pulled over and started to clean up. Jane was able to finish her nap . Jane was able to catch a car using the seat. Jane stopped at the store to buy a new book. Jane was sad her cat dropped out of the kitchen. Jane screamed. Jane was horrified to find her car broken down on the floor. Jane was horrified and dropped her pay phone. Jane was easily able to grab a nap. Jane pulled over and started to cry. Jane pulled over and started to cry. Jane stopped at the store to buy a new dish from the store. Jane comes, noticed a lot of food left under it. Jane was horrified and dropped her book. Jane remembered to take a nap. Jane was since she took a nap while she waited for the refun Jane knew she had no time to finish her book. Jane was glad. Jane was annoyed and began to cry. Jane stopped at the store to buy a new one. Jane wanted to have her car back. Jane was monthed.
Headache. Kate was wearing big over the ear headphones. But they were tight and squeezing her head. She tried to adjust them to relieve the tension. But nothing really worked. Kate decided to go to the store and buy some ear phones. She decided to go to the store and buy some headphones instead. She decided to go to the store and buy some ear phones. She decided to go buy a pair of headphones instead. She decided to go to the store and buy some headphones instead. She decided to go to the store and buy some headphones instead. She decided to go to the doctor and have some rest. Kate decided to go to the store and buy some headphones instead. She decided to go to the store and buy some ear phones. She decided to go buy some ear plugs. She decided to go buy a hat so she could enjoy the the long lin Kate decided to go to the store to buy some candy since she was ti Kate decided to go outside to rest. Kate decided to go to the store and buy some ran phones. Kate decided to go to the store and buy some headphones instead. Kate decided to go back to her old dishes. She decided to go buy a big pair of headphones instead. Kate decided to go to the store and wearing some headphones. She was forced to go to the store to buy some cash. She decided to go to the store and buy some headphones instead. She decided to go to a ess instead. Kate decided to go to the store and buy some fake headphones. She decided to go to the store and buy some ear phones. She decided to go to the store and buy some headphones instead. She decided to go to the store and buy some headphones instead. She decided to go buy some ear phones. Kate decided to go to the gym and use some saw no more watching T Kate decided to go to the store and buy some cute phones. She decided to go buy a fake headphones instead. She decided to go to the store and buy some concert.
Table 8: Test #1 data samples for storyGen task and different values (Top-k).
Context Response set () Response set () Response set ()
it is doing okay i just want people to talk more i think you mean people i think its working now i just want people to talk more i think i just want people to talk more i dont think you understand what youre saying i like to think so i know that feel i dont know what to say i think its pretty cool i think its working now i cant believe you didnt think this through i actually havent had may i ask why i just hope its not going to happen its alright dont even want to talk what did it say i was using windows 95 anyway xd thanks for reading the thread i understand and hope they dont mind that want to talk more its doing ok body language great voice acting and music all good i know your pain im just hawaii everyone wants to be friends lt go talk to people from its pretty smart we didnt run a year ago and have turned back but id lick your teeth but tha t sucks smart people it really didnt stop i think you need to go back to normal now sounds like a plan to me we we can try forever practice beer okay but another two more 200 radio views im going to ore my comment of team building looking bro yep they said that things have changed yeah we thank you random stranger
girls who say no not like it matters i dont like it at all i dont like it either i dont like it either i think you mean girls who say no not like it matters i think that is the most accurate description of this thread i like this one girls who say no dont like it matter i dont like it either i know a guy who says no and he doesnt like it girls who say no dont like it matter wait so there are girls who say no like it matters i have a friend who doesnt like it but i always say no and you dont like to play with girls i say no i dont want it to girls who say no dont like it care i see what you did there girls who say no need to make eye contact with girls girls who say yes dont like it does i really dont care if its not gay or straight out programs or not ugh that game was just awful troll how about mine like sales figure instead of higher definitely not as stupid as that agreed is that true refer to my highest iq you and me less keep it up dude ive never seen such an complaining member with a legitimate thought not that theres anything wrong with that or are more wise than telling want girls make videos youve never met gtthe girls who say no or a one time thing people thats which is weird since it is a girl i know so sad
Table 9: Test #1 data samples for respGen task and different temperatures.
Context Response set () Response set () Response set ()
do you know if he still lives in new orleans i dont think so i dont think so i dont think so no i dont think so no idea i dont know i dont think so no i dont think so no idea i just saw him live in new orleans i dont think so no idea i just saw him live in new orleans he is in nyc he does i hope he does i dont think so he lives in new orleans yes he does he is in florida im guessing not that i know of hes a small town i dont know what happened there nope still in new orleans i do not i dont know if he still lives in new orleans i dont think so aw screw you guys the gimme a shower and ill send you my money he does im so confused by how this is relevant they spent all that much in new orleans up in first place still living in new orleans according to most of them it jesus mustve worth anyone else want to wait for him to go back home how many tickets would you say the willing said if he warn you that hes im guessing he lives there as well yep rick albeit not he lives in ny no i dont get it any more guess i may canadian
its that hanging thing in the back of your mouth its a tattoo of a hand its a good thing its not i think its a banana i was thinking the same thing its a very nice thing to do its a hole in the bottom of my mouth i think its a good thing its a pretty cool thing to do thats what i thought i think its a butt well it is kind of like a mac that would be a good idea its my cake day you can see it in his mouth its the junk that was pooping in i was thinking the same thing i always see that i didnt know that was possible this is my favorite comment of the day thats not a sticker i think you just flash it is and feels good my ex and i reach our arms cheating into the door for almost a minute my mind went to side with that yep terrible for my teeth exactly what i was thinking why do i look so contest and where else are you going just dont wear a helmet and it looks like something some one would give to get in on
Table 10: Test #1 data samples for respGen task and different values (nucleus sampling).
Context Response set () Response set () Response set ()
watching curry play in his prime is truly a privilege i know i just dont want him to play for us he has to be a good center for that he is a great center of football in his prime hes been playing in his prime for a long time he was a great back in the day hes been playing for a while now i dont know about that he was pretty damn good at that i dont think he was ever in his prime i dont think he is a prime minister i dont know why but i think he is a very good player and and his career as well agreed the way hes playing is awesome it has to be this is just called a job and then being on the field for the first time i dont see him doing that often enough he just likes to party in the kitchen at this point hes going to be a great star for the rest of the only if he pays well the only thing that can make that kind of difference is how much time you yeah my feeling i mean we dont like it but it happens all the you are one for real they still have a rule saying they might not be injured yet it really is a necessary thing to do finally some reason to continue watching him at some point yet that would be epic not to mention eating curry dinner is a privilege i just dont want to turn over for this goal like he does in gt playing in his prime is truly a privilege ftfy so is saying he is in high school
im going this evening when she usually works i think you accidentally a word you are a good man i hope she works im going to the same time as you when she usually works i am so sorry to hear that i hope she works for you i am so jealous of your work i am so jealous i hope you have fun i hope you get a job at a local bar i hope she works for you i see what you did there where in the world are you going to put your socks on she usually does but she has to keep up to date with her in the awesome what do you want for it how does a women have a relationship with someone did you get the job thats where she goes i know im also going this friday thats a great idea well make sure youre there too good luck thats what you like to hear go see her later this evening i read that as she usually fails okay ill see you there brah i always thought that it was for the sake of having the girl play as that can be expected here thats it im down now i dont see why it would matter if shes married i will be the ex gf she still does make her phone calls the whole office
Table 11: Test #1 data samples for respGen task and different values (Top-k).
Response set () Response set () Response set ()
Not the hacking. The hacking is the fact that the DNC was hacked. ! Not the hacking. The hacking is the real problem. The hacking is the! Not the hacking of the DNC, but the leaks of the emails of the Democratic National Committee.! Not the hacking, but the way it was done. The FBI’s investigation into the! Not the hacking of the DNC, but the hacking of the emails of the Democratic National Committee.! Not the hacking of the DNC, but the leaking of the emails. The DNC’s! Not the hacking of the DNC. The hacking of the DNC was a ”false flag! Not the hacking of the DNC. But the hacking of the RNC. The DNC hack! Not the hacking. The hacking is the problem. The hacking is the problem! Not the hacking of the DNC, but the leaking of the emails. The DNC was hacked,! Not the hacking after all? I’m sure the nation-states that are involved in! Not the hacking that happened on the internal networks of the Energy Department. In fact, according to! Not the hacking of the American public but rather the fraudulent Heisenberg principle that seemed to be! Not the hacking that took place in the DNC last year or the release of hacked emails during the! Not the hacking futurists Cardboard inventor and self-described tinkerer Dennis! Not the hacking alone. In the first half of the report, the hackers tried to create fake! Not the hacking. The hacking is the NSA’s new SHIELD technology. It is! Not the hacking and hacking and hacking of the world government. I know this man is a man! Not the hacking aspect, but the pressure exerted by the Trumpistas. But also the Russia angle! Not the hacking, but the willingness.” The evidence of interest in this case comes in! Not the hacking experience of a CIA VRO crunch nine months ago—JumpStart for 2016 jumps! Not the hacking, David.) The directory was flagged in a document it created in late last year! Not the hacking of Democratic Party systems - said the Russian team’s activity represented ”just the beginning! Not the hacking, of course – which these sources sounded more concerned about than being attacked 140 times! Not the hacking story is over. But yet there’s another reason not to rush out such statements! Not the hacking-either.- These were scattered in the workshop.(Expanded- being guys with! Not the hacking of private material of elected officials, e.g. emails, even if the! Not the hacking has happened yet!!!!!!!!!!!!!! Not the hacking rumours have cost him any of his followers, least of all the proprietors of! Not the hacking group behind the breach of Sony, which has posted the staffer’s information online,!
How is our new technology helping us to do that? We are using a new technology! How is our system different from that of the United States? The United States is a! How is our approach different from that of the other major European countries? The European Commission! How is our country going to be able to compete with the rest of the world if we don! How is our country going to be able to compete with China in the future?” he asked.! How is our work different from that of other organizations? The work of the Center for! How is our work different from other research in this area? We are not the first! How is our system of government supposed to work? The reason we have a government is! How is our system different from the one that was used in the past? The system! How is our country supposed to be a beacon of hope for the world if we have to look! How is our government going to catch up with the cyber criminals?” he said. ”I’m! How is our society selling humanity on slavery? The answers to these questions are also important for us! How is our minister giving it to you? Isn’t it? It’s got a bit of! How is our research different from other studies? This study examined the effects of peer-! How is our mission different from Seniors’ Service Corps (SSC) other than the fact! How is our challenge different? The only difference is that this challenge is about building an! How is our nation governed?” As Obama moved into his second term, he is increasingly! How is our recommendation different from what more traditional veterinarians do? We don’t believe! How is our rapid abandonment of critical thinking, knowledge, and values, and the subsequent burial of! How is our education system designed for our futures? We are the children of immigrants,! How is our Internet even even connected with our corporate tracks? Every cell phone on the planet knows! How is our developer name attached to the icon? Since the Planetside icon is use internally! How is our food paradise created? Artificial chemical fertilizers. So these aren’t GMOs, but! How is our acquisition* worth - BOARD ROLL (Least Significant Equivalents)! How is our transit plan addressing this problem? Under our old plans, Burlington Buses! How is our mind different than any other part of the body?” A Broader View! How is our campaign working? Bitcoin launches alongside psychological research showing that people pay a lot! How is our mentioning application related to a related method (#five with two in queue) page such! How is our having to resort to roundabout hypotheticals to argue that Stewart may secretly want! How is our blood working out for you?” a statewide voter got an outpouring of rename and!
Table 12: Test #1 data samples for promptGen task and different temperatures. Bold text is the 3-words prompt context.
Response set () Response set () Response set ()
So that’s the story of the last few years. The current political climate is not! So that’s the end of the first part of this series. I hope you enjoyed it.! So that’s the first thing I want to say. I’m not going to be the guy! So that’s the thing about being a professional. You have to be able to handle the criticism! So that’s the way it is. I don’t think there’s any way to change it! So that’s the problem. It’s not just that the government is failing to protect! So that’s the thing about this. It’s not just about the money. It’s about! So that’s the end of the story. The next step is to create a custom! So that’s the case. So, what’s the problem? Well,! So that’s the first time I’ve ever seen a real one. I’m not! So that’s the state of the campaign. Now, what I do want to talk about is! So that’s the thing: For as much as I love TLC, it’s hard to! So that’s the idea, anyway. The last two seasons have been about doing that. It! So that’s the end of the half-hour segment. The next half-hour! So that’s the situation we’re in,” he said. ”We’re in the! So that’s the thing, I don’t know if you know, but in general it’s! So that’s the difference between the kinds of things that people will be talking about on Wednesday,! So that’s the $2.3 billion. Here’s the issue: You’re! So that’s the standard for using memcpy(). It’s fine to use memc! So that’s the next step, and the next step is to try to figure out what’s! So that’s the first time you want to punch somebody, not miss before.” The Seahawks would! So that’s the science behind the Broadwell-E processors from Intel that Intel launched last fall! So that’s the instinct from other teams, that they’re a headache. - Ramsay MacDonald,! So that’s the white whale right there about too much debt. And then what you! So that’s the end of our discussion about the causes. What happens when we look at the! So that’s the cover of inhibition against ”chronic” or ”adaptive” stimulants! So that’s the way the story goes, but exactly how is cloud providers going to restrict Their! So that’s the beginning, the beginning of the show, I guess five minutes.” ! So that’s the Indie Mobile Game Week Honoring Winners!!!!!!!!! So that’s the reason I’m writing, that’s why you don’t understand why people know!
do you listen to the music?” ”I don’t know. I don’t listen! do you listen to them?” ”I do,” he said. ”I’m not! do you listen to the voices of the people?” ”I do,” said the king! do you listen to the song?” ”I don’t know . I don’t know! do you listen to the music?” ”I do.” ”You’re not! do you listen to the news? I do. I’m a big fan of the! do you listen to me?” ”Yes, I do.” ”I’m! do you listen to the other side?” ”I don’t know. I don’t! do you listen to the other side?” ”I do,” said the boy. ”! do you listen to the news? No, I don’t. I don’t listen! do you listen to the current draft? I listen to the current draft. I’m! do you listen to it?” It’s easy to hear the ”why?” but when! do you listen to the people that come here?” ”No, I’m too busy! do you listen to the thing?” ”Of course I do. I’ve been reading! do you listen to those who are opposing it, who want to create a situation in which a! do you listen to music or watch TV? How often do you cook or clean? How much! do you listen to them? It’s like the first time you got into something and it just! do you listen to your father? We’ll leave it to the gods to decide.” ! do you listen to music? I like to listen to music, but I don’t really know! do you listen to my story and see if you like it?” ”I think you! do you listen to Human Fly?, which YouTuber Nico Perri collaborated on, and Google! do you listen to the acapella lyrics out of context and express the feeling?” It’s! do you listen to Michael Kiwanuka-Smith who writes, ”The American Journalism Review discern! do you listen to my songs as I said,” Ramckhalter said. ”You feel! do you listen to U.S. 90 night at this time of the year? ! do you listen to that as well?” ”The question was not, ’Who is! do you listen?” He asks, leaning forward as he woodenly talks to him. ”Listen! do you listen to those books and sway him so much? No. He was deeply brainwashed! do you listen?’ Simon(lol).I feel like i’m in a Kurdish Genocide. I! do you listen to value authenticated queries from your menu when running count? And if not, then!
Table 13: Test #1 data samples for promptGen task and different values (nucleus sampling). Bold text is the 3-words prompt context.
Response set () Response set () Response set ()
I’m kidding. You can’t do that. ”I’ve been trying to find! I’m kidding. You can’t be serious.” The man was a little surprised.! I’m kidding. You can’t do that, because the game’s a little bit different.! I’m kidding. You’re the only one who can do it, you know. You’re! I’m kidding. You can have it. ”It’s not a bad idea,”! I’m kidding. You don’t need to be so serious.” ”No, I! I’m kidding. You’re not. I’m just a little bit of a dickhead.! I’m kidding. You’re not. You’re not. I know. I know. You! I’m kidding. You can’t do it. ”I don’t want to be! I’m kidding. You’re not going to be able to see it, and you’ll be! I’m kidding. You’re too nice. We need someone in the government to talk! I’m kidding. You just sit there, I’ll have the other side do it. What! I’m kidding. You can be my roommate for the holidays in a few weeks. You don! I’m kidding. You can get the full version of the file above, and also the original! I’m kidding. You don’t look very strong.” ”You look as strong as! I’m kidding. You’re right about the last guy. He’s a nice guy. But! I’m kidding. You’ll be happy to know it’s the first time that’s happened since! I’m kidding. You will come down with the same problem, or some other sort of problem! I’m kidding. You might have seen me in a lot of other things. I’m actually! I’m kidding. You should go and see a doctor. In fact, I’m! I’m kidding. You’re kidding?” ”I’m not.” ”Why! I’m kidding. You’re not.” ”What? A dick of the heart?! I’m kidding. You’re looking at a new version,” said Zilch, who was! I’m kidding. You know when someone takes to the streets to protest? It’s common for! I’m kidding. You are definitely a complete free agent,” said Caruthers. ! I’m kidding. You can have another at first, but don’t start just jumping ahead/! I’m kidding. You’re just a teenager, aren’t you?” It ends there, your! I’m kidding. You were never fully persuaded.” ”Perfect, I am not,”! I’m kidding. You are also in a worse case scenario for someone who was on $2500! I’m kidding. You know…” ”I should have stopped him; I shouldn’t!
Where did he go?” I ask, looking at him. ”I’m not sure. He! Where did he get the idea to do this? He had been working on a book! Where did he come from? He was born in the city of Karkaros! Where did he go?” I asked. ”I don’t know,” she said. ! Where did he go?” ”I think he went to the hospital,” she said.! Where did he get the idea for the name? I think it’s a combination of! Where did he get the idea to make a movie about the Holocaust? ”I had a lot! Where did he get that idea? ”I was just trying to make a statement,”! Where did he get that from? He’s a very good writer. I don’t know what! Where did he go? Where was he? Where was he? He’s gone. ! Where did he come back from? [The Doctor is sitting in a chair. Amy! Where did he find the money?” asked a reporter from the BBC. ”Is anybody else there! Where did he grow up?” But the boy answered, ”He always loved to read! Where did he get that idea?” he asked. ”I didn’t know. I’ve never! Where did he come from?” You’re looking for that missing piece. Maybe you’re missing the! Where did he come from? He was, I think, from a small island about midway between! Where did he come from, to be sure?” he asked, ”I know he came from! Where did he go? [A little while later] I am about to say this! Where did he hear about my story? I couldn’t tell you. He’d only heard of! Where did he come from? From a place called ”the City of the Sun.”! Where did he at the time in his day seek the God he worshipped? He said: ”! Where did he earn his master’s degree? He is part of a class of doctoral students who! Where did he learn to play guitar?” I asked, puzzled. ”Before I joined! Where did he come from?” ”Australia,” said Peter. ”How could! Where did he hear this, you might ask? Of course, he’d heard of it.! Where did he go? He’s probably dead – or dead and buried within the walls! Where did he earn $150 million on his way to a $5 billion makeover? ! Where did he learn to make his own sticks, or for that matter, hang a stick on! Where did he learn to skate, anyway? Go here and watch this beautiful skater! Where did he get this idea from? What do you think about it? I get!
Table 14: Test #1 data samples for promptGen task and different values (Top-k). Bold text is the 3-words prompt context.

b.2 Test #2

Tables creftypeplural 17, 16 and 15 present data samples from storyGen, respGen and promptGen with the human testers of test #2, as detailed in §6. Each table presents two contexts and two response sets per context - one for the low content diversity class and one for the high content diversity class.

Context Response set (high content diversity) Response set (low content diversity)
Sold Out Jane wanted to watch a big new action movie. She had been waiting a long time for it to come out. When tickets became available she was too busy. By the time she had a chance to buy some it was sold out. Jane cried over the fact that she couldn’t watch it and just gave up looking for a ticket. Jane decided to look for a scalper that would sell her the ticket for the movie that she really wanted to see. Jane thought it was okay since she can still have a chance to watch it once it gets uploaded in video and movie streaming applications. Jane posted a status on her social media accounts asking her friends for any spare ticket that she is willing to buy. Jane resorted to contacting her old friend who is working at a huge movie theater hoping she can help her get a ticket. Jane remembered that she has an old friend who is a manager at a big movie theater so she contacted that friend in the hopes that she can buy any spare ticket. Desperate to watch the movie, Jane called her friend, who works at a movie theater, asking for a ticket to that movie. Jane recalled that her friend works at a movie theater and hoped that she can help get a ticket for that movie. Jane decided to look for her friend who could possibly have access to tickets for that movie since that friend currently works at a movie theater. Jane realized that her friend might have spare tickets since she is a manager of a movie theater showing that film.
Beavers. My friend has some beavers in his backyard. They come up from the creek by his house. He invites my over and we watch them. We take pictures of them and send them to our friends. They are fascinating animals. Our friends love getting the pictures. Sometimes his dogs chase them. They are building a dam on the creek. They won’t let us get too close to them. They are busy gathering sticks to make a dam. The dam they are building is almost complete. It’s fascinating to see their workmanship building a dam. They are turning the creek into a pond by building a dam. They all work together with careful engineering to build a dam.
Table 15: Test #2 data samples for storyGen task.
Context Response set (high content diversity) Response set (low content diversity)
kill la kill is still going new episode every thursday That show sucks OMG I can’t wait I thought they canceled it What channel is it on I only watch nature programs on BBC Lead actor is soooo hot Did you see the cliffhanger at the end of the season I’ve been waiting for it to return for weeks I’m totally gonna binge watch last season I just got into this show and can’t stop watching
places apple slices in a bowl so they’ll stay fresh Oh boy, I love apples. I don’t need you telling me how to keep things fresh, take a hike. Girl, you’re the fresh one around here. This post might be better in the life hacks section. This is actually a useful bit of advice. I find merit in this input. That information will serve me well. Thanks, that’s really good to know! Such knowledge is certainly beneficial. Wise words, I will heed them.
Table 16: Test #2 data samples for respGen task.
Response set (high content diversity) Response set (low content diversity)
Suppose there’s an escape plan we haven’t thought of yet. Suppose there’s an omelet that is the most amazing ever. Suppose there’s an airplane ticket that’s even cheaper. Suppose there’s an actual deadline for this paper. Suppose there’s an event that we can go to this weekend. Suppose there’s an airline that costs less. Suppose there’s an flight that isn’t as expensive. Suppose there’s an air travel fare, but doesn’t cost as much. Suppose there’s an way to fly there that is low cost. Suppose there’s an flight going there and it’s not a lot of money
Nothing remotely like eating a big breakfast. Nothing remotely like dancing with your wife at the wedding. Nothing remotely like singing Justin Bieber’s greatest hits Nothing remotely like falling down a hill Nothing remotely like getting yelled at Nothing remotely like being super full and satisfied. Nothing remotely like getting to taste many different foods. Nothing remotely like starting the day off right. Nothing remotely like doing exactly what I want to do. Nothing remotely like feeding myself with great food.
Table 17: Test #2 data samples for promptGen task. Bold text is the 3-words prompt context.

Appendix C Additional Results

Comparing test #1 results of storyGen to other tasks (Table 2), this task is characterised with noisier scores for all metrics (Figures creftypeplural 3 and 11), hence lower values and higher variance. A possible explanation is larger effect of on the distribution in this task.

Tables creftypeplural 19, 18 and 4 present test #1 absolute scoring experiment using temperature, nucleus sampling and Top-k decoding parameters as . Top-k consistently yields lower compared to other decoding parameters, especially for storyGen task. This implies that Top-k represents diversity less reliably than other methods.

Figure 11: Test #1: Scatter plot of n-gram-based (cosine similarity), neural (BERT-STS) and human (absHDS) metrics as a function of temperature for storyGen. Each point corresponds to a single generated set. Error bars of HDS represent the standard deviation over 10 annotator ratings.
Temperature Top-p Top-k
distinct-n 0.76 (0.03) 0.69 (0.03) 0.2 (0.06)
cos-sim 0.71 (0.04) 0.66 (0.03) 0.16 (0.06)
BERT-STS 0.64 (0.04) 0.58 (0.04) 0.2 (0.07)
sent-BERT 0.65 (0.03) 0.59 (0.04) 0.17 (0.06)
BERT-score 0.69 (0.04) 0.61 (0.04) 0.23 (0.05)
Table 18: Test #1 results for different decoding parameters: Spearman’s (mean and standard deviation) of automatic metrics for storyGen.
Temperature Top-p Top-k
distinct-n 0.89 (0.01) 0.84 (0.02) 0.64 (0.04)
cos-sim 0.89 (0.01) 0.78 (0.03) 0.62 (0.05)
BERT-STS 0.81 (0.02) 0.74 (0.03) 0.56 (0.04)
sent-BERT 0.80 (0.02) 0.63 (0.05) 0.51 (0.04)
BERT-score 0.87 (0.01) 0.77 (0.03) 0.6 (0.05)
Table 19: Test #1 results for different decoding parameters: Spearman’s (mean and standard deviation) of automatic metrics for respGen.