Generative language models are applicable for a wide variety of tasks including writing articles, composing Shakespearean sonnets, or engaging in conversation. For nearly all of these goals, human judgments are the sole way to credibly evaluate the quality of the generated text, rendering it prohibitively expensive to optimize directly over the desired objectives. Researchers typically address this issue by adopting a two-stage process.
At train time, models seek to imitate a human-written text corpus as a proxy for the true objective (e.g. higher quality samples). At inference time, models generate text sequences via a decoding algorithm that better optimizes the desired success criteria given the original predictions from the network. Nearly all major breakthroughs in image and language generation over the past few years (radford2019language; zhang2019dialogpt; fan2018hierarchical)
have adopted this two stage process where the model probability distributions differ between train and inference time.
This work examines decoding methods for language models, which are well known to be critical for performance in language generation (ippolito2019human). Recent efforts for improving generative language models models have focused primarily on altering the model architecture (vaswani2017attention; gehring2017convolutional), training method (de2019training) and model size (radford2019language; adiwardana2020towards). While effort has also been made towards improving decoders (vijayakumar2016diverse; li2016mutual; ippolito2019comparison), there has been significantly less progress towards evaluating improvements in decoder performance, especially for open-ended generative tasks where successful models must generate a diverse spectrum of high quality answers rather than merely a single output.
For many tasks, these two criteria of quality and diversity are not always equally important. In machine translation, the most important criteria is to produce an accurate, high quality translation of the input; generating a variety of alternative translations is also useful, but not if it comes at the cost of correctness. Meanwhile, in open domain dialogue the goal is often to sustain an enjoyable conversation with a human conversational partner and as such, a higher premium is placed on diversity.
To give a concrete example for the case of dialogue, the phrase “I don’t know” is usually a perfectly reasonable remark, and it appears quite often during normal human conversation. However, a chatbot that repeats “I don’t know” on every turn of dialogue makes for a very poor conversationalist. In such open-ended domains, being able to converse about a wide variety of topics with the occasional odd remark is highly preferred to merely repeating the safest possible remark over and over(li2016diversity).
To simultaneously capture both of these criteria, we propose framing the goal of generative language models as a multi-objective optimization over both quality and diversity. The proposed framework is flexible enough to encompass tasks that traditionally place low emphasis on diversity such as machine translation or summarization and others with high diversity such as storytelling.
Furthermore, the proposed framework enables us to evaluate existing decoding algorithms by comparing their performance along the entire quality-diversity spectrum. We compare a variety of commonly-used decoding algorithms in the first large-scale study of decoder quality, utilizing over 38,000 ratings on almost 10,000 samples. We find that when diversity is highly valued, all decoders perform similarly, but when quality is viewed as more important, the recently proposed nucleus sampling (holtzman2019curious) outperforms all other evaluated decoding algorithms.
Additionally, we use our framework to investigate the commonly held intuition that model likelihood is directly correlated with human quality judgments. First, we explicitly test this belief by measuring the relationship between the quality of a sentence as judged by human raters and its likelihood under a generative model. Our findings confirm the existence of a likelihood trap, the counter-intuitive observation that the highest likelihood sentences are of surprisingly low quality, despite a generally positive relationship between model likelihoods and human quality judgments. While this finding has been observed across a wide variety of language generation tasks and models ranging from news generation to machine translation (cohen2018unconstrained; holtzman2019curious), to our knowledge we are the first to explicitly quantify the relationship between the two at all points in the model probability space.
Secondly, we propose and evaluate selective sampling, selective sampling, a decoder which emphasizes high probability sentences by drawing samples from the global temperature-adjusted model distribution. While this has traditionally been considered intractable due to the difficulty of computing the partition function, we propose a procedure that uses rejection sampling to directly sample from the desired distribution without explicitly computing the partition function. When evaluating this decoder alongside existing token-by-token decoders, we discover that it performs poorly even when taking the likelihood trap into account, suggesting that local token-by-token decoders may be capable of capturing structure that a global decoder does not.
In this section, we introduce a framework for trading off quality and diversity in language generation. Let denote the space of all possible generated sentences. We consider autoregressive language models that decompose the likelihood of a sequence token-by-token in a left-to-right fashion (hamilton1994time; sutskever2014sequence). Specifically, the (conditional) likelihood of the sequence is:
is any additional conditioning signal, such as the previous turn of dialogue. Random sampling is the decoding procedure that follows naturally from the factorization of the model’s joint distribution where tokens are sampled one-at-a-time according to the model’s conditional distribution,. Often is not sampled from directly; it is first post-processed by a decoder to bias it toward already high-likelihood tokens.
In the proposed framework, we evaluate the quality of a single sentence by asking humans for a quality judgment . We can define the quality of a model as the expected human “quality” judgment for sentences drawn from it:
We measure the diversity of a model via the Shannon entropy (shannon1948mathematical), a diversity metric widely used across many fields beyond computer science including biology, economics, chemistry, and physics. Shannon entropy is given by:
This allows us to define our multi-objective optimization problem as maximizing the following goal :
where is the task-specific measure of the relative importance of diversity and quality. For open-ended tasks such as dialogue that place a premium on variety, decoder performance under large is critical. For more closed domain tasks such as summarization or machine translation, performance under smaller (including possibly 0) is more important.
Ideally, one would optimize directly over goal , but its dependence on human judgments makes direct optimization infeasible in practice. Instead, prior works optimize a proxy objective (such as the KL divergence) then employ a decoding algorithm to “warp” model post-hoc towards higher values of .
In the following section, we relate our objective to existing decoders and investigate a novel decoding algorithm that normalized globally across all possible sequences rather than simply token-by-token.
3 Selective Sampling
3.1 The Likelihood Trap
Sequence likelihood is commonly used as a heuristic for selecting high-quality generations. In the extreme, beam search approximates finding thesingle most likely generation and is the approach principally adopted in machine translation (koehn2004pharaoh).
However, prior work has suggested that this assumption of a monotonically positive relationship between sequence likelihood and sequence quality breaks down at the extremes. For example, it is well known in the machine translation and image captioning communities that after a certain point, increasing the beam sizehurts BLEU scores and other measures of quality (stahlberg-byrne-2019-nmt; koehn2017six; vinyals2016show). More recently holtzman2019curious observe similar phenomena for open-ended generation where the highest likelihood sentences degenerate into extreme repetition.
We empirically quantify the relationship between sequence likelihoods and human quality judgments by sub-sampling a large number of context-response pairs representing a wide variety of model log likelihoods. We then request human crowdworkers to rate the quality of each response given the context on a five-point “Terrible”-to-“High Quality” scale.
|The Atlanta Falcons have started the 2015 season 4-0 under new head coach Dan Quinn. Quarterback Matt Ryan has the …||… mental Tough O’Rourke Tough apology assessment category of virtue from Boser’ Blog here. It’s got letters and images on it and is utterly …||-177||Nonsense|
|… team afloat and looks closer to the 2010 Atlanta Falcons. Starting cornerback Desmond Trufant was one of the top players on the 2014 …||-74||Reasonable|
|… team in the thick of the NFC South race. The Atlanta Falcons have started the 2015 season 4-0 under new head coach Dan Quinn. Quarter…||-14||Repetition|
|They have changed the phone menu to try to deflect us to email, but you can still get a live …||… answer from a female administratoria llallushoss@rahpx Sandra PJ Jenniea nightiopq HamidF daroyqg S’) …||-229||Nonsense|
|… message or call on line, so I suppose they are just using that as an excuse. Yet they are still telling people to change their telephone number…||-86||Reasonable|
|… link to a phone number here. They have changed the phone menu to try to deflect us to email, but you can still get a live link to…||-23||Repetition|
Figure 1 plots these ratings as a function of and confirms that on average the highest quality generations are not the most likely. Specifically, we find that response quality is generally positively related with up until an inflection point after which it becomes negatively related. In our experiments, this inflection point occurs at . Our findings suggest that while model likelihoods form a good proxy for response quality, naively maximizing over sentence likelihood leads to suboptimal response quality. We term this phenomenon the likelihood trap.
Examples of the likelihood trap can be seen in Table 1. Text sequences with extremely high likelihood tend to devolve into either extreme repetition or other nonsense, which some have attributed to either model biases (holtzman2019curious) or aberrations in the training data (ott2018analyzing). We do not examine the underlying causes of the likelihood trap in this paper.
3.2 Global Temperature Sampling
Motivated by our findings that human judgments are positively correlated with model likelihoods for some interval of likelihoods, we investigate whether using as a proxy for would lead to a better decoding algorithm. Specifically, we create a proxy quality function,
is selected as a hyperparameter.
Using globally-normalized temperature sampling, we can then approximate optimizing for through instead optimizing for the proxy objective . This is due to the following proposition.
Let be a probability distribution over some finite set . Let be the Shannon entropy function. The probability distribution which minimizes the reverse KL Divergence subject to for any achievable constant has the form,
for some temperature .
Proof included in Appendix A.1 ∎
When applied to autoregressive models, global temperature sampling is usually dismissed as intractable due to the need to sum over the exponentially large space of all possible sequences in pursuit of the partition function. Instead, past work typically decomposes sentences into tokens in a left-to-right autoregressive fashion and then use a local approximation,
where models are normalized locally over each set of tokens. This results in the well known (local) temperature sampling algorithm.
Unfortunately, while replacing the global partition function with a series of local ones transforms an exponential problem into a linear one, this approximation may bias the model towards favoring local structure over global structure. Indeed, we show via the following example that for some joint distributions, it is impossible to represent globally-normalized temperature sampling via local temperature sampling, even if local temperature sampling is allowed to use a different temperature at each timestep.
There exists a probability distribution and global temperature such that no choice of parameter allows local temperature sampling to match the joint distribution .
Figure 2 illustrates one such choice of . By construction, local temperature sampling is forced to set regardless of the temperature hyperparameter used at that timestep. Setting a global temperature of results in
which is not imitable by any local temperature setting. ∎
Our core insight is that one can sample from the globally-normalized temperature sampling distribution without estimating the partition functionvia rejection sampling. Rejection sampling (forsythe1972neumann) gives an algorithm from sampling from an (unnormalized) energy distribution if there exists a proposal distribution and constant such that .
We observe that for and . This allows us to use as the proposal distribution since the unnormalized probabilities of the global temperature sampling are given by .
Selective sampling, by design, significantly increases the chances of sampling sequences with large values of . To avoid falling into the likelihood trap, we propose explicitly discarding generations where is greater than a chosen hyperparameter . An additional positive side effect of the cutoff is that the envelope constant can be chosen to create a tight bound on , which increases acceptance probabilities by several orders of magnitude.
A priori, it is not obvious how to choose effectively. We propose collecting human judgments for a selection of random samples from as illustrated in Figure 1 and setting equal to the discovered inflection point. Note, that while this results in our procedure ignoring the set of sentences that individually have the highest probabilities, the total probability mass of this set is quite low: less than 0.5% in our experiments.
In Section 2, we introduce a theoretical framework for comparing decoding algorithms along a quality-diversity curve. Under this framework, we evaluate several commonly used decoding algorithms in a human study described below. In addition to selective sampling, we consider the following autoregressive decoding algorithms,
temperature: Sample tokens with probability proportional to . varies from 0 to 1.
top- (fan2018hierarchical): Sample tokens only from the top- highest likelihood tokens in the vocabulary at each timestep. varies from 1 to vocabulary size.
top- (also known as nucleus sampling) (holtzman2019curious): Sample only from tokens comprising the top- percent of probability mass at each timestep, as ordered from most to least likely. varies from 0 to 1.
At the extremes of their hyperparameter ranges, these algorithms all converge to greedy decoding and random sampling, respectively. To sweep across the quality-diversity curve, we consider several hyperparameter settings per decoding algorithm below. We refer to each decoding algorithm-hyperparameter combination as a ‘decoding configuration’.
We apply each decoding algorithm to the 774M parameter variant of GPT-2 (radford2019language), a publicly-released language model. To ground samples in a common context, we select a set of 48 examples from the GPT-2 test set to condition upon. As samples are evaluated by human raters, we filter out examples containing explicit content or web markup. Samples are drawn by conditioning on a ‘prompt’ consisting of the first 20 space-delimited words of a test example. As sample quality becomes ambiguous when samples are terse (ippolito2019human), we explicitly require all sampling methods to generate exactly 30 tokens, a length approximately equal to the prompt.
To estimate the expected Human judgment score of the probability distributions induced by each decoding algorithm, we enlist a qualified pool of 146 Amazon Mechanical Turk (AMT) workers selected by satisfactory performance on a qualification task. Workers are presented sets of five samples, each conditioned on the same prompt and drawn from five different algorithm-hyperparameter configurations and asked to assign qualitative scores to each sample ranging from human-like to gibberish. The exact prompts, as shown to crowdworkers, are included in the Appendix.
Prior work has found that human annotaters have significant trouble in directly separating out machine and human generated responses when they are of similar quality, as the task of assessing sentence quality is highly subjective (ippolito2019human)
. We found that constructing pairwise preference ratings by randomly pairing samples evaluated at the same time significantly reduced the variance of our results. Specifically, if one sample is rated higher than the other, one is assigned a score of +1 and the other -1. If both are rated equally, both are assigned a score of 0. The score assigned to a decoding configuration is its average score across all pairwise preference ratings. The average scores for each decoding strategy setting we experimented with are shown in Figure6.
We now introduce the first large-scale study comparing decoding algorithms and their hyperparameters. Unlike all prior work (holtzman2019curious; ippolito2019comparison), we explicitly put decoding algorithms on equal footing by comparing sample quality at equal points of diversity. We consider five hyperparameter configurations per decoding algorithm for a total of twenty decoding algorithm-hyperparameter configurations. For each configuration and prompt, we draw ten samples. In total, workers rate nearly 10,000 samples resulting in over 38,000 paired ratings.
Our main results are summarized in Figures 4 and 5. We empirically estimate the entropy of the probability distribution induced by each decoding configuration . Reassuringly, both entropy and human judgment scores vary smoothly with decoding algorithm hyperparameter.
As expected, random sampling directly from the model is simultaneously the highest entropy and the lowest quality. This is empirically consistent with the long-standing intuition that decoding algorithms are critical to improving sample quality. Why are samples from random sampling such poor quality? Language models such as GPT-2 are trained to minimize the KL-divergence between a training set and the model distribution , an objective that prioritizes recall over precision (arjovsky2017wasserstein). As a result, models tend to ensure that high quality sequences have high likelihood without insisting that all high likelihood sequences also have high quality. When we evaluate samples from the model, we evaluate the latter condition.
Our second conclusion is that sample quality varies significantly with entropy for all decoding algorithms. Moreover, when aligned on entropy, sample quality between all autoregressive decoding algorithms is comparable across a wide range. It is only when entropy is low – when decoding algorithms heavily influence sampling – that sample quality between algorithms diverge. In this regime, we find that nucleus sampling outperforms top-, which in turn outperforms temperature sampling. Observing such a difference should be unsurprising: the entropy of a distribution alone does not determine its sample quality. We conclude that a fair comparison of decoding algorithms must not only compare at the same level of entropy but at a range of entropy levels.
Finally and most surprisingly, we find that, in spite of its theoretical appeal, selective sampling consistently underperforms all other decoding algorithms considered.
4.3 Selective Sampling
Why does selective sampling underperform? Our error analysis yields at least two potential causes: priors induced by decoding algorithms and a context-dependent likelihood trap.
We first consider the implicit priors of autoregressive decoding algorithms. Autoregressive decoding algorithms naturally favor sequences where each token has high model likelihood with respect to its conditional distribution . Note that this is not necessarily the same as favoring all high-likelihood sequences with high joint likelihood ; a criteria selective sampling targets at low temperatures. We hypothesize that autoregressive decoding algorithms are inducing additional structure beyond high joint likelihood.
To test this hypothesis, we construct a human rating experiment that pairs random samples from a decoding algorithm with another random samples from the model distribution such that the two samples have the same joint sentence likelihoods. In this way, we are able to control for differences in the distribution of that different decoders induce and explicitly test only how various decoding algorithms promote different sequences with the same overall joint likelihood. We draw samples from three commonly-used decoding configurations conditioned on all 48 prompts and compare each against random sampling by ask crowdworkers to rate which of the paired responses is of higher quality.
In Figure 6, we see that temperature sampling with is undeniably preferred to otherwise equivalent samples drawn directly from , though for other decoding configurations, the difference is currently less clear. Selective sampling, a method with proposals drawn from , does not share this prior of its autoregressive locally normalized decoding counterparts. We can thus conclude that the success of a decoding algorithm involves more than promoting high joint likelihood; in this way, selective sampling is deficient.
Second, we consider the distribution over sample log likelihoods conditioned on a fixed prompt as show in Figure 7 Depending on the prompt, the distribution over log likelihoods varies from prompt to prompt. In selective sampling, we’ve elected to choose a single, global maximum likelihood constant . For some prompts, this has nearly no impact – nearly all samples have likelihood below the cutoff. For others, this may eliminate nearly half of samples, leaving only those of lower quality. This suggests that a fixed cutoff for all prompts may not be ideal.
Based on the prior experiments, we find that choice of decoding algorithm and its hyperparameter has a significant impact on sample quality and diversity. Further, we find that sample quality and diversity can be traded for one another, and that the merit of a decoding algorithm requires comparing it to others at equivalent levels of diversity. We also given evidence that autoregressive decoding algorithms induce additional preference beyond promoting samples with high joint likelihood; a beneficial preference selective sampling does not share.
5 Related Work
Encouraging Diversity Several recent work have proposed strategies for increasing the diversity of text generated by language models. These approaches fall into two broad categories: (1) algorithmic changes to the decoding algorithm and (2) methods that involve training auxiliary language models or modifying the training paradigm for the main language model in some way.
The advantage of changing the decoding algorithm is that improvements can be rapidly be implemented on top of any already trained language model. vijayakumar2016diverse, li2016mutual, tam2019clustered, and kulikov2018importance all propose modifications to beam search to force it to explore a more diverse set of beams. In contrast, modifications to random sampling that have been proposed aim to reduce diversity and thereby increase quality (fan2018hierarchical; holtzman2019curious). ippolito2019comparison compare many of these algorithmic advancements on the tasks of open-ended dialog and image captioning, concluding that the quality-diversity tradeoff makes it nearly impossible to say that any one of these methods is ubiquitously best.
We choose to evaluate three commonly used decoding methods: nucleus sampling (holtzman2019curious), top-k sampling (fan2018hierarchical), and temperature sampling. All three of these methods control the relative tradeoff between quality and diversity with a single hyperparameter. Top-k sampling samples from only the top-k most likely tokens at a timestep, proportionally according to the original probability. Nucleus sampling (also called top-p) sampling operates similarly, but chooses an adaptive
such that the top-k tokens comprise of the top-p percent of the total probability mass at each timestep. Temperature sampling divides the logits of each token by the temperature hyperparameter before normalizing and converting the logits into sampling probabilities. In terms of diversity-promoting approaches that require training new language models,(li2016diversity) use a language model that predicts the source sequence given the target sequence to rank candidate generations, penalizing generations that are too generic (have low ). welleck2019neural
propose a novel loss function which discourages the model from assigning too high probability to repetitive wording.zhang2018generating and xu2018diversity use adversarial learning methods to encourage diversity. Though these methods are promising, the extra complexity of training makes them less attractive for quickly improving upon existing language models.
The concept of oversampling generations and then ranking them has been popular since the days of statistical machine translation (shen2004discriminative) but has also been used more recently in other domains (li2016diversity; ippolito2019comparison; kriz2019complexity). Our particular contribution is to relate our sampling algorithm to the reverse KL divergence and competing objectives maximization. We are also able to use this method to give approximate probability density estimates for sampled sentences, which typically cannot be done for algorithms that oversample generations.
Likelihood Trap We are far from the first to observe evidence of the likelihood trap. In particular, the machine translation and image captioning communities have long known that using higher beam sizes often leads to lower BLEU scores (cohen2018unconstrained; vinyals2016show; yang2018breaking). In open-ended generation, holtzman2019curious find similar results, observing that maximizing the likelihood generates extremely repetitive sentences. In addition to finding corroborating evidence that low quality generations appear at both the low and high probability extremes, our main contribution towards understanding the likelihood trap is the first explicit measurement of the relationship between model likelihoods and human quality judgments at all points in the model probability space, not just the endpoints.
ott2018analyzing attempt to quantify the reasons behind the likelihood trap, proposing that the underlying issue is low quality examples in the training data. They demonstrate that the likelihood trap can be avoided when restricting themselves to a significantly smaller dataset where each training point is carefully examined to guarantee that it is high quality. However, given the recent interest in training increasingly large language models on increasing large datasets, it seems infeasible to guarantee the quality of every example included in the dataset.
Frameworks Note that our framework is related, but not identical to many frameworks such as hashimoto2019unifying; kingma2013auto; goodfellow2014gan which ask that generative models mimic the training distribution exactly. While some tasks do require indistinguishability as the ultimate goal (e.g. representation learning (bengio2013representation), Turing Test (turing2009computing; ippolito2019human), etc.), this is typically not the case for most generation tasks. Humans make errors, but a “perfect” model would not seek to imitate these mistakes. Because we ground quality evaluations in human judgments rather than on any statistical measure, our framework is easily able to capture the possibility of superhuman performance in ways that frameworks based solely on a statistical divergence would find difficult.
In this paper, we propose a framework for credibly evaluating decoding algorithms and use it to conduct the first large scale evaluation of decoding algorithms by measuring their performance along the entire quality-diversity frontier. Our findings suggest that existing decoding algorithms are more or less interchangeable in high diversity settings, but that nucleus sampling performs best when quality is valued over diversity. Additionally, we provide evidence for the existence of a likelihood trap and are the first to explicitly measure the relationship between and human judgments. Finally, we propose and evaluate selective sampling, the first algorithm that can tractably estimate globally normalized temperature sampling.
In the future, we hope to extend our work to additional generative language models as well as other modalities such as image and music generation. Additionally, we leave questions of whether selective sampling can be improved via choice of an adaptive cutoff that can vary based on the prompt or proposal distributions other than random sampling for future discovery.
Appendix A Appendix
a.1 Proof of Proposition 1
Notice first that, subject to ,
Properly choosing allows us to write the Lagrangian dual for the above constrained optimization problem as
Setting and immediately gives us temperature sampling. Finally, observing that positive temperatures give us the local maxima and negative temperatures give us the local minima completes the proof.
a.2 Experimental Design
In this section, we describe the design of experiments presented in Section 4 in greater detail.
We begin by describing the task presented to crowdsourced raters. A sample task is shown in Figure 9. Each task consists of a “context” sequence of the first 20 words in a news article.111News articles are sourced from GPT-2’s WebText dataset. https://github.com/openai/gpt-2-output-dataset We then present the rater with five continuations of 30 word-piece tokens. The rater assigns a label of “High Quality”, “Decent”, “Passable”, “Bad” or “Terrible” to each. We note that these labels are inherently subjective, and include a description and reference example before each task to calibrate the rater. The same description and example is repeated in Figure 8.
In preliminary experiments, we found examples and instructions insufficient for achieving repeatable results. Manual inspection of rater responses revealed a failure to interpret the labels correctly as well as spammers who would always choose the same response for every prompt. As a result, we crafted a qualification exam of five continuations. Only raters which rated all five continuations correctly or nearly correctly222Raters which incorrectly labeled at most one continuation with a label at most one level off (e.g. if the correct answer is ”Bad”, acceptable errors are ”Passable” and ”Terrible”) are counted as ”nearly correct”. were allowed to participate in further experiments. Of the 550 crowdsourced workers surveyed, 136 met this criteria. We refer to this set of raters as the ”qualified rater pool” below.
Even with a qualification exam, we found raters often disagree on the appropriate label for a given continuation. However, when asked to choose which of two continuations was higher quality quality (if any), raters were better aligned. With this in mind, we choose to analyze pairs of ratings given in the same task. From five absolute ratings, we construct twenty pairwise preference ratings: two per pair of continuations. If two continuations receive the same label, they are assigned a preference of 0. If the first continuation is rated higher than the second, a the pair (first, second) is assigned a score of +1 and the pair (second, first) a score of -1. All analyses comparing multiple decoding methods use this methodology.
Even with the precautions above, care is needed to ensure repeatable results. To measure this, we performed an “A/A” experiment prior to data collection. This experiment consists of having the same tasks rated by two different pools of raters. Identical analyses are performed on both rating results, and the experimental setup is deemed valid if conclusions are consistent. To achieve this, we constructed 150 tasks333The large-scale experiment includes 1,930 tasks. using a subset of the context sequences and decoding methods from our primary experiment. We artificially split the qualified worker pool in two by sending the same tasks for evaluation at midnight and at noon.444All tasks within each experiment were rated within 4 hours and 1.5 hours, respectively. We submit the same set of tasks to both rater pools. An analysis of results from both sets of ratings (Figure 10) reveals a statistically consistent preference of top- over top- and (local) temperature sampling, and a severe disapproval of random sampling from the model. These results are also consistent with the same statistics gathered in the full-scale experiment presented in the main text and another experiment described below.
To further validate the reliability of our methodology, we explicitly measure inter-rater agreement on the same set of 150 tasks in a follow-up experiment after large-scale data collection. In this experiment, we ask each task be rated by five distinct raters. We measure Fleiss’s Kappa, a measure inter-rater agreement, on the resulting pairwise ratings. We obtain a score of 0.1964 – an indication that a correlation between raters exists but that the task is far from unambiguous. While this may initially appear concerning, we argue that this is an indication of the task’s difficulty. Unlike image classification, for example, a universally agreeable criteria for text quality does not exist. A measure of Cohen’s Kappa on the A/A experiment above produces a score of 0.19578 – nearly identical to the inter-rater agreement experiment described here. The similarity of these two statistics gives evidence that the proposed experimental design is repeatable in spite of the task’s ambiguity. These reuslts underscore the importance of large-scale, repeatable studies like that presented here.
We conclude by measuring rater preference between each pair of sampling method and hyperparameter on the five-raters-per-task inter-rater agreement experiment described above. Results, as shown in Figure 12, indicate that the same trends presented in the full-scale experiment (Figure 5) hold,
Top- is preferred to all other sampling methods,
Increased diversity correlates with lower human judgement scores, and
Random sampling directly from the model produces the lowest human judgement scores by a large margin