ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons

by   Margaret Li, et al.

While dialogue remains an important end-goal of natural language research, the difficulty of evaluation is an oft-quoted reason why it remains troublesome to make real progress towards its solution. Evaluation difficulties are actually two-fold: not only do automatic metrics not correlate well with human judgments, but also human judgments themselves are in fact difficult to measure. The two most used human judgment tests, single-turn pairwise evaluation and multi-turn Likert scores, both have serious flaws as we discuss in this work. We instead provide a novel procedure involving comparing two full dialogues, where a human judge is asked to pay attention to only one speaker within each, and make a pairwise judgment. The questions themselves are optimized to maximize the robustness of judgments across different annotators, resulting in better tests. We also show how these tests work in self-play model chat setups, resulting in faster, cheaper tests. We hope these tests become the de facto standard, and will release open-source code to that end.


page 1

page 5

page 6

page 7

page 10

page 11


A Review of Evaluation Techniques for Social Dialogue Systems

In contrast with goal-oriented dialogue, social dialogue has no clear me...

DynaEval: Unifying Turn and Dialogue Level Evaluation

A dialogue is essentially a multi-turn interaction among interlocutors. ...

Evaluating Mixed-initiative Conversational Search Systems via User Simulation

Clarifying the underlying user information need by asking clarifying que...

Neural Generation of Dialogue Response Timings

The timings of spoken response offsets in human dialogue have been shown...

Constructing Open Cloze Tests Using Generation and Discrimination Capabilities of Transformers

This paper presents the first multi-objective transformer model for cons...

Towards a Human-like Open-Domain Chatbot

We present Meena, a multi-turn open-domain chatbot trained end-to-end on...

Towards Fair Evaluation of Dialogue State Tracking by Flexible Incorporation of Turn-level Performances

Dialogue State Tracking (DST) is primarily evaluated using Joint Goal Ac...


Figure 1: Acute-eval asks humans to compare two multi-turn dialogues, and independent of the gray speakers, choose between Speaker 1 (light blue) and Speaker 2 (dark blue).

Dialogue between human and machine is an important end-goal of natural language research. The open-ended nature of generating sequences in a multi-turn setup naturally makes the task difficult to evaluate – with full evaluation possessing many of the difficulties of the task itself as it requires deep understanding of the content of the conversation. As in many other natural language generation (NLG) tasks, automatic metrics have not been shown to have a clear correlation with human evaluations

[Liu et al.2016, Lowe et al.2017]. This means the current standard for all dialogue research involves human trials, which slows down research and greatly increases the cost of model development.

Unfortunately, human judgments are themselves difficult to measure. The two most used approaches, single-turn pairwise evaluation [Vinyals and Le2015, Li et al.2016b], and multi-turn Likert scores [Venkatesh et al.2017, Zhang et al.2018, See et al.2019, Dinan et al.2019b, Dinan et al.2019a] have serious limitations. Single-turn pairwise evaluation provides the benefits and simplicity of an A/B test, allowing for cheap and fast annotations, with comparisons that are robust to annotator score bias, but fail to take into account the multi-turn aspect of conversations. To give a trivial example, such comparisons fail to capture whether the model would repeat itself in a multi-turn conversation because they only look at one turn; repetition is a known issue that humans dislike [See et al.2019].

Multi-turn Likert scores require the annotator to have a multi-turn conversation and then provide an integer score, which is more costly and time-consuming to run but evaluates full conversations more accurately. The integer scores however suffer from differing bias and variance per annotator, which researchers have tried to mitigate

[Kulikov et al.2018], but nevertheless due to its lack of sensitivity often yields comparisons that are not statistically significant. Furthermore, due to strong anchoring effects during model evaluation, i.e. that annotators are affected by the first systems they evaluate, Likert comparisons are generally not comparable across multiple papers. This mandates that evaluations of new models be simultaneously collected with baselines, further increasing the cost of developing additional models [See et al.2019].

In this work we introduce Acute-eval, a method that combines the benefits, and attempts to mitigate the deficiencies, of the above two approaches by introducing a pairwise relative comparison setup for multi-turn dialogues. In each trial, we show the annotator two whole conversations, with the second speaker in each conversation highlighted, as the judgment should be independent of the quality of the first speaker, see Figure 1. We then show a carefully worded question with two choices: speaker A or B, where the question measures a desired quality such as which speaker is more engaging, interesting or knowledgeable. Our experiments show that annotators perform well in this setup, and that our method can reveal subtle but significant differences between conversational models that other approaches, such as multi-turn Likert, cannot.

Overall, our work provides the following contributions:

  • A new evaluation method with a clear mechanism that provides fast, cheap iteration. This evaluation method allows efficient reuse of data from prior papers, allowing new models to be evaluated independently of baselines, and dramatically lowers the cost of annotation.

  • We optimize question choices to find those with the highest agreement, increasing confidence in the desired test. We provide the wording of the questions that we found to work best for several questions of interest (most engaging, human, interesting or knowledgeable conversationalist) for further research use.

  • We provide an explicit benchmark comparison between current best performing retrieval and generative models on two recent tasks, PersonaChat [Zhang et al.2018] and Wizard of Wikipedia [Dinan et al.2019b] for several question choices, revealing the current state-of-the-art, and to be used for benchmarking on these tasks in the future.

  • We show that our test can be applied to self-chats rather than human-model conversation logs, which can reveal problems with existing models at a cheaper price, and provides high agreement with the human-model evaluations.

  • We will release the code for running these tests.

Related Work

Dialogue tasks have traditionally been separated into two areas: goal-oriented and chitchat. Goal-oriented tasks typically have a clearer evaluation, e.g. task completion can be measured if the correct actions are taken [Hastie2012, Henderson, Thomson, and Williams2014, Bordes, Boureau, and Weston2017, El Asri et al.2017, Wen et al.2017]. Chitchat tasks are more open ended, and instead feature conversations without a precise goal that can be automatically evaluated. For example, conversations where two speaking partners are discussing interests [Zhang et al.2018] or topics [Dinan et al.2019b]. We study the latter in this work.

Evaluation of chitchat tasks with automatic metrics is difficult precisely because of their open-ended nature. For example, the answer to the question “What are you doing tonight?” has many possible answers, each with little word overlap. This means standard metrics for tasks like question-answering or machine translation do not work well, and have poor correlation with human judgments [Liu et al.2016, Novikova et al.2017]. Nevertheless, a number of studies do report automatic metrics, without human studies [Serban et al.2016, Parthasarathi and Pineau2018]. Researchers have made attempts to improve automatic evaluation, trying methods such as adversarial evaluation [Li et al.2017], learning a scoring model [Lowe et al.2017], or a learnt ensemble of automatic metrics [Ghandeharioun et al.2019], but their value is as yet not fully understood.

Currently the standard approach in chitchat dialogue is to perform human evaluations [Vinyals and Le2015, Li et al.2016a, Li et al.2016c, Venkatesh et al.2017, Zhang et al.2018, Dinan et al.2019b], typically reporting a judgment such as conversation quality or appropriateness via a Likert scale or pairwise comparison. While conversations are naturally multi-turn, pairwise setups typically consider single turn evaluations, taking the “gold” dialogue history from human-human logs, and only consider altering a single utterance. A more complete multi-turn evaluation is typically measured with a Likert scale (usually 1-4 or 1-5) after the conversation takes place. Some works such as [See et al.2019] ask a series of questions relating to different aspects of conversational ability. There are some notable variants from these standard setups. novikova2018rankme novikova2018rankme provide a method that combines continuous scales and relative assessments, but in single-turn, rather than multi-turn evaluation. DBLP:journals/corr/abs-1906-09308 DBLP:journals/corr/abs-1906-09308 compare human evaluations to automatic metrics computed on self-chats. Note that we also use self-chats in this work, but we evaluate these with humans, rather than automatic metrics.

Finally, this work expands upon some of the ideas present in see2019goodconversation see2019goodconversation. In that work, a test for interestingness of a specificity-controlled model conducted with pairwise chat logs was mentioned, similar to the ones used here, but was not the focus of their work. In our work, we conduct a full study of novel variants of this approach, consider optimizing the questions for robust measurements over four types of questions, utilize self-chat logs in addition to human-bot logs, and benchmark state-of-the-art models across two recent tasks.

Method: Acute-eval

To compare two dialogue models, model A and model B, our evaluation asks humans to directly compare side-by-side multi-turn dialogues conducted by these models. See Figure 1 for an example.

Our method is thus the following: (1) collect conversation logs for model A; similarly for model B. (2) In a number of trials, ask annotators to make binary judgments between sampled pairs from the logs, and collate the results to determine the winner, either A or B, and the statistical significance.

We consider different approaches to step (1) and (2) below.

Human-Model chats

Our standard setup is to compare conversation logs between models and humans. In each evaluation trial we then show a human annotator two of the previously obtained conversations, one of model conversing with a human, and one of model conversing with a (possibly different) human. The annotator sees the conversations side by side on the same screen, with the two models’ utterances highlighted in different colors, and the human utterances in gray to minimally distract from the models.

The annotator is posed a question phrasing (e.g. “which speaker is more knowledgeable” or “which speaker sounds more human?”), and asked to make a binary choice between model and model . They are strongly encouraged to provide a short text justification for their choice. We collect trials of such pairwise judgments, and use them to decide which model wins. Statistical significance can be computed using a binomial test.


Human-model conversation logs are themselves time-consuming and expensive to collect, which limits rapid iterative model development. We investigate if it is possible to remove the human from the conversation, and only use human annotators in the final pairwise conversation evaluation step. The concept of self-chats [Li et al.2016c, Ghandeharioun et al.2019], whereby a model talks to itself, playing the roles of both speaking partners, has been previously explored in other contexts. Such logs are easy to collect for models A and B, involving simply running inference for both speaker roles. We then use these logs in the Acute-eval pairwise comparison setup as described above.

Question Optimization

So far, we have not detailed the actual question(s) asked of the annotators. The framing and phrasing of questions in surveys is known to greatly affect the direction of responses, and therefore, in the case of evaluation, inter-annotator agreement. Though this has been noted in prior work [Lowe et al.2017], we have found no systematic experimentation on question formulation or task presentation. We therefore aim to propose and evaluate multiple potential question wordings to achieve higher agreement.

To do this, we build an initial test that compares human-human logs with human-model logs where the model is a relatively low quality baseline model. The aim is that there should be a clear and agreeable difference between human and model which is visible to human annotators. We ask annotators to make judgments between these two, where we choose pairs where the human should be judged as superior.

We then run independent trials with different question phrasing, and find the questions with highest inter-annotator agreement. The winning questions can then be used in future experiments by ourselves, and other researchers. Although having high inter-annotator agreement does not guarantee that crowdworkers interpret the question as intended, it increases the chance the question is understood uniformly. That is, the researcher still has to exercise care in the formulation of the question so that they believe it measures the quantity they are interested in. In our experiments we find questions with high-agreement rate over four axes: engagingness, interestingness, knowledge and humanness.

Annotation Quality

We use crowdworkers for our annotations. We recommend limiting the number of annotations a single worker may complete to be only a few pairs (in our experiments, if we are making model comparisons then we allow annotations). In preliminary trials, we found that limiting the influence of any one worker was important for replicability, but that results were highly consistent across multiple runs with this limitation.

Additionally, the first comparison any worker is asked to annotate consists of a conversation between a weak baseline model and human, and a human-human conversation. If a worker fails to rate the human-human conversation as better, we remove their annotations from the results, in order to remove poor quality annotators. We additionally remove workers who never give a reason for their choice. Note that adding such worker quality tests to pairwise annotation tasks is straightforward where the gold annotation is known, while it is harder for Likert tests which have integer scores. One may also increase the number of quality-control annotations to decrease the likelihood of fraudulent workers, but we found using a single control question had a reasonable cost-noise ratio.

Each specific pair of conversations is shown at most once, given that there are at least as many possible pairs of conversations as desired annotations. If there are more conversations available for each model than desired annotations, each conversation is shown at most once - that is, in only one annotation. We found that maximizing the diversity of pairs improved robustness of our evaluation across multiple replication experiments.


Question Choice 1 Agrm.
Engagingness (PersonaChat)
Which speaker is more engaging to talk to? Speaker 1 is more engaging 82.5%
Who would you prefer to talk to for a long conversation? I would prefer to talk to Speaker 1 *87.5%
Which speaker do you think is more captivating? Speaker 1 is more captivating than Speaker 2 84.2%
Interestingness (PersonaChat)
If you had to say one of these speakers is interesting and one is boring, who would you say is more interesting? Speaker 1 is more interesting *86.7%
Which speaker is more interesting to talk to? Speaker 1 is more interesting *81.5%
Which speaker is more boring to talk to? Speaker 1 is more boring 69.6%
Who would you rather talk to for fun? Speaker 1 is more fun 70.8%
Humanness (PersonaChat)
Which speaker sounds more human? Speaker 1 sounds more human *76.9%
If you had to guess that one speaker is human and one is a bot, which do you think is human? Speaker 1 sounds human 71.4%
Which speaker sounds more like a real person? Speaker 1 sounds more like a real person 76.9%
Knowledgeable (Wizard of Wikipedia)
Which speaker is more knowledgeable? Speaker 1 is more knowledgeable *88.9%
If you had to say that one speaker is more knowledgeable and one is more ignorant, who is more knowledgeable? Speaker 1 is more knowledgeable *100%
Which speaker is more well-informed? Speaker 1 is more well-informed *85.0%
Table 1: Optimizing questions: we measure the agreement rates for the most chosen response for different phrasings of questions, and choose the most agreed upon versions. Starred agreements indicate statistical significance (binomial test, ), and bold agreements indicate the question was used in future trials.

We perform experiments on two tasks, PersonaChat and Wizard of Wikipedia, which evaluate different aspects of conversational ability. We first optimize the questions to maximize worker agreement, and then benchmark existing state-of-the-art models on each task.

PersonaChat task

PersonaChat [Zhang et al.2018] is a chitchat dialogue task involving two participants (two humans or a human and a bot). Each participant is given a persona – a short collection of personal traits such as I’m left handed or My favorite season is spring – and are instructed to get to know each other by chatting naturally using their designated personas, for 6–8 turns. The original dataset contains nearly 9000 human-human training conversations; most models are pretrained with a larger corpus, and then fine-tuned on this set.

PersonaChat was the subject of the NeurIPS 2018 ConvAI2 Challenge [Dinan et al.2019a], in which competitor’s models were first evaluated with respect to automatic metrics, and then with respect to human judgment via human-bot chats followed by the question “How much did you enjoy talking to this user?” on a scale of 1–4. A total of 9 systems were evaluated using human annotators, 100 conversations for each. In this work, we leverage the human-model chat logs from the ConvAI2 competition for three models: Lost in Conversation (LIC)111˙chatbot, which won the competition, and Hugging Face (HF; wolf2019transfer, wolf2019transfer) which won the automatic evaluation track, and the KVMemNN [Miller et al.2016] baseline released by the competition organizers (KV; dinan2019second, dinan2019second). LIC and HF are large pretrained and fine-tuned generative Transformer models, while KV is a retrieval model with no pretraining.

Secondly, we also compare to recently published models from see2019goodconversation see2019goodconversation. The authors studied the effects of controllable generation. and showed that Repetition-controlled (RC), Inquisitive (INQ), and Interesting (INT) models obtained the highest human Likert scores in their study, however their comparison to models from other studies is not direct. We thus compare to these models as well; we use the human-model conversation logs from their work, 100 for each model.

Finally, we also compare to the Polyencoder model (PE, DBLP:journals/corr/abs-1905-01969, DBLP:journals/corr/abs-1905-01969), a recent state-of-the-art retrieval model. It is a type of large Transformer architecture pretrained on Reddit, which learns a small number of global features to represent the input so that retrieval can be computed efficiently. As no conversation logs were provided in that work, we additionally collect human-model conversations for that model.

Overall, we benchmark 7 models, and compare them to human (H) performance in a number of different settings: with human-model and self-chat over three questions: engagingness, humamnness and interestingness.

Wizard of Wikipedia task

Wizard of Wikipedia [Dinan et al.2019b] is a chitchat dialogue task where two speakers discuss a topic in depth, chosen from 1247 topics. One speaker (termed the Wizard) is meant to be both engaging and knowledgeable on the topics, and has access to an information retrieval system over Wikipedia to supplement their own knowledge. The other speaker (the Apprentice) is meant to be curious and eager to learn about the topic. The original dataset contains over 18,000 human-human dialogues, and has been used to train various kinds of models to imitate the human wizards. These include the Memory Network Transformer, in both generative and retrieval versions that employs the retrieved knowledge by attending over it before producing an utterance (GK and RK respectively), and baselines that do not have access to the knowledge (GU and RU). See Figure 4 for an example chat. We use the human-model logs from that paper (100 conversations for each model) on unseen test topics and evaluate them against humans (H), using both engagingness and knowledgeability questions. We note the original paper tested engagingness only.

Question Optimization

We are interested in evaluating models in terms of four axes: engagingness, interestingness, knowledge and humanness. In order to find the questions with highest inter-annotator agreement, we run multiple trials of experiments according to the setup described below. Each trial tests the effectiveness of a single question and consists of the same set of multi-turn conversation logs, presented to the human annotators. We test 13 questions: three regarding engagingness, four regarding interestingness, three regarding humanness, and three regarding knowledgeability (see Table 1).

We compare human-human logs with human-model logs where the model is a relatively low quality baseline model, with the aim that there should be a clear and agreeable difference between human and model which is visible to human annotators. For PersonaChat we use a greedy generative baseline, and for Wizard we use the GU (generative unknowledgeable) model. Both of these baselines exhibit strong repetitive behavior which is known to be highly disfavored by crowdworkers [See et al.2019]. We select a single handpicked conversation pair for each of the tasks, and collect 20 annotations per question.

We calculate the inter-annotator agreement for each question. The question achieving the highest inter-annotator agreement is selected for use in the rest of our experiments. The specific question phrasing and the texts accompanying the option for Speaker 1 (i.e. the left-hand conversation) are listed in Table 1 along with inter-annotator agreements. As can be seen, the phrasing of the question is important, with poor phrasing choices leading to much lower agreement levels, e.g. 86.7% agreement in the best case for interestingness, and 69.6% in the worst case.

As a preliminary sanity check, we ran A/A tests over each of the engagingness, interestingness, and humanness best questions, with the same model appearing as both Speaker 1 and 2. All three tests came back close to 50-50.

Overall, we see this question optimization step as an important pre-requisite for our main experiments, and use the best discovered phrasing in each case. We encourage further research to use them as well.

Benchmarking: Evaluation of State-of-the-art


Wins % matches

Loses % matches

RC 50 58 54 66 68 69 67
KV 50 57 55 57 57 61 60
INQ 42 43 51 59 52 62 71
HF 46 45 49 55 54 57 64
INT 34 43 41 45 52 54 52
LIC 32 43 48 46 48 53 65
PE 31 39 38 43 46 47 53
H 33 40 29 36 48 35 47
Table 2: Acute-Eval results on the Engagingness question for the PersonaChat models talking to humans. Bold win percentages indicate significance ().
Win Margin

Lose Margin

RC .18 .10 .42
KV .17 .58
INQ -.18 -.08 .24
HF -.17 .41
INT -.10 .08 .32
LIC -.58 -.41
H -.42 -.24 -.32
Table 3: Likert pairwise differences for Engagingness on PersonaChat, where known. Differences are collected from multiple papers and may not be directly comparable.
Wins % matches

Loses % matches

RC 58 67 42 73 68 74 74
KV 42 51 26 57 60 63 71
INQ 33 49 25 63 66 63 72
HF 58 74 75 81 81 82 81
INT 27 43 37 16 51 51 63
LIC 32 40 34 19 49 55 60
PE 26 37 37 18 49 45 61
H 26 29 28 19 37 40 39
Table 4: Acute-Eval results for self-chats for the Engagingness question on PersonaChat. Results largely agree with the human-model evaluations (Table 2) and the Likert evaluations (Table 3).

We first compare all 7 models and humans on the PersonaChat task using Acute-eval over the human-model chats using the optimized engagingness question. In total, we evaluate 28 paired comparisons. Results are given in Table 2. Bold win percentages indicate significance.

We first observe that the models form a clean well-ordered set, and there are no rock-paper-scissors effects, giving an order Human  PE  LIC  INT  HF  INQ  KV  RC. In general, these results agree closely with the known Likert comparisons made in prior papers, shown in Table 3. Similar conclusions are derived for the interestingness and humanness questions as well, see Tables 6 and 5, note the model ordering is slightly different for those questions. see2019goodconversation see2019goodconversation previously showed that different models often exhibit different rankings for different metrics, and Acute-eval results remain largely consistent with Likert.

A surprising result for the community is that the retrieval model PE outperforms all generative models, as the community has focused heavily on building generative models, e.g. almost all 23 entrants to the ConvAI2 competition [Dinan et al.2019a]. Now that the current best performing models have been benchmarked against each other we hope future research will use the same approach so the state-of-the-art can be clearly tracked.


Figure 2: Randomly chosen example of Hugging Face (HF) model talking with itself. HF self-chat degenerates rapidly, explaining its poor performance. Other models handle self-chat more successfully, see Fig. 3 and Supplementary Material.
Figure 3: Randomly chosen example of Polyencoder (PE) model talking with itself (self-chat).
Wins % Win Margin
RC 53 64 68 73 -.01 .90
LIC 47 54 56 59
INT 36 46 51 59 -.01 .91
PE 32 44 49 54
H 27 41 41 46 -.90 -.91
Table 5: Results on the Humanness question for the PersonaChat models talking to humans. Acute-Eval (left) is able to identify significant differences between INT and RC when Likert (known published differences, right) does not.
Figure 4: Example of the Wizard Retrieval (RK) talking with a human. The Wizard model is able to use facts from Wikipedia during its conversation.

We perform Acute-eval over self-chats instead of human-model chats. We compare all models and humans (via human-human chats) in an otherwise identical setup to the human-bot evaluation for PersonaChat. Results are given in Table 4.

We observe very similar conclusions to human-model chats in terms of winning models, making this a viable cheaper alternative to collecting human-model conversations, thus being considerably cheaper to collect. This approach also appears to require relatively fewer annotations/person-hours in this case to achieve statistical significance. One important caveat is the performance of the HF model. HF self-chats surface degeneracies in the model itself, and do not look natural (see Figure 2 for examples), explaining its poor performance compared to all other models. All other models do not exhibit this behavior and apart from HF, are ordered by humans exactly the same as for human-bot chats. For example, see Figure 3 for PE engaging in self-chat more successfully. However, due to the inadequacies of a specific model, in this case HF, conclusions from self-chat performance results must therefore be handled with care, but we believe are a reasonable choice for early experiments in the model development cycle, enabling faster research iteration.

One concern with self-chat is that powerful models could easily cheat, and simply recall training examples with perfect accuracy. In practice, we found that none of the models exhibit this behavior: 1% of the Polyencoder’s call-response utterance pairs produced during self-chats come directly from the training set. The worst offender, INQ, has roughly 10% of pairs coming from training, but this stems from it using the same generic greeting and response in nearly all conversations (“Hello, how are you doing today?”, “I am doing well, how about yourself?”).

Wins % Win Margin
RC 52 71 75 76 .04 .26
LIC 48 57 66 66
INT 29 43 55 64 -.04 .23
PE 25 34 45 52
H 24 34 36 48 -.26 -.23
Table 6: Results on the Interestingness question for the PersonaChat models talking to humans. Acute-Eval (left) is able to identify significant differences between INT and RC when Likert (known published differences, right) does not.
Wins % Win Margin
GU 67 79 75 77 .39 .58 .60 1.8
GK 33 64 63 73 -.39 .19 .21 1.4
RU 21 36 52 48 -.58 -.19 .02 1.2
RK 25 37 48 62 -.60 -.21 -.02 1.2
H 23 27 52 38 -1.8 -1.4 -1.2 -1.2
Table 7: Results on the Engagingness question for the Wizard of Wikipedia models (G/R for Generative/Retrieval and U/K for with and without access to knowledge. Left shows the Acute-Eval results, and right shows known Likert differences. Our method shows statistical significance between several methods that Likert does not.
Wins %

Loses %

GU 79 85 82 76
GK 21 54 70 56
RU 15 46 49 48
RK 18 30 51 47
H 24 44 52 53
Table 8: Acute-Eval results on the Knowledgeability question for Wizard of Wikipedia models (G/R for Generative/Retrieval and U/K with and without access to knowledge.

Wizard of Wikipedia

We similarly compare all 4 models and humans on the optimized engaging and knowledge questions. The results are given in Tables 7 and 8. We again find retrieval models outperform generative models, with knowledge attention (GK) clearly helping the generative models, but with RU and RK very close.

Results largely agree between the two questions, except retrieval with knowledge (RK) more clearly beats the generative version (GK) than retrieval without (RU) when the question is about knowledge. For the engagingness question, where it makes sense that this is less important, there is little difference between knowledge or not.

Figure 5: Relative cost effectiveness of potential collection methods: Likert and Acute-eval human-model chat and self-chat pairwise tests. Our methods obtain statistical significance with fewer person hours; Likert fails in this case.

Comparison to Likert

We compare Acute-eval to multi-turn Likert for both tasks by computing pairwise Likert differences, where known, from the original papers. We do not compare across papers as evaluation setups differ. Values are provided in Tables 3, 6, 5 and 7. While the tests generally agree, Acute-eval can be a more sensitive test, which more often yields significance. On Wizard of Wikipedia where all Likert matchups are known, 8 of the pairwise matchups are significant for our test with human-model chats, while 6 are significant for Likert. On PersonaChat for the interestingness question, 6 of 10 matchups are significant for Acute-eval, including all known Likert matchups, which only has 2 of 3 that are significant. For the humanness question, 5 of 10 matchups are significant for Acute-eval, including all known Likert matchups, which only has 2 of 3 that are significant. For the engagingness question, 5 of the 9 Likert matchups are significant. All 9 are significant for Acute-eval when using self-chats; 3 are significant for human-model chats.

We compare the cost effectiveness of Likert to Acute-eval human-model and self-chat comparisons in Figure 5. Shown is the PersonaChat Engagingness question comparing RC and INT models, a fairly tight matchup. We show the % chance of achieving significance when drawing pairs of dialogues at random, plotting with respect to person-hours spent annotating. In this case Likert fails to achieve significance, likely due to bias and variance issues with integer scores. Acute-eval human-model and self-chat pairwise tests perform well, achieving significance; self-chat requires fewer person-hours.


Studying the ability of machines to communicate with humans is an important long-term goal of AI research. Unfortunately, measuring progress towards that goal has been hampered by the trustworthiness of evaluation itself. Current human evaluation methods such as multi-turn Likert are expensive to run, have annotator bias and variance problems, and can fail to yield statistical significance.

In this work we have contributed a novel evaluation method that alleviates some of these problems. By optimizing questions and performing comparisons on pairs of human-bot dialogues we arrive at more sensitive statistical tests when benchmarking current state-of-the models. Utilizing self-chat bot evaluations we can often improve sensitivity, while yielding even cheaper evaluations. We will publicly release the code for our tests, and recommend them to be used in future research studies in order to push forward the state of the art.


  • [Bordes, Boureau, and Weston2017] Bordes, A.; Boureau, Y.-L.; and Weston, J. 2017. Learning end-to-end goal-oriented dialog. In Proceedings of the International Conference on Learning Representations.
  • [Dinan et al.2019a] Dinan, E.; Logacheva, V.; Malykh, V.; Miller, A.; Shuster, K.; Urbanek, J.; Kiela, D.; Szlam, A.; Serban, I.; Lowe, R.; et al. 2019a. The second conversational intelligence challenge (ConvAI2). arXiv preprint arXiv:1902.00098.
  • [Dinan et al.2019b] Dinan, E.; Roller, S.; Shuster, K.; Fan, A.; Auli, M.; and Weston, J. 2019b.

    Wizard of Wikipedia: Knowledge-powered conversational agents.

    In Proceedings of the International Conference on Learning Representations.
  • [El Asri et al.2017] El Asri, L.; Schulz, H.; Sharma, S.; Zumer, J.; Harris, J.; Fine, E.; Mehrotra, R.; and Suleman, K. 2017. Frames: a corpus for adding memory to goal-oriented dialogue systems. In Proceedings of the 18th Annual SIGDIAL Meeting on Discourse and Dialogue, 207–219. ACL.
  • [Ghandeharioun et al.2019] Ghandeharioun, A.; Shen, J. H.; Jaques, N.; Ferguson, C.; Jones, N.; Lapedriza, À.; and Picard, R. W. 2019. Approximating interactive human evaluation with self-play for open-domain dialog systems. arXiv preprint arXiv:1906.09308.
  • [Hastie2012] Hastie, H. 2012. Metrics and evaluation of spoken dialogue systems. In Lemon, O., and Pietquin, O., eds., Data-Driven Methods for Adaptive Spoken Dialogue Systems. Springer. 131–150.
  • [Henderson, Thomson, and Williams2014] Henderson, M.; Thomson, B.; and Williams, J. D. 2014. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 263–272.
  • [Humeau et al.2019] Humeau, S.; Shuster, K.; Lachaux, M.; and Weston, J. 2019. Real-time inference in multi-sentence tasks with deep pretrained transformers. arXiv preprint arxiv:1905.01969.
  • [Kulikov et al.2018] Kulikov, I.; Miller, A. H.; Cho, K.; and Weston, J. 2018. Importance of a search strategy in neural dialogue modelling. arXiv preprint arXiv:1811.00907.
  • [Li et al.2016a] Li, J.; Galley, M.; Brockett, C.; Gao, J.; and Dolan, B. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, 110–119. ACL.
  • [Li et al.2016b] Li, J.; Galley, M.; Brockett, C.; Spithourakis, G. P.; Gao, J.; and Dolan, B. 2016b. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 994–1003. ACL.
  • [Li et al.2016c] Li, J.; Monroe, W.; Ritter, A.; Jurafsky, D.; Galley, M.; and Gao, J. 2016c.

    Deep reinforcement learning for dialogue generation.


    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    , 1192–1202.
  • [Li et al.2017] Li, J.; Monroe, W.; Shi, T.; Jean, S.; Ritter, A.; and Jurafsky, D. 2017. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2157–2169. ACL.
  • [Liu et al.2016] Liu, C.-W.; Lowe, R.; Serban, I.; Noseworthy, M.; Charlin, L.; and Pineau, J. 2016.

    How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation.

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2122–2132. ACL.
  • [Lowe et al.2017] Lowe, R.; Noseworthy, M.; Serban, I. V.; Angelard-Gontier, N.; Bengio, Y.; and Pineau, J. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 1116–1126. ACL.
  • [Miller et al.2016] Miller, A.; Fisch, A.; Dodge, J.; Karimi, A.-H.; Bordes, A.; and Weston, J. 2016. Key-value memory networks for directly reading documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1400–1409. ACL.
  • [Novikova et al.2017] Novikova, J.; Dušek, O.; Curry, A. C.; and Rieser, V. 2017. Why we need new evaluation metrics for nlg. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2241–2252. ACL.
  • [Novikova, Dušek, and Rieser2018] Novikova, J.; Dušek, O.; and Rieser, V. 2018. Rankme: Reliable human ratings for natural language generation. arXiv preprint arXiv:1803.05928.
  • [Parthasarathi and Pineau2018] Parthasarathi, P., and Pineau, J. 2018. Extending neural generative conversational model using external knowledge sources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 690–695. ACL.
  • [See et al.2019] See, A.; Roller, S.; Kiela, D.; and Weston, J. 2019. What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 1702–1723. ACL.
  • [Serban et al.2016] Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A. C.; and Pineau, J. 2016.

    Building end-to-end dialogue systems using generative hierarchical neural network models.

    In AAAI, volume 16, 3776–3784.
  • [Venkatesh et al.2017] Venkatesh, A.; Khatri, C.; Ram, A.; Guo, F.; Gabriel, R.; Nagar, A.; Prasad, R.; Cheng, M.; Hedayatnia, B.; Metallinou, A.; et al. 2017. On evaluating and comparing conversational agents. Advances in Neural Information Processing Systems, Conversational AI Workshop.
  • [Vinyals and Le2015] Vinyals, O., and Le, Q. 2015. A neural conversational model. In

    Proceedings of the 31st International Conference on Machine Learning, Deep Learning Workshop

  • [Wen et al.2017] Wen, T.-H.; Vandyke, D.; Mrkšić, N.; Gasic, M.; Rojas Barahona, L. M.; Su, P.-H.; Ultes, S.; and Young, S. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. ACL. 438–449.
  • [Wolf et al.2019] Wolf, T.; Sanh, V.; Chaumond, J.; and Delangue, C. 2019. TransferTransfo: A transfer learning approach for neural network based conversational agents. arXiv preprint arXiv:1901.08149.
  • [Zhang et al.2018] Zhang, S.; Dinan, E.; Urbanek, J.; Szlam, A.; Kiela, D.; and Weston, J. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2204–2213. ACL.

Supplementary Material

Figure 6: Randomly chosen examples of Hugging Face (HF) model talking with with a human (left) and itself (self-chat, right). HF self-chat degenerates rapidly, explaining its poor performance. Other models do not have this degeneration feature.
Figure 7: Examples of Lost in Conversation (LIC) model talking with a human subject (left), and itself (right). Both examples were selected randomly.
Figure 8: Examples of Polyencoder (PE) model talking with a human subject (left), and itself (right). Both examples were selected randomly.
Figure 9: Examples of Wizard of Wikipedia chats. Left shows Generative model (GK) talking with a human subject. Right shows the Retrieval model (RK).