ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons

09/06/2019
by   Margaret Li, et al.
0

While dialogue remains an important end-goal of natural language research, the difficulty of evaluation is an oft-quoted reason why it remains troublesome to make real progress towards its solution. Evaluation difficulties are actually two-fold: not only do automatic metrics not correlate well with human judgments, but also human judgments themselves are in fact difficult to measure. The two most used human judgment tests, single-turn pairwise evaluation and multi-turn Likert scores, both have serious flaws as we discuss in this work. We instead provide a novel procedure involving comparing two full dialogues, where a human judge is asked to pay attention to only one speaker within each, and make a pairwise judgment. The questions themselves are optimized to maximize the robustness of judgments across different annotators, resulting in better tests. We also show how these tests work in self-play model chat setups, resulting in faster, cheaper tests. We hope these tests become the de facto standard, and will release open-source code to that end.

READ FULL TEXT

page 1

page 5

page 6

page 7

page 10

page 11

research
09/13/2017

A Review of Evaluation Techniques for Social Dialogue Systems

In contrast with goal-oriented dialogue, social dialogue has no clear me...
research
05/06/2021

Assessing Dialogue Systems with Distribution Distances

An important aspect of developing dialogue systems is how to evaluate an...
research
06/02/2021

DynaEval: Unifying Turn and Dialogue Level Evaluation

A dialogue is essentially a multi-turn interaction among interlocutors. ...
research
09/19/2023

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

To solve complex tasks, large language models (LLMs) often require multi...
research
10/25/2022

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

Recent model-based reference-free metrics for open-domain dialogue evalu...
research
04/14/2022

Constructing Open Cloze Tests Using Generation and Discrimination Capabilities of Transformers

This paper presents the first multi-objective transformer model for cons...
research
09/14/2023

Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation

Human evaluation has been widely accepted as the standard for evaluating...

Please sign up or login with your details

Forgot password? Click here to reset