Towards Best Experiment Design for Evaluating Dialogue System Output

09/23/2019
by   Sashank Santhanam, et al.
0

To overcome the limitations of automated metrics (e.g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence. While it has been demonstrated that human judgments can suffer from the inconsistency of ratings, extant research has also found that the design of the evaluation task affects the consistency and quality of human judgments. We conduct a between-subjects study to understand the impact of four experiment conditions on human ratings of dialogue system output. In addition to discrete and continuous scale ratings, we also experiment with a novel application of Best-Worst scaling to dialogue evaluation. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as time taken to complete the task and no prior experience of participating in similar studies of rating dialogue system output positively impact consistency and agreement amongst raters

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2021

Modeling Performance in Open-Domain Dialogue with PARADISE

There has recently been an explosion of work on spoken dialogue systems,...
research
09/26/2019

Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

We present "AutoJudge", an automated evaluation method for conversationa...
research
03/15/2018

RankME: Reliable Human Ratings for Natural Language Generation

Human evaluation for natural language generation (NLG) often suffers fro...
research
06/17/2020

Is this Dialogue Coherent? Learning from Dialogue Acts and Entities

In this work, we investigate the human perception of coherence in open-d...
research
12/23/2018

AVRA: Automatic Visual Ratings of Atrophy from MRI images using Recurrent Convolutional Neural Networks

Quantifying the degree of atrophy is done clinically by neuroradiologist...
research
12/19/2022

SEScore2: Retrieval Augmented Pretraining for Text Generation Evaluation

Is it possible to leverage large scale raw and raw parallel corpora to b...
research
07/25/2019

What's in an accent? The impact of accented synthetic speech on lexical choice in human-machine dialogue

The assumptions we make about a dialogue partner's knowledge and communi...

Please sign up or login with your details

Forgot password? Click here to reset