Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

01/12/2022
by   Eric Michael Smith, et al.
1

At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al., 2016, arXiv:1603.08023), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.

READ FULL TEXT

page 5

page 6

page 7

research
12/16/2021

Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Conversational question answering (CQA) systems aim to provide natural-l...
research
01/11/2018

On Evaluating and Comparing Conversational Agents

Conversational agents are exploding in popularity. However, much work re...
research
08/19/2020

FinChat: Corpus and evaluation setup for Finnish chat conversations on everyday topics

Creating open-domain chatbots requires large amounts of conversational d...
research
10/05/2020

Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

The lack of time-efficient and reliable evaluation methods hamper the de...
research
08/18/2020

Deploying Lifelong Open-Domain Dialogue Learning

Much of NLP research has focused on crowdsourced static datasets and the...
research
11/24/2022

How "open" are the conversations with open-domain chatbots? A proposal for Speech Event based evaluation

Open-domain chatbots are supposed to converse freely with humans without...
research
08/13/2021

Low-Resource Adaptation of Open-Domain Generative Chatbots

Recent work building open-domain chatbots has demonstrated that increasi...

Please sign up or login with your details

Forgot password? Click here to reset