Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems

12/18/2022
by   Sarah E. Finch, et al.
0

There has been great recent advancement in human-computer chat. However, proper evaluation currently requires human judgements that produce notoriously high-variance metrics due to their inherent subjectivity. Furthermore, there is little standardization in the methods and labels used for evaluation, with an overall lack of work to compare and assess the validity of various evaluation approaches. As a consequence, existing evaluation results likely leave an incomplete picture of the strengths and weaknesses of open-domain chatbots. We aim towards a dimensional evaluation of human-computer chat that can reliably measure several distinct aspects of chat quality. To this end, we present our novel human evaluation method that quantifies the rate of several quality-related chatbot behaviors. Our results demonstrate our method to be more suitable for dimensional chat evaluation than alternative likert-style or comparative methods. We then use our validated method and existing methods to evaluate four open-domain chat models from the recent literature.

READ FULL TEXT

page 9

page 17

page 18

page 19

page 20

page 21

research
08/03/2021

How to Evaluate Your Dialogue Models: A Review of Approaches

Evaluating the quality of a dialogue system is an understudied problem. ...
research
06/19/2022

MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue

Automatic open-domain dialogue evaluation is a crucial component of dial...
research
09/14/2023

Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation

Human evaluation has been widely accepted as the standard for evaluating...
research
10/25/2022

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

Recent model-based reference-free metrics for open-domain dialogue evalu...
research
07/31/2019

The Validity, Generalizability and Feasibility of Summative Evaluation Methods in Visual Analytics

Many evaluation methods have been used to assess the usefulness of Visua...
research
12/14/2019

Leveraging Multi-Method Evaluation for Multi-Stakeholder Settings

In this paper, we focus on recommendation settings with multiple stakeho...
research
09/12/2022

Open-Domain Dialog Evaluation using Follow-Ups Likelihood

Automatic evaluation of open-domain dialogs remains an unsolved problem....

Please sign up or login with your details

Forgot password? Click here to reset