Human-Centered Metrics for Dialog System Evaluation

05/24/2023
by   Salvatore Giorgi, et al.
0

We present metrics for evaluating dialog systems through a psychologically-grounded "human" lens: conversational agents express a diversity of both states (short-term factors like emotions) and traits (longer-term factors like personality) just as people do. These interpretable metrics consist of five measures from established psychology constructs that can be applied both across dialogs and on turns within dialogs: emotional entropy, linguistic style and emotion matching, as well as agreeableness and empathy. We compare these human metrics against 6 state-of-the-art automatic metrics (e.g. BARTScore and BLEURT) on 7 standard dialog system data sets. We also introduce a novel data set, the Three Bot Dialog Evaluation Corpus, which consists of annotated conversations from ChatGPT, GPT-3, and BlenderBot. We demonstrate the proposed human metrics offer novel information, are uncorrelated with automatic metrics, and lead to increased accuracy beyond existing automatic metrics for predicting crowd-sourced dialog judgements. The interpretability and unique signal of our proposed human-centered framework make it a valuable tool for evaluating and improving dialog systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/01/2020

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

The lack of meaningful automatic evaluation metrics for dialog has imped...
research
06/23/2020

Unsupervised Evaluation of Interactive Dialog with DialoGPT

It is important to define meaningful and interpretable automatic evaluat...
research
10/05/2021

Investigating the Impact of Pre-trained Language Models on Dialog Evaluation

Recently, there is a surge of interest in applying pre-trained language ...
research
09/17/2019

Hierarchical Reinforcement Learning for Open-Domain Dialog

Open-domain dialog generation is a challenging problem; maximum likeliho...
research
05/23/2023

How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation

We release MMSMR, a Massively Multi-System MultiReference dataset to ena...
research
09/14/2021

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

Natural language generation (NLG) spans a broad range of tasks, each of ...
research
04/11/2023

Approximating Human Evaluation of Social Chatbots with Prompting

Once powerful conversational models have become available for a wide aud...

Please sign up or login with your details

Forgot password? Click here to reset