Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs

06/06/2023
by   Abishek Komma, et al.
3

Measurement of interaction quality is a critical task for the improvement of spoken dialog systems. Existing approaches to dialog quality estimation either focus on evaluating the quality of individual turns, or collect dialog-level quality measurements from end users immediately following an interaction. In contrast to these approaches, we introduce a new dialog-level annotation workflow called Dialog Quality Annotation (DQA). DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment. In this contribution, we show that: (i) while dialog quality cannot be completely decomposed into dialog-level attributes, there is a strong relationship between some objective dialog attributes and judgments of dialog quality; (ii) for the task of dialog-level quality estimation, a supervised model trained on dialog-level annotations outperforms methods based purely on aggregating turn-level features; and (iii) the proposed evaluation model shows better domain generalization ability compared to the baselines. On the basis of these results, we argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/01/2020

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

The lack of meaningful automatic evaluation metrics for dialog has imped...
research
06/25/2016

Leveraging Semantic Web Search and Browse Sessions for Multi-Turn Spoken Dialog Systems

Training statistical dialog models in spoken dialog systems (SDS) requir...
research
01/18/2017

Assessing User Expertise in Spoken Dialog System Interactions

Identifying the level of expertise of its users is important for a syste...
research
11/21/2022

CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation

Practical dialog systems need to deal with various knowledge sources, no...
research
10/11/2017

DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset

We develop a high-quality multi-turn dialog dataset, DailyDialog, which ...
research
04/11/2023

Approximating Human Evaluation of Social Chatbots with Prompting

Once powerful conversational models have become available for a wide aud...
research
02/02/2022

The slurk Interaction Server Framework: Better Data for Better Dialog Models

This paper presents the slurk software, a lightweight interaction server...

Please sign up or login with your details

Forgot password? Click here to reset