Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References

07/24/2019
by   Prakhar Gupta, et al.
0

The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multi-reference evaluation. Existing metrics have been shown to correlate poorly with human judgement, particularly in open-domain dialog. One alternative is to collect human annotations for evaluation, which can be expensive and time consuming. To demonstrate the effectiveness of multi-reference evaluation, we augment the test set of DailyDialog with multiple references. A series of experiments show that the use of multiple references results in improved correlation between several automatic metrics and human judgement for both the quality and the diversity of system output.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2021

REAM♯: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation

The lack of reliable automatic evaluation metrics is a major impediment ...
research
05/23/2023

How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation

We release MMSMR, a Massively Multi-System MultiReference dataset to ena...
research
03/23/2019

Referring to the recently seen: reference and perceptual memory in situated dialog

From theoretical linguistic and cognitive perspectives, situated dialog ...
research
05/21/2020

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Open Domain dialog system evaluation is one of the most important challe...
research
09/12/2022

Open-Domain Dialog Evaluation using Follow-Ups Likelihood

Automatic evaluation of open-domain dialogs remains an unsolved problem....
research
11/01/2020

SMRT Chatbots: Improving Non-Task-Oriented Dialog with Simulated Multiple Reference Training

Non-task-oriented dialog models suffer from poor quality and non-diverse...
research
04/10/2020

Designing Precise and Robust Dialogue Response Evaluators

Automatic dialogue response evaluator has been proposed as an alternativ...

Please sign up or login with your details

Forgot password? Click here to reset