How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation

05/23/2023
by   Huda Khayrallah, et al.
0

We release MMSMR, a Massively Multi-System MultiReference dataset to enable future work on metrics and evaluation for dialog. Automatic metrics for dialogue evaluation should be robust proxies for human judgments; however, the verification of robustness is currently far from satisfactory. To quantify the robustness correlation and understand what is necessary in a test set, we create and release an 8-reference dialog dataset by extending single-reference evaluation sets and introduce this new language learning conversation dataset. We then train 1750 systems and evaluate them on our novel test set and the DailyDialog dataset. We release the novel test set, and model hyper parameters, inference outputs, and metric scores for each system on a variety of datasets.

READ FULL TEXT

page 4

page 8

page 9

page 10

research
07/24/2019

Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References

The aim of this paper is to mitigate the shortcomings of automatic evalu...
research
04/12/2021

SuperSim: a test set for word similarity and relatedness in Swedish

Language models are notoriously difficult to evaluate. We release SuperS...
research
05/29/2021

Annotation Inconsistency and Entity Bias in MultiWOZ

MultiWOZ is one of the most popular multi-domain task-oriented dialog da...
research
10/05/2021

Investigating the Impact of Pre-trained Language Models on Dialog Evaluation

Recently, there is a surge of interest in applying pre-trained language ...
research
06/23/2020

Unsupervised Evaluation of Interactive Dialog with DialoGPT

It is important to define meaningful and interpretable automatic evaluat...
research
05/24/2023

Human-Centered Metrics for Dialog System Evaluation

We present metrics for evaluating dialog systems through a psychological...
research
10/22/2022

EnDex: Evaluation of Dialogue Engagingness at Scale

We propose EnDex, the first human-reaction based model to evaluate dialo...

Please sign up or login with your details

Forgot password? Click here to reset