How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

03/25/2016
by   Chia-Wei Liu, et al.
0

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2020

Evaluating Dialogue Generation Systems via Response Selection

Existing automatic evaluation metrics for open-domain dialogue response ...
research
06/29/2017

Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation

Automated metrics such as BLEU are widely used in the machine translatio...
research
06/10/2021

Shades of BLEU, Flavours of Success: The Case of MultiWOZ

The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for b...
research
06/02/2016

Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation

We introduce the multiresolution recurrent neural network, which extends...
research
05/19/2022

Target-Guided Dialogue Response Generation Using Commonsense and Data Augmentation

Target-guided response generation enables dialogue systems to smoothly t...
research
06/13/2023

HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation

Similes play an imperative role in creative writing such as story and di...

Please sign up or login with your details

Forgot password? Click here to reset