Evaluating Dialogue Generation Systems via Response Selection

04/29/2020
by   Shiki Sato, et al.
0

Existing automatic evaluation metrics for open-domain dialogue response generation systems correlate poorly with human evaluation. We focus on evaluating response generation systems via response selection. To evaluate systems properly via response selection, we propose the method to construct response selection test sets with well-chosen false candidates. Specifically, we propose to construct test sets filtering out some types of false candidates: (i) those unrelated to the ground-truth response and (ii) those acceptable as appropriate responses. Through experiments, we demonstrate that evaluating systems via response selection with the test sets developed by our method correlates more strongly with human evaluation, compared with widely used automatic evaluation metrics such as BLEU.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/25/2016

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

We investigate evaluation metrics for dialogue response generation syste...
research
06/12/2020

Speaker Sensitive Response Evaluation Model

Automatic evaluation of open-domain dialogue response generation is very...
research
06/11/2021

Local Explanation of Dialogue Response Generation

In comparison to the interpretation of classification models, the explan...
research
07/06/2018

The price of debiasing automatic metrics in natural language evaluation

For evaluating generation systems, automatic metrics such as BLEU cost n...
research
04/06/2020

Grayscale Data Construction and Multi-Level Ranking Objective for Dialogue Response Selection

Response selection plays a vital role in building retrieval-based conver...
research
11/25/2020

Learning to Expand: Reinforced Pseudo-relevance Feedback Selection for Information-seeking Conversations

Intelligent personal assistant systems for information-seeking conversat...
research
02/23/2019

Re-evaluating ADEM: A Deeper Look at Scoring Dialogue Responses

Automatically evaluating the quality of dialogue responses for unstructu...

Please sign up or login with your details

Forgot password? Click here to reset