Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References

by   Prakhar Gupta, et al.

The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multi-reference evaluation. Existing metrics have been shown to correlate poorly with human judgement, particularly in open-domain dialog. One alternative is to collect human annotations for evaluation, which can be expensive and time consuming. To demonstrate the effectiveness of multi-reference evaluation, we augment the test set of DailyDialog with multiple references. A series of experiments show that the use of multiple references results in improved correlation between several automatic metrics and human judgement for both the quality and the diversity of system output.


page 1

page 2

page 3

page 4


REAM♯: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation

The lack of reliable automatic evaluation metrics is a major impediment ...

Referring to the recently seen: reference and perceptual memory in situated dialog

From theoretical linguistic and cognitive perspectives, situated dialog ...

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Open Domain dialog system evaluation is one of the most important challe...

Open-Domain Dialog Evaluation using Follow-Ups Likelihood

Automatic evaluation of open-domain dialogs remains an unsolved problem....

SMRT Chatbots: Improving Non-Task-Oriented Dialog with Simulated Multiple Reference Training

Non-task-oriented dialog models suffer from poor quality and non-diverse...

Designing Precise and Robust Dialogue Response Evaluators

Automatic dialogue response evaluator has been proposed as an alternativ...

Please sign up or login with your details

Forgot password? Click here to reset