Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

05/21/2020
by   Weixin Liang, et al.
0

Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly reference-based. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, self-reported user rating suffers from bias and variance among different users. To alleviate this problem, we formulate dialog evaluation as a comparison task. We also propose an automatic evaluation model CMADE (Comparison Model for Automatic Dialog Evaluation) that automatically cleans self-reported user ratings as it trains on them. Specifically, we first use a self-supervised method to learn better dialog feature representation, and then use KNN and Shapley to remove confusing samples. Our experiments show that CMADE achieves 89.2 comparison task.

READ FULL TEXT
research
05/01/2020

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

The lack of meaningful automatic evaluation metrics for dialog has imped...
research
07/24/2019

Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References

The aim of this paper is to mitigate the shortcomings of automatic evalu...
research
06/07/2021

A Comprehensive Assessment of Dialog Evaluation Metrics

Automatic evaluation metrics are a crucial component of dialog systems r...
research
10/05/2021

Investigating the Impact of Pre-trained Language Models on Dialog Evaluation

Recently, there is a surge of interest in applying pre-trained language ...
research
04/11/2023

Approximating Human Evaluation of Social Chatbots with Prompting

Once powerful conversational models have become available for a wide aud...
research
05/30/2019

Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention

Semantically controlled neural response generation on limited-domain has...
research
12/21/2020

Self-attention Comparison Module for Boosting Performance on Retrieval-based Open-Domain Dialog Systems

Since the pre-trained language models are widely used, retrieval-based o...

Please sign up or login with your details

Forgot password? Click here to reset