PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems

04/06/2020
by   Tian Lan, et al.
0

Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them, is still a big challenge problem. As far as we know, there are three kinds of automatic methods to evaluate the open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective. In this paper, we will first measure systematically all kinds of automatic evaluation metrics over the same experimental setting to check which kind is best. Through extensive experiments, the learning-based metrics are demonstrated that they are the most effective evaluation metrics for open-domain generative dialogue systems. Moreover, we observe that nearly all learning-based metrics depend on the negative sampling mechanism, which obtains an extremely imbalanced and low-quality dataset to train a score model. In order to address this issue, we propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Extensive experiments demonstrate that our proposed evaluation method significantly outperforms the state-of-the-art learning-based evaluation methods, with an average correlation improvement of 13.18 proposed method and state-of-the-art baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/19/2022

MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue

Automatic open-domain dialogue evaluation is a crucial component of dial...
research
04/24/2019

Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Despite advances in open-domain dialogue systems, automatic evaluation o...
research
06/22/2023

Overview of Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems at DSTC 11 Track 4

The advent and fast development of neural networks have revolutionized t...
research
08/31/2023

Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation

Despite significant research effort in the development of automatic dial...
research
06/03/2022

Relevance in Dialogue: Is Less More? An Empirical Comparison of Existing Metrics, and a Novel Simple Metric

In this work, we evaluate various existing dialogue relevance metrics, f...
research
08/24/2020

How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

Though generative dialogue modeling is widely seen as a language modelin...
research
06/27/2023

C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation

Existing reference-free turn-level evaluation metrics for chatbots inade...

Please sign up or login with your details

Forgot password? Click here to reset