Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

04/24/2019
by   Sarik Ghazarian, et al.
0

Despite advances in open-domain dialogue systems, automatic evaluation of such systems is still a challenging problem. Traditional reference-based metrics such as BLEU are ineffective because there could be many valid responses for a given context that share no common words with reference responses. A recent work proposed Referenced metric and Unreferenced metric Blended Evaluation Routine (RUBER) to combine a learning-based metric, which predicts relatedness between a generated response and a given query, with reference-based metric; it showed high correlation with human judgments. In this paper, we explore using contextualized word embeddings to compute more accurate relatedness scores, thus better evaluation metrics. Experiments show that our evaluation metrics outperform RUBER, which is trained on static embeddings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2021

REAM♯: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation

The lack of reliable automatic evaluation metrics is a major impediment ...
research
05/08/2023

DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation

Despite the recent advances in open-domain dialogue systems, building a ...
research
09/23/2020

Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

There is an increasing focus on model-based dialog evaluation metrics su...
research
04/06/2020

PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems

Open-domain generative dialogue systems have attracted considerable atte...
research
05/08/2018

One "Ruler" for All Languages: Multi-Lingual Dialogue Evaluation with Adversarial Multi-Task Learning

Automatic evaluating the performance of Open-domain dialogue system is a...
research
06/24/2022

Using BERT Embeddings to Model Word Importance in Conversational Transcripts for Deaf and Hard of Hearing Users

Deaf and hard of hearing individuals regularly rely on captioning while ...
research
05/24/2023

Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality

LLMs (large language models) such as ChatGPT have shown remarkable langu...

Please sign up or login with your details

Forgot password? Click here to reset