deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

06/23/2015
by   Michel Galley, et al.
0

We introduce Discriminative BLEU (deltaBLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs. Reference strings are scored for quality by human raters on a scale of [-1, +1] to weight multi-reference BLEU. In tasks involving generation of conversational responses, deltaBLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman's rho and Kendall's tau.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/27/2020

BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

Evaluation is a bottleneck in the development of natural language genera...
research
10/14/2018

BLEU is Not Suitable for the Evaluation of Text Simplification

BLEU is widely considered to be an informative metric for text-to-text g...
research
05/15/2023

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

In this study, we analyze NLG automatic metrics based on whether human e...
research
08/07/2020

Perception Score, A Learned Metric for Open-ended Text Generation Evaluation

Automatic evaluation for open-ended natural language generation tasks re...
research
03/13/2021

Simpson's Bias in NLP Training

In most machine learning tasks, we evaluate a model M on a given data po...
research
07/03/2020

On the Relation between Quality-Diversity Evaluation and Distribution-Fitting Goal in Text Generation

The goal of text generation models is to fit the underlying real probabi...
research
09/30/2015

Enhanced Bilingual Evaluation Understudy

Our research extends the Bilingual Evaluation Understudy (BLEU) evaluati...

Please sign up or login with your details

Forgot password? Click here to reset