Why We Need New Evaluation Metrics for NLG

07/21/2017
by   Jekaterina Novikova, et al.
0

The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

READ FULL TEXT
research
02/17/2022

Revisiting the Evaluation Metrics of Paraphrase Generation

Paraphrase generation is an important NLP task that has achieved signifi...
research
06/11/2020

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Automatic metrics are fundamental for the development and evaluation of ...
research
08/10/2015

Improve the Evaluation of Fluency Using Entropy for Machine Translation Evaluation Metrics

The widely-used automatic evaluation metrics cannot adequately reflect t...
research
05/24/2023

Don't Take This Out of Context! On the Need for Contextual Models and Evaluations for Stylistic Rewriting

Most existing stylistic text rewriting methods operate on a sentence lev...
research
12/29/2017

Objective evaluation metrics for automatic classification of EEG events

The evaluation of machine learning algorithms in biomedical fields for a...
research
06/02/2021

SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis

The open-ended nature of visual captioning makes it a challenging area f...
research
06/01/2021

Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics

To understand better the causes of good generalization performance in st...

Please sign up or login with your details

Forgot password? Click here to reset