Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

12/20/2022
by   Qingyu Lu, et al.
0

The state-of-the-art language model-based automatic metrics, e.g. BARTScore, benefiting from large-scale contextualized pre-training, have been successfully used in a wide range of natural language generation (NLG) tasks, including machine translation, text summarization, and data-to-text. Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors (e.g. imperfections in fluency) can produce high-quality human judgments. This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis. To this end, we augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors. Experimental results show that BARTScore++ can consistently improve the performance of vanilla BARTScore and outperform existing top-scoring metrics in 20 out of 25 test settings. We hope our technique can also be extended to other pre-trained model-based metrics. We will release our code and scripts to facilitate the community.

READ FULL TEXT
research
10/10/2022

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

Is it possible to build a general and automatic natural language generat...
research
04/21/2022

Spurious Correlations in Reference-Free Evaluation of Text Generation

Model-based, reference-free evaluation metrics have been proposed as a f...
research
12/21/2022

Tracing and Removing Data Errors in Natural Language Generation Datasets

Recent work has identified noisy and misannotated data as a core cause o...
research
04/16/2021

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

A benchmark provides an ecosystem to measure the advancement of models w...
research
09/14/2021

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

Natural language generation (NLG) spans a broad range of tasks, each of ...
research
05/13/2022

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

Precisely assessing the progress in natural language generation (NLG) ta...
research
04/29/2021

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

Human evaluation of modern high-quality machine translation systems is a...

Please sign up or login with your details

Forgot password? Click here to reset