Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

06/06/2023
by   Jan Deriu, et al.
0

A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreement with human judgments. In this paper, we propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics when used to generate preference rankings between system outputs. We show that existing automated metrics are generally over-confident in assigning significant differences between systems in this setting. However, our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics. We show that using this combination, we only require about 50 annotations typically used in evaluations to arrive at robust and statistically significant results while yielding the same evaluation outcome as the pure human evaluation in 95 three text generation tasks: dialogue systems, machine translation, and text summarization.

READ FULL TEXT
research
12/19/2022

SEScore2: Retrieval Augmented Pretraining for Text Generation Evaluation

Is it possible to leverage large scale raw and raw parallel corpora to b...
research
10/24/2022

On the Effectiveness of Automated Metrics for Text Generation Systems

A major challenge in the field of Text Generation is evaluation because ...
research
09/23/2019

Automated Chess Commentator Powered by Neural Chess Engine

In this paper, we explore a new approach for automated chess commentary ...
research
04/11/2022

Toward More Effective Human Evaluation for Machine Translation

Improvements in text generation technologies such as machine translation...
research
10/10/2022

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

Is it possible to build a general and automatic natural language generat...
research
09/12/2019

VizSeq: A Visual Analysis Toolkit for Text Generation Tasks

Automatic evaluation of text generation tasks (e.g. machine translation,...
research
08/23/2021

CGEMs: A Metric Model for Automatic Code Generation using GPT-3

Today, AI technology is showing its strengths in almost every industry a...

Please sign up or login with your details

Forgot password? Click here to reset