Human vs Automatic Metrics: on the Importance of Correlation Design

05/29/2018
by   Anastasia Shimorina, et al.
0

This paper discusses two existing approaches to the correlation analysis between automatic evaluation metrics and human scores in the area of natural language generation. Our experiments show that depending on the usage of a system- or sentence-level correlation analysis, correlation results between automatic scores and human judgments are inconsistent.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2021

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Natural Language Generation (NLG) evaluation is a multifaceted task requ...
research
12/24/2020

LCEval: Learned Composite Metric for Caption Evaluation

Automatic evaluation metrics hold a fundamental importance in the develo...
research
05/15/2023

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

In this study, we analyze NLG automatic metrics based on whether human e...
research
07/31/2019

On conducting better validation studies of automatic metrics in natural language generation evaluation

Natural language generation (NLG) has received increasing attention, whi...
research
07/06/2021

Comparing PCG metrics with Human Evaluation in Minecraft Settlement Generation

There are a range of metrics that can be applied to the artifacts produc...
research
04/21/2022

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

How reliably an automatic summarization evaluation metric replicates hum...
research
08/31/2022

The Glass Ceiling of Automatic Evaluation in Natural Language Generation

Automatic evaluation metrics capable of replacing human judgments are cr...

Please sign up or login with your details

Forgot password? Click here to reset