OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

05/19/2021
by   Jian Guan, et al.
0

Automatic metrics are essential for developing natural language generation (NLG) models, particularly for open-ended language generation tasks such as story generation. However, existing automatic metrics are observed to correlate poorly with human evaluation. The lack of standardized benchmark datasets makes it difficult to fully evaluate the capabilities of a metric and fairly compare different metrics. Therefore, we propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics. OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics, including (a) the correlation with human judgments, (b) the generalization to different model outputs and datasets, (c) the ability to judge story coherence, and (d) the robustness to perturbations. To this end, OpenMEVA includes both manually annotated stories and auto-constructed test examples. We evaluate existing metrics on OpenMEVA and observe that they have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge (e.g., causal order between events), the generalization ability and robustness. Our study presents insights for developing NLG models and metrics in further research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/24/2022

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Research on Automatic Story Generation (ASG) relies heavily on human and...
research
05/08/2022

RoViST:Learning Robust Metrics for Visual Storytelling

Visual storytelling (VST) is the task of generating a story paragraph th...
research
09/16/2020

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

Despite the success of existing referenced metrics (e.g., BLEU and Mover...
research
03/15/2023

DeltaScore: Evaluating Story Generation with Differentiating Perturbations

Various evaluation metrics exist for natural language generation tasks, ...
research
10/31/2018

dAIrector: Automatic Story Beat Generation through Knowledge Synthesis

dAIrector is an automated director which collaborates with humans storyt...
research
02/09/2021

Hallmarks of Human-Machine Collaboration: A framework for assessment in the DARPA Communicating with Computers Program

There is a growing desire to create computer systems that can communicat...
research
04/12/2021

Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

With the recent advances of open-domain story generation, the lack of re...

Please sign up or login with your details

Forgot password? Click here to reset