DeltaScore: Evaluating Story Generation with Differentiating Perturbations

03/15/2023
by   Zhuohan Xie, et al.
0

Various evaluation metrics exist for natural language generation tasks, but they have limited utility for story generation since they generally do not correlate well with human judgments and do not measure fine-grained story aspects, such as fluency versus relatedness, as they are intended to assess overall generation quality. In this paper, we propose deltascore, an approach that utilizes perturbation to evaluate fine-grained story aspects. Our core idea is based on the hypothesis that the better the story performs in a specific aspect (e.g., fluency), the more it will be affected by a particular perturbation (e.g., introducing typos). To measure the impact, we calculate the likelihood difference between the pre- and post-perturbation stories using a language model. We evaluate deltascore against state-of-the-art model-based and traditional similarity-based metrics across multiple story domains, and investigate its correlation with human judgments on five fine-grained story aspects: fluency, coherence, relatedness, logicality, and interestingness. Our results demonstrate that deltascore performs impressively in evaluating fine-grained story aspects, and we discovered a striking outcome where a specific perturbation appears to be highly effective in measuring most aspects.

READ FULL TEXT

page 7

page 8

research
01/04/2021

Outline to Story: Fine-grained Controllable Story Generation from Cascaded Events

Large-scale pretrained language models have shown thrilling generation c...
research
05/19/2021

OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

Automatic metrics are essential for developing natural language generati...
research
04/30/2020

Modelling Suspense in Short Stories as Uncertainty Reduction over Neural Representation

Suspense is a crucial ingredient of narrative fiction, engaging readers ...
research
10/04/2020

STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

Systems for story generation are asked to produce plausible and enjoyabl...
research
09/16/2020

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

Despite the success of existing referenced metrics (e.g., BLEU and Mover...
research
10/11/2022

CHAE: Fine-Grained Controllable Story Generation with Characters, Actions and Emotions

Story generation has emerged as an interesting yet challenging NLP task ...
research
10/16/2022

StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning

Existing automatic story evaluation methods place a premium on story lex...

Please sign up or login with your details

Forgot password? Click here to reset