StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning

10/16/2022
by   Hong Chen, et al.
0

Existing automatic story evaluation methods place a premium on story lexical level coherence, deviating from human preference. We go beyond this limitation by considering a novel Story Evaluation method that mimics human preference when judging a story, namely StoryER, which consists of three sub-tasks: Ranking, Rating and Reasoning. Given either a machine-generated or a human-written story, StoryER requires the machine to output 1) a preference score that corresponds to human preference, 2) specific ratings and their corresponding confidences and 3) comments for various aspects (e.g., opening, character-shaping). To support these tasks, we introduce a well-annotated dataset comprising (i) 100k ranked story pairs; and (ii) a set of 46k ratings and comments on various aspects of the story. We finetune Longformer-Encoder-Decoder (LED) on the collected dataset, with the encoder responsible for preference score and aspect prediction and the decoder for comment generation. Our comprehensive experiments result in a competitive benchmark for each task, showing the high correlation to human preference. In addition, we have witnessed the joint learning of the preference scores, the aspect ratings, and the comments brings gain in each single task. Our dataset and benchmarks are publicly available to advance the research of story evaluation tasks.[Dataset and pre-trained model demo are available at anonymous website <http://storytelling-lab.com/eval> and <https://github.com/sairin1202/StoryER>]

READ FULL TEXT

page 12

page 14

research
06/05/2019

Visual Story Post-Editing

We introduce the first dataset for human edits of machine-generated visu...
research
12/03/2019

Knowledge-Enriched Visual Storytelling

Stories are diverse and highly personalized, resulting in a large possib...
research
10/04/2020

STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

Systems for story generation are asked to produce plausible and enjoyabl...
research
05/08/2022

RoViST:Learning Robust Metrics for Visual Storytelling

Visual storytelling (VST) is the task of generating a story paragraph th...
research
03/15/2023

DeltaScore: Evaluating Story Generation with Differentiating Perturbations

Various evaluation metrics exist for natural language generation tasks, ...
research
06/17/2022

Understanding Aesthetics with Language: A Photo Critique Dataset for Aesthetic Assessment

Computational inference of aesthetics is an ill-defined task due to its ...
research
05/24/2022

How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory

Human ratings are treated as the gold standard in NLG evaluation. The st...

Please sign up or login with your details

Forgot password? Click here to reset