Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA

05/23/2023
by   David Heineman, et al.
0

Large language models (e.g., GPT-3.5) are uniquely capable of producing highly rated text simplification, yet current human evaluation methods fail to provide a clear understanding of systems' specific strengths and weaknesses. To address this limitation, we introduce SALSA, an edit-based human annotation framework that enables holistic and fine-grained text simplification evaluation. We develop twenty one linguistically grounded edit types, covering the full spectrum of success and failure across dimensions of conceptual, syntactic and lexical simplicity. Using SALSA, we collect 12K edit annotations on 700 simplifications, revealing discrepancies in the distribution of transformation approaches performed by fine-tuned models, few-shot LLMs and humans, and finding GPT-3.5 performs more quality edits than humans, but still exhibits frequent errors. Using our fine-grained annotations, we develop LENS-SALSA, a reference-free automatic simplification metric, trained to predict sentence- and word-level quality simultaneously. Additionally, we introduce word-level quality estimation for simplification and report promising baseline results. Our training material, annotation toolkit, and data are released at http://salsa-eval.com.

READ FULL TEXT

page 6

page 7

page 20

page 23

research
08/14/2023

Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation

Fine-grained, span-level human evaluation has emerged as a reliable and ...
research
12/19/2022

LENS: A Learnable Evaluation Metric for Text Simplification

Training learnable metrics using modern language models has recently eme...
research
09/20/2023

XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates

Text editing is a crucial task that involves modifying text to better al...
research
06/02/2023

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Language models (LMs) often exhibit undesirable text generation behavior...
research
04/09/2021

Annotating and Modeling Fine-grained Factuality in Summarization

Recent pre-trained abstractive summarization systems have started to ach...
research
10/26/2022

arXivEdits: Understanding the Human Revision Process in Scientific Writing

Scientific publications are the primary means to communicate research di...
research
05/31/2023

What does the Failure to Reason with "Respectively" in Zero/Few-Shot Settings Tell Us about Language Models?

Humans can effortlessly understand the coordinate structure of sentences...

Please sign up or login with your details

Forgot password? Click here to reset