APPLS: A Meta-evaluation Testbed for Plain Language Summarization

05/23/2023
by   Yue Guo, et al.
0

While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. This is in part because PLS involves multiple, interrelated language transformations (e.g., adding background explanations, removing specialized terminology). No metrics are explicitly engineered for PLS, and the suitability of other text generation evaluation metrics remains unclear. To address these concerns, our study presents a granular meta-evaluation testbed, APPLS, designed to evaluate existing metrics for PLS. Drawing on insights from previous research, we define controlled perturbations for our testbed along four criteria that a metric of plain language should capture: informativeness, simplification, coherence, and faithfulness. Our analysis of metrics using this testbed reveals that current metrics fail to capture simplification, signaling a crucial gap. In response, we introduce POMME, a novel metric designed to assess text simplification in PLS. We demonstrate its correlation with simplification perturbations and validate across a variety of datasets. Our research contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics, offering insights with relevance to other text generation tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2020

Go Figure! A Meta Evaluation of Factuality in Summarization

Text generation models can generate factually inconsistent text containi...
research
10/14/2020

Re-evaluating Evaluation in Text Summarization

Automated evaluation metrics as a stand-in for manual evaluation are an ...
research
06/20/2023

Open-Domain Text Evaluation via Meta Distribution Modeling

Recent advances in open-domain text generation models powered by large p...
research
05/24/2023

Revisiting Sentence Union Generation as a Testbed for Text Consolidation

Tasks involving text generation based on multiple input texts, such as m...
research
08/01/2022

SMART: Sentences as Basic Units for Text Evaluation

Widely used evaluation metrics for text generation either do not work we...
research
10/09/2020

Mark-Evaluate: Assessing Language Generation using Population Estimation Methods

We propose a family of metrics to assess language generation derived fro...
research
04/11/2022

TRUE: Re-evaluating Factual Consistency Evaluation

Grounded text generation systems often generate text that contains factu...

Please sign up or login with your details

Forgot password? Click here to reset