SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation

05/22/2023
by   Elizabeth Clark, et al.
0

Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets. As a result of its size and scope, SEAHORSE can serve both as a benchmark to evaluate learnt metrics, as well as a large-scale resource for training such metrics. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE (Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make SEAHORSE publicly available for future research on multilingual and multifaceted summarization evaluation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2021

Evaluating the Efficacy of Summarization Evaluation across Languages

While automatic summarization evaluation methods developed for English a...
research
12/19/2022

LR-Sum: Summarization for Less-Resourced Languages

This preprint describes work in progress on LR-Sum, a new permissively-l...
research
12/20/2022

mFACE: Multilingual Summarization with Factual Consistency Evaluation

Abstractive summarization has enjoyed renewed interest in recent years, ...
research
06/07/2023

Echoes from Alexandria: A Large Resource for Multilingual Book Summarization

In recent years, research in text summarization has mainly focused on th...
research
09/14/2021

BenchIE: Open Information Extraction Evaluation Based on Facts, Not Tokens

Intrinsic evaluations of OIE systems are carried out either manually – w...
research
12/12/2022

Searching for Effective Multilingual Fine-Tuning Methods: A Case Study in Summarization

Recently, a large number of tuning strategies have been proposed to adap...
research
05/27/2023

An Investigation of Evaluation Metrics for Automated Medical Note Generation

Recent studies on automatic note generation have shown that doctors can ...

Please sign up or login with your details

Forgot password? Click here to reset