Revisiting Summarization Evaluation for Scientific Articles

04/01/2016
by   Arman Cohan, et al.
0

Evaluation of text summarization approaches have been mostly based on metrics that measure similarities of system generated summaries with a set of human written gold-standard summaries. The most widely used metric in summarization evaluation has been the ROUGE family. ROUGE solely relies on lexical overlaps between the terms and phrases in the sentences; therefore, in cases of terminology variations and paraphrasing, ROUGE is not as effective. Scientific article summarization is one such case that is different from general domain summarization (e.g. newswire data). We provide an extensive analysis of ROUGE's effectiveness as an evaluation metric for scientific summarization; we show that, contrary to the common belief, ROUGE is not much reliable in evaluating scientific summaries. We furthermore show how different variants of ROUGE result in very different correlations with the manual Pyramid scores. Finally, we propose an alternative metric for summarization evaluation which is based on the content relevance between a system generated summary and the corresponding human written summaries. We call our metric SERA (Summarization Evaluation by Relevance Analysis). Unlike ROUGE, SERA consistently achieves high correlations with manual scores which shows its effectiveness in evaluation of scientific article summarization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/21/2017

Scientific Article Summarization Using Citation-Context and Article's Discourse Structure

We propose a summarization approach for scientific articles which takes ...
research
10/07/2021

GeSERA: General-domain Summary Evaluation by Relevance Analysis

We present GeSERA, an open-source improved version of SERA for evaluatin...
research
09/04/2019

Answers Unite! Unsupervised Metrics for Reinforced Summarization Models

Abstractive summarization approaches based on Reinforcement Learning (RL...
research
10/08/2021

Evaluation of Summarization Systems across Gender, Age, and Race

Summarization systems are ultimately evaluated by human annotators and r...
research
05/12/2023

What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization

Summarization models often generate text that is poorly calibrated to qu...
research
08/25/2015

Better Summarization Evaluation with Word Embeddings for ROUGE

ROUGE is a widely adopted, automatic evaluation measure for text summari...
research
10/20/2017

A Semantically Motivated Approach to Compute ROUGE Scores

ROUGE is one of the first and most widely used evaluation metrics for te...

Please sign up or login with your details

Forgot password? Click here to reset