LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization

01/26/2023
by   Laura Nguyen, et al.
0

Text Summarization is a popular task and an active area of research for the Natural Language Processing community. By definition, it requires to account for long input texts, a characteristic which poses computational challenges for neural models. Moreover, real-world documents come in a variety of complex, visually-rich, layouts. This information is of great relevance, whether to highlight salient content or to encode long-range interactions between textual passages. Yet, all publicly available summarization datasets only provide plain text content. To facilitate research on how to exploit visual/layout information to better capture long-range dependencies in summarization models, we present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/layout information. We extend existing and popular English datasets (arXiv and PubMed) with layout information and propose four novel datasets – consistently built from scholar resources – covering French, Spanish, Portuguese, and Korean languages. Further, we propose new baselines merging layout-aware and long-range models – two orthogonal approaches – and obtain state-of-the-art results, showing the importance of combining both lines of research.

READ FULL TEXT

page 8

page 12

page 13

page 14

research
05/18/2021

BookSum: A Collection of Datasets for Long-form Narrative Summarization

The majority of available text summarization datasets include short-form...
research
04/18/2022

MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization

Video summarization intends to produce a concise video summary by effect...
research
07/19/2019

What is this Article about? Extreme Summarization with Topic-aware Convolutional Neural Networks

We introduce 'extreme summarization', a new single-document summarizatio...
research
01/17/2023

On the State of German (Abstractive) Text Summarization

With recent advancements in the area of Natural Language Processing, the...
research
08/23/2019

Neural Text Summarization: A Critical Evaluation

Text summarization aims at compressing long documents into a shorter for...
research
08/24/2023

American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

Existing full text datasets of U.S. public domain newspapers do not reco...
research
01/26/2018

A Formal Definition of Importance for Summarization

Research on summarization has mainly been driven by empirical approaches...

Please sign up or login with your details

Forgot password? Click here to reset