LANS: Large-scale Arabic News Summarization Corpus

10/24/2022
by   Abdulaziz Alhamadani, et al.
0

Text summarization has been intensively studied in many languages, and some languages have reached advanced stages. Yet, Arabic Text Summarization (ATS) is still in its developing stages. Existing ATS datasets are either small or lack diversity. We build, LANS, a large-scale and diverse dataset for Arabic Text Summarization task. LANS offers 8.4 million articles and their summaries extracted from newspapers websites metadata between 1999 and 2019. The high-quality and diverse summaries are written by journalists from 22 major Arab newspapers, and include an eclectic mix of at least more than 7 topics from each source. We conduct an intrinsic evaluation on LANS by both automatic and human evaluations. Human evaluation of 1000 random samples reports 95.4 accuracy for our collected summaries, and automatic evaluation quantifies the diversity and abstractness of the summaries. The dataset is publicly available upon request.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2018

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

We present NEWSROOM, a summarization dataset of 1.3 million articles and...
research
06/10/2019

BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization

Most existing text summarization datasets are compiled from the news dom...
research
12/19/2022

LR-Sum: Summarization for Less-Resourced Languages

This preprint describes work in progress on LR-Sum, a new permissively-l...
research
07/27/2012

Diversity in Ranking using Negative Reinforcement

In this paper, we consider the problem of diversity in ranking of the no...
research
04/26/2023

ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries

Automatic chart to text summarization is an effective tool for the visua...
research
04/29/2020

Conditional Neural Generation using Sub-Aspect Functions for Extractive News Summarization

Much progress has been made in text summarization, fueled by neural arch...
research
05/27/2023

MeetingBank: A Benchmark Dataset for Meeting Summarization

As the number of recorded meetings increases, it becomes increasingly im...

Please sign up or login with your details

Forgot password? Click here to reset