DeepAI
Log In Sign Up

LANS: Large-scale Arabic News Summarization Corpus

Text summarization has been intensively studied in many languages, and some languages have reached advanced stages. Yet, Arabic Text Summarization (ATS) is still in its developing stages. Existing ATS datasets are either small or lack diversity. We build, LANS, a large-scale and diverse dataset for Arabic Text Summarization task. LANS offers 8.4 million articles and their summaries extracted from newspapers websites metadata between 1999 and 2019. The high-quality and diverse summaries are written by journalists from 22 major Arab newspapers, and include an eclectic mix of at least more than 7 topics from each source. We conduct an intrinsic evaluation on LANS by both automatic and human evaluations. Human evaluation of 1000 random samples reports 95.4 accuracy for our collected summaries, and automatic evaluation quantifies the diversity and abstractness of the summaries. The dataset is publicly available upon request.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/30/2018

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

We present NEWSROOM, a summarization dataset of 1.3 million articles and...
06/10/2019

BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization

Most existing text summarization datasets are compiled from the news dom...
12/19/2022

LR-Sum: Summarization for Less-Resourced Languages

This preprint describes work in progress on LR-Sum, a new permissively-l...
04/29/2020

Conditional Neural Generation using Sub-Aspect Functions for Extractive News Summarization

Much progress has been made in text summarization, fueled by neural arch...
07/27/2012

Diversity in Ranking using Negative Reinforcement

In this paper, we consider the problem of diversity in ranking of the no...
03/12/2022

Chart-to-Text: A Large-Scale Benchmark for Chart Summarization

Charts are commonly used for exploring data and communicating insights. ...
10/18/2018

WikiHow: A Large Scale Text Summarization Dataset

Sequence-to-sequence models have recently gained the state of the art pe...

1 Introduction

Every day there is an abundant amount of text published on the internet, such as news articles, scientific papers, product reviews, and blogs. Therefore, the need for text summarization is compelling to make use of this information overload. For a summarized text, a good one should be concise and include the main information of the original text (Radev et al., 2002). For some languages like English, the field has developed rapidly and achieved competitive results(Zhang et al., 2020; Lewis et al., 2019; Dou et al., 2020). Unlike English, the field in Arabic has been slowly and fairly developing in the past few years; thus, it has not reached its advanced shape. In the field of Arabic Text Summarization (ATS)(Belkebir and Guessoum, 2015; AL-Khawaldeh and Samawi, 2015; Fejer and Omar, 2014; Abu Nada et al., 2020; El-Kassas et al., 2021), the dearth of a diverse and large summarization dataset is one of the main existing difficulties that ATS researchers encounter(Al-Saleh and Menai, 2016; Elsaid et al., 2022).

Figure 1: The Webpage view (left) shows a typical news article view. The summaries are extracted from the HTML source code view’s (right) metadata (og:description).

Concerted efforts have been made to overcome those challenges by building various Arabic datasets for the task such that EASC(El-Haj et al., 2010), Kalimat(El-Haj and Koulali, 2013), TAC2011(El-Ghannam and El-Shishtawy, 2014), ANT(Chouigui et al., 2021), and XL-Sum(Hasan et al., 2021), but those datasets have limitations in terms of diversity or size. Therefore, the demand for a diverse and large-scale dataset is crucial to advance the ATS field. The diversity in the ATS dataset is in twofold. The first kind of diversity exists in the Modern Standard Arabic (MSA). Even though 22 countries use MSA as an official standard language, each country has its own dialects (Dialectal Arabic) for communication. Each country’s dialects have some effects on the MSA style of writing and the choice of words. For example, in a sentence describing the rounds of a soccer match, Morrocan MSA would use the word "أطوار" for "rounds" and "المواجهة" for word "the match" while Saudi MSA would use "أشواط" and "المباراة" respectively. Second, there is diversity in news categories. Each newspaper has different news topics, such as finance, politics, sports, health, local, international news, and more. Not all ATS datasets include both diversity aspects in one dataset (see Table 2). Thus, it is essential to build a dataset that considers both types of diversity.

In terms of size, the available ATS datasets contain a range of 100 to 41,000 training samples, which make them too small to fully train a summarization model. The performance in summarization models evidently relies on a substantial amount of applicable training samples(Völske et al., 2017; Grusky et al., 2018; Zhang et al., 2020; Lewis et al., 2019; Dou et al., 2020). Thus, we expect a large-scale dataset that is provided in this work.

To overcome the current limitations in diversity and size, we introduce a new ATS dataset (LANS) that includes both types of diversity and large-scale to present new opportunities to ATS models and improve their summary accuracies. To achieve MSA diversity, that is the variety of each Arab country dialects’ impact on its MSA, LANS encompasses 19 Arab countries and collected articles along with their summaries of 22 popular newspapers (see Table 1). For the diversity of text categories, we consider all available news categories of each source in our ATS dataset. Thus, LANS ensures both types of diversity of MSA among the Arab countries. To overcome the size limitation, LANS provides more than 8 million news articles along with their summaries. LANS’s substantial amount of articles and their summaries, plus the diversity in MSA sources and categories make it a worthy resource for ATS models.

LANS exploited the metadata of newspapers’ archives to extract and build the dataset. In Figure. 1, a high-level example is shown to demonstrate where the collected information originated from two parallel views: the webpage view and its HTML source code view. The webpage view shows what a reader sees when reading a news article: the URL, title, bold part or the abstract sentence/s, and article bodies. LANS pursues the metadata attributes from the HTML source code - specifically (og:description) to extract the summaries from. In the webpage view, the summaries lie either in bold text or before the article’s paragraphs. In the HTML source code view, the summaries lie in the metadata attributes, in our case between (og:description) tags, which we extracted as the news articles’ summaries. After the extraction, we cleaned and filtered 11M news articles to present 8.4M articles along with their summaries.

To quantify the quality of the collected summaries and examine their summarization properties, we conducted an automatic evaluation based on 3 common metrics. Moreover, we corroborated the evaluation with the human evaluation of 1000 samples to verify the accuracy of using the abstract from the HTML source code’s metadata as a summary. The human evaluation verifies that using the summary available in the metadata has a 95.4% accuracy. Considering the large size of LANS, 8.4 million, LANS can benefit the ATS field, because large datasets improve NLP tasks, such as numerous training samples for pre-trained models(Zhang et al., 2020; Lewis et al., 2019). Besides, both types of diversities create opportunities for researchers to construct more accurate ATS models.

ID Newspaper Country From Articles ID Newspaper Country From Articles
1 Elkhabar Algeria 2014 78201 12 Hespress Morroco 2007 91357
2 Alwasat Bahrain 2013 23860 13 Alwatan Oman 2014 130067
3 Gate Ahram Egypt 2016 315655 14 Alquds Palestine 2015 88313
4 Youm7 Egypt 2008 2039818 15 Alquds-UK Palestine 2013 349439
5 Albayan Emirates 1999 1137188 16 Alwatan Qatar 2016 214405
6 Almadapaper Iraq 2009 105925 17 Aljazira Saudi Arabia 2001 809445
7 Aldustoor Jordan 2000 601372 18 Alryiadh Saudi Arabia 2004 1004893
8 Annahar Kuwait 2007 575482 19 Alsudan Alyoom Sudan 2016 104439
9 Alakhbar Lebanon 2006 222215 20 Zamanalwsl Syria 2007 128785
10 WAL Libya 2013 141898 21 Alssabah Tunisia 2011 166137
11 Sahara Media Mauritania 2009 11982 22 Almasdar Yemen 2009 102608
Total 8,443,484
Table 1: Overall statistics of the collected articles

Our main contributions are as follows: (1) We curate LANS, a large-scale ATS dataset of 8.4 million Arabic news articles paired with their summaries written by journalists between 1999 to 2019. To our knowledge, it is the largest to date. (2) LANS is collected from 22 reputable Arab newspapers to achieve high quality of diversity in MSA, and for each source, there are at least 7 topics to achieve diversity in categories. (3) To quantify the intrinsic quality of LANS, a human evaluation is conducted on 1000 random samples and verifies 95.4% accuracy of the summaries. Plus, the automatic evaluation on the whole dataset quantifies the abstractness and properties of the summaries.

2 Related Work (Existing Datasets)

To the best of our knowledge, Lakhas (Douzidia and Lapalme, 2004) is considered one of the early works to build an ATS model. Due to the lack of ATS datasets at that time, Douzidia et al. translated (DUC)111An English text summarization dataset of news paired with human summaries. https://duc.nist.gov/ dataset, from English to Arabic for their ATS model’s evaluation (Douzidia and Lapalme, 2004). The translation used machine translation at that time which was not as accurate and advanced as these days, and that had a negative impact on the results. Moreover, other ATS models built their own datasets to evaluate their models(Al-Maleh and Desouki, 2020). Consequently, researchers built Arabic ground-truth summaries over the past years, and this section mentions the major ones.

The Essex Arabic Summaries Corpus (EASC) Dataset. EASC (El-Haj et al., 2010) is an ATS dataset, where each summary is extracted from the texts by Mechanical Turk. Its text source is two Arabic newspapers (Alrai and Alwatan) and the Arabic language version of Wikipedia. As a result, it contains 153 Arabic articles and 765 summaries (5 summaries per article). In short, EASC has high-quality human-generated summaries but it is too small and lacks diversity.

Kalimat Dataset. El-Haj et al. worked on a dataset called Kalimat(El-Haj and Koulali, 2013). It has 20,291 extractive Single-document and multi-document system summaries, and includes only 6 categories. It has been collected from only one source, which is Alwatan newspaper from Oman. The single-document summaries are generated based on their model Gen-Summ which inputs the article and its first sentence, then outputs the extractive summary. The multi-document summaries were generated for each 10, 100, and 500 articles in different categories. The generated summaries also lack human evaluation of the summaries.

Arabic News Texts Corpus (ANT) and XL-Sum. ANT  (Chouigui et al., 2021), and XL-Sum(Hasan et al., 2021) are the most recent works. ANT collected 31,798 documents paired with summaries using RSS feeds from 5 Arab news sources: AlArabiya, BBC, CNN, France24, and SkyNews, while XL-Sum collected 40,327 only from BBC. ANT includes 6 categories, while XL-Sum reported none. Unlike ANT, LANS utilized the HTML source code og:description tag to collect the summaries which is similar to (Grusky et al., 2018). ANT is evaluated on several extractive summarization methods such as LexRank, TextRank, Luhn and LSA. XL-Sum fine-tuned mT5 on their dataset and randomly sampled 500/500 development and test set respectively. Besides, they conducted human evaluation on 250 random samples. When compared to our LANS, our work collected nearly 8 million articles with summaries from 19 Arab countries local newspapers. Moreover, experts evaluated 1000 random summaries from LANS to substantiate the validity of the summaries.


Corpus
# of documents MSA Diversity Category Diversity Human Evaluation
EASC 153
KALIMAT 20291
ANT 31798
XL-Sum 40327 250
LANS 8 millions 1000

Table 2: Arabic Text Summarization Datasets comparison

3 LANS Dataset

Collecting, processing, extracting, processing, and building any dataset are meticulous work. This section details how LANS is collected starting from the scraping process to building the dataset and how it is shaped for public use.

3.1 Data Collection

Our main goal is to improve the ATS field by collecting and building the largest and most diverse ATS dataset. We collect newspapers from 19 countries 222There are 22 Arab countries, but 3 of them: Djibouti, the Comoros Islands, and Somalia, lack Arabic data and reliable newspapers. For consistency and fairness of data collection, all the TV news channels’ websites are excluded, like Alarabiya, Aljazeera, Arabic CNN, and Arabic BBC because they are primarily established as TV news channels. To make our data sources comprehensive and trustworthy, we collected and listed approximately all the reliable newspapers for each country. For instance, we listed 18 reputable newspapers in Saudi Arabia. After analyzing the newspapers, we then ranked them by assigning the highest priority to the newspaper with the longest publishing history.

Next, we only select the newspapers if their content passes certain criteria: a- history of published articles (archive), b- diversity in categories, and c- availability of the newspaper’s summary in the metadata. History of published articles (archives): each newspaper’s website is inspected to examine if it has a considerable historical electronic archive to reestablish the long-history versions of a newspaper. An old reputable newspaper can be given a lower rank over a modern one if the latter has a longer historical e-archive. Thus, LANS has collected data from 1999 to 2019 see Table 1. Diversity in categories: a newspaper should contain a variety of topics or categories (at least 7), for example, local news, international news, politics, economy, religion, culture, health, sports, art, technology, and so on. Availability of the summary in the metadata: the metadata of a document has the hidden information of an article. The summary of an article written by the author initially lies in the metadata and also can appear in bold on the webpage or ahead of the article. The availability of the summary published by the author/journalist is the major factor in selecting the newspaper. Only the newspapers with provided summaries in the metadata are selected.

The aforementioned criteria narrow down the list of the reliable newspapers, shown in Table 1. As a result, 22 popular newspapers of 19 Arab countries have been selected for the next step from the period of time between 1999 to 2019. The wide variety of the data sources can significantly benefit the diversity of the summaries.

3.1.1 Data Scraping

Since there are 22 newspaper websites to be scraped, it is necessary to customize a code for each of them. Each code identifies the patterns, the selectors, and the URLs to be scraped. The main information scraped from each news article are the following: URL, title or (headline), article, and finally the summary or (the metadata from og:description). An example is shown in Table 3, which shows the scraped information from an article’s webpage. For reproducibility, Scrapy was ideal, in our case scenario, for implementing recurring and large-scale web scraping projects. Besides, Scrapy supports different built-in data output such as JSON, XML, and CSV.

Selector Scraped info
URL http://www.alwasatnews.com/news/1196668.html
Title بالصور… المرخ الخيرية تنظم حملة تنظيف لمقبرة القرية
Article قام المشاركون بإزالة الأشجار والأوساخ الضارة وتقليم الأشجار، وقد شهدت الحملة مشاركة من الأهالي من جميع الفئات العمرية، بالإضافة لأعضاء مجلس إدارة الجمعية. من جانبه، قال رئيس لجنة شئون القرية والمقبرة في الجمعية مصطفى عبدالنبي إن الحملة تأتي استكمالاً لعملية التطوير شامل للمقبرة، حيث تستعد اللجنة للبدء بالمرحلة السادسة من عملية تطوير المقبرة والتي ستشمل عمل كراسي للمظلة ورصف الطريق المؤدي من المغتسل إلى المظلة ونقل خزان الماء الرئيسي من موقعه الحالي إلى الجهة الشرقية للمغتسل وإصلاح واستكمال شراء الاحتياجات، بالإضافة إلى متابعة الخطة التطويرية بالتنسيق مع إدارة الأوقاف الجعفرية، هذا وأثنى على نشاط المشتركين في الحملة، كما قدم شكره لجميع أبناء القرية لتعاونهم لإنجاح حملة تنظيف المقبرة. endR
Summary نظمت لجنة شئون القرية في جمعية المرخ الخيرية الاجتماعية، تزامناً مع رأس السنة الميلادية، حملة تنظيف لمقبرة القرية تحت شعار استثمر وقتك لآخرتك، صباح أمس الأحد1 يناير كانون الثاني 7102
Table 3: Scraped information from an Article

3.2 Building LANS Dataset

For the collected data to be curated so it preserves a good quality for reuse and evaluation, we detail how the data is extracted, cleaned, and preprocessed.

3.2.1 Data Extraction

Among the data formats for retrieval, the most convenient format to preserve data quality is XML. The extracted data is stored in a tree structure. Each newspaper has a dataset formatted as the following: "Item" is the root node of the tree. The root has many child nodes "Items". Each "Items", a child node, holds the extracted data of a single document (a newspaper article). The child node, "Items", has 4 child nodes of its own named: Address, Title, Article, and Summary. Each child node of the parent "Items" (Address, Title, Article, and Summary) has 1 or more grandchild nodes depending on the actual values extracted from an article’s webpage. The data in this stage is not considered clean nor reliable because it contains many errors that could impact the quality of LANS. Errors can be extraneous or foreign characters, empty values, HTML code, or other common text errors. Thus, we need to clean the data. Plus, to better utilize the data in the XML files, we need to preprocess the data for the evaluation process.

Data cleaning: Initially, more than 11 million articles and their metadata are scraped. The data is laboriously examined to ensure whether the extracted articles are error-free content or not, and to ensure their validity for usage. One of the main errors was the collected articles with missing content. There are some reasons for that. One of the reasons is that many articles contain only images or videos without any textual content, because they are types of news that only report pictures or videos. The other reason for missing content is mistakes from the HTML pages, or content stored under a different selector. All articles with the mentioned errors are removed. Moreover, to clean the other errors the normalization step in the preprocessing steps below is performed. In short, the removed articles may have no title, article, or valid data. After removing all the unusable articles, the number has dropped from 11,115,932 to 8,443,484 articles. After this step, the data is stored in its final XML tree format.

3.2.2 Preprocessing

Even though the data is clean at this stage, it requires preprocessing for ATS evaluation process, due to the complex and rich nature of Arabic language. The steps involve normalization, segmentation, removal of stop words, and lemmatization; in that order. This stage in Arabic is the primary stage to prepare the text for processing and transform the input text into a unified representation.

The normalization step cleans the data and removes many extraneous texts. It removes extra white spaces or tabs, foreign irrelevant characters, non-letters, and diacritics. It also replaces certain Arabic characters with a certain single character to normalize the differences in characters. Normalization also removes the "Tatweel" (character stretching) (Ayedh et al., 2016). For tatweel, a word that appears in this format "تــمـــديـــد" is going to be replaced with "تمديد"

Segmentation or tokenization are commonly used interchangeably. The segmentation process is applied to segment the article into sentences and prepare for the next steps. We use the Natural Language Toolkit(NLTK) (Loper and Bird, 2002) to tokenize sentences and words. We are aware that some scholars weigh tokenization differently such as when tokenization breaks the words into constituent prefix(es), stem, and suffix(s)(Mubarak, 2017; Abdelali et al., 2016; El-Defrawy et al., 2015; Pasha et al., 2014). However, ATS lemmatization accomplishes the intended purpose of the other definition of Arabic tokenization.

Stop words have a major impact on text summarization because they impact the length of the articles and summaries, and increase the frequency of words which in both cases would change the weights of sentences(El-Khair, 2017; Al-Taani and Al-Omour, 2014). To remove the stop words, we used a list of stop words prepared by Abu El-khair et al(El-Khair, 2017) which contains 1377 words.

For our evaluation, the final and most crucial step for preprocessing the text is Lemmatization. This step can improve the accuracy of the summarization and evaluation process. Lemmatization is the process of reducing words to their basic root by removing the attached affixes of words. LANS dataset does not store the data in the lemmatized format, because lemmatization is usually used in the training or testing on the original data. Many lemmatizers are considered such as Alkhalil(Boudchiche and Mazroui, 2019), ISRI (Khoja)(El-Defrawy et al., 2015), Madamira (Pasha et al., 2014), CAMeL(Obeid et al., 2020), but only Farasa(Mubarak, 2017; Abdelali et al., 2016) is applied because it outperforms the state-of-the-art CAMel by a slight margin and its fast performance on large-scale datasets. Following all the mentioned steps, the dataset is passed for automatic evaluation (see sec 6).

4 LANS Description

LANS builds 8,443,484 articles and their summaries from 22 newspapers of 19 Arab countries dated from 1999 to 2019. The high-level overall statistics in Table 1 show that some newspapers have more data than the others. This does not undermine any country’s newspapers. Among the newspapers with a long history of journalism, most of them have been published on physical newspapers before newspapers become digitalized. The dates of collection reflect how much data is available in the e-archive for each newspaper. For instance, Gate Ahram newspaper from Egypt (Gat, 2022) is established in 1875 and has been published since then. However, the available e-archive for the newspaper starts from 2016. Each newspaper’s webpage has its own e-archive and its own progress over time. This is why the variations of collection dates exist.

LANS encompasses 19 Arab countries for MSA diversity. One of the overlooked aspects of diversity in Arabic is the diversity of MSA in the Arab countries. It is true that all the newspapers in the Arab countries use the same MSA, but events, culture, and use of vocabulary are different from one country to another. Therefore, it is necessary to collect such diverse data from each country. To achieve MSA diversity in LANS, our dataset encompasses 19 Arab countries - except for the Comoros Islands, Djibouti, and Somalia because of the scarcity of data in their newspapers.

Further, LANS provides a wide-ranging topic variety. The collected data from each country covers different categories, and some newspapers have more categories than others, which enhances the diversity of categories in LANS. Some newspapers have only a few categories (not less than 7), while some others have more than 9 categories including local news, international, political, financial, society, sports, technology, art, health, and religious news articles. This category diversity is one of the features of LANS. It allows researchers to not only create subdatasets, but also create sub-subdataset of any of the subdatasets. For example, a subset can be all articles/summaries from Saudi Arabia. Then, a sub-subdataset can be the local news categories from the subset of Saudi Arabia articles/summaries. This type of diversity can be created from LANS.

The dataset is chunked into separate XML files, each file is under 2 GB to make it easier to load and process. The total size of the whole dataset is 32GB. Each country’s dataset is a subset of the whole dataset, and researchers have the freedom to choose a subset or several subsets (by specific countries) to train and evaluate ATS models.

Summary
LANS من المقرر الكشف عن اسماء اهم خمسة مرشحين لجائزة افضل لاعب كرة قدم في افريقيا للعام الحالي غداً الأحد ويتوقع ان يكون كابتن منتخب نيجيريا جاي جاي اوكوتشا من بين اقوى المرشحين للجائزة السنوية.
mT5-based pipeline من المقرر الكشف عن أفضل خمسة مرشحين لأفضل لاعب كرة قدم في أفريقيا لهذا العام هذا الأحد، ومن المتوقع أن يكون الكابتن المنتخب لنيجيريا جاي أوكوكوشا من بين أقوى المرشحين للجائزة السنوية. أوكوشا، المرشح الرئيسي لأفضل لاعب أفريقي.
Table 4: Table presents a sample of two summaries from LANS and mT5-based pipeline.

5 Experiment

Since the ATS field is still under-researched for abstractive

summarization, it is difficult to achieve multiple comparisons among the available works. Therefore, we created a translate-summarize-translate pipeline from the available pretrained state-of-the-art multi-language models such mT5

(Xue et al., 2020), mBART(Tang et al., 2020), and CRISS(Tran et al., 2020). For our experiment, we chose mT5 because of its wide coverage of 101 languages and support for 41 languages. The model is utilized to generate summaries of the 1000 randomly sampled articles, and then compare them with LANS ground-truth summaries using ROUGE-N. In a high-level description, the pipeline inputs the preprocessed samples as mentioned earlier in section 3.2.2, translates the articles (Arabic English), generates summaries from the translated articles, then translates the generated summaries (English Arabic) for evaluation. The model for each step of the pipeline will be given later.

Some of the pipeline steps to generate automatic text summaries are tuned to adapt Arabic language. Firstly, we preprocess the text, as detailed in section 3.2.2. Secondly, we translate the articles from Arabic to English. We apply OPUS-MT (Tiedemann and Thottingal, 2020) project. OPUS-MT is based on Marian-NMT (Junczys-Dowmunt et al., 2018)

, a state-of-the-art transformer-based Neural Machine Translation (NMT), and trained on OPUS data using OPUS-MT-Train. The translation achieves accurate results in machine translation. Next, since articles are translated into English, we process the articles to generate automatic text summaries using mT5 which inherits all the benefits of T5

Raffel et al. (2019). The automatic text summaries currently are English. Finally, we translate automatic text summaries into Arabic by similar settings described in the second step. An example of the ground-truth summary and a generated Arabic summary are displayed in Table 4.

Both summaries are evaluated by ROUGE Ganesan (2018)evaluation metric and will be used for human evaluation (see sec 6.2). We apply ROUGE-1, ROUGE-2, and ROUGE-L to consider different summary lengths. Moreover, we also show how stemming impacts the accuracy. The results are reported in Table 5. The results show that the summaries generated by mT5 achieve lower scores before applying the lemmatization process. After we lemmatized the summaries by Farasa, the results improve by a good margin. In both cases, for a model that has not been designed for Arabic language, mT5 shows good scores when scored with LANS summaries see Table 4.

Before Lemmatization After Lemmatization
R-1 R-2 R-L R-1 R-2 R-L
mT5 0.3 0.12 0.28 0.44 0.19 0.38
Table 5: Results of the generated summaries referenced to LANS summaries.

6 Intrinsic Evaluation of LANS

We apply two methods of evaluation to validate the reliability of the summaries from LANS. The first is an automatic evaluation which examines the summarization techniques in LANS. It uses the following metrics: compression ratio, fragment density, and coverage. The automatic evaluation has been performed on the whole dataset. The second evaluation is performed by experts which verifies the quality of LANS by randomly extracting 1000 articles and their respective summaries, which are evaluated by experts.

Dataset COV DENS CMP Dataset COV DENS CMP
Elkhabar(Algeria) 0.34 0.87 0.77 Alwatan(Oman) 0.35 0.64 0.68
Alwasat(Bahrain) 0.32 0.88 0.51 Alquds(Palestine) 0.28 0.74 0.65
Gate Ahram(Egypt) 0.27 0.81 0.57 Alquds-UK(Palestine) 0.39 0.90 0.79
Youm7(Egypt) 0.31 0.86 0.53 Alwatan(Qatar) 0.24 0.58 0.74
Aldustoor(Jordan) 0.25 0.52 0.50 Aljazira(Saudi Arabia) 0.23 0.46 0.57
Annahar(Kuwait) 0.24 0.57 0.72 Alryiadh(Saudi Arabia) 0.30 0.73 0.51
Almadapaper(Iraq) 0.45 0.52 0.64 Alsudan Alyoom(Sudan) 0.36 0.31 0.49
Alakhbar(Lebanon) 0.27 0.49 0.82 Zamanalwsl(Syria) 0.26 0.62 0.59
WAL(Libya) 0.32 0.30 0.55 Alssabah(Tunisia) 0.26 0.70 0.58
Sahara Media(Mauritania) 0.32 0.88 0.68 Albayan(Emirates) 0.41 0.35 0.65
Hespress(Morocco) 0.38 1.01 0.78 Almasdar(Yemen) 0.38 0.92 0.77
Table 6: Automatic evaluation results of LANS comparing all newspapers to each other. The up arrow indicates that higher is better and the opposite for the down arrow . The results show the diversity among the collected datasets from one source to another. It also shows there is a high level in abstractiveness and conciseness.

6.1 Automatic Evaluation

To assess LANS, we apply 3 common metrics to quantify the abstractness of LANS’s summaries and examine their strategies. Note that summaries can be extractive or abstractive; extractive summaries derive words from the source text, while abstractive summaries use novel words to describe the source text. The applied metrics used are compression ratio, fragment density (abstractivity), and coverage Grusky et al. (2018); Bommasani and Cardie (2020). Compression Ratio quantifies the conciseness of summaries, and is defined as the ratio of words between a summary and an article:

(1)

where is the summary’s length and is the article’s length in words. Coverage by Grusky et al. (2018) quantifies how much the summary borrows words from the article. Its formula is below:

(2)

where is the set of extractive phrases in summary extracted from article , and is the summary tokens (words) derived from the article. In abstractive summaries, it is preferred not to derive many words from the article.

Fragment Density is proposed by (Grusky et al., 2018), and later introduced as Abstractivity in (Bommasani and Cardie, 2020) with a slight change that generalizes it. This paper uses fragment density. It quantifies how well the summaries can construct a sequence of words that are greedily matched in the article. It is measured as the following:

(3)

The results of the automatic evaluation are reported in Table 6. The arrow for coverage scores (COV) indicates how abstractive the summaries are from each source. The reported low scores signify that the summaries have novel words to describe the articles. The arrows for density (DENS) and fragment compression (CMP) mean the higher the better. The highest score for density is in Hespress(Morocco) newspaper summaries, and the lowest is in WAL (a Libyan news agency). For compression, the most concise summaries are reported from Alakhbar (Lebanon), and the least concise ones are reported from Alsudan Alyoom (Sudan). The diversity exists among the Arab countries’ style of writing the summaries, and the indication of that is the varying scores in all metrics.

6.2 Human Evaluation

Relying on only automatic evaluation and ROUGE metric may result in some limitations, such as biases in scoring against the systems that depend more on paraphrasing such as abstractive systemsGrusky et al. (2018). As a result, even though meaningful summaries are generated, ROUGE can be subjective and assigns a low score to well-generated summariesSee et al. (2017). Therefore, we conduct human evaluation.

Human evaluation is costly, but the results from the automatic method described in Sec. 6.1 are yet to be verified by experts. A survey is created for human experts to assess which summaries capture the full key information of the articles, have better readability, and have syntactic correctness. The survey contained the 1000 random samples selected for the experiment in Sec. 5. Each survey question contains the following data: the full article; Choice 1: LANS summary; Choice 2: mT5-based generated summary; and Choice 3: none-of-the-above (non of the summaries). Choices 1 and 2 were shuffled and anonymized, so human experts can make fairer choices with less biases. For example, if Choice 1 was always LANS’s summary, then human experts may form a judgment to always choose Choice 1. Therefore, the choices were shuffled. Besides, the choices were anonymous. It means that human evaluators do not know the origin of each summary.

The experts who did the survey are highly knowledgeable in Arabic. For a human expert to evaluate the survey; an expert should be an Arabic native speaker, also, an expert should at least have a bachelor’s degree majoring in Arabic Language. The experts were asked not only to choose which choice is the fittest for the given criteria, but also to provide their feedback on the choices. Human evaluation results show that 954 of LANS extracted summaries have more accurate semantic representation, and correct syntactic forms. The semantic representation means that the summary captures salient and key information of the article and has better readability. The results, also, show that 2 of the choices are "none", which means neither summaries meet the required criteria. While the ROUGE scores are low between the automatic generated summaries and LANS summaries, the human evaluation results verify the correctness of LANS summaries with an accuracy of 95.4%.

7 Conclusion

This work presents LANS, a large-scale and diverse text summarization dataset of more than 8 million new articles (32GB) paired with their summaries written by journalists. The summaries are collected from the metadata of 22 scraped popular Arab newspapers’ websites from the period between 1999 to 2019. For each of those resources, LANS considered a wide range of topics. The work applied two evaluation methods (automatic and human) to verify the superiority of the extracted summaries in LANS. The dataset is available on this link 333Request data from first author. LANS offers this dataset for researchers to advance the field of ATS, and takes advantage of the data to train and evaluate the results of new models on this dataset. For future work, we plan to benchmark LANS for text classification since the articles’ labels are available.

References