Log In Sign Up

NewsEdits: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)

by   Alexander Spangher, et al.

News article revision histories have the potential to give us novel insights across varied fields of linguistics and social sciences. In this work, we present, to our knowledge, the first publicly available dataset of news article revision histories, or NewsEdits. Our dataset is multilingual; it contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources based in three countries. Across version pairs, we count 10.9 million added sentences; 8.9 million changed sentences and 6.8 million removed sentences. Within the changed sentences, we derive 72 million atomic edits. NewsEdits is, to our knowledge, the largest corpus of revision histories of any domain.


NewsEdits: A News Article Revision Dataset and a Document-Level Reasoning Challenge

News article revision histories provide clues to narrative and factual e...

Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes

Knowledge is central to human and scientific developments. Natural Langu...

1.5 billion words Arabic Corpus

This study is an attempt to build a contemporary linguistic corpus for A...

Counting Protests in News Articles: A Dataset and Semi-Automated Data Collection Pipeline

Between January 2017 and January 2021, thousands of local news sources i...

Context in Informational Bias Detection

Informational bias is bias conveyed through sentences or clauses that pr...

Headline Generation: Learning from Decomposable Document Titles

We propose a novel method for generating titles for unstructured text do...

AGRR-2019: A Corpus for Gapping Resolution in Russian

This paper provides a comprehensive overview of the gapping dataset for ...