DeepAI
Log In Sign Up

NewsEdits: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)

04/19/2021
by   Alexander Spangher, et al.
27

News article revision histories have the potential to give us novel insights across varied fields of linguistics and social sciences. In this work, we present, to our knowledge, the first publicly available dataset of news article revision histories, or NewsEdits. Our dataset is multilingual; it contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources based in three countries. Across version pairs, we count 10.9 million added sentences; 8.9 million changed sentences and 6.8 million removed sentences. Within the changed sentences, we derive 72 million atomic edits. NewsEdits is, to our knowledge, the largest corpus of revision histories of any domain.

READ FULL TEXT
06/14/2022

NewsEdits: A News Article Revision Dataset and a Document-Level Reasoning Challenge

News article revision histories provide clues to narrative and factual e...
10/17/2022

Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes

Knowledge is central to human and scientific developments. Natural Langu...
11/12/2016

1.5 billion words Arabic Corpus

This study is an attempt to build a contemporary linguistic corpus for A...
02/01/2021

Counting Protests in News Articles: A Dataset and Semi-Automated Data Collection Pipeline

Between January 2017 and January 2021, thousands of local news sources i...
12/03/2020

Context in Informational Bias Detection

Informational bias is bias conveyed through sentences or clauses that pr...
04/17/2019

Headline Generation: Learning from Decomposable Document Titles

We propose a novel method for generating titles for unstructured text do...
06/10/2019

AGRR-2019: A Corpus for Gapping Resolution in Russian

This paper provides a comprehensive overview of the gapping dataset for ...