NewsEdits: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)

04/19/2021
by   Alexander Spangher, et al.
27

News article revision histories have the potential to give us novel insights across varied fields of linguistics and social sciences. In this work, we present, to our knowledge, the first publicly available dataset of news article revision histories, or NewsEdits. Our dataset is multilingual; it contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources based in three countries. Across version pairs, we count 10.9 million added sentences; 8.9 million changed sentences and 6.8 million removed sentences. Within the changed sentences, we derive 72 million atomic edits. NewsEdits is, to our knowledge, the largest corpus of revision histories of any domain.

READ FULL TEXT
research
06/14/2022

NewsEdits: A News Article Revision Dataset and a Document-Level Reasoning Challenge

News article revision histories provide clues to narrative and factual e...
research
10/17/2022

Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes

Knowledge is central to human and scientific developments. Natural Langu...
research
05/10/2023

Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

We present Vārta, a large-scale multilingual dataset for headline genera...
research
02/01/2021

Counting Protests in News Articles: A Dataset and Semi-Automated Data Collection Pipeline

Between January 2017 and January 2021, thousands of local news sources i...
research
12/03/2020

Context in Informational Bias Detection

Informational bias is bias conveyed through sentences or clauses that pr...
research
01/15/2021

Annotation of epidemiological information in animal disease-related news articles: guidelines

This paper describes a method for annotation of epidemiological informat...
research
04/17/2019

Headline Generation: Learning from Decomposable Document Titles

We propose a novel method for generating titles for unstructured text do...

Please sign up or login with your details

Forgot password? Click here to reset