NewsEdits: A News Article Revision Dataset and a Document-Level Reasoning Challenge

06/14/2022
by   Alexander Spangher, et al.
4

News article revision histories provide clues to narrative and factual evolution in news articles. To facilitate analysis of this evolution, we present the first publicly available dataset of news revision histories, NewsEdits. Our dataset is large-scale and multilingual; it contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources based in three countries, spanning 15 years of coverage (2006-2021). We define article-level edit actions: Addition, Deletion, Edit and Refactor, and develop a high-accuracy extraction algorithm to identify these actions. To underscore the factual nature of many edit actions, we conduct analyses showing that added and deleted sentences are more likely to contain updating events, main content and quotes than unchanged sentences. Finally, to explore whether edit actions are predictable, we introduce three novel tasks aimed at predicting actions performed during version updates. We show that these tasks are possible for expert humans but are challenging for large NLP models. We hope this can spur research in narrative framing and help provide predictive tools for journalists chasing breaking news.

READ FULL TEXT

page 24

page 26

page 27

research
04/19/2021

NewsEdits: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)

News article revision histories have the potential to give us novel insi...
research
10/17/2022

Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes

Knowledge is central to human and scientific developments. Natural Langu...
research
08/06/2019

DpgMedia2019: A Dutch News Dataset for Partisanship Detection

We present a new Dutch news dataset with labeled partisanship. The datas...
research
09/06/2019

Giveme5W1H: A Universal System for Extracting Main Events from News Articles

Event extraction from news articles is a commonly required prerequisite ...
research
05/10/2023

Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

We present Vārta, a large-scale multilingual dataset for headline genera...
research
09/16/2022

Evons: A Dataset for Fake and Real News Virality Analysis and Prediction

We present a novel collection of news articles originating from fake and...
research
05/17/2022

Global Contentious Politics Database (GLOCON) Annotation Manuals

The database creation utilized automated text processing tools that dete...

Please sign up or login with your details

Forgot password? Click here to reset