TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia

03/23/2017
by   Fabian Flöck, et al.
0

We present a dataset that contains every instance of all tokens ( words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,787 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and re-deleted from its article, enabling a complete and straightforward tracking of its history. This data would be exceedingly hard to create by an average potential user as it is (i) very expensive to compute and as (ii) accurately tracking the history of each token in revisioned documents is a non-trivial task. Adapting a state-of-the-art algorithm, we have produced a dataset that allows for a range of analyses and metrics, already popular in research and going beyond, to be generated on complete-Wikipedia scale; ensuring quality and allowing researchers to forego expensive text-comparison computation, which so far has hindered scalable usage. We show how this data enables, on token-level, computation of provenance, measuring survival of content over time, very detailed conflict metrics, and fine-grained interactions of editors like partial reverts, re-additions and other metrics, in the process gaining several novel insights.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2020

Analyzing Wikidata Transclusion on English Wikipedia

Wikidata is steadily becoming more central to Wikipedia, not just in mai...
research
04/18/2021

A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation

Large pretrained generative models like GPT-3 often suffer from hallucin...
research
05/30/2023

SWiPE: A Dataset for Document-Level Simplification of Wikipedia Pages

Text simplification research has mostly focused on sentence-level simpli...
research
10/07/2020

Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank

Detecting fine-grained differences in content conveyed in different lang...
research
01/28/2020

WikiHist.html: English Wikipedia's Full Revision History in HTML Format

Wikipedia is written in the wikitext markup language. When serving conte...
research
11/02/2021

Quality change: norm or exception? Measurement, Analysis and Detection of Quality Change in Wikipedia

Wikipedia has been turned into an immensely popular crowd-sourced encycl...
research
09/05/2018

Token Curated Registries - A Game Theoretic Approach

Token curated registries (TCRs) have been proposed recently as an approa...

Please sign up or login with your details

Forgot password? Click here to reset