WikiHist.html: English Wikipedia's Full Revision History in HTML Format

01/28/2020
by   Blagoj Mitrevski, et al.
0

Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and mod-ules). Hence, researchers who intend to analyze Wikipediaas seen by its readers should work with HTML, rather than wikitext. Since Wikipedia's revision history is publicly available exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia's REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts ofdata and (2) does not correctly expand macros in historical article revisions. We solve these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and release WikiHist.html, English Wikipedia's full revision history in HTML format. We highlight the advantages of WikiHist.html over raw wikitext in an empirical analysis of Wikipedia's hyperlinks, showing that over half of the wiki links present in HTML are missing from raw wikitext and that the missing links are important for user navigation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2020

Analyzing Wikidata Transclusion on English Wikipedia

Wikidata is steadily becoming more central to Wikipedia, not just in mai...
research
02/12/2019

WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks

Wikipedia articles contain multiple links connecting a subject to other ...
research
05/24/2017

Analysing Timelines of National Histories across Wikipedia Editions: A Comparative Computational Approach

Portrayals of history are never complete, and each description inherentl...
research
12/04/2019

WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset

Over the past years, deep learning methods allowed for new state-of-the-...
research
04/24/2017

Recognizing Descriptive Wikipedia Categories for Historical Figures

Wikipedia is a useful knowledge source that benefits many applications i...
research
03/23/2017

TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia

We present a dataset that contains every instance of all tokens ( words...
research
02/25/2022

Mining Naturally-occurring Corrections and Paraphrases from Wikipedia's Revision History

Naturally-occurring instances of linguistic phenomena are important both...

Please sign up or login with your details

Forgot password? Click here to reset