DeepAI AI Chat
Log In Sign Up

Web Archive Analytics

by   Michael Völske, et al.

Web archive analytics is the exploitation of publicly accessible web pages and their evolution for research purposes – to the extent organizationally possible for researchers. In order to better understand the complexity of this task, the first part of this paper puts the entirety of the world's captured, created, and replicated data (the "Global Datasphere") in relation to other important data sets such as the public internet and its web pages, or what is preserved thereof by the Internet Archive. Recently, the Webis research group, a network of university chairs to which the authors belong, concluded an agreement with the Internet Archive to download a substantial part of its web archive for research purposes. The second part of the paper in hand describes our infrastructure for processing this data treasure: We will eventually host around 8 PB of web archive data from the Internet Archive and Common Crawl, with the goal of supplementing existing large scale web corpora and forming a non-biased subset of the 30 PB web archive at the Internet Archive.


page 1

page 2

page 3

page 4


The Hidden Web, XML and Semantic Web: A Scientific Data Management Perspective

The World Wide Web no longer consists just of HTML pages. Our work sheds...

The Blind Men and the Internet: Multi-Vantage Point Web Measurements

In this paper, we design and deploy a synchronized multi-vantage point w...

From Linked Data to Relevant Data -- Time is the Essence

The Semantic Web initiative puts emphasis not primarily on putting data ...

FastWARC: Optimizing Large-Scale Web Archive Analytics

Web search and other large-scale web data analytics rely on processing a...

Large-Scale Characterization and Segmentation of Internet Path Delays with Infinite HMMs

Round-Trip Times are one of the most commonly collected performance metr...

WordPress on AWS: a Communication Framework

Every organization needs to communicate with its audience, and social me...

Large-scale Sustainable Search on Unconventional Computing Hardware

Since the advent of the Internet, quantifying the relative importance of...