Web Archive Analytics

07/02/2021
by   Michael Völske, et al.
0

Web archive analytics is the exploitation of publicly accessible web pages and their evolution for research purposes – to the extent organizationally possible for researchers. In order to better understand the complexity of this task, the first part of this paper puts the entirety of the world's captured, created, and replicated data (the "Global Datasphere") in relation to other important data sets such as the public internet and its web pages, or what is preserved thereof by the Internet Archive. Recently, the Webis research group, a network of university chairs to which the authors belong, concluded an agreement with the Internet Archive to download a substantial part of its web archive for research purposes. The second part of the paper in hand describes our infrastructure for processing this data treasure: We will eventually host around 8 PB of web archive data from the Internet Archive and Common Crawl, with the goal of supplementing existing large scale web corpora and forming a non-biased subset of the 30 PB web archive at the Internet Archive.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/10/2011

The Hidden Web, XML and Semantic Web: A Scientific Data Management Perspective

The World Wide Web no longer consists just of HTML pages. Our work sheds...
research
11/22/2021

FastWARC: Optimizing Large-Scale Web Archive Analytics

Web search and other large-scale web data analytics rely on processing a...
research
05/21/2019

The Blind Men and the Internet: Multi-Vantage Point Web Measurements

In this paper, we design and deploy a synchronized multi-vantage point w...
research
03/25/2011

From Linked Data to Relevant Data -- Time is the Essence

The Semantic Web initiative puts emphasis not primarily on putting data ...
research
07/03/2020

WordPress on AWS: a Communication Framework

Every organization needs to communicate with its audience, and social me...
research
10/28/2019

Large-Scale Characterization and Segmentation of Internet Path Delays with Infinite HMMs

Round-Trip Times are one of the most commonly collected performance metr...
research
03/23/2018

Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

Over the last few years, the complexity of web applications has increase...

Please sign up or login with your details

Forgot password? Click here to reset