DeepAI AI Chat
Log In Sign Up

Web Archive Analytics

07/02/2021
by   Michael Völske, et al.
0

Web archive analytics is the exploitation of publicly accessible web pages and their evolution for research purposes – to the extent organizationally possible for researchers. In order to better understand the complexity of this task, the first part of this paper puts the entirety of the world's captured, created, and replicated data (the "Global Datasphere") in relation to other important data sets such as the public internet and its web pages, or what is preserved thereof by the Internet Archive. Recently, the Webis research group, a network of university chairs to which the authors belong, concluded an agreement with the Internet Archive to download a substantial part of its web archive for research purposes. The second part of the paper in hand describes our infrastructure for processing this data treasure: We will eventually host around 8 PB of web archive data from the Internet Archive and Common Crawl, with the goal of supplementing existing large scale web corpora and forming a non-biased subset of the 30 PB web archive at the Internet Archive.

READ FULL TEXT

page 1

page 2

page 3

page 4

05/10/2011

The Hidden Web, XML and Semantic Web: A Scientific Data Management Perspective

The World Wide Web no longer consists just of HTML pages. Our work sheds...
05/21/2019

The Blind Men and the Internet: Multi-Vantage Point Web Measurements

In this paper, we design and deploy a synchronized multi-vantage point w...
03/25/2011

From Linked Data to Relevant Data -- Time is the Essence

The Semantic Web initiative puts emphasis not primarily on putting data ...
11/22/2021

FastWARC: Optimizing Large-Scale Web Archive Analytics

Web search and other large-scale web data analytics rely on processing a...
10/28/2019

Large-Scale Characterization and Segmentation of Internet Path Delays with Infinite HMMs

Round-Trip Times are one of the most commonly collected performance metr...
07/03/2020

WordPress on AWS: a Communication Framework

Every organization needs to communicate with its audience, and social me...
04/06/2021

Large-scale Sustainable Search on Unconventional Computing Hardware

Since the advent of the Internet, quantifying the relative importance of...