The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

03/31/2020
by   Xinyue Wang, et al.
0

The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addressing method. We then run controlled experiments to illustrate how severe these problems can be. Indeed, performance gain of one to two orders of magnitude can be achieved simply by reformatting WARC files into Parquet or Avro formats. While these results do not necessarily constitute an endorsement for Avro or Parquet, the time has come for the web archiving community to consider replacing WARC with more efficient web archival formats.

READ FULL TEXT
research
03/01/2016

RWebData: A High-Level Interface to the Programmable Web

The rise of the programmable web offers new opportunities for the empiri...
research
05/22/2020

OBDA for the Web: Creating Virtual RDF Graphs On Top of Web Data Sources

Due to Variety, Web data come in many different structures and formats, ...
research
08/04/2023

Should we trust web-scraped data?

The increasing adoption of econometric and machine-learning approaches b...
research
11/07/2017

Optimizing ROOT IO For Analysis

The ROOT I/O (RIO) subsystem is foundational to most HEP experiments - i...
research
11/22/2021

FastWARC: Optimizing Large-Scale Web Archive Analytics

Web search and other large-scale web data analytics rely on processing a...
research
02/03/2017

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Web archives are a valuable resource for researchers of various discipli...
research
01/25/2021

Towards an Open Format for Scalable System Telemetry

A data representation for system behavior telemetry for scalable big dat...

Please sign up or login with your details

Forgot password? Click here to reset