DeepAI AI Chat
Log In Sign Up

The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

by   Xinyue Wang, et al.

The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addressing method. We then run controlled experiments to illustrate how severe these problems can be. Indeed, performance gain of one to two orders of magnitude can be achieved simply by reformatting WARC files into Parquet or Avro formats. While these results do not necessarily constitute an endorsement for Avro or Parquet, the time has come for the web archiving community to consider replacing WARC with more efficient web archival formats.


RWebData: A High-Level Interface to the Programmable Web

The rise of the programmable web offers new opportunities for the empiri...

OBDA for the Web: Creating Virtual RDF Graphs On Top of Web Data Sources

Due to Variety, Web data come in many different structures and formats, ...

Should we trust web-scraped data?

The increasing adoption of econometric and machine-learning approaches b...

Optimizing ROOT IO For Analysis

The ROOT I/O (RIO) subsystem is foundational to most HEP experiments - i...

FastWARC: Optimizing Large-Scale Web Archive Analytics

Web search and other large-scale web data analytics rely on processing a...

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Web archives are a valuable resource for researchers of various discipli...

Towards an Open Format for Scalable System Telemetry

A data representation for system behavior telemetry for scalable big dat...