FastWARC: Optimizing Large-Scale Web Archive Analytics

11/22/2021
by   Janek Bevendorff, et al.
0

Web search and other large-scale web data analytics rely on processing archives of web pages stored in a standardized and efficient format. Since its introduction in 2008, the IIPC's Web ARCive (WARC) format has become the standard format for this purpose. As a list of individually compressed records of HTTP requests and responses, it allows for constant-time random access to all kinds of web data via off-the-shelf open source parsers in many programming languages, such as WARCIO, the de-facto standard for Python. When processing web archives at the terabyte or petabyte scale, however, even small inefficiencies in these tools add up quickly, resulting in hours, days, or even weeks of wasted compute time. Reviewing the basic components of WARCIO and analyzing its bottlenecks, we proceed to build FastWARC, a new high-performance WARC processing library for Python, written in C++/Cython, which yields performance improvements by factors of 1.6-8x.

READ FULL TEXT
research
07/02/2021

Web Archive Analytics

Web archive analytics is the exploitation of publicly accessible web pag...
research
08/24/2020

ImarisWriter: Open Source Software for Storage of Large Images in Blockwise Multi-Resolution Format

We publish as open source a high performance file writer library to stor...
research
06/08/2019

A Component-Based Approach to Traffic Data Wrangling

We produce an increasing amount of data. This is positive as it allows u...
research
10/14/2020

fugashi, a Tool for Tokenizing Japanese in Python

Recent years have seen an increase in the number of large-scale multilin...
research
10/22/2020

Transform Data Complexity into Profitability through Data Mining Services

Data Mining experts are able to efficiently search and extract data from...
research
03/31/2020

The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

The WARC file format is widely used by web archives to preserve collecte...
research
10/19/2012

Exploiting Locality in Searching the Web

Published experiments on spidering the Web suggest that, given training ...

Please sign up or login with your details

Forgot password? Click here to reset