ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

02/03/2017
by   Helge Holzmann, et al.
0

Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/08/2021

Introducing A Dark Web Archival Framework

We present a framework for web-scale archiving of the dark web. While co...
research
06/20/2023

A Responsive Framework for Research Portals Data using Semantic Web Technology

As the amount of data on the World Wide Web continues to grow exponentia...
research
07/06/2023

JSONoid: Monoid-based Enrichment for Configurable and Scalable Data-Driven Schema Discovery

Schema discovery is an important aspect to working with data in formats ...
research
05/15/2023

DarkBERT: A Language Model for the Dark Side of the Internet

Recent research has suggested that there are clear differences in the la...
research
06/10/2020

PeopleMap: Visualization Tool for Mapping Out Researchers using Natural Language Processing

Discovering research expertise at institutions can be a difficult task. ...
research
09/05/2023

Data-Juicer: A One-Stop Data Processing System for Large Language Models

The immense evolution in Large Language Models (LLMs) has underscored th...
research
03/31/2020

The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

The WARC file format is widely used by web archives to preserve collecte...

Please sign up or login with your details

Forgot password? Click here to reset