Collecting 16K archived web pages from 17 public web archives

05/09/2019
by   Mohamed Aturban, et al.
0

We document the creation of a data set of 16,627 archived web pages, or mementos, of 3,698 unique live web URIs (Uniform Resource Identifiers) from 17 public web archives. We used four different methods to collect the dataset. First, we used the Los Alamos National Laboratory (LANL) Memento Aggregator to collect mementos of an initial set of URIs obtained from four sources: (a) the Moz Top 500, (b) the dataset used in our previous study, (c) the HTTP Archive, and (d) the Web Archives for Historical Research group. Second, we extracted URIs from the HTML of already collected mementos. These URIs were then used to look up mementos in LANL's aggregator. Third, we downloaded web archives' published lists of URIs of both original pages and their associated mementos. Fourth, we collected more mementos from archives that support the Memento protocol by requesting TimeMaps directly from archives, not through the Memento aggregator. Finally, we downsampled the collected mementos to 16,627 due to our constraints of a maximum of 1,600 mementos per archive and being able to download all mementos from each archive in less than 40 hours.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/23/2018

Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

Over the last few years, the complexity of web applications has increase...
research
08/12/2021

Where Did the Web Archive Go?

To perform a longitudinal investigation of web archives and detecting va...
research
05/30/2023

Decision Support to Crowdsourcing for Annotation and Transcription of Ancient Documents: The RECITAL Workshop

In the 18th century in Paris, only two public theatres could officially ...
research
03/23/2021

Automated Discovery of Real-Time Network Camera Data From Heterogeneous Web Pages

Reduction in the cost of Network Cameras along with a rise in connectivi...
research
06/19/2018

You, the Web and Your Device: Longitudinal Characterization of Browsing Habits

Understanding how people interact with the web is key for a variety of a...
research
05/29/2019

MementoMap Framework for Flexible and Adaptive Web Archive Profiling

In this work we propose MementoMap, a flexible and adaptive framework to...
research
05/15/2021

A Large Visual, Qualitative and Quantitative Dataset of Web Pages

The World Wide Web is not only one of the most important platforms of co...

Please sign up or login with your details

Forgot password? Click here to reset