DeepAI AI Chat
Log In Sign Up

Collecting 16K archived web pages from 17 public web archives

05/09/2019
by   Mohamed Aturban, et al.
Old Dominion University
Los Alamos National Laboratory
knaw.nl
0

We document the creation of a data set of 16,627 archived web pages, or mementos, of 3,698 unique live web URIs (Uniform Resource Identifiers) from 17 public web archives. We used four different methods to collect the dataset. First, we used the Los Alamos National Laboratory (LANL) Memento Aggregator to collect mementos of an initial set of URIs obtained from four sources: (a) the Moz Top 500, (b) the dataset used in our previous study, (c) the HTTP Archive, and (d) the Web Archives for Historical Research group. Second, we extracted URIs from the HTML of already collected mementos. These URIs were then used to look up mementos in LANL's aggregator. Third, we downloaded web archives' published lists of URIs of both original pages and their associated mementos. Fourth, we collected more mementos from archives that support the Memento protocol by requesting TimeMaps directly from archives, not through the Memento aggregator. Finally, we downsampled the collected mementos to 16,627 due to our constraints of a maximum of 1,600 mementos per archive and being able to download all mementos from each archive in less than 40 hours.

READ FULL TEXT

page 1

page 2

page 3

page 4

03/23/2018

Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

Over the last few years, the complexity of web applications has increase...
08/12/2021

Where Did the Web Archive Go?

To perform a longitudinal investigation of web archives and detecting va...
03/23/2021

Automated Discovery of Real-Time Network Camera Data From Heterogeneous Web Pages

Reduction in the cost of Network Cameras along with a rise in connectivi...
08/27/2021

Replaying Archived Twitter: When your bird is broken, will it bring you down?

Historians and researchers trust web archives to preserve social media c...
06/19/2018

You, the Web and Your Device: Longitudinal Characterization of Browsing Habits

Understanding how people interact with the web is key for a variety of a...
05/29/2019

MementoMap Framework for Flexible and Adaptive Web Archive Profiling

In this work we propose MementoMap, a flexible and adaptive framework to...
01/30/2023

WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics

Modeling user interfaces (UIs) from visual information allows systems to...