The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

09/10/2019 ∙ by Martin Klein, et al. ∙ Los Alamos National Laboratory 0

Web archiving frameworks are commonly assessed by the quality of their archival records and by their ability to operate at scale. The ubiquity of dynamic web content poses a significant challenge for crawler-based solutions such as the Internet Archive that are optimized for scale. Human driven services such as the Webrecorder tool provide high-quality archival captures but are not optimized to operate at scale. We introduce the Memento Tracer framework that aims to balance archival quality and scalability. We outline its concept and architecture and evaluate its archival quality and operation at scale. Our findings indicate quality is on par or better compared against established archiving frameworks and operation at scale comes with a manageable overhead.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

The web archiving landscape has evolved significantly over the last twenty years. While the Internet Archive (IA) is the uncontested pioneer in this field and is to date by far the largest publicly available web archive, we are now able to freely access archived web resources from more than twenty web archives around the world111http://timetravel.mementoweb.org/. Many national libraries and archives such as the National Library of Australia [15] and the UK National Archives [2] have begun to capture parts of the web and contribute to increased diversity. However, most of the current web archiving frameworks are optimized either to try to cope with the scale of the web or to provide high-quality archival captures. The IA, for example, generally crawls the web in a best-effort approach, capturing as many web resources as possible. This results in an ever-increasing number of URIs archived and available via the IA’s Wayback Machine replay engine [10]. At the time of writing, this number stands at more than billion URIs [11].

Regarding archival quality, web archiving has become increasingly challenging as a result of the proliferation of dynamic web content that only becomes available via activation of - typically JavaScript-based - affordances in pages. Web archiving dynamic web content is technically challenging and crawler-driven solutions that have been experimented with thus far are resource intensive, slowing down the crawling process [4, 5]. As such, the IA, which focuses on scale, typically does not apply such techniques. The result of this focus on scale over quality is aptly illustrated by the fact that the cnn.com website has not been properly archived by the IA since November and can not be replayed correctly in the Wayback Machine [3]. The screenshot in Figure 0(a) shows the replay of a cnn.com Memento (the archived copy) in the IA222http://web.archive.org/web/20190417195948/https://www.cnn.com/. At the other end of the spectrum, the Webrecorder tool [13] has emerged, focusing on high-fidelity web archiving. Its value is evident in Figure 0(b) showing a screenshot of the replay of a cnn.com Memento created with the Webrecorder tool333https://webrecorder.io/martinklein/tpdl_test_collection/20190417221002/https://www.cnn.com/. However, Webrecorder achieves this level of quality via human interaction with the web resource that is to be archived and, as such, can not operate at a scale comparable to that of the IA’s crawling processes. To date, approaches that can archive at scale and with high-fidelity remain elusive.

(a) Internet Archive
(b) Webrecorder
Figure 1: Replay of cnn.com Mementos

In this paper we introduce the Memento Tracer web archiving framework that aims to operate at web scale while also providing high-quality web captures. Memento Tracer is a result of the “Scholarly Orphans” project, a collaborative effort between the Prototyping Team of the Los Alamos National Laboratory Research Library and members of the Web Science and Digital Library group of the Old Dominion University Computer Science Department. The project is focused on archiving scholarly artifacts, which are resources that scholars create or deposit in productivity portals such as GitHub, Slideshare, or Publons. Our contributions in this work are two-fold:

  1. We outline the Memento Tracer concept, detail its architecture, and describe a pilot implementation.

  2. We conduct an experimental evaluation of Memento Tracer regarding the scale and quality at which it can archive two resource types that present dynamic content challenges.

Despite the limited scope of the evaluation, we feel that our contributions reveal various attractive characteristics of Memento Tracer, which suggest the potential for it to evolve towards a web archiving approach that is able to capture web resources at scale and with high quality.

2 Related Work

The Internet Archive, in addition to its crawler-based web archiving services offers an archive-on-demand service called “Save page now”. This service allows a user to submit a URI to the IA, which will be crawled immediately. Perma.cc and archive.today are two alternatives that offer very similar services, all of which come with strengths and weaknesses. Perma.cc, for example, requires a user login to pro-actively create Mementos of submitted URIs and charges a subscription fee that depends on the number of Mementos created per month. Little is publicly known about archive.today, their technology stack and institutional background but similar to the IA’s “Save page now” service their capability to handle dynamic web content is limited. The IA has acknowledged this shortcoming and introduced a beta version of a more powerful archiving-on-demand service. This service is based on “brozzler” [8] and operates a Chromium browser to execute dynamic content and therefore, for example, discovers URIs that are generated by JavaScript. Our first tests did not return reliable results but once the service reaches a more stable state, it should be included in this comparative study.

Brunelle et al. [5] conducted a study to investigate the balance and implicit overhead between operating a “regular” web crawler such as Heritrix [9] and a headless browser such as PhantomJS [7] to more reliably execute dynamic content. They found that using the more sophisticated crawling approach based on a headless browser resulted in a spike of discovered URIs to crawl as well as vastly increased crawl time.

The Webrecorder tool is made for humans to interact with a web resource and record the interaction into an archival record. Dynamic content is typically handled very well and, as long as all essential parts of the resource are interacted with, the archival record represents a high-fidelity capture of the live web resource that can be played back, for example, with the Webrecorder Player [14]. While archiving with Webrecorder is a manual process, the tool’s developers have made some initial steps towards automating certain interactions with individual web resources [12].

Brunelle et al. [6] proposed an automated method to assess the archived quality of web resources. Their algorithm assigns relative values to embedded resources and depending on the availability of these resources, determines a damage rating. The authors implemented a web service to assess Memento Damage444http://memento-damage.cs.odu.edu/ which we considered for our quality evaluation but since it does not compare two versions of the same resource but rather analyzes individual resources separately, it is not applicable for our study.

3 Memento Tracer Framework

We introduce a new collaborative approach to capture web publications for archival purposes with the Memento Tracer framework. The framework is inspired by existing capture approaches yet aims for a new balance between archiving at scale and quality of the resulting snapshots. The framework was developed as part of a project that focuses on capturing scholarly artifacts from productivity portals such as GitHub, Slideshare, Publons, Figshare, Wikipedia, and Stack Overflow. Similar to other existing web crawler approaches such as LOCKSS [17, 18], it uses server-side processes that leverage the insight that web publications in a given portal are typically based on the same template and share features such as layout and interactive affordances.

Similar to the Webrecorder tool, a human helps achieve high quality captures and determines the boundary of the to be archived resource. However, with Memento Tracer, heuristics that apply to an entire class of web publications are recorded, not individual web publications. These heuristics can collaboratively be created by curators and deposited in a shared community repository. When the server-side capture processes come across a web publication of a class for which heuristics are available, they can be applied, yielding captures that are aligned with the curators’ instructions.

3.1 Framework

Figure 2: Memento Tracer framework

Figure 2 visualizes the Memento Tracer concept and its three main components from left to right. Below we describe the framework in detail using the task of archiving slide decks from Slideshare as an example.

A Browser Extension The first step in the framework begins when a curator navigates to a web page representative for a class of resources in that portal, for example the landing page of a Slideshare presentation, and activates the browser extension. By interacting with the web page (clicking through the slides, downloading the entire slide deck, etc), the curator creates a trace that, in an abstract manner, describes the artifact to be archived. The extension does not record actual resources or URLs that are traversed by the curator. Rather it captures interactions in terms that uniquely identify the page’s elements that are being interacted with, for example by means of their class ID or XPath. The extension’s recording of a page’s elements is inspired by the Selenium IDE555https://www.seleniumhq.org/selenium-ide/, which is an open source record and playback test automation suite for the web. Since all pages of the same class in the same portal are typically based on the same template, the resulting traces apply across all pages of the class rather than to a specific page only. In our example the created trace is valid for all slide decks in Slideshare. Currently, the extension is able to record simple mouse-clicks, clicks on all links in a certain user interface component, and repeated clicks. The latter is especially useful when navigating through all slides of a presentation or paginating through multi-page blog posts or manuals. The created trace also indicates the URL pattern to which the trace applies and provenance information including the resource on which the trace was created and the user agent used to create it. When the layout or affordances for a particular class of web publications change, a new trace needs to be recorded to ensure it is valid for all changed publications. In contrast to crawler-based approaches but similar to the Webrecorder concept, with Memento Tracer a curator is in charge of determining the desired components of a web resource that is to be archived. The fact that a trace can automatically be applied to all artifacts of the same class represents a major scalability advantage over other human-driven approaches such as Webrecorder. For each resource, even in the same class and portal, Webrecorder requires all interactions to be executed.

A Shared Repository After a curator has successfully recorded a trace, she can share it with the community via a publicly accessible repository, thereby crowdsourcing the web curator task. The shared repository allows for reuse and refinement of existing traces. Hence, anyone in the community can utilize a trace created by another curator, for example the aforementioned Slideshare trace, to capture other slide decks. Since the perspective of what the essence of a web publication is may differ from one curator to the next [16], the repository supports multiple traces for a specific class of pages. Each can be unambiguously identified in the repository. In addition, since the layout of pages evolves over time, traces will need updating, making version support by the repository essential. Given these requirements, we consider GitHub a suitable host for the shared repository. Traces available to the community for reuse and refinement in addition to versioning in a shared repository further increases the scalability of the Memento Tracer approach.

A Headless Browser Application To generate web captures, the Memento Tracer framework assumes a setup consisting of a WebDriver666Selenium WebDriver: https://www.seleniumhq.org/, a headless browser777Headless Chrome: https://chromium.googlesource.com/chromium/src/+/lkgr/headless/README.md, and a capturing tool888WarcProxy: https://github.com/internetarchive/warcprox. We developed a parser for the WebDriver (based on the Selenium WebDriver’s API) that translates the content of a trace into instructions (JavaScript code) for the headless browser to emulate the interactions with the web resource as captured by the browser extension in the trace. The capturing tool writes resources navigated by the headless browser to WARC files [1]. When this fully automated capture setup comes across a web resource of a class for which a trace is available, the trace will be invoked to guide the capturing of the resource. This functionality of capturing resources based on traces guarantees high-fidelity archived resources, which is a major advantage over, for example, the IA’s automated crawling approaches.

3.2 Pilot

We used a pilot implementation of the above described framework to capture artifacts deposited by researchers in productivity portals. We created a trace for a sample resource of each of the portals and used those traces to guide the capturing process of artifacts deposited in the respective portals by the researchers. The application at https://myresearch.institute provides an overview of artifacts captured since August . The application offers different views on the collection of captured artifacts, such as by capture date, by researcher, and by productivity portal. A landing page per captured artifact999For example: https://myresearch.institute/event/e7e8fcc4e8c14392af1c264295d6268a/ provides basic metadata about the artifact as well as links to the WARC file resulting from the capture and to a replay of the captured artifact. More information about the capture pipeline in which Memento Tracer was used is available via the About page101010https://myresearch.institute/about/.

4 Experiment Design

We evaluate our Memento Tracer framework in two dimensions: archival quality and scalability. To assess archival quality, we compare its performance against Mementos created with Webrecorder, the tool designed to create high-fidelity captures. To evaluate scalability, we conduct two experiments. The first assesses the extent to which Memento Tracer can generate quality captures for a large set of web resources. The second compares the time Memento Tracer and an automated crawling framework designed for scale require to create captures.

Quality of archived web resources is not trivial to measure. Different replay systems of Mementos may vary in performance and individuals’ perception of what the essence of a resource is and hence which part of the resource needs to be part of the archived record may differ [16]. Rather than trying to find a compromise between these arguably subjective aspects, we decided to focus our quality assessment on the extent to which URIs that should be captured according to curatorial decisions are actually captured. In order to create a baseline of the number of URIs we expect in a Memento, we analyze the live web version of each resource. We expect a high-quality archival record to contain at least the same number of URIs as its live web version. We are aware that this comparison may not capture all dimensions of quality. For example, a CSS that is missing from a captured resource may have a more detrimental impact on the “look and feel” of the replay than a missing image. However, this process enables us to automatically compare a dataset of live web resources with their corresponding Mementos.

4.1 Data Gathering

Memento Tracer is a result of our Scholarly Orphans project, where our focus is on archiving scholarly artifacts that researchers deposit in web productivity portals. We generated a dataset that consists of resources from two such portals selected because they present interesting web archiving challenges to analyze the performance of our novel web archiving framework. This dataset is applicable to investigate web archiving quality and scale. The first portal from which we obtained resources is GitHub. Its API does not offer the functionality to randomly select Github resources. We therefore decided to utilize the news and podcast platform https://changelog.com/ and its digest of GitHub repositories published daily since January 1st . Changelog’s digest consists of the most popular GitHub repositories on a given day as measured by the number of stars received. These repositories are further distinguished between popular overall, popular overall but making the list for the first time, and popular overall but newly created repositories. We focused on the latter two categories as this ensures we obtain established repositories, while also decreasing the chances of obtaining duplicates. Furthermore, newly created repositories are included. This mixture of GitHub repositories, while somewhat biased towards popularity (given the number of received stars or “likes”), serves as our sample set of resources. In total, we obtained URIs of GitHub repositories. In order to conduct an accurate analysis of archiving quality, we need to ensure that our comparisons are based on the live web versions that were used to create Mementos, which, in fact, may no longer be the same version by the time we conduct our comparisons. We therefore use the GitHub API to identify the time-specific last commit URI of each repository and use these URIs, for which the repository content is fixed.

Our second dataset consists of resources from Slideshare. In order to obtain a random sample, we use the portal’s Explore feature111111https://www.slideshare.net/explore

to obtain a sample of slide decks. Given this source, our dataset is clearly also biased towards popularity and Slideshare’s selection algorithm. However, the algorithm to select slide decks and feature them on the Explore site is entirely opaque to us. In addition, this process guarantees a broad variety of subjects under which the slide decks are classified, making our sampling results an applicable dataset. Since Slideshare creates a new URI for each uploaded and updated slide deck, there is no concern that the resource on the live web will change throughout our experiment. In total, we obtain

URIs of distinct slide decks for this dataset.

4.2 Traces for Dataset Resources

We use the Memento Tracer Chrome extension to create a trace for GitHub repositories as well as for Slideshare slide decks. For this step we mimic the decision making of a curator and determine which parts of the resources are essential to be captured and archived. According to these curatorial decisions, the trace created for GitHub repositories includes all files and top level directories listed in the repository as well as the downloadable ZIP file containing the entire repository121212A screencast of the Memento Tracer Chrome extension and the interactions with a GitHub repository recorded into a trace is available at: https://doi.org/10.6084/m9.figshare.8049839.v1. Guided by this trace131313The trace is available at: https://doi.org/10.6084/m9.figshare.8024612, all GitHub repositories archived with Memento Tracer should therefore, when replayed, contain all repository files as well as the ZIP file. The trace created for Slideshare guides the capturing process to include all slides as well as all notes per slide deck141414The trace is available at: https://doi.org/10.6084/m9.figshare.8024615.

These curatorial decisions allow us to precisely and automatically determine the number of URIs of interest contained in the live web version of each resource. For this purpose, our evaluation program loads the live web resource in a browser and interacts with it to count the number of URIs expected according to the curatorial decisions made. With this process we determine that each file and top-level directory in a GitHub repository as well as the Zip file has a distinct URI. Similarly, each slide in a Slideshare slide deck as well as its associated note has a unique URI.

5 Experiments and Results

5.1 Archival Quality

With our baseline of live web URIs in place, we can compare Mementos created with different archiving frameworks. We use the same evaluation program we used to assess the number of URIs from live web resources to assess the number of URIs in Mementos. To conduct this comparison, we create the following subsets derived from our dataset introduced in Section 4.1. We randomly pick GitHub repositories and Slideshare slide decks and use the Webrecorder tool to manually create respective Mementos. Being very familiar with the Webrecorder tool, we applied the same curatorial criteria that we used to record traces for the two productivity portals. For GitHub repositories, this means we click on every single file in a GitHub repository in order to capture these resources and the “Clone or Download” button in order to capture the ZIP file of the repository. For slide decks this means we click on the “Next” button as many times as necessary to capture all slides in the deck and on each of the included notes. Since this manual process is rather time consuming, we limit the size of this dataset to repositories and slide decks. In addition, with a trace recorded for GitHub repositories and for slide decks on Slideshare, we use the Memento Tracer framework to capture the same GitHub repositories and Slideshare slide decks.

(a) GitHub URIs, each left bar represents file URIs, the right ZIP file URIs
(b) Slideshare URIs, each left bar represents slide URIs, the right notes URIs
Figure 3: Relative number of URIs from live web, Memento Tracer, and Webrecorder Mementos. Green represents available, red unavailable resources.

Our first analysis is based on whether all expected URIs are contained in the archived record. We expect all URIs for files in GitHub repositories (each file has a distinct URI) and one URI for the repository ZIP file in addition to the URI of the repository itself. For Slideshare we expect all URIs for slides (each slide has a distinct URI) and all URIs for notes (each note has a distinct URI) plus the URI of the slide deck page itself. Since the URIs of the repositories and slide deck pages are all available on the live web and in all Mementos, we exclude them going forward and only focus on URIs of the component resources that are of interest according to our curatorial decisions.

Figure 3 displays the results of the analysis based on the total of URIs sampled from the overall dataset. The relative numbers of URIs are represented on the y-axis and the corresponding sources (LW, MT, WR) are shown on the x-axis. The size of the green portion of a bar indicates the number of URIs available and the red portion shows unavailable URIs. Figure 2(a) displays the GitHub URIs where corresponds to repository file URIs and to ZIP file URIs. Figure 2(b) shows the Slideshare URIs with representing URIs of slides and URIs of notes. We can immediately make a few observations from these graphs. As expected, the ratio of URIs in Webrecorder Mementos is very similar to the live web versions. Generally, more than of URIs are available, which confirms the tool’s reputation of delivering high-fidelity captures. We also notice very high ratios of available URIs for Mementos created with Memento Tracer. In fact, the ratio of available URIs in Memento Tracer Mementos is at times even higher than the ratio in Webrecorder Mementos. The drop in available URIs from Webrecorder GitHub repository file URIs can potentially be explained by observed network errors while creating the archival snapshots as well as possible human errors, for example forgetting to click on a file. Both points favor an automated framework to capture web resources. Such a process can detect network errors, try the capture again, and is not subject to human errors.

These findings strongly support our claim that Memento Tracer captures are of high quality, they are comparable to if not better than Webrecorder Mementos, and very closely align with their live web versions.

(a) GitHub URIs, files left, ZIP right
(b) Slideshare URIs, slides left, notes right
Figure 4: Relative number of URIs from live web and Memento Tracer Mementos. Green represents available, red unavailable resources.

5.2 Quality at Scale

Using our subset in previous experiments, we established that Memento Tracer Mementos are of very high quality, when compared to their live web versions (and Webrecorder Mementos). We are now interested in analyzing whether Memento Tracer keeps delivering quality when operating at scale. We use the framework to capture all resources in our dataset; Memento Tracer Mementos of GitHub repositories as well as all Memento Tracer Mementos of Slideshare slide decks. This translates to a dataset increase of more than two orders of magnitude ( vs. repositories and vs. slide decks). If we find a high level of similarity in terms of available URIs observed in the live web versions and Memento Tracer Mementos, we can confidently state that, even at scale, the Memento Tracer approach provides high-quality captures.

Figure 4 shows, in concept similar to Figure 3, the results of this large-scale analysis. Figure 3(a) represents the GitHub URI ratios and Figure 3(b) the Slideshare ratios. We can barely see a red portion in any of the bars, indicating that the same almost of URIs available in live web versions are also available in the corresponding Memento Tracer Mementos.

Table 1 provides insight into the comparison between live web versions and Memento Tracer Mementos from the granularity level of GitHub repositories and Slideshare slide decks. The second row shows the percentage of Memento Tracer Memento GitHub repositories with percent (top row) of available URIs from the corresponding repositories’ live web version. For example, we can see that of Memento Tracer Memento repositories contain zero URIs from their live web version and contain all of available URIs. The third row shows the same data if we only consider repository file URIs and the fourth when only considering ZIP file URIs. Since there is only one ZIP file per repository, this data is binary. We can observe that Memento Tracer does very well overall, and slightly better for ZIP files ( vs. ). The most likely reasons are that temporary network issues not caught by the automatic capture process prevented the Memento Tracer framework from archiving all file URIs. Rows five through seven show the same data for Slideshare slide decks. We find that Memento Tracer does even better there and almost perfect () for slides. The percentage for notes is also very high, at .

100
GitHub All 92.83
Files 93.29
ZIP n/a n/a n/a n/a n/a n/a n/a n/a n/a 98.7
Slideshare All 98.67
Slides 99.9
Notes 98.58
Table 1: Percentage of GitHub repositories with percent of available URIs from the corresponding repositories’ live web version

These findings exceed our expectation and confirm that the Memento Tracer framework, even at large scale, archives web resources with high-quality.

5.3 Memento Tracer Overhead

Figure 5: Time deltas between Memento Tracer and a simple web crawler

We are further interested in the overhead of the WebDriver- and headless browser-based capture approach in the Memento Tracer framework. Simply comparing runtimes of Memento Tracer versus another automatic web archiving framework such as Heritrix would not be fair as a crawler would not discover the same URIs since it can not cope with some dynamic affordances. Instead, we extract all URIs captured by the Memento Tracer framework while creating Mementos of the initial subset of GitHub URIs and Slideshare URIs. This amounts to a total of extracted URIs from the GitHub subset and URIs from SlideShare and we crawl these URIs with a simple Python-based crawler. Our simple crawler builds on the popular Python Requests library to perform HTTP GET requests against the URIs and is configured to resemble a Chrome browser (specified user agent, set timeout values, etc). We also used threads to parallelize these HTTP requests, speed up the crawling process, and emulate a production crawling framework operating at scale. The advantage of this process is that the crawler simply captures URIs and its runtime therefore provides the minimal time needed to capture the same resources as Memento Tracer. We compare both runtimes and present the delta in Figure 5

. All deltas (per GitHub and Slideshare URI) are positive (y-axis), which means Memento Tracer in all instances takes longer than the simple crawler. This finding is not surprising given Memento Tracer’s overhead of running and controlling a browser for each URI. We see quite a variance from around

seconds to just under seconds for Slideshare URIs. On average, based on our subset, GitHub URIs take seconds longer to be captured with Memento Tracer and Slideshare URIs take seconds longer. If we extrapolate this average to our entire dataset, the GitHub portion would take hours longer and Slideshare hours.

While this may sound like a lot, we highlight two arguments for why these numbers are reasonable; First, on average, the Memento Tracer framework is times slower than the simple crawler. In contrast, Brunelle et al. [5] found their headless browser approach to be times slower than a common crawler, so we see a significant decrease in crawl time. Second, simple automatic crawlers would not even discover a lot of the captured URIs, so speed is not the only factor. Runtimes can vary, depend on network speeds, and potential framework and crawler optimizations but objectively, these numbers provide insight into the cost (extra time) involved in automatic high-quality archiving.

6 Discussion and Future Work

Memento Tracer was developed as part of the Scholarly Orphans project, which focuses on artifacts that researchers deposit in a limited set of web productivity portals. Investigating the framework’s applicability and merit beyond that scope remains for future work. We anticipate limitations imposed by the Chrome extension to create a trace for resources and limited value of the automated approach for web resources not based on common templates.

Traces created with our browser extension are currently expressed in a non-standardized manner. In order to enable interoperability between traces and other capture frameworks such as Puppeteer151515https://github.com/GoogleChrome/puppeteer, a standard language needs to be devised to express interactions.

We are exploring alternate framework components to further stabilize our pilot implementation of the framework. Both the headless browser and the WarcProxy tool have proven unreliable at times.

7 Conclusion

In this paper we introduced Memento Tracer - a framework that provides high quality captures of web resources. Memento Tracer puts the curator in charge of determining the desired components of a to be archived web resource and takes advantage of frequently reused patterns in online productivity portals. We conducted experiments that show that Memento Tracer delivers high archival quality and that it can even outperform the Webrecorder tool that was designed for high fidelity captures. We have further shown that Memento Tracer captures web resources at high quality, even when operated at scale. The technical complexity of the framework, however, comes at a cost. In our experimental setup, compared to a simple crawling framework, Memento Tracer takes around seconds longer to capture a single URI. Our findings prove the feasibility and highlight the potential of the Memento Tracer approach. As such, the contributions of this work should be considered a next step towards balancing quality and scalability for web archiving.

8 Acknowledgement

This work is supported in part by The Andrew W. Mellon Foundation grant 11600663.

References