Collecting 16K archived web pages from 17 public web archives

05/09/2019 ∙ by Mohamed Aturban, et al. ∙ Old Dominion University Los Alamos National Laboratory 0

We document the creation of a data set of 16,627 archived web pages, or mementos, of 3,698 unique live web URIs (Uniform Resource Identifiers) from 17 public web archives. We used four different methods to collect the dataset. First, we used the Los Alamos National Laboratory (LANL) Memento Aggregator to collect mementos of an initial set of URIs obtained from four sources: (a) the Moz Top 500, (b) the dataset used in our previous study, (c) the HTTP Archive, and (d) the Web Archives for Historical Research group. Second, we extracted URIs from the HTML of already collected mementos. These URIs were then used to look up mementos in LANL's aggregator. Third, we downloaded web archives' published lists of URIs of both original pages and their associated mementos. Fourth, we collected more mementos from archives that support the Memento protocol by requesting TimeMaps directly from archives, not through the Memento aggregator. Finally, we downsampled the collected mementos to 16,627 due to our constraints of a maximum of 1,600 mementos per archive and being able to download all mementos from each archive in less than 40 hours.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Even though web archives hold billions of archived web pages [16], or mementos, obtaining a sample of mementos can be difficult. We describe the steps we took to create a data set of 16,627 mementos of 3,698 unique live web URIs (Uniform Resource Identifiers) from 17 public web archives. We use this collection in our study of identifying changes and transformations in the content of mementos over time (our preliminary work can be found in [6, 7, 8]).

To obtain a memento, lookup by URI-R111URI-R identifies an original resource from the live web (as described in Section 2) is widely supported by most web archives, but this requires a user to know the URI of an original page. For instance, we expect to find mementos of well-known URI-Rs (e.g., www.cnn.com) in massive web archives, such as the Internet Archive (web.archive.org), as these archives try to capture the entire web by employing large-scale web crawlers. Other web archives focus on preserving special collections. For instance, the UK Web Archive (webarchive.org.uk/ukwa/) was established with the objective of archiving only UK websites (e.g., www.parliament.uk/) [10]. Other web archives, such as perma.cc, webcitation.org, and archive.is, capture web pages on demand, so they only preserve pages submitted by users, not through crawling the web. Table 1 shows a list of 17 public web archives:

  • General: Archives preserve any web page discovered through large-scale web crawlers.

  • On-demand: In general, only web pages (URIs) submitted by users are captured, but the archive might also create archived collections or obtain a copy of collections captured by other archives.

  • National: Archives preserve a government or country’s web content. They might capture web pages with one or more specific Top Level Domain.

  • Organizational: Archives preserve web pages that are about specific organizations, such as the European Union.

Archive URI Archive Name Purpose
swap.stanford.edu Stanford Web Archive Portal General
web.archive.org The Internet Archive General and on-demand
archive.bibalex.org Bibliotheca Alexandrina’s Internet Archive National
arquivo.pt The Portuguese Web Archive (PWA) National
collectionscanada.gc.ca Library and Archives Canada National
digar.ee The Estonian Web Archive National
nationalarchives.gov.uk The National Archives National
vefsafn.is The Icelandic Web Archive National
webarchive.loc.gov Library of Congress Web Archives National
webarchive.org.uk The UK Web Archive (UKWA) National
webarchive.proni.gov.uk Public Record Office of Northern Ireland (PRONI) National
webharvest.gov Congressional & Federal Government Web Harvests National
archive-it.org Archive-It - Web Archiving Services for Libraries and Archives On-demand
archive.is Archive.is On-demand
perma.cc Perma.cc On-demand
webcitation.org WebCite On-demand
europarchive.org The European Archive Organizational
Table 1: A set of 17 public web archives.
Archive URI-Ms 1996 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17
webarchive.loc.gov 1,594 - 1 1 1 4 100 100 100 99 100 100 100 100 98 99 99 99 98 98 99 98 -
vefsafn.is 1,589 6 8 10 11 11 13 13 14 42 46 74 71 70 85 102 116 140 153 152 152 150 150
webcitation.org 1,585 - - - - - - - - - 28 89 85 70 119 156 156 157 156 155 130 127 157
arquivo.pt 1,569 30 14 14 15 15 - - - - 1 1 - 163 167 166 163 162 167 165 164 162 -
web.archive.org 1,566 73 73 73 69 71 71 72 73 72 73 72 72 72 72 70 69 69 67 70 71 72 70
archive.is 1,396 11 10 9 12 10 12 14 13 18 14 20 33 25 29 28 59 12 214 214 214 213 212
archive-it.org 1,383 17 15 2 1 3 1 1 - 1 51 109 107 108 105 109 107 106 109 107 107 109 108
swap.stanford.edu 1,222 - - - - - - - - - - - 21 77 185 166 119 135 164 180 140 21 14
nationalarchives.gov.uk 994 - - - - - - 1 2 25 12 50 40 97 117 106 110 104 94 83 59 54 40
europarchive.org 979 - - - - - - - - - - - - - - - 120 219 72 172 146 213 37
webharvest.gov 712 - - - - - - - - 128 - 126 - 91 - 129 2 127 59 38 12 - -
digar.ee 488 - - - - - - - - - - - - - - 36 95 69 89 69 74 56 -
webarchive.proni.gov.uk 469 - - - - - - - - - - - - - - 17 94 19 75 75 78 59 52
collectionscanada.gc.ca 351 - - - - - - - - - 40 173 138 - - - - - - - - - -
webarchive.org.uk 349 - - - - - - - - - 6 9 10 31 34 31 34 34 30 34 29 34 33
archive.bibalex.org 199 - - 1 - - - - - - - 1 - - - - 99 98 - - - - -
perma.cc 182 - - - - - - - - - - - - - - - - - - 23 53 53 53
Total 16,627 137 121 110 109 114 197 201 202 385 371 824 677 904 1011 1215 1442 1550 1547 1635 1528 1421 926
Table 2: URI-Ms per archive per year. The data is available in CSV format on GitHub at https://github.com/oduwsdl/mementos-fixity/blob/master/urims- per-year.csv.

Table 2 shows the actual number of collected mementos, denoted by URI-M222URI-M identifies an archived version (memento) of an original resource (as described in Section 2), per archive and also the distribution of selected mementos through time. We explain in Section 3 how we obtained this set of 16,627 mementos (illustrated in Table 2).

During the process of collecting mementos, we obtained many archived pages, but because of the requirements for our target study, we only selected 16,627 mementos. The requirements include:

  1. Downloading mementos is a slow operation and since the bottleneck is the archives themselves, parallelization will not help. We chose a target of completing the download of all mementos from all the archives within 40 hours. We also planned to do no more than two such downloads per week in order to limit the load on the archives.

  2. Since we want to study changes in the playback of mementos over time, we chose 200 as the minimum number of URI-Rs per archive.

  3. The number of selected mementos from each web archive should not exceed 1,600. This condition should help reducing the difference between large archives and small archives in terms of the number of sampled mementos.

The main purpose of this paper is to document how the dataset of mementos was created so it can be reused by other studies.

2 Background

In order to automatically collect portions of the web, some web archives employ web crawling software, such as the Internet Archive’s Heritrix [23, 21]. Having a set of seed URIs placed in a queue, Heritrix will start by fetching web pages identified by those URIs, and each time a web page is downloaded, Heritrix writes the page to a WARC file [3], extracts any URIs from the page, places those discovered URIs in the queue, and repeats the process.

The crawling process will result in a set of archived pages. To provide access to their archived pages, many web archives use OpenWayback [14], the open-source implementation of IA’s Wayback Machine, that allows users to query the archive by submitting a URI. OpenWayback will replay the content of any selected archived web page in the browser. One of the main tasks of OpenWayback is to ensure that when replaying a web page from an archive, all resources that are used to construct the page (e.g., images, style sheets, and JavaScript files) should be retrieved from the archive, not from the live web. Thus, at the time of replaying the page, OpenWayback will rewrite all links to those resources to point directly to the archive [24]. In addition to OpenWayback, PyWb [17] is another tool for replaying mementos. It is used by perma.cc, Webrecorder [18], and other archives and services.

Memento [25, 26] is an HTTP protocol extension that uses time as a dimension to access the web by relating the current web resources to their prior states. The Memento protocol is supported by most public web archives including the Internet Archive. The protocol introduces two HTTP headers for content negotiation. First, Accept-Datetime is an HTTP Request header through which a client can request a prior state of a web resource by providing the preferred datetime, for example,

Accept-Datetime: Mon, 09 Jan 2017 11:21:57 GMT

.
 
Second, the Memento-Datetime HTTP Response header is sent by a server to indicate the datetime at which the resource was captured, for instance,

Memento-Datetime: Sun, 08 Jan 2017 09:15:41 GMT

.
 
The Memento protocol also defines the following terminology:

  • URI-R - identifies an original resource from the live web

  • URI-M - identifies an archived version (memento) of the original resource at a particular point in time

  • URI-T - a resource (TimeMap) that provides a list of mementos (URI-Ms) for a particular original resource (URI-R)

  • URI-G - a resource (TimeGate) that supports content negotiation based on datetime to access prior versions of an original resource (URI-R)

Figure 1 shows an example of requesting a TimeMap of www.cnn.com from the Internet Archive. This TimeMap has a list of over 227,000 URI-Ms of the original page www.cnn.com captured between June 20, 2000 and March 07, 2019.

% curl  http://web.archive.org/web/timemap$/link/http://www.cnn.com$
<http://cnn.com:80/>; rel="original",
<http://web.archive.org/web/timemap/link/http://www.cnn.com>; rel="self"; type="application/link-format"; from="Tue, 20 Jun 2000 18:02:59 GMT",
<http://web.archive.org>; rel="timegate",
<http://web.archive.org/web/20000620180259/http://cnn.com:80/>; rel="first memento$"; datetime="Tue, 20 Jun 2000 18:02:59 GMT",
<http://web.archive.org/web/20000620180259/http://cnn.com:80/>; rel="memento$"; datetime="Tue, 20 Jun 2000 18:02:59 GMT",
<http://web.archive.org/web/20000620180259/http://cnn.com:80/>; rel="memento$"; datetime="Tue, 20 Jun 2000 18:02:59 GMT",
...
<http://web.archive.org/web/20190307012015/https://www.cnn.com/>; rel="memento$"; datetime="Thu, 07 Mar 2019 01:20:15 GMT",
<http://web.archive.org/web/20190307013003/http://www.cnn.com/>; rel="memento$"; datetime="Thu, 07 Mar 2019 01:30:03 GMT",
<http://web.archive.org/web/20190307013003/https://www.cnn.com/>; rel="memento$"; datetime="Thu, 07 Mar 2019 01:30:03 GMT",
Figure 1: An example of downloading a TimeMap from the Internet Archive using curl. The TimeMap contains over 227,000 URI-Ms of www.cnn.com.

A Memento aggregator can be used to retrieve TimeMaps aggregated from multiple web archives. The Memento Aggregator from Los Alamos National Laboratory (LANL) [11] is one implementation of a Memento aggregator that provides TimeMaps across different web archives both with (a) native support of the Memento protocol and (b) by proxy support of the Memento protocol. MemGator [5, 4] is another implementation of a Memento aggregator and an open source project that provides a variety of customization options, such as allowing users to specify a list of web archives to retrieve TimeMaps from, but it only aggregates TimeMaps from archives that natively support the Memento protocol.

On the playback of a memento, archives rewrite or transform the original content so that the memento is rendered appropriately in the user’s browser. The transformation process includes adding HTML tags to the original content to indicate when the memento was created and retrieved, and rewriting all URI-Rs of embedded resources so they point to the archive, not to the live web. Archives also add banners which provide information about both the memento being viewed and the original page.

In addition to the rewritten content, most archives allow accessing unaltered, raw, archived content (i.e., retrieving the archived version of the original content without any type of transformation by the archive). The most common mechanism to retrieve the raw content, which is supported by different Wayback Machine implementations [15, 19], is by adding id_ after the timestamp in the requested URI-M.

3 Methodology

We collected URI-Rs from four different sources. The first 500 URI-Rs were from Moz [2], which provides a list of the top 500 domains on the web. The second set consists of 1,535 URI-Rs from a previous study [12] about investigating memento damage. The third set contains 6,657,856 URI-Rs that are publicly available in the HTTP Archive [1]. The final set of URI-Rs (8,774,352) is from the Web Archives for Historical Research group (WAHR) [22].

We included the first two sources even though they are relatively small compared to the other two sources because (1) we wanted our final selected set of URI-Rs to have some top/well-known web pages (i.e., URI-Rs from Moz), and (2) the URI-Rs from the study of memento damage contains a mixture of URI-Rs with different path lengths (e.g., www.example.com/path/to/file.html). The main characteristic of URI-Rs that belong to the first and third source (i.e., Moz and HTTP Archive) is that the URI-R consists of a domain name only (e.g., www.example.com). The URI-Rs from WAHR are extracted from tweets about the hashtags #climatemarch, #MarchForScience, #porteouverte,
#paris, #Bataclan, #parisattacks, #WomensMarch, and #YMMfire between December 11, 2015 and May 3, 2017. Table 3 shows the number of collected URI-Rs including the number of URI-Rs by hashtag for the WAHR source. The total number of unique URI-Rs from all four sources is 8,220,606 after removing duplicates.

NAME Access time URI-Rsllll URI-Rs after removing duplicates
MOZ 2017-06-08 500 500
Memento damage 2017-06-08 1,535 1,535
HTTP Archive 2017-04-15 6,657,856 6,657,856
WAHR #climatemarch 2017-04-19 2017-05-03 175,278 41,674
#MarchForScience 2017-04-12 2017-04-26 299,124 90,318
#porteouverte #paris #Bataclan #parisattacks 2015-12-11 5,561,037 857,490
#WomensMarch 2017-01-12 2017-01-28 2,403,637 526,903
#YMMfire 2016-08-20 335,276 45,327
Total 15,434,243 8,220,606
Table 3: URI-R source count.

We merged all 8,220,606 unique URI-Rs from the four sources into a single list. The order of how URI-Rs are placed on the list is as follows:

  1. Moz’s URI-Rs were placed on the top of this list followed by URI-Rs from our Memento damage study.

  2. We repeatedly selected 10 URI-Rs from HTTP Archive and 10 URI-Rs from WAHR, choosing 10 from a different hashtag each round.

The order of URI-Rs in the list is important because we decided to work with a smaller number of URI-Rs for our study. Thus, out of 8,220,606 URI-Rs, we only selected the first 10,000 URI-Rs that fulfill the conditions explained in Section 3.1.

3.1 Method 1: selecting the first 10,000 URI-Rs from the initial set of 8,220,606 URI-Rs

URI-Rs must be canonicalized to determine whether or not a URI-R with a particular domain name and file path length has already been selected. We used the canonicalization function that is part of PyWb [17]. The function indicates that http://www.example.com, http://www.example.com:80, and www.EXAMPLE.com are the same, as shown in Figure 2. The output of this canonicalization function is in Sort-friendly URI Reordering Transform (SURT) format [20].

% python canonicalize.py http://www.example.com
com,example)/
% python canonicalize.py http://www.example.com:80
com,example)/
% python canonicalize.py www.EXAMPLE.com
com,example)/
Figure 2: An example showing three different URI-Rs that map to the same URI-R (in SURT format [20]) using the canonicalization function from [17]

In addition to the canonicalization function, we issued an HTTP HEAD request to discover if two URI-Rs redirect to the same web resource. As Figure 3 shows, sending a HTTP HEAD request to www.fb.com and facebook.com will result in a “301” redirect to https://www.facebook.com/, which is the URI-R we select, rather than the first two URI-Rs.

% curl -sIL fb.com | egrep -i "(HTTP/|^location:)"
HTTP/1.1 301 Moved Permanently
Location: https://www.facebook.com/
HTTP/2 200
% curl -sIL https://facebook.com | egrep -i "(HTTP/|^location:)"
HTTP/2 301
location: https://www.facebook.com/
HTTP/2 20
Figure 3: An example showing two different URI-Rs that redirect to the same URI-R.

Also, the selected URI-Rs must contain a variety of file path lengths that we group into the following five sets, each of which contains 2,000 URI-Rs with:

  • : Path length of zero

    - www.example.com

  • : Path length of one

    - www.example.com/file1.html

  • : Path length of two

    - www.example.com/1/file2.html

  • : Path length of three

    - www.example.com/1/2/file3.html

  • : Path length of four or more

    - www.example.com/1/2/3/file4.html

The final two conditions for selecting the first 10,000 URI-Rs are:

  1. URI-Rs with the same file path length should not have the same domain name. For example, if www.youtube.com/watch?v=cpPG0bKHYKc has already been selected, then www.youtube.com/watch?v=hFhiV5X5QM4 will not be selected. This may help to collect more unique URI-Rs and vary the content we plan to study.

  2. The TimeMaps of selected URI-Rs must contain at least one memento as our further work is to study any change or transformation in the content of mementos over time.

To retrieve TimeMaps, we used the LANL Memento Aggregator. Once a TimeMap is downloaded, we reduced the number of mementos in the TimeMap to one memento per year from each archive. TimeMaps returned from LANL’s aggregator have more information and metadata than we need for our further study. Therefore, we wrote two Python scripts available on Github333https://github.com/oduwsdl/mementos-fixity. The script timemap.py extracts only URI-Ms and their Memento-Datetime from the returned TimeMaps, while the second script yearly-filter.py filters TimeMaps by selecting one memento (the first) per year by archive. Figure 4 shows an example of a TimeMap with 64 mementos of the URI-R http://www.f
utureofmusic.org/about/positions.cfm
, and Figure 5 shows the corresponding TimeMap after filtering. It contains only 10 mementos.

% python timemap.py http://www.futureofmusic.org/about/positions.cfm > full-timemap.txt
% cat full-timemap.txt
20120328211040 http://www.webcitation.org$/66VfNacdz
20141021161223 http://archive.is$/20141021161223/http://www.futureofmusic.org/about/positions.cfm
20141021175005 http://archive.is$/20141021175005/http://www.futureofmusic.org/about/positions.cfm
20141021175817 http://archive.is$/20141021175817/http://www.futureofmusic.org/about/positions.cfm
20141106145319 http://archive.is$/20141106145319/http://www.futureofmusic.org/about/positions.cfm
20141106151301 http://archive.is$/20141106151301/http://www.futureofmusic.org/about/positions.cfm
20070114182707 https://web.archive.org$/web/20070114182707/http://www.futureofmusic.org:80/about/positions.cfm
... <18 mementos from 2007-2008> ...
20090122061339 https://web.archive.org$/web/20090122061339/http://futureofmusic.org:80/about/positions.cfm
20090228213737 https://web.archive.org$/web/20090228213737/http://futureofmusic.org:80/about/positions.cfm
20120607045812 https://web.archive.org$/web/20120607045812/http://www.futureofmusic.org/about/positions.cfm
20120607045828 https://web.archive.org$/web/20120607045828/http://futureofmusic.org/about/positions.cfm
20130323010922 https://web.archive.org$/web/20130323010922/http://www.futureofmusic.org/about/positions.cfm
20130323011136 https://web.archive.org$/web/20130323011136/http://futureofmusic.org/about/positions.cfm
20131231022915 https://web.archive.org$/web/20131231022915/http://www.futureofmusic.org/about/positions.cfm
20140819212552 https://web.archive.org$/web/20140819212552/http://www.futureofmusic.org/about/positions.cfm
20150320143837 https://web.archive.org$/web/20150320143837/http://www.futureofmusic.org/about/positions.cfm
20160325184708 https://web.archive.org$/web/20160325184708/http://www.futureofmusic.org/about/positions.cfm
20070114182707 https://web.archive.org$/web/20070114182707/http://www.futureofmusic.org:80/about/positions.cfm
20070209043456 https://web.archive.org$/web/20070209043456/http://www.futureofmusic.org:80/about/positions.cfm
... <26 duplicate mementos from web.archive.org> ...
Figure 4: The TimeMap of www.futureofmusic.org/about/positions.cfm contains 64 mementos from three different archives: web.archive.org, archive.is, and webcitation.org.
% cat full-timemap.txt | python yearly-filter.py > yearly-filter.txt
% cat yearly-filter.txt
20120328211040 http://www.webcitation.org$/66VfNacdz
20141021161223 http://archive.is$/20141021161223/http://www.futureofmusic.org/about/positions.cfm
20070114182707 https://web.archive.org$/web/20070114182707/http://www.futureofmusic.org:80/about/positions.cfm
20080109053549 https://web.archive.org$/web/20080109053549/http://www.futureofmusic.org:80/about/positions.cfm#ed
20090122061339 https://web.archive.org$/web/20090122061339/http://futureofmusic.org:80/about/positions.cfm
20120607045812 https://web.archive.org$/web/20120607045812/http://www.futureofmusic.org/about/positions.cfm
20130323010922 https://web.archive.org$/web/20130323010922/http://www.futureofmusic.org/about/positions.cfm
20140819212552 https://web.archive.org$/web/20140819212552/http://www.futureofmusic.org/about/positions.cfm
20150320143837 https://web.archive.org$/web/20150320143837/http://www.futureofmusic.org/about/positions.cfm
20160325184708 https://web.archive.org$/web/20160325184708/http://www.futureofmusic.org/about/positions.cfm
Figure 5: The TimeMap of www.futureofmusic.org/about/positions.cfm after filtering. It contains only 10 mementos (the first memento per year is selected from each archive).

Table 4 shows the number of selected URI-Rs per source and path length and Table 5 shows that 13% of the selected URI-Rs currently have either the HTTP status code 4xx or 5xx. Even though these URI-Rs are no longer live, they are archived.

Path length
Source s0 s1 s2 s3 s4+ Total
MOZ 286 17 3 2 1 309
HTTP Archive 1,581 42 70 2 0 1,695
Memento Damage 114 63 62 42 60 341
#climatemarch 4 74 98 89 99 364
#MarchForScience 1 162 173 139 243 718
#porteouverte #paris #Bataclan #parisattacks 8 758 716 855 711 3,048
#WomensMarch 3 723 734 749 734 2,943
#YMMfire 3 161 144 122 152 582
Total 2,000 2,000 2,000 2,000 2,000 10,000
Table 4: The initial collected set of URI-Rs per source by path length (results of Method 1).
HTTP status code
Path length 200 4xx/5xx Total
s0 1,870 130 2,000
s1 1,651 349 2,000
s2 1,715 285 2,000
s3 1,720 280 2,000
s4+ 1,731 269 2,000
Total 8,687 1,313 10,000
Table 5: The final URI-R HTTP status codes of the initial collected set of URI-Rs (results of Method 1).

Table 6 (column Method 1) shows the list of 16 web archives from which the mementos of our 10,000 URI-Rs are collected (there is one archive, nationalarc
hives.gov.uk
, that has not been counted yet because it has not contributed any mementos). The total number of URI-Rs in the table exceeds 10,000 because a URI-R often has mementos in multiple archives, resulting in some URI-Rs being counted multiple times, but the total number of unique URI-Rs is still 10,000. The total number of URI-Ms in all TimeMaps is 12,988,039. This number drops to 48,199 URI-Ms after applying the one memento per year filter.

-0.20in0in

Table 6: The four methods used to collect URI-Rs/URI-Ms. The table indicates (shown in bold) that (1) seven archives satisfy the condition of 200 URI-Rs by Method1 (2) five additional archives satisfy the condition of 200 URI-Rs by Method 2, (3) four other archives satisfy the condition by Method 3, and (4) the last archive that satisfies the condition of 200 URI-Rs by Method 4.
Method 1 Method 2 Method 3 Method 4
Archive URI-Ms URI-Rs URI-Ms URI-Rs URI-Ms URI-Rs URI-Ms URI-Rs
web.archive.org 32,139 9,790 40,258 10,353 45,155 10,924 45,155 10,924
archive.is 2,322 1,284 3,229 1,526 3,471 1,654 3,471 1,654
archive-it.org 3,500 804 7,986 1,355 8,994 1,593 8,994 1,593
archive.bibalex.org 3,363 568 6,176 941 7,286 1,148 7,286 1,148
webarchive.loc.gov 2,721 418 7,122 934 7,766 1,062 7,766 1,062
arquivo.pt 1,410 324 3,154 758 3,430 895 3,430 895
webcitation.org 1,125 472 1,858 725 1,954 775 1,954 775
europarchive.org 407 106 911 287 992 324 992 324
swap.stanford.edu 609 132 1,176 283 1,233 304 1,233 304
vefsafn.is 19 7 1,520 246 1,715 294 1,715 294
webharvest.gov 84 21 743 227 826 248 826 248
webarchive.org.uk 12 5 27 8 907 228 907 228
digar.ee 333 129 513 223 518 228 518 228
webarchive.proni.gov.uk 138 48 316 141 480 213 480 213
nationalarchives.gov.uk 0 0 0 0 1,011 200 1,011 200
collectionscanada.gc.ca 8 6 59 50 359 200 359 200
perma.cc 9 6 101 71 154 111 290 200
Total 48,199 14,120 75,149 18,128 86,251 20,401 86,387 20,490

From Table 6, we notice that several archives have a small number of URI-Rs and URI-Ms. Since we want to study the playback fidelity of the web archives, we chose 200 as the minimum number of URI-Rs per archive. After applying Method 1, we used the three methods (Sections 3.2, 3.3, and 3.4) to discover additional mementos from web archives that have fewer than 200 URI-Rs.

3.2 Method 2: Discovering additional URI-Rs from the HTML of already collected mementos

For each archive that has not satisfied the 200 URI-Rs condition we downloaded the raw content of already collected mementos from the archive and extracted all URI-Rs found in the HTML. Using the LANL Memento Aggregator, we requested the TimeMap of each URI-R that had not already been selected. We applied this method for the following archives:

  1. webharvest.gov

  2. swap.stanford.edu

  3. vefsafn.is

  4. webarchive.org.uk

  5. webarchive.proni.gov.uk

  6. collectionscanada.gc.ca

  7. perma.cc

Three archives are not included in the list above. The first reason being that Method 2 can not be applied for nationalarchives.gov.uk because the archive has not yet provided any mementos. The second is that the archives europarchive.org and digar.ee satisfied the condition of 200 URI-Rs after applying Method 2 to swap.stanford.edu and vefsafn.is, respectively.

As shown in Table 7, any new discovered URI-Rs/URI-Ms with this method caused the information from all archives to be updated even for archives that already had more than 200 URI-Rs. Figure 6 shows an example of URI-Rs extracted from the HTML of the memento:

https://wayback.vefsafn.is/wayback/20041020191800id_/http
://www.w3.org/


The URI-Rs are extracted from the attribute href in the <a> tags (using the Python script extract_urirs.py444https://github.com/oduwsdl/mementos-fixity). We downloaded the TimeMap of the URI-R www.inria.fr/ which had not previously been selected. As Figure 7 shows, the TimeMap does not only contain mementos from vefsafn.is but also mementos from the eight archives: web.archive.org, archive.bibalex.org, webcitation.org, webarchive.loc.gov, archive-it.org, archive. is, vefsafn.is, and digar.ee.

-.1in0in Method 2 webharvest.gov swap.stanford .edu vefsafn.is webarchive.org .uk proni.gov.uk collectionscanada .gc.ca perma.cc Archive URI-Ms URI-Rs URI-Ms URI-Rs URI-Ms URI-Rs URI-Ms URI-Rs web.archive.org 34,819 9,968 36,385 10,075 39,092 10,265 40,258 10,353 archive.is 2,330 1,289 2,562 1,359 3,093 1,479 3,229 1,526 archive-it.org 5,108 979 5,999 1,095 7,373 1,271 7,986 1,355 archive.bibalex.org 4,169 675 4,764 762 5,827 891 6,176 941 webarchive.loc.gov 4,432 590 5,297 698 6,598 860 7,122 934 arquivo.pt 1,762 456 2,158 553 2,935 689 3,154 758 webcitation.org 1,290 532 1,415 593 1,737 689 1,858 725 europarchive.org 476 137 603 202 842 265 911 287 swap.stanford.edu 651 145 797 200 1,072 261 1,176 283 vefsafn.is 19 7 19 7 1,270 200 1,520 246 webharvest.gov 647 200 698 212 711 214 743 227 digar.ee 339 134 362 153 491 212 513 223 webarchive.proni.gov.uk 144 52 156 58 224 84 316 141 perma.cc 58 38 74 50 76 52 101 71 collectionscanada.gc.ca 26 22 40 35 42 37 59 50 webarchive.org.uk 12 5 12 5 12 5 27 8 nationalarchives.gov.uk 0 0 0 0 0 0 0 0 Total 56,282 15,229 61,341 16,057 71,395 17,474 75,149 18,128

Table 7: After applying Method 2 to seven archives, five archives satisfy the condition of 200 URI-Rs (shown in bold). Notice that applying this method to one archive may increase the number of URI-Rs in other archives (e.g., applying method 2 for vefsafn.is makes both vefsafn.is and digar.ee satisfy the 200 URI-R condition).
python extract_urirs.py http://wayback.vefsafn.is/wayback/20041020191800id\_/http://www.w3.org/
http://www.csail.mit.edu/
http://www.google.com/
http://www.ilog.com/
http://www.inria.fr/
http://jigsaw.w3.org/css-validator/
http://www.w3.org/People/Raggett/tidy/
http://validator.w3.org/
http://www.w3.org/2004/MWeb/Overview.html
http://purl.org/rss/1.0/
...
Figure 6: An example of extracting URI-Rs from the HTML of the memento wayback.vefsafn.is/wayback/20041020191800id_/http://www.w3.org/ (only 9 URI-Rs, out of 138, are shown). Notice that we used the option id_ in the URI-M to retrieve the archived unaltered, or raw, content of the memento.
% python timemap.py http://www.inria.fr/
19961230035541 https://web.archive.org$/web/19961230035541/http://www4.inria.fr:80/
...
19961230035541 http://web.archive.bibalex.org$:80/web/19961230035541/http://www4.inria.fr/
...
20140729051225 http://www.webcitation.org$/6RQAbDGPm
20020808175122 http://webarchive.loc.gov$/all/20020808175122/http://www.inria.fr/
20100731132417 http://wayback.archive-it.org$/all/20100731132417/http://www.inria.fr/
...
19961230035541 http://archive.is$/19961230035541/http://www.inria.fr/
...
19961013190926 https://arquivo.pt$/wayback/19961013190926/http://www.inria.fr/
...
20110325131647 http://veebiarhiiv.digar.ee$/a/20110325131647/http://www.inria.fr/
...
Figure 7: Downloading the TimeMap of the URI-R http://www.inria.fr/.

With Method 2, we now have 200 URI-Rs for the following additional archives:

  • webharvest.gov

  • swap.stanford.edu

  • vefsafn.is

  • digar.ee

  • europarchive.org

Table 6 (column Method 2) shows the new archives that satisfy the condition of 200 URI-Rs and the four archives which still did not satisfy the condition.

3.3 Method 3: URI-Rs discovered in archives’ published lists

Some archives make lists of URI-Rs they collect available on the web. Archives may also publish lists of URI-Ms associated with each URI-R. We found these published collections for three archives (Table 8) that had not met the 200 URI-R minimum.

Archive List URI-R URI-M
webarchive.org.uk data.webarchive.org.uk/ opendata/ukwa.ds.1/ 26,910 -
collectionscanada.gc.ca collectionscanada.gc.ca /webarchives/url-list/i ndex-e.html 2,613 27,232
nationalarchives.gov.uk nationalarchives.gov.uk /webarchive/atoz/ 4,956 168,328
Table 8: Archives’ published lists of URI-Rs and URI-Ms.

We downloaded the published list of URI-Rs only (URI-Ms were not included in this list) from the archive webarchive.org.uk. Then, using the LANL Memento Aggregator we retrieved TimeMaps that at least contain one memento in the UK Web Archive, of the first 192 URI-Rs. Table 6 (column Method 3) shows that this method helps two archives to reach 200 URI-Rs (i.e., webarchive.pro ni.gov.uk and webarchive.org.uk), but at the same time, a new web archive appears in the TimeMaps, nationalarchives.gov.u k, raising the total number of archives to 17.

Next, we downloaded lists of URI-Rs and URI-Ms made available by the two web archives collectionscanada.gc.ca and nationalarchives.gov.uk. We only extracted the number required to reach 200 URI-Rs per archive. With this method, we did not need a Memento aggregator since the archives already provide a list of mementos, but for the sake of consistency, we used the LANL’s Aggregator to download TimeMaps, so we can update information for the other archives. Table 6 (column Method 3) shows that for perma.cc we only needed to discover 89 additional URI-Rs to reach 200 URI-Rs. Table 9 shows how the number of URI-Rs/URI-Ms has increased after applying Method 3 to each of the three archives.

-0.15in0in Method 2 webarchive.org.uk collectionscanada.gc.ca nationalarchives.gov.uk Archive URI-Ms URI-Rs URI-Ms URI-Rs URI-Ms URI-Rs web.archive.org 41,988 10,572 43,615 10,739 45,155 10,924 archive.is 3,370 1,590 3,426 1,619 3,471 1,654 archive-it.org 8,339 1,431 8,910 1,561 8,994 1,593 archive.bibalex.org 6,630 1,009 7,043 1,091 7,286 1,148 webarchive.loc.gov 7,344 989 7,721 1,039 7,766 1,062 arquivo.pt 3,298 819 3,379 855 3,430 895 webcitation.org 1,892 746 1,944 765 1,954 775 europarchive.org 963 311 986 320 992 324 swap.stanford.edu 1,198 292 1,232 303 1,233 304 vefsafn.is 1,626 270 1,703 289 1,715 294 webharvest.gov 743 227 826 248 826 248 webarchive.org.uk 812 201 812 201 907 228 digar.ee 515 225 518 228 518 228 webarchive.proni.gov.uk 460 201 462 203 480 213 nationalarchives.gov.uk 3 1 3 1 1011 200 collectionscanada.gc.ca 59 50 359 200 359 200 perma.cc 104 73 154 111 154 111 Total 79,344 83,093 19,773 19,007 86,251 20,401

Table 9: Applying Method 3 (using archives’ published lists) for three archives.

3.4 Method 4: Sending TimeMap requests directly to an archive

The LANL Memento Aggregator may serve cached TimeMaps [11], which may result in TimeMaps that do not contain recently created mementos. For this reason we decided to request TimeMaps for the already selected URI-Rs directly from perma.cc. Figure 8 shows an example of requesting the TimeMap of the URI-R www.whitehouse.gov from perma.cc (the archive uses other domain names like perma-archives.org). It contains 57 mementos. By this method, we were able to obtain the additional 89 URI-Rs for perma.cc shown in Table 6 (column Method 4).

curl https://perma-archives.org/warc/timemap$/*/http://www.whitehouse.gov/
<https://perma-archives.org/warc/timemap/*/http://www.whitehouse.gov/>; rel="self"; type="application/link-format"; from="Thu, 27 Aug 2015 17:14:18 GMT",
<http://www.whitehouse.gov/>; rel="original",
<https://perma-archives.org/warc/timegate/http://www.whitehouse.gov/>; rel="timegate",
<https://perma-archives.org/warc/20150827171418/http://www.whitehouse.gov/>; rel="memento"; datetime="Thu, 27 Aug 2015 17:14:18 GMT",
<https://perma-archives.org/warc/20150827171418/https://www.whitehouse.gov/>; rel="memento"; datetime="Thu, 27 Aug 2015 17:14:18 GMT",
<https://perma-archives.org/warc/20150831171426/https://www.whitehouse.gov/>; rel="memento"; datetime="Mon, 31 Aug 2015 17:14:26 GMT",
...
<https://perma-archives.org/warc/20180302185657/https://whitehouse.gov/>; rel="memento"; datetime="Fri, 02 Mar 2018 18:56:57 GMT",
<https://perma-archives.org/warc/20180302185657/https://www.whitehouse.gov/>; rel="memento"; datetime="Fri, 02 Mar 2018 18:56:57 GMT",
<https://perma-archives.org/warc/20180828214528/https://www.whitehouse.gov/>; rel="memento"; datetime="Tue, 28 Aug 2018 21:45:28 GMT
Figure 8: An example of requesting a TimeMap from Perma archive. The TimeMap of www.whitehouse.gov contains 57 mementos (only 6 are shown).

3.5 Filtering by download time, the maximum number of mementos, and HTTP status codes

At this point, the selected set contained 86,387 URI-Ms, 20,490 total, and 11,222 unique URI-Rs from 17 different web archives. For our target study, we downloaded the rewritten and raw mementos 10 times. We ran 17 parallel processes where each process downloaded mementos from a specific archive. We found that download time varies between web archives. For example, it took about 40 hours to download 733 mementos from webharvest.gov and 12 hours to download 1,011 mementos from nationalarchives.gov.uk. Thus, we decided to change the number of mementos per archive to what could be downloaded within 40 hours, and the number of mementos must not exceed 1,600 per archive. This produced 18,472 mementos. Unfortunately, we did not check the HTTP status when selecting mementos to make sure they are “200 OK” or archival 4xx/5xx responses (i.e., they have the HTTP response header Memento-Datetime for the archives that support the Memento protocol). After selecting the 18,472 mementos, we found that about 10%, 1,975, of these mementos had the HTTP status code of non-archival 4xx or 5xx (1,498 are from archive.bibalex.org) as the example in Figure 9 shows. Thus, we removed most of the 4xx/5xx mementos and kept only 130 (out of 1,975) because we wanted to keep track of these mementos. We could not replace those removed mementos because by this time we had already used the selected dataset in our study, and it was not possible to recover any excluded mementos. This resulted in 16,627 mementos remaining.

% curl -I http://web.archive.bibalex.org/web/20051026134855/http:/www.anchorage.gc.ca/
HTTP/1.1 503 Service Unavailable
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=817B5F389A4566092459D914091A0961; Path=/; HttpOnly
Content-Type: text/html;charset=utf-8
Transfer-Encoding: chunked
Date: Sun, 31 Mar 2019 13:25:47 GMT
Connection: close
Figure 9: Non archival HTTP 503 example. The HTTP response header Memento-Datetime is not included in the returned response.

3.6 Final set

Table 10 shows the final numbers of selected URI-Rs and URI-Ms per archive (available on GitHub555https://github.com/oduwsdl/mementos-fixity). The table shows that three archives have fewer than 200 URI-Rs for the following reasons:

  • perma.cc: It took about 40 hours to download 182 mementos from perma.cc, including the raw mementos.

  • archive.bibalex. org: We removed 1,498 mementos because they returned the “503 Service Unavailable” HTTP response code.

  • collectionscanada.gc.ca: We removed mementos of two URI-Rs that returned the “503 Service Unavailable” HTTP response code.

Figure 10 shows the distribution of URI-Ms between 1996 and 2017. The main reason for having fewer mementos in years 1996-2005 is because most web archives did not exist during those early years [13, 9]. Figure 11 shows the number of URI-Rs per path length. The number of distinct URI-Rs is 3,698, and of those 1,996 (54%) have a path length of zero and the remaining 1,702 URIs (46%) have a path length greater than or equal to one.

Archive URI-Rs URI-Ms
web.archive.org 1,566 1,566
archive-it.org 1,338 1,383
archive.is 1,257 1,396
webarchive.loc.gov 1,059 1,594
arquivo.pt 766 1,569
webcitation.org 720 1,585
europarchive.org 321 979
swap.stanford.edu 302 1,222
vefsafn.is 290 1,589
webharvest.gov 247 712
digar.ee 225 488
webarchive.org.uk 221 349
webarchive.proni.gov.uk 209 469
nationalarchives.gov.uk 200 994
collectionscanada.gc.ca 198 351
perma.cc 175 182
archive.bibalex.org 168 199
Total 9,262 16,627
Table 10: Final numbers in the selected set of URI-Rs and URI-Ms.

Figure 10: URI-Ms per year. Note that we collected mementos in November 15, 2017. For this reason, the number of mementos from 2017 is less than the number of mementos in other years, 2010-2016 (i.e., no mementos with Memento-Datetime value after November 15, 2017).

Figure 11: URI-Rs per path length (54% of URI-Rs are with zero path length).

4 Conclusions

In this paper we describe four methods to discover 16,627 mementos from 17 public web archives. We use the LANL Memento Aggregator to look up mementos by submitting the URI-R of original web pages (Method 1). For archives that have fewer than 200 URIs, we collect additional mementos by extracting URI-Rs from the HTML of already discovered mementos (Method 2). As our third method, we use published lists of original web pages and their associated mementos made available by several web archives. Finally, we request TimeMaps directly from the archive perma.cc (Method 4). Even though the process of discovering mementos resulted in a total of 80,387 mementos (after applying the one memento per year filter), we downsampled this number to 16,627 due to our constraints of limiting to 1,600 URI-Ms per archive, being able to download all the mementos in less than 40 hours, and the condition that the number of URI-Rs per archive should be greater than or equal to 200.

5 Acknowledgements

This work is supported in part by The Andrew W. Mellon Foundation (AMF) grant 11600663.

References

  • [1] The HTTP Archive Tracks How the Web is Built. https://httparchive.org/downloads.php (4 2017), accessed on 2017 April 15
  • [2] The Moz Top Pages. https://moz.com/top500 (6 2017), accessed on 2017 June 8
  • [3] WARC file format (ISO 28500:2017) (2017)
  • [4] Alam, S.: A Memento Aggregator CLI and Server in Go. https://github.com/oduwsdl/MemGator (2016)
  • [5] Alam, S., Nelson, M.L.: MemGator-A portable concurrent memento aggregator: Cross-platform CLI and server binaries in Go. In: Proceedings of the 16th ACM/IEEE Joint Conference on Digital Libraries (JCDL). pp. 243–244 (2016)
  • [6] Aturban, M., Alam, S., Nelson, M.L., Weigle, M.C.: Archive Assisted Archival Fixity Verification Framework. In: Proceedings of the 19th ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2019)
  • [7] Aturban, M., Kelly, M., Alam, S., Berlin, J.A., Nelson, M.L., Weigle, M.C.: ArchiveNow: Simplified, Extensible, Multi-Archive Preservation. In: Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL). pp. 321–322 (2018)
  • [8] Aturban, M., Nelson, M.L., Weigle, M.C.: Difficulties of Timestamping Archived Web Pages. Tech. Rep. arXiv:1712.03140 (December 2017)
  • [9] Bailey, J., Grotke, A., McCain, E., Moffatt, C., Taylor, N.: Web Archiving in the United States: A 2016 Survey. http://ndsa.org/documents/WebArchivingintheUnitedStates_A2016Survey.pdf (February 2017)
  • [10] Bailey, S., Thompson, D.: Building the UK’s First Public Web Archive. D-Lib Magazine 12(1), 1082–9873 (2006)
  • [11]

    Bornand, N.J., Balakireva, L., Van de Sompel, H.: Routing Memento requests using binary classifiers. In: Proceedings of the 16th ACM/IEEE Joint Conference on Digital Libraries (JCDL). pp. 63–72 (2016)

  • [12] Brunelle, J.F., Kelly, M., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: Not all Mementos are created equal: Measuring the impact of missing resources. International Journal on Digital Libraries 16(3-4), 283–301 (2015)
  • [13] Costa, M., Gomes, D., Silva, M.J.: The evolution of web archiving. International Journal on Digital Libraries 18(3), 191–205 (2017)
  • [14] International Internet Preservation Consortium (IIPC): OpenWayback. https://github.com/iipc/openwayback/wiki (October 2005)
  • [15] International Internet Preservation Consortium (IIPC): OpenWayback. https://iipc.github.io/openwayback/2.1.0.RC.1/administrator_manual.html (2015)
  • [16] Kahle, B.: Wayback Rising! now 731,667,951,000 web objects (counting images and pages) active on https://web.archive.org . 731 billion! Thank you for all the support, it makes a difference. go @internetarchive. https://twitter.com/brewster_kahle/status/1118172506777509890 (April 2019)
  • [17] Kreymer, I.: PyWb - Web Archiving Tools for All. https://github.com/ikreymer/pywb (December 2013)
  • [18] Kreymer, I.: Webrecorder - a web archiving platform and service for all (2015), https://webrecorder.io
  • [19] Kreymer, I.: Rewriter. https://github.com/webrecorder/pywb/blob/master/docs/manual/rewriter.rst (2018)
  • [20] Kumar, R.: Sort-friendly URI Reordering Transform (SURT) python module. https://github.com/internetarchive/surt (2017)
  • [21] Mohr, G., Stack, M., Ranitovic, I., Avery, D., Kimpton, M.: An Introduction to Heritrix An open source archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop (IWAW) (2004)
  • [22] Ruest, N., Milligan, I., Deschamps, R., Lin, J., Library and Archives Canada: Web Archives for Historical Research Group Dataverse (WAHR). https:/dataverse.scholarsportal.info/dataverse/wahr (5 2017), accessed on 2017 May 3
  • [23] Sigurdsson, K.: Incremental crawling with Heritrix. In: Proceedings of the 5th International Web Archiving Workshop (IWAW) (2005)
  • [24] Tofel, B.: Wayback for accessing web archives. In: Proceedings of the 7th International Web Archiving Workshop (IWAW). pp. 27–37 (2007)
  • [25] Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states – Memento, Internet RFC 7089. http://tools.ietf.org/html/rfc7089 (2013)
  • [26] Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: Time Travel for the Web. Tech. Rep. arXiv:0911.1112 (2009)