Making Recommendations from Web Archives for "Lost" Web Pages

08/07/2019 ∙ by Lulwah M. Alkwai, et al. ∙ Old Dominion University 0

When a user requests a web page from a web archive, the user will typically either get an HTTP 200 if the page is available, or an HTTP 404 if the web page has not been archived. This is because web archives are typically accessed by URI lookup, and the response is binary: the archive either has the page or it does not, and the user will not know of other archived web pages that exist and are potentially similar to the requested web page. In this paper, we propose augmenting these binary responses with a model for selecting and ranking recommended web pages in a Web archive. This is to enhance both HTTP 404 responses and HTTP 200 responses by surfacing web pages in the archive that the user may not know existed. First, we check if the URI is already classified in DMOZ or Wikipedia. If the requested URI is not found, we use ML to classify the URI using DMOZ as our ontology and collect candidate URIs to recommended to the user. Next, we filter the candidates based on if they are present in the archive. Finally, we rank candidates based on several features, such as archival quality, web page popularity, temporal similarity, and URI similarity. We calculated the F1 score for different methods of classifying the requested web page at the first level. We found that using all-grams from the URI after removing numerals and the TLD produced the best result with F1=0.59. For second-level classification, the micro-average F1=0.30. We found that 44.89 the correctly classified URIs contained at least one word that exists in a dictionary and 50.07 in the domain. In comparison with the URIs from our Wayback access logs, only 5.39 contained at least one word from a dictionary. These percentages are low and may affect the ability for the requested URI to be correctly classified.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Web archives are a window to view past versions of web pages. The oldest and largest web archive, the Internet Archive’s Wayback Machine, contains over 700 billion web objects (Kahle, 2019) . But even with this massive collection, sometimes a user requests a web page that the Wayback Machine does not have. Currently, in this case, the user is presented with a message saying that the Wayback Machine does not have the page archived and a link to search for other archived pages in that same domain (Figure 0(a)).

(a) Response to the request http://tripadvisor.com/where_to_travel at the Internet Archive

(b) Proposed recommendations for the requested URI http://tripadvisor.com/where_to_travel displayed with MementoEmbed (Jones, 2018) social cards
Figure 1. The actual response to the requested URI http://tripadvisor.com/where_to_travel (0(a)) and its proposed replacement (0(b))

Our goal is to enhance the response from a web archive with recommendations of other archived web pages that may be relevant to the request. For example, Figure 0(b) shows a potential set of recommended archived web pages for the request in Figure 0(a).

One approach to finding related web pages is to examine the content of the requested web page and then select candidates with similar content. However, in this work, we assume that the requested web page is neither available in web archives nor on the live web and thus is considered to be a “lost” web page. This assumption reflects previous work showing that users often search web archives when they cannot find the desired web page on the live web (AlNoamany et al., 2014) and that there are a significant number of web pages that are not archived (Ainsworth et al., 2011; Alkwai et al., 2017). Learning about a requested web page without examining the content of the page can be challenging due to little context and content available. There are several advantages to using the Uniform Resource Identifier (URI) over using the content of the web page. First, in some cases the content of the URI is not available on the live Web or in the archive. Second, the URI may contain hints about the resource it identifies. Third, it is more efficient both in time and space to use the text of the URI only rather than to extract the content of the web page. Fourth, some web pages have little or no textual content, such as images or videos, so extracting the content will be not useful or even possible. Fifth, some web pages have privacy settings that do not permit them to be archived.

In this work we recommend similar URIs to a request by following five steps. First, we determine if the requested URI is one of the 4 million categorized URIs in DMOZ111The original DMOZ, http://dmoz.org, is out of service but we have archived versions locally. or in Wikipedia via the Wikipedia API. If the URI is found, we collect candidates in the same category from DMOZ or Wikipedia and move to Step 4. Second, if the URI is not found we classify the requested URI based on a first-level of categorization. Third, we classify the requested URI to determine the deep categorization levels and collect candidates. Fourth, we filter candidates by removing candidates that are not archived. Finally, we filter and rank candidates based on several features, such as archival quality, web page popularity, temporal similarity, and URI similarity.

2. Related Work

There has been previous work on searching an archive without indexing it. Kanhabua et al. (Kanhabua et al., 2016) proposed a search system to support retrieval and analytics on the Internet Archive. They used Bing to search the live web and then extracted the URLs from the results and used those as queries to the web archive. They measured the coverage of the archived content retrieved by the current search engine and found that on page one of Bing results, 94% are available in the Internet Archive. Note that this technique will not find URLs that have been missing (HTTP status 404) long enough for Bing to have removed them from its index.

Klein et al. (Klein and Nelson, 2014) addressed a similar but slightly different problem by using web archives to recommend replacement pages on the live web. They investigated four techniques for using the archived page to generate queries for live web search engines: (1) lexical signatures, (2) web page titles, (3) tags, and (4) link neighborhood lexical signatures. Using these four methods helped to find a replacement for missing web pages. Various datasets were used, including DMOZ. By comparing the different methods, they found that 70% of the web pages were recovered using the title method. The result increased to 77% by combining the other three methods. In their work, the user will get a single alternative when a page is not found on the live Web.

Huurdeman et al. (Huurdeman et al., 2014, 2015) detailed their approach to recover pages in the unarchived Web based on the existence of links and anchors of crawled pages. The data used was from the Dutch 2012 National Library of the Netherlands222https://kb.nl/en (KB). Both external links (inter-server links), which are links between different servers, and site internal links (intra-server links), which occur within a server, were included in the dataset. Their findings included that the archived pages show evidence of a large number of unarchived pages and web sites. Finally, they found that even with a few words to describe a missing web page, they can be found within the first rank.

Classification is the process of comparing representations of documents with representations of labeled categories and computing similarity to find to which category the documents belong. Baykan et al. (Baykan et al., 2009, 2011) investigated using the URI to classify the web page and identify its topic. They found that there is a relationship between classification and the length of the URI, where the longer URI, the better result. They used different machine learning algorithms, and the highest scores were achieved by the maximum entropy algorithm. They trained the classifiers on the DMOZ dataset using all-grams method and tested the performance on Yahoo!, Wikipedia, Delicious, and Google. The classifier performed the best on the Google data, with . We use Baykan et al.’s tokenization methods in Section 4.2.

Xue et al. (Xue et al., 2008)

used text classification on a hierarchal structure. They proposed a deep classification method, where given a document, the entire categories are divided into two kinds according to their similarity to the document, related categories and unrelated categories. They had two steps, the search stage and the classification stage. After the search stage ends a small subset of candidate categories in a hierarchy structure would be the result. Then the output of the first stage would be the input of the second stage. For the first search stage, two strategies have been proposed, document-based and category-based. They either compared the requested document to each document in the dataset or compared it to all documents in a category. Then term frequency (TF) and cosine similarity were used to find the top 10 documents. For the second stage, the resulting 10 category candidates are structured as a tree, then the tree is pruned by removing the category if it has no candidate in it. Three strategies are proposed to accomplish this step: flat structure, pruned top-down, and ancestor-assistant. They also used Naïve Bayes as a classifier because of the large sample size and the speed desired. They used 3-gram because of the close similarity between categories. As a dataset they used 1.3 million URIs from DMOZ and ignored the Regional and World categories. For evaluation, they used the Mi-F

1 score metric, which evaluates the performance for each level. They found that the deep classification performs the highest of the three using the Mi-F1 score, where it resulted in a 77% improvement over top-down based approach. This work is the basis for the deep-level classification we perform (Section 4.3).

Rajalakshmi et al. (Rajalakshmi and Aravindan, 2013)

proposed an approach where N-gram based features are extracted from URIs alone, and the URI is classified using Support Vector Machines and Maximum Entropy Classifiers. In this work, they used the 3-gram features from the URI on two datasets: 2 million URIs from DMOZ and a WebKB dataset with 4K URIs. Using this method on the WebKB dataset resulted in an increase of F

1 score by 20.5% compared to the related work (Kan, 2004; Kan and Thi, 2005; Devi et al., 2007). Also, using this method on DMOZ resulted in an increase of F1 score by 4.7% compared to the related work (Rajalakshmi and Aravindan, 2011; Kan and Thi, 2005; Baykan et al., 2009).

One of the features we will use to rank the candidate URIs is the archival quality. Archival quality refers to measuring memento damage by evaluating the impact of missing resources in a web page. The missing resources could be text, images, video, audio, style sheet, or any other type of resource on the web page. Brunelle et al. (Brunelle et al., 2015) proposed a damage rating algorithm to measure the relative value of embedded resources and evaluate archival success. The algorithm is based on a URI’s MIME type, size, and location of the embedded resources. In the Internet Archive the average memento damage reduced from 0.16 in 1998 to 0.13 in 2013.

3. Datasets

In this work we use three datasets: DMOZ, Wikipedia, and a set of requests to the Wayback Machine. We use the DMOZ and Wikipedia datasets as ontologies to help classify the requested URI and generate candidate recommendations. For evaluation, we use the Wayback Machine access logs as a sample of actual requests to a popular web archive. We chose DMOZ because its web pages are likely to be found in the archive (Ainsworth et al., 2011; AlSum, 2014). Wikipedia was chosen because new or recent web pages are found. In this section we will describe each of the datasets.

3.1. Dmoz

DMOZ, or the Open Directory Project (ODP), was the largest human-edited directory of the Web. DMOZ is considered a hierarchical classification in which each category may have sub-categories. Each entry in the dataset contains the following fields: category, URI, title, and description. For example an entry could be: Computers/Computer_Science/Academic_Departments/North_America/United_States/Virginia, http://cs.odu.edu/, Old Dominion University, and Norfolk Virginia, as shown in Figure 2.

Figure 2. ODU main page found in DMOZ

DMOZ was closed down on March 14, 2017. We have archived 118 DMOZ files of the type RDF, from 2001 to 2017. Since we focus on English language web pages, we first filtered out the World category. Then, we collect all entries that contain at least the URI and the category fields. Next, starting from the latest archived dataset, we collected the entries that include a unique URI. After that, we converted all the URIs to Sort-friendly URI Reordering Transform (SURT)333https://pypi.org/project/surt/ format. Table 1 shows the number of collected entries and sub-categories for each category. To be consistent with a similar work (Rajalakshmi and Aravindan, 2013), we filtered out the Regional, Netscape, Kids_and_Teens, and Adult categories.

Category Num. URIs Num. sub-categories
Regional 2,348,257 297,140
Arts 658,942 57,959
Society 487,834 36,259
Business 469,668 22,465
News 421,800 2,581
Computers 297,789 12,580
Sports 278,706 28,761
Recreation 261,005 15,467
Shopping 250,538 7,393
Science 217,071 17,212
Adult 197,141 10,683
Reference 160,652 13,077
Games 151,459 20,233
Health 149,648 10,292
Home 81,059 3,553
Kids_ands_Teens 63,333 5,793
Netscape 27,223 2,581
Total 6,522,125 564,029
Table 1. The number of entries for each category and the number of sub-categories in the DMOZ dataset

Since we are going to gather recommendations from DMOZ, we wanted to analyze the dataset. We checked the top-level domains, the depth of URIs, if the URIs are on the live web, and if URI patterns occur.

Top-Level Domain

In this section we determine the diversity of the top-level domains (TLDs) in DMOZ. Shown in Table 2, we found that 61.85% of URIs are in the commercial top-level domain, .com, followed by .org, .net, .edu. Other top-level domains include .ca, .it, etc.

TLD Num. URIs Percent
com 4,034,276 61.85%
org 586,152 8.99%
net 371,753 5.70%
edu 224,539 3.44%
gov 60,919 0.93%
us 11,382 0.17%
others 1,233,105 18.90%
Total 6,522,125 100%
Table 2. Top-level domain analysis for DMOZ dataset
Depth

Here, we want to know if the URIs we are recommending are only URIs of depth 0. Note that depth 0 includes URIs ending with /index.html or /home.html. The depth is measured after URI canonicalization444https://pypi.org/project/surt/. Shown in Table 3 we found that 50.57% of the URIs in DMOZ are depth 0 (i.e., top-level web pages).

Depth Count Percent
0 3,298,369 50.57%
1 1,134,874 17.40%
2 905,849 13.89%
3+ 1,183,033 18.14%
Total 6,522,125 100%
Table 3. Depth analysis for DMOZ dataset
Live Web

As of November 2018, we found that 86% of the URIs in the DMOZ dataset are either live or redirect to live web pages.

Patterns

In this section we calculate the different URI patterns that occur in DMOZ. Table 4 shows the percentage of occurrence of the pattern in the hostname and the path. We analyze the following patterns:

We found 42.65% of the URIs contain long strings in the hostname and 20.01% of the URIs contain numbers in the path.

Pattern % in hostname % in path
Long strings 42.65% 13.21%
Long slugs 10.85% 7.82%
Numbers 4.37% 20.01%
Change in case 0.36% 8.18%
Query - 4.72%
Port number 0.11% -
IP address 0.07% -
Percent-encoding 0% 0.50%
Date 0% 0.43%
Table 4. URI patterns present in DMOZ

3.2. Wikipedia

Wikipedia is a web-based encyclopedia, launched in 2001 (Wikipedia, [n. d.]a) and available in 304 languages (Wikipedia, [n. d.]b). It contains articles that are categorized and most also contain a list of external links. For instance, the article shown in Figure 3 is categorized as Old Dominion University, Universities and colleges in Virginia, Educational institutions established in 1930, etc. and contains two external links at the end of the article. If the entity described in the article has an official website, then it will be linked as the “Official website” in the list of external links. We use Python Wikipedia packages (Goldsmith, 2016; Majlis, 2019) to extract the information needed.

Figure 3. Searching for the request http://odu.edu in Wikipedia resulted in finding the Wikipedia web page https://en.wikipedia.org/wiki/Old_Dominion_University that contains the requested URI as the official website in the external link section. We use other web pages in the same categories (at the end of the page) as candidate web pages.

3.3. Wayback Machine

The Wayback Machine server access logs contain real requests to the Internet Archive’s Wayback Machine (Tofel, 2007). The requests are from 295 noncontiguous days between 2011-01-01 to 2012-03-02. A sample of this dataset was used for evaluation. This dataset has been used in other work (AlNoamany et al., 2013; AlNoamany, 2016).

Each request (line) contains the following information: Client IP, Access Time, HTTP Request Method, URI, Protocol, HTTP Status Code, Bytes Sent, Referring URI, User-Agent.

In our work, we will use a sample from the requests made on Feb 8, 2012, similar to data selected in AlNoamany et al. (AlNoamany et al., 2013). There were 49,026,577 requests on that day. Before collecting a sample to use, we performed several filtering steps. First, we filtered out any requests that did not result in an HTTP 200 status code. We also filtered out any requests with an invalid URI format or extension, non-HTML URIs, an IP address as the domain, or a ccTLD from a non-English speaking country. In addition, we filtered out requests that resulted in HTML with a non-English HTML language code. This filtering left 732,130 unique URIs.

4. Algorithm

Our recommendation algorithm, shown in Algorithm 1, is composed of four main steps, each of which will be described in more detail in the following subsections. As per the current method of searching a web archive, the user provides a requested URI and optionally a desired datetime.

Our goal is to provide recommendations for other archived web pages based on the requested URI, which we assume is “lost”, neither available on the live web nor archived. The first step is to obtain a first-level classification of the URI, using DMOZ or Wikipedia. This would result in a high-level category for the URI, such as “Computers”, “Business”, etc. similar to those in Table 1. We then use machine learning techniques to obtain a deeper categorization, such as “Computers/Computer_Science/Academic_Departments/North_America_United_States/Virginia”. Once this categorization is obtained, we can collect candidates from other URIs in the same category in DMOZ or Wikipedia. Then we filter out any candidates that are not archived and finally rank and recommend candidates based on several features, such as archival quality, web page popularity, temporal similarity, and URI similarity.

Step 1: Classify the URI ()
function Classify_URI_level_one()
     Tokenize (requested_URI)
     ML (requested_URI)
end function
Step 2: Deep classify the URI ()
function Classify_URI_deep_levels()
     Index_dataset_by_category ()
     Cosine_similarity ()
     Get_top_N_candidates ()
     Create_and_prune tree ()
     ML ()
end function
Step 3: Filter candidates
function Archived()
     for  do
         if Candidate is archived then
              Archived_Candidates=Candidate
         end if
     end for
end function
Step 4: Score and rank candidates
function Rank()
     Score ()
     Get_top_N_candidates ()
end function
Main Function
function Recommending_Archived_Web_Pages()
     if requested_URI not in a_classified_ontology then
         Classify_URI_level_one() Step 1
         Classify_URI_deep_levels() Step 2
     end if
     Collect_All_Candidates()
     Archived() Step 3
     Rank() Step 4
end function
Algorithm 1 Algorithm for recommending archived web pages using only the URI

4.1. Check Ontologies

The first step is to determine if the requested URI is already present and categorized in DMOZ or Wikipedia. Using DMOZ is straightforward; we check if the URI exists in DMOZ or not. However, in Wikipedia we check if the requested URI is the official web site (by searching for the keyword “official website”) and is categorized. For example, if the requested URI was http://odu.edu, we use the URI to find a related Wikipedia web page. In this example we find that the Wikipedia web page https://en.wikipedia.org/wiki/Old_Dominion_University mentions http://odu.edu as the official website. Then we collect the categories that this web page belongs to, such as Old Dominion University, Universities and colleges in Virginia, Educational institutions established in 1930, etc. Then we collect as candidates all of the official web pages that these categories contain.

To test how often this option might be available, we used the Wayback Machine access logs (Section 3.3). From the filtered set, we found 13.17% URIs in DMOZ or Wikipedia.

4.2. Step 1: First-Level Classification

For a request that did not appear in an ontology, we will classify it using only the tokens from the URI. We test three different methods of tokenization. First, we use URI tokens that are split by non-alphanumeric characters. Second, we use all-grams from the tokens. Third, we use all-grams from the URI.

4.2.1. Tokenize the URI


To classify the URI, we need to extract meaningful keywords, or tokens, from the URI. We adopt the three methods proposed by Baykan et al. (Baykan et al., 2011).

  • Tokens The URI is split into potentially meaningful tokens. The URI is converted to lower-case and then split into tokens using any non-alphabetic character as a delimiter. Finally, the “http” (or “https”) token is removed, along with any resulting token of length 2 or less.

  • All-grams from tokens The URI tokens are converted to all-grams. We perform the tokenization as above and then generate all-grams on the tokens by combining 4-, 5-, 6-, 7-, and 8-grams of the combined tokens.

  • All-grams from the URI The URI is converted to all-grams without tokenizing first. Any punctuation and numbers are removed from the URI, along with “http” (or “https”). Then the result is converted to lowercase. Finally, the all-grams are generated by combining the 4-, 5-, 6-, 7-, and 8-grams of the remaining URI characters.

An example of the different tokenization methods is shown in Table 5. Using these methods we also examine removing the TLDs from the URIs, removing numbers, and removing stop words (Section 4.2.2).

Method Result
Tokens odu, edu, compsci
All-grams from tokens
odu, edu, comp, omps,
mpsc, psci, comps, ompsc,
mpsci, compsc, ompsci, compsci
All-grams from URI
(http://odu.edu/compsci)
odue, dued, uedu, educ,duco,
ucom, comp, omps, mpsc, psci,
odued, duedu, ueduc, educo,
ducom, ucomp, comps, ompsc,
mpsci, oduedu, dueduc, ueduco,
educom, ducomp, ucomps, compsc,
ompsci, odueduc, dueduco,
ueducom, educomp, ducomps,
ucompsc, compsci, odueduco,
dueducom, ueducomp, educomps,
ducompsc, ucompsci
Table 5. Tokenizing the URI https://odu.edu/compsci using different methods (Baykan et al., 2011)

To determine the best tokenization method, as a baseline we tested the classification of tokens on the DMOZ dataset, using machine learning. We took the DMOZ dataset and created a 10-fold cross-validation set, using 90% for training and 10% for testing. We employed a Naïve Bayes classifier to take tokens and return the top-level category. Naïve Bayes was selected because of its simplicity that assumes independence between the features. In the testing dataset we filtered out URIs that contain tokens not seen in the training set, as was also done in related work (Baykan et al., 2011).

We measured the F1 score to evaluate the different tokenization methods. Table 6 shows the result of our evaluation. In addition to the base tokenization methods described above, we also tested the following alternatives for each method:

  • remove TLD before tokenization

  • remove TLD and numbers before tokenization

  • remove TLD, numbers, and stop words before tokenization

The stop words were based on a set of stop words in the Natural Language Toolkit (NLTK)555https://nltk.org/. We found that using the all-grams from the URI after removing the TLD and numbers had the highest F1 score, which was comparable to results obtained in related work (Rajalakshmi and Aravindan, 2013). We use this method of tokenization going forward.

Method
F1 score
Micro
average
Macro
average
Tokens All URI tokens 0.39 0.45 0.31
URI tokens,
without TLD
0.35 0.40 0.28
URI tokens,
without TLD
and numbers
0.40 0.45 0.32
URI tokens,
without TLD
and stop words
0.39 0.43 0.30
All-gram from tokens All URI tokens 0.51 0.53 0.45
URI tokens,
without TLD
0.51 0.53 0.46
URI tokens,
without TLD
and numbers
0.51 0.52 0.47
URI tokens,
without TLD
and stop words
0.50 0.52 0.46
All-grams from URI All URI tokens 0.56 0.55 0.48
URI tokens,
without TLD
0.55 0.59 0.46
URI tokens,
without TLD
and numbers
0.59 0.62 0.61
URI tokens,
without TLD
and stop words
0.55 0.60 0.47
Table 6. Classifying at the first-level, comparing F1 score, Micro average, and Macro average for DMOZ dataset using different methods

4.2.2. Classify the URI using Machine Learning


Now that we have determined the best tokenization method, we will apply this for future requests. We trained the Naïve Bayes classifier on the entire DMOZ dataset and this will be used for classification as the baseline at the first-level. We take the requested URI, remove the TLD and numbers, and then perform the all-gram from URI tokenizations described in the previous section. These resulting all-grams are used in the the classifier to produce a first-level classification.

4.3. Step 2: Deep-Level Classification

In this step we want to classify the requested URI http://cs.odu.edu/compsci to a hierarchal deep classification such as Computers/Computer_Science/Academic_Departments/North_America_United_States/Virginia. Known methods to determine hierarchal deep classification are the big-bang approach and the top-down approach (Sun and Lim, 2001). Neither method is ideal with a large number of hierarchies and may result in error propagation. For this reason we adopt the method by Xue et al. (Xue et al., 2008), but as opposed to this work, we are limited to the URI only and do not have the documents or any supporting details.

  1. Index dataset. In preparation to compute similarity between the requested URI and the category entries, we index DMOZ by category, creating a list of all URIs in each of the DMOZ deep-level categories.

  2. Cosine similarity. We compute the cosine similarity between the tokenized requested URI and the tokenized URIs and their titles and description, in each category. In this step each category of the index will get a similarity score to the requested URI, which is the average similarity to all entries in that category.

  3. Collect N candidates. Next we select the top 10 candidate categories with the highest similarity score, similar to related work (Xue et al., 2008).

  4. Prune tree. Each candidate category could be a leaf node or an internal node. We create a hierarchical tree and then prune it to get the final list of candidates that we can use machine learning to classify. First, we create a tree from the candidates by starting from the first node and then going down until all 10 candidates are presented, as shown in Figure 3(a). Next, in order to enhance the classification, the tree is pruned based on the ancestor assistance strategy. The ancestor assistance strategy includes the ancestors of a node if there are no common ancestors with another candidate, as shown in Figure 3(b).

    (a) Create hierarchical tree from the 10 candidate categories (the candidate categories are highlighted). The numbers represent the category ID
    (b) Pruned tree using ancestor assistance strategy. The parents of nodes 88 and 100 are included because they have no shared ancestor with other candidates
    Figure 4. The process of pruning a hierarchical tree using ancestor assistance strategy (Xue et al., 2008)
  5. Classify. To choose a single classification from the pruned tree we classify the requested URI based on two methods, using 3-gram tokens and all-grams. The 3-gram method had the best result when comparing documents (Xue et al., 2008), however in our work we compare URI tokens, so we expect the all-gram method to perform better.

4.4. Steps 3, 4: Filter, Rank and Recommend

Step 3 in our algorithm is to ensure that all recommendations come from a web archive. We take the candidates from Step 2 and remove any that are not archived. We use MemGator (Alam and Nelson, 2016) to determine this. In Step 4, we rank and recommend the remaining candidates based on temporal similarity (), web page popularity (), URI similarity (), and archival quality (). Our final list of recommended web pages will be ranked based on Equation 1, where wt+wp+ws+wq=1.0 and specify the weights given to each of the features.

(1)

4.4.1. Temporal similarity


Temporal similarity refers to how close the available candidate web page’s Memento-Datetime (Van de Sompel et al., 2013) is to the requested URI. This is shown in Equation 2, where is the request datetime, is the candidate datetime, is the current datetime, and is the earliest datetime. The earliest datetime is considered 1996, because it was when archiving the Web started666https://archive.org/about/.

(2)

4.4.2. Web page popularity


We use how often the web page has been archived and the domain’s popularity as determined by Alexa777https://alexa.com as an approximation for the web page’s popularity. Our popularity measure is given in Equation 3, where is the Alexa Global Ranking of the requested domain, is the lowest ranked domain in Alexa, is the number of times the URI has been archived, and is the number of times Alexa’s top-ranked web site has been archived.

(3)

We set to 30,000,000 as it is the current lowest ranking in Alexa, and we set to 538,300, the number of times that http://google.com, the top-ranked Alexa web page, has been archived.

4.4.3. URI similarity


We measure the similarity of requested URI tokens and candidate URI tokens using Jaccard similarity coefficient (Equation 4).

(4)

4.4.4. Archival quality


Archival quality refers to how well the page is archived. We use Memento-Damage (Siregar, 2017) to calculate the impact of missing resources in the web page. We calculate archival quality in Equation 5, where is the damage score calculated from Memento-Damage.

(5)

5. Example

Here we present an example of a request and the resulting recommendations. We request http://odu.edu/compsci with the date of March 1, 2014. This URI is not classified in DMOZ or in Wikipedia, so we use machine learning and classify it to Computers/Computer_Science/Academic_Departments/North_America/United_States/Virginia. Then we collect all the candidates from DMOZ:

Using equal weights () for our ranking equation, the top three ranked candidates are:

6. Evaluation and Results

First, we evaluate how well our deep classification method works (Step 3). To test this step we use 10% of the DMOZ dataset for testing and the rest for training. We assume that level one categorization is already predicted in Step 1. We evaluate the performance by determining if we classified each level correctly. For example, if a URI is actually in the category c1/c2/c3, then for level two evaluation, we check if we predicted c1/c2. For each level we calculate the Micro-average F1 (Mi-F1) score. In Figure 5, we show the Mi-F1 score of each level using 3-gram cosine similarity. The highest level in our results was 0.2 compared to 0.8 in the related work (Xue et al., 2008), but that is due to using only the requested URI as the testing data and the URI and title and category as training, as opposed to using the text of the full document as in (Xue et al., 2008). This shows that using only the tokens from the URI is not enough for deep classification. Because of limited information, we also show the result of testing the same method using all-gram cosine similarity. We found that the results are better, however it is still considered low compared to related work.

Figure 5. Performance on classifying to different levels using 3-gram and all-gram

Some features could affect the URI classification. We investigated the relationship between the depth of the URI and classification. Table 7 shows the URI depth and the percentage of the correctly classified URIs. We only considered URIs to be correctly classified if they were correct to the deepest level. We found that 63.45% of the correctly classified URIs are of depth 0.

Depth Percent
0 63.45%
1 16.96%
2 13.48%
3 3.77%
4 1.47%
5+ 0.86%
Table 7. URI depth and percentage of correctly classified URIs

Next, we check if the words in the URIs are in a dictionary (after removing the TLD). We use the enchant English dictionary888https://pypi.org/project/pyenchant/ and wordninja999https://pypi.org/project/wordninja/ to split compound words. For example, the URI http://mickeymantlebaseballcards.net is split to mickey, mantle, baseball, and cards. We found that 36.92% of the correctly classified URIs contain only words from a dictionary, and 44.89% of the correctly classified URIs contain at least one word from a dictionary.

An ideal structure of the URI is that it contains long strings that will have more semantics. We are trying to identify a “slug”, which is the part of a URI that contains keywords or the web page title. An example of a slug is the path in https://cnn.com/2017/07/31/health/climate-change-two-degrees-studies/index.html. The slug in the URI is readable, and we can identify what the web page is about. We evaluate the existence of long strings in the correctly classified URIs. We assume that the average length of an English word is 5 (Pierce, 2012; Palmer, 1997) and anything greater is considered a long string.

Overall, we found that 41.58% of the sampled URIs contain long strings in the domain, for example, http://timesonline.co.uk/tol/sport/cricket/. Also, we found that 89.47% of the sampled URIs contain long strings in the path, for example, http://medlineplus.nlm.nih.gov/medlineplus/parkinsonsdisease.html. When analyzing the correctly classified URIs, we found that 50.07% of the correctly classified URIs contain long strings in the domain. Also, we found that 13.45% of the correctly classified URIs contain long strings in the path. Words can be separated by delimiters in the domain or path. We found that 9.91% of the correctly classified URIs contain words separated by delimiters in the domain, for example, http://vintage-poster-art.com/. We also found that 6.97% of the correctly classified URIs contain separated words by delimiters in the path, for example, http://seaworldparks.com/en/buschgardens-williamsburg/.

In addition, we wanted to investigate the effect of the category on correct classification. As shown in Table 8, we found that 15.32% of the correctly classified URIs were in the “Society” first-level category. We also found that none of the correctly classified URIs were in “News”. We found that in the “News” category in DMOZ, there is a level two subcategory “Online_Archive” that contains 95% of the “News” URIs and repeats several subcategories inside “News”. This caused errors in our classification.

Category count Percent
Society 459 15.32%
Arts 401 13.38%
Shopping 355 11.85%
Recreation 331 11.05%
Sports 291 9.71%
Home 288 9.61%
Reference 238 7.94%
Computers 228 7.61%
Health 190 6.34%
Science 130 4.34%
Games 50 1.67%
Business 35 1.17%
News 0 0%
Total 2996 100%
Table 8. Percentage from the correctly classified URIs for each category

After finding certain characteristics that help with classifying the URI, we need to know what percentage of URIs in the Wayback access log have similar characteristics. First, we wanted to determine the diversity of the top-level domains (TLDs) in Wayback access log dataset. Shown in Table 9, we found that 71.80% URIs are commercial top-level domain, .com, followed by .net, .org, and .edu. This distribution is almost similar to that in DMOZ (Section 3.1).

TLD Num. URIs Percent
com 525,651 71.80%
net 56,589 7.73%
org 53,703 7.34%
edu 8,599 1.17%
gov 2,343 0.32%
us 2,071 0.28%
others 83,174 11.36%
Total 732,130 100%
Table 9. Top-level domain analysis for the Wayback Machine server access logs dataset

Next, we want to determine the depth of the requested URIs. Shown in Table 10 we found that 83.74% of the URIs in the Wayback access log are depth 0, essentially top-level web pages. It means that users often request URIs of depth 0 from the archive. Since 63.45% of the correctly classified URIs are of depth 0, having 83.74% could enhance the classification results.

Depth Count Percent
0 613,121 83.74%
1 54,008 7.38%
2 33,644 4.60%
3+ 31,357 4.28%
Total 732,130 100%
Table 10. Depth analysis for Wayback access log dataset

We saw that having terms found in a dictionary affects classification. We found that 5.39% of the Wayback access log URIs contain only words from a dictionary, and 26.74% contain at least one word from a dictionary. These percentages are low and may affect the ability for the requested URI to be correctly classified.

In our DMOZ evaluation, we found that long strings in the domain helped with classification. When analyzing the Wayback access logs requests, we found that 50.16% contain long strings in the domain. We also found that only 3.24% contain long strings in the path. In addition, we found that 12.99% contain words separated by delimiters in the domain and only 1.54% in the path. This also reflects the large percentage of URIs from the access logs with depth 0 (no path). For classifying most of these requests, we will have to largely rely on domain information.

7. Conclusions

In this work we wanted to recommend web pages from a Web archive for a requested “lost” URI. Our work proposes a method to enhance the current response from Web archives when a URI cannot be found (Figure 0(a)). We used both DMOZ and Wikipedia to classify the request and find candidates. First, we check if the requested URI is classified in DMOZ or Wikipedia. If the requested URI is not pre-classified, then we classify the URI using first-level classification and then deep classification. This step results in a list of candidates that we filter based on if the web page is archived. Next we score and rank the candidates based on archival quality, web page popularity, temporal similarity, and URI similarity.

We found that the best method to classify the first-level is using all-grams from the URI while filtering the TLD and numbers from the URI. Using a Naïve Bayes classifier resulted in a F1 score of 0.59. For the second-level classification we measure the accuracy for each classification level. For second-level classification, the micro-average F1=0.30 and for third-level classification, F1=0.15. We also found that 44.89% of the correctly classified URIs contain a word that exists in a dictionary. Also, 50.07% of the correctly classified URIs contain long strings in the domain. We also analyzed the properties of a sample of URIs requested to the Wayback Machine and found that the large majority were of depth 0, meaning that our classification will rely largely on domain information.

Future work includes adding other languages, filtering spam web pages, and ranking based on how long the web page was not live. For popularity, if the access log was saved we can measure how frequently the URI was requested from the archive. For temporal similarity we can measure the closeness of the creation date of the request and the candidate.

8. Acknowledgments

This work is supported in part by the National Science Foundation, IIS-1526700.

References

  • (1)
  • Ainsworth et al. (2011) Scott G. Ainsworth, Ahmed Alsum, Hany M. SalahEldeen, Michele C. Weigle, and Michael L. Nelson. 2011. How Much of the Web is Archived?. In Proceedings of the 11th IEEE/ACM Joint Conference on Digital Libraries (JCDL). 133–136.
  • Alam and Nelson (2016) Sawood Alam and Michael L Nelson. 2016. MemGator-A portable concurrent memento aggregator: Cross-platform CLI and server binaries in Go. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. ACM, 243–244.
  • Alkwai et al. (2017) Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle. 2017. Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages. ACM Transactions on Information Systems (TOIS) 36, 1 (2017), 1:1–1:34.
  • AlNoamany (2016) Yasmin AlNoamany. 2016. Using Web Archives to Enrich the Live Web Experience Through Storytelling. Ph.D. Dissertation. Old Dominion University.
  • AlNoamany et al. (2014) Yasmin AlNoamany, Ahmed AlSum, Michele C. Weigle, and Michael L. Nelson. 2014. Who and What Links to the Internet Archive. International Journal on Digital Libraries (IJDL) 14, 3-4 (2014), 101–115.
  • AlNoamany et al. (2013) Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson. 2013. Access Patterns for Robots and Humans in Web Archives. In Proceedings of the 13th IEEE/ACM Joint Conference on Digital Libraries (JCDL). 339–348.
  • AlSum (2014) Ahmed AlSum. 2014. Web Archive Services Framework for Tighter Integration Between the Past and Present Web. Ph.D. Dissertation. Old Dominion University.
  • Baykan et al. (2009) Eda Baykan, Monika Henzinger, Ludmila Marian, and Ingmar Weber. 2009. Purely URL-based Topic Classification. In Proceedings of the 18th International conference on World Wide Web (WWW). 1109–1110.
  • Baykan et al. (2011) Eda Baykan, Monika Henzinger, Ludmila Marian, and Ingmar Weber. 2011. A Comprehensive Study of Features and Algorithms for URL-based Topic Classification. ACM Transactions on the Web (TWEB) 5, 3 (2011), 15.
  • Brunelle et al. (2015) Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson. 2015. Not all mementos are created equal: Measuring the impact of missing resources. International Journal on Digital Libraries (IJDL) 16, 3-4 (2015), 283–301.
  • Devi et al. (2007) M Indra Devi, R Rajaram, and K Selvakuberan. 2007. Machine learning techniques for automated web page classification using URL features. In Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA), Vol. 2. 116–120.
  • Goldsmith (2016) Jonathan Goldsmith. 2016. A Pythonic wrapper for the Wikipedia API. https://github.com/goldsmith/Wikipedia. (2016).
  • Huurdeman et al. (2014) Hugo C. Huurdeman, Anat Ben-David, Jaap Kamps, Thaer Samar, and Arjen P. de Vries. 2014. Finding pages on the unarchived web. In Proceedings of the 14th IEEE/ACM Joint Conference on Digital Libraries (JCDL). 331–340.
  • Huurdeman et al. (2015) Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, Arjen P. de Vries, Anat Ben-David, and Richard A. Rogers. 2015. Lost but not Forgotten: Finding Pages on the Unarchived Web. International Journal on Digital Libraries (IJDL) 16, 3-4 (2015), 247–265.
  • Jones (2018) Shawn M. Jones. 2018. A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages. https://ws-dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html. (2018).
  • Kahle (2019) Brewster Kahle. 2019. 703,726,890,000 URL’s now in the @waybackmachine by the @internetarchive ! (703 billion) Over a billion more added each week. The Web is a grand experiment in sharing and giving. Loving it! http://web.archive.org/. https://twitter.com/brewster_kahle/status/1087515601717800960. (21 January 2019).
  • Kan (2004) Min-Yen Kan. 2004. Web Page Classification Without the Web Page. In Proceedings of the 13th International World Wide Web conference on Alternate Track Papers and Posters. 262–263.
  • Kan and Thi (2005) Min-Yen Kan and Hoang Oanh Nguyen Thi. 2005. Fast Webpage Classification Using URL Features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CKIM). 325–326.
  • Kanhabua et al. (2016) Nattiya Kanhabua, Philipp Kemkes, Wolfgang Nejdl, Tu Ngoc Nguyen, Felipe Reis, and Nam Khanh Tran. 2016. How to search the Internet Archive without indexing it. In Proceedings of the International conference on Theory and Practice of Digital Libraries (TPDL). 147–160.
  • Klein and Nelson (2014) Martin Klein and Michael L. Nelson. 2014. Moved but not Gone: An Evaluation of Real-time Methods for Discovering Replacement Web Pages. International Journal on Digital Libraries (IJDL) 14, 1-2 (2014), 17–38.
  • Majlis (2019) Martin Majlis. 2019. Python wrapper for Wikipedia. https://github.com/martin-majlis/Wikipedia-API. (2019).
  • Palmer (1997) David D Palmer. 1997. A trainable rule-based algorithm for word segmentation. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 321–328.
  • Pierce (2012) John R Pierce. 2012. An introduction to information theory: symbols, signals and noise. Courier Corporation.
  • Rajalakshmi and Aravindan (2011) R. Rajalakshmi and Chandrabose Aravindan. 2011. Naive bayes approach for website classification. In Proceedings of the Information Technology and Mobile Communication. Communications in Computer and Information Science. Vol. 147.
  • Rajalakshmi and Aravindan (2013) R Rajalakshmi and Chandrabose Aravindan. 2013. Web Page Classification Using N-gram Based URL Features. In Proceedings of the 5th International Conference on Advanced Computing (ICoAC). 15–21.
  • Siregar (2017) Erika Siregar. 2017. Deploying the Memento-Damage Service. https://ws-dl.blogspot.com/2017/11/2017-11-22-deploying-memento-damage.html. (2017).
  • Sun and Lim (2001) Aixin Sun and Ee-Peng Lim. 2001. Hierarchical text classification and evaluation. In Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on. IEEE, 521–528.
  • Tofel (2007) Brad Tofel. 2007. Wayback for Accessing Web Archives. In 7th International Web Archiving Workshop (IWAW’07).
  • Van de Sompel et al. (2013) Herbert Van de Sompel, Michael L. Nelson, and Robert Sanderson. 2013. HTTP framework for time-based access to resource states – Memento, Internet RFC 7089. http://tools.ietf.org/html/rfc7089. (2013).
  • Wikipedia ([n. d.]a) Wikipedia. [n. d.]a. History of Wikipedia. https://en.wikipedia.org/wiki/History_of_Wikipedia. ([n. d.]).
  • Wikipedia ([n. d.]b) Wikipedia. [n. d.]b. List of Wikipedias. https://en.wikipedia.org/wiki/List_of_Wikipedias. ([n. d.]).
  • Xue et al. (2008) Gui-Rong Xue, Dikan Xing, Qiang Yang, and Yong Yu. 2008. Deep classification in large-scale text hierarchies. In Proceedings of the 31st annual International ACM SIGIR conference on Research and Development in Information Retrieval. 619–626.