Tree-based Focused Web Crawling with Reinforcement Learning

12/12/2021
by   Andreas Kontogiannis, et al.
0

A focused crawler aims at discovering as many web pages relevant to a target topic as possible, while avoiding irrelevant ones; i.e. maximizing the harvest rate. Reinforcement Learning (RL) has been utilized to optimize the crawling process, yet it deals with huge state and action spaces, which can constitute a serious challenge. In this paper, we propose TRES, an end-to-end RL-empowered framework for focused crawling. Unlike other approaches, we properly model a crawling environment as a Markov Decision Process, by representing the state as a subgraph of the Web and actions as its expansion edges. TRES adopts a keyword expansion strategy based on the cosine similarity of keyword embeddings. To learn a reward function, we propose a deep neural network, called KwBiLSTM, leveraging the discovered keywords. To reduce the time complexity of selecting a best action, we propose Tree-Frontier, a two-fold decision tree, which also speeds up training by discretizing the state and action spaces. Experimentally, we show that TRES outperforms state-of-the-art methods in terms of harvest rate by at least 58 Our implementation code can be found on https://github.com/ddaedalus/TRES.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/05/2020

State Action Separable Reinforcement Learning

Reinforcement Learning (RL) based methods have seen their paramount succ...
05/15/2018

Feedback-Based Tree Search for Reinforcement Learning

Inspired by recent successes of Monte-Carlo tree search (MCTS) in a numb...
10/12/2021

StARformer: Transformer with State-Action-Reward Representations

Reinforcement Learning (RL) can be considered as a sequence modeling tas...
06/01/2021

Exploring Dynamic Selection of Branch Expansion Orders for Code Generation

Due to the great potential in facilitating software development, code ge...
06/25/2022

Towards Modern Card Games with Large-Scale Action Spaces Through Action Representation

Axie infinity is a complicated card game with a huge-scale action space....
10/09/2018

Discovering General-Purpose Active Learning Strategies

We propose a general-purpose approach to discovering active learning (AL...
04/16/2019

Reinforcement Learning for Nested Polar Code Construction

In this paper, we model nested polar code construction as a Markov decis...

1 Introduction

Focused crawlers [CHAKRABARTI19991623] are intelligent agents, which given a defined topic of interest, find a strategy to search for relevant web pages. Such agents are widely used to collect vast amounts of topic-related data. Typically, a focused crawler starts from an initial set of highly relevant data, in the form of URLs, called seeds. The crawler utilizes seeds as a bootstrap, extracting their outlink URLs and storing them in a URL collection, called frontier [10.1145/988672.988714]

. The agent selects the best URL from the frontier, visits its web page and classifies it as relevant or irrelevant to the user’s interests. Then, its outlink URLs (hyperlinks) are stored in the frontier. Based on relevance feedback, the crawling strategy is adjusted. The total number of fetched URLs is user-defined and thus the whole process terminates when that number of URL selections is reached.

Many approaches, such as [Elaraby2019ANA, RAJIV20213, SALEH2017181], have proposed the use of classifiers that identify whether a given web page is relevant to the target topic. Such classifiers can be used to (a) identify the relevant web pages (in the form of URLs) in the frontier, assigning them high crawling priorities, (b) give relevance feedback to the crawler’s strategy after its visiting a web page and (c) evaluate the crawling data after the end of a running process.

An alternative way of measuring the quality of an outlink URL is by applying Reinforcement Learning (RL)

. Unlike supervised learning methods, such focused crawlers are able to distinguish web pages (or websites), called

hubs [Rennie99efficientweb], that are possibly connected with relevant URLs on the Web. In cases where no relevant URLs exist in the frontier, a good RL strategy may identify hubs to find new relevant web pages, unlike a classification method which ignores such information.

In this paper we propose a novel focused crawling framework on textual Web data, called TRES (Tree REinforcement Spider)

. Our crawler agent uses the typical focused crawling framework, we described above, adopting an RL crawling policy for selecting URLs. We take into account the semantic similarities between candidate keywords and construct small vector representations of candidate URLs summarizing various semantic and statistical information. We utilize this URL representation as a shared state-action representation.

Furthermore, we introduce Tree-Frontier, which is a two-fold decision tree, utilizing the rewards from experience as target labels. Since a decision tree splits on feature values, under a defined criterion, each leaf node corresponds to a local neighborhood of the input space. This way, in Tree-Frontier, a leaf stores a set of similar frontier samples, corresponding to outlink URLs (in the frontier), along with similar experience samples, corresponding to fetched web pages.

We utilize a Double Deep Q-Network (DDQN) [10.5555/3016100.3016191]

focused crawler agent; though other model-free RL agents that estimate Q-values on a state-action input are also applicable. In summary, this paper makes the following contributions:

  • Our TRES outperforms other state-of-the-art focused crawlers on three evaluated domains, in terms of harvest rate of web pages, utilizing no more than a a single seed URL.

  • We introduce the Tree-Frontier algorithm for efficiently and effectively selecting which URLs to fetch through discretization of the large state and action spaces.

  • Tree-Frontier improves the time complexity of a synchronous update [10.1007/978-3-319-91662-0_20] (i.e. selecting the best action, given the state, by calculating the Q-values of all available actions), by discarding a factor from the time complexity, accounting for the minimal number of new URLs inserted in frontier at each timestep.

  • We introduce a novel Markov Decision Process (MDP) formulation for focused crawling with RL, in which, unlike previous approaches, the agent necessarily follows the successive order of state transitions .

The remainder of the paper is organized as follows. In Section 2, we describe a brief overview of the related work. In Section 3, we present the problem definition and our framework overview. Our experimental evaluation and the comparisons with baseline and state-of-the-art methods are presented in Section 4. In section 5, we discuss the conclusions of this paper and some future steps.

2 Related Work

Chakrabarti et al. [CHAKRABARTI19991623] introduced the reference of focused crawling on target domains. Focused crawlers many a times utilize two classifiers; an apprentice and a critic [10.1145/511446.511466]. The critic classifies web pages, while the apprentice learns (from the critic’s feedback) to distinguish the most promising outlink URLs. Suebchua et al. [Suebchua2017EfficientTF] introduced the ”Neighborhood feature”, which exploits the relevance of all already fetched web pages of the same domain (web site), in order to more effectively select URLs belonging to the same domain. This approach assumes that a web page is likely to be relevant if many relevant web pages have been discovered in its neighborhood, which is similar to the empirical evidence of topical locality on the Web showed by Davison [10.1145/345508.345597].

Search-based Discovery methods utilize queries on search engine APIs (e.g. Google and Bing) and adapt a strategy onto query improvement. SeedFinder [10.1007/s11280-015-0331-7] tries to find new seeds in order to best bootstrap the crawling process of an arbitrary focused crawler, by executing queries on search engines utilizing terms (keywords) extracted from relevant texts. Liakos et al. [10.1007/s11280-015-0349-x] expanded a keyword collection by selecting the keyword with the highest TF-IDF score to form a query. By executing this query on a search engine, new candidate keywords were retrieved and the ones with the highest scores were stored in the collection. We adopt a similar keyword discovery strategy, with the exception of using the cosine similarities of word2vec [DBLP:journals/corr/abs-1301-3781] embeddings of candidate keywords instead of their TF-IDF scores.

On the other hand, Crawling-based Discovery methods utilize the link structure to discover new content by automatically fetching and iteratively selecting URLs extracted from the visited web pages. They can be divided into: (a) focused (forward) and (b) backward crawling methods. Backward crawling methods, e.g. the BIPARTITE crawler [barbosa-etal-2011-crawling], explore URLs through backlink search, considering the Web to be an undirected graph. In this work, we are not interested in evaluating backward crawling, since it requires paid APIs for most real users.

Du et al. [10.1016/j.asoc.2015.07.026] integrated the TF-IDF scores of the words with the semantic similarities among the words to construct topic and text semantic vectors and computed the cosine similarities between these vectors to select URLs. ACHE [10.1145/1242572.1242632] is an adaptive crawler aiming to maximizing the harvest rate [CHAKRABARTI19991623] of fetched web pages through an online-learning policy. This way, it learns to identify promising URLs and adapts its focus as the crawl progresses.

Rennie and McCallum [Rennie99efficientweb] introduced RL in focused crawling, but they generalized over all possible states in order to deal with the huge state space; thus each action (outlink URL) was immediately disconnected from its state. To face these challenges, RLwCS [10.5555/1567281.1567456] uses a tabular Q-learning policy that selects the best classifier from a defined pool of classifiers, which evaluate all available candidate URLs that the crawler can immediately fetch. InfoSpiders [Menczer99adaptiveretrieval, 10.1145/1031114.1031117] approximate the Q-values of URLs in an online way, yet they only take into account statistical features of keywords for the state representation, without considering any experience samples.

Han et al. [10.1007/978-3-319-91662-0_20] proposed an online crawling algorithm, based on SARSA policy approximation [10.5555/3312046]

introducing a shared state-action representation. They manually discretized the state-action space into buckets according to value ranges of statistical heuristic features. However, their MDP formulation somewhat unnaturally allows the agent to deviate from the successive order of states; i.e. it may select actions that do not really exist in a current state. This way, given a current state, the agent may transition with positive probability to states, for which the corresponding transition probabilities must be zero.

Gouriten et al. [10.1145/2631775.2631795] highlighted the frontier batch disadvantage, considering that a batch is the number of successive crawling timesteps in which the frontier has not been updated. They showed that even in an offline setting, the problem of determining the optimal sequence of fetched pages, is NP-hard. Pham et al. [10.1145/3308558.3313709] used a multi-armed bandit strategy to combine different search operators (queries on APIs, forward and backward crawling), as proxy to collect relevant web pages, and ranked them, based on their similarity to seeds.

3 Problem and Method Overview

3.1 Problem Definition

3.1.1 Keyword Expansion Problem

Let be the set of all web page text documents belonging to a target topic . Also, let be the finite set of all keywords that describe . Given an initial small set of keywords, , where , and a corpus of text documents and each contains candidate keywords, our goal is to expand so that .

3.1.2 Web Page Classification Problem

Similarly to [Lu2016AnIF], we deal with a binary web page classification problem, in order to decide whether a given web page is relevant to the topic of interest. Specifically, let represent the set of all web page text documents belonging to any topic, other than . Also, let represent the set of all web pages on the Web. Assuming (a) is finite, (b) each web page represents a text document and (c) a word vector space , then is a set of words. Given a collection of web pages , which all represent text documents of maximum word length , and a keyword set , our goal is to find a function , so that for each input web page , classifies correctly into or . The above function can be also written .

3.1.3 Focused Crawling Problem

Let be the finite set of URLs corresponding to all web pages that belong to the target topic . Given and a small set of seeds , where , our goal is to expand over a -timestep-attempt-process of discovering URLs belonging to . Thus, since the crawling process is terminated after timesteps, optimally would be expanded by new URLs. We will refer to these URLs discovered as relevant.

Next, we describe our proposed TRES framework. First, we formulate the focused crawling setting as an MDP. Afterwards, we present the keyword expansion strategy that our crawler adopts, in order to discover keywords that are relevant to the target topic. Next, we describe KwBiLSTM, which is a deep neural network that represents the reward function of our RL setting. Last but not least, we present the proposed focused crawling algorithm, which utilizes RL, in order to find good policies for selecting URLs, and the Tree-Frontier algorithm for efficient sampling from the frontier.

3.2 Modeling Focused Crawling as an MDP

Reinforcement Learning algorithms are almost always applicable when the problem setting can be formulated as an MDP [10.5555/3312046]. Similarly to [10.1007/978-3-319-91662-0_20], we consider the focused crawler to be an agent, which exists in an interactive environment, that provides states, actions and rewards, and its goal is to maximize the cumulative reward. Next we will highlight that maximizing the cumulative agent’s reward is equivalent to the focused crawling objective. To model a focused crawling setting as an MDP, we first describe some specific concepts.

Let be the finite set of all URLs on the Web. Let Page: be the function that assigns a web page in to its unique URL. Let be the function that matches a given URL to a set containing its outlink URLs. Let be the directed graph of the Web. The node set of is equivalent to the URL set . The edge set of is defined as follows: a node has a directed edge to another node if . We represent the corresponding edge with .

When a URL is fetched, a directed subgraph of is expanded by a new node. Let be the directed subgraph of that the crawler has traversed until timestep . For it holds that ; i.e. the initial subgraph has its node set equal to the seed set and no edges. We denote the URL fetched at as . Then, is expanded by node and edge , where . Note that for each timestep , is a forest of ordered trees.

Also, let closure, or crawled set as defined in [10.1145/2631775.2631795], be the set of fetched URLs until timestep .

Definition 1 (Web Path).

We define the web path of the URL fetched by the crawler at timestep as the path of from a given seed URL in to .

From the definition of closure, the crawler fetches a URL at most once, so a URL has a unique web path for the whole crawling process. Let frontier, or crawl frontier, be the set of all available outlink URLs that the crawler has not fetched until timestep , but were extracted by web pages fetched before . More formally:

Definition 2 (Frontier).

We define frontier at timestep , given , as:

In the above definition, notice that we also include the corresponding web paths of those URLs, from which we extract the available URLs to fetch at timestep .

At this point, we proceed with the MDP formulation; i.e. . We denote a state at timestep as the directed subgraph of that the crawler has traversed at ; i.e. . We define actions at timestep given (and thus and ), as the set of all available edges through which can be expanded by one new node; that is:

(1)

Also, we define as the transition probability of selecting the edge to expand state in order to transition to a state . Furthermore, we define a binary reward function that equals if the web page of the URL crawled at timestep is relevant and otherwise .

The initial state only consists of the seed URLs. A final state is not naturally defined since crawling is a non episodic problem, where maximal rewards are ideally observed most of the time. Assuming that the crawling process stops after timesteps, is the final state.

From all the above, unlike other MDP formulations for focused crawling that regard action selection to be related to URL relevance estimation, to our knowledge, our approach is the first to preserve the successive order of state transitions. Also, observe that in our MDP, given a timestep , selecting a from is equivalent to selecting action from . Recall that is the set of all available actions at timestep . Now we can match these actions to the elements of frontier with URLs not in closure . Therefore, we can relate the traditional use of the frontier to the action set of an RL focused crawler agent for all timesteps.

3.3 Discovering Relevant Keywords

Keywords and seeds often play the role of crawler’s only prior knowledge about the target topic. Having a collection of keywords that are irrelevant to the target topic can lead the crawler to follow URLs that diverge from interest. In this paper, we assume that an initial small keyword set is given as input, the keywords of which are all highly related to the target topic . We propose a keyword expansion method that discovers new keywords from a corpus of text documents .

In our setting, we consider all words of the texts in to be candidate keywords of the target topic. We utilize a word2vec111http://vectors.nlpl.eu/explore/embeddings/en/ model, that has been trained on the whole Wikipedia corpus and represents each word in a vector space . To decide whether a candidate word is a new keyword, we measure a semantic score , which equals the average cosine similarity of with each keyword .

To decide a new keyword, we define a keyword threshold , so that if then is regarded a keyword and is stored in another keyword set . After all candidate keywords have been evaluated, we end up with a new set of keywords . In this paper, we denote by the average cosine similarity of all . More formally:

(2)

We present our keyword expansion strategy in Algorithm 1. Intuitively, our keyword expansion strategy is highly selective, since a new keyword must have a greater score than some keywords in . We also tested the TF-IDF schema for crawling [10.1007/s11280-015-0349-x]

and TextRank

[mihalcea-tarau-2004-textrank]. The latter approaches are not as selective as ours and resulted in retrieving many irrelevant keywords with low scores.

Input: Initial keyword set of size , a corpus of text documents
      Output: the expanded keyword set

1:Initialize empty keyword set
2:Create a set of all words in
3:for  do
4:     if  then Equation (2)
5:         append in
6:     else
7:         continue
8:     end if
9:end for
10:
Algorithm 1 Keyword Expansion Strategy

3.4 Learning the Reward Function

In the RL setting, the agent adjusts its policy in order to learn the given task which is often equivalent to maximizing the expected reward. Given a dataset of web pages , a keyword set , a maximum text length and the word embedding space , we seek a classification function which plays the role of our reward function .

Let and . We only assume realistic scenarios about where , where . In our experiments, and thus the majority (irrelevant) class has at least 9 times more instances than the minority class. We also assign target label to and target label to . To address this web page classification task, we propose KwBiLSTM

, which is a Bidirectional Long short-term memory (BiLSTM)

[10.1007/11550907_126], empowered by the a priori knowledge of the expanded keyword set . Our model’s structure is shown in Figure 1.

Figure 1: The structure of KwBiLSTM

In the preprocessing phase of web page classification, we follow a standard procedure for each text (web page) in : (a) we remove the words/phrases belonging to a list of stopwords, (b) we replace each remaining word with a token through tokenization, (c) we match each word token to a word2vec embedding vector in and (d) we create a text matrix of rows, where is a fixed parameter of our algorithm that is common for all texts. may contain zero rows (vectors in

), due to applying zero-padding in case the text consists of less than

words. So for each text, embedding vectors are produced in total and gathered into . In texts that contain more than words, we keep the first of them and discard the rest, on the grounds that a lot of useful information (such as the web page title) is usually written first in the web page text. To address the zero-padding process described above, KwBiLSTM leverages an additional Masking layer (before the BiLSTM layer).

KwBiLSTM takes three inputs; an array of word embeddings, which is the word2vec representation of the candidate web page text, the Exhaustive Keyword Vector and the Limited Keyword Vector . We explain EK and LK below. A BiLSTM is fed with the text matrix (using the word vectors as a sequence), in order to generate an embedding vector space of fixed dimensionality.

Specifically, is of fixed dimension which refers to the number of keywords estimated to appear in a relevant web page. In our experiments, is the average number of appearances of all in . Let be the i-th element of for a given web page . We define if has at least keyword appearances in its text, otherwise . Additionally, aggregates keyword information through a fixed dimension . Here, along with , a small keyphrase set may additionally be given as input. Similarly, let be the i-th element of for a given web page (text) . Also, let be the number of keyword appearances in , which is of text length . Then, , if any keyphrase found in , otherwise , and if any of the keywords are found as part of the URL, otherwise .

Intuitively, we note that the shortcut used in Concat2 (Figure 1) highlights the aggregative information of keywords found during prediction, even in cases where the text length is relatively small. We found that this is very useful in such cases, e.g. when predicting a web page utilizing only its title or an anchor text. We train the model by minimizing Cross-Entropy.

Last but not least, the use of the Mean Pooling Layer (Figure 1), instead of an Attention Mechanism, is based on the observation that a text is a finite collection of words, rather than a sequence of words. In our experiments, we observed that a change in the order of the words of a text does not affect our model’s performance in the binary classification task, and this was the main reason we opted for a Mean Pooling Layer, instead of an Attention Mechanism.

3.5 Focused Crawling with Reinforcement Learning

In this subsection, we describe our RL focused crawler agent and its use in learning good Q-value approximations for selecting available URLs from the frontier. Our approach is divided in three parts: (a) representing the states and the actions, (b) efficiently selecting URLs from the frontier and (c) training the agent with an RL algorithm.

3.5.1 State-Action Representation

Similarly to [10.1007/978-3-319-91662-0_20], we propose a shared state-action representation. Let be the shared representation vector for a given state and an action at timestep . Recall that an available action (at timestep ) is related to a URL in the frontier at timestep , , such that , where is the timestep when the crawler fetched , .

To represent in given , we simply aggregate information of ; i.e. the web path of the URL . Specifically, we use the following scalar state features: the reward received at (binary), the reciprocal of the distance of from the closest relevant node in (continuous) and the relevance ratio of (continuous).

To represent in , we use the following scalar features: the existence of keywords in the URL text (binary), the existence of keywords/keyphrases in the anchor text of the source page (binary) and a probability estimation of its relevance (continuous) given by KwBiLSTM (based on the web page title). We introduce hub features, which are two additional scalar action features: (a) the URL’s domain relevance ratio (continuous) until timestep and (b) the unknown domain relevance which is set to if the specific web site (domain) has not been visited before, otherwise . Hub features are based on the assumption that the crawler is more likely to fetch more relevant web pages trying to avoid less relevant or unknown domains. There is a trade-off between maximizing the harvest rate of web pages and maximizing the number of fetched domains. Experimentally, the use of Hub features served the first of these goals.

In the above setting, both state and action spaces are extremely large, and so it is more difficult to train an RL agent. Thus, we initially discretize (in the range of ) all continuous features of , except for the probability estimation of relevance, into buckets: , , …, . Experimentally, we noticed that contextually similar URLs corresponded to similar probability estimations of relevance scores, independently of how relevant their respective web pages were. This way, such a continuous feature seemed to contribute to better splits of Tree-Frontier, which we will discuss next.

3.5.2 Training with Double Deep Q-Learning

In our setting, the focused crawler is a DDQN agent existing in the MDP described previously. Let be the neural network estimation of the state-action value function of the agent’s policy. Specifically, let be the parameters of the online Q-Network and be the parameters of the target Q-Network of DDQN. Then, given a record from Experience Replay , the target used by DDQN can be written as

where is the set of actions extracted at timestep ; i.e. . Also, let be the empty graph. Then, we initialize leveraging the experience from seeds , using zero state features in . This initialization can speed up training, in cases where a lot of seeds are given as input and/or positive rewards are sparse during exploration. Next, we minimize , with respect to by performing gradient descent steps on mini-batches of . We note that the function approximators used to produce

for both the online and target Q-Networks are standard Multilayer Perceptrons with two hidden layers.

3.6 Synchronous Frontier Update

The only thing we have not described so far is that at a given timestep which action is selected. In a common Deep Q-learning setting, an -greedy policy is often used [Mnih2015HumanlevelCT, 10.5555/3016100.3016191], which results in calculating the estimated Q-values of all actions of a given state.

In a focused crawling setting, it is common to implement the frontier as a priority queue, with the priority values being the estimated Q-values of actions. In that case, for each timestep, updating requires a synchronous update [10.1007/978-3-319-91662-0_20]. Let be the time of a single DDQN prediction and be the number of new URLs inserted in the frontier at timestep . Then, as the next theorem indicates, a synchronous update is costly and, thus, impractical.

Theorem 1.

Assuming and let be the frontier instance because of seeds , then a synchronous update has an overall time complexity of

3.7 Updating and Sampling through Tree-Frontier

To reduce the time complexity of a synchronous update, we introduce Tree-Frontier, which is a two-fold variation of CART algorithm. Decision trees are traditionally used to find a discretization of a large state space by recursively growing a state tree [10.5555/295240.295802]. A state tree leaf represents a partition of initial state space . As height increases through a new splitting rule on a tree node, narrower convex regions are created, in which agent behaves in a predictable way.

To group possible frontier samples and make fewer predictions, we borrow concepts from Explainable Reinforcement Learning, where decision trees are often used to provide interpretable policy approximation [Wu2018BeyondST, Liu2018TowardID, Bewley_Lawry_2021]. Specifically, Bewley and Lawry [Bewley_Lawry_2021] proposed a binary decision tree for utilizing MDP state abstraction through a linear combination of three impurity measures; action (in ) selected, expected sum of rewards, and temporal dynamics. This way, they identify convex regions of that are similar from all three above perspectives. Furthermore, to address the problem of allocating memory for huge continuous state spaces, Jiang et al. [Jiang2021AnER]

utilized a decision tree for model learning that represented the Experience Replay buffer for generating simulated samples by using the variance of the continuous states and their average rewards.

In our setting, we are interested in discretizing both state and action space, with the latter being a theoretically continuous vector space. Also, we encounter two different kinds of samples. For a given timestep , let experience samples be those that were selected at previous timesteps , and thus we have received their respective rewards and also let frontier samples be belonging to . Combining the above, let , where and , be the set of experience samples (including seeds) at timestep . Similarly, let be the dataset of frontier samples at timestep given .

We propose an online binary decision tree algorithm, called Tree-Frontier, in order to efficiently represent and manipulate the frontier. As mentioned before, Tree-Frontier is two-fold; a tree node stores both and for a given timestep . Given , Tree-Frontier is used to predict the agent’s reward for a given state-action vector. It uses experience samples in order to split nodes, utilizing their rewards as target labels. Let be the subset of experience samples of that belong to a tree leaf . To split this leaf, we seek binary partitions , such that for some state-action feature and numerical threshold :

(3)

where is the f-th element of the state-action vector . Let be the reward sample variance of node . Unlike CART, to assess a candidate partition of , we calculate the weighted sample variance reduction of the rewards in , which is a splitting criterion that has been utilized in XRL [Liu2018TowardID]. That is:

(4)

Here represents the reward sample variance of a node. Unlike CART that follows a depth-first growth strategy, for each new experience sample in leaf , Tree-Frontier checks for candidate partitions only in . If no partition with positive exists, no split is made. Else, we select the partition of that causes the highest reduction. In this sense, Tree-Frontier follows a best-first (online) strategy.

Input: Tree-Frontier , , , new experience sample , new frontier samples
      Output: a state-action , and

1:Update
2:Update from : check for split on leaf , that contains , using (3) and (4)
3:Update by inserting each sample in to a leaf in following the tree rules
4:Select a frontier sample from each leaf through uniform sampling
5:if exploration modethen
6:else:  
7:end if
Algorithm 2 Tree-Frontier update for a timestep

Frontier has now a decision tree representation, where each frontier sample falls into some leaf. At a given timestep , instead of calculating the estimated Q-values of all frontier samples, we select one representative from each leaf through uniform sampling. Thus, we calculate the estimated Q-values of only those representatives and select the action with the highest Q-value. We call this procedure tree-frontier update. We present the tree-frontier update algorithm for a given timestep in Algorithm 2.

Theorem 2.

A tree-frontier update has an overall time complexity of .

Thus, we have shown that a tree-frontier update has always a better overall time complexity than a synchronous update. We present the proofs of the above two theorems, along with the mathematical analysis of TRES in Appendix. The full focused crawling procedure of the proposed TRES framework is presented in Algorithm 3.

Input: Seed set , initial keyword set , a dataset of web pages , the total number of URL fetches
                    the maximal number of domain visits
      Output: a list of the fetched URLs

1:KS = KeywordExpansionStrategy(, ) Algorithm 1
2:Learn the reward function , by training a KwBiLSTM using and
3:Initialize state leveraging the experience from the relevant URLs in
4:Initialize experience samples and frontier samples using the seed set
5:Initialize closure and Tree-Frontier using and
6:Initialize timestep counter
7:Initialize and
8:Initialize Experience Replay buffer with experience samples
9:Initialize an empty list
10:while  do
11:     Train the DDQN agent on a minibatch of
12:     Get a state-action , and = TreeFrontierUpdate(, , , , ) s.t:
13:             URL and condition holds Algorithm 2
14:     Transition to a new state by expanding with the edge
15:     Fetch URL and observe its reward
16:     Extract outlink URLs to create new frontier samples
17:     Create new experience sample
18:     Update experience replay buffer with
19:     Update closure with the new fetched URL
20:     Append to
21:     
22:end while
Algorithm 3 TRES

4 Experimental Evaluation

In this section, we describe the effectiveness of TRES through experiments on real-world data and comparisons with state-of-the-art crawlers. Experimentally, we study the following topics: (a) Sports, (b) Food, (c) Hardware, all belonging to Open Directory Project (Dmoz)222https://dmoz-odp.org/, which has been widely used for crawling evaluation [10.1007/s11280-015-0349-x, 10.1007/s11280-015-0331-7, Elaraby2019ANA, BATSAKIS20091001]. Dmoz indexes about five million URLs covering a wide range of topics [10.1007/s11280-015-0349-x]. Note that each topic is not equally represented by URLs in Dmoz; we assume a topic is as difficult to be crawled as it is to be found in Dmoz. Therefore, we expect the Hardware domain to be the most difficult of the three.

Unlike common (but more unrealistic) approaches [10.1145/1242572.1242632, Suebchua2017EfficientTF], for each experiment we utilize a single seed and average results from 10 different single-seed crawling runs. To select seeds, we used URLs from Google, such that they are not immediately mutually connected on the Web. In other words, each one of these seeds do not belong to any outlink URL set extracted by the other seeds.

Similarly to [10.1145/3308558.3313709], the classification training set of irrelevant web pages, , consists of URLs from 10 Dmoz supertopics, so that there is only a small chance these pages belong to the target topic . More specifically, we use approximately and samples for the training set of relevant web pages, , and for each supertopic in , respectively.

4.1 Evaluation Metrics

To evaluate fetched URLs, our goal is to maximize the widely used harvest rate (HR) [CHAKRABARTI19991623] of fetched web pages. More formally:

(5)

Observe that in our binary reward function setting the harvest rate is equivalent to the agent’s average cumulative reward, and thus to an RL metric [app11135826]. Also, we discuss the number of fetched web sites (domains), considering that a web site is relevant if it contains at least one relevant web page. Last but not least, we examine the efficiency of Tree-Frontier, in comparison to the use of synchronous updates on selecting the best action from the frontier.

Classifier Sports Food Hardware
Precision Recall F-Macro Precision Recall F-macro Precision Recall F-macro
Vanilla BiLSTM 90.1 88.3 93.2 85.6 88.2 92.6 78.4 85.0 88.3
ACHE’s SVM 76.9 88.5 90.4 68.5 81.8 86.7 62.6 77.0 83.9
KwBiLSTM 91.3 90.7 95.4 87.0 90.8 94.2 80.0 86.2 90.8
Table 1:

Classication Results: Precision (%) (relevant class), Recall (%) (relevant class) and Macro-average F-Score (%)

4.2 State-Of-The-Art and Baselines

We compare TRES to the following state-of-the-art methods:

(a) ACHE [10.1145/1242572.1242632]: is one of the most well-known focused crawlers, which aims to maximize the harvest rate of fetched web pages through an online-learning policy. ACHE has been widely used in crawling evaluation, such as in [10.1145/3308558.3313709, 10.1093/bioinformatics/btt571] and in textual data collection [10.1145/3159652.3159724].

(b) SeedFinder (SF) [10.1007/s11280-015-0331-7]: extracts new seeds and keywords from relevant pages to derive search queries.

As baseline methods we examine:

(a) Tree-Random (TR): is similar to TRES, with the exception of selecting each action from a random leaf. In other words, Tree-Random only uses exploration and thus it selects actions with uniform sampling (line 5 in Algorithm 2). Note that the Tree-Random’s frontier is a Tree-Frontier, which is exactly the same as TRES.

(b) Random (R): selects URLs from the frontier at random. It is equivalent to Tree-Random, with the only difference that the tree height is fixed and equals 1 for the whole crawling process.

From all the above, observe that TRES is a generalization of both baseline methods. We are interested in measuring the performance of Tree-Random, since it indicates to some extent the effectiveness of Tree-Frontier on discretizing the state and action spaces. Intuitively, assume an outcome of harvest rates from a crawling experiment, where Tree-Random achieves and Random achieves . Then, it means that the true percentage of relevant URLs were on average in the frontier, yet the probability of selecting a leaf of Tree-Frontier leading to a relevant URL is . Thus, in such a scenario, Tree-Frontier generated a lot of leaves, in which the agent demonstrated on average the desirable behavior, despite the fact that there were on average 19 times more actions, leading to not relevant URLs, in the frontier.

4.3 Keyword Expansion Evaluation

In all settings, we initialize the starting keyword set with a few keywords provided by Dmoz. We initialize with 62, 10 and 11 keywords in the Sports, Food and Hardware domain, respectively. The results from keyword expansion are represented in Table 2. We show that the discovered keywords (keyword set ) were at least 1400% more frequent (on average) in web pages of the target topic than in irrelevant web pages of . In Food and Hardware settings includes their supertopics (Recreation and Computers) with respective mean keyword counts equal to and . Thus, the discovered keywords are at least 160% more frequent in than in their supertopics (if existent).

Moreover, notice that our method managed to discover approximately 20 and 10 times more keywords than the size of in the Food and Hardware settings, respectively, despite the relatively small number of initial given keywords. On the other hand, in the Sports setting our method managed to retrieve 32 new keywords, and thus to enlarge only by 50%. Thus, these results support the highly selective behavior of our method, because a large initial leads (on average) to lower CS scores (than threshold ) for the candidate keywords.

Sports Food Hardware
19.55 15.46 17.66
0.86 1.04 1.44
# 62 10 11
# 32 205 129
Table 2: Mean Keyword Count of Discovered Keywords, Initial Keywords (), Discovered Keywords ()

4.4 Web Page Classification Evaluation

To evaluate the proposed web page classification model, KwBiLSTM, we follow a stratified 5-fold cross validation setting on data set . As depicted in Table 1

, we measure Precision and Recall of the relevant class and Macro-average F-Score (F-Macro). We use ACHE’s SVM and a Vanilla BiLSTM as baseline methods. In all three settings, KwBiLSTM outperforms both baselines in all evaluation metrics, while ACHE’s SVM constantly performs the worst. Notice that in the Hardware setting our model performs worse than the others, because the irrelevant class includes the Computers topic, the web pages of which consistently include relevant keywords (6.67 on average).

Focused Crawler Sports Food Hardware
HR Domains HR Domains HR Domains
ACHE 54.43 2081 60.45 2153 49.25 940
SeedFinder 56.05 1842 60.16 3642 38.68 1314
Random 3.55 17 8.97 59 1.66 28
Tree-Random 60.01 204 36.60 523 40.63 354
TRES 97.43 83 95.59 78 93.64 55
ACHE_100 47.57 4574 50.85 3820 42.18 1923
SeedFinder_100 49.97 4032 57.49 5043 31.14 2255
TRES_100 94.19 286 97.55 301 88.40 420
ACHE_10 36.79 6084 - - - -
SeedFinder_10 37.70 5774 46.96 6742 19.63 2832
TRES_10 84.52 3084 94.56 3189 63.14 1977
TRES_5 78.56 4053 90.98 4324 46.57 2956
Table 3: Focused Crawling Evaluation Results

4.5 Focused Crawling Evaluation

For crawling evaluation, we use the ground-truth trained KwBiLSTM to classify the fetched web pages. Table 3 presents the focused crawling results for all three topics, in terms of harvest rate and the number of domains fetched. Considering the constraint that the crawler can fetch at most web pages from a domain (web site), we divide our experiments into three categories: (a) (rows: 1-5), (b) (rows: 6-8) and (c) (rows: 9-12). As depicted, our TRES consistently outperforms state-of-the-art in harvest rate by at least: 58% in , 94% in and 110% in . Our baseline Tree-Random performs on a par with both ACHE and SeedFinder in HR. Specifically, Tree-Random outperforms both state-of-the-art methods in the Sports domain and only SeedFinder in the Food domain. Also, it consistently outperforms Random crawler in both harvest rate and total fetched domains. The harvest rate score of Tree-Random reflects the relevance ratio of the leaves of Tree-Frontier, which appears higher than the relevance ratio of a traditional frontier. The lower harvest rate scores of the state-of-the-art methods could be attributed to the fact that their web page classifiers were less accurate than our KwBiLSTM and thus their crawling strategy diverged a lot from the target topic.

[] []

[]

Figure 2: Frontier Sampling: Number of Tree-Frontier leaves overtime (Up Left), Frontier size overtime (Up Right), Frontier size to Tree-Frontier leaves number ratio (Down)

Furthermore, we demonstrate the effect of on the trade-off between harvest rate and the number of domains fetched. The state-of-the-art methods proved to be more exploratory and discovered a greater number of relevant domains than TRES. However, our TRES was neither optimized in fetching different domains, nor to discover further seeds, as SeedFinder was. On the other hand, as decreases, our TRES discovers more relevant domains and performs better with respect to the state-of-the-art methods in this objective, while it consistently leads in harvest rate. Specifically, in , ACHE aborted in most cases (due to running out of URLs to crawl) and thus we could not provide its average results. We note that similar experimental problems of ACHE, that is when it is bootstrapped by a small seed set, have been observed in previous works, as well [10.1145/3308558.3313709]. Notice that in this setting, in the Hardware topic, our TRES (TRES_5) manages to outperform all evaluated methods both in the harvest rate and in the number of fetched domains.

4.6 Tree-Frontier vs Synchronous Update

The use of Tree-Frontier allows us to efficiently sample from the frontier, reducing the execution time by orders of magnitude, while at the same time preserving the high performance of a synchronous update, which exhaustively examines every frontier sample in each timestep. While tuning the hyperparameters of the TRES algorithm, the size of the tree was gradually increased until no significant improvement was observed to the evaluation metrics. After that point, the performance reached a plateau with respect to the tree size. This allows us to compare with the synchronous update approach, as the performance in terms of the evaluation metrics is similar, but the execution time is significantly lower in TRES. In particular, as we can see in Figure

2, the size of the subset of the frontier that TRES examines at each timestep is between 200 and 300 times smaller than the full frontier size,. This implies analogous execution time reduction, which however was not directly measured, because this would be computationally impractical, considering that (the much faster) TRES took several hours to execute in our machines.

In Figure 2, we can also observe that the dependence of the frontier size and the number of leaves (in Tree-Frontier) to the number of timesteps is not exactly linear. In the case of the full frontier (synchronous update) this is attributed to the variance in the number of outlinks that a random web page could contain. There is an oscillation around the mean outlink number, which would ideally be equal to the slope of curve (see Figure 2

, subplot (b)), had there been no variance in the random variable of the number of outlink URLs (of a web page). In the case of Tree Frontier, the variance of the slope is due to the fact that at some timesteps no split is performed and, thus, the number of leaves remains the same. At other timesteps, binary splits may occur, increasing the size of the set of the candidate frontier samples that we examine.

It is worth mentioning that the Tree-Frontier approach could be employed in other RL settings, as well. One challenge in RL is the increasingly large action space, which should ideally be examined exhaustively at each timestep. Tree-Frontier effectively tackles this problem by reducing the number of actions that are examined at each timestep, that is the number of samples for which Q-values are calculated. Frontier sampling is performed in a stratified way utilizing a sort of clustering through an online discretization of both large state and action spaces. Therefore, in an RL setting with rampant action space growth (frontier), the Tree-Frontier algorithm could potentially be preferable to an exhaustive calculation of Q-values or alternative methods.

5 Conclusion

In this paper, we proposed an end-to-end RL focused crawling framework that outperforms other state-of-the-art methods on the harvest rate of web pages. We formalized the MDP for focused crawling, overcoming previous limitations, and improved the impractical time complexity of synchronous update through the use of Tree-Frontier we introduced.

As a future work, we would like to investigate an exploration bonus in the reward function, in order to enhance agent’s exploration towards discovering a greater number of relevant domains. An additional open problem is trying to minimize the size of the labeled data set used for training the classifier, which we use to estimate the reward function.