1 Introduction
Focused crawlers [CHAKRABARTI19991623] are intelligent agents, which given a defined topic of interest, find a strategy to search for relevant web pages. Such agents are widely used to collect vast amounts of topicrelated data. Typically, a focused crawler starts from an initial set of highly relevant data, in the form of URLs, called seeds. The crawler utilizes seeds as a bootstrap, extracting their outlink URLs and storing them in a URL collection, called frontier [10.1145/988672.988714]
. The agent selects the best URL from the frontier, visits its web page and classifies it as relevant or irrelevant to the user’s interests. Then, its outlink URLs (hyperlinks) are stored in the frontier. Based on relevance feedback, the crawling strategy is adjusted. The total number of fetched URLs is userdefined and thus the whole process terminates when that number of URL selections is reached.
Many approaches, such as [Elaraby2019ANA, RAJIV20213, SALEH2017181], have proposed the use of classifiers that identify whether a given web page is relevant to the target topic. Such classifiers can be used to (a) identify the relevant web pages (in the form of URLs) in the frontier, assigning them high crawling priorities, (b) give relevance feedback to the crawler’s strategy after its visiting a web page and (c) evaluate the crawling data after the end of a running process.
An alternative way of measuring the quality of an outlink URL is by applying Reinforcement Learning (RL)
. Unlike supervised learning methods, such focused crawlers are able to distinguish web pages (or websites), called
hubs [Rennie99efficientweb], that are possibly connected with relevant URLs on the Web. In cases where no relevant URLs exist in the frontier, a good RL strategy may identify hubs to find new relevant web pages, unlike a classification method which ignores such information.In this paper we propose a novel focused crawling framework on textual Web data, called TRES (Tree REinforcement Spider)
. Our crawler agent uses the typical focused crawling framework, we described above, adopting an RL crawling policy for selecting URLs. We take into account the semantic similarities between candidate keywords and construct small vector representations of candidate URLs summarizing various semantic and statistical information. We utilize this URL representation as a shared stateaction representation.
Furthermore, we introduce TreeFrontier, which is a twofold decision tree, utilizing the rewards from experience as target labels. Since a decision tree splits on feature values, under a defined criterion, each leaf node corresponds to a local neighborhood of the input space. This way, in TreeFrontier, a leaf stores a set of similar frontier samples, corresponding to outlink URLs (in the frontier), along with similar experience samples, corresponding to fetched web pages.
We utilize a Double Deep QNetwork (DDQN) [10.5555/3016100.3016191]
focused crawler agent; though other modelfree RL agents that estimate Qvalues on a stateaction input are also applicable. In summary, this paper makes the following contributions:

Our TRES outperforms other stateoftheart focused crawlers on three evaluated domains, in terms of harvest rate of web pages, utilizing no more than a a single seed URL.

We introduce the TreeFrontier algorithm for efficiently and effectively selecting which URLs to fetch through discretization of the large state and action spaces.

TreeFrontier improves the time complexity of a synchronous update [10.1007/9783319916620_20] (i.e. selecting the best action, given the state, by calculating the Qvalues of all available actions), by discarding a factor from the time complexity, accounting for the minimal number of new URLs inserted in frontier at each timestep.

We introduce a novel Markov Decision Process (MDP) formulation for focused crawling with RL, in which, unlike previous approaches, the agent necessarily follows the successive order of state transitions .
The remainder of the paper is organized as follows. In Section 2, we describe a brief overview of the related work. In Section 3, we present the problem definition and our framework overview. Our experimental evaluation and the comparisons with baseline and stateoftheart methods are presented in Section 4. In section 5, we discuss the conclusions of this paper and some future steps.
2 Related Work
Chakrabarti et al. [CHAKRABARTI19991623] introduced the reference of focused crawling on target domains. Focused crawlers many a times utilize two classifiers; an apprentice and a critic [10.1145/511446.511466]. The critic classifies web pages, while the apprentice learns (from the critic’s feedback) to distinguish the most promising outlink URLs. Suebchua et al. [Suebchua2017EfficientTF] introduced the ”Neighborhood feature”, which exploits the relevance of all already fetched web pages of the same domain (web site), in order to more effectively select URLs belonging to the same domain. This approach assumes that a web page is likely to be relevant if many relevant web pages have been discovered in its neighborhood, which is similar to the empirical evidence of topical locality on the Web showed by Davison [10.1145/345508.345597].
Searchbased Discovery methods utilize queries on search engine APIs (e.g. Google and Bing) and adapt a strategy onto query improvement. SeedFinder [10.1007/s1128001503317] tries to find new seeds in order to best bootstrap the crawling process of an arbitrary focused crawler, by executing queries on search engines utilizing terms (keywords) extracted from relevant texts. Liakos et al. [10.1007/s112800150349x] expanded a keyword collection by selecting the keyword with the highest TFIDF score to form a query. By executing this query on a search engine, new candidate keywords were retrieved and the ones with the highest scores were stored in the collection. We adopt a similar keyword discovery strategy, with the exception of using the cosine similarities of word2vec [DBLP:journals/corr/abs13013781] embeddings of candidate keywords instead of their TFIDF scores.
On the other hand, Crawlingbased Discovery methods utilize the link structure to discover new content by automatically fetching and iteratively selecting URLs extracted from the visited web pages. They can be divided into: (a) focused (forward) and (b) backward crawling methods. Backward crawling methods, e.g. the BIPARTITE crawler [barbosaetal2011crawling], explore URLs through backlink search, considering the Web to be an undirected graph. In this work, we are not interested in evaluating backward crawling, since it requires paid APIs for most real users.
Du et al. [10.1016/j.asoc.2015.07.026] integrated the TFIDF scores of the words with the semantic similarities among the words to construct topic and text semantic vectors and computed the cosine similarities between these vectors to select URLs. ACHE [10.1145/1242572.1242632] is an adaptive crawler aiming to maximizing the harvest rate [CHAKRABARTI19991623] of fetched web pages through an onlinelearning policy. This way, it learns to identify promising URLs and adapts its focus as the crawl progresses.
Rennie and McCallum [Rennie99efficientweb] introduced RL in focused crawling, but they generalized over all possible states in order to deal with the huge state space; thus each action (outlink URL) was immediately disconnected from its state. To face these challenges, RLwCS [10.5555/1567281.1567456] uses a tabular Qlearning policy that selects the best classifier from a defined pool of classifiers, which evaluate all available candidate URLs that the crawler can immediately fetch. InfoSpiders [Menczer99adaptiveretrieval, 10.1145/1031114.1031117] approximate the Qvalues of URLs in an online way, yet they only take into account statistical features of keywords for the state representation, without considering any experience samples.
Han et al. [10.1007/9783319916620_20] proposed an online crawling algorithm, based on SARSA policy approximation [10.5555/3312046]
introducing a shared stateaction representation. They manually discretized the stateaction space into buckets according to value ranges of statistical heuristic features. However, their MDP formulation somewhat unnaturally allows the agent to deviate from the successive order of states; i.e. it may select actions that do not really exist in a current state. This way, given a current state, the agent may transition with positive probability to states, for which the corresponding transition probabilities must be zero.
Gouriten et al. [10.1145/2631775.2631795] highlighted the frontier batch disadvantage, considering that a batch is the number of successive crawling timesteps in which the frontier has not been updated. They showed that even in an offline setting, the problem of determining the optimal sequence of fetched pages, is NPhard. Pham et al. [10.1145/3308558.3313709] used a multiarmed bandit strategy to combine different search operators (queries on APIs, forward and backward crawling), as proxy to collect relevant web pages, and ranked them, based on their similarity to seeds.
3 Problem and Method Overview
3.1 Problem Definition
3.1.1 Keyword Expansion Problem
Let be the set of all web page text documents belonging to a target topic . Also, let be the finite set of all keywords that describe . Given an initial small set of keywords, , where , and a corpus of text documents and each contains candidate keywords, our goal is to expand so that .
3.1.2 Web Page Classification Problem
Similarly to [Lu2016AnIF], we deal with a binary web page classification problem, in order to decide whether a given web page is relevant to the topic of interest. Specifically, let represent the set of all web page text documents belonging to any topic, other than . Also, let represent the set of all web pages on the Web. Assuming (a) is finite, (b) each web page represents a text document and (c) a word vector space , then is a set of words. Given a collection of web pages , which all represent text documents of maximum word length , and a keyword set , our goal is to find a function , so that for each input web page , classifies correctly into or . The above function can be also written .
3.1.3 Focused Crawling Problem
Let be the finite set of URLs corresponding to all web pages that belong to the target topic . Given and a small set of seeds , where , our goal is to expand over a timestepattemptprocess of discovering URLs belonging to . Thus, since the crawling process is terminated after timesteps, optimally would be expanded by new URLs. We will refer to these URLs discovered as relevant.
Next, we describe our proposed TRES framework. First, we formulate the focused crawling setting as an MDP. Afterwards, we present the keyword expansion strategy that our crawler adopts, in order to discover keywords that are relevant to the target topic. Next, we describe KwBiLSTM, which is a deep neural network that represents the reward function of our RL setting. Last but not least, we present the proposed focused crawling algorithm, which utilizes RL, in order to find good policies for selecting URLs, and the TreeFrontier algorithm for efficient sampling from the frontier.
3.2 Modeling Focused Crawling as an MDP
Reinforcement Learning algorithms are almost always applicable when the problem setting can be formulated as an MDP [10.5555/3312046]. Similarly to [10.1007/9783319916620_20], we consider the focused crawler to be an agent, which exists in an interactive environment, that provides states, actions and rewards, and its goal is to maximize the cumulative reward. Next we will highlight that maximizing the cumulative agent’s reward is equivalent to the focused crawling objective. To model a focused crawling setting as an MDP, we first describe some specific concepts.
Let be the finite set of all URLs on the Web. Let Page: be the function that assigns a web page in to its unique URL. Let be the function that matches a given URL to a set containing its outlink URLs. Let be the directed graph of the Web. The node set of is equivalent to the URL set . The edge set of is defined as follows: a node has a directed edge to another node if . We represent the corresponding edge with .
When a URL is fetched, a directed subgraph of is expanded by a new node. Let be the directed subgraph of that the crawler has traversed until timestep . For it holds that ; i.e. the initial subgraph has its node set equal to the seed set and no edges. We denote the URL fetched at as . Then, is expanded by node and edge , where . Note that for each timestep , is a forest of ordered trees.
Also, let closure, or crawled set as defined in [10.1145/2631775.2631795], be the set of fetched URLs until timestep .
Definition 1 (Web Path).
We define the web path of the URL fetched by the crawler at timestep as the path of from a given seed URL in to .
From the definition of closure, the crawler fetches a URL at most once, so a URL has a unique web path for the whole crawling process. Let frontier, or crawl frontier, be the set of all available outlink URLs that the crawler has not fetched until timestep , but were extracted by web pages fetched before . More formally:
Definition 2 (Frontier).
We define frontier at timestep , given , as:
In the above definition, notice that we also include the corresponding web paths of those URLs, from which we extract the available URLs to fetch at timestep .
At this point, we proceed with the MDP formulation; i.e. . We denote a state at timestep as the directed subgraph of that the crawler has traversed at ; i.e. . We define actions at timestep given (and thus and ), as the set of all available edges through which can be expanded by one new node; that is:
(1) 
Also, we define as the transition probability of selecting the edge to expand state in order to transition to a state . Furthermore, we define a binary reward function that equals if the web page of the URL crawled at timestep is relevant and otherwise .
The initial state only consists of the seed URLs. A final state is not naturally defined since crawling is a non episodic problem, where maximal rewards are ideally observed most of the time. Assuming that the crawling process stops after timesteps, is the final state.
From all the above, unlike other MDP formulations for focused crawling that regard action selection to be related to URL relevance estimation, to our knowledge, our approach is the first to preserve the successive order of state transitions. Also, observe that in our MDP, given a timestep , selecting a from is equivalent to selecting action from . Recall that is the set of all available actions at timestep . Now we can match these actions to the elements of frontier with URLs not in closure . Therefore, we can relate the traditional use of the frontier to the action set of an RL focused crawler agent for all timesteps.
3.3 Discovering Relevant Keywords
Keywords and seeds often play the role of crawler’s only prior knowledge about the target topic. Having a collection of keywords that are irrelevant to the target topic can lead the crawler to follow URLs that diverge from interest. In this paper, we assume that an initial small keyword set is given as input, the keywords of which are all highly related to the target topic . We propose a keyword expansion method that discovers new keywords from a corpus of text documents .
In our setting, we consider all words of the texts in to be candidate keywords of the target topic. We utilize a word2vec^{1}^{1}1http://vectors.nlpl.eu/explore/embeddings/en/ model, that has been trained on the whole Wikipedia corpus and represents each word in a vector space . To decide whether a candidate word is a new keyword, we measure a semantic score , which equals the average cosine similarity of with each keyword .
To decide a new keyword, we define a keyword threshold , so that if then is regarded a keyword and is stored in another keyword set . After all candidate keywords have been evaluated, we end up with a new set of keywords . In this paper, we denote by the average cosine similarity of all . More formally:
(2) 
We present our keyword expansion strategy in Algorithm 1. Intuitively, our keyword expansion strategy is highly selective, since a new keyword must have a greater score than some keywords in . We also tested the TFIDF schema for crawling [10.1007/s112800150349x]
and TextRank
[mihalceatarau2004textrank]. The latter approaches are not as selective as ours and resulted in retrieving many irrelevant keywords with low scores.3.4 Learning the Reward Function
In the RL setting, the agent adjusts its policy in order to learn the given task which is often equivalent to maximizing the expected reward. Given a dataset of web pages , a keyword set , a maximum text length and the word embedding space , we seek a classification function which plays the role of our reward function .
Let and . We only assume realistic scenarios about where , where . In our experiments, and thus the majority (irrelevant) class has at least 9 times more instances than the minority class. We also assign target label to and target label to . To address this web page classification task, we propose KwBiLSTM
, which is a Bidirectional Long shortterm memory (BiLSTM)
[10.1007/11550907_126], empowered by the a priori knowledge of the expanded keyword set . Our model’s structure is shown in Figure 1.In the preprocessing phase of web page classification, we follow a standard procedure for each text (web page) in : (a) we remove the words/phrases belonging to a list of stopwords, (b) we replace each remaining word with a token through tokenization, (c) we match each word token to a word2vec embedding vector in and (d) we create a text matrix of rows, where is a fixed parameter of our algorithm that is common for all texts. may contain zero rows (vectors in
), due to applying zeropadding in case the text consists of less than
words. So for each text, embedding vectors are produced in total and gathered into . In texts that contain more than words, we keep the first of them and discard the rest, on the grounds that a lot of useful information (such as the web page title) is usually written first in the web page text. To address the zeropadding process described above, KwBiLSTM leverages an additional Masking layer (before the BiLSTM layer).KwBiLSTM takes three inputs; an array of word embeddings, which is the word2vec representation of the candidate web page text, the Exhaustive Keyword Vector and the Limited Keyword Vector . We explain EK and LK below. A BiLSTM is fed with the text matrix (using the word vectors as a sequence), in order to generate an embedding vector space of fixed dimensionality.
Specifically, is of fixed dimension which refers to the number of keywords estimated to appear in a relevant web page. In our experiments, is the average number of appearances of all in . Let be the ith element of for a given web page . We define if has at least keyword appearances in its text, otherwise . Additionally, aggregates keyword information through a fixed dimension . Here, along with , a small keyphrase set may additionally be given as input. Similarly, let be the ith element of for a given web page (text) . Also, let be the number of keyword appearances in , which is of text length . Then, , if any keyphrase found in , otherwise , and if any of the keywords are found as part of the URL, otherwise .
Intuitively, we note that the shortcut used in Concat2 (Figure 1) highlights the aggregative information of keywords found during prediction, even in cases where the text length is relatively small. We found that this is very useful in such cases, e.g. when predicting a web page utilizing only its title or an anchor text. We train the model by minimizing CrossEntropy.
Last but not least, the use of the Mean Pooling Layer (Figure 1), instead of an Attention Mechanism, is based on the observation that a text is a finite collection of words, rather than a sequence of words. In our experiments, we observed that a change in the order of the words of a text does not affect our model’s performance in the binary classification task, and this was the main reason we opted for a Mean Pooling Layer, instead of an Attention Mechanism.
3.5 Focused Crawling with Reinforcement Learning
In this subsection, we describe our RL focused crawler agent and its use in learning good Qvalue approximations for selecting available URLs from the frontier. Our approach is divided in three parts: (a) representing the states and the actions, (b) efficiently selecting URLs from the frontier and (c) training the agent with an RL algorithm.
3.5.1 StateAction Representation
Similarly to [10.1007/9783319916620_20], we propose a shared stateaction representation. Let be the shared representation vector for a given state and an action at timestep . Recall that an available action (at timestep ) is related to a URL in the frontier at timestep , , such that , where is the timestep when the crawler fetched , .
To represent in given , we simply aggregate information of ; i.e. the web path of the URL . Specifically, we use the following scalar state features: the reward received at (binary), the reciprocal of the distance of from the closest relevant node in (continuous) and the relevance ratio of (continuous).
To represent in , we use the following scalar features: the existence of keywords in the URL text (binary), the existence of keywords/keyphrases in the anchor text of the source page (binary) and a probability estimation of its relevance (continuous) given by KwBiLSTM (based on the web page title). We introduce hub features, which are two additional scalar action features: (a) the URL’s domain relevance ratio (continuous) until timestep and (b) the unknown domain relevance which is set to if the specific web site (domain) has not been visited before, otherwise . Hub features are based on the assumption that the crawler is more likely to fetch more relevant web pages trying to avoid less relevant or unknown domains. There is a tradeoff between maximizing the harvest rate of web pages and maximizing the number of fetched domains. Experimentally, the use of Hub features served the first of these goals.
In the above setting, both state and action spaces are extremely large, and so it is more difficult to train an RL agent. Thus, we initially discretize (in the range of ) all continuous features of , except for the probability estimation of relevance, into buckets: , , …, . Experimentally, we noticed that contextually similar URLs corresponded to similar probability estimations of relevance scores, independently of how relevant their respective web pages were. This way, such a continuous feature seemed to contribute to better splits of TreeFrontier, which we will discuss next.
3.5.2 Training with Double Deep QLearning
In our setting, the focused crawler is a DDQN agent existing in the MDP described previously. Let be the neural network estimation of the stateaction value function of the agent’s policy. Specifically, let be the parameters of the online QNetwork and be the parameters of the target QNetwork of DDQN. Then, given a record from Experience Replay , the target used by DDQN can be written as
where is the set of actions extracted at timestep ; i.e. . Also, let be the empty graph. Then, we initialize leveraging the experience from seeds , using zero state features in . This initialization can speed up training, in cases where a lot of seeds are given as input and/or positive rewards are sparse during exploration. Next, we minimize , with respect to by performing gradient descent steps on minibatches of . We note that the function approximators used to produce
for both the online and target QNetworks are standard Multilayer Perceptrons with two hidden layers.
3.6 Synchronous Frontier Update
The only thing we have not described so far is that at a given timestep which action is selected. In a common Deep Qlearning setting, an greedy policy is often used [Mnih2015HumanlevelCT, 10.5555/3016100.3016191], which results in calculating the estimated Qvalues of all actions of a given state.
In a focused crawling setting, it is common to implement the frontier as a priority queue, with the priority values being the estimated Qvalues of actions. In that case, for each timestep, updating requires a synchronous update [10.1007/9783319916620_20]. Let be the time of a single DDQN prediction and be the number of new URLs inserted in the frontier at timestep . Then, as the next theorem indicates, a synchronous update is costly and, thus, impractical.
Theorem 1.
Assuming and let be the frontier instance because of seeds , then a synchronous update has an overall time complexity of
3.7 Updating and Sampling through TreeFrontier
To reduce the time complexity of a synchronous update, we introduce TreeFrontier, which is a twofold variation of CART algorithm. Decision trees are traditionally used to find a discretization of a large state space by recursively growing a state tree [10.5555/295240.295802]. A state tree leaf represents a partition of initial state space . As height increases through a new splitting rule on a tree node, narrower convex regions are created, in which agent behaves in a predictable way.
To group possible frontier samples and make fewer predictions, we borrow concepts from Explainable Reinforcement Learning, where decision trees are often used to provide interpretable policy approximation [Wu2018BeyondST, Liu2018TowardID, Bewley_Lawry_2021]. Specifically, Bewley and Lawry [Bewley_Lawry_2021] proposed a binary decision tree for utilizing MDP state abstraction through a linear combination of three impurity measures; action (in ) selected, expected sum of rewards, and temporal dynamics. This way, they identify convex regions of that are similar from all three above perspectives. Furthermore, to address the problem of allocating memory for huge continuous state spaces, Jiang et al. [Jiang2021AnER]
utilized a decision tree for model learning that represented the Experience Replay buffer for generating simulated samples by using the variance of the continuous states and their average rewards.
In our setting, we are interested in discretizing both state and action space, with the latter being a theoretically continuous vector space. Also, we encounter two different kinds of samples. For a given timestep , let experience samples be those that were selected at previous timesteps , and thus we have received their respective rewards and also let frontier samples be belonging to . Combining the above, let , where and , be the set of experience samples (including seeds) at timestep . Similarly, let be the dataset of frontier samples at timestep given .
We propose an online binary decision tree algorithm, called TreeFrontier, in order to efficiently represent and manipulate the frontier. As mentioned before, TreeFrontier is twofold; a tree node stores both and for a given timestep . Given , TreeFrontier is used to predict the agent’s reward for a given stateaction vector. It uses experience samples in order to split nodes, utilizing their rewards as target labels. Let be the subset of experience samples of that belong to a tree leaf . To split this leaf, we seek binary partitions , such that for some stateaction feature and numerical threshold :
(3) 
where is the fth element of the stateaction vector . Let be the reward sample variance of node . Unlike CART, to assess a candidate partition of , we calculate the weighted sample variance reduction of the rewards in , which is a splitting criterion that has been utilized in XRL [Liu2018TowardID]. That is:
(4) 
Here represents the reward sample variance of a node. Unlike CART that follows a depthfirst growth strategy, for each new experience sample in leaf , TreeFrontier checks for candidate partitions only in . If no partition with positive exists, no split is made. Else, we select the partition of that causes the highest reduction. In this sense, TreeFrontier follows a bestfirst (online) strategy.
Frontier has now a decision tree representation, where each frontier sample falls into some leaf. At a given timestep , instead of calculating the estimated Qvalues of all frontier samples, we select one representative from each leaf through uniform sampling. Thus, we calculate the estimated Qvalues of only those representatives and select the action with the highest Qvalue. We call this procedure treefrontier update. We present the treefrontier update algorithm for a given timestep in Algorithm 2.
Theorem 2.
A treefrontier update has an overall time complexity of .
Thus, we have shown that a treefrontier update has always a better overall time complexity than a synchronous update. We present the proofs of the above two theorems, along with the mathematical analysis of TRES in Appendix. The full focused crawling procedure of the proposed TRES framework is presented in Algorithm 3.
4 Experimental Evaluation
In this section, we describe the effectiveness of TRES through experiments on realworld data and comparisons with stateoftheart crawlers. Experimentally, we study the following topics: (a) Sports, (b) Food, (c) Hardware, all belonging to Open Directory Project (Dmoz)^{2}^{2}2https://dmozodp.org/, which has been widely used for crawling evaluation [10.1007/s112800150349x, 10.1007/s1128001503317, Elaraby2019ANA, BATSAKIS20091001]. Dmoz indexes about five million URLs covering a wide range of topics [10.1007/s112800150349x]. Note that each topic is not equally represented by URLs in Dmoz; we assume a topic is as difficult to be crawled as it is to be found in Dmoz. Therefore, we expect the Hardware domain to be the most difficult of the three.
Unlike common (but more unrealistic) approaches [10.1145/1242572.1242632, Suebchua2017EfficientTF], for each experiment we utilize a single seed and average results from 10 different singleseed crawling runs. To select seeds, we used URLs from Google, such that they are not immediately mutually connected on the Web. In other words, each one of these seeds do not belong to any outlink URL set extracted by the other seeds.
Similarly to [10.1145/3308558.3313709], the classification training set of irrelevant web pages, , consists of URLs from 10 Dmoz supertopics, so that there is only a small chance these pages belong to the target topic . More specifically, we use approximately and samples for the training set of relevant web pages, , and for each supertopic in , respectively.
4.1 Evaluation Metrics
To evaluate fetched URLs, our goal is to maximize the widely used harvest rate (HR) [CHAKRABARTI19991623] of fetched web pages. More formally:
(5) 
Observe that in our binary reward function setting the harvest rate is equivalent to the agent’s average cumulative reward, and thus to an RL metric [app11135826]. Also, we discuss the number of fetched web sites (domains), considering that a web site is relevant if it contains at least one relevant web page. Last but not least, we examine the efficiency of TreeFrontier, in comparison to the use of synchronous updates on selecting the best action from the frontier.
Classifier  Sports  Food  Hardware  

Precision  Recall  FMacro  Precision  Recall  Fmacro  Precision  Recall  Fmacro  
Vanilla BiLSTM  90.1  88.3  93.2  85.6  88.2  92.6  78.4  85.0  88.3 
ACHE’s SVM  76.9  88.5  90.4  68.5  81.8  86.7  62.6  77.0  83.9 
KwBiLSTM  91.3  90.7  95.4  87.0  90.8  94.2  80.0  86.2  90.8 
Classication Results: Precision (%) (relevant class), Recall (%) (relevant class) and Macroaverage FScore (%)
4.2 StateOfTheArt and Baselines
We compare TRES to the following stateoftheart methods:
(a) ACHE [10.1145/1242572.1242632]: is one of the most wellknown focused crawlers, which aims to maximize the harvest rate of fetched web pages through an onlinelearning policy. ACHE has been widely used in crawling evaluation, such as in [10.1145/3308558.3313709, 10.1093/bioinformatics/btt571] and in textual data collection [10.1145/3159652.3159724].
(b) SeedFinder (SF) [10.1007/s1128001503317]: extracts new seeds and keywords from relevant pages to derive search queries.
As baseline methods we examine:
(a) TreeRandom (TR): is similar to TRES, with the exception of selecting each action from a random leaf. In other words, TreeRandom only uses exploration and thus it selects actions with uniform sampling (line 5 in Algorithm 2). Note that the TreeRandom’s frontier is a TreeFrontier, which is exactly the same as TRES.
(b) Random (R): selects URLs from the frontier at random. It is equivalent to TreeRandom, with the only difference that the tree height is fixed and equals 1 for the whole crawling process.
From all the above, observe that TRES is a generalization of both baseline methods. We are interested in measuring the performance of TreeRandom, since it indicates to some extent the effectiveness of TreeFrontier on discretizing the state and action spaces. Intuitively, assume an outcome of harvest rates from a crawling experiment, where TreeRandom achieves and Random achieves . Then, it means that the true percentage of relevant URLs were on average in the frontier, yet the probability of selecting a leaf of TreeFrontier leading to a relevant URL is . Thus, in such a scenario, TreeFrontier generated a lot of leaves, in which the agent demonstrated on average the desirable behavior, despite the fact that there were on average 19 times more actions, leading to not relevant URLs, in the frontier.
4.3 Keyword Expansion Evaluation
In all settings, we initialize the starting keyword set with a few keywords provided by Dmoz. We initialize with 62, 10 and 11 keywords in the Sports, Food and Hardware domain, respectively. The results from keyword expansion are represented in Table 2. We show that the discovered keywords (keyword set ) were at least 1400% more frequent (on average) in web pages of the target topic than in irrelevant web pages of . In Food and Hardware settings includes their supertopics (Recreation and Computers) with respective mean keyword counts equal to and . Thus, the discovered keywords are at least 160% more frequent in than in their supertopics (if existent).
Moreover, notice that our method managed to discover approximately 20 and 10 times more keywords than the size of in the Food and Hardware settings, respectively, despite the relatively small number of initial given keywords. On the other hand, in the Sports setting our method managed to retrieve 32 new keywords, and thus to enlarge only by 50%. Thus, these results support the highly selective behavior of our method, because a large initial leads (on average) to lower CS scores (than threshold ) for the candidate keywords.
Sports  Food  Hardware  
19.55  15.46  17.66  
0.86  1.04  1.44  
#  62  10  11 
#  32  205  129 
4.4 Web Page Classification Evaluation
To evaluate the proposed web page classification model, KwBiLSTM, we follow a stratified 5fold cross validation setting on data set . As depicted in Table 1
, we measure Precision and Recall of the relevant class and Macroaverage FScore (FMacro). We use ACHE’s SVM and a Vanilla BiLSTM as baseline methods. In all three settings, KwBiLSTM outperforms both baselines in all evaluation metrics, while ACHE’s SVM constantly performs the worst. Notice that in the Hardware setting our model performs worse than the others, because the irrelevant class includes the Computers topic, the web pages of which consistently include relevant keywords (6.67 on average).
Focused Crawler  Sports  Food  Hardware  
HR  Domains  HR  Domains  HR  Domains  
ACHE  54.43  2081  60.45  2153  49.25  940 
SeedFinder  56.05  1842  60.16  3642  38.68  1314 
Random  3.55  17  8.97  59  1.66  28 
TreeRandom  60.01  204  36.60  523  40.63  354 
TRES  97.43  83  95.59  78  93.64  55 
ACHE_100  47.57  4574  50.85  3820  42.18  1923 
SeedFinder_100  49.97  4032  57.49  5043  31.14  2255 
TRES_100  94.19  286  97.55  301  88.40  420 
ACHE_10  36.79  6084         
SeedFinder_10  37.70  5774  46.96  6742  19.63  2832 
TRES_10  84.52  3084  94.56  3189  63.14  1977 
TRES_5  78.56  4053  90.98  4324  46.57  2956 
4.5 Focused Crawling Evaluation
For crawling evaluation, we use the groundtruth trained KwBiLSTM to classify the fetched web pages. Table 3 presents the focused crawling results for all three topics, in terms of harvest rate and the number of domains fetched. Considering the constraint that the crawler can fetch at most web pages from a domain (web site), we divide our experiments into three categories: (a) (rows: 15), (b) (rows: 68) and (c) (rows: 912). As depicted, our TRES consistently outperforms stateoftheart in harvest rate by at least: 58% in , 94% in and 110% in . Our baseline TreeRandom performs on a par with both ACHE and SeedFinder in HR. Specifically, TreeRandom outperforms both stateoftheart methods in the Sports domain and only SeedFinder in the Food domain. Also, it consistently outperforms Random crawler in both harvest rate and total fetched domains. The harvest rate score of TreeRandom reflects the relevance ratio of the leaves of TreeFrontier, which appears higher than the relevance ratio of a traditional frontier. The lower harvest rate scores of the stateoftheart methods could be attributed to the fact that their web page classifiers were less accurate than our KwBiLSTM and thus their crawling strategy diverged a lot from the target topic.
Furthermore, we demonstrate the effect of on the tradeoff between harvest rate and the number of domains fetched. The stateoftheart methods proved to be more exploratory and discovered a greater number of relevant domains than TRES. However, our TRES was neither optimized in fetching different domains, nor to discover further seeds, as SeedFinder was. On the other hand, as decreases, our TRES discovers more relevant domains and performs better with respect to the stateoftheart methods in this objective, while it consistently leads in harvest rate. Specifically, in , ACHE aborted in most cases (due to running out of URLs to crawl) and thus we could not provide its average results. We note that similar experimental problems of ACHE, that is when it is bootstrapped by a small seed set, have been observed in previous works, as well [10.1145/3308558.3313709]. Notice that in this setting, in the Hardware topic, our TRES (TRES_5) manages to outperform all evaluated methods both in the harvest rate and in the number of fetched domains.
4.6 TreeFrontier vs Synchronous Update
The use of TreeFrontier allows us to efficiently sample from the frontier, reducing the execution time by orders of magnitude, while at the same time preserving the high performance of a synchronous update, which exhaustively examines every frontier sample in each timestep. While tuning the hyperparameters of the TRES algorithm, the size of the tree was gradually increased until no significant improvement was observed to the evaluation metrics. After that point, the performance reached a plateau with respect to the tree size. This allows us to compare with the synchronous update approach, as the performance in terms of the evaluation metrics is similar, but the execution time is significantly lower in TRES. In particular, as we can see in Figure
2, the size of the subset of the frontier that TRES examines at each timestep is between 200 and 300 times smaller than the full frontier size,. This implies analogous execution time reduction, which however was not directly measured, because this would be computationally impractical, considering that (the much faster) TRES took several hours to execute in our machines.In Figure 2, we can also observe that the dependence of the frontier size and the number of leaves (in TreeFrontier) to the number of timesteps is not exactly linear. In the case of the full frontier (synchronous update) this is attributed to the variance in the number of outlinks that a random web page could contain. There is an oscillation around the mean outlink number, which would ideally be equal to the slope of curve (see Figure 2
, subplot (b)), had there been no variance in the random variable of the number of outlink URLs (of a web page). In the case of Tree Frontier, the variance of the slope is due to the fact that at some timesteps no split is performed and, thus, the number of leaves remains the same. At other timesteps, binary splits may occur, increasing the size of the set of the candidate frontier samples that we examine.
It is worth mentioning that the TreeFrontier approach could be employed in other RL settings, as well. One challenge in RL is the increasingly large action space, which should ideally be examined exhaustively at each timestep. TreeFrontier effectively tackles this problem by reducing the number of actions that are examined at each timestep, that is the number of samples for which Qvalues are calculated. Frontier sampling is performed in a stratified way utilizing a sort of clustering through an online discretization of both large state and action spaces. Therefore, in an RL setting with rampant action space growth (frontier), the TreeFrontier algorithm could potentially be preferable to an exhaustive calculation of Qvalues or alternative methods.
5 Conclusion
In this paper, we proposed an endtoend RL focused crawling framework that outperforms other stateoftheart methods on the harvest rate of web pages. We formalized the MDP for focused crawling, overcoming previous limitations, and improved the impractical time complexity of synchronous update through the use of TreeFrontier we introduced.
As a future work, we would like to investigate an exploration bonus in the reward function, in order to enhance agent’s exploration towards discovering a greater number of relevant domains. An additional open problem is trying to minimize the size of the labeled data set used for training the classifier, which we use to estimate the reward function.