Learning to Crawl

05/29/2019 ∙ by Utkarsh Upadhyay, et al. ∙ Poznan University of Technology Max Planck Institute for Software Systems Verizon Media 0

Web crawling is the problem of keeping a cache of webpages fresh, i.e., having the most recent copy available when a page is requested. This problem is usually coupled with the natural restriction that the bandwidth available to the web crawler is limited. The corresponding optimization problem was solved optimally by Azar et al. [2018] under the assumption that, for each webpage, both the elapsed time between two changes and the elapsed time between two requests follow a Poisson distribution with known parameters. In this paper, we study the same control problem but under the assumption that the change rates are unknown a priori, and thus we need to estimate them in an online fashion using only partial observations (i.e., single-bit signals indicating whether the page has changed since the last refresh). As a point of departure, we characterise the conditions under which one can solve the problem with such partial observability. Next, we propose a practical estimator and compute confidence intervals for it in terms of the elapsed time between the observations. Finally, we show that the explore-and-commit algorithm achieves an O(√(T)) regret with a carefully chosen exploration horizon. Our simulation study shows that our online policy scales well and achieves close to optimal performance for a wide range of the parameters.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Introduction to AI, UC Berkeley

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As information dissemination in the world becomes near real-time, it becomes more and more important for search engines, like Bing and Google, and other knowledge repositories to keep their caches of information and knowledge fresh. In this paper, we consider the web-crawling problem of designing policies for refreshing webpages in a local cache with the objective of maximizing the number of incoming requests which are served with the latest version of the page. Webpages are the simplest and most ubiquitous source of information on the internet. As items to be kept in a cache, they have two key properties: (i) they need to be polled, which uses bandwidth, and (ii) polling them only provides partial information about their change process, , a single bit indicating whether the webpage has changed since it was last refreshed or not. cho2003effective in their seminal work presented a formulation of the problem which was recently studied by Azar8099. Under the assumption that the changes to the webpages and the requests are Poisson processes with known rates, they describe an efficient algorithm to find the optimal refresh rates for the webpages.

However, the change rates of the webpages are often not known in advance and need to be estimated. Since the web crawler cannot continuously monitor every page, there is only partial information available on the change process. cho2003estimating, and more recently li2017temporal, have proposed estimators of the rate of change given partial observations. However, the problem of learning the refresh rates of items while also trying to optimise the objective of keeping the cache as up-to-date for incoming requests as possible seems very challenging. On the one hand, because the optimal policy found by the algorithm by Azar8099 does not allocate bandwidth for pages that are changing very frequently. On the other hand, rate estimates with low precision, especially for those that are changing frequently, may result in a policy that has non-vanishing regret. We formulate this web-crawling problem with unknown change rates as an online optimization problem for which we define a natural notion of regret, describe the conditions under which the refresh rates of the webpages can be learned, and show that using a simple explore-and-commit algorithm, one can obtain regret of order .

Though in this paper we primarily investigate the problem of web-crawling, our notion of regret and the observations we make about the learning algorithms will also be applicable to other control problems which model the actions of agents as Poisson processes and the policies as intensities. Such an approach is seen in recent works which model or predict social activities (farajtabar2018point; du2016recurrent), control online social activity (zarezade2017steering; karimi2016smart; wang2017variational; upadhyay2018deep), or even controlling spacing for optimal learning (tabibian2019enhancing). All such problem admit online versions where the parameters for the models (, difficulty of items from recall events (tabibian2019enhancing), or rates of posting of messages by other broadcasters (karimi2016smart)) need to be learned while also optimising the policy the agent follows.

In Section 2, we will formally describe the problem setup, formulate the objective, and the associated constraints. Section 3 takes a closer look at the objective function and the optimal policy with the aim of describing the properties the learning algorithm should have. We propose an estimator for learning the parameters of Poisson process with partial observability and provide guarantees on its performance in Section 4. Leveraging the bound on the estimator’s performance, we propose a simple explore-and-commit algorithm in Section 5 and show that it achieves regret. In Section 6, we test our algorithm using synthetic data to justify our theoretical findings and we conclude with future research directions in Section 7.

2 Problem Formulation

In this section, we consider the problem of keeping a cache of webpages up-to-date by modelling the changes to webpages, the requests for the pages, and the bandwidth constraints placed on a standard web-crawler. We assume that the cache is empty when all processes start at time .

We model the changes to each webpage as Poisson processes with constant rates. The parameters of these change processes are denoted by , where denotes the rate of changes made to webpage . We will assume that are not known to us but we know only an upper bound and lower bound on the change rates. The crawler will learn by refreshing the pages and observing the single-bit feedback described below. We denote the time webpage changes for the th time as . We model the incoming requests for each webpage also as Poisson processes with constant rates and denote these rates as . We will assume that these rates, which can also be interpreted as the importance of each webpage in our cache, are known to the crawler. We will denote the time webpage is requested for the th time as . The change process and the request process, given their parameters, are assumed to be independent of each other.

We denote time points when page is refreshed by the crawler using . The feedback which the crawler gets after refreshing a webpage at time consists of a single bit which indicates whether the webpage has changed or not since the last observation that was made at time . Let indicate the event that neither a change nor a refresh of the page has happened between time and for webpage . Define as the event that the webpage is fresh in the cache at time . Defining the maximum of an empty set to be , we have:

where the indicator function takes value 1 on the event in its argument and value 0 on its complement. Hence, we can describe the feedback we receive upon refreshing a page at time as:


We call this a partial observation of the change process to contrast it with full observability of the process, , when a refresh at provides the number of changes to the webpage in the period . For example, the crawler will have full observability of the incoming request processes.

The policy space consists of all measurable functions which, at any time , decide when the crawler should refresh which page in its cache based on the observations up time that includes .

The objective of the web-crawling problem is to refresh webpages such that it maximizes the number of requests which are served a fresh version. So the utility of a policy followed from time to can be written as:

Figure 1: *

Our goal is to find a policy that maximizes this utility (2).111The freshness of the webpages does depend on the policy which is hidden by function Fresh. However, if the class of policies is unconstrained, the utility can be maximized by a trivial policy which continuously refreshes all webpages in the cache. This is not a practical policy since it will overburden the web servers and the crawler. Therefore, we would like to impose a bandwidth constraint on the crawler. Such a constraint can take various forms and a natural way of framing it is that the expected number of webpages that are refreshed in any time interval with width cannot exceed . This constraint defines a class of stochastic policies , where each webpage’s refresh time is drawn by the crawler from a Poisson process with rate . Under this class of policies, the problem setup is as shown in the figure below. This problem setup was studied by Azar8099 and shown to be tractable. We define the regret for such policies as follows

It is worth reiterating that the parameters will not be known to the crawler. The crawler will need to determine when and which page to refresh given only the single bits of information corresponding to each refresh the policy makes. In the following sections, we will determine what properties such a learning algorithm should have and propose such an algorithm.

3 Learning with Poisson Processes with Partial Observability

In this section, we will derive an analytical form of the utility function which is amenable to analysis, describe how to uncover the optimal policy in if all parameters (, and ) are known, and consider the problem of learning the parameters with partial observability. We will use these insights to determine some properties a learning algorithm should have to be tractable to analyse and for it to uncover the optimal policy.

3.1 Utility and the Optimal policy

Consider the expected value of the utility of a policy which the crawler follows from time till time . Assume that the state of the cache at time is given by , where . Then, using (2), we have:


where (3) follows from Campbell’s formula for Poisson Process (poissonprocess) (expectation of a sum over the point process equals the integral over time with process’ intensity measure) as well as the fact that the request process and change/refresh processes are independent. In the next lemma, we show that the differential utility function , defined implicitly in (3), can be made time-independent if the policy is allowed to run for long-enough.

Lemma 1 (Adapted from (Azar8099)).

For any given , let be a policy which the crawler adopts at time and let the initial state of the cache be , where . Then if , then .


Let denote an event that neither a change nor a refresh has happened for webpage in time interval . Note that under event , we have . Otherwise (, under event

), as we have assumed that the change and the refresh processes are independent Poisson processes for all webpages, the probability that the last event which happened for webpage

between and was an update event is . Hence, we can write the differential utility function as:


This proves the first part of the inequality that .

Now substituting into (4), we get:

where we have used in the first inequality and in the second inequality. ∎

Hence, as long as condition described by Lemma 1 holds, the differential utility function for a policy is time independent and can be written as just . Substituting this into (3), we get:


This leads to the following time-horizon independent optimisation problem for the optimal policy:


Azar8099 have considered the approximate utility function given by (6) to derive the optimal refresh rates for known in time (See Algorithm 2 in (Azar8099)).222The optimal policy can be obtained in time by using the method proposed by duchi2008efficient.

This approximation has bearing upon the kind of learning algorithms we could use to keep the analysis of the algorithm, and computing the optimum policy, tractable. The learning algorithm we employ must follow a policy for a certain amount of burn-in time before we can use (5) to approximate the performance of the policy. If the learning algorithm changes the policy too quickly, then we may see large deviations between the actual utility and the approximation. However, if the learning algorithm changes the policy slowly, where Lemma 1 can serve as a guide to the appropriate period, then we can use (5) to easily calculate its performance between and .

Now that we know how to uncover the optimal policy when are known, we turn our attention to the task of learning it with partial observations.

3.2 Learnability of Poisson Process’s Rate with Partial Observability

In this section, we address the problem of partial information of Poisson process and investigate under what condition the rate of the Poisson process can be estimated. In our setting, for an arbitrary webpage, we only observe binary outcomes , defined by (1). The refresh times and the Poisson process of changes with rate induce a distribution over which is denoted by . If the observations happen at regular time intervals, i.e. for some constant , then the support of is:

This means that we can have a consistent estimator, based on the strong law of large numbers, if the crawler refreshes the cache at fixed intervals.

However, we can characterise the necessary property of the set of partial observations which allows parameter estimation of Poisson processes under the general class of policies . This result may be of independent interest.

Lemma 2.

Let be a sequence of times, such that , at which observations are made of a Poisson process with rate , such that iff there was an event of the process in , define , and . Then:

  1. If and , then any statistic for estimating has non-vanishing bias.

  2. If , then there exist disjoint subsets of such that is monotone and for For any such sequence , the mapping is strictly monotone and

  3. If then, there exists a sequence of disjoint subsets of such that is monotone and for For any such , the mapping is strictly monotone and


See Appendix A. ∎

Note that it is possible that, for some , the statistics almost surely converge to a value that is unique to , but for some other one they do not. Indeed, when , then and for , but for . More concretely, assuming that the respective limits exist, we have:

In particular, if , it implies that for all , which implies that it will be possible to learn the true value for any parameter .

Lemma 2 has important implications on the learning algorithms we can use to learn . It suggests that if the learning algorithm decreases the refresh rate for a webpage too quickly, such that (assuming the limit exists), then the estimate of each parameter has non-vanishing error.

In summary, in this section, we have made two important observations about the learning algorithm we can employ to solve the web-crawling problem. Firstly, given an error tolerance of , the learning algorithm should change the policy only after steps to allow for time invariant differential utility approximation to be valid. Secondly, in order to obtain consistent estimates for from partial observations, the learning algorithm should not change the policy so drastically that it violates the conditions in Lemma 2. These observations strongly suggest that to obtain theoretical guarantees on the regret, one should use phased learning algorithms where each phase of the algorithm is of duration , the policy is only changed when moving from one phase to the other, and the changes made to the policy are such that constraints presented in Lemma 2

are not violated. Parallels can be drawn between such learning algorithms and the algorithms used for online learning of Markov Decision Processes which rely on bounds on mixing times 

(neu2010online). In Section 5, we present the simplest of such algorithms, , the explore-and-commit algorithm, for the problem and provide theoretical guarantees on the regret. Additionally, in Section 6.3, we also empirical compare the performance of ETC to the phased -greedy learning algorithm.

In the next section, we investigate practical estimators for the parameters and the formal guarantees they provide for the web-crawling problem.

4 Parameter Estimation and Sensitivity Analysis with Partial Observations

In this section, we address the problem of parameter estimation of Poisson process under partial observability and investigate the relationship between the utility of the optimal policy (obtained using true parameters) and policy (obtained using the estimates).

Assume the same setup as for Lemma 2, , we are given a finite sequence of observation times in advance, and we observe , defined as in (1), based on a Poisson process with rate . Define . Then log-likelihood of is:

which is a concave function. Taking the derivative and solving for yields the maximum likelihood estimator. However, as the MLE estimator lacks a closed form, coming up with a non-asymptotic confidence interval it is a very challenging task. Hence, we consider a simpler estimator.

Let us define an intermediate statistic as the fraction of times we observed that the underlying Poisson process produced no events, . Since we get . Motivated by this, we can estimate

by the following moment matching method: choose

to be the unique solution of the equation


and then obtain estimator of by clipping to range , . The RHS in (7) is monotonically decreasing in , therefore finding the solution of (7) with error can be done in time based on binary search. Additionally, if the intervals are of fixed size, , , then reduces to the maximum likelihood estimator. Such an estimator was proposed by cho2003estimating and was shown to have good empirical performance. Here, instead of smoothing the estimator, a subsequent clipping of resolves the issue of its instability for the extreme values of and (when the solution to (7) becomes and

, respectively). In the following lemma, we will show that this estimator is also amenable to non-asymptotic analysis by providing a high probability confidence interval for the estimator


Lemma 3.

Under the condition of Lemma 2, for any , and observations it holds that

where and is obtained by solving (7).


Recall that is the empirical frequency of no-event counts, and denote by . In this notation, we have:

where is monotonically decreasing in (and similarly for as a function of ).

First, assume that , which implies by the monotonicity property mentioned above, and by the property of clipping (and the fact that ) we also have . By the convexity of the exponential function:

which implies after summing over :


Similarly for we have and therefore:

which implies:


By combining (8) and (9), we get:


For being a frequency of counts, Hoeffding’s inequality for independent Bernoulli variables implies that for :

Hence, with probability at least , we have and combining it with (10), with probability :

which finishes the proof. ∎

With this following lemma we bound the sensitivity of the expected utility to the accuracy of our parameter estimates .

Lemma 4.

For the expected utility defined in (6), let , and define the suboptimality of as . Then can be bounded by:


See Appendix B. ∎

This lemma gives us hope that if we can learn well enough such that for all , then we can obtain sub-linear regret by following the policy . This indeed is possible and, in the next section, we show that an explore-and-commit algorithm can yield regret.

5 Explore-Then-Commit Algorithm

In this section, we will analyse a version of the explore-and-commit (etc) algorithm for solving the web-crawling problem. The algorithm will first learn by sampling all pages till time and then commit to the policy of observing the pages at the rates as given by the Algorithm 2 in (Azar8099), obtained by passing it the estimated rates instead of the true rates , from time till .

Revisiting the policy space. The constraint we had used to define the policy space was that given any interval of width , the expected number of refreshes in that interval should not exceed , which limited us to the class of Poisson policies. However, an alternative way to impose the constraint is to bound the time-averaged number of requests made per unit of time asymptotically. It can be shown that given our modelling assumptions that request and change processes are memory-less, the policy which maximizes the utility in (2) given a fixed number of observations per page will space them equally. This motivates a policy class as the set of deterministic policies which refresh webpage at regular time intervals of length . Policies from allow us to obtain tight confidence intervals for by a straight-forward application of Lemma 3. However, the sensitivity of the utility function for this policy space to the quality of the estimated parameters is difficult to bound tightly. In particular, the differential utility function for this class of policies (defined in (3)) is not strongly concave, which is a basic building block of Lemma 4. This precludes performance bounds which are quadratic in the error of estimates , which lead to worse bounds on the regret of the ETC algorithm. These reasons are further expounded in Appendix D. Nevertheless, we show in Appendix C, that using the uniform-intervals policy incurs lower regret than the uniform-rates policy , while still making on average requests per unit time.

Hence, to arrive at regret bounds, we will perform the exploration using Uniform-interval exploration policy which refreshes webpages at regular intervals , which will allow us to use Lemma 3 to bound the error of the estimated with high probability.

Lemma 5.

For a given , after following the uniform-interval policy for time , which is assumed to be a multiplicity of , we can claim the following for the error in the estimates produced using the estimator proposed in Lemma 3:


Running the uniform-interval policy for time results in observations collected for each webpage with time intervals for all , including an observation made at , so that . Substituting these in Lemma 3, we have that for the -th webpage, with probability at most it holds

By the union bound, the above event occur for any with probability at most , which finishes the proof. ∎

With these lemmas, we can bound the regret suffered by the ETC algorithm using the following Theorem.

Theorem 1.

Let denote the explore-and-commit algorithm which explores using the uniform-interval exploration policy for time (assumed to be a multiplicity of ), estimates using the estimator proposed in (7), and then uses the policy till time . Then for a given , with probability , the expected regret of the explore and commit policy is bounded by:

Further, we can choose an exploration horizon such that, with probability , the expected regret is .


Since the utility of any policy is non-negative, we can upper-bound the regret of the algorithm in the exploration phase by the expected utility of the best stationary policy , which is . In the exploitation phase, the regret is given by (see (5)), which we bound using Lemma 4. Hence, we see that (with a slight abuse of notation to allow us to write for ):


As we are using the estimator from Lemma 3, we have . Using this and Lemma 5 with (11), we get with probability :


This proves the first claim.

The bound in (12) takes the maximum value when , giving with probability , the worst-case regret bound of:

This proves the second part of the theorem. ∎

This theorem bounds the expected regret conditioned on the event that the crawler learns such that . These kinds of guarantees have been seen in recent works (rosenski2016multi; avner2014concurrent).

Finally, note that using the doubling trick the regret bound can be made horizon independent at no extra cost. The policy can be de-randomized to either yield a fixed interval policy in or, to a carousel like policy with similar performance guarantees (Azar8099, See Algorithm 3). With this upper-bound on the regret of the ETC algorithm, in the next section we explore the empirical performance of the strategy.

6 Experimental Evaluation

We start with an empirical evaluation of the MLE estimator and the moment matching estimator for partial observations, and the associated confidence intervals proposed in Lemma 3. These show that, for a variety of different parameters, the performance of the MLE estimator and the moment matching estimator is close to each other. This is followed by the analysis of the ETC algorithm, which shows empirically that the bounds that we have proven in Theorem 1 are tight up to constants. Finally, we compare the ETC algorithm with phased -greedy algorithm and show that phased strategies can out-perform a well-tuned ETC algorithm, if given sufficient number of phases to learn. We leave the detailed analysis of this class of algorithms for later work.

6.1 Performance of Moment Matching Estimator

(a) ,
(b) ,
(c) ,
(d) ,
(e) ,
(f) ,
Figure 2: Error in the estimates produced by the moment matching estimator (in green) compared to the upper bound (in blue) and the MLE estimator (in orange) for three different values of . To calculate the bound, was assumed to be . The first row shows the estimation error when the refresh rate was given by (refresh) events per unit time, while the second row shows the results for events per unit time. The error bars show 25-75 percentiles of error across simulations.

In this section, we consider the performance of the moment matching estimator (7) and the bounds on its performance proposed in Lemma 3.

In all experiments below, we have assumed that and . For a fixed number of observations , known and a fixed random-seed, we simulate times to refresh a webpage