Change Rate Estimation and Optimal Freshness in Web Page Crawling

04/05/2020 ∙ by Konstantin Avrachenkov, et al. ∙ Inria 0

For providing quick and accurate results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. However, finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. Consequently, the ideal crawling rates are the ones that maximise the freshness of the local cache and also respect the above constraints. Azar et al. 2018 recently proposed a tractable algorithm to solve this optimisation problem. However, they assume the knowledge of the exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide two novel schemes for online estimation of page change rates. Both schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. For both these schemes, we prove convergence and, also, derive their convergence rates. Finally, we provide some numerical experiments to compare the performance of our proposed estimators with the existing ones (e.g., MLE).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The world wide web is gigantic: it has a lot of interconnected information and both the information and the connections keep changing. However, irrespective of the challenges arising out of this, a user always expects a search engine to instantaneously provide accurate and up-to-date results. A search engine deals with this by maintaining a local cache of all the useful web pages and their links. As the freshness of this cache determines the quality of the search results, the search engine regularly updates it by employing a crawler (also referred to as a web spider or a web robot). The job of a crawler is (a) to access various web pages at certain frequencies so as to determine if any changes have happened to the content since the last crawled instance and (b) to update the local cache whenever there is a change. To understand the detailed working of crawlers, see [13, 6, 14, 17, 12].

In general, a crawler has two constraints on how often it can access a page. The first one is due to limitations on the available bandwidth. The second one—also known as the politiness constraint—arises when a server imposes limits on the crawl frequency. The latter implies that the crawler can not access pages on that server too often in a short amount of time. Such constraints cannot be ignored, since otherwise the server may forbid the crawler from all future accesses.

In summary, to identify the ideal rates for crawling different web pages, a search engine needs to solve the following optimisation problem: Maximise the freshness of the local database subject to constraints on the crawling frequency.

In the early variants of this problem, the freshness of each page was assumed to be equally important [7, 12]. In such cases, experimental evidence surprisingly shows that the uniform policy—crawl all pages at the same frequency irrespective of their change rates—is more or less the optimal crawling strategy.

Starting from the pioneering work in [9], however, the freshness definition was modified to include different weights for different pages depending on their importance, e.g., represented as the frequency of requests for the pages. The motivation for this change was the fact that only a finite number of pages can be crawled in any given time frame. Hence, to improve the utility of the local database, important pages should be kept as fresh as possible. Not surprisingly, under this new definition, the optimal crawling policy does indeed depend on the page change rates. This was numerically demonstrated first in [9] for a setup with a small number of pages. A more rigorous derivation of this fact was recently given in the path breaking paper [2] by Azar et al. In fact, this work also provides a near-linear time algorithm to find a near-optimal solution.

A separate study [1, 16] provides a Whittle index based dynamic programming approach to optimise the schedule of a web crawler. In that context, the page/catalogue freshness estimate also influences the optimal crawling policy and its good estimation is needed.

Our work is mainly motivated by the work from Azar et al. [2]. In particular, input to their algorithm is the actual page change rates. However, in practice, these values are not known in advance and, instead, have to be estimated. This is the issue that we address in this paper.

Our main contributions can be summarised as follows. First, we propose two novel approaches for online estimation of the actual page change rates. The first is based on the Law of Large Numbers (LLN), while the second is derived using the Stochastic Approximation (SA) principles. Second, we theoretically show that both these estimators almost surely (a.s.) converge to the actual change rate values, i.e., both our estimators are asymptotically consistent. Furthermore, we also derive their convergence rates in the expected error sense. Finally, we provide some simulation results to compare the performance of our online schemes to each other and also to that of the (offline) MLE estimator. Alongside, we also show how our estimates can be combined with the algorithm in

[2] to obtain near-optimal crawling rates.

The rest of this paper is organised as follows. The next section provides a formal summary of this work in terms of the setup, goals, and key contributions. It also gives the explicit update rules for our two online schemes. In Section 3, we discuss their convergence and converge rates and also provide the formal analysis for the same. The numerical experiments discussed above are given in Section 4. We conclude in Section 5 with some future directions.

2 Setup, Goal, and Key Contributions

The three topics are individually described below.

Setup: We assume the following. The local cache consists of copies of pages and denotes the importance of the th page. Further, each page changes independently and the actual times at which page changes is a homogeneous Poisson point process in with a constant but unknown rate Independent of everything else, page is crawled (accessed) at the random instances where and the inter-arrival times, i.e.,

are iid exponential random variables with a known rate

Thus, the times at which page is crawled is also a Poisson point process but with rate At time instance we get to know if page got modified or not in the interval i.e., we can access the value of the indicator

We emphasise that each page is crawled independently. In other words, the notations and defined above do depend on However, we hide this dependence for the sake of notational simplicity. We shall follow this practice for the other notations as well; the dependence on should be clear from the context.

Although the above assumptions are standard in the crawling literature, nevertheless, we now provide a quick justification for the same. Our assumption that the page change process is a Poisson point process is based on the experiments reported in [4, 5, 8]. Some generalised models for the page change process have also been considered in the literature [15, 18]; however, we do not pursue these ideas here. Separately, our assumption on is based on the fact that a crawler can only access incomplete knowledge about the page change process. In particular, a crawler does not know when and how many times a page has changed between two crawling instances. Instead, all it can track is the status of a page at each crawling instance and know if it has changed or not with respect to the previous access. Sometimes, it is possible to also know the time at which the page was last modified [6, 10], but we do not consider this case here.

Goal: Develop online algorithms for estimating in the above setup. Subsequently, find optimal crawling rates so that the overall freshness of the local cache defined by

(1)

is maximised subject to Here, is some finite horizon, is a bound on the overall crawling frequency, is the indicator, and Fresh is the event that page is fresh at time i.e., the local copy matches the actual page.

Key Contributions: We present two online methods for estimating the first based on the LLN and the second based on SA. If and denote the iterates of these two methods, then their update rules are as shown below.

  • LLN Estimator: For

    (2)

    Here, hence, And, is any positive sequence satisfying the conditions in Theorem 1; e.g., could be or

  • SA Estimator: For and some initial value

    (3)

    Here, is any stepsize sequence that satisfies the conditions in Theorem 2. For example, could be for some

We call these methods online because the estimates can be updated on the fly as and when a new observation becomes available. This contrasts the MLE estimator in which one needs to start the calculation from scratch each time a new data point arrives. Also, unlike MLE, our estimators are never unstable. See Section 3.3 for the complete details on this.

Our main results include the following. We show that both and converge to a.s. Further, we show that

  1. and

  2. if with

Finally, we provide three numerical experiments for judging the strength of our two estimators. In the first one, we compare the performance of our estimators to each other and also to that of the Naive estimator and the MLE estimator described in [10]. In the second one, we combine our estimates with the algorithm in [2] and compute the optimal crawling rates. Subsequently, we use this to measure the overall freshness of the local cache. In the last and final experiment, we look at the behaviour of our estimators for different choices of the sequences and

3 Change rate estimation

Here, we provide a formal convergence and convergence rate analysis for our two estimators. Thereafter, we compare their behaviours to that of the estimators that already exist in the literature—the Naive estimator, the MLE estimator, and the Moment Matching (MM) estimator.

3.1 LLN Estimator

Our first aim here is to obtain a formula for We shall use this later to motivate the form of our LLN estimator.

Let Then, as per our assumptions in Section 2, is an exponential random variable with rate Also, These two facts put together show that

(4)

This gives the desired formula for

From this last calculation, we have

(5)

Separately, because is an iid sequence and , it follows from the strong law of large numbers that Thus,

Consequently, a natural estimator for is

(6)

where is as defined below (2).

Unfortunately, the above estimator faces an instability issue, i.e., when are all To fix this, one can add a non-zero term in the denominator. The different choices then gives rise to the LLN estimator defined in (2).

The following result discusses the convergence and convergence rate of this estimator.

Theorem 1.

Consider the estimator given in (2) for some positive sequence

  1. If   then

  2. Additionally, if   then

Proof.

Let and Then, observe that (2) can be rewritten as Now, a.s. and The first claim holds due to the strong law of large numbers, while the second one is true due to our assumption. Statement (1) is now easy to see.

We now derive Statement (2). From (5), we have

where

and

Since and, hence,

Similarly,

Because we have assumed we get It remains to show Towards that, let be a positive sequence that we will pick later. Then,

where

and

On the one hand,

On the other hand, since and it follows by applying the Chernoff bound that

We now pick so that for all Then, Now, due to our assumptions on Similarly, whence it follows that These relations together then show that

The desired result now follows. ∎

3.2 SA Estimator

Here, we use the theory of stochastic approximation to study the behaviour of our SA estimator.

Theorem 2.

Consider the estimator given in (3) for some positive stepsize sequence

  1. Suppose that and Then,

  2. Suppose that with Then,

Proof.

For let Then, from (4) and the fact that is an iid sequence, we get

where Hence, one can rewrite (3) as

(7)

where

Since for all is a martingale difference sequence. Consequently, (7) is a classical SA algorithm whose limiting ODE is

(8)

Now, Statement (1) follows from Corollary 4 and Theorem 7 in Chapters 2 and 3, respectively, of [3], provided we show that:

  1. is a globally Lipschitz continuous function.

  2. is an unique globally asymptotically stable equilibrium of (8).

  3. and

  4. is a martingale difference sequence with respect to the filtration Further, there is a constant such that for all

  5. There exists a continuous function such that the functions satisfy uniformly on compact sets as

  6. The ODE has origin as its unique globally asymptotically stable equilibrium.

Since is linear, the Lipschitz continuity condition trivially holds. Separately, observe that this shows that is an equilibrium point of (8). Now, is a Lyapunov function for (8) with respect to This is because while the equality holds in both these relations if and only if This shows that is a unique globally asymptotically stable equilibrium of (8), which establishes Condition ii.).

Condition iii.) trivially holds due to our assumption about Regarding the next condition, observe that is indeed a martingale difference sequence. Further, whence it follows that Condition iv.) also holds.

Next, let Then, it is easy to see that Condition v.) trivially holds. Similarly, it is easy to see that Condition vi.) holds as well.

Statement (1) now follows, as desired.

We now sketch a proof for Statement (2). First, note that

where Now, since

Recall that for some constant Using this above and then repeating all the steps from the proof of [11, Theorem 3.1] gives Statement (2), as desired. ∎

3.3 Comparison with Existing Estimators

As far as we know, there are three other approaches in the literature for estimating page change rates—the Naive estimator, the MLE estimator, and the MM estimator. The details about the first two estimators can be found in [10] while, for the third one, one can look at [19]. We now do a comparison, within the context of our setup, between these estimators and the ones that we have proposed.

The Naive estimator simply uses the average number of changes detected to approximate the rate at which a page changes. That is, if denote the values of the Naive estimator then, in our setup, where is as defined below in (2). The intuition behind this is the following. If is as defined at the beginning of Section 3.1, then observe that Hence, the Naive estimator tries to approximate with so that the previous relation can then be used for guessing the change rate.

Clearly, Also, from the strong law of large numbers, Thus, this estimator is not consistent and is also biased. This is to be expected since this estimator does not account for all the changes that occur between two consecutive accesses.

Next, we look at the MLE estimator. Informally, this estimator identifies the parameter value that has the highest probability of producing the observed set of observations.In our setup, the value of the MLE estimator is obtained by solving the following equation for

(9)

where and is as defined in Section 2. The derivation of this relation is given in [10, Appendix C]. As mentioned in [10, Section 4], the above estimator is consistent.

Note that the MLE estimator makes actual use of the inter-arrival crawl times unlike our two estimators and also the Naive estimator. In this sense, it fully accounts for the randomness in crawling intervals. And, as we shall see in the numerical section, the quality of the estimate obtained via MLE improves rapidly in comparison to the Naive estimator as the sample size increases.

However, MLE suffers in two aspects— computational tractability and mathematical instability. Specifically, note that the MLE estimator lacks a closed form expression. Therefore, one has to solve (9) by using numerical methods such as the Newton–Raphson method, Fisher’s Scoring Method, etc. Unfortunately, using these ideas to solve (9) takes more and more time as the number of samples grow. Also note that, under the above solution ideas, the MLE estimator works in an offline fashion. In that, each time we get a new observation, (9) needs to be solved afresh. This is because there is no easy way to efficiently reuse the calculations from one iteration into the next. One reasonable alternative is to perform MLE estimation in a batch mode, i.e., wait until we gather a large number of samples and then apply one of the above-mentioned methods. However, even then the computation time will be long when is large.

Besides the complexity, the MLE estimator is also unstable in two situations. One, when no changes have been detected (), and the other, when all the accesses detect a change (). In the first setting, no solution exists; in the second setting, the solution is One simple strategy to avoid these instability issues is to clip the estimate to some pre-defined range whenever one of bad observation instances occur.

Finally, we talk about the MM estimator. Here, one looks at the fraction of times no changes were detected during page accesses and, then, using a moment matching method tries to approximate the actual page change rate. In our context, the value of this estimator is obtained by solving for The details of this equation are given in [19, Section 4]. While the MM idea is indeed simpler than MLE, the associated estimation process continues to suffer from similar instability and computational issues like the ones discussed above.

We emphasise that none of our estimators suffer from any of the issues mentioned above. In particular, both our estimators are online and have a significantly simple update rule; thus, improving the estimate whenever a new data point arrives is extremely easy. Also, both our estimators are stable, i.e., the estimated values will almost surely be finite. More importantly, the performance of our estimators is comparable to that of MLE. This can be seen from the numerical experiments in Section 4.

(a)
(b) Confidence interval
Figure 1: Comparison between Different Estimators.

4 Numerical Results

In this section, we provide three simulations to help evaluate the strength of our estimators. In the first experiment, we look at how well our estimation ideas perform in comparison to the Naive and the MLE estimator. In the second experiment, we substitute the change rate estimates obtained via the above approaches into the algorithm given in [2] and compute the optimal crawling rates. To judge the quality of the crawling policy so obtained, we also look at the associated average freshness as defined in (1). Finally, in the third experiment, we compare the performance of our two estimators for different choices of and respectively.

Expt. 1: Comparison of Estimation Quality

Here, we compare four different page rate estimators: LLN, SA, Naive, and MLE. Their performances can be seen in Fig 1. We now describe what is happening in the two figures there. Unless specified, the notations are as in Section 2.

In Fig. 1(a), we work with exactly one page. We suppose that the times at which this page changes is a homogeneous Poisson point process with rate Separately, we set the crawling frequency arbitrarily to be This implies that the times at which we crawl this page is another Poisson point process with rate

Using the above parameters, we now generate the random time instances at which this page changes. Alongside, we also sample the time instances at which this page is crawled. We then check if the page has changed or not between two successive page accesses. This generates the values of indicator sequence

We now give and as input to the four different estimators mentioned above and analyse their performances. The trajectory shown in Fig. 1(a) corresponds to exactly one run of each estimator. Note that the trajectory of the estimates obtained by the SA estimator is labelled etc. For the SA estimator, we had set with On the other hand, for our LLN estimator, we had set

In Fig. 1(b), the parameter values are exactly in Fig. 1(a). However, we now run the simulation times; the page change times and the page access times are generated afresh in each run. We then look at the confidence interval of the obtained estimates.

We now summarise our findings. Clearly, in each case, we can observe that performances of the MLE, LLN, and the SA estimators are comparable to each other and all of them outperform the Naive estimator. This last observation is not surprising since the Naive estimator completely ignores the missing changes between two crawling instances. However, the fact that the estimates from our approaches are close to that of the MLE estimator—both in terms of mean and variance—was indeed surprising to us. This is because, unlike MLE, our estimators completely ignore the actual lengths of the intervals between two accesses. Instead, they use

which only accounts for the mean interval length.

While the plots do not show this, we once again draw attention to the fact that the time taken by each iteration in MLE rapidly grows as increases. However, our estimators take roughly the same amount of time for each iteration.

(a) Optimal crawling rate for Page
(b) Average Freshness
Figure 2: Optimal Crawling Rates and Freshness

Expt. 2: Optimal Crawling rates and Freshness

In this experiment, we consider pages together. The

sequence—the mean change rates for different pages—is obtained by sampling independently from the uniform distribution on

i.e., . We further assume that the bound on the overall bandwidth is The initial crawling frequencies for different pages are set by breaking up evenly across all pages, i.e., for all Because the values are arbitrarily chosen, these are not the optimal crawling rates. We then independently generate the change and access times for each page as in Expt. 1. Subsequently, we estimate the unknown change rate for each page individually.

For each we then substitute the change rate estimates given by the different estimators into [2, Algorithm 2] and obtain the associated optimal crawling rates. In the same way, we substitute the actual values there and obtain the true optimal crawling rates. Fig. 2(a) provides a comparison between these values for a single page. We can see that the estimate of the optimal crawling rate obtained from our approaches is much better than that of the Naive estimator.

To check how good our estimate of the true optimal crawling policy is, we look at the associated average freshness given by222In [2], it was shown that maximising (1) under a bandwidth constraint for large enough corresponds to maximising (10) under the same bandwidth constraint.

(10)

and compare the same to that of the true optimal crawling policy. This comparison is given in Fig. 2(b). Somewhat surprisingly, the average freshness does not vary much for all the three estimators. However, eventually, the average freshness captured by our estimators becomes much closer to the true optimal average freshness.

(a) LLN estimator for different choices
(b) SA estimator with for different choices
Figure 3: Impact of and choices on Performance.

Expt. 3: Impact of and choices

The theoretical results presented in Section 3 showed that the convergence rate of our estimators is affected by the choice of and respectively. Figures 3(a) and 3(b) provide a numerical verification of the same.

The details are as follows. Here, again, we restrict our attention to one single page. For Fig. 3(a), we chose and Notice that the page change rate is very high, whereas the crawling frequency is relatively a low value. We then used the LLN estimator with three different choices of these choices are shown in the figure itself. The LLN estimator with has the worst performance. This behaviour matches the prediction made by Theorem 1.

In Fig. 3(b), we again consider the same setup as above. However, this time we run the SA estimator with three different choices of the choices are given in the figure itself. We see that the performance for is better than the case. This is as predicted in Theorem 2. However, it worsens for the case. Notice that the latter case is not covered by Theorem 2.

5 Conclusion and Future work

We proposed two new online approaches for estimating the rate of change of web pages. Both these estimators are computationally efficient in comparison to the MLE estimator. We first provide theoretical analysis on the convergence of our estimators and then provide numerical simulations to compare their performance with the existing estimators in the literature. From numerical experiments, we have verified that the proposed estimators perform significantly better than the Naive estimator and have extremely simple update rules which make them computationally attractive.

The performance of both our estimators currently depend on the choice of and respectively. One aspect to analyse in the future would be to ask what would be the ideal choice for these sequences that would help attain the fastest convergence rate. Another interesting research direction is to combine the online estimation with dynamic optimisation.

Acknowledgement

This work is partly supported by ANSWER project PIA FSN2 (P15 9564-266178 \DOS0060094) and DST-Inria project ”Machine Learning for Network Analytics” IFC/DST-Inria-2016-01/448.

References

  • [1] K. E. Avrachenkov and V. S. Borkar (2016) Whittle index policy for crawling ephemeral content. IEEE Transactions on Control of Network Systems 5 (1), pp. 446–455. Cited by: §1.
  • [2] Y. Azar, E. Horvitz, E. Lubetzky, Y. Peres, and D. Shahaf (2018) Tractable near-optimal policies for crawling. Proceedings of the National Academy of Sciences 115 (32), pp. 8099–8103. Cited by: Change Rate Estimation and Optimal Freshness in Web Page CrawlingThis paper has been accepted to the 13th EAI International Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS’20, May 18–20, 2020, Tsukuba, Japan. This is the author version of the paper., §1, §1, §1, §2, §4, §4, footnote 2.
  • [3] V. S. Borkar (2009) Stochastic approximation: a dynamical systems viewpoint. Vol. 48, Springer, India. Cited by: §3.2.
  • [4] B. E. Brewington and G. Cybenko (2000) How dynamic is the web?. Computer Networks 33 (1-6), pp. 257–276. Cited by: §2.
  • [5] B. E. Brewington and G. Cybenko (2000) Keeping up with the changing web. Computer 33 (5), pp. 52–58. Cited by: §2.
  • [6] C. Castillo (2005) Effective web crawling. In Acm sigir forum, Vol. 39, New York, NY, USA, pp. 55–56. Cited by: §1, §2.
  • [7] J. Cho and H. Garcia-Molina (2000) Synchronizing a database to improve freshness. ACM sigmod record 29 (2), pp. 117–128. Cited by: §1.
  • [8] J. Cho and H. Garcia-Molina (2000) The evolution of the web and implications for an incremental crawler. In 26th International Conference on Very Large Databases, San Francisco, CA, USA, pp. 1–18. Cited by: §2.
  • [9] J. Cho and H. Garcia-Molina (2003) Effective page refresh policies for web crawlers. ACM Transactions on Database Systems (TODS) 28 (4), pp. 390–426. Cited by: §1.
  • [10] J. Cho and H. Garcia-Molina (2003) Estimating frequency of change. ACM Transactions on Internet Technology (TOIT) 3 (3), pp. 256–290. Cited by: §2, §2, §3.3, §3.3.
  • [11] G. Dalal, B. Szörényi, G. Thoppe, and S. Mannor (2018) Finite sample analyses for td (0) with function approximation. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    San Francisco, CA, USA, pp. 6144–6160. Cited by: §3.2.
  • [12] J. Edwards, K. McCurley, and J. Tomlin (2001) An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of the 10th International Conference on World Wide Web, Vol. 8, New York, NY, USA, pp. 106–113. Cited by: §1, §1.
  • [13] A. Heydon and M. Najork (1999) Mercator: a scalable, extensible web crawler. World Wide Web 2 (4), pp. 219–229. Cited by: §1.
  • [14] R. Kumar, A. Jain, and C. Agrawal (2016) A survey of web crawling algorithms. Advances in vision computing: An international journal 3, pp. 1–7. Cited by: §1.
  • [15] N. Matloff (2005) Estimation of internet file-access/modification rates from indirect data. ACM Transactions on Modeling and Computer Simulation (TOMACS) 15 (3), pp. 233–253. Cited by: §2.
  • [16] J. Niño-Mora (2014) A dynamic page-refresh index policy for web crawlers. In Analytical and Stochastic Modeling Techniques and Applications, Cham, pp. 46–60. Cited by: §1.
  • [17] C. Olston, M. Najork, et al. (2010) Web crawling. Foundations and Trends® in Information Retrieval 4 (3), pp. 175–246. Cited by: §1.
  • [18] S. R. Singh (2007) Estimating the rate of web page updates.. In Proc. International Joint Conferences on Artificial Intelligence, San Francisco, CA, USA, pp. 2874–2879. Cited by: §2.
  • [19] U. Upadhyay, R. Busa-Fekete, W. Kotlowski, D. Pal, and B. Szorenyi (2020) Learning to crawl. In Thirty-fourth AAAI Conference on Artificial Intelligence, New York, NY, USA, pp. 8471–8478. Cited by: §3.3, §3.3.