Change Rate Estimation and Optimal Freshness in Web Page Crawling

04/05/2020
by   Konstantin Avrachenkov, et al.
0

For providing quick and accurate results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. However, finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. Consequently, the ideal crawling rates are the ones that maximise the freshness of the local cache and also respect the above constraints. Azar et al. 2018 recently proposed a tractable algorithm to solve this optimisation problem. However, they assume the knowledge of the exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide two novel schemes for online estimation of page change rates. Both schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. For both these schemes, we prove convergence and, also, derive their convergence rates. Finally, we provide some numerical experiments to compare the performance of our proposed estimators with the existing ones (e.g., MLE).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/17/2020

Online Algorithms for Estimating Change Rates of Web Pages

For providing quick and accurate search results, a search engine maintai...
research
06/17/2021

A Short Note of PAGE: Optimal Convergence Rates for Nonconvex Optimization

In this note, we first recall the nonconvex problem setting and introduc...
research
05/29/2019

Learning to Crawl

Web crawling is the problem of keeping a cache of webpages fresh, i.e., ...
research
11/09/2021

Prediction of new outlinks for focused Web crawling

Discovering new hyperlinks enables Web crawlers to find new pages that h...
research
11/14/2006

Cartes auto-organisées pour l'analyse exploratoire de données et la visualisation

This paper shows how to use the Kohonen algorithm to represent multidime...
research
04/19/2023

WASEF: Web Acceleration Solutions Evaluation Framework

The World Wide Web has become increasingly complex in recent years. This...
research
01/12/2017

VESPA: VIPT Enhancements for Superpage Accesses

L1 caches are critical to the performance of modern computer systems. Th...

Please sign up or login with your details

Forgot password? Click here to reset