Random Forest Classifier based Scheduler Optimization for Search Engine Web Crawlers

by   Dulani Meedeniya, et al.

The backbone of every search engine is the set of web crawlers, which go through all indexed web pages and update the search indexes with fresh copies, if there are changes. The crawling process provides optimum search results by keeping the indexes refreshed and up to date. This requires an "ideal scheduler" to crawl each web page immediately after a change occurs. Creating an optimum scheduler is possible when the web crawler has information about how often a particular change occurs. This paper discusses a novel methodology to determine the change frequency of a web page using machine learning and server scheduling techniques. The methodology has been evaluated with 3000+ web pages with various changing patterns. The results indicate how Information Access (IA) and Performance Gain (PG) are balanced out to zero in order to create an optimum crawling schedule for search engine indexing.


page 1

page 2

page 3

page 4


Detection of Change Frequency in Web Pages to Optimize Server-based Scheduling

The Internet at present has become vast and dynamic with the ever increa...

Mapping Web Pages by Internet Protocol (IP) addresses: Analyzing Spatial and Temporal Characteristics of Web Search Engine Results

Internet Protocol (IP) addresses are frequently used as a method of loca...

Online Algorithms for Estimating Change Rates of Web Pages

For providing quick and accurate search results, a search engine maintai...

An Efficient Bandit Algorithm for Realtime Multivariate Optimization

Optimization is commonly employed to determine the content of web pages,...

Raspberry Pi and Arduino Uno Working together as a Basic Meteorological Station

The present paper describes a novel Raspberry Pi and Arduino UNO archite...

Phish-IRIS: A New Approach for Vision Based Brand Prediction of Phishing Web Pages via Compact Visual Descriptors

Phishing, a continuously growing cyber threat, aims to obtain innocent u...

Large-scale Sustainable Search on Unconventional Computing Hardware

Since the advent of the Internet, quantifying the relative importance of...