Random Forest Classifier based Scheduler Optimization for Search Engine Web Crawlers

03/06/2022
by   Dulani Meedeniya, et al.
0

The backbone of every search engine is the set of web crawlers, which go through all indexed web pages and update the search indexes with fresh copies, if there are changes. The crawling process provides optimum search results by keeping the indexes refreshed and up to date. This requires an "ideal scheduler" to crawl each web page immediately after a change occurs. Creating an optimum scheduler is possible when the web crawler has information about how often a particular change occurs. This paper discusses a novel methodology to determine the change frequency of a web page using machine learning and server scheduling techniques. The methodology has been evaluated with 3000+ web pages with various changing patterns. The results indicate how Information Access (IA) and Performance Gain (PG) are balanced out to zero in order to create an optimum crawling schedule for search engine indexing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/06/2022

Detection of Change Frequency in Web Pages to Optimize Server-based Scheduling

The Internet at present has become vast and dynamic with the ever increa...
research
04/30/2023

Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives

Webpages change over time, and web archives hold copies of historical ve...
research
09/17/2020

Online Algorithms for Estimating Change Rates of Web Pages

For providing quick and accurate search results, a search engine maintai...
research
11/21/2017

Raspberry Pi and Arduino Uno Working together as a Basic Meteorological Station

The present paper describes a novel Raspberry Pi and Arduino UNO archite...
research
05/19/2019

Phish-IRIS: A New Approach for Vision Based Brand Prediction of Phishing Web Pages via Compact Visual Descriptors

Phishing, a continuously growing cyber threat, aims to obtain innocent u...
research
10/22/2018

An Efficient Bandit Algorithm for Realtime Multivariate Optimization

Optimization is commonly employed to determine the content of web pages,...
research
10/19/2012

Exploiting Locality in Searching the Web

Published experiments on spidering the Web suggest that, given training ...

Please sign up or login with your details

Forgot password? Click here to reset