Log In Sign Up

Exploiting Locality in Searching the Web

by   Joel Young, et al.

Published experiments on spidering the Web suggest that, given training data in the form of a (relatively small) subgraph of the Web containing a subset of a selected class of target pages, it is possible to conduct a directed search and find additional target pages significantly faster (with fewer page retrievals) than by performing a blind or uninformed random or systematic search, e.g., breadth-first search. If true, this claim motivates a number of practical applications. Unfortunately, these experiments were carried out in specialized domains or under conditions that are difficult to replicate. We present and apply an experimental framework designed to reexamine and resolve the basic claims of the earlier work, so that the supporting experiments can be replicated and built upon. We provide high-performance tools for building experimental spiders, make use of the ground truth and static nature of the WT10g TREC Web corpus, and rely on simple well understand machine learning techniques to conduct our experiments. In this paper, we describe the basic framework, motivate the experimental design, and report on our findings supporting and qualifying the conclusions of the earlier research.


page 5

page 7


Color Assessment and Transfer for Web Pages

Colors play a particularly important role in both designing and accessin...

Random Forest Classifier based Scheduler Optimization for Search Engine Web Crawlers

The backbone of every search engine is the set of web crawlers, which go...

Google Dataset Search by the Numbers

Scientists, governments, and companies increasingly publish datasets on ...

CPS-MEBR: Click Feedback-Aware Web Page Summarization for Multi-Embedding-Based Retrieval

Embedding-based retrieval (EBR) is a technique to use embeddings to repr...

Recommending Relevant Sections from a Webpage about Programming Errors and Exceptions

Programming errors or exceptions are inherent in software development an...

HTMLPhish: Enabling Accurate Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis

Recently, the development and implementation of phishing attacks require...