Exploiting Locality in Searching the Web

10/19/2012
by   Joel Young, et al.
0

Published experiments on spidering the Web suggest that, given training data in the form of a (relatively small) subgraph of the Web containing a subset of a selected class of target pages, it is possible to conduct a directed search and find additional target pages significantly faster (with fewer page retrievals) than by performing a blind or uninformed random or systematic search, e.g., breadth-first search. If true, this claim motivates a number of practical applications. Unfortunately, these experiments were carried out in specialized domains or under conditions that are difficult to replicate. We present and apply an experimental framework designed to reexamine and resolve the basic claims of the earlier work, so that the supporting experiments can be replicated and built upon. We provide high-performance tools for building experimental spiders, make use of the ground truth and static nature of the WT10g TREC Web corpus, and rely on simple well understand machine learning techniques to conduct our experiments. In this paper, we describe the basic framework, motivate the experimental design, and report on our findings supporting and qualifying the conclusions of the earlier research.

READ FULL TEXT

page 5

page 7

research
08/09/2023

web crawler strategies for web pages under robot.txt restriction

In the present time, all know about World Wide Web and work over the Int...
research
08/07/2012

Color Assessment and Transfer for Web Pages

Colors play a particularly important role in both designing and accessin...
research
03/06/2022

Random Forest Classifier based Scheduler Optimization for Search Engine Web Crawlers

The backbone of every search engine is the set of web crawlers, which go...
research
06/12/2020

Google Dataset Search by the Numbers

Scientists, governments, and companies increasingly publish datasets on ...
research
11/29/2022

ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information

ClueWeb22, the newest iteration of the ClueWeb line of datasets, provide...
research
05/29/2019

MementoMap Framework for Flexible and Adaptive Web Archive Profiling

In this work we propose MementoMap, a flexible and adaptive framework to...
research
11/22/2021

FastWARC: Optimizing Large-Scale Web Archive Analytics

Web search and other large-scale web data analytics rely on processing a...

Please sign up or login with your details

Forgot password? Click here to reset