Bootstrapping Domain-Specific Content Discovery on the Web

02/25/2019
by   Kien Pham, et al.
0

The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest D, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to serve as training examples for creating a classifier that recognizes pages in D, as well as a set of pages to seed the crawl. In this paper, we propose DISCO, an approach designed to bootstrap domain-specific search. Given a small set of websites, DISCO aims to discover a large collection of relevant websites. DISCO uses a ranking-based framework that mimics the way users search for information on the Web: it iteratively discovers new pages, distills, and ranks them. It also applies multiple discovery strategies, including keyword-based and related queries issued to search engines, backward and forward crawling. By systematically combining these strategies, DISCO is able to attain high harvest rates and coverage for a variety of domains. We perform extensive experiments in four social-good domains, using data gathered by SMEs in the respective domains, and show that our approach is effective and outperforms state-of-the-art methods.

READ FULL TEXT
research
01/19/2023

From 10 Blue Links Pages to Feature-Full Search Engine Results Pages – Analysis of the Temporal Evolution of SERP Features

Web Search Engine Results Pages (SERP) are one of the most well-known an...
research
12/02/2021

Where the Earth is flat and 9/11 is an inside job: A comparative algorithm audit of conspiratorial information in web search results

Web search engines are important online information intermediaries that ...
research
04/19/2023

Variational Quantum PageRank

The PageRank algorithm is used to rank web pages by their importance. Si...
research
11/09/2021

Prediction of new outlinks for focused Web crawling

Discovering new hyperlinks enables Web crawlers to find new pages that h...
research
03/23/2018

Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

Over the last few years, the complexity of web applications has increase...
research
02/03/2017

The Dawn of Today's Popular Domains: A Study of the Archived German Web over 18 Years

The Web has been around and maturing for 25 years. The popular websites ...
research
03/09/2017

Information Extraction in Illicit Domains

Extracting useful entities and attribute values from illicit domains suc...

Please sign up or login with your details

Forgot password? Click here to reset