DeepAI AI Chat
Log In Sign Up

Bootstrapping Domain-Specific Content Discovery on the Web

by   Kien Pham, et al.
NYU college

The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest D, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to serve as training examples for creating a classifier that recognizes pages in D, as well as a set of pages to seed the crawl. In this paper, we propose DISCO, an approach designed to bootstrap domain-specific search. Given a small set of websites, DISCO aims to discover a large collection of relevant websites. DISCO uses a ranking-based framework that mimics the way users search for information on the Web: it iteratively discovers new pages, distills, and ranks them. It also applies multiple discovery strategies, including keyword-based and related queries issued to search engines, backward and forward crawling. By systematically combining these strategies, DISCO is able to attain high harvest rates and coverage for a variety of domains. We perform extensive experiments in four social-good domains, using data gathered by SMEs in the respective domains, and show that our approach is effective and outperforms state-of-the-art methods.


Web Crawler: Design And Implementation For Extracting Article-Like Contents

The World Wide Web is a large, wealthy, and accessible information syste...

Information Extraction in Illicit Domains

Extracting useful entities and attribute values from illicit domains suc...

Multi-level computational methods for interdisciplinary research in the HathiTrust Digital Library

We show how faceted search using a combination of traditional classifica...

The Dawn of Today's Popular Domains: A Study of the Archived German Web over 18 Years

The Web has been around and maturing for 25 years. The popular websites ...

Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

Over the last few years, the complexity of web applications has increase...