The iCrawl Wizard -- Supporting Interactive Focused Crawl Specification

12/19/2016
by   Gerhard Gossen, et al.
0

Collections of Web documents about specific topics are needed for many areas of current research. Focused crawling enables the creation of such collections on demand. Current focused crawlers require the user to manually specify starting points for the crawl (seed URLs). These are also used to describe the expected topic of the collection. The choice of seed URLs influences the quality of the resulting collection and requires a lot of expertise. In this demonstration we present the iCrawl Wizard, a tool that assists users in defining focused crawls efficiently and semi-automatically. Our tool uses major search engines and Social Media APIs as well as information extraction techniques to find seed URLs and a semantic description of the crawl intent. Using the iCrawl Wizard even non-expert users can create semantic specifications for focused crawlers interactively and efficiently.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2016

iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

Researchers in the Digital Humanities and journalists need to monitor, c...
research
04/04/2018

Focused Crawl of Web Archives to Build Event Collections

Event collections are frequently built by crawling the live web on the b...
research
05/29/2019

Using Micro-collections in Social Media to Generate Seeds for Web Archive Collections

In a Web plagued by disappearing resources, Web archive collections prov...
research
05/27/2019

Social Cards Probably Provide For Better Understanding Of Web Archive Collections

Used by a variety of researchers, web archive collections have become in...
research
12/16/2016

Analyzing Web Archives Through Topic and Event Focused Sub-collections

Web archives capture the history of the Web and are therefore an importa...
research
07/06/2021

Garbage, Glitter, or Gold: Assigning Multi-dimensional Quality Scores to Social Media Seeds for Web Archive Collections

From popular uprisings to pandemics, the Web is an essential source cons...
research
02/22/2016

Empath: Understanding Topic Signals in Large-Scale Text

Human language is colored by a broad range of topics, but existing text ...

Please sign up or login with your details

Forgot password? Click here to reset