Extracting Event-Centric Document Collections from Large-Scale Web Archives

07/28/2017
by   Gerhard Gossen, et al.
0

Web archives are typically very broad in scope and extremely large in scale. This makes data analysis appear daunting, especially for non-computer scientists. These collections constitute an increasingly important source for researchers in the social sciences, the historical sciences and journalists interested in studying past events. However, there are currently no access methods that help users to efficiently access information, in particular about specific events, beyond the retrieval of individual disconnected documents. Therefore we propose a novel method to extract event-centric document collections from large scale Web archives. This method relies on a specialized focused extraction algorithm. Our experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables the extraction of event-centric collections for different event types.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2016

Analyzing Web Archives Through Topic and Event Focused Sub-collections

Web archives capture the history of the Web and are therefore an importa...
research
02/10/2021

ELSKE: Efficient Large-Scale Keyphrase Extraction

Keyphrase extraction methods can provide insights into large collections...
research
04/04/2018

Focused Crawl of Web Archives to Build Event Collections

Event collections are frequently built by crawling the live web on the b...
research
05/17/2017

Stories From the Past Web

Archiving Web pages into themed collections is a method for ensuring the...
research
04/16/2019

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Most existing event extraction (EE) methods merely extract event argumen...
research
10/03/2017

Event Identification as a Decision Process with Non-linear Representation of Text

We propose scale-free Identifier Network(sfIN), a novel model for event ...
research
02/01/2017

ArchiveWeb: Collaboratively Extending and Exploring Web Archive Collections

Curated web archive collections contain focused digital contents which a...

Please sign up or login with your details

Forgot password? Click here to reset