Harvest – An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums

02/03/2021
by   Albert Weichselbraun, et al.
0

Automatic extraction of forum posts and metadata is a crucial but challenging task since forums do not expose their content in a standardized structure. Content extraction methods, therefore, often need customizations such as adaptations to page templates and improvements of their extraction code before they can be deployed to new forums. Most of the current solutions are also built for the more general case of content extraction from web pages and lack key features important for understanding forum content such as the identification of author metadata and information on the thread structure. This paper, therefore, presents a method that determines the XPath of forum posts, eliminating incorrect mergers and splits of the extracted posts that were common in systems from the previous generation. Based on the individual posts further metadata such as authors, forum URL and structure are extracted. We also introduce Harvest, a new open source toolkit that implements the presented methods and create a gold standard extracted from 52 different Web forums for evaluating our approach. A comprehensive evaluation reveals that Harvest clearly outperforms competing systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/21/2021

The Impact of Main Content Extraction on Near-Duplicate Detection

Commercial web search engines employ near-duplicate detection to ensure ...
research
05/16/2019

The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata that Describe Scientific Experiments

The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to re...
research
08/21/2020

Howl: A Deployed, Open-Source Wake Word Detection System

We describe Howl, an open-source wake word detection toolkit with native...
research
03/17/2023

A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents

Extracting information from academic PDF documents is crucial for numero...
research
01/05/2021

Generating Informative CVE Description From ExploitDB Posts by Extractive Summarization

ExploitDB is one of the important public websites, which contributes a l...
research
04/28/2015

CommentWatcher: An Open Source Web-based platform for analyzing discussions on web forums

We present CommentWatcher, an open source tool aimed at analyzing discus...
research
01/10/2012

Sentence based semantic similarity measure for blog-posts

Blogs-Online digital diary like application on web 2.0 has opened new an...

Please sign up or login with your details

Forgot password? Click here to reset