Web2Text: Deep Structured Boilerplate Removal

01/08/2018
by   Thijs Vogels, et al.
0

Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this issue, we introduce a novel model that performs sequence labeling to collectively classify all text blocks in an HTML page as either boilerplate or main content. Our method uses a hidden Markov model on top of potentials derived from DOM tree features using convolutional neural networks. The proposed method sets a new state-of-the-art performance for boilerplate removal on the CleanEval benchmark. As a component of information retrieval pipelines, it improves retrieval performance on the ClueWeb12 collection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/22/2020

Boilerplate Removal using a Neural Sequence Labeling Model

The extraction of main content from web pages is an important task for n...
research
12/16/2020

Information retrieval system for silte language using BM25 weighting

The main aim of an information retrieval system is to extract appropriat...
research
10/27/2021

Don't read, just look: Main content extraction from web pages using visually apparent features

The extraction of main content provides only primary informative blocks ...
research
06/04/2020

Stopwords in Technical Language Processing

There are increasingly applications of natural language processing techn...
research
12/16/2021

Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Information retrieval is an important component in natural language proc...
research
08/26/2017

Effective Blog Pages Extractor for Better UGC Accessing

Blog is becoming an increasingly popular media for information publishin...
research
09/29/1998

Using Local Optimality Criteria for Efficient Information Retrieval with Redundant Information Filters

We consider information retrieval when the data, for instance multimedia...

Please sign up or login with your details

Forgot password? Click here to reset