Boilerplate Removal using a Neural Sequence Labeling Model

04/22/2020
by   Jurek Leonhardt, et al.
0

The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.

READ FULL TEXT
research
01/08/2018

Web2Text: Deep Structured Boilerplate Removal

Web pages are a valuable source of information for many natural language...
research
11/21/2021

The Impact of Main Content Extraction on Near-Duplicate Detection

Commercial web search engines employ near-duplicate detection to ensure ...
research
12/12/2012

Learning with Scope, with Application to Information Extraction and Classification

In probabilistic approaches to classification and information extraction...
research
10/27/2021

Don't read, just look: Main content extraction from web pages using visually apparent features

The extraction of main content provides only primary informative blocks ...
research
01/01/2022

Usability and Aesthetics: Better Together for Automated Repair of Web Pages

With the recent explosive growth of mobile devices such as smartphones o...
research
04/13/2018

A Deep Learning Approach to Fast, Format-Agnostic Detection of Malicious Web Content

Malicious web content is a serious problem on the Internet today. In thi...
research
05/15/2021

A Large Visual, Qualitative and Quantitative Dataset of Web Pages

The World Wide Web is not only one of the most important platforms of co...

Please sign up or login with your details

Forgot password? Click here to reset