Don't read, just look: Main content extraction from web pages using visually apparent features

10/27/2021
by   Geunseong Jung, et al.
0

The extraction of main content provides only primary informative blocks by removing a web page's minor areas like navigation menu, ads, and site templates. It has various applications: information retrieval, search engine optimization, and browser reader mode. We tested the existing four main content extraction methods (Firefox Readability.js, Chrome DOM Distiller, Web2Text, and Boilernet) in web pages datasets of two English datasets from the global websites and seven non-English datasets from seven local regions each. It shows that the performance decreases by up to 40 English datasets. This paper proposes a multilingual main content extraction method that uses visually apparent features such as the elements' positions, size, and distances from the centers of the browser window and the web document. These are based on the authors' intention: the elements' placement and appearance in web pages have constraints because of humans' narrow central vision. Hence, our method, Grid-Center-Expand (GCE), finds the closest leaf node to the centroid of the web page from which minor areas have been removed. For the main content, the leaf node repeatedly ascends to the parent node of the DOM tree until this node fits one of the following conditions: <article> tag, containing specific attributes, or sudden width change. In the non-English datasets, our method performs better than up to 13 56 performs well regardless of the regional and linguistic characteristics of the web page. In addition, we create DNN models using Google's TabNet with GCE's features. The best of our models has similar performance to Boilernet and Web2text in all datasets. Accordingly, we show that these features can be useful to machine learning models for extracting main content.

READ FULL TEXT

page 2

page 4

research
11/26/2019

A Fast Template-based Approach to Automatically Identify Primary Text Content of a Web Page

Search engines have become an indispensable tool for browsing informatio...
research
06/26/2021

JSAnalyzer: A Web Developer Tool for Simplifying Mobile Pages Through JavaScript Optimizations

The amount of JavaScript embedded in Web pages has substantially grown i...
research
08/26/2017

Effective Blog Pages Extractor for Better UGC Accessing

Blog is becoming an increasingly popular media for information publishin...
research
08/26/2017

Navigation Objects Extraction for Better Content Structure Understanding

Existing works for extracting navigation objects from webpages focus on ...
research
01/08/2018

Web2Text: Deep Structured Boilerplate Removal

Web pages are a valuable source of information for many natural language...
research
04/22/2020

Boilerplate Removal using a Neural Sequence Labeling Model

The extraction of main content from web pages is an important task for n...
research
12/12/2012

Learning with Scope, with Application to Information Extraction and Classification

In probabilistic approaches to classification and information extraction...

Please sign up or login with your details

Forgot password? Click here to reset