Effective Blog Pages Extractor for Better UGC Accessing

08/26/2017
by   Kui Zhao, et al.
0

Blog is becoming an increasingly popular media for information publishing. Besides the main content, most of blog pages nowadays also contain noisy information such as advertisements etc. Removing these unrelated elements can improves user experience, but also can better adapt the content to various devices such as mobile phones. Though template-based extractors are highly accurate, they may incur expensive cost in that a large number of template need to be developed and they will fail once the template is updated. To address these issues, we present a novel template-independent content extractor for blog pages. First, we convert a blog page into a DOM-Tree, where all elements including the title and body blocks in a page correspond to subtrees. Then we construct subtree candidate set for the title and the body blocks respectively, and extract both spatial and content features for elements contained in the subtree. SVM classifiers for the title and the body blocks are trained using these features. Finally, the classifiers are used to extract the main content from blog pages. We test our extractor on 2,250 blog pages crawled from nine blog sites with obviously different styles and templates. Experimental results verify the effectiveness of our extractor.

READ FULL TEXT

page 1

page 3

page 4

research
11/26/2019

A Fast Template-based Approach to Automatically Identify Primary Text Content of a Web Page

Search engines have become an indispensable tool for browsing informatio...
research
10/27/2021

Don't read, just look: Main content extraction from web pages using visually apparent features

The extraction of main content provides only primary informative blocks ...
research
11/08/2018

SpeedReader: Reader Mode Made Fast and Private

Most popular web browsers include "reader modes" that improve the user e...
research
08/26/2017

Navigation Objects Extraction for Better Content Structure Understanding

Existing works for extracting navigation objects from webpages focus on ...
research
06/16/2015

A Novel Semantics and Feature Preserving Perspective for Content Aware Image Retargeting

There is an increasing requirement for efficient image retargeting techn...
research
07/24/2018

Time-efficient Garbage Collection in SSDs

SSDs are currently replacing magnetic disks in many application areas. A...
research
01/08/2018

Web2Text: Deep Structured Boilerplate Removal

Web pages are a valuable source of information for many natural language...

Please sign up or login with your details

Forgot password? Click here to reset