A Fast Template-based Approach to Automatically Identify Primary Text Content of a Web Page

11/26/2019
by   Dat Quoc Nguyen, et al.
0

Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant Web pages. One reason is because search engines also look at non-informative blocks of Web pages such as advertisement, navigation links, etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detect main content blocks in a Web page by improving the ContentExtractor algorithm. By automatically identifying and storing templates representing the structure of content blocks in a website, content blocks of a new Web page from the Website can be extracted quickly. The hierarchical order of the output blocks is also maintained which guarantees that the extracted content blocks are in the same order as the original ones.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/21/2019

DroidMeter: A Measurement Tool to Study Embedded Web Pages

Traditional Web browsing involves typing a URL on a browser and loading ...
research
11/21/2021

The Impact of Main Content Extraction on Near-Duplicate Detection

Commercial web search engines employ near-duplicate detection to ensure ...
research
08/26/2017

Effective Blog Pages Extractor for Better UGC Accessing

Blog is becoming an increasingly popular media for information publishin...
research
10/27/2021

Don't read, just look: Main content extraction from web pages using visually apparent features

The extraction of main content provides only primary informative blocks ...
research
08/13/2017

Interstitial Content Detection

Interstitial content is online content which grays out, or otherwise obs...
research
08/26/2017

Navigation Objects Extraction for Better Content Structure Understanding

Existing works for extracting navigation objects from webpages focus on ...
research
09/07/2017

Advanced Page Rank Algorithm with Semantics, In Links, Out Links and Google Analytics

In this paper we have modified the existing page ranking mechanism as an...

Please sign up or login with your details

Forgot password? Click here to reset