Navigation Objects Extraction for Better Content Structure Understanding

08/26/2017
by   Kui Zhao, et al.
0

Existing works for extracting navigation objects from webpages focus on navigation menus, so as to reveal the information architecture of the site. However, web 2.0 sites such as social networks, e-commerce portals etc. are making the understanding of the content structure in a web site increasingly difficult. Dynamic and personalized elements such as top stories, recommended list in a webpage are vital to the understanding of the dynamic nature of web 2.0 sites. To better understand the content structure in web 2.0 sites, in this paper we propose a new extraction method for navigation objects in a webpage. Our method will extract not only the static navigation menus, but also the dynamic and personalized page-specific navigation lists. Since the navigation objects in a webpage naturally come in blocks, we first cluster hyperlinks into different blocks by exploiting spatial locations of hyperlinks, the hierarchical structure of the DOM-tree and the hyperlink density. Then we identify navigation objects from those blocks using the SVM classifier with novel features such as anchor text lengths etc. Experiments on real-world data sets with webpages from various domains and styles verified the effectiveness of our method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/26/2019

A Fast Template-based Approach to Automatically Identify Primary Text Content of a Web Page

Search engines have become an indispensable tool for browsing informatio...
research
10/27/2021

Don't read, just look: Main content extraction from web pages using visually apparent features

The extraction of main content provides only primary informative blocks ...
research
08/26/2017

Effective Blog Pages Extractor for Better UGC Accessing

Blog is becoming an increasingly popular media for information publishin...
research
04/08/2018

A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

Existing techniques for efficiently crawling social media sites rely on ...
research
06/26/2023

MOVESe: MOVablE and Moving LiDAR Scene Segmentation with Improved Navigation in Seg-label free settings

Accurate detection of movable and moving objects in LiDAR is of vital im...
research
04/09/2018

Automated Discovery of Internet Censorship by Web Crawling

Censorship of the Internet is widespread around the world. As access to ...
research
01/08/2022

Extraction of Product Specifications from the Web – Going Beyond Tables and Lists

E-commerce product pages on the web often present product specification ...

Please sign up or login with your details

Forgot password? Click here to reset