Introduction to OXPath

06/28/2018
by   Ruslan R. Fayzrakhmanov, et al.
0

Contemporary web pages with increasingly sophisticated interfaces rival traditional desktop applications for interface complexity and are often called web applications or RIA (Rich Internet Applications). They often require the execution of JavaScript in a web browser and can call AJAX requests to dynamically generate the content, reacting to user interaction. From the automatic data acquisition point of view, thus, it is essential to be able to correctly render web pages and mimic user actions to obtain relevant data from the web page content. Briefly, to obtain data through existing Web interfaces and transform it into structured form, contemporary wrappers should be able to: 1) interact with sophisticated interfaces of web applications; 2) precisely acquire relevant data; 3) scale with the number of crawled web pages or states of web application; 4) have an embeddable programming API for integration with existing web technologies. OXPath is a state-of-the-art technology, which is compliant with these requirements and demonstrated its efficiency in comprehensive experiments. OXPath integrates Firefox for correct rendering of web pages and extends XPath 1.0 for the DOM node selection, interaction, and extraction. It provides means for converting extracted data into different formats, such as XML, JSON, CSV, and saving data into relational databases. This tutorial explains main features of the OXPath language and the setup of a suitable working environment. The guidelines for using OXPath are provided in the form of prototypical examples.

READ FULL TEXT

page 1

page 3

page 5

page 9

page 18

page 19

page 21

page 29

research
12/08/2017

Difficulties of Timestamping Archived Web Pages

We show that state-of-the-art services for creating trusted timestamps i...
research
10/18/1999

PIPE: Personalizing Recommendations via Partial Evaluation

It is shown that personalization of web content can be advantageously vi...
research
07/31/2015

SnowWatch: Snow Monitoring through Acquisition and Analysis of User-Generated Content

We present a system for complementing snow phenomena monitoring with vir...
research
11/03/2015

SWISH: SWI-Prolog for Sharing

Recently, we see a new type of interfaces for programmers based on web t...
research
04/19/2023

WASEF: Web Acceleration Solutions Evaluation Framework

The World Wide Web has become increasingly complex in recent years. This...
research
11/16/2020

Bridging the Technology Gap Between Industry and Semantic Web: Generating Databases and Server Code From RDF

Despite great advances in the area of Semantic Web, industry rather seld...
research
03/07/2011

Design of Automatically Adaptable Web Wrappers

Nowadays, the huge amount of information distributed through the Web mot...

Please sign up or login with your details

Forgot password? Click here to reset