Semi-structured data extraction and modelling: the WIA Project

09/30/2013
by   Gianluca Colombo, et al.
0

Over the last decades, the amount of data of all kinds available electronically has increased dramatically. Data are accessible through a range of interfaces including Web browsers, database query languages, application-specific interfaces, built on top of a number of different data exchange formats. All these data span from un-structured to highly structured data. Very often, some of them have structure even if the structure is implicit, and not as rigid or regular as that found in standard database systems. Spreadsheet documents are prototypical in this respect. Spreadsheets are the lightweight technology able to supply companies with easy to build business management and business intelligence applications, and business people largely adopt spreadsheets as smart vehicles for data files generation and sharing. Actually, the more spreadsheets grow in complexity (e.g., their use in product development plans and quoting), the more their arrangement, maintenance, and analysis appear as a knowledge-driven activity. The algorithmic approach to the problem of automatic data structure extraction from spreadsheet documents (i.e., grid-structured and free topological-related data) emerges from the WIA project: Worksheets Intelligent Analyser. The WIA-algorithm shows how to provide a description of spreadsheet contents in terms of higher level of abstractions or conceptualisations. In particular, the WIA-algorithm target is about the extraction of i) the calculus work-flow implemented in the spreadsheets formulas and ii) the logical role played by the data which take part into the calculus. The aim of the resulting conceptualisations is to provide spreadsheets with abstract representations useful for further model refinements and optimizations through evolutionary algorithms computations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/20/2022

Business Document Information Extraction: Towards Practical Benchmarks

Information extraction from semi-structured documents is crucial for fri...
research
05/26/2022

Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents

This paper introduces a new information extraction model for business do...
research
05/28/2023

SAP HANA Data Volume Management

Today information technology is a data-driven environment. The role of d...
research
02/05/2020

Rapid Adaptation of BERT for Information Extraction on Domain-Specific Business Documents

Techniques for automatically extracting important content elements from ...
research
12/10/2019

Ledgerdata Refiner: A Powerful Ledger Data Query Platform for Hyperledger Fabric

Blockchain is one of the most popular distributed ledger technologies. I...
research
01/14/2019

FoundationDB Record Layer: A Multi-Tenant Structured Datastore

The FoundationDB Record Layer is an open source library that provides a ...
research
07/04/2020

Detecting Opportunities for Differential Maintenance of Extracted Views

Semi-structured and unstructured data management is challenging, but man...

Please sign up or login with your details

Forgot password? Click here to reset