Unsupervised Data Extraction from Computer-generated Documents with Single Line Formatting

07/07/2020
by   Vladimir Bernstein, et al.
0

Processing large amounts of data is an essential problem of the big data era. Most of the data exchange is done via direct communication (using APIs) and well-structured file formats (JSON, XML, EDI, etc.), but a significant portion of the data is transferred using arbitrary formatted computer-generated documents (such as invoices, purchase orders, financial reports, etc.), which require sophisticated processing and human intervention for data interpretation and extraction. The currently available solutions, ranging from manual data entry to low-level scripting and data extraction tools, are costly and require human intervention. This paper describes the principle methodology for unsupervised, fully automatic data extraction from a wide range of computer-generated documents, assuming that their formatting reflects the original structure of the data sources. The presented methodology falls into the category of unsupervised machine learning and consists of the three main parts: (1) - detecting repeating patterns of text formatting by employing the relative feature space clustering and adaptive weighted feature score maps, (2) - detecting hierarchical formatting structures via collapsing and noise filtering procedure applied to the repeating formatting patterns and (3) - automatic configuration of the interactive data extraction tool (SiMX TextConverter) for fully automated processing.

READ FULL TEXT

page 5

page 21

research
02/20/2021

Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images

Automatic table detection in PDF documents has achieved a great success ...
research
11/05/2021

A Semi-automatic Data Extraction System for Heterogeneous Data Sources: A Case Study from Cotton Industry

With the recent developments in digitisation, there are increasing numbe...
research
02/14/2018

Classification of Scientific Papers With Big Data Technologies

Data sizes that cannot be processed by conventional data storage and ana...
research
01/13/2018

EmbedRank: Unsupervised Keyphrase Extraction using Sentence Embeddings

Keyphrase extraction is the task of automatically selecting a small set ...
research
08/05/2020

Unsupervised seismic facies classification using deep convolutional autoencoder

With the increased size and complexity of seismic surveys, manual labeli...
research
02/18/2023

Optimising Human-Machine Collaboration for Efficient High-Precision Information Extraction from Text Documents

While humans can extract information from unstructured text with high pr...
research
04/30/2021

Word-Level Alignment of Paper Documents with their Electronic Full-Text Counterparts

We describe a simple procedure for the automatic creation of word-level ...

Please sign up or login with your details

Forgot password? Click here to reset