Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images

02/20/2021
by   Siwen Luo, et al.
0

Automatic table detection in PDF documents has achieved a great success but tabular data extraction are still challenging due to the integrity and noise issues in detected table areas. The accurate data extraction is extremely crucial in finance area. Inspired by this, the aim of this research is proposing an automated table detection and tabular data extraction from financial PDF documents. We proposed a method that consists of three main processes, which are detecting table areas with a Faster R-CNN (Region-based Convolutional Neural Network) model with Feature Pyramid Network (FPN) on each page image, extracting contents and structures by a compounded layout segmentation technique based on optical character recognition (OCR) and formulating regular expression rules for table header separation. The tabular data extraction feature is embedded with rule-based filtering and restructuring functions that are highly scalable. We annotate a new Financial Documents dataset with table regions for the experiment. The excellent table detection performance of the detection model is obtained from our customized dataset. The main contributions of this paper are proposing the Financial Documents dataset with table-area annotations, the superior detection model and the rule-based layout segmentation technique for the tabular data extraction from PDF files.

READ FULL TEXT
research
01/06/2020

TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images

With the widespread use of mobile phones and scanners to photograph and ...
research
07/07/2020

Unsupervised Data Extraction from Computer-generated Documents with Single Line Formatting

Processing large amounts of data is an essential problem of the big data...
research
12/01/2021

Automatic travel pattern extraction from visa page stamps using CNN models

We propose an automated document analysis system that processes scanned ...
research
03/17/2020

GFTE: Graph-based Financial Table Extraction

Tabular data is a crucial form of information expression, which can orga...
research
01/29/2022

Information Extraction through AI techniques: The KIDs use case at CONSOB

In this paper we report on the initial activities carried out within a c...
research
11/25/2022

Semantic Table Detection with LayoutLMv3

This paper presents an application of the LayoutLMv3 model for semantic ...
research
07/02/2021

Optical Braille Recognition using Circular Hough Transform

Braille has empowered visually challenged community to read and write. B...

Please sign up or login with your details

Forgot password? Click here to reset