Line-items and table understanding in structured documents

03/22/2019
by   Martin Holeček, et al.
0

Table detection and extraction has been studied in the context of documents like scientific papers, where tables are clearly outlined and stand out from the visual document structure. We study this topic in a rather more challenging domain of layout-heavy business documents, particularly invoices. Invoices present the novel challenges of tables being often without outlines - either in the form of borders or surrounding text flow - with ragged columns and widely varying data content. We will also show, that we can extract different structural information from different table-like structures. We present a comprehensive representation of a page using graph over word boxes, positional embeddings, trainable textual features and rephrase the table detection as a text box labeling problem. We will work on a new dataset of invoices using this representation and propose multiple baselines to solve this labeling problem. We then propose a novel neural network model that achieves strong, practical results on the presented dataset and analyze the model performance and effects of graph convolutions and self-attention in detail.

READ FULL TEXT
research
08/23/2022

Graph Neural Networks and Representation Embedding for Table Extraction in PDF Documents

Tables are widely used in several types of documents since they can brin...
research
05/04/2023

Revisiting Table Detection Datasets for Visually Rich Documents

Table Detection has become a fundamental task for visually rich document...
research
05/03/2023

Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text Documents with Semantic-Oriented Hierarchical Graphs

Discrete reasoning over table-text documents (e.g., financial reports) g...
research
08/25/2020

CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images

Localizing page elements/objects such as tables, figures, equations, etc...
research
10/10/2022

A two-stage approach for table extraction in invoices

The automated analysis of administrative documents is an important field...
research
05/27/2019

FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents

In this paper, we present a new dataset for Form Understanding in Noisy ...
research
01/15/2019

Integrating and querying similar tables from PDF documents using deep learning

Large amount of public data produced by enterprises are in semi-structur...

Please sign up or login with your details

Forgot password? Click here to reset