DocParser: Hierarchical Structure Parsing of Document Renderings

11/05/2019
by   Johannes Rausch, et al.
0

Translating document renderings (e.g. PDFs, scans) into hierarchical structures is extensively demanded in the daily routines of many real-world applications, and is often a prerequisite step of many downstream NLP tasks. Earlier attempts focused on different but simpler tasks such as the detection of table or cell locations within documents; however, a holistic, principled approach to inferring the complete hierarchical structure in documents is missing. As a remedy, we developed "DocParser": an end-to-end system for parsing the complete document structure - including all text elements, figures, tables, and table cell structures. To the best of our knowledge, DocParser is the first system that derives the full hierarchical document compositions. Given the complexity of the task, annotating appropriate datasets is costly. Therefore, our second contribution is to provide a dataset for evaluating hierarchical document structure parsing. Our third contribution is to propose a scalable learning framework for settings where domain-specific data is scarce, which we address by a novel approach to weak supervision. Our computational experiments confirm the effectiveness of our proposed weak supervision: Compared to the baseline without weak supervision, it improves the mean average precision for detecting document entities by 37.1 hierarchical relations between entity pairs, it improves the F1 score by 27.6

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/05/2022

TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets

Tables have been an ever-existing structure to store data. There exist n...
research
05/01/2020

Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context

Documents are often the format of choice for knowledge sharing and prese...
research
03/24/2023

HRDoc: Dataset and Baseline Method Toward Hierarchical Reconstruction of Document Structures

The problem of document structure reconstruction refers to converting di...
research
03/08/2022

Table Structure Recognition with Conditional Attention

Tabular data in digital documents is widely used to express compact and ...
research
05/03/2023

Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text Documents with Semantic-Oriented Hierarchical Graphs

Discrete reasoning over table-text documents (e.g., financial reports) g...
research
09/06/2021

Parsing Table Structures in the Wild

This paper tackles the problem of table structure parsing (TSP) from ima...

Please sign up or login with your details

Forgot password? Click here to reset