HRDoc: Dataset and Baseline Method Toward Hierarchical Reconstruction of Document Structures

03/24/2023
by   Jiefeng Ma, et al.
0

The problem of document structure reconstruction refers to converting digital or scanned documents into corresponding semantic structures. Most existing works mainly focus on splitting the boundary of each element in a single document page, neglecting the reconstruction of semantic structure in multi-page documents. This paper introduces hierarchical reconstruction of document structures as a novel task suitable for NLP and CV fields. To better evaluate the system performance on the new task, we built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units. Every document in HRDoc has line-level annotations including categories and relations obtained from rule-based extractors and human annotators. Moreover, we proposed an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem. By adopting a multi-modal bidirectional encoder and a structure-aware GRU decoder with soft-mask operation, the DSPS model surpass the baseline method by a large margin. All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc.

READ FULL TEXT

page 1

page 4

research
12/07/2022

Hierarchical multimodal transformers for Multi-Page DocVQA

Document Visual Question Answering (DocVQA) refers to the task of answer...
research
12/06/2022

Multimodal Tree Decoder for Table of Contents Extraction in Document Images

Table of contents (ToC) extraction aims to extract headings of different...
research
05/08/2023

SwinDocSegmenter: An End-to-End Unified Domain Adaptive Transformer for Document Instance Segmentation

Instance-level segmentation of documents consists in assigning a class-a...
research
11/05/2019

DocParser: Hierarchical Structure Parsing of Document Renderings

Translating document renderings (e.g. PDFs, scans) into hierarchical str...
research
04/18/2023

Deep Unrestricted Document Image Rectification

In recent years, tremendous efforts have been made on document image rec...
research
01/24/2022

Importance of Textlines in Historical Document Classification

This paper describes a system prepared at Brno University of Technology ...
research
01/25/2021

PAWLS: PDF Annotation With Labels and Structure

Adobe's Portable Document Format (PDF) is a popular way of distributing ...

Please sign up or login with your details

Forgot password? Click here to reset