Multimodal Tree Decoder for Table of Contents Extraction in Document Images

12/06/2022
by   Pengfei Hu, et al.
0

Table of contents (ToC) extraction aims to extract headings of different levels in documents to better understand the outline of the contents, which can be widely used for document understanding and information retrieval. Existing works often use hand-crafted features and predefined rule-based functions to detect headings and resolve the hierarchical relationship between headings. Both the benchmark and research based on deep learning are still limited. Accordingly, in this paper, we first introduce a standard dataset, HierDoc, including image samples from 650 documents of scientific papers with their content labels. Then we propose a novel end-to-end model by using the multimodal tree decoder (MTD) for ToC as a benchmark for HierDoc. The MTD model is mainly composed of three parts, namely encoder, classifier, and decoder. The encoder fuses the multimodality features of vision, text, and layout information for each entity of the document. Then the classifier recognizes and selects the heading entities. Next, to parse the hierarchical relationship between the heading entities, a tree-structured decoder is designed. To evaluate the performance, both the metric of tree-edit-distance similarity (TEDS) and F1-Measure are adopted. Finally, our MTD approach achieves an average TEDS of 87.2 HierDoc. The code and dataset will be released at: https://github.com/Pengfei-Hu/MTD.

READ FULL TEXT
research
11/25/2019

Image-based table recognition: data, model, and evaluation

Important information that relates to a specific topic in a document is ...
research
02/02/2023

CTE: A Dataset for Contextualized Table Extraction

Relevant information in documents is often summarized in tables, helping...
research
03/24/2023

HRDoc: Dataset and Baseline Method Toward Hierarchical Reconstruction of Document Structures

The problem of document structure reconstruction refers to converting di...
research
04/16/2021

LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding

Document layout comprises both structural and visual (eg. font-sizes) in...
research
05/08/2023

PromptRank: Unsupervised Keyphrase Extraction Using Prompt

The keyphrase extraction task refers to the automatic selection of phras...
research
02/18/2021

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

We address the challenging problem of Natural Language Comprehension bey...
research
10/29/2021

Deep Keyphrase Completion

Keyphrase provides accurate information of document content that is high...

Please sign up or login with your details

Forgot password? Click here to reset