Table-Of-Contents generation on contemporary documents

11/20/2019
by   Najah-Imane Bentabet, et al.
0

The generation of precise and detailed Table-Of-Contents (TOC) from a document is a problem of major importance for document understanding and information extraction. Despite its importance, it is still a challenging task, especially for non-standardized documents with rich layout information such as commercial documents. In this paper, we present a new neural-based pipeline for TOC generation applicable to any searchable document. Unlike previous methods, we do not use semantic labeling nor assume the presence of parsable TOC pages in the document. Moreover, we analyze the influence of using external knowledge encoded as a template. We empirically show that this approach is only useful in a very low resource environment. Finally, we propose a new domain-specific data set that sheds some light on the difficulties of TOC generation in real-world documents. The proposed method shows better performance than the state-of-the-art on a public data set and on the newly released data set.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/29/2021

PSG: Prompt-based Sequence Generation for Acronym Extraction

Acronym extraction aims to find acronyms (i.e., short-forms) and their m...
research
03/19/2023

Diffusion-based Document Layout Generation

We develop a diffusion-based approach for various document layout sequen...
research
03/04/2020

Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout

State-of-the-art solutions for Natural Language Processing (NLP) are abl...
research
06/14/2022

RDU: A Region-based Approach to Form-style Document Understanding

Key Information Extraction (KIE) is aimed at extracting structured infor...
research
04/06/2019

An Integrated Approach for Keyphrase Generation via Exploring the Power of Retrieval and Extraction

In this paper, we present a novel integrated approach for keyphrase gene...
research
05/02/2023

An experimental framework for designing document structure for users' decision making – An empirical study of recipes

Textual documents need to be of good quality to ensure effective asynchr...
research
05/15/2023

Document Understanding Dataset and Evaluation (DUDE)

We call on the Document AI (DocAI) community to reevaluate current metho...

Please sign up or login with your details

Forgot password? Click here to reset