Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs

06/12/2023
by   Alejandro Peña, et al.
0

Every day, thousands of digital documents are generated with useful information for companies, public organizations, and citizens. Given the impossibility of processing them manually, the automatic processing of these documents is becoming increasingly necessary in certain sectors. However, this task remains challenging, since in most cases a text-only based parsing is not enough to fully understand the information presented through different components of varying significance. In this regard, Document Layout Analysis (DLA) has been an interesting research field for many years, which aims to detect and classify the basic components of a document. In this work, we used a procedure to semi-automatically annotate digital documents with different layout labels, including 4 basic layout blocks and 4 text categories. We apply this procedure to collect a novel database for DLA in the public affairs domain, using a set of 24 data sources from the Spanish Administration. The database comprises 37.9K documents with more than 441K document pages, and more than 8M labels associated to 8 layout block units. The results of our experiments validate the proposed text labeling procedure with accuracy up to 99

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/08/2022

Information Extraction from Scanned Invoice Images using Text Analysis and Layout Features

While storing invoice content as metadata to avoid paper document proces...
research
04/30/2021

Word-Level Alignment of Paper Documents with their Electronic Full-Text Counterparts

We describe a simple procedure for the automatic creation of word-level ...
research
05/01/2023

CHIC: Corporate Document for Visual question Answering

The massive use of digital documents due to the substantial trend of pap...
research
02/23/2021

Page Layout Analysis System for Unconstrained Historic Documents

Extraction of text regions and individual text lines from historic docum...
research
06/05/2023

Leveraging Large Language Models for Topic Classification in the Domain of Public Affairs

The analysis of public affairs documents is crucial for citizens as it p...
research
09/27/2016

Semi Automatic Color Segmentation of Document Pages

-This paper presents a semi automatic method used to segment color docum...
research
12/23/2021

Digital Editions as Distant Supervision for Layout Analysis of Printed Books

Archivists, textual scholars, and historians often produce digital editi...

Please sign up or login with your details

Forgot password? Click here to reset