M^6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis

05/15/2023
by   Hiuyi Cheng, et al.
0

Document layout analysis is a crucial prerequisite for document understanding, including document retrieval and conversion. Most public datasets currently contain only PDF documents and lack realistic documents. Models trained on these datasets may not generalize well to real-world scenarios. Therefore, this paper introduces a large and diverse document layout analysis dataset called M^6Doc. The M^6 designation represents six properties: (1) Multi-Format (including scanned, photographed, and PDF documents); (2) Multi-Type (such as scientific articles, textbooks, books, test papers, magazines, newspapers, and notes); (3) Multi-Layout (rectangular, Manhattan, non-Manhattan, and multi-column Manhattan); (4) Multi-Language (Chinese and English); (5) Multi-Annotation Category (74 types of annotation labels with 237,116 annotation instances in 9,080 manually annotated pages); and (6) Modern documents. Additionally, we propose a transformer-based document layout analysis method called TransDLANet, which leverages an adaptive element matching mechanism that enables query embedding to better match ground truth to improve recall, and constructs a segmentation branch for more precise document image instance segmentation. We conduct a comprehensive evaluation of M^6Doc with various layout analysis methods and demonstrate its effectiveness. TransDLANet achieves state-of-the-art performance on M^6Doc with 64.5% mAP. The M^6Doc dataset will be available at https://github.com/HCIILAB/M6Doc.

READ FULL TEXT

page 7

page 16

page 18

research
06/02/2022

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

Accurate document layout analysis is a key requirement for high-quality ...
research
02/03/2022

DocBed: A Multi-Stage OCR Solution for Documents with Complex Layouts

Digitization of newspapers is of interest for many reasons including pre...
research
03/19/2023

Diffusion-based Document Layout Generation

We develop a diffusion-based approach for various document layout sequen...
research
05/11/2023

WeLayout: WeChat Layout Analysis System for the ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

In this paper, we introduce WeLayout, a novel system for segmenting the ...
research
12/15/2019

Indiscapes: Instance Segmentation Networks for Layout Parsing of Historical Indic Manuscripts

Historical palm-leaf manuscript and early paper documents from Indian su...
research
04/24/2023

PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis

Document layout analysis has a wide range of requirements across various...
research
08/21/2023

Performance Enhancement Leveraging Mask-RCNN on Bengali Document Layout Analysis

Understanding digital documents is like solving a puzzle, especially his...

Please sign up or login with your details

Forgot password? Click here to reset