DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

06/02/2022
by   Birgit Pfitzmann, et al.
0

Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present DocLayNet, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8

research
08/16/2019

PubLayNet: largest dataset ever for document layout analysis

Recognizing the layout of unstructured digital documents is an important...
research
06/08/2023

ShuttleSet: A Human-Annotated Stroke-Level Singles Dataset for Badminton Tactical Analysis

With the recent progress in sports analytics, deep learning approaches h...
research
08/29/2023

Vision Grid Transformer for Document Layout Analysis

Document pre-trained models and grid-based models have proven to be very...
research
05/20/2021

Document Domain Randomization for Deep Learning Document Layout Extraction

We present document domain randomization (DDR), the first successful tra...
research
02/18/2021

Robust PDF Document Conversion Using Recurrent Neural Networks

The number of published PDF documents has increased exponentially in rec...
research
02/22/2023

The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure Captions

Scientific articles published prior to the "age of digitization" in the ...

Please sign up or login with your details

Forgot password? Click here to reset