PubLayNet: largest dataset ever for document layout analysis

by   Xu Zhong, et al.

Recognizing the layout of unstructured digital documents is an important step when parsing the documents into structured machine-readable format for downstream applications. Deep neural networks that are developed for computer vision have been proven to be an effective method to analyze layout of document images. However, document layout datasets that are currently publicly available are several magnitudes smaller than established computing vision datasets. Models have to be trained by transfer learning from a base model that is pre-trained on a traditional computer vision dataset. In this paper, we develop the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated. The experiments demonstrate that deep neural networks trained on PubLayNet accurately recognize the layout of scientific articles. The pre-trained models are also a more effective base mode for transfer learning on a different document domain. We release the dataset ( to support development and evaluation of more advanced models for document layout analysis.


page 1

page 2

page 3

page 4

page 5

page 6

page 7


DocBank: A Benchmark Dataset for Document Layout Analysis

Document layout analysis usually relies on computer vision models to und...

Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis

Recognizing the layout of unstructured digital documents is crucial when...

Vision-Based Layout Detection from Scientific Literature using Recurrent Convolutional Neural Networks

We present an approach for adapting convolutional neural networks for ob...

A Graphical Approach to Document Layout Analysis

Document layout analysis (DLA) is the task of detecting the distinct, se...

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

Recent advances in document image analysis (DIA) have been primarily dri...

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

Accurate document layout analysis is a key requirement for high-quality ...

Efficient Document Image Classification Using Region-Based Graph Neural Network

Document image classification remains a popular research area because it...

Please sign up or login with your details

Forgot password? Click here to reset