DocBank: A Benchmark Dataset for Document Layout Analysis

06/01/2020
by   Minghao Li, et al.
0

Document layout analysis usually relies on computer vision models to understand documents while ignoring textual information that is vital to capture. Meanwhile, high quality labeled datasets with both visual and textual information are still insufficient. In this paper, we present DocBank, a benchmark dataset with fine-grained token-level annotations for document layout analysis. DocBank is constructed using a simple yet effective way with weak supervision from the documents available on the arXiv.com. With DocBank, models from different modalities can be compared fairly and multi-modal approaches will be further investigated and boost the performance of document layout analysis. We build several strong baselines and manually split train/dev/test sets for evaluation. Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents. The DocBank dataset will be publicly available at <https://github.com/doc-analysis/DocBank>.

READ FULL TEXT

page 2

page 7

research
08/16/2019

PubLayNet: largest dataset ever for document layout analysis

Recognizing the layout of unstructured digital documents is an important...
research
01/25/2021

PAWLS: PDF Annotation With Labels and Structure

Adobe's Portable Document Format (PDF) is a popular way of distributing ...
research
06/01/2021

Incorporating Visual Layout Structures for Scientific Text Classification

Classifying the core textual components of a scientific paper-title, aut...
research
02/17/2023

Entry Separation using a Mixed Visual and Textual Language Model: Application to 19th century French Trade Directories

When extracting structured data from repetitively organized documents, s...
research
07/05/2023

Line Graphics Digitization: A Step Towards Full Automation

The digitization of documents allows for wider accessibility and reprodu...
research
10/15/2021

Accurate Fine-grained Layout Analysis for the Historical Tibetan Document Based on the Instance Segmentation

Accurate layout analysis without subsequent text-line segmentation remai...
research
05/20/2021

Document Domain Randomization for Deep Learning Document Layout Extraction

We present document domain randomization (DDR), the first successful tra...

Please sign up or login with your details

Forgot password? Click here to reset