Document Domain Randomization for Deep Learning Document Layout Extraction

05/20/2021
by   Meng Ling, et al.
0

We present document domain randomization (DDR), the first successful transfer of convolutional neural networks (CNNs) trained only on graphically rendered pseudo-paper pages to real-world document segmentation. DDR renders pseudo-document pages by modeling randomized textual and non-textual contents of interest, with user-defined layout and font styles to support joint learning of fine-grained classes. We demonstrate competitive results using our DDR approach to extract nine document classes from the benchmark CS-150 and papers published in two domains, namely annual meetings of Association for Computational Linguistics (ACL) and IEEE Visualization (VIS). We compare DDR to conditions of style mismatch, fewer or more noisy samples that are more easily obtained in the real world. We show that high-fidelity semantic information is not necessary to label semantic classes but style mismatch between train and test can lower model accuracy. Using smaller training samples had a slightly detrimental effect. Finally, network models still achieved high test accuracy when correct labels are diluted towards confusing labels; this behavior hold across several classes.

READ FULL TEXT
research
11/27/2021

Document Layout Analysis with Aesthetic-Guided Image Augmentation

Document layout analysis (DLA) plays an important role in information ex...
research
06/01/2020

DocBank: A Benchmark Dataset for Document Layout Analysis

Document layout analysis usually relies on computer vision models to und...
research
07/10/2019

Generating All the Roads to Rome: Road Layout Randomization for Improved Road Marking Segmentation

Road markings provide guidance to traffic participants and enforce safe ...
research
06/02/2022

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

Accurate document layout analysis is a key requirement for high-quality ...
research
01/24/2022

Cross-Domain Document Layout Analysis via Unsupervised Document Style Guide

The document layout analysis (DLA) aims to decompose document images int...
research
02/18/2021

Robust PDF Document Conversion Using Recurrent Neural Networks

The number of published PDF documents has increased exponentially in rec...

Please sign up or login with your details

Forgot password? Click here to reset