Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs

03/15/2021
by   Lars Vögtlin, et al.
0

We present a framework to generate synthetic historical documents with precise ground truth using nothing more than a collection of unlabeled historical images. Obtaining large labeled datasets is often the limiting factor to effectively use supervised deep learning methods for Document Image Analysis (DIA). Prior approaches towards synthetic data generation either require expertise or result in poor accuracy in the synthetic documents. To achieve high precision transformations without requiring expertise, we tackle the problem in two steps. First, we create template documents with user-specified content and structure. Second, we transfer the style of a collection of unlabeled historical images to these template documents while preserving their text and layout. We evaluate the use of our synthetic historical documents in a pre-training setting and find that we outperform the baselines (randomly initialized and pre-trained). Additionally, with visual examples, we demonstrate a high-quality synthesis that makes it possible to generate large labeled historical document datasets with precise ground truth.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/03/2023

DocLangID: Improving Few-Shot Training to Identify the Language of Historical Documents

Language identification describes the task of recognizing the language o...
research
09/05/2019

Deep Visual Template-Free Form Parsing

Automatic, template-free extraction of information from form images is c...
research
12/11/2019

Lifelong learning for text retrieval and recognition in historical handwritten document collections

This chapter provides an overview of the problems that need to be dealt ...
research
03/16/2022

A Survey of Historical Document Image Datasets

This paper presents a systematic literature review of image datasets for...
research
07/14/2021

Synthesis in Style: Semantic Segmentation of Historical Documents using Synthetic Data

One of the most pressing problems in the automated analysis of historica...
research
09/04/2020

Externalizing Transformations of Historical Documents: Opportunities for Provenance-Driven Visualization

Transcription, annotation, digitization and/or visualization are common ...
research
05/22/2019

A Comprehensive Study of ImageNet Pre-Training for Historical Document Image Analysis

Automatic analysis of scanned historical documents comprises a wide rang...

Please sign up or login with your details

Forgot password? Click here to reset