DocEmul: a Toolkit to Generate Structured Historical Documents

10/10/2017
by   Samuele Capobianco, et al.
0

We propose a toolkit to generate structured synthetic documents emulating the actual document production process. Synthetic documents can be used to train systems to perform document analysis tasks. In our case we address the record counting task on handwritten structured collections containing a limited number of examples. Using the DocEmul toolkit we can generate a larger dataset to train a deep architecture to predict the number of records for each page. The toolkit is able to generate synthetic collections and also perform data augmentation to create a larger trainable dataset. It includes one method to extract the page background from real pages which can be used as a substrate where records can be written on the basis of variable structures and using cursive fonts. Moreover, it is possible to extend the synthetic collection by adding random noise, page rotations, and other visual variations. We performed some experiments on two different handwritten collections using the toolkit to generate synthetic data to train a Convolutional Neural Network able to count the number of records in the real collections.

READ FULL TEXT
research
10/24/2016

Record Counting in Historical Handwritten Documents with Convolutional Neural Networks

In this paper, we investigate the use of Convolutional Neural Networks f...
research
09/05/2017

PageNet: Page Boundary Extraction in Historical Handwritten Documents

When digitizing a document into an image, it is common to include a surr...
research
09/18/2019

Unsupervised Writer Adaptation for Synthetic-to-Real Handwritten Word Recognition

Handwritten Text Recognition (HTR) is still a challenging problem becaus...
research
12/07/2022

Hierarchical multimodal transformers for Multi-Page DocVQA

Document Visual Question Answering (DocVQA) refers to the task of answer...
research
06/24/2015

Unshredding of Shredded Documents: Computational Framework and Implementation

A shredded document D is a document whose pages have been cut into strip...
research
03/08/2019

ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records

We propose a Historical Document Reading Challenge on Large Chinese Stru...
research
06/18/2018

The Off-Topic Memento Toolkit

Web archive collections are created with a particular purpose in mind. A...

Please sign up or login with your details

Forgot password? Click here to reset