Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning

06/15/2021
by   Christian Reul, et al.
0

In order to apply Optical Character Recognition (OCR) to historical printings of Latin script fully automatically, we report on our efforts to construct a widely-applicable polyfont recognition model yielding text with a Character Error Rate (CER) around 2 this model can be further finetuned to specific classes of printings with little manual and computational effort. The mixed or polyfont model is trained on a wide variety of materials, in terms of age (from the 15th to the 19th century), typography (various types of Fraktur and Antiqua), and languages (among others, German, Latin, and French). To optimize the results we combined established techniques of OCR training like pretraining, data augmentation, and voting. In addition, we used various preprocessing methods to enrich the training data and obtain more robust models. We also implemented a two-stage approach which first trains on all available, considerably unbalanced data and then refines the output by training on a selected more balanced subset. Evaluations on 29 previously unseen books resulted in a CER of 1.73 outperforming a widely used standard model with a CER of 2.84 Training a more specialized model for some unseen Early Modern Latin books starting from our mixed model led to a CER of 1.47 50 the aforementioned standard model. Our new mixed model is made openly available to the community.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/19/2022

Open Source Handwritten Text Recognition on Medieval Manuscripts using Mixed Models and Document-Specific Finetuning

This paper deals with the task of practical and open source Handwritten ...
research
10/08/2018

State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines

In this paper we evaluate Optical Character Recognition (OCR) of 19th ce...
research
07/05/2018

Calamari - A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition

Optical Character Recognition (OCR) on contemporary and historical data ...
research
09/14/2018

Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

In this paper we describe a dataset of German and Latin ground truth (GT...
research
02/27/2018

Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning

We combine three methods which significantly improve the OCR accuracy of...
research
08/06/2020

On the Accuracy of CRNNs for Line-Based OCR: A Multi-Parameter Evaluation

We investigate how to train a high quality optical character recognition...
research
09/27/2022

3D Rendering Framework for Data Augmentation in Optical Character Recognition

In this paper, we propose a data augmentation framework for Optical Char...

Please sign up or login with your details

Forgot password? Click here to reset