Transfer Learning for OCRopus Model Training on Early Printed Books

12/15/2017
by   Christian Reul, et al.
0

A method is presented that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books when only small amounts of diplomatic transcriptions are available. This is achieved by building from already existing models during training instead of starting from scratch. To overcome the discrepancies between the set of characters of the pretrained model and the additional ground truth the OCRopus code is adapted to allow for alphabet expansion or reduction. The character set is now capable of flexibly adding and deleting characters from the pretrained alphabet when an existing model is loaded. For our experiments we use a self-trained mixed model on early Latin prints and the two standard OCRopus models on modern English and German Fraktur texts. The evaluation on seven early printed books showed that training from the Latin mixed model reduces the average amount of errors by 43 and 150 lines of ground truth, respectively. Furthermore, it is shown that even building from mixed models trained on data unrelated to the newly added training and test data can lead to significantly improved recognition results.

READ FULL TEXT
research
11/27/2017

Improving OCR Accuracy on Early Printed Books by utilizing Cross Fold Training and Voting

In this paper we introduce a method that significantly reduces the chara...
research
09/14/2018

Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

In this paper we describe a dataset of German and Latin ground truth (GT...
research
02/27/2018

Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning

We combine three methods which significantly improve the OCR accuracy of...
research
08/06/2016

OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus

This article describes the results of a case study that applies Neural N...
research
01/19/2022

Open Source Handwritten Text Recognition on Medieval Manuscripts using Mixed Models and Document-Specific Finetuning

This paper deals with the task of practical and open source Handwritten ...
research
09/06/2018

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

We propose a post-OCR text correction approach for digitising texts in R...
research
01/17/2022

Evaluation of HTR models without Ground Truth Material

The evaluation of Handwritten Text Recognition (HTR) models during their...

Please sign up or login with your details

Forgot password? Click here to reset