Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning

02/27/2018
by   Christian Reul, et al.
0

We combine three methods which significantly improve the OCR accuracy of OCR mod-els trained on early printed books: (1) The pretraining method utilizes the informationstored in already existing models trained on a variety of typesets (mixed models) insteadof starting the training from scratch. (2) Performing cross fold training on a single setof ground truth data (line images and their transcriptions) with a single OCR engine(OCRopus) produces a committee whose members then vote for the best outcome byalso taking the top-N alternatives and their intrinsic confidence values into account.(3) Following the principle of maximal disagreement we select additional training lineswhich the voters disagree most on, expecting them to offer the highest informationgain for a subsequent training (active learning). Evaluations on six early printed booksyielded the following results: On average the combination of pretraining and votingimproved the character accuracy by 46 starting from the samemixed model. This number rose to 53 models for pretraining,underlining the importance of diverse voters. Incorporating active learning improvedthe obtained results by another 16 average (evaluated on three of the six books).Overall, the proposed methods lead to an average error rate of 2.5 substantial ground truth pool of 1,000 lines brought the errorrate down even further to less than 1

READ FULL TEXT

page 13

page 16

page 17

page 18

research
11/27/2017

Improving OCR Accuracy on Early Printed Books by utilizing Cross Fold Training and Voting

In this paper we introduce a method that significantly reduces the chara...
research
12/15/2017

Transfer Learning for OCRopus Model Training on Early Printed Books

A method is presented that significantly reduces the character error rat...
research
07/31/2022

Learning while Acquisition: Towards Active Learning Framework for Beamforming in Ultrasound Imaging

In the recent past, there have been many efforts to accelerate adaptive ...
research
06/15/2021

Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning

In order to apply Optical Character Recognition (OCR) to historical prin...
research
11/26/2019

ViewAL: Active Learning with Viewpoint Entropy for Semantic Segmentation

We propose ViewAL, a novel active learning strategy for semantic segment...
research
08/06/2016

OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus

This article describes the results of a case study that applies Neural N...
research
06/02/2020

Committee neural network potentials control generalization errors and enable active learning

It is well known in the field of machine learning that committee models ...

Please sign up or login with your details

Forgot password? Click here to reset