Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning
We combine three methods which significantly improve the OCR accuracy of OCR mod-els trained on early printed books: (1) The pretraining method utilizes the informationstored in already existing models trained on a variety of typesets (mixed models) insteadof starting the training from scratch. (2) Performing cross fold training on a single setof ground truth data (line images and their transcriptions) with a single OCR engine(OCRopus) produces a committee whose members then vote for the best outcome byalso taking the top-N alternatives and their intrinsic confidence values into account.(3) Following the principle of maximal disagreement we select additional training lineswhich the voters disagree most on, expecting them to offer the highest informationgain for a subsequent training (active learning). Evaluations on six early printed booksyielded the following results: On average the combination of pretraining and votingimproved the character accuracy by 46 starting from the samemixed model. This number rose to 53 models for pretraining,underlining the importance of diverse voters. Incorporating active learning improvedthe obtained results by another 16 average (evaluated on three of the six books).Overall, the proposed methods lead to an average error rate of 2.5 substantial ground truth pool of 1,000 lines brought the errorrate down even further to less than 1
READ FULL TEXT