OCR4all -- An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

09/09/2019
by   Christian Reul, et al.
25

Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. A comfortable GUI allows error corrections not only in the final output, but already in early stages to minimize error propagations. Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary. Experiments showed that users with minimal or no experience were able to capture the text of even the earliest printed books with manageable effort and great quality, achieving excellent character error rates (CERs) below 0.5 19th century novels showed that OCR4all can considerably outperform the commercial state-of-the-art tool ABBYY Finereader on moderate layouts if suitably pretrained mixed OCR models are available. The architecture of OCR4all allows the easy integration (or substitution) of newly developed tools for its main components by standardized interfaces like PageXML, thus aiming at continual higher automation for historical printings.

READ FULL TEXT

page 4

page 5

page 12

page 15

page 17

page 26

page 30

page 31

research
01/20/2017

LAREX - A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books

A semi-automatic open-source tool for layout analysis on early printed b...
research
01/20/2017

Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)

This paper provides the first thorough documentation of a high quality d...
research
10/08/2018

State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines

In this paper we evaluate Optical Character Recognition (OCR) of 19th ce...
research
03/31/2022

Digitizing Historical Balance Sheet Data: A Practitioner's Guide

This paper discusses how to successfully digitize large-scale historical...
research
08/09/2019

RCE: An Integration Environment for Engineering and Science

We present RCE (Remote Component Environment), an open-source framework ...
research
12/13/2017

Creating New Language and Voice Components for the Updated MaryTTS Text-to-Speech Synthesis Platform

We present a new workflow to create components for the MaryTTS text-to-s...
research
08/06/2020

On the Accuracy of CRNNs for Line-Based OCR: A Multi-Parameter Evaluation

We investigate how to train a high quality optical character recognition...

Please sign up or login with your details

Forgot password? Click here to reset