Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs

10/13/2021
by   Matteo Romanello, et al.
0

Together with critical editions and translations, commentaries are one of the main genres of publication in literary and textual scholarship, and have a century-long tradition. Yet, the exploitation of thousands of digitized historical commentaries was hitherto hindered by the poor quality of Optical Character Recognition (OCR), especially on commentaries to Greek texts. In this paper, we evaluate the performances of two pipelines suitable for the OCR of historical classical commentaries. Our results show that Kraken + Ciaconna reaches a substantially lower character error rate (CER) than Tesseract/OCR-D on commentary sections with high density of polytonic Greek text (average CER 7 Ciaconna on text sections written predominantly in Latin script (average CER 8.2 dataset with OCR ground truth for 19th classical commentaries and Pogretra, a large collection of training data and pre-trained models for a wide variety of ancient Greek typefaces.

READ FULL TEXT
research
08/06/2016

OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus

This article describes the results of a case study that applies Neural N...
research
07/05/2018

Calamari - A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition

Optical Character Recognition (OCR) on contemporary and historical data ...
research
10/08/2018

State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines

In this paper we evaluate Optical Character Recognition (OCR) of 19th ce...
research
06/01/2022

Optical character recognition quality affects perceived usefulness of historical newspaper clippings

Introduction. We study effect of different quality optical character rec...
research
11/05/2014

Optical Character Recognition, Using K-Nearest Neighbors

The problem of optical character recognition, OCR, has been widely discu...
research
08/26/2019

End-To-End Measure for Text Recognition

Measuring the performance of text recognition and text line detection en...
research
09/14/2018

Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

In this paper we describe a dataset of German and Latin ground truth (GT...

Please sign up or login with your details

Forgot password? Click here to reset