A Tool for Facilitating OCR Postediting in Historical Documents

04/23/2020
by   Alberto Poncelas, et al.
0

Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary ,1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/01/2021

Neural OCR Post-Hoc Correction of Historical Corpora

Optical character recognition (OCR) is crucial for a deeper access to hi...
research
10/26/2022

The Biscari Archive. A case study of the application of Transkribus tool

The Paterno' Castello Principi di Biscari Archive, preserved at the Stat...
research
01/27/2016

Font Identification in Historical Documents Using Active Learning

Identifying the type of font (e.g., Roman, Blackletter) used in historic...
research
01/27/2019

Degraded Historical Documents Images Binarization Using a Combination of Enhanced Techniques

Document image binarization is the initial step and a crucial in many do...
research
05/28/2019

A Cost Efficient Approach to Correct OCR Errors in Large Document Collections

Word error rate of an ocr is often higher than its character error rate....
research
06/12/2021

Toward the Optimized Crowdsourcing Strategy for OCR Post-Correction

Digitization of historical documents is a challenging task in many digit...
research
09/17/2019

Fast Search with Poor OCR

The indexing and searching of historical documents have garnered attenti...

Please sign up or login with your details

Forgot password? Click here to reset