Toward a Period-Specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

07/30/2023
by   Omri Suissa, et al.
0

Over the past few decades, large archives of paper-based historical documents, such as books and newspapers, have been digitized using the Optical Character Recognition (OCR) technology. Unfortunately, this broadly used technology is error-prone, especially when an OCRed document was written hundreds of years ago. Neural networks have shown great success in solving various text processing tasks, including OCR post-correction. The main disadvantage of using neural networks for historical corpora is the lack of sufficiently large training datasets they require to learn from, especially for morphologically-rich languages like Hebrew. Moreover, it is not clear what are the optimal structure and values of hyperparameters (predefined parameters) of neural networks for OCR error correction in Hebrew due to its unique features. Furthermore, languages change across genres and periods. These changes may affect the accuracy of OCR post-correction neural network models. To overcome these challenges, we developed a new multi-phase method for generating artificial training datasets with OCR errors and hyperparameters optimization for building an effective neural network for OCR post-correction in Hebrew.

READ FULL TEXT

page 7

page 14

page 17

research
07/30/2023

Optimizing the Neural Network Training for OCR Error Correction of Historical Hebrew Texts

Over the past few decades, large archives of paper-based documents such ...
research
06/12/2021

Toward the Optimized Crowdsourcing Strategy for OCR Post-Correction

Digitization of historical documents is a challenging task in many digit...
research
05/25/2020

From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction

A great deal of historical corpora suffer from errors introduced by the ...
research
12/14/2020

Vartani Spellcheck – Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance

Traditional Optical Character Recognition (OCR) systems that generate te...
research
11/04/2021

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Much of the existing linguistic data in many languages of the world is l...
research
08/29/2023

Enhancing OCR Performance through Post-OCR Models: Adopting Glyph Embedding for Improved Correction

The study investigates the potential of post-OCR models to overcome limi...
research
09/06/2018

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

We propose a post-OCR text correction approach for digitising texts in R...

Please sign up or login with your details

Forgot password? Click here to reset