DeepAI AI Chat
Log In Sign Up

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

by   Shruti Rijhwani, et al.

Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general-purpose OCR systems on recognition of less-well-resourced languages. However, these methods rely on manually curated post-correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically-aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15-29 of self-training and lexically-aware decoding essential for achieving consistent improvements. Data and code are available at


page 1

page 2

page 3

page 4


OCR Post Correction for Endangered Language Texts

There is little to no data available to build natural language processin...

Optimizing the Neural Network Training for OCR Error Correction of Historical Hebrew Texts

Over the past few decades, large archives of paper-based documents such ...

Building One-Shot Semi-supervised (BOSS) Learning up to Fully Supervised Performance

Reaching the performance of fully supervised learning with unlabeled dat...

Toward a Period-Specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

Over the past few decades, large archives of paper-based historical docu...

Post-OCR Document Correction with large Ensembles of Character Sequence Models

In this paper, we propose a novel method based on character sequence-to-...

ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation

Optical character recognition (OCR) systems performance have improved si...

Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer

We propose a semi-supervised network for wide-angle portraits correction...