Lexically Aware Semi-Supervised Learning for OCR Post-Correction

11/04/2021
by   Shruti Rijhwani, et al.
0

Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general-purpose OCR systems on recognition of less-well-resourced languages. However, these methods rely on manually curated post-correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically-aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15-29 of self-training and lexically-aware decoding essential for achieving consistent improvements. Data and code are available at https://shrutirij.github.io/ocr-el/.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2020

OCR Post Correction for Endangered Language Texts

There is little to no data available to build natural language processin...
research
07/30/2023

Optimizing the Neural Network Training for OCR Error Correction of Historical Hebrew Texts

Over the past few decades, large archives of paper-based documents such ...
research
06/16/2020

Building One-Shot Semi-supervised (BOSS) Learning up to Fully Supervised Performance

Reaching the performance of fully supervised learning with unlabeled dat...
research
07/30/2023

Toward a Period-Specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

Over the past few decades, large archives of paper-based historical docu...
research
09/13/2021

Post-OCR Document Correction with large Ensembles of Character Sequence Models

In this paper, we propose a novel method based on character sequence-to-...
research
03/23/2020

ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation

Optical character recognition (OCR) systems performance have improved si...
research
09/14/2021

Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer

We propose a semi-supervised network for wide-angle portraits correction...

Please sign up or login with your details

Forgot password? Click here to reset