A Simple and Practical Approach to Improve Misspellings in OCR Text

06/22/2021
by   Junxia Lin, et al.
0

The focus of our paper is the identification and correction of non-word errors in OCR text. Such errors may be the result of incorrect insertion, deletion, or substitution of a character, or the transposition of two adjacent characters within a single word. Or, it can be the result of word boundary problems that lead to run-on errors and incorrect-split errors. The traditional N-gram correction methods can handle single-word errors effectively. However, they show limitations when dealing with split and merge errors. In this paper, we develop an unsupervised method that can handle both errors. The method we develop leads to a sizable improvement in the correction rates. This tutorial paper addresses very difficult word correction problems - namely incorrect run-on and split errors - and illustrates what needs to be considered when addressing such problems. We outline a possible approach and assess its success on a limited study.

READ FULL TEXT
research
02/09/2023

Correcting Real-Word Spelling Errors: A New Hybrid Approach

Spelling correction is one of the main tasks in the field of Natural Lan...
research
11/21/2016

Statistical Learning for OCR Text Correction

The accuracy of Optical Character Recognition (OCR) is crucial to the su...
research
04/21/2016

OCR Error Correction Using Character Correction and Feature-Based Word Classification

This paper explores the use of a learned classifier for post-OCR text co...
research
04/04/2017

Guided Proofreading of Automatic Segmentations for Connectomics

Automatic cell image segmentation methods in connectomics produce merge ...
research
02/07/2023

Real-Word Error Correction with Trigrams: Correcting Multiple Errors in a Sentence

Spelling correction is a fundamental task in Text Mining. In this study,...
research
04/26/2022

Pretraining Chinese BERT for Detecting Word Insertion and Deletion Errors

Chinese BERT models achieve remarkable progress in dealing with grammati...
research
08/29/2023

Enhancing OCR Performance through Post-OCR Models: Adopting Glyph Embedding for Improved Correction

The study investigates the potential of post-OCR models to overcome limi...

Please sign up or login with your details

Forgot password? Click here to reset