Vartani Spellcheck – Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance

12/14/2020
by   Aditya Pal, et al.
0

Traditional Optical Character Recognition (OCR) systems that generate text of highly inflectional Indic languages like Hindi tend to suffer from poor accuracy due to a wide alphabet set, compound characters and difficulty in segmenting characters in a word. Automatic spelling error detection and context-sensitive error correction can be used to improve accuracy by post-processing the text generated by these OCR systems. A majority of previously developed language models for error correction of Hindi spelling have been context-free. In this paper, we present Vartani Spellcheck - a context-sensitive approach for spelling correction of Hindi text using a state-of-the-art transformer - BERT in conjunction with the Levenshtein distance algorithm, popularly known as Edit Distance. We use a lookup dictionary and context-based named entity recognition (NER) for detection of possible spelling errors in the text. Our proposed technique has been tested on a large corpus of text generated by the widely used Tesseract OCR on the Hindi epic Ramayana. With an accuracy of 81 improvement over some of the previously established context-sensitive error correction mechanisms for Hindi. We also explain how Vartani Spellcheck may be used for on-the-fly autocorrect suggestion during continuous typing in a text editor environment.

READ FULL TEXT
research
09/29/2021

Hierarchical Character Tagger for Short Text Spelling Error Correction

State-of-the-art approaches to spelling error correction problem include...
research
01/23/2019

Context-Sensitive Malicious Spelling Error Correction

Misspelled words of the malicious kind work by changing specific keyword...
research
10/19/2017

Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings

We present an unsupervised context-sensitive spelling correction method ...
research
05/28/2019

A Cost Efficient Approach to Correct OCR Errors in Large Document Collections

Word error rate of an ocr is often higher than its character error rate....
research
09/19/2023

RedPenNet for Grammatical Error Correction: Outputs to Tokens, Attentions to Spans

The text editing tasks, including sentence fusion, sentence splitting an...
research
07/30/2023

Toward a Period-Specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

Over the past few decades, large archives of paper-based historical docu...
research
06/26/2019

Leveraging Text Repetitions and Denoising Autoencoders in OCR Post-correction

A common approach for improving OCR quality is a post-processing step ba...

Please sign up or login with your details

Forgot password? Click here to reset