Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

09/06/2018
by   Amrith Krishna, et al.
0

We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit. Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman. Currently, there exists no dataset available for Romanised Sanskrit OCR. So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth. For training, we synthetically generate training images for both the settings. We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the art model in solving monotone sequence-to-sequence tasks (Schnober et al., 2016). We find that our system is robust in combating OCR-prone errors, as it obtains a CRR of 87.01 dataset settings. A human judgment survey performed on the models shows that our proposed model results in predictions which are faster to comprehend and faster to improve for a human than the other systems.

READ FULL TEXT
research
11/10/2020

OCR Post Correction for Endangered Language Texts

There is little to no data available to build natural language processin...
research
02/17/2018

Building a Word Segmenter for Sanskrit Overnight

There is an abundance of digitised texts available in Sanskrit. However,...
research
09/20/2023

Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

Scientific articles published prior to the "age of digitization" ( 1997)...
research
07/29/2022

Thutmose Tagger: Single-pass neural model for Inverse Text Normalization

Inverse text normalization (ITN) is an essential post-processing step in...
research
09/13/2021

Post-OCR Document Correction with large Ensembles of Character Sequence Models

In this paper, we propose a novel method based on character sequence-to-...
research
07/30/2023

Toward a Period-Specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

Over the past few decades, large archives of paper-based historical docu...
research
12/15/2017

Transfer Learning for OCRopus Model Training on Early Printed Books

A method is presented that significantly reduces the character error rat...

Please sign up or login with your details

Forgot password? Click here to reset