Tamizhi-Net OCR: Creating A Quality Large Scale Tamil-Sinhala-English Parallel Corpus Using Deep Learning Based Printed Character Recognition (PCR)

09/13/2021
by   Charangan Vasantharajan, et al.
0

Most of the low resource languages do not have the necessary resources to create even a substantial monolingual corpus. These languages may often be found in government proceedings but mostly in the form of Portable Document Formats (PDFs) that contains legacy fonts. Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding which are not optimized for text extraction. Therefore, we propose a simple, automatic, and novel idea that can scale for Tamil, Sinhala, and English languages and many documents. For this purpose, we enhanced the performance of Tesseract 4.1.1 by employing LSTM-based training on many legacy fonts to recognize printed characters in the above languages. Especially, our model detects code-mix text, numbers, and special characters from the printed document. It is shown that this approach can boost the character-level accuracy of Tesseract 4.1.1 from 85.5 to 98.2 for Tamil (+12.9 relative change) and 91.8 to 94.8 for Sinhala (+3.26 dataset that is considered as challenging by its authors.

READ FULL TEXT
research
08/21/2023

bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents

Despite the existence of numerous Optical Character Recognition (OCR) to...
research
05/14/2019

A human-inspired recognition system for premodern Japanese historical documents

Recognition of historical documents is a challenging problem due to the ...
research
07/11/2013

Conversion of Braille to Text in English, Hindi and Tamil Languages

The Braille system has been used by the visually impaired for reading an...
research
03/06/2022

Capsule Networks for Character Recognition in Low Resource Languages

Most of the existing techniques in handwritten character recognition are...
research
12/24/2014

A Fuzzy Based Model to Identify Printed Sinhala Characters (ICIAfS14)

Character recognition techniques for printed documents are widely used f...
research
11/12/2020

Inference-only sub-character decomposition improves translation of unseen logographic characters

Neural Machine Translation (NMT) on logographic source languages struggl...
research
01/12/2012

Autonomous Cleaning of Corrupted Scanned Documents - A Generative Modeling Approach

We study the task of cleaning scanned text documents that are strongly c...

Please sign up or login with your details

Forgot password? Click here to reset