Lipi Gnani - A Versatile OCR for Documents in any Language Printed in Kannada Script

01/02/2019
by   Shiva Kumar H R, et al.
8

A Kannada OCR, named Lipi Gnani, has been designed and developed from scratch, with the motivation of it being able to convert printed text or poetry in Kannada script, without any restriction on vocabulary. The training and test sets have been collected from over 35 books published between the period 1970 to 2002, and this includes books written in Halegannada and pages containing Sanskrit slokas written in Kannada script. The coverage of the OCR is nearly complete in the sense that it recognizes all the punctuation marks, special symbols, Indo-Arabic and Kannada numerals and also the interspersed English words. Several minor and major original contributions have been done in developing this OCR at the different processing stages such as binarization, line and character segmentation, recognition and Unicode mapping. This has created a Kannada OCR that performs as good as, and in some cases, better than the Google's Tesseract OCR, as shown by the results. To the knowledge of the authors, this is the maiden report of a complete Kannada OCR, handling all the issues involved. Currently, there is no dictionary based postprocessing, and the obtained results are due solely to the recognition process. Four benchmark test databases containing scanned pages from books in Kannada, Sanskrit, Konkani and Tulu languages, but all of them printed in Kannada script, have been created. The word level recognition accuracy of Lipi Gnani is 4 the Kannada dataset than that of Google's Tesseract OCR, 8 datasets of Tulu and Sanskrit, and 25

READ FULL TEXT

page 2

page 12

page 14

page 15

page 16

research
05/10/2019

Restoring Arabic vowels through omission-tolerant dictionary lookup

Vowels in Arabic are optional orthographic symbols written as diacritics...
research
12/24/2014

AltecOnDB: A Large-Vocabulary Arabic Online Handwriting Recognition Database

Arabic is a semitic language characterized by a complex and rich morphol...
research
08/21/2012

An Online Character Recognition System to Convert Grantha Script to Malayalam

This paper presents a novel approach to recognize Grantha, an ancient sc...
research
06/07/2012

Off-Line Arabic Handwriting Character Recognition Using Word Segmentation

The ultimate aim of handwriting recognition is to make computers able to...
research
01/10/2022

Towards Boosting the Accuracy of Non-Latin Scene Text Recognition

Scene-text recognition is remarkably better in Latin languages than the ...
research
07/26/2021

Improving Word Recognition in Speech Transcriptions by Decision-level Fusion of Stemming and Two-way Phoneme Pruning

We introduce an unsupervised approach for correcting highly imperfect sp...
research
03/08/2018

Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio -- Episode 1: Machine Transcription of the Manuscripts

In Codice Ratio is a research project to study tools and techniques for ...

Please sign up or login with your details

Forgot password? Click here to reset