Discrimination of English to other Indian languages (Kannada and Hindi) for OCR system

05/10/2012
by   Ankit Kumar, et al.
0

India is a multilingual multi-script country. In every state of India there are two languages one is state local language and the other is English. For example in Andhra Pradesh, a state in India, the document may contain text words in English and Telugu script. For Optical Character Recognition (OCR) of such a bilingual document, it is necessary to identify the script before feeding the text words to the OCRs of individual scripts. In this paper, we are introducing a simple and efficient technique of script identification for Kannada, English and Hindi text words of a printed document. The proposed approach is based on the horizontal and vertical projection profile for the discrimination of the three scripts. The feature extraction is done based on the horizontal projection profile of each text words. We analysed 700 different words of Kannada, English and Hindi in order to extract the discrimination features and for the development of knowledge base. We use the horizontal projection profile of each text word and based on the horizontal projection profile we extract the appropriate features. The proposed system is tested on 100 different document images containing more than 1000 text words of each script and a classification rate of 98.25 Kannada, English and Hindi respectively.

READ FULL TEXT
research
06/25/2011

Morphological Reconstruction for Word Level Script Identification

A line of a bilingual document page may contain text words in regional l...
research
10/11/2014

Direct Processing of Document Images in Compressed Domain

With the rapid increase in the volume of Big data of this digital era, f...
research
06/29/2021

Language Lexicons for Hindi-English Multilingual Text Processing

Language Identification in textual documents is the process of automatic...
research
06/29/2021

A Simple and Efficient Probabilistic Language model for Code-Mixed Text

The conventional natural language processing approaches are not accustom...
research
03/23/2017

Content-based similar document image retrieval using fusion of CNN features

Rapid increase of digitized document give birth to high demand of docume...
research
07/11/2013

Conversion of Braille to Text in English, Hindi and Tamil Languages

The Braille system has been used by the visually impaired for reading an...
research
10/18/2016

Stylometric Analysis of Early Modern Period English Plays

Function word adjacency networks (WANs) are used to study the authorship...

Please sign up or login with your details

Forgot password? Click here to reset