On-Device Language Identification of Text in Images using Diacritic Characters

11/10/2020
by   Shubham Vatsal, et al.
0

Diacritic characters can be considered as a unique set of characters providing us with adequate and significant clue in identifying a given language with considerably high accuracy. Diacritics, though associated with phonetics often serve as a distinguishing feature for many languages especially the ones with a Latin script. In this proposed work, we aim to identify language of text in images using the presence of diacritic characters in order to improve Optical Character Recognition (OCR) performance in any given automated environment. We showcase our work across 13 Latin languages encompassing 85 diacritic characters. We use an architecture similar to Squeezedet for object detection of diacritic characters followed by a shallow network to finally identify the language. OCR systems when accompanied with identified language parameter tends to produce better results than sole deployment of OCR systems. The discussed work apart from guaranteeing an improvement in OCR results also takes on-device (mobile phone) constraints into consideration in terms of model size and inference time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/24/2014

A Fuzzy Based Model to Identify Printed Sinhala Characters (ICIAfS14)

Character recognition techniques for printed documents are widely used f...
research
07/04/2019

A Novel Approach to OCR using Image Recognition based Classification for Ancient Tamil Inscriptions in Temples

Recognition of ancient Tamil characters has always been a challenge for ...
research
08/04/2023

Universal Defensive Underpainting Patch: Making Your Text Invisible to Optical Character Recognition

Optical Character Recognition (OCR) enables automatic text extraction fr...
research
11/29/2019

Mechanism for Embossing Braille Characters on Paper: Conceptual Design

This paper presents the conceptual design of a low-cost simple printer h...
research
10/24/2020

Revisiting Neural Language Modelling with Syllables

Language modelling is regularly analysed at word, subword or character u...
research
11/28/2017

Treatment of Unicode canoncal decomposition among operating systems

This article shows how the text characters that have multiple representa...
research
03/21/2023

Optical Character Recognition and Transcription of Berber Signs from Images in a Low-Resource Language Amazigh

The Berber, or Amazigh language family is a low-resource North African v...

Please sign up or login with your details

Forgot password? Click here to reset