Contextual Analysis for Middle Eastern Languages with Hidden Markov Models

05/07/2015
by   Kazem Taghva, et al.
0

Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers. In this paper, we propose a machine learning approach for contextual analysis based on the first order Hidden Markov Model. We will design and build a model for the Farsi language to exhibit this technology. The Farsi model achieves 94 % accuracy with the training based on a short list of 89 Farsi vocabularies consisting of 2780 Farsi characters. The experiment can be easily extended to many languages including Arabic, Urdu, and Sindhi. Furthermore, the advantage of this approach is that the same software can be used to perform contextual analysis without coding complex rules for each specific language. Of particular interest is that the languages with fewer speakers can have greater representation on the web, since they are typically ignored by software developers due to lack of financial incentives.

READ FULL TEXT
research
01/26/2023

Beyond Arabic: Software for Perso-Arabic Script Manipulation

This paper presents an open-source software library that provides a set ...
research
07/09/2014

Online Stroke and Akshara Recognition GUI in Assamese Language Using Hidden Markov Model

The work describes the development of Online Assamese Stroke & Akshara R...
research
12/10/2018

Auto-Encoder-BoF/HMM System for Arabic Text Recognition

The recognition of Arabic text, in both handwritten and printed forms, r...
research
03/13/2023

Instate: Predicting the State of Residence From Last Name

India has twenty-two official languages. Serving such a diverse language...
research
10/24/2020

Revisiting Neural Language Modelling with Syllables

Language modelling is regularly analysed at word, subword or character u...
research
04/04/2022

Reliable Editions from Unreliable Components: Estimating Ebooks from Print Editions Using Profile Hidden Markov Models

A profile hidden Markov model, a popular model in biological sequence an...
research
09/15/2023

Merging two Hierarchies of Internal Contextual Grammars with Subregular Selection

In this paper, we continue the research on the power of contextual gramm...

Please sign up or login with your details

Forgot password? Click here to reset