Graphemic Normalization of the Perso-Arabic Script

10/21/2022
by   Raiomond Doctor, et al.
0

Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions. This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages, such as Arabic and Persian, building on earlier work by the expert community. We particularly focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues such as the use of visually ambiguous yet canonically nonequivalent letters and the mixing of letters from different orthographies. Among the contributing conflating factors are the lack of input methods, the instability of modern orthographies, insufficient literacy, and loss or lack of orthographic tradition. We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks. Our results indicate statistically significant improvements in performance in most conditions for all the languages considered when normalization is applied. We argue that better understanding and representation of Perso-Arabic script variation within regional orthographic traditions, where those are present, is crucial for further progress of modern computational NLP techniques especially for languages with a paucity of resources.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/26/2023

Beyond Arabic: Software for Perso-Arabic Script Manipulation

This paper presents an open-source software library that provides a set ...
research
04/03/2023

PALI: A Language Identification Benchmark for Perso-Arabic Scripts

The Perso-Arabic scripts are a family of scripts that are widely adopted...
research
11/25/2020

A Panoramic Survey of Natural Language Processing in the Arab World

The term natural language refers to any system of symbolic communication...
research
05/25/2023

Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

The wide accessibility of social media has provided linguistically under...
research
04/03/2021

Global Syntactic Variation in Seven Languages: Towards a Computational Dialectology

The goal of this paper is to provide a complete representation of region...
research
05/19/2022

Curras + Baladi: Towards a Levantine Corpus

The processing of the Arabic language is a complex field of research. Th...
research
10/25/2015

Statistical Parsing by Machine Learning from a Classical Arabic Treebank

Research into statistical parsing for English has enjoyed over a decade ...

Please sign up or login with your details

Forgot password? Click here to reset