PALI: A Language Identification Benchmark for Perso-Arabic Scripts

04/03/2023
by   Sina Ahmadi, et al.
0

The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where “unconventional” writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.

READ FULL TEXT
research
03/17/2011

Identification of arabic word from bilingual text using character features

The identification of the language of the script is an important stage i...
research
11/02/2020

Automated Transcription of Non-Latin Script Periodicals: A Case Study in the Ottoman Turkish Print Archive

Our study utilizes deep learning methods for the automated transcription...
research
05/25/2023

Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

The wide accessibility of social media has provided linguistically under...
research
02/24/2017

Normalisation de la langue et de lecriture arabe : enjeux culturels regionaux et mondiaux

Arabic language and writing are now facing a resurgence of international...
research
10/21/2022

Graphemic Normalization of the Perso-Arabic Script

Since its original appearance in 1991, the Perso-Arabic script represent...
research
12/07/2019

Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities

In this paper, we examine and analyze the challenges associated with dev...
research
12/30/2020

Predicting cross-linguistic adjective order with information gain

Languages vary in their placement of multiple adjectives before, after, ...

Please sign up or login with your details

Forgot password? Click here to reset