Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

05/25/2023
by   Sina Ahmadi, et al.
0

The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages. This, however, comes with certain challenges in script normalization, particularly where the speakers of a language in a bilingual community rely on another script or orthography to write their native language. This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script. Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated. We conduct a small-scale evaluation of real data as well. Our experiments indicate that script normalization is also beneficial to improve the performance of downstream tasks such as machine translation and language identification.

READ FULL TEXT

page 1

page 8

research
04/03/2023

PALI: A Language Identification Benchmark for Perso-Arabic Scripts

The Perso-Arabic scripts are a family of scripts that are widely adopted...
research
04/03/2023

Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

One of the major challenges that under-represented and endangered langua...
research
10/21/2022

Graphemic Normalization of the Perso-Arabic Script

Since its original appearance in 1991, the Perso-Arabic script represent...
research
06/01/2020

Lexical Normalization for Code-switched Data and its Effect on POS-tagging

Social media provides an unfiltered stream of user-generated input, lead...
research
10/06/2021

Sequence-to-Sequence Lexical Normalization with Multilingual Transformers

Current benchmark tasks for natural language processing contain text tha...
research
06/18/2022

MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written in Latin Script

Social media user-generated text is actually the main resource for many ...
research
04/24/2017

Detecting English Writing Styles For Non Native Speakers

This paper presents the first attempt, up to our knowledge, to classify ...

Please sign up or login with your details

Forgot password? Click here to reset