Phonetic and Visual Priors for Decipherment of Informal Romanization

by   Maria Ryskina, et al.

Informal romanization is an idiosyncratic process used by humans in informal digital communication to encode non-Latin script languages into Latin character sets found on common keyboards. Character substitution choices differ between users but have been shown to be governed by the same main principles observed across a variety of languages—namely, character pairs are often associated through phonetic or visual similarity. We propose a noisy-channel WFST cascade model for deciphering the original non-Latin script from observed romanized text in an unsupervised fashion. We train our model directly on romanized data from two languages: Egyptian Arabic and Russian. We demonstrate that adding inductive bias through phonetic and visual priors on character mappings substantially improves the model's performance on both languages, yielding results much closer to the supervised skyline. Finally, we introduce a new dataset of romanized Russian, collected from a Russian social network website and partially annotated for our experiments.



There are no comments yet.


page 1

page 2

page 3

page 4


An Efficient Language-Independent Multi-Font OCR for Arabic Script

Optical Character Recognition (OCR) is the process of extracting digitiz...

Arabic Character Segmentation Using Projection Based Approach with Profile's Amplitude Filter

Arabic is one of the languages that present special challenges to Optica...

Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems

Visual modifications to text are often used to obfuscate offensive comme...

Chinese-Japanese Unsupervised Neural Machine Translation Using Sub-character Level Information

Unsupervised neural machine translation (UNMT) requires only monolingual...

#HashtagWars: Learning a Sense of Humor

In this work, we present a new dataset for computational humor, specific...

Deep Diacritization: Efficient Hierarchical Recurrence for Improved Arabic Diacritization

We propose a novel architecture for labelling character sequences that a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.