Phonetic and Visual Priors for Decipherment of Informal Romanization

05/05/2020
by   Maria Ryskina, et al.
0

Informal romanization is an idiosyncratic process used by humans in informal digital communication to encode non-Latin script languages into Latin character sets found on common keyboards. Character substitution choices differ between users but have been shown to be governed by the same main principles observed across a variety of languages—namely, character pairs are often associated through phonetic or visual similarity. We propose a noisy-channel WFST cascade model for deciphering the original non-Latin script from observed romanized text in an unsupervised fashion. We train our model directly on romanized data from two languages: Egyptian Arabic and Russian. We demonstrate that adding inductive bias through phonetic and visual priors on character mappings substantially improves the model's performance on both languages, yielding results much closer to the supervised skyline. Finally, we introduce a new dataset of romanized Russian, collected from a Russian social network website and partially annotated for our experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/18/2020

An Efficient Language-Independent Multi-Font OCR for Arabic Script

Optical Character Recognition (OCR) is the process of extracting digitiz...
research
07/04/2017

Arabic Character Segmentation Using Projection Based Approach with Profile's Amplitude Filter

Arabic is one of the languages that present special challenges to Optica...
research
03/30/2022

Is Word Error Rate a good evaluation metric for Speech Recognition in Indic Languages?

We propose a new method for the calculation of error rates in Automatic ...
research
03/27/2019

Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems

Visual modifications to text are often used to obfuscate offensive comme...
research
12/09/2016

#HashtagWars: Learning a Sense of Humor

In this work, we present a new dataset for computational humor, specific...
research
10/19/2017

Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings

We present an unsupervised context-sensitive spelling correction method ...
research
03/22/2023

Evaluating Transformer Models and Human Behaviors on Chinese Character Naming

Neural network models have been proposed to explain the grapheme-phoneme...

Please sign up or login with your details

Forgot password? Click here to reset