Phonology-Augmented Statistical Framework for Machine Transliteration using Limited Linguistic Resources

10/07/2018
by   Gia H. Ngo, et al.
0

Transliteration converts words in a source language (e.g., English) into words in a target language (e.g., Vietnamese). This conversion considers the phonological structure of the target language, as the transliterated output needs to be pronounceable in the target language. For example, a word in Vietnamese that begins with a consonant cluster is phonologically invalid and thus would be an incorrect output of a transliteration system. Most statistical transliteration approaches, albeit being widely adopted, do not explicitly model the target language's phonology, which often results in invalid outputs. The problem is compounded by the limited linguistic resources available when converting foreign words to transliterated words in the target language. In this work, we present a phonology-augmented statistical framework suitable for transliteration, especially when only limited linguistic resources are available. We propose the concept of pseudo-syllables as structures representing how segments of a foreign word are organized according to the syllables of the target language's phonology. We performed transliteration experiments on Vietnamese and Cantonese. We show that the proposed framework outperforms the statistical baseline by up to 44.68 limited training examples (587 entries).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/03/2015

Complexity and universality in the long-range order of words

As is the case of many signals produced by complex systems, language pre...
research
11/15/2017

An Unsupervised Approach for Mapping between Vector Spaces

We present a language independent, unsupervised approach for transformin...
research
01/17/2023

Statistical analysis of word flow among five Indo-European languages

A recent increase in data availability has allowed the possibility to pe...
research
05/22/2023

Automatic Spell Checker and Correction for Under-represented Spoken Languages: Case Study on Wolof

This paper presents a spell checker and correction tool specifically des...
research
11/06/2018

Effective Subword Segmentation for Text Comprehension

Character-level representations have been broadly adopted to alleviate t...
research
05/07/2020

Phonotactic Complexity and its Trade-offs

We present methods for calculating a measure of phonotactic complexity—b...
research
04/05/2016

Mental Lexicon Growth Modelling Reveals the Multiplexity of the English Language

In this work we extend previous analyses of linguistic networks by adopt...

Please sign up or login with your details

Forgot password? Click here to reset