Learning to pronounce as measuring cross-lingual joint orthography-phonology complexity

01/29/2022
by   Domenic Rosati, et al.
0

Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language "hard to pronounce" by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model's proficiency against its grapheme and phoneme inventories, we show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce. Namely the complexity of a language's pronunciation from its orthography is due to the expressive or simplicity of its grapheme-to-phoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/18/2021

Contributions of Transformer Attention Heads in Multi- and Cross-lingual Tasks

This paper studies the relative importance of attention heads in Transfo...
research
12/04/2022

Languages You Know Influence Those You Learn: Impact of Language Characteristics on Multi-Lingual Text-to-Text Transfer

Multi-lingual language models (LM), such as mBERT, XLM-R, mT5, mBART, ha...
research
09/23/2021

Cross-Lingual Language Model Meta-Pretraining

The success of pretrained cross-lingual language models relies on two es...
research
04/11/2019

Strong Baselines for Complex Word Identification across Multiple Languages

Complex Word Identification (CWI) is the task of identifying which words...
research
10/24/2019

Cross-Lingual Vision-Language Navigation

Vision-Language Navigation (VLN) is the task where an agent is commanded...
research
06/02/2023

Distilling Efficient Language-Specific Models for Cross-Lingual Transfer

Massively multilingual Transformers (MMTs), such as mBERT and XLM-R, are...
research
03/13/2023

Instate: Predicting the State of Residence From Last Name

India has twenty-two official languages. Serving such a diverse language...

Please sign up or login with your details

Forgot password? Click here to reset