Revisiting Syllables in Language Modelling and their Application on Low-Resource Machine Translation

10/05/2022
by   Arturo Oncevay, et al.
0

Language modelling and machine translation tasks mostly use subword or character inputs, but syllables are seldom used. Syllables provide shorter sequences than characters, require less-specialised extracting rules than morphemes, and their segmentation is not impacted by the corpus size. In this study, we first explore the potential of syllables for open-vocabulary language modelling in 21 languages. We use rule-based syllabification methods for six languages and address the rest with hyphenation, which works as a syllabification proxy. With a comparable perplexity, we show that syllables outperform characters and other subwords. Moreover, we study the importance of syllables on neural machine translation for a non-related and low-resource language-pair (Spanish–Shipibo-Konibo). In pairwise and multilingual systems, syllables outperform unsupervised subwords, and further morphological segmentation methods, when translating into a highly synthetic language with a transparent orthography (Shipibo-Konibo). Finally, we perform some human evaluation, and discuss limitations and opportunities.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2020

Revisiting Neural Language Modelling with Syllables

Language modelling is regularly analysed at word, subword or character u...
research
09/09/2020

Central Yup'ik and Machine Translation of Low-Resource Polysynthetic Languages

Machine translation tools do not yet exist for the Yup'ik language, a po...
research
10/05/2019

How Transformer Revitalizes Character-based Neural Machine Translation: An Investigation on Japanese-Vietnamese Translation Systems

While translating between Chinese-centric languages, many works have dis...
research
03/16/2022

BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Morphologically-rich polysynthetic languages present a challenge for NLP...
research
03/04/2020

Evaluating Low-Resource Machine Translation between Chinese and Vietnamese with Back-Translation

Back translation (BT) has been widely used and become one of standard te...
research
09/02/2017

Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging

Word segmentation plays a pivotal role in improving any Arabic NLP appli...

Please sign up or login with your details

Forgot password? Click here to reset