Central Yup'ik and Machine Translation of Low-Resource Polysynthetic Languages

09/09/2020
by   Christopher Liu, et al.
0

Machine translation tools do not yet exist for the Yup'ik language, a polysynthetic language spoken by around 8,000 people who live primarily in Southwest Alaska. We compiled a parallel text corpus for Yup'ik and English and developed a morphological parser for Yup'ik based on grammar rules. We trained a seq2seq neural machine translation model with attention to translate Yup'ik input into English. We then compared the influence of different tokenization methods, namely rule-based, unsupervised (byte pair encoding), and unsupervised morphological (Morfessor) parsing, on BLEU score accuracy for Yup'ik to English translation. We find that using tokenized input increases the translation accuracy compared to that of unparsed input. Although overall Morfessor did best with a vocabulary size of 30k, our first experiments show that BPE performed best with a reduced vocabulary size.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/05/2022

Revisiting Syllables in Language Modelling and their Application on Low-Resource Machine Translation

Language modelling and machine translation tasks mostly use subword or c...
research
08/16/2018

Augmenting Statistical Machine Translation with Subword Translation of Out-of-Vocabulary Words

Most statistical machine translation systems cannot translate words that...
research
10/06/2020

Converting the Point of View of Messages Spoken to Virtual Assistants

Virtual Assistants can be quite literal at times. If the user says "tell...
research
09/19/2023

NSOAMT – New Search Only Approach to Machine Translation

Translation automation mechanisms and tools have been developed for seve...
research
08/16/2021

Active Learning for Massively Parallel Translation of Constrained Text into Low Resource Languages

We translate a closed text that is known in advance and available in man...
research
06/17/2021

Central Kurdish machine translation: First large scale parallel corpus and experiments

While the computational processing of Kurdish has experienced a relative...

Please sign up or login with your details

Forgot password? Click here to reset