Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

04/12/2021
by   Zhong Zhou, et al.
9

We translate a closed text that is known in advance into a severely low resource language by leveraging massive source parallelism. Our contribution is four-fold. Firstly, we rank 124 source languages empirically to determine their closeness to the low resource language and select the top few. We call the linguistic definition of language family Family of Origin (FAMO), and we call the empirical definition of higher-ranked languages using our metrics Family of Choice (FAMC). Secondly, we build an Iteratively Pretrained Multilingual Order-preserving Lexiconized Transformer (IPML) to train on  1,000 lines ( 3.5 language to translate from Spanish, we obtain a +24.7 BLEU increase over a multilingual baseline, and a +10.2 BLEU increase over our asymmetric baseline in the Bible dataset. Thirdly, we also use a real severely low resource Mayan language, Eastern Pokomchi. Finally, we add an order-preserving lexiconized component to translate named entities accurately. We build a massive lexicon table for 2,939 Bible named entities in 124 source languages, and include many that occur once and covers more than 66 severely low resource languages. Training on randomly sampled 1,093 lines of low resource data, we reach a 30.3 BLEU score for Spanish-English translation testing on 30,022 lines of Bible, and a 42.8 BLEU score for Portuguese-English translation on the medical EMEA dataset.

READ FULL TEXT
research
08/16/2021

Active Learning for Massively Parallel Translation of Constrained Text into Low Resource Languages

We translate a closed text that is known in advance and available in man...
research
11/07/2019

Low-Resource Machine Translation using Interlinear Glosses

Neural Machine Translation (NMT) does not handle low-resource translatio...
research
09/27/2022

Improving Multilingual Neural Machine Translation System for Indic Languages

Machine Translation System (MTS) serves as an effective tool for communi...
research
04/21/2018

Massively Parallel Cross-Lingual Learning in Low-Resource Target Language Translation

We work on translation from rich-resource languages to low-resource lang...
research
05/05/2023

Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages

In many humanitarian scenarios, translation into severely low resource l...
research
11/10/2020

Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers

We investigate different approaches to translate between similar languag...
research
11/04/2022

CLSE: Corpus of Linguistically Significant Entities

One of the biggest challenges of natural language generation (NLG) is th...

Please sign up or login with your details

Forgot password? Click here to reset