Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

by   Zhong Zhou, et al.

We translate a closed text that is known in advance into a severely low resource language by leveraging massive source parallelism. Our contribution is four-fold. Firstly, we rank 124 source languages empirically to determine their closeness to the low resource language and select the top few. We call the linguistic definition of language family Family of Origin (FAMO), and we call the empirical definition of higher-ranked languages using our metrics Family of Choice (FAMC). Secondly, we build an Iteratively Pretrained Multilingual Order-preserving Lexiconized Transformer (IPML) to train on  1,000 lines ( 3.5 language to translate from Spanish, we obtain a +24.7 BLEU increase over a multilingual baseline, and a +10.2 BLEU increase over our asymmetric baseline in the Bible dataset. Thirdly, we also use a real severely low resource Mayan language, Eastern Pokomchi. Finally, we add an order-preserving lexiconized component to translate named entities accurately. We build a massive lexicon table for 2,939 Bible named entities in 124 source languages, and include many that occur once and covers more than 66 severely low resource languages. Training on randomly sampled 1,093 lines of low resource data, we reach a 30.3 BLEU score for Spanish-English translation testing on 30,022 lines of Bible, and a 42.8 BLEU score for Portuguese-English translation on the medical EMEA dataset.


Active Learning for Massively Parallel Translation of Constrained Text into Low Resource Languages

We translate a closed text that is known in advance and available in man...

Low-Resource Machine Translation using Interlinear Glosses

Neural Machine Translation (NMT) does not handle low-resource translatio...

Improving Multilingual Neural Machine Translation System for Indic Languages

Machine Translation System (MTS) serves as an effective tool for communi...

Massively Parallel Cross-Lingual Learning in Low-Resource Target Language Translation

We work on translation from rich-resource languages to low-resource lang...

Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages

In many humanitarian scenarios, translation into severely low resource l...

Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers

We investigate different approaches to translate between similar languag...

CLSE: Corpus of Linguistically Significant Entities

One of the biggest challenges of natural language generation (NLG) is th...

Please sign up or login with your details

Forgot password? Click here to reset