Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages

05/05/2023
by   Zhong Zhou, et al.
0

In many humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, endangered languages may be possible and reduce human translation effort. We attempt to leverage translation resources from many rich resource languages to efficiently produce best possible translation quality for a well known text, which is available in multiple languages, in a new, severely low resource language. We examine two approaches: 1. best selection of seed sentences to jump start translations in a new language in view of best generalization to the remainder of a larger targeted text(s), and 2. we adapt large general multilingual translation engines from many other languages to focus on a specific text in a new, unknown language. We find that adapting large pretrained multilingual models to the domain/text first and then to the severely low resource language works best. If we also select a best set of seed sentences, we can improve average chrF performance on new test languages from a baseline of 21.9 to 50.7, while reducing the number of seed sentences to only around 1,000 in the new, unknown language.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/31/2022

Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Model

Numerous recent work on unsupervised machine translation (UMT) implies t...
research
06/06/2021

The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

One of the biggest challenges hindering progress in low-resource and mul...
research
10/13/2022

Bootstrapping Multilingual Semantic Parsers using Large Language Models

Despite cross-lingual generalization demonstrated by pre-trained multili...
research
04/12/2021

Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

We translate a closed text that is known in advance into a severely low ...
research
08/16/2021

Active Learning for Massively Parallel Translation of Constrained Text into Low Resource Languages

We translate a closed text that is known in advance and available in man...
research
10/20/2018

Improving Multilingual Semantic Textual Similarity with Shared Sentence Encoder for Low-resource Languages

Measuring the semantic similarity between two sentences (or Semantic Tex...
research
03/13/2020

LSCP: Enhanced Large Scale Colloquial Persian Language Understanding

Language recognition has been significantly advanced in recent years by ...

Please sign up or login with your details

Forgot password? Click here to reset