MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

03/15/2021
by   David I. Adelani, et al.
0

Massively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due the lack of standardized evaluation datasets. In this paper, we present MENYO-20k, the first multi-domain parallel corpus for the low-resource Yorùbá–English (yo–en) language pair with standardized train-test splits for benchmarking. We provide several neural MT (NMT) benchmarks on this dataset and compare to the performance of popular pre-trained (massively multilingual) MT models, showing that, in almost all cases, our simple benchmarks outperform the pre-trained MT models. A major gain of BLEU +9.9 and +8.6 (en2yo) is achieved in comparison to Facebook's M2M-100 and Google multilingual NMT respectively when we use MENYO-20k to fine-tune generic models.

READ FULL TEXT
research
05/27/2023

Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec

In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec ...
research
06/09/2020

An Augmented Translation Technique for low Resource language pair: Sanskrit to Hindi translation

Neural Machine Translation (NMT) is an ongoing technique for Machine Tra...
research
11/06/2018

Off-the-Shelf Unsupervised NMT

We frame unsupervised machine translation (MT) in the context of multi-t...
research
10/13/2020

The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT

This paper describes the development of a new benchmark for machine tran...
research
10/12/2022

Using Massive Multilingual Pre-Trained Language Models Towards Real Zero-Shot Neural Machine Translation in Clinical Domain

Massively multilingual pre-trained language models (MMPLMs) are develope...
research
07/31/2022

Mismatching-Aware Unsupervised Translation Quality Estimation For Low-Resource Languages

Translation Quality Estimation (QE) is the task of predicting the qualit...
research
09/13/2021

Evaluating Multiway Multilingual NMT in the Turkic Languages

Despite the increasing number of large and comprehensive machine transla...

Please sign up or login with your details

Forgot password? Click here to reset