The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT

10/13/2020
by   Jörg Tiedemann, et al.
0

This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages and tools for creating state-of-the-art translation models from that collection. The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages. Using the package it is possible to work on realistic low-resource scenarios avoiding artificially reduced setups that are common when demonstrating zero-shot or few-shot learning. For the first time, this package provides a comprehensive collection of diverse data sets in hundreds of languages with systematic language and script annotation and data splits to extend the narrow coverage of existing benchmarks. Together with the data release, we also provide a growing number of pre-trained baseline models for individual language pairs and selected language groups.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2021

MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

Massively multilingual machine translation (MT) has shown impressive cap...
research
10/20/2022

SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages

In recent years, multilingual machine translation models have achieved p...
research
05/27/2023

Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec

In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec ...
research
09/01/2021

Survey of Low-Resource Machine Translation

We present a survey covering the state of the art in low-resource machin...
research
07/31/2022

Mismatching-Aware Unsupervised Translation Quality Estimation For Low-Resource Languages

Translation Quality Estimation (QE) is the task of predicting the qualit...
research
08/11/2020

A parallel evaluation data set of software documentation with document structure annotation

This paper accompanies the software documentation data set for machine t...
research
01/05/2023

Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI

Detecting "toxic" language in internet content is a pressing social and ...

Please sign up or login with your details

Forgot password? Click here to reset