Neural machine translation, corpus and frugality

01/26/2021
by   Raoul Blin, et al.
0

In machine translation field, in both academia and industry, there is a growing interest in increasingly powerful systems, using corpora of several hundred million to several billion examples. These systems represent the state-of-the-art. Here we defend the idea of developing in parallel <<frugal>> bilingual translation systems, trained with relatively small corpora. Based on the observation of a standard human professional translator, we estimate that the corpora should be composed at maximum of a monolingual sub-corpus of 75 million examples for the source language, a second monolingual sub-corpus of 6 million examples for the target language, and an aligned bilingual sub-corpus of 6 million bi-examples. A less desirable alternative would be an aligned bilingual corpus of 47.5 million bi-examples.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2022

MorisienMT: A Dataset for Mauritian Creole Machine Translation

In this paper, we describe MorisienMT, a dataset for benchmarking machin...
research
11/03/2017

Towards Neural Machine Translation with Partially Aligned Corpora

While neural machine translation (NMT) has become the new paradigm, the ...
research
03/22/2021

Monolingual and Parallel Corpora for Kangri Low Resource Language

In this paper we present the dataset of Himachali low resource endangere...
research
04/09/2020

Self-Training for Unsupervised Neural Machine Translation in Unbalanced Training Data Scenarios

Unsupervised neural machine translation (UNMT) that relies solely on mas...
research
04/05/2020

Reference Language based Unsupervised Neural Machine Translation

Exploiting common language as an auxiliary for better translation has a ...
research
10/29/2017

JESC: Japanese-English Subtitle Corpus

In this paper we describe the Japanese-English Subtitle Corpus (JESC). J...
research
08/05/2020

Designing the Business Conversation Corpus

While the progress of machine translation of written text has come far i...

Please sign up or login with your details

Forgot password? Click here to reset