DeepAI AI Chat
Log In Sign Up

Neural machine translation, corpus and frugality

01/26/2021
by   Raoul Blin, et al.
0

In machine translation field, in both academia and industry, there is a growing interest in increasingly powerful systems, using corpora of several hundred million to several billion examples. These systems represent the state-of-the-art. Here we defend the idea of developing in parallel <<frugal>> bilingual translation systems, trained with relatively small corpora. Based on the observation of a standard human professional translator, we estimate that the corpora should be composed at maximum of a monolingual sub-corpus of 75 million examples for the source language, a second monolingual sub-corpus of 6 million examples for the target language, and an aligned bilingual sub-corpus of 6 million bi-examples. A less desirable alternative would be an aligned bilingual corpus of 47.5 million bi-examples.

READ FULL TEXT

page 1

page 2

page 3

page 4

06/06/2022

MorisienMT: A Dataset for Mauritian Creole Machine Translation

In this paper, we describe MorisienMT, a dataset for benchmarking machin...
11/03/2017

Towards Neural Machine Translation with Partially Aligned Corpora

While neural machine translation (NMT) has become the new paradigm, the ...
03/22/2021

Monolingual and Parallel Corpora for Kangri Low Resource Language

In this paper we present the dataset of Himachali low resource endangere...
04/09/2020

Self-Training for Unsupervised Neural Machine Translation in Unbalanced Training Data Scenarios

Unsupervised neural machine translation (UNMT) that relies solely on mas...
04/05/2020

Reference Language based Unsupervised Neural Machine Translation

Exploiting common language as an auxiliary for better translation has a ...
10/29/2017

JESC: Japanese-English Subtitle Corpus

In this paper we describe the Japanese-English Subtitle Corpus (JESC). J...
08/05/2020

Designing the Business Conversation Corpus

While the progress of machine translation of written text has come far i...