DeepAI AI Chat
Log In Sign Up

Neural machine translation, corpus and frugality

by   Raoul Blin, et al.

In machine translation field, in both academia and industry, there is a growing interest in increasingly powerful systems, using corpora of several hundred million to several billion examples. These systems represent the state-of-the-art. Here we defend the idea of developing in parallel <<frugal>> bilingual translation systems, trained with relatively small corpora. Based on the observation of a standard human professional translator, we estimate that the corpora should be composed at maximum of a monolingual sub-corpus of 75 million examples for the source language, a second monolingual sub-corpus of 6 million examples for the target language, and an aligned bilingual sub-corpus of 6 million bi-examples. A less desirable alternative would be an aligned bilingual corpus of 47.5 million bi-examples.


page 1

page 2

page 3

page 4


MorisienMT: A Dataset for Mauritian Creole Machine Translation

In this paper, we describe MorisienMT, a dataset for benchmarking machin...

Towards Neural Machine Translation with Partially Aligned Corpora

While neural machine translation (NMT) has become the new paradigm, the ...

Monolingual and Parallel Corpora for Kangri Low Resource Language

In this paper we present the dataset of Himachali low resource endangere...

Self-Training for Unsupervised Neural Machine Translation in Unbalanced Training Data Scenarios

Unsupervised neural machine translation (UNMT) that relies solely on mas...

Reference Language based Unsupervised Neural Machine Translation

Exploiting common language as an auxiliary for better translation has a ...

JESC: Japanese-English Subtitle Corpus

In this paper we describe the Japanese-English Subtitle Corpus (JESC). J...

Designing the Business Conversation Corpus

While the progress of machine translation of written text has come far i...