Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead?

04/21/2021
by   Tim Isbister, et al.
6

Most work in NLP makes the assumption that it is desirable to develop solutions in the native language in question. There is consequently a strong trend towards building native language models even for low-resource languages. This paper questions this development, and explores the idea of simply translating the data into English, thereby enabling the use of pretrained, and large-scale, English language models. We demonstrate empirically that a large English language model coupled with modern machine translation outperforms native language models in most Scandinavian languages. The exception to this is Finnish, which we assume is due to inferior translation quality. Our results suggest that machine translation is a mature technology, which raises a serious counter-argument for training native language models for low-resource languages. This paper therefore strives to make a provocative but important point. As English language models are improving at an unprecedented pace, which in turn improves machine translation, it is from an empirical and environmental stand-point more effective to translate data from low-resource languages into English, than to build language models for such languages.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/28/2021

A Preordered RNN Layer Boosts Neural Machine Translation in Low Resource Settings

Neural Machine Translation (NMT) models are strong enough to convey sema...
09/26/2021

Curb Your Carbon Emissions: Benchmarking Carbon Emissions in Machine Translation

In recent times, there has been definitive progress in the field of NLP,...
03/24/2016

Contrastive Analysis with Predictive Power: Typology Driven Estimation of Grammatical Error Distributions in ESL

This work examines the impact of cross-linguistic transfer on grammatica...
04/30/2020

Use of Machine Translation to Obtain Labeled Datasets for Resource-Constrained Languages

The large annotated datasets in NLP are overwhelmingly in English. This ...
09/30/2021

Prose2Poem: The Blessing of Transformers in Translating Prose to Persian Poetry

Persian Poetry has consistently expressed its philosophy, wisdom, speech...
01/10/2017

Bidirectional American Sign Language to English Translation

We outline a bidirectional translation system that converts sentences fr...
10/09/2016

Enabling Medical Translation for Low-Resource Languages

We present research towards bridging the language gap between migrant wo...

Code Repositories

ScandiSent

Sentiment Corpus for Swedish 🇸🇪 Norwegian 🇳🇴 Danish 🇩🇰 Finnish 🇫🇮 (and English 🏴󠁧󠁢󠁥󠁮󠁧󠁿)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.