The eBible Corpus: Data and Model Benchmarks for Bible Translation for Low-Resource Languages

04/19/2023
by   Vesa Akerman, et al.
0

Efficiently and accurately translating a corpus into a low-resource language remains a challenge, regardless of the strategies employed, whether manual, automated, or a combination of the two. Many Christian organizations are dedicated to the task of translating the Holy Bible into languages that lack a modern translation. Bible translation (BT) work is currently underway for over 3000 extremely low resource languages. We introduce the eBible corpus: a dataset containing 1009 translations of portions of the Bible with data in 833 different languages across 75 language families. In addition to a BT benchmarking dataset, we introduce model performance benchmarks built on the No Language Left Behind (NLLB) neural machine translation (NMT) models. Finally, we describe several problems specific to the domain of BT and consider how the established data and model benchmarks might be used for future translation efforts. For a BT task trained with NLLB, Austronesian and Trans-New Guinea language families achieve 35.1 and 31.6 BLEU scores respectively, which spurs future innovations for NMT for low-resource languages in Papua New Guinea.

READ FULL TEXT

page 7

page 8

research
08/25/2023

Ngambay-French Neural Machine Translation (sba-Fr)

In Africa, and the world at large, there is an increasing focus on devel...
research
05/31/2021

Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data

The scarcity of parallel data is a major obstacle for training high-qual...
research
06/09/2022

Dict-NMT: Bilingual Dictionary based NMT for Extremely Low Resource Languages

Neural Machine Translation (NMT) models have been effective on large bil...
research
08/11/2020

Revisiting Low Resource Status of Indian Languages in Machine Translation

Indian language machine translation performance is hampered due to the l...
research
11/29/2022

Learnings from Technological Interventions in a Low Resource Language: Enhancing Information Access in Gondi

The primary obstacle to developing technologies for low-resource languag...
research
09/30/2021

Prose2Poem: The Blessing of Transformers in Translating Prose to Persian Poetry

Persian Poetry has consistently expressed its philosophy, wisdom, speech...
research
07/13/2021

On the Difficulty of Translating Free-Order Case-Marking Languages

Identifying factors that make certain languages harder to model than oth...

Please sign up or login with your details

Forgot password? Click here to reset