Revisiting Low Resource Status of Indian Languages in Machine Translation

08/11/2020
by   Jerin Philip, et al.
0

Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated framework also results in generating a relatively larger corpus as compared to existing corpora that are available for Indian languages. This corpus helps us obtain substantially improved results on the publicly available WAT evaluation benchmark and other standard evaluation benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/22/2019

Corpus Augmentation by Sentence Segmentation for Low-Resource Neural Machine Translation

Neural Machine Translation (NMT) has been proven to achieve impressive r...
research
08/25/2023

Ngambay-French Neural Machine Translation (sba-Fr)

In Africa, and the world at large, there is an increasing focus on devel...
research
04/19/2023

The eBible Corpus: Data and Model Benchmarks for Bible Translation for Low-Resource Languages

Efficiently and accurately translating a corpus into a low-resource lang...
research
06/05/2022

Finetuning a Kalaallisut-English machine translation system using web-crawled data

West Greenlandic, known by native speakers as Kalaallisut, is an extreme...
research
06/07/2000

An evaluation of Naive Bayesian anti-spam filtering

It has recently been argued that a Naive Bayesian classifier can be used...
research
06/17/2019

Benchmarking Neural Machine Translation for Southern African Languages

Unlike major Western languages, most African languages are very low-reso...
research
02/25/2020

MuST-Cinema: a Speech-to-Subtitles corpus

Growing needs in localising audiovisual content in multiple languages th...

Please sign up or login with your details

Forgot password? Click here to reset