Neural Machine Translation based Word Transduction Mechanisms for Low-Resource Languages

11/21/2018
by   Saurav Jha, et al.
0

Out-Of-Vocabulary (OOV) words can pose serious challenges for machine translation (MT) tasks, and in particular, for Low-Resource Languages (LRLs). This paper adapts variants of seq2seq models to perform transduction of such words from Hindi to Bhojpuri (an LRL instance), learning from a set of cognate pairs built upon a bilingual dictionary of Hindi-Bhojpuri words. We demonstrate that our models can effectively be used for languages that have a limited amount of parallel corpora, by working at the character-level to grasp phonetic and orthographic similarities across multiple types of word adaptions, whether synchronic or diachronic, loan words or cognates. We provide a comprehensive overview over the training aspects of character-level NMT systems adapted to this task, combined with a detailed analysis of their respective error cases. Using our method, we achieve an improvement by over 6 BLEU on the Hindi-to-Bhojpuri translation task. Further, we show that such transductions generalize well to other languages by applying it successfully to Hindi-Bangla cognate pairs. Our work can be seen as an important step in the process of: (i) resolving the OOV words problem arising in MT tasks, (ii) creating effective parallel corpora for resource-constrained languages, and (iii) leveraging the enhanced semantic knowledge captured by word-level embeddings onto character-level tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2020

Improving Multilingual Neural Machine Translation For Low-Resource Languages: French-, English- Vietnamese

Prior works have demonstrated that a low-resource language pair can bene...
research
04/05/2018

Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts

Although there are increasing and significant ties between China and Por...
research
08/16/2018

Augmenting Statistical Machine Translation with Subword Translation of Out-of-Vocabulary Words

Most statistical machine translation systems cannot translate words that...
research
11/30/2021

Low-Resource Machine Translation Training Curriculum Fit for Low-Resource Languages

We conduct an empirical study of neural machine translation (NMT) for tr...
research
07/31/2023

SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Sub-word segmentation is an essential pre-processing step for Neural Mac...
research
09/17/2017

Unwritten Languages Demand Attention Too! Word Discovery with Encoder-Decoder Models

Word discovery is the task of extracting words from unsegmented text. In...
research
09/06/2018

Character-Aware Decoder for Neural Machine Translation

Standard neural machine translation (NMT) systems operate primarily on w...

Please sign up or login with your details

Forgot password? Click here to reset