Data Augmentation and Terminology Integration for Domain-Specific Sinhala-English-Tamil Statistical Machine Translation

11/05/2020
by   Aloka Fernando, et al.
0

Out of vocabulary (OOV) is a problem in the context of Machine Translation (MT) in low-resourced languages. When source and/or target languages are morphologically rich, it becomes even worse. Bilingual list integration is an approach to address the OOV problem. This allows more words to be translated than are in the training data. However, since bilingual lists contain words in the base form, it will not translate inflected forms for morphologically rich languages such as Sinhala and Tamil. This paper focuses on data augmentation techniques where bilingual lexicon terms are expanded based on case-markers with the objective of generating new words, to be used in Statistical machine Translation (SMT). This data augmentation technique for dictionary terms shows improved BLEU scores for Sinhala-English SMT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2016

Statistical Machine Translation for Indian Languages: Mission Hindi

This paper discusses Centre for Development of Advanced Computing Mumbai...
research
01/25/2021

Facilitating Terminology Translation with Target Lemma Annotations

Most of the recent work on terminology integration in machine translatio...
research
03/31/2021

Few-shot learning through contextual data augmentation

Machine translation (MT) models used in industries with constantly chang...
research
07/01/2021

Zero-pronoun Data Augmentation for Japanese-to-English Translation

For Japanese-to-English translation, zero pronouns in Japanese pose a ch...
research
05/18/2022

Data Augmentation to Address Out-of-Vocabulary Problem in Low-Resource Sinhala-English Neural Machine Translation

Out-of-Vocabulary (OOV) is a problem for Neural Machine Translation (NMT...
research
08/11/2022

Domain-Specific Text Generation for Machine Translation

Preservation of domain knowledge from the source to target is crucial in...
research
04/05/2018

Domain Adaptation for Statistical Machine Translation

Statistical machine translation (SMT) systems perform poorly when it is ...

Please sign up or login with your details

Forgot password? Click here to reset