Reduce Indonesian Vocabularies with an Indonesian Sub-word Separator

07/01/2022
by   Mukhlis Amien, et al.
0

Indonesian is an agglutinative language since it has a compounding process of word-formation. Therefore, the translation model of this language requires a mechanism that is even lower than the word level, referred to as the sub-word level. This compounding process leads to a rare word problem since the number of vocabulary explodes. We propose a strategy to address the unique word problem of the neural machine translation (NMT) system, which uses Indonesian as a pair language. Our approach uses a rule-based method to transform a word into its roots and accompanied affixes to retain its meaning and context. Using a rule-based algorithm has more advantages: it does not require corpus data but only applies the standard Indonesian rules. Our experiments confirm that this method is practical. It reduces the number of vocabulary significantly up to 57%, and on the English to Indonesian translation, this strategy provides an improvement of up to 5 BLEU points over a similar NMT system that does not use this technique.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/30/2014

Addressing the Rare Word Problem in Neural Machine Translation

Neural Machine Translation (NMT) is a new approach to machine translatio...
research
07/31/2017

Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English

The necessity of using a fixed-size word vocabulary in order to control ...
research
11/07/2019

SubCharacter Chinese-English Neural Machine Translation with Wubi encoding

Neural machine translation (NMT) is one of the best methods for understa...
research
09/15/2016

Factored Neural Machine Translation

We present a new approach for neural machine translation (NMT) using the...
research
07/19/2017

Modeling Target-Side Inflection in Neural Machine Translation

NMT systems have problems with large vocabulary sizes. Byte-pair encodin...
research
07/30/2018

Training Neural Machine Translation using Word Embedding-based Loss

In neural machine translation (NMT), the computational cost at the outpu...
research
10/06/2020

Converting the Point of View of Messages Spoken to Virtual Assistants

Virtual Assistants can be quite literal at times. If the user says "tell...

Please sign up or login with your details

Forgot password? Click here to reset