Morphological Word Segmentation on Agglutinative Languages for Neural Machine Translation

01/02/2020
by   Yirong Pan, et al.
0

Neural machine translation (NMT) has achieved impressive performance on machine translation task in recent years. However, in consideration of efficiency, a limited-size vocabulary that only contains the top-N highest frequency words are employed for model training, which leads to many rare and unknown words. It is rather difficult when translating from the low-resource and morphologically-rich agglutinative languages, which have complex morphology and large vocabulary. In this paper, we propose a morphological word segmentation method on the source-side for NMT that incorporates morphology knowledge to preserve the linguistic and semantic information in the word structure while reducing the vocabulary size at training time. It can be utilized as a preprocessing tool to segment the words in agglutinative languages for other natural language processing (NLP) tasks. Experimental results show that our morphologically motivated word segmentation method is better suitable for the NMT model, which achieves significant improvements on Turkish-English and Uyghur-Chinese machine translation tasks on account of reducing data sparseness and language complexity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2019

Overcoming the Rare Word Problem for Low-Resource Language Pairs in Neural Machine Translation

Among the six challenges of neural machine translation (NMT) coined by (...
research
07/31/2017

Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English

The necessity of using a fixed-size word vocabulary in order to control ...
research
01/11/2018

Improved English to Russian Translation by Neural Suffix Prediction

Neural machine translation (NMT) suffers a performance deficiency when a...
research
03/14/2021

Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language

Building effective neural machine translation (NMT) models for very low-...
research
09/02/2021

How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Data-driven subword segmentation has become the default strategy for ope...
research
10/07/2016

Morphology Generation for Statistical Machine Translation using Deep Learning Techniques

Morphology in unbalanced languages remains a big challenge in the contex...
research
06/14/2018

Morphological and Language-Agnostic Word Segmentation for NMT

The state of the art of handling rich morphology in neural machine trans...

Please sign up or login with your details

Forgot password? Click here to reset