Morphological and Language-Agnostic Word Segmentation for NMT

06/14/2018
by   Dominik Macháček, et al.
0

The state of the art of handling rich morphology in neural machine translation (NMT) is to break word forms into subword units, so that the overall vocabulary size of these units fits the practical limits given by the NMT model and GPU memory capacity. In this paper, we compare two common but linguistically uninformed methods of subword construction (BPE and STE, the method implemented in Tensor2Tensor toolkit) and two linguistically-motivated methods: Morfessor and one novel method, based on a derivational dictionary. Our experiments with German-to-Czech translation, both morphologically rich, document that so far, the non-motivated methods perform better. Furthermore, we iden- tify a critical difference between BPE and STE and show a simple pre- processing step for BPE that considerably increases translation quality as evaluated by automatic measures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2017

Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English

The necessity of using a fixed-size word vocabulary in order to control ...
research
01/02/2020

Morphological Word Segmentation on Agglutinative Languages for Neural Machine Translation

Neural machine translation (NMT) has achieved impressive performance on ...
research
10/30/2019

A Latent Morphology Model for Open-Vocabulary Neural Machine Translation

Translation into morphologically-rich languages challenges neural machin...
research
08/16/2019

Incorporating Word and Subword Units in Unsupervised Machine Translation Using Language Model Rescoring

This paper describes CAiRE's submission to the unsupervised machine tran...
research
01/11/2018

Improved English to Russian Translation by Neural Suffix Prediction

Neural machine translation (NMT) suffers a performance deficiency when a...
research
03/25/2022

Modeling Target-Side Morphology in Neural Machine Translation: A Comparison of Strategies

Morphologically rich languages pose difficulties to machine translation....

Please sign up or login with your details

Forgot password? Click here to reset