Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

04/29/2018
by   Taku Kudo, et al.
0

Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2020

Adversarial Subword Regularization for Robust Neural Machine Translation

Exposing diverse subword segmentations to neural machine translation (NM...
research
09/17/2019

Pointer-based Fusion of Bilingual Lexicons into Neural Machine Translation

Neural machine translation (NMT) systems require large amounts of high q...
research
11/08/2019

Domain Robustness in Neural Machine Translation

Translating text that diverges from the training domain is a key challen...
research
10/29/2019

BPE-Dropout: Simple and Effective Subword Regularization

Subword segmentation is widely used to address the open vocabulary probl...
research
10/19/2018

Optimizing Segmentation Granularity for Neural Machine Translation

In neural machine translation (NMT), it is has become standard to transl...
research
10/30/2019

A Latent Morphology Model for Open-Vocabulary Neural Machine Translation

Translation into morphologically-rich languages challenges neural machin...
research
11/04/2022

Dealing with Abbreviations in the Slovenian Biographical Lexicon

Abbreviations present a significant challenge for NLP systems because th...

Please sign up or login with your details

Forgot password? Click here to reset