LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation

02/28/2022
by   Keita Nonaka, et al.
0

In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches. However, compression-based approach has a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a probabilistic string algorithm, called locally-consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the probabilistic mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning from especially small training data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/29/2019

BPE-Dropout: Simple and Effective Subword Regularization

Subword segmentation is widely used to address the open vocabulary probl...
research
07/31/2017

Regularization techniques for fine-tuning in neural machine translation

We investigate techniques for supervised domain adaptation for neural ma...
research
04/01/2022

CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation

We propose a novel data-augmentation technique for neural machine transl...
research
05/07/2020

Does Multi-Encoder Help? A Case Study on Context-Aware Neural Machine Translation

In encoder-decoder neural models, multiple encoders are in general used ...
research
07/24/2023

Joint Dropout: Improving Generalizability in Low-Resource Neural Machine Translation through Phrase Pair Variables

Despite the tremendous success of Neural Machine Translation (NMT), its ...
research
09/23/2019

Data Ordering Patterns for Neural Machine Translation: An Empirical Study

Recent works show that ordering of the training data affects the model p...
research
02/26/2021

Gradient-guided Loss Masking for Neural Machine Translation

To mitigate the negative effect of low quality training data on the perf...

Please sign up or login with your details

Forgot password? Click here to reset