SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

07/31/2023
by   Haiyue Song, et al.
0

Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient as they require parallel corpora, days to train and hours to decode. This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora. SelfSeg takes as input a word in the form of a partially masked character sequence, optimizes the word generation probability and generates the segmentation with the maximum posterior probability, which is calculated using a dynamic programming algorithm. The training time of SelfSeg depends on word frequencies, and we explore several word frequency normalization strategies to accelerate the training phase. Additionally, we propose a regularization mechanism that allows the segmenter to generate various segmentations for one word. To show the effectiveness of our approach, we conduct MT experiments in low-, middle- and high-resource scenarios, where we compare the performance of using different segmentation methods. The experimental results demonstrate that on the low-resource ALT dataset, our method achieves more than 1.2 BLEU score improvement compared with BPE and SentencePiece, and a 1.1 score improvement over Dynamic Programming Encoding (DPE) and Vocabulary Learning via Optimal Transport (VOLT) on average. The regularization method achieves approximately a 4.3 BLEU score improvement over BPE and a 1.2 BLEU score improvement over BPE-dropout, the regularized version of BPE. We also observed significant improvements on IWSLT15 Vi->En, WMT16 Ro->En and WMT15 Fi->En datasets, and competitive results on the WMT14 De->En and WMT14 Fr->En datasets.

READ FULL TEXT

page 1

page 17

research
05/03/2020

Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation

This paper introduces Dynamic Programming Encoding (DPE), a new segmenta...
research
06/07/2016

Incorporating Discrete Translation Lexicons into Neural Machine Translation

Neural machine translation (NMT) often makes mistakes in translating low...
research
04/05/2020

Incorporating Bilingual Dictionaries for Low Resource Semi-Supervised Neural Machine Translation

We explore ways of incorporating bilingual dictionaries to enable semi-s...
research
11/21/2018

Neural Machine Translation based Word Transduction Mechanisms for Low-Resource Languages

Out-Of-Vocabulary (OOV) words can pose serious challenges for machine tr...
research
08/12/2020

Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task

Supervised Chinese word segmentation has been widely approached as seque...
research
10/29/2019

BPE-Dropout: Simple and Effective Subword Regularization

Subword segmentation is widely used to address the open vocabulary probl...
research
12/14/2021

GEO-BLEU: Similarity Measure for Geospatial Sequences

In recent geospatial research, the importance of modeling large-scale hu...

Please sign up or login with your details

Forgot password? Click here to reset