Finding Better Subword Segmentation for Neural Machine Translation

07/25/2018
by   Yingting Wu, et al.
0

For different language pairs, word-level neural machine translation (NMT) models with a fixed-size vocabulary suffer from the same problem of representing out-of-vocabulary (OOV) words. The common practice usually replaces all these rare or unknown words with a <UNK> token, which limits the translation performance to some extent. Most of recent work handled such a problem by splitting words into characters or other specially extracted subword units to enable open-vocabulary translation. Byte pair encoding (BPE) is one of the successful attempts that has been shown extremely competitive by providing effective subword segmentation for NMT systems. In this paper, we extend the BPE style segmentation to a general unsupervised framework with three statistical measures: frequency (FRQ), accessor variety (AV) and description length gain (DLG). We test our approach on two translation tasks: German to English and Chinese to English. The experimental results show that AV and DLG enhanced systems outperform the FRQ baseline in the frequency weighted schemes at different significant levels.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/31/2015

Neural Machine Translation of Rare Words with Subword Units

Neural machine translation (NMT) models typically operate with a fixed v...
research
08/10/2022

How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?

Neural Machine Translation (NMT) is an open vocabulary problem. As a res...
research
10/19/2018

Optimizing Segmentation Granularity for Neural Machine Translation

In neural machine translation (NMT), it is has become standard to transl...
research
05/25/2018

Japanese Predicate Conjugation for Neural Machine Translation

Neural machine translation (NMT) has a drawback in that can generate onl...
research
06/02/2023

Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT

Subword tokenization is the de facto standard for tokenization in neural...
research
03/21/2020

A Joint Approach to Compound Splitting and Idiomatic Compound Detection

Applications such as machine translation, speech recognition, and inform...
research
12/20/2018

How Much Does Tokenization Affect Neural Machine Translation?

Tokenization or segmentation is a wide concept that covers simple proces...

Please sign up or login with your details

Forgot password? Click here to reset