How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?

08/10/2022
by   Ali Araabi, et al.
0

Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant method to tackle this problem is Byte Pair Encoding (BPE) which splits words, including OOV words, into sub-word segments. BPE has achieved impressive results for a wide range of translation tasks in terms of automatic evaluation metrics. While it is often assumed that by using BPE, NMT systems are capable of handling OOV words, the effectiveness of BPE in translating OOV words has not been explicitly measured. In this paper, we study to what extent BPE is successful in translating OOV words at the word-level. We analyze the translation quality of OOV words based on word type, number of segments, cross-attention weights, and the frequency of segment n-grams in the training data. Our experiments show that while careful BPE settings seem to be fairly useful in translating OOV words across datasets, a considerable percentage of OOV words are translated incorrectly. Furthermore, we highlight the slightly higher effectiveness of BPE in translating OOV words for special cases, such as named-entities and when the languages involved are linguistically close to each other.

READ FULL TEXT

page 7

page 9

research
07/25/2018

Finding Better Subword Segmentation for Neural Machine Translation

For different language pairs, word-level neural machine translation (NMT...
research
08/31/2015

Neural Machine Translation of Rare Words with Subword Units

Neural machine translation (NMT) models typically operate with a fixed v...
research
12/20/2018

How Much Does Tokenization Affect in Neural Machine Translation?

Tokenization or segmentation is a wide concept that covers simple proces...
research
12/20/2018

How Much Does Tokenization Affect Neural Machine Translation?

Tokenization or segmentation is a wide concept that covers simple proces...
research
03/14/2021

Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language

Building effective neural machine translation (NMT) models for very low-...
research
09/04/2017

Learning Word Embeddings from the Portuguese Twitter Stream: A Study of some Practical Aspects

This paper describes a preliminary study for producing and distributing ...
research
06/02/2023

Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT

Subword tokenization is the de facto standard for tokenization in neural...

Please sign up or login with your details

Forgot password? Click here to reset