Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT

06/02/2023
by   Benoist Wolleb, et al.
0

Subword tokenization is the de facto standard for tokenization in neural language models and machine translation systems. Three advantages are frequently cited in favor of subwords: shorter encoding of frequent tokens, compositionality of subwords, and ability to deal with unknown words. As their relative importance is not entirely clear yet, we propose a tokenization approach that enables us to separate frequency (the first advantage) from compositionality. The approach uses Huffman coding to tokenize words, by order of frequency, using a fixed amount of symbols. Experiments with CS-DE, EN-FR and EN-DE NMT show that frequency alone accounts for 90 reached by BPE, hence compositionality has less importance than previously thought.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/07/2016

Incorporating Discrete Translation Lexicons into Neural Machine Translation

Neural machine translation (NMT) often makes mistakes in translating low...
research
09/01/2019

Towards Understanding Neural Machine Translation with Word Importance

Although neural machine translation (NMT) has advanced the state-of-the-...
research
07/25/2018

Finding Better Subword Segmentation for Neural Machine Translation

For different language pairs, word-level neural machine translation (NMT...
research
12/20/2018

How Much Does Tokenization Affect Neural Machine Translation?

Tokenization or segmentation is a wide concept that covers simple proces...
research
08/10/2022

How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?

Neural Machine Translation (NMT) is an open vocabulary problem. As a res...
research
05/23/2023

Empowering LLM-based Machine Translation with Cultural Awareness

Traditional neural machine translation (NMT) systems often fail to trans...
research
08/22/2018

Deciding the status of controversial phonemes using frequency distributions; an application to semiconsonants in Spanish

Exploiting the fact that natural languages are complex systems, the pres...

Please sign up or login with your details

Forgot password? Click here to reset