VOLT: Improving Vocabularization via Optimal Transport for Machine Translation

12/31/2020
by   Jingjing Xu, et al.
0

It is well accepted that the choice of token vocabulary largely affects the performance of machine translation. However, due to expensive trial costs, most studies only conduct simple trials with dominant approaches (e.g BPE) and commonly used vocabulary sizes. In this paper, we find an exciting relation between an information-theoretic feature and BLEU scores. With this observation, we formulate the quest of vocabularization – finding the best token dictionary with a proper size – as an optimal transport problem. We then propose VOLT, a simple and efficient vocabularization solution without the full and costly trial training. We evaluate our approach on multiple machine translation tasks, including WMT-14 English-German translation, TED bilingual translation, and TED multilingual translation. Empirical results show that VOLT beats widely-used vocabularies on diverse scenarios. For example, VOLT achieves 70 Also, one advantage of VOLT lies in its low resource consumption. Compared to naive BPE-search, VOLT reduces the search time from 288 GPU hours to 0.5 CPU hours.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/16/2018

Augmenting Statistical Machine Translation with Subword Translation of Out-of-Vocabulary Words

Most statistical machine translation systems cannot translate words that...
research
05/09/2022

Sub-Word Alignment Is Still Useful: A Vest-Pocket Method for Enhancing Low-Resource Machine Translation

We leverage embedding duplication between aligned sub-words to extend th...
research
05/25/2018

Japanese Predicate Conjugation for Neural Machine Translation

Neural machine translation (NMT) has a drawback in that can generate onl...
research
09/15/2022

Rethinking Round-trip Translation for Automatic Machine Translation Evaluation

A parallel corpus is generally required to automatically evaluate the tr...
research
09/24/2021

Unsupervised Translation of German–Lower Sorbian: Exploring Training and Novel Transfer Methods on a Low-Resource Language

This paper describes the methods behind the systems submitted by the Uni...

Please sign up or login with your details

Forgot password? Click here to reset