A Systematic Analysis of Vocabulary and BPE Settings for Optimal Fine-tuning of NMT: A Case Study of In-domain Translation

The effectiveness of Neural Machine Translation (NMT) models largely depends on the vocabulary used at training; small vocabularies can lead to out-of-vocabulary problems – large ones, to memory issues. Subword (SW) tokenization has been successfully employed to mitigate these issues. The choice of vocabulary and SW tokenization has a significant impact on both training and fine-tuning an NMT model. Fine-tuning is a common practice in optimizing an MT model with respect to new data. However, new data potentially introduces new words (or tokens), which, if not taken into consideration, may lead to suboptimal performance. In addition, the distribution of tokens in the new data can differ from the distribution of the original data. As such, the original SW tokenization model could be less suitable for the new data. Through a systematic empirical evaluation, in this work we compare different strategies for SW tokenization and vocabulary generation with the ultimate goal to uncover an optimal setting for fine-tuning a domain-specific model. Furthermore, we developed several (in-domain) models, the best of which achieves 6 BLEU points improvement over the baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2020

Vocabulary Adaptation for Distant Domain Adaptation in Neural Machine Translation

Neural machine translation (NMT) models do not work well in domains diff...
research
08/02/2017

Dynamic Data Selection for Neural Machine Translation

Intelligent selection of training data has proven a successful technique...
research
11/20/2019

Fine-Tuning by Curriculum Learning for Non-Autoregressive Neural Machine Translation

Non-autoregressive translation (NAT) models remove the dependence on pre...
research
04/14/2017

Translation of Patent Sentences with a Large Vocabulary of Technical Terms Using Neural Machine Translation

Neural machine translation (NMT), a new approach to machine translation,...
research
06/02/2023

Leveraging Auxiliary Domain Parallel Data in Intermediate Task Fine-tuning for Low-resource Translation

NMT systems trained on Pre-trained Multilingual Sequence-Sequence (PMSS)...
research
10/26/2021

AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain

During the fine-tuning phase of transfer learning, the pretrained vocabu...
research
05/21/2020

Learning to Transfer Dynamic Models of Underactuated Soft Robotic Hands

Transfer learning is a popular approach to bypassing data limitations in...

Please sign up or login with your details

Forgot password? Click here to reset