Transfer Learning in Multilingual Neural Machine Translation with Dynamic Vocabulary

11/03/2018 ∙ by Surafel M. Lakew, et al. ∙ Amazon 0

We propose a method to transfer knowledge across neural machine translation (NMT) models by means of a shared dynamic vocabulary. Our approach allows to extend an initial model for a given language pair to cover new languages by adapting its vocabulary as long as new data become available (i.e., introducing new vocabulary items if they are not included in the initial model). The parameter transfer mechanism is evaluated in two scenarios: i) to adapt a trained single language NMT system to work with a new language pair and ii) to continuously add new language pairs to grow to a multilingual NMT system. In both the scenarios our goal is to improve the translation performance, while minimizing the training convergence time. Preliminary experiments spanning five languages with different training data sizes (i.e., 5k and 50k parallel sentences) show a significant performance gain ranging from +3.85 up to +13.63 BLEU in different language directions. Moreover, when compared with training an NMT model from scratch, our transfer-learning approach allows us to reach higher performance after training up to 4



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Machine Translation (NMT) has shown to surpass phrase based Machine Translation approaches not only in high-resource language settings, but also with low-resource [1] and zero-resource translation tasks [2, 3]. Although recent approaches yield promising results, training models in low-resource settings remains a challenge for MT research [4]. [2] have shown that a multilingual NMT (M-NMT) model that utilizes a concatenation of data covering multiple language pairs (including high-resourced ones) can result in better performance in the low-resource translation task. Alternatively, [5] proposed a transfer-learning approach from an NMT “parent-model” trained on a high-resource language to initialize a “child-model” in a low-resource setting showing consistent translation improvements on the latter task.

Though effective, training models on a concatenation of data covering multiple language pairs or initializing them by transferring knowledge from a parent model does not consider the dynamic nature of new language vocabularies. In relation to how and when model vocabularies are built, there can be two distinct scenarios. In the first one, all the training data for all the language pairs are available since the beginning. In this case, either separate or joint sub-word segmentation models can be applied on the training material to build vocabularies that represent all the data [6, 7]. In the second scenario, training data covering different language directions are not available at the same time (most real-world MT training scenarios fall in this category, in which new data or new needs in terms of domains or language coverage emerge over time). In such cases, either: i) new MT models are trained from scratch with new vocabularies built from the incoming training data, or ii) the word segmentation rules of a prior (parent) model are applied on the new data to continue the training as a fine-tuning task. In all the scenarios, accurate word segmentation is crucial to avoid out-of-vocabulary (OOV) tokens. However, different strategies for the different training conditions can result in longer training time or performance degradations. More specifically, limiting the target task with the initial model vocabulary will result in: i) a word segmentation that is unfavorable for the new language directions and ii) a fixed vocabulary/model dimension despite the varying language and training dataset size.

NMT models are not only data-demanding, but also require considerable time to be trained, optimized, and put into use. In particular real-word scenarios, strict time constraints prevent the possibility to deploy and use NMT technology (consider, for instance, emergency situations that require to promptly enable communication across languages [8]). On top of this, when the available training corpora are limited in size, delivering usable NMT systems (i.e., systems that can be used with the requirement of not making severe errors [9]) becomes prohibitive. In summary: i) on the data side, acquiring new training material for undefined languages is costly and not always possible, and ii) on the model side, building an NMT system from scratch when new data become available raises efficiency and performance issues that are particularly relevant in low-resource scenarios.

We address these issues by introducing a method to transfer knowledge across languages by means of a dynamic vocabulary. Starting from an initial model, our method allows to build new NMT models, either in a single or multiple language translation directions, by dynamically updating the initial vocabulary to new incoming data. For instance, given a trained German-English NMT system (), the learned parameters can be transferred across models, while adopting new language vocabularies. In our experimental setting we test two transfer approaches:

  • progAdapt: train a chain of consecutive M-NMT models by transferring the parameters of an initial model for to new language pairs . In this scenario, the goal is to maximize performance on the new language pairs.

  • progGrow: progressively introduce new language pairs to the initial model to create a growing M-NMT model covering translation directions. In this scenario, the goal is to maximize performance on all the language pairs.

Our experiments are carried out with ItalianEnglish, RomanianEnglish, and DutchEnglish training data sets of different size, ranging from low-resource (k) to extremely low-resource (k) conditions.

As such, in a rather different way from previous work [5], we show our transfer-learning approach in a multilingual NMT model with dynamic vocabulary both in the source and target directions. Our contributions are as follows:

  • we develop a transfer-learning technique for NMT based on a dynamic vocabulary, which adapts the parameters learned on a parent task (language direction) to cover new target tasks;

  • through experiments in different scenarios, we show that our approach improves knowledge transfer across NMT models for different languages, particularly in low-resource conditions;

  • we show that, with our transfer learning approach, it is possible to train a faster converging model that achieves better performance than a system trained from scratch.

2 Related work

2.1 Transfer Learning

Recent efforts [10, 11]

in natural language processing (NLP) research have shown promising results when transfer-learning techniques are applied to leverage existing models to cope with the scarcity of training data in specific domains or language settings. The advancements in NLP came following a much larger impact of transfer-learning in computer vision tasks, such as classification and segmentation, either using features of ImageNet 


or by fine-tuning the last layers of a deep neural network 

[13]. Specific to NLP, pre-trained word embeddings [14] used as input to the first layer of the network have become a common practice. In a broader sense, pre-trained models have been successfully exploited for several NLP tasks. [15]

used an MT model as a pre-training step to further contextualize word vectors for


tasks like sentiment analysis, question classification, textual entailment, and question answering. In a similar way, a language model is utilized for pre-training in sequence labeling tasks 

[16], question answering, textual entailment, and sentiment analysis [17].

Close to our approach, [5] explored techniques for transfer-learning across two NMT models. First, a “parent” model is trained with a large amount of available data. Then the encoder-decoder components are transferred to initialize the parameters of a low-resourced “child” model. In this parent-child setting, the decoder parameters of the child model are fixed at the time of fine-tuning. Later, in [18], the parent-child approach has been extended to analyze the effect of using related languages on the source side.

Although this work shares a related approach with [5], we diverge by our hypothesis not to selectively update only the encoder, allowing all the parameters to be updated as a beneficial strategy in our setting. Our strategy is based on both the sourcetarget and targetsource translation directions that we consider as transferable. Moreover, our transfer-learning approach relies on a dynamic vocabulary that enforces changes in the trainable parameters of the network in contrast to fixing them111In future work, we plan to further study which parameters are more beneficial if transferred and which part of the network to selectively update..

2.2 Multilingual NMT

In a one-to-many multilingual translation scenario, [19] proposed a multi-task learning approach that utilizes a single encoder for the source language and separate attention mechanisms and decoders for each target language. [20] used distinct encoder and decoder networks for modeling multiple language pairs in a many-to-many setting. Later, [21] introduced a way to share the attention mechanism across multiple languages. Aimed at avoiding translation ambiguities on the decoder side, a many-to-one character level NMT setup [22] and a two/multi-source NMT [23] were also proposed. Inspired by [24], who automatically annotated the source side with artificial flags to manage the politeness level of the output, other works focused on controlling the grammatical voice [25], the text domain [26, 27], and enforcing gender agreement [28]. Simplified yet efficient multilingual NMT approaches have been proposed by [2] and [3]. The approach in [3] applies a language-specific code to words from different languages in a mixed-language vocabulary. The approach in [2], by prepending a language flag to the input string, greatly simplified multilingual NMT eliminating the need of having separate encoder/decoder networks and attention mechanism for each new language pair. In this work we follow a similar strategy by incorporating an artificial language flag.

3 Transfer Learning in M-NMT

In this work, we cast transfer-learning in a multilingual neural machine translation (M-NMT) task as the problem of dynamically changing/updating the vocabulary of a trained NMT system. In particular, transfer-learning across models is assumed to: i) include a strategy to add new language-specific items to an existing NMT vocabulary, and ii) be able to manage a number of new translation directions in different transfer rounds, either by covering them one at a time (i.e., in a chain where new languages are covered stepwise) or simultaneously (i.e., pursuing all directions at each step). Our investigation focuses on two aspects. The first one is how the parameters of an existing model can be transferred to a target one for a new language pair. The second aspect is how to limit the impact of parameters’ transfer on the performance of the initial model as long as new language directions are added. For convenience, we refer to our approach as TL-DV (Transfer-Learning using Dynamic Vocabulary).

Figure 1: Transfer-Learning; (left) from an initial NMT model to a new language pair, model is applied after inserting the new vocabulary entries, for instance, the initial model parameters are transfered to with the updated embedding space (i.e., keeping , as overlapping entries, while replacing the non-overlapping with new language vocabularies), and (right) from an initial model to , but incorporating both the previous and new language pair data and vocabulary entries.

As shown in Figure 1, our transfer-learning approach is evaluated in two conditions:

  • progAdapt, in which progressive updates are made on the assumption that new target NMT task data become available for one language direction at a time (i.e., new language directions are covered sequentially). In this condition, our goal is to maximize performance on the new target tasks by taking advantage of parameters learned in their parent task;

  • progGrow, in which progressive updates are made on the same assumption of receiving new target task data as in progAdapt, but with the additional goal of preserving the performance of the previous language directions.

We discuss these two scenarios below in 3.2 and 3.3.

3.1 Dynamic Vocabulary

In the defined scenarios, we update the vocabulary of the previous model with the current language direction vocabulary . The approach simply keeps the intersection (same entries) between and , whereas replacing entries with if the entries of the former vocabulary do not exist in the latter. At training time, these new entries are randomly initialized, while the intersecting items maintain the embeddings of the former model. The alternative approach to dynamic vocabulary in a continuous model training is to use the initial model vocabulary , which we refer to as static-vocabulary.

3.2 Progressive Adaptation to New Languages

In this scenario, starting from the init model (), we perform progressive adaptation by initializing the training of a model at each step () with the previous model (). At time of reloading the model from , a TL-DV update is performed as described in 3. In this approach, the dataset of the initial model is not included at the current training stage. This allows the adaptation to the new language without unnecessary word segmentation that may arise by applying the initial model’s segmentation rules. As shown in Figure 1 (left), the adaptation on any of the stages is language independent, though subject to the available training dataset. We refer to the application of this approach in the experimental settings and discussion as progAdapt.

3.3 Progressive Growth of Translation Directions

In this scenario, an initial model is simultaneously adapted to an incremental number of translation directions, under the constraint that the level of performance on has to be maintained. For a simplified experimental setup, we will incorporate a single language pair (sourcetarget) at a time, when adapting to from (see Figure 1 (right)). We refer to the application of this approach in the experimental settings and discussion as progGrow.

4 Experimental Setting

4.1 Dataset and Preprocessing

Our experimental setting includes the init model language pair (German-English) and three additional language pairs (Italian-English, Romanian-English, and Dutch-English) for testing the proposed approaches. We use publicly available datasets from the WIT TED corpus [29]. Table 1 shows the summary of the training, dev, and test sets. To simulate an extremely low-resource () and low-resource () model settings, K and K sentences are sampled from the last three language pairs’ training data.

At the preprocessing step, we first tokenize the raw data and remove sentences longer than 70 tokens. As in [2], we prepend a “language flag” on the source side of the corpus for all multilingual models. For instance, if a German source is paired with an English target, we append <2ENG> at the beginning of source segments. Next, a shared byte pair encoding (BPE) model [6] is trained using the union of the source and target sides of each language pair. Following [30], the number of BPE segmentation rules is set to for the data size used in our experimental setting. At different levels of training (), a BPE model with respect to the language pairs is then used to segment the training, dev, and test data into sub-word units. While, the vocabulary size of the init is fixed, the vocabulary varies in the consecutive training stages depending on the overlap of sub-word units and lexical similarity between two language pairs.

Language Train Dev Test Received German(De)-En 200k 1497 1138 init Italian(It)-En 5k/50k 1501 1147 Romanian(Ro)-En 5k/50k 1633 1129 Dutch(Nl)-En 5k/50k 1726 1181

Table 1: Languages and dataset sizes for train, dev, and test sets of the init model for De-En direction and other pairs assumed to be received progressively (It-En, Ro-En, Nl-En).

4.2 Experimental Settings

All systems are trained using the Transformer [31] model implementation of the OpenNMT-tf sequence modeling framework222 [32]. At training time, to alternate between dynamic and static vocabulary, we utilized an updated version of the script within the same framework. For all trainings, we use LazyAdam, a variant of the Adam optimizer [33], with an initial learning rate constant of and a dropout [34, 35] of . The learning rate is increased linearly in the early stages (warmup_training_steps=), and afterwards it is decreased with an inverse square root of the training step.

To train our models using Transformer, we employ a uniform setting with hidden units and embedding dimension, and 6 layers of self-attention encoder-decoder network. The training batch size is of sub-word tokens. At inference time, we use a beam size of and a batch size of . Following [31] and for a fair comparison, all baseline experiments are run for 100k training steps, i.e., all models are observed to converge within these steps. The consecutive experiments converge in variable training steps. However, to make sure a convergence point is reached, all restarted experiments on are run for additional K steps. All models are trained on a GeForce-GTX-1080 machine with a single GPU. Systems are compared in terms of BLEU [36] using the multi-bleu.perl implementation333A script from the Moses SMT toolkit, on the single references of the official IWSLT test sets.

4.3 Baseline Models

To evaluate and compare with our approach, we train single language pair baseline models corresponding to the newly introduced language pairs at each training stage. The baseline models, referred to as Bi-NMT, are separately trained from scratch in a bi-directional setting (i.e., source target). In addition, we report scores from a multilinugal (M-NMT) model trained with the concatenation of all available data in each training stage. The alternative baseline are built by fine-tuning the init model. These models use the vocabulary (word segmentation rules) of the init model, avoiding the proposed dynamic vocabulary. This fine-tuning approach is prevalent in continued model trainings, for adapting NMT models [37, 38] or improving zero-shot and low-resource translation tasks [39, 40, 41]. For the alternative baseline where we fine-tune init with its static-vocabulary, we observed that results were mostly analogous to Bi-NMT models. Hence, we avoided this comparison in this work and relied on the former baselines.

5 Results and Discussion

Experiments are performed using the progAdapt (see 3.2) and progGrow (3.3) approaches. The experimental results with the associated discussion are presented in Table 2 for models characterized by relatively low-resource data (), and in Table 3 for an extremely low-resource condition (). In both dataset conditions, the performance of the proposed approaches is compared with the baseline systems (Bi-NMT and M-NMT, see 4.3).

The init model which is trained with a data size X larger than and X the size of , achieves BLEU scores of and , respectively, for the De-En and En-De directions. In Table 2 and 3, the progAdapt is reported for each training stage (i.e., , , and ), whereas the progGrow is reported for the final stage . Moreover, Table 4 analyzes the effect of language relatedness and training stage reordering in our TL-DV approach. Bold highlighted BLEU scores show the best performing approach, while the arrows indicate statistically significant differences of the hypothesis against the better performing baseline (M-NMT) using bootstrap resampling ([42].

5.1 Low-Resource Setting

Dir De-En It-En Ro-En Nl-En
Init/Bi-NMT > 26.74 25.21 10.80 21.75
< 23.30 22.39 12.94 19.75
M-NMT > 24.14 26.42 22.17 24.00
< 21.80 23.57 17.35 21.25
ProgAdapt > - 30.08 24.43 26.36
< - 26.24 20.31 25.52
ProgGrow > 26.22 29.61 23.23 24.78
Table 2: models performance i) at for the init De-En direction and baseline (Bi-NMT) It-En, Ro-En, and Nl-En directions, ii) at for progAdapt, and iii) at for the progGrow approach.

For each language pair (i.e., It-En, Ro-En, and Nl-En), the results of the baseline models Bi-NMT trained using the available K parallel data ( setting) are presented in the first two rows of Table 2. The progAdapt results are reported from three consecutive adaptations to new language directions. These include the init to It-En, followed by the adaptation to Ro-En, and then to Nl-En. Compared to the corresponding Bi-NMT and M-NMT models, all of the three progressive adaptations using the dynamic vocabulary technique achieved a higher performance gain.

If we look at the specific level of adaption () against the Bi-NMT, we observe that the It-En direction showed a +4.87 and +3.85 gain for the En and It target, respectively. When we take this model and continue the adaptation to Ro-En and Nl-En, we see a similar trend where the highest gain is observed on for the Ro-En direction with +13.63 and +7.37 points. These significant improvements over the baseline models tell us that transfer-learning using dynamic vocabulary in a multilingual setting is a viable direction. Its capability to quickly tune the representation space of the init model to deliver improved results is an indication of the importance of using different word representations for each language pair444We reserve the adaptation from the init model directly to all the three new language pairs and the comparison with the current setting for future work..

In case of the progGrow, we observed a similar improvement trend as in the progAdapt approach. The results are reported from the final stage () of the model growth, but improvements are consistent throughout the and stages. The M-NMT outperformed the Bi-NMT models except for De-En pair. However, compared to the multilingual model as an alternative method for achieving cross-lingual transfer-learning, our approach shows improvements in the consecutive training stages. Overall, our observation is that the suggested progGrow model can accommodate new translation directions when the data are received. Most importantly, improvements are observed for these newly introduced languages without altering the performance of the init model in the De-En direction.

Specific to each language direction, It-En shows a comparable performance with the progAdapt approach, whereas in case of Ro-En and Nl-En a small degradation ranging from (De-En) to (Nl-En) is observed. The loss in performance is likely due to the increased ambiguities in the encoder side of the progGrow model, where at both training and inference time there does not exist a disambiguation mechanism between languages except the prepended language flag. This observation, which sheds a light on our initial expectation of more data aggregation benefiting the model performance, requires further investigation.

5.2 Extremely Low-Resource Setting

Dir De-En It-En Ro-En Nl-En
Init/Bi-NMT > 26.74 7.64 4.56 5.69
< 23.30 5.25 3.86 5.14
M-NMT > 24.96 16.26 12.67 15.59
< 21.67 10.38 8.67 12.72
ProgAdapt > - 15.16 11.03 11.52
< - 14.40 11.10 13.57
ProgGrow > 25.61 15.02 11.20 13.56
Table 3: models performance i) at for the init De-En direction and baseline (Bi-NMT) It-En, Ro-En, and Nl-En directions, ii) at for progAdapt, and iii) at for the progGrow approach.

In a similar way with what we observed in the experiments, the baseline models in the extremely low-resource setting demonstrate poor performance. Looking at our approaches, we observe a relatively higher gain at the first stage of progAdapt and progGrow. For instance, for the It-En pair there is a +7.52 improvement compared to the +4.87 in the models (see Table 2) over the Bi-NMT model. In the subsequent additional language directions (i.e., Ro-En and Nl-En), we also observe a similar trend. However, in comparison with the M-NMT, both of our approach perform poorly when translating to the En target. The main reason for this could be the aggregation of all the available data for a single run in the M-NMT model, while our approaches exploit data when it becomes available in a continuous training. Alternatively the distance between each language pair could play a significant role when we adapt in an extremely sparse data.

prog-Adapt/Grow with Related Languages. When related language pairs are consecutively added ( and ) at each training stages, our TL-DV approach showed the best performance. For instance, for the Nl-En experiments, we changed the sequence of the added language pair moving from a random order to a sequence based on the similarity to the init model.

Dir De-En Nl-En De-En Nl-En
ProgAdapt > - 27.23 16.21
< - 25.51 15.86
ProgGrow > 26.62 26.41 26.52 15.52
Table 4: and models performance at for progAdapt and progGrow approaches in a closely related De-En (init) and Nl-En language pairs setting.

Table 4 shows the results from progAdapt and progGrow, when the Nl-En pair is used at the training stage. The results confirm the trend observed in Table 2, however, with a relatively better performance when translating in to English. Most importantly, the results show a consistent and larger gain of (Nl-En) and (En-Nl) with the progAdapt, and (Nl-En) with progGrow compared to the corresponding results in Table 3. Thus, we emphasize on the degree of language similarity as a direct influencing factor when incorporating a new language pair both in progAdapt and progGrow approaches.

Prog-Adapt/Grow with Faster Convergence. The other main advantage of our TL-DV approach comes from the time a model takes to restart from the init model and reach a convergence point with better performance. In all experiments with our TL-DV approach a converged model is found within K steps for and K for training settings. Compared to 100K steps needed by a model trained from scratch to reach good performance, our approach takes only 4% to 20% of training steps with significantly higher performance. For instance, taking into consideration the models, Figure 2 illustrates the steps required for the baseline systems to converge (Table 3), in comparison with our approach where progGrow shows to converge slightly faster than progAdapt. However, with the relatively larger data of the models, the progAdapt approach proves to converge much faster than progGrow, for the reason that the newly introduced vocabulary and training dataset sizes are smaller compared to the concatenation of the init and data.

Figure 2: Model training steps number comparison for the three different language pairs between the baseline (rightmost) and the proposed approaches in the setting.

We further analyzed the influence of shared vocabularies between models and on the performance of TL-DV. For this discussion, we took the progAdapt model from all stages. Figure 3 summarizes the improvement differences from consecutive models in relation to the percentage of shared vocabularies. For instance, init and the (It-En) model vocabularies have a % overlap, whereas and share % and % with the previous model. The interesting aspect of the shared vocabulary comes from the increase in model performance with a higher fraction of shared vocabulary entires. Thus, a larger number of shared parameters between two consecutive models allows for a better gain in performance of the latter.

Figure 3: The difference in performance between the baseline and progAdapt models (TgtSrc and SrcTgt directions) in relation with the shared vocabulary between model and new language pair model .

The results achieved by the transfer-learning with dynamic vocabulary approach in two different training size conditions show that: i) adapting a trained NMT model to a new language pair improves performance on the target task significantly, and ii) it is possible to train a model faster to achieve better performance. Overall, the capability of injecting new vocabularies for new language pairs in the initial model is a crucial aspect for efficient and fast adaptation steps.

6 Conclusions

In this work, we proposed a transfer-learning approach within a multilingual NMT. Experimental results show that our dynamic vocabulary based transfer-learning improves model performance in a significant way of up to in an extremely low-resource and up to BLEU in a low-resource setting over a bilingual baseline model.

In future work, we will focus on finding the optimal way of transferring model parameters. Moreover, we plan to test our approach for various languages and language varieties.

7 Acknowledgments

This work has been partially supported by the EC-funded project ModernMT (H2020 grant agreement no. 645487). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. Moreover, we thank the Erasmus Mundus European Program in Language and Communication Technology.