Improving Zero-Shot Translation of Low-Resource Languages

11/04/2018 ∙ by Surafel M. Lakew, et al. ∙ Fondazione Bruno Kessler 0

Recent work on multilingual neural machine translation reported competitive performance with respect to bilingual models and surprisingly good performance even on (zeroshot) translation directions not observed at training time. We investigate here a zero-shot translation in a particularly lowresource multilingual setting. We propose a simple iterative training procedure that leverages a duality of translations directly generated by the system for the zero-shot directions. The translations produced by the system (sub-optimal since they contain mixed language from the shared vocabulary), are then used together with the original parallel data to feed and iteratively re-train the multilingual network. Over time, this allows the system to learn from its own generated and increasingly better output. Our approach shows to be effective in improving the two zero-shot directions of our multilingual model. In particular, we observed gains of about 9 BLEU points over a baseline multilingual model and up to 2.08 BLEU over a pivoting mechanism using two bilingual models. Further analysis shows that there is also a slight improvement in the non-zero-shot language directions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine translation of low-resource languages represents a challenge for neural machine translation (NMT) [1]. Recent efforts in multilingual NMT (Multi-NMT) [2, 3] have shown to improve translation performance in low-resource settings. Multi-NMT models can be trained with parallel corpora of several language pairs to work in many-to-one, one-to-many, or many-to-many translation directions. A simple approach, named target-forcing [3], is to prepend to the source sentence a tag specifying the target language, both at training and testing time. In addition to performance gains for low-resource languages, the benefit of Multi-NMT is the possibility to perform zero-shot translation, i.e. across directions that were not observed at training time.

Figure 1: Our zero-shot translation setting for Italian-Romanian. Parallel data is available only for Italian-English, Romanian-English, German-English, and Dutch-English. We leverage multi-lingual neural machine translation trained on all available parallel data to translate across Italian-Romanian (in both directions), either directly (zero-shot) or through English (pivoting).

Application scenarios in which zero-shot translation can bootstrap the creation of new parallel data – e.g. via human post-editing– [2], show how translation performance in the initial zero-shot direction improves over time with the addition of new parallel data. In this work, we explore instead the possibility to enable a trained Multi-NMT model to further learn from its own generated data. Briefly, our method works as follows: first (1), we let the Multi-NMT engine generate zero-shot translations on some portion of the training data; then (2), we re-start the training process on both the generated translations and the original parallel data. We repeat this training-inference-training cycle for a few times. Notice that, at each iteration, the original training data is augmented only with the last batch of generated translations. We observe that the generated outputs initially contain a mix of words from the shared vocabulary, but after few iteration they tend to only contain words in the zero-shot target language thus becoming more and more suitable for learning.

We test our approach on a Multi-NMT scenario including Italian, Romanian, English, German and Dutch, assuming that the zero-shot translation pair is Italian-Romanian. We also make the assumption that all languages have just parallel data with English (see Figure 1). We apply our approach on top of the multilingual NMT training method suggested by [2]. Experimental results show that our iterative training procedure not only significantly improves performance on the zero-shot directions, but it also boost multilingual NMT in general. Finally, our approach shows to outperform pivot-based machine translation, too.

2 Related Work

In this section we discuss relevant works on multilingual NMT, zero-shot NMT, and model training with self-generated data, which are closely related to our approach.

2.1 Multilingual NMT

Previous works in Multi-NMT are characterized by the use of separate encoding and/or decoding networks for every translation direction. Dong et al. (2015) [4] proposed a multi-task learning approach for a one-to-many translation scenario, based on a sharing representations between related tasks –i.e the source language – in order to enhance generalization on the target language. In particular, they used a single encoder in the source side, and separate attention mechanism and decoders for every target language. In a related work [5], used separate encoder and decoder networks for modeling language pairs in a many-to-many

setting. Notably, they dropped the attention mechanism in favor of a shared vector space where to represent both text and multi-modal information. Aimed at reducing ambiguities at translation time, 

[6] employed a multi-source

system that considers two languages on the encoder side and one target language on the decoder side. In particular, the attention model is applied to a combination of the two encoder states. In a

many-to-many translation scenario, [7] introduced a way to share the attention mechanism across multiple languages. As in [4], but (only on the decoder side) and in [5], they used separate encoders and decoders for each source and target language.

Despite the reported improvements, the need of using additional encoder and/or decoder for every language added to the system tells the limitation of these approaches, by making their network complex and expensive to train.

In a very different way, [2] and [3] developed similar Multi-NMT approaches by introducing a target-forcing token in the input. The approach in [3] applies a language-specific code to words from different languages in a mixed-language vocabulary. In practice, they force the decoder to translate to a specific target language by prepending and appending an artificial token to the source text. However, their word and sub-word level language-specific coding mechanism significantly increase the input length, which shows to have an impact on the computational cost and performance of NMT [8]. In [2], only one artificial token is prepended to the source sentences in order to specify the target language. Prepending language tokens has permitted to eliminate the need of having separate encoder/decoder networks and attention mechanism for every new language pair.

2.2 Zero-Shot Translation

By extending the approach in [7], zero-resource NMT has been suggested in [9]. The authors proposed a many-to-one translation setting and used the idea of generating a pseudo-parallel corpus [10], using a pivot language, to fine tune their model. Moreover, also in this case the need of separate encoders and decoders for every language pair significantly increases the complexity of the model.

An attractive feature of the target-forcing mechanism comes from the possibility to perform zero-shot translation with the same multilingual setting as in [2, 3]. However, recent experiments have shown that the mechanism fails to achieve reasonable zero-shot translation performance for low-resource languages [11]. The promising results in [2] and [3] hence require further investigation to verify if their method can work in various language settings, particularly across distant languages.

2.3 Training with self-generated data

Training procedures using self-generated data have been around for a while. For instance, in statistical machine translation (SMT), [12, 13] showed how the output of a translation model can be used iteratively to improve results in a task like post-editing. Mechanisms like back-translating the target side of a single language pair have been used for domain adaptation [14] and more recently by [10] to improve an NMT baseline model. In [15], a dual-learning mechanism is proposed where two NMT models working in the opposite directions provide each other feedback signals that permit them to learn from monolingual data. In a related way, our approach also considers training from monolingual data along dual zero-shot directions. As a difference, however, our train-infer-train loop leverages the capability of the network to jointly learn multiple translation directions.

Although our brief survey shows that re-using the output of an MT system for further training and improvement has been successfully applied in different settings, our approach differs from past works in mainly two aspects: i) introducing for the first time a train-infer-train mechanism addresses Multi-NMT, and ii) we cast the approach into a self-correcting training procedure over two dual zero-shot directions, so that incrementally improved translations mutually reinforce each direction.

3 Neural Machine Translation

The standard NMT architecture comprises an encoder, a decoder and an attention-mechanism, which are all trained with maximum likelihood in an end-to-end fashion [16]

. The encoder is a recurrent neural network (RNN) that encodes a source sentence into a sequence of hidden state vectors. The decoder is another RNN that uses the representation of the encoder to predict words in the target language 

[8] [17]. As the name suggests, attention is a mechanism used to improve the translation quality by deciding which part of the source sentence can contribute mostly in the prediction process [18]. As shown in Figure 2, which simplifies the NMT architecture, first the encoder takes the source words on the left (purple color), maps them to vectors and feeds them into the RNN. When the eos (i.e end of sentence) symbol is seen, the final time step initializes the decoder RNN (blue color). At each time step, the attention mechanism is applied over the encoder hidden states and combined with the current hidden state of the decoder to predict the next target word. Then, the prediction is fed back to the encoder RNN to predict the next word, until the eos symbol is generated [19].

Figure 2: NMT architecture with encoder-decoder and an attention mechanism, showing a source sentence “A B” translated into a target sentence “X Y”.

4 Mixed Language Input for Multi-NMT

Our goal is to improve translation in the zero-shot directions of a multilingual model with limited directions covered by the training data (see Figure 1). The training strategy of the proposed approach is summarized in Algorithm 1, while its flow chart is illustrated in Figure 3.

To address this problem, our training procedure is performed in three steps which are iterated for several rounds. In the first step (line ), the multilingual NMT system is trained on the original data available. In the second step (line ), the trained model is run to translate between the zero-shot directions. Then, in the third step (line ), the output translations are combined with the corresponding source sentences and added to the original training data. The resulting expanded corpus is now ready to perform a new round of the training process.

According to our train-infer-train scheme, new synthetic data for the two zero-shot directions are generated at each round. This process creates a duality between the two zero-shot translation directions, which we can exploit for mutual improvement. Indeed, for each direction, sub-optimal translations paired with the corresponding original (and well-formed) source are used to obtain new “parallel” (,) sentence pairs that extend the training material for the other direction. The translated mixed-input for the two languages is represented as , while the target side represents the original sentences extracted for inference.

Algorithm 1: Iterative Learning Procedure 1: TRAIN: () 2:    Multi-NMT initial training using dataset 3: repeat INFER-TRAIN 4:    for , do 5:       inference in duality using Multi-NMT 6:    end for 7:    prepare ([ + ], [ + ]) 8:    Multi-NMT reload Multi-NMT, train using 9:    return Multi-NMT 10: until Multi-NMT converges Multi-NMT

Table 1: Iterative Learning algorithm of the proposed approach using the duality of zero-shot translation directions.

Figure 3: Illustration of the proposed multilingual train-infer-train strategy. Using a standard NMT architecture, a portion of two zero-shot directions monolingual dataset is extracted for inference to construct a dual sourcetarget mixed-input and continue the training. The top solid line shows the training process, where as the dashed lines show the inference stage

In the Multi-NMT scenario, the sub-optimal translations representing the source element of the new training pairs will likely contain a mixed-language that includes words from a vocabulary shared with other languages. The expectation is that, round after round, the model will generate better outputs by learning at the same time to translate and “correct” its own translations by removing spurious elements from other languages. If this intuition holds true, the iterative improvement will yield increasingly better results in translating between the zero-shot directions. Ideally, this incremental training and inference cycle can continue until the model converges (line ).

5 Experiments

All the experiments are carried out using the open source OpenNMT-py111 toolkit [19]. For training the models, we used the parameters specified in Table 2. Considering the high data sparsity of our low-resource setting, we applied a dropout of  [20] to prevent overfitting [21]. To train the baseline Multi-NMT, we used Adam [22] as the optimization algorithm with an initial learning rate of . In the subsequent train-infer-train rounds, we used SGD [23], with a learning rate of

. If the perplexity does not decrease on the validation set or the number of epoch is above

, a learning rate decay of is applied. This combination of optimizers was found to be effective in accelerating the training in the first few iterations. In all the reported experiments the baseline models are trained until convergence, while each train round after the inference stage is assumed to iterate over 10 epochs. For decoding, a beam search of size 10 is applied.

Model parameters Value RNN type LSTM RNN size 1024 Embedding dim 512 Encoder bidirectional Encoder depth 2 Decoder depth 2 Beam size 10 Batch size 128

Table 2: Hyper-parameters used to train all the models, unless specified in a different setting.

5.1 Dataset

To evaluate our approach, we consider five languages (i.e English (EN), Dutch (NL), German (DE), Italian (IT), and Romanian (RO)). To simulate a low-resource scenario, each language pair has parallel sentences (see Table 3 for details). All the parallel datasets are from the IWSLT17 222 multilingual shared task [24].

Direction Training test2010 test2017 EN DE 197,489 1,497 1,138 EN IT 221,688 1,501 1,147 EN NL 231,669 1,726 1,181 EN RO 211,508 1,633 1,129 IT RO 209,668 1,605 1,127

Table 3: Number of sentences used to train the multilingual model on eight directions. The IT RO pairs are used to train only the bilingual models.

To train all models, we used the same pipeline, first to get a tokenized dataset. Then, we apply byte pair encoding (BPE) [25], using a jointly trained (on source and target dataset) shared BPE model to segment the tokens into sub-word units. For this operation we used BPE merging rules, with a minimum of frequency threshold to apply the segmentation. When training the multilingual models, the pipeline includes adding the artificial language token at the source side of each parallel dataset both for the training and validation sets [2]. We evaluate our models using test2010, and for comparison we use test2017 of the IWSLT2017 evaluation dataset.

5.2 Models

Our baseline models are trained in a multilingual and bilingual settings. For each direction of the multilingual model and every bilingual model we report the BLEU [26] score computed using multi-bleu.perl 333 from the Moses SMT implementation. BLEU scores of the Multi-NMT systems trained on the parallel data in Table 3 are reported in Table 6 and 7 (second column). To compare our zero-shot translations against those of the bilingual models we trained two ItalianRomanian models. Both bilingual are trained with the same amount of training data used by each direction of the Multi-NMT model (see Table 3). Moreover, as additional terms of comparison, we trained two pivoting-based systems (using English as a pivot language): ItalianEnglishRomanian and RomanianEnglishItalian.

5.2.1 Bilingual models

The baseline models for comparison consist of: i) an eight direction multilingual model (Multi-NMT), and two bilingual NMT models.

System tst2010 tst2017
ItalianRomanian 19.66 19.14
RomanianItalian 22.44 20.69
Table 4: BLEU scores of two bilingual NMT models (ItalianRomanian and RomanianItalian) on IWSLT data tst2010 and tst2017

The results of the two bilingual models are shown in Table 4. From the Multi-NMT model (see row and of Table 6 and Table 7), we particularly focus on the performance of the zero-shot directions that can be compared with the results from these two models.

5.2.2 Pivoting

If data are available, the pivoting strategy is the most intuitive way to accomplish zero-shot translation, or to translate from/into under-resourced languages through high resource ones [27]. However, results in the pivoting framework are strictly bounded to the performance of the two combined translation engines, and especially to that of the weaker one. In contrast, Multi-NMT models that leverage knowledge acquired from data for different language combinations (similar to multi-task learning) can potentially compete or even outperform the pivoting ones. Checking this possibility is the motivation for our comparison between the two approaches.

System tst2010 tst2017
ItalianRomanian 16.4 15.00
RomanianItalian 18.9 17.36
Table 5: Performance of the ItalianRomanian pivot translation directions using English as a pivot on tst2010 and tst2017

In our experiment we take English as the bridge language between Italian and Romanian in both translation directions. Unsurprisingly, compared with those of the bilingual models trained on ItalianRomanian data, the results shown in Table 5 are lower.

On both translation directions, the bilingual models are indeed about BLEU points better. Such comparison, however, is not the main point of our experiment, instead, we aim to fairly analyze performance differences between pivoting and zero-shot methods trained in the same condition which lacks ItalianRomanian training data.

5.3 Zero-shot results

In this experiment, we show how our approach helps to improve the baseline Multi-NMT model. The train-infer-train procedure described in Section 4 was applied for five rounds, where each round consists of iterations. Table 6, shows the improvement on the ItalianRomanian zero-shot directions using the Multi-NMT model. Specifically, the ItalianRomanian direction reached BLEU score improving over the baseline () by points. RomanianItalian translation improved with an even larger margin (+) from to BLEU score.

Multi-NMT Multi-NMT
EnglishItalian 27.07 28.47
ItalianEnglish 32.12 33.16
EnglishRomanian 24.65 25.37
RomanianEnglish 32.7 34.00
EnglishGerman 26.39 26.42
GermanEnglish 31.3 31.79
EnglishDutch 30.27 30.85
DutchEnglish 35.13 35.77
ItalianRomanian 8.59 17.38
RomanianItalian 8.65 19.36
Table 6: Comparison on test2010 set between a baseline Multi-NMT model against a Multi-NMT model with our proposed train-infer-train approach for the ItalianRomanian zero-shot direction. The best result for each direction is shown in bold.
Multi-NMT Multi-NMT
EnglishItalian 29.02 30.43
ItalianEnglish 32.87 33.61
EnglishRomanian 20.96 21.94
RomanianEnglish 27.48 28.21
EnglishGerman 19.75 19.85
GermanEnglish 24.12 24.25
EnglishDutch 25.37 26.12
DutchEnglish 29.25 29.15
ItalianRomanian 8.18 17.08
RomanianItalian 8.58 19.25
Table 7: Comparison on test2017 set between a baseline Multi-NMT model against a Multi-NMT model with our proposed train-infer-train approach for the ItalianRomanian zero-shot direction.

In addition, and to our great surprise, the results from our self-correcting mechanism showed to perform even better than the pivoting strategy. To check the validity of our results, we also compared the baseline multilingual system and our approach on the IWSLT 2017 test set (test2017). As shown in Table 7, the results confirm those computed on test2010, with almost identical gains (+ and +).

The other important advantage of our approach is evidenced by the performance gains obtained on the language directions supported by parallel training corpora. To put this into perspective, all translation directions have shown improvements, except for the slight drop (- BLEU) observed for the DutchEnglish direction in test2017 case.

Figure 4: Results from test2017 for the ItalianRomanian zero-shot directions, comparing our iterative learning approach (solid lines) with the pivoting mechanism (dashed lines)

Comparing the results from every rounds (see Figure 4), we observe that the training after the first inference step is responsible for the largest portion of the overall gain. This is mainly due to the initial introduction of (noisy) parallel data for the zero-shot directions. The contribution of the self-correcting process can be seen in the following rounds, i.e the improvement after each inference suggests that the generated data are getting cleaner and cleaner.

Source … che rafforza la corruzione, l’evasione fiscale, la povertà, l’instabilità.
Pivot … poarta de bază, evazia fiscală, sărăcia, instabilitatea.
Multi-NMT … restrânge corrupția, fiscale de evasion, poverty, instabilitate.
Multi-NMT … care rafinează corupția, evasarea fiscală, sărăcia, instabilitatea.
Reference … care protejează corupţia, evaziunea fiscală, sărăcia şi instabilitatea.
Source E o poveste incredibilă.
Pivot È una storia incredibile
Multi-NMT È una storia incredible.
Multi-NMT È una storia incredibile
Reference È una storia incredibile .
Source We can’t use them to make simple images of things out in the Universe.
Multi-NMT Non possiamo usarli per creare immagini semplici di cose nell’universo.
Multi-NMT Non possiamo usarle per fare semplici immagini di cose nell’universo.
Reference Non possiamo usarle per fare semplici immagini di cose nell’univero
Table 8: Top two examples: zero-shot translations generated by pivoting via English, multilingual translation(Multi-NMT) and multilingual translation enhanced with out approach (Multi-NMT). Last example: multilingual and enhanced multi-lingual translation in a resourced translation direction.

Looking at the sample translation outputs using the different approaches in Table 8, we observe that the baseline Multi-NMT system produces mixed language output (e.g. “poverty” in ItalianRomanian and “incredible” in RomanianItalian). Thanks to our approach, the Multi-NMT system instead tends to produce more consistent target language (“poverty” becomes “sărăcia” in ItalianRomanian and “incredible” becomes “incredibile” in RomanianItalian). Furthermore, even in the non-zero-shot directions there are case where the enhanced Multi-NMT system produces better translations (see the last reported example).

6 Conclusions

We introduced a method to improve zero-shot translation in multilingal NMT under particularly resource-scarce training conditions. The proposed self-correcting procedure, by leveraging synthetic dual translations, achieved significant improvements over a multilingual NMT baseline and outperformed a pivoting NMT approach for the Italian-Romanian directions.

In future work, we plan to improve the train-infer-train stages to reach better performance in less time and with lower training complexity. In our current setup we did not consider any form of selection for the dataset to be translated at the inference stage of the train-infer-train procedure. We expect that applying frequency and similarity based approaches to select promising training candidates can bring further improvements. Moreover, we plan also to test our approach with additional monolingual data from the two zero-shot directions. Finally we plan to extend our approach to the translation of mixed language sentences (i.e code-mixing).

7 Acknowledgements

This work has been partially supported by the EC-funded projects ModernMT (H2020 grant agreement no. 645487) and QT21 (H2020 grant agreement no. 645452). This work was also supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1 and by a donation of Azure credits by Microsoft. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.