Multi-lingual translation models, which can map from multiple source languages into multiple target languages, have recently received significant attention because of the potential for positive transfer between high- and low-resource language pairs, and because of possible efficiency gains enabled by translation models which share parameters across many languages Dong et al. (2015); Ha et al. (2016); Firat et al. (2016); Johnson et al. (2016); Blackwood et al. (2018); Sachan and Neubig (2018); Aharoni et al. (2019). Multi-lingual models that share parameters across languages can also perform zero-shot translation, translating between language pairs for which no parallel training data is available Wu et al. (2016); Ha et al. (2016); Johnson et al. (2016).
Although multi-task models have recently been shown to achieve positive transfer for some combinations of NLP tasks, in the context of MT, multi-lingual models do not universally outperform models trained to translate in a single direction when sufficient training data is available. However, the ability to do zero-shot translation may be of practical importance in many cases, as parallel training data is not available for most language pairs Wu et al. (2016); Johnson et al. (2016); Aharoni et al. (2019). Therefore, small decreases in the performance of supervised pairs may be admissible if the corresponding gain in zero-shot performance is large. In addition, zero-shot translation can be used to generate synthetic training data for low- or zero- resource language pairs, making it a practical alternative to the bootstrapping by back-translation approach that has recently been used to build completely unsupervised MT systems Firat et al. (2016); Artetxe et al. (2018); Lample et al. (2018a, b). Therefore, understanding the trade-offs between different methods of constructing multi-lingual MT systems is still an important line of research.
Deep sequence-to-sequence models have become the established state-of-the-art for machine translation. The dominant paradigm continues to be models divided into roughly three high-level components: embeddings
, which map discrete tokens into real-valued vectors,encoders, which map sequences of vectors into an intermediate representation, and decoders, which use the representation from an encoder, combined with a dynamic representation of the current state, and output a sequence of tokens in the target language conditioned upon the encoder’s representation of the input. For multi-lingual systems, any combination of embedding, encoder and/or decoder parameters can potentially be shared by groups of tasks, or duplicated and kept private for each task.
Our work builds upon recent research on many-to-one, one-to-many, and many-to-many translation models. We are interested in evaluating many-to-many models under realistic conditions, including:
A highly imbalanced amount of training data available for different language pairs.
A very diverse set of source and target languages.
Training and evaluation data from many domains.
We focus on multi-layer transformer models Vaswani et al. (2017), which achieve state-of-the-art performance on large-scale MT and NLP tasks Devlin et al. (2018); Bojar et al. (2018). The decoder component of the transformer is visualized in figure 1. We study four ways of building multi-lingual translation models. Importantly, all of the models we study can do zero-shot translation: translating between language pairs for which no parallel data was seen at training time. The models use training data from 11 distinct languages111cs, de, en, fi, fr, gu, kk, lt, ru, tr and zh, with supervised data available from the WMT19 news-translation task for 22 of the 110 unique translation directions222Note we do not consider auto-encoding, thus the number of translation directions is .. This leaves 88 translation directions for which no parallel data is available. We try to evaluate zero-shot translation performance on all of these additional directions, using both gold parallel data, and evaluations based on pivoting or multi-hop translation.
Target Language Specification
Although the embedding and encoder parameters of a multi-lingual system may be shared across all languages without any special modification to the model, decoding from a multi-lingual model requires a means of specifying the desired output language. Previous work has accomplished this in different ways, including:
However, to the best of our knowledge, no side-by-side comparison of these approaches has been conducted. We therefore train models which are identical except for the way that decoding into different target languages is handled, and conduct a large-scale evaluation. We use only the language pairs and official parallel data released by the WMT task organisers, meaning that all of our systems correspond to the constrained setting of the WMT shared task, and our experimental settings should thus be straightforward to replicate.
2 Multi-Task Translation Models
This section discusses the key components of the transformer-based NMT model, focusing on the various ways to enable translation into many target languages. We use the terms source/target task and language interchangeably, to emphasize our view that multi-lingual NMT is one instantiation of the more general case of multi-task sequence to sequence learning.
2.1 Shared Encoders and Embeddings
In this work, we are only interested in ways of providing target task information to the model – information about the source task is never given explicitly, and encoder parameters are always fully shared across all tasks. The segmentation model and embedding parameters are also shared between all source and target tasks (see below for more details).
2.2 Multi-lingual Decoder Configurations
Figure 1 visualizes the decoder component of the transformer model, with dashed and dotted lines indicating the parameter sets that we can replicate or share across target tasks.
2.2.1 Target Task Tokens (Prepend)
Wu et al. (2016) showed that, as long as a mechanism exists for specifying the target task, it is possible to share the decoder module’s parameters across all tasks. In the case where all parameters are shared, the decoder model must learn to operate in a number of distinct modes which are triggered by some variation in the input. A simple way to achive this variation is by pre-pending a special "task-token" to each input. We refer to this method as Prepend.
2.2.2 Task Embeddings (Emb)
An alternative to the use of a special task token is to treat the target task as an additional input feature, and to train a unique embedding for each target task Lample and Conneau (2019), which is combined with the source input. This technique has the advantage of explicitly decoupling target task information from source task input, introducing a relatively small number of additional parameters. This approach can also be seen as adding an additional token-level feature which is the same for all tokens in a sequence Sennrich and Haddow (2016). We refer to this setting as Emb.
2.2.3 Task-specific Decoders (Dec)
In general, any subset of decoder parameters may be replicated for each target language, resulting in parameter sets which are specific to each target task. At one extreme, the entire decoder module may be replicated for each target language, a setting which we label Dec Dong et al. (2015).
2.2.4 Task-specific Attention (Attn)
An approach somewhere in-between Emb and Dec is to partially share some of the decoder parameters, while keeping others unique to each task. Recent work proposed creating unique attention modules for every target task, while sharing the other decoder parameters Sachan and Neubig (2018); Blackwood et al. (2018). The implementation of their approaches differ significantly – we propose to create completely unique attention parameters for each task. This means that for each of our 11 languages, we have unique context- and self-attention parameters in each layer of the transformer decoder. We refer to this setting as Attn.
All experiments are conducted using the transformer-base configuration of Vaswani et al. (2017) with the relevant modifications for each system discussed in the previous section. We use a shared sentencepiece333https://github.com/google/sentencepiece segmentation model with 32000 pieces. We use all available parallel data from the WMT19 news-translation task for training, with the exception of commoncrawl, which we found to be very noisy after manually checking a sample of the data, and paracrawl, which we use only for en-fi and en-lt444Turkish (tr) is included from the 2018 language pairs because the task-organizers suggest the possibility of using tr data to improve kk performance.
We train each model on two P100 GPUs with an individual batch size of up to 2048 tokens. Gradients are accumulated over 8 mini-batches and parameters are updated synchronously, meaning that our effective batch size is tokens per iteration. Because the task pair for each mini-batch is sampled according to our policy weights and (fixed) random seed, and each iteration consists of 8 unique mini-batches, a single parameter update can potentially contain information from up to 8 unique task pairs. We train each model for 100,000 iterations without early stopping, which takes about 40 hours per model. When evaluating we always use the final model checkpoint (i.e. the model parameters saved after 100,000 iterations). We use our in-house research NMT system, which is heavily based upon OpenNMT-py Klein et al. (2017).
|# seen||# available||
The sampling policy weights were specified manually by looking at the amount of available data for each pair, and estimating the difficulty of each translation direction. The result of the sampling policy is that lower resource language pairs are upsampled significantly. Table1 summarizes the statistics for each language pair. Note that the data in each row represents a pair of tasks, i.e. the total number of segments seen for en-cs is split evenly between encs, and csen. Because we train for only 100,000 iterations, we do not see all of the available training data for some high-resource language pairs.
With the exception of the Prepend system, the input to each model is identical. Each experimental setting is mutually exclusive, i.e. in the Emb setting we do not prepend task tokens, and in the Attn setting we do not use task embeddings.
We conduct four different evaluations of the performance of our models. First, we check performance on the 22 supervised pairs using dev and test sets from the WMT shared task. We then try to evaluate zero-shot translation performance in several ways. We use the TED talks multi-parallel dataset Ye et al. (2018) to create gold sets for all zero-shot pairs that occur in the TED talks corpus, and evaluate on those pairs. We also try two ways of evaluating zero-shot translation without gold data. In the first, we do round-trip translation , and measure performance on the pair – this method is labeled Zero-Shot Pivot. In the second, we use parallel evaluation datasets from the WMT shared tasks (consisting of pairs), and translate , then measure performance on the resulting pairs (see below for more details), where the pivot and target language pair is a zero-shot translation task – this method is labeled Zero-Shot Parallel Pivot555For the Zero-Shot Pivot and Zero-shot parallel pivot evaluations we use the first 1000 segments of each dataset, because we need to translate twice for every possible pair.
Table 2 lists the WMT evaluation dataset that we use for each language pair. In the Zero-Shot Pivot setting, the reference side of the dataset is used as input.
Table 3 shows global results for all parallel tasks and all zero-shot tasks, by system. Global scores are obtained by concatenating the segmented outputs for each translation direction, and computing the BLEU score against the corresponding concatenated, segmented reference translations. The results in table 3 are thus tokenized BLEU scores.
3.2 Parallel Tasks
Table 4 lists results for all supervised task pairs from WMT 2019. For each pair, we report BLEU scores on de-tokenized output, and compute scores using sacrebleu 666BLEU+case.mixed+ lang.<src-lang>-<trg-lang>+ numrefs.1+smooth.exp+tok.<trg-lang>+ version.1.2.19. Therefore, we expect BLEU scores to be equivalent to those used in the WMT automatic evaluation.
We note that across all but the lowest-resource tasks, the model with a unique decoder for each language outperforms all others. However, for engu and enkk
, the lowest-resource translation directions, the unique decoder model fails completely, probably because the unique parameters forkk and gu were not updated by a sufficient number of mini-batches (approximately 15,600 for engu and 14,800 for enkk).
3.3 Zero-shot Translation Tasks
In order to test our models in the zero-shot setting, we first create a multi-parallel dataset from the from the TED Talks multi-parallel corpus Ye et al. (2018), which has recently been used for the training and evaluation of multi-lingual models. We filter the dev and test sets of this corpus to find segments which have translations for all of en, fr, ru, tr, de, cs, lt, fi, and are at least 20 characters long, resulting in 606 segments. Because this corpus is pre-processed, we first de-tokenize and de-escape punctuation using sacremoses777https://github.com/alvations/sacremoses. We then evaluate zero-shot translation for all possible pairs which do not occur in our parallel training data, aggregate results are shown in the second row of table 3.
We then adapt an evaluation technique that has recently been used for unsupervised MT – we translate from the source language into a pivot language, then back into the source language, and evaluate the score of the resulting source-language hypotheses against the original source Lample et al. (2018a). This technique allows us to evaluate for all possible translation directions in our multi-directional model.
|# Failed Pivot Tasks||3||31||1||1|
Aware of the risk that the model simply copies through the original source segment instead of translating, we assert that at least 95% of pivot translations’ language code is correctly detected by langid888https://github.com/saffsd/langid.py, and pairs which do not meet this criteria for any system are removed from the evaluation for all systems (not just for the system that failed). For all models except Emb only rukkru filtfi, and zhguzh failed this test, but for the Emb model 31 of the 110 translation directions failed (see tables 6 and 7(in appendix)999We conduct round trip translation on all 110 directions, but we only use directions that are (1) not available in the parallel training data, and (2) pass the language identification test to compute the global zero-shot translation performance.. This result indicates that models which use language embeddings may have a more "fuzzy" representation of the output task, and are much more prone to copying than other approaches to multi-lingual MT.
Finally, we conduct the Zero-Shot Parallel Pivot evaluation using the same datasets in table 2 by translating from EN (or DE in the case of *-FR) to each possible pivot language, and then from the pivot language into the target language. Compared to the Zero-Shot Pivot setting, this evaluation should help to protect against the risk of copying, because source and reference segments are not from the same language. Aggregate results for this setting are shown in the third row of table 3, full results in table 8 in appendix.
Our results show that a models with either (1) a completely unique decoders for each target language or (2) unique decoder attention parameters for each target language clearly outperform models with fully shared decoder parameters.
As shown in table 3, the Zero-Shot Pivot
evaluation is the outlier in our results, with theEmb system outperforming the others. Even for the languages which passed the language identification filter used in this evaluation, we suspect that some copying is occurring for the Emb system, because of the mismatch in results between the Zero-Shot Pivot task and the Supervised, Zero-Shot TED, and Zero-shot Parallel Pivot tasks (see table 3). Since the ranking of the models according to the Zero-Shot Parallel Pivot evaluation is well aligned with the Zero-Shot TED and Supervised evaluations which use gold parallel evaluation data, we believe that this method is effective for zero-shot evaluation of translation quality for language pairs where no gold data is available.
It is plausible that the language-independence of encoder output could be correlated with the amount of sharing in the decoder module. Because most non-English target tasks only have parallel training data in English, a unique decoder for those tasks only needs to learn to decode from English, not from every possible source task. However, our results show that the Attn model, which partially shares parameters across target languages only slightly outperforms the Dec model globally, because of the improved performance of the Attn model on the lowest-resource tasks (Table 4, Table 7 (in appendix)). Thus we conclude that multi-lingual encoders still learn to share information across languages, even when trained using decoders that are unique to each target task.
4 Related Work
Dong et al. (2015); Firat et al. (2016); Ha et al. (2016); Johnson et al. (2016) and others have shown that multi-way NMT systems can be created with minimal modification to the approach used for single-language-pair systems. Johnson et al. (2016) showed that simply prepending a target-task token to source inputs is enough to enable zero-shot translation between language pairs for which no parallel training data is available.
Our work is most similar to Sachan and Neubig (2018), where several different strategies for sharing decoder parameters are investigated for one-to-many translation models. However, their evaluation setting is constrained to one-to-many models which translate from English into two target languages, whereas our setting is more ambitious, performing multi-way translation between 11 languages. Blackwood et al. (2018) showed that using separate attention parameters for each task can improve the performance of multi-task MT models – this work was the inspiration for the Attn setting in our experiments.
Several recent papers focus specifically upon improving the zero-shot performance of multi-lingual MT models Chen et al. (2017); Arivazhagan et al. (2019); Gu et al. (2019); Lu et al. (2018); Al-Shedivat and Parikh (2019); Sestorain et al. (2019).
Concurrently with this work, Aharoni et al. (2019) evaluated a multiway MT system on a large number of language pairs using the TED talks corpus. However, they focus upon EN-* and *-EN, and do not test different model variants.
5 Conclusions and Future Work
We have presented results which are consistent with recent smaller-scale evaluations of multi-lingual MT systems, showing that assigning unique attention parameters to each target language in a multi-lingual NMT system is optimal when evaluating such a system globally. However, when evaluated on the individual task level, models which have unique decoder parameters for every target task tend to outperform other configurations, except when the amount of available training data is extremely small. We have also introduced two methods of evaluating zero-shot translation performance when parallel data is not available, and we conducted a large-scale evaluation of translation performance across all possible translation directions in the constrained setting of the WMT19 news-translation task.
In future work, we hope to continue studying how multi-lingual translation systems scale to realistic volumes of training data and large numbers of source and target tasks.
- Aharoni et al. (2019) Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3874–3884, Minneapolis, Minnesota. Association for Computational Linguistics.
- Al-Shedivat and Parikh (2019) Maruan Al-Shedivat and Ankur Parikh. 2019. Consistency by agreement in zero-shot neural machine translation. In Proceedings of NAACL.
- Arivazhagan et al. (2019) Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, and Wolfgang Macherey. 2019. The missing ingredient in zero-shot neural machine translation. arXiv preprint arXiv:1903.07091.
Artetxe et al. (2018)
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018.
Unsupervised neural machine translation.In Proceedings of the Sixth International Conference on Learning Representations.
- Blackwood et al. (2018) Graeme Blackwood, Miguel Ballesteros, and Todd Ward. 2018. Multilingual neural machine translation with task-specific attention. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3112–3122, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Bojar et al. (2018) Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, and Christof Monz. 2018. Findings of the 2018 conference on machine translation (wmt18). In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 272–307, Belgium, Brussels. Association for Computational Linguistics.
- Chen et al. (2017) Yun Chen, Yang Liu, Yong Cheng, and Victor O.K. Li. 2017. A teacher-student framework for zero-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1925–1935, Vancouver, Canada. Association for Computational Linguistics.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dong et al. (2015)
Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015.
Multi-task learning for
multiple language translation.
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1723–1732, Beijing, China. Association for Computational Linguistics.
- Firat et al. (2016) Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 866–875, San Diego, California. Association for Computational Linguistics.
- Gu et al. (2019) Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O. K. Li. 2019. Improved zero-shot neural machine translation via ignoring spurious correlations.
- Ha et al. (2016) Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel. 2016. Toward multilingual neural machine translation with universal encoder and decoder. CoRR, abs/1611.04798.
- Johnson et al. (2016) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Technical report, Google.
- Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proc. ACL.
- Lample and Conneau (2019) Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.
- Lample et al. (2018a) Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018a. Unsupervised machine translation using monolingual corpora only. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
- Lample et al. (2018b) Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018b. Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Lu et al. (2018) Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, and Jason Sun. 2018. A neural interlingua for multilingual machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 84–92, Belgium, Brussels. Association for Computational Linguistics.
- Luong et al. (2016) Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. In International Conference on Learning Representations.
- McInnes et al. (2018) Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. 2018. Umap: Uniform manifold approximation and projection. The Journal of Open Source Software, 3(29):861.
- Sachan and Neubig (2018) Devendra Sachan and Graham Neubig. 2018. Parameter sharing methods for multilingual self-attentional translation models. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 261–271, Belgium, Brussels. Association for Computational Linguistics.
- Sennrich and Haddow (2016) Rico Sennrich and Barry Haddow. 2016. Linguistic Input Features Improve Neural Machine Translation. In Proceedings of the First Conference on Machine Translation, pages 83–91, Berlin, Germany. Association for Computational Linguistics.
- Sestorain et al. (2019) Lierni Sestorain, Massimiliano Ciaramita, Christian Buck, and Thomas Hofmann. 2019. Zero-shot dual machine translation.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.
- Ye et al. (2018) Qi Ye, Sachan Devendra, Felix Matthieu, Padmanabhan Sarguna, and Neubig Graham. 2018. When and why are pre-trained word embeddings useful for neural machine translation. In HLT-NAACL.