Evaluating the Supervised and Zero-shot Performance of Multi-lingual Translation Models

06/24/2019 ∙ by Chris Hokamp, et al. ∙ 0

We study several methods for full or partial sharing of the decoder parameters of multilingual NMT models. We evaluate both fully supervised and zero-shot translation performance in 110 unique translation directions using only the WMT 2019 shared task parallel datasets for training. We use additional test sets and re-purpose evaluation methods recently used for unsupervised MT in order to evaluate zero-shot translation performance for language pairs where no gold-standard parallel data is available. To our knowledge, this is the largest evaluation of multi-lingual translation yet conducted in terms of the total size of the training data we use, and in terms of the diversity of zero-shot translation pairs we evaluate. We conduct an in-depth evaluation of the translation performance of different models, highlighting the trade-offs between methods of sharing decoder parameters. We find that models which have task-specific decoder parameters outperform models where decoder parameters are fully shared across all tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-lingual translation models, which can map from multiple source languages into multiple target languages, have recently received significant attention because of the potential for positive transfer between high- and low-resource language pairs, and because of possible efficiency gains enabled by translation models which share parameters across many languages Dong et al. (2015); Ha et al. (2016); Firat et al. (2016); Johnson et al. (2016); Blackwood et al. (2018); Sachan and Neubig (2018); Aharoni et al. (2019). Multi-lingual models that share parameters across languages can also perform zero-shot translation, translating between language pairs for which no parallel training data is available Wu et al. (2016); Ha et al. (2016); Johnson et al. (2016).

Although multi-task models have recently been shown to achieve positive transfer for some combinations of NLP tasks, in the context of MT, multi-lingual models do not universally outperform models trained to translate in a single direction when sufficient training data is available. However, the ability to do zero-shot translation may be of practical importance in many cases, as parallel training data is not available for most language pairs Wu et al. (2016); Johnson et al. (2016); Aharoni et al. (2019). Therefore, small decreases in the performance of supervised pairs may be admissible if the corresponding gain in zero-shot performance is large. In addition, zero-shot translation can be used to generate synthetic training data for low- or zero- resource language pairs, making it a practical alternative to the bootstrapping by back-translation approach that has recently been used to build completely unsupervised MT systems Firat et al. (2016); Artetxe et al. (2018); Lample et al. (2018a, b). Therefore, understanding the trade-offs between different methods of constructing multi-lingual MT systems is still an important line of research.

Deep sequence-to-sequence models have become the established state-of-the-art for machine translation. The dominant paradigm continues to be models divided into roughly three high-level components: embeddings

, which map discrete tokens into real-valued vectors,

encoders, which map sequences of vectors into an intermediate representation, and decoders, which use the representation from an encoder, combined with a dynamic representation of the current state, and output a sequence of tokens in the target language conditioned upon the encoder’s representation of the input. For multi-lingual systems, any combination of embedding, encoder and/or decoder parameters can potentially be shared by groups of tasks, or duplicated and kept private for each task.

Figure 1: The decoder component of the transformer model Vaswani et al. (2017). All parameters may be shared across all target tasks, or a unique set of decoder parameters can be created for each task (outer dashed line). Alternatively, we can create unique attention parameters for each task, while sharing the final feed-forward layers (inner dotted lines). The possibility of including an embedding for the target task is visualized at the bottom of the diagram. Illustration modeled after Sachan and Neubig (2018).

Our work builds upon recent research on many-to-one, one-to-many, and many-to-many translation models. We are interested in evaluating many-to-many models under realistic conditions, including:

  1. A highly imbalanced amount of training data available for different language pairs.

  2. A very diverse set of source and target languages.

  3. Training and evaluation data from many domains.

We focus on multi-layer transformer models Vaswani et al. (2017), which achieve state-of-the-art performance on large-scale MT and NLP tasks Devlin et al. (2018); Bojar et al. (2018). The decoder component of the transformer is visualized in figure 1. We study four ways of building multi-lingual translation models. Importantly, all of the models we study can do zero-shot translation: translating between language pairs for which no parallel data was seen at training time. The models use training data from 11 distinct languages111cs, de, en, fi, fr, gu, kk, lt, ru, tr and zh, with supervised data available from the WMT19 news-translation task for 22 of the 110 unique translation directions222Note we do not consider auto-encoding, thus the number of translation directions is .. This leaves 88 translation directions for which no parallel data is available. We try to evaluate zero-shot translation performance on all of these additional directions, using both gold parallel data, and evaluations based on pivoting or multi-hop translation.

Target Language Specification

Although the embedding and encoder parameters of a multi-lingual system may be shared across all languages without any special modification to the model, decoding from a multi-lingual model requires a means of specifying the desired output language. Previous work has accomplished this in different ways, including:

  • pre-pending a special target-language token to the input Wu et al. (2016)

  • using an additional embedding vector for the target language Lample and Conneau (2019)

  • using unique decoders for each target language Luong et al. (2016); Firat et al. (2016)

  • partially sharing some of the decoder parameters while keeping others unique to each target language Sachan and Neubig (2018); Blackwood et al. (2018).

However, to the best of our knowledge, no side-by-side comparison of these approaches has been conducted. We therefore train models which are identical except for the way that decoding into different target languages is handled, and conduct a large-scale evaluation. We use only the language pairs and official parallel data released by the WMT task organisers, meaning that all of our systems correspond to the constrained setting of the WMT shared task, and our experimental settings should thus be straightforward to replicate.

2 Multi-Task Translation Models

This section discusses the key components of the transformer-based NMT model, focusing on the various ways to enable translation into many target languages. We use the terms source/target task and language interchangeably, to emphasize our view that multi-lingual NMT is one instantiation of the more general case of multi-task sequence to sequence learning.

2.1 Shared Encoders and Embeddings

In this work, we are only interested in ways of providing target task information to the model – information about the source task is never given explicitly, and encoder parameters are always fully shared across all tasks. The segmentation model and embedding parameters are also shared between all source and target tasks (see below for more details).

2.2 Multi-lingual Decoder Configurations

Figure 1 visualizes the decoder component of the transformer model, with dashed and dotted lines indicating the parameter sets that we can replicate or share across target tasks.

2.2.1 Target Task Tokens (Prepend)

Wu et al. (2016) showed that, as long as a mechanism exists for specifying the target task, it is possible to share the decoder module’s parameters across all tasks. In the case where all parameters are shared, the decoder model must learn to operate in a number of distinct modes which are triggered by some variation in the input. A simple way to achive this variation is by pre-pending a special "task-token" to each input. We refer to this method as Prepend.

2.2.2 Task Embeddings (Emb)

An alternative to the use of a special task token is to treat the target task as an additional input feature, and to train a unique embedding for each target task Lample and Conneau (2019), which is combined with the source input. This technique has the advantage of explicitly decoupling target task information from source task input, introducing a relatively small number of additional parameters. This approach can also be seen as adding an additional token-level feature which is the same for all tokens in a sequence Sennrich and Haddow (2016). We refer to this setting as Emb.

2.2.3 Task-specific Decoders (Dec)

In general, any subset of decoder parameters may be replicated for each target language, resulting in parameter sets which are specific to each target task. At one extreme, the entire decoder module may be replicated for each target language, a setting which we label Dec Dong et al. (2015).

2.2.4 Task-specific Attention (Attn)

An approach somewhere in-between Emb and Dec is to partially share some of the decoder parameters, while keeping others unique to each task. Recent work proposed creating unique attention modules for every target task, while sharing the other decoder parameters Sachan and Neubig (2018); Blackwood et al. (2018). The implementation of their approaches differ significantly – we propose to create completely unique attention parameters for each task. This means that for each of our 11 languages, we have unique context- and self-attention parameters in each layer of the transformer decoder. We refer to this setting as Attn.

3 Experiments

All experiments are conducted using the transformer-base configuration of Vaswani et al. (2017) with the relevant modifications for each system discussed in the previous section. We use a shared sentencepiece333https://github.com/google/sentencepiece segmentation model with 32000 pieces. We use all available parallel data from the WMT19 news-translation task for training, with the exception of commoncrawl, which we found to be very noisy after manually checking a sample of the data, and paracrawl, which we use only for en-fi and en-lt444Turkish (tr) is included from the 2018 language pairs because the task-organizers suggest the possibility of using tr data to improve kk performance.

We train each model on two P100 GPUs with an individual batch size of up to 2048 tokens. Gradients are accumulated over 8 mini-batches and parameters are updated synchronously, meaning that our effective batch size is tokens per iteration. Because the task pair for each mini-batch is sampled according to our policy weights and (fixed) random seed, and each iteration consists of 8 unique mini-batches, a single parameter update can potentially contain information from up to 8 unique task pairs. We train each model for 100,000 iterations without early stopping, which takes about 40 hours per model. When evaluating we always use the final model checkpoint (i.e. the model parameters saved after 100,000 iterations). We use our in-house research NMT system, which is heavily based upon OpenNMT-py Klein et al. (2017).

# seen # available

# epochs

% budget
en-cs 3,466,692 51,136,198 0.06 10.7
en-de 2,678,808 3,054,632 0.88 8.3
en-fi 3,466,692 6,457,071 0.54 10.7
en-gu 1,260,615 137,905 9.14 3.9
en-kk 1,181,827 158,067 7.47 3.7
en-lt 3,624,269 2,283,272 1.59 11.2
en-ru 5,042,462 11,391,126 0.44 15.6
en-tr 1,575,769 207,678 7.58 4.9
en-zh 5,846,104 14,549,833 0.40 18.1
de-fr 4,097,000 1,980,332 2.06 12.7
Total 32,240,238 91,356,114 - 100

Table 1: Training dataset statistics for our multilingual NMT experiments. # seen is the total number of segments seen during training. # available is the number of unique segments available in the parallel training datasets. # epochs is the number of passes made over the available training data – when this is , the available training data was only partially seen. % budget is the percentage of the training budget allocated to this pair of tasks.

The sampling policy weights were specified manually by looking at the amount of available data for each pair, and estimating the difficulty of each translation direction. The result of the sampling policy is that lower resource language pairs are upsampled significantly. Table

1 summarizes the statistics for each language pair. Note that the data in each row represents a pair of tasks, i.e. the total number of segments seen for en-cs is split evenly between encs, and csen. Because we train for only 100,000 iterations, we do not see all of the available training data for some high-resource language pairs.

With the exception of the Prepend system, the input to each model is identical. Each experimental setting is mutually exclusive, i.e. in the Emb setting we do not prepend task tokens, and in the Attn setting we do not use task embeddings.

Figure 2 plots the validation performance during training on one of our validation datasets. The language embeddings from the Emb system are visualized in figure 3.

Figure 2: Word-level accuracy on WMT EN-DE 2014 dev set as training progresses. The model which has a DE-specific decoder achieves the highest accuracy on this dev set.
Figure 3: Language embeddings of the Emb system projected with UMAP McInnes et al. (2018).

3.1 Results

Evaluation Dataset
en-cs newstest2018
en-de newstest2018
en-fi newstest2018
en-gu newsdev2019
en-kk newsdev2019
en-lt newsdev2019
en-ru newstest2018
en-tr newstest2018
en-zh newstest2018
de-fr euelections_dev2019
Table 2: The WMT evaluation dataset used for each language pair.

We conduct four different evaluations of the performance of our models. First, we check performance on the 22 supervised pairs using dev and test sets from the WMT shared task. We then try to evaluate zero-shot translation performance in several ways. We use the TED talks multi-parallel dataset Ye et al. (2018) to create gold sets for all zero-shot pairs that occur in the TED talks corpus, and evaluate on those pairs. We also try two ways of evaluating zero-shot translation without gold data. In the first, we do round-trip translation , and measure performance on the pair – this method is labeled Zero-Shot Pivot. In the second, we use parallel evaluation datasets from the WMT shared tasks (consisting of pairs), and translate , then measure performance on the resulting pairs (see below for more details), where the pivot and target language pair is a zero-shot translation task – this method is labeled Zero-Shot Parallel Pivot555For the Zero-Shot Pivot and Zero-shot parallel pivot evaluations we use the first 1000 segments of each dataset, because we need to translate twice for every possible pair.

Table 2 lists the WMT evaluation dataset that we use for each language pair. In the Zero-Shot Pivot setting, the reference side of the dataset is used as input.

Table 3 shows global results for all parallel tasks and all zero-shot tasks, by system. Global scores are obtained by concatenating the segmented outputs for each translation direction, and computing the BLEU score against the corresponding concatenated, segmented reference translations. The results in table 3 are thus tokenized BLEU scores.

Prepend Emb Dec Attn
Table 3: Overall results for supervised and zero-shot tasks. Tokenized BLEU scores are computed by concatenating all of the hypotheses for all translation directions, and computing BLEU with respect to the concatenated references. We use the sentencepiece-segmented hypotheses and references to avoid issues with tokenization of multi-lingual hypotheses and references.

3.2 Parallel Tasks

Table 4 lists results for all supervised task pairs from WMT 2019. For each pair, we report BLEU scores on de-tokenized output, and compute scores using sacrebleu 666BLEU+case.mixed+ lang.<src-lang>-<trg-lang>+ numrefs.1+smooth.exp+tok.<trg-lang>+ version.1.2.19. Therefore, we expect BLEU scores to be equivalent to those used in the WMT automatic evaluation.

Prepend Emb Dec Attn

Table 4: Results for all task pairs in the WMT 2019 news-translation shared task where parallel training data is available.

We note that across all but the lowest-resource tasks, the model with a unique decoder for each language outperforms all others. However, for engu and enkk

, the lowest-resource translation directions, the unique decoder model fails completely, probably because the unique parameters for

kk and gu were not updated by a sufficient number of mini-batches (approximately 15,600 for engu and 14,800 for enkk).

3.3 Zero-shot Translation Tasks

Prepend Emb Dec Attn
Table 5: Zero-shot translation results for ru*ru Note that BLEU scores are computed by translating , and computing the score between SRC and *[]^2.4ex[-6.9pt]SRC. Systems which do not pass the language identification filter are struck-through and removed from global evaluation. Note that parallel training data was available for ruen.

In order to test our models in the zero-shot setting, we first create a multi-parallel dataset from the from the TED Talks multi-parallel corpus Ye et al. (2018), which has recently been used for the training and evaluation of multi-lingual models. We filter the dev and test sets of this corpus to find segments which have translations for all of en, fr, ru, tr, de, cs, lt, fi, and are at least 20 characters long, resulting in 606 segments. Because this corpus is pre-processed, we first de-tokenize and de-escape punctuation using sacremoses777https://github.com/alvations/sacremoses. We then evaluate zero-shot translation for all possible pairs which do not occur in our parallel training data, aggregate results are shown in the second row of table 3.

We then adapt an evaluation technique that has recently been used for unsupervised MT – we translate from the source language into a pivot language, then back into the source language, and evaluate the score of the resulting source-language hypotheses against the original source Lample et al. (2018a). This technique allows us to evaluate for all possible translation directions in our multi-directional model.

Prepend Emb Dec Attn
# Failed Pivot Tasks 3 31 1 1
Table 6: Out of 110 pivot translation tasks, how many failed the language identification check?

Aware of the risk that the model simply copies through the original source segment instead of translating, we assert that at least 95% of pivot translations’ language code is correctly detected by langid888https://github.com/saffsd/langid.py, and pairs which do not meet this criteria for any system are removed from the evaluation for all systems (not just for the system that failed). For all models except Emb only rukkru filtfi, and zhguzh failed this test, but for the Emb model 31 of the 110 translation directions failed (see tables 6 and 7(in appendix)999We conduct round trip translation on all 110 directions, but we only use directions that are (1) not available in the parallel training data, and (2) pass the language identification test to compute the global zero-shot translation performance.. This result indicates that models which use language embeddings may have a more "fuzzy" representation of the output task, and are much more prone to copying than other approaches to multi-lingual MT.

Finally, we conduct the Zero-Shot Parallel Pivot evaluation using the same datasets in table 2 by translating from EN (or DE in the case of *-FR) to each possible pivot language, and then from the pivot language into the target language. Compared to the Zero-Shot Pivot setting, this evaluation should help to protect against the risk of copying, because source and reference segments are not from the same language. Aggregate results for this setting are shown in the third row of table 3, full results in table 8 in appendix.

3.4 Discussion

Our results show that a models with either (1) a completely unique decoders for each target language or (2) unique decoder attention parameters for each target language clearly outperform models with fully shared decoder parameters.

As shown in table 3, the Zero-Shot Pivot

evaluation is the outlier in our results, with the

Emb system outperforming the others. Even for the languages which passed the language identification filter used in this evaluation, we suspect that some copying is occurring for the Emb system, because of the mismatch in results between the Zero-Shot Pivot task and the Supervised, Zero-Shot TED, and Zero-shot Parallel Pivot tasks (see table 3). Since the ranking of the models according to the Zero-Shot Parallel Pivot evaluation is well aligned with the Zero-Shot TED and Supervised evaluations which use gold parallel evaluation data, we believe that this method is effective for zero-shot evaluation of translation quality for language pairs where no gold data is available.

It is plausible that the language-independence of encoder output could be correlated with the amount of sharing in the decoder module. Because most non-English target tasks only have parallel training data in English, a unique decoder for those tasks only needs to learn to decode from English, not from every possible source task. However, our results show that the Attn model, which partially shares parameters across target languages only slightly outperforms the Dec model globally, because of the improved performance of the Attn model on the lowest-resource tasks (Table 4, Table 7 (in appendix)). Thus we conclude that multi-lingual encoders still learn to share information across languages, even when trained using decoders that are unique to each target task.

4 Related Work

Dong et al. (2015); Firat et al. (2016); Ha et al. (2016); Johnson et al. (2016) and others have shown that multi-way NMT systems can be created with minimal modification to the approach used for single-language-pair systems. Johnson et al. (2016) showed that simply prepending a target-task token to source inputs is enough to enable zero-shot translation between language pairs for which no parallel training data is available.

Our work is most similar to Sachan and Neubig (2018), where several different strategies for sharing decoder parameters are investigated for one-to-many translation models. However, their evaluation setting is constrained to one-to-many models which translate from English into two target languages, whereas our setting is more ambitious, performing multi-way translation between 11 languages. Blackwood et al. (2018) showed that using separate attention parameters for each task can improve the performance of multi-task MT models – this work was the inspiration for the Attn setting in our experiments.

Several recent papers focus specifically upon improving the zero-shot performance of multi-lingual MT models Chen et al. (2017); Arivazhagan et al. (2019); Gu et al. (2019); Lu et al. (2018); Al-Shedivat and Parikh (2019); Sestorain et al. (2019).

Concurrently with this work, Aharoni et al. (2019) evaluated a multiway MT system on a large number of language pairs using the TED talks corpus. However, they focus upon EN-* and *-EN, and do not test different model variants.

5 Conclusions and Future Work

We have presented results which are consistent with recent smaller-scale evaluations of multi-lingual MT systems, showing that assigning unique attention parameters to each target language in a multi-lingual NMT system is optimal when evaluating such a system globally. However, when evaluated on the individual task level, models which have unique decoder parameters for every target task tend to outperform other configurations, except when the amount of available training data is extremely small. We have also introduced two methods of evaluating zero-shot translation performance when parallel data is not available, and we conducted a large-scale evaluation of translation performance across all possible translation directions in the constrained setting of the WMT19 news-translation task.

In future work, we hope to continue studying how multi-lingual translation systems scale to realistic volumes of training data and large numbers of source and target tasks.