The Missing Ingredient in Zero-Shot Neural Machine Translation

03/17/2019 ∙ by Naveen Arivazhagan, et al. ∙ Google 0

Multilingual Neural Machine Translation (NMT) models are capable of translating between multiple source and target languages. Despite various approaches to train such models, they have difficulty with zero-shot translation: translating between language pairs that were not together seen during training. In this paper we first diagnose why state-of-the-art multilingual NMT models that rely purely on parameter sharing, fail to generalize to unseen language pairs. We then propose auxiliary losses on the NMT encoder that impose representational invariance across languages. Our simple approach vastly improves zero-shot translation quality without regressing on supervised directions. For the first time, on WMT14 English-FrenchGerman, we achieve zero-shot performance that is on par with pivoting. We also demonstrate the easy scalability of our approach to multiple languages on the IWSLT 2017 shared task.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Machine Translation (NMT) Sutskever et al. (2014); Bahdanau et al. (2014); Cho et al. (2014) allows for a simple extension to the multilingual setting with the ultimate goal of a single model supporting translation between all languages Dong et al. (2015); Luong et al. (2015); Firat et al. (2016a); Johnson et al. (2016). The challenge, however, is that parallel training data is usually only available in concurrence with English. Even so, it seems plausible that, given good cross-language generalization, the model should be able to translate between any pairing of supported source and target languages - even untrained, non-English pairs. However, despite the model’s excellent performance on supervised directions, the quality on these zero-shot directions consistently lags behind pivoting by 2-10 BLEU points (Firat et al., 2016b; Johnson et al., 2016; Ha et al., 2017a; Lu et al., 2018).

The failure of multilingual models to generalize to these zero-shot directions has been patched up by using a few techniques. Typically, translation between zero-shot language pairs is instead achieved by following a two-step process of pivoting or bridging through a common language Wu and Wang (2007); Salloum and Habash (2013). Decoding through two noisy channels, however, doubles the latency and compounds errors. Several data augmentation, or zero resource, methods have therefore been proposed to enable one-step translation Firat et al. (2016b); Chen et al. (2017), but these require multiple training phases and grow quadratically in the number of languages. In this work, we try to understand the generalization problem that impairs zero-shot translation and resolve it directly rather than treating the symptoms.

The success of zero-shot translation depends on the ability of the model to learn language invariant features, or an interlingua, for cross-lingual transfer (Ben-David et al., 2007; Mansour et al., 2009). We begin with an error analysis which reveals that the standard approach of tying weights in the encoder is, by itself, not a sufficient constraint to elicit this, and the model enters a failure mode when translating between zero-shot languages Firat et al. (2016b).

To resolve this issue, we begin to view zero-shot translation as a domain adaptation problem Ben-David et al. (2007); Mansour et al. (2009) in multilingual NMT. We treat English, the language with which we always have parallel data, as the source domain, and the other languages collectively as the target domain. Rather than passively relying on parameter sharing, we apply auxiliary losses to explicitly incentivize the model to use domain/source-language invariant representations. By essentially using English representations as an implicit pivot in the continuous latent space, we achieve large improvements on zero-shot translation performance.

As we demonstrate on WMT14 (English-German-French), we are for the first time able to achieve zero-shot performance that is on par with pivoting. We are able to do this without any meaningful regression on the supervised directions and without the multi-phase training and quadratic complexity of data-synthesis approaches. We show how our approach can easily be scaled up to more languages on IWSLT17. Our results suggest that explicitly incentivizing cross-lingual transfer may be the missing ingredient to improving the quality of zero-shot translation.

2 Related Work

Multilingual NMT

was first proposed by Dong et al. (2015) for translating from one source language to multiple target languages. Subsequently, sequence-to-sequence models were extended to the many-to-many setting by the addition of task-specific encoder and decoder modules Luong et al. (2015); Firat et al. (2016b). Since then, Johnson et al. (2016); Ha et al. (2016) have shown that a vanilla sequence-to-sequence model with a single encoder and decoder can be used in the many-to-many setting by using a special token to indicate the target language.

Zero-Shot Translation

was first demonstrated by Johnson et al. (2016); Ha et al. (2016) who showed that multilingual models are somewhat capable of between untrained language pairs. Being able to translate between language pairs that were never trained on, these models give evidence for cross-lingual transfer. Zero-shot translation thus also becomes an important measure of model generalization. Unfortunately, the performance of such zero-shot translation is often not good enough to be useful as it is easily beaten by the simple pivoting approach. E.g., the multilingual model of johnson2016google scores 6 BLEU points lower on zero-shot Portuguese-to-Spanish translations.

Since then there have been efforts to improve the quality of zero-shot translation. DBLP:journals/corr/abs-1711-07893,ha2016toward report a language bias problem in zero-shot translation wherein the multilingual NMT model often decodes to the wrong language. Strategies proposed to counter this include alternative ways of indicating the desired language, dictionary-based filtering at inference time, and balancing the dataset with additional monolingual data. While these techniques improved the quality of zero-shot translation, it was still behind pivoting by 4-5 BLEU points.

There have also been several innovative proposals to promote cross-lingual transfer, the key to zero-shot translation, by modifying the model’s architecture and selectively sharing parameters. DBLP:conf/emnlp/FiratSAYC16 use separate encoders and decoders per language, but employ a common attention mechanism. In contrast, blackwood2018multilingual propose sharing all parameters but the attention mechanism. platanios18emnlp develop a contextual parameter generator that can be used to generate the encoder-decoder parameters for any source-target language pair. lu2018neural develop a shared “interlingua layer” at the interface of otherwise unshared, language-specific encoders and decoders. While making great progress, these efforts still end up behind pivoting by 2-10 BLEU points depending on the specific dataset and language pair. The above results suggest that parameter sharing alone is not sufficient for the system to learn language agnostic representations.

Language Invariance

as part of the the objective function may help solve this problem. Learning coordinated representations with the use of parallel data has been explored thoroughly in the context of multi-view and multi-modal learning Wang et al. (2015); Baltrušaitis et al. (2018). Without access to parallel data that can be used for direct alignment, a large mass of work minimizes domain discrepancy at the feature distribution level (Ben-David et al., 2007; Pan et al., 2011; Ganin et al., 2016) to improve transfer. These techniques have been widely applied to learning cross-lingual representations Mikolov et al. (2013); Hermann and Blunsom (2014).

The idea of aligning intermediate representations has also been explored in low-resource and unsupervised settings. gu-EtAl:2018:N18-1 develop a way to align word embeddings and support cross-lingual transfer to languages with different scripts. They also apply a mixture-of-experts layer on top of the encoder to improve sentence level transfer, but this can be considered under the umbrella of parameter sharing techniques. Similar to our work, artetxe2018unsupervised,DBLP:journals/corr/abs-1711-00043,DBLP:journals/corr/abs-1804-09057 explore applying adversarial losses on the encoder to ensure that the representations are language agnostic. However, more recent work on unsupervised NMT (Lample et al., 2018) has shown that the cycle consistency loss was the key ingredient in their systems. Such translation consistency losses have also been explored in DBLP:journals/corr/abs-1805-04813,DBLP:journals/corr/abs-1805-10338,DBLP:journals/corr/XiaHQWYLM16

Zero-Resource NMT

are another class of methods to build translation systems for language pairs with no available training data. Unlike zero-shot translation systems, they are not immediately concerned with improving cross-lingual transfer. They instead address the problem by synthesizing a pseudo-parallel corpus that covers the missing parallel source-target data. This data is typically acquired by translating the English portion of available English-Source or English-Target parallel data to the third language Firat et al. (2016b); Chen et al. (2017). With the help of this supervision, these approaches perform very well – often beating pivoting. However, this style of zero-resource translation requires multiple phases to train teacher models, generate pseudo-parallel data (back-translation Sennrich et al. (2015)), and then train a multilingual model on all possible language pairs. The added training complexity, along with the fact that it scales quadratically with the number of languages, makes these approaches less suitable for a truly multilingual setting.

3 An Error Analysis of Zero-Shot Translation

For zero-shot translation to work, the intermediate representations of the multilingual model need to be language invariant. In this section we evaluate the degree to which a standard multilingual NMT system is able to achieve language invariant representations. We compare its translation quality to bilingual systems on both supervised and unsupervised (zero-shot) directions and develop an understanding of the pathologies that lead to low zero-shot quality.

3.1 Experimental Setup

3.1.1 Data

Our experiments use the standard WMT14 enfr (39M) and ende (4.5M) training datasets that are used to benchmark state-of-the-art NMT systems Vaswani et al. (2017); Gehring et al. (2017); Chen et al. (2018). We pre-process the data by applying the standard Moses pre-processing scripts.111We use normalize-punctuation.perl, remove-non-printing-char.perl, and tokenizer.perl. We swap the source and target to get parallel data for the fren and deen directions. The resulting datasets are merged by oversampling the German portion to match the size of the French portion. This results in a total of 158M sentence pairs. The vocabulary is built by applying 32k BPE (Sennrich et al., 2016) to obtain subwords. It is shared by both the encoder and the decoder. The target language <tl> tokens are also added to the vocabulary. For evaluation we use the 3-way parallel newstest-2012 (3003 sentence) as the dev set and newstest-2013 (3000 sentences) as the test set.222We could not use newstest-2014 since it is not 3-way parallel and would have made evaluating and analyzing results on defr translation difficult.

3.1.2 Model and Optimization

We run all our experiments with Transformers (Vaswani et al., 2017)

, using the TransformerBase configuration. Embeddings are initialized from a Gaussian distribution with scale

. We train our model with the transformer learning rate schedule using 4k warmup steps. Dropout is set to 0.1. We use synchronized training with 16 Tesla P100 GPUs and train the model until convergence, which takes around 500k steps. All models are implemented in Tensorflow-Lingvo

Shen et al. (2019).

The bilingual NMT models are trained as usual. Similar to Johnson et al. (2016), our multilingual NMT model has the exact same architecture as the single direction models, using a single encoder and a single decoder for all language pairs. This setup maximally enforces the parameter sharing constraint that previous works rely on to promote cross-lingual transfer. Its simplicity also makes it favorable to analyze. The model is instructed on which language to translate a given input sentence into by feeding in a <tl> token, which is unique per target language, along with the source sentence.

3.2 Baseline Result

System Zero Shot Pivot Supervised
single - - 27.59 20.71 34.76 24.31 33.61 30.46
multi (drop=0.1) 17.00 11.84 26.25 20.18 32.68 24.48 32.33 30.26
multi (drop=0.3) 21.57 13.18 - - 29.64 21.98 29.55 27.52
Table 1: WMT14 en-de-fr Zero-shot results with baseline and aligned models compared against pivoting. Pivoting through English is performed using the baseline multilingual model.

We train 4 one-to-one translation models for , , , and , and one multilingual model for and report results in Table 1. We see that the multilingual model performs well on the directions for which it received supervision: . The 1-2 BLEU point regression as compared the one-to-one models is expected given that the multilingual model is trained to perform multiple tasks while using the same capacity.

Pivoting results for and were obtained by first translating from German/French to English, and then translating the English to French/German. Once again the multilingual model performs well. The 1 BLEU drop as compared to the single model baseline arises from the relative difference in performance on supervised directions. Unlike single language pair models, the multilingual model is capable of zero-shot translation. Unfortunately, the quality is far below that of pivoting, making zero-shot translation unusable.

3.3 Target Language is Entangled with Source Language

en de fr
14% 25% 60%
12% 54% 34%
Table 2: Percentage of sentences by language in reference translations and the sentences decoded using the baseline multilingual model (newstest2012)

Inspecting the model’s predictions, we find that a significant fraction of the examples were translated to the wrong language. They were either translated to English or simply copied as shown in Table 2. This phenomenon has been reported before Ha et al. (2017b, 2016). It is likely a consequence of the fact that at training time, German and French sentences were always translated into English. As a result, the model never learns to properly attribute the target language to the <tl> token, and simply changing the <tl> token at test time is not effective.

3.4 Problems with Cross-lingual Generalization

# examples Pivot Zero-Shot
1875/3003 19.71 19.22
1591/3003 24.33 21.63
Table 3: BLEU on subset of examples predicted in the right language by zero-shot translation through the multilingual model (newstest2012)

Given that a large portion of the errors are due to incorrect language, we try to estimate the improvement to zero-shot translation quality that could potentially be achieved by solving this issue. We discount these errors by re-evaluating the BLEU scores of zero-shot translation and pivoting on only those examples that the multilingual model already zero-shot translates to the right language. The results are shown in Table

3. We find that although the vanilla zero-shot translation system is much stronger than expected at first glance, it still lags the pivoting by 0.5 BLEU points on French to German and by 2.7 BLEU points on German to French. This gap and the below analysis indicate a generalization problem in the model.

One way to improve model generalization is by restricting the capacity of the model. With lower capacity, the model is expected be forced to learn cross-lingual representations which can be more broadly used across the different tasks of translating in many directions. This can be done by decreasing the number of parameters, for example through weight tying as previous multilingual approaches have done, or by simply increasing the regularization. We increase the dropout applied to the model from 0.1 to 0.3. We see that this results in higher zero-shot performance. However, this comes at high cost to the performance on supervised directions which end up being over-regularized.

Figure 1: The proposed multilingual NMT model along with alignment. and are a pair of translations sampled from available data, . One of or is always English. and are the encoder representations of the and , respectively. is the decoder prediction. is the standard cross-entropy loss associated with maximum likelihood training for NMT. is the alignment loss. Both, and , losses are minimized simultaneously.

Based on this, it seems that when a model is trained on just the end-to-end translation objective, there is no guarantee that it will discover language invariant representations; given enough capacity, it is possible for the model to partition its intrinsic dimensions and overfit to the supervised translation directions. Without any explicit incentive to learn invariant features, the intermediate encoder representations are specific to individual languages and this leads to poor zero-shot performance. While constraining model capacity can help alleviate this problem, it also impairs performance on supervised translation directions. We thus need to develop a more direct approach to push the model to learn transferable features.

4 Aligning Latent Representations

To improve generalization to other languages, we apply techniques from domain adaptation to multilingual NMT. Multilingual NMT can be seen as a multi-task, multi-domain problem. Each source language forms a new domain, and each target language is a different task. We simplify this to a two domain problem by taking English to be the source domain, , and grouping the non-English languages into the target domain, . English is chosen as the source domain since it is the only domain for which we consistently have enough data for all the tasks/target languages. Minimizing the discrepancy between the feature distributions of the source and target domains will allow us to enable zero-shot translation Ben-David et al. (2007); Mansour et al. (2009)

. To this end we apply a regularizer while training the model that will force the model to make the representations of sentences in all non-English languages similar to their English counterparts - effectively making the model domain/source-language agnostic. In this way, English representations at the final layer of the encoder now form an implicit pivot in the latent space. The multilingual model is now trained on both, the cross-entropy translation loss and the new regularization loss. The loss function we then minimize is:


where is the cross-entropy translation loss, is the alignment regularizer that will be defined below, and is a hyper-parameter that controls the contribution of the alignment loss.

Since we wish to make the representations source language invariant, we choose to apply the above regularization on top of the NMT encoder. This is because NMT models naturally decompose into an encoder and a decoder with a presumed separation of roles: The encoder encodes text in the source language into an intermediate latent representation, and the decoder generates the target language text conditioned on the encoder representation Cho et al. (2014).

Below, we discuss two classes of regularizers that can be used. The first minimizes distribution level discrepancy between the source and target domain. The second uses the available parallel data to directly enforce a correspondence at the instance level .

4.1 Aligning Distributions

We minimize the discrepancy between the feature distributions of the source and target domains by explicitly optimizing the following domain adversarial loss(Ganin et al., 2016):


where is the discriminator and is parametrized by . Note that, unlike Artetxe et al. (2018); Yang et al. (2018), who also train their encoder adversarially to a language detecting discriminator, we are trying to align the distribution of encoder representations of all other languages to that of English and vice-versa. Our discriminator is just a binary predictor, independent of how many languages we are jointly training on.

Architecturally, the discriminator is a feed-forward network with 3 hidden layers of dimension 2048 using the leaky ReLU(

) non-linearity. It operates on the temporally max-pooled representation of the encoder output. We also experimented with a discriminator that made independent predictions for the encoder representation,

, at each time-step Lample et al. (2017), but found the pooling based approach to work better for our purposes. More involved discriminators that consider the sequential nature of the encoder representations may be more effective, but we do not explore them in this work.

4.2 Aligning Known Translation Pairs

The above adversarial domain adaptation strategy does not take full advantage of the fact that we have access to parallel data. Instead, it only enforces alignment between the source and the target domain at a distribution level. Here we attempt to make use of the available parallel data, and enforce an instance level correspondence between known translations, rather than just aligning the distributions in embedding space.

Previous work on multi-modal and multi-view representation learning has shown that when given paired data, transferable representations can be much more easily learned by improving some measure of similarity between the alternative views Baltrušaitis et al. (2018). In our case, the different views correspond to semantically equivalent sentences written in different languages. These are immediately available to us in our parallel training data. We now minimize:



is the joint distribution of translation pairs. Note that

and are actually a pair of sequences, and to compare them we would ideally have access to the word level correspondences between the two sentences. In the absence of this information, we make a bag-of-words assumption and align the pooled representation similar to Gouws et al. (2015); Coulmance et al. (2016). Empirically, we find that max pooling and minimizing the cosine distance between the representations of parallel sentences works well, but many other loss functions may yet be explored to obtain even better results.

5 Experiments

We experiment with alignment on the same baseline multilingual setup as section 3. In addition to the model being trained end-to-end on the cross-entropy loss from translation, the encoder is also trained to minimize the alignment loss. To do this, we simultaneously encode both the source and the target sentence of all the translation pairs in a minibatch. While only the encoding of the source sentence is passed on to the decoder for translation, the encodings of both sentences are used to minimize the alignment loss.

For cosine alignment, we simply minimize the cosine distance between the encodings of a given sentence pair. For adversarial adaptation, the encodings of all sentences in a batch are grouped into English and non-English encodings and fed to the discriminator. For each sentence encoding, the discriminator is trained to predict whether it came from the English group or the non-English group. On the other hand, the encoder is trained adversarially to the discriminator.

was tuned to 1.0 for both the adversarial and the cosine alignment loss. Simply fine-tuning a pre-trained multilingual model with SGD using a learning rate of 1e-4 works well, obviating the need to train from scratch. The models converge within a few thousand updates.

5.1 Zero-Shot Now Matches Pivoting

Multilingual System Zero Shot Supervised
vanilla 17.00 11.84 32.68 24.48 32.33 30.26
adversarial 26.00 20.39 32.92 24.5 32.39 30.21
pool-cosine 25.85 20.18 32.94 24.51 32.36 30.32
Table 4: WMT14 en-de-fr Zero-shot results with baseline and aligned models compared against pivoting. Pivoting through English is performed using the baseline multilingual model.

We compare the zero-shot performance of the multilingual models against the pivoting with the same multilingual model. Pivoting was able to achieve BLEU scores of 26.25 on and 20.18 on as evaluated on newstest2013. Our results in Table 4 demonstrate that both our approaches to latent representation alignment result in large improvements in zero-shot translation quality for both directions, effectively closing the gap to the strong performance of pivoting. The alignment losses also effectively disentangle the representation of the source sentence from the target language ensuring prediction in the desired language.

In contrast to naively constraining the model to encourage it to learn transferable representations as was explored in section 3.4, the alignment losses are able to strike a much finer balance by taking a pin pointed approach to enforcing source language invariance. This is what allows us to push the model to generalize to the zero-shot language pairs without hurting the quality in the supervised directions.

5.2 Quantifying the Improvement to Language Invariance

Figure 2:

Average cosine distance between aligned context vectors for all combinations of English (en), German (de) and French (fr) as training progresses.

We design a simple experiment to determine the degree to which representations learned while training a multilingual translation model are truly cross-lingual. Because sentences in different languages can have different lengths and word orders despite being translations of each other, it is not possible to directly compare encoder output representations. We instead go further downstream and compare the context vectors obtained while decoding from such a pair of sentences.

In sequence-to-sequence models with attention, the attention mechanism is the only means by which the decoder can access the encoder representation. Thus, if we expect that for semantically equivalent source sentences, the decoder prediction should not change, then neither should the context vectors returned by the attention mechanism. Comparing context vectors obtained in such a manner will allow us to determine the extent to which their representations are functionally equivalent to the decoder.

We sample a set of parallel en-de-fr sentences extracted from our dev set, newstest2012, for this analysis. For each sentence in each triple of aligned sentences, we obtain the sequence of pairs of context vectors while decoding to it from the other two sentences. We plot the mean cosine distances of these pairs for our baseline multilingual training run in Figure 2. We also show how these curves evolve when we fine-tune with the alignment losses. Our results indicate that the vanilla multilingual model learns to align encoder representations over the course of training. However, in the absence of an external incentive, alignment process arrests as training progresses. Incrementally training with the alignment losses results in a more language-agnostic representation which contributes to the improvements in zero-shot performance.

5.3 Cosine vs Adversarial

The simple approach of just maximizing the representational similarity of known translation pairs is nearly indistinguishable from the quality of the more sophisticated adversarial training based approach. The adversarial regularizer suffers from three major problems: 1) it is sensitive to its initialization scheme and the choice of hyperparameters; 2) it has many moving parts coming from the architecture of the discriminator, the optimizer, and the non-linearity, all of which are non-trivial to control; and 3) it may also exhibit various failure modes including vanishing gradients and unstable oscillatory behaviour. In comparison, the cosine loss on translation pairs is simple, robust and effective with the only hyper-parameter being

, which controls the weight of the alignment loss with respect to the translation loss.

5.4 Zero-Shot with Adaptation vs Zero-Resource

We also evaluate against a zero-resource system. Here we synthesize parallel data by translating the portion of the available with previously trained one-to one models. We thus obtain 4.5M sentences with synthesized French and 39M sentences with synthesized German. These are reversed and concatenated to obtain 43.5M sentences. We then train two one-to-one NMT models for and . These models obtained BLEU scores of 29.04 and 21.66 respectively on newstest2013. Note that these are one-to-one models and thus have the advantage of focussing on a single task as compared to a many-to-many multilingual model.

While this approach achieves very good results it is hard to apply to a multilingual setting with many languages. It requires multiple phases: 1) Teacher models from English to each target language need to be trained, 2) Pseudo-parallel data for each language pair needs to be synthesized, 3) The multilingual model then needs to be jointly trained on data for all language pairs. The sequential nature of these phases and the quadratic scaling of this process make this approach unsuitable when we wish to support a large number of languages. In contrast, our approach does not require any additional pre-processing, additional training phases, or data generation. With the paired cosine alignment loss, the only hyper-parameter that we need to tune is .

5.5 IWSLT17: Scaling to more languages

Group vanilla align(cosine)
direct pivot direct
(8) 30.11 - 29.95
(12) 16.73 (zs) 17.76 17.72 (zs)
All (20) 22.2 22.81 22.72
Table 5: Average BLEU scores for multilingual model on IWSLT-2017; Zero-Shot results are marked (zs).

Here we demonstrate the scalability of our approach to multiple languages. We use the dataset from IWSLT-17 shared task which has transcripts of Ted talks in 5 languages: English (en), Dutch (nl), German (de), Italian (it), and Romanian (ro). The original dataset is multi-way parallel with approximately 220 thousand sentences per language, but for the sake of our experiments we only use the to/from English directions for training. The dev and test sets are also multi-way parallel and comprise around 900 to 1100 sentences per language pair respectively. We again use the transformer base architecture but multiply learning rate by 2.0 and increase the number of warmup steps to 8k to make the learning rate schedule more conservative. Dropout is set to 0.2. We use the cosine loss with set to 0.001, but higher values up to 0.1 are also equally effective.

On this dataset, the baseline model does not seem to have trouble with decoding to the correct language and performs well on zero-shot translation from the start. This may be a symptom of this dataset being multi-way parallel with the English sentences shared across all language pairs. However, it is still 1 BLEU point behind the quality of pivoting as shown in Table 5. By training with the auxillary cosine alignment loss, we are once again able to match the quality of bridging.

6 Conclusion

We started with an error-analysis of zero-shot translation in naively trained multilingual NMT and diagnosed why they do not automatically generalize to zero-shot directions. Viewing zero-shot NMT under the light of domain adaptation, we proposed auxillary losses to force the model to learn source language invariant representations that improve generalization. Through careful analyses we showed how these representations lead to better zero-shot performance while still maintaining performance on the supervised directions. We demonstrated the simplicity and effectiveness of our approach on two public benchmarks datasets: WMT English-French-German and the IWSLT 2017 shared task.


We would like to thank the Google Brain and Google Translate teams for their useful inputs and discussions. We would also like to thank the entire Lingvo development team for their foundational contributions to this project.