Although Neural Machine Translation (NMT) has dominated recent research on translation tasks [wu2016google, vaswani2017attention, hassan2018achieving], NMT heavily relies on large-scale parallel data, resulting in poor performance on low-resource or zero-resource language pairs [koehn2017six]. Translation between these low-resource languages (e.g., ArabicSpanish) is usually accomplished with pivoting through a rich-resource language (such as English), i.e., Arabic (source) sentence is translated to English (pivot) first which is later translated to Spanish (target) [kauers2002interlingua, de2006catalan]. However, the pivot-based method requires doubled decoding time and suffers from the propagation of translation errors.
One common alternative to avoid pivoting in NMT is transfer learning [zoph2016transfer, nguyen2017transfer, kocmi2018trivial, Kim2019PivotbasedTL] which leverages a high-resource pivottarget model (parent) to initialize a low-resource sourcetarget model (child) that is further optimized with a small amount of available parallel data. Although this approach has achieved success in some low-resource language pairs, it still performs very poorly in extremely low-resource or zero-resource translation scenario. Specifically, kocmi2018trivial kocmi2018trivial reports that without any child model training data, the performance of the parent model on the child test set is miserable.
In this work, we argue that the language space mismatch problem, also named domain shift problem [fu2015transductive], brings about the zero-shot translation failure in transfer learning. It is because transfer learning has no explicit training process to guarantee that the source and pivot languages share the same feature distributions, causing that the child model inherited from the parent model fails in such a situation. For instance, as illustrated in the left of Figure 1, the points of the sentence pair with the same semantics are not overlapping in source space, resulting in that the shared decoder will generate different translations denoted by different points in target space. Actually, transfer learning for NMT can be viewed as a multi-domain problem where each source language forms a new domain. Minimizing the discrepancy between the feature distributions of different source languages, i.e., different domains, will ensure the smooth transition between the parent and child models, as shown in the right of Figure 1. One way to achieve this goal is the fine-tuning technique, which forces the model to forget the specific knowledge from parent data and learn new features from child data. However, the domain shift problem still exists, and the demand of parallel child data for fine-tuning heavily hinders transfer learning for NMT towards the zero-resource setting.
In this paper, we explore the transfer learning in a common zero-shot scenario where there are a lot of sourcepivot and pivottarget parallel data but no sourcetarget parallel data. In this scenario, we propose a simple but effective transfer approach, the key idea of which is to relieve the burden of the domain shift problem by means of cross-lingual pre-training. To this end, we firstly investigate the performance of two existing cross-lingual pre-training methods proposed by lample2019cross lample2019cross in zero-shot translation scenario. Besides, a novel pre-training method called BRidge Language Modeling (BRLM) is designed to make full use of the sourcepivot bilingual data to obtain a universal encoder for different languages. Once the universal encoder is constructed, we only need to train the pivottarget model and then test this model in sourcetarget direction directly. The main contributions of this paper are as follows:
We propose a new transfer learning approach for NMT which uses the cross-lingual language model pre-training to enable a high performance on zero-shot translation.
We propose a novel pre-training method called BRLM, which can effectively alleviates the distance between different source language spaces.
Our proposed approach significantly improves zero-shot translation performance, consistently surpassing pivoting and multilingual approaches. Meanwhile, the performance on supervised translation direction remains the same level or even better when using our method.
In recent years, zero-shot translation in NMT has attracted widespread attention in academic research. Existing methods are mainly divided into four categories: pivot-based method, transfer learning, multilingual NMT, and unsupervised NMT.
Pivot-based Method is a common strategy to obtain a sourcetarget model by introducing a pivot language. This approach is further divided into pivoting and pivot-synthetic. While the former firstly translates a source language into the pivot language which is later translated to the target language [kauers2002interlingua, de2006catalan, utiyama2007comparison], the latter trains a sourcetarget model with pseudo data generated from source-pivot or pivot-target parallel data [chen2017teacher, zheng2017maximum]. Although the pivot-based methods can achieve not bad performance, it always falls into a computation-expensive and parameter-vast dilemma of quadratic growth in the number of source languages, and suffers from the error propagation problem [zhu2013improving].
Transfer Learning is firstly introduced for NMT by zoph2016transfer zoph2016transfer, which leverages a high-resource parent model to initialize the low-resource child model. On this basis, nguyen2017transfer nguyen2017transfer and kocmi2018trivial kocmi2018trivial use shared vocabularies for source/target language to improve transfer learning, while kim2019effective kim2019effective relieve the vocabulary mismatch by mainly using cross-lingual word embedding. Although these methods are successful in the low-resource scene, they have limited effects in zero-shot translation.
Multilingual NMT (MNMT) enables training a single model that supports translation from multiple source languages into multiple target languages, even those unseen language pairs [firat2016multi, firat2016zero, johnson2017google, al2019consistency, aharoni2019massively]. Aside from simpler deployment, MNMT benefits from transfer learning where low-resource language pairs are trained together with high-resource ones. However, gu2019improved gu2019improved point out that MNMT for zero-shot translation easily fails, and is sensitive to the hyper-parameter setting. Also, MNMT usually performs worse than the pivot-based method in zero-shot translation setting [arivazhagan2019missing].
Unsupervised NMT (UNMT) considers a harder setting, in which only large-scale monolingual corpora are available for training. Recently, many methods have been proposed to improve the performance of UNMT, including using denoising auto-encoder, statistic machine translation (SMT) and unsupervised pre-training [artetxe2017unsupervised, lample2018phrase, ren2019unsupervised, lample2019cross]. Since UNMT performs well between similar languages (e.g., English-German translation), its performance between distant languages is still far from expectation.
Our proposed method belongs to the transfer learning, but it is different from traditional transfer methods which train a parent model as starting point. Before training a parent model, our approach fully leverages cross-lingual pre-training methods to make all source languages share the same feature space and thus enables a smooth transition for zero-shot translation.
In this section, we will present a cross-lingual pre-training based transfer approach. This method is designed for a common zero-shot scenario where there are a lot of sourcepivot and pivottarget bilingual data but no sourcetarget parallel data, and the whole training process can be summarized as follows step by step:
Pre-train a universal encoder with source/pivot monolingual or sourcepivot bilingual data.
Train a pivottarget parent model built on the pre-trained universal encoder with the available parallel data. During the training process, we freeze several layers of the pre-trained universal encoder to avoid the degeneracy issue [Howard2018UniversalLM].
Directly translate source sentences into target sentences with the parent model, which benefits from the availability of the universal encoder.
The key difficulty of this method is to ensure the intermediate representations of the universal encoder are language invariant. In the rest of this section, we first present two existing methods yet to be explored in zero-shot translation, and then propose a straightforward but effective cross-lingual pre-training method. In the end, we present the whole training and inference protocol for transfer.
Masked and Translation Language Model Pretraining
Two existing cross-lingual pre-training methods, Masked Language Modeling (MLM) and Translation Language Modeling (TLM), have shown their effectiveness on XNLI cross-lingual classification task [lample2019cross, Huang2019UnicoderAU], but these methods have not been well studied on cross-lingual generation tasks in zero-shot condition. We attempt to take advantage of the cross-lingual ability of the two methods for zero-shot translation.
Specifically, MLM adopts the Cloze objective of BERT [devlin2018bert] and predicts the masked words that are randomly selected and replaced with [MASK] token on monolingual corpus. In practice, MLM takes different language monolingual corpora as input to find features shared across different languages. With this method, word pieces shared in all languages have been mapped into a shared space, which makes the sentence representations across different languages close [DBLP:journals/corr/abs-1906-01502].
Since MLM objective is unsupervised and only requires monolingual data, TLM is designed to leverage parallel data when it is available. Actually, TLM is a simple extension of MLM, with the difference that TLM concatenates sentence pair into a whole sentence, and then randomly masks words in both the source and target sentences. In this way, the model can either attend to surrounding words or to the translation sentence, implicitly encouraging the model to align the source and target language representations. Note that although each sentence pair is formed into one sentence, the positions of the target sentence are reset to count form zero.
Bridge Language Model Pretraining
Aside from MLM and TLM, we propose BRidge Language Modeling (BRLM) to further obtain word-level representation alignment between different languages. This method is inspired by the assumption that if the feature spaces of different languages are aligned very well, the masked words in the corrupted sentence can also be guessed by the context of the correspondingly aligned words on the other side. To achieve this goal, BRLM is designed to strengthen the ability to infer words across languages based on alignment information, instead of inferring words within monolingual sentence as in MLM or within the pseudo sentence formed by concatenating sentence pair as in TLM.
As illustrated in Figure 2, BRLM stacks shared encoder over both side sentences separately. In particular, we design two network structures for BRLM, which are divided into Hard Alignment (BRLM-HA) and Soft Alignment (BRLM-SA) according to the way of generating the alignment information. These two structures actually extend MLM into a bilingual scenario, with the difference that BRLM leverages external aligner tool or additional attention layer to explicitly introduce alignment information during model training.
Hard Alignment (BRLM-HA). We first use external aligner tool on sourcepivot parallel data to extract the alignment information of sentence pair. During model training, given sourcepivot sentence pair, BRLM-HA randomly masks some words in source sentence and leverages alignment information to obtain the aligned words in pivot sentence for masked words. Based on the processed input, BRLM-HA adopts the Transformer [vaswani2017attention] encoder to gain the hidden states for source and pivot sentences respectively. Then the training objective of BRLM-HA is to predict the masked words by not only the surrounding words in source sentence but also the encoder outputs of the aligned words. Note that this training process is also carried out in a symmetric situation, in which we mask some words in pivot sentence and obtain the aligned words in the source sentence.
Soft Alignment (BRLM-SA). Instead of using external aligner tool, BRLM-SA introduces an additional attention layer to learn the alignment information together with model training. In this way, BRLM-SA avoids the effect caused by external wrong alignment information and enables many-to-one soft alignment during model training. Similar with BRLM-HA, the training objective of BRLM-SA is to predict the masked words by not only the surrounding words in source sentence but also the outputs of attention layer. In our implementation, the attention layer is a multi-head attention layer adopted in Transformer, where the queries come from the masked source sentence, the keys and values come from the pivot sentence.
In principle, MLM and TLM can learn some implicit alignment information during model training. However, the alignment process in MLM is inefficient since the shared word pieces only account for a small proportion of the whole corpus, resulting in the difficulty of expanding the shared information to align the whole corpus. TLM also lacks effort in alignment between the source and target sentences since TLM concatenates the sentence pair into one sequence, making the explicit alignment between the source and target infeasible. BRLM fully utilizes the alignment information to obtain better word-level representation alignment between different languages, which better relieves the burden of the domain shift problem.
We consider the typical zero-shot translation scenario in which a high resource pivot language has parallel data with both source and target languages, while source and target languages has no parallel data between themselves. Our proposed cross-lingual pretraining based transfer approach for sourcetarget zero-shot translation is mainly divided into two phrases: the pretraining phase and the transfer phase.
In the pretraining phase, we first pretrain MLM on monolingual corpora of both source and pivot languages, and continue to pretrain TLM or the proposed BRLM on the available parallel data between source and pivot languages, in order to build a cross-lingual encoder shared by the source and pivot languages.
In the transfer phase, we train pivottarget NMT model initialized by the cross-lingually pre-trained encoder, and finally transfer the trained NMT model to sourcetarget translation thanks to the shared encoder. Note that during training pivottarget NMT model, we freeze several layers of the cross-lingually pre-trained encoder to avoid the degeneracy issue.
For the more complicated scenario that either the source side or the target side has multiple languages, the encoder and the decoder are also shared across each side languages for efficient deployment of translation between multiple languages.
|MultiUN||Ar-En,En-Es En-Ru||9.7M,11.3M 11.6M||4,000||4,000|
|Europarl||Fr En Es||De En Fr||Ro En De|
|Direction||Fr Es||En Es||De Fr||En Fr||Ro De||En De|
|Cross-lingual Transfer [kim2019effective]||18.45||34.01||9.86||34.05||2.02||23.61|
|Proposed Cross-lingual Pretraining Based Transfer|
We evaluate our cross-lingual pre-training based transfer approach against several strong baselines on two public datatsets, Europarl [koehn2005europarl] and MultiUN [eisele2010multiun], which contain multi-parallel evaluation data to assess the zero-shot performance. In all experiments, we use BLEU as the automatic metric for translation evaluation.111We calculate BLEU scores with the multi-bleu.perl script.
The statistics of Europarl and MultiUN corpora are summarized in Table 1. For Europarl corpus, we evaluate on French-English-Spanish (Fr-En-Es), German-English-French (De-En-Fr) and Romanian-English-German (Ro-En-De), where English acts as the pivot language, its left side is the source language, and its right side is the target language. We remove the multi-parallel sentences between different training corpora to ensure zero-shot settings. We use the devtest2006 as the validation set and the test2006 as the test set for FrEs and DeFr. For distant language pair RoDe, we extract 1,000 overlapping sentences from newstest2016 as the test set and the 2,000 overlapping sentences split from the training set as the validation set since there is no official validation and test sets. For vocabulary, we use 60K sub-word tokens based on Byte Pair Encoding (BPE) [sennrich2015neural].
For MultiUN corpus, we use four languages: English (En) is set as the pivot language, which has parallel data with other three languages which do not have parallel data between each other. The three languages are Arabic (Ar), Spanish (Es), and Russian (Ru), and mutual translation between themselves constitutes six zero-shot translation direction for evaluation. We use 80K BPE splits as the vocabulary. Note that all sentences are tokenized by the tokenize.perl222https://github.com/moses-smt/mosesdecoder/blob/RELEASE-3.0/scripts/tokenizer/tokenizer.perl script, and we lowercase all data to avoid a large vocabulary for the MultiUN corpus.
We use traditional transfer learning, pivot-based method and multilingual NMT as our baselines. For the fair comparison, the Transformer-big model with 1024 embedding/hidden units, 4096 feed-forward filter size, 6 layers and 8 heads per layer is adopted for all translation models in our experiments. We set the batch size to 2400 per batch and limit sentence length to 100 BPE tokens. We set the (a dropout rate on each attention head), which is favorable to the zero-shot translation and has no effect on supervised translation directions [gu2019improved]. For the model initialization, we use Facebook’s cross-lingual pretrained models released by XLM333https://github.com/facebookresearch/XLM to initialize the encoder part, and the rest parameters are initialized with xavier uniform. We employ the Adam optimizer with , and . At decoding time, we generate greedily with length penalty .
Regarding MLM, TLM and BRLM, as mentioned in the pre-training phase of transfer protocol, we first pre-train MLM on monolingual data of both source and pivot languages, then leverage the parameters of MLM to initialize TLM and the proposed BRLM, which are continued to be optimized with source-pivot bilingual data. In our experiments, we use MLM+TLM, MLM+BRLM to represent this training process. For the masking strategy during training, following devlin2018bert devlin2018bert, of BPE tokens are selected to be masked. Among the selected tokens, of them are replaced with [MASK] token, are replaced with a random BPE token, and unchanged. The prediction accuracy of masked words is used as a stopping criterion in the pre-training stage. Besides, we use fastalign tool [dyer2013simple] to extract word alignments for BRLM-HA.
|Direction||Ar Es||Es Ar||Ar Ru||Ru Ar||Es Ru||Ru Es||A-ZST||A-ST|
|Proposed Cross-lingual Pretraining Based Transfer|
|Adding Back Translation|
Table 2 and 3 report zero-shot results on Europarl and Multi-UN evaluation sets, respectively. We compare our approaches with related approaches of pivoting, multilingual NMT (MNMT) [johnson2017google], and cross-lingual transfer without pretraining [kim2019effective]. The results show that our approaches consistently outperform other approaches across languages and datasets, especially surpass pivoting, which is a strong baseline in the zero-shot scenario that multilingual NMT systems often fail to beat [johnson2017google, al2019consistency, arivazhagan2019missing]. Pivoting translates source to pivot then to target in two steps, causing inefficient translation process. Our approaches use one encoder-decoder model to translate between any zero-shot directions, which is more efficient than pivoting. Regarding the comparison between transfer approaches, our cross-lingual pretraining based transfer outperforms transfer method that does not use pretraining by a large margin.
Results on Europarl Dataset.
Regarding comparison between the baselines in table 2, we find that pivoting is the strongest baseline that has significant advantage over other two baselines. Cross-lingual transfer for languages without shared vocabularies [kim2019effective] manifests the worst performance because of not using sourcepivot parallel data, which is utilized as beneficial supervised signal for the other two baselines.
Our best approach of MLM+BRLM-SA achieves the significant superior performance to all baselines in the zero-shot directions, improving by 0.9-4.8 BLEU points over the strong pivoting. Meanwhile, in the supervised direction of pivottarget, our approaches performs even better than the original supervised Transformer thanks to the shared encoder trained on both large-scale monolingual data and parallel data between multiple languages.
MLM alone that does not use sourcepivot parallel data performs much better than the cross-lingual transfer, and achieves comparable results to pivoting. When MLM is combined with TLM or the proposed BRLM, the performance is further improved. MLM+BRLM-SA performs the best, and is better than MLM+BRLM-HA indicating that soft alignment is helpful than hard alignment for the cross-lingual pretraining.
Results on MultiUN Dataset.
Like experimental results on Europarl, MLM+BRLM-SA performs the best among all proposed cross-lingual pretraining based transfer approaches as shown in Table 3. When comparing systems consisting of one encoder-decoder model for all zero-shot translation, our approaches performs significantly better than MNMT [johnson2017google].
Although it is challenging for one model to translate all zero-shot directions between multiple distant language pairs of MultiUN, MLM+BRLM-SA still achieves better performances on Es Ar and Es Ru than strong pivoting, which uses MNMT to translate source to pivot then to target in two separate steps with each step receiving supervised signal of parallel corpora. Our approaches surpass pivoting in all zero-shot directions by adding back translation [sennrich2015neural] to generate pseudo parallel sentences for all zero-shot directions based on our pretrained models such as MLM+BRLM-SA, and further training our universal encoder-decoder model with these pseudo data. gu2019improved gu2019improved introduces back translation into MNMT, while we adopt it in our transfer approaches. Finally, our best MLM+BRLM-SA with back translation outperforms pivoting by 2.4 BLEU points averagely, and outperforms MNMT [gu2019improved] by 4.6 BLEU points averagely. Again, in supervised translation directions, MLM+BRLM-SA with back translation also achieves better performance than the original supervised Transformer.
We first evaluate the representational invariance across languages for all cross-lingual pre-training methods. Following arivazhagan2019missing arivazhagan2019missing, we adopt max-pooling operation to collect the sentence representation of each encoder layer for all source-pivot sentence pairs in the Europarl validation sets. Then we calculate the cosine similarity for each sentence pair and average all cosine scores. As shown in Figure3, we can observe that, MLM+BRLM-SA has the most stable and similar cross-lingual representations of sentence pairs on all layers, while it achieves the best performance in zero-shot translation. This demonstrates that better cross-lingual representations can benefit for the process of transfer learning. Besides, MLM+BRLM-HA is not as superior as MLM+BRLM-SA and even worse than MLM+TLM on Fr-En, since MLM+BRLM-HA may suffer from the wrong alignment knowledge from an external aligner tool. We also find an interesting phenomenon that as the number of layers increases, the cosine similarity decreases.
Contextualized Word Representation.
We further sample an English-Russian sentence pair from the MultiUN validation sets and visualize the cosine similarity between hidden states of the top encoder layer to further investigate the difference of all cross-lingual pre-training methods. As shown in Figure 4, the hidden states generated by MLM+BRLM-SA have higher similarity for two aligned words. It indicates that MLM+BRLM-SA can gain better word-level representation alignment between source and pivot languages, which better relieves the burden of the domain shift problem.
The Effect of Freezing Parameters.
To freeze parameters is a common strategy to avoid catastrophic forgetting in transfer learning [Howard2018UniversalLM]. Table 4 shows the performance of transfer learning with freezing different layers on MultiUN test set, in which EnRu denotes the parent model, ArRu and Es
Ru are two child models, and all models are based on MLM+BRLM-SA. We can find that updating all parameters during training will cause a notable drop on the zero-shot direction due to the catastrophic forgetting. On the contrary, freezing all the parameters leads to the decline on supervised direction because the language features extracted during pre-training is not sufficient for MT task. Freezing the first four layers of the transformer shows the best performance and keeps the balance between pre-training and fine-tuning.
|Freezing Layers||En Ru||Ar Ru||Es Ru|
In this paper, we propose a cross-lingual pretraining based transfer approach for the challenging zero-shot translation task, in which source and target languages have no parallel data, while they both have parallel data with a high resource pivot language. With the aim of building the language invariant representation between source and pivot languages for smooth transfer of the parent model of pivottarget direction to the child model of sourcetarget direction, we introduce one monolingual pretraining method and two bilingual pretraining methods to construct an universal encoder for the source and pivot languages. Experiments on public datasets show that our approaches significantly outperforms several strong baseline systems, and manifest the language invariance characteristics in both sentence level and word level neural representations.
We would like to thank the anonymous reviewers for the helpful comments. This work was supported by National Key R&D Program of China (Grant No. 2016YFE0132100), National Natural Science Foundation of China (Grant No. 61525205, 61673289). This work was also partially supported by Alibaba Group through Alibaba Innovative Research Program and the Priority Academic Program Development (PAPD) of Jiangsu Higher Education Institutions.