Neural machine translation (NMT) has achieved great success and reached satisfactory translation performance for several language pairs bahdanau_iclr2018_nmt; gehring2017cnnmt; vaswani2017transformer. Such breakthroughs heavily depend on the availability of large scale of bilingual sentence pairs. Taking the WMT14 English-French task as an example, NMT uses about 35 million parallel sentence pairs for training. As bilingual sentence pairs are costly to collect, the success of NMT is not fully realized on the vast majority of language pairs, especially for zero-resource languages. Recently, artetxe2018unsupervisedmt; lample2018unsupervisedmt tackle this challenge by training unsupervised neural machine translation (UNMT) models using only monolingual data, which achieves considerably good accuracy but still far away from that of the state-of-the-art supervised models.
Most previous works focus on modeling the architecture through parameter sharing or proper initialization. We argue that the drawback of UNMT mainly comes from the lack of supervised signals. In this paper, we first propose a simple yet effective unsupervised NMT (Munmt) framework to leverage the weakly supervised signal from multilingual data. More specifically, we consider two variants of multilingual supervision. a) Multilingual monolingual data which is unrelated to the training directions. For example, using monolingual Fr data to help the training of En-De. b) A relatively less strict unsupervised setting where other bilingual language pairs can be introduced. For example, we can naturally leverage parallel En-Fr data to guild the unsupervised En-De direction.
We illustrate our methods in Figure 1. When considering only monolingual data, Munmt takes advantage of multiple unsupervised translation tasks and jointly trains a singe model to serve all directions. This multilingual approach has been proved to be effective for supervised NMT while still lacks of exploration for UNMT. For the unrestricted scenario, we leverage the parallel data from other languages in two ways, shown in Figure 1(c) and Figure 1(d) respectively. Without loss of generality, we focus on the unsupervised direction . For forward cross translation, the unsupervised system serves as a teacher, translating the Fr part of parallel data to De. The resulted synthetic data can be used to improve our target system . This method can be viewed as consistency regularization which aims at generating the same output of language De for bilingual input pairs. For backward cross translation, we translate the monolingual De to Fr with , and then translate Fr to En with . The resulted synthetic bilingual data can be used for as well. It is also worth mentioning that, the unrestricted scenario still follows the multilingual approach, which means we jointly train both supervised and unsupervised directions all in one model.
Munmt borrows the ideas from multilingual NMT dong2015multi; johnson-etal-2017-googlemnmt; kudugunta2019investigating, while multilingual translation has not been well studied on unsupervised scenarios. Munmt is also related to the pivot-based method for zero-shot machine translations firat2016zero; chen2017teacher; ren-etal-2018-triangular; kim2019pivot. The main difference comes from that, these methods focus on modeling the zero-shot directions with two supervised NMT models. For Fr-De translation, pivot methods need both Fr-En and En-De parallel training data, which is more restrict than Munmt. Our contributions can be summarized as follows: a) We extend unsupervised NMT to a multilingual approach Munmt which is capable of translating all the language pairs including both rich-resource and zero-resource ones. It makes less assumptions and can be widely used. b) We first propose to leverage the unrelated bilingual data to help the training of UNMT. We believe that the learning method can be also utilized in other unsupervised cross lingual problems. c) We empirically evaluate Munmt on six benchmark translation directions. Munmt significantly surpasses individual models by more than 3 BLEU scores on average and even better than a strong competitor with pre-trained cross-lingual BERT.
2 Background: Multilingual NMT & UNMT
Neural machine translation is a sequence-to-sequence model which composes of an encoder and a decoder. The basic building blocks of the NMT model can be RNNs sutskever_nips2014_s2smt, CNNs gehring2017cnnmt, and Transformer vaswani2017transformer. NMT model encodes input sentences to intermediate representations and then decodes the target sentences word by word.
Formally, denotes a sentence in language . And for supervised machine translation models, given parallel dataset with source language and target language , we use to denode the supervised training procedure from language to language , the training loss for a single sentence can be defined as:
Extend the individual NMT to a multilingual NMT model, johnson-etal-2017-googlemnmt proposed to train multiple translation directions all in one model. Then the loss is defined as:
where indicates all translation directions with parallel data. The training and architecture are extremely simple without large model modification. Benefiting from the shared architecture, the multilingual model can even translate zero-shot directions.
For unsupervised machine translation systems, only monolingual data is utilized during training. This method heavily relies on the dual structure of machine translations. Due to the ill-posed nature, it is important to find a way to associate the source side languages and the target side languages. The methods for associating include word embedding alignment artetxe-2017-unsupword; artetxe2018unsupervisedmt; lample2018unsupword; lample2018unsupervisedmt, joint learning of embeddings lample-2018-pbumt and pre-training language models lample2019cross; song2019icml-mass. This step can also be seen as a good initialization of the system. With a better start, the optimizing of UNMT becomes much easier and thus can get better results. The observation is in line with our motivation.
After the initialization, the unsupervised neural machine translation systems use language modeling via denoising auto-encoders and iterative back-translation to train the model. Language model gives a prior about how to encode and generate sentences in one language. Adding noise to the model prevents the system from only copying and improves the quality of translation. We use to denote the language model training on language . It translates the monolingual data to synthetic source sentences, and use the synthetic sentences as source data and the original sentences as target data to train the model. For example, we have monolingual data of language , and we want to train the translation direction from to , then we can translate sentences in to language with the model in inference mode, then we have pseudo-parallel data of and . After that, we can use these data to train the model with supervised method from to . We use to denote this back translation procedure. Back-translation is the key part in the unsupervised setting to connects the two languages and trains the model in the way similar to the supervised one. For an unsupervised machine translation system with two languages , and monolingual datasets , the losses of the two steps respectively are:
where is a noise model to randomly drop, swap or blank some words, and denotes the sentence in language inferred from the sentence in language with the same model. Then the total loss of an unsupervised machine translation is:
where and are coefficients. During training, is gradually reduced to 0. If the system is initialized with pre-training language models, can be even set to 0. The initialization and back-translation are the most important part of unsupervised machine translation.
Besides, there is an assumption for unsupervised machine translation that latent representations for different languages need be in the shared latent space. Since only monolingual data is available, shared representations ensure that the model has connections of these languages. Only based on this, the back-translation can work properly. Sharing the model is one of the simplest ways to satisfy this assumption.
3 The Proposed Munmt
In this part, we propose Munmt which based on a multilingual machine translation model involving supervised and unsupervised methods with a triangular training structure. As mentioned before, the zero-shot directions of johnson-etal-2017-googlemnmt are not trained directly, therefore the performances of these translation directions cannot be guaranteed. We try to extend the zero-shot directions with the unsupervised method, which can give better performances for these directions. Munmt makes use of both monolingual data and parallel data during training. The goal of Munmt is to improve the translation performance of the zero-shot directions and also keep a multilingual model for rich-resource directions.
The settings of our method are mainly following lample2019cross, but we only use pre-trained token embeddings for initialization rather than the pre-trained cross-lingual language model for time efficiency.
3.1 Munmt without Parallel Data
In Munmt with only monolingual data, we generally have different languages corresponding to monolingual data . The goal is to train a single system that can perform translation between any pairs of these languages.
For simplicity, we describe our method on three language pairs. The method can be easily extended to more language pairs. Suppose that the three languages are denoted as the triad , and we have monolingual data for all the three languages. There are also bilingual data for En-Fr and our target is to train an unsupervised En Fr system. The structure is depicted in t Figure 1. Compared with most previous zero-shot translation studies such as pivot method or teacher-student method chen-etal-acl2017-teacher; al-shedivat-parikh-naacl2019-consistency which need parallel data of several pivot language pairs, Munmt differently make the less assumptions that only one parallel language pair is needed. Therefore, the application scenario of Munmt is more extensive.
The baseline of UNMT system is mainly following lample2019cross. The difference is that we use a shared model for all translation directions, including the vocabulary, the embeddings, the encoder, and the decoder. To distinguish different languages, we add language embeddings to the input of the encoder and the decoder to distinguish the specific language lample-2018-pbumt; lample2019cross. Word embeddings, position embeddings, and language embeddings are added together as the input.
3.2 Multilingual UNMT with Parallel Data
As bilingual data is more crucial, we should make the most of these data instead of just treating them as monolingual. Specifically, the training process can be decomposed as two steps: the multilingual joint training, and cross language translation.
In the first step, we jointly train a multilingual translation model. This approach is similar to the Munmt without bilingual data. The difference comes from that we incorporate both supervised and unsupervised directions all in one model. The model sharing strategy will help the transformation of language free knowledge from supervised directions to unsupervised ones.
In the second step, we propose a simple yet effective strategy to leverage two-side synthetic data for NMT. a) Forward Cross Translation: We translate the Fr part of the bilingual data to De, which resulted in triple . We than select the generated parallel En-De data to fine-tune the unsupervised direction. b) Backward Cross Translation: We translate the monolingual De data to Fr and then from Fr to En. The result parallel De-En data can also be used to fine-tuning our all in one model.
The details of our method is shown in Algorithm 1. The training is rather straight forward. Step 1 to step 5 in Algorithm 1 show the multilingual joint training procedure. Firstly, we train the cross-lingual token embeddings on all mixed data, that is, all monolingual data of three languages and the parallel data . The data are segmented to tokens through Byte-Pair Encoding (BPE) sennrich-haddow-birch:2016:P16-12 and the embeddings are generated by using FastText bojanowski2017fasttextemb
. After that, we initialize the model embeddings with the trained token embeddings. Then during training, in each epoch, we iteratively switch training stages among language modeling, back-translation, and supervised training. Language modeling is accomplished via denoising autoencoder on monolingual data of all three languages. Back-translation is conducted on only unsupervised directions along which no parallel data are available. In our setting, there are four directions to do back-translation. For the remaining two directions with parallel data, we do supervised training on the parallel data. Details of these stages have been introduced in the Backgrounds section. After each epoch, we can adjust the coefficients to control the training procedure. Usually, we anneal the language model coefficientto 0 gradually. Then the final loss of this joint training step can write as:
Through joint training, we can get a multilingual translation model. However, the supervision from parallel data is not fully utilized, and the performance of the unsupervised directions can still be improved, and we achieve this by leveraging the bilingual data. Step 6 and Step 7 introduce both forward and backward auxiliary data. The comparison and analyses will be in section Analyses.
|Traid||Mono. data||Para. data||Eval pair|
|En: 20M, De: 20M, Fr: 20M||: 36M|
|En: 20M, De: 20M, Fr: 20M||: 4.5M|
|En: 20M, Fr: 20M, Ro: 2.9M||: 36M|
|Munmt w/o Para.||32.34||29.93||23.03||28.81||30.37||27.14|
|Munmt w/ Para.||33.67||30.47||23.99||28.68||31.95||28.27|
|Munmt + Forward||35.88||33.34||26.50||30.00||33.12||31.42|
|Munmt + Backward + Forward||36.53||33.41||26.60||30.10||35.09||31.58|
Our experiments are conducted in three different settings. For zero-shot directions, we use similar dataset settings as in lample2019cross, that is, using only monolingual data to simulate the zero-shot setting. We use the triad to denote one language setting. The meaning of the triad is the same as in the Method section. We mainly evaluate on language pair for the easy availability of evaluation data and the symmetry of the training structure. The settings are shown in Table 1.
Our data settings include , , and . For English, French, and German monolingual data, we randomly extract 20 million sentences from all available WMT monolingual News Crawl datasets from year 2007 to 2017. For Romanian monolingual, we use all of the available Romanian sentences from News Crawl dataset and augment it with WMT16 monolingual data, which results in 2.9 million sentences. For parallel data, we use the standard WMT 2014 English-French dataset consisting of about 36M sentence pairs, and the standard WMT 2014 English-German dataset consisting of about 4.5M sentence pairs. Following previous work, we report results on newstest 2014 for English-French pair, and on newstest 2016 for English-German and English-Romanian.
4.2 Model Settings
We use Moses scripts111https://github.com/moses-smt/mosesdecoder for sentence tokenization. All the languages in each setting are jointly segmented into sub-word units with 60K BPE code, and the vocabulary is shared across all three languages. The cross-lingual token embeddings are trained on all monolingual and parallel data with FastText toolkit222https://github.com/facebookresearch/fastText.
In the experiments, Munmt is built upon Transformer models. We use the Transformer with 6 layers, 1024 hidden units, 16 heads, GELU activations, a dropout rate of 0.1 and learned positional embeddings. The token embeddings, positional embeddings, and language embeddings have the same dimension of 1024 and are added up as the final inputs. We train our models with the Adam optimizer, a linear warm-up and learning rates varying from to . For coefficients, we set and to 1, and we set the initial
to 1 and aneal it to 0 gradually until 300K updates. The model is trained on 8 NVIDIA V100 GPUs. We implement all our models in PyTorch based on the code oflample2019cross333https://github.com/facebookresearch/XLM. During testing, we use the beam search with size 4. All the results are evaluated on BLEU score with Moses scripts, which is in line with the previous studies.
4.3 Main Results
The main results are shown in Table 2. For each dataset setting, we use the multilingual unsupervised machine translation of evaluated language pair as the baseline (denoted as “Munmt w/o Para.” in the table), which is similar to the EMB initialization of lample2019cross. Due to jointly multilingual training, our baseline results are higher than EMB results of lample2019cross. The MLM uses pre-trained models on large corpora for better initialization.
Then we report the results of Munmt model after joint training with parallel data. For the comparisons between this setting and the baseline, we observe that Munmt is higher in most settings, which means the multilingual training method can help the unsupervised directions.
Based on the jointly trained model, we use cross translation synthetic data to improve the model. The results of “Munmt + Forward” are from the model tuned by only 1 epoch with about 100K sentences. This method is fast and the performances are surprisingly effective. The “Munmt + Forward + Backward” denotes that, besides forward translation, we also use monolingual data and cross translate it to the source language. This method achieve the best performance, which outperform the “Munmt w/o Para.” by more than 3 BLEU score on average. The improvements show great potential for introducing indirect supervision for unsupervised NMT. Simple yet effective, Munmt with only embedding initialization obtains significant better results compared with MLM which is pre-trained on BERT like language model lample2019cross.
In this part, we conduct several studies on our method to better understand the settings of our method.
Effectiveness on Supervised Directions
In Table 3, we show the performance on supervised directions, to show the affects of supervised directions by jointly training a single system. Firstly, we test the baseline supervised system, that is, only and are conducted on the model. The performances are slightly lower than the state-of-the-art, since that the model architecture is a little different, some techniques such as model average are not applied, and two directions are trained in one model. In Munmt, the performance of supervised directions drops a little, but in exchange, the performances of zero-shot directions are greatly improved and the model is convenient to serve for multiple translation directions.
Beam search or Greedy Decoding
For the synthetic data generation, the reported results are from greedy decoding for time efficiency. We compared the effects of sample strategies on the language setting of where En-De is the supervised direction. The results based on beam search generation for is 33.59, and for is 34.53 in terms of BLEU. Compared with greedy decoding, the result of is slightly better whereas is inferior since the beam search makes the fake data further biased on the learned pattern. The results shows that Munmt is really robust to the sampling strategies when performing forward and backward cross translation.
Strategies of Making the Most of Parallel Data
It is beneficial to introduce the parallel data to guide the training of unsupervised training. We here explored three different ways to leverage the supervised data and plot the performance curves along with the training procedure in Figure 3. The comparison system is Munmt trained only with monolingual data.
The simplest way for introducing parallel data is to fine-tune Munmt with additional parallel data (corresponding to “A”). This approach is similar with multi-task learning. The high-level language free knowledge can transfer from supervised language pairs to unsupervised ones through parameter sharing. As the results shows, the fine-tuning method can improve the comparison system with a few update steps. While after continuous fine-tuning, the performance degrades dramatically. This does not exceed our expectations, since the model may incline to the supervised direction with continuous fine-tuning which has bee explored on multi-task learning kirkpatrick2017overcoming.
We than study the effect of introducing backward cross translation. As shown in the three figures, “C” is clearly higher than the black horizontal baseline and lower than “B”. The backward cross translation introduce the multilingual signal through a relatively indirect way. It keeps the target side as real data, and generate synthetic data through cross translation. Since the conventional back translation has been included in bilingual UNMT, the information gain from backward cross lingual translation is limited.
The forward cross translation, corresponding to “B”, achieves the best performance among all the competitor. This approach directly leans the pivot translation probability and shows to be effective. Compared with backward cross translation “C”, we suggest the auxiliary data of the target side is more important than that of the source side.
Robustness on Parallel Data
Munmt is robust to the parallel data in several aspects. As shown in Figure 4, we switch the parallel data from En-Fr to En-De, the performance is almost preserves. This results is also in line with the unsupervised En-Fr experiments in Table 2. The smaller parallel data of En-De, improves a lot for unsupervised En-Fr translation performance. We then reduce the scale of the parallel data En-De and surprisingly find that even with only supervised data, Munmt still works well. The experiments demonstrate that Munmt is robust and has great potential to be applied to practical systems.
5 Related Work
Neural machine translation model relies heavily on large amounts of parallel data. To tackle the low resource machine translation task, many efforts have been explored to improve system performance. For low resource machine translation, it has been proved that utilizing other rich resource data is beneficial to get a better system. These methods include multilingual translation system firat-etal-emnlp2016-mnmt; johnson-etal-2017-googlemnmt, teacher-student framework chen-etal-acl2017-teacher, or others zheng_ijcai2017-zeroresourcenmt. Besides only parallel data, many attempts have been done to explore the usefulness of monolingual data. These attempts include semi-supervised methods and unsupervised methods which only monolingual data is used. Lots of works try to use monolingual data together with supervised data to get a better system. Some of them try to use small amounts of parallel data and augment the system with monolingual data sennrich-etal-acl2016-improvingwithmono; he_nips2016_dualnmt; wang_aaai2018_dualtransfernmt; gu-etal-naacl2018-lowresourcenmt; edunov-etal-emnlp2018-backtrans. Some try to utilize parallel data of rich resource language pairs and also monolingual data ren-etal-acl2018-triangular; wang_iclr2018_multiagentnmt; al-shedivat-parikh-naacl2019-consistency. ren-etal-acl2018-triangular also proposed a triangular architecture, but their work still relied on parallel data of low resource language pairs. With the joint help of parallel and monolingual data, the performance of a low resource system can be improved.
However, learning with only monolingual data is a challenging task since it is hard to find out clues to connect these languages. One of the start points of unsupervised translation is statistical decipherment ravi-knight-acl2011-deciphering; dou-knight-emnlpconll2012-decipher which regard translation as decipherment. Some works try to use dictionaries, some parallel sentences, or other supervisions to prune the search space and connect the languages klementiev-etal-eacl2012-mt; irvine-callison-burch-conll2014-hallucinating, which may not be purely unsupervised. In 2017, pure unsupervised machine translation method with only monolingual data is worked out. On the basis of embedding alignment artetxe-2017-unsupword; lample2018unsupword, lample2018unsupervisedmt and artetxe2018unsupervisedmt work out similar methods for fully unsupervised machine translation. After that, lots of works have been done to improve the unsupervised machine translation systems by methods such as statistical machine translation lample-2018-pbumt; artetxe-etal-emnlp2018-unsupervised; ren_aaai2019_usmt; artetxe-etal-acl2019-umt, pretraining models lample2019cross; song2019icml-mass, or others wu-etal-naacl2019-extracteditumt, and these methods hugely improved the performance of unsupervised machine translation.
Our work tries to utilize both monolingual and parallel data, and combine unsupervised and supervised machine translation through multilingual translation method into a single model Munmt to get better performance for unsupervised language pairs.
In this work, we propose a multilingual machine translation framework Munmt incorporating supervised and unsupervised method to tackle the challenging the unsupervised translation task. By mixing different training schemes into one model and utilizing backward and forward cross language translation, we greatly improve the performance of the unsupervised NMT direction. By joint training, Munmt can serve all translation directions in one model. Empirically, Munmt achieves substantial improvements over a strong UNMT baseline.
In the future, we plan to build a universal Munmt system involved massive languages. And we will study the Munmt around the following efforts: a) Will Munmt benefit from more supervised language pairs? b) Is it possible for Munmt to improve the unsupervised translation with totally unrelated unsupervised language pairs? c) Give more analyses to show how UNMT learns from indirect supervision form other language pairs.