, which advocates the use of neural networks to directly model the translation process in an end-to-end way. Thanks to the capability of learning representations from training data, NMT systems have achieved significant improvements over conventional statistical machine translation (SMT) across a variety of language pairs[Junczys-Dowmunt et al.2016, Johnson et al.2016].
However, there still remains a major challenge for NMT: large-scale parallel corpora are usually non-existent for most language pairs. This is unfortunate because NMT is a data-hungry approach and requires a large amount of data to fully train model parameters. Without sufficient training data, NMT tends to learn poor estimates on low-count events. Zophet al. zoph:16 indicate that NMT obtains much worse translation quality than SMT when only small-scale parallel corpora are available.
As a result, improving neural machine translation on resource-scarce language pairs has attracted much attention in the community [Firat et al.2016, Zoph et al.2016, Johnson et al.2016]. Most existing methods focus on leveraging data of multiple resource-rich language pairs to improve NMT for resource-scarce language pairs. Firat et al. firat:16 propose multi-way, multilingual neural machine translation to achieve direct source-to-target translation even without parallel data available. Zoph et al
. zoph:16 present a transfer learning method that transfers the model parameters trained for resource-rich language pairs to initialize and constrain the translation model training of resource-scarce language pairs. Johnsonet al. johnson:16 introduce a universal NMT model for all language pairs, which takes advantage of multilingual data to improve NMT for all languages involved.
Bridging source and target languages with a pivot language is another important direction, which has been intensively studied in conventional SMT [Cohn and Lapata2007, Wu and Wang2007, Utiyama and Isahara2007, Bertoldi et al.2008, Zahabi et al.2013, El Kholy et al.2013]. Pivot-based approaches assume that there exist source-pivot and pivot-target parallel corpora, which can be used to train source-to-pivot and pivot-to-target translation models, respectively. One of the most representative approaches, triangulation approach, is to construct a source-to-target phrase table through combining source-to-pivot and pivot-to-target phrase tables. Another representative approach adopts a pivot-based translation strategy. As a result, source-to-target translation can be divided into two steps: the source sentence is first translated into a pivot sentence using the source-to-pivot model, which is then translated to a target sentence using the pivot-to-target model. Pivot-based approaches have been widely used in SMT due to its simplicity, effectiveness, and minimum requirement of multilingual data. Recently, Johnson et al. johnson:16 adapt pivot-based approaches to NMT and show that their universal model without incremental training achieves much worse translation performance than pivot-based NMT.
However, pivot-based approaches often suffer from the error propagation problem: the errors made in the source-to-pivot translation will be propagated to the pivot-to-target translation. This can be partly attributed to the discrepancy between source-pivot and pivot-target parallel corpora since they are usually loosely-related or even unrelated. To aggregate the situation, source-to-pivot and pivot-to-target translation models are trained independently, which further enlarges the gap between source and target languages.
In this work, we propose an approach to joint training for pivot-based neural machine translation. The basic idea is to connect the source-to-pivot and pivot-to-target NMT models and enable them to interact with each other during training. This can be done either by encouraging the sharing of word embeddings on the pivot language or by maximizing the likelihood of the cascaded model on a small source-target parallel corpus. Experiments on the Europarl and WMT corpora show that joint training of source-to-pivot and pivot-to-target models obtains significant improvements over independent training.
Given a source language sentence and a target language sentence , we use to denote a standard attention-based neural machine translation model [Bahdanau et al.2015], where is a set of model parameters.
Ideally, the source-to-target model can be trained on a source-target parallel corpus using maximum likelihood estimation:
where the log-likelihood is defined as
Unfortunately, parallel corpora are usually not readily available for low-resource language pairs. Instead, one can assume that there exist a third language called pivot with source-pivot and pivot-target parallel corpora available. As a result, it is possible to bridge the source and target languages with the pivot [Cohn and Lapata2007, Wu and Wang2007, Utiyama and Isahara2007, Bertoldi et al.2008, Zahabi et al.2013, El Kholy et al.2013].
Let be a pivot language sentence. The source-to-target model can be decomposed into two sub-models by treating the pivot sentence as a latent variable:
Let be a source-pivot parallel corpus, and be a pivot-target parallel corpus. The source-to-pivot and pivot-to-target models can be independently trained on the two parallel corpora, respectively:
where the log-likelihoods are defined as:
As Figure 1 shows, a pivot-based translation strategy is usually adopted. Given an unseen source sentence to be translated , the decision rule is given by:
Due to the exponential search space of the pivot language, the decoding process is usually approximated with two steps. The first step translates the source sentence into a pivot sentence:
Then, the pivot sentence is translated to a target sentence:
Although pivot-based approaches are widely for addressing the data scarcity problem in machine translation, they suffer from cascaded translation errors: the mistakes made in the source-to-pivot translation as shown in Eq. (9) will be propagated to the pivot-to-target translation as shown in Eq. (10). This can be partly attributed to the model discrepancy problem: the source-to-pivot and pivot-to-target models are quite different in terms of vocabulary and parameter space because the source-pivot and pivot-target parallel corpora are usually loosely-related or even unrelated. To make things worse, the source-to-pivot model and the pivot-to-target model are trained on the two parallel corpora independently, which further increases the discrepancy between two models.
Therefore, it is important to reduce the discrepancy between source-to-pivot and pivot-to-target models to further improve pivot-based neural machine translation.
3 Joint Training for Pivot-based NMT
3.1 Training Objective
To alleviate the model discrepancy problem, we propose an approach to joint training for pivot-based neural machine translation. The basic idea is to connect source-to-pivot and pivot-to-target models and enable them to interact with each other during training. Our new training objective is given by:
Note that the training objective consists of three parts: the source-to-pivot likelihood , the pivot-to-target likelihood , and a connection term . The hyper-parameter is used to balance the preference between likelihoods and the connection term.
We expect that the connection term associates the source-to-pivot model with the pivot-to-target model and enables the interaction between two models during training. In the following subsection, we will introduce the three connection terms used in our experiments.
3.2 Connection Terms
It is difficult to connect the source-to-pivot and pivot-to-target models during training because the source-to-pivot and pivot-to-target models are distantly-related by definition. More importantly, NMT lacks linguistically interpretable language structures such as phrases in SMT to achieve a direct connection at the parameter level [Wu and Wang2007].
Fortunately, both the source-to-pivot and pivot-to-target models include the word embeddings of the pivot language as parameters. It is possible to connect the two models via pivot word embeddings.
More formally, let be the pivot vocabulary of the source-to-pivot model and be the pivot vocabulary of the pivot-to-target model. We use to denote a word in the pivot language and
to denote the vector representation ofin the source-to-pivot model. is defined in a similar way.
Our first connection term encourages the two models to generate the same vector representations for pivot words in the intersection of two vocabularies:
where if the two vectors and are identical. Otherwise, .
As word embeddings seem hardly to be exactly identical due to the divergence of natural languages, an alternative is to soften the above hard matching constraint by penalizing the Euclidean distance between two vectors:
The third connection term assumes that there is a small bridging source-target parallel corpus (Bridging Corpus) available. The connection term is defined as the log-likelihood of the bridging data:
In training, our goal is to find the optimal source-to-pivot and pivot-to-target model parameters that maximize the training objective:
The partial derivative of with respect to the parameters of the source-to-pivot model can be calculated as:
The partial derivative with respect to the parameters can be calculated similarly.
The gradients of the first and second connection terms and with respect to model parameters are easy to calculate. However, calculating the gradients of the third connection term involves enumerating all possible pivot sentences in an exponential search space (see Eq. (15)).
To alleviate this problem, we follow standard practice to use a subset to approximate the full space [Shen et al.2016, Cheng et al.2016]. Two methods can be used to generate a subset: sampling translations from the full space [Shen et al.2016] or generating a top- list of candidate translations [Cheng et al.2016]. We find that using top- lists leads to better results than sampling in our experiments.
We use standard mini-batched stochastic gradient descent algorithms to optimize model parameters. In each iteration, three mini-batches are constructed by randomly selecting sentence pairs from the source-pivot parallel corpus, the pivot-target parallel corpus , and the bridging source-target parallel corpus (only available for the third connection term), respectively. After separate gradient calculation in each mini-batch, the gradients are collected to update model parameters.
|es en||en fr||es fr||de en||en fr||de fr|
We evaluated our approach on two translation tasks:
Spanish-English-French: Spanish as the source language, English as the pivot language, and French as the target language,
German-English-French: German as the source language, English as the pivot language, and French as the target language.
Table 1 shows the statistics of the Europarl and WMT corpora used in our experiments. We use
tokenize.perl script for tokenization. For each language pair, we remove the empty lines and retain sentence pairs with no more than 50 words. To avoid the intersection of the source-pivot and pivot-target corpora, we split the overlapped pivot-language sentences of source-to-pivot and pivot-to-target corpora into two separate parts with equal size and
merge them separately with the non-overlapping parts for each language pair.
|GroundTruth||source||uno no debe empezar a dudar en público del valor , tampoco del valor inmediato en el aspecto material , de esta ampliación .|
|pivot||it makes little sense to start to doubt in public the value , including the direct value at a material level , of this enlargement .|
|target||il ne faut pas commencer à douter en public de la valeur , ni de la valeur immédiate , de la portée matérielle de cet élargissement .|
|Indep.||pivot||one should not begin to doubt in terms of the value of courage , or of the immediate effect on material , of enlargement . [BLEU: 13.33]|
|target||il ne faudrait pas se tromper en termes de valeur de courage ou d ’ effet immédiat sur le matériel , l ’ élargissement . [BLEU: 8.69]|
|Hard||pivot||one must not start to doubt in the public , not the immediate value in the material , this enlargement . [BLEU: 19.02]|
|target||il ne faut pas que l ’ on commence à douter , ni au public , ni à la valeur immédiate , à l ’ élargissement . [BLEU: 25.36]|
|Soft||pivot||one cannot start thinking of the value of the value , and the immediate courage , of this enlargement . [BLEU: 21.57]|
|target||on ne peut pas commencer à penser à la valeur de la valeur , au courage immédiat , de cet élargissement . [BLEU: 26.60]|
|Liklihhod||pivot||one must not start to question the value of the value , either of the immediate value in the material aspect , of this enlargement . [BLEU: 24.60]|
|target||il ne faut pas commencer à remettre en question la valeur de la valeur , ni de la valeur immédiate de l ’ aspect matériel , de cet élargissement . [BLEU: 56.40]|
The Europarl corpus consists of 850K Spanish-English sentence pairs with 22.32M Spanish words and 21.44M English words, 840K German-English sentence pairs with 20.88M German words and 21.91M English words, and 900K English-French sentence pairs with 22.56M English words and 25.00M French words. The WMT 2006 shared task datasets are used as the development and test sets. The evaluation metric is case-insensitive BLEU[Papineni et al.2002] as calculated by the
The WMT corpus is composed of the Common Crawl, News Commentary, Europarl v7 and UN corpora. The Spanish-English parallel corpus consists of 6.78M sentence pairs with 183.01M Spanish words and 166.28M English words. The English-French parallel corpus comprises 9.29M sentence pairs with 227.06M English words and 258.95M French words. The newstest2011 and newstest2012 datasets serve as development and test sets. We use case-sensitive BLEU as the evaluation metric.
We use the attention-based neural machine translation system RNNsearch [Bahdanau et al.2015] in our experiments. For the Europarl corpus in Table 1, we set the vocabulary size of all the languages to 30K which covers over 99% of words for English, Spanish and French and over 97 % for German. We follow Jean et al. Jean:15 to address rare words. For Spanish-English and English-French corpora from the WMT corpus, due to large vocabulary size, we adopt byte pair encoding [Sennrich et al.2016b] to split rare words into sub-words. The size of sub-words is set to 43K, 33K, 43K respectively for Spanish, English, and French. These sub-words cover 100% of the text.
We set the hyper-parameter for balancing between likelihood and the connection term to 1.0. The threshold of gradients is set to 0.1. The bridging source-target parallel corpus contains 100K sentence pairs that do not overlap with the training data. We set to 10 for calculating top- lists to approximate the full search space. The parameters for the source-to-pivot and pivot-to-target translation models in the likelihood connection term are initialized by pre-trained model parameters.
|es en||en fr||es fr|
|Firat et al. firat:16||21.81||21.46|
4.2 Results on the Europarl Corpus
Table 2 shows the comparison results between our joint training on three connection terms and independent training on the Europarl Corpus. For the source-to-target translation task, we present source-to-pivot, pivot-to-target and source-to-target translation results compared with independent training. In Spanish-to-French translation task, soft connection achieves significant improvements in Spanish-to-French and Spanish-to-English directions although hard connection still performs comparably with independent training. In German-to-French translation task, soft and hard connections also achieve comparable performances with independent training.
In contrast, we find that likelihood connection dramatically improves translation performance on both Spanish-to-French and German-to-French corpora (up to +2.80 BLEU scores in Spanish-to-French and up to 2.23 BLEU scores in German-to-French). The significant improvements for source-to-pivot and pivot-to-target directions are also observed. This suggests that introducing source-to-target parallel corpus to maximize with as latent variables makes the source-to-pivot and pivot-to-target translation models improved collaboratively.
Table 3 shows pivot and target translation examples of independent training and our approaches. Apparently, our approaches improve translation quality of both pivot sentences and target sentences.
According to Eq. (3), the cost of the source-to-target model can be decomposed into the cost of source-to-pivot and pivot-to-target models. Because we have a small test trilingual corpus, (Spanish, English, French), we use the English sentence to approximate the latent variables in Eq. (3). Then we calculate the cost of Spanish-to-French on the trilingual corpus. Figure 2 shows the learning curves of the test cost of independent training and joint training on three connection terms. We can find that hard and soft connections learn slower than the independent training. Likelihood connection drives its cost lower after fine-tuning based on pre-trained parameters in just 10K iterations.
|# Sent.||es en||en fr||es fr|
4.3 Results on the WMT Corpus
Likelihood connection obtains the best performance in our three proposed connection terms according to experiments on the Europarl corpus. To further verify its practicability, Table 4 shows results on the WMT corpus which is a much larger corpus. We find that likelihood connection still outperforms independent training significantly on Spanish-to-English, English-to-French and Spanish-to-French directions (up to +1.18 BLEU scores in Spanish-to-French).
We also compare our approach with Firat et al. firat:16. They propose a multi-way, multilingual NMT model to build a source-to-target translation model. Although our parallel training corpus is much smaller than theirs, Table 5 shows that our approach achieves substantial improvements over them (up to +4.32 BLEU).
4.4 Effect of Bridging Corpora
As bridging corpora are used in likelihood connection term for “bridging” the source-to-pivot and pivot-to-target translation models, why do not we directly build NMT systems with these corpora?
We train source-to-target models using bridging corpora and show translation results in Table 6 . We observe that performance is much worse than that in Table 2 and Table 4 using the pivot-based translation strategy. It indicates that NMT yields poor performance on low-resource languages and the pivot-based translation strategy remedies the drawback to alleviate data scarcity effectively.
We also investigate the effect of the data size of bridging corpora on the likelihood connection. Table 7 shows that using a small parallel corpus (1K sentence pairs) has made a measurable improvement. When more than 50K sentence pairs are added, the further improvements become modest. This finding suggests that a small corpus suffices to enable the likelihood connection to reach the reasonable performance.
5 Related Work
Our work is inspired by two lines of research: (1) machine translation with pivot languages and (2) incorporating additional data resource for NMT.
5.1 Machine Translation with Pivot Languages
Machine translation suffers from the scarcity of parallel corpora. For low-resource language pairs, a pivot language is introduced to “bridge” source and target languages in statical machine translation [Cohn and Lapata2007, Wu and Wang2007, Utiyama and Isahara2007, Zahabi et al.2013, El Kholy et al.2013].
In NMT, Firat et al. firat:16 and Johnson et al johnson:16 propose multi-way, multilingual NMT models that enable zero-resource machine translation. They also need to apply pivot-based approaches into NMT to ameliorate the performance of zero-resource machine translation. Zoph et al. zoph:16 adopt transfer learning to fine-tune parameters of the low-resource language pairs using trained parameters on the high-resource language pairs. However, our approach aims to jointly train source-to-pivot and pivot-to-target NMT models, which can alleviate the error propagation of pivot-based approaches. We use connection terms to “bridge” these two models and make them benefit each other.
5.2 Incorporating Additional Data Resources for NMT
Due to the limit in quantity, quality and coverage for parallel corpora, additional data resources have raised attention recently. Gulccehre et al Gulcehre:15 propose to incorporate target-side monolingual corpora as a language model for NMT. Sennrich, Haddow, and Birch sennrich:16a pair the target monolingual corpora with its corresponding translations, then merge them with parallel corpora for retraining source-to-target model. Zhang and Zong zhang:16 propose two approaches, self-training algorithm and multi-task learning framework, to incorporate source-side monolingual corpora. Cheng et al. cheng:16b introduce an autoencoder framework to reconstruct monolingual sentences using source-to-target and target-to-source NMT models. The proposed model can exploit both source and target monolingual corpora. In contrast to Cheng et al. cheng:16b, the objective of our likelihood connection is to maximize the probability of target-language sentences through pivot sentences given source sentences. We use a small source-to-target parallel corpus to train source-to-pivot and pivot-to-target NMT models jointly.
We present joint training for pivot-based neural machine translation. Experiments on different language pairs confirm that our approach achieves significant improvements. It is appealing to combine source and pivot sentences for decoding target sentences [Firat et al.2016] or train a multi-source model directly [Zoph and Knight2016]. We also plan to study better connection terms for our joint training.
- [Bahdanau et al.2015] Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR, 2015.
- [Bertoldi et al.2008] Nicola Bertoldi, Madalina Barbaiani, Marcello Federico, and Roldano Cattoni. Phrase-based statistical machine translation with pivot languages. In IWSLT, 2008.
- [Cheng et al.2016] Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. Semi-supervised learning for neural machine translation. In Proceedings of ACL, 2016.
- [Cohn and Lapata2007] Trevor Cohn and Mirella Lapata. Machine translation by triangulation: Making effective use of multi-parallel corpora. In Proceedings of ACL, 2007.
- [El Kholy et al.2013] Ahmed El Kholy, Nizar Habash, Gregor Leusch, Evgeny Matusov, and Hassan Sawaf. Language independent connectivity strength features for phrase pivot statistical machine translation. In Proceedings of ACL, 2013.
- [Firat et al.2016] Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T Yarman Vural, and Kyunghyun Cho. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of EMNLP, 2016.
- [Gulccehre et al.2015] Caglar Gulccehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loïc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. On using monolingual corpora in neural machine translation. arXiv:1503.03535 [cs.CL], 2015.
- [Jean et al.2015] Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. In Proceedings of ACL, 2015.
- [Johnson et al.2016] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. arXiv preprint arXiv:1611.04558, 2016.
- [Junczys-Dowmunt et al.2016] Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang. Is neural machine translation ready for deployment? a case study on 30 translation directions. arXiv:1610.01108v2, 2016.
- [Koehn2004] Philipp Koehn. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP, 2004.
- [Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a methof for automatic evaluation of machine translation. In Proceedings of ACL, 2002.
- [Sennrich et al.2016a] Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving nerual machine translation models with monolingual data. In Proceedings of ACL, 2016.
- [Sennrich et al.2016b] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of ACL, 2016.
- [Shen et al.2016] Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. Minimum risk training for neural machine translation. In Proceedings of ACL, 2016.
- [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Proceedings of NIPS, 2014.
- [Utiyama and Isahara2007] Masao Utiyama and Hitoshi Isahara. A comparison of pivot methods for phrase-based statistical machine translation. In HLT-NAACL, 2007.
- [Wu and Wang2007] Hua Wu and Haifeng Wang. Pivot language approach for phrase-based statistical machine translation. Machine Translation, 2007.
- [Zahabi et al.2013] Samira Tofighi Zahabi, Somayeh Bakhshaei, and Shahram Khadivi. Using context vectors in improving a machine translation system with bridge language. In Proceedings of ACL, 2013.
- [Zhang and Zong2016] Jiajun Zhang and Chengqing Zong. Exploiting source-side monolingual data in neural machine translation. In Proceedings of EMNLP, 2016.
- [Zoph and Knight2016] Barret Zoph and Kevin Knight. Multi-source neural translation. In Proceedings of NAACL, 2016.
- [Zoph et al.2016] Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for low-resource neural machine translation. In Proceedings of EMNLP, 2016.