End-to-end neural machine translation (NMT), which leverages a single, large neural network to directly transform a source-language sentence into a target-language sentence, has attracted increasing attention in recent several years[Kalchbrenner and Blunsom2013, Sutskever et al.2014, Bahdanau et al.2015]. Free of latent structure design and feature engineering that are critical in conventional statistical machine translation (SMT) [Brown et al.1993, Koehn et al.2003, Chiang2005]
, NMT has proven to excel in modeling long-distance dependencies by enhancing recurrent neural networks (RNNs) with the gating[Hochreiter and Schmidhuber1993, Cho et al.2014, Sutskever et al.2014] and attention mechanisms [Bahdanau et al.2015].
However, most existing NMT approaches suffer from a major drawback: they heavily rely on parallel corpora for training translation models. This is because NMT directly models the probability of a target-language sentence given a source-language sentence and does not have a separate language model like SMT[Kalchbrenner and Blunsom2013, Sutskever et al.2014, Bahdanau et al.2015]. Unfortunately, parallel corpora are usually only available for a handful of resource-rich languages and restricted to limited domains such as government documents and news reports. In contrast, SMT is capable of exploiting abundant target-side monolingual corpora to boost fluency of translations. Therefore, the unavailability of large-scale, high-quality, and wide-coverage parallel corpora hinders the applicability of NMT.
As a result, several authors have tried to use abundant monolingual corpora to improve NMT. Gulcehre:15 propose two methods, which are referred to as shallow fusion and deep fusion, to integrate a language model into NMT. The basic idea is to use the language model to score the candidate words proposed by the translation model at each time step or concatenating the hidden states of the language model and the decoder. Although their approach leads to significant improvements, one possible downside is that the network architecture has to be modified to integrate the language model.
Alternatively, Sennrich:15 propose two approaches to exploiting monolingual corpora that is transparent to network architectures. The first approach pairs monolingual sentences with dummy input. Then, the parameters of encoder and attention model are fixed when training on these pseudo parallel sentence pairs. In the second approach, they first train a nerual translation model on the parallel corpus and then use the learned model to translate a monolingual corpus. The monolingual corpus and its translations constitute an additional pseudo parallel corpus. Similar ideas have also been suggested in conventional SMT[Ueffing et al.2007, Bertoldi and Federico2009]. Sennrich:15 report that their approach significantly improves translation quality across a variety of language pairs.
In this paper, we propose semi-supervised learning for neural machine translation. Given labeled (i.e., parallel corpora) and unlabeled (i.e., monolingual corpora) data, our approach jointly trains source-to-target and target-to-source translation models. The key idea is to append a reconstruction term to the training objective, which aims to reconstruct the observed monolingual corpora using an autoencoder. In the autoencoder, the source-to-target and target-to-source models serve as the encoder and decoder, respectively. As the inference is intractable, we propose to sample the full search space to improve the efficiency. Specifically, our approach has the following advantages:
Transparent to network architectures: our approach does not depend on specific architectures and can be easily applied to arbitrary end-to-end NMT systems.
Both the source and target monolingual corpora can be used: our approach can benefit NMT not only using target monolingual corpora in a conventional way, but also the monolingual corpora of the source language.
Experiments on Chinese-English NIST datasets show that our approach results in significant improvements in both directions over state-of-the-art SMT and NMT systems.
2 Semi-Supervised Learning for Neural Machine Translation
2.1 Supervised Learning
Given a parallel corpus , the standard training objective in NMT is to maximize the likelihood of the training data:
where is a neural translation model and is a set of model parameters. can be seen as labeled data for the task of predicting a target sentence given a source sentence .
As is modeled by a single, large neural network, there does not exist a separate target language model in NMT. Therefore, parallel corpora have been the only resource for parameter estimation in most existing NMT systems. Unfortunately, even for a handful of resource-rich languages, the available domains are unbalanced and restricted to government documents and news reports. Therefore, the availability of large-scale, high-quality, and wide-coverage parallel corpora becomes a major obstacle for NMT.
2.2 Autoencoders on Monolingual Corpora
It is appealing to explore the more readily available, abundant monolingual corpora to improve NMT. Let us first consider an unsupervised setting: how to train NMT models on a monolingual corpus ?
Our idea is to leverage autoencoders [Vincent et al.2010, Socher et al.2011]: (1) encoding an observed target sentence into a latent source sentence using a target-to-source translation model and (2) decoding the source sentence to reconstruct the observed target sentence using a source-to-target model. For example, as shown in Figure 1(b), given an observed English sentence “Bush held a talk with Sharon”, a target-to-source translation model (i.e., encoder) transforms it into a Chinese translation “bushi yu shalong juxing le huitan” that is unobserved on the training data (highlighted in grey). Then, a source-to-target translation model (i.e., decoder) reconstructs the observed English sentence from the Chinese translation.
More formally, let and be source-to-target and target-to-source translation models respectively, where and are corresponding model parameters. An autoencoder aims to reconstruct the observed target sentence via a latent source sentence:
where is an observed target sentence, is a copy of to be reconstructed, and is a latent source sentence.
We refer to Eq. (2) as a target autoencoder. 111Our definition of auotoencoders is inspired by Ammar et al. Ammar:14. Note that our autoencoders inherit the same spirit from conventional autoencoders [Vincent et al.2010, Socher et al.2011] except that the hidden layer is denoted by a latent sentence instead of real-valued vectors.
except that the hidden layer is denoted by a latent sentence instead of real-valued vectors.Likewise, given a monolingual corpus of source language , it is natural to introduce a source autoencoder that aims at reconstructing the observed source sentence via a latent target sentence:
Please see Figure 1(a) for illustration.
2.3 Semi-Supervised Learning
As the autoencoders involve both source-to-target and target-to-source models, it is natural to combine parallel corpora and monolingual corpora to learn birectional NMT translation models in a semi-supervised setting.
Formally, given a parallel corpus , a monolingual corpus of target language , and a monolingual corpus of source language , we introduce our new semi-supervised training objective as follows:
where and are hyper-parameters for balancing the preference between likelihood and autoencoders.
Note that the objective consists of four parts: source-to-target likelihood, target-to-source likelihood, target autoencoder, and source autoencoder. In this way, our approach is capable of exploiting abundant monolingual corpora of both source and target languages.
The optimal model parameters are given by
It is clear that the source-to-target and target-to-source models are connected via the autoencoder and can hopefully benefit each other in joint training.
We use mini-batch stochastic gradient descent to train our joint model. For each iteration, besides the mini-batch from the parallel corpus, we also construct two additional mini-batches by randomly selecting sentences from the source and target monolingual corpora. Then, gradients are collected from these mini-batches to update model parameters.
The partial derivative of with respect to the source-to-target model is given by
The partial derivative with respect to can be calculated similarly.
It is prohibitively expensive to compute the sums due to the exponential search space of .
Alternatively, we propose to use a subset of the full space to approximate Eq. (8):
In practice, we use the top- list of candidate translations of as . As , it is possible to calculate Eq. (9) efficiently by enumerating all candidates in . In practice, we find this approximation results in significant improvements and seems to suffice to keep the balance between efficiency and translation quality.
We evaluated our approach on the Chinese-English dataset.
As shown in Table 1, we use both a parallel corpus and two monolingual corpora as the training set. The parallel corpus from LDC consists of 2.56M sentence pairs with 67.53M Chinese words and 74.81M English words. The vocabulary sizes of Chinese and English are 0.21M and 0.16M, respectively. We use the Chinese and English parts of the Xinhua portion of the GIGAWORD corpus as the monolingual corpora. The Chinese monolingual corpus contains 18.75M sentences with 451.94M words. The English corpus contains 22.32M sentences with 399.83M words. The vocabulary sizes of Chinese and English are 0.97M and 1.34M, respectively.
For Chinese-to-English translation, we use the NIST 2006 Chinese-English dataset as the validation set for hyper-parameter optimization and model selection. The NIST 2002, 2003, 2004, and 2005 datasets serve as test sets. Each Chinese sentence has four reference translations. For English-to-Chinese translation, we use the NIST datasets in a reverse direction: treating the first English sentence in the four reference translations as a source sentence and the original input Chinese sentence as the single reference translation. The evaluation metric is case-insensitive BLEU[Papineni et al.2002] as calculated by the
We compared our approach with two state-of-the-art SMT and NMT systems:
For Moses, we use the default setting to train the phrase-based translation on the parallel corpus and optimize the parameters of log-linear models using the minimum error rate training algorithm [Och2003]. We use the SRILM toolkit [Stolcke2002] to train 4-gram language models.
For RNNSearch, we use the parallel corpus to train the attention-based neural translation models. We set the vocabulary size of word embeddings to 30K for both Chinese and English. We follow Luong:15 to address rare words.
On top of RNNSearch, our approach is capable of training bidirectional attention-based neural translation models on the concatenation of parallel and monolingual corpora. The sample size is set to 10. We set the hyper-parameter and when we add the target monolingual corpus, and and
for source monolingual corpus incorporation. The threshold of gradient clipping is set to. The parameters of our model are initialized by the model trained on parallel corpus.
3.2 Effect of Sample Size
As the inference of our approach is intractable, we propose to approximate the full search space with the top- list of candidate translations to improve efficiency (see Eq. (9)).
Figure 2 shows the BLEU scores of various settings of over time. Only the English monolingual corpus is appended to the training data. We observe that increasing the size of the approximate search space generally leads to improved BLEU scores. There are significant gaps between and . However, keeping increasing does not result in significant improvements and decreases the training efficiency. We find that achieves a balance between training efficiency and translation quality. As shown in Figure 3, similar findings are also observed on the English-to-Chinese validation set. Therefore, we set in the following experiments.
3.3 Effect of OOV Ratio
|this work||C E||35.61||38.78||38.32||38.49||36.45|
Given a parallel corpus, what kind of monolingual corpus is most beneficial for improving translation quality? To answer this question, we investigate the effect of OOV ratio on translation quality, which is defined as
where is a target-language sentence in the monolingual corpus , is a target-language word in , is the vocabulary of the target side of the parallel corpus .
Intuitively, the OOV ratio indicates how a sentence in the monolingual resembles the parallel corpus. If the ratio is 0, all words in the monolingual sentence also occur in the parallel corpus.
Figure 4 shows the effect of OOV ratio on the Chinese-to-English validation set. Only English monolingual corpus is appended to the parallel corpus during training. We constructed four monolingual corpora of the same size in terms of sentence pairs. “0% OOV” means the OOV ratio is 0% for all sentences in the monolingual corpus. “10% OOV” suggests that the OOV ratio is no greater 10% for each sentence in the monolingual corpus. We find that using a monolingual corpus with a lower OOV ratio generally leads to higher BLEU scores. One possible reason is that low-OOV monolingual corpus is relatively easier to reconstruct than its high-OOV counterpart and results in better estimation of model parameters.
Figure 5 shows the effect of OOV ratio on the English-to-Chinese validation set. Only English monolingual corpus is appended to the parallel corpus during training. We find that “0% OOV” still achieves the highest BLEU scores.
|Monolingual||hongsen shuo , ruguo you na jia famu gongsi dangan yishenshifa , name tamen jiang zihui qiancheng .|
|Reference||hongsen said, if any logging companies dare to defy the law, then they will destroy their own future .|
|Translation||hun sen said , if any of those companies dare defy the law , then they will have their own fate . [iteration 0]|
|hun sen said if any tree felling company dared to break the law , then they would kill themselves . [iteration 40K]|
|hun sen said if any logging companies dare to defy the law , they would destroy the future themselves . [iteration 240K]|
|Monolingual||dan yidan panjue jieguo zuizhong queding , ze bixu zai 30 tian nei zhixing .|
|Reference||But once the final verdict is confirmed , it must be executed within 30 days .|
|Translation||however , in the final analysis , it must be carried out within 30 days . [iteration 0]|
|however , in the final analysis , the final decision will be carried out within 30 days . [iteration 40K]|
|however , once the verdict is finally confirmed , it must be carried out within 30 days . [iteration 240K]|
3.4 Comparison with SMT
Table 2 shows the comparison between Moses and our work. Moses used the monolingual corpora as shown in Table 1: 18.75M Chinese sentences and 22.32M English sentences. We find that exploiting monolingual corpora dramatically improves translation performance in both Chinese-to-English and English-to-Chinese directions.
Relying only on parallel corpus, RNNsearch outperforms Moses trained also only on parallel corpus. But the capability of making use of abundant monolingual corpora enables Moses to achieve much higher BLEU scores than RNNsearch only using parallel corpus.
Instead of using all sentences in the monolingual corpora, we constructed smaller monolingual corpora with zero OOV ratio: 2.56M Chinese sentences with 47.51M words and 2.56M English English sentences with 37.47M words. In other words, the monolingual corpora we used in the experiments are much smaller than those used by Moses.
By adding English monolingual corpus, our approach achieves substantial improvements over RNNsearch using only parallel corpus (up to +4.7 BLEU points). In addition, significant improvements are also obtained over Moses using both parallel and monolingual corpora (up to +3.5 BLEU points).
An interesting finding is that adding English monolingual corpora helps to improve English-to-Chinese translation over RNNsearch using only parallel corpus (up to +3.2 BLEU points), suggesting that our approach is capable of improving NMT using source-side monolingual corpora.
In the English-to-Chinese direction, we obtain similar findings. In particular, adding Chinese monolingual corpus leads to more benefits to English-to-Chinese translation than adding English monolingual corpus. We also tried to use both Chinese and English monolingual corpora through simply setting all the to but failed to obtain further significant improvements.
Therefore, our findings can be summarized as follows:
Adding target monolingual corpus improves over using only parallel corpus for source-to-target translation;
Adding source monolingual corpus also improves over using only parallel corpus for source-to-target translation, but the improvements are smaller than adding target monolingual corpus;
Adding both source and target monolingual corpora does not lead to further significant improvements.
3.5 Comparison with Previous Work
We re-implemented Sennrich:15’s method on top of RNNsearch as follows:
Train the target-to-source neural translation model on the parallel corpus .
The trained target-to-source model is used to translate a target monolingual corpus into a source monolingual corpus .
The target monolingual corpus is paired with its translations to form a pseudo parallel corpus, which is then appended to the original parallel corpus to obtain a larger parallel corpus: .
Re-train the the source-to-target neural translation model on to obtain the final model parameters .
Table 3 shows the comparison results. Both the two approaches use the same parallel and monolingual corpora. Our approach achieves significant improvements over Sennrich:15 in both Chinese-to-English and English-to-Chinese directions (up to +1.8 and +1.0 BLEU points). One possible reason is that Sennrich:15 only use the pesudo parallel corpus for parameter estimation for once (see Step 4 above) while our approach enables source-to-target and target-to-source models to interact with each other iteratively on both parallel and monolingual corpora.
To some extent, our approach can be seen as an iterative extension of Sennrich:15’s approach: after estimating model parameters on the pseudo parallel corpus, the learned model parameters are used to produce a better pseudo parallel corpus. Table 4 shows example Viterbi translations on the Chinese monolingual corpus over iterations:
We observe that the quality of Viterbi translations generally improves over time.
4 Related Work
Our work is inspired by two lines of research: (1) exploiting monolingual corpora for machine translation and (2) autoencoders in unsupervised and semi-supervised learning.
4.1 Exploiting Monolingual Corpora for Machine Translation
Exploiting monolingual corpora for conventional SMT has attracted intensive attention in recent years. Several authors have introduced transductive learning to make full use of monolingual corpora [Ueffing et al.2007, Bertoldi and Federico2009]. They use an existing translation model to translate unseen source text, which can be paired with its translations to form a pseudo parallel corpus. This process iterates until convergence. While Klementiev:12 propose an approach to estimating phrase translation probabilities from monolingual corpora, Zhang:13 directly extract parallel phrases from monolingual corpora using retrieval techniques. Another important line of research is to treat translation on monolingual corpora as a decipherment problem [Ravi and Knight2011, Dou et al.2014].
Closely related to Gulcehre:15 and Sennrich:15, our approach focuses on learning birectional NMT models via autoencoders on monolingual corpora. The major advantages of our approach are the transparency to network architectures and the capability to exploit both source and target monolingual corpora.
4.2 Autoencoders in Unsupervised and Semi-Supervised Learning
Autoencoders and their variants have been widely used in unsupervised deep learning ([Vincent et al.2010, Socher et al.2011, Ammar et al.2014]
, just to name a few). Among them, Socher:11’s approach bears close resemblance to our approach as they introduce semi-supervised recursive autoencoders for sentiment analysis. The difference is that we are interested in making a better use of parallel and monolingual corpora while they concentrate on injecting partial supervision to conventional unsupervised autoencoders. Dai:15 introduce a sequence autoencoder to reconstruct an observed sequence via RNNs. Our approach differs from sequence autoencoders in that we use bidirectional translation models as encoders and decoders to enable them to interact within the autoencoders.
We have presented a semi-supervised approach to training bidirectional neural machine translation models. The central idea is to introduce autoencoders on the monolingual corpora with source-to-target and target-to-source translation models as encoders and decoders. Experiments on Chinese-English NIST datasets show that our approach leads to significant improvements.
As our method is sensitive to the OOVs present in monolingual corpora, we plan to integrate Jean:15’s technique on using very large vocabulary into our approach. It is also necessary to further validate the effectiveness of our approach on more language pairs and NMT architectures. Another interesting direction is to enhance the connection between source-to-target and target-to-source models (e.g., letting the two models share the same word embeddings) to help them benefit more from interacting with each other.
This work was done while Yong Cheng was visiting Baidu. This research is supported by the 973 Program (2014CB340501, 2014CB340505), the National Natural Science Foundation of China (No. 61522204, 61331013, 61361136003), 1000 Talent Plan grant, Tsinghua Initiative Research Program grants 20151080475 and a Google Faculty Research Award. We sincerely thank the viewers for their valuable suggestions.
- [Ammar et al.2014] Waleed Ammar, Chris Dyer, and Noah Smith. 2014. Conditional random field autoencoders for unsupervised structred prediction. In Proceedings of NIPS 2014.
- [Bahdanau et al.2015] Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR.
- [Bertoldi and Federico2009] Nicola Bertoldi and Marcello Federico. 2009. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of WMT.
- [Brown et al.1993] Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguisitics.
- [Chiang2005] David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of ACL.
- [Cho et al.2014] Kyunhyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST-8.
- [Dai and Le2015] Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Proceedings of NIPS.
- [Dou et al.2014] Qing Dou, Ashish Vaswani, and Kevin Knight. 2014. Beyond parallel data: Joint word alignment and decipherment improves machine translation. In Proceedings of EMNLP.
- [Gulccehre et al.2015] Caglar Gulccehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loïc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. arXiv:1503.03535 [cs.CL].
- [Hochreiter and Schmidhuber1993] Sepp Hochreiter and Jürgen Schmidhuber. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguisitics.
- [Jean et al.2015] Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of ACL.
- [Kalchbrenner and Blunsom2013] Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of EMNLP.
- [Klementiev et al.2012] Alexandre Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky. 2012. Toward statistical machine translation without paralel corpora. In Proceedings of EACL.
- [Koehn et al.2003] Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of NAACL.
- [Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of ACL (demo session).
- [Luong et al.2015] Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of ACL.
- [Och2003] Franz Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of ACL.
- [Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a methof for automatic evaluation of machine translation. In Proceedings of ACL.
- [Ravi and Knight2011] Sujith Ravi and Kevin Knight. 2011. Deciphering foreign language. In Proceedings of ACL.
- [Sennrich et al.2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving nerual machine translation models with monolingual data. arXiv:1511.06709 [cs.CL].
- [Socher et al.2011] Richard Socher, Jeffrey Pennington, Eric Huang, Andrew Ng, and Christopher Manning. 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of EMNLP.
- [Stolcke2002] Andreas Stolcke. 2002. Srilm - am extensible language modeling toolkit. In Proceedings of ICSLP.
- [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of NIPS.
- [Ueffing et al.2007] Nicola Ueffing, Gholamreza Haffari, and Anoop Sarkar. 2007. Trasductive learning for statistical machine translation. In Proceedings of ACL.
[Vincent et al.2010]
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research.
- [Zhang and Zong2013] Jiajun Zhang and Chengqing Zong. 2013. Learning a phrase-based translation model from monolingual data with application to domain adaptation. In Proceedings of ACL.