Neural machine translation has significantly improved the quality of machine translation in recent several years Kalchbrenner and Blunsom (2013); Sutskever et al. (2014); Bahdanau et al. (2015); Junczys-Dowmunt et al. (2016a). Although most sentences are more fluent than translations by statistical machine translation (SMT) Koehn et al. (2003); Chiang (2005), NMT has a problem to address translation adequacy especially for the rare and unknown words. Additionally, it suffers from over-translation and under-translation to some extent Tu et al. (2016). Compared to NMT, SMT, such as phrase-based machine translation (PBMT, Koehn et al. (2003)) and hierarchical phrase-based machine translation (HPMT, Chiang (2005)), does not need to limit the vocabulary and can guarantee translation coverage of source sentences. It is obvious that NMT and SMT have different strength and weakness. In order to take full advantages of both NMT and SMT, system combination can be a good choice.
Traditionally, system combination has been explored respectively in sentence-level, phrase-level, and word-level Kumar and Byrne (2004); Feng et al. (2009); Chen et al. (2009). Among them, word-level combination approaches that adopt confusion network for decoding have been quite successful Rosti et al. (2007); Ayan et al. (2008); Freitag et al. (2014). However, these approaches are mainly designed for SMT without considering the features of NMT results. NMT opts to produce diverse words and free word order, which are quite different from SMT. And this will make it hard to construct a consistent confusion network. Furthermore, traditional system combination approaches cannot guarantee the fluency of the final translation results.
In this paper, we propose a neural system combination framework, which is adapted from the multi-source NMT model Zoph and Knight (2016)
. Different encoders are employed to model the semantics of the source language input and each best translation produced by different NMT and SMT systems. The encoders produce multiple context vector representations, from which the decoder generates the final output word by word. Since the same training data is used for NMT, SMT and neural system combination, we further design a smart strategy to simulate the real training data for neural system combination.
Specifically, we make the following contributions in this paper:
We propose a neural system combination method, which is adapted from multi-source NMT model and can accommodate both source inputs and different system translations. It combines the fluency of NMT and adequacy (especially the ability to address rare words) of SMT.
We design a good strategy to construct appropriate training data for neural system combination.
The extensive experiments on Chinese-English translation show that our model archives significant improvement by 3.4 BLEU points over the state-of-the-art system combination methods and 5.3 BLEU points over the best individual system output.
2 Neural Machine Translation
The encoder-decoder NMT with an attention mechanism Bahdanau et al. (2015)
has been proposed to softly align each decoder state with the encoder states, and computes the conditional probability of the translation given the source sentence.
The decoder is a recurrent neural network that predicts a target sequence. Each word is predicted based on a recurrent hidden state , the previously predicted word , and a context vector . is obtained from the weighted sum of the annotations . We use the latest implementation of attention-based NMT111https://github.com/nyu-dl/dl4mt-tutorial.
3 Neural System Combination for Machine Translation
Macherey and Och Macherey and Och (2007) gave empirical evidence that these systems to be combined need to be almost uncorrelated in order to be beneficial for system combination. Since NMT and SMT are two kinds of translation models with large differences, we attempt to build a neural system combination model, which can take advantage of the different systems.
Model: Figure 1 illustrates the neural system combination framework, which can take as input the source sentence and the results of MT systems. Here, we use MT results as inputs to detail the model.
Formally, given the result sequences (, , and ) of three MT systems for the same source sentence and previously generated target sequence , the probability of the next target word is
Here is a non-linear function, represents the word embedding of the previous prediction word, and is the state of decoder at time step j, calculated by
|Jane Freitag et al. (2014)||39.83||42.75||38.63||39.10||40.08|
where is previous hidden state, is an intermediate state. And is the context vector of system combination obtained by attention mechanism, which is computed as weighted sum of the context vectors of three MT systems, just as illustrated in the middle part of Figure 1.
where K is the number of MT systems, and is a normalized item calculated as follows:
Here, we calculate kth MT system context as a weighted sum of the source annotations:
where is the annotation of from a bi-directional GRU, and its weight is computed by
where scores how well and match.
Training Data Simulation: The neural system combination framework should be trained on the outputs of multiple translation systems and the gold target translations. In order to keep consistency in training and testing, we design a strategy to simulate the real scenario. We randomly divide the training corpus into two parts, then reciprocally train the MT system on one half and translate the source sentences of the other half into target translations. The MT translations and the gold target reference can be available.
We perform our experiments on the Chinese-English translation task. The MT systems participating in system combination are PBMT, HPMT and NMT. The evaluation metric is case-insensitive BLEUPapineni et al. (2002).
4.1 Data preparation
Our training data consists of 2.08M sentence pairs extracted from LDC corpus. We use NIST 2003 Chinese-English dataset as the validation set, NIST 2004-2006 datasets as test sets. We list all the translation methods as follows:
PBMT: It is the start-of-the-art phrase-based SMT system. We use its default setting and train a 4-gram language model on the target portion of the bilingual training data.
HPMT: It is a hierarchical phrase-based SMT system, which uses its default configuration as PBMT in Moses.
NMT: It is an attention-based NMT system, with the same setting given in section 2.
4.2 Training Details
The hyper-parameters used in our neural combination system are described as follows. We limit both Chinese and English vocabulary to 30k in our experiments. The number of hidden units is 1000 and the word embedding dimension is 500 for all source and target word. The network parameters are updated with Adadelta algorithm. We adopt beam search with beam size b=10 at test time.
As to confusion-network-based system Jane, we use its default configuration and train a 4-gram language model on target data and 10M Xinhua portion of Gigaword corpus.
4.3 Main Results
We compare our neural combination system with the best individual engines, and the state-of-the-art traditional combination system Jane Freitag et al. (2014). Table 1 shows the BLEU of different models on development data and test data. The BLEU score of the multi-source neural combination model is 2.53 higher than the best single model HPMT. The source language input gives a further improvement of +1.12 BLEU points.
As shown in Table 1, Jane outperforms the best single MT system by 1.92 BLEU points. However, our neural combination system with source language gets an improvement of 1.67 BLEU points over Jane. Furthermore, when augmenting our neural combination system with ensemble decoding 222We use four neural combination models in ensemble model., it leads to another significant boost of +1.69 BLEU points.
4.4 Word Order of Translation
We evaluate word order by the automatic evaluation metrics RIBES Isozaki et al. (2010), whose score is a metric based on rank correlation coefficients with word precision. RIBES is known to have stronger correlation with human judgements than BLEU for English as discussed in Isozaki et al. Isozaki et al. (2010).
Figure 2 illustrates experimental results of RIBES scores, which demonstrates that our neural combination model outperforms the best result of single MT system and Jane. Additionally, although BLEU point of Jane is higher than single NMT system, the word order of Jane is worse in terms of RIBES.
4.5 Rare and Unknown Words Translation
It is difficult for NMT systems to handle rare words, because low-frequency words in training data cannot capture latent translation mappings in neural network model. However, we do not need to limit the vocabulary in SMT, which are often able to translate rare words in training data. As shown in Table 2, the number of unknown words of our proposed model is 137 fewer than original NMT model.
Table 4 shows an example of system combination. The Chinese word zuzhiwang is an out-of-vocabulary(OOV) for NMT and the baseline NMT cannot correctly translate this word. Although PBMT and HPMT translate this word well, they does not conform to the grammar. By combining the merits of NMT and SMT, our model gets the correct translation.
4.6 Effect of Ensemble Decoding
|Source||海珊 也 与 恐怖 组织网 建立 了 联系 。|
|Pinyin||hanshan ye yu kongbu zuzhiwang jianli le lianxi 。|
|Reference||hussein has also established ties with terrorist networks .|
|PBMT||hussein also has established relations and terrorist group .|
|HPMT||hussein also and terrorist group established relations .|
|NMT||hussein also established relations with UNK .|
|Jane||hussein also has established relations with .|
|Multi||hussein also has established relations with the terrorist group .|
The performance of candidate systems is very important to the result of system combination, and we use ensemble strategy with four NMT models to improve the performance of original NMT system. As shown in Table 3, the E-NMT with ensemble strategy outperforms the original NMT system by +1.40 BLEU points, and it has become the best sytem in all MT systems, which is +0.68 BLEU points higher than HPMT.
After replacing original NMT with strong E-NMT , Jane outperforms original result by +0.45 BLEU points, and our model gets an improvement of +3.08 BLEU points over Jane. Experiments further demonstrate that our proposed model is effective and robust for system combination.
5 Related Work
The recently proposed neural machine translation has drawn more and more attention. Most of the existing approaches and models mainly focus on designing better attention modelsLuong et al. (2015a); Mi et al. (2016a, b); Tu et al. (2016); Meng et al. (2016), better strategies for handling rare and unknown words Luong et al. (2015b); Li et al. (2016); Zhang and Zong (2016a); Sennrich et al. (2016b) , exploiting large-scale monolingual data Cheng et al. (2016); Sennrich et al. (2016a); Zhang and Zong (2016b), and integrating SMT techniques Shen et al. (2016); Junczys-Dowmunt et al. (2016b); He et al. (2016).
Our focus in this work is aiming to take advantage of NMT and SMT by system combination, which attempts to find consensus translations among different machine translation systems. In past several years, word-level, phrase-level and sentence-level system combination methods were well studied Bangalore et al. (2001); Rosti et al. (2008); Li and Zong (2008); Li et al. (2009); Heafield and Lavie (2010); Freitag et al. (2014); Ma and Mckeown (2015); Zhu et al. (2016), and reported state-of-the-art performances in benchmarks for SMT. Here, we propose a neural system combination model which combines the advantages of NMT and SMT efficiently.
Recently, Niehues et al. Niehues et al. (2016) use phrase-based SMT to pre-translate the inputs into target translations. Then a NMT system generates the final hypothesis using the pre-translation. Moreover, multi-source MT has been proved to be very effective to combine multiple source languages Och and Ney (2001); Zoph and Knight (2016); Firat et al. (2016a, b); Garmash and Monz (2016). Unlike previous works, we adapt multi-source NMT for system combination and design a good strategy to simulate the real training data for our neural system combination.
6 Conclusion and Future Work
In this paper, we propose a novel neural system combination framework for machine translation. The central idea is to take advantage of NMT and SMT by adapting the multi-source NMT model. The neural system combination method cannot only address the fluency of NMT and the adequacy of SMT, but also can accommodate the source sentences as input. Furthermore, our approach can further use ensemble decoding to boost the performance compared to traditional system combination methods.
Experiments on Chinese-English datasets show that our approaches obtain significant improvements over the best individual system and the state-of-the-art traditional system combination methods. In the future work, we plan to encode n-best translation results to further improve the system combination quality. Additionally, it is interesting to extend this approach to other tasks like sentence compression and text abstraction.
The research work has been funded by the Natural Science Foundation of China under Grant No. 61333018 and No. 61673380, and it is also supported by the Strategic Priority Research Program of the CAS under Grant No. XDB02070007.
- Ayan et al. (2008) Necip Fazil Ayan, Jing Zheng, and Wen Wang. 2008. Improving alignments for better confusion networks for combining machine translation systems. In Proceedings of COLING 2008.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Nuural machine translation by jointly learning to align and translate. In Proceedings of ICLR 2015.
- Bangalore et al. (2001) Srinivas Bangalore, German Bordel, and Giuseppe Richardi. 2001. Computing consensus translation from multiple machine translation systems. In Proceedings of IEEE ASRU.
- Chen et al. (2009) Boxing Chen, Min Zhang, Haizhou Li, and Aiti Aw. 2009. A comparative study of hypothesis alignment and its improvement for machine translation system combination. In Proceedings of ACL 2009.
- Cheng et al. (2016) Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Semi-supervised learning for neural machine translation. In Proceedings of ACL 2016.
- Chiang (2005) David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of ACL 2005.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder Cdecoder for statistical machine translation. In Proceedings of EMNLP 2014.
- Feng et al. (2009) Yang Feng, Yang Liu, Haitao Mi, Qun Liu, and Yajuan Lu. 2009. Lattice-based system combination for statistical machine translation. In Proceedings of ACL 2009.
- Firat et al. (2016a) Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016a. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of NAACL-HLT 2016.
- Firat et al. (2016b) Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. 2016b. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of EMNLP 2016.
- Freitag et al. (2014) Markus Freitag, Matthias Huck, and Hermann Ney. 2014. Jane: open source machine translation system combiantion. In Proceedings of EACL 2014.
- Garmash and Monz (2016) Ekaterina Garmash and Christof Monz. 2016. Ensemble learning for multi-source neural machine translation. In Proceedings of COLING 2016.
- He et al. (2016) Wei He, Zhongjun He, Hua Wu, and Haifeng Wang. 2016. Improved neural machine translation with SMT features. In Proceedings of AAAI 2016.
- Heafield and Lavie (2010) Kenneth Heafield and Alon Lavie. 2010. Combining machine translation output with open source. The Prague Bulletin of Machematical Linguistics.
- Isozaki et al. (2010) Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. 2010. Automatic evaluation of translation quality for distant language pairs. In Proceedings of EMNLP 2010.
- Junczys-Dowmunt et al. (2016a) Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang. 2016a. Is neural machine translation ready for deployment? A case study on 30 translation directions. In Proceedings of IWSLT 2016.
- Junczys-Dowmunt et al. (2016b) Marcin Junczys-Dowmunt, Tomasz Dwojak, and Rico Sennrich. 2016b. The AMU-UEDIN submission to the WMT16 news translation task: attention-based NMT models as feature functions in phrase-based SMT. In Proceedings of WMT 2016.
- Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of EMNLP 2013.
- Koehn et al. (2003) Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of ACL NAACL 2013.
- Kumar and Byrne (2004) Shankar Kumar and William Byrne. 2004. Minimum bayes-risk decoding for statistical machine translation. In Proceedings of HLT-NAACL 2004.
- Li et al. (2009) Maoxi Li, Jiajun Zhang, Yu Zhou, and Chengqing Zong. 2009. The CASIA statistical machine translation system for IWSLT 2009. In Proceedings of IWSLT2009.
- Li and Zong (2008) Maoxi Li and Chengqing Zong. 2008. Word reordering alignment for combination of statistical machine translation systems. In Proceedings of the International Symposium on Chinese Spoken Language Processing.
- Li et al. (2016) Xiaoqing Li, Jiajun Zhang, and Chengqing Zong. 2016. Towards zero unknown word in neural machine translation. In Proceedings of IJCAI 2016.
- Luong et al. (2015a) Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches to attention-based neural machine translation. In Proceedings of EMNLP 2015.
- Luong et al. (2015b) Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. 2015b. Addressing the rare word problem in neural machine translation. In Proceedings of ACL 2015.
- Ma and Mckeown (2015) Wei-Yun Ma and Kathleen Mckeown. 2015. System combination for machine translation through paraphrasing. In Proceedings of EMNLP 2015.
- Macherey and Och (2007) Wolfgang Macherey and Franz Josef Och. 2007. An empirical study on computing consensus translations from multiple machine translation systems. In Proceedings of EMNLP 2007.
- Meng et al. (2016) Fandong Meng, Zhengdong Lu, Hang Li, and Qun Liu. 2016. Interactive attention for neural machine translation. In Proceedings of COLING 2016.
- Mi et al. (2016a) Haitao Mi, Baskaran Sankaran, Zhiguo Wang, Niyu Ge, and Abe Ittycheriah. 2016a. A coverage embedding model for neural machine translation. In Proceedings of EMNLP 2016.
- Mi et al. (2016b) Haitao Mi, Zhiguo Wang, Niyu Ge, and Abe Ittycheriah. 2016b. Supervised attentions for neural machine translation. In Proceedings of EMNLP 2016.
- Niehues et al. (2016) Jan Niehues, Eunah Cho, Thanh-Le Ha, and Alex Waibel. 2016. Pre-translation for neural machine translation. In Proceedings of COLING 2016.
- Och and Ney (2001) Franz Josef Och and Hermann Ney. 2001. Statistical multi-source translation. In Proceedings of MT Summit.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a methof for automatic evaluation of machine translation. In Proceedings of ACL 2002.
- Rosti et al. (2007) Antti-Veikko I. Rosti, Spyros Matsoukas, and Richard Schwartz. 2007. Improved word-level system combination for machine translation. In Proceedings of ACL 2007.
- Rosti et al. (2008) Antti-Veikko I. Rosti, Bing Zhang, Spyros Matsoukas, and Richard Schwartz. 2008. Incremental hypothesis alignment for building confusion networks with appplication to machine translation systems combination. In Proceedings of the Third ACL Workshop on Statistical Machine Translation.
- Sennrich et al. (2016a) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine translation models with monolingual data. In Proceedings of ACL 2016.
- Sennrich et al. (2016b) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of ACL 2016.
- Shen et al. (2016) Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum risk training for neural machine translation. In Proceedings of ACL 2016.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of NIPS 2014.
- Tu et al. (2016) Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. In Proceedings of ACL 2016.
- Zhang and Zong (2016a) Jiajun Zhang and Chengqing Zong. 2016a. Bridging neural machine translation and bilingual dictionaries. arXiv preprint arXiv:1610.07272.
- Zhang and Zong (2016b) Jiajun Zhang and Chengqing Zong. 2016b. Exploiting source-side monolingual data in neural machine translation.. In Proceedings of EMNLP 2016.
- Zhu et al. (2016) Junguo Zhu, Muyun Yang, Sheng Li, and Tiejun Zhao. 2016. Sentence-level paraphrasing for machine translation system combination. In Proceedings of ICYCSEE 2016.
- Zoph and Knight (2016) Barret Zoph and Kevin Knight. 2016. Multi-source neural translation. In Proceedings of NAACL-HLT 2016.