1 Introduction
Lattice minimum Bayesrisk (LMBR) decoding has been applied successfully to translation lattices in traditional SMT to improve translation performance of a single system [Kumar and Byrne2004, Tromble et al.2008, Blackwood et al.2010]. However, minimum Bayesrisk (MBR) decoding is also a very powerful framework for combining diverse systems [Sim et al.2007, de Gispert et al.2009]. Therefore, we study combining traditional SMT and NMT in a hybrid decoding scheme based on MBR. We argue that MBRbased methods in their present form are not wellsuited for NMT because of the following reasons:

Previous approaches work well with rich lattices and diverse hypotheses. However, NMT decoding usually relies on beam search with a limited beam and thus produces very narrow lattices [Li and Jurafsky2016, Vijayakumar et al.2016].

NMT decoding is computationally expensive. Therefore, it is difficult to collect the statistics needed for risk calculation for NMT.

The Bayesrisk in SMT is usually defined for complete translations. Therefore, the risk computation needs to be restructured in order to integrate it in an NMT decoder which builds up hypotheses from left to right.
To address these challenges, we use a special loss function which is computationally tractable as it avoids using NMT scores for risk calculation. We show how to reformulate the original LMBR decision rule for using it in a wordbased NMT decoder which is not restricted to an
best list or a lattice. Our hybrid system outperforms lattice rescoring on multiple data sets for EnglishGerman and JapaneseEnglish. We report similar gains from applying our method to subwordunitbased NMT rather than wordbased NMT.2 Combining NMT and SMT by Minimising the Lattice Bayesrisk
We propose to collect statistics for MBR from a potentially large translation lattice generated with SMT, and use the gram posteriors as additional score in NMT decoding. The LMBR decision rule used by Tromble et al. lmbrtromble has the form
(1) 
where is the hypothesis space of possible translations, is the evidence space for computing the Bayesrisk, and is the set of all grams in (typically, ). In this work, our evidence space is a translation lattice generated with SMT. The function counts how often gram occurs in translation .
denotes the path posterior probability of
in . Our aim is to integrate these gram posteriors into the NMT decoder since they correlate well with the presence of grams in reference translations [de Gispert et al.2013]. We call the quantity to be maximised the evidence which corresponds to the (negative) Bayesrisk which is normally minimised in MBR decoding. We emphasize that this risk can be computed for any translation hypothesis and not only those produced by the SMT system.NMT assigns a probability to a translation of source sentence
via a lefttoright factorisation based on the chain rule:
(2) 
where
is a neural network using the hidden state of the decoder network
and the context vector
which encodes relevant parts of the source sentence [Bahdanau et al.2015].^{1}^{1}1We refer to Bahdanau et al. bahdanau for a full discussion of attentionbased NMT. can also represent an ensemble of NMT systems in which case the scores of the individual systems are multiplied together to form a single distribution. Applying the LMBR decision rule in Eq. 1 directly to NMT would involve computing for all translations in the evidence space. In case of LMBR this is equivalent to rescoring the entire translation lattice exhaustively with NMT. However, this is not feasible even for small lattices because the evaluation of is computationally very expensive. Therefore, we propose to calculate the Bayesrisk over SMT translation lattices using only pure SMT scores, and bias the NMT decoder towards lowrisk hypotheses. Our final combined decision rule is(3) 
If contains a word not in the NMT vocabulary, the NMT model provides a score and updates its decoder state as for an unknown word. We note that can be computed even if is not in the SMT lattice. Therefore, Eq. 3 can be used to generate translations outside the SMT search space. We further note that Eq. 3 can be derived as an instance of LMBR under a modified loss function.
3 Lefttoright Decoding
Beam search is often used for NMT because the factorisation in Eq. 2 allows to build up hypotheses from left to right. In contrast, our definition of the evidence in Eq. 1 contains a sum over the (unordered) set of all grams. However, we can rewrite our objective function in Eq. 3 in a way which makes it easy to use with beam search.
(4)  
for grams up to order 4. This form lends itself naturally to beam search: at each time step, we add to the previous partial hypothesis score both the loglikelihood of the last token according the NMT model, and the partial MBR gains from the current
gram history. Note that this is similar to applying (the exponentiated scores of) an interpolated language model based on
gram posteriors extracted from the SMT lattice. In the remainder of this paper, we will refer to decoding according Eq. 4 as MBRbased NMT.Setup  newstest2014  newstest2015  newstest2016  

SMT baseline [de Gispert et al.2010, HiFST]  18.9  21.2  26.0  
Single NMT (word)  Pure NMT  17.7  19.6  23.1 
100best rescoring  20.6  22.5  27.5  
Lattice rescoring  21.6  23.8  29.6  
This work  22.0  24.6  29.5  
5Ensemble NMT (word)  Pure NMT  19.4  21.8  25.4 
100best rescoring  21.0  23.3  28.6  
Lattice rescoring  22.1  24.2  30.2  
This work  22.8  25.4  30.8  
Single NMT (BPE)  Pure NMT  19.6  21.9  24.6 
Lattice rescoring  21.5  24.0  29.6  
This work  21.7  24.1  28.6  
3Ensemble NMT (BPE)  Pure NMT  21.0  23.4  27.0 
Lattice rescoring  21.7  24.2  30.0  
This work  22.3  24.9  29.2 
4 Efficient gram Posterior Calculation
The risk computation in our approach is based on posterior probabilities for grams which we extract from the SMT translation lattice . is defined as the sum of the path probabilities of paths in containing [Blackwood et al.2010, Eq. 2]:
(5) 
We use the framework of Blackwood et al. lmbrblackwood based on gram mapping and path counting transducers to efficiently precompute all nonzero values of . Complete enumeration of all grams in a lattice is usually feasible even for very large lattices [Blackwood et al.2010]. Additionally, for all these grams , we smooth
by mixing it with the uniform distribution to flatten the distribution and increase the offset to
grams which are not in the lattice.5 Subwordunitbased NMT
Characterbased or subwordunitbased NMT [Chitnis and DeNero2015, Sennrich et al.2016, Chung et al.2016, Luong and Manning2016, CostaJussà and Fonollosa2016, Ling et al.2015, Wu et al.2016] does not use isolated words as modelling units but applies a finer grained tokenization scheme. One of the main motivation for these approaches is to overcome the limited vocabulary in wordbased NMT. We consider our hybrid system as an alternative way to fix NMT OOVs. However, our method can also be used with subwordunitbased NMT. In this work, we use byte pair encodings [Sennrich et al.2016, BPE] to test combining wordbased SMT with subwordunitbased NMT via both lattice rescoring and MBR. First, we construct a finite state transducer (FST) which maps word sequences to BPE sequences. Then, we convert the wordbased SMT lattices to BPEbased lattices by composing them with the mapping transducer and projecting the output tape using standard OpenFST operations [Allauzen et al.2007]. The converted lattices are used for extracting gram posteriors as described in the previous sections. Note that even though the grams are on the BPE level, their posteriors are computed from wordlevel SMT translation scores.
Setup  dev  test  

SMT baseline [Neubig2013, Travatar]  19.5  22.2  
Single NMT (word)  Pure NMT  20.3  22.5 
10kbest rescoring  22.2  24.5  
This work  22.4  25.2  
6Ensemble NMT (word)  Pure NMT  22.6  25.0 
10kbest rescoring  22.4  25.4  
This work  23.9  26.5  
Single NMT (BPE)  Pure NMT  20.8  23.5 
10kbest rescoring  21.9  24.6  
This work  23.0  25.4  
3Ensemble NMT (BPE)  Pure NMT  23.3  25.9 
10kbest rescoring  22.6  25.1  
This work  24.1  26.7 
6 Experimental Setup
We test our approach on EnglishGerman (EnDe) and JapaneseEnglish (JaEn). For EnDe, we use the WMT newstest2014 (the filtered version) as a development set, and keep newstest2015 and newstest2016 as test sets. For JaEn, we use the ASPEC corpus [Nakazawa et al.2016] to be strictly comparable to the evaluation done in the Workshop of Asian Translation (WAT).
The NMT systems are as described by Stahlberg et al.
sgnmt using the Blocks and Theano frameworks
[van Merriënboer et al.2015, Bastien et al.2012] with hyperparameters as in [Bahdanau et al.2015] and a vocabulary size of 30k for JaEn and 50k for EnDe. We use the coverage penalty proposed by Wu et al. gnmt to improve the length and coverage of translations. Our final ensembles combine five (EnDe) to six (JaEn) independently trained NMT systems.Our EnDe SMT baseline is a hierarchical system based on the HiFST package^{4}^{4}4http://ucamsmt.github.io/ which produces rich output lattices. The system uses rules extracted as described by de Gispert et al. hifstgrammar and a 5gram language model [Heafield et al.2013].
In JaEn we use Travatar [Neubig2013], an opensource treetostring system. We provide the system with Japanese trees obtained using the Ckylark parser [Oda et al.2015] and train it on highquality alignments as recommended by Neubig and Due smtelementstree2str. This system, which reproduces the results of the best submission in WAT 2014 [Neubig2014], is used to create a 10kbest list of hypotheses, which we convert into determinised and minimised FSAs for our work. Our JaEn NMT models are trained on the same 500k training samples as the Travatar baseline.
The parameter is tuned by optimising the BLEU score on the validation set, and we set (). Using the BOBYQA algorithm [Powell2009] or lattice MERT [Macherey et al.2008] to optimise the parameters independently did not yield improvements. The beam search implementation of the SGNMT decoder^{5}^{5}5http://ucamsmt.github.io/sgnmt/html/ [Stahlberg et al.2016b] is used in all our experiments. We set the beam size to 20 for EnDe and 12 for JaEn.
7 Results
^{6}^{6}footnotetext: Comparable to http://lotus.kuee.kyotou.ac.jp/WAT/evaluation/list.php?t=2Our results are summarised in Tab. 1 and 2.^{7}^{7}7Instructions for reproducing our key results will be available upon publication at http://ucamsmt.github.io/sgnmt/html/tutorial.html Our approach outperforms both single NMT and SMT baselines by up to 3.4 BLEU for EnDe and 2.8 BLEU for JaEn. Ensembling yields further gains across all test sets both for the NMT baselines and our MBRbased hybrid systems. We see substantial gains from our MBRbased method over lattice rescoring for both single and ensembled NMT on all test sets and language pairs except EnDe newstest2016. On JaEn, we report 26.7 BLEU, second to only one system (as of February 2017) that uses a number of techniques such as minimum risk training and a much larger vocabulary size which could also be used in our framework.
Our wordlevel NMT baselines suffer from their limited vocabulary since we do not apply postprocessing techniques like UNKreplace [Luong et al.2015]. Therefore, NMT with subword units (BPE) consistently outperforms them by a large margin. Lattice rescoring and MBR yield large gains for both BPEbased and wordbased NMT. However, the performance difference between BPE and wordlevel NMT diminishes with lattice rescoring and MBR decoding: rescoring with NMT often performs on the same level for both words and subword units, and MBRbased NMT is often even better with a wordlevel NMT baseline. This indicates that subword units are often not necessary when the hybrid system has access to a large wordlevel vocabulary like the SMT vocabulary.
Note that the BPE lattice rescoring system is constrained to produce words in the output vocabulary of the syntactic SMT system and is prevented from inventing new target language words out of combinations of subword units. MBR imposes a soft version of such a constraint by biasing the BPEbased system towards words in the SMT search space.
The hypotheses produced by our MBRbased method often differ from the translations in the baseline systems. For example, 77.8% of the translations from our best MBRbased system on JaEn cannot be found in the SMT 10kbest list, and 78.0% do not match the translation from the pure NMT 6ensemble.^{8}^{8}8Up to NMT OOVs. This suggests that our MBR decoder is able to produce entirely new hypotheses, and that our method has a profound effect on the translations which goes beyond rescoring the SMT search space or fixing UNKs in the NMT output.
Tab. 1 also shows that rescoring is sensitive to the size of the best list or lattice: rescoring the entire lattice instead of a 100best list often yields a gain of 1 full BLEU point. In order to test our MBRbased method on small lattices, we compiled best lists of varying sizes to lattices and extracted gram posteriors from the reduced lattices. Fig. 1 shows that the best list size has an impact on both methods. Rescoring a 10best list already yields a large improvement of 1.2 BLEU. However, the hypotheses are still close to the SMT baseline. The MBRbased approach can make better use of small best lists as it does not suffer this restriction. MBRbased combination on a 10best list performs on about the same level as rescoring a 10,000best list which demonstrates a practical advantage of MBR over rescoring.
8 Related Work
Combining the advantages of NMT and traditional SMT has received some attention in current research. A recent line of research attempts to integrate SMTstyle translation tables into the NMT system [Zhang and Zong2016, Arthur et al.2016, He et al.2016]. Wang et al. nmt+smttrain interpolated NMT posteriors with word recommendations from SMT and jointly trained NMT together with a gating function which assigns the weight between SMT and NMT scores dynamically. Neubig et al. neubigAtwat2015 rescored best lists from a syntaxbased SMT system with NMT. Stahlberg et al. sgnmt restricted the NMT search space to a Hiero lattice and reported improvements over best list rescoring. Stahlberg et al. editdistance combined Hiero and NMT via a loose coupling scheme based on composition of finite state transducers and translation lattices which takes the edit distance between translations into account. Our approach is similar to the latter one since it allows to divert from SMT and generate translations without derivations in the SMT system. This ability is crucial for NMT ensembles because SMT lattices are often too narrow for the NMT decoder [Stahlberg et al.2016a]. However, the method proposed by Stahlberg et al. editdistance insists on a monotone alignment between SMT and NMT translations to calculate the edit distance. This can be computationally expensive and not appropriate for MT where word reorderings are common. The MBR decoding described here does not have this shortcoming.
9 Conclusion
This paper discussed a novel method for blending NMT with traditional SMT by biasing NMT scores towards translations with low Bayesrisk with respect to the SMT lattice. We reported significant improvements of the new method over lattice rescoring on JapaneseEnglish and EnglishGerman and showed that it can make good use even of very small lattices and best lists.
In this work, we calculated the Bayesrisk over nonneural SMT lattices. In the future, we are planning to introduce neural models to the risk estimation while keeping the computational complexity under control, e.g. by using neural gram language models [Bengio et al.2003, Vaswani et al.2013] or approximations of NMT scores [Lecorvé and Motlicek2012, Liu et al.2016] for gram posterior calculation.
Acknowledgments
This work was supported by the U.K. Engineering and Physical Sciences Research Council (EPSRC grant EP/L027623/1).
We thank Graham Neubig for providing pretrained parsing and alignment models, as well as scripts, to allow perfect reproduction of the NAIST WAT 2014 submission.
References
 [Allauzen et al.2007] Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. 2007. Openfst: A general and efficient weighted finitestate transducer library. In International Conference on Implementation and Application of Automata, pages 1123. Springer.

[Arthur et al.2016]
Philip Arthur, Graham Neubig, and Satoshi Nakamura.
2016.
Incorporating discrete translation lexicons into neural machine translation.
In EMNLP, pages 15571567.  [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
 [Bastien et al.2012] Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David WardeFarley, and Yoshua Bengio. 2012. Theano: new features and speed improvements. In NIPS.

[Bengio et al.2003]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin.
2003.
A neural probabilistic language model.
Journal of machine learning research
, 3:11371155.  [Blackwood et al.2010] Graeme Blackwood, Adrià de Gispert, and William Byrne. 2010. Efficient path counting transducers for minimum Bayesrisk decoding of statistical machine translation lattices. In ACL, pages 2732.
 [Chitnis and DeNero2015] Rohan Chitnis and John DeNero. 2015. Variablelength word encodings for neural translation models. In EMNLP, pages 20882093.
 [Chung et al.2016] Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A characterlevel decoder without explicit segmentation for neural machine translation. In ACL.
 [CostaJussà and Fonollosa2016] Marta R. CostaJussà and José AR. Fonollosa. 2016. Characterbased neural machine translation. In ACL.
 [de Gispert et al.2009] Adrià de Gispert, Sami Virpioja, Mikko Kurimo, and William Byrne. 2009. Minimum Bayes risk combination of translation hypotheses from alternative morphological decompositions. In HLTNAACL, pages 7376.
 [de Gispert et al.2010] Adrià de Gispert, Gonzalo Iglesias, Graeme Blackwood, Eduardo R Banga, and William Byrne. 2010. Hierarchical phrasebased translation with weighted finitestate transducers and shallown grammars. Computational Linguistics, 36(3):505533.
 [de Gispert et al.2013] Adrià de Gispert, Graeme Blackwood, Gonzalo Iglesias, and William Byrne. 2013. Ngram posterior probability confidence measures for statistical machine translation: an empirical study. Machine Translation, 27(2):85114.
 [He et al.2016] Wei He, Zhongjun He, Hua Wu, and Haifeng Wang. 2016. Improved neural machine translation with SMT features. In AAAI.
 [Heafield et al.2013] Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified KneserNey language model estimation. In ACL, pages 690696.
 [Kumar and Byrne2004] Shankar Kumar and William Byrne. 2004. Minimum Bayesrisk decoding for statistical machine translation. In HLTNAACL.

[Lecorvé and Motlicek2012]
Gwénolé Lecorvé and Petr Motlicek.
2012.
Conversion of recurrent neural network language models to weighted finite state transducers for automatic speech recognition.
Technical report, Idiap.  [Li and Jurafsky2016] Jiwei Li and Dan Jurafsky. 2016. Mutual information and diverse decoding improve neural machine translation. arXiv preprint arXiv:1601.00372.
 [Ling et al.2015] Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W. Black. 2015. Characterbased neural machine translation. arXiv preprint arXiv:1511.04586.
 [Liu et al.2016] Xunying Liu, Xie Chen, Yongqiang Wang, Mark JF. Gales, and Philip C. Woodland. 2016. Two efficient lattice rescoring methods using recurrent neural network language models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(8):14381449.
 [Luong and Manning2016] MinhThang Luong and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid wordcharacter models. In ACL.
 [Luong et al.2015] MinhThang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In ACL.
 [Macherey et al.2008] Wolfgang Macherey, Franz Josef Och, Ignacio Thayer, and Jakob Uszkoreit. 2008. Latticebased minimum error rate training for statistical machine translation. In EMNLP, pages 725734.
 [Nakazawa et al.2016] Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi, and Hitoshi Isahara. 2016. ASPEC: Asian scientific paper excerpt corpus. In LREC, pages 22042208.
 [Neubig and Duh2014] Graham Neubig and Kevin Duh. 2014. On the elements of an accurate treetostring machine translation system. In ACL, pages 143149.
 [Neubig et al.2015] Graham Neubig, Makoto Morishita, and Satoshi Nakamura. 2015. Neural reranking improves subjective quality of machine translation: NAIST at WAT2015. In WAT, Kyoto, Japan.
 [Neubig2013] Graham Neubig. 2013. Travatar: A foresttostring machine translation engine based on tree transducers. In ACL.
 [Neubig2014] Graham Neubig. 2014. Foresttostring SMT for Asian language translation: NAIST at WAT 2014. In WAT.
 [Oda et al.2015] Yusuke Oda, Graham Neubig, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Ckylark: A more robust PCFGLA parser. In NAACL, pages 4145.
 [Powell2009] Michael JD. Powell. 2009. The BOBYQA algorithm for bound constrained optimization without derivatives. Cambridge NA Report NA2009/06, University of Cambridge, Cambridge.
 [Sennrich et al.2016] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
 [Sim et al.2007] Khe Chai Sim, William J. Byrne, Mark JF. Gales, Hichem Sahbi, and Philip C. Woodland. 2007. Consensus network decoding for statistical machine translation system combination. In ICASSP. IEEE.
 [Stahlberg et al.2016a] Felix Stahlberg, Eva Hasler, and Bill Byrne. 2016a. The edit distance transducer in action: The University of Cambridge EnglishGerman system at WMT16. In WMT.
 [Stahlberg et al.2016b] Felix Stahlberg, Eva Hasler, Aurelien Waite, and Bill Byrne. 2016b. Syntactically guided neural machine translation. In ACL.
 [Tromble et al.2008] Roy W. Tromble, Shankar Kumar, Franz Och, and Wolfgang Macherey. 2008. Lattice minimum Bayesrisk decoding for statistical machine translation. In EMNLP, pages 620629.

[van Merriënboer et al.2015]
Bart van Merriënboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy
Serdyuk, David WardeFarley, Jan Chorowski, and Yoshua Bengio.
2015.
Blocks and fuel: Frameworks for deep learning.
CoRR.  [Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with largescale neural language models improves translation. In EMNLP, pages 13871392.
 [Vijayakumar et al.2016] Ashwin K. Vijayakumar, Michael Cogswell, Ramprasath R. Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
 [Wang et al.2016] Xing Wang, Zhengdong Lu, Zhaopeng Tu, Hang Li, Deyi Xiong, and Min Zhang. 2016. Neural machine translation advised by statistical machine translation. CoRR, abs/1610.05150.
 [Wu et al.2016] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.
 [Zhang and Zong2016] Jiajun Zhang and Chengqing Zong. 2016. Bridging neural machine translation and bilingual dictionaries. arXiv preprint arXiv:1610.07272.
Comments
There are no comments yet.