Statistical phrase-based models used to be the state-of-the-art machine translation systems (Koehn et al., 2007)
before the deep learning revolution. In contrast to word-based systems(Koehn et al., 2003; Lopez, 2008; Koehn, 2009), phrase-based approaches explicitly model phrase structures in both source and target sentences and their corresponding alignments. These linguistic structures can be understood as a form of inductive bias to the model, a key factor to its superior performance over word-based counterpart.
. These models usually consist of three main components: an encoder that encodes the source sentence into a fixed-length vectoR, a decoder that generates the translation word by word, and an attention module that selectively retrieves the source side information as per the decoder’s needs. Innovations in both architectures(Vaswani et al., 2017; Gehring et al., 2017) and training techniques (Vaswani et al., 2017; Ba et al., 2016) keep advancing the-state-the-art results on standard benchmarks for machine translation.
Despite the remarkable success of neural sequence to sequence models, the biases induced from phrase structures, which have been shown to be useful in machine translation (Koehn, 2009), are largely ignored. Until recently, Huang et al. (2018) developed Neural Phrase-based Machine Translation (NPMT) to incorporate target-side phrasal information into a neural translation model. Their model builds upon the Sleep-WAke Networks (SWAN), a segmentation-based sequence modeling technique described in Wang et al. (2017a).
In this paper, we propose Neural Phrase-to-Phrase Machine Translation (NPMT). Our contributions are twofold. First, we develop an explicit phrase-level attention mechanism to capture the source side phrasal information and their alignments with the target side phrase outputs. This approach also avoids the more costly dynamic programming procedure due to the monotonic requirement for input-output alignments in NPMT. NPMT can achieve comparable performance with the state-of-the art methods on benchmark datasets. Second, we can naturally incorporate an external phrase-to-phrase dictionary during the decoding procedure. NMT systems trained on a fixed amount of parallel data are known to have limitations with out-of-vocabulary words (Luong et al., 2014). In other words, those out-of-vocabulary words are from different data distributions other than the training data. The standard remedy for this is to apply a post-processing step that patches in unknown words based on the word-level alignment information from attention mechanism (Luong et al., 2014; Hashimoto et al., 2016). In our model, given the phrase-level attentions, we develop a dictionary look-up decoding method with an external phrase-to-phrase dictionary. We demonstrate that our NPMT model consistently outperforms Transformer (Vaswani et al., 2017) in a simulated open vocabulary setting and a cross domain translation setting.
2 Neural phrase-to-phrase machine translation
In this section, we first describe the proposed NPMT model with the phrase-level attention mechanism. We show how it avoids the more costly dynamic programming used in NPMT (Huang et al., 2018). Next, given the phrase-level attention mechanism, we develop a decoding algorithm that can naturally use an external phrase-to-phrase dictionary.
Figure 1 shows an overview of NPMT. There are three main components in our model: a source encoder , a target encoder , and a segment decoder . The source encoder consists of a sentence-level and phrase-level encoders (, ), namely . We use bidirectional LSTM (Hochreiter & Schmidhuber, 1997) for source encoders, LSTM for target encoder and the Transformer (Vaswani et al., 2017) for the segment decoder.111In our proposed model, we first tried LSTM to use as the targer decoder, but we later found Transformer did better to capture meaningful output segments.
Consider a source sentence and a target sentence . First, the source sentence is passed through sentence-level encoder to obtain the word vector representation,
Second, we use phrase-level encoder to independently encode all possible segments in the source sentence as,
where is the maximum segment length we will consider in our model. This architecture of computing phrase embedding is inspired by the Segmental RNNs (Kong et al., 2015), which has been shown to be effective in both text and speech tasks (Lu et al., 2017).
Denote as the entire set of phrase-level encoding vectors. Now the rest of the model is more conveniently described as a generative process. Let be a concatenation operator, be the end of segment symbol and as the end of sentence symbol.
For segment index
Update the attention state given all previous segments,
where is the total length of all previous segments. Note that different collections of previous segments can lead to the same attention state as long as their concatenation is the same.
Given the current state , sample words from segment decoder until we reach the end of segment symbol or end of sentence . This gives us a segment .
If we see in , stop.
Concatenate to obtain the output sentence . Since there are more than one way to obtain the same using the generative process above, we need to consider all cases. Let be a valid segmentation of , then we have
where is the number of segment in , attention is defined in Eq. 3
and the segment probabilityis defined via the decoder .
Given a pair of source and target sentence , the objective function of machine translation model is defined as follows,
) is intractable. We also need to develop a dynamic programming algorithms to efficiently compute the loss function. We denote conditional probabilityas ; then we have . It can be shown that we have the following recursion,
The initial condition is . The computational complexity of this algorithm is to , where denotes the maximum segment length. Comparing with the original NPMT model (Huang et al., 2018) with training complexity, the proposed approach is much faster.
For the proposed NPMT, besides the standard beam search algorithm, we propose a method to decode with an external phrase-to-phrase dictionary. We leave the beam search algorithm in Appendix.
Dictionary integration in decoding.
The proposed NPMT builds a phrase-to-phrase framework, which could be easily extended to leverage an external phrase-to-phrase dictionary. Here, we describe the dictionary-enhanced decoding algorithm in the greedy search case.222 Integration dictionary to beam search is nontrivial. For the current model, the usage of the dictionary is determined by rules instead of probability scores. But beam search requires a quantitative comparison among the candidates consisting of the phrases generated by both of the rule-based system and the neural model. It is unclear how to quantify the former into beam search scores. We leave it for future work.
Integration dictionary to beam search is nontrivial. For the current model, the usage of the dictionary is determined by rules instead of probability scores. But beam search requires a quantitative comparison among the candidates consisting of the phrases generated by both of the rule-based system and the neural model. It is unclear how to quantify the former into beam search scores. We leave it for future work.
The decoding process is shown by Figure 2. At each timestamp in the decoding process, the NPMT will produce a phrase-level attention, and the phrase with the maximum attention is the one to be translated. Given the attended phrase, the model needs to decide whether to use a dictionary to translate the phrase or not. Here only the attended phrases with unknown words are translated with the dictionary if the dictionary includes its translation. The intuition is that when a person is reading a sentence, he/she decides to use a dictionary if he/she meets an unknown word and the attended phrase can be found in the dictionary. When there are multiple translations for a phrase in the dictionary, we use the model to score all candidates, and then choose the best one. Here in NPMT, we use the segment decoder to do so. The decoding algorithm is shown in Algorithm 1
In this section, we demonstrate NPMT on several machine translation benchmarks under different settings. We first compare the NPMT model with some strong baselines on IWSLT14 German-English, IWSLT 2015 English-Vietnamese machine translation tasks (Cettolo et al., 2014, 2015). Furthermore, to demonstrate the benefit of using the phrase-level attention mechanism with an external dictionary, we evaluate NPMT on open vocabulary and cross-domain translation tasks.333We focus on word-based models in this paper as words are more appropriate for dictionary representation. We leave the BPE types of representations (Sennrich et al., 2015) in future work.
3.1 IWSLT Machine Translation Tasks
In this section, we compare NPMT with the state of the art machine translation models on IWSLT 14 German-English and IWSLT 15 English-Vietnamese machine translation tasks (Cettolo et al., 2014, 2015). The IWSLT14 German-English machine translation data (Cettolo et al., 2014) comes from translated TED talks, and the dataset contains roughly 153K training sentences, 7K development sentences, and 7K test sentences. We use the same preprocessing and dataset splits as in (Ott et al., 2018). Sentences longer than words are removed from the dataset. For the IWSLT 15 English-Vietnamese machine translation task, the data is from translated TED talks, and the dataset contains roughly 133K training sentence pairs provided by the IWSLT 2015 Evaluation Campaign (Cettolo et al., 2015). Following the same preprocessing steps in Luong et al. (2017); Huang et al. (2018)
, we use the TED tst2012 (1553 sentences) as the validation set for hyperparameter tuning and TED tst2013 (1268 sentences) as the test set. We report the results on the test set.
We use 6-layer BiLSTMs to encode at both words and segment level in the source encoder, 6-layer LSTMs as the target encoder, and a 6-layer transformer as the segment decoder.444We also experiment on LSTM segment decoder variant, but the performance is inferior. The models are trained using Adam (Kingma & Ba, 2014). Similar to the RNMT+ learning rate scheduler (Chen et al., 2018; Ott et al., 2018), we use a three-stage learning rate scheduler by replacing the exponential decay with a linear one for fast convergence. Specifically, the learning rate is quickly warmed up to the maximum, kept at the maximum value to progress, and finally decaying to zero. In our experiments, we set the maximum learning rate to , weight decay to , the word embedding dimensionality to , and the dropout rate to . The performances of baseline models, including BSO (Wiseman & Rush, 2016), Seq2Seq with attention, Actor-Critic (Bahdanau et al., 2016), Luong & Manning (2015) and NPMT (Huang et al., 2018) are taken from (Huang et al., 2018). We use the fairseq implementation of Transformer555https://github.com/pytorch/fairseq and set the number of layers in both encoder and decoder to 6.666No further improvement is observed by increasing the depth. We fine-tuned the hyperparameters using the valid set and found the best-performed hyperparameters, except the maximum learning rate, are the same as in NPMT. We replace the default inverse-square-root scheduler (Vaswani et al., 2017) with the proposed learning rate scheduler which works better empirically and the maximum learning rate of Transformer is set to .
The IWSLT 14 German-English and IWLST 15 English-Vietnamese test results are shown in Tables 1 and 2. The proposed NPMT achieves comparable results as the transformer (Vaswani et al., 2017), and outperforms other baseline models in both tasks. Besides, comparing with the NPMT model, NPMT achieves better results with a much less training time (24 hours vs 2 hours on a Nvidia V100 GPU).
|BSO (Wiseman & Rush, 2016)||23.83||25.48|
|Seq2Seq with attention||26.17||27.61|
|Actor-Critic (Bahdanau et al., 2016)||27.49||28.53|
|Transformer (Vaswani et al., 2017)||31.27||32.30|
|NPMT (Huang et al., 2018)||28.57||29.92|
3.2 Translation with Dictionary
To demonstrate the usefulness of using an external dictionary, we set up an experiment to show how an external bilingual dictionary can enhance the performance of NPMT in the open vocabulary translation tasks, where there are a large number of out-of-vocabulary (OOV) words.
We build the open vocabulary machine translation task by using UNK to replace the infrequent words. By infrequent words, we mean the words that appear fewer times than a predefined threshold in the training data. For the masked words, the model can only see their corresponding texts, where no vector representations are available. Given this setting, the models are supposed to translate the sentences with masked unknown words.
In our experiment, we evaluate different thresholds of infrequency words (3, 5, 10, 20, 50, 100), to show how using an external dictionary can improve translation in the open vocabulary setting. In Tables 3 and 4, we report the results on IWSLT14 German-English and IWSLT15 English-Vietnamese datasets. For both datasets, we use Moses (Koehn et al., 2007) to extract an in-domain dictionary from the training data, including and translations. Besides, we also try an out-of-domain dictionary on IWSTL14 German-English task, which is extracted from the WMT14 German-English dataset, and includes translations. The goals of these two tasks are different. For IWSLT15 English-Vietnamese task, the dictionary is extracted from the training data by Moses, and is an in-domain one. The external resource here is the dictionary extract tool, namely Moses. But for IWSLT14 German-English taks, the dictionary itself is an external resource which can be obtained from any textual corpus in any way, and is an out-of-domain one. In other words, the difference between these task lies in the overlapping between the training data and the dictionary. Note that in these dictionaries, there could be several translations attached to each word. For comparison, we also report the transformer with a trivial lookup mechanism by translating the attended unknown word with the given dictionary.
|Test Data OOV Rate||German||3.8%||4.7%||6.2%||8.1%||11.6%||14.8%|
|Test Data OOV Rate||English||2.2%||2.8%||3.9%||5.6%||8.9%||12.1%|
In Fig. 3 and Fig. 4, we analyze the lookup phrase ratio and the BLEU score improvement on both datasets, where the lookup phrase ratio is the percentage of target words that is translated using the dictionary. We observe the lookup phrase ratio and BLEU improvement are increasing when the source OOV rate is higher. This also suggests the need of using the dictionary in a high OOV scenario. In Fig. 3, when there are OOV rate, using an in-domain dictionary can enhance the performance of the NPMT model by and BLEU scores on IWSLT14 German-English and IWSLT15 English-Vietnamese task, whereas only and BLEU score improvement is achieved with the transformer model on these task respectively. Fig. 4 shows the difference between a small in-domain dictionary and a large out-of-domain one for IWSLT14 DE-EN. When the OOV rate is low, more improvements are gained by larger because it contained more translations. But when the OOV rate becomes higher, show its effectiveness by providing accurate in-domain translation candidates.
Table 5 shows the translation examples by NPMT with unknown word frequency threshold . At the word level, NPMT discovers some meaningful phrases like “this is”, “the machine” and “read by”. For unseen words, such as “genf”, “bescheidene” and “zeitschriften”, could be attended and decoded with the dictionary.
|Target ground truth||this is the machine below geneva .|
|Target ground truth||this is a modest little app .|
|Target ground truth||our magazines are read by millions .|
3.3 Cross-Domain Translation
In this section, Transformer and the proposed model are evaluated on a cross-domain machine translation task. The models are trained and validated on the IWSLT14 German-English dataset, which is originally from TED talks, and tested on the WMT14 German-English dataset, which is extracted from the news domain.
In this task, we used the lowest threshold for the training data to make the observed vocabulary as complete as possible. Note that although the used vocabulary is the most complete one for training domain, the OOV rate is still high in the out-of-domain test data. And the dictionary, which is extracted from the test domain, is used to handle such OOV words at the word/phrase level.777In practice, it would be better to use expert-created dictionaries for cross-domain tasks.
In Table 6, we show the experimental results with and without dictionary using Transformer and NPMT model. We observed that while dictionary contributed to the performance of both models, the “NPMT + ” model used the dictionary more often and achieved higher BLEU gain (+) than transformer did (+) in the experiments.
4 Related Work
Neural phrase-based machine translation is first introduced by Huang et al. (2018). The model builds upon Sleep-WAke Networks (SWAN), a segmentation-based sequence modeling technique described in (Wang et al., 2017a) and mitigate this issue of monotonic alignment assumption by introducing a new layer to perform (soft) local reordering on input sequences. Another related model shares the monotonic alignment assumption (which is often inappropriate in many language pairs) is the segment-to-segment neural transduction model (SSNT) (Yu et al., 2016b, a)
. Our model relies on the phrasal attention mechanism rather than marginalize out the monotonic alignments using dynamic programming. There have been several works that propose different ways to incorporate phrases into attention based neural machine translation.Wang et al. (2017b), Tang et al. (2016) and Zhao et al. (2018) incorporate the phrase table as memory in the neural machine translation architecture. Hasler et al. (2018) use a user-provided phrase table of terminologies into NMT system by organizing the beam search into multiple stacks corresponding to subsets of satisfied constraints as defined by FSA states. Dahlmann et al. (2017) divides the beams into the word beam and the phrase beam of fixed size. He et al. (2016) uses statistical machine translation (SMT) as features in the NMT model under the log-linear framework. Yang et al. (2018) enhance the self-attention networks to capture useful phrase patterns by imposing learned Gaussian biases. Nguyen & Joty (2018)
also incorporates phrase-level attention with transformer by encoding a fixed number of n-grams (e.g. unigram, bigram). However,Nguyen & Joty (2018) only focuses on phrase-level attention on the source side, whereas our model focuses on phrase-to-phrase translation by attending on a phrase at the source side and generating a phrase at the target side. Our model further integrates an external dictionary during decoding which is important in open vocabulary and cross domain translation settings.
We proposed Neural Phrase-to-Phrase Machine Translation (NPMT) that uses a phrase-level attention mechanism to enable phrase-to-phrase level translation in a neural machine translation system. By using this phrase-level attention, we can incorporate an external dictionary during decoding. We show that we can improve the machine translation results in open vocabulary and cross domain settings by using the external phrase dictionary in the decoding time.
We thank Dani Yogatama and Chris Dyer for helpful comments and dicussions on this project.
- Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Bahdanau et al. (2016) Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.
- Cettolo et al. (2014) Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. Report on the 11th IWSLT evaluation campaign, IWSLT 2014. In Proceedings of IWSLT, 2014.
- Cettolo et al. (2015) Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. The IWSLT 2015 evaluation campaign. In International Conference on Spoken Language, 2015.
- Chen et al. (2018) Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. The best of both worlds: Combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849, 2018.
Dahlmann et al. (2017)
Leonard Dahlmann, Evgeny Matusov, Pavel Petrushkov, and Shahram Khadivi.
Neural machine translation leveraging phrase-based models in a hybrid
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1422–1431, 2017.
- Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.
- Hashimoto et al. (2016) Kazuma Hashimoto, Akiko Eriguchi, and Yoshimasa Tsuruoka. Domain adaptation and attention-based unknown word replacement in chinese-to-japanese neural machine translation. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016), pp. 75–83, 2016.
- Hasler et al. (2018) Eva Hasler, Adrià De Gispert, Gonzalo Iglesias, and Bill Byrne. Neural machine translation decoding with terminology constraints. arXiv preprint arXiv:1805.03750, 2018.
- He et al. (2016) Wei He, Zhongjun He, Hua Wu, and Haifeng Wang. Improved neural machine translation with smt features. In AAAI, pp. 151–157, 2016.
- Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Huang et al. (2018) Po-Sen Huang, Chong Wang, Sitao Huang, Dengyong Zhou, and Li Deng. Towards neural phrase-based machine translation. International Conference on Learning Representations, 2018.
- Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Koehn (2009) Philipp Koehn. Statistical Machine Translation. Cambridge University Press, 2009.
- Koehn et al. (2003) Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 48–54, 2003.
- Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics, 2007.
- Kong et al. (2015) Lingpeng Kong, Chris Dyer, and Noah A Smith. Segmental recurrent neural networks. arXiv preprint arXiv:1511.06018, 2015.
- Lopez (2008) Adam Lopez. Statistical machine translation. ACM Computing Surveys (CSUR), 40(3):8, 2008.
- Lu et al. (2017) Liang Lu, Lingpeng Kong, Chris Dyer, and Noah A Smith. Multitask learning with ctc and segmental crf for speech recognition. arXiv preprint arXiv:1702.06378, 2017.
- Luong & Manning (2015) Minh-Thang Luong and Christopher D Manning. Stanford neural machine translation systems for spoken language domains. In Proceedings of the International Workshop on Spoken Language Translation, 2015.
- Luong et al. (2014) Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206, 2014.
- Luong et al. (2017) Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. Neural machine translation (seq2seq) tutorial. https://github.com/tensorflow/nmt, 2017.
- Nguyen & Joty (2018) Phi Xuan Nguyen and Shafiq Joty. Phrase-based attentions. arXiv preprint arXiv:1810.03444, 2018.
- Ott et al. (2018) Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In EMNLP, 2018.
- Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
Sutskever et al. (2014)
Ilya Sutskever, Oriol Vinyals, and Quoc V Le.
Sequence to sequence learning with neural networks.Neural Information Processing Systems, pp. 3104–3112, 2014.
- Tang et al. (2016) Yaohua Tang, Fandong Meng, Zhengdong Lu, Hang Li, and Philip LH Yu. Neural machine translation with external phrase memory. arXiv preprint arXiv:1606.01792, 2016.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
Wang et al. (2017a)
Chong Wang, Yining Wang, Po-Sen Huang, Abdelrahman Mohamed, Dengyong Zhou,
and Li Deng.
Sequence modeling via segmentations.
International Conference on Machine Learning, pp. 3674–3683, 2017a.
- Wang et al. (2017b) Xing Wang, Zhaopeng Tu, Deyi Xiong, and Min Zhang. Translating phrases in neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1432–1442. Association for Computational Linguistics, 2017b.
- Wiseman & Rush (2016) Sam Wiseman and Alexander M Rush. Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960, 2016.
- Yang et al. (2018) Baosong Yang, Zhaopeng Tu, Derek F. Wong, Fandong Meng, Lidia S. Chao, and Tong Zhang. Modeling localness for self-attention networks. In EMNLP, 2018.
- Yu et al. (2016a) Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Tomas Kocisky. The neural noisy channel. arXiv preprint arXiv:1611.02554, 2016a.
- Yu et al. (2016b) Lei Yu, Jan Buys, and Phil Blunsom. Online segment to segment neural transduction. arXiv preprint arXiv:1609.08194, 2016b.
- Zhao et al. (2018) Yang Zhao, Yining Wang, Jiajun Zhang, and Chengqing Zong. Phrase table as recommendation memory for neural machine translation. arXiv preprint arXiv:1805.09960, 2018.