. Generally, it relies on a recurrent neural network under the Encode-Decode framework: it firstly encodes a source sentence into context vectors and then generates its translation token-by-token, selecting from the target vocabulary. Among different variants of NMT, attention based NMT, which is the focus of this paper, is attracting increasing interests in the community[Bahdanau et al.2015, Luong et al.2015]. One of its advantages is that it is able to dynamically make use of the encoded context through an attention mechanism thereby allowing the use of fewer hidden layers while still maintaining high levels of translation performance.
An attention mechanism is designed to predict the alignment of a target word with respect to source words. In order to facilitate incremental decoding, it tries to make this alignment prediction without any information about the target word itself, and thus this attention can be considered to be a form of a reordering model (see §2 for more details). However, it differs from conventional alignment models that are able to use the target word to infer its alignments [Och and Ney2000, Dyer et al.2013, Liu and Sun2015], and as a result there is a substantial gap in quality between the alignments derived by this attention based NMT and conventional alignment models (54 VS 30 in terms of AER for Chinese-to-English as reported in [Cheng et al.2016]). This discrepancy might be an indication that the potential of NMT is limited. In addition, the attention in NMT is learned in an unsupervised manner without explicit prior knowledge about alignment.111 We do agree that NMT is a supervised model with respect to translation rather than reordering. In contrast, in conventional statistical machine translation (SMT), it is standard practice to learn reordering models in a supervised manner with the guidance from conventional alignment models.
Inspired by the supervised reordering in conventional SMT, in this paper, we propose a Supervised Attention based NMT (SA-NMT) model. Specifically, similar to conventional SMT, we first run off-the-shelf aligners (GIZA++ [Och and Ney2000] or fast_align [Dyer et al.2013]
etc.) to obtain the alignment of the bilingual training corpus in advance. Then, treating this alignment result as the supervision of attention, we jointly learn attention and translation, both in supervised manners. Since the conventional aligners delivers higher quality alignment, it is expected that the alignment in the supervised attention NMT will be improved leading to better end-to-end translation performance. One advantage of the proposed SA-NMT is that it implements the supervision of attention as a regularization in the joint training objective (§3.2). Furthermore, since the supervision of attention lies in the middle of the entire network architecture rather than the top ( as in the supervision of translation (see Figure 1(b)), it serves to mitigate the vanishing gradient problem during the back-propagation[Szegedy et al.2015].
This paper makes the following contributions:
It revisits the attention model from the point view of reordering (§2), and propose a supervised attention for NMT that is supervised by statistical alignment models (§3). The proposed approach is simple and easy to be implemented, and it is generally applicable to any attention-based NMT models, although in this case it is implemented on top of the model in[Bahdanau et al.2015].
On two Chinese-to-English translation tasks, it empirically shows that the proposed approach gives rise to improved performance (§4): on a large scale task, it outperforms three baselines including a state-of-the-art Moses, and leads to improvements of up to 2.5 BLEU points over the strongest baseline; on a low resource task, it even obtains about 5 BLEU points over the attention based NMT system on which is it based.
2 Revisiting Neural Machine Translation
|(a) NMT||(b) SA-NMT|
Suppose denotes a source sentence, a target sentence. In addition, let denote a prefix of
. Neural Machine Translation (NMT) directly maps a source sentence into a target under an encode-decode framework. In the encoding stage, it uses two bidirectional recurrent neural networks to encodeinto a sequence of vectors , with representing the concatenation of two vectors for
source word from two directional RNNs. In the decoding stage, it generates the target translation from the conditional probability over the pair of sequencesand via a recurrent neural network parametrized by as follows:
where and respectively denote an RNN hidden state (i.e. a vector) and a context vector at timestep ; is a transformation function mapping into a vector with dimension of the target vocabulary size; and denotes the component of a vector.222In that sense, in Eq.(1) also denotes the index of this word in its vocabulary. Furthermore,Chung et al.2014]; and the context vector is a dynamical source representation at timestep , and calculated as the weighted sum of source encodings , i.e. . Here the weight implements an attention mechanism, and is the alignment probability of being aligned to . is derived through a feedforward neural network as follows:
consists of two layers, the top one being a softmax layer. We skip the detailed definitions oftogether with , and , and refer the readers to [Bahdanau et al.2015] instead.333In the original paper, is independent on the in Eq.(2), is independent on the in Eq.(2), but this dependency was retained in our direct baseline NMT2. Figure 1(a) shows one slice of computational graph for NMT definition at time step .
To train NMT, the following negative log-likelyhood is minimized:
where is a bilingual sentence pair from a given training corpus, is as defined in Eq.(1). Note that even though the training is conducted in a supervised manner with respect to translation, i.e., are observable in Figure 1(a), the attention is learned in a unsupervised manner, since is hidden.
In Figure 1(a), can not be dependent on , as the target word is unknown at the timestep during the testing. Therefore, at timestep , NMT firstly tries to calculate , through which NMT figures out those source words will be translated next, even though the next target word is unavailable. From this point of view, the attention mechanism plays a role in reordering and thus can be considered as a reordering model. Unlike this attention model, conventional alignment models define the alignment directly over and as follows:
where denotes either a log-probability for a generative model like IBM models [Brown et al.1993] or a feature function for discriminative models [Liu and Sun2015]. In order to infer , alignment models can readily use the entire , of course including as well, thereby they can model the alignment between and more sufficiently. As a result, the attention based NMT might not deliver satisfying alignments, as reported in [Cheng et al.2016], compared to conventional alignment models. This may be a sign that the potential of NMT is limited in end-to-end translation.
3 Supervised Attention
In this section, we introduce supervised attention to improve the alignment, which consequently leads to better translation performance for NMT. Our basic idea is simple: similar to conventional SMT, it firstly uses a conventional aligner to obtain the alignment on the training corpus; then it employs these alignment results as supervision to train the NMT. During testing, decoding proceeds in exactly the same manner as standard NMT, since there is no alignment supervision available for unseen test sentences.
3.1 Preprocessing Alignment Supervision
As described in §2, the attention model outputs a soft alignment , such that
is a normalized probability distribution. In contrast, most aligners are typically oriented to grammar induction for conventional SMT, and they usually output ‘hard’ alignments, such as[Och and Ney2000]. They only indicate whether a target word is aligned to a source word or not, and this might not correspond to a distribution for each target word. For example, one target word may align to multiple source words, or no source words at all.
Therefore, we apply the following heuristics to preprocess the hard alignment: if a target word does not align to any source words, we inherit its affiliation from the closest aligned word with preference given to the right, following[Devlin et al.2014]; if a target word is aligned to multiple source words, we assume it aligns to each one evenly. In addition, in the implementation of NMT, there are two special tokens ‘eol’ added to both source and target sentences. We assume they are aligned to each other. In this way, we can obtain the final supervision of attention, denoted as .
3.2 Jointly Supervising Translation and Attention
We propose a soft constraint method to jointly supervise the translation and attention as follows:
where is as defined in Eq. (1),
is a loss function that penalizes the disagreement betweenand , and is a hyper-parameter that balances the preference between likelihood and disagreement. In this way, we treat the attention variable as an observable variable as shown in Figure 1(b), and this is different from the standard NMT as shown in Figure 1(a) in essence. Note that this training objective resembles to that in multi-task learning [Evgeniou and Pontil2004]. Our supervised attention method has two further advantages: firstly, it is able to alleviate overfitting by means of the ; and secondly it is capable of addressing the vanishing gradient problem because the supervision of is more close to than as in Figure 1(b).
In order to quantify the disagreement between and , three different methods are investigated in our experiments:
We conducted experiments on two Chinese-to-English translation tasks: one is the NIST task oriented to NEWS domain, which is a large scale task and suitable to NMT; and the other is the speech translation oriented to travel domain, which is a low resource task and thus is very challenging for NMT. We used the case-insensitive BLEU4 to evaluate translation quality and adopted the multi-bleu.perl as its implementation.
4.1 The Large Scale Translation Task
We used the data from the NIST2008 Open Machine Translation Campaign. The training data consisted of 1.8M sentence pairs, the development set was nist02 (878 sentences), and the test sets are were nist05 (1082 sentences), nist06 (1664 sentences) and nist08 (1357 sentences).
We compared the proposed approach with three strong baselines:
Moses: a phrase-based machine translation system [Koehn et al.2007];
We developed the proposed approach based on NMT2, and denoted it as SA-NMT.
|Mean Squared Error (MSE)||39.4|
|Cross Entropy (CE)||40.0|
We followed the standard pipeline to run Moses. GIZA++ with grow-diag-final-and was used to build the translation model. We trained a 5-gram target language model on the Gigaword corpus, and used a lexicalized distortion model. All experiments were run with the default settings.
To train NMT1, NMT2 and SA-NMT, we employed the same settings for fair comparison. Specifically, except the stopping iteration which was selected using development data, we used the default settings set out in [Bahdanau et al.2015] for all NMT-based systems: the dimension of word embedding was 620, the dimension of hidden units was 1000, the batch size was 80, the source and target side vocabulary sizes were 30000, the maximum sequence length was 50, 444This excludes all the sentences longer than 50 words in either source or target side only for NMT systems, but for Moses we use the entire training data. the beam size for decoding was 12, and the optimization was done by Adadelta with all hyper-parameters suggested by [Zeiler2012]. Particularly for SA-NMT, we employed a conventional word aligner to obtain the word alignment on the training data before training SA-NMT. In this paper, we used two different aligners, which are fast_align and GIZA++. We tuned the hyper-parameter to be 0.3 on the development set, to balance the preference between the translation and alignment. Training was conducted on a single Tesla K40 GPU machine. Each update took about 3.0 seconds for both NMT2 and SA-NMT, and 2.4 seconds for NMT1. Roughly, it took about 10 days to NMT2 to finish 300000 updates.
4.1.2 Settings on External Alignments
We implemented three different losses to supervise the attention as described in §3.2. To explore their behaviors on the development set, we employed the GIZA++ to generate the alignment on the training set prior to the training SA-NMT. In Table 1, we can see that MUL is better than MSE. Furthermore, CE performs best among all losses, and thus we adopt it for the following experiments.
In addition, we also run fast_align to generate alignments as the supervision for SA-NMT and the results were reported in Table 2. We can see that GIZA++ performs slightly better than fast_align and thus we fix the external aligner as GIZA++ in the following experiments.
4.1.3 Results on Large Scale Translation Task
Figure 2 shows the learning curves of NMT2 and SA-NMT on the development set. We can see that NMT2 generally obtains higher BLEU as the increasing of updates before peaking at update of , while it is unstable from then on. On the other hand, SA-NMT delivers much better BLEU for the beginning updates and performs more steadily along with the updates, although it takes more updates to reach the peaking point.
Table 3 reports the main end-to-end translation results for the large scale task. We find that both standard NMT generally outperforms Moses except NMT1 on nist05. The proposed SA-NMT achieves significant and consistent improvements over all three baseline systems, and it obtains the averaged gains of 2.2 BLEU points on test sets over its direct baseline NMT2. It is clear from these results that our supervised attention mechanism is highly effective in practice.
4.1.4 Results and Analysis on Alignment
As explained in §2, standard NMT can not use the target word information to predict its aligned source words, and thus might fail to predict the correct source words for some target words. For example, for the sentence in the training set in Figure 3 (a), NMT2 aligned ‘following’ to ‘皮诺契特 (gloss: pinochet)’ rather than ‘继 (gloss: follow)’, and worse still it aligned the word ‘.’ to ‘在 (gloss: in)’ rather than ‘。’ even though this word is relatively easy to align correctly. In contrast, with the help of information from the target word itself, GIZA++ successfully aligned both ‘following’ and ‘.’ to the expected source words (see Figure3(c)). With the alignment results from GIZA++ as supervision, we can see that our SA-NMT can imitate GIZA++ and thus align both words correctly. More importantly, for sentences in the unseen test set, like GIZA++, SA-NMT confidently aligned ‘but’ and ‘.’ to their correct source words respectively as in Figure3(b), where NMT2 failed. It seems that SA-NMT can learn its alignment behavior from GIZA++, and subsequently apply the alignment abilities it has learned to unseen test sentences.
Results on word alignment task for the large scale data. The evaluation metric is Alignment Error Rate (AER). ‘*’ denotes that the corresponding result is significanly better than NMT2 with.
Table 4 shows the overall alignment results on word alignment task in terms of the metric, alignment error rate. We used the manually-aligned dataset as in [Liu and Sun2015] as the test set. Following [Luong and Manning2015], we force-decode both the bilingual sentences including source and reference sentences to obtain the alignment matrices, and then for each target word we extract one-to-one alignments by picking up the source word with the highest alignment confidence as the hard alignment. From Table 4, we can see clearly that standard NMT (NMT2) is far behind GIZA++ in alignment quality. This shows that it is possible and promising to supervise the attention with GIZA++. With the help from GIZA++, our supervised attention based NMT (SA-NMT) significantly reduces the AER, compared with the unsupervised counterpart (NMT2). This shows that the proposed approach is able to realize our intuition: the alignment is improved, leading to better translation performance.
Note that there is still a gap between SA-NMT and GIZA++ as indicated in Table 4. Since SA-NMT was trained for machine translation instead of word alignment, it is possible to reduce its AER if we aim to the word alignment task only. For example, we can enlarge in Eq.(4) to bias the training objective towards word alignment task, or we can change the architecture slightly to add the target information crucial for alignment as in [Yang et al.2013, Tamura et al.2014].
4.2 Results on the Low Resource Translation Task
For the low resource translation task, we used the BTEC corpus as the training data, which consists of 30k sentence pairs with 0.27M Chinese words and 0.33M English words. As development and test sets, we used the CSTAR03 and IWSLT04 held out sets, respectively. We trained a 4-gram language model on the target side of training corpus for running Moses. For training all NMT systems, we employed the same settings as those in the large scale task, except that vocabulary size is 6000, batch size is 16, and the hyper-parameter for SA-NMT.
Table 5 reports the final results. Firstly, we can see that both standard neural machine translation systems NMT1 and NMT2 are much worse than Moses with a substantial gap. This result is not difficult to understand: neural network systems typically require sufficient data to boost their performance, and thus low resource translation tasks are very challenging for them. Secondly, the proposed SA-NMT gains much over NMT2 similar to the case in the large scale task, and the gap towards Moses is narrowed substantially.
While our SA-NMT does not advance the state-of-the-art Moses as in large scale translation, this is a strong result if we consider that previous works on low resource translation tasks: arthur+:2016 gained over Moses on the Japanese-to-English BTEC corpus, but they resorted to a corpus consisting of 464k sentence pairs; luong+manning:2015 revealed the comparable performance to Moses on English-to-Vietnamese with 133k sentences pairs, which is more than 4 times of our corprus size. Our method is possible to advance Moses by using reranking as in [Neubig et al.2015, Cohn et al.2016], but it is beyond the scope of this paper and instead we remain it as future work.
5 Related Work
Many recent works have led to notable improvements in the attention mechanism for neural machine translation. tu+:2016 introduced an explicit coverage vector into the attention mechanism to address the over-translation and under-translation inherent in NMT. feng+:2016 proposed an additional recurrent structure for attention to capture long-term dependencies. cheng+:2016 proposed an agreement-based bidirectional NMT model for symmetrizing alignment. cohn+:2016 incorporated multiple structural alignment biases into attention learning for better alignment. All of them improved the attention models that were learned in an unsupervised manner. While we do not modify the attention model itself, we learn it in a supervised manner, therefore our approach is orthogonal to theirs.
It has always been standard practice to learn reordering models from alignments for conventional SMT either at the phrase level or word level. At the phrase level, koehn+:2007 proposed a lexicalized MSD model for phrasal reordering; xiong+:2006 proposed a feature-rich model to learn phrase reordering for BTG; and li+:2014 proposed a neural network method to learn a BTG reordering model. At the word level, bisazza+federico:2016 surveyed many word reordering models learned from alignment models for SMT, and in particular there are some neural network based reordering models, such as [Zhang et al.2016]. Our work is inspired by these works in spirit, and it can be considered to be a recurrent neural network based word-level reordering model. The main difference is that in our approach the reordering model and translation model are trained jointly rather than separately as theirs.
It has been shown that attention mechanism in NMT is worse than conventional word alignment models in its alignment accuracy. This paper firstly provides an explanation for this by viewing the atten- tion mechanism from the point view of reordering. Then it proposes a supervised attention for NMT with guidance from external conventional alignment models, inspired by the supervised reordering models in conventional SMT. Experiments on two Chinese-to-English translation tasks show that the proposed approach achieves better alignment results leading to significant gains relative to standard attention based NMT.
We would like to thank Xugang Lu for invaluable discussions on this work.
- [Arthur et al.2016] Philip Arthur, Graham Neubig, and Satoshi Nakamura. 2016. Incorporating discrete translation lexicons into neural machine translation. CoRR, abs/1606.02006.
- [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
- [Bisazza and Federico2016] Arianna Bisazza and Marcello Federico. 2016. A survey of word reordering in statistical machine translation: Computational models and language phenomena. Computational Linguistics, 42.
[Brown et al.1993]
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L.
The mathematics of statistical machine translation: Parameter estimation.Comput. Linguist., 19(2):263–311.
- [Cheng et al.2016] Yong Cheng, Shiqi Shen, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Agreement-based joint training for bidirectional attention-based neural machine translation. In Proceedings of IJCAI.
- [Chung et al.2014] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555.
- [Cohn et al.2016] Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vymolova, Kaisheng Yao, Chris Dyer, and Gholamreza Haffari. 2016. Incorporating structural alignment biases into an attentional neural translation model. In Proceedings of NAACL-HLT.
- [Devlin et al.2014] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. 2014. Fast and robust neural network joint models for statistical machine translation. In Proceedings of ACL.
- [Dyer et al.2013] Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A simple, fast, and effective reparameterization of ibm model 2. In In Proc. NAACL.
- [Evgeniou and Pontil2004] Theodoros Evgeniou and Massimiliano Pontil. 2004. Regularized multi–task learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04.
- [Feng et al.2016] Shi Feng, Shujie Liu, Mu Li, and Ming Zhou. 2016. Implicit distortion and fertility models for attention-based encoder-decoder NMT model. CoRR, abs/1601.03317.
- [Koehn et al.2007] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of ACL: Demonstrations.
- [Lehmann and Casella1998] E.L. Lehmann and G. Casella. 1998. Theory of Point Estimation. Springer Verlag.
- [Li et al.2014] Peng Li, Yang Liu, Maosong Sun, Tatsuya Izuha, and Dakun Zhang. 2014. A neural reordering model for phrase-based translation. In Proceedings of COLING.
- [Liang et al.2006] Percy Liang, Ben Taskar, and Dan Klein. 2006. Alignment by agreement. In Proceedings of HLT-NAACL.
- [Liu and Sun2015] Yang Liu and Maosong Sun. 2015. Contrastive unsupervised word alignment with non-local features.
- [Luong and Manning2015] Minh-Thang Luong and Christopher D. Manning. 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of IWSLT.
- [Luong et al.2015] Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of EMNLP.
- [Neubig et al.2015] Graham Neubig, Makoto Morishita, and Satoshi Nakamura. 2015. Neural reranking improves subjective quality of machine translation: NAIST at WAT2015. In Proceedings of the 2nd Workshop on Asian Translation (WAT2015).
- [Och and Ney2000] Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. In Proceedings of ACL, pages 440–447.
[Rubinstein and Kroese2004]
Reuven Y. Rubinstein and Dirk P. Kroese.
The Cross Entropy Method: A Unified Approach To Combinatorial Optimization, Monte-carlo Simulation (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
- [Sutskever et al.2015] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2015. Sequence to sequence learning with neural networks. In Proceedings of NIPS.
- [Szegedy et al.2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In .
- [Tamura et al.2014] Akihiro Tamura, Taro Watanabe, and Eiichiro Sumita. 2014. Recurrent neural networks for word alignment model. In Proceedings of ACL.
- [Tu et al.2016] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. In Proceedings of ACL.
- [Xiong et al.2006] Deyi Xiong, Qun Liu, and Shouxun Lin. 2006. Maximum entropy based phrase reordering model for statistical machine translation. In Proceedings of ACL.
- [Yang et al.2013] Nan Yang, Shujie Liu, Mu Li, Ming Zhou, and Nenghai Yu. 2013. Word alignment modeling with context dependent deep neural network. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), August.
- [Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. CoRR.
- [Zhang et al.2016] Jingyi Zhang, Masao Utiyama, Eiichiro Sumita, Hai Zhao, Graham Neubig, and Satoshi Nakamura. 2016. Learning local word reorderings for hierarchical phrase-based statistical machine translation. Machine Translation.