The attention model plays a crucial role in NMT, as it shows which source word(s) the model should focus on in order to predict the next target word. However, the attention or alignment quality of NMT is still very low[Mi et al.2016a, Tu et al.2016].
In this paper, we alleviate the above issue by utilizing the alignments
(human annotated data or machine alignments) of the training set. Given the alignments of all the training sentence pairs, we add an alignment distance cost to the objective function. Thus, we not only maximize the log translation probabilities, but also minimize the alignment distance cost. Large-scale experiments over Chinese-to-English on various test sets show that our best method for a single system improves the translation quality significantly over the large vocabulary NMT system (Section5) and beats the state-of-the-art syntax-based system.
2 Neural Machine Translation
is an encoder-decoder network. the encoder employs a bi-directional recurrent neural network to encode the source sentence, where is the sentence length (including the end-of-sentence ), into a sequence of hidden states , each is a concatenation of a left-to-right and a right-to-left .
Given , the decoder predicts the target translation by maximizing the conditional log-probability of the correct translation , where is the sentence length (including the end-of-sentence). At each time , the probability of each word from a target vocabulary is:
is a two layer feed-forward neural network over the embedding of the previous word, and the hidden state . The is computed as:
is a gated recurrent units,is a weighted sum of ; the weights, , are computed with a two layer feed-forward neural network :
We put all (, ) into a matrix , we have a matrix (alignment) like (c) in Figure 2, where each row (for each target word) is a probability distribution over the source sentence .
The training objective is to maximize the conditional log-probability of the correct translation given with respect to the parameters
where is the -th sentence pair in the training set, is the total number of pairs.
3 Alignment Component
The attentions, , in each step play an important role in NMT. However, the accuracy is still far behind the traditional MaxEnt alignment model in terms of alignment F1 score [Mi et al.2016b, Tu et al.2016]. Thus, in this section, we explicitly add an alignment distance to the objective function in Eq. 5. The “truth” alignments for each sentence pair can be from human annotated data, unsupervised or supervised alignments (e.g. GIZA++ [Och and Ney2000] or MaxEnt [Ittycheriah and Roukos2005]).
Given an alignment matrix for a sentence pair () in Figure 2 (a), where we have an end-of-source-sentence token , and we align all the unaligned target words ( in this example) to , also we force (end-of-target-sentence) to be aligned to with probability one. Then we conduct two transformations to get the probability distribution matrices ((b) and (c) in Figure 2).
3.1 Simple Transformation
The first transformation simply normalizes each row. Figure 2 (b) shows the result matrix . The last column in red dashed lines shows the alignments of the special end-of-sentence token .
3.2 Smoothed Transformation
Given the original alignment matrix , we create a matrix with all points initialized with zero. Then, for each alignment point , we update
by adding a Gaussian distribution,, with a window size (-, … … +). Take the for example, we have += , += , and += with =2, =. Then we normalize each row and get (c). In our experiments, we use a shape distribution, where = .
Alignment Objective: Given the “true” alignment , and the machine attentions produced by NMT model, we compute the Euclidean distance bewteen and .
There are two parts: translation and alignment, so we can optimize them jointly, or separately (e.g. we first optimize alignment only, then optimize translation). Thus, we divide the network in Figure 1 into alignment A and translation T parts:
A: all networks before the hidden state ,
T: the network .
If we only optimize A, we keep the parameters in T unchanged. We can also optimize them jointly J. In our experiments, we test different optimization strategies.
4 Related Work
In order to improve the attention or alignment accuracy, cheng+:2016 adapted the agreement-based learning [Liang et al.2006, Liang et al.2008], and introduced a combined objective that takes into account both translation directions (source-to-target and target-to-source) and an agreement term between the two alignment directions. By contrast, our approach directly uses and optimizes NMT parameters using the “supervised” alignments.
|Cov. LVNMT [Mi et al.2016b]||0.92||35.59||10.71||0.89||30.18||15.33||0.97||27.48||16.67||14.24|
|Zh En||A J||0.95||35.71||10.38||0.93||30.73||14.98||0.96||27.38||16.24||13.87|
|A T J||0.95||35.95||10.24||0.92||30.95||14.62||0.97||26.76||17.04||13.97|
|J + Gau.||0.96||36.95||9.71||0.94||32.43||13.61||0.97||28.63||15.80||13.04|
5.1 Data Preparation
We run our experiments on Chinese to English task. The training corpus consists of approximately 5 million sentences available within the DARPA BOLT Chinese-English task. The corpus includes a mix of newswire, broadcast news, and webblog. We do not include HK Law, HK Hansard and UN data. The Chinese text is segmented with a segmenter trained on CTB data using conditional random fields (CRF). Our development set is the concatenation of several tuning sets (GALE Dev, P1R6 Dev, and Dev 12) initially released under the DARPA GALE program. The development set is 4491 sentences in total. Our test sets are NIST MT06 (1664 sentences) , MT08 news (691 sentences), and MT08 web (666 sentences).
For all NMT systems, the full vocabulary size of the training set is . In the training procedure, we use AdaDelta [Zeiler2012] to update model parameters with a mini-batch size 80. Following mi+cov:2016, the output vocabulary for each mini-batch or sentence is a sub-set of the full vocabulary. For each source sentence, the sentence-level target vocabularies are union of top most frequent target words and the top 10 candidates of the word-to-word/phrase translation tables learned from ‘fast_align’ [Dyer et al.2013]. The maximum length of a source phrase is 4. In the training time, we add the reference in order to make the translation reachable.
The Cov. LVNMT system is a re-implementation of the enhanced NMT system of mi+cov:2016, which employs a coverage embedding model and achieves better performance over large vocabulary NMT jean+:2015. The coverage embedding dimension of each source word is 100.
Following jean+:2015, we dump the alignments, attentions, for each sentence, and replace UNKs with the word-to-word translation model or the aligned source word.
Our traditional SMT system is a hybrid syntax-based tree-to-string model [Zhao and Al-onaizan2008], a simplified version of the joint decoding [Liu et al.2009, Cmejrek et al.2013]. We parse the Chinese side with Berkeley parser, and align the bilingual sentences with GIZA++ and MaxEnt. and extract Hiero and tree-to-string rules on the training set. Our language models are trained on the English side of the parallel corpus, and on monolingual corpora (around 10 billion words from Gigaword (LDC2011T07).We tune our system with PRO [Hopkins and May2011] to minimize (Ter- Bleu)/2 111The metric used for optimization in this work is (Ter-Bleu)/2 to prevent the system from using sentence length alone to impact Bleu or Ter. Typical SMT systems use target word count as a feature and it has been observed that Bleu can be optimized by tweaking the weighting of the target word count with no improvement in human assessments of translation quality. Conversely, in order to optimize Ter shorter sentences can be produced. Optimizing the combination of metrics alleviates this effect [Arne Mauser and Ney2008]. on the development set.
5.2 Translation Results
Table 1 shows the translation results of all systems. The syntax-based statistical machine translation model achieves an average (Ter-Bleu)/2 of 13.36 on three test sets. The Cov. LVNMT system achieves an average (Ter-Bleu)/2 of 14.24, which is about 0.9 points worse than Tree-to-string SMT system. Please note that all systems are single systems. It is highly possible that ensemble of NMT systems with different random seeds can lead to better results over SMT.
We test three different alignments:
Zh En (one direction of GIZA++),
(the “grow-diag-final-and” heuristic merge of both directions of GIZA++),
MaxEnt (trained on hand-aligned sentences).
The alignment quality improves from Zh En to MaxEnt. We also test different optimization strategies: J (jointly), A (alignment only), and T (translation model only). A combination, A T, shows that we optimize A only first, then we fix A and only update T part. Gau. denotes the smoothed transformation (Section 3.2). Only the last row uses the smoothed transformation, all others use the simple transformation.
Experimental results in Table 1 show some interesting results. First, with the same alignment, J joint optimization works best than other optimization strategies (lines 3 to 6). Unfortunately, breaking down the network into two separate parts (A and T) and optimizing them separately do not help (lines 3 to 5). We have to conduct joint optimization J in order to get a comparable or better result (lines 3, 5 and 6) over the baseline system.
Second, when we change the training alignment seeds (Zh En, GDFA, and MaxEnt) NMT model does not yield significant different results (lines 6 to 8).
Third, the smoothed transformation (J + Gau.) gives some improvements over the simple transformation (the last two lines), and achieves the best result (1.2 better than LVNMT, and 0.3 better than Tree-to-string). In terms of Bleu scores, we conduct the statistical significance tests with the sign-test of collins+:2005, the results show that the improvements of our J + Gau. over LVNMT are significant on three test sets ().
At last, the brevity penalty (BP) consistently gets better after we add the alignment cost to NMT objective. Our alignment objective adjusts the translation length to be more in line with the human references accordingly.
|Cov LVNMT [Mi et al.2016b]||51.11||41.42||45.76|
|A T J||53.71||49.33||51.43|
|J + Gau.||48.90||55.38||51.94|
5.3 Alignment Results
Table 2 shows the alignment F1 scores on the alignment test set (447 hand aligned sentences). The MaxEnt model is trained on hand-aligned sentences, and achieves an F1 score of 75.96. For NMT systems, we dump the alignment matrixes and convert them into alignments with following steps. For each target word, we sort the alphas and add the max probability link if it is higher than 0.2. If we only tune the alignment component (A in line 3), we improve the alignment F1 score from 45.76 to 47.87. And we further boost the score to 50.97 by tuning alignment and translation jointly (J in line 7). Interestingly, the system using MaxEnt produces more alignments in the output, and results in a higher recall. This suggests that using MaxEnt can lead to a sharper attention distribution, as we pick the alignment links based on the probabilities of attentions, the sharper the distribution is, more links we can pick. We believe that a sharp attention distribution is a great property of NMT.
Again, the best result is J + Gau. in the last row, which significantly improves the F1 by 5 points over the baseline Cov. LVNMT system. When we use MaxEnt alignments, J + Gau. smoothing gives us about 1.7 points gain over J system. So it looks interesting to run another J + Gau. over GDFA alignment.
Together with the results in Table 1, we conclude that adding the alignment cost to the training objective helps both translation and alignment significantly.
In this paper, we utilize the “supervised” alignments, and put the alignment cost to the NMT objective function. In this way, we directly optimize the attention model in a supervised way. Experiments show significant improvements in both translation and alignment tasks over a very strong LVNMT system.
We thank the anonymous reviewers for useful comments.
- [Arne Mauser and Ney2008] Sasa Hasan Arne Mauser and Hermann Ney. 2008. Automatic evaluation measures for statistical machine translation system optimization. In Proceedings of LREC 2008, Marrakech, Morocco, may.
- [Bahdanau et al.2014] D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. ArXiv e-prints, September.
- [Cheng et al.2016] Yong Cheng, Shiqi Shen, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Agreement-based joint training for bidirectional attention-based neural machine translation. In Proceedings of IJCAI, New York, USA, July.
[Cmejrek et al.2013]
Martin Cmejrek, Haitao Mi, and Bowen Zhou.
Flexible and efficient hypergraph interactions for joint hierarchical
and forest-to-string decoding.
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 545–555, Seattle, Washington, USA, October. Association for Computational Linguistics.
- [Collins et al.2005] Michael Collins, Philipp Koehn, and Ivona Kucerova. 2005. Clause restructuring for statistical machine translation. In Proceedings of ACL, pages 531–540, Ann Arbor, Michigan, June.
- [Dyer et al.2013] Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–648, Atlanta, Georgia, June. Association for Computational Linguistics.
- [Hopkins and May2011] Mark Hopkins and Jonathan May. 2011. Tuning as ranking. In Proceedings of EMNLP.
- [Ittycheriah and Roukos2005] Abraham Ittycheriah and Salim Roukos. 2005. A maximum entropy word aligner for arabic-english machine translation. In HLT ’05: Proceedings of the HLT and EMNLP, pages 89–96.
- [Jean et al.2015] Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of ACL, pages 1–10, Beijing, China, July.
- [Liang et al.2006] P. Liang, B. Taskar, and D. Klein. 2006. Alignment by agreement. In North American Association for Computational Linguistics (NAACL), pages 104–111.
- [Liang et al.2008] P. Liang, D. Klein, and M. I. Jordan. 2008. Agreement-based learning. In Advances in Neural Information Processing Systems (NIPS).
- [Liu et al.2009] Yang Liu, Haitao Mi, Yang Feng, and Qun Liu. 2009. Joint decoding with multiple translation models. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2, ACL ’09, pages 576–584, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [Luong et al.2015] Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal, September. Association for Computational Linguistics.
- [Mi et al.2016a] Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. 2016a. A coverage embedding model for neural machine translation. ArXiv e-prints.
- [Mi et al.2016b] Haitao Mi, Zhiguo Wang, and Abe Ittycheriah. 2016b. Vocabulary manipulation for neural machine translation. In Proceedings of ACL, Berlin, Germany, August.
- [Och and Ney2000] Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL ’00, pages 440–447, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [Tu et al.2016] Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li. 2016. Coverage-based Neural Machine Translation. ArXiv e-prints, January.
- [Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. CoRR.
- [Zhao and Al-onaizan2008] Bing Zhao and Yaser Al-onaizan. 2008. Generalizing local and non-local word-reordering patterns for syntax-based machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 572–581, Stroudsburg, PA, USA. Association for Computational Linguistics.