Supervised Attentions for Neural Machine Translation

07/30/2016 ∙ by Haitao Mi, et al. ∙ ibm 0

In this paper, we improve the attention or alignment accuracy of neural machine translation by utilizing the alignments of training sentence pairs. We simply compute the distance between the machine attentions and the "true" alignments, and minimize this cost in the training procedure. Our experiments on large-scale Chinese-to-English task show that our model improves both translation and alignment qualities significantly over the large-vocabulary neural machine translation system, and even beats a state-of-the-art traditional syntax-based system.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural machine translation (NMT) has gained popularity in recent two years [Bahdanau et al.2014, Jean et al.2015, Luong et al.2015], especially for the attention-based models of bahdanau+:2014.

The attention model plays a crucial role in NMT, as it shows which source word(s) the model should focus on in order to predict the next target word. However, the attention or alignment quality of NMT is still very low 

[Mi et al.2016a, Tu et al.2016].

In this paper, we alleviate the above issue by utilizing the alignments

(human annotated data or machine alignments) of the training set. Given the alignments of all the training sentence pairs, we add an alignment distance cost to the objective function. Thus, we not only maximize the log translation probabilities, but also minimize the alignment distance cost. Large-scale experiments over Chinese-to-English on various test sets show that our best method for a single system improves the translation quality significantly over the large vocabulary NMT system (Section 

5) and beats the state-of-the-art syntax-based system.

2 Neural Machine Translation

Figure 1: The architecture of attention-based NMT [Bahdanau et al.2014]. The source sentence with length , is an end-of-sentence token on the source side. The reference translation is with length , similarly, is the target side . and are bi-directional encoder states. is the attention probability at time , position . is the weighted sum of encoding states. is a hidden state.

is an output state. Another one layer neural network projects

to the target output vocabulary, and conducts softmax to predict the probability distribution over the output vocabulary. The attention model (the right box) is a two layer feedforward neural network,

is an intermediate state, then another layer converts it into a real number , the final attention probability at position is .

As shown in Figure 1, attention-based NMT [Bahdanau et al.2014]

is an encoder-decoder network. the encoder employs a bi-directional recurrent neural network to encode the source sentence

, where is the sentence length (including the end-of-sentence ), into a sequence of hidden states , each is a concatenation of a left-to-right and a right-to-left .

Given , the decoder predicts the target translation by maximizing the conditional log-probability of the correct translation , where is the sentence length (including the end-of-sentence). At each time , the probability of each word from a target vocabulary is:



is a two layer feed-forward neural network over the embedding of the previous word

, and the hidden state . The is computed as:



is a gated recurrent units,

is a weighted sum of ; the weights, , are computed with a two layer feed-forward neural network :


We put all (, ) into a matrix , we have a matrix (alignment) like (c) in Figure 2, where each row (for each target word) is a probability distribution over the source sentence .

The training objective is to maximize the conditional log-probability of the correct translation given with respect to the parameters


where is the -th sentence pair in the training set, is the total number of pairs.

3 Alignment Component

Figure 2: Alignment transformation. A special token, , is introduced to the source sentence, we align all the unaligned target words ( in this case) to . (a): the original alignment matrix from GIZA++ or MaxEnt aligner. (b): simple normalization by rows (probability distribution over the source sentence ). (c): smoothed transformation followed by normalization by rows, and typically, we always align end-of-source-sentence to end-of-target-sentence by probability one.

The attentions, , in each step play an important role in NMT. However, the accuracy is still far behind the traditional MaxEnt alignment model in terms of alignment F1 score [Mi et al.2016b, Tu et al.2016]. Thus, in this section, we explicitly add an alignment distance to the objective function in Eq. 5. The “truth” alignments for each sentence pair can be from human annotated data, unsupervised or supervised alignments (e.g. GIZA++ [Och and Ney2000] or MaxEnt [Ittycheriah and Roukos2005]).

Given an alignment matrix for a sentence pair () in Figure 2 (a), where we have an end-of-source-sentence token , and we align all the unaligned target words ( in this example) to , also we force (end-of-target-sentence) to be aligned to with probability one. Then we conduct two transformations to get the probability distribution matrices ((b) and (c) in Figure 2).

3.1 Simple Transformation

The first transformation simply normalizes each row. Figure 2 (b) shows the result matrix . The last column in red dashed lines shows the alignments of the special end-of-sentence token .

3.2 Smoothed Transformation

Given the original alignment matrix , we create a matrix with all points initialized with zero. Then, for each alignment point , we update

by adding a Gaussian distribution,

, with a window size (-, … +). Take the for example, we have += , += , and += with =2, =. Then we normalize each row and get (c). In our experiments, we use a shape distribution, where = .

3.3 Objectives

Alignment Objective: Given the “true” alignment , and the machine attentions produced by NMT model, we compute the Euclidean distance bewteen and .


NMT Objective: We plug Eq. 6 to Eq. 5, we have


There are two parts: translation and alignment, so we can optimize them jointly, or separately (e.g. we first optimize alignment only, then optimize translation). Thus, we divide the network in Figure 1 into alignment A and translation T parts:

  1. A: all networks before the hidden state ,

  2. T: the network .

If we only optimize A, we keep the parameters in T unchanged. We can also optimize them jointly J. In our experiments, we test different optimization strategies.

4 Related Work

In order to improve the attention or alignment accuracy, cheng+:2016 adapted the agreement-based learning [Liang et al.2006, Liang et al.2008], and introduced a combined objective that takes into account both translation directions (source-to-target and target-to-source) and an agreement term between the two alignment directions. By contrast, our approach directly uses and optimizes NMT parameters using the “supervised” alignments.

5 Experiments

MT06 MT08 avg.
single system News Web
Bp Bleu T-b Bp Bleu T-b Bp Bleu T-b T-b
Tree-to-string 0.95 34.93 9.45 0.94 31.12 12.90 0.90 23.45 17.72 13.36
Cov. LVNMT [Mi et al.2016b] 0.92 35.59 10.71 0.89 30.18 15.33 0.97 27.48 16.67 14.24


Zh En A J 0.95 35.71 10.38 0.93 30.73 14.98 0.96 27.38 16.24 13.87
A T 0.95 28.59 16.99 0.92 24.09 20.89 0.97 20.48 23.31 20.40
A T J 0.95 35.95 10.24 0.92 30.95 14.62 0.97 26.76 17.04 13.97
J 0.96 36.76 9.67 0.94 31.24 14.80 0.96 28.35 15.61 13.36
GDFA J 0.96 36.44 10.16 0.94 30.66 15.01 0.96 26.67 16.72 13.96
MaxEnt J 0.95 36.80 9.49 0.93 31.74 14.02 0.96 27.53 16.21 13.24
J + Gau. 0.96 36.95 9.71 0.94 32.43 13.61 0.97 28.63 15.80 13.04
Table 1: Single system results in terms of (Ter-Bleu)/2 (T-b, the lower the better) on 5 million Chinese to English training set. Bp denotes the brevity penalty. NMT results are on a large vocabulary () and with UNK replaced. The second column shows different alignments (Zh En (one direction), GDFA (“grow-diag-final-and”), and MaxEnt [Ittycheriah and Roukos2005]. A, T, and J mean optimize alignment only, translation only, and jointly. Gau. denotes the smoothed transformation.

5.1 Data Preparation

We run our experiments on Chinese to English task. The training corpus consists of approximately 5 million sentences available within the DARPA BOLT Chinese-English task. The corpus includes a mix of newswire, broadcast news, and webblog. We do not include HK Law, HK Hansard and UN data. The Chinese text is segmented with a segmenter trained on CTB data using conditional random fields (CRF). Our development set is the concatenation of several tuning sets (GALE Dev, P1R6 Dev, and Dev 12) initially released under the DARPA GALE program. The development set is 4491 sentences in total. Our test sets are NIST MT06 (1664 sentences) , MT08 news (691 sentences), and MT08 web (666 sentences).

For all NMT systems, the full vocabulary size of the training set is . In the training procedure, we use AdaDelta [Zeiler2012] to update model parameters with a mini-batch size 80. Following mi+cov:2016, the output vocabulary for each mini-batch or sentence is a sub-set of the full vocabulary. For each source sentence, the sentence-level target vocabularies are union of top most frequent target words and the top 10 candidates of the word-to-word/phrase translation tables learned from ‘fast_align’ [Dyer et al.2013]. The maximum length of a source phrase is 4. In the training time, we add the reference in order to make the translation reachable.

The Cov. LVNMT system is a re-implementation of the enhanced NMT system of mi+cov:2016, which employs a coverage embedding model and achieves better performance over large vocabulary NMT jean+:2015. The coverage embedding dimension of each source word is 100.

Following jean+:2015, we dump the alignments, attentions, for each sentence, and replace UNKs with the word-to-word translation model or the aligned source word.

Our traditional SMT system is a hybrid syntax-based tree-to-string model [Zhao and Al-onaizan2008], a simplified version of the joint decoding [Liu et al.2009, Cmejrek et al.2013]. We parse the Chinese side with Berkeley parser, and align the bilingual sentences with GIZA++ and MaxEnt. and extract Hiero and tree-to-string rules on the training set. Our language models are trained on the English side of the parallel corpus, and on monolingual corpora (around 10 billion words from Gigaword (LDC2011T07).We tune our system with PRO [Hopkins and May2011] to minimize (Ter- Bleu)/2 111The metric used for optimization in this work is (Ter-Bleu)/2 to prevent the system from using sentence length alone to impact Bleu or Ter. Typical SMT systems use target word count as a feature and it has been observed that Bleu can be optimized by tweaking the weighting of the target word count with no improvement in human assessments of translation quality. Conversely, in order to optimize Ter shorter sentences can be produced. Optimizing the combination of metrics alleviates this effect [Arne Mauser and Ney2008]. on the development set.

5.2 Translation Results

Table 1 shows the translation results of all systems. The syntax-based statistical machine translation model achieves an average (Ter-Bleu)/2 of 13.36 on three test sets. The Cov. LVNMT system achieves an average (Ter-Bleu)/2 of 14.24, which is about 0.9 points worse than Tree-to-string SMT system. Please note that all systems are single systems. It is highly possible that ensemble of NMT systems with different random seeds can lead to better results over SMT.

We test three different alignments:

  1. Zh En (one direction of GIZA++),

  2. GDFA

    (the “grow-diag-final-and” heuristic merge of both directions of GIZA++),

  3. MaxEnt (trained on hand-aligned sentences).

The alignment quality improves from Zh En to MaxEnt. We also test different optimization strategies: J (jointly), A (alignment only), and T (translation model only). A combination, A T, shows that we optimize A only first, then we fix A and only update T part. Gau. denotes the smoothed transformation (Section 3.2). Only the last row uses the smoothed transformation, all others use the simple transformation.

Experimental results in Table 1 show some interesting results. First, with the same alignment, J joint optimization works best than other optimization strategies (lines 3 to 6). Unfortunately, breaking down the network into two separate parts (A and T) and optimizing them separately do not help (lines 3 to 5). We have to conduct joint optimization J in order to get a comparable or better result (lines 3, 5 and 6) over the baseline system.

Second, when we change the training alignment seeds (Zh En, GDFA, and MaxEnt) NMT model does not yield significant different results (lines 6 to 8).

Third, the smoothed transformation (J + Gau.) gives some improvements over the simple transformation (the last two lines), and achieves the best result (1.2 better than LVNMT, and 0.3 better than Tree-to-string). In terms of Bleu scores, we conduct the statistical significance tests with the sign-test of collins+:2005, the results show that the improvements of our J + Gau. over LVNMT are significant on three test sets ().

At last, the brevity penalty (BP) consistently gets better after we add the alignment cost to NMT objective. Our alignment objective adjusts the translation length to be more in line with the human references accordingly.

system pre. rec. F1
MaxEnt 74.86 77.10 75.96
Cov LVNMT [Mi et al.2016b] 51.11 41.42 45.76


Zh En A 50.88 45.19 47.87
A J 53.18 49.37 51.21
A T 50.29 44.90 47.44
A T J 53.71 49.33 51.43
J 54.29 48.02 50.97
GDFA J 53.88 48.25 50.91
MaxEnt J 44.42 55.25 49.25
J + Gau. 48.90 55.38 51.94
Table 2: Alignment F1 scores of different models.

5.3 Alignment Results

Table 2 shows the alignment F1 scores on the alignment test set (447 hand aligned sentences). The MaxEnt model is trained on hand-aligned sentences, and achieves an F1 score of 75.96. For NMT systems, we dump the alignment matrixes and convert them into alignments with following steps. For each target word, we sort the alphas and add the max probability link if it is higher than 0.2. If we only tune the alignment component (A in line 3), we improve the alignment F1 score from 45.76 to 47.87. And we further boost the score to 50.97 by tuning alignment and translation jointly (J in line 7). Interestingly, the system using MaxEnt produces more alignments in the output, and results in a higher recall. This suggests that using MaxEnt can lead to a sharper attention distribution, as we pick the alignment links based on the probabilities of attentions, the sharper the distribution is, more links we can pick. We believe that a sharp attention distribution is a great property of NMT.

Again, the best result is J + Gau. in the last row, which significantly improves the F1 by 5 points over the baseline Cov. LVNMT system. When we use MaxEnt alignments, J + Gau. smoothing gives us about 1.7 points gain over J system. So it looks interesting to run another J + Gau. over GDFA alignment.

Together with the results in Table 1, we conclude that adding the alignment cost to the training objective helps both translation and alignment significantly.

6 Conclusion

In this paper, we utilize the “supervised” alignments, and put the alignment cost to the NMT objective function. In this way, we directly optimize the attention model in a supervised way. Experiments show significant improvements in both translation and alignment tasks over a very strong LVNMT system.


We thank the anonymous reviewers for useful comments.


  • [Arne Mauser and Ney2008] Sasa Hasan Arne Mauser and Hermann Ney. 2008. Automatic evaluation measures for statistical machine translation system optimization. In Proceedings of LREC 2008, Marrakech, Morocco, may.
  • [Bahdanau et al.2014] D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. ArXiv e-prints, September.
  • [Cheng et al.2016] Yong Cheng, Shiqi Shen, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Agreement-based joint training for bidirectional attention-based neural machine translation. In Proceedings of IJCAI, New York, USA, July.
  • [Cmejrek et al.2013] Martin Cmejrek, Haitao Mi, and Bowen Zhou. 2013. Flexible and efficient hypergraph interactions for joint hierarchical and forest-to-string decoding. In

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

    , pages 545–555, Seattle, Washington, USA, October. Association for Computational Linguistics.
  • [Collins et al.2005] Michael Collins, Philipp Koehn, and Ivona Kucerova. 2005. Clause restructuring for statistical machine translation. In Proceedings of ACL, pages 531–540, Ann Arbor, Michigan, June.
  • [Dyer et al.2013] Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–648, Atlanta, Georgia, June. Association for Computational Linguistics.
  • [Hopkins and May2011] Mark Hopkins and Jonathan May. 2011. Tuning as ranking. In Proceedings of EMNLP.
  • [Ittycheriah and Roukos2005] Abraham Ittycheriah and Salim Roukos. 2005. A maximum entropy word aligner for arabic-english machine translation. In HLT ’05: Proceedings of the HLT and EMNLP, pages 89–96.
  • [Jean et al.2015] Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of ACL, pages 1–10, Beijing, China, July.
  • [Liang et al.2006] P. Liang, B. Taskar, and D. Klein. 2006. Alignment by agreement. In North American Association for Computational Linguistics (NAACL), pages 104–111.
  • [Liang et al.2008] P. Liang, D. Klein, and M. I. Jordan. 2008. Agreement-based learning. In Advances in Neural Information Processing Systems (NIPS).
  • [Liu et al.2009] Yang Liu, Haitao Mi, Yang Feng, and Qun Liu. 2009. Joint decoding with multiple translation models. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2, ACL ’09, pages 576–584, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [Luong et al.2015] Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal, September. Association for Computational Linguistics.
  • [Mi et al.2016a] Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. 2016a. A coverage embedding model for neural machine translation. ArXiv e-prints.
  • [Mi et al.2016b] Haitao Mi, Zhiguo Wang, and Abe Ittycheriah. 2016b. Vocabulary manipulation for neural machine translation. In Proceedings of ACL, Berlin, Germany, August.
  • [Och and Ney2000] Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL ’00, pages 440–447, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [Tu et al.2016] Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li. 2016. Coverage-based Neural Machine Translation. ArXiv e-prints, January.
  • [Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. CoRR.
  • [Zhao and Al-onaizan2008] Bing Zhao and Yaser Al-onaizan. 2008. Generalizing local and non-local word-reordering patterns for syntax-based machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 572–581, Stroudsburg, PA, USA. Association for Computational Linguistics.