Phrase Table as Recommendation Memory for Neural Machine Translation

05/25/2018 ∙ by Yang Zhao, et al. ∙ 0

Neural Machine Translation (NMT) has drawn much attention due to its promising translation performance recently. However, several studies indicate that NMT often generates fluent but unfaithful translations. In this paper, we propose a method to alleviate this problem by using a phrase table as recommendation memory. The main idea is to add bonus to words worthy of recommendation, so that NMT can make correct predictions. Specifically, we first derive a prefix tree to accommodate all the candidate target phrases by searching the phrase translation table according to the source sentence. Then, we construct a recommendation word set by matching between candidate target phrases and previously translated target words by NMT. After that, we determine the specific bonus value for each recommendable word by using the attention vector and phrase translation probability. Finally, we integrate this bonus value into NMT to improve the translation results. The extensive experiments demonstrate that the proposed methods obtain remarkable improvements over the strong attentionbased NMT.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The past several years have witnessed a significant progress in Neural Machine Translation (NMT). Most NMT methods are based on the encoder-decoder architecture [Kalchbrenner and Blunsom2013, Sutskever et al.2014, Bahdanau et al.2015] and can achieve promising translation performance in a variety of language pairs [Gehring et al.2017, Vaswani et al.2017, Junczys-Dowmunt et al.2016a].

Figure 1: An example of mistakes made by NMT, while SMT can produce a correct translation.

However, recent studies [Arthur et al.2016, Tu2016Reconstruction] show that NMT often generates words that make target sentences fluent, but unfaithful to the source sentences. In contrast, traditional Statistical Machine Translation (SMT) methods tend to rarely make this kind of mistakes. Fig. 1 shows an example that NMT makes mistakes when translating the phrase “jinkou dafu xiahua (the sharp decline in imports)” and the phrase “maoyi shuncha (the trade surplus)”, but SMT can produce correct results when translating these two phrases. [Arthur et al.2016]

argues that the reason behind this is the use of distributed representations of words in NMT makes systems often generate words that seem natural in the context, but do not reflect the content of the source sentence. Traditional SMT can avoid this problem as it produces the translations based on phrase mappings.

Therefore, it will be beneficial to combine SMT and NMT to alleviate the previously mentioned problem. Actually, researchers have made some effective attempts to achieve this goal. Earlier studies were based on the SMT framework, and have been deeply discussed in [zhang2015deep]. Later, the researchers transfers to NMT framework. Specifically, coverage mechanism [Tu et al.2016, Mi et al.2016a], SMT features [Wang et al.2016, He et al.2016, Stahlberg et al.2016, li2016towards, Wang2017Towards]

and translation lexicons

[Arthur et al.2016, Zhang and Zong2016, Feng et al.2017] have been fully explored. In contrast, phrase translation table, as the core of SMT, has not been fully studied. Recently, [Tang et al.2016] and [Wang et al.2017] explore the possibility of translating phrases in NMT. However, the “phrase” in their approaches are different from that used in phrase-based SMT. In [Tang et al.2016]’s models, the phrase pair must be a one-to-one mapping with a source phrase having a unique target phrase (named entity translation pairs). In [Wang et al.2017]’s models, the source side of a phrase pair must be a chunk. Therefore, it is still a big challenge to incorporate any phrase pair in the phrase table into NMT system to alleviate the unfaithfulness problem.

In this paper, we propose an effective method to incorporate a phrase table as recommendation memory into the NMT system. To achieve this, we add bonuses to the words in recommendation set to help NMT make better predictions. Generally, our method contains three steps. 1) In order to find out which words are worthy to recommend, we first derive a candidate target phrase set by searching the phrase table according to the input sentence. After that, we construct a recommendation word set at each decoding step by matching between candidate target phrases and previously translated target words by NMT. 2) We then determine the specific bonus value for each recommendable word by using the attention vector produced by NMT and phrase translation probability extracted from phrase table. 3) Finally we integrate the word bonus value into the NMT system to improve the final results.

In this paper, we make the following contributions:

1) We propose a method to incorporate the phrase table as recommendation memory into NMT system. We design a novel approach to find from the phrase table the target words worthy of recommendation, calculate their recommendation scores and use them to promote NMT to make better predictions.

2) Our empirical experiments on Chinese-English translation and English-Japanese translation tasks show the efficacy of our methods. For Chinese-English translation, we can obtain an average improvement of 2.23 BLEU points. For English-Japanese translation, the improvement can reach 1.96 BLEU points. We further find that the phrase table is much more beneficial than bilingual lexicons to NMT.

2 Neural Machine Translation

NMT contains two parts, encoder and decoder, where encoder transforms the source sentence into context vectors . This context set is constructed by

stacked Long Short Term Memory (LSTM)

[Hochreiter and Schmidhuber1997] layers. can be calculated as follows:

(1)

The decoder generates one target word at a time by computing the probability of as follows:

(2)

where is the score produced by NMT:

(3)

and is the attention output:

(4)

the attention model calculates

as the weighted sum of the source-side context vectors:

(5)
(6)

is computed using the following formula:

(7)

3 Phrase Table as Recommendation Memory for NMT

In section 2 we described how the standard NMT models calculate the probability of the next target word (Eq. (2)). Our goal in this paper is to improve the accuracy of this probability estimation by incorporating information from phrase tables. Our main idea is to find the recommendable words and increase their probabilities at each decoding time step. Thus, three questions arise:

1) Which words are worthy to recommend at each decoding step?

2) How to determine an appropriate bonus value for each recommendable word?

3) How to integrate the bonus value into NMT?

In this section, we will describe the specific methods to answer above three questions. As the basis of our work, we first introduce two definitions used by our methods.

Definition 1 (prefix of phrase): the prefix of a phrase is a word sequence which begins with the first word of the phrase and ends with any word of the phrase. Note that the prefix string can be empty. For a phrase , this phrase contains four prefixes: .

Definition 2 (suffix of partial translation): the suffix of the partial translation is a word sequence, which begins with any word belonging to , and ends with . Similarly, the suffix string can also be empty. For partial translation , there are four suffixes .

3.1 Word Recommendation Set

3.1.1 Candidate Target Phrase Set

The first step is to derive a candidate target phrase set for a source sentence. The recommendation words are selected from this set.

Given a source sentence and a phrase translation table (as shown in upper right of Fig. 2), we can traverse the phrase translation table and get all the phrase pairs whose source side matches the input source sentence. Then, for each phrase pair, we add the target phrases with the top highest phrase translation probabilities into the candidate target phrase set.

In order to improve efficiency of the next step, we represent this candidate target phrase set in a form of prefix tree. If the phrases contain the same prefix (Definition 1), the prefix tree can merge them and represent them using the same non-terminal nodes. The root of this prefix tree is an empty node. Fig. 2 shows an example to illustrate how we get the candidate target phrase set for a source sentence. In this example, In phrase table (upper right), we find four phrases whose source side matches the source sentence (upper left). We add the target phrases into candidate target phrase set (middle). Finally, we use a prefix tree (bottom) to represent the candidate target phrases.

Figure 2: The procedure of constructing the target side prefix tree from candidate target phrase set for a source sentence.

3.1.2 Word Recommendation Set

With above preparations, we can start to construct the word recommendation set. In our method, we need to construct a word recommendation set at each decoding step . The basic idea is that if a prefix (Definition 1) of a phrase in candidate target phrase set matches a suffix (Definition 2) of the partial translation , the next word of in the phrase may be the next target word to be predicted and thus is worthy to recommend.

Here, we take Fig. 2 as an example to illustrate our idea. We assume that the partial translation is “he settled in the US, and lived in the suburb of”. According to our definition, this partial translation contains a suffix “suburb of”. Meanwhile, in candidate target phrase set, there is a phrase (“suburb of Milwaukee”) whose two-word prefix is “suburb of” as well. We can notice that the next word of the prefix (”Milwaukee”) is exactly the one that should be predicted by the decoder. Thus, we recommend “Milwaukee” by adding a bonus to it with the hope that when this low-frequency word is mistranslated by NMT, our recommendation can fix this mistake.

Under this assumption, the procedure of constructing the word recommendation set is illustrated in Algorithm 1. We first get all suffixes of (line 2) and all prefixes of target phrases belonging to candidate target phrase set (line 3). If a prefix of the candidate phrase matches a suffix of , we add the next word of the prefix in the phrase into recommendation set (line 4-7).

In the definition of the prefix and suffix, we also allow them to be an empty string. By doing so, we can add the first word of each phrase into the word recommendation set, since the suffix of and the prefix of any target phrase always contain a match part . The reason we add the first word of the phrase into recommendation set is that we hope our methods can still recommend some possible words when NMT has finished the translation of one phrase and begins to translate another new one, or predicts the first target word of the whole sentence.

Input: candidate target phrase set; already generated partial translation
Output: word recommendation set

1:
2:Get all suffixes of (denote each suffix by )
3:Get all prefixes of each target phrase in candidate target phrase set (denote every prefix by )
4:for each suffix and each prefix  do
5:     if   then
6:         Add the next word of into      
7:return
Algorithm 1 Construct recommendation word set

Now we already know which word is worthy to recommend. In order to facilitate the calculation of the bonus value (section 3.2), we also need to maintain the origin of each recommendation word. Here, the origin of a recommendation word contains two parts: 1) the phrase pair this word belongs to and 2) the phrase translation probability between the source and target phrases. Formally, for a recommendation word , we can denote it by:

(8)

where denotes the -th phrase pair the recommendation word belongs to (some words may belong to different phrase pairs and denotes the number of phrase pairs). is the source phrase and is the target phrase. is the phrase translation probability between the source and target phrases111Here the phrase translation probability is the mean of four probabilities, i.e., the bidirectional phrase translation probabilities and bidirectional lexical terms.. Take Fig. 2 as an example. When the partial translation is “he”, word “settled” can be recommended according to algorithm 1. Word “settled” is contained in two phrase pairs and the translation probabilities are respectively 0.6 and 0.4. Thus, we can denote the word ”settled” as follows:

(9)

3.2 Bonus Value Calculation

The next task is to calculate the bonus value for each recommendation word. For a recommendation word denoted by Eq. (8), its bonus value is calculated as follows:

Step1: Extracting each phrase translation probability .

Step2: For each phrase pair , we convert the attention weight in NMT (Eq. (6)) between target word and source word to phrase alignment probability between target word and source phrase as follows:

(10)

where is the number of words in phrase . As shown in Eq. (10), our conversion method is making an average of word alignment probability whose source word belongs to source phrase .

Step3: Calculating the bonus value for each recommendation word as follows:

(11)

From Eq. (11), the bonus value is determined by two factors, i.e., 1) alignment information and 2) translation probability . The process of involving is important because the bonus value will be influenced by different source phrases that systems focus on. And we take into consideration with a hope that the larger is, the larger its bonus value is.

3.3 Integrating Bonus Values into NMT

The last step is to combine the bonus value with the conditional probability of the baseline NMT model (Eq.(2)). Specifically, we add the bonuses to the words on the basis of original NMT score (Eq. (3)) as follows:

(12)

where is calculated by Eq. (11).

is the bonus weight, and specifically, it is the result of sigmoid function (

), where is a learnable parameter, and this sigmoid function ensures that the final weight falls between 0 and 1222In our preliminary experiments, we also try another strategy which adds the bonus to the NMT results as a bias, while the performance of this strategy is lower than the current introduced method (Eq. 12)..

4 Experimental Settings

In this section, we describe the experiments to evaluate our proposed methods.

4.1 Dataset

We test the proposed methods on Chinese-to-English (CH-EN) translation and English-to-Japanese (EN-JA) translation. In CH-EN translation, we test the proposed methods with two data sets: 1) small data set, which includes 0.63M333LDC2000T50, LDC2002L27, LDC2002T01, LDC2002E18, LDC2003E07, LDC2003E14, LDC2003T17, LDC2004T07. sentence pairs; 2) large-scale data set, which contains about 2.1M sentence pairs. NIST 2003 (MT03) dataset is used for validation. NIST2004-2006 (MT04-06) and NIST 2008 (MT08) datasets are used for testing. In EN-JA translation, we use KFTT dataset444http://www.phontron.com/kftt/., which includes 0.44M sentence pairs for training, 1166 sentence pairs for validation and 1160 sentence pairs for testing.

4.2 Training and Evaluation Details

We use the Zoph_RNN toolkit555https://github.com/isi-nlp/ZophRNN. We extend this toolkit with global attention. to implement all our described methods. In all experiments, the encoder and decoder include two stacked LSTM layers. The word embedding dimension and the size of hidden layers are both set to 1,000. The minibatch size is set to 128. We limit the vocabulary to 30K most frequent words for both the source and target languages. Other words are replaced by a special symbol “UNK”. At test time, we employ beam search and beam size is set to 12. We use case-insensitive 4-gram BLEU score as the automatic metric [Papineni et al.2002] for translation quality evaluation.

4.3 Phrase Translation Table

Our phrase translation table is learned directly from parallel data by Moses [Koehn et al.2007]. To ensure the quality of the phrase pair, in all experiments, the phrase translation table is filtered as follows: 1) out-of-vocabulary words in the phrase table are replaced by UNK; 2) we remove the phrase pairs whose words are all punctuations and UNK; 3) for a source phrase, we retain at most 10 target phrases having the highest phrase translation probabilities.

4.4 Translation Methods

# Method CH-EN EN-JA
MT03(dev) MT04 MT05 MT06 MT08 Ave dev test
1 Moses 28.35 30.02 29.10 32.92 23.20 28.72 20.06 22.40
2 Baseline 34.20 36.96 32.60 33.85 25.96 32.71 23.61 25.99
3 Arthur 34.98 37.96 33.36 34.79 26.53 33.52 24.33 26.72
4 Our method 36.48 38.79 35.34 36.58 27.49 34.94 25.63 27.95
5 system(no matching) 34.99 37.54 33.32 34.22 26.39 33.29 24.11 26.47
6 system(no first) 35.25 38.07 34.13 34.95 26.67 33.81 24.37 26.93
Table 1: Translation results (BLEU score) for different translation methods. “” indicates that it is statistically significant better () than Baseline and “” indicates .

We compare our method with other relevant methods as follows:

1) Moses: It is a widely used phrasal SMT system [Koehn et al.2007].

2) Baseline: It is the baseline attention-based NMT system [Luong et al.2015, Zoph and Knight2016].

3) Arthur: It is the state-of-the-art method which incorporates discrete translation lexicons into NMT model [Arthur et al.2016]. We choose automatically learned lexicons and bias method. We implement the method on the base of the baseline attention-based NMT system. Hyper parameter is 0.001, the same as that reported in their work.

5 Translation Results

Table 1 reports the detailed translation results for different methods. Comparing the first two rows in Table 1, it is very obvious that the attention-based NMT system Baseline substantially outperforms the phrase-based SMT system Moses on both CH-EN translation and EN-JA translation. The average improvement for CH-EN and EN-JA translation is up to 3.99 BLEU points (32.71 vs. 28.72) and 3.59 BLEU (25.99 vs. 22.40) points, respectively.

5.1 Effect of Integrating Phrase Translation Table

The first question we are interested in is whether or not phrase translation table can improve the translation quality of NMT. Compared to the baseline, our method markedly improves the translation quality on both CH-EN translation and EN-JA translation. In CH-EN translation, the average improvement is up to 2.23 BLEU points (34.94 vs. 32.71). In EN-JA translation, the improvement can reach 1.96 BLEU points (27.95 vs. 25.99). It indicates that incorporating a phrase table into NMT can substantially improve NMT’s translation quality.

Figure 3: Translation examples, where the proposed method is able to obtain a correct translation while the baseline NMT is not.

In Fig. 3, we show an illustrative example of CH-EN translation. In this example, our method is able to obtain a correct translation while the baseline is not. Specifically, baseline NMT system mistranslates “jinkou dafu xiahua (the sharp decline in imports)” into “import of imports”, and incorrectly translates “maoyi shuncha (trade surplus)” into “trade”. But these two mistakes are fixed by our method, because there are two phrase translation pairs (“jinkou dafu xiahua” to “the sharp decline in imports” and “maoyi shuncha” to “trade surplus”) in the phrase table, and the correct translations are obtained due to our recommendation method.

5.2 Lexicon vs. Phrase

A natural question arises that whether it is more beneficial to incorporate a phrase translation table than the translation lexicons. From Table 1, we can conclude that both translation lexicons and phrase translation table can improve NMT system’s translation quality. In CH-EN translation, Arthur improves the baseline NMT system with 0.81 BLEU points, while our method improves the baseline NMT system with 2.23 BLEU points. In EN-JA translation, Arthur improves the baseline NMT system with 0.73 BLEU points, while our method improves the baseline NMT system with 1.96 BLEU points. Therefore, it is very obvious that phrase information is more effective than lexicon information when we use them to improve the NMT system.

Method Faithfulness
Baseline 3.21
Arthur 3.25
Our method 3.33
Table 2: Subjective evaluation of translation faithfulness.

Fig. 4 shows an illustrative example. In this example, baseline NMT mistranslates “dianli (electricity) anquan (safe)” into “coal”. Arthur partially fixes this error and it can correctly translate “dianli (electrical)” into “electrical”, but the source word “anquan (safe)” is still missed. Fortunately, this mistake is fixed by our proposed method. The reason behind this is that Arthur uses information from translation lexicons, which makes the system only fix the translation mistake of an individual lexicon (in this example, it is “dianli (electrical)”), while our method uses the information from phrases, which makes the system can not only obtain the correct translation of the individual lexicon but also capture local lexicon reordering and fixed collocation etc.

Besides the BLEU score, we also conduct a subjective evaluation to validate the benefit of incorporating a phrase table in NMT. The subjective evaluation is conducted on CH-EN translation. As our method tries to solve the problem that NMT system cannot reflect the true meaning of the source sentence, the criterion of the subjective evaluation is the faithfulness of translation results. Specifically, five human evaluators, who are native Chinese and expert in English, are asked to evaluate the translations of 500 source sentences randomly sampled from the test sets without knowing which system a translation is selected from. The score ranges from 0 to 5. For a translation result, the higher its score is, the more faithful it is. Table 2 shows the average results of five subjective evaluations on CH-EN translation. As shown in Table 2,the faithfulness of translation results produced by our method is better than Arthur and baseline NMT system.

Method MT03 MT04 MT05 MT06 MT08 Ave
Baseline 39.07 40.49 37.26 38.04 28.83 36.74
Arthur 39.92 41.41 38.18 38.67 29.32 37.50
Our method 40.87 42.41 39.29 39.83 30.47 38.57
Table 3: Translation results (BLEU score) for different translation methods on large-scale data. “” indicates that it is statistically significant better () than Baseline.
Figure 4: Translation examples, where both two methods can improve the baseline system, but our proposed model produces a better translation result.

5.3 Different Methods to Construct Recommendation Set

When constructing the word recommendation set, our current methods are adding the next word of the match part into recommendation set. In order to test the validity of this strategy, we compare the current strategy with another system, in which, we can add all words in candidate target phrase set into recommendation set without matching. We denote this system by system(no matching), whose results are reported in line 5 in Table 1. From the results, we can conclude that in both CH-EN translation and EN-JA translation, system(no matching) can boost the baseline system, while the improvements are much smaller than our methods. It indicates that the matching between the phrase and partial translation is quite necessary for our methods.

As we discussed in Section 3.1, we allow the prefix and suffix to be an empty string to make first word of each phrase into the word recommendation set. To show effectiveness of this setting, we also implement another system as a comparison. In the system, the first words of each phrase are not included in the recommendation set (we denote the system by system(no first)). The results of this system are reported in line 6 in Table 1. As shown in Table 1, our methods performs better than system(no first)) on both CH-EN translation and EN-JA translation. This result shows that the first word of the target phrase is also important for our method and is worthy to recommend.

5.4 Translation Results on Large Data

We also conduct another experiment to find out whether or not our methods are still effective when much more sentence pairs are available. Therefore, the CH-EN experiments on millions of sentence pairs are conducted and Table 3 reports the results. We can conclude from Table 3 that our model can also improve the NMT translation quality on all of the test sets and the average improvement is up to 1.83 BLEU points.

6 Related Work

In this work, we focus on integrating the phrase translation table of SMT into NMT. And there have been several effective works to combine SMT and NMT.

Using coverage mechanism. [Tu et al.2016] and [Mi et al.2016a] improved the over-translation and under-translation problems in NMT inspired by the coverage mechanism in SMT.

Extending beam search. [Dahlmann et al.2017] extended the beam search method with SMT hypotheses. [Stahlberg et al.2016] improved the beam search by using the SMT lattices.

Combining SMT features and results. [He et al.2016] presented a log-linear model to integrate SMT features (translation model and the language model) into NMT. [Liu et al.2016] and [Mi et al.2016b] proposed a supervised attention model for NMT to minimize the alignment disagreement between NMT and SMT. [Wang et al.2016]

proposed a method that incorporates the translations of SMT into NMT with an auxiliary classifier and a gating function.

[Zhou et al.2017] proposed a neural combination model to fuse the NMT translation results and SMT translation results.

Incorporating translation lexicons. [Arthur et al.2016, Feng et al.2017] attempted to integrate NMT with the probabilistic translation lexicons. [Zhang and Zong2016] moved forward further by incorporating a bilingual dictionaries in NMT.

In above works, integrating the phrase translation table of SMT into NMT has not been fully studied.

Translating phrase in NMT. The most related works are [Tang et al.2016] and [Wang et al.2017]. Both methods attempted to explore the possibility of translating phrases as a whole in NMT. In their models, NMT can generate a target phrase in phrase memory or a word in vocabulary by using a gate. However, their “phrases” are different from that are used in phrase-based SMT. [Tang et al.2016]’s models only support a unique translation for a source phrase. In [Wang et al.2017]’s models, the source side of a phrase pair must be a chunk. Different from above two methods, our model can use any phrase pair in the phrase translation table and promising results can be achieved.

7 Conclusions and Future Work

In this paper, we have proposed a method to incorporate a phrase translation table as recommendation memory into NMT systems to alleviate the problem that the NMT system is opt to generate fluent but unfaithful translations.

Given a source sentence and a phrase translation table, we first construct a word recommendation set at each decoding step by using a matching method. Then we calculate a bonus value for each recommendable word. Finally we integrate the bonus value into NMT. The extensive experiments show that our method achieved substantial increases in both Chinese-English and English-Japanese translation tasks.

In the future, we plan to design more effective methods to calculate accurate bonus values.

Acknowledgments

The research work described in this paper has been supported by the National Key Research and Development Program of China under Grant No. 2016QY02D0303 and the Natural Science Foundation of China under Grant No. 61333018 and 61673380. The research work in this paper also has been supported by Beijing Advanced Innovation Center for Language Resources.

References

  • [Arthur et al.2016] Philip Arthur, Graham Neubig, and Satoshi Nakamura. Incorporating discrete translation lexicons into neural machine translation. In Proceedings of EMNLP 2016, pages 1557–1567, 2016.
  • [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR 2015, 2015.
  • [Dahlmann et al.2017] Leonard Dahlmann, Evgeny Matusov, Pavel Petrushkov, and Shahram Khadivi. Neural machine translation leveraging phrase-based models in a hybrid search. In proceedings of EMNLP 2017, pages 1422–1431, 2017.
  • [Feng et al.2017] Yang Feng, Shiyue Zhang, Andi Zhang, Dong Wang, and Andrew Abel. Memory-augmented neural machine translation. In proceedings of EMNLP 2017, page 1401–1410, 2017.
  • [Gehring et al.2017] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1601.03317, 2017.
  • [He et al.2016] Wei He, Zhongjun He, Hua Wu, and Haifeng Wang. Improved neural machine translation with smt features. In Proceedings of AAAI 2016, 2016.
  • [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [Junczys-Dowmunt et al.2016a] Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang. Is neural machine translation ready for deployment? a case study on 30 translation directions. arXiv preprint arXiv:1610.01108, 2016.
  • [Junczys-Dowmunt et al.2016b] Marcin Junczys-Dowmunt, Tomasz Dwojak, and Rico Sennrich. The amu-uedin submission to the wmt16 news translation task: Attention-based nmt models as feature functions in phrase-based smt. In Proceedings of the First Conference on Machine Translation, pages 319–325, 2016.
  • [Kalchbrenner and Blunsom2013] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings of EMNLP 2013, pages 1700–1709, 2013.
  • [Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of ACL 2007, pages 177–180, 2007.
  • [Liu et al.2016] Lemao Liu, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. Neural machine translation with supervised attention. In Proceedings of COLING 2016, pages 3093–3102, 2016.
  • [Luong et al.2015] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. In Proceedings of EMNLP 2015, pages 1412–1421, 2015.
  • [Mi et al.2016a] Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. Coverage embedding model for neural machine translation. In Proceedings of EMNLP 2016, pages 955–960, 2016.
  • [Mi et al.2016b] Haitao Mi, Zhiguo Wang, and Abe Ittycheriah. Supervised attentions for neural machine translation. In Proceedings of EMNLP 2016, pages 2283–2288, 2016.
  • [Mi et al.2016c] Haitao Mi, Zhiguo Wang, and Abe Ittycheriah. Vocabulary manipulation for neural machine translation. In Proceedings of ACL 2016, pages 1–10, 2016.
  • [Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL 2002, pages 311–318, 2002.
  • [Stahlberg et al.2016] Felix Stahlberg, Eva Hasler, Aurelien Waite, and Bill Byrne. Syntactically guided neural machine translation. In Proceedings of ACL 2016, pages 299–305, 2016.
  • [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le.

    Sequence to sequence learning with neural networks.

    In Proceedings of NIPS 2014, 2014.
  • [Tang et al.2016] Yaohua Tang, Fandong Meng, Zhengdong Lu, Hang Li, and Philip LH Yu. Neural machine translation with external phrase memory. arXiv preprint arXiv:1606.01792, 2016.
  • [Tu et al.2016] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Coverage-based neural machine translation. In Proceedings of ACL 2016, pages 76–85, 2016.
  • [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, and Łukasz Kaiser. Attention is all you need. arXiv preprint arXiv:1601.03317, 2017.
  • [Wang et al.2016] Xing Wang, Zhengdong Lu, Zhaopeng Tu, Hang Li, Deyi Xiong, and Min Zhang. Neural machine translation advised by statistical machine translation. In proceedings of AAAI 2017, 2016.
  • [Wang et al.2017] Xing Wang, Zhaopeng Tu, Deyi Xiong, and Min Zhang. Translating phrases in neural machine translation. In proceedings of EMNLP 2017, pages 1432–1442, 2017.
  • [Zhang and Zong2016] Jiajun Zhang and Chengqing Zong. Bridging neural machine translation and bilingual dictionaries. arXiv preprint arXiv:1610.07272, 2016.
  • [Zhou et al.2017] Long Zhou, Wenpeng Hu, Jiajun Zhang, and Chengqing Zong. Neural system combination for machine translation. In Proceedings of ACL 2017, pages 378–384, 2017.
  • [Zoph and Knight2016] Barret Zoph and Kevin Knight. Multi-source neural translation. In Proceedings of NAACL 2016, pages 30–34, 2016.