Neural Phrase-to-Phrase Machine Translation

In this paper, we propose Neural Phrase-to-Phrase Machine Translation (NP^2MT). Our model uses a phrase attention mechanism to discover relevant input (source) segments that are used by a decoder to generate output (target) phrases. We also design an efficient dynamic programming algorithm to decode segments that allows the model to be trained faster than the existing neural phrase-based machine translation method by Huang et al. (2018). Furthermore, our method can naturally integrate with external phrase dictionaries during decoding. Empirical experiments show that our method achieves comparable performance with the state-of-the art methods on benchmark datasets. However, when the training and testing data are from different distributions or domains, our method performs better.


page 1

page 2

page 3

page 4


Translating Phrases in Neural Machine Translation

Phrases play an important role in natural language understanding and mac...

On the Use of Machine Translation-Based Approaches for Vietnamese Diacritic Restoration

This paper presents an empirical study of two machine translation-based ...

Prosodic Phrase Alignment for Machine Dubbing

Dubbing is a type of audiovisual translation where dialogues are transla...

BattRAE: Bidimensional Attention-Based Recursive Autoencoders for Learning Bilingual Phrase Embeddings

In this paper, we propose a bidimensional attention based recursive auto...

Transformer-based Lexically Constrained Headline Generation

This paper explores a variant of automatic headline generation methods, ...

Fast, Scalable Phrase-Based SMT Decoding

The utilization of statistical machine translation (SMT) has grown enorm...

Handling Verb Phrase Anaphora with Dependent Types and Events

This paper studies how dependent typed events can be used to treat verb ...

1 Introduction

Statistical phrase-based models used to be the state-of-the-art machine translation systems (Koehn et al., 2007)

before the deep learning revolution. In contrast to word-based systems

(Koehn et al., 2003; Lopez, 2008; Koehn, 2009), phrase-based approaches explicitly model phrase structures in both source and target sentences and their corresponding alignments. These linguistic structures can be understood as a form of inductive bias to the model, a key factor to its superior performance over word-based counterpart.

In recent years, we have witnessed the surge of neural sequence to sequence (seq2seq) models (Bahdanau et al., 2014; Sutskever et al., 2014)

. These models usually consist of three main components: an encoder that encodes the source sentence into a fixed-length vectoR, a decoder that generates the translation word by word, and an attention module that selectively retrieves the source side information as per the decoder’s needs. Innovations in both architectures

(Vaswani et al., 2017; Gehring et al., 2017) and training techniques (Vaswani et al., 2017; Ba et al., 2016) keep advancing the-state-the-art results on standard benchmarks for machine translation.

Despite the remarkable success of neural sequence to sequence models, the biases induced from phrase structures, which have been shown to be useful in machine translation (Koehn, 2009), are largely ignored. Until recently, Huang et al. (2018) developed Neural Phrase-based Machine Translation (NPMT) to incorporate target-side phrasal information into a neural translation model. Their model builds upon the Sleep-WAke Networks (SWAN), a segmentation-based sequence modeling technique described in Wang et al. (2017a).

In this paper, we propose Neural Phrase-to-Phrase Machine Translation (NPMT). Our contributions are twofold. First, we develop an explicit phrase-level attention mechanism to capture the source side phrasal information and their alignments with the target side phrase outputs. This approach also avoids the more costly dynamic programming procedure due to the monotonic requirement for input-output alignments in NPMT. NPMT can achieve comparable performance with the state-of-the art methods on benchmark datasets. Second, we can naturally incorporate an external phrase-to-phrase dictionary during the decoding procedure. NMT systems trained on a fixed amount of parallel data are known to have limitations with out-of-vocabulary words (Luong et al., 2014). In other words, those out-of-vocabulary words are from different data distributions other than the training data. The standard remedy for this is to apply a post-processing step that patches in unknown words based on the word-level alignment information from attention mechanism (Luong et al., 2014; Hashimoto et al., 2016). In our model, given the phrase-level attentions, we develop a dictionary look-up decoding method with an external phrase-to-phrase dictionary. We demonstrate that our NPMT model consistently outperforms Transformer (Vaswani et al., 2017) in a simulated open vocabulary setting and a cross domain translation setting.

2 Neural phrase-to-phrase machine translation

In this section, we first describe the proposed NPMT model with the phrase-level attention mechanism. We show how it avoids the more costly dynamic programming used in NPMT (Huang et al., 2018). Next, given the phrase-level attention mechanism, we develop a decoding algorithm that can naturally use an external phrase-to-phrase dictionary.

2.1 NpMt

Figure 1: The architecture of NPMT model. The example shows how the target phrase is translated conditioning directly on the source phrase , using the phrase-level attention. Note that for brevity, in the phrase-level encoder, we only show one possible segmentation of the source sentence. We use “” to indicate all the possible segments in Eq. 2.

Figure 1 shows an overview of NPMT. There are three main components in our model: a source encoder , a target encoder , and a segment decoder . The source encoder consists of a sentence-level and phrase-level encoders (, ), namely . We use bidirectional LSTM (Hochreiter & Schmidhuber, 1997) for source encoders, LSTM for target encoder and the Transformer (Vaswani et al., 2017) for the segment decoder.111In our proposed model, we first tried LSTM to use as the targer decoder, but we later found Transformer did better to capture meaningful output segments.

Consider a source sentence and a target sentence . First, the source sentence is passed through sentence-level encoder to obtain the word vector representation,


Second, we use phrase-level encoder to independently encode all possible segments in the source sentence as,


where is the maximum segment length we will consider in our model. This architecture of computing phrase embedding is inspired by the Segmental RNNs (Kong et al., 2015), which has been shown to be effective in both text and speech tasks (Lu et al., 2017).

Denote as the entire set of phrase-level encoding vectors. Now the rest of the model is more conveniently described as a generative process. Let be a concatenation operator, be the end of segment symbol and as the end of sentence symbol.

For segment index

Update the attention state given all previous segments,


where is the total length of all previous segments. Note that different collections of previous segments can lead to the same attention state as long as their concatenation is the same.

Given the current state , sample words from segment decoder until we reach the end of segment symbol or end of sentence . This gives us a segment .

If we see in , stop.

Concatenate to obtain the output sentence . Since there are more than one way to obtain the same using the generative process above, we need to consider all cases. Let be a valid segmentation of , then we have


where is the number of segment in , attention is defined in Eq. 3

and the segment probability

is defined via the decoder .

2.2 Training

Given a pair of source and target sentence , the objective function of machine translation model is defined as follows,


Similar to NPMT in Huang et al. (2018), direct computing Eq. (5

) is intractable. We also need to develop a dynamic programming algorithms to efficiently compute the loss function. We denote conditional probability

as ; then we have . It can be shown that we have the following recursion,

The initial condition is . The computational complexity of this algorithm is to , where denotes the maximum segment length. Comparing with the original NPMT model (Huang et al., 2018) with training complexity, the proposed approach is much faster.

2.3 Decoding

For the proposed NPMT, besides the standard beam search algorithm, we propose a method to decode with an external phrase-to-phrase dictionary. We leave the beam search algorithm in Appendix.

Dictionary integration in decoding.

Figure 2: The decoding process of NPMT model with dictionary extension. The example shows how a German sentence “sie sehen in der mitte das kolosseum , den fluß tiber .” is translated into the English sentence “you see the colosseum in the middle , the river tiber .” Different colors are used to represent the aligned phrases. The underlined words (“kolosseum”, “fluß”, “tiber”) are OOV words, and the italic phrases (“sehen”, “das kolosseum”, “tiber”) can be found in the dictionary. The solid lines are used to denote the translation by the neural model, and the dashed ones are by the dictionary.

The proposed NPMT builds a phrase-to-phrase framework, which could be easily extended to leverage an external phrase-to-phrase dictionary. Here, we describe the dictionary-enhanced decoding algorithm in the greedy search case.222

Integration dictionary to beam search is nontrivial. For the current model, the usage of the dictionary is determined by rules instead of probability scores. But beam search requires a quantitative comparison among the candidates consisting of the phrases generated by both of the rule-based system and the neural model. It is unclear how to quantify the former into beam search scores. We leave it for future work.

The decoding process is shown by Figure 2. At each timestamp in the decoding process, the NPMT will produce a phrase-level attention, and the phrase with the maximum attention is the one to be translated. Given the attended phrase, the model needs to decide whether to use a dictionary to translate the phrase or not. Here only the attended phrases with unknown words are translated with the dictionary if the dictionary includes its translation. The intuition is that when a person is reading a sentence, he/she decides to use a dictionary if he/she meets an unknown word and the attended phrase can be found in the dictionary. When there are multiple translations for a phrase in the dictionary, we use the model to score all candidates, and then choose the best one. Here in NPMT, we use the segment decoder to do so. The decoding algorithm is shown in Algorithm 1

Data: a source sentence: , a dictionary:
Result: a target sentence:

Compute hidden representation

for the source sentence;
Set initial output segment ;
Set output sentence ;
while eos  do
       Compute attention state and the attention distribution;
       Select the attended source phrase with the maximum attention score;
       if UNK &&  then
             Retrieve all target translations from the dictionary;
             Select the best candidate in by using the neural segment decoder as the scoring function
             Generate the next output segment by using the neural segment decoder );
       end if
      Append to output ;
end while
Algorithm 1 NPMT greedy decoding with dictionaries.

3 Experiments

In this section, we demonstrate NPMT on several machine translation benchmarks under different settings. We first compare the NPMT model with some strong baselines on IWSLT14 German-English, IWSLT 2015 English-Vietnamese machine translation tasks (Cettolo et al., 2014, 2015). Furthermore, to demonstrate the benefit of using the phrase-level attention mechanism with an external dictionary, we evaluate NPMT on open vocabulary and cross-domain translation tasks.333We focus on word-based models in this paper as words are more appropriate for dictionary representation. We leave the BPE types of representations (Sennrich et al., 2015) in future work.

3.1 IWSLT Machine Translation Tasks

In this section, we compare NPMT with the state of the art machine translation models on IWSLT 14 German-English and IWSLT 15 English-Vietnamese machine translation tasks (Cettolo et al., 2014, 2015). The IWSLT14 German-English machine translation data (Cettolo et al., 2014) comes from translated TED talks, and the dataset contains roughly 153K training sentences, 7K development sentences, and 7K test sentences. We use the same preprocessing and dataset splits as in (Ott et al., 2018). Sentences longer than words are removed from the dataset. For the IWSLT 15 English-Vietnamese machine translation task, the data is from translated TED talks, and the dataset contains roughly 133K training sentence pairs provided by the IWSLT 2015 Evaluation Campaign (Cettolo et al., 2015). Following the same preprocessing steps in Luong et al. (2017); Huang et al. (2018)

, we use the TED tst2012 (1553 sentences) as the validation set for hyperparameter tuning and TED tst2013 (1268 sentences) as the test set. We report the results on the test set.

We use 6-layer BiLSTMs to encode at both words and segment level in the source encoder, 6-layer LSTMs as the target encoder, and a 6-layer transformer as the segment decoder.444We also experiment on LSTM segment decoder variant, but the performance is inferior. The models are trained using Adam (Kingma & Ba, 2014). Similar to the RNMT+ learning rate scheduler (Chen et al., 2018; Ott et al., 2018), we use a three-stage learning rate scheduler by replacing the exponential decay with a linear one for fast convergence. Specifically, the learning rate is quickly warmed up to the maximum, kept at the maximum value to progress, and finally decaying to zero. In our experiments, we set the maximum learning rate to , weight decay to , the word embedding dimensionality to , and the dropout rate to . The performances of baseline models, including BSO (Wiseman & Rush, 2016), Seq2Seq with attention, Actor-Critic (Bahdanau et al., 2016), Luong & Manning (2015) and NPMT (Huang et al., 2018) are taken from (Huang et al., 2018). We use the fairseq implementation of Transformer555 and set the number of layers in both encoder and decoder to 6.666No further improvement is observed by increasing the depth. We fine-tuned the hyperparameters using the valid set and found the best-performed hyperparameters, except the maximum learning rate, are the same as in NPMT. We replace the default inverse-square-root scheduler (Vaswani et al., 2017) with the proposed learning rate scheduler which works better empirically and the maximum learning rate of Transformer is set to .

The IWSLT 14 German-English and IWLST 15 English-Vietnamese test results are shown in Tables 1 and 2. The proposed NPMT achieves comparable results as the transformer (Vaswani et al., 2017), and outperforms other baseline models in both tasks. Besides, comparing with the NPMT model, NPMT achieves better results with a much less training time (24 hours vs 2 hours on a Nvidia V100 GPU).

Model BLEU
Greedy Beam Search
BSO (Wiseman & Rush, 2016) 23.83 25.48
Seq2Seq with attention 26.17 27.61
Actor-Critic (Bahdanau et al., 2016) 27.49 28.53
Transformer (Vaswani et al., 2017) 31.27 32.30
NPMT (Huang et al., 2018) 28.57 29.92
NPMT 30.99 31.70
Table 1: Performance on IWSLT14 German-English test set.
Model BLEU
Greedy Beam Search
Luong & Manning (2015) - 23.30
Seq2Seq with attention 25.50 26.10
Transformer (Vaswani et al., 2017) 29.72 30.74
NPMT (Huang et al., 2018) 26.91 27.69
NPMT 29.93 30.60
Table 2: Performance on IWSLT15 English-Vietnamese test set.

3.2 Translation with Dictionary

To demonstrate the usefulness of using an external dictionary, we set up an experiment to show how an external bilingual dictionary can enhance the performance of NPMT in the open vocabulary translation tasks, where there are a large number of out-of-vocabulary (OOV) words.

We build the open vocabulary machine translation task by using UNK to replace the infrequent words. By infrequent words, we mean the words that appear fewer times than a predefined threshold in the training data. For the masked words, the model can only see their corresponding texts, where no vector representations are available. Given this setting, the models are supposed to translate the sentences with masked unknown words.

In our experiment, we evaluate different thresholds of infrequency words (3, 5, 10, 20, 50, 100), to show how using an external dictionary can improve translation in the open vocabulary setting. In Tables 3 and 4, we report the results on IWSLT14 German-English and IWSLT15 English-Vietnamese datasets. For both datasets, we use Moses (Koehn et al., 2007) to extract an in-domain dictionary from the training data, including and translations. Besides, we also try an out-of-domain dictionary on IWSTL14 German-English task, which is extracted from the WMT14 German-English dataset, and includes translations. The goals of these two tasks are different. For IWSLT15 English-Vietnamese task, the dictionary is extracted from the training data by Moses, and is an in-domain one. The external resource here is the dictionary extract tool, namely Moses. But for IWSLT14 German-English taks, the dictionary itself is an external resource which can be obtained from any textual corpus in any way, and is an out-of-domain one. In other words, the difference between these task lies in the overlapping between the training data and the dictionary. Note that in these dictionaries, there could be several translations attached to each word. For comparison, we also report the transformer with a trivial lookup mechanism by translating the attended unknown word with the given dictionary.

Threshold 3 5 10 20 50 100
Vocab Size German 35,479 23,327 13,679 8,055 3,831 2,152
English 24,743 17,807 11,559 7,359 3,807 2,224
Test Data OOV Rate German 3.8% 4.7% 6.2% 8.1% 11.6% 14.8%
English 1.5% 2.1% 3.0% 4.4% 7.2% 10.2%
Transformer 31.27 30.92 30.35 28.88 26.27 23.73
Transformer + 31.27 30.97 30.65 29.34 27.78 26.01
Transformer + 31.67 31.40 31.04 29.43 27.42 25.20
NPMT 30.99 30.92 29.86 28.29 25.81 23.08
NPMT + 30.99 31.01 30.33 29.52 28.02 26.61
NPMT + 31.48 31.75 31.06 30.04 27.86 25.85
Table 3: IWSLT14 German-English Translation with in-domain and out-of-domain dictionaries.
Threshold 3 5 10 20 50 100
Vocab Size English 24,415 17,191 10,919 6,799 3,431 1,943
Vietnamese 10,663 7,711 5,343 3,863 2,575 1,943
Test Data OOV Rate English 2.2% 2.8% 3.9% 5.6% 8.9% 12.1%
Vietnamese 0.8% 1.0% 1.4% 2.0% 3.3% 4.6%
Transformer 29.72 29.71 28.97 28.40 25.68 23.00
Transformer + 29.72 29.72 29.01 28.71 26.92 25.02
NPMT 29.93 29.78 29.27 27.87 25.89 23.63
NPMT+ 29.94 29.79 29.36 28.38 27.65 26.31
Table 4: IWSLT15 English-Vietnamese Translation with an in-domain dictionary.

In Fig. 3 and Fig. 4, we analyze the lookup phrase ratio and the BLEU score improvement on both datasets, where the lookup phrase ratio is the percentage of target words that is translated using the dictionary. We observe the lookup phrase ratio and BLEU improvement are increasing when the source OOV rate is higher. This also suggests the need of using the dictionary in a high OOV scenario. In Fig. 3, when there are OOV rate, using an in-domain dictionary can enhance the performance of the NPMT model by and BLEU scores on IWSLT14 German-English and IWSLT15 English-Vietnamese task, whereas only and BLEU score improvement is achieved with the transformer model on these task respectively. Fig. 4 shows the difference between a small in-domain dictionary and a large out-of-domain one for IWSLT14 DE-EN. When the OOV rate is low, more improvements are gained by larger because it contained more translations. But when the OOV rate becomes higher, show its effectiveness by providing accurate in-domain translation candidates.

Figure 3: Comparison on BLEU score improvement and lookup phrase ratio under different source language OOV rates both under in-domain dictionaries.
Figure 4: Comparison on BLEU score improvement and lookup phrase ratio for IWSLT14 DE-EN under in-domain and out-of-domain dictionaries.

Table 5 shows the translation examples by NPMT with unknown word frequency threshold . At the word level, NPMT discovers some meaningful phrases like “this is”, “the machine” and “read by”. For unseen words, such as “genf”, “bescheidene” and “zeitschriften”, could be attended and decoded with the dictionary.

Greedy decoding
Target ground truth this is the machine below geneva .
Greedy decoding
Target ground truth this is a modest little app .
Greedy decoding
Target ground truth our magazines are read by millions .
Table 5: DE-EN translation examples, where “[]” represents the phrase boundary, “()” represents the phrase is looked up by the external dictionary, and “    ” represents the frequency of the word is under the vocabulary threshold and hence the word is replaced by the “UNK” token in the network vocabulary. The subscript represents the corresponding phrases found by NPMT.

3.3 Cross-Domain Translation

In this section, Transformer and the proposed model are evaluated on a cross-domain machine translation task. The models are trained and validated on the IWSLT14 German-English dataset, which is originally from TED talks, and tested on the WMT14 German-English dataset, which is extracted from the news domain.

In this task, we used the lowest threshold for the training data to make the observed vocabulary as complete as possible. Note that although the used vocabulary is the most complete one for training domain, the OOV rate is still high in the out-of-domain test data. And the dictionary, which is extracted from the test domain, is used to handle such OOV words at the word/phrase level.777In practice, it would be better to use expert-created dictionaries for cross-domain tasks.

In Table 6, we show the experimental results with and without dictionary using Transformer and NPMT model. We observed that while dictionary contributed to the performance of both models, the “NPMT + ” model used the dictionary more often and achieved higher BLEU gain (+) than transformer did (+) in the experiments.

Model BLEU
Transformer 14.69
Transformer + 15.60
NPMT 14.86
NPMT+ 16.11
Table 6: Results on cross-domain translation by training the model on IWSLT14 and testing on WMT14, both with German-English data. In the test data from WMT14, OOV rates are % for German and % for English.

4 Related Work

Neural phrase-based machine translation is first introduced by Huang et al. (2018). The model builds upon Sleep-WAke Networks (SWAN), a segmentation-based sequence modeling technique described in (Wang et al., 2017a) and mitigate this issue of monotonic alignment assumption by introducing a new layer to perform (soft) local reordering on input sequences. Another related model shares the monotonic alignment assumption (which is often inappropriate in many language pairs) is the segment-to-segment neural transduction model (SSNT) (Yu et al., 2016b, a)

. Our model relies on the phrasal attention mechanism rather than marginalize out the monotonic alignments using dynamic programming. There have been several works that propose different ways to incorporate phrases into attention based neural machine translation.

Wang et al. (2017b), Tang et al. (2016) and Zhao et al. (2018) incorporate the phrase table as memory in the neural machine translation architecture. Hasler et al. (2018) use a user-provided phrase table of terminologies into NMT system by organizing the beam search into multiple stacks corresponding to subsets of satisfied constraints as defined by FSA states. Dahlmann et al. (2017) divides the beams into the word beam and the phrase beam of fixed size. He et al. (2016) uses statistical machine translation (SMT) as features in the NMT model under the log-linear framework. Yang et al. (2018) enhance the self-attention networks to capture useful phrase patterns by imposing learned Gaussian biases. Nguyen & Joty (2018)

also incorporates phrase-level attention with transformer by encoding a fixed number of n-grams (e.g. unigram, bigram). However,

Nguyen & Joty (2018) only focuses on phrase-level attention on the source side, whereas our model focuses on phrase-to-phrase translation by attending on a phrase at the source side and generating a phrase at the target side. Our model further integrates an external dictionary during decoding which is important in open vocabulary and cross domain translation settings.

5 Conclusions

We proposed Neural Phrase-to-Phrase Machine Translation (NPMT) that uses a phrase-level attention mechanism to enable phrase-to-phrase level translation in a neural machine translation system. By using this phrase-level attention, we can incorporate an external dictionary during decoding. We show that we can improve the machine translation results in open vocabulary and cross domain settings by using the external phrase dictionary in the decoding time.


We thank Dani Yogatama and Chris Dyer for helpful comments and dicussions on this project.


Appendix A Beam Search

The beam search algorithm is mostly based on (Wang et al., 2017a). Here we use a word-level beam search algorithm shown in Algorithm 2.

Data: a source sentence: , beam size
Result: a target sentence:
compute the hidden representation for the source sentence;
prev_sent end-of-sentence token eos;
prev_segment end-of-segment token ;
init_score ;
// Z is the appendable candidates;
// Y is the finished candidates;
while Z does not reach maximum length do
       // is the next appendable candidates;
       for prev_sent, prev_segment, score in  do
             Generate next_tokens with probability ;
             for next_token in next_tokens candidate set do
                   if next_token is eos then
                         prev_segment prev_segment + next_token;
                         prev_sent prev_sent + prev_segment;
                         score score + ;
                         Append (prev_sent, score) to ;
                   end if
                  if next_token is  then
                         prev_sent prev_sent + prev_segment;
                         score score + ;
                         Append (prev_sent, , score) to ;
                   end if
                  if next_token is not or eos then
                         prev_segment prev_segment + next_token;
                         prev_sent prev_sent + prev_segment;
                         score score + ;
                         Append (prev_sent, prev_segment, score) to ;
                   end if
             end for
       end for
      Keep only top candidates in ;
end while
the best candidates with highest score in
Algorithm 2 NPMT Beam Search.