Plan, Attend, Generate: Character-level Neural Machine Translation with Planning in the Decoder

06/13/2017 ∙ by Caglar Gulcehre, et al. ∙ 0

We investigate the integration of a planning mechanism into an encoder-decoder architecture with an explicit alignment for character-level machine translation. We develop a model that plans ahead when it computes alignments between the source and target sequences, constructing a matrix of proposed future alignments and a commitment vector that governs whether to follow or recompute the plan. This mechanism is inspired by the strategic attentive reader and writer (STRAW) model. Our proposed model is end-to-end trainable with fully differentiable operations. We show that it outperforms a strong baseline on three character-level decoder neural machine translation on WMT'15 corpus. Our analysis demonstrates that our model can compute qualitatively intuitive alignments and achieves superior performance with fewer parameters.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Character-level neural machine translation (NMT) is an attractive research problem (Lee et al., 2016; Chung et al., 2016; Luong and Manning, 2016) because it addresses important issues encountered in word-level NMT. Word-level NMT systems can suffer from problems with rare words(Gulcehre et al., 2016) or data sparsity, and the existence of compound words without explicit segmentation in certain language pairs can make learning alignments and translations more difficult. Character-level neural machine translation mitigates these issues.

In this work we propose to augment the encoder-decoder model for character-level NMT by integrating a planning mechanism. Specifically, we develop a model that uses planning to improve the alignment between input and output sequences. Our model’s encoder is a recurrent neural network (RNN) that reads the source (a sequence of byte pairs representing text in some language) and encodes it as a sequence of vector representations; the decoder is a second RNN that generates the target translation character-by-character in the target language. The decoder uses an attention mechanism to align its internal state to vectors in the source encoding. It creates an explicit plan of source-target alignments to use at future time-steps based on its current observation and a summary of its past actions. At each time-step it may follow or modify this plan. This enables the model to plan ahead rather than attending to what is relevant primarily at the current generation step. More concretely, we augment the decoder’s internal state with (i) an

alignment plan matrix and (ii) a commitment plan

vector. The alignment plan matrix is a template of alignments that the model intends to follow at future time-steps, i.e., a sequence of probability distributions over input tokens. The commitment plan vector governs whether to follow the alignment plan at the current step or to recompute it, and thus models discrete decisions. This planning mechanism is inspired by the

strategic attentive reader and writer (STRAW) of Vezhnevets et al. (2016).

Our work is motivated by the intuition that, although natural language is output step-by-step because of constraints on the output process, it is not necessarily conceived and ordered according to only local, step-by-step interactions. Sentences are not conceived one word at a time. Planning, that is, choosing some goal along with candidate macro-actions to arrive at it, is one way to induce coherence in sequential outputs like language. Learning to generate long coherent sequences, or how to form alignments over long input contexts, is difficult for existing models. NMT performance of encoder-decoder models with attention deteriorates as sequence length increases  (Cho et al., 2014; Sutskever et al., 2014), and this effect can be more pronounced at the character-level NMT. This is because character sequences are longer than word sequences. A planning mechanism can make the decoder’s search for alignments more tractable and more scalable.

We evaluate our proposed model and report results on character-level translation tasks from WMT’15 for English to German, English to Finnish, and English to Czech language pairs. On almost all pairs we observe improvements over a baseline that represents the state of the art in neural character-level translation. In our NMT experiments, our model outperforms the baseline despite using significantly fewer parameters and converges faster in training.

2 Planning for Character-level Neural Machine Translation

We now describe how to integrate a planning mechanism into a sequence-to-sequence architecture with attention (Bahdanau et al., 2015). Our model first creates a plan, then computes a soft alignment based on the plan, and generates at each time-step in the decoder. We refer to our model as PAG (Plan-Attend-Generate).

2.1 Notation and Encoder

As input our model receives a sequence of tokens, , where denotes the length of . It processes these with the encoder, a bidirectional RNN. At each input position we obtain annotation vector by concatenating the forward and backward encoder states, , where denotes the hidden state of the encoder’s forward RNN and denotes the hidden state of the encoder’s backward RNN.

Through the decoder the model predicts a sequence of output tokens, . We denote by

the hidden state of the decoder RNN generating the target output token at time-step


2.2 Alignment and Decoder

Our goal is a mechanism that plans which parts of the input sequence to focus on for the next time-steps of decoding. For this purpose, our model computes an alignment plan matrix and commitment plan vector at each time-step. Matrix stores the alignments for the current and the next timesteps; it is conditioned on the current input, i.e. the token predicted at the previous time-step , and the current context , which is computed from the input annotations . The recurrent decoder function, , receives , , as inputs and computes the hidden state vector


Context is obtained by a weighted sum of the encoder annotations,


The alignment vector is a function of the first row of the alignment matrix. At each time-step, we compute a candidate alignment-plan matrix whose entry at the row is


where is an MLP and denotes a summary of the alignment matrix’s row at time . The summary is computed using an MLP, , operating row-wise on : .

The commitment plan vector governs whether to follow the existing alignment plan, by shifting it forward from , or to recompute it. Thus, represents a discrete decision. For the model to operate discretely, we use the recently proposed Gumbel-Softmax trick (Jang et al., 2016; Maddison et al., 2016)

in conjunction with the straight-through estimator  

(Bengio et al., 2013)

to backpropagate through

.111We also experimented with training using REINFORCE (Williams, 1992) but found that Gumbel-Softmax led to better performance. The model further learns the temperature for the Gumbel-Softmax as proposed in Gulcehre et al. (2017). Both the commitment vector and the action plan matrix are initialized with ones; this initialization is not modified through training.

Figure 1: Our planning mechanism in a sequence-to-sequence model that learns to plan and execute alignments. Distinct from a standard sequence-to-sequence model with attention, rather than using a simple MLP to predict alignments our model makes a plan of future alignments using its alignment-plan matrix and decides when to follow the plan by learning a separate commitment vector. We illustrate the model for a decoder with two layers for the first layer and the for the second layer of the decoder. The planning mechanism is conditioned on the first layer of the decoder ().
Alignment-plan update

Our decoder updates its alignment plan as governed by the commitment plan. Denoted by the first element of the discretized commitment plan . In more detail, , where the discretized commitment plan is obtained by setting ’s largest element to 1 and all other elements to 0. Thus, is a binary indicator variable; we refer to it as the commitment switch. When , the decoder simply advances the time index by shifting the action plan matrix forward via the shift function . When , the controller reads the action-plan matrix to produce the summary of the plan,

. We then compute the updated alignment plan by interpolating the previous alignment plan matrix

with the candidate alignment plan matrix . The mixing ratio is determined by a learned update gate , whose elements correspond to tokens in the input sequence and are computed by an MLP with sigmoid activation, :

To reiterate, the model only updates its alignment plan when the current commitment switch is active. Otherwise it uses the alignments planned and committed at previous time-steps.

  for  do
     for  do
         if  then
             {Read alignment plan}
             {Compute candidate alignment plan}
             {Compute update gate}
             {Update alignment plan}
             {Shift alignment plan}
             {Shift commitment plan}
         end if
         Compute the alignment as
     end for
  end for
Algorithm 1 Pseudocode for updating the alignment plan and commitment vector.
Commitment-plan update

The commitment plan also updates when becomes 1. If is 0, the shift function shifts the commitment vector forward and appends a -element. If is 1, the model recomputes using a single layer MLP () followed by a Gumbel-Softmax, and is recomputed by discretizing as a one-hot vector:


We provide pseudocode for the algorithm to compute the commitment plan vector and the action plan matrix in Algorithm 1. An overview of the model is depicted in Figure 1.

2.2.1 Alignment Repeat

In order to reduce the model’s computational cost, we also propose an alternative approach to computing the candidate alignment-plan matrix at every step. Specifically, we propose a model variant that reuses the alignment from the previous time-step until the commitment switch activates, at which time the model computes a new alignment. We call this variant repeat, plan, attend, and generate (rPAG). rPAG can be viewed as learning an explicit segmentation with an implicit planning mechanism in an unsupervised fashion. Repetition can reduce the computational complexity of the alignment mechanism drastically; it also eliminates the need for an explicit alignment-plan matrix, which reduces the model’s memory consumption as well. We provide pseudocode for rPAG in Algorithm 1.

  for  do
     for  do
         if  then
             {Shift the commitment vector }
             {Reuse the old the alignment}
         end if
     end for
  end for
Algorithm 2 Pseudocode for updating the repeat alignment and commitment vector.

2.3 Training

We use a deep output layer Pascanu et al. (2013) to compute the conditional distribution over output tokens,


where is a matrix of learned parameters and we have omitted the bias for brevity. Function is an MLP with activation.

The full model, including both the encoder and decoder, is jointly trained to minimize the (conditional) negative log-likelihood

where the training corpus is a set of pairs and denotes the set of all tunable parameters. As noted in (Vezhnevets et al., 2016), the proposed model can learn to recompute very often which decreases the utility of planning. In order to avoid this behavior, we introduce a loss that penalizes the model for committing too often,



is the commitment hyperparameter and

is the timescale over which plans operate.

Figure 2: Learning curves for different models on WMT’15 for EnDe. Models with the planning mechanism converge faster than our baseline (which has larger capacity).




Figure 3: We visualize the alignments learned by PAG in (a), rPAG in (b), and our baseline model with a 2-layer GRU decoder using for the attention in (c). As depicted, the alignments learned by PAG and rPAG are smoother than those of the baseline. The baseline tends to put too much attention on the last token of the sequence, defaulting to this empty location in alternation with more relevant locations. Our model, however, places higher weight on the last token usually when no other good alignments exist. We observe that rPAG tends to generate less monotonic alignments in general.

3 Experiments

In our NMT experiments we use byte pair encoding (BPE) (Sennrich et al., 2015) for the source sequence and character representation for the target, the same setup described in Chung et al. (2016). We also use the same preprocessing as in that work.222Our implementation is based on the code available at

We test our planning models against a baseline on the WMT’15 tasks for English to German (EnDe), English to Czech (EnCs), and English to Finnish (EnFi) language pairs. We present the experimental results in Table 1.

As a baseline we use the biscale GRU model of Chung et al. (2016), with the attention mechanisms in both the baseline and (r)PAG conditioned on both layers of the encoder’s biscale GRU ( and – see (Chung et al., 2016) for more detail). Our implementation reproduces the results in the original paper to within a small margin.

Table 1 shows that our planning mechanism generally improves translation performance over the baseline. It does this with fewer updates and fewer parameters. We trained (r)PAG for 350K updates on the training set, while the baseline was trained for 680K updates. We used 600 units in (r)PAG’s encoder and decoder, while the baseline used 512 in the encoder and 1024 units in the decoder. In total our model has about 4M fewer parameters than the baseline. We tested all models with a beam size of 15.

As can be seen from Table 1, layer normalization (Ba et al., 2016) improves the performance of the PAG model significantly. However, according to our results on EnDe, layer norm affects the performance of rPAG only marginally. Thus, we decided not to train rPAG with layer norm on other language pairs.

Model Layer Norm Dev Test 2014 Test 2015
EnDe Baseline 21.57 21.33 23.45
Baseline 21.4 21.16 22.1
PAG 21.52 21.35 22.21
22.12 21.93 22.83
rPAG 21.81 21.71 22.45
21.67 21.81 22.73
EnCs Baseline 17.68 19.27 16.98
PAG 17.44 18.72 16.99
18.78 20.9 18.59
rPAG 17.83 19.54 17.79
EnFi Baseline 11.19 - 10.93
PAG 11.51 - 11.13
12.67 - 11.84
rPAG 11.50 - 10.59
Table 1: The results of different models on WMT’15 task on English to German, English to Czech and English to Finnish language pairs. We report BLEU scores of each model computed via the multi-blue.perl script. The best-score of each model for each language pair appears in bold-face. We use newstest2013 as our development set, newstest2014 as our "Test 2014" and newstest2015 as our "Test 2015" set. denotes the results of the baseline that we trained using the hyperparameters reported in Chung et al. (2016) and the code provided with that paper. For our baseline, we only report the median result, and do not have multiple runs of our models.

In Figure 3, we show qualitatively that our model constructs smoother alignments. In contrast to (r)PAG, we see that the baseline decoder aligns the first few characters of each word that it generates to a byte in the source sequence; for the remaining characters it places the largest alignment weight on the final, empty token of the source sequence. This is because the baseline becomes confident of which word to generate after the first few characters, and generates the remainder of the word mainly by relying on language-model predictions. As illustrated by the learning curves in Figure 2, we observe further that (r)PAG converges faster with the help of its improved alignments.

4 Conclusions and Future Work

In this work, we addressed a fundamental issue in neural generation of long sequences by integrating planning into the alignment mechanism of sequence-to-sequence architectures on machine translation problem. We proposed two different planning mechanisms: PAG, which constructs explicit plans in the form of stored matrices, and rPAG, which plans implicitly and is computationally cheaper. The (r)PAG approach empirically improves alignments over long input sequences. In machine translation experiments, models with a planning mechanism outperforms a state-of-the-art baseline on almost all language pairs using fewer parameters. As a future work, we plan to test our planning mechanism at the outputs of the model and other sequence-to-sequence tasks as well.


Appendix A Qualitative Translations from both Models

In Table 2, we present example translations from our model and the baseline along with the ground-truth.

Groundtruth Our Model (PAG + Biscale) Baseline (Biscale)
1 Eine republikanische Strategie , um der Wiederwahl von Obama entgegenzutreten Eine republikanische Strategie gegen die Wiederwahl von Obama Eine republikanische Strategie zur Bekämpfung der Wahlen von Obama
2 Die Führungskräfte der Republikaner rechtfertigen ihre Politik mit der Notwendigkeit , den Wahlbetrug zu bekämpfen . Republikanische Führungspersönlichkeiten haben ihre Politik durch die Notwendigkeit gerechtfertigt , Wahlbetrug zu bekämpfen . Die politischen Führer der Republikaner haben ihre Politik durch die Notwendigkeit der Bekämpfung des Wahlbetrugs gerechtfertigt .
3 Der Generalanwalt der USA hat eingegriffen , um die umstrittensten Gesetze auszusetzen . Die Generalstaatsanwälte der Vereinigten Staaten intervenieren , um die umstrittensten Gesetze auszusetzen . Der Generalstaatsanwalt der Vereinigten Staaten hat dazu gebracht , die umstrittensten Gesetze auszusetzen .
4 Sie konnten die Schäden teilweise begrenzen Sie konnten die Schaden teilweise begrenzen Sie konnten den Schaden teilweise begrenzen .
5 Darüber hinaus haben Sie das Recht von Einzelpersonen und Gruppen beschränkt , jenen Wählern Hilfestellung zu leisten , die sich registrieren möchten . Darüber hinaus begrenzten sie das Recht des Einzelnen und der Gruppen , den Wählern Unterstützung zu leisten , die sich registrieren möchten . Darüber hinaus unterstreicht Herr Beaulieu die Bedeutung der Diskussion Ihrer Bedenken und Ihrer Familiengeschichte mit Ihrem Arzt .
Table 2: Randomly chosen example translations from the development-set.
333These examples are randomly chosen from the first 100 examples of the development set. None of the authors of this paper can speak or understand German.