Character-level neural machine translation (NMT) is an attractive research problem (Lee et al., 2016; Chung et al., 2016; Luong and Manning, 2016) because it addresses important issues encountered in word-level NMT. Word-level NMT systems can suffer from problems with rare words(Gulcehre et al., 2016) or data sparsity, and the existence of compound words without explicit segmentation in certain language pairs can make learning alignments and translations more difficult. Character-level neural machine translation mitigates these issues.
In this work we propose to augment the encoder-decoder model for character-level NMT by integrating a planning mechanism. Specifically, we develop a model that uses planning to improve the alignment between input and output sequences. Our model’s encoder is a recurrent neural network (RNN) that reads the source (a sequence of byte pairs representing text in some language) and encodes it as a sequence of vector representations; the decoder is a second RNN that generates the target translation character-by-character in the target language. The decoder uses an attention mechanism to align its internal state to vectors in the source encoding. It creates an explicit plan of source-target alignments to use at future time-steps based on its current observation and a summary of its past actions. At each time-step it may follow or modify this plan. This enables the model to plan ahead rather than attending to what is relevant primarily at the current generation step. More concretely, we augment the decoder’s internal state with (i) analignment plan matrix and (ii) a commitment plan
vector. The alignment plan matrix is a template of alignments that the model intends to follow at future time-steps, i.e., a sequence of probability distributions over input tokens. The commitment plan vector governs whether to follow the alignment plan at the current step or to recompute it, and thus models discrete decisions. This planning mechanism is inspired by thestrategic attentive reader and writer (STRAW) of Vezhnevets et al. (2016).
Our work is motivated by the intuition that, although natural language is output step-by-step because of constraints on the output process, it is not necessarily conceived and ordered according to only local, step-by-step interactions. Sentences are not conceived one word at a time. Planning, that is, choosing some goal along with candidate macro-actions to arrive at it, is one way to induce coherence in sequential outputs like language. Learning to generate long coherent sequences, or how to form alignments over long input contexts, is difficult for existing models. NMT performance of encoder-decoder models with attention deteriorates as sequence length increases (Cho et al., 2014; Sutskever et al., 2014), and this effect can be more pronounced at the character-level NMT. This is because character sequences are longer than word sequences. A planning mechanism can make the decoder’s search for alignments more tractable and more scalable.
We evaluate our proposed model and report results on character-level translation tasks from WMT’15 for English to German, English to Finnish, and English to Czech language pairs. On almost all pairs we observe improvements over a baseline that represents the state of the art in neural character-level translation. In our NMT experiments, our model outperforms the baseline despite using significantly fewer parameters and converges faster in training.
2 Planning for Character-level Neural Machine Translation
We now describe how to integrate a planning mechanism into a sequence-to-sequence architecture with attention (Bahdanau et al., 2015). Our model first creates a plan, then computes a soft alignment based on the plan, and generates at each time-step in the decoder. We refer to our model as PAG (Plan-Attend-Generate).
2.1 Notation and Encoder
As input our model receives a sequence of tokens, , where denotes the length of . It processes these with the encoder, a bidirectional RNN. At each input position we obtain annotation vector by concatenating the forward and backward encoder states, , where denotes the hidden state of the encoder’s forward RNN and denotes the hidden state of the encoder’s backward RNN.
Through the decoder the model predicts a sequence of output tokens, . We denote by
the hidden state of the decoder RNN generating the target output token at time-step.
2.2 Alignment and Decoder
Our goal is a mechanism that plans which parts of the input sequence to focus on for the next time-steps of decoding. For this purpose, our model computes an alignment plan matrix and commitment plan vector at each time-step. Matrix stores the alignments for the current and the next timesteps; it is conditioned on the current input, i.e. the token predicted at the previous time-step , and the current context , which is computed from the input annotations . The recurrent decoder function, , receives , , as inputs and computes the hidden state vector
Context is obtained by a weighted sum of the encoder annotations,
The alignment vector is a function of the first row of the alignment matrix. At each time-step, we compute a candidate alignment-plan matrix whose entry at the row is
where is an MLP and denotes a summary of the alignment matrix’s row at time . The summary is computed using an MLP, , operating row-wise on : .
The commitment plan vector governs whether to follow the existing alignment plan, by shifting it forward from , or to recompute it. Thus, represents a discrete decision. For the model to operate discretely, we use the recently proposed Gumbel-Softmax trick (Jang et al., 2016; Maddison et al., 2016)
in conjunction with the straight-through estimator(Bengio et al., 2013)
to backpropagate through.111We also experimented with training using REINFORCE (Williams, 1992) but found that Gumbel-Softmax led to better performance. The model further learns the temperature for the Gumbel-Softmax as proposed in Gulcehre et al. (2017). Both the commitment vector and the action plan matrix are initialized with ones; this initialization is not modified through training.
Our decoder updates its alignment plan as governed by the commitment plan. Denoted by the first element of the discretized commitment plan . In more detail, , where the discretized commitment plan is obtained by setting ’s largest element to 1 and all other elements to 0. Thus, is a binary indicator variable; we refer to it as the commitment switch. When , the decoder simply advances the time index by shifting the action plan matrix forward via the shift function . When , the controller reads the action-plan matrix to produce the summary of the plan,
. We then compute the updated alignment plan by interpolating the previous alignment plan matrixwith the candidate alignment plan matrix . The mixing ratio is determined by a learned update gate , whose elements correspond to tokens in the input sequence and are computed by an MLP with sigmoid activation, :
To reiterate, the model only updates its alignment plan when the current commitment switch is active. Otherwise it uses the alignments planned and committed at previous time-steps.
The commitment plan also updates when becomes 1. If is 0, the shift function shifts the commitment vector forward and appends a -element. If is 1, the model recomputes using a single layer MLP () followed by a Gumbel-Softmax, and is recomputed by discretizing as a one-hot vector:
2.2.1 Alignment Repeat
In order to reduce the model’s computational cost, we also propose an alternative approach to computing the candidate alignment-plan matrix at every step. Specifically, we propose a model variant that reuses the alignment from the previous time-step until the commitment switch activates, at which time the model computes a new alignment. We call this variant repeat, plan, attend, and generate (rPAG). rPAG can be viewed as learning an explicit segmentation with an implicit planning mechanism in an unsupervised fashion. Repetition can reduce the computational complexity of the alignment mechanism drastically; it also eliminates the need for an explicit alignment-plan matrix, which reduces the model’s memory consumption as well. We provide pseudocode for rPAG in Algorithm 1.
We use a deep output layer Pascanu et al. (2013) to compute the conditional distribution over output tokens,
where is a matrix of learned parameters and we have omitted the bias for brevity. Function is an MLP with activation.
The full model, including both the encoder and decoder, is jointly trained to minimize the (conditional) negative log-likelihood
where the training corpus is a set of pairs and denotes the set of all tunable parameters. As noted in (Vezhnevets et al., 2016), the proposed model can learn to recompute very often which decreases the utility of planning. In order to avoid this behavior, we introduce a loss that penalizes the model for committing too often,
is the commitment hyperparameter andis the timescale over which plans operate.
In our NMT experiments we use byte pair encoding (BPE) (Sennrich et al., 2015) for the source sequence and character representation for the target, the same setup described in Chung et al. (2016). We also use the same preprocessing as in that work.222Our implementation is based on the code available at https://github.com/nyu-dl/dl4mt-cdec
We test our planning models against a baseline on the WMT’15 tasks for English to German (EnDe), English to Czech (EnCs), and English to Finnish (EnFi) language pairs. We present the experimental results in Table 1.
As a baseline we use the biscale GRU model of Chung et al. (2016), with the attention mechanisms in both the baseline and (r)PAG conditioned on both layers of the encoder’s biscale GRU ( and – see (Chung et al., 2016) for more detail). Our implementation reproduces the results in the original paper to within a small margin.
Table 1 shows that our planning mechanism generally improves translation performance over the baseline. It does this with fewer updates and fewer parameters. We trained (r)PAG for 350K updates on the training set, while the baseline was trained for 680K updates. We used 600 units in (r)PAG’s encoder and decoder, while the baseline used 512 in the encoder and 1024 units in the decoder. In total our model has about 4M fewer parameters than the baseline. We tested all models with a beam size of 15.
As can be seen from Table 1, layer normalization (Ba et al., 2016) improves the performance of the PAG model significantly. However, according to our results on EnDe, layer norm affects the performance of rPAG only marginally. Thus, we decided not to train rPAG with layer norm on other language pairs.
|Model||Layer Norm||Dev||Test 2014||Test 2015|
In Figure 3, we show qualitatively that our model constructs smoother alignments. In contrast to (r)PAG, we see that the baseline decoder aligns the first few characters of each word that it generates to a byte in the source sequence; for the remaining characters it places the largest alignment weight on the final, empty token of the source sequence. This is because the baseline becomes confident of which word to generate after the first few characters, and generates the remainder of the word mainly by relying on language-model predictions. As illustrated by the learning curves in Figure 2, we observe further that (r)PAG converges faster with the help of its improved alignments.
4 Conclusions and Future Work
In this work, we addressed a fundamental issue in neural generation of long sequences by integrating planning into the alignment mechanism of sequence-to-sequence architectures on machine translation problem. We proposed two different planning mechanisms: PAG, which constructs explicit plans in the form of stored matrices, and rPAG, which plans implicitly and is computationally cheaper. The (r)PAG approach empirically improves alignments over long input sequences. In machine translation experiments, models with a planning mechanism outperforms a state-of-the-art baseline on almost all language pairs using fewer parameters. As a future work, we plan to test our planning mechanism at the outputs of the model and other sequence-to-sequence tasks as well.
- Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 .
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations (ICLR) .
- Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 .
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 .
- Chung et al. (2016) Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. arXiv preprint arXiv:1603.06147 .
- Gulcehre et al. (2016) Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing the unknown words. arXiv preprint arXiv:1603.08148 .
- Gulcehre et al. (2017) Caglar Gulcehre, Sarath Chandar, and Yoshua Bengio. 2017. Memory augmented neural networks with wormhole connections. arXiv preprint arXiv:1701.08718 .
- Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 .
- Lee et al. (2016) Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2016. Fully character-level neural machine translation without explicit segmentation. arXiv preprint arXiv:1610.03017 .
- Luong and Manning (2016) Minh-Thang Luong and Christopher D Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. arXiv preprint arXiv:1604.00788 .
- Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712 .
- Pascanu et al. (2013) Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2013. How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026 .
- Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 .
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. pages 3104–3112.
- Vezhnevets et al. (2016) Alexander Vezhnevets, Volodymyr Mnih, John Agapiou, Simon Osindero, Alex Graves, Oriol Vinyals, and Koray Kavukcuoglu. 2016. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems. pages 3486–3494.
Ronald J Williams. 1992.
Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning 8(3-4):229–256.
Appendix A Qualitative Translations from both Models
In Table 2, we present example translations from our model and the baseline along with the ground-truth.
|Groundtruth||Our Model (PAG + Biscale)||Baseline (Biscale)|
|1||Eine republikanische Strategie , um der Wiederwahl von Obama entgegenzutreten||Eine republikanische Strategie gegen die Wiederwahl von Obama||Eine republikanische Strategie zur Bekämpfung der Wahlen von Obama|
|2||Die Führungskräfte der Republikaner rechtfertigen ihre Politik mit der Notwendigkeit , den Wahlbetrug zu bekämpfen .||Republikanische Führungspersönlichkeiten haben ihre Politik durch die Notwendigkeit gerechtfertigt , Wahlbetrug zu bekämpfen .||Die politischen Führer der Republikaner haben ihre Politik durch die Notwendigkeit der Bekämpfung des Wahlbetrugs gerechtfertigt .|
|3||Der Generalanwalt der USA hat eingegriffen , um die umstrittensten Gesetze auszusetzen .||Die Generalstaatsanwälte der Vereinigten Staaten intervenieren , um die umstrittensten Gesetze auszusetzen .||Der Generalstaatsanwalt der Vereinigten Staaten hat dazu gebracht , die umstrittensten Gesetze auszusetzen .|
|4||Sie konnten die Schäden teilweise begrenzen||Sie konnten die Schaden teilweise begrenzen||Sie konnten den Schaden teilweise begrenzen .|
|5||Darüber hinaus haben Sie das Recht von Einzelpersonen und Gruppen beschränkt , jenen Wählern Hilfestellung zu leisten , die sich registrieren möchten .||Darüber hinaus begrenzten sie das Recht des Einzelnen und der Gruppen , den Wählern Unterstützung zu leisten , die sich registrieren möchten .||Darüber hinaus unterstreicht Herr Beaulieu die Bedeutung der Diskussion Ihrer Bedenken und Ihrer Familiengeschichte mit Ihrem Arzt .|