Recently, models of end-to-end machine translation based on neural network classification have been shown to produce excellent translations, rivalling or in some cases surpassing traditional statistical machine translation systems[Kalchbrenner and Blunsom2013, Sutskever et al.2014, Bahdanau et al.2015]. This is despite the neural approaches using an overall simpler model, with fewer assumptions about the learning and prediction problem.
Broadly, neural approaches are based around the notion of an encoder-decoder [Sutskever et al.2014], in which the source language is encoded
into a distributed representation, followed by adecoding step which generates the target translation. We focus on the attentional model of translation [Bahdanau et al.2015] which uses a dynamic representation of the source sentence while allowing the decoder to attend to different parts of the source as it generates the target sentence. The attentional model raises intriguing opportunities, given the correspondence between the notions of attention and alignment in traditional word-based machine translation models [Brown et al.1993].
In this paper we map modelling biases from word based translation models into the attentional model, such that known linguistic elements of translation can be better captured. We incorporate absolute positional bias whereby word order tends to be similar between the source sentence and its translation (e.g., IBM Model 2 and [Dyer et al.2013]), fertility whereby each instance of a source word type tends to be translated into a consistent number of target tokens (e.g., IBM Models 3, 4, 5), relative position bias whereby prior preferences for monotonic alignments/attention can be encouraged (e.g., IBM Model 4, 5 and HMM-based Alignment [Vogel et al.1996]), and alignment consistency whereby the attention in both
translation directions are encourged to agree (e.g. symmetrization heuristics[Och and Ney2003] or joint modelling [Liang et al.2006, Ganchev et al.2008]).
We provide an empirical analysis of incorporating the above structural biases into the attentional model, considering low resource translation scenario over four language-pairs. Our results demonstrate consistent improvements over vanila encoder-decoder and attentional model in terms of the perplexity and BLEU score, e.g. up to 3.5 BLEU points when re-ranking the candiate translations generated by a state-of-the-art phrase based model.111The source code will be released on publication
2 The attentional model of translation
The encoding of the source sentence is formulated using a pair of RNNs (denoted bi-RNN) one operating left-to-right over the input sequence and another operating right-to-left,
where and are the RNN hidden states. The left-to-right RNN function is defined as
is a learned parameter vector, as are, , and , with the number of hidden units, the size of the source vocabulary and the word embedding dimensionality.222Similarly,
are the parameters of the right-to-left RNN. Note that we use a long short term memory unit[Hochreiter and Schmidhuber1997] in place of the RNN, shown here for simplicity of exposition. Each source word is then represented as a pair of hidden states, one from each RNN, . This encodes not only the word but also its left and right context, which can provide important evidence for its translation.
A crucial question is how this dynamic sized matrix can be used in the decoder to generate the target sentence. As with Sutskever’s encoder-decoder, the target sentence is created left-to-right using a RNN, while the encoded source is used to bias the process as an auxiliary input. The mechanism for this bias is by attentional vectors, i.e. vectors of scores over each source sentence location, which are used to aggregate the dynamic source encoding into a fixed length vector.
The decoder operates as a standard RNN over the translation , formulated as follows
where the decoder RNN is defined analogously to Eq 1 but with an additional input, the source attention component and weighting matrix . The hidden state of the recurrence is then passed through a single hidden layer333In bahdanau2015neural they use a max-out layer for this final step, however we found this to be a needless complication, and instead use a standard hidden layer with tanh activation. (Eq 3) in combination with the source attention and target word using weighting matrices and . In Eq 4 this vector is transformed to be target vocabulary sized, using weight matrix and bias , after which a is taken, and the resulting normalised vector used as the parameters of a Categorical distribution in generating the next target word.
The presentation above assumes a simple RNN is used to define the recurrence over hidden states, however we can easily use alternative formulations of recurrent networks including multiple-layer RNNs, gated recurrent units (GRU; ChoGRU2014), or long short-term memory (LSTM; Hochreiter1997) units. These more advanced methods allow for more efficient learning of more complex concepts, particularly long distance effects. Empirically we found LSTMs to be the best performing, and therefore use these units herein.
with the scalars denoting the compatibility between the target hidden state and the source encoding . This is defined as a neural network with one hidden layer of size and a single output, parameterised by , and . The then normalises the scalar compatibility values such that for a given target word , the values of
can be interpreted as alignment probabilities to each source location. Finally, these alignments are used to to reweight the source componentsto produce a fixed length context representation.
Training of this model is done by minimising the cross-entropy of the target sentence, measured word-by-word as for a language model. We use standard stochastic gradient optimisation using the back-propagation technique for computation of partial derivatives according to the chain rule.
3 Incorporating Structural Biases
The attentional model, as described above, provides a powerful and elegant model of translation in which alignments between source and target words are learned through the implicit conditioning context afforded by the attention mechanism. Despite its elegance, the attentional model omits several key components of a traditional alignment models such as the IBM models [Brown et al.1993]
and Vogel’s hidden Markov Model[Vogel et al.1996] as implemented in the GIZA++ toolkit [Och and Ney2003]. Combining the strengths of this highly successful body of research into a neural model of machine translation holds potential to further improve modelling accuracy of neural techniques. Below we outline methods for incorporating these factors as structural biases into the attentional model.
3.1 Position bias
First we consider position bias, based on the observation that a word at a given relative position in the source tends to align to a word at a similiar relative position in the target, (IBM 2). Related, alignments tend to occur near the diagonal [Dyer et al.2013], when considering the alignments as a binary matrix (illustrated in Figure 1), where the cell at denotes whether an alignment exists between source word and target word .
We include a position bias through redefining the pre-normalised attention scalars in Eq 5 as:
where the new component in the input is a simple feature function of the positions in the source and target sentences and the source length,
and . We exclude the target length as this is unknown during decoding, as a partial translation can have several (infinite) different lengths. The use of the function is to avoid numerical instabilities from widely varying sentence lengths. The non-linearity in Eq 6 allows for complex functions of these inputs to be learned, such as relative positions and approximate distance from the diagonal, as well as their interactions with the other inputs (e.g., to learn that some words are exceptional cases where a diagonal bias should not apply).
3.2 Markov condition
The HMM model of translation [Vogel et al.1996]
is based on a Markov condition over alignment random variables, to allow the model to learn local effects such as whenis aligned then it is likely that or . These correspond to local diagonal alignments or one-to-many alignments, respectively. In general, there are many correlations between the alignments of a word and the word immediately to its left.
where abbreviates the , and components from Eq 6, and provides a fixed dimensional representation of the attention state for the preceding word. It is not immediately obvious how to incorporate the previous attention vector as is dynamically sized to match the source sentence length, thus using it directly would not generalise over sentences of different lengths. For this reason, we make a simplification by just considering local moves offset by positions, that is,
with . Our approach is likely to capture the most important alignments patterns forming the backbone of the alignment HMM, namely monotone, 1-to-many, and local inversions.
Fertility is the propensity for a word to be translated as a consistent number of words in the other language, e.g., Iseseisvusdeklaratsioon translates as (the) Declaration of Independence. Fertility is a central component in the IBM models 3–5 [Brown et al.1993]. Incorporating fertility into the attentional model is a little more involved, and we present two techniques for doing so.
First we consider a feature-based technique, which includes the following features
and the corresponding feature weights, i.e., . These sums represent the total alignment score for the surrounding source words, similar to fertility in a traditional latent variable model, which is the sum over binary alignment random variables. A word which already has several alignments can be excluded from participating in more alignments, thus combating the garbage collection problem. Conversely words that tend to need high fertility can be learned through the interactions between these features and the word and context embeddings in Eq 7.
A second, more explicit, technique for incorporating fertility is to include this as a modelling constraint. Initially we considered a soft constraint based on the approach in [Xu et al.2015], where an image captioning model was biased to attend to every pixel in the image exactly once. In our setting, the same idea can be applied through adding a regularisation term to the training objective of the form . However this method is overly restrictive: enforcing that every word is used exactly once is not appropriate in translation where some words are likely to be dropped (e.g., determiners and other function words), while others might need to be translated several times to produce a phrase in the target language.444 Modern decoders [Koehn et al.2003] often impose the restriction of each word being translated exactly once, however this is tempered by their use of phrases as translation units rather than words, which allow for higher fertility in contiguous translation chunks. For this reason we develop an alternative method, based around a contextual fertility model, which scores the fertility of source word , defined as
, using a normal distribution555The normal distribution is deficient, as it has support for all scalar values, despite being bounded above and below (). This could be corrected by using a truncated normal, or various other choices of distribution. parameterised by and , both positive scalar valued non-linear functions of the source word encoding . This is incorporated into the training objective as an additional additive term, , for each training sentence.
This formulation allows for greater consistency in translation, through e.g., learning which words tend to be omitted from translation, or translate as several words. Compared to the fertility model in IBM 3–5 [Brown et al.1993], ours uses many fewer parameters through working over vector embeddings, and moreover, the BiRNN encoding of the source means that we learn context-dependent fertilities, which can be useful for dealing with fixed syntactic patterns or multi-word expressions.
3.4 Bilingual Symmetry
So far we have considered a conditional model of the target given the source, modelling . However it is well established for latent variable translation models that the alignments improve if is also modelled and the inferences of both directional models are combined – evidenced by the symmetrisation heuristics used in most decoders [Koehn et al.2005], and also by explicit joint agreement training objectives [Liang et al.2006, Ganchev et al.2008]
. The rationale is that both models make somewhat independent errors, so an ensemble stands to gain from variance reduction.
We propose a method for joint training of two directional models as pictured in Figure 2. Training twinned models involves optimising where, as before, we consider only a single sentence pair, for simplicity of notation. This corresponds to a pseudo-likelihood objective, with the linking the two models.666We could share some parameters, e.g., the word embedding matrices, however we found this didn’t make much difference versus using disjoint parameter sets. We set herein. The component considers the alignment (attention) matrices, and , and attempts to make these close to one another for both translation directions (see Fig. 2). To achieve this, we use a ‘trace bonus’, inspired by [Levinboim et al.2015], formulated as
As the alignment cells are normalised using the and thus take values in [0,1], the trace term is bounded above by which occurs when the two alignment matrices are transposes of each other, representing perfect one-to-one alignments in both directions
|lang-pair||# tokens (K)||# types (K)|
We conducted our experiments with four language pairs, translating between English Romanian, Estonian, Russian and Chinese. These languages were chosen to represent a range of translation difficulties, including languages with significant morphological complexity (Estonian, Russian). We focus on a (simulated) low resource setting, where only a limited amount of training data is available. This serves to demonstrate the robustness and generalisation of our model on sparse data – something that has not yet been established for neural models with millions of parameters with vast potential for over-fitting.
Table 1 shows the statistics of the training sets.777For all datasets words were thresholded for training frequency , with uncommon training and unseen testing words replaced by an ⟨unk⟩ symbol. For Chinese-English, the data comes from the BTEC corpus, where the number of training sentence pairs is 44,016. We used ‘devset1_2’ and ‘devset_3’ as the development and test sets, respectively, and in both cases used only the first reference for evaluation. For other language pairs, the data come from the Europarl corpus [Koehn2005], where we used 100K sentence pairs for training, and 3K for development and 2K for testing.888The first 100K sentence pairs were used for training, while the development and test were drawn from the last 100K sentence pairs, taking the first 2K for testing and the last 3K for development.
Models and Baselines.
We have implemented our neural translation model with linguistic features in C++ using the CNN library.999https://github.com/clab/cnn/ We compared our proposed model against our implementations of the attentional model [Bahdanau et al.2015] and encoder-decoder architecture [Sutskever et al.2014]. As the baseline, we used a state-of-the-art phrase-based statistical machine translation model built using Moses [Koehn et al.2007] with the standard features: relative-frequency and lexical translation model probabilities in both directions; distortion model; language model and word count. We used KenLM [Heafield2011] to create 3-gram language models with Kneser-Ney smoothing on the target side of the bilingual training corpora.
Following previous work [Kalchbrenner and Blunsom2013, Sutskever et al.2014, Bahdanau et al.2015, Neubig et al.2015], we evaluated all neural models using test set perplexities and in a re-ranking setting, using BLEU [Papineni et al.2002] measure. For re-ranking, we generated 100-best translations using the baseline phrase-based model, to which we added log probability features from our neural models alongside the features of the underlying phrase-based model.
4.1 Analysis of Alignment Biases
We start by investigating the effect of various linguistic constraints, described in Section 3, on the attentional model. Table 2 presents the perplexity of trained models for ChineseEnglish translation. For comparison, we report the results of an encoder-decoder-based neural translation model [Sutskever et al.2014] as the baseline. All other results are for the attentional model with a single-layer LSTM as encoder and two-layer LSTM as decoder, using 512 embedding, 512 hidden, and 256 alignment dimensions. For each model, we also report the number of its parameters. Models are trained using stochastic gradient, allowing up to 20 epochs. For each model the best perplexity on the held-out development set is reported, which was achieved in 5-10 epochs for most cases.
As expected, the vanilla attentional model greatly improves over encoder-decoder (perplexity of 4.77 vs. 5.35), clearly making good use of the additional context. Adding the combined positional bias, local fertility, and Markov structure (denoted by +align) further decreases the perplexity to 4.56. Adding the global fertility (+glofer) is detrimental, however, increases perplexity to 5.20. Interestingly, global fertility does helps to reduce the perplexity (to 4.31) when using with pre-training setting (+align+glofer-pre). In this case, it is refining an already excellent model from which reliable global fertility estimates can be obtained. This finding is consistent with the other languages, see Figure3 which shows typical learning curves of different variants of the attentional model. Note that when global fertility is added to the vanilla attentional model with alignment features, it significantly slows down training as it limits exploration in early training iterations, however it does bring a sizeable win when used to fine-tune a pre-trained model. Finally, the bilingual symmetry also helps to reduce the perplexity scores when used with the alignment features, however, does not combine well with global fertility (+align+sym+glofer-pre). This is perhaps an unsurprising result as both methods impose a often-times similar regularising effect over the attention matrix.
Figure 4 illustrates the different attention matrices inferred by the various model variants. Note the difference between the base attentional model and its variant with alignment features (‘+align’), where more weight is assigned to diagonal and 1-to-many alignments. Global fertility pushes more attention to the sentinel symbols ⟨s⟩ and ⟨/s⟩. Determiners and prepositions in English show much lower fertility than nouns, while Estonian nouns have even higher fertility. This accords with Estonian morphology, wherein nouns are inflected with rich case marking, e.g., nõukoguga has the cogitative -ga suffix, meaning ‘with’, and thus translates as several English words (with the council). The right-most column corresponds to joint symmetric training, with many more confident attention values especially for consistent 1-to-many alignments (difficult in English and raskeid in Estonian, an adjective in partitive case meaning some difficult).
4.2 Full Results
The perplexity results of the neural models for the two translation directions across the four language pairs are presented in Table 3.a and 3.b. In all cases, our work achieves lower perplexities compared to the vanilla attentional model and the encoder-decoder architecture, owing to the linguistic constraints.
Table 4 presents the BLEU scores for the re-ranking setting for the translating into English from our four languages. We compare re-ranking settings using the log probabilities produced by our model as additional features vs. using log probabilities from the vanilla attentional model and the encoder-decoder. The re-rankers based on our model are significantly better than the rest for Chinese and Estonian, and on par with the other for Russian and RomanianEnglish. In all cases our model has performance at least 1 BLEU point better than the baseline phrase-based system. It is worth noting that for Chinese-English, our re-ranker leads to an increase of almost 3 points in the BLEU score using an ensemble of neural models with different configurations.101010We use the outputs of 6–12 models trained in both directions, using different alignment and fertility options, and using a smaller dimensionality than earlier (100 embedding, 100 hidden and 50 attention dimensions).
5 Related Work
kalchbrenner13emnlp were the first to propose a full neural model of translation, using a convolutional network as the source encoder, followed by an RNN decoder to generate the target translation. This was extended in sutskever2014sequence, who replaced the source encoder with an RNN using a Long Short-Term Memory (LSTM), and bahdanau2015neural who introduced the notion of “attention” to the model, whereby the source context can dynamically change during the decoding process to attend to the most relevant parts of the source sentence luong-pham-manning:2015:EMNLP refined the attention mechanism to be more local, by constraining attention to a text span, whose words’ representations are averaged. To leverage the attention history, [Luong et al.2015]
made use of the attention vector of the previous position when generating the attention vector for the next position, similar in spirit to our method for incorporating alignment structural biases. Concurrent with our work, chengetal2015 proposed a similar agreement-based joint training for bidirectional attention-based neural machine translation, and showed significant improvement in the BLEU score for the large data FrenchEnglish translation.
We have shown that the attentional model of translation does not capture many well known properties of traditional word-based translation models, and proposed several ways of imposing these as structural biases on the model. We show improvements across several challenging language pairs in a low-resource setting, both in perplexity and re-ranking evaluations. In future work we intend to investigate the model performance on larger datasets, and incorporate further linguistic information such as morphological representations.
- [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA.
- [Brown et al.1993] Peter E. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2).
- [Cheng et al.2015] Yong Cheng, Shiqi Shen, Zhongjun Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2015. Agreement-based joint training for bidirectional attention-based neural machine translation. In arXiv: 1512.04650 [cs.CL].
- [Cho et al.2014] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation. In arXiv:1409.1259 [cs.CL].
- [Dyer et al.2013] Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–648, Atlanta, Georgia, June. Association for Computational Linguistics.
- [Ganchev et al.2008] Kuzman Ganchev, João V. Graça, and Ben Taskar. 2008. Better alignments = better translations? In Proceedings of ACL-08: HLT, pages 986–993, Columbus, Ohio, June. Association for Computational Linguistics.
- [Heafield2011] Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland, United Kingdom, July.
- [Hochreiter and Schmidhuber1997] S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9:1735–1780.
[Kalchbrenner and Blunsom2013]
Nal Kalchbrenner and Phil Blunsom.
Recurrent continuous translation models.
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, October.
- [Koehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 48–54.
- [Koehn et al.2005] Philipp Koehn, Amittai Axelrod, Alexandra Birch, Chris Callison-Burch, Miles Osborne, David Talbot, and Michael White. 2005. Edinburgh system description for the 2005 IWSLT speech translation evaluation. In IWSLT, pages 68–75.
- [Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. ACL Interactive Poster and Demonstration Sessions, pages 177–180.
- [Koehn2005] Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the tenth Machine Translation Summit, pages 79–86. AAMT.
- [Levinboim et al.2015] Tomer Levinboim, Ashish Vaswani, and David Chiang. 2015. Model invertibility regularization: Sequence alignment with or without parallel data. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 609–618, Denver, CO.
- [Liang et al.2006] Percy Liang, Ben Taskar, and Dan Klein. 2006. Alignment by agreement. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 104–111, New York, NY.
- [Luong et al.2015] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal, September. Association for Computational Linguistics.
- [Neubig et al.2015] Graham Neubig, Makoto Morishita, and Satoshi Nakamura. 2015. Neural reranking improves subjective quality of machine translation: NAIST at WAT2015. In Proceedings of the 2nd Workshop on Asian Translation (WAT2015), Kyoto, Japan, October.
- [Och and Ney2003] Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
- [Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL.
- [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural networks. In Neural Information Processing Systems (NIPS), pages 3104–3112, Montréal.
- [Vogel et al.1996] Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceedings of the International Conference on Computational Linguistics (COLING), pages 836–841.
[Xu et al.2015]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Rich Zemel, and Yoshua Bengio.
Show, attend and tell: Neural image caption generation with visual
Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2048–2057.