Exploring the Use of Attention within an Neural Machine Translation Decoder States to Translate Idioms

10/10/2018 ∙ by Giancarlo D. Salton, et al. ∙ Dublin Institute of Technology 0

Idioms pose problems to almost all Machine Translation systems. This type of language is very frequent in day-to-day language use and cannot be simply ignored. The recent interest in memory augmented models in the field of Language Modelling has aided the systems to achieve good results by bridging long-distance dependencies. In this paper we explore the use of such techniques into a Neural Machine Translation system to help in translation of idiomatic language.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent interest in Deep Learning (DL) research has resulted in ideas originating in the DNNs field being applied to Machine Translation (MT), resulting in the development of a new area of research called Neural Machine Translation (NMT). The basic idea of NMT is to apply two different Recurrent Neural Networks (RNNs), which are a specific type of DNNs for processing sequences, in the so-called Encoder/Decoder framework

[Sutskever et al.2014, Cho et al.2014]. The first RNN, called the encoder

, is trained to compress the input sentence, written in the source language, into a distributed representation (


, a fixed-size vector of real numbers). The second RNN, called the

decoder, is trained to take that distributed representation and decompress it (word-by-word) into the output sentence, written in the target language.

A number of extensions to the basic encoder-decoder architecture have been proposed. For example, both bahdanau:2015 and luong:2015 integrate attention mechanisms into the NMT architecture to improve the alignment of information flow between the encoder and the decoder at each step in the generation of the translation. However, a potentially interesting (and as yet under-explored) extension to the encoder-decoder framework is the use of attention within the decoder itself. The decoder is essentially an RNN language model (RNN-LM) that is conditioned on a representation of the input sentence generated by an encoder RNN. Consequently, the quality of a translation is dependent on the power of the RNN-LM that is used by the NMT architecture. Within language model research there have been a number of augmentations proposed to improve the performance of RNN-LM, such as the daniluk:2017, merity:2017 and salton:2017b. However, to-date these innovations in the design of RNN-LMs have not been integrated into NMT decoders.

Many of the augmentations to RNN-LMs are focused on improving the ability of the language models to span long-distance dependencies. Consequently, an interesting case-study to use to explore the potential of using attention within an NMT Decoder is the translation of idioms. In general, the performance of MT systems degrade when processing idioms [Vilar et al.2006, Salton et al.2014]. Furthermore, the inclusion of an idiomatic phrase within a translation introduces a break into the evolving context of the RNN-LM decoder that the decoder must bridge once the idiom has been generated. Hence, improving the ability of the RNN-LM decoder to span long-distance dependencies may result in improved performance on the translation of idioms.

In this paper we propose an extension to NMT that uses a RNN-LM which has been augmented with memory and an attention mechanisms as a decoder. We evaluate this model on a number of different datasets, a present and analysis of it performance on sentences that include idioms. We begin by introducing NMT in more detail and outlining previous work in the field in Section 2. In Section 3 we describe the modifications we made to a baseline NMT architecture in order to integrate an Attentive RNN-LM into the architecture as its decoder (we refer to this augmented NMT architecture as an Attentive NMT). We then describe a set of experiments that evaluated the performance of an Attentive NMT over benchmark datasets, first when translating idioms and then when translating regular language Section 4. In Section 5 we present an analysis of the performance of an Attentive NMT system compared to a set of baselines systems. Finally, in Section 6 we draw our conclusions.

2 Neural Machine Translation

Figure 1 informally outlines the basic NMT model. Formally, let and represent represent the source sentence and the target sentence respectively. and denote the lengths of and respectively (note that might be different from ). Also let denote the number of hidden units in the Encoder and denote the number of hidden units in the Decoder.

Figure 1: The Encoder-Decoder architecture. The rectangles inside the box on the left represent the Encoder RNN unfolded over time and the rectangles inside the box on the right represent the Decoder RNN, also unfolded over time. The output of the last step of the Encoder RNN is the distributed representation of the source sentence. The first input for the Decoder RNN is the start-of-sentence ( <s>) symbol of the output sentence and the Decoder RNN’s first hidden state is set to be the distributed representation of the input sentence. At each timestep the Decoder RNN emits a symbol (word) that will serve as input to the Decoder RNN at timestep . The Decoder RNN performs computations until the </s> symbol is emitted.

Encoder. The Encoder compresses information about into a distributed representation using a RNN. The computation of involves iterating over the following equation:


where is a non-linear function; and (also called the Encoder hidden state) is the output of at each iteration . The Encoder then outputs its last hidden state to be the representation 111Note that ):



Decoder. The Decoder is often trained to predict the next word given and all previous (i.e., all words of previous timesteps to ). Therefore, the Decoder

is understood to define a probability over the translation

using the (ordered) conditionals:


where the probability is defined as:


where is a non-linear function; and (also called the Decoder hidden state) is the output of at each iteration .

The Sequence-to-Sequence Learning (Seq2seq) model proposed by [Sutskever et al.2014] closely follows the Encoder-Decoder

approach using Long-Short Term Memory (LSTM)

[Hochreiter and Schmidhuber1997] units. A single Seq2seq model has achieved results close to the state-of-the-art SMT systems [Sutskever et al.2014]. Despite being relatively simple in comparison with other NMT systems, the ensemble of Seq2seq models using LSTM units still achieves the state-of-the-art results for French to English translation [Sutskever et al.2014]. Nevertheless, a negative point of these models is the fact they are difficult to train given the required number of hidden layers for each neural network and the size of the models. In fact, these models can easily have more then 26 billion parameters.

An alternative NMT model incorporating an attention mechanism was proposed by bahdanau:2015. Their model, called NMT with Soft Attention (or NMT with Global Attention), also builds upon the Encoder-Decoder

approach and adds a small Neural Network that learns which part of the encoded distributed representation of the source sentence to pay attention to at the different stages of the decoding process. This model is more complex then Seq2seq but requires a smaller number of hidden layers and parameters. This reduction in hidden layers and parameters is due to the use of Gated Recurrent Units (GRU)

[Cho et al.2014] in the hidden layers of this model. This approach was the first pure NMT system to win a Machine Translation Shared Task in the Workshop on Statistical Machine Translation (WMT) [Bojar et al.2015].

More recently, luong:2015 proposed the NMT with Local Attention model, building upon the ideas of both Seq2seq and NMT with Global Attention. In this work, luong:2015 also use two stacks of LSTM units (similar to Seq2seq) and includes two feedforward networks. The first network is trained to predict a fixed-size window over the distributed representation of the source sentence. The second network is trained to learn which part of the predicted window to pay attention to at different stages of the decoding step, similar in spirit to NMT with Global Attention. The NMT with Local Attention model is currently the state-of-the-art for translating from English into German [Luong et al.2015]. In addition, luong:2015 also experimented with the use of stacks of LSTM units together with a Global attention mechanism (more general than the original), with slightly worse results than the Local Attention model.

3 Attentive Nmt

According to sutskever:2014, the decoder of a NMT system is essentially a RNN-LM conditioned on an input sequence encoded by another RNN (i.e., the encoder). To date, the extensions to the Seq2seq architecture have focused on improving the representation generated by the encoder (for example, using bi-directional inputs [Bahdanau et al.2015]) or using attention (either local or global) to provide an alignment mechanisms between the decoder and the encoder. In this paper we adopt a different approach and explore how extensions to the decoder can help with the translation of idioms. Specifically, we experiment with using at language model that integrates an attention mechanism. In recent years a number of augmented language models have been proposed [Daniluk et al.2017, Grave et al.2017, Merity et al.2017, Salton et al.2017]. For the purposes of this work, any of these augmented LM systems could be used as a decoder in the NMT system. However, for our experiments we chose to use the attentive Language Model architecture of salton:2017b. Our motivation for using this model was the relative compactness of the architecture in terms of parameter size, its suitability for sentence based processing, and ease of implementation. In order to adapt the Attentive RNN-LM of salton:2017b to work as the decoder of a NMT system, we use the NMT architecture described in luong:2015.

luong:2015 adds a memory buffer where all the hidden states of the encoder RNN are stored. The authors then modify Eq. 2 so that a new representation for the input sequence is generated at each timestep using an attention based model. More formally, generating involves iterating over the following equations:

dot (7)
general (8)
concat (9)

where is a dot product and is a matrix of parameters.

The vector is then merged with the current state by means of a concatenation layer



, in the original model, is then passed to the softmax layer to make the prediction for the next word.

As we are interested in the simple models, we choose to integrate the Attentive RNN-LM with single score of salton:2017b. This model performs reasonably well despite being simple. This model generates a context vector () based on previous hidden states of the RNN by calculated a weighted sum of these hidden states (see Equation 11). The weights used in the calculation of the summation represents the attention () the model pays to each of the past states (). This context vector is then used to bring forward, to the prediction step, past information processed by the model. More formally, the context for the decoder is generated by iterating over the following equations:


According to salton:2017b, that score reflects the “relevance” of the state to the current prediction. The context is merged with the current state using a concatenation layer:


The next word probability is computed using using Eq.15:


In order to integrate the Attentive RNN-LM as a decoder within luong:2015 NMT architecture, we add a concatenation layer so that we may merge with to bring froward information from previously predicted words:


is then passed to the softmax layer (Eq. 15) to make the next prediction. In all equations above, indicate a matrix of parameters, indicate a vector of parameters and

indicates a bias vector.

4 NMT Experiments

We are interested on the effectiveness of the Attentive NMT when translating idioms. We describe the experiment on the translation of idioms from German/English and from English/German in Section 4.1. To get some intuition on how the model performs on regular language, we also tested the model over benchmark datasets as described in Section 4.2. We present and discuss the results obtained by our models in Section 4.3

4.1 Idioms Experiment

Datasets. As our idiom dataset for the English/German language pair (in both directions), we used the dataset of fritzinger:2010. This dataset was extracted from EUROPARL and from a German newspaper corpus222Frankfurter Allgemeine Zeitung. The dataset consists of sentences containing one of 77 German preposition+noun+verb triples. Each sentence is annotated with syntactic and morphological features and has one of four labels: idiomatic; literal; ambiguous; or error (which indicates parsing or extraction errors). Around of the dataset consists of idioms and the remaining consists of the other 3 labels. Therefore, we considered only the 3,050 sentences extracted from EUROPARL given that it is a bilingual corpus. From these sentences, we randomly selected 2,200 sentences to build a test set and used the remaining 850 sentences as part of our training data so as to ensure that there were idioms in the training data.

Given that the EUROPARL is part of the training dataset used in the WMT’ 2016, we used that dataset (after the pre-processing performed on the EUROPARL part of that data as described above) to train ours and the baseline models.

Attentive NMT. We use Attentive NMT models of similar sizes in comparison to those proposed by luong:2015. More specifically, the encoder is an RNN composed of four layers of 1,000 LSTM units and the decoder is the Attentive RNN-LMs (also composed of four layers of 1,000 LSTM units). We also applied attention over the encoder outputs using the “dot” and “general” content functions (Eq. 7 and Eq. 8 respectively) as described in luong:2015. In total, we have trained four models for each direction (eight in total).

We optimize the model using ADAM [Kingma and Ba2014]

with a learning rate of 0.0001 and mini-batches of size 128 to minimize the average negative log probability of the target words. We train the models until we do not get any improvements on the negative log probability over the development set with an early stop counter of 5 epochs

333In general, the average number of epochs is around 14, including the 5 epochs of patience.. Once the model runs out of patience, we rollback its parameters and use the model that achieved the best performance on the validation set to obtain the translations. We initialize the weight matrices of the network uniformly in [, ] while all biases are initialized to a constant value of . We also apply dropout to the non-recurrent connections and clip the norm of the gradients, normalized by mini-batch size, at 5.0. In all our models (similar to [Press and Wolf2016]), we tie the matrix in Eq. 15

to be the embedding matrix (which also has 1,000 dimensions) used to represent the input words. We remove all sentences in the training set larger than 50 words and we pad all sentences shorter than 50 words with a special symbol so they will all have the same size. We use a vocabulary of the 50,000 most frequent words in the training set including three symbols to account for the padding of shorter sentences, the end of sequence and OOV words respectively.

NMT Baseline. We used the model of sennrich-wmt:2016444Code available at http://data.statmt.org/rsennrich/wmt16_systems/

as our NMT baseline. In fact, this model is equal to the model of bahdanau:2015 apart from the fact that it uses BPE to reduce the size of the vocabulary. In this model we used word embeddings of 500 dimensions and 1,024 GRU units in both encoder and decoder. The recurrent weight matrices were initialized as random orthogonal matrices; the bias vectors were initialized at 0; the parameters of the attention layer (the encoder attention layer) were sampled from a Gaussian distribution with

mean and variance

; and the remaining weight matrices were sampled from a Gaussian distribution with mean and variance ; the model was trained using Adadelta [Zeiler2012] with a learning rate of (and a “patience” of 10 epochs) and the norm of the gradients (normalized by mini-batch size) were clipped at .

SMT Baseline. We trained a PBSMT system using the Moses toolkit [Koehn et al.2007] with its baseline settings. We used a 5-gram LM trained with the KenLM [Heafield2011] toolkit with modified-kneser-ney smoothing. In this case, we used the newstest2013 (3K sentences) set to tune this model.

4.2 Regular Language Experiments

To get some intuition on their performance over regular language, we evaluate our models over two benchmark datasets, the WMT’ 2014 and WMT’ 2015 test sets. We used the same Attentive NMT and baseline models trained for the idioms experiment described in Section 4.1.

Model Idioms dataset
English/German (EN/DE)
NMT 14.4 72.4
SMT 15.9 68.7
Dot & Attentive NMT 19.5 66.9
General & Attentive NMT 19.9 64.1
German/English (DE/EN)
NMT 20.1 64.2
SMT 24.0 60.6
Dot & Attentive NMT 13.9 74.1
General & Attentive NMT 14.3 73.8
Table 1: Results in terms of BLEU and TER score on the idioms test set for the English/German language pair in both directions. “Dot” and “General” refer to the content function applied to compute the attention over the encoder states. Larger BLEU and smaller TER scores indicates better performance.
Model WMT’ 2014 WMT’ 2015
English/German (EN/DE)
SMT 14.8 69.4 17.3 66.1
NMT 13.6 72.3 13.8 72.4
Dot & Attentive NMT 15.3 67.1 17.2 64.8
General & Attentive NMT 16.1 68.7 17.8 66.2
German/English (DE/EN)
SMT 21.6 58.6 22.9 56.9
NMT 17.9 63.3 17.7 63.9
Dot & Attentive NMT 14.2 70.1 14.8 68.1
General & Attentive NMT 13.9 70.8 14.6 68.4
Table 2: Results in terms of BLEU and TER score on the WMT’ 2014 and WMT’ 2015 test sets. “Dot” and “General” refer to the content function applied to compute the attention over the encoder states. Larger BLEU and smaller TER scores indicates better performance.

4.3 Results

Table 1 presents the results in terms of BLEU and TER scores over the dataset of idioms of fritzinger:2010. Given that this idiom corpus was extracted from the data used to train all models, we can consider it an in-domain test set. In the English/German direction, both of our models performed better than all the baselines, including the SMT, in terms of both BLEU and TER. Our models had a similar performance when the BLEU scores are considered, but there was a difference of almost 3 TER points between our best and our worst models. However, in the German/English direction, both of our models had the worst performance among all models in terms of both BLEU and TER scores. The SMT system performed best in this language direction, achieving almost 20 BLEU points more than our best model.

Table 2 show the results of our experiments in the English/German language pair in terms of BLEU and TER scores. Although the baseline NMT is the model that won the WMT’ 2016 shared task on the English/German language pair, on both directions, the results obtained in our experiments were below the SMT baseline for both datasets (WMT’ 2014 and WMT’ 2015) and both metrics (BLEU and TER). However, the entries submitted by sennrich-wmt:2016 were the results obtained by an ensemble of these baseline models, including reranking and other post-processing steps on the output. As we are interested in the comparisons of single models, we do not apply the same steps as they did.

In the English/German direction, the General & Attentive NMT achieved the highest BLEU score on both WMT’ 2014 and WMT’ 2015 test sets. It is noteworthy that the best model in terms of BLEU is not the best model in terms of TER. Using this metric to score the systems, the Attentive NMT and dot encoder content function had a better performance on WMT’ 2014 while the baseline SMT performed better on the WMT’ 2015. None of the Attentive NMT models had the same level of performance in the German/English direction as the baselines, i.e. NMT and SMT.

All results presented in this section were tested for significance using the method of koehn:2004 and we found all ().

5 Analysis of the Models

The results obtained by our Attentive NMT are mixed. The Attentive NMT models improved the translations from English into German, including the translation of idioms. Given the fact that the German language has long-distance dependencies in its syntax, this result may be an indication that our model helps in bridging those dependencies by bringing forward past information that may have faded from context at the target side with the smoothing effect as shown in the language modeling experiments of salton:2017b. However, that improvement is not observed when English is the target side. A mitigating point for consideration here is that, as shown by jean-wmt:2015, it is recognized that NMT models often struggle when translating into English (in comparison to translating into other languages) and even the inclusion of an attention module to score past states of the decoder may not overcome this challenge for NMT.

However, this still leaves open the question of why the baseline NMT system outperforms the Attentive NMT models when translating into English. Our hypothesis for what causes this is that the attention mechanism within the decoder interferes with the encoder-decoder alignment mechanism. When translating languages, there are interactions between input and output words that are highly informative to the system when making the next prediction and that are captured via word alignments [Koehn2010]. By introducing a module to score states of the decoder and subsequently merge this information to the alignments calculated by the attention module over the encoder states (as we do in Eq. 16 after the computation of the alignments by the with the Encoder states) we are introducing a bias towards the decoder states and, thus, we may be weakening the information carried about input/output alignments. If the model is not robust enough to balance this trade off, it will fail and produce poor translations, a fact that is observed in several of our models.

6 Conclusions

In this paper we have studied the inclusion of the attention module over the decoder states of a NMT system. Although this type of attention module has achieved good performance in language modeling, such improvement were limited in the NMT setting. We have shown that, although our systems do not achieve state-of-the-art results for English/German direction, they are on par with single models that compose the ensembles that won the WMT’ 2016 shared task and with baseline SMT systems. Despite the fact that the results are mixed, we have demonstrated that in some cases the representations built by the attention module also improves the translation of idioms, especially when the target language has long-distance dependencies such as the case for German.

In future work we plan to investigate the usage of other architectures of memory-agumented RNN-LMs with the decoder of the NMT system. We also plan to integrate the computation of the attention over the encoder states and the attention over the decoder states into a single step.


The acknowledgements should go immediately before the references. Do not number the acknowledgements section. Do not include this section when submitting your paper for review.


  • [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’2015).
  • [Bojar et al.2015] Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46, Lisbon, Portugal, September. Association for Computational Linguistics.
  • [Cho et al.2014] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1724–1734, Doha, Qatar, October. Association for Computational Linguistics.
  • [Daniluk et al.2017] Michal Daniluk, Tim Rocktäschel, Johannes Welbl, and Sebastian Riedel. 2017. Frustratingly Short Attention Spans in Neural Language Modeling. 5th International Conference on Learning Representations (ICLR’2017).
  • [Fritzinger et al.2010] Fabienne Fritzinger, Marion Weller, and Ulrich Heid. 2010. A survey of idiomatic preposition-noun-verb triples on token level. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10).
  • [Grave et al.2017] Edouard Grave, Armand Joulin, and Nicolas Usunier. 2017. Improving neural language models with a continuous cache. 5th International Conference on Learning Representations (ICLR’2017).
  • [Heafield2011] Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland, United Kingdom, July.
  • [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. volume 9, pages 1735–1780, Cambridge, MA, USA, November. MIT Press.
  • [Jean et al.2015] Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. Montreal neural machine translation systems for wmt’15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 134–140, Lisbon, Portugal, September. Association for Computational Linguistics.
  • [Kingma and Ba2014] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv, abs/1412.6980.
  • [Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics.
  • [Koehn2004] Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP’2004), pages 388–395.
  • [Koehn2010] Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York, NY, USA.
  • [Luong et al.2015] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421.
  • [Merity et al.2017] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture models. 5th International Conference on Learning Representations (ICLR’2017).
  • [Press and Wolf2016] Ofir Press and Lior Wolf. 2016. Using the output embedding to improve language models. arXiv, abs/1608.05859.
  • [Salton et al.2014] Giancarlo D. Salton, Robert J. Ross, and John D. Kelleher. 2014. An Empirical Study of the Impact of Idioms on Phrase Based Statistical Machine Translation of English to Brazilian-Portuguese. In Third Workshop on Hybrid Approaches to Translation (HyTra) at 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 36–41.
  • [Salton et al.2017] Giancarlo D. Salton, Robert J. Ross, and John D. Kelleher. 2017. Attentive language models. In Proceedings of The 8th International Joint Conference on Natural Language Processing (IJCNLP 2017 ).
  • [Sennrich et al.2016] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Edinburgh Neural Machine Translation Systems for WMT 16. In Proc. of the First Conference on Machine Translation (WMT16), Berlin, Germany.
  • [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112.
  • [Vilar et al.2006] D. Vilar, J. Xu, L. D’haro, and H. Ney. 2006. Error analysis of statistical machine translation output. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, May. European Language Resources Association (ELRA). ACL Anthology Identifier: L06-1244.
  • [Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701.