Memory-Augmented Neural Networks (MANN) are a new class of recurrent neural network (RNN
) that separate computation from memory. The key distinction between MANNs and other RNNs such as Long Short-Term Memory cells (LSTM)  is the existence of an external memory unit. A controller network in the MANN receives input, interacts with the external memory unit via read and write heads and produces output. MANNs have been shown to learn faster and generalize better than LSTMs on a range of artificial sequential learning tasks [7, 8, 18]. Despite their success on artificial tasks, LSTM based models remain the preferred choice for many commercially important sequence learning tasks such as handwriting recognition , machine translation  and speech recognition .
, where the encoder and decoder are often LSTMs or other gated RNNs such as the Gated Recurrent Unit, are a class of neural network models that have achieved state-of-the-art performance on many language pairs for machine translation [12, 16]
. An encoder RNN reads the source sentence one token at a time. The encoder both maintains an internal vector representing the full source sentence and it encodes each token in the source sentence into a vector often assumed to represent the meaning of that token in its surrounding context. The decoder receives the internal vector from the encoder and can read from the encoded source sentence when producing the target sentence.
Attentional encoder-decoders can be seen as a basic form of MANN. The collection of vectors representing the encoded source sentence can be viewed as external memory which is written to by the encoder and read from by the decoder. But attentional encoder-decoders do not have the same range of capabilities as MANNs such as the Neural Turing Machine (NTM)  or Differentiable Neural Computer (DNC) . The encoder RNN in attentional encoder-decoders must write a vector at each timestep and this write must be to a single memory location. The encoder is not able to update previously written vectors and has only one write head. The decoder has read only access to the encoded source sentence and typically just a single read head. Widely used attention mechanisms [1, 13] do not have the ability to iterate through the source sentence from a previously attended location. All of these capabilities are present in NTMs and DNCs.
In this paper we propose two extensions to the attentional encoder-decoder which add several capabilities present in other MANNs. We are also the first that we are aware of to evaluate the performance of MANNs applied directly to machine translation.
We briefly review how attention weights are computed for Luong attention  and how addresses are computed for the NTM. Alternative attention mechanisms have similar computations  and likewise for alternative MANNs such as DNCs .
2.1 Luong Attention
At each timestep during the decoding of an attentional encoder-decoder a weighting over the encoded source sentence is computed, where and . The predicted token at that timestep during decoding is then a function of the decoder RNN hidden state and the weighted sum of the encoder hidden states i.e. .
The difference between various attention mechanisms is how to compute the weighting . In Luong attention  the weighting is computed as the softmax over scaled scores for each source sentence token, eq. 2. The scores for each source sentence token are computed as the dot product of decoder RNN hidden state and encoder RNN hidden state
which is first linearly transformed by a matrix.
2.2 NTM Addressing
Rather than computing weightings over an encoded source sentence, NTMs have a fixed sized external memory unit which is a memory matrix. represents the number of memory locations and the dimension of each memory cell. A controller neural network has read and write heads into the memory matrix. Addresses for read and write heads in a NTM are computed somewhat similarly to attention mechanisms. However in addition to being able to address memory using the similarity between a lookup key and memory contents, so called content based addressing, NTMs also have the ability to iterate from current or past addresses. This enables NTMs to learn a broader class of algorithms than attentional encoder-decoders [7, 8].
At each timestep (), for each read and write head the controller network outputs a set of parameters; a lookup key , a scaling factor
, an interpolation gate, a shift kernel (s.t. and ) and a sharpening parameter which are used to compute the weighting over the N memory locations in the memory matrix as follows:
We can see that is computed similarly to Luong attention and allows for content based addressing. represents a lookup key into memory and
is some similarity measure such as cosine similarity:
NTMs enable iteration from current or previously computed memory weights as follows:
where (5) enables the network to choose whether to use the current content based weights or the previous weight vector, (6) enables iteration through memory by convolving the current weighting by a 1-D convolutional shift kernel and (7) corrects for any blurring occurring as a result of the convolution operation.
The vector read by a particular read head at timestep is computed as a weighted sum over memory locations similarly to Luong attention:
An attentional encoder-decoder has no write mechanism. Another way to view this, is that an attentional encoder-decoder has a memory matrix with equal to the source sentence length and the encoder must always write its hidden state to the memory location corresponding to its position in the source sentence. A NTM does have a write operation, with write addresses determining a weighting over memory locations for the write. Each write head modifies the memory matrix by outputting erase () and add () vectors which are then used to softly zero out existing memory contents and write new memory contents through addition:
3 Proposed Models
We propose two models which bridge the gap between the attentional encoder-decoder and MANNs, extending the attentional encoder-decoder with additional mechanisms inspired by MANNs. We also propose the application of MANNs directly to machine translation.
3.1 Neural Turing Machine Style Attention
The reads from a decoder in an attentional encoder-decoder for machine translation often exhibit monotonic iteration through the encoded source sentence [1, 15]. However widely used attention mechanisms have no way to explicitly encode such a strategy. NTMs combine content based addressing similar to attention mechanisms with the ability to iterate through memory. We propose a new attention mechanism which combines the content based addressing of Luong attention  with the ability to iterate through memory from NTMs. For our proposed attention mechanism at each timestep () the decoder outputs a set of parameters for each of its read heads: , , , (s.t. and ) and which are used to compute the weighting over encoded source sentence for
Equations (11) and (12) represent the standard content based addressing of Luong style attention. Equations (13-15) replicate equations (5-7) of the NTM to enable iteration from the currently attended source sentence token or the previously attended token . As with the NTM equation (14) represents a 1D convolution on the weighting with a convolutional shift kernel which is outputted by the decoder to enable iteration. Equation (15) corrects for any blurring resulting from the 1D convolution. We can see that such an attention mechanism has the content based addressing capability of Luong attention and the same capability to iterate from previously computed addresses as NTMs.
3.2 Memory-Augmented Decoder (M.A.D)
The introduction of attention mechanisms has proved highly successful for neural machine translation. Attention extends the writable memory capacity of the encoder in an encoder-decoder model linearly with the length of the source sentence. This avoids the bottleneck of having to encode the whole source sentence meaning into the fixed size vector passed from the encoder to decoder[1, 2]. But the decoder in an attentional encoder-decoder must still maintain a history of its past actions in a fixed size vector. We are motivated by the success of attention which extended the memory capacity of the encoder to propose the addition of an external memory unit to the decoder of an attentional encoder-decoder, hence extending the decoder’s memory capacity, fig. 1. We still maintain a read-only attention mechanism into the encoded source sentence, however the decoder now has the ability to read and write to an external memory unit. We can set the external memory unit to have a number of memory locations greater than the maximum target sentence length in the corpus, thus scaling the decoder’s memory capacity with the target sentence length in a similar vain as to how attention scaled the encoder’s memory capacity with source sentence length.
We note that a similar model has been proposed before , but that in order to train their model the authors propose a pre-training approach based on first training without the external memory unit attached to decoder and then adding it on. This approach restricts the form of possible memory interactions as it must be possible to add the external memory unit while maintaining the pre-trained weights of the attentional encoder-decoder. We simply make the decoder a NTM with the standard read and write heads into an external memory and an additional read head into the encoded source sentence with the addresses on this read head computed in Luong attention style, but other choices for the addressing mechanism are possible, including DNC style addressing. Following a recent stable NTM implementation  we do not have any problems training our proposed model.
3.3 Pure Memory-Augmented Neural Network
We propose a pure MANN model for machine translation, fig. 2. Under our proposed model a MANN receives the embedded source sentence as input one token at a time and then receives an end of sequence token. The MANN must then output the target sentence. We are motivated by the enhanced performance of MANNs compared to LSTMs on artificial sequence learning tasks [7, 8, 14].
We note that our proposed model has the representational capability to learn a solution similar to an attentional encoder-decoder by simply writing an encoding of each source sentence token to a single memory location and reading the encodings back using content based addressing after the end of sequence token.
We also highlight the differences to the attentional encoder-decoder model. The pure MANN model may have multiple read and write heads each of which uses more powerful addressing mechanisms than popular attention mechanisms. The proposed model may also update previously written locations in light of new information or reuse memory locations if the previous contents have already served their purpose. There is no separation between encoding and decoding and thus only a single RNN is used as the MANN’s controller rather than different RNN cells for the encoder and decoder in an encoder-decoder, halving the number of network parameters dedicated to this part of the network.
In this paper for the pure MANN model we use and compare NTMs and DNCs as the choice MANN, however any other MANN with differentiable read and write mechanisms into an external memory unit would be permissible. In both cases we use a LSTM controller. We also compare the use of multiple read and write heads for the NTM model.
We evaluate our models on two machine translation tasks. As a low resource spoken language task we use the 2015 International Workshop on Spoken Language’s dataset of English to Vietnamese translated TED talks. We follow  in their preprocessing and setup and use their results as a baseline. For training we use TED tst2013, a dataset of 133K sentence pairs. As the validation set we use TED tst2012 and test set results are reported on the TED tst2015 dataset. We use a fixed vocabulary of 17.5K words and 7.7K words for English and Vietnamese respectively. Any words outside the source or target vocabulary are mapped to an unknown token (UNK).
As a medium resource written language task we follow  in their general setup for the Romanian to English task from the ACL’s 2016 First Conference on Machine Translation’s, Machine Translation of News Task. We use their results as a baseline. We train our models on the Europarl English Romanian dataset which consists of 600k sentence pairs. We use the newsdev2016 and newstest2016 datasets as the validation and test sets respectively. We Byte Pair Encode  the source and target languages with 89,500 merge operations. After Byte Pair Encoding, the English vocabulary size is 48,824 sub-words and 65,699 sub-words for the Romanian vocabulary.
For all models we use the Adam optimizer 
with an initial learning rate of 0.001. We train for a fixed number of steps but after each epoch we measure the BLEU score on the validation set and measure the test set performance from the version of the model with the highest validation set BLEU score. For the VietnameseEnglish models we train for 14,000 steps and for the RomanianEnglish models we train for 120,000 steps.
For all models we use beam search with a beam width of 10. We set the dropout rate to 0.3 with no other regularization applied. For both the VietnameseEnglish and RomanianEnglish tasks we follow [12, 16] in using a stack of 2 x 512 unit LSTMs as the encoder and decoder for all relevant models and the controller network for the MANNs. For the memory-augmented decoder the number of memory locations is set to 64 and each memory location is a 512 dimensional vector. Whereas for the pure MANN model the number of memory locations is set to 128 with the memory cell size also set to 512.
The test set BLEU scores for the VietnameseEnglish translation task are all very similar, with each model’s score within the range of 23.1-23.8 BLEU (table 1). Interestingly, despite the pure MANN models seeing the source sentence in a uni-directional fashion (with all other models using bi-directional encoders) the pure MANN models perform on par with the other models.
|NTM Style Attention||21.5||23.6|
|M.A.D. (1 R/W head)||21.1||23.1|
|M.A.D. (2 R/W heads)||21.2||23.8|
|Pure MANN (NTM - 1 R/W head)||20.9||23.5|
|Pure MANN (NTM - 2 R/W heads)||21.3||23.5|
|Pure MANN (DNC - 1 R/W head)||20.6||23.6|
The attentional encoder-decoder  has the highest test set BLEU score of all the models for the RomanianEnglish translation task (table 2). The proposed extensions to the attentional encoder-decoder result in 0.3-0.9 lower test set BLEU score. For the RomanianEnglish translation task, the pure MANN model has 1.5-1.9 lower test set BLEU score.
|NTM Style Attention||30.0||28.7|
|M.A.D. (1 R/W head)||29.8||28.9|
|M.A.D. (2 R/W heads)||29.7||28.3|
|Pure MANN (NTM - 1 R/W head)||28.9||27.7|
|Pure MANN (NTM - 2 R/W heads)||28.0||27.3|
|Pure MANN (DNC - 1 R/W head)||27.8||27.5|
We now examine the attention weights for an attentional encoder-decoder and address computation for the 1 R/W head NTM on a particular RomanianEnglish translation. The sentence was chosen as it was the first sentence in our test set which had the same translation from both models. We note that the pattern of addresses are typical of the addresses computed on other sentences for both language pairs, but that a single typical example is presented for brevity.
We see that we replicate the monotonic iteration through the source sentence often observed in attention mechanisms [1, 15] in fig. 3. We note that this pattern of addressing must be computed solely using the content based addressing of the attention mechanism as no iteration capability is available to the attention weight computation.
We now examine how the NTM has computed the addresses for its read and write head in order to arrive at the same resulting translation. Looking first at the write head, fig. 4, we see that as the NTM is shown the source sentence it has learned a very similar strategy to the encoder of an attentional encoder-decoder. In particular we can see that the write head writes an encoded version of the source sentence tokens to successive memory locations, fig. 3(a). Interestingly we see that the successive memory locations are computed using the iteration cabability of the NTM as the content based addresses are not significant, fig. 3(b) and the shift kernel is iterating forward through memory, fig. 3(d) from the address at the previous timestep as can be seen from the interpolation gate, fig. 3(c). If we interpret the encoded source sentence for an attentional encoder-decoder as being written to memory, then this is precisely the form of addresses we would see - except that in the case of a NTM the addressing strategy is learned not hardcoded. This suggests that this particular inductive bias built into the attentional encoder-decoder is a sensible one.
The attentional encoder-decoder leaves the encoded source sentence unchanged during decoding as it has no write mechanism. However we observe that the write head is active during decoding for the NTM, fig. 3(a). We see that the NTM uses content based addressing, fig. 3(b) to write to the memory locations that are previously read from by the read head , fig. 5. This suggests that perhaps the NTM has developed a strategy of marking particular source sentence tokens as completed so as not to retranslate them later during decoding. Interestingly such a mechanism is built directly into the DNC  and in fact monotonic attention mechanisms have been developed which prevent retranslation of previously translated tokens or preceeding tokens in the source sentence . But here of course this strategy is learned from random initialization by the NTM.
Having seen that the NTM learns to write the encoded source sentence to successive memory locations we are not surprised that as the predicted sentence is produced the NTM reads the from memory locations similarly to the attentional encoder-decoder. We see that the previously written to memory locations are then read back from, fig. 4(a). Interestingly, we see the read head addresses of the NTM as it produces the predicted sentence are heavily determined by its content based addressing, fig. 4(b). Thus the NTM does not make significant use of its iteration capability, despite exhibiting the type of monotonic iteration through the source sentence as has been observed with attention mechanisms.
We also note that the read head of the NTM is not particularly focused as the NTM sees the source sentence, fig. 4(a). This is somewhat surprising as the results of the read operation are available to the controller at the next timestep and thus could be used to retrieve the encoding of a previous source sentence token or a summary of a section of the source sentence rather than relying on the LSTM controller memory solely for this. We suspect that this behaviour is the result of the read head operation not being available to the write head at the current timestep and thus cannot be used to disambiguate the current token as has been the motivation for the successful Transformer NMT model . Thus, we suggest that extending the NTM and other MANNs depth-wise to have successive rather than parallel operations on the memory matrix at each timestep may be a fruitful avenue of future research.
We have proposed a series of MANN inspired models for machine translation. Two of these models; NTM Style Attention and the Memory-Augmented Decoder extend the attentional encoder-decoder which has achieved state-of-the-art results on many language pairs. These extensions perform 0.2-0.5 BLEU better than the attentional encoder-decoder alone on the low resource VietnameseEnglish translation task and 0.3-0.9 lower BLEU on the RomanianEnglish translation task. We conclude that a content based addressing mechanism is sufficient to encode a strategy of monotonic iteration through source sentences and that enabling the network to express this strategy directly does not significantly improve translation quality. From the Memory-Augmented Decoder results it appears as though extending the memory capacity of the decoder in an attentional encoder-decoder does not offer an advantage, contrary to previous results .
Our third proposed model is to just use MANNs directly for machine translation. As far as we are aware we are the first to publish results on MANNs used directly for machine translation. The pure MANN model performs marginally better, +0.2-0.3 BLEU, than the attentional encoder-decoder for the VietnameseEnglish translation task. Performance is 0.3-1.9 BLEU worse for the RomanianEnglish translation task. We conclude that MANNs in their current form do not improve over the attentional encoder-decoder for machine translation. Our analysis of the algorithm learned by the pure MANN shows that despite being randomly initialized the pure MANN learns a very similar solution to the attentional encoder-decoder.
We note that the performance gap between the pure MANN and attentional encoder-decoder is not very large and that the pure MANN model is very general and does not incorporate any domain specific knowledge. MANNs are a relatively new architecture that have received less attention than encoder-decoder approaches. We expect that with the development of improved MANN architectures, MANNs could achieve state-of-the-art results for machine translation.
This publication emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number 13/RC/2106.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1, §1, §2, §3.1, §3.2, §5.1.
-  (2014) On the properties of neural machine translation: encoder–decoder approaches. Conference Proceedings In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111. Cited by: §3.2.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §1.
-  (2018) Implementing neural turing machines. In International Conference on Artificial Neural Networks, pp. 94–104. Cited by: §3.2.
Towards end-to-end speech recognition with recurrent neural networks.
International Conference on Machine Learning, pp. 1764–1772. Cited by: §1.
-  (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence 31 (5), pp. 855–868. External Links: Cited by: §1.
-  (2014) Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §1, §1, §2.2, §3.3.
-  (2016) Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471. External Links: Cited by: §1, §1, §2.2, §2, §3.3, §5.1.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. External Links: Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
-  (2017) Neural machine translation (seq2seq) tutorial. https://github.com/tensorflow/nmt. Cited by: §4.
-  (2015) Stanford neural machine translation systems for spoken language domains. Conference Proceedings In Proceedings of the International Workshop on Spoken Language Translation, pp. 76–79. Cited by: §1, §4, §4, Table 1.
Effective approaches to attention-based neural machine translation.
Conference Proceedings In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. Cited by: §1, §1, §2.1, §2, §3.1.
-  (2016) Scaling memory-augmented neural networks with sparse reads and writes. Conference Proceedings In Advances in Neural Information Processing Systems, pp. 3621–3629. Cited by: §3.3.
-  (2017) Online and linear-time attention by enforcing monotonic alignments. In International Conference on Machine Learning, pp. 2837–2846. Cited by: §3.1, §5.1, §5.1.
-  (2016) Edinburgh neural machine translation systems for wmt 16. Conference Proceedings In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Vol. 2, pp. 371–376. Cited by: §1, §4, §4, Table 2, §5.
-  (2016) Neural machine translation of rare words with subword units. Conference Proceedings In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1715–1725. Cited by: §4.
-  (2015) End-to-end memory networks. Conference Proceedings In Advances in neural information processing systems, pp. 2440–2448. Cited by: §1.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §5.1.
-  (2016) Memory-enhanced decoder for neural machine translation. Conference Proceedings In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 278–286. Cited by: §3.2, §6.
-  (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1.