1 Introduction
Recent successes of deep neural networks have spanned many domains, from computer vision
img12 to speech recognition dahl12and many other tasks. In particular, sequencetosequence recurrent neural networks (RNNs) with long shortterm memory (LSTM) cells
hochreiter1997 have proven especially successful at natural language processing (NLP) tasks, including machine translation sutskever14 ; bahdanau2014neural ; cho2014learning .The basic sequencetosequence architecture for machine translation is composed of an RNN encoder which reads the source sentence one token at a time and transforms it into a fixedsized state vector. This is followed by an RNN decoder, which generates the target sentence, one token at a time, from the state vector. While a pure sequencetosequence recurrent neural network can already obtain good translation results
sutskever14 ; cho2014learning , it suffers from the fact that the whole sentence to be translated needs to be encoded into a single fixedsize vector. This clearly manifests itself in the degradation of translation quality on longer sentences (see Figure 6) and hurts even more when there is less training data KVparse15 .In bahdanau2014neural , a successful mechanism to overcome this problem was presented: a neural model of attention. In a sequencetosequence model with attention, one retains the outputs of all steps of the encoder and concatenates them to a memorytensor. At each step of the decoder, a probability distribution over this memory is computed and used to estimate a weighted average encoder representation to be used as input to the next decoder step. The decoder can hence focus on different parts of the encoder representation while producing tokens. Figure 1 illustrates a single step of this process.
The attention mechanism has proven useful well beyond the machine translation task. Image models can benefit from attention too; for instance, image captioning models can focus on the relevant parts of the image when describing it xuetal2015 ; generative models for images yield especially good results with attention, as was demonstrated by the DRAW model draw
, where the network focuses on a part of the image to produce at a given time. Another interesting usecase for the attention mechanism is the Neural Turing Machine
ntm14 , which can learn basic algorithms and generalize beyond the length of the training instances.While the attention mechanism is very successful, one important limitation is built into its definition. Since the attention mask is computed using a Softmax, it by definition tries to focus on a single element of the memory it is attending to. In the extreme case, also known as hard attention xuetal2015 , one of the memory elements is selected and the selection is trained using the REINFORCE algorithm (since this is not differentiable) reinforce . It is easy to demonstrate that this restriction can make some tasks almost unlearnable for an attention model. For example, consider the task of adding two decimal numbers, presented one after another like this:
Input  1  2  5  0  +  2  3  1  5 
Output  3  5  6  5 
A recurrent neural network can have the carryover in its state and could learn to shift its attention to subsequent digits. But that is only possible if there are two attention heads, attending to the first and to the second number. If only a single attention mechanism is present, the model will have a hard time learning this task and will not generalize properly, as was demonstrated in neural_gpu ; stack_rnn .
A solution to this problem, already proposed in the recent literature (for instance, the Neural GPU from neural_gpu ), is to allow the model to access and change all its memory at each decoding step. We will call this mechanism an active memory. While it might seem more expensive than attention models, it is actually not, since the attention mechanism needs to compute an attention score for all its memory as well in order to focus on the most appropriate part. The approximate complexity of an attention mechanism is therefore the same as the complexity of the active memory. In practice, we get steptimes around second for an active memory model, the Extended Neural GPU introduced below, and second for a comparable model with an attention mechanism. But active memory can potentially make parallel computations on the whole memory, as depicted in Figure 2.
Active memory is a natural choice for image models as they usually operate on a canvas. And indeed, recent works have shown that actively updating the canvas that will be used to produce the final results can be beneficial. Residual networks resnet
, the currently best performing model on the ImageNet task, falls into this category. In
poggio16 it was shown that the weights of different layers of a residual network can be tied (so it becomes recurrent), without degrading performance. Other models that operate on the whole canvas at each step were presented in one_shot ; conceptual_compression . Both of these models are generative and show very good performance, yielding better results than the original DRAW model. Thus, the active memory approach seems to be a better choice for image models.But what about nonimage models? The Neural GPUs neural_gpu demonstrated that active memory yields superior results on algorithmic tasks. But can it be applied to realworld problems? In particular, the original attention model brought a great success to natural language processing, esp. to neural machine translation. Can active memory be applied to this task on a large scale?
We answer this question positively, by presenting an extension of the Neural GPU model that yields good results for neural machine translation. This model allows us to investigate in depth a number of questions about the relationship between attention and active memory. We clarify why the previous active memory model did not succeed on machine translation by showing how it is related to the inherent dependencies in the target distributions, and we study a few variants of the model that show how a recurrent structure on the output side is necessary to obtain good results.
2 Active Memory Models
In the previous section, we used the term active memory broadly, referring to any model where every part of the memory undergoes active change at every step. This is in contrast to attention models where only a small part of the memory changes at every step, or where the memory remains constant.
The exact implementation of an active change of the memory might vary from model to model. In the present paper, we will focus on the most common ways this change is implemented that all rely on the convolution operator.
The convolution acts on a kernel bank and a 3dimensional tensor. Our kernel banks are 4dimensional tensors of shape , i.e., they contain parameters, where and are kernel width and height. A kernel bank can be convolved with a 3dimensional tensor of shape which results in the tensor of the same shape as defined by:
In the equation above the index might sometimes be negative or larger than the size of , and in such cases we assume the value is
. This corresponds to the standard convolution operator used in many deep learning toolkits, with zero padding on both sides and stride
. Using the standard operator has the advantage that it is heavily optimized and can directly benefit from any new work (e.g., fastconv ) on optimizing convolutions.Given a memory tensor , an active memory model will produce the next memory by using a number of convolutions on and combining them. In the most basic setting, a residual active memory model will be defined as:
i.e., it will only add to an already existing state.
While residual models have been successful in image analysis resnet and generation one_shot
, they might suffer from the vanishing gradient problem in the same way as recurrent neural networks do. Therefore, in the same spirit as LSTM gates
hochreiter1997 and GRU gates gru2014 improve over pure RNNs, one can introduce convolutional LSTM and GRU operators. Let us focus on the convolutional GRU, which we define in the same way as in neural_gpu , namely:(1) 
As a baseline for our investigation of active memory models, we will use the Neural GPU model from neural_gpu , depicted in Figure 3, and defined as follows. The given sequence of discrete symbols from is first embedded into the tensor by concatenating the vectors obtained from an embedding lookup of the input symbols into its first column. More precisely, we create the starting tensor of shape by using an embedding matrix of shape and setting (in python notation) for all (here is the input). All other elements of are set to . Then, we apply different CGRU gates in turn for steps to produce the final tensor :
The result of a Neural GPU is produced by multiplying each item in the first column of by an output matrix
to obtain the logits
and then selecting the largest one:. During training we use the standard loss function, i.e., we compute a Softmax over the logits
and use the negative log probability of the target as the loss.2.1 The Markovian Neural GPU
The baseline Neural GPU model yields very poor results on neural machine translation: its perword perplexity on WMT^{1}^{1}1See Section 3 for more details on the experimental setting. does not go below (good models on this task go below ), and its BLEU scores are also very bad (below , while good models are higher than ). Which part of the model is responsible for such bad results?
It turns out that the main culprit is the output generator. As one can see in Figure 3 above, every output symbol is generated independently of all other output symbols, conditionally only on the state . This is fine for learning purely deterministic functions, like the toy tasks the Neural GPU was designed for. But it does not work for harder realworld problems, where there could be multiple possible outputs for each input.
The most basic way to mitigate this problem is to make every output symbol depend on the previous output. This only changes the output generation, not the state, so the definition of the model is the same as above until . The result is then obtained by multiplying by an output matrix each item from the first column of concatenated with the embedding of the previous output generated by another embedding matrix :
For we use a special symbol and, to get the output, we select . During training we use the standard loss function, i.e., we compute a Softmax over the logits and use the negative log probability of the target as the loss. Also, as is standard in recurrent networks sutskever14 , we use teacher forcing, i.e., during training we provide the true output label as instead of using the previous output generated by the model. This means that the loss incurred from generating does not directly influence the value of . We depict this model in Figure 4.
2.2 The Extended Neural GPU
The Markovian Neural GPU yields much better results on neural machine translation than the baseline model: its perword perplexity reaches about and its BLEU scores improve a bit. But these results are still far from those achieved by models with attention.
Could it be that the Markovian dependence of the outputs is too weak for this problem, that a full recurrent dependence of the state is needed for good performance? We test this by extending the baseline model with an active memory decoder, as depicted in Figure 5.
The definition of the Extended Neural GPU follows the baseline model until . We consider as the starting point for the active memory decoder, i.e., we set . In the active memory decoder we will also use a separate output tape tensor of the same shape as , i.e., is of shape . We start with set to all and define the decoder states by
where is defined just like CGRU in Equation (1) but with additional input as highlighted below in bold:
(2) 
We generate the th output by multiplying the th vector in the first column of by the output matrix , i.e., . We then select . The symbol is then embedded back into a dense representation using another embedding matrix and we put it into the th place on the output tape , i.e., we define
In this way, we accumulate (embedded) outputs stepbystep on the output tape . Each step has access to all outputs produced in all steps before .
Again, it is important to note that during training we use teacher forcing, i.e., we provide the true output labels for instead of using the outputs generated by the model.
2.3 Related Models
A convolutional architecture has already been used to obtain good results in wordlevel neural machine translation in KalchbrennerB13 and more recently in MengLWLJL15 . These model use a standard RNN on top of the convolution to generate the output and avoid the output dependence problem in this way. But the state of this RNN has a fixed size, and in the first one the sentence representation generated by the convolutional network is also a fixedsize vector. Therefore, while superficially similar to active memory, these models are more similar to fixedsize memory models. The first one suffers from all the limitations of sequencetosequence models without attention sutskever14 ; cho2014learning that we discussed before.
Another recently introduced model, the Grid LSTM gridLSTM15 , might look less related to active memory, as it does not use convolutions at all. But in fact it is to a large extend an active memory model – the memory is on the diagonal of the grid of the running LSTM cells. The Reencoder architecture for neural machine translation introduced in that paper is therefore related to the Extended Neural GPU. But it differs in a number of ways. For one, the input is provided stepwise, so the network cannot start processing the whole input in parallel, as in our model. The diagonal memory changes in size and the model is a 3dimensional grid, which might not be necessary for language processing. The Reencoder also does not use convolutions and this is crucial for performance. The experiments from gridLSTM15 are only performed on a very small dataset of 44K short sentences. This is almost 1000 times smaller than the dataset we are experimenting with and makes is unclear whether Grid LSTMs can be applied to largescale realworld tasks.
In image processing, in addition to the captioning xuetal2015 and generative models one_shot ; conceptual_compression that we mentioned before, there are several other active memory models. They use convolutional LSTMs, an architecture similar to CGRU, and have recently been used for weather prediction convLSTMweather and image compression convLSTMcompress , in both cases surpassing the stateoftheart.
3 Experiments
Since all components of our models (defined above) are differentiable, we can train them using any stochastic gradient descent optimizer. For the results presented in this paper we used the Adam optimizer
adam with and gradients norm clipped to . The number of layers was set to , the width of the state tensors was constant at , the number of maps was , and the convolution kernels width and height was always .^{2}^{2}2Our model was implemented using TensorFlow
tensorflow . Its code is available as opensource at https://github.com/tensorflow/models/tree/master/neural_gpu/.As our main test, we train the models discussed above and a baseline attention model on the WMT’14 EnglishFrench translation task. This is the same task that was used to introduce attention bahdanau2014neural , but – to avoid the problem with the UNK token – we spellout each word that is not in the vocabulary. More precisely, we use a 32K vocabulary that includes all characters and the most common words, and every word that is not in the vocabulary is spelledout letterbyletter. We also include a special SPACE symbol, which is used to mark spaces between characters (we assume spaces between words). We train without any data filtering on the WMT’14 corpus and test on the WMT’14 test set (newstest’14).
As a baseline, we use a GRU model with attention that is almost identical to the original one from bahdanau2014neural , except that it has 2 layers of GRU cells, each with 1024 units. Tokens from the vocabulary are embedded into vectors of size 512, and attention is put on the top layer. This model is identical as the one in KVparse15 , except that is uses GRU cells instead of LSTM cells. It has about M parameters, while our Extended Neural GPU model has about M parameters. Better results have been reported on this task with attention models with more parameters, but we aim at a baseline similar in size to the active memory model we are using.
When decoding from the Extendend Neural GPU model, one has to provide the expected size of the output, as it determines the size of the memory. We test all sizes between input size and double the input size using a greedy decoder and pick the result with smallest logperplexity (highest likelihood). This is expensive, so we only use a very basic beamsearch with beam of size and no length normalization. It is possible to reduce the cost by predicting the output length: we tried a basic estimator based just on input sentence length and it decreased the BLEU score by . Better training and decoding could remove the need to predict output length, but we leave this for future work.
For the baseline model, we use a full beamsearch decoder with beam of size , length normalization and an attention coverage penalty in the decoder. This is a basic penalty that pushes the decoder to attend to all words in the source sentence. We experimented with more elaborate methods following coverage but it did not improve our results. The parameters for length normalization and coverage penalty are tuned on the development set (newstest’13). The final BLEU scores and perword perplexities for these different models are presented in Table 1
. Worse models have higher variance of their BLEU scores, so we only write
for these models.Model  Perplexity (log)  BLEU 

Neural GPU  30.1 (3.5)  < 5 
Markovian Neural GPU  11.8 (2.5)  < 5 
Extended Neural GPU  3.3 (1.19)  29.6 
GRU+Attention  3.4 (1.22)  26.4 
One can see from Table 1 that an active memory model can indeed match an attention model on the machine translation task, even with slightly fewer parameters. It is interesting to note that the active memory model does not need the length normalization that is necessary for the attention model (esp. when rare words are spelled). We conjecture that active memory inherently generalizes better from shorter examples and makes decoding easier, a welcome news, since tuning decoders is a large problem in sequencetosequence models.
In addition to the summary results from Table 1, we analyzed the performance of the models on sentences of different lengths. This was the key problem solved by the attention mechanism, so it is worth asking if active memory solves it as well. In Figure 6 we plot the BLEU scores on the test set for sentences in each length bucket, bucketing by , i.e., for lengths and so on. We plot the curves for the Extended Neural GPU model, the long baseline GRU model with attention, and – for comparison – we add the numbers for a nonattention model from Figure 2 of bahdanau2014neural . (Note that these numbers are for a model that uses different tokenization, so they are not fully comparable, but still provide a context.)
As can be seen, our active memory model is less sensitive to sentence length than the attention baseline. It indeed solves the problem that the attention mechanism was designed to solve.
Parsing.
In addition to the main largescale translation task, we tested the Extended Neural GPU on English constituency parsing, the same task as in KVparse15 . We only used the standard WSJ dataset for training. It is small by neural network standards, as it contains only 40K sentences. We trained the Extended Neural GPU with the same settings as above, only with (instead of ) and dropout of in each step. During decoding, we selected wellbracketed outputs with the right number of POStags from all lengths considered. Evaluated with the standard EVALB tool on the standard WSJ 23 test set, we got F1 score. This is lower than reported in KVparse15 , but we didn’t use any of their optimizations (no early stopping, no POStag substitution, no special tuning). Since a pure sequencetosequence model has F1 score well below , this shows that the Extended Neural GPU is versatile and can learn and generalize well even on small datasets.
4 Discussion
To better understand the main shortcoming of previous active memory models, let us look at the average logperplexities of different attention models in Table 1. A pure Neural GPU model yields , a Markovian one yields , and only a model with full dependence, trained with teacher forcing, achieves . The recurrent dependence in generating the output distribution turns out to be the key to achieving good performance.
We find it illuminating that the issue of dependencies in the output distribution can be disentangled from the particularities of the model or model class. In earlier works, such dependence (and training with teacher forcing) was always used in LSTM and GRU models, but very rarely in other kinds models. We show that it can be beneficial to consider this issue separately from the model architecture. It allows us to create the Extended Neural GPU and this way of thinking might also prove fruitful for other classes of models.
When the issue of recurrent output dependencies is addressed, as we do in the Extended Neural GPU, an active memory model can indeed match or exceed attention models on a largescale realworld task. Does this mean we can always replace attention by active memory?
The answer could be yes for the case of soft attention. Its cost is approximately the same as active memory, it performs much worse on some tasks like learning algorithms, and – with the introduction of the Extended Neural GPU – we do not know of a task where it performs clearly better.
Still, an attention mask is a very natural concept, and it is probable that some tasks can benefit from a selector that focuses on single items by definition. This is especially obvious for hard attention: it can be used over large memories with potentially much less computational cost than an active memory, so it might be indispensable for devising longterm memory mechanisms. Luckily, active memory and attention are not exclusive, and we look forward to investigating models that combine these mechanisms.
References

[1]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton.
Imagenet classification with deep convolutional neural network.
In Advances in Neural Information Processing Systems, 2012.  [2] George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Contextdependent pretrained deep neural networks for largevocabulary speech recognition. IEEE Transactions on Audio, Speech & Language Processing, 20(1):30–42, 2012.
 [3] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [4] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
 [5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
 [6] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
 [7] Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems, 2015.
 [8] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
 [9] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. CoRR, abs/1502.04623, 2015.
 [10] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.

[11]
Ronald J. Williams.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
Machine Learning, 8:229––256, 1992.  [12] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.
 [13] A. Joulin and T. Mikolov. Inferring algorithmic patterns with stackaugmented recurrent nets. In Advances in Neural Information Processing Systems, (NIPS), 2015.
 [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [15] Qianli Liao and Tomaso Poggio. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. CoRR, abs/1604.03640, 2016.
 [16] Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. Oneshot generalization in deep generative models. CoRR, abs/1603.05106, 2016.
 [17] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. CoRR, abs/1604.08772, 2016.
 [18] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. CoRR, abs/1509.09308, 2015.
 [19] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoderdecoder approaches. CoRR, abs/1409.1259, 2014.
 [20] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings EMNLP 2013, pages 1700–1709, 2013.
 [21] Fandong Meng, Zhengdong Lu, Mingxuan Wang, Hang Li, Wenbin Jiang, and Qun Liu. Encoding source language with convolutional neural network for machine translation. In ACL, pages 20–30, 2015.
 [22] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long shortterm memory. In International Conference on Learning Representations, 2016.
 [23] Xingjian Shi, Zhourong Chen, Hao Wang, DitYan Yeung, Wai kin Wong, and Wang chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, 2015.
 [24] George Toderici, Sean M. O’Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja, Michele Covell, and Rahul Sukthankar. Variable rate image compression with recurrent neural networks. In International Conference on Learning Representations, 2016.
 [25] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [26] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Largescale machine learning on heterogeneous distributed systems, 2015.
 [27] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling coverage for neural machine translation. CoRR, abs/1601.04811, 2016.