1 Introduction
Long ShortTerm Memory (LSTM) networks are recurrent neural networks equipped with a special gating mechanism that controls access to memory cells
(Hochreiter & Schmidhuber, 1997). Since the gates can prevent the rest of the network from modifying the contents of the memory cells for multiple time steps, LSTM networks preserve signals and propagate errors for much longer than ordinary recurrent neural networks. By independently reading, writing and erasing content from the memory cells, the gates can also learn to attend to specific parts of the input signals and ignore other parts. These properties allow LSTM networks to process data with complex and separated interdependencies and to excel in a range of sequence learning domains such as speech recognition
(Graves et al., 2013), offline handwriting recognition (Graves & Schmidhuber, 2008), machine translation (Sutskever et al., 2014) and imagetocaption generation (Vinyals et al., 2014; Kiros et al., 2014).Even for nonsequential data, the recent success of deep networks has shown that long chains of sequential computation are key to finding and exploiting complex patterns. Deep networks suffer from exactly the same problems as recurrent networks applied to long sequences: namely that information from past computations rapidly attenuates as it progresses through the chain – the vanishing gradient problem (Hochreiter, 1991) – and that each layer cannot dynamically select or ignore its inputs. It therefore seems attractive to generalise the advantages of LSTM to deep computation.
We extend LSTM cells to deep networks within a unified architecture. We introduce Grid LSTM, a network that is arranged in a grid of one or more dimensions. The network has LSTM cells along any or all of the dimensions of the grid. The depth dimension is treated like the other dimensions and also uses LSTM cells to communicate directly from one layer to the next. Since the number of dimensions in the grid can easily be 2 or more, we propose a novel, robust way for modulating the Nway communication across the LSTM cells.
Ndimensional Grid LSTM (NLSTM for short) can naturally be applied as feedforward networks as well as recurrent ones. Onedimensional Grid LSTM corresponds to a feedforward network that uses LSTM cells in place of transfer functions such as
and ReLU
(Nair & Hinton, 2010). These networks are related to Highway Networks (Srivastava et al., 2015) where a gated transfer function is used to successfully train feedforward networks with up to 900 layers of depth. Grid LSTM with two dimensions is analogous to the Stacked LSTM, but it adds cells along the depth dimension too. Grid LSTM with three or more dimensions is analogous to Multidimensional LSTM (Graves et al., 2013; Sutskever et al., 2014; Graves et al., 2007; Graves, 2012), but differs from it not just by having the cells along the depth dimension, but also by using the proposed mechanism for modulating the Nway interaction that is not prone to the instability present in Multidimesional LSTM.We study some of the learning properties of Grid LSTM in various algorithmic tasks. We compare the performance of twodimensional Grid LSTM to Stacked LSTM on computing the addition of two 15digit integers without curriculum learning and on memorizing sequences of numbers (Zaremba & Sutskever, 2014). We find that in these settings having cells along the depth dimension is more effective than not having them; similarly, tying the weights across the layers is also more effective than untying the weights, despite the reduced number of parameters.
We also apply Grid LSTM to two empirical tasks. The architecture achieves 1.47 bitspercharacter in the 100M characters Wikipedia dataset (Hutter, 2012) outperforming other neural networks. Secondly, we use Grid LSTM to define a novel neural translation model that reencodes the source sentence based on the target words generated up to that point. The network outperforms the reference phrasebased CDEC system (Dyer et al., 2010)
on the IWSLT BTEC ChinesetoEnsligh translation task. The appendix contains additional results for Grid LSTM on learning parity functions and classifying MNIST images.
The outline of the paper is as follows. In Sect. 2 we describe standard LSTM networks that comprise the background. In Sect. 3 we define the Grid LSTM architecture. In Sect. 4 we consider the six experiments and we conclude in Sect. 5.
2 Background
We begin by describing the standard LSTM recurrent neural network and the derived Stacked and Multidimensional LSTM networks; some aspects of the networks motivate the Grid LSTM.
2.1 Long ShortTerm Memory
The LSTM network processes a sequence of input and target pairs . For each pair the LSTM network takes the new input
and produces an estimate for the target
given all the previous inputs . The past inputs determine the state of the network that comprises a hidden vector and a memory vector . The computation at each step is defined as follows (Graves et al., 2013):(1) 
where
is the logistic sigmoid function,
in are the recurrent weight matrices of the network and is the concatenation of the new input , transformed by a projection matrix , and the previous hidden vector :(2) 
The computation outputs new hidden and memory vectors and that comprise the next state of the network. The estimate for the target is then computed in terms of the hidden vector . We use the functional as shorthand for Eq. 1 as follows:
(3) 
where concatenates the four weight matrices .
One aspect of LSTM networks is the role of the gates and . The forget gate can delete parts of the previous memory vector whereas the gate can write new content to the new memory modulated by the input gate . The output gate controls what is then read from the new memory onto the hidden vector . The mechanism has two important learning properties. Each memory vector is obtained by a linear transformation of the previous memory vector and the gates; this ensures that the forward signals from one step to the other are not repeatedly squashed by a nonlinearity such as and that the backward error signals do not decay sharply at each step, an issue known as the vanishing gradient problem (Hochreiter et al., 2001). The mechanism also acts as a memory and implicit attention system, whereby the signal from some input can be written to the memory vector and attended to in parts across multiple steps by being retrieved one part at a time.
2.2 Stacked LSTM
A model that is closely related to the standard LSTM network is Stacked LSTM (Graves et al., 2013; Sutskever et al., 2014). Stacked LSTM adds capacity by stacking LSTM layers on top of each other. The output hidden vector in Eq. 1 from the LSTM below is taken as the input to the LSTM above in place of . The Stacked LSTM is depicted in Fig. 2. Note that although the LSTM cells are present along the sequential computation of each LSTM network, they are not present in the vertical computation from one layer to the next.
2.3 Multidimensional LSTM
Another related model is Multidimensional LSTM (Graves et al., 2007). Here the inputs are not arranged in a sequence, but in a dimensional grid, such as the twodimensional grid of pixels in an image. At each input in the array the network receives hidden vectors and memory vectors and computes a hidden vector and a memory vector that are passed as the next state for each of the dimensions. The network concatenates the transformed input and the hidden vectors into a vector and as in Eq. 1 computes and , as well as forget gates . These gates are then used to compute the memory vector as follows:
(4) 
As the number of paths in a grid grows combinatorially with the size of each dimension and the total number of dimensions , the values in can grow at the same rate due to the unconstrained summation in Eq. 4. This can cause instability for large grids, and adding cells along the depth dimension increases and exacerbates the problem. This motivates the simple alternate way of computing the output memory vectors in the Grid LSTM.
3 Architecture
Grid LSTM deploys cells along any or all of the dimensions including the depth of the network. In the context of predicting a sequence, the Grid LSTM has cells along two dimensions, the temporal one of the sequence itself and the vertical one along the depth. To modulate the interaction of the cells in the two dimensions, the Grid LSTM proposes a simple mechanism where the values in the cells cannot grow combinatorially as in Eq. 4. In this section we describe the multidimensional blocks and the way in which they are combined to form a Grid LSTM.
3.1 Grid LSTM Blocks
As in multidimensional LSTM, a Ndimensional block in a Grid LSTM receives as input hidden vectors and memory vectors . Unlike the multidimensional case, the block outputs hidden vectors and memory vectors that are all distinct.
The computation is simple and proceeds as follows. The model first concatenates the input hidden vectors from the N dimensions:
(5) 
Then the block computes transforms , one for each dimension, obtaining the desired output hidden and memory vectors:
(6) 
Each transform has distinct weight matrices in and applies the standard LSTM mechanism across the respective dimension. Note how the vector that contains all the input hidden vectors is shared across the transforms, whereas the input memory vectors affect the way interaction but are not directly combined. Ndimensional blocks can naturally be arranged in a Ndimensional grid forming a Grid LSTM. As for a block, the grid has sides with incoming hidden and memory vectors and sides with outgoing hidden and memory vectors. Note that a block does not receive a separate data representation. A data point is projected into the network via a pair of input hidden and memory vectors along one of the sides of the grid.
3.2 Priority Dimensions
In a Ndimensional block the transforms for all dimensions are computed in parallel. But it can be useful for a dimension to know the outputs of the transforms from the other dimensions, especially if the outgoing vectors from that dimension will be used to estimate the target. For instance, to prioritize the first dimension of the network, the block first computes the transforms for the other dimensions obtaining the output hidden vectors . Then the block concatenates these output hidden vectors and the input hidden vector for the first dimension into a new vector as follows:
(7) 
The vector is then used in the final transform to obtain the prioritized output hidden and memory vectors and .
3.3 NonLSTM Dimensions
In Grid LSTM networks that have only a few blocks along a given dimension in the grid, it can be useful to just have regular connections along that dimension without the use of cells. This can be naturally accomplished inside the block by using for that dimension in Eq. 6 a simple transformation with a nonlinear activation function instead of the transform
. Given a weight matrix , for the first dimension this looks as follows:(8) 
where is a standard nonlinear transfer function or simply the identity. This allows us to see how, modulo the differences in the mechanism inside the blocks, Grid LSTM networks generalize the models in Sect. 2. A 2d Grid LSTM applied to temporal sequences with cells in the temporal dimension but not in the vertical depth dimension, corresponds to the Stacked LSTM. Likewise, the 3d Grid LSTM without cells along the depth corresponds to Multidimensional LSTM, stacked with one or more layers.
3.4 Inputs from Multiple Sides
If we picture a Ndimensional block as in Fig. 1, we see that N of the sides of the block have input vectors associated with them and the other N sides have output vectors. As the blocks are arranged in a grid, this separation extends to the grid as a whole; each side of the grid has either input or output vectors associated with it. In certain tasks that have inputs of different types, a model can exploit this separation by projecting each type of input on a different side of the grid. The mechanism inside the blocks ensures that the hidden and memory vectors from the different sides will interact closely without being conflated. This is the case in the neural translation model introduced in Sect. 4 where source words and target words are projected on two different sides of a Grid LSTM.
3.5 Weight Sharing
Sharing of weight matrices can be specified along any dimension in a Grid LSTM and it can be useful to induce invariance in the computation along that dimension. As in the translation and image models, if multiple sides of a grid need to share weights, capacity can be added to the model by introducing into the grid a new dimension without sharing of weights. If the weights are shared along all dimensions including the depth, we refer to the model as a Tied LSTM.
4 Experiments
4.1 Addition
We first experiment with LSTM networks on learning to sum two 15digit integers. The problem formulation is similar to that in (Zaremba & Sutskever, 2014)
, where each number is given to the network one digit at a time and the result is also predicted one digit at a time. The input numbers are separated by delimiter symbols and an endofresult symbol is predicted by the network; these symbols as well as input and target padding are indicated by
. An example is as follows:Contrary to the work in (Zaremba & Sutskever, 2014) that uses from 4 to 9 digits for the input integers, we fix the number of digits to 15, we do not use curriculum learning strategies and we do not put digits from the partially predicted output back into the network, forcing the network to remember its partial predictions and making the task more challenging. The predicted output numbers have either 15 or 16 digits.
We compare the performance of LSTM networks with that of standard Stacked LSTM
(Fig. 2). We train the two types of networks with either tied or untied weights, with 400 hidden units each and with between 1 and 50 layers. We train the network with stochastic gradient descent using minibatches of size 15 and the Adam optimizer with a learning rate of 0.001
(Kingma & Ba, 2014). We train the networks for up to million samples or until they reach 100% accuracy on a random sample of 100 unseen addition problems. Note that since during training all samples are randomly generated, samples are seen only once and it is not possible for the network to overfit on training data. The training and test accuracies agree closely.Figure 4 relates the results of the experiments on the addition problem. The best performing tied LSTM is 18 layers deep and learns to perfectly solve the task in less than 550K training samples. We find that tied LSTM networks generally perform better than untied LSTM networks, which is likely due to the repetitive nature of the steps involved in the addition algorithm. The best untied LSTM network has 5 layers, learns more slowly and achieves a perdigit accuracy of 67% after 5 million examples. LSTM networks in turn perform better than either tied or untied Stacked LSTM networks, where more stacked layers do not improve over the singlelayer models. We see that the cells present a clear advantage for the deep LSTM networks by helping to mitigate the vanishing of gradients along the depth dimension.
4.2 Memorization
For our third algorithmic task, we analyze the performance of LSTM networks on the task of memorizing a random sequence of symbols. The sequences are 20 symbols long and we use a vocabulary of 64 symbols encoded as onehot vectors and given to the network one symbol per step. The setup is similar to the one for addition above. The network is tasked with reading the input sequence and outputting the same sequence unchanged:
Since the sequences are randomly generated, there is no correlation between successive symbols and the network must memorize the whole sequence without compression.
We train LSTM and Stacked LSTM with either tied or untied weights on the memorization task. All networks have 100 hidden units and have between 1 and 50 layers. We use minibatches of size 15 and optimize the network using Adam and a learning rate of 0.001. As above, we train each network for up to 5 million samples or until they reach 100% accuracy on 100 unseen samples. Accuracy is measured per individual symbol, not per sequence. We do not use curriculum learning or other training strategies.
Figure 5 reports the performance of the networks. The small number of hidden units contributes to making the training of the networks difficult. But we see that tied LSTM networks are most successful and learn to solve the task with the smallest number of samples. The 43layer tied
LSTM network learns a solution with less than 150K samples. Although there is fairly high variance amid the solving networks, deeper networks tend to learn faster. In addition, there is large difference in the performance of tied
LSTM networks and tied Stacked LSTM networks. The latter perform with much lower accuracy and Stacked LSTM networks with more than 16 layers do not reach an accuracy of more than 50%. Here we see that the optimization property of the cells in the depth dimension delivers a large gain. Similarly to the case of the addition problem, both the untied LSTM networks and the untied Stacked LSTM networks take significantly longer to learn than the respective counterparts with tied weights, but the advantage of the cells in the depth direction clearly emerges for untied LSTM networks too.BPC  Parameters  Alphabet Size  Test data  
Stacked LSTM (Graves, 2013) 
1.67  27M  205  last 4MB 
MRNN (Sutskever et al., 2011)  1.60  4.9M  86  last 10MB 
GFRNN (Chung et al., 2015)  1.58  20M  205  last 5MB 
Tied 2LSTM  1.47  16.8M  205  last 5MB 

4.3 CharacterLevel Language Modelling
We next test the LSTM network on the Hutter challenge Wikipedia dataset (Hutter, 2012). The aim is to successively predict the next character in the corpus. The dataset has 100 million characters. We follow the splitting procedure of (Chung et al., 2015), where the last 5 million characters are used for testing. The alphabet has 205 characters in total.
We use a tied
LSTM with 1000 hidden units and 6 layers of depth. As in Fig. 2 and in the previous tasks, the characters are projected both to form the initial input hidden and cell vectors and the top softmax layer is connected to the topmost output hidden and cell vectors. The model has a total of
parameters. As usual the objective is to minimize the negative loglikelihood of the character sequence under the model. Training is performed by sampling sequences of 10000 characters and processing them in order. We back propagate the errors every 50 characters. The initial cell and hidden vectors in the temporal direction are initialized to zero only at the beginning of each sequence; they maintain their forward propagated values after each update in order to simulate full back propagation. We use minibatches of 100, thereby processing 100 sequences of 10000 characters each in parallel. The network is trained with Adam with a learning rate of 0.001 and training proceeds for approximately 20 epochs.
Figure 6 reports the bitspercharacter performance together with the number of parameters of various recently proposed models on the dataset. The tied LSTM significantly outperforms other models despite having fewer parameters. More layers of depth and adding capacity by untying some of the weights are likely to further enhance the LSTM.
4.4 Translation
We next use the flexibility of Grid LSTM to define a novel neural translation model. In the neural approach to machine translation one trains a neural network endtoend to map the source sentence to the target sentence (Kalchbrenner & Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014). The mapping is usually performed within the encoderdecoder framework. A neural network, that can be convolutional or recurrent, first encodes the source sentence and then the computed representation of the source conditions a recurrent neural network to generate the target sentence. This approach has yielded strong empirical results, but it can suffer from a bottleneck. The encoding of the source sentence must contain information about all the words and their order; the decoder network in turn cannot easily revisit the unencoded source sentence to make decisions based on partially produced translations. This issue can be alleviated by a soft attention mechanism in the decoder neural network that uses gates to focus on specific parts of the source sentence (Bahdanau et al., 2014).
We use Grid LSTM to view translation in a novel fashion as a twodimensional mapping. We call this the Reencoder network. One dimension processes the source sentence whereas the other dimension produces the target sentence. The resulting network repeatedly reencodes the source sentence conditioned on the part of the target sentence generated so far, thus functioning as an implicit attention mechanism. The size of the representation of the source sentence varies with length and the source sentence is repeatedly scanned based on each generated target word. As represented in Fig. 9, for each target word, beginning with the startoftargetsentence symbol, the network scans the source sentence one way in the first layer and the other way in the second layer; the scan depends on all the target words that have been generated so far and at each block the two layers communicate directly. Note that, like the attentionbased model (Bahdanau et al., 2014), the twodimensional translation model has complexity , where and are respectively the length of the source and target; by contrast the recurrent encoderdecoder model only has complexity . This gives additional computational capacity to the former models.
Besides addressing the bottleneck, the twodimensional setup aims at explicitly capturing the invariance present in translation. Translation patterns between two languages are invariant above all to position and scale of the pattern. For instance, reordering patterns  such as the one that maps the English “do not verb” to the French “ne verb pas”, or the one that sends a part of an English verb to the end of a German sentence  should be detected and applied independently of where they occur in the source sentence or of the number of words involved in that instance of the pattern. To capture this, the Grid LSTM translation model shares the weights across the source and target dimensions. In addition, a hierarchy of stacked twodimensional grids in opposite directions is used to both increase capacity and help with learning longer scale translation patterns. The resulting model is a threedimensional Grid LSTM where hierarchy grows along the third dimension. The model is depicted in Fig. 7.
Valid1  Test1  Valid15  Test15  

DGLSTMAttention (Yao et al., 2015)    34.5     
CDEC (Dyer et al., 2010)  30.1  41  50.1  58.9 
3LSTM (7 Models)  30.3  42.4  51.8  60.2 
Reference  thank you . please pay for this bill at the cashier . 
Generated  thank you , ma ’am . please give this bill to the cashier and pay there . 
Reference  how about having lunch with me some day ? i found a good restaurant near my hotel . 
Generated  how about lunch with me ? i found a good restaurant near my hotel . 
We evaluate the Grid LSTM translation model on the IWSLT BTEC ChinesetoEnglish corpus that consists of 44016 pairs of source and target sentences for training, 1006 for development and 503 for testing. The corpus has about 0.5M words in each language, a source vocabulary of 7055 Chinese words and a target vocabulary of 5646 English words (after replacing words that occur only once with the UNK symbol). Target sentences are on average around 12 words long. The development and test corpora come with 15 reference translations. The LSTM uses two twodimensional grids of 3LSTM blocks for the hierarchy. Since the network has just two layers in the third dimension, we use regular identity connections without nonlinear transfer function along the third dimension, as defined in Sect. 3.3; the source and target dimensions have tied weights and LSTM cells. The processing is bidirectional, in that the first grid processes the source sentence from beginning to end and the second one from end to beginning. This allows for the shortest distance that the signal travels between input and output target words to be constant and independent of the length of the source. Note that the second grid receives an input coming from the grid below at each
LSTM block. We train seven models with vectors of size 450 and apply dropout with probability 0.5 to the hidden vectors within the blocks. For the optimization we use Adam with a learning rate of 0.001. At decoding the output probabilities are averaged across the models. The beam search has size 20 and we discard all candidates that are shorter than half of the length of the source sentence. The results are shown in Fig. 8. Our best model reaches a perplexity of 4.54 on the test data. We use as baseline the stateoftheart hierarchical phrasebased system CDEC
(Dyer et al., 2010). We see that the Grid LSTM significantly outperforms the baseline system on both the validation and test data sets.5 Conclusion
We have introduced Grid LSTM, a network that uses LSTM cells along all of the dimensions and modulates in a novel fashion the multiway interaction. We have seen the advantages of the cells compared to regular connections in solving tasks such as parity, addition and memorization. We have described powerful and flexible ways of applying the model to character prediction, machine translation and image classification, showing strong performance across the board.
Acknowledgements
We thank Koray Kavukcuoglu, Razvan Pascanu, Ilya Sutskever and Oriol Vinyals for helpful comments and discussions.
References
 Bahdanau et al. (2014) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. URL http://arxiv.org/abs/1409.0473.
 Cho et al. (2014) Cho, Kyunghyun, van Merrienboer, Bart, Gülçehre, Çaglar, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using RNN encoderdecoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078.
 Chung et al. (2015) Chung, Junyoung, Gülçehre, Çaglar, Cho, KyungHyun, and Bengio, Yoshua. Gated feedback recurrent neural networks. CoRR, abs/1502.02367, 2015. URL http://arxiv.org/abs/1502.02367.
 Ciresan et al. (2012) Ciresan, Dan Claudiu, Meier, Ueli, and Schmidhuber, Jürgen. Multicolumn deep neural networks for image classification. In arXiv:1202.2745v1 [cs.CV], 2012.
 Duch (2006) Duch, Wlodzislaw. Kseparability. In Kollias, Stefanos, Stafylopatis, Andreas, Duch, Wlodzislaw, and Oja, Erkki (eds.), Artificial Neural Networks, ICANN 2006, volume 4131 of Lecture Notes in Computer Science, pp. 188–197. Springer Berlin Heidelberg, 2006. ISBN 9783540386254. doi: 10.1007/11840817˙20. URL http://dx.doi.org/10.1007/11840817_20.
 Duchi et al. (2010) Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. Technical Report UCB/EECS201024, EECS Department, University of California, Berkeley, Mar 2010. URL http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS201024.html.
 Dyer et al. (2010) Dyer, Chris, Lopez, Adam, Ganitkevitch, Juri, Weese, Johnathan, Ture, Ferhan, Blunsom, Phil, Setiawan, Hendra, Eidelman, Vladimir, and Resnik, Philip. cdec: A decoder, alignment, and learning framework for finitestate and contextfree translation models. In Proceedings of the Association for Computational Linguistics (ACL), 2010.

Goodfellow et al. (2013)
Goodfellow, Ian J., WardeFarley, David, Mirza, Mehdi, Courville, Aaron C.,
and Bengio, Yoshua.
Maxout networks.
In
Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 1621 June 2013
, pp. 1319–1327, 2013. URL http://jmlr.org/proceedings/papers/v28/goodfellow13.html.  Graham (2014a) Graham, Benjamin. Spatiallysparse convolutional neural networks. CoRR, abs/1409.6070, 2014a. URL http://arxiv.org/abs/1409.6070.
 Graham (2014b) Graham, Benjamin. Fractional maxpooling. CoRR, abs/1412.6071, 2014b. URL http://arxiv.org/abs/1412.6071.
 Graves (2012) Graves, A. Supervised sequence labelling with recurrent neural networks, volume 385. Springer, 2012.
 Graves & Schmidhuber (2008) Graves, A. and Schmidhuber, J. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems, volume 21, 2008.
 Graves et al. (2007) Graves, A., Fernández, S., and Schmidhuber, J. Multidimensional recurrent neural networks. In Proceedings of the 2007 International Conference on Artificial Neural Networks, Porto, Portugal, September 2007.
 Graves et al. (2013) Graves, A., Mohamed, A., and Hinton, G. Speech recognition with deep recurrent neural networks. In Proc ICASSP 2013, Vancouver, Canada, May 2013.
 Graves (2013) Graves, Alex. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013. URL http://arxiv.org/abs/1308.0850.
 Hochreiter (1991) Hochreiter, S. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München, 1991.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long ShortTerm Memory. Neural Computation, 9(8):1735–1780, 1997.
 Hochreiter et al. (2001) Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. Gradient flow in recurrent nets: the difficulty of learning longterm dependencies. In Kremer and Kolen (eds.), A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
 Hohil et al. (1999) Hohil, Myron E., Liu, Derong, and Smith, Stanley H. Solving the nbit parity problem using neural networks. Neural Networks, 12(9):1321–1323, 1999. doi: 10.1016/S08936080(99)000696. URL http://dx.doi.org/10.1016/S08936080(99)000696.
 Hutter (2012) Hutter, Marcus. The human knowledge compression context, 2012. URL http://prize.hutter1.net.
 Kalchbrenner & Blunsom (2013) Kalchbrenner, Nal and Blunsom, Phil. Recurrent continuous translation models. Seattle, October 2013. Association for Computational Linguistics.
 Kingma & Ba (2014) Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
 Kiros et al. (2014) Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visualsemantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014. URL http://arxiv.org/abs/1411.2539.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Lee et al. (2015)
Lee, ChenYu, Xie, Saining, Gallagher, Patrick, Zhang, Zhengyou, and Tu,
Zhuowen.
Deeplysupervised nets.
In
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 912, 2015
, 2015. URL http://jmlr.org/proceedings/papers/v38/lee15a.html.  Lin et al. (2013) Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network. CoRR, abs/1312.4400, 2013. URL http://arxiv.org/abs/1312.4400.
 Mairal et al. (2014) Mairal, Julien, Koniusz, Piotr, Harchaoui, Zaïd, and Schmid, Cordelia. Convolutional kernel networks. Neural Information Processing Systems, 2014. URL http://arxiv.org/abs/1406.3332.
 Marvin Minsky (1972) Marvin Minsky, Seymour Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge MA, 1972.
 Nair & Hinton (2010) Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel, pp. 807–814, 2010. URL http://www.icml2010.org/papers/432.pdf.

Simard et al. (2003)
Simard, Patrice Y., Steinkraus, David, and Platt, John C.
Best practices for convolutional neural networks applied to visual document analysis.
In 7th International Conference on Document Analysis and Recognition (ICDAR 2003), 2Volume Set, 36 August 2003, Edinburgh, Scotland, UK, pp. 958–962, 2003. doi: 10.1109/ICDAR.2003.1227801. URL http://dx.doi.org/10.1109/ICDAR.2003.1227801.  Srivastava et al. (2015) Srivastava, Rupesh Kumar, Greff, Klaus, and Schmidhuber, Jürgen. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
 Sutskever et al. (2011) Sutskever, I., Martens, J., and Hinton, G. Generating text with recurrent neural networks. In ICML, 2011.
 Sutskever et al. (2014) Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112, 2014.
 Vinyals et al. (2014) Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555, 2014.
 Visin et al. (2015) Visin, Francesco, Kastner, Kyle, Cho, Kyunghyun, Matteucci, Matteo, Courville, Aaron C., and Bengio, Yoshua. Renet: A recurrent neural network based alternative to convolutional networks. CoRR, abs/1505.00393, 2015. URL http://arxiv.org/abs/1505.00393.
 Wan et al. (2013) Wan, Li, Zeiler, Matthew D., Zhang, Sixin, LeCun, Yann, and Fergus, Rob. Regularization of neural networks using dropconnect. In ICML (3), volume 28 of JMLR Proceedings, pp. 1058–1066. JMLR.org, 2013. URL http://dblp.unitrier.de/db/conf/icml/icml2013.html#WanZZLF13.
 Yao et al. (2015) Yao, Kaisheng, Cohn, Trevor, Vylomova, Katerina, Duh, Kevin, and Dyer, Chris. Depthgated LSTM. CoRR, abs/1508.03790, 2015. URL http://arxiv.org/abs/1508.03790.
 Zaremba & Sutskever (2014) Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. CoRR, abs/1410.4615, 2014. URL http://arxiv.org/abs/1410.4615.
Appendix
We here report on two additional results, one algorithmic and the other one empirical, where we see that without special initialization or training tricks, a 1LSTM network can learn to compute parity for up to 250 input bits, and a 3LSTM network applied to images obtains strong results on MNIST.
5.1 Parity
We apply onedimensional Grid LSTM to learning parity. Given a string of bits 0 or 1, the parity or generalized XOR
of the string is defined to be 1 if the sum of the bits is odd, and 0 if the sum of the bits is even. Although manually crafted neural networks for the problem have been devised
(Hohil et al., 1999), training a generic neural network from a finite number of examples and a generic random initialization of the weights to successfully learn to compute the parity of bit strings for significant values of is a longstanding problem (Marvin Minsky, 1972; Duch, 2006). It is core to the problem that the bit string is given to the neural network as a whole through a single projection; considering one bit at a time and remembering the previous partial result in a recurrent or multistep architecture reduces the problem of learning bit parity to the simple one of learning just bit parity. Learning parity is difficult because a change in a single bit in the input changes the target value and the decision boundaries in the resulting space are highly nonlinear.We train 1LSTM networks with tied weights and we compare them with fullyconnected feedforward networks with ReLU or activation functions and with either tied or untied weights. We search the space of hyperparameters as follows. The 1LSTM networks are trained with either 500 or 1500 hidden units and having from 1 to 150 hidden layers. The 1LSTM networks are trained on input strings that have from to bits in increments of 10. The feedforward ReLU and networks are trained with 500, 1500 or 3000 units and also having from 1 to 150 hidden layers. The latter networks are trained on input bit strings that have between and bits in increments of 5. Each network is trained with a maximum of million samples or four days of computation on a Tesla K40m GPU. For the optimization we use minibatches of size 20 and the AdaGrad rule with a learning rate of 0.06 (Duchi et al., 2010). A network is considered to have found the solution if the network correctly computes the parity of 100 randomly sampled unseen bit strings. Due to the nature of the problem, during training the predicted accuracy is never better than random guessing and when the network finds a solution the accuracy suddenly spikes to 100%.
Figure 9 depicts the results of the experiments with LSTM networks and Figure 10 relates the best performing networks of each type. For the feedforward ReLU and networks with either tied or untied weights, we find that these networks fail to find solutions for bits and beyond. Some networks in the search space find solutions for input bits. By contrast, as represented in Fig. 9, tied LSTM networks find solutions for up to bits.
There appears to be a correlation between the length of the input bit strings and the minimum depth of the LSTM networks. The minimum depth of the networks increases with suggesting that longer bit strings need more operations to be applied to them; however, the rate of growth is sublinear suggesting that more than a single bit of the input is considered at every step. We visualized the activations of the memory vectors obtained via a feedforward pass through one of the LSTM networks using selected input bit strings (Fig. 10). This revealed the prominent presence of counting neurons that keep a counter for the number of layers processed so far. These two aspects seem to suggest that the networks are using the cells to process the bit string sequentially by attending to parts of it at each step in the computation, a seemingly crucial feature that is not available in ReLU or transfer functions.
5.2 MNIST Digit Recognition
In our last experiment we apply a LSTM network to images. We consider nonoverlapping patches of pixels in an image as forming a twodimensional grid of inputs. The LSTM performs computations with LSTM cells along three different dimensions. Two of the dimensions correspond to the two spatial dimensions of the grid, whereas the remaining dimension is the depth of the network. Like in a convolutional neural network (LeCun et al., 1998), the same threeway transform of the LSTM is applied at all parts of the grid, ensuring that the same features can be extracted across all parts of the input image. Due to the unbounded context size of the LSTM, the computations of features at one end of the image can be influenced by the features computed at the other end of the image within the same layer. Due to the cells along the depth direction, features from the present patch can be passed onto the next layer either unprocessed or as processed by the layer itself as a function of neighboring patches.
We construct the network as depicted in Fig. 11. We divide the MNIST image into pixel patches, where is a small number such as 2 or 4. The patches are then linearized and projected into two vectors of the size of the hidden layer of the LSTM; the projected vectors are the input hidden and memory vectors at the first layer in the depth direction of the LSTM. At each layer the computation of the LSTM starts from one corner of the image, follows the two spatial dimensions and ends in the opposite corner of the image. The network has a few layers of depth, each layer starting the computation at one of the corners of the image. In the current form there is no pooling between successive layers of the LSTM. The topmost layer concatenates all the output hidden and memory vectors at all parts of the grid. These are then passed through a layer of ReLUs and a final softmax layer.
The setup has some similarity with the original application of Multidimensional LSTM to images (Graves, 2012) and with the recently described ReNet architecture (Visin et al., 2015). The difference with the former is that we apply multiple layers of depth to the image, use threedimensional blocks and concatenate the top output vectors before classification. The difference with the ReNet architecture is that the LSTM processes the image according to the two inherent spatial dimensions; instead of stacking hidden layers as in the ReNet, the block also modulates directly what information is passed along the depth dimension.
The training details are as follows. The MNIST dataset consists of 50000 training images, 10000 validation images and 10000 test images. The pixel values are normalized by dividing them by 255. Data augmentation is performed by shifting training images from 0 to 4 pixels in the horizontal and vertical directions and padding with zero values. The shift in the two directions is chosen uniformly at random. Validation samples are used for retraining the best model settings found during the grid search. We train the LSTM both with and without cells in the depth dimension. The LSTM with the cells uses patches of pixels, has four LSTM layers with 100 hidden units and one ReLU layer with 4096 units. The LSTM without the cells in the depth dimension has input patches of size obtained by cropping the image to a size of , it also has four LSTM layers of 100 units and has a ReLU layer of 2048 units. For the latter model we use ReLU as transfer function for the depth direction as in Eq. 6. We use minibatches of size 128 and train the models using Adam and a learning rate of 0.001.
Figure 12 reports test set errors of our models and that of competing approaches. We can see that even in the absence of pooling the LSTM with the cells performs near the stateoftheart. The
LSTM without the cells also performs quite well; the cells in the depth direction likely help with the feature extraction at the higher layers. The other approaches, with the exception of ReNet, are convolutional neural networks.
Test Error (%)  

Wan et al. (Wan et al., 2013)  0.28 
Graham (Graham, 2014a)  0.31 
Untied 3LSTM  0.32 
Ciresan et al. (Ciresan et al., 2012) 
0.35 
Untied 3LSTM with ReLU (*)  0.36 
Mairar et al. (Mairal et al., 2014)  0.39 
Lee et al. (Lee et al., 2015)  0.39 
Simard et al. (Simard et al., 2003)  0.4 
Graham (Graham, 2014b)  0.44 
Goodfellow et al. (Goodfellow et al., 2013) 
0.45 
Visin et al. (Visin et al., 2015)  0.45 
Lin et al. (Lin et al., 2013)  0.47 