1 Introduction
Recurrent neural networks (RNN, see, e.g., Rumelhart et al., 1986) have recently become a popular choice for modeling variablelength sequences. RNNs have been successfully used for various task such as language modeling (see, e.g., Graves, 2013; Pascanu et al., 2013a; Mikolov, 2012; Sutskever et al., 2011), learning word embeddings (see, e.g., Mikolov et al., 2013a), online handwritten recognition (Graves et al., 2009) and speech recognition (Graves et al., 2013).
In this work, we explore deep extensions of the basic RNN. Depth for feedforward models can lead to more expressive models (Pascanu et al., 2013b), and we believe the same should hold for recurrent models. We claim that, unlike in the case of feedforward neural networks, the depth of an RNN is ambiguous. In one sense, if we consider the existence of a composition of several nonlinear computational layers in a neural network being deep, RNNs are already deep, since any RNN can be expressed as a composition of multiple nonlinear layers when unfolded in time.
Schmidhuber (1992); El Hihi and Bengio (1996) earlier proposed another way of building a deep RNN by stacking multiple recurrent hidden states on top of each other. This approach potentially allows the hidden state at each level to operate at different timescale (see, e.g., Hermans and Schrauwen, 2013). Nonetheless, we notice that there are some other aspects of the model that may still be considered shallow. For instance, the transition between two consecutive hidden states at a single level is shallow, when viewed separately.This has implications on what kind of transitions this model can represent as discussed in Section 3.2.3.
Based on this observation, in this paper, we investigate possible approaches to extending an RNN into a deep RNN. We begin by studying which parts of an RNN may be considered shallow. Then, for each shallow part, we propose an alternative deeper design, which leads to a number of deeper variants of an RNN. The proposed deeper variants are then empirically evaluated on two sequence modeling tasks.
The layout of the paper is as follows. In Section 2 we briefly introduce the concept of an RNN. In Section 3 we explore different concepts of depth in RNNs. In particular, in Section 3.3.1–3.3.2 we propose two novel variants of deep RNNs and evaluate them empirically in Section 5 on two tasks: polyphonic music prediction (BoulangerLewandowski et al., 2012) and language modeling. Finally we discuss the shortcomings and advantages of the proposed models in Section 6.
2 Recurrent Neural Networks
A recurrent neural network (RNN) is a neural network that simulates a discretetime dynamical system that has an input , an output and a hidden state . In our notation the subscript represents time. The dynamical system is defined by
(1)  
(2) 
where and are a state transition function and an output function, respectively. Each function is parameterized by a set of parameters; and .
Given a set of training sequences
, the parameters of an RNN can be estimated by minimizing the following cost function:
(3) 
where and . is a predefined divergence measure between and , such as Euclidean distance or crossentropy.
2.1 Conventional Recurrent Neural Networks
A conventional RNN is constructed by defining the transition function and the output function as
(4)  
(5) 
where , and are respectively the transition, input and output matrices, and and
are elementwise nonlinear functions. It is usual to use a saturating nonlinear function such as a logistic sigmoid function or a hyperbolic tangent function for
. An illustration of this RNN is in Fig. 2 (a).The parameters of the conventional RNN can be estimated by, for instance, stochastic gradient descent (SGD) algorithm with the gradient of the cost function in Eq. (
3) computed by backpropagation through time
(Rumelhart et al., 1986).3 Deep Recurrent Neural Networks
3.1 Why Deep Recurrent Neural Networks?
Deep learning is built around a hypothesis that a deep, hierarchical model can be exponentially more efficient at representing some functions than a shallow one (Bengio, 2009). A number of recent theoretical results support this hypothesis (see, e.g., Le Roux and Bengio, 2010; Delalleau and Bengio, 2011; Pascanu et al., 2013b). For instance, it has been shown by Delalleau and Bengio (2011) that a deep sumproduct network may require exponentially less units to represent the same function compared to a shallow sumproduct network. Furthermore, there is a wealth of empirical evidences supporting this hypothesis (see, e.g., Goodfellow et al., 2013; Hinton et al., 2012b, a). These findings make us suspect that the same argument should apply to recurrent neural networks.
3.2 Depth of a Recurrent Neural Network
The depth is defined in the case of feedforward neural networks as having multiple nonlinear layers between input and output. Unfortunately this definition does not apply trivially to a recurrent neural network (RNN) because of its temporal structure. For instance, any RNN when unfolded in time as in Fig. 1 is deep, because a computational path between the input at time to the output at time crosses several nonlinear layers.
A close analysis of the computation carried out by an RNN (see Fig. 2 (a)) at each time step individually, however, shows that certain transitions are not deep, but are only results of a linear projection followed by an elementwise nonlinearity. It is clear that the hiddentohidden (), hiddentooutput () and inputtohidden () functions are all shallow in the sense that there exists no intermediate, nonlinear hidden layer.
We can now consider different types of depth of an RNN by considering those transitions separately. We may make the hiddentohidden transition deeper by having one or more intermediate nonlinear layers between two consecutive hidden states ( and ). At the same time, the hiddentooutput function can be made deeper, as described previously, by plugging, multiple intermediate nonlinear layers between the hidden state and the output . Each of these choices has a different implication.
3.2.1 Deep InputtoHidden Function
A model can exploit more nontemporal structure from the input by making the inputtohidden function deep. Previous work has shown that higherlevel representations of deep networks tend to better disentangle the underlying factors of variation than the original input (Goodfellow et al., 2009; Glorot et al., 2011b) and flatten the manifolds near which the data concentrate (Bengio et al., 2013). We hypothesize that such higherlevel representations should make it easier to learn the temporal structure between successive time steps because the relationship between abstract features can generally be expressed more easily. This has been, for instance, illustrated by the recent work (Mikolov et al., 2013b)
showing that word embeddings from neural language models tend to be related to their temporal neighbors by simple algebraic relationships, with the same type of relationship (adding a vector) holding over very different regions of the space, allowing a form of analogical reasoning.
This approach of making the inputtohidden function deeper is in the line with the standard practice of replacing input with extracted features in order to improve the performance of a machine learning model
(see, e.g., Bengio, 2009). Recently, Chen and Deng (2013) reported that a better speech recognition performance could be achieved by employing this strategy, although they did not jointly train the deep inputtohidden function together with other parameters of an RNN.3.2.2 Deep HiddentoOutput Function
A deep hiddentooutput function can be useful to disentangle the factors of variations in the hidden state, making it easier to predict the output. This allows the hidden state of the model to be more compact and may result in the model being able to summarize the history of previous inputs more efficiently. Let us denote an RNN with this deep hiddentooutput function a deep output RNN (DORNN).
Instead of having feedforward, intermediate layers between the hidden state and the output, BoulangerLewandowski et al. (2012)
proposed to replace the output layer with a conditional generative model such as restricted Boltzmann machines or neural autoregressive distribution estimator
(Larochelle and Murray, 2011). In this paper we only consider feedforward intermediate layers.3.2.3 Deep HiddentoHidden Transition
The third knob we can play with is the depth of the hiddentohidden transition. The state transition between the consecutive hidden states effectively adds a new input to the summary of the previous inputs represented by the fixedlength hidden state. Previous work with RNNs has generally limited the architecture to a shallow operation; affine transformation followed by an elementwise nonlinearity. Instead, we argue that this procedure of constructing a new summary, or a hidden state, from the combination of the previous one and the new input should be highly nonlinear. This nonlinear transition could allow, for instance, the hidden state of an RNN to rapidly adapt to quickly changing modes of the input, while still preserving a useful summary of the past. This may be impossible to be modeled by a function from the family of generalized linear models. However, this highly nonlinear transition can be modeled by an MLP with one or more hidden layers which has an universal approximator property (see, e.g., Hornik et al., 1989).
An RNN with this deep transition will be called a deep transition RNN (DTRNN) throughout remainder of this paper. This model is shown in Fig. 2 (b).
This approach of having a deep transition, however, introduces a potential problem. As the introduction of deep transition increases the number of nonlinear steps the gradient has to traverse when propagated back in time, it might become more difficult to train the model to capture longterm dependencies (Bengio et al., 1994). One possible way to address this difficulty is to introduce shortcut connections (see, e.g., Raiko et al., 2012) in the deep transition, where the added shortcut connections provide shorter paths, skipping the intermediate layers, through which the gradient is propagated back in time. We refer to an RNN having deep transition with shortcut connections by DT(S)RNN (See Fig. 2 (b*)).
Furthermore, we will call an RNN having both a deep hiddentooutput function and a deep transition a deep output, deep transition RNN (DOTRNN). See Fig. 2 (c) for the illustration of DOTRNN. If we consider shortcut connections as well in the hidden to hidden transition, we call the resulting model DOT(S)RNN.
An approach similar to the deep hiddentohidden transition has been proposed recently by Pinheiro and Collobert (2014)
in the context of parsing a static scene. They introduced a recurrent convolutional neural network (RCNN) which can be understood as a recurrent network whose the transition between consecutive hidden states (and input to hidden state) is modeled by a convolutional neural network. The RCNN was shown to speed up scene parsing and obtained the stateoftheart result in Stanford Background and SIFT Flow datasets.
Ko and Dieter (2009) proposed deep transitions for Gaussian Process models. Earlier, Valpola and Karhunen (2002) used a deep neural network to model the state transition in a nonlinear, dynamical statespace model.3.2.4 Stack of Hidden States
An RNN may be extended deeper in yet another way by stacking multiple recurrent hidden layers on top of each other (Schmidhuber, 1992; El Hihi and Bengio, 1996; Jaeger, 2007; Graves, 2013). We call this model a stacked RNN (sRNN) to distinguish it from the other proposed variants. The goal of a such model is to encourage each recurrent level to operate at a different timescale.
It should be noticed that the DTRNN and the sRNN extend the conventional, shallow RNN in different aspects. If we look at each recurrent level of the sRNN separately, it is easy to see that the transition between the consecutive hidden states is still shallow. As we have argued above, this limits the family of functions it can represent. For example, if the structure of the data is sufficiently complex, incorporating a new input frame into the summary of what had been seen up to now might be an arbitrarily complex function. In such a case we would like to model this function by something that has universal approximator properties, as an MLP. The model can not rely on the higher layers to do so, because the higher layers do not feed back into the lower layer. On the other hand, the sRNN can deal with multiple time scales in the input sequence, which is not an obvious feature of the DTRNN. The DTRNN and the sRNN are, however, orthogonal in the sense that it is possible to have both features of the DTRNN and the sRNN by stacking multiple levels of DTRNNs to build a stacked DTRNN which we do not explore more in this paper.
3.3 Formal descriptions of deep RNNs
Here we give a more formal description on how the deep transition recurrent neural network (DTRNN) and the deep output RNN (DORNN) as well as the stacked RNN are implemented.
3.3.1 Deep Transition RNN
We noticed from the state transition equation of the dynamical system simulated by RNNs in Eq. (1) that there is no restriction on the form of
. Hence, we propose here to use a multilayer perceptron to approximate
instead.In this case, we can implement by intermediate layers such that
where and are the elementwise nonlinear function and the weight matrix for the th layer. This RNN with a multilayered transition function is a deep transition RNN (DTRNN).
An illustration of building an RNN with the deep state transition function is shown in Fig. 2 (b). In the illustration the state transition function is implemented with a neural network with a single intermediate layer.
This formulation allows the RNN to learn a nontrivial, highly nonlinear transition between the consecutive hidden states.
3.3.2 Deep Output RNN
Similarly, we can use a multilayer perceptron with
intermediate layers to model the output function in Eq. (2) such thatwhere and are the elementwise nonlinear function and the weight matrix for the th layer. An RNN implementing this kind of multilayered output function is a deep output recurrent neural network (DORNN).
Fig. 2 (c) draws a deep output, deep transition RNN (DOTRNN) implemented using both the deep transition and the deep output with a single intermediate layer each.
3.3.3 Stacked RNN
The stacked RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996) has multiple levels of transition functions defined by
where is the hidden state of the th level at time . When , the state is computed using instead of . The hidden states of all the levels are recursively computed from the bottom level .
Once the toplevel hidden state is computed, the output can be obtained using the usual formulation in Eq. (5). Alternatively, one may use all the hidden states to compute the output (Hermans and Schrauwen, 2013). Each hidden state at each level may also be made to depend on the input as well (Graves, 2013). Both of them can be considered approaches using shortcut connections discussed earlier.
The illustration of this stacked RNN is in Fig. 2 (d).
4 Another Perspective: Neural Operators
In this section, we briefly introduce a novel approach with which the already discussed deep transition (DT) and/or deep output (DO) recurrent neural networks (RNN) may be built. We call this approach which is based on building an RNN with a set of predefined neural operators, an operatorbased framework.
In the operatorbased framework, one first defines a set of operators of which each is implemented by a multilayer perceptron (MLP). For instance, a plus operator may be defined as a function receiving two vectors and and returning the summary of them:
where we may constrain that the dimensionality of and are identical. Additionally, we can define another operator which predicts the most likely output symbol given a summary , such that
It is possible to define many other operators, but in this paper, we stick to these two operators which are sufficient to express all the proposed types of RNNs.
It is clear to see that the plus operator and the predict operator correspond to the transition function and the output function in Eqs. (1)–(2). Thus, at each step, an RNN can be thought as performing the plus operator to update the hidden state given an input () and then the predict operator to compute the output (). See Fig. 3 for the illustration of how an RNN can be understood from the operatorbased framework.
Each operator can be parameterized as an MLP with one or more hidden layers, hence a neural operator, since we cannot simply expect the operation will be linear with respect to the input vector(s). By using an MLP to implement the operators, the proposed deep transition, deep output RNN (DOTRNN) naturally arises.
This framework provides us an insight on how the constructed RNN be regularized. For instance, one may regularize the model such that the plus operator is commutative. However, in this paper, we do not explore further on this approach.
Note that this is different from (Mikolov et al., 2013a) where the learned embeddings of words happened to be suitable for algebraic operators. The operatorbased framework proposed here is rather geared toward learning these operators directly.
5 Experiments
We train four types of RNNs described in this paper on a number of benchmark datasets to evaluate their performance. For each benchmark dataset, we try the task of predicting the next symbol.
The task of predicting the next symbol is equivalent to the task of modeling the distribution over a sequence. For each sequence , we decompose it into
and each term on the righthand side will be replaced with a single timestep of an RNN. In this setting, the RNN predicts the probability of the next symbol
in the sequence given the all previous symbols . Then, we train the RNN by maximizing the loglikelihood.We try this task of modeling the joint distribution on three different tasks; polyphonic music prediction, characterlevel and wordlevel language modeling.
We test the RNNs on the task of polyphonic music prediction using three datasets which are Nottingham, JSB Chorales and MuseData (BoulangerLewandowski et al., 2012). On the task of characterlevel and wordlevel language modeling, we use Penn Treebank Corpus (Marcus et al., 1993).
5.1 Model Descriptions
We compare the conventional recurrent neural network (RNN), deep transition RNN with shortcut connections in the transition MLP (DT(S)RNN), deep output/transition RNN with shortcut connections in the hidden to hidden transition MLP (DOT(S)RNN) and stacked RNN (sRNN). See Fig. 2 (a)–(d) for the illustrations of these models.
RNN  DT(S)RNN  DOT(S)RNN  sRNN  

2 layers  
Music  Notthingam 






JSB Chorales 






MuseData 






Language  Charlevel 






Wordlevel 





The size of each model is chosen from a limited set to minimize the validation error for each polyphonic music task (See Table. 1 for the final models). In the case of language modeling tasks, we chose the size of the models from and
for wordlevel and characterlevel tasks, respectively. In all cases, we use a logistic sigmoid function as an elementwise nonlinearity of each hidden unit. Only for the characterlevel language modeling we used rectified linear units
(Glorot et al., 2011a) for the intermediate layers of the output function, which gave lower validation error.5.2 Training
We use stochastic gradient descent (SGD) and employ the strategy of clipping the gradient proposed by Pascanu et al. (2013a). Training stops when the validation cost stops decreasing.
Polyphonic Music Prediction: For Nottingham and MuseData datasets we compute each gradient step on subsequences of at most 200 steps, while we use subsequences of 50 steps for JSB Chorales. We do not reset the hidden state for each subsequence, unless the subsequence belongs to a different song than the previous subsequence.
The cutoff threshold for the gradients is set to 1. The hyperparameter for the learning rate schedule
^{1}^{1}1 We use at each update , the following learning rate where and indicate respectively when the learning rate starts decreasing and how quickly the learning rate decreases. In the experiment, we set to coincide with the time when the validation error starts increasing for the first time. is tuned manually for each dataset. We set the hyperparameter to for Nottingham, for MuseData andfor JSB Chroales. They correspond to two epochs, a single epoch and a third of an epoch, respectively.
The weights of the connections between any pair of hidden layers are sparse, having only 20 nonzero incoming connections per unit (see, e.g., Sutskever et al., 2013)
. Each weight matrix is rescaled to have a unit largest singular value
(Pascanu et al., 2013a). The weights of the connections between the input layer and the hidden state as well as between the hidden state and the output layer are initialized randomly from the white Gaussian distribution with its standard deviation fixed to
and , respectively. In the case of deep output functions (DOT(S)RNN), the weights of the connections between the hidden state and the intermediate layer are sampled initially from the white Gaussian distribution of standard deviation . In all cases, the biases are initialized to .To regularize the models, we add white Gaussian noise of standard deviation to each weight parameter every time the gradient is computed (Graves, 2011).
Language Modeling: We used the same strategy for initializing the parameters in the case of language modeling. For characterlevel modeling, the standard deviations of the white Gaussian distributions for the inputtohidden weights and the hiddentooutput weights, we used and , respectively, while those hyperparameters were both for wordlevel modeling. In the case of DOT(S)RNN, we sample the weights of between the hidden state and the rectifier intermediate layer of the output function from the white Gaussian distribution of standard deviation . When using rectifier units (characterbased language modeling) we fix the biases to .
In language modeling, the learning rate starts from an initial value and is halved each time the validation cost does not decrease significantly (Mikolov et al., 2010). We do not use any regularization for the characterlevel modeling, but for the wordlevel modeling we use the same strategy of adding weight noise as we do with the polyphonic music prediction.
For all the tasks (polyphonic music prediction, characterlevel and wordlevel language modeling), the stacked RNN and the DOT(S)RNN were initialized with the weights of the conventional RNN and the DT(S)RNN, which is similar to layerwise pretraining of a feedforward neural network (see, e.g., Hinton and Salakhutdinov, 2006). We use a ten times smaller learning rate for each parameter that was pretrained as either RNN or DT(S)RNN.
RNN  DT(S)RNN  DOT(S)RNN  sRNN  DOT(S)RNN*  

Notthingam  
JSB Chorales  
MuseData 
5.3 Result and Analysis
5.3.1 Polyphonic Music Prediction
The logprobabilities on the test set of each data are presented in the first four columns of Tab. 2. We were able to observe that in all cases one of the proposed deep RNNs outperformed the conventional, shallow RNN. Though, the suitability of each deep RNN depended on the data it was trained on. The best results obtained by the DT(S)RNNs on Notthingam and JSB Chorales are close to, but worse than the result obtained by RNNs trained with the technique of fast dropout (FD) which are and , respectively (Bayer et al., 2013).
In order to quickly investigate whether the proposed deeper variants of RNNs may also benefit from the recent advances in feedforward neural networks, such as the use of nonsaturating activation functions
^{2}^{2}2 Note that it is not trivial to use nonsaturating activation functions in conventional RNNs, as this may cause the explosion of the activations of hidden states. However, it is perfectly safe to use nonsaturating activation functions at the intermediate layers of a deep RNN with deep transition. and the method of dropout. We have built another set of DOT(S)RNNs that have the recently proposed units (Gulcehre et al., 2013) in deep transition and maxout units (Goodfellow et al., 2013) in deep output function. Furthermore, we used the method of dropout (Hinton et al., 2012b) instead of weight noise during training. Similarly to the previously trained models, we searched for the size of the models as well as other learning hyperparameters that minimize the validation performance. We, however, did not pretrain these models.The results obtained by the DOT(S)RNNs having and maxout units trained with dropout are shown in the last column of Tab. 2. On every music dataset the performance by this model is significantly better than those achieved by all the other models as well as the best results reported with recurrent neural networks in (Bayer et al., 2013). This suggests us that the proposed variants of deep RNNs also benefit from having nonsaturating activations and using dropout, just like feedforward neural networks. We reported these results and more details on the experiment in (Gulcehre et al., 2013).
We, however, acknowledge that the modelfree stateoftheart results for the both datasets were obtained using an RNN combined with a conditional generative model, such as restricted Boltzmann machines or neural autoregressive distribution estimator
(Larochelle and Murray, 2011), in the output (BoulangerLewandowski et al., 2012).RNN  DT(S)RNN  DOT(S)RNN  sRNN  

CharacterLevel  ^{1}^{1}1Reported by Mikolov et al. (2012a) using mRNN with Hessianfree optimization technique.  ^{3}^{3}3Reported by Graves (2013) using the dynamic evaluation and weight noise.  
WordLevel  ^{2}^{2}2Reported by Mikolov et al. (2011) using the dynamic evaluation.  ^{3}^{3}3Reported by Graves (2013) using the dynamic evaluation and weight noise. 
The previous/current stateoftheart results obtained with RNNs having longshort term memory units.
5.3.2 Language Modeling
On Tab. 3, we can see the perplexities on the test set achieved by the all four models. We can clearly see that the deep RNNs (DT(S)RNN, DOT(S)RNN and sRNN) outperform the conventional, shallow RNN significantly. On these tasks DOT(S)RNN outperformed all the other models, which suggests that it is important to have highly nonlinear mapping from the hidden state to the output in the case of language modeling.
The results by both the DOT(S)RNN and the sRNN for wordlevel modeling surpassed the previous best performance achieved by an RNN with 1000 long shortterm memory (LSTM) units (Graves, 2013) as well as that by a shallow RNN with a larger hidden state (Mikolov et al., 2011), even when both of them used dynamic evaluation^{3}^{3}3Dynamic evaluation refers to an approach where the parameters of a model are updated as the validation/test data is predicted.. The results we report here are without dynamic evaluation.
For characterlevel modeling the stateoftheart results were obtained using an optimization method Hessianfree with a specific type of RNN architecture called mRNN (Mikolov et al., 2012a) or a regularization technique called adaptive weight noise (Graves, 2013). Our result, however, is better than the performance achieved by conventional, shallow RNNs without any of those advanced regularization methods (Mikolov et al., 2012b), where they reported the best performance of using an RNN trained with the Hessianfree learning algorithm (Martens and Sutskever, 2011).
6 Discussion
In this paper, we have explored a novel approach to building a deep recurrent neural network (RNN). We considered the structure of an RNN at each timestep, which revealed that the relationship between the consecutive hidden states and that between the hidden state and output are shallow. Based on this observation, we proposed two alternative designs of deep RNN that make those shallow relationships be modeled by deep neural networks. Furthermore, we proposed to make use of shortcut connections in these deep RNNs to alleviate a problem of difficult learning potentially introduced by the increasing depth.
We empirically evaluated the proposed designs against the conventional RNN which has only a single hidden layer and against another approach of building a deep RNN (stacked RNN, Graves, 2013), on the task of polyphonic music prediction and language modeling.
The experiments revealed that the RNN with the proposed deep transition and deep output (DOT(S)RNN) outperformed both the conventional RNN and the stacked RNN on the task of language modeling, achieving the stateoftheart result on the task of wordlevel language modeling. For polyphonic music prediction, a different deeper variant of an RNN achieved the best performance for each dataset. Importantly, however, in all the cases, the conventional, shallow RNN was not able to outperform the deeper variants. These results strongly support our claim that an RNN benefits from having a deeper architecture, just like feedforward neural networks.
The observation that there is no clear winner in the task of polyphonic music prediction suggests us that each of the proposed deep RNNs has a distinct characteristic that makes it more, or less, suitable for certain types of datasets. We suspect that in the future it will be possible to design and train yet another deeper variant of an RNN that combines the proposed models together to be more robust to the characteristics of datasets. For instance, a stacked DT(S)RNN may be constructed by combining the DT(S)RNN and the sRNN.
In a quick additional experiment where we have trained DOT(S)RNN constructed using nonsaturating nonlinear activation functions and trained with the method of dropout, we were able to improve the performance of the deep recurrent neural networks on the polyphonic music prediction tasks significantly. This suggests us that it is important to investigate the possibility of applying recent advances in feedforward neural networks, such as novel, nonsaturating activation functions and the method of dropout, to recurrent neural networks as well. However, we leave this as future research.
One practical issue we ran into during the experiments was the difficulty of training deep RNNs. We were able to train the conventional RNN as well as the DT(S)RNN easily, but it was not trivial to train the DOT(S)RNN and the stacked RNN. In this paper, we proposed to use shortcut connections as well as to pretrain them either with the conventional RNN or with the DT(S)RNN. We, however, believe that learning may become even more problematic as the size and the depth of a model increase. In the future, it will be important to investigate the root causes of this difficulty and to explore potential solutions. We find some of the recently introduced approaches, such as advanced regularization methods (Pascanu et al., 2013a) and advanced optimization algorithms (see, e.g., Pascanu and Bengio, 2013; Martens, 2010), to be promising candidates.
Acknowledgments
We would like to thank the developers of Theano
(Bergstra et al., 2010; Bastien et al., 2012). We also thank Justin Bayer for his insightful comments on the paper. We would like to thank NSERC, Compute Canada, and Calcul Québec for providing computational resources. Razvan Pascanu is supported by a DeepMind Fellowship. Kyunghyun Cho is supported by FICS (Finnish Doctoral Programme in Computational Sciences) and “the Academy of Finland (Finnish Centre of Excellence in Computational Inference Research COIN, 251170)”.References
 Bastien et al. (2012) Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
 Bayer et al. (2013) Bayer, J., Osendorfer, C., Korhammer, D., Chen, N., Urban, S., and van der Smagt, P. (2013). On fast dropout and its applicability to recurrent networks. arXiv:1311.0701 [cs.NE].
 Bengio (2009) Bengio, Y. (2009). Learning deep architectures for AI. Found. Trends Mach. Learn., 2(1), 1–127.
 Bengio et al. (1994) Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning longterm dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.
 Bengio et al. (2013) Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Better mixing via deep representations. In ICML’13.
 Bergstra et al. (2010) Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., WardeFarley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.
 BoulangerLewandowski et al. (2012) BoulangerLewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription. In ICML’2012.
 Chen and Deng (2013) Chen, J. and Deng, L. (2013). A new method for learning deep recurrent neural networks. arXiv:1311.6091 [cs.LG].
 Delalleau and Bengio (2011) Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sumproduct networks. In NIPS.
 El Hihi and Bengio (1996) El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for longterm dependencies. In NIPS 8. MIT Press.
 Glorot et al. (2011a) Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectifier neural networks. In AISTATS.
 Glorot et al. (2011b) Glorot, X., Bordes, A., and Bengio, Y. (2011b). Domain adaptation for largescale sentiment classification: A deep learning approach. In ICML’2011.
 Goodfellow et al. (2009) Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep networks. In NIPS’09, pages 646–654.
 Goodfellow et al. (2013) Goodfellow, I. J., WardeFarley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In ICML’2013.
 Graves (2011) Graves, A. (2011). Practical variational inference for neural networks. In J. ShaweTaylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2348–2356.
 Graves (2013) Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv:1308.0850 [cs.NE].
 Graves et al. (2009) Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., and Schmidhuber, J. (2009). A novel connectionist system for improved unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
 Graves et al. (2013) Graves, A., Mohamed, A., and Hinton, G. (2013). Speech recognition with deep recurrent neural networks. ICASSP.
 Gulcehre et al. (2013) Gulcehre, C., Cho, K., Pascanu, R., and Bengio, Y. (2013). Learnednorm pooling for deep feedforward and recurrent neural networks. arXiv:1311.1780 [cs.NE].
 Hermans and Schrauwen (2013) Hermans, M. and Schrauwen, B. (2013). Training and analysing deep recurrent neural networks. In Advances in Neural Information Processing Systems 26, pages 190–198.
 Hinton et al. (2012a) Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
 Hinton and Salakhutdinov (2006) Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
 Hinton et al. (2012b) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b). Improving neural networks by preventing coadaptation of feature detectors. Technical report, arXiv:1207.0580.
 Hornik et al. (1989) Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366.
 Jaeger (2007) Jaeger, H. (2007). Discovering multiscale dynamical features with hierarchical echo state networks. Technical report, Jacobs University.
 Ko and Dieter (2009) Ko, J. and Dieter, F. (2009). Gpbayesfilters: Bayesian filtering using gaussian process prediction and observation models. Autonomous Robots.

Larochelle and Murray (2011)
Larochelle, H. and Murray, I. (2011).
The Neural Autoregressive Distribution Estimator.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS’2011)
, volume 15 of JMLR: W&CP.  Le Roux and Bengio (2010) Le Roux, N. and Bengio, Y. (2010). Deep belief networks are compact universal approximators. Neural Computation, 22(8), 2192–2207.
 Marcus et al. (1993) Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993). Building a large annotated corpus of english: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
 Martens (2010) Martens, J. (2010). Deep learning via Hessianfree optimization. In L. Bottou and M. Littman, editors, Proceedings of the Twentyseventh International Conference on Machine Learning (ICML10), pages 735–742. ACM.
 Martens and Sutskever (2011) Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessianfree optimization. In Proc. ICML’2011. ACM.
 Mikolov (2012) Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis, Brno University of Technology.
 Mikolov et al. (2010) Mikolov, T., Karafiát, M., Burget, L., Cernocky, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), volume 2010, pages 1045–1048. International Speech Communication Association.
 Mikolov et al. (2011) Mikolov, T., Kombrink, S., Burget, L., Cernocky, J., and Khudanpur, S. (2011). Extensions of recurrent neural network language model. In Proc. 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP 2011).
 Mikolov et al. (2012a) Mikolov, T., Sutskever, I., Deoras, A., Le, H., Kombrink, S., and Cernocky, J. (2012a). Subword language modeling with neural networks. unpublished.
 Mikolov et al. (2012b) Mikolov, T., Sutskever, I., Deoras, A., Le, H.S., Kombrink, S., and Cernocky, J. (2012b). Subword language modeling with neural networks. preprint (http://www.fit.vutbr.cz/ imikolov/rnnlm/char.pdf).
 Mikolov et al. (2013a) Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013a). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119.
 Mikolov et al. (2013b) Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013b). Efficient estimation of word representations in vector space. In International Conference on Learning Representations: Workshops Track.
 Pascanu and Bengio (2013) Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. Technical report, arXiv:1301.3584.
 Pascanu et al. (2013a) Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent neural networks. In ICML’2013.
 Pascanu et al. (2013b) Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of response regions of deep feed forward networks with piecewise linear activations. arXiv:1312.6098[cs.LG].
 Pinheiro and Collobert (2014) Pinheiro, P. and Collobert, R. (2014). Recurrent convolutional neural networks for scene labeling. In Proceedings of The 31st International Conference on Machine Learning, pages 82–90.

Raiko et al. (2012)
Raiko, T., Valpola, H., and LeCun, Y. (2012).
Deep learning made easier by linear transformations in perceptrons.
In Proceedings of the Fifteenth Internation Conference on Artificial Intelligence and Statistics (AISTATS 2012), volume 22 of JMLR Workshop and Conference Proceedings, pages 924–932. JMLR W&CP.  Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by backpropagating errors. Nature, 323, 533–536.
 Schmidhuber (1992) Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of history compression. Neural Computation, (4), 234–242.
 Sutskever et al. (2011) Sutskever, I., Martens, J., and Hinton, G. (2011). Generating text with recurrent neural networks. In L. Getoor and T. Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pages 1017–1024, New York, NY, USA. ACM.
 Sutskever et al. (2013) Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In ICML.
 Valpola and Karhunen (2002) Valpola, H. and Karhunen, J. (2002). An unsupervised ensemble learning method for nonlinear dynamic statespace models. Neural Comput., 14(11), 2647–2692.