1 Introduction
Recurrent Neural Networks (RNNs, rumelhart86 ; werbos88 ) are powerful dynamic systems for tasks that involve sequential inputs, such as audio classification, machine translation and speech generation. As they process a sequence one element at a time, internal states are maintained to store information computed from the past inputs which makes RNNs capable of modeling temporal correlations between elements from any distance in theory.
In practice, however, it is difficult for RNNs to learn longterm dependencies in data by using backpropagation through time (BPTT, rumelhart86 ) due to the well known vanishing and exploding gradient problem hochreiter01 . Besides, training RNNs suffers from gradient conflicts (e.g. input conflict and output conflict lstm
) which make it challenging to latch longterm information while keeping mid and shortterm memory simultaneously. Various attempts have been made to increase the temporal range that credit assignment takes effect for recurrent models during training, including adopting a much more sophisticated HessianFree optimization method instead of stochastic gradient descent
martens10 ; martens11 , using orthogonal weight matrices to assist optimization saxe13 ; irnn and allowing direct connections to model inputs or states from the distant past narx ; dilatedrnn ; mistrnn. Long shortterm memory (LSTM,
lstm) and its variant, known as gated recurrent units (GRU,
gru ) mitigate gradient conflicts by using multiplicative gate units. Moreover, the vanishing gradient problem is alleviated by the additivity in their state transition operator. Simplified gated units have been proposed
ugrnn ; mgu yet the ability of capturing longterm dependencies has not been improved. Recent work also supports the idea of partitioning the hidden units in an RNN into separate modules with different processing periods cwrnn .In this paper, we introduce Grouped Distributor Unit (GDU), a new gated recurrent architecture with additive state transition and only one gate unit. Hidden states inside a GDU are partitioned into groups, each of which keeps a constant proportion of previous memory at each time step, forcing information to be latched. The vanishing gradient problem, together with the issue of gradient conflict, which impede the extraction of longterm dependencies are thus alleviated.
We empirically evaluated the proposed model against LSTM and GRU on both synthetic problems which are designed to be pathologically difficult and natural dataset containing longterm components. Results reveal that our proposed model outperforms LSTM and GRU on these tasks with a simpler structure and less parameters.
2 Background and related work
An RNN is able to encode sequences of arbitrary length into a fixedlength representation by folding a new observation into its hidden state using a transition operator at each time step :
(1) 
Simple recurrent networks (SRN, elman90 ), known as one of the earliest variants, make as the composition of an elementwise nonlinearity with an affine transformation of both and :
(2) 
where is the inputtostate weight matrix, is the statetostate recurrent weight matrix, is the bias and
is the nonlinear activation function. For the convenience of the following descriptions, we denote this kind of operators as
, and a subscript can be added to distinguish different network components. Thus in SRN, .During training via BPTT, the error obtained from the output of an RNN at time step t (denoted as ) travels backward through each state unit. The corresponding error signal propagated back to time step (denoted as ^{2}^{2}2 , in which is the state size and the th component represents the sensitivity of to small perturbations in the th state unit at time step . , ) contains a product of Jacobian matrices:
(3) 
From Eq. (3) we can easily find a sufficient condition for the vanishing gradient problem to occur, i.e. . Under this condition, a bound can be found such that , and
(4) 
As , long term contributions (for which is large) go to exponentially fast with .
In SRN, is given by . As a result, if the derivative of the nonlinear function is bounded in SRN, namely, , s.t. , it will be sufficient for , where
is the largest singular value of the recurrent weight matrix
, for to vanish (as )pascanu13 .Any RNN architecture with a longterm memory ability should at least be designed to make sure the norm of its transition Jacobian will not easily be bounded by for a long time span as it goes through a sequence.
2.1 Gated additive state transition (GAST)
Long shortterm memory (LSTM, lstm ) introduced a memory unit with selfconnected structure which can maintain its state over time, and nonlinear gating units (originally input and output gates) which control the information flow into and out of it. Since the initial proposal in 1997, many improvements have been made to the LSTM architecture forgetgate ; peephole . In this paper, we refer to the variant with forget gate and without peephole connections, which has a comparable performance with more complex variants greff17 :
(5a)  
(5b)  
(5c)  
(5d)  
(5e)  
(5f) 
Here denotes the sigmoid activation and denotes elementwise multiplication. Note that should also be considered as hidden state besides .
Cho et al. gru
proposed a similar architecture with gating units called gated recurrent unit (GRU). Different from LSTM, GRU exposes all its states to the output and use a linear interpolation between the previous state
and the candidate state :(6a)  
(6b)  
(6c)  
(6d) 
Previous work has clearly indicated the advantages of the gating units over the more traditional recurrent units chung14 . Both LSTM and GRU perform well in tasks that require capturing longterm dependencies. However, the choice of these two structures may depend heavily on the dataset and corresponding task.
It is easy to notice that the most prominent feature shared between these units is the additivity in their state transition operators. In another word, both LSTM and GRU keep the existing states and add the new states on top of it instead of replacing previous states directly, as it did in traditional recurrent units like SRN. Another important ingredient in their transition operator is the gating mechanism, which regulates the information flow and enables the network to form skip connections adaptively. In this paper we refer to this kind of transition operators as the Gated Additive State Transition (GAST) with a general formula:
(7a)  
(7b) 
where , and are called gate operators with subscript indicating that values of the corresponding gating units change over time (see Fig. 1 (left)). In LSTM:
(8a)  
(8b)  
(8c) 
whilst in GRU:
(9a)  
(9b)  
(9c) 
We denote the gate vector used in a gate operator
at time step as . Note that except Eq. (8a), gate operators have a common form ^{3}^{3}3In the following part of this paper, gate operators are referred to as being in this form.:(10) 
where is a state vector to be gated. We use to indicate as in the case of GRU. According to Eq. (7b), the transition Jacobian of a GAST can be resolved into 4 parts:
(11) 
in which
(12a)  
(12b)  
(12c)  
(12d) 
The gradient property of GAST is much better than that of SRN since it can easily prevent its transition Jacobian norm to be bounded within by saturating part of units in nearly at 1. Intuitively, when this happens, the corresponding components of error signal are allowed to be backpropagated easily through the shortcut created by the additive character of GAST without vanishing too quickly.
The original LSTM lstm uses full gate recurrence odyssey
, which means that all neurons receive recurrent inputs from all gate activations at the previous time step besides the block outputs. Nevertheless, it still follows Eqs. (
7). Another difference is that the original LSTM does not use forget gate, i.e. , thus in Eq. (12a), is a unit diagonal matrix . In addition, gradients are truncated by replacing the other components in its transition Jacobian, i.e. Eqs. (12b), (12c) and (12d), by zero, forming a constant error carrousel (CEC) where . It is noticeable, however, that if the gradients are not truncated, Eq. (3) does not hold for LSTMs since the gate vector used in is calculated at the previous time step, see Eq. (8a). In this condition, a concatenation of and should be used in analysis of its transition Jacobian, as in Fig. 7.Simplifying GAST has drawn interest of researchers recently. GRU itself reduces the gate units to compared to LSTM which has gate units by coupling forget gate and input gate into one update gate, namely making the gate operator equals to . In this paper we denote this kind of GAST as cGAST, with the prefix c short for coupled. Based on GRU, the Minimal Gated Unit (MGU, mgu ) reduced the gate number further to only 1 by letting without losing GRU’s accuracy benefits. The Update Gate RNN (UGRNN, ugrnn ) entirely removed operator. However, none of these models has shown superiority over LSTM and GRU on longterm tasks with singlelayer hidden states.
2.2 Units partitioning
Although the capacity of capturing longterm dependencies in sequences is of crucial importance of RNNs, it is worthwhile to notice that the flowing data is usually embedded with both slowmoving and fastmoving information, of which the former corresponds to longterm dependencies. Along with the existence of both long and shortterm information in sequences, the training process always has gradient conflict existing. Here gradient conflict mainly refers to the contradiction between error signals backpropagated to a same time step, but injected at different time steps during training via BPTT. This issue may hinder the establish of longterm memory even without the gradient vanishing problem.
Consider a task in which a GRU is given one data point at a time and assigned to predict the next, e.g. ERG (see Section 4.3). If the correct prediction at time step is heavily depending on the data point appeared at time step , namely , where , we can say a longterm dependency exists between and . GRU can capture this kind of dependency by learning to encode into some state units and latch it until . For simplicity, let us focus on a single state unit and assume that the information of has been stored in . At time step (), state unit will often receive conflicting error signals. The error signal injected at time step may attempt to make keep its value until . While other error signals injected before , say, , may hope that helps to do the prediction at time step , thus it may attempt to make to be overwritten by a new value. This conflict makes the GRU model hesitate to shut the update gate for by setting to . In GRU, we also observed that state units latching longterm memories (with corresponding neurons in staying active for a long time) are usually sparse (see Fig. 6 (left)), which impedes the backpropagation of effective longterm error signals, since shortterm error signals dominate. As a result, learning can be slow.
El Hihi and Bengio first showed that RNNs can learn both long and shortterm dependencies more easily and efficiently if state units are partitioned into groups with different timescales hihi95 . The clockwork RNN (CWRNN) cwrnn implemented this by assigning each state unit a fixed temporal granularity, making state transition happens only at its prescribed clock rate. It can also be seen as a member of cGAST family. More specifically, a UGRNN with a special gate operator in which each gate vector value is explicitly scheduled to saturate at either or . CWRNN does not suffer from gradient conflict for it inherently has the ability to latch information. However, the clock rate schedule should be tuned for each task.
3 Grouped Distributor Unit
As introduced in Section 2, a network combining the advantages of GAST and the idea to partition state units into groups seems promising. Further, we argue that a dynamic system with memory does not need to overwrite the vast majority of its memory based on relatively little input data. For cGAST models whose , we define the proportion of states to be overwritten at time step as:
(13) 
where is the state size. On the other hand, the proportion of previous states to be kept is:
(14) 
Hence in our view, if a model input contains small amount of information compared to system memory , should be kept low to protect the previous states. For cGAST family members, a lower leads to more active units in (see Fig.6 (right)) and thus less prone to be affected by gradient conflict.
To put a limit on , we start by a plain UGRNN and partition its state units into groups:
(15) 
where the th group contains units. At each time step, for each , we let a positive constant to be distributed to the corresponding components in , namely
(16) 
Thus becomes a constant given by
(17) 
See Fig.1 (right), the distribution work in each group is done by a distributor, hence the proposed structure is called Grouped Distributor Unit (GDU). The distributor is implemented by utilizing the softmax activation over each group individually in calculating :
(18a)  
(18b)  
(18c) 
here and . ^{4}^{4}4The permutation of can be arbitrary. Note that when and when . The resulting GDU is given by
(19a)  
(19b) 
where denotes distributor operator with group configuration as is detailed in Eqs. (18).
In this paper, we let . As a consequence,
(20) 
If the size of each state group is set to a constant , will be further reduced to .
GDU has an inherent strength to keep a longterm memory since any saturated state unit will force all other group members to latch information. As a result, “bandwidth” is wider for longterm information to travel forward and error signals to backpropagate (see Fig.6 (right)).
Like CWRNN, we set an explicit rate for each group. However, instead of making all group members act in the same way, we allow each unit to find its own rate by learning.
4 Experiments
We evaluated the proposed GDU on both pathological synthetic tasks and natural data set in comparison with LSTM and GRU. It is important to point out that although LSTM and GRU have similar performance in nature data set chung14 , one model may outperform another by a huge gap in different pathological tasks like the adding problem (see 4.1) at which GRU is good and the temporal order problem (see 4.2) in which LSTM performs better.
If not otherwise specified, all networks have one hidden layer with a same state size. Weight variables were initialized via Xavier uniform initializer glorot10 , and the initial values of all internal state variable were set to . All networks were trained using Adam optimization method adam
via BPTT, and the models were implemented using Tensorflow
tf . In GDU models, apply to all groups.4.1 The adding problem
The adding problem is a sequence regression problem which was originally proposed in lstm to examine the ability of recurrent models to capture longterm dependencies. Two sequences of length
are taken as input. The first one consists of real numbers sampled from a uniform distribution in
. While the second sequence serves as indicators with exactly two entries being 1 and the remaining being 0. We followed the settings in urnn where is a constant and the first 1 entry is located uniformly at random in the first half of the indicator sequence, whilst the second 1 entry is located uniformly at random in another half. The target of this problem is to add up the two entries in the first sequence whose corresponding indicator in the second sequence is 1. A naive strategy of outputting regardless of the inputs yields a mean squared error of, which is the variance of the sum of two independent uniform distributions over
. We took it as the baseline.Four different lengths of sequences, were used in this experiment. For each length, sequences were generated for testing, while a batch of sequences were randomly generated at each training step. Four models, an LSTM with hidden states, a GRU with hidden states, a GDU with groups of size and a GDU with only group of size were compared, with the corresponding parameter number , , and . A simple linear layer without activation is stacked on top of the recurrent layer in each model.
The results are shown in Fig. 2. Obviously GRU outperforms LSTM in these trials. LSTM fails to converge within training steps when is while GRU can learn this task within steps even trained with sequences of length . Our GDU models perform slightly better than GRU with less parameters. As increases, this advantage becomes more obvious. Note that a GDU with only one group of size has comparable performance with a much bigger one, which indicates that GDU can efficiently capture simple longterm dependencies even with a tiny model.
4.2 The 3bit temporal order problem
The 3bit temporal order problem is a sequence classification problem to examine the ability of recurrent models to extract information conveyed by the temporal order of widely separated inputs of recurrent models lstm . The input sequence consists of randomly chosen symbols from the set except for three elements at position , and that are either or . Position is randomly chosen between and , where and
is the sequence length. The target is to classify the order (either XXX, XXY, XYX, XYY, YXX, YXY, YYX, YYY) which is represented locally using
units, as well as the input symbol (represented using units).Four different lengths of sequences, were used in this experiment. Same with the settings in 4.1, we generated testing sequences for each length, and randomly generated a batch of sequences for each training step. Accuracy is used as the metric on testing set, and the baseline is . We compared an LSTM model with hidden states, a GRU model with hidden states and a GDU with groups of size on these data sets. The parameter numbers are , and respectively.
The results are shown in Fig.3. In contrast to the results of the adding problem, LSTM outperforms GRU on this task. However, both LSTM and GRU fail in learning to distinguish the temporal order when the sequence length increases to . The GDU model with always starts learning earlier. When trained with relatively longer sequences, GDU outperforms these models by a large margin with much less parameters.
4.3 Multiembedded Reber grammar
Embeded Reber grammar (ERG) fahlman91 ; lstm is a good example containing dependencies with different time scales. This task needs RNNs to read strings, one symbol at a time, and to predict the next symbol (error signals occur at every time step). To correctly predict the symbol before last, a model has to remember the second symbol. However, since it allows for training sequences with short time lags (of as few as steps), using it to evaluate a model’s ability to learn longterm dependency is not appropriate. In order to make the training sequences longer, we modified the ERG by having multiple Reber strings embedded between the second and the last but one symbols (See Fig.4).
We refer to this variant as the multiembedded Reber grammar (mERG) and simply use the prefix to indicate the number of embedded Reber strings. For example, “BT(BPVVE)(BTSXSE)(BTXXVVE)TE” is a 3ERG sequence. Since each Reber string has a minimal length , the shortest ERG sequence has a length of .
Learning ERG requires a recurrent model to have the ability to latch longterm memory while keeping mid and shortterm memory (provided is big) in the meantime. Further, there may be two legal successors from a given symbol and the model will never be able to do a perfect job of prediction. During training, the rules defining the grammar are never presented. Thus the model will see contradictory examples, sometimes with one successor and sometimes the other, which requires it to learn to activate both legal outputs. What’s more, a model must remember how many Reber strings it has read to make a correct prediction of the next symbol if the current symbol is an E. In other words, models must learn to count.
We set to be , and for this task, with the minimal sequence length , and respectively. One sequence is given at a time. As for the symbols with legal successors, a prediction is considered correct if the two desired outputs are the two with the largest values. For each we generated sequences for training and sequences for testing. The sequences in testing set are unique and have never appeared in training set. The same training and testing sets are used for comparing all models.
We also defined two criteria to test the model’s ability to capture long and shortterm dependencies separately. The one for shortterm dependency is (short for shortterm criterion) defined as the percentages of testing sequences each symbol of which is predicted correctly by the model except for the one before last. The other is (short for longterm criterion) defined as the percentages of testing sequences whose last but one symbol is predicted correctly. We stopped the training when both and are satisfied (reach to ), namely all symbols in all testing sequences are predicted correctly. A naive strategy of predicting the symbol before last as T or P gives an expected of , which serves as the baseline.
An LSTM model and a GRU model both with 100 hidden states were chosen to be compared as previous, with corresponding parameter numbers and . As for GDU, we chose a model with groups of size and groups of size (denoted as GDU(2x35+10x3)), totally hidden units and parameters.
From the results presented in Fig. 5, we can see for ERG, models always learn to capture the shortterm dependencies first. While the longterm dependency is much more difficult to learn. GRU outperforms LSTM this time, no matter from the aspect of which criterion. GDU is slightly inferior to LSTM and GRU in terms of . However, on aspect of , it has an obvious advantage.
As discussed in Section 2, learning to latch longterm information in the presence of shortterm dependencies is difficult for a traditional GAST model due to the gradient conflict. GDU greatly alleviate this problem by limiting in cGAST, namely the proportion of states to be overwritten, which results in a broader “bandwidth” for longterm information flow. Fig. 6 illustrates this by visualizing the activation of GAST models on a same ERG sequence after the has been satisfied.
4.4 Sequential pMNIST classification
The sequential MNIST task irnn can be seen as a sequence classification task in which MNIST images lecun98 of digits are read pixel by pixel from left to right, top to bottom. While the sequential MNIST irnn is a challenging variant where the pixels are permuted by a same randomly generated permutation matrix. This creates many longer term dependencies across pixels than in the original pixel ordering, which makes it necessary for a model to learn and remember more complicated dependencies embedded in varying time scales.
Model  # parameters (, )  Test Accuracy 

LSTM(128)  67.9  91.2 
GRU(128)  51.2  90.6 
GDU(4x32)  34.6  93.5 
GDU(5x25)  33.0  93.0 
LSTM(256)  266.8  91.8 
GRU(256)  200.7  92.6 
GDU(4x62)  134.7  94.7 
GDU(5x51)  133.6  94.8 
All models are trained with batch size of and the learning rate is set to . No tricks, such as dropout dropout
pascanu13, recurrent batch normalization
rbn16 , etc., are used since we are not focusing on achieving absolute high accuracy. We trained two sets of models with and hidden states respectively. Again, GDU outperforms LSTM and GRU with less parameters in this task as shown in Table 1.As discussed in Section 2, controlling is the key to avoid the vanishing gradient issue, so that longterm dependencies can be learned. We explored how each model propagated gradients by examining as a function of , where is the prediction loss. Gradient norms were computed after and epochs and the normalized curves are plotted in Fig. 7. For LSTM and GRU, we can see that error signals have trouble in reaching far from where they are injected at the early stage. This problem is reduced after training for dozens of epochs. GDU models have better gradient properties than LSTM and GRU because of the distributor mechanism in Eqs. (18).
5 Conclusions and future work
We proposed a novel RNN architecture with gated additive state transition which contains only one gate unit. The issues of gradient vanishing and conflict are mitigated by explicitly limiting the proportion of states to be overwritten at each time step. Our experiments mainly focused on challenging pathological problems. The results were consistent over different tasks and clearly demonstrated that the proposed grouped distributor architecture is helpful to extract longterm dependencies embedded in data.
A plethora of further ideas can be explored based on our findings. For example, various combinations of groups with different sizes and overwrite proportions can be explored. Further, the overwrite proportion can be trained. What’s more interesting is that the grouped distributor structure can be used spatially to ease gradientbased training of very deep networks. To be more specific, this work can base on the highway network highway in which the distributor operator can be used to calculate the transform gate. Testings of the stacked GDU on other data sets are also planned.
References
 [1] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by backpropagating errors. Nature, 323:533–536, 1986.

[2]
Paul J.Werbos.
Generalization of backpropagation with application to a recurrent gas market model.
Neural Networks, 1:339–356, 1988.  [3] Sepp Hochreiter, Yoshua Bengio, and Paolo Frasconi. Gradient flow in recurrent nets: the difficulty of learning longterm dependencies. In J. Kolen and S. Kremer, editors, Field Guide to Dynamical Recurrent Networks. IEEE Press, 2001.
 [4] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Comput., 9(8):1735–1780, 1997.

[5]
James Martens.
Deep learning via hessianfree optimization.
In
Proceedings of the 27th International Conference on International Conference on Machine Learning
, ICML’10, pages 735–742, USA, 2010. Omnipress.  [6] James Martens and Ilya Sutskever. Learning recurrent neural networks with hessianfree optimization. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pages 1033–1040, USA, 2011. Omnipress.
 [7] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, 2013.

[8]
Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton.
A simple way to initialize recurrent networks of rectified linear units, 2015.
 [9] Tsungnan Lin, B. G. Horne, P. Tino, and C. L. Giles. Learning longterm dependencies in narx recurrent neural networks. Trans. Neur. Netw., 7(6):1329–1338, 1996.
 [10] Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark A HasegawaJohnson, and Thomas S Huang. Dilated recurrent neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 77–87. Curran Associates, Inc., 2017.
 [11] Robert DiPietro, Christian Rupprecht, Nassir Navab, and Gregory D. Hager. Analyzing and exploiting NARX recurrent neural networks for longterm dependencies, 2018.
 [12] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoderdecoder for statistical machine translation, 2014.
 [13] David Sussillo Jasmine Collins, Jascha SohlDickstein. Capacity and trainability in recurrent neural networks. In International Conference on Learning Representations, 2016.
 [14] GuoBing Zhou, Jianxin Wu, ChenLin Zhang, and ZhiHua Zhou. Minimal gated unit for recurrent neural networks, 2016.
 [15] Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork RNN. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1863–1871, Bejing, China, 2014. PMLR.
 [16] Jeffrey L. Elman. Finding structure in time. COGNITIVE SCIENCE, 14(2):179–211, 1990.
 [17] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28, ICML’13, pages III–1310–III–1318. JMLR.org, 2013.
 [18] Felix A. Gers, Jürgen A. Schmidhuber, and Fred A. Cummins. Learning to forget: Continual prediction with lstm. Neural Comput., 12(10):2451–2471, October 2000.
 [19] Felix A. Gers and Juergen Schmidhuber. Recurrent nets that time and count. Technical report, 2000.
 [20] Klaus Greff; Rupesh K. Srivastava; Jan Koutník ; Bas R. Steunebrink ; Jürgen Schmidhuber. Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(8):2222–2232, 2017.
 [21] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, 2014.
 [22] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. Lstm: A search space odyssey. CoRR, abs/1503.04069, 2015.
 [23] Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for longterm dependencies. 1996.

[24]
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics
, 2010.  [25] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [26] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Largescale machine learning on heterogeneous distributed systems, 2015.
 [27] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks, 2015.
 [28] Scott E. Fahlman. The recurrent cascadecorrelation architecture. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 190–196. MorganKaufmann, 1991.
 [29] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
 [30] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
 [31] Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron C. Courville. Recurrent batch normalization, 2016.
 [32] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks, 2015.