Learning Longer-term Dependencies via Grouped Distributor Unit

04/29/2019 ∙ by Wei Luo, et al. ∙ 0

Learning long-term dependencies still remains difficult for recurrent neural networks (RNNs) despite their success in sequence modeling recently. In this paper, we propose a novel gated RNN structure, which contains only one gate. Hidden states in the proposed grouped distributor unit (GDU) are partitioned into groups. For each group, the proportion of memory to be overwritten in each state transition is limited to a constant and is adaptively distributed to each group member. In other word, every separate group has a fixed overall update rate, yet all units are allowed to have different paces. Information is therefore forced to be latched in a flexible way, which helps the model to capture long-term dependencies in data. Besides having a simpler structure, GDU is demonstrated experimentally to outperform LSTM and GRU on tasks including both pathological problems and natural data set.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent Neural Networks (RNNs, rumelhart86 ; werbos88 ) are powerful dynamic systems for tasks that involve sequential inputs, such as audio classification, machine translation and speech generation. As they process a sequence one element at a time, internal states are maintained to store information computed from the past inputs which makes RNNs capable of modeling temporal correlations between elements from any distance in theory.

In practice, however, it is difficult for RNNs to learn long-term dependencies in data by using back-propagation through time (BPTT, rumelhart86 ) due to the well known vanishing and exploding gradient problem hochreiter01 . Besides, training RNNs suffers from gradient conflicts (e.g. input conflict and output conflict lstm

) which make it challenging to latch long-term information while keeping mid- and short-term memory simultaneously. Various attempts have been made to increase the temporal range that credit assignment takes effect for recurrent models during training, including adopting a much more sophisticated Hessian-Free optimization method instead of stochastic gradient descent

martens10 ; martens11 , using orthogonal weight matrices to assist optimization saxe13 ; irnn and allowing direct connections to model inputs or states from the distant past narx ; dilatedrnn ; mistrnn

. Long short-term memory (LSTM,


) and its variant, known as gated recurrent units (GRU,

gru ) mitigate gradient conflicts by using multiplicative gate units

. Moreover, the vanishing gradient problem is alleviated by the additivity in their state transition operator. Simplified gated units have been proposed

ugrnn ; mgu yet the ability of capturing long-term dependencies has not been improved. Recent work also supports the idea of partitioning the hidden units in an RNN into separate modules with different processing periods cwrnn .

In this paper, we introduce Grouped Distributor Unit (GDU), a new gated recurrent architecture with additive state transition and only one gate unit. Hidden states inside a GDU are partitioned into groups, each of which keeps a constant proportion of previous memory at each time step, forcing information to be latched. The vanishing gradient problem, together with the issue of gradient conflict, which impede the extraction of long-term dependencies are thus alleviated.

We empirically evaluated the proposed model against LSTM and GRU on both synthetic problems which are designed to be pathologically difficult and natural dataset containing long-term components. Results reveal that our proposed model outperforms LSTM and GRU on these tasks with a simpler structure and less parameters.

2 Background and related work

An RNN is able to encode sequences of arbitrary length into a fixed-length representation by folding a new observation into its hidden state using a transition operator at each time step :


Simple recurrent networks (SRN, elman90 ), known as one of the earliest variants, make as the composition of an element-wise nonlinearity with an affine transformation of both and :


where is the input-to-state weight matrix, is the state-to-state recurrent weight matrix, is the bias and

is the nonlinear activation function. For the convenience of the following descriptions, we denote this kind of operators as

, and a subscript can be added to distinguish different network components. Thus in SRN, .

During training via BPTT, the error obtained from the output of an RNN at time step t (denoted as ) travels backward through each state unit. The corresponding error signal propagated back to time step (denoted as 222 , in which is the state size and the -th component represents the sensitivity of to small perturbations in the -th state unit at time step . , ) contains a product of Jacobian matrices:


From Eq. (3) we can easily find a sufficient condition for the vanishing gradient problem to occur, i.e. . Under this condition, a bound can be found such that , and


As , long term contributions (for which is large) go to exponentially fast with .

In SRN, is given by . As a result, if the derivative of the nonlinear function is bounded in SRN, namely, , s.t. , it will be sufficient for , where

is the largest singular value of the recurrent weight matrix

, for to vanish (as )pascanu13 .

Any RNN architecture with a long-term memory ability should at least be designed to make sure the norm of its transition Jacobian will not easily be bounded by for a long time span as it goes through a sequence.

2.1 Gated additive state transition (GAST)

Long short-term memory (LSTM, lstm ) introduced a memory unit with self-connected structure which can maintain its state over time, and non-linear gating units (originally input and output gates) which control the information flow into and out of it. Since the initial proposal in 1997, many improvements have been made to the LSTM architecture forgetgate ; peephole . In this paper, we refer to the variant with forget gate and without peephole connections, which has a comparable performance with more complex variants greff17 :


Here denotes the sigmoid activation and denotes element-wise multiplication. Note that should also be considered as hidden state besides .

Cho et al. gru

proposed a similar architecture with gating units called gated recurrent unit (GRU). Different from LSTM, GRU exposes all its states to the output and use a linear interpolation between the previous state

and the candidate state :


Previous work has clearly indicated the advantages of the gating units over the more traditional recurrent units chung14 . Both LSTM and GRU perform well in tasks that require capturing long-term dependencies. However, the choice of these two structures may depend heavily on the dataset and corresponding task.

Figure 1: Left: The gated additive state transition (GAST). Inputs and outputs are not shown. Superscript

denotes the ordinal number of a state unit. In LSTM,

corresponds to . In GRU, . Right: The GAST in a GDU group with size and . Compared to LSTM and GRU, gate operator is removed and gate operators inside the group is correlated, i.e. . Any unit assigned with a high will force other group members to latch information.

It is easy to notice that the most prominent feature shared between these units is the additivity in their state transition operators. In another word, both LSTM and GRU keep the existing states and add the new states on top of it instead of replacing previous states directly, as it did in traditional recurrent units like SRN. Another important ingredient in their transition operator is the gating mechanism, which regulates the information flow and enables the network to form skip connections adaptively. In this paper we refer to this kind of transition operators as the Gated Additive State Transition (GAST) with a general formula:


where , and are called gate operators with subscript indicating that values of the corresponding gating units change over time (see Fig. 1 (left)). In LSTM:


whilst in GRU:


We denote the gate vector used in a gate operator

at time step as . Note that except Eq. (8a), gate operators have a common form 333In the following part of this paper, gate operators are referred to as being in this form.:


where is a state vector to be gated. We use to indicate as in the case of GRU. According to Eq. (7b), the transition Jacobian of a GAST can be resolved into 4 parts:


in which


The gradient property of GAST is much better than that of SRN since it can easily prevent its transition Jacobian norm to be bounded within by saturating part of units in nearly at 1. Intuitively, when this happens, the corresponding components of error signal are allowed to be back-propagated easily through the shortcut created by the additive character of GAST without vanishing too quickly.

The original LSTM lstm uses full gate recurrence odyssey

, which means that all neurons receive recurrent inputs from all gate activations at the previous time step besides the block outputs. Nevertheless, it still follows Eqs. (

7). Another difference is that the original LSTM does not use forget gate, i.e. , thus in Eq. (12a), is a unit diagonal matrix . In addition, gradients are truncated by replacing the other components in its transition Jacobian, i.e. Eqs. (12b), (12c) and (12d), by zero, forming a constant error carrousel (CEC) where . It is noticeable, however, that if the gradients are not truncated, Eq. (3) does not hold for LSTMs since the gate vector used in is calculated at the previous time step, see Eq. (8a). In this condition, a concatenation of and should be used in analysis of its transition Jacobian, as in Fig. 7.

Simplifying GAST has drawn interest of researchers recently. GRU itself reduces the gate units to compared to LSTM which has gate units by coupling forget gate and input gate into one update gate, namely making the gate operator equals to . In this paper we denote this kind of GAST as cGAST, with the prefix c short for coupled. Based on GRU, the Minimal Gated Unit (MGU, mgu ) reduced the gate number further to only 1 by letting without losing GRU’s accuracy benefits. The Update Gate RNN (UGRNN, ugrnn ) entirely removed operator. However, none of these models has shown superiority over LSTM and GRU on long-term tasks with single-layer hidden states.

2.2 Units partitioning

Although the capacity of capturing long-term dependencies in sequences is of crucial importance of RNNs, it is worthwhile to notice that the flowing data is usually embedded with both slow-moving and fast-moving information, of which the former corresponds to long-term dependencies. Along with the existence of both long- and short-term information in sequences, the training process always has gradient conflict existing. Here gradient conflict mainly refers to the contradiction between error signals back-propagated to a same time step, but injected at different time steps during training via BPTT. This issue may hinder the establish of long-term memory even without the gradient vanishing problem.

Consider a task in which a GRU is given one data point at a time and assigned to predict the next, e.g. ERG (see Section 4.3). If the correct prediction at time step is heavily depending on the data point appeared at time step , namely , where , we can say a long-term dependency exists between and . GRU can capture this kind of dependency by learning to encode into some state units and latch it until . For simplicity, let us focus on a single state unit and assume that the information of has been stored in . At time step (), state unit will often receive conflicting error signals. The error signal injected at time step may attempt to make keep its value until . While other error signals injected before , say, , may hope that helps to do the prediction at time step , thus it may attempt to make to be overwritten by a new value. This conflict makes the GRU model hesitate to shut the update gate for by setting to . In GRU, we also observed that state units latching long-term memories (with corresponding neurons in staying active for a long time) are usually sparse (see Fig. 6 (left)), which impedes the back-propagation of effective long-term error signals, since short-term error signals dominate. As a result, learning can be slow.

El Hihi and Bengio first showed that RNNs can learn both long- and short-term dependencies more easily and efficiently if state units are partitioned into groups with different timescales hihi95 . The clockwork RNN (CW-RNN) cwrnn implemented this by assigning each state unit a fixed temporal granularity, making state transition happens only at its prescribed clock rate. It can also be seen as a member of cGAST family. More specifically, a UGRNN with a special gate operator in which each gate vector value is explicitly scheduled to saturate at either or . CW-RNN does not suffer from gradient conflict for it inherently has the ability to latch information. However, the clock rate schedule should be tuned for each task.

3 Grouped Distributor Unit

As introduced in Section 2, a network combining the advantages of GAST and the idea to partition state units into groups seems promising. Further, we argue that a dynamic system with memory does not need to overwrite the vast majority of its memory based on relatively little input data. For cGAST models whose , we define the proportion of states to be overwritten at time step as:


where is the state size. On the other hand, the proportion of previous states to be kept is:


Hence in our view, if a model input contains small amount of information compared to system memory , should be kept low to protect the previous states. For cGAST family members, a lower leads to more active units in (see Fig.6 (right)) and thus less prone to be affected by gradient conflict.

To put a limit on , we start by a plain UGRNN and partition its state units into groups:


where the -th group contains units. At each time step, for each , we let a positive constant to be distributed to the corresponding components in , namely


Thus becomes a constant given by


See Fig.1 (right), the distribution work in each group is done by a distributor, hence the proposed structure is called Grouped Distributor Unit (GDU). The distributor is implemented by utilizing the softmax activation over each group individually in calculating :


here and . 444The permutation of can be arbitrary. Note that when and when . The resulting GDU is given by


where denotes distributor operator with group configuration as is detailed in Eqs. (18).

In this paper, we let . As a consequence,


If the size of each state group is set to a constant , will be further reduced to .

GDU has an inherent strength to keep a long-term memory since any saturated state unit will force all other group members to latch information. As a result, “bandwidth” is wider for long-term information to travel forward and error signals to back-propagate (see Fig.6 (right)).

Like CW-RNN, we set an explicit rate for each group. However, instead of making all group members act in the same way, we allow each unit to find its own rate by learning.

4 Experiments

We evaluated the proposed GDU on both pathological synthetic tasks and natural data set in comparison with LSTM and GRU. It is important to point out that although LSTM and GRU have similar performance in nature data set chung14 , one model may outperform another by a huge gap in different pathological tasks like the adding problem (see 4.1) at which GRU is good and the temporal order problem (see 4.2) in which LSTM performs better.

If not otherwise specified, all networks have one hidden layer with a same state size. Weight variables were initialized via Xavier uniform initializer glorot10 , and the initial values of all internal state variable were set to . All networks were trained using Adam optimization method adam

via BPTT, and the models were implemented using Tensorflow

tf . In GDU models, apply to all groups.

4.1 The adding problem

The adding problem is a sequence regression problem which was originally proposed in lstm to examine the ability of recurrent models to capture long-term dependencies. Two sequences of length

are taken as input. The first one consists of real numbers sampled from a uniform distribution in

. While the second sequence serves as indicators with exactly two entries being 1 and the remaining being 0. We followed the settings in urnn where is a constant and the first 1 entry is located uniformly at random in the first half of the indicator sequence, whilst the second 1 entry is located uniformly at random in another half. The target of this problem is to add up the two entries in the first sequence whose corresponding indicator in the second sequence is 1. A naive strategy of outputting regardless of the inputs yields a mean squared error of

, which is the variance of the sum of two independent uniform distributions over

. We took it as the baseline.

Four different lengths of sequences, were used in this experiment. For each length, sequences were generated for testing, while a batch of sequences were randomly generated at each training step. Four models, an LSTM with hidden states, a GRU with hidden states, a GDU with groups of size and a GDU with only group of size were compared, with the corresponding parameter number , , and . A simple linear layer without activation is stacked on top of the recurrent layer in each model.

Figure 2: The results of the adding problem on different sequence lengths. The legends for all sub-figures are the same thus are only shown in the first sub-figure, in which state sizes are specified following model names. For a GDU model, means it has groups of size . Each training trial was stopped when the test MSE reached below , as indicated by a short vertical bar. When training with sequences of length , LSTM(100) failed to converge within steps and only the curve of the first steps is shown.

The results are shown in Fig. 2. Obviously GRU outperforms LSTM in these trials. LSTM fails to converge within training steps when is while GRU can learn this task within steps even trained with sequences of length . Our GDU models perform slightly better than GRU with less parameters. As increases, this advantage becomes more obvious. Note that a GDU with only one group of size has comparable performance with a much bigger one, which indicates that GDU can efficiently capture simple long-term dependencies even with a tiny model.

4.2 The 3-bit temporal order problem

The 3-bit temporal order problem is a sequence classification problem to examine the ability of recurrent models to extract information conveyed by the temporal order of widely separated inputs of recurrent models lstm . The input sequence consists of randomly chosen symbols from the set except for three elements at position , and that are either or . Position is randomly chosen between and , where and

is the sequence length. The target is to classify the order (either XXX, XXY, XYX, XYY, YXX, YXY, YYX, YYY) which is represented locally using

units, as well as the input symbol (represented using units).

Four different lengths of sequences, were used in this experiment. Same with the settings in 4.1, we generated testing sequences for each length, and randomly generated a batch of sequences for each training step. Accuracy is used as the metric on testing set, and the baseline is . We compared an LSTM model with hidden states, a GRU model with hidden states and a GDU with groups of size on these data sets. The parameter numbers are , and respectively.

Figure 3: The results of the 3-bits temporal order problem on different sequence lengths. The legends containing the information of model size are only shown in the first sub-figure. Each trial was stopped if all sequences in testing set are classified correctly, as indicated by a dashed vertical line. When sequence length is 500, both LSTM and GRU failed within 50000 training steps, and their accuracy curves, which keeps fluctuating around the baseline are partially plotted.

The results are shown in Fig.3. In contrast to the results of the adding problem, LSTM outperforms GRU on this task. However, both LSTM and GRU fail in learning to distinguish the temporal order when the sequence length increases to . The GDU model with always starts learning earlier. When trained with relatively longer sequences, GDU outperforms these models by a large margin with much less parameters.

4.3 Multi-embedded Reber grammar

Embeded Reber grammar (ERG) fahlman91 ; lstm is a good example containing dependencies with different time scales. This task needs RNNs to read strings, one symbol at a time, and to predict the next symbol (error signals occur at every time step). To correctly predict the symbol before last, a model has to remember the second symbol. However, since it allows for training sequences with short time lags (of as few as steps), using it to evaluate a model’s ability to learn long-term dependency is not appropriate. In order to make the training sequences longer, we modified the ERG by having multiple Reber strings embedded between the second and the last but one symbols (See Fig.4).

Figure 4: Left: Trainsition diagram for the Reber grammar. Right: Transition diagram for the multi-embedded Reber grammar. Each box represents a Reber string.

We refer to this variant as the multi-embedded Reber grammar (mERG) and simply use the prefix to indicate the number of embedded Reber strings. For example, “BT(BPVVE)(BTSXSE)(BTXXVVE)TE” is a 3ERG sequence. Since each Reber string has a minimal length , the shortest ERG sequence has a length of .

Learning ERG requires a recurrent model to have the ability to latch long-term memory while keeping mid- and short-term memory (provided is big) in the meantime. Further, there may be two legal successors from a given symbol and the model will never be able to do a perfect job of prediction. During training, the rules defining the grammar are never presented. Thus the model will see contradictory examples, sometimes with one successor and sometimes the other, which requires it to learn to activate both legal outputs. What’s more, a model must remember how many Reber strings it has read to make a correct prediction of the next symbol if the current symbol is an E. In other words, models must learn to count.

We set to be , and for this task, with the minimal sequence length , and respectively. One sequence is given at a time. As for the symbols with legal successors, a prediction is considered correct if the two desired outputs are the two with the largest values. For each we generated sequences for training and sequences for testing. The sequences in testing set are unique and have never appeared in training set. The same training and testing sets are used for comparing all models.

We also defined two criteria to test the model’s ability to capture long- and short-term dependencies separately. The one for short-term dependency is (short for short-term criterion) defined as the percentages of testing sequences each symbol of which is predicted correctly by the model except for the one before last. The other is (short for long-term criterion) defined as the percentages of testing sequences whose last but one symbol is predicted correctly. We stopped the training when both and are satisfied (reach to ), namely all symbols in all testing sequences are predicted correctly. A naive strategy of predicting the symbol before last as T or P gives an expected of , which serves as the baseline.

An LSTM model and a GRU model both with 100 hidden states were chosen to be compared as previous, with corresponding parameter numbers and . As for GDU, we chose a model with groups of size and groups of size (denoted as GDU(2x35+10x3)), totally hidden units and parameters.

Figure 5: The results of the multi-embedded Reber grammar. The upper left and right figures show the training steps each model takes to satisfy the criteria (reach to 1.0) for or . Each box-whisker (showing median, and quantiles, minimum, maximum and outliers) contains the corresponding results of trials. For we only give the best results of each model in the bottom left figure. The bottom right figure shows the density histogram of sequence lengths in ERG training set.

From the results presented in Fig. 5, we can see for ERG, models always learn to capture the short-term dependencies first. While the long-term dependency is much more difficult to learn. GRU outperforms LSTM this time, no matter from the aspect of which criterion. GDU is slightly inferior to LSTM and GRU in terms of . However, on aspect of , it has an obvious advantage.

As discussed in Section 2, learning to latch long-term information in the presence of short-term dependencies is difficult for a traditional GAST model due to the gradient conflict. GDU greatly alleviate this problem by limiting in cGAST, namely the proportion of states to be overwritten, which results in a broader “bandwidth” for long-term information flow. Fig. 6 illustrates this by visualizing the activation of GAST models on a same ERG sequence after the has been satisfied.

Figure 6: The activation of of GRU(100) (left) and GDU(2x35+10x3) (right) on a same sequence from 10ERG testing set. Each column corresponds to the gate activation at one time step. Each row with continuous dark color corresponds to a gate unit which keeps active and thus latches information.

4.4 Sequential pMNIST classification

The sequential MNIST task irnn can be seen as a sequence classification task in which MNIST images lecun98 of digits are read pixel by pixel from left to right, top to bottom. While the sequential MNIST irnn is a challenging variant where the pixels are permuted by a same randomly generated permutation matrix. This creates many longer term dependencies across pixels than in the original pixel ordering, which makes it necessary for a model to learn and remember more complicated dependencies embedded in varying time scales.

Model # parameters (, ) Test Accuracy
LSTM(128) 67.9 91.2
GRU(128) 51.2 90.6
GDU(4x32) 34.6 93.5
GDU(5x25) 33.0 93.0
LSTM(256) 266.8 91.8
GRU(256) 200.7 92.6
GDU(4x62) 134.7 94.7
GDU(5x51) 133.6 94.8
Table 1: Results for permuted pixel-by-pixel MNIST. Best result in each model set are bold.

All models are trained with batch size of and the learning rate is set to . No tricks, such as dropout dropout

, gradient clipping


, recurrent batch normalization

rbn16 , etc., are used since we are not focusing on achieving absolute high accuracy. We trained two sets of models with and hidden states respectively. Again, GDU outperforms LSTM and GRU with less parameters in this task as shown in Table 1.

Figure 7: Norms of the error signal back-propagated to each time step, i.e. after epochs (left) and epochs (right). For LSTM model, we calculate instead of , where is a concatenation of and .

As discussed in Section 2, controlling is the key to avoid the vanishing gradient issue, so that long-term dependencies can be learned. We explored how each model propagated gradients by examining as a function of , where is the prediction loss. Gradient norms were computed after and epochs and the normalized curves are plotted in Fig. 7. For LSTM and GRU, we can see that error signals have trouble in reaching far from where they are injected at the early stage. This problem is reduced after training for dozens of epochs. GDU models have better gradient properties than LSTM and GRU because of the distributor mechanism in Eqs. (18).

5 Conclusions and future work

We proposed a novel RNN architecture with gated additive state transition which contains only one gate unit. The issues of gradient vanishing and conflict are mitigated by explicitly limiting the proportion of states to be overwritten at each time step. Our experiments mainly focused on challenging pathological problems. The results were consistent over different tasks and clearly demonstrated that the proposed grouped distributor architecture is helpful to extract long-term dependencies embedded in data.

A plethora of further ideas can be explored based on our findings. For example, various combinations of groups with different sizes and overwrite proportions can be explored. Further, the overwrite proportion can be trained. What’s more interesting is that the grouped distributor structure can be used spatially to ease gradient-based training of very deep networks. To be more specific, this work can base on the highway network highway in which the distributor operator can be used to calculate the transform gate. Testings of the stacked GDU on other data sets are also planned.


  • [1] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986.
  • [2] Paul J.Werbos.

    Generalization of backpropagation with application to a recurrent gas market model.

    Neural Networks, 1:339–356, 1988.
  • [3] Sepp Hochreiter, Yoshua Bengio, and Paolo Frasconi. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In J. Kolen and S. Kremer, editors, Field Guide to Dynamical Recurrent Networks. IEEE Press, 2001.
  • [4] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, 1997.
  • [5] James Martens. Deep learning via hessian-free optimization. In

    Proceedings of the 27th International Conference on International Conference on Machine Learning

    , ICML’10, pages 735–742, USA, 2010. Omnipress.
  • [6] James Martens and Ilya Sutskever. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pages 1033–1040, USA, 2011. Omnipress.
  • [7] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, 2013.
  • [8] Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton.

    A simple way to initialize recurrent networks of rectified linear units, 2015.

  • [9] Tsungnan Lin, B. G. Horne, P. Tino, and C. L. Giles. Learning long-term dependencies in narx recurrent neural networks. Trans. Neur. Netw., 7(6):1329–1338, 1996.
  • [10] Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark A Hasegawa-Johnson, and Thomas S Huang. Dilated recurrent neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 77–87. Curran Associates, Inc., 2017.
  • [11] Robert DiPietro, Christian Rupprecht, Nassir Navab, and Gregory D. Hager. Analyzing and exploiting NARX recurrent neural networks for long-term dependencies, 2018.
  • [12] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation, 2014.
  • [13] David Sussillo Jasmine Collins, Jascha Sohl-Dickstein. Capacity and trainability in recurrent neural networks. In International Conference on Learning Representations, 2016.
  • [14] Guo-Bing Zhou, Jianxin Wu, Chen-Lin Zhang, and Zhi-Hua Zhou. Minimal gated unit for recurrent neural networks, 2016.
  • [15] Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork RNN. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1863–1871, Bejing, China, 2014. PMLR.
  • [16] Jeffrey L. Elman. Finding structure in time. COGNITIVE SCIENCE, 14(2):179–211, 1990.
  • [17] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages III–1310–III–1318. JMLR.org, 2013.
  • [18] Felix A. Gers, Jürgen A. Schmidhuber, and Fred A. Cummins. Learning to forget: Continual prediction with lstm. Neural Comput., 12(10):2451–2471, October 2000.
  • [19] Felix A. Gers and Juergen Schmidhuber. Recurrent nets that time and count. Technical report, 2000.
  • [20] Klaus Greff; Rupesh K. Srivastava; Jan Koutník ; Bas R. Steunebrink ; Jürgen Schmidhuber. Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(8):2222–2232, 2017.
  • [21] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, 2014.
  • [22] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. Lstm: A search space odyssey. CoRR, abs/1503.04069, 2015.
  • [23] Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependencies. 1996.
  • [24] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In

    In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics

    , 2010.
  • [25] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [26] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems, 2015.
  • [27] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks, 2015.
  • [28] Scott E. Fahlman. The recurrent cascade-correlation architecture. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 190–196. Morgan-Kaufmann, 1991.
  • [29] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
  • [30] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
  • [31] Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron C. Courville. Recurrent batch normalization, 2016.
  • [32] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks, 2015.