There are many examples for sequential data with non-uniform information flow. One is text data, where nouns and adjectives can be more crucial than determiners to modify sentence context. We show such a case in Figure 1. Words ”brown”, ”fox” and ”quick” contain larger amount of information than ”the” and ”is”. Yet another example is video sequence, where the difference between adjacent video frames varies. In other words, some video frames can change drastically to convey quantities of information while others change slightly or even freezing. In sensor systems, sequential data may also be non-uniform. As different sensors have diverse sampling rates, accompanying with occasional package loss in wireless connection, the resulting frequency of the signal sequence is highly variable, leading to a non-uniform information distribution.
To process sequential data, Recurrent Neural Networks (RNNs)[Rumelhart, Hinton, and Williams1986] have emerged as a popular approach. It has been applied for speech recognition [Graves, Mohamed, and Hinton2013], language modeling [Sundermeyer, Schlüter, and Ney2012], learning word embeddings [Goldberg and Levy2014], location prediction [Kong and Wu2018], electronic health record analysis [Jin et al.2018], etc. Among all kinds of RNNs and their extensions, Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber1997] is immensely important to learn long-term dependencies. This is because of its gated structure, which alleviates the problem of vanishing and exploding gradients.
Although traditional LSTM has produced promising results, it performs the best for data sequences where information is uniformly distributed between steps. For sequences with non-uniform information flow between steps, its performance is hardly satisfactory. The work in[Pascanu et al.2013]
has investigated the case and attributed the problem to the lack of depth between steps in its design. In other words, for high information flow, the transition function from one hidden state to the next is too shallow in LSTM to the point that it may be regarded as a single linear transformation with activation, which cannot capture the latent structure in sequences.
To tackle the above, many recent works devise deeper structures such as adding intermediate layers or stacking LSTMs. An approach is Fast-Slow LSTM (FS-LSTM) [Mujika, Meier, and Steger2017], which forms a two-layer hierarchical structure. The approach uses a fixed “worst-case” depth according to the maximum information flow in a step in the data sequence. This is inefficient. For example, in speech recognition, little operations are needed for modeling silent parts, and hence applying a deep LSTM structure can waste computation resources. Besides, a deep structure can possibly lead to vanishing and exploding gradients because error signals have to traverse a long path. Therefore, models are required to handle such sequences with adaptive depth in each step.
Phased LSTM has been proposed to address the above [Neil, Pfeiffer, and Liu2016], which extends LSTM with a time gate whose openness is controlled by an independent rhythmic oscillation. Based on the temporal information, information flow is normalized, thereby improving the performance. A similar implementation is Heterogeneous Event LSTM (HE-LSTM) [Liu et al.2018], which replaces time gate with an event gate. The change enables learning on the hierarchical structure of input features and produces a better result in asynchronous sequential data. However, both methods require prior knowledge on the latent structure of sequences, which may not be known in reality. Furthermore, for sequences without timestamp features, normalization cannot be conducted.
In this paper, we propose Depth-Adaptive Long Short-Term Memory (DA-LSTM), a hierarchical architecture that can adapt model depth to non-uniform information flow in sequential data. Different from Phased LSTM and HE-LSTM, DA-LSTM neither requires any prior knowledge nor timestamps of sequences. DA-LSTM consists of two layers. In the bottom layer, multiple cells are sequentially connected to increase the capability of modeling multiscale transition, while in the top layer, a short path is constructed by linking the head and tail cell in the bottom layer, alleviating gradients exploding and vanishing problem. Additionally, we add a portion gate into the LSTM cell, which can dynamically determine the size of hidden units to be updated. Therefore, DA-LSTM can provide sufficient depth for information-dense steps, while saving computation load at information-sparse inputs.
Using DA-LSTM, we conduct experiments on real-world classification tasks. Experimental results on the dataset show that DA-LSTM can dynamically adjust model depth, and hence achieves faster training speed while preserving, or even improving baseline performances without prior knowledge of sequences’ latent structure. Therefore, DA-LSTM can be universally applied to process sequences with non-uniform information distribution.
The remaining parts are organized as follows. Firstly, related works are presented and compared with DA-LSTM. Next, we exhibit the hierarchical structure and cell structure of our model. We then propose experimental results and illustrations, followed by a comprehensive conclusion.
In this section, we review works that are relevant to DA-LSTM. Firstly, we focus on methods that have a deep structure to model complicated transition, which contains a considerable amount of information. Next, several approaches that can flexibly adjust models are compared. Finally, we discuss how our model differs from these two types of methods.
Standard LSTM can sometimes generate suboptimal results. Works have investigated the problem and shown that the degradation in performance can be attributed to the shallow hidden-to-hidden depth [Pascanu et al.2013]. In other words, because only a few linear and nonlinear transformations are performed between hidden states at adjacent time steps, standard LSTM is not capable of modeling complex features, such as multiscale temporal structure or localized attention. Therefore, many works have increased the model depth to boost performance. Deep Transition LSTM [Pascanu et al.2013]
is one typical approach, which inserts several intermediate layers between hidden layers. However, adding too many intermediate layers introduces a potential problem. As the number of nonlinear steps increases, gradients should traverse a longer path during backpropagation, contributing to the problem of vanishing and exploding gradients. Recurrent Highway Networks (RHN)[Zilly et al.2016] address the issue by utilizing the highway layer [Srivastava, Greff, and Schmidhuber2015], which achieves state-of-the-art BPC measure on Penn Treebank and Hutter Prize Wikipedia datasets. Stacking recurrent hidden layers is another primary approach. Models of this concept have been applied in sequence segmentation and recognition tasks [Graves, Mohamed, and Hinton2013], achieving baseline results on the TIMIT phoneme recognition benchmark. More recent works introduce the hierarchical design, such as Clockwork RNN (CW-RNN) [Koutnik et al.2014], which divides hidden states into several modules with different clock-rate. The model follows an asynchronous updating rule and is capable of capturing multiscale latent structure over sequences.
However, the better performance of models with an increased depth is at the expense of higher computation load and harder training process. Since the sequence can have a highly variable amount of information, models are not necessary to maintain equal depth over the whole sequence. Models that have a mechanism to adjust depth according to sequences’ latent structure can significantly save computation resources. One type of models use timestamp as an indicator to handle non-uniform sequences, such as Phased LSTM[Neil, Pfeiffer, and Liu2016] and Time-Aware LSTM[Baytas et al.2017]. The former model follows a periodic oscillation function to control the hidden state transition while the latter one discounts transition with an arbitrary function. One limitation of these models is that they both require some prior knowledge about sequences’ temporal structure, which cannot be obtained in many scenarios. Another implementation is adaptive computing time (ACT) [Graves2016]
, which processes a varied number of intermediate updates between steps. ACT utilizes a halting gate to assign each update with a probability. When the sum of prior halting probabilities exceeds, ACT stops updating at the current step and enter the next step, resetting cumulated halting probability to zero. Implementation of a similar concept is Hierarchical Multiscale Recurrent Neural Networks (HM-LSTM) [Chung, Ahn, and Bengio2017]. This work develops a multi-layer LSTM, which has both bottom-up and top-down connections. Each layer in HM-LSTM has three possible operations, namely COPY, UPDATE and FLUSH. A boundary state decides the choice of operations, whose design is borrowed by Skip-RNN [Campos et al.2018]. Despite their high performance, both methods need to be trained by REINFORCE [Williams1992] and require an extra reward signal. Defining these rewards is demanding and task-specific. Variable Computation Unit (VCU) [Jernite et al.2017] provides an alternative view for adjusting depth. Instead of skipping the whole cell as Skip-RNN, VCU performs partial updates, leaving higher dimensions in hidden units intact.
Our DA-LSTM architecture combines the advantages of both deep structure and flexible depth adjustment mechanism. Firstly, by forming a two-layer hierarchical structure, DA-LSTM is equipped with sufficient depth to model multiscale complex transitions. Secondly, since a portion gate is added to the standard LSTM cell, our model is flexible enough to assign an appropriate amount of operations for hidden state updates, reducing computation load and speeding up training.
In this section, we first introduce the hierarchical structure of DA-LSTM to obtain an overview. Then, the modified LSTM cell is displayed in details to illustrate how the portion gate helps to save computation resources.
DA-LSTM Hierarchical Framework
Transitions in some parts of the sequential data can be complex, such as periods when changes between steps are drastic, containing a large amount of information, and thereby require models to be capable of capturing various latent features. Since a single standard LSTM cell has a shallow structure between hidden states, which only consists of several linear transformations and activations, learning statistics from sequences is challenging. To solve the problem caused by insufficient depth, we propose a hierarchical structure (see figure2(a)), which comprises two stacked layers.
In the bottom layer, cells are connected sequentially. Each cell can represent a transition function at time step t, and therefore this layer can model complex transition by processing multiple updates within one step. Here can be input from sequences or the output from cells in different layers. If there is no additional input, can be omitted by setting the value to zero. The tuple of memory and hidden states is represented by . Only the first cell receives an input and the last cell emits an output. The top layer consists of one customized cell , which is updated once at every step. Cell receives the signal from the first cell and sends feedback to the last cell in the lower layer. Equations about how information transmits in DA-LSTM architecture is presented in the following:
Note that the function in (5) can be any transformation, converting hidden states to a predicted output.
Compared with Deep Transition LSTM, DA-LSTM maintains a two-layer hierarchical structure with and cells, respectively. While the bottom layer is similar to Deep Transition LSTM, the top layer is updated with a different frequency, enabling the learning of multiscale latent features. Besides, links between layers can be regarded as a shortcut connection [Raiko, Valpola, and LeCun2012]
, which alleviates the problem of vanishing and exploding gradients. Therefore, error signals can be backpropagated for longer steps, and the learning of long-term dependencies is promoted. DA-LSTM also differs from stacked LSTM. Instead of connecting layers with only bottom-up links, DA-LSTM also contains top-down links. Therefore, The hidden state transitions in DA-LSTM can be regarded as multilayer perceptron (MLP), which is a universal approximator[Hornik, Stinchcombe, and White1989] and capable of representing larger families of functions.
Modified LSTM Cell Structure
Long Short-Term Memory (LSTM)[Hochreiter and Schmidhuber1997] cell is a pervasive unit in building recurrent neural networks. We start by defining update rules with equations for LSTM without peephole connection [Gers and Schmidhuber2000]:
Compared with RNN cells, LSTM utilizes the gated structure , and , namely forget gate, input gate and output gate at time step . is the new memory candidate, which is computed by a linear transformation of previous hidden state and new input
concatenated with activation function. , , , , , , , and , , ,
are parameters in corresponding gates. The forget gate and input gate, which choose sigmoid function as nonlinearity, output scalar values. The values are then multiplied with previous memoryand new memory candidate , respectively, to produce a new memory . Finally, output is obtained by the element-wise multiplication between and the value of output gate .
The modified cell in DA-LSTM extends LSTM with a portion gate , which defines a faction number within interval . As is shown in the following, , and are parameters for portion gate with sigmoid activation.
Figure 2(b) shows the whole structure of modified LSTM cell. Assume previous hidden state has a dimension of , DA-LSTM will use the first dimensions of hidden units and . Note that here can either be the input for the cell in the top layer or a new input in the bottom layer. In the former case, ’s dimension is because the input for cell is exactly the hidden state output of cell , which has the same dimension as . In the latter case, the dimension of can be regarded as for simplification, since an extra preprocessing layer can be inserted to transform the dimension of raw input to . Therefore, is assumed to have the same dimension as .
After obtaining , the modified LSTM cell performs partial update on dimension . As can be seen from figure 3, dimensions over are truncated, generating and , hence only multiplications are needed to compute the rest gates. Since the multiplication account for the majority of computation load, time and resources are expected to be saved.
In practice, hard selection of the updating dimension size is not favored because training model with non-differentiable function is non-trivial. Here we apply a soft mask [Jernite et al.2017], which approximates the hard choice with a threshold function. We use in figure 2
(b) to denote the continuous function that generates a soft mask vector. The threshold function is defined as (element-wise multiplication is denoted by):
Given a sharpness parameter , the soft mask vector , truncated hidden state and truncated input are defined as:
The element-wise multiplication of soft mask and generates , whose first dimensions are unchanged, with dimensions over being zero. Here is the offset caused by the fraction part of . If the sharpness parameter increases, will decrease to zero because the value of is less likely to fall into the range between and .
Experiments and Evaluation
In this section, experimental results on real sensor data are performed to prove the efficiency and effectiveness of our proposed DA-LSTM architecture.
Datasets and Experimental Setup
, which can be obtained from UCI Machine Learning Repository. PAMAP2 is collected by putting wearable compounded sensors over a group of people aged betweenand for hours, entirely providing instances. The specification of sensors can be seen in Table 1.
Since thermometer, 3D-accelerator, 3D-gyroscope, and 3D-magnetometer are embedded into a compound sensor called IMU sensor, they share the same sampling rates. However, as sensors communicate with wireless connection, the real frequency of IMU sensors placed at different body positions varies and package loss frequently occurs, generating quantities of NaN tags in resulting integrated records. We firstly preprocess sequential data by removing rows of signal records with NaN tag. All the sensors provide features for physical activity recognition.
There are activity types, including sitting, walking, running, which are more recognizable, and house cleaning, folding laundry, ironing, which are less distinguishable. Besides, states between these activities, namely transient states, is labeled as , where sensor readings are noisy and irrelevant to predictions. Therefore, these transient periods can be regarded as information-sparse parts, and we define a transient ratio to control the ratio of transient states in our extracted sequences. If is too large, inputs will mostly consist of transient states and information will be uniformly distributed. Similarly, when is too small, inputs will contain quantities of non-transient states, whose information distribution is still uniform. The only difference is the total amount of information distributed. In other words, sequences with small contain a larger number of complicated transitions while sequences with large contain a smaller number of complicated ones. If is near , however, the amount of information distributed at different steps are highly variable.
We follow these steps to extract sequences from 10-hour sensor records. Firstly, records are split into two parts, one with nothing but transient states and the other with rest records. Next, we follow a transient ratio to sample data points from different parts. Finally, records for transient states and non-transient states from the same test objects are concatenated and sorted in chronological order, with a sequence length parameter defining number of steps in each sequence. Unless otherwise stated, we use the following baseline parameters: . Besides, The proportions of the training set, the cross-validation set, and the testing set are , and
, respectively. Several single cell and hierarchical structures are tested on these extracted sequential data, and all the experiments are conducted with tensorflow framework.
|Heart Rate Monitor||1||9Hz|
Comparison Schemes and Metrics
DA-LSTM architecture is compared with the following methods to prove its effectiveness and efficiency.
Phased LSTM: Phased LSTM [Neil, Pfeiffer, and Liu2016] as described in related work.
Stacked LSTM: LSTM cells are stacked on the top of each other by treating the output of the bottom layer as input [Malhotra et al.2015]. As for this structure, there is no link conveying the information from the upper cell to lower cell.
Deep Transition LSTM(DT-LSTM): A model that sequentially connects several cells between steps [Raiko, Valpola, and LeCun2012].
Clockwork RNN: Clockwork RNN [Koutnik et al.2014] as described in related work.
We choose the default parameter settings for Phased LSTM and Clockwork RNN from corresponding papers. Other common hyperparameters are listed in table2. Note that modules in Clockwork RNN contain different number of cells and we use addition to denote the structure.
|method||# of hidden units||# of cells|
|Deep Transition LSTM||40||3|
Since all the methods are tested on classification tasks, it is common to use cross entropy loss as a quantitative metric. Besides, convergence time is recorded to indicate the computation load.
Cross Entropy at Epoch 1
|Cross Entropy||Convergence Time (s)|
|Deep Transition LSTM||1.3609||0.7932||8935.57|
Table 3 lists cross entropy loss in the test set and convergence time with baseline parameters. Several conclusions can be drawn from the table.
Firstly, it can be seen that Clockwork LSTM and DA-LSTM achieve lower cross entropy loss than Deep Transition LSTM and Phased LSTM. The performance of Deep Transition LSTM is degraded because gradients should traverse a longer path in backpropagation, which hampers the learning of long-term dependencies. As for Phased LSTM, suboptimal results can be attributed to the incompatibility between the real latent structure in sequences and Phased LSTM’s temporal assumption. Therefore, we can conclude that both longer backpropagation paths and false assumption about sequences’ latent structure are detrimental to performance.
Secondly, comparing Phased LSTM with Stacked LSTM, Deep Transition LSTM and DA-LSTM, it can be seen that the convergence time for the single cell structure is significantly smaller than hierarchical structures. Instead of forming multilayer cells between steps, Phased LSTM maintains a simpler structure, performing much less computation. However, even though hierarchical structures require higher computation load, DA-LSTM manages to save the cost. Compared with Stacked LSTM and Deep Transition LSTM, DA-LSTM reduces convergence time significantly by introducing a portion gate, which can dynamically adjust model depth and omit quantities of operations according to the latent structure of sequences. Therefore, a conclusion that DA-LSTM is capable of saving computation resources can be drawn. Here the time for Clockwork RNN cannot be fairly compared because its basic unit is not LSTM cell, which reduces lots of overheads and operations.
Thirdly, it can be seen that both DA-LSTM and Clockwork RNN achieve state-of-the-art performance. Different from Clockwork RNN, whose module settings implicitly define some temporal structure information of inputs, DA-LSTM does not require any prior knowledge about inputs’ latent structure. Even though the temporal assumption formed by Clockwork RNN is more general than Phased LSTM, achieving better performance, when we modify the transient ratio , results of Clockwork RNN become quite unstable. Details will be provided in the following parts.
To further display DA-LSTM’s capabilities of dealing with non-uniform information flow in sequential data, we run experiments on sequences with different transient ratio . Figure 4 displays corresponding learning curves. From these graphs, when we change the transient ratio , the resulting cross entropy of Clockwork RNN fluctuates rapidly. It can be seen that Clockwork RNN generates suboptimal results when , which is possibly due to the incompatibility between the model’s assumption and sequences’ latent structure. DA-LSTM, however, always maintains a stable performance.
Besides, we find that the performance of both Deep Transition LSTM and Phased LSTM is inferior. As for Phased LSTM, because we use default parameter settings described in the corresponding paper, the assumption for the latent temporal structure of inputs may not be congenital with our datasets, hence generating suboptimal results. The bad performance of Deep Transition LSTM can be attributed to a different reason. Although adding intermediate layers between cells increase the depth of deep transition LSTM, which is supposed to provide better performance, a longer backpropagation path intensifies the problem of vanishing and exploding gradients. Therefore, Deep Transition LSTM is incapable of capturing long-term dependencies. DA-LSTM alleviates the problem by applying links to connect the start and end of intermediate cells, allowing gradients to traverse a short path and thereby promoting the learning of long-term dependencies. One exception occurs when . Because the latent structure of sequential inputs is quite simple at that time, all the methods achieve high accuracy.
Based on the above experimental results, we can conclude that DA-LSTM architecture can significantly save computation cost while preserving or sometimes even improving the state-of-the-art performance.
In this section, to understand the how DA-LSTM adapts depth, we run experiments with different transient ratio and cell number . The average portion gate value after convergence is recorded for comparison.
Figure 4(a) illustrates how the value of changes over different transient ratio . As can be seen from the figure, when the value of increases, which means there are more transient states, the value of decrease significantly, reducing the total number of performed operations. This observation proves the functionality of portion gate that adjusts the model depth depending on sequences’ latent structure. Figure 4(b) shows how the value of responds to the increase in the number of cells . When increase from to , the average value gradually decrease, helping DA-LSTM to maintain a constant level of depth that is compatible with inputs’ latent structure.
Therefore, it can be concluded that DA-LSTM architecture is capable of learning inputs’ latent structure and adjusting itself by the portion gate to maintain performance.
In this paper, we present a novel DA-LSTM architecture to process sequential data with non-uniform information distribution. With a two-layer hierarchical structure, the capability of the model is enhanced by increasing the model depth. Additionally, our model learns the latent structure of sequential inputs and saves computation cost by dynamically adjusting the number of performed operations. Experiments conducted on real-world dataset prove that DA-LSTM architecture can preserve or even improve baseline performance sometimes, and reduce the amount of computation significantly.
- [Baytas et al.2017] Baytas, I. M.; Xiao, C.; Zhang, X.; Wang, F.; Jain, A. K.; and Zhou, J. 2017. Patient subtyping via time-aware lstm networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 65–74. ACM.
- [Campos et al.2018] Campos, V.; Jou, B.; i Nieto, X. G.; Torres, J.; and Chang, S.-F. 2018. Skip RNN: Learning to skip state updates in recurrent neural networks. In International Conference on Learning Representations.
- [Chung, Ahn, and Bengio2017] Chung, J.; Ahn, S.; and Bengio, Y. 2017. Hierarchical multiscale recurrent neural networks. In International Conference on Learning Representations.
- [Gers and Schmidhuber2000] Gers, F. A., and Schmidhuber, J. 2000. Recurrent nets that time and count. In Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, volume 3, 189–194. IEEE.
- [Goldberg and Levy2014] Goldberg, Y., and Levy, O. 2014. word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.
- [Graves, Mohamed, and Hinton2013] Graves, A.; Mohamed, A.-r.; and Hinton, G. 2013. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, 6645–6649. IEEE.
- [Graves2016] Graves, A. 2016. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983.
- [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- [Hornik, Stinchcombe, and White1989] Hornik, K.; Stinchcombe, M.; and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural networks 2(5):359–366.
- [Jernite et al.2017] Jernite, Y.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. Variable computation in recurrent neural networks. In International Conference on Learning Representations.
- [Jin et al.2018] Jin, B.; Che, C.; Liu, Z.; Zhang, S.; Yin, X.; and Wei, X. 2018. Predicting the risk of heart failure with ehr sequential data modeling. IEEE Access 6:9256–9261.
- [Kong and Wu2018] Kong, D., and Wu, F. 2018. Hst-lstm: A hierarchical spatial-temporal long-short term memory network for location prediction. In IJCAI, 2341–2347.
- [Koutnik et al.2014] Koutnik, J.; Greff, K.; Gomez, F.; and Schmidhuber, J. 2014. A clockwork rnn. In Xing, E. P., and Jebara, T., eds., Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, 1863–1871. Bejing, China: PMLR.
- [Liu et al.2018] Liu, L.; Shen, J.; Zhang, M.; Wang, Z.; and Tang, J. 2018. Learning the joint representation of heterogeneous temporal events for clinical endpoint prediction. arXiv preprint arXiv:1803.04837.
[Malhotra et al.2015]
Malhotra, P.; Vig, L.; Shroff, G.; and Agarwal, P.
Long short term memory networks for anomaly detection in time series.In Proceedings, 89. Presses universitaires de Louvain.
- [Mujika, Meier, and Steger2017] Mujika, A.; Meier, F.; and Steger, A. 2017. Fast-slow recurrent neural networks. In Advances in Neural Information Processing Systems, 5915–5924.
- [Neil, Pfeiffer, and Liu2016] Neil, D.; Pfeiffer, M.; and Liu, S.-C. 2016. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances in Neural Information Processing Systems, 3882–3890.
- [Pascanu et al.2013] Pascanu, R.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2013. How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026.
- [Raiko, Valpola, and LeCun2012] Raiko, T.; Valpola, H.; and LeCun, Y. 2012. Deep learning made easier by linear transformations in perceptrons. In Artificial Intelligence and Statistics, 924–932.
- [Reiss and Stricker2012a] Reiss, A., and Stricker, D. 2012a. Creating and benchmarking a new dataset for physical activity monitoring. In Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments, 40. ACM.
- [Reiss and Stricker2012b] Reiss, A., and Stricker, D. 2012b. Introducing a new benchmarked dataset for activity monitoring. In Wearable Computers (ISWC), 2012 16th International Symposium on, 108–109. IEEE.
- [Rumelhart, Hinton, and Williams1986] Rumelhart, D. E.; Hinton, G. E.; and Williams, R. J. 1986. Learning representations by back-propagating errors. nature 323(6088):533.
- [Srivastava, Greff, and Schmidhuber2015] Srivastava, R. K.; Greff, K.; and Schmidhuber, J. 2015. Highway networks. arXiv preprint arXiv:1505.00387.
- [Sundermeyer, Schlüter, and Ney2012] Sundermeyer, M.; Schlüter, R.; and Ney, H. 2012. Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association.
Williams, R. J.
Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning 8(3-4):229–256.
- [Zilly et al.2016] Zilly, J. G.; Srivastava, R. K.; Koutník, J.; and Schmidhuber, J. 2016. Recurrent highway networks. arXiv preprint arXiv:1607.03474.