Training very deep neural networks has become very common in the last few years. Both theoretical and empirical evidence points to the fact that deeper networks can represent more efficiently specific functions (Bengio et al. , Bianchini and Scarselli ). Some commonly used architectures for deep feed-forward networks are Resnet, Highway Networks  and Dense-Net. These architectures can be structured with tens, and sometimes even hundreds of layers. Unfortunately, training a very deep Recurrent Neural Network (RNN) still remains a challenge.
Zilly et al.  introduced the Recurrent Highway Network (RHN) in order to address this issue. Its main difference from previous deep RNN architectures, was incorporating Highway layers inside the recurrent transition. By using a transition depth of 10 Highway layers, RHN managed to achieve state-of-the-art results on several benchmarks of word and character prediction. However, increasing the transition depth of a similar RHN, does not improve the results significantly.
In this paper, we first analyze the reasons for this phenomena. Based on the results of our analysis, we suggest a simple solution which adds a non-significant number of parameters. This variant is called a Highway State Gating cell or a HSG. By using the HSG mechanism, the new state is generated by a weighted combination of the previous state and the output of the RHN cell. The main idea behind the HSG cell is to provide a fast route for the information to flow through time. That way, we also provide a shorter path for the back-propagation through time (BPTT). This enables the use of a deeper transition depth, together with significant performance improvement on a widely used benchmark.
2 Related Work
were suggested in order to reduce the number of parameters of the traditional and commonly used Long-Short-Term-Memory (LSTM) cell (Hochreiter and Schmidhuber ). Similarly to HSG, in GRUs the new state is a weighted sum of the previous state and a non-linear transition of the current input and the previous state. The main difference is that the transition is of a depth of a single layer and, therefore, less robust.
. They proposed adding to the LSTM cell a residual connection from its input to the reset gate projection output. By that they allowed another route for the information to flow directly through. They managed to train a net of 10 residual LSTM layers which outperformed other architectures. In their work, they focused on the way that the information passes through layers in the feed-forward manner, and not on the way it passes through time.
Wang and Tian  used residual connections in time. In their work they talked about the way information passes through time. They managed to improve performance on some benchmarks, while reducing the number of parameters. The difference is that they needed to work with a fixed residual length that is a hyper-parameter. Also, their work focused on cells with a one layer transition depth.
Another article, relating to Zoneout regularization (Krueger et al. ) also relates to information flow through time. The authors introduced a new regularization method for RNNs, where the idea is very similar to dropout
. The difference is that the dropped neurons in the state vectors get their values in the former time-step, instead of being zeroed. They mentioned that one of the benefits of this method is that the BPTT skips a time-step on its path back through time. In our work, there is a direct (weighted) connection between the current state and the former one, which is used similarly both for training and inference.
Another relevant issue is the slowness regularizers (Hinton , Földiák , Luciw and Schmidhuber , Jonschkowski and Brock , Merity et al. ) which add a penalty for large changes in state through time. In our work we do not add such a penalty, but we allow a direct route for the state to pass through time-steps, and therefore we ’encourage’ the state not to change when it is not needed.
3 Revisiting Vanilla Recurrent Highway Networks
Let be the transition depth of the RHN cell, and be the cell’s input at time . Let and represent the weight matrices of nonlinear transforms and the and gates at layer . The biases are denoted by , and let denote the intermediate output at layer at time , with . The gates and utilize a sigmoid () non-linearity and ”” denotes element-wise multiplication. An RHN layer is described by
and is the indicator function. A very common variant for this is coupling gate to gate , i.e. . Figure 1 illustrates the RHN cell.
According to Zilly et al. , one of the main advantages of using deep RHN instead of stacked RNNs, is the path length. While the path length of stacked RNNs from time to time is (figure 2), the path length of a RHN of depth is (figure 3). The high recurrence depth can add significantly higher modeling power.
We believe that its power might, sometimes, also be its weakness. Let us examine a case where information that is relevant for a large number of time steps is given at time
; for example in stocks forecasting, where we expect a sharp movement to occur in the next few time steps. We would like the state to remain the same until the event happens (unless any dramatic event changes the forecast). In this case, we probably prefer the net state to remain stable without dramatic changes. However, when using a deep RHN, the information must pass through many layers, and that might cause an unwanted change of the state. For example, with a RHN of depth, the input state at time has to pass layers in order to propagate time steps. To the best of our knowledge, there is no use of a feed-forward Highway Network of this depth in any field. This fact also affects the vanishing gradient issue using BPTT. The fact that the gradient needs to back-propagate through hundreds of layers causes it to vanish and not be effective. The empirical results support our assumption, and it seems like a performance bottleneck occurs when we use deeper nets.
4 Highway State Gate in time
We suggest a simple, yet efficient, solution for the depth-performance bottleneck issue. Let represent the weight matrices, and let
be a bias vector. Letrepresent the output of the RHN cell at time . is the output of the HSG cell at time . The HSG cell is described by
A scheme of the HSG cell and an unfolded RHN with HSG is depicted in figure 4 and figure 5, respectively. The direct outcome of adding an HSG cell is giving the information an alternative and fast route to flow through time.
Since gate utilizes a Sigmoid, its values are in the range . When , i.e. HSG is closed, . When , i.e. the gate is opened, . In the first case, the net functions as a vanilla RHN. In this case the information from the former state passes only through the functionality of the RHN. This means that the functionality of a regular RHN can be achieved easily even after stacking the HSG layer.
One of the strengths of this architecture is that each state neuron has its own stand-alone gate. This means that some of the neurons can pass information easily through many time-steps, whereas other neurons learn short time dependencies.
Now let us examine the example we mentioned above, when using RHN with the HSG cell. The net depth is , and a state needs to propagate time-steps. In this case, the state has multiple routes to propagate through. The propagation lengths are now , with . This means that the information has multiple routes, and even if we use a really deep net, it still has a short path to flow through. For this reason, we expect our variant to enable training deeper RHNs more efficiently. The results below support our claim.
Our experiments study the benefit of adding depth to a RHN with and without stacking HSG cells at its output. We conducted our experiments on the Penn Treebank (PTB) benchmark.
PTB: The Penn Treebank111http://www.fit.vutbr.cz/˜imikolov/rnnlm/simple-examples.tgz, presented by Marcus et al. , is a well known data set for experiments in the field of language modeling. The goal is predicting the next word at each time step, based on the past. Its vocabulary size is k unique words. All words that are not in the vocabulary are labeled to a single token. The database is structured of k training words, k validation words, and k test words.
, and L2 weight decay. The learning rate exponentially decreased at each epoch. An initial bias ofwas used for both the RHN and the HSG gates. That way, the gates are closed at the beginning of training. We tried RHN depths from . Results are shown in table 1. It can be well seen from the results that a performance bottleneck occurs when adding more layers to the vanilla RHN. However, adding more layers to the RHN network with the HSG cell results in a steady improvement. Figure 6 also illustrates the difference between both architectures during training. It can be seen that not only does the net with HSG achieve better results, it also converges a bit faster than the vanilla one. Another interesting aspect is the histogram of the gate values of the HSG cell in figure 8. It can be seen that most of the gates are usually closed (small valued). However, in a significant number of cases the gates open, which means that the model passes a very similar state to the next time step.
|Validation set||Test set|
|RHN||with HSG||w/o HSG||with HSG||w/o HSG|
*Note that the HSG is more significant as the depth of the RHN increases.
In this work, we revisit a widely used RNN model. We analyze its limits and issues, and propose a variant for it called Highway State Gate (HSG). The main idea behind HSG is to generate an alternative fast route for the information to flow through time. The HSG uses a gating mechanism to assemble a new state out of a weighted sum of the former state and the RHN output. We show that when using our method, training deeper nets results in better performance. To the best of our knowledge, this is the first time in the field of Recurrent Nets that adding layers to this scale resulted in a steady improvement.
- Bengio et al.  Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards ai. Large-scale kernel machines, 34(5):1–41, 2007.
Bianchini and Scarselli 
Monica Bianchini and Franco Scarselli.
On the complexity of neural network classifiers: A comparison between shallow and deep architectures.IEEE transactions on neural networks and learning systems, 25(8):1553–1565, 2014.
- Cho et al.  Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
- Földiák  Peter Földiák. Learning invariance from transformation sequences. Neural Computation, 3(2):194–200, 1991.
- Gal and Ghahramani  Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1019–1027. Curran Associates, Inc., 2016.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
- Hinton  Geoffrey E Hinton. Connectionist learning procedures. In Machine Learning, Volume III, pages 555–610. Elsevier, 1990.
- Hochreiter and Schmidhuber  Sepp Hochreiter and J?rgen Schmidhuber. Long short-term memory. Neural Computation, 1997.
- Huang et al.  Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In
- Jonschkowski and Brock  Rico Jonschkowski and Oliver Brock. Learning state representations with robotic priors. Autonomous Robots, 39(3):407–428, 2015.
- Kim et al.  Jaeyoung Kim, Mostafa El-Khamy, and Jungwon Lee. Residual lstm: Design of a deep recurrent architecture for distant speech recognition. arXiv preprint arXiv:1701.03360, 2017.
- Krueger et al.  David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal. Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305, 2016.
- Luciw and Schmidhuber  Matthew Luciw and Juergen Schmidhuber. Low complexity proto-value function learning from sensory observations with incremental slow feature analysis. In International Conference on Artificial Neural Networks, pages 279–287. Springer, 2012.
- Marcus et al.  Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Comput. Linguist., 19(2):313–330, June 1993. ISSN 0891-2017.
- Merity et al.  Stephen Merity, Bryan McCann, and Richard Socher. Revisiting activation regularization for language rnns. arXiv preprint arXiv:1708.01009, 2017.
- Srivastava et al.  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.
- Srivastava et al.  Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
Wang and Tian 
Yiren Wang and Fei Tian.
Recurrent residual learning for sequence classification.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 938–943, 2016.
- Zilly et al.  Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutn?k, and J?rgen Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.