Highway State Gating for Recurrent Highway Networks: improving information flow through time

05/23/2018 ∙ by Ron Shoham, et al. ∙ Ben-Gurion University of the Negev 0

Recurrent Neural Networks (RNNs) play a major role in the field of sequential learning, and have outperformed traditional algorithms on many benchmarks. Training deep RNNs still remains a challenge, and most of the state-of-the-art models are structured with a transition depth of 2-4 layers. Recurrent Highway Networks (RHNs) were introduced in order to tackle this issue. These have achieved state-of-the-art performance on a few benchmarks using a depth of 10 layers. However, the performance of this architecture suffers from a bottleneck, and ceases to improve when an attempt is made to add more layers. In this work, we analyze the causes for this, and postulate that the main source is the way that the information flows through time. We introduce a novel and simple variation for the RHN cell, called Highway State Gating (HSG), which allows adding more layers, while continuing to improve performance. By using a gating mechanism for the state, we allow the net to "choose" whether to pass information directly through time, or to gate it. This mechanism also allows the gradient to back-propagate directly through time and, therefore, results in a slightly faster convergence. We use the Penn Treebank (PTB) dataset as a platform for empirical proof of concept. Empirical results show that the improvement due to Highway State Gating is for all depths, and as the depth increases, the improvement also increases.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training very deep neural networks has become very common in the last few years. Both theoretical and empirical evidence points to the fact that deeper networks can represent more efficiently specific functions (Bengio et al. [1], Bianchini and Scarselli [2]). Some commonly used architectures for deep feed-forward networks are Resnet[6], Highway Networks [17] and Dense-Net[9]. These architectures can be structured with tens, and sometimes even hundreds of layers. Unfortunately, training a very deep Recurrent Neural Network (RNN) still remains a challenge.

Zilly et al. [19] introduced the Recurrent Highway Network (RHN) in order to address this issue. Its main difference from previous deep RNN architectures, was incorporating Highway layers inside the recurrent transition. By using a transition depth of 10 Highway layers, RHN managed to achieve state-of-the-art results on several benchmarks of word and character prediction. However, increasing the transition depth of a similar RHN, does not improve the results significantly.

In this paper, we first analyze the reasons for this phenomena. Based on the results of our analysis, we suggest a simple solution which adds a non-significant number of parameters. This variant is called a Highway State Gating cell or a HSG. By using the HSG mechanism, the new state is generated by a weighted combination of the previous state and the output of the RHN cell. The main idea behind the HSG cell is to provide a fast route for the information to flow through time. That way, we also provide a shorter path for the back-propagation through time (BPTT). This enables the use of a deeper transition depth, together with significant performance improvement on a widely used benchmark.

2 Related Work

Gated-Recurrent-Units (GRUs) [3]

were suggested in order to reduce the number of parameters of the traditional and commonly used Long-Short-Term-Memory (LSTM) cell (

Hochreiter and Schmidhuber [8]). Similarly to HSG, in GRUs the new state is a weighted sum of the previous state and a non-linear transition of the current input and the previous state. The main difference is that the transition is of a depth of a single layer and, therefore, less robust.

Kim et al. [11] introduced a different variant of the LSTM cell which is inspired by Resnet[6]

. They proposed adding to the LSTM cell a residual connection from its input to the reset gate projection output. By that they allowed another route for the information to flow directly through. They managed to train a net of 10 residual LSTM layers which outperformed other architectures. In their work, they focused on the way that the information passes through layers in the feed-forward manner, and not on the way it passes through time.

Wang and Tian [18] used residual connections in time. In their work they talked about the way information passes through time. They managed to improve performance on some benchmarks, while reducing the number of parameters. The difference is that they needed to work with a fixed residual length that is a hyper-parameter. Also, their work focused on cells with a one layer transition depth.

Another article, relating to Zoneout regularization (Krueger et al. [12]) also relates to information flow through time. The authors introduced a new regularization method for RNNs, where the idea is very similar to dropout[16]

. The difference is that the dropped neurons in the state vectors get their values in the former time-step, instead of being zeroed. They mentioned that one of the benefits of this method is that the BPTT skips a time-step on its path back through time. In our work, there is a direct (weighted) connection between the current state and the former one, which is used similarly both for training and inference.

Another relevant issue is the slowness regularizers (Hinton [7], Földiák [4], Luciw and Schmidhuber [13], Jonschkowski and Brock [10], Merity et al. [15]) which add a penalty for large changes in state through time. In our work we do not add such a penalty, but we allow a direct route for the state to pass through time-steps, and therefore we ’encourage’ the state not to change when it is not needed.

3 Revisiting Vanilla Recurrent Highway Networks

Let be the transition depth of the RHN cell, and be the cell’s input at time . Let and represent the weight matrices of nonlinear transforms and the and gates at layer . The biases are denoted by , and let denote the intermediate output at layer at time , with . The gates and utilize a sigmoid () non-linearity and ”” denotes element-wise multiplication. An RHN layer is described by




and is the indicator function. A very common variant for this is coupling gate to gate , i.e. . Figure 1 illustrates the RHN cell.

g[c][][.8] h[c][][.8] j[c][][.8] c[c][][.8] f[c][][.8] e[c][][.8] d[c][][.75] k[c][][.8] z[c][][.8] a[c][][.8] b[c][][.8]

Figure 1: Schematic showing RHN cell computation. The Feed-Forward route goes from bottom to top through stacked Highway layers. On the right side there is the memory unit, followed by the recurrent connection.

According to Zilly et al. [19], one of the main advantages of using deep RHN instead of stacked RNNs, is the path length. While the path length of stacked RNNs from time to time is (figure 2), the path length of a RHN of depth is (figure 3). The high recurrence depth can add significantly higher modeling power.

C[r][][.8]layer 1 B[r][][.8]layer 2 A[r][][.8]layer L D[c][][.8] Z[c][][.8] F[c][][.8] G[l][][.8] H[l][][.8] I[l][][.8]

Figure 2: The figure illustrates an unfolded RNN with stacked layers. Here the path length from time to time is .

A[lb][][.8]layer 1 B[lb][][.8]layer L J[c][][.8] K[l][][.8] L[l][][.8] F[b][][.8] C[bl][][.8] D[bl][][.8] E[bl][][.8] G[c][][.8] H[l][][.8] I[l][][.8]

Figure 3: The figure illustrates an unfolded RHN with layers. Here the path length from time to time is .

We believe that its power might, sometimes, also be its weakness. Let us examine a case where information that is relevant for a large number of time steps is given at time

; for example in stocks forecasting, where we expect a sharp movement to occur in the next few time steps. We would like the state to remain the same until the event happens (unless any dramatic event changes the forecast). In this case, we probably prefer the net state to remain stable without dramatic changes. However, when using a deep RHN, the information must pass through many layers, and that might cause an unwanted change of the state. For example, with a RHN of depth

, the input state at time has to pass layers in order to propagate time steps. To the best of our knowledge, there is no use of a feed-forward Highway Network of this depth in any field. This fact also affects the vanishing gradient issue using BPTT. The fact that the gradient needs to back-propagate through hundreds of layers causes it to vanish and not be effective. The empirical results support our assumption, and it seems like a performance bottleneck occurs when we use deeper nets.

4 Highway State Gate in time

We suggest a simple, yet efficient, solution for the depth-performance bottleneck issue. Let represent the weight matrices, and let

be a bias vector. Let

represent the output of the RHN cell at time . is the output of the HSG cell at time . The HSG cell is described by




A scheme of the HSG cell and an unfolded RHN with HSG is depicted in figure 4 and figure 5, respectively. The direct outcome of adding an HSG cell is giving the information an alternative and fast route to flow through time.

A[c][][.85] B[c][][.85] C[c][][.85] D[c][][.85] E[b][][.85]

Figure 4: The figure illustrates a zoom into the HSG cell.

A[l][][.8]layer 1 B[l][][.8]layer L C[lt][][.8] D[lt][][.8] E[lt][][.8] J[c][][.8] K[c][][.8] L[c][][.8] Z[c][][.7] HSG M[c][][.8] F[b][][.8] S[l][][.8] R[lb][][.8] Q[lb][][.8] G[c][][.8] H[l][][.8] I[l][][.8]

Figure 5: A macro scheme of an unfolded RHN with HSG cell. The state feeds both the RHN and the next time-step HSG cell.

Since gate utilizes a Sigmoid, its values are in the range . When , i.e. HSG is closed, . When , i.e. the gate is opened, . In the first case, the net functions as a vanilla RHN. In this case the information from the former state passes only through the functionality of the RHN. This means that the functionality of a regular RHN can be achieved easily even after stacking the HSG layer.

One of the strengths of this architecture is that each state neuron has its own stand-alone gate. This means that some of the neurons can pass information easily through many time-steps, whereas other neurons learn short time dependencies.

Now let us examine the example we mentioned above, when using RHN with the HSG cell. The net depth is , and a state needs to propagate time-steps. In this case, the state has multiple routes to propagate through. The propagation lengths are now , with . This means that the information has multiple routes, and even if we use a really deep net, it still has a short path to flow through. For this reason, we expect our variant to enable training deeper RHNs more efficiently. The results below support our claim.

5 Results

Our experiments study the benefit of adding depth to a RHN with and without stacking HSG cells at its output. We conducted our experiments on the Penn Treebank (PTB) benchmark.

PTB: The Penn Treebank111http://www.fit.vutbr.cz/˜imikolov/rnnlm/simple-examples.tgz, presented by Marcus et al. [14], is a well known data set for experiments in the field of language modeling. The goal is predicting the next word at each time step, based on the past. Its vocabulary size is k unique words. All words that are not in the vocabulary are labeled to a single token. The database is structured of k training words, k validation words, and k test words.

We used a hidden size of 830, similarly to that used by Zilly et al. [19]. For regularization, we use variational dropout [5]

, and L2 weight decay. The learning rate exponentially decreased at each epoch. An initial bias of

was used for both the RHN and the HSG gates. That way, the gates are closed at the beginning of training. We tried RHN depths from . Results are shown in table 1. It can be well seen from the results that a performance bottleneck occurs when adding more layers to the vanilla RHN. However, adding more layers to the RHN network with the HSG cell results in a steady improvement. Figure 6 also illustrates the difference between both architectures during training. It can be seen that not only does the net with HSG achieve better results, it also converges a bit faster than the vanilla one. Another interesting aspect is the histogram of the gate values of the HSG cell in figure 8. It can be seen that most of the gates are usually closed (small valued). However, in a significant number of cases the gates open, which means that the model passes a very similar state to the next time step.

Validation set Test set
RHN with HSG w/o HSG with HSG w/o HSG
depth=10 67.5 67.9 65.0 65.4
depth=20 65.6 66.4 62.9 63.2
depth=30 64.8 66.4 62.0 63.4
depth=40 64.7 66.7 61.7 63.6

*Note that the HSG is more significant as the depth of the RHN increases.

Table 1: Single RHN model test and validation perplexity of the PTB dataset
Figure 6: Comparison of the learning curve between RHN with (green) and without (red) HSG cell. The upper and the lower graphs show the perplexity on the validation and test sets respectively.
Figure 7: Histogram of HSG cell gates values. The values were drawn from a trained RHN of depth , with a hidden size of . There are values from a

random time steps. The gates utilize a Sigmoid function and, therefore, the values are in the range of

. We see that in most of the cases the gate values are relatively low, which means that the state gates are closed, and the new state is generated in a feed-forward manner. However, for a substantial number of times, the values are high, which means that the information flows directly through time.
Figure 8: Graph of Perplexity vs depth of the RHN over the test set with (blue) and without(orange) HSG cell. This figure illustrates the depth-performance bottleneck phenomena. It can be seen that by the depth of 20 layers both architectures give similar results. However, when we stack more layers, the vanilla RHN stops improving (and even deteriorating), whereas RHN with HSG cell keeps improving.

6 Conclusion

In this work, we revisit a widely used RNN model. We analyze its limits and issues, and propose a variant for it called Highway State Gate (HSG). The main idea behind HSG is to generate an alternative fast route for the information to flow through time. The HSG uses a gating mechanism to assemble a new state out of a weighted sum of the former state and the RHN output. We show that when using our method, training deeper nets results in better performance. To the best of our knowledge, this is the first time in the field of Recurrent Nets that adding layers to this scale resulted in a steady improvement.