Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks

06/14/2018 ∙ by Minmin Chen, et al. ∙ 0

Recurrent neural networks have gained widespread use in modeling sequence data across various domains. While many successful recurrent architectures employ a notion of gating, the exact mechanism that enables such remarkable performance is not well understood. We develop a theory for signal propagation in recurrent networks after random initialization using a combination of mean field theory and random matrix theory. To simplify our discussion, we introduce a new RNN cell with a simple gating mechanism that we call the minimalRNN and compare it with vanilla RNNs. Our theory allows us to define a maximum timescale over which RNNs can remember an input. We show that this theory predicts trainability for both recurrent architectures. We show that gated recurrent networks feature a much broader, more robust, trainable region than vanilla RNNs, which corroborates recent experimental findings. Finally, we develop a closed-form critical initialization scheme that achieves dynamical isometry in both vanilla RNNs and minimalRNNs. We show that this results in significantly improvement in training dynamics. Finally, we demonstrate that the minimalRNN achieves comparable performance to its more complex counterparts, such as LSTMs or GRUs, on a language modeling task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent Neural Networks (RNNs) (Rumelhart et al., 1986; Elman, 1990) have found widespread use across a variety of domains from language modeling (Mikolov et al., 2010; Kiros et al., 2015; Jozefowicz et al., 2016) and machine translation (Bahdanau et al., 2014) to speech recognition (Graves et al., 2013) and recommendation systems (Hidasi et al., 2015; Wu et al., 2017)

. However, RNNs as originally proposed are difficult to train and are rarely used in practice. Instead, variants of RNNs - such as Long Short-Term Memory (LSTM) networks 

(Hochreiter & Schmidhuber, 1997)

and Gated Recurrent Units (GRU) 

(Chung et al., 2014) - that feature various forms of “gating” perform significantly better than their vanilla counterparts. Often, these models must be paired with techniques such as normalization layers (Ioffe & Szegedy, 2015b; Ba et al., 2016)

and gradient clipping 

(Pascanu et al., 2013) to achieve good performance.

A rigorous explanation for the remarkable success of gated recurrent networks remains illusive (Jozefowicz et al., 2015; Greff et al., 2017). Recent work (Collins et al., 2016)

provides empirical evidence that the benefits of gating are mostly rooted in improved trainability rather than increased capacity or expressivity. The problem of disentangling trainability from expressivity is widespread in machine learning since state-of-the-art architectures are nearly always the result of sparse searches in high dimensional spaces of hyperparameters. As a result, we often mistake trainability for expressivity. Seminal early work 

(Glorot & Bengio, ; Bertschinger et al., ) showed that a major hindrance to trainability was the vanishing and exploding of gradients.

Recently, progress has been made in the feed-forward setting (Schoenholz et al., 2017; Pennington et al., 2017; Yang & Schoenholz, 2017) by developing a theory of both the forward-propagation of signal and the backward-propagation of gradients. This theory is based on studying neural networks whose weights and biases are randomly distributed. This is equivalent to studying the behavior of neural networks after random initialization or, equivalently, to studying the prior over functions induced by a particular choice of hyperparameters (Lee et al., 2017). It was shown that randomly initialized neural networks are trainable if three conditions are satisfied: (1) the size of the output of the network is finite for finite inputs, (2) the output of the network is sensitive to changes in the input, and (3) gradients neither explode nor vanish. Moreover, neural networks achieving dynamical isometry, i.e. having input-output Jacobian matrices that are well-conditioned, were shown to train orders of magnitude faster than networks that do not.

In this work, we combine mean field theory and random matrix theory to extend these results to the recurrent setting. We will be particularly focused on understanding the role that gating plays in trainability. As we will see, there are a number of subtleties that must be addressed for (gated) recurrent networks that were not present in the feed-forward setting. To clarify the discussion, we will therefore contrast vanilla RNNs with a gated RNN cell, that we call the minimalRNN, which is significantly simpler than LSTMs and GRUs but implements a similar form of gating. We expect the framework introduced here to be applicable to more complicated gated architectures.

The first main contribution of this paper is the development of a mean field theory for forward propagation of signal through vanilla RNNs and minimalRNNs. In doing so, we identify a theory of the maximum timescale over which signal can propagate in each case. Next, we produce a random matrix theory for the end-to-end Jacobian of the minimalRNN. As in the feed-forward setting, we establish that the duality between the forward propagation of signal and the backward propagation of gradients persists in the recurrent setting. We then show that our theory is indeed predictive of trainability in recurrent neural networks by comparing the maximum trainable number of steps of RNNs with the timescale predicted by the theory. Overall, we find remarkable alignment between theory and practice. Additionally, we develop a closed-form initialization procedure for both networks and show that on a variety of tasks RNNs initialized to be dynamically isometric are significantly easier to train than those lacking this property.

Corroborating the experimental findings of Collins et al. (2016), we show that both signal propagation and dynamical isometry in vanilla RNNs is far more precarious than in the case of the minimalRNN. Indeed the vanilla RNN achieves dynamical isometry only if the network is initialized with orthogonal weights at the boundary between order-and-chaos, a one-dimensional line in parameter space. Owing to its gating mechanism, the minimalRNN on the other hand enjoys a robust multi-dimensional subspace of good initializations which all enable dynamical isometry. Based on these insights, we conjecture that more complex gated recurrent neural networks also benefit from the similar effects.

2 Related Work

Identity and Orthogonal initialization schemes have been identified as a promising approach to improve trainability of deep neural networks (Le et al., 2015; Mishkin & Matas, 2015). Additionally, Arjovsky et al. (2016); Hyland & Rätsch (2017); Xie et al. (2017) advocate going beyond initialization to constrain the transition matrix to be orthogonal throughout the entire learning process either through re-parametrisation or by constraining the optimization to the Stiefel manifold (Wisdom et al., 2016). However, as was pointed out in Vorontsov et al. (2017), strictly enforcing orthogonality during training may hinder training speed and generalization performance. While these contributions are similar to our own, in the sense that they attempt to construct networks that feature dynamical isometry, it is worth noting that orthogonal weight matrices do not guarantee dynamical isometry. This is due to the nonlinear nature of deep neural networks as shown in Pennington et al. (2017). In this paper we continue this trend and show that orthogonality has little impact on the conditioning of the Jacobian (and so trainability) in gated RNNs.

The notion of “edge of chaos” initialization has been explored previously especially in the case of recurrent neural networks. Bertschinger et al. ; Glorot & Bengio

propose edge-of-chaos initialization schemes that they show leads to improved performance. Additionally, architectural innovations such as batch normalization 

(Ioffe & Szegedy, 2015a)

, orthogonal matrix initialization 

(Saxe et al., 2013), random walk initialization (Sussillo & Abbott, 2014), composition kernels (Daniely et al., 2016), or residual network architectures (He et al., 2015) all share a common goal of stabilizing gradients and improving training dynamics.

There is a long history of applying mean field-like approaches to understand the behavior of neural networks. Indeed several pieces of seminal work used statistical physics (Derrida & Pomeau, ; Sompolinsky et al., 1988) and Gaussian Processes (Neal, 2012)

to show that neural networks exhibit remarkable regularity as the width of the network gets large. Mean field theory also has long been used to study Boltzmann machines 

(Ackley et al., ) and sigmoid belief networks (Saul et al., 1996). More recently, there has been a revitalization of mean field theory to explore questions of trainability and expressivity in fully-connected networks and residual networks (Poole et al., 2016; Schoenholz et al., 2017; Yang & Schoenholz, 2017; Schoenholz et al., 2017; Karakida et al., 2018; Hayou et al., 2018; Hanin & Rolnick, 2018; Yang & Schoenholz, 2018). Our approach will closely follow these later contributions and extend many of their techniques to the case of recurrent networks with gating. Beyond mean field theory, there have been several attempts in understanding signal propagation in RNNs, e.g., using Gers̆gorin circle theorem (Zilly et al., 2016) or time invariance (Tallec & Ollivier, 2018).

3 Theory and Critical Initialization

We begin by developing a mean field theory for vanilla RNNs and discuss the notion of dynamical isometry. Afterwards, we move on to a simple gated architecture to explain the role of gating in facilitating signal propagation in RNNs.

3.1 Vanilla RNN

Vanilla RNNs are described by the recurrence relation,

(1)

Here is the input, is the pre-activation, and

is the hidden state after applying an arbitrary activation function

. For the purposes of this discussion we set . Furthermore, and are weight matrices that multiply the hidden state and inputs respectively and is a bias.

Next, we apply mean-field theory to vanilla RNNs following a similar strategy introduced in (Poole et al., 2016; Schoenholz et al., 2017). At the level of mean-field theory, vanilla RNNs will prove to be intimately related to feed-forward networks and so this discussion proceeds analogously. For a more detailed discussion, see these earlier studies.

Consider two sequences of inputs and , described by the covariance matrix with . To simplify notation, we assume the input sequences have been standardized so that independent of time. This allows us to write , where

is a matrix whose diagonal terms are 1 and whose off-diagonal terms are the cosine similarity between the inputs at time

. These sequences are then passed into two identical copies of an RNN to produce two corresponding pre-activation sequences and . As in Poole et al. (2016)

we let the weights and biases be Gaussian distributed so that

, , and 111in practice we will set for vanillaRNN., and we consider the wide network limit,

. As in the fully-connected setting, we would like to invoke the Central Limit Theorem (CLT) to conclude that the pre-activations of hidden states are jointly Gaussian distributed. Unfortunately, the CLT is violated in the recurrent setting as

is correlated with due to weight sharing between steps of the RNN.

To make progress, we proceed by developing the theory of signal propagation for RNNs with untied weights. This allows for several simplifications, including the application of the CLT to conclude that are jointly Gaussian distributed,

where the covariance matrix

is independent of neuron index,

. We explore the ramifications of this approximation by comparing simulations of RNNs with tied and untied weights. Overall, we will see that while ignoring weight tying leads to quantitative differences between theory and experiment, it does not change the qualitative picture that emerges. See figs. 1 and 2 for verification.

With this approximation in mind, we will now quantify how the pre-activation hidden states and evolve by deriving the recurrence relation of the covariance matrix from the recurrence on in eq. (1). Using identical arguments to Poole et al. (2016) one can show that,

(2)

where and

(3)

is a Gaussian measure with covariance matrix . By symmetry, our normalization allows us to define to be the magnitude of the pre-activation hidden state and to be the cosine similarity between the hidden states. We will be particularly concerned with understanding the dynamics of the cosine similarity, .

In feed-forward networks, the inputs dictate the initial value of the cosine similarity, and then the evolution of is determined solely by the network architecture. By contrast in recurrent networks, inputs perturb at each timestep. Analyzing the dynamics of for arbitrary is therefore challenging, however significant insight can be gained by studying the off-diagonal entries of eq. (2) for independent of time. In the case of time-independent , as both and where and

are fixed points of the variance of the pre-activation hidden state and the cosine-similarity between pre-activation hidden states respectively. As was discussed previously 

(Poole et al., 2016; Schoenholz et al., 2017), the dynamics of are generally uninteresting provided is finite. We therefore choose to normalize the hidden state such that which implies that independent of time.

In this setting it was shown in Schoenholz et al. (2017) that in the vicinity of a fixed point, the off-diagonal term in eq. (2) can be expanded to lowest order in to give the linearized dynamics, where

(4)

These dynamics have the solution where is the time when is close enough to for the linear approximation to be valid. If it follows that approaches exponentially quickly over a timescale and is called a stable fixed point. When gets too close to to be distinguished from it to within numerical precision, information about the initial inputs has been lost. Thus, sets the maximum timescale over which we expect the RNN to be able to remember information. If then gets exponentially farther from over time and is an unstable fixed point. In this case, for the activation function considered here, another fixed point that is stable will emerge. Note that is independent of and so the dynamics of near do not depend on .

In vanilla fully-connected networks is always a fixed point of

, but it is not always stable. Indeed, it was shown that these networks exhibit a phase transition where

goes from being a stable fixed point to an unstable one as a function of the network’s hyperparameters. This is known as the order-to-chaos transition and it occurs exactly when . Since , signal can propagate infinitely far at the boundary between order and chaos. Comparing the diagonal and off-diagonal entries of eq. (2), we see that in recurrent networks, is a fixed point only when , and in this case the discussion is identical to the feed-forward setting. When , it is easy to see that since if at some time then . We see that in recurrent networks noise from the inputs destroys the ordered phase and there is no ordered-to-chaos critical point. As a result we should expect the maximum timescale over which memory may be stored in vanilla RNNs to be fundamentally limited by noise from the inputs.

The end-to-end Jacobian of a vanilla RNN with untied weights is in fact formally identical to the input-output Jacobian of a feedforward network, and thus the results from (Pennington et al., 2017) regarding conditions for dynamical isometry apply directly. In particular, dynamical isometry is achieved with orthogonal state-to-state transition matrices , non-linearities, and small values of . Perhaps surprisingly, these conclusions continue to be valid if the assumption of untied weights is relaxed. To understand why this is the case, consider the example of a linear network. For untied weights, the end-to-end Jacobian is given by , while for tied weights the Jacobian is given by . It turns out that as

there is sufficient self-averaging to overcome the dependencies induced by weight tying and the asymptotic singular value distributions of

and are actually identical (Haagerup & Larsen, 2000).

3.2 MinimalRNN

3.2.1 Mean-Field Theory

To study the role of gating, we introduce the minimalRNN which is simpler than other gated RNN architectures but nonetheless features the same gating mechanism. A sequence of inputs , is first mapped to the hidden space through 222 here can be any highly flexible functions such as a feed-forward network. In our experiments, we take to be a fully connected layer with activation, that is, .. From here on, we refer to as the inputs to minimalRNN. The minimalRNN is then described by the recurrence relation,

(5)

where is the pre-activation to the gating function, the update gate and the hidden state. The minimalRNN retains the most essential gate in LSTMs (Jozefowicz et al., 2015; Greff et al., 2017) and achieves competitive performance. The simplified update of this cell on the other hand, enables us to pinpoint the role of gating in a more controlled setting.

As in the previous case we consider two sequences of inputs to the network, and . We take , and . By analogy to the vanilla case, we can make the mean field approximation that the are jointly Gaussian distributed with covariance matrix . Here,

(6)

where we have defined

as the second-moment matrix with

333 will be centered under mean field approximation if is initialized with mean zero. As in the vanilla case, is the covariance between inputs so that .

We note that is fixed by the input, but it remains for us to work out . We find that (see SI section B),

(7)

Here we assume that the expectation factorizes so that and are approximately independent. We believe this approximation becomes exact in the limit.

We choose to normalize the data in a similar manner to the vanilla case so that independent of time. An immediate consequence of this normalization is that and . We then write and as the cosine similarities between the hidden states and the pre-activations respectively. With this normalization, we can work out the mean-field recurrence relation characterizing the covariance matrix for the minimalRNN. This analysis can be done by deriving the recurrence relation for either or . We will choose to study the dynamics of , however the two are trivially related by eq. (6). In SI section C, we analyze the dynamics of the diagonal term in the recurrence relation and prove that there is always a fixed point at some . In SI section D, we compute the depth scale over which approaches . However, as in the case of the vanillaRNN, the dynamics of are generally uninteresting.

We now turn our attention to the dynamics of the cosine similarity between the pre-activations, . As in the case of vanilla RNNs, we note that approaches quickly relative to the dynamics of . We therefore choose to normalize the hidden state of the RNN so that in which case both and independent of time. From eq. (6) and  (7) it follows that the cosine similarity of the pre-activation evolves as,

(8)

where we have defined . As in the case of the vanilla RNN, we can study the behavior of in the vicinity of a fixed point, . By expanding eq. (3.2.1) to lowest order in we arrive at a linearized recurrence relation that has an exponential solution where here,

(9)

The discussion above in the vanilla case carries over directly to the minimalRNN with the appropriate replacement of . Unlike in the case of the vanilla RNN, here we see that itself depends on .

Again is a fixed point of the dynamics only when . In this case, the minimalRNN experiences an order-to-chaos phase transition when at which point the maximum timescale over which signal can propagate goes to infinity. Similar to the vanilla RNN, when , we expect that the phase transition will be destroyed and the maximum duration of signal propagation will be severely limited. However, in a significant departure from the vanilla case, when we notice that , and for all . Considering eq. (9) we notice that in this regime independent of . In other words, gating allows for arbitrarily long term signal propagation in recurrent neural networks independent of .

Figure 1: Numerical verification of mean field results on the minimalRNN. All experiments were done with fixed , , , and . These hyperparameters are chosen so that the minimalRNN has an order-to-chaos critical point at . MC simulations were averaged over 1000 minimalRNNs with hidden dimension 8192. (a) Cosine similarity for theory (white dashed) compared with MC simulations of a minimalRNN with untied weights (solid). (b) Cosine similarity for tied weights (dashed) compared with untied weights (solid). (c) Comparison of for theory (white dashed), MC with untied weights (solid black), tied weights (dashed grey). (d) Comparison of linearized dynamics. Dashed lines show and solid lines show the result of simulations with untied weights. (e) Comparison of the timescale for signal propagation for different values of .

We explore agreement between our theory and MC simulations of the minimalRNN in fig. 1. In this set of experiments, we consider inputs such that for and for . Fig. 1 (a,c,d) show excellent quantitative agreement between our theory and MC simulations. In fig. 1 (a,b) we compare the MC simulations of the minimalRNN with and without weight tying. While we observe that for many choices of hyperparameters the untied weight approximation is quite good (particularly when ), deeper into the chaotic phase the quantitative agreement between breaks down. Nonetheless, we observe that the untied approximation describes the qualitative behavior of the real minimalRNN overall. In fig. 1 (e) we plot the timescale for signal propagation for and for the minimalRNN with identical choices of hyperparameters. We see that while as gets large independent of , a critical point at is only observed when .

3.2.2 Dynamical Isometry

In the previous subsection, we derived a quantity that defines the boundary between the ordered and the chaotic phases of forward propagation. Here we show that it also defines the boundary between exploding and vanishing gradients. To see this, consider the Jacobian of the state-to-state transition operator,

(10)

where denotes a diagonal matrix with along its diagonal. We can compute the expected norm-squared of back-propagated error signals, which measures the growth or shrinkage of gradients. It is equal to the mean-squared singular value of the Jacobian (Poole et al., 2016; Schoenholz et al., 2017) or the first moment of ,

(11)

where we have used the fact that the elements of , and are i.i.d. Since we assume convergence to the fixed point, these distributions are independent of and it is easy to see that . The variance of back-propagated error signals through time steps is therefore . As such, the constraint

defines the boundary between phases of exponentially exploding and exponentially vanishing gradient norm (variance). Note that unlike in the case of forward signal propagation, in the case of backpropagation this is independent of

.

As argued in (Pennington et al., 2017, 2018)

, controlling the variance of back-propagated gradients is necessary but not sufficient to guarantee trainability, especially for very deep networks. Beyond the first moment, the entire distribution of eigenvalues of

(or of singular values of ) is relevant. Indeed, it was found in (Pennington et al., 2017, 2018) that enabling dynamical isometry, namely the condition that all singular values of are close to unity, can drastically improve training speed for very deep feed-forward networks.

Following (Pennington et al., 2017, 2018)

, we use tools from free probability theory to compute the variance

of the limiting spectral density of ; however, unlike previous work, in our case the relevant matrices are not symmetric and therefore we must invoke tools from non-Hermitian free probability, see (Cakmak, 2012) for a review. As in previous section, we make the simplifying assumption that the weights are untied, relying on the same motivations given in section 3.1. Using these tools, an un-illuminating calculation reveals that,

(12)

where,

(13)

and is the first term in the Taylor expansion of the S-transform of the eigenvalue distribution of  (Pennington et al., 2018). For example, for Gaussian matrices, and for orthogonal matrices .

Some remarks are in order about eq. (12). First, we note the duality between the forward and backward signal propagation (eq. (9) and eq. (13)). For critical initializations, , so does not grow exponentially, but it still grows linearly with . This situation is entirely analogous to the feed-forward analysis of (Pennington et al., 2017, 2018). In the case of the vanilla RNN, the coefficient of the linear term is proportion to , and can only be reduced by taking the weight and bias variances . A crucial difference in the minimalRNN is that the coefficient of the linear term can be made arbitrarily small by simply adjusting the bias mean to be positive, which will send and independent of . Therefore the conditions for dynamical isometry decouple from the weight and bias variances, implying that trainability can occur for a higher-dimensional, more robust, slice of parameter space. Moreover, the value of has no effect on the capacity of the minimalRNN to achieve dynamical isometry. We believe these are fundamental reasons why gated cells such as the minimalRNN perform well in practice.

Algorithm 1 describes the procedure to find and to achieve condition for minimalRNN. Given , we then construct the weight matrices and biases accordingly. is used to initialize the to avoid transient phase.

0:  
1:  
2:  
3:   (eq.(7))
4:  
5:   (eq.(13))
6:  
7:  
Algorithm 1 Critical initialization for minimalRNNs

4 Experiments

Figure 2: Relationship between theory and trainability. We plot the training accuracy (higher accuracies in red) overlayed with theoretical timescale, (shown in white). The top row of figures shows the results with untied weights and the bottom row shows the results with weight tying. (a-b) Vanilla RNN with . (c-d) MinimalRNN with . (e-f) MinimalRNN with and . (g-h) MinimalRNN with and .

Having established a theory for the behavior of random vanilla RNNs and minimalRNNs, we now discuss the connection between our theory and trainability in practice. We begin by corroborating the claim that the maximum timescale over which memory can be stored in a RNN is controlled by the timescale identified in the previous section. We will then investigate the role of dynamical isometry in speeding up learning.

4.1 Trainability

Dataset. To verify the results of our theoretical calculation, we consider a task that is reflective of the theory above. To that end, we constructed a sequence dataset for training RNNs from MNIST (LeCun et al., 1998). Each of the

digit image is flattened into a vector of

pixels and sent as the first input to a RNN. We then send random inputs , into the RNN varying between 10 and 1000 steps. As the only salient information about the digit is in the first layer, the network will need to propagate information through layers to accurately identify the MNIST digit. The random inputs are drawn independently for each example and so this is a regime where for all .

We then performed a series of experiments on this task to make connection with our theory. In each case we experimented with both tied and untied weights. The result are shown in fig. 2. In the case of untied weights, we observe strong quantitative agreement between our theoretical prediction for and the maximum depth where the network is still trainable. When the weights of the network are tied, we observe quantitative deviations between our thoery and experiments, but the overall qualitative picture remains.

We train vanilla RNNs for

steps (around 10 epochs) varying

while fixing . The results of this experiment are shown in fig. 2 (a-b). We train minimalRNNs for steps (around 1 epoch) fixing . We perform three different experiments here: 1) varying with shown in fig. 2 (c-d), 2) varying with shown in fig. 2 (e-f), 3) varying with shown in fig. 2 (g-h). Comparing fig. 2(a,b) with fig. 2(c,d, g,h), the minimalRNN with large depth is trainable over a much wider range of hyperparameters than the vanillaRNN despite the fact that the network was trained for an order of magnitude less time.

4.2 Critical initialization

Figure 3: Learning dynamics, measured by accuracy on the test set, for vanillaRNNs and minimalRNNs trained with depth 196 (a, b) and 784 (c, d) under four different initialization conditions. Drastic difference in terms of convergence speed was observed between critical and off-critical initialization in both models. Well-trained models reach an accuracy of 0.98 on the test set.

Dataset. To study the impact of critical initialization on training speed, we constructed a more realistic sequence dataset from MNIST. We unroll the pixels into a sequence of inputs, each containing pixels. We tested and to vary the difficulty of the tasks.

Note that we are more interested in the training speed of these networks under different initialization conditions than the test accuracy. We compare the convergence speed of vanilla RNN and minimalRNN under four initialization conditions: 1) critical initialization with orthogonal weights (solid blue); 2) critical initialization with Gaussian distributed weights (sold red); 3) off-critical initialization with orthogonal weights (dotted green); 4) off-critical initialization with Gaussian distributed weights (dotted black).

We fix to zero in all settings. Under critical initialization, and are carefully chosen to achieve as defined in eqn.(4) for vanilla RNN and eqn.(13) (detailed in algorithm 1) for minimalRNN respectively. When testing networks off criticality, we employ a common initialization procedure in which, and .

Figure 3 summarizes our findings: there is a clear difference in training speed between models trained with critical initialization compared with models initialized far from criticality. We observe two orders of magnitude difference in training speed between a critical and off-critical initialization for vanilla RNNs. While a critically initialized model reaches a test accuracy of after 750 optimization steps, the off-critical nework takes over 16,000 updates. A similar trend was observed for the minimalRNN. This difference is even more pronounced in the case of the longer sequence with . Both vanilla RNNs and minimalRNNs initialized off-criticality failed at task. The well-conditioned minimalRNN trains a factor of three faster than the vanilla RNN. As predicted above, the difference in training speed between orthogonal and Gaussian initialization schemes is significant for vanilla RNNs but is insignificant for the minimalRNN. This is corroborated in fig. 3 (b,d) where the distribution of the weights has no impact on the training speed.

5 Language modeling

We compare the minimalRNN against more complex gated RNNs such as LSTM and GRU on the Penn Tree-Bank corpus (Marcus et al., 1993). Language modeling is a difficult task, and competitive performance is often achieved by more complicated RNN cells. We show that the minimalRNN achieves competitive performance despite its simplicity.

We follow the precise setup of  (Mikolov et al., 2010; Zaremba et al., 2014), and train RNNs of two sizes: a small configuration with 5M parameters and a medium-sized configuration with 20M parameters 444The hidden layer size of these networks are adjusted accordingly to reach the target model size.. We report the perplexity on the validation and test sets. We focus our comparison on single layer RNNs, however we also report perplexities for multi-layer RNNs from the literature for reference. We follow the learning schedule of Zaremba et al. (2014) and  (Jozefowicz et al., 2015). We review additional hyperparameter ranges in section F of the supplementary material.

Table 1 summarizes our results. We find that single layer RNNs perform on par with their multi-layer counterparts. Despite being a significantly simpler model, the minimalRNN performs comparably to GRUs. Given the closed-form critical initialization developed here that significantly boosts convergence speed, the minimalRNN might be a favorable alternative to GRUs. There is a gap in perplexity between the performance of LSTMs and minimalRNNs. We hypothesize that this is due to the removal of an independent gate on the input. The same strategy is employed in GRUs and may cause a conflict between keeping longer-range memory and updating new information as was originally pointed out by Hochreiter & Schmidhuber (1997).

5M-t 20M-v 20M-t
VanillaRNN (Jozefowicz et al., 2015) 122.8 103.0 97.7
GRU   (Jozefowicz et al., 2015) 108.2 95.5 91.7
LSTM    (Jozefowicz et al., 2015) 109.7 83.3 78.8
LSTM 95.4 87.5 83.8
GRU 99.5 93.9 89.8
minimalRNN 101.4 94.4 89.9
Table 1: Perplexities on the PTB. minimalRNN achieves comparable performance to the more complex gated RNN architectures despite its simplicity.

6 Discussion

We have developed a theory of signal propagation for random vanilla RNNs and a simple gated RNNs. We demonstrate rigorously that the theory predicts trainability of these networks and gating mechanisms allow for a significantly larger trainable region. We are planning to extend the theory to more complicated RNN cells as well as RNNs with multiple layers.

Acknowledgements

We thank Jascha Sohl-Dickstein and Greg Yang for helpful discussions and Ashish Bora for many contributions to early stages of this project.

References

Appendix A MinimalRNN Architecture

Figure Supp.1: Model architecture of minimalRNN.

Appendix B Diagonal Recurrence Relation

Here we analyze the mean field dynamics of the minimalRNN. The minimalRNN features a hidden state and inputs . The inputs are transformed via a fully-connected network before being fed into the network. The RNN cell is then described by the equations,

(14)
(15)
(16)

Here denotes the (pre)-activation and denotes an input to the network.Thus, acts as a gate on the ’th step. We take , and .

By the CTL we can make a mean field assumption that where,

(17)

where we have defined and . We note that is fixed by the input, but it remains for us to work out . We find that,

(18)
(19)
(20)

where we have assumed that the expectation factorizes so that and are approximately independent.

We choose to normalize the data so that independent of time. An immediate consequence of this normalization is that and . We then write , and where , , and are cosine similarities between the inputs, the hidden states, and the respectively. With this normalization, we can work out the mean-field recurrence relation characterizing the covariance matrix for the minimalRNN.

We begin by considering the diagonal recurrence relations. We find that the dynamics are described by the equation,

(21)
(22)

As expected, the first and second integrands determine how much of the update of the random network is controlled by the norm of the hidden state and how much is determined by the norm of the input. Since it follows that when the first and second term will be equal and so,

(23)

In general, will therefore control the degree to which the hidden state of the random minimalRNN is updated based on the previous hidden state or based on the inputs with implying parity between the two. This is reflected in eq. (23).

Appendix C Existence of a Fixed Point

In the event that the norm of the inputs is time-independent, for all , then the minimalRNN will have a fixed point provided there exists a that satisfies a transcendental equation, namely that

(24)

It is easy to see that such a solution always exists. When the first term of approaches while the magnitude of the second increases without bound and so . Conversely, when the first term is positive while and so . The existence of a satisfying the transcendental equation then follows directly from the intermediate value theorem.

Appendix D Dynamics

We can now investigate the dynamics of the norm of the hidden state in the vicinity of . To do this suppose that with . Our goal is then to expand eq.(21) about . First, we note that,

(25)
(26)
(27)

Letting this implies that,

(28)
(29)
(30)
(31)
(32)
(33)
(34)

It follows that as,

(35)

with

(36)

as expected.

Appendix E Off-Diagonal Recurrence Relation

We now turn our attention to the off-diagonal term. From eq. (7) it follows that,

(37)
(38)

where