1 Introduction
Recurrent Neural Networks (RNNs) (Rumelhart et al., 1986; Elman, 1990) have found widespread use across a variety of domains from language modeling (Mikolov et al., 2010; Kiros et al., 2015; Jozefowicz et al., 2016) and machine translation (Bahdanau et al., 2014) to speech recognition (Graves et al., 2013) and recommendation systems (Hidasi et al., 2015; Wu et al., 2017)
. However, RNNs as originally proposed are difficult to train and are rarely used in practice. Instead, variants of RNNs  such as Long ShortTerm Memory (LSTM) networks
(Hochreiter & Schmidhuber, 1997)and Gated Recurrent Units (GRU)
(Chung et al., 2014)  that feature various forms of “gating” perform significantly better than their vanilla counterparts. Often, these models must be paired with techniques such as normalization layers (Ioffe & Szegedy, 2015b; Ba et al., 2016)(Pascanu et al., 2013) to achieve good performance.A rigorous explanation for the remarkable success of gated recurrent networks remains illusive (Jozefowicz et al., 2015; Greff et al., 2017). Recent work (Collins et al., 2016)
provides empirical evidence that the benefits of gating are mostly rooted in improved trainability rather than increased capacity or expressivity. The problem of disentangling trainability from expressivity is widespread in machine learning since stateoftheart architectures are nearly always the result of sparse searches in high dimensional spaces of hyperparameters. As a result, we often mistake trainability for expressivity. Seminal early work
(Glorot & Bengio, ; Bertschinger et al., ) showed that a major hindrance to trainability was the vanishing and exploding of gradients.Recently, progress has been made in the feedforward setting (Schoenholz et al., 2017; Pennington et al., 2017; Yang & Schoenholz, 2017) by developing a theory of both the forwardpropagation of signal and the backwardpropagation of gradients. This theory is based on studying neural networks whose weights and biases are randomly distributed. This is equivalent to studying the behavior of neural networks after random initialization or, equivalently, to studying the prior over functions induced by a particular choice of hyperparameters (Lee et al., 2017). It was shown that randomly initialized neural networks are trainable if three conditions are satisfied: (1) the size of the output of the network is finite for finite inputs, (2) the output of the network is sensitive to changes in the input, and (3) gradients neither explode nor vanish. Moreover, neural networks achieving dynamical isometry, i.e. having inputoutput Jacobian matrices that are wellconditioned, were shown to train orders of magnitude faster than networks that do not.
In this work, we combine mean field theory and random matrix theory to extend these results to the recurrent setting. We will be particularly focused on understanding the role that gating plays in trainability. As we will see, there are a number of subtleties that must be addressed for (gated) recurrent networks that were not present in the feedforward setting. To clarify the discussion, we will therefore contrast vanilla RNNs with a gated RNN cell, that we call the minimalRNN, which is significantly simpler than LSTMs and GRUs but implements a similar form of gating. We expect the framework introduced here to be applicable to more complicated gated architectures.
The first main contribution of this paper is the development of a mean field theory for forward propagation of signal through vanilla RNNs and minimalRNNs. In doing so, we identify a theory of the maximum timescale over which signal can propagate in each case. Next, we produce a random matrix theory for the endtoend Jacobian of the minimalRNN. As in the feedforward setting, we establish that the duality between the forward propagation of signal and the backward propagation of gradients persists in the recurrent setting. We then show that our theory is indeed predictive of trainability in recurrent neural networks by comparing the maximum trainable number of steps of RNNs with the timescale predicted by the theory. Overall, we find remarkable alignment between theory and practice. Additionally, we develop a closedform initialization procedure for both networks and show that on a variety of tasks RNNs initialized to be dynamically isometric are significantly easier to train than those lacking this property.
Corroborating the experimental findings of Collins et al. (2016), we show that both signal propagation and dynamical isometry in vanilla RNNs is far more precarious than in the case of the minimalRNN. Indeed the vanilla RNN achieves dynamical isometry only if the network is initialized with orthogonal weights at the boundary between orderandchaos, a onedimensional line in parameter space. Owing to its gating mechanism, the minimalRNN on the other hand enjoys a robust multidimensional subspace of good initializations which all enable dynamical isometry. Based on these insights, we conjecture that more complex gated recurrent neural networks also benefit from the similar effects.
2 Related Work
Identity and Orthogonal initialization schemes have been identified as a promising approach to improve trainability of deep neural networks (Le et al., 2015; Mishkin & Matas, 2015). Additionally, Arjovsky et al. (2016); Hyland & Rätsch (2017); Xie et al. (2017) advocate going beyond initialization to constrain the transition matrix to be orthogonal throughout the entire learning process either through reparametrisation or by constraining the optimization to the Stiefel manifold (Wisdom et al., 2016). However, as was pointed out in Vorontsov et al. (2017), strictly enforcing orthogonality during training may hinder training speed and generalization performance. While these contributions are similar to our own, in the sense that they attempt to construct networks that feature dynamical isometry, it is worth noting that orthogonal weight matrices do not guarantee dynamical isometry. This is due to the nonlinear nature of deep neural networks as shown in Pennington et al. (2017). In this paper we continue this trend and show that orthogonality has little impact on the conditioning of the Jacobian (and so trainability) in gated RNNs.
The notion of “edge of chaos” initialization has been explored previously especially in the case of recurrent neural networks. Bertschinger et al. ; Glorot & Bengio
propose edgeofchaos initialization schemes that they show leads to improved performance. Additionally, architectural innovations such as batch normalization
(Ioffe & Szegedy, 2015a), orthogonal matrix initialization
(Saxe et al., 2013), random walk initialization (Sussillo & Abbott, 2014), composition kernels (Daniely et al., 2016), or residual network architectures (He et al., 2015) all share a common goal of stabilizing gradients and improving training dynamics.There is a long history of applying mean fieldlike approaches to understand the behavior of neural networks. Indeed several pieces of seminal work used statistical physics (Derrida & Pomeau, ; Sompolinsky et al., 1988) and Gaussian Processes (Neal, 2012)
to show that neural networks exhibit remarkable regularity as the width of the network gets large. Mean field theory also has long been used to study Boltzmann machines
(Ackley et al., ) and sigmoid belief networks (Saul et al., 1996). More recently, there has been a revitalization of mean field theory to explore questions of trainability and expressivity in fullyconnected networks and residual networks (Poole et al., 2016; Schoenholz et al., 2017; Yang & Schoenholz, 2017; Schoenholz et al., 2017; Karakida et al., 2018; Hayou et al., 2018; Hanin & Rolnick, 2018; Yang & Schoenholz, 2018). Our approach will closely follow these later contributions and extend many of their techniques to the case of recurrent networks with gating. Beyond mean field theory, there have been several attempts in understanding signal propagation in RNNs, e.g., using Gers̆gorin circle theorem (Zilly et al., 2016) or time invariance (Tallec & Ollivier, 2018).3 Theory and Critical Initialization
We begin by developing a mean field theory for vanilla RNNs and discuss the notion of dynamical isometry. Afterwards, we move on to a simple gated architecture to explain the role of gating in facilitating signal propagation in RNNs.
3.1 Vanilla RNN
Vanilla RNNs are described by the recurrence relation,
(1) 
Here is the input, is the preactivation, and
is the hidden state after applying an arbitrary activation function
. For the purposes of this discussion we set . Furthermore, and are weight matrices that multiply the hidden state and inputs respectively and is a bias.Next, we apply meanfield theory to vanilla RNNs following a similar strategy introduced in (Poole et al., 2016; Schoenholz et al., 2017). At the level of meanfield theory, vanilla RNNs will prove to be intimately related to feedforward networks and so this discussion proceeds analogously. For a more detailed discussion, see these earlier studies.
Consider two sequences of inputs and , described by the covariance matrix with . To simplify notation, we assume the input sequences have been standardized so that independent of time. This allows us to write , where
is a matrix whose diagonal terms are 1 and whose offdiagonal terms are the cosine similarity between the inputs at time
. These sequences are then passed into two identical copies of an RNN to produce two corresponding preactivation sequences and . As in Poole et al. (2016)we let the weights and biases be Gaussian distributed so that
, , and ^{1}^{1}1in practice we will set for vanillaRNN., and we consider the wide network limit,. As in the fullyconnected setting, we would like to invoke the Central Limit Theorem (CLT) to conclude that the preactivations of hidden states are jointly Gaussian distributed. Unfortunately, the CLT is violated in the recurrent setting as
is correlated with due to weight sharing between steps of the RNN.To make progress, we proceed by developing the theory of signal propagation for RNNs with untied weights. This allows for several simplifications, including the application of the CLT to conclude that are jointly Gaussian distributed,
where the covariance matrix
is independent of neuron index,
. We explore the ramifications of this approximation by comparing simulations of RNNs with tied and untied weights. Overall, we will see that while ignoring weight tying leads to quantitative differences between theory and experiment, it does not change the qualitative picture that emerges. See figs. 1 and 2 for verification.With this approximation in mind, we will now quantify how the preactivation hidden states and evolve by deriving the recurrence relation of the covariance matrix from the recurrence on in eq. (1). Using identical arguments to Poole et al. (2016) one can show that,
(2) 
where and
(3) 
is a Gaussian measure with covariance matrix . By symmetry, our normalization allows us to define to be the magnitude of the preactivation hidden state and to be the cosine similarity between the hidden states. We will be particularly concerned with understanding the dynamics of the cosine similarity, .
In feedforward networks, the inputs dictate the initial value of the cosine similarity, and then the evolution of is determined solely by the network architecture. By contrast in recurrent networks, inputs perturb at each timestep. Analyzing the dynamics of for arbitrary is therefore challenging, however significant insight can be gained by studying the offdiagonal entries of eq. (2) for independent of time. In the case of timeindependent , as both and where and
are fixed points of the variance of the preactivation hidden state and the cosinesimilarity between preactivation hidden states respectively. As was discussed previously
(Poole et al., 2016; Schoenholz et al., 2017), the dynamics of are generally uninteresting provided is finite. We therefore choose to normalize the hidden state such that which implies that independent of time.In this setting it was shown in Schoenholz et al. (2017) that in the vicinity of a fixed point, the offdiagonal term in eq. (2) can be expanded to lowest order in to give the linearized dynamics, where
(4) 
These dynamics have the solution where is the time when is close enough to for the linear approximation to be valid. If it follows that approaches exponentially quickly over a timescale and is called a stable fixed point. When gets too close to to be distinguished from it to within numerical precision, information about the initial inputs has been lost. Thus, sets the maximum timescale over which we expect the RNN to be able to remember information. If then gets exponentially farther from over time and is an unstable fixed point. In this case, for the activation function considered here, another fixed point that is stable will emerge. Note that is independent of and so the dynamics of near do not depend on .
In vanilla fullyconnected networks is always a fixed point of
, but it is not always stable. Indeed, it was shown that these networks exhibit a phase transition where
goes from being a stable fixed point to an unstable one as a function of the network’s hyperparameters. This is known as the ordertochaos transition and it occurs exactly when . Since , signal can propagate infinitely far at the boundary between order and chaos. Comparing the diagonal and offdiagonal entries of eq. (2), we see that in recurrent networks, is a fixed point only when , and in this case the discussion is identical to the feedforward setting. When , it is easy to see that since if at some time then . We see that in recurrent networks noise from the inputs destroys the ordered phase and there is no orderedtochaos critical point. As a result we should expect the maximum timescale over which memory may be stored in vanilla RNNs to be fundamentally limited by noise from the inputs.The endtoend Jacobian of a vanilla RNN with untied weights is in fact formally identical to the inputoutput Jacobian of a feedforward network, and thus the results from (Pennington et al., 2017) regarding conditions for dynamical isometry apply directly. In particular, dynamical isometry is achieved with orthogonal statetostate transition matrices , nonlinearities, and small values of . Perhaps surprisingly, these conclusions continue to be valid if the assumption of untied weights is relaxed. To understand why this is the case, consider the example of a linear network. For untied weights, the endtoend Jacobian is given by , while for tied weights the Jacobian is given by . It turns out that as
there is sufficient selfaveraging to overcome the dependencies induced by weight tying and the asymptotic singular value distributions of
and are actually identical (Haagerup & Larsen, 2000).3.2 MinimalRNN
3.2.1 MeanField Theory
To study the role of gating, we introduce the minimalRNN which is simpler than other gated RNN architectures but nonetheless features the same gating mechanism. A sequence of inputs , is first mapped to the hidden space through ^{2}^{2}2 here can be any highly flexible functions such as a feedforward network. In our experiments, we take to be a fully connected layer with activation, that is, .. From here on, we refer to as the inputs to minimalRNN. The minimalRNN is then described by the recurrence relation,
(5)  
where is the preactivation to the gating function, the update gate and the hidden state. The minimalRNN retains the most essential gate in LSTMs (Jozefowicz et al., 2015; Greff et al., 2017) and achieves competitive performance. The simplified update of this cell on the other hand, enables us to pinpoint the role of gating in a more controlled setting.
As in the previous case we consider two sequences of inputs to the network, and . We take , and . By analogy to the vanilla case, we can make the mean field approximation that the are jointly Gaussian distributed with covariance matrix . Here,
(6) 
where we have defined
as the secondmoment matrix with
. ^{3}^{3}3 will be centered under mean field approximation if is initialized with mean zero. As in the vanilla case, is the covariance between inputs so that .We note that is fixed by the input, but it remains for us to work out . We find that (see SI section B),
(7)  
Here we assume that the expectation factorizes so that and are approximately independent. We believe this approximation becomes exact in the limit.
We choose to normalize the data in a similar manner to the vanilla case so that independent of time. An immediate consequence of this normalization is that and . We then write and as the cosine similarities between the hidden states and the preactivations respectively. With this normalization, we can work out the meanfield recurrence relation characterizing the covariance matrix for the minimalRNN. This analysis can be done by deriving the recurrence relation for either or . We will choose to study the dynamics of , however the two are trivially related by eq. (6). In SI section C, we analyze the dynamics of the diagonal term in the recurrence relation and prove that there is always a fixed point at some . In SI section D, we compute the depth scale over which approaches . However, as in the case of the vanillaRNN, the dynamics of are generally uninteresting.
We now turn our attention to the dynamics of the cosine similarity between the preactivations, . As in the case of vanilla RNNs, we note that approaches quickly relative to the dynamics of . We therefore choose to normalize the hidden state of the RNN so that in which case both and independent of time. From eq. (6) and (7) it follows that the cosine similarity of the preactivation evolves as,
(8) 
where we have defined . As in the case of the vanilla RNN, we can study the behavior of in the vicinity of a fixed point, . By expanding eq. (3.2.1) to lowest order in we arrive at a linearized recurrence relation that has an exponential solution where here,
(9)  
The discussion above in the vanilla case carries over directly to the minimalRNN with the appropriate replacement of . Unlike in the case of the vanilla RNN, here we see that itself depends on .
Again is a fixed point of the dynamics only when . In this case, the minimalRNN experiences an ordertochaos phase transition when at which point the maximum timescale over which signal can propagate goes to infinity. Similar to the vanilla RNN, when , we expect that the phase transition will be destroyed and the maximum duration of signal propagation will be severely limited. However, in a significant departure from the vanilla case, when we notice that , and for all . Considering eq. (9) we notice that in this regime independent of . In other words, gating allows for arbitrarily long term signal propagation in recurrent neural networks independent of .
We explore agreement between our theory and MC simulations of the minimalRNN in fig. 1. In this set of experiments, we consider inputs such that for and for . Fig. 1 (a,c,d) show excellent quantitative agreement between our theory and MC simulations. In fig. 1 (a,b) we compare the MC simulations of the minimalRNN with and without weight tying. While we observe that for many choices of hyperparameters the untied weight approximation is quite good (particularly when ), deeper into the chaotic phase the quantitative agreement between breaks down. Nonetheless, we observe that the untied approximation describes the qualitative behavior of the real minimalRNN overall. In fig. 1 (e) we plot the timescale for signal propagation for and for the minimalRNN with identical choices of hyperparameters. We see that while as gets large independent of , a critical point at is only observed when .
3.2.2 Dynamical Isometry
In the previous subsection, we derived a quantity that defines the boundary between the ordered and the chaotic phases of forward propagation. Here we show that it also defines the boundary between exploding and vanishing gradients. To see this, consider the Jacobian of the statetostate transition operator,
(10) 
where denotes a diagonal matrix with along its diagonal. We can compute the expected normsquared of backpropagated error signals, which measures the growth or shrinkage of gradients. It is equal to the meansquared singular value of the Jacobian (Poole et al., 2016; Schoenholz et al., 2017) or the first moment of ,
(11) 
where we have used the fact that the elements of , and are i.i.d. Since we assume convergence to the fixed point, these distributions are independent of and it is easy to see that . The variance of backpropagated error signals through time steps is therefore . As such, the constraint
defines the boundary between phases of exponentially exploding and exponentially vanishing gradient norm (variance). Note that unlike in the case of forward signal propagation, in the case of backpropagation this is independent of
.As argued in (Pennington et al., 2017, 2018)
, controlling the variance of backpropagated gradients is necessary but not sufficient to guarantee trainability, especially for very deep networks. Beyond the first moment, the entire distribution of eigenvalues of
(or of singular values of ) is relevant. Indeed, it was found in (Pennington et al., 2017, 2018) that enabling dynamical isometry, namely the condition that all singular values of are close to unity, can drastically improve training speed for very deep feedforward networks.Following (Pennington et al., 2017, 2018)
, we use tools from free probability theory to compute the variance
of the limiting spectral density of ; however, unlike previous work, in our case the relevant matrices are not symmetric and therefore we must invoke tools from nonHermitian free probability, see (Cakmak, 2012) for a review. As in previous section, we make the simplifying assumption that the weights are untied, relying on the same motivations given in section 3.1. Using these tools, an unilluminating calculation reveals that,(12) 
where,
(13)  
and is the first term in the Taylor expansion of the Stransform of the eigenvalue distribution of (Pennington et al., 2018). For example, for Gaussian matrices, and for orthogonal matrices .
Some remarks are in order about eq. (12). First, we note the duality between the forward and backward signal propagation (eq. (9) and eq. (13)). For critical initializations, , so does not grow exponentially, but it still grows linearly with . This situation is entirely analogous to the feedforward analysis of (Pennington et al., 2017, 2018). In the case of the vanilla RNN, the coefficient of the linear term is proportion to , and can only be reduced by taking the weight and bias variances . A crucial difference in the minimalRNN is that the coefficient of the linear term can be made arbitrarily small by simply adjusting the bias mean to be positive, which will send and independent of . Therefore the conditions for dynamical isometry decouple from the weight and bias variances, implying that trainability can occur for a higherdimensional, more robust, slice of parameter space. Moreover, the value of has no effect on the capacity of the minimalRNN to achieve dynamical isometry. We believe these are fundamental reasons why gated cells such as the minimalRNN perform well in practice.
Algorithm 1 describes the procedure to find and to achieve condition for minimalRNN. Given , we then construct the weight matrices and biases accordingly. is used to initialize the to avoid transient phase.
4 Experiments
Having established a theory for the behavior of random vanilla RNNs and minimalRNNs, we now discuss the connection between our theory and trainability in practice. We begin by corroborating the claim that the maximum timescale over which memory can be stored in a RNN is controlled by the timescale identified in the previous section. We will then investigate the role of dynamical isometry in speeding up learning.
4.1 Trainability
Dataset. To verify the results of our theoretical calculation, we consider a task that is reflective of the theory above. To that end, we constructed a sequence dataset for training RNNs from MNIST (LeCun et al., 1998). Each of the
digit image is flattened into a vector of
pixels and sent as the first input to a RNN. We then send random inputs , into the RNN varying between 10 and 1000 steps. As the only salient information about the digit is in the first layer, the network will need to propagate information through layers to accurately identify the MNIST digit. The random inputs are drawn independently for each example and so this is a regime where for all .We then performed a series of experiments on this task to make connection with our theory. In each case we experimented with both tied and untied weights. The result are shown in fig. 2. In the case of untied weights, we observe strong quantitative agreement between our theoretical prediction for and the maximum depth where the network is still trainable. When the weights of the network are tied, we observe quantitative deviations between our thoery and experiments, but the overall qualitative picture remains.
We train vanilla RNNs for
steps (around 10 epochs) varying
while fixing . The results of this experiment are shown in fig. 2 (ab). We train minimalRNNs for steps (around 1 epoch) fixing . We perform three different experiments here: 1) varying with shown in fig. 2 (cd), 2) varying with shown in fig. 2 (ef), 3) varying with shown in fig. 2 (gh). Comparing fig. 2(a,b) with fig. 2(c,d, g,h), the minimalRNN with large depth is trainable over a much wider range of hyperparameters than the vanillaRNN despite the fact that the network was trained for an order of magnitude less time.4.2 Critical initialization
Dataset. To study the impact of critical initialization on training speed, we constructed a more realistic sequence dataset from MNIST. We unroll the pixels into a sequence of inputs, each containing pixels. We tested and to vary the difficulty of the tasks.
Note that we are more interested in the training speed of these networks under different initialization conditions than the test accuracy. We compare the convergence speed of vanilla RNN and minimalRNN under four initialization conditions: 1) critical initialization with orthogonal weights (solid blue); 2) critical initialization with Gaussian distributed weights (sold red); 3) offcritical initialization with orthogonal weights (dotted green); 4) offcritical initialization with Gaussian distributed weights (dotted black).
We fix to zero in all settings. Under critical initialization, and are carefully chosen to achieve as defined in eqn.(4) for vanilla RNN and eqn.(13) (detailed in algorithm 1) for minimalRNN respectively. When testing networks off criticality, we employ a common initialization procedure in which, and .
Figure 3 summarizes our findings: there is a clear difference in training speed between models trained with critical initialization compared with models initialized far from criticality. We observe two orders of magnitude difference in training speed between a critical and offcritical initialization for vanilla RNNs. While a critically initialized model reaches a test accuracy of after 750 optimization steps, the offcritical nework takes over 16,000 updates. A similar trend was observed for the minimalRNN. This difference is even more pronounced in the case of the longer sequence with . Both vanilla RNNs and minimalRNNs initialized offcriticality failed at task. The wellconditioned minimalRNN trains a factor of three faster than the vanilla RNN. As predicted above, the difference in training speed between orthogonal and Gaussian initialization schemes is significant for vanilla RNNs but is insignificant for the minimalRNN. This is corroborated in fig. 3 (b,d) where the distribution of the weights has no impact on the training speed.
5 Language modeling
We compare the minimalRNN against more complex gated RNNs such as LSTM and GRU on the Penn TreeBank corpus (Marcus et al., 1993). Language modeling is a difficult task, and competitive performance is often achieved by more complicated RNN cells. We show that the minimalRNN achieves competitive performance despite its simplicity.
We follow the precise setup of (Mikolov et al., 2010; Zaremba et al., 2014), and train RNNs of two sizes: a small configuration with 5M parameters and a mediumsized configuration with 20M parameters ^{4}^{4}4The hidden layer size of these networks are adjusted accordingly to reach the target model size.. We report the perplexity on the validation and test sets. We focus our comparison on single layer RNNs, however we also report perplexities for multilayer RNNs from the literature for reference. We follow the learning schedule of Zaremba et al. (2014) and (Jozefowicz et al., 2015). We review additional hyperparameter ranges in section F of the supplementary material.
Table 1 summarizes our results. We find that single layer RNNs perform on par with their multilayer counterparts. Despite being a significantly simpler model, the minimalRNN performs comparably to GRUs. Given the closedform critical initialization developed here that significantly boosts convergence speed, the minimalRNN might be a favorable alternative to GRUs. There is a gap in perplexity between the performance of LSTMs and minimalRNNs. We hypothesize that this is due to the removal of an independent gate on the input. The same strategy is employed in GRUs and may cause a conflict between keeping longerrange memory and updating new information as was originally pointed out by Hochreiter & Schmidhuber (1997).
5Mt  20Mv  20Mt  
VanillaRNN (Jozefowicz et al., 2015)  122.8  103.0  97.7 
GRU (Jozefowicz et al., 2015)  108.2  95.5  91.7 
LSTM (Jozefowicz et al., 2015)  109.7  83.3  78.8 
LSTM  95.4  87.5  83.8 
GRU  99.5  93.9  89.8 
minimalRNN  101.4  94.4  89.9 
6 Discussion
We have developed a theory of signal propagation for random vanilla RNNs and a simple gated RNNs. We demonstrate rigorously that the theory predicts trainability of these networks and gating mechanisms allow for a significantly larger trainable region. We are planning to extend the theory to more complicated RNN cells as well as RNNs with multiple layers.
Acknowledgements
We thank Jascha SohlDickstein and Greg Yang for helpful discussions and Ashish Bora for many contributions to early stages of this project.
References
 (1) Ackley, David H., Hinton, Geoffrey E., and Sejnowski, Terrence J. A learning algorithm for boltzmann machines*. Cognitive Science, 9(1). ISSN 15516709.
 Arjovsky et al. (2016) Arjovsky, Martin, Shah, Amar, and Bengio, Yoshua. Unitary evolution recurrent neural networks. In International Conference on Machine Learning, pp. 1120–1128, 2016.
 Ba et al. (2016) Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Geoffrey E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 Bahdanau et al. (2014) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 (5) Bertschinger, Nils, Natschläger, Thomas, and Legenstein, Robert A. At the edge of chaos: Realtime computations and selforganized criticality in recurrent neural networks.
 Cakmak (2012) Cakmak, Burak. Nonhermitian random matrix theory for mimo channels. Master’s thesis, Institutt for elektronikk og telekommunikasjon, 2012.
 Chung et al. (2014) Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, and Bengio, Yoshua. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 Collins et al. (2016) Collins, Jasmine, SohlDickstein, Jascha, and Sussillo, David. Capacity and trainability in recurrent neural networks. ICLR, 2016.
 Daniely et al. (2016) Daniely, A., Frostig, R., and Singer, Y. Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity. arXiv:1602.05897, 2016.
 (10) Derrida, B. and Pomeau, Y. Random networks of automata: A simple annealed approximation. EPL (Europhysics Letters), 1(2):45.
 Elman (1990) Elman, Jeffrey L. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
 (12) Glorot, Xavier and Bengio, Yoshua. Understanding the difficulty of training deep feedforward neural networks.
 Graves et al. (2013) Graves, Alex, Mohamed, Abdelrahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In ICASSP, pp. 6645–6649. IEEE, 2013.
 Greff et al. (2017) Greff, Klaus, Srivastava, Rupesh K, Koutník, Jan, Steunebrink, Bas R, and Schmidhuber, Jürgen. Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10):2222–2232, 2017.
 Haagerup & Larsen (2000) Haagerup, Uffe and Larsen, Flemming. Brown’s spectral distribution measure for rdiagonal elements in finite von neumann algebras. Journal of Functional Analysis, 176(2):331–367, 2000.
 Hanin & Rolnick (2018) Hanin, Boris and Rolnick, David. How to start training: The effect of initialization and architecture. arXiv preprint arXiv:1803.01719, 2018.
 Hayou et al. (2018) Hayou, Soufiane, Doucet, Arnaud, and Rousseau, Judith. On the selection of initialization and activation function for deep neural networks. arXiv preprint arXiv:1805.08266, 2018.
 He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. ArXiv eprints, December 2015.
 Hidasi et al. (2015) Hidasi, Balázs, Karatzoglou, Alexandros, Baltrunas, Linas, and Tikk, Domonkos. Sessionbased recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.
 Hochreiter & Schmidhuber (1997) Hochreiter, Sepp and Schmidhuber, Jürgen. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Hyland & Rätsch (2017) Hyland, Stephanie L and Rätsch, Gunnar. Learning unitary operators with help from u (n). In AAAI, pp. 2050–2058, 2017.
 Ioffe & Szegedy (2015a) Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pp. 448–456, 2015a.
 Ioffe & Szegedy (2015b) Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456, 2015b.
 Jozefowicz et al. (2015) Jozefowicz, Rafal, Zaremba, Wojciech, and Sutskever, Ilya. An empirical exploration of recurrent network architectures. In ICML, pp. 2342–2350, 2015.
 Jozefowicz et al. (2016) Jozefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer, Noam, and Wu, Yonghui. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
 Karakida et al. (2018) Karakida, R., Akaho, S., and Amari, S.i. Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach. ArXiv eprints, June 2018.
 Kiros et al. (2015) Kiros, Ryan, Zhu, Yukun, Salakhutdinov, Ruslan R, Zemel, Richard, Urtasun, Raquel, Torralba, Antonio, and Fidler, Sanja. Skipthought vectors. In Advances in neural information processing systems, pp. 3294–3302, 2015.
 Le et al. (2015) Le, Quoc V, Jaitly, Navdeep, and Hinton, Geoffrey E. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
 LeCun et al. (1998) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lee et al. (2017) Lee, Jaehoon, Bahri, Yasaman, Novak, Roman, Schoenholz, Samuel S, Pennington, Jeffrey, and SohlDickstein, Jascha. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
 Marcus et al. (1993) Marcus, Mitchell P, Marcinkiewicz, Mary Ann, and Santorini, Beatrice. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
 Mikolov et al. (2010) Mikolov, Tomas, Karafiát, Martin, Burget, Lukas, Cernockỳ, Jan, and Khudanpur, Sanjeev. Recurrent neural network based language model. In Interspeech, volume 2, pp. 3, 2010.
 Mishkin & Matas (2015) Mishkin, Dmytro and Matas, Jiri. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.
 Neal (2012) Neal, Radford M. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
 Pascanu et al. (2013) Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pp. 1310–1318, 2013.

Pennington et al. (2017)
Pennington, Jeffrey, Schoenholz, Sam, and Ganguli, Surya.
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice.
NIPS, 2017.  Pennington et al. (2018) Pennington, Jeffrey, Schoenholz, Samuel S., and Ganguli, Surya. The emergence of spectral universality in deep networks. In AISTATS, pp. 1924–1932, 2018.
 Poole et al. (2016) Poole, B., Lahiri, S., Raghu, M., SohlDickstein, J., and Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. NIPS, 2016.
 Rumelhart et al. (1986) Rumelhart, David E, Hinton, Geoffrey E, and Williams, Ronald J. Learning representations by backpropagating errors. nature, 323(6088):533, 1986.

Saul et al. (1996)
Saul, Lawrence K, Jaakkola, Tommi, and Jordan, Michael I.
Mean field theory for sigmoid belief networks.
Journal of artificial intelligence research
, 4:61–76, 1996.  Saxe et al. (2013) Saxe, Andrew M, McClelland, James L, and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
 Schoenholz et al. (2017) Schoenholz, S. S., Gilmer, J., Ganguli, S., and SohlDickstein, J. Deep Information Propagation. ICLR, 2017.
 Schoenholz et al. (2017) Schoenholz, Samuel S, Pennington, Jeffrey, and SohlDickstein, Jascha. A correspondence between random neural networks and statistical field theory. arXiv preprint arXiv:1710.06570, 2017.
 Sompolinsky et al. (1988) Sompolinsky, H., Crisanti, A., and Sommers, H. J. Chaos in random neural networks. Phys. Rev. Lett., 61:259–262, Jul 1988. doi: 10.1103/PhysRevLett.61.259.
 Sussillo & Abbott (2014) Sussillo, David and Abbott, LF. Random walks: Training very deep nonlinear feedforward networks with smart initialization. CoRR, vol. abs/1412.6558, 2014.
 Tallec & Ollivier (2018) Tallec, Corentin and Ollivier, Yann. Can recurrent neural networks warp time? arXiv preprint arXiv:1804.11188, 2018.
 Vorontsov et al. (2017) Vorontsov, Eugene, Trabelsi, Chiheb, Kadoury, Samuel, and Pal, Chris. On orthogonality and learning recurrent networks with long term dependencies. arXiv preprint arXiv:1702.00071, 2017.
 Wisdom et al. (2016) Wisdom, Scott, Powers, Thomas, Hershey, John, Le Roux, Jonathan, and Atlas, Les. Fullcapacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 4880–4888, 2016.
 Wu et al. (2017) Wu, ChaoYuan, Ahmed, Amr, Beutel, Alex, Smola, Alexander J, and Jing, How. Recurrent recommender networks. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 495–503. ACM, 2017.
 Xie et al. (2017) Xie, Di, Xiong, Jiang, and Pu, Shiliang. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. arXiv preprint arXiv:1703.01827, 2017.
 Yang & Schoenholz (2018) Yang, Greg and Schoenholz, Sam S. Deep mean field theory: Layerwise variance and width variation as methods to control gradient explosion, 2018.
 Yang & Schoenholz (2017) Yang, Greg and Schoenholz, Samuel S. Mean field residual networks: On the edge of chaos. arXiv preprint arXiv:1712.08969, 2017.
 Zaremba et al. (2014) Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
 Zilly et al. (2016) Zilly, Julian Georg, Srivastava, Rupesh Kumar, Koutník, Jan, and Schmidhuber, Jürgen. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.
Appendix A MinimalRNN Architecture
Appendix B Diagonal Recurrence Relation
Here we analyze the mean field dynamics of the minimalRNN. The minimalRNN features a hidden state and inputs . The inputs are transformed via a fullyconnected network before being fed into the network. The RNN cell is then described by the equations,
(14)  
(15)  
(16) 
Here denotes the (pre)activation and denotes an input to the network.Thus, acts as a gate on the ’th step. We take , and .
By the CTL we can make a mean field assumption that where,
(17) 
where we have defined and . We note that is fixed by the input, but it remains for us to work out . We find that,
(18)  
(19)  
(20) 
where we have assumed that the expectation factorizes so that and are approximately independent.
We choose to normalize the data so that independent of time. An immediate consequence of this normalization is that and . We then write , and where , , and are cosine similarities between the inputs, the hidden states, and the respectively. With this normalization, we can work out the meanfield recurrence relation characterizing the covariance matrix for the minimalRNN.
We begin by considering the diagonal recurrence relations. We find that the dynamics are described by the equation,
(21)  
(22) 
As expected, the first and second integrands determine how much of the update of the random network is controlled by the norm of the hidden state and how much is determined by the norm of the input. Since it follows that when the first and second term will be equal and so,
(23) 
In general, will therefore control the degree to which the hidden state of the random minimalRNN is updated based on the previous hidden state or based on the inputs with implying parity between the two. This is reflected in eq. (23).
Appendix C Existence of a Fixed Point
In the event that the norm of the inputs is timeindependent, for all , then the minimalRNN will have a fixed point provided there exists a that satisfies a transcendental equation, namely that
(24) 
It is easy to see that such a solution always exists. When the first term of approaches while the magnitude of the second increases without bound and so . Conversely, when the first term is positive while and so . The existence of a satisfying the transcendental equation then follows directly from the intermediate value theorem.
Appendix D Dynamics
We can now investigate the dynamics of the norm of the hidden state in the vicinity of . To do this suppose that with . Our goal is then to expand eq.(21) about . First, we note that,
(25)  
(26)  
(27) 
Letting this implies that,
(28)  
(29)  
(30)  
(31)  
(32)  
(33)  
(34) 
It follows that as,
(35) 
with
(36) 
as expected.