Deep neural network architectures have become ubiquitous in machine learning. The success of deep networks is due to the fact that they are highly expressive(Montufar et al., 2014) while simultaneously being relatively easy to optimize (Choromanska et al., 2015; Goodfellow et al., 2014) with strong generalization properties (Recht et al., 2015). Consequently, developments in machine learning often accompany improvements in our ability to train increasingly deep networks. Despite this, designing novel network architectures is frequently equal parts art and science. This is, in part, because a general theory for neural networks that might inform design decisions has lagged behind the feverish pace of design.
demonstrated that random neural networks are exponentially expressive in their depth. Central to their approach was the consideration of networks after random initialization, whose weights and biases were i.i.d. Gaussian distributed. In particular the paper byPoole et al. (2016)
developed a “mean field” formalism for treating wide, untrained, neural networks. They showed that these mean field networks exhibit an order-to-chaos transition as a function of the weight and bias variances. Notably the mean field formalism is not closely tied to a specific choice of activation function or loss.
In this paper, we demonstrate the existence of several characteristic “depth” scales that emerge naturally and control signal propagation in these random networks. We then show that one of these depth scales, , diverges at the boundary between order and chaos. This result is insensitive to many architectural decisions (such as choice of activation function) and will generically be true at any order-to-chaos transition. We then extend these results to include dropout and we show that even small amounts of dropout destroys the order-to-chaos critical point and consequently removes the divergence in . Together these results bound the depth to which signal may propagate through random neural networks.
We then develop a corresponding mean field model for gradients and we show that a duality exists between the forward propagation of signals and the backpropagation of gradients. The ordered and chaotic phases that Poole et al. (2016) identified correspond to regions of vanishing and exploding gradients, respectively. We demonstrate the validity of this mean field theory by computing gradients of random networks on MNIST. This provides a formal explanation of the ‘vanishing gradients’ phenomenon that has long been observed in neural networks (Bengio et al., 1993). We continue to show that the covariance between two gradients is controlled by the same depth scale that limits correlated signal propagation in the forward direction.
Finally, we hypothesize that a necessary condition for a random neural network to be trainable is that information should be able to pass through it. Thus, the depth-scales identified here bound the set of hyperparameters that will lead to successful training. To test this ansatz we train ensembles of deep, fully connected, feed-forward neural networks of varying depth on MNIST and CIFAR10, with and without dropout. Our results confirm that neural networks are trainable precisely when their depth is not much larger than. This result is dataset independent and is, therefore, a universal function of network architecture.
A corollary of these result is that asymptotically deep neural networks should be trainable provided they are initialized sufficiently close to the order-to-chaos transition. The notion of “edge of chaos” initialization has been explored previously. Such investigations have been both direct as in Bertschinger et al. (2005); Glorot & Bengio (2010)
or indirect, through initialization schemes that favor deep signal propagation such as batch normalization(Ioffe & Szegedy, 2015)
, orthogonal matrix initialization(Saxe et al., 2014), random walk initialization (Sussillo & Abbott, 2014), composition kernels (Daniely et al., 2016), or residual network architectures (He et al., 2015). The novelty of the work presented here is two-fold. First, our framework predicts the depth at which networks may be trained even far from the order-to-chaos transition. While a skeptic might ask when it would be profitable to initialize a network far from criticality, we respond by noting that there are architectures (such as neural networks with dropout) where no critical point exists and so this more general framework is needed. Second, our work provides a formal, as opposed to intuitive, explanation for why very deep networks can only be trained near the edge of chaos.
We begin by recapitulating the mean-field formalism developed in Poole et al. (2016). Consider a fully-connected, untrained, feed-forward, neural network of depth with layer width and some nonlinearity . Since this is an untrained neural network we suppose that its weights and biases are respectively i.i.d. as and . Notationally we set to be the pre-activations of the th layer and to be the activations of that layer. Finally, we take the input to the network to be . The propagation of a signal through the network is described by the pair of equations,
Since the weights and biases are randomly distributed, these equations define a probability distribution on the activations and pre-activations over an ensemble of untrained neural networks. The “mean-field” approximation is then to replace
by a Gaussian whose first two moments match those of. For the remainder of the paper we will take the mean field approximation as given.
Consider first the evolution of a single input, , as it evolves through the network (as quantified by and ). Since the weights and biases are independent with zero mean, the first two moments of the pre-activations in the same layer will be,
where is the Kronecker delta. Here is the variance of the pre-activations in the th layer due to an input and it is described by the recursion relation,
where is the measure for a standard Gaussian distribution. Together these equations completely describe the evolution of a single input through a mean field neural network. For any choice of and with bounded , eq. 3 has a fixed point at .
The propagation of a pair of signals, and , through this network can be understood similarly. Here the mean pre-activations are trivially the same as in the single-input case. The independence of the weights and biases implies that the covariance between different pre-activations in the same layer will be given by, . The covariance, , will be given by the recurrence relation,
where and , with , are Gaussian approximations to the pre-activations in the preceding layer with the correct covariance matrix. Moreover is the correlation between the two inputs after layers.
Examining eq. 4 it is clear that is a fixed point of the recurrence relation. To determine whether or not the is an attractive fixed point the quantity,
is introduced. Poole et al. (2016) note that the fixed point is stable if and is unstable otherwise. Thus, represents a critical line separating an ordered phase (in which and all inputs end up asymptotically correlated) and a chaotic phase (in which and all inputs end up asymptotically decorrelated). For the case of , the phase diagram in fig. 1 (a) is observed.
3 Asymptotic Expansions and Depth Scales
Our first contribution is to demonstrate the existence of two depth-scales that arise naturally within the framework of mean field neural networks. Motivating the existence of these depth-scales, we iterate eq. 3 and 4 until convergence for many values of between 0.1 and 3.0 and with starting with and . We see, in fig. 1 (b) and (c), that the manner in which both approaches and approaches is exponential over many orders of magnitude. We therefore anticipate that asymptotically and for sufficiently large . Here, and define depth-scales over which information may propagate about the magnitude of a single input and the correlation between two inputs respectively.
We will presently prove that and are asymptotically exponential. In both cases we will use the same fundamental strategy wherein we expand one of the recurrence relations (either eq. 3 or eq. 4) about its fixed point to get an approximate “asymptotic” recurrence relation. We find that this asymptotic recurrence relation in turn implies exponential decay towards the fixed point over a depth-scale, .
We first analyze eq. 3 and identify a depth-scale at which information about a single input may propagate. Let . By construction so long as exists it follows that as . Eq. 3 may be expanded to lowest order in to arrive at an asymptotic recurrence relation (see Appendix 7.1),
Notably, the term multiplying is a constant. It follows that for large the asymptotic recurrence relation has an exponential solution, , with given by
This establishes as a depth scale that controls how deep information from a single input may penetrate into a random neural network.
Here and . Thus, once again, we expect that for large this recurrence will have an exponential solution, , with given by
In the ordered phase and so . Since the transition between order and chaos occurs when it follows that diverges at any order-to-chaos transition so long as and exist.
These results can be investigated intuitively by plotting vs in fig. 2 (a). In the ordered phase there is only a single fixed point, . In the chaotic regime we see that a second fixed point develops and the point becomes unstable. We see that the linearization about the fixed points becomes significantly closer to the trivial map near the order-to-chaos transition.
To test these claims we measure and directly by iterating the recurrence relations for and as before with and . In this case we consider values of between and and between and . For each hyperparameter settings we fit the resulting residuals, and , to exponential functions and infer the depth-scale. We then compare this measured depth-scale to that predicted by the asymptotic expansion. The result of this measurement is shown in fig. 2. In general we see that the agreement is quite good. As expected we see that diverges at the critical point.
As observed in Poole et al. (2016) we see that the depth scale for the propagation of information in a single input, , is consistently finite and significantly shorter than . To understand why this is the case consider eq. 6 and note that for nonlinearities the second term is always negative. Thus, even as approaches 1 we expect to be substantially smaller than 1.
The mean field formalism can be extended to include dropout. The main contribution here will be to argue that even infinitesimal amounts of dropout destroys the mean field critical point, and therefore limits the trainable network depth. In the presence of dropout the propagation equation, eq. 1, becomes,
where and is the dropout rate. As is typically the case we have re-scaled the sum by so that the mean of the pre-activation is invariant with respect to our choice of dropout rate.
Following a similar procedure to the original mean field calculation consider the fate of two inputs, and , as they are propagated through such a random network. We take the dropout masks to be chosen independently for the two inputs mimicking the manner in which dropout is employed in practice. With dropout the diagonal term in the covariance matrix will be (see Appendix 7.3),
The variance of a single input with dropout will therefore propagate in an identical fashion to the vanilla case with a re-scaling . Intuitively, this result implies that, for the case of a single input, the presence of dropout simply increases the effective variance of the weights.
Computing the off-diagonal term of the covariance matrix similarly (see Appendix 7.4),
with , , and defined by analogy to the mean field equations without dropout. Here, unlike in the case of a single input, the recurrence relation is identical to the recurrence relation without dropout. To see that is no longer a fixed point of these dynamics consider what happens to eq. 12 when we input . For simplicity, we leverage the short range of to replace . We find (see Appendix 7.5),
The second term is positive for any . This implies that if for any then . Thus, is not a fixed point of eq. 12 for any . Since eq. 12 is identical in form to eq. 4 it follows that the depth scale for signal propagation with dropout will likewise be given by eq. 9 with the substitutions and computed using eq. 11 and eq. 12 respectively. Importantly, since there is no longer a sharp critical point with dropout we do not expect a diverging depth scale.
As in networks without dropout we plot, in fig. 3 (a), the iterative map as a function of . Most significantly, we see that the is no longer a fixed point of the dynamics. Instead, as the dropout rate increases gets mapped to decreasing values and the fixed point monotonically decreases.
To test these results we plot in fig. 3 (b) the asymptotic correlation, , as a function of for different values of dropout from to . As expected, we see that for all there is no sharp transition between and . Moreover as the dropout rate increases the correlation monotonically decreases. Intuitively this makes sense. Identical inputs passed through two different dropout masks will become increasingly dissimilar as the dropout rate increases. In fig. 3 (c) we show the depth scale, , as a function of for the same range of dropout probabilities. We find that, as predicted, the depth of signal propagation with dropout is drastically reduced and, importantly, there is no longer a divergence in . Increasing the dropout rate continues to decrease the correlation depth for constant .
4 Gradient Backpropagation
There is a duality between the forward propagation of signals and the backpropagation of gradients. To elucidate this connection consider the backpropagation equations given a loss ,
with the identification . Within mean field theory, it is clear that the scale of fluctuations of the gradient of weights in a layer will be proportional to (see appendix 7.6). In contrast to the pre-activations in forward propagation (eq. 1), the will typically not be Gaussian distributed even in the large layer width limit.
Nonetheless, we can work out a recurrence relation for the variance of the error, , leveraging the Gaussian ansatz on the pre-activations. In order to do this, however, we must first make an additional approximation that the weights used during forward propagation are drawn independently from the weights used in backpropagation. This approximation is similar in spirit to the vanilla mean field approximation and is reminiscent of work on feedback alignment (Lillicrap et al., 2014). With this in mind we arrive at the recurrence (see appendix 7.7),
The presence of in the above equation should perhaps not be surprising. In Poole et al. (2016) they show that is intimately related to the tangent space of a given layer in mean field neural networks. We note that the backpropagation recurrence features an explicit dependence on the ratio of widths of adjacent layers of the network, . Here we will consider exclusively constant width networks where this factor is unity. For a discussion of the case of unequal layer widths see Glorot & Bengio (2010).
Since depends only on the asymptotic it follows that for constant width networks we expect eq. 15 to again have an exponential solution with,
Note that here both above and below the transition. It follows that can be both positive and negative. We conclude that there should be three distinct regimes for the gradients.
In the ordered phase, and so . We therefore expect gradients to vanish over a depth .
At criticality, and so . Here gradients should be stable regardless of depth.
In the chaotic phase, and so . It follows that in this regime gradients should explode over a depth .
Intuitively these three regimes make sense. To see this, recall that perturbations to a weight in layer can alternatively be viewed as perturbations to the pre-activations in the same layer. In the ordered phase both the perturbed signal and the unperturbed signal will be asymptotically mapped to the same point and the derivative will be small. In the chaotic phase the perturbed and unperturbed signals will become asymptotically decorrelated and the gradient will be large.
To investigate these predictions we construct deep random networks of depth and layer-width . We then consider the cross-entropy loss of these networks on MNIST. In fig. 4 (a) we plot the layer-by-layer 2-norm of the gradient, , as a function of layer, , for different values of . We see that behaves exponentially over many orders of magnitude. Moreover, we see that the gradient vanishes in the ordered phase and explodes in the chaotic phase. We test the quantitative predictions of eq. 16 in fig. 4 (b) where we compare as predicted from theory with the measured depth-scale constructed from exponential fits to the gradient data. Here we see good quantitative agreement between the theoretical predictions from mean field random networks and experimentally realized networks. Together these results suggest that the approximations on the backpropagation equations were representative of deep, wide, random networks.
Finally, we show that the depth scale for correlated signal propagation likewise controls the depth at which information stored in the covariance between gradients can survive. The existence of consistent gradients across similar samples from a training set ought to be especially important for determining whether or not a given neural network architecture can be trained. To establish this depth-scale first note (see Appendix 7.8) that the covariance between gradients of two different inputs, and , will be proportional to where is the loss evaluated on and are appropriately defined errors.
It can be shown (see Appendix 7.9) that features the recurrence relation,
where and are defined similarly as for the forward pass. Expanding asymptotically it is clear that to zeroth order in , will have an exponential solution with with as defined in the forward pass.
5 Experimental Results
Taken together, the results of this paper lead us to the following hypothesis: a necessary condition for a random network to be trained is that information about the inputs should be able to propagate forward through the network, and information about the gradients should be able to propagate backwards through the network. The preceding analysis shows that networks will have this property precisely when the network depth, , is not much larger than the depth-scale . This criterion is data independent and therefore offers a “universal” constraint on the hyperparameters that depends on network architecture alone. We now explore this relationship between depth of signal propagation and network trainability empirically.
To investigate this prediction, we consider random networks of depth and with
. We train these networks using Stochastic Gradient Descent (SGD) and RMSProp on MNIST and CIFAR10. We use a learning rate offor SGD when , for larger , and for RMSProp. These learning rates were selected by grid search between and in exponentially spaced steps of size . We note that the depth dependence of learning rate was explored in detail in Saxe et al. (2014). In fig. 5 (a)-(d) we color in red the training accuracy that neural networks achieved as a function of and for different datasets, training time, and choice of minimizer (see Appendix 7.10 for more comparisons). In all cases the neural networks over-fit the data to give a training accuracy of and test accuracies of on MNIST and on CIFAR10. We emphasize that the purpose of this study is to demonstrate trainability as opposed to optimizing test accuracy.
We now make the connection between the depth scale, , and the maximum trainable depth more precise. Given the arguments in the preceding sections we note that if then signal through the network will be attenuated by a factor of . To understand how much signal can be lost while still allowing for training, we overlay in fig. 5 (a) curves corresponding to from to . We find that networks appear to be trainable when . It would be interesting to understand why this is the case.
Motivated by this argument in fig. 5 (b)-(d) in white, dashed, overlay we plot twice the predicted depth scale, . There is clearly a relationship between the depth of correlated signal propagation and whether or not these networks are trainable. Networks closer to their critical point appear to train more quickly than those further away. Moreover, this relationship has no obvious dependence on dataset, duration of training, or minimizer. We therefore conclude that these bounds on trainable hyperparameters are universal. This in turn implies that to train increasingly deep networks, one must generically be ever closer to criticality.
Next we consider the effect of dropout. As we showed earlier, even infinitesimal amounts of dropout disrupt the order-to-chaos phase transition and cause the depth scale to become finite. However, since the effect of a single dropout mask is to simply re-scale the weight variance by, the gradient magnitude will be stable near criticality, while the input and gradient correlations will not be. This therefore offers a unique opportunity to test whether the relevant depth-scale is or .
In fig. 6 we repeat the same experimental setup as above on MNIST with dropout rates and 0.94. We observe, first and foremost, that even extremely modest amounts of dropout limit the maximum trainable depth to about . We additionally notice that the depth-scale, , predicts the trainable region accurately for varying amounts of dropout.
In this paper we have elucidated the existence of several depth-scales that control signal propagation in random neural networks. Furthermore, we have shown that the degree to which a neural network can be trained depends crucially on its ability to propagate information about inputs and gradients through its full depth. At the transition between order and chaos, information stored in the correlation between inputs can propagate infinitely far through these random networks. This in turn implies that extremely deep neural networks may be trained sufficiently close to criticality. However, our contribution goes beyond advocating for hyperparameter selection that brings random networks to be nearly critical. Instead, we offer a general purpose framework that predicts, at the level of mean field theory, which hyperparameters should allow a network to be trained. This is especially relevant when analyzing schemes like dropout where there is no critical point and which therefore imply an upper bound on trainable network depth.
An alternative perspective as to why information stored in the covariance between inputs is crucial for training can be understood by appealing to the correspondence between infinitely wide Bayesian neural networks and Gaussian Processes (Neal, 2012). In particular the covariance, , is intimately related to the kernel of the induced Gaussian Process. It follows that cases in which signal stored in the covariance between inputs may propagate through the network correspond precisely to situations in which the associated Gaussian Process is well defined.
Our work suggests that it may be fruitful to investigate pre-training schemes that attempt to perturb the weights of a neural network to favor information flow through the network. In principle this could be accomplished through a layer-by-layer local criterion for information flow or by selecting the mean and variance in schemes like batch normalization to maximize the covariance depth-scale.
These results suggest that theoretical work on random neural networks can be used to inform practical architectural decisions. However, there is still much work to be done. For instance, the framework developed here does not apply to unbounded activations, such as rectified linear units, where it can be shown that there are phases in which eq.3 does not have a fixed point. Additionally, the analysis here applies directly only to fully connected feed-forward networks, and will need to be extended to architectures with structured weight matrices such as convolutional networks.
We close by noting that in physics it has long been known that, through renormalization, the behavior of systems near critical points can control their behavior even far from the idealized critical case. We therefore make the somewhat bold hypothesis that a broad class of neural network topologies will be controlled by the fully-connected mean field critical point.
We thank Ben Poole, Jeffrey Pennington, Maithra Raghu, and George Dahl for useful discussions. We are additionally grateful to RocketAI for introducing us to Temporally Recurrent Online Learning and two-dimensional time.
- Bengio et al. (1993) Y Bengio, Paolo Frasconi, and P Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pp. 1183–1188. IEEE, 1993.
Bertschinger et al. (2005)
Nils Bertschinger, Thomas Natschläger, and Robert A. Legenstein.
At the edge of chaos: Real-time computations and self-organized criticality in recurrent neural networks.In L. K. Saul, Y. Weiss, and L. Bottou (eds.), Advances in Neural Information Processing Systems 17, pp. 145–152. MIT Press, 2005.
- Choromanska et al. (2015) Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.
- Daniely et al. (2016) A. Daniely, R. Frostig, and Y. Singer. Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity. arXiv:1602.05897, 2016.
- Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Aistats, volume 9, pp. 249–256, 2010.
- Goodfellow et al. (2014) Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. arXiv:1412.6544, 2014.
- He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. ArXiv e-prints, December 2015.
- Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pp. 448–456, 2015.
- Lillicrap et al. (2014) Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random feedback weights support learning in deep neural networks. arXiv:1411.0247, 2014.
- Montufar et al. (2014) Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 2924–2932. Curran Associates, Inc., 2014.
- Neal (2012) Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
- Poole et al. (2016) B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. arXiv:1606.05340, June 2016.
- Raghu et al. (2016) M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the expressive power of deep neural networks. arXiv:1606.05336, June 2016.
- Recht et al. (2015) Benjamin Recht, Moritz Hardt, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv:1509.01240, 2015.
- Saxe et al. (2014) A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations, 2014.
- Sussillo & Abbott (2014) David Sussillo and LF Abbott. Random walks: Training very deep nonlinear feed-forward networks with smart initialization. CoRR, vol. abs/1412.6558, 2014.
Here we present derivations of results from throughout the paper.
7.1 Single input depth-scale
Consider the recurrence relation for the variance of a single input,
and a fixed point of the dynamics, . can be expanded about the fixed point to yield the asymptotic recurrence relation,
We begin by first expanding to order ,
We therefore arrive at the approximate reccurence relation,
Using the identity, we can rewrite this asymptotic recurrence relation as,
7.2 Two input depth-scale
Consider the recurrence relation for the co-variance of two input,
a correlation between the inputs, , and a fixed point of the dynamics, . can be expanded about the fixed point to yield the asymptotic recurrence relation,
Since the relaxation of and to occurs much more quickly than the convergence of we approximate as in Poole et al. (2016). We therefore consider the perturbation . It follows that we may make the approximation,
We now consider the case where and separately; we will later show that these two results agree with one another. First we consider the case where in which case we may safely expand the above equation to get,
This allows us to in turn approximate the recurrence relation,
are appropriately defined asymptotic random variables. This leads to the asymptotic recurrence relation,
We now consider the case where and . In this case the expansion of will become,
and so the lowest order correction is of order as opposed to . As usual we now expand the recurrence relation, noting that is independent of when to find,
It follows that the asymptotic recurrence relation in this case will be,
where is the stability condition for the ordered phase. We note that although the approximations were somewhat different the asymptotic recurrence relation for reduces eq. 47 result for . We may therefore use 4 for all .
7.3 Variance of an input with dropout
In the presence of dropout with rate , the variance of a single input as it is passed through the network is described by the recurrence relation,
Recall that the recurrence relation for the pre-activations is given by,
where . It follows that the variance will be given by,
where we have used the fact that .
7.4 Covariance of two inputs with dropout
The co-variance between two signals, and , with separate i.i.d. dropout masks and is given by,
where, in analogy to eq. 4, and