1 Introduction
Deep convolutional neural networks (CNNs) have been crucial to the success of deep learning. Architectures based on CNNs have achieved unprecedented accuracy in domains ranging across computer vision
(Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012)(Collobert et al., 2011; Kalchbrenner et al., 2014; Kim, 2014), and recently even the board game Go (Silver et al., 2016, 2017).The performance of deep convolutional networks has improved as these networks have been made ever deeper. For example, some of the bestperforming models on ImageNet
(Deng et al., 2009) have employed hundreds or even a thousand layers (He et al., 2016a, b). However, these extremely deep architectures have been trainable only in conjunction with techniques like residual connections (He et al., 2016a) and batch normalization (Ioffe & Szegedy, 2015). It is an open question whether these techniques qualitatively improve model performance or whether they are necessary crutches that solely make the networks easier to train. In this work, we study vanilla CNNs using a combination of theory and experiment to disentangle the notions of trainability and generalization performance. In doing so, we show that through a careful, theoreticallymotivated initialization scheme, we can train vanilla CNNs with 10,000 layers using no architectural tricks.Recent work has used mean field theory to build a theoretical understanding of neural networks with random parameters (Poole et al., 2016; Schoenholz et al., 2017; Yang & Schoenholz, 2017; Schoenholz et al., 2017; Karakida et al., 2018; Hayou et al., 2018; Hanin & Rolnick, 2018; Yang & Schoenholz, 2018)
. These studies revealed a maximum depth through which signals can propagate at initialization, and verified empirically that networks are trainable precisely when signals can travel all the way through them. In the fullyconnected setting, the theory additionally predicts the existence of an ordertochaos phase transition in the space of initialization hyperparameters. For networks initialized on the critical line separating these phases, signals can propagate indefinitely and arbitrarily deep networks can be trained. While mean field theory captures the “average” dynamics of random neural networks it does not quantify the scale of gradient fluctuations that are crucial to the stability of gradient descent. A related body of work
(Saxe et al., 2013; Pennington et al., 2017, 2018)has examined the inputoutput Jacobian and used random matrix theory to quantify the distribution of its singular values in terms of the activation function and the distribution from which the initial random weight matrices are drawn. These works concluded that networks can be trained most efficiently when the Jacobian is wellconditioned, a criterion that can be achieved with orthogonal, but not Gaussian, weight matrices. Together, these approaches have allowed researchers to efficiently train extremely deep network architectures, but so far they have been limited to neural networks composed of fullyconnected layers.
In the present work, we continue this line of research and extend it to the convolutional setting. We show that a welldefined meanfield theory exists for convolutional networks in the limit that the number of channels is large, even when the size of the image is small. Moreover, convolutional networks have precisely the same ordertochaos transition as fullyconnected networks, with vanishing gradients in the ordered phase and exploding gradients in the chaotic phase. And just like fullyconnected networks, very deep CNNs that are initialized on the critical line separating those two phases can be trained with relative ease.
Moving beyond mean field theory, we additionally show that the random matrix analysis of (Pennington et al., 2017, 2018) carries over to the convolutional setting. Furthermore, we identify an efficient construction from the wavelet literature that generates random orthogonal matrices with the blockcirculant structure that corresponds to convolution operators. This construction facilitates random orthogonal initialization for convolulational layers and enables good conditioning of the endtoend Jacobian matrices of arbitrarily deep networks. We show empirically that networks with this initialization can train significantly more quickly than standard convolutional networks.
Finally, we emphasize that although the ordertochaos phase boundaries of fullyconnected and convolutional networks look identical, the underlying meanfield theories are in fact quite different. In particular, a novel aspect of the convolutional theory is the existence of multiple depth scales that control signal propagation at different spatial frequencies. In the large depth limit, signals can only propagate along modes with minimal spatial structure; all other modes end up deteriorating, even at criticality. We hypothesize that this type of signal degradation is harmful for generalization, and we develop a modified initialization scheme that allows for balanced propagation of signals among all frequencies. In this scheme, which we call DeltaOrthogonal initialization, the orthogonal kernel is drawn from a spatially nonuniform distribution, and it allows us to train vanilla CNNs of 10,000 layers or more with no degradation in performance.
2 Theoretical results
In this section, we first derive a mean field theory for signal propagation in random convolutional neural networks. We will follow the general methodology established in Poole et al. (2016); Schoenholz et al. (2017); Yang & Schoenholz (2017). We will then arrive at a theory for the singular value distribution of the Jacobian following Pennington et al. (2017, 2018). Together, this will allow us to derive theoretically motivated initialization schemes for convolutional neural networks that we call orthogonal kernels and DeltaOrthogonal kernels^{1}^{1}1An example implementation of a deep network initialized critically using the DeltaOrthogonal kernel is provided at https://github.com/brainresearch/meanfieldcnns.. Later we will demonstrate experimentally that these kernels outperform existing initialization schemes for very deep vanilla convolutional networks.
2.1 A mean field theory for CNNs
2.1.1 Recursion relation for covariance
Consider an layer 1D^{2}^{2}2For notational simplicity, we consider onedimensional convolutions, but the dimensional case proceeds identically. CNN with periodic boundary conditions, filter width , number of channels , spatial size
, perlayer weight tensors
, and biases . Let be the activation function and let denote the preactivation unit at layer , channel , and spatial location , where we define the set of spatial locations . The forwardpropagation dynamics can be described by the recurrence relation,(2.1) 
where and . At initialization, we take the weights to be drawn i.i.d. from the Gaussian and the biases to be drawn i.i.d. from the Gaussian . Note that since we assume periodic boundary conditions. We wish to understand how signals propagate through these networks. As in previous work in this vein, we will take the large network limit, which in this context corresponds to the number of channels . This allows us to use powerful theoretical tools such as mean field theory and random matrix theory. Moreover, this approximation has been shown to give results that agree well with experiments on finitesize networks.
In the limit of a large number of channels, the central limit theorem implies that the preactivation vectors
are i.i.d. Gaussian with mean zero and covariance matrix . Here, the expectation is taken over the weights and biases and it is independent of the channel index . In this limit, the covariance matrix takes the form (see Supplemental Materials (SM)),(2.2) 
and is independent of . A more compact representation of this equation can be given as,
(2.3) 
where and denotes 2D circular crosscorrelation, i.e. for any matrix , is defined as,
(2.4) 
The function is related to the map defined in Poole et al. (2016) (see also (Daniely et al., 2016)) and is given by,
(2.5) 
All but the two dimensions and in eqn. (2.5) marginalize, so, as in (Poole et al., 2016), the map can be computed by a twodimensional integral. Unlike in (Poole et al., 2016), and do not correspond to different examples but rather to different spatial positions and eqn. (2.5) characterizes how signals from a single input propagate through convolutional networks in the meanfield approximation^{3}^{3}3The multiinput analysis proceeds in precisely the same manner as we present here, but comes with increased notational complexity and features no qualitatively different behavior, so we focus our presentation on the singleinput case..
2.1.2 Dynamics of signal propagation
We now seek to study the dynamics induced by eqn. (2.3). Schematically, our approach will be to identify fixed points of eqn. (2.3) and then linearize the dynamics around these fixed points. These linearized dynamics will dictate the stability and rate of decay towards the fixed points, which determines the depth scales over which signals in the network can propagate.
Schoenholz et al. (2017) found that for many activation functions (e.g. ) and any choice of and , the map has a fixed point (i.e. ) of the form,
(2.6) 
where is the Kronecker, is the fixedpoint variance of a single input, and is the fixedpoint correlation between two inputs. It follows from the form of eqn. (2.4) that is also a fixed point of the layertolayer covariance map in the convolutional case (eqn. (2.3)), i.e. .
To analyze the dynamics of the iteration map (2.3) near the fixed point , we define and expand eqn. (2.3) to lowest order in This expansion requires the Jacobian of the map evaluated at the fixed point, the properties of which we analyze in the SM. In brief, perturbations in and
evolve independently and the Jacobian decomposes into a diagonal eigenspace
with eigenvalue
, and an offdiagonal eigenspace with eigenvalue . The eigenvalues are given by^{4}^{4}4By the symmetry of , these expectations are independent of spatial location and of the choice of and .,(2.7) 
and the eigenspaces have bases,
(2.8) 
i.e. and . Note that and also were found in Schoenholz et al. (2017) to control signal propagation in the fullyconnected case. The constant is given in Lemma B.2 of the SM but does not concern us here. This eigendecomposition implies that the layerwise deviations from the fixed point evolve under eqn. (2.3) as,
(2.9) 
where and are decomposition of into the eigenspaces and .
Eqn. (2.9) defines the linear dynamics of random convolutional neural networks near their fixed points and is the basis for the indepth analysis of the following subsections.
2.1.3 Multidimensional signal propagation
In the fullyconnected setting, the dynamics of signal propagation near the fixed point are governed by scalar evolution equations. In contrast, the convolutional setting enjoys much richer dynamics, as eqn. (2.9) describes a multidimensional system that we now analyze.
It follows from eqns. (2.4) and (2.8) (see also the SM) that does not mix the diagonal and offdiagonal eigenspaces, i.e. and . To see this, note that for , the definition implies . This property ensures that can be expressed as a linear combination of matrices in , which means it also belongs to . The same argument applies to . As a result, these eigenspaces evolve entirely independently under the linearization of the covariance iteration map (2.3).
Let denote the depth over which transient effects persist and after which eqn. (2.9) accurately describes the linearized dynamics. Therefore, at depths larger than , we have
(2.10) 
This matrixvalued equation is still somewhat complicated owing to the nested applications of . To further elucidate the dynamics, we can move to a Fourier basis, which diagonalizes the circular crosscorrelation operator and decouples the modes of eqn. (2.10). In particular, let
denote the 2D discrete Fourier transform and
denote a Fourier mode of . Then eqn. (2.10) becomes a simple scalar equation,(2.11) 
with . Thus, the linearized dynamics of convolutional neural networks decouple into independentlyevolving Fourier modes that evolve near the fixed point at frequencydependent rates.
2.1.4 Fixedpoint analysis
The stability of the fixed point is determined by whether nearby points move closer or farther from under the dynamics described by eqn. (2.9). Eqn. (2.11) shows that this condition depends on the whether the quantities and are less than or greater than one.
Since is a diagonal matrix, the eigenvalues have a specific structure. In particular, the set of eigenvalues is comprised of copies of the 1D discrete Fourier transform of the diagonal entries of . Furthermore, since the diagonal entries of are nonnegative and sum to one, their Fourier coefficients have absolute value no larger than one and the zerofrequency coefficient is equal to one; see Figure 4 for the full distribution in the case of 2D convolutions. It follows that the fixed point will be stable if and only if and .
These stability conditions are precisely the ones found to govern fullyconnected networks (Poole et al., 2016; Schoenholz et al., 2017). Moreover, the fixed point matrix is also the same as in the fullyconnected case. Together, these observations imply that the entire fixedpoint structure of the convolutional case is identical to that of the fullyconnected case. In particular, based on the results of (Poole et al., 2016), we can immediately conclude that the hyperparameter plane is separated by the line into an ordered phase with in which all pixels approach the same value, and a chaotic phase with in which the pixels become decorrelated with one another; see the SM for a review of this phase diagram analysis.
2.1.5 Depth scales of signal propagation
We now assume that the conditions for a stable fixed point are met, i.e. and , and we consider the rate at which the fixed point is approached. As in (Schoenholz et al., 2017), it is convenient to additionally assume so that the dynamics in the diagonal subspace can be neglected. In this case, eqn. (2.11) can be rewritten as
(2.12) 
where are depth scales governing the convergence of the different modes. In particular, we expect signals corresponding to a specific Fourier mode to be able to travel a depth commensurate to through the network. Thus, unlike fullyconnected networks which exhibit only a single depth scale, convolutional networks feature a hierarchy of depth scales.
Recalling that , it follows that , which is identical to the depth scale governing signal propagation through fullyconnected networks. It follows from (Schoenholz et al., 2017) that when , diverges and thus convolutional networks can propagate signals arbitrarily far through the modes. Since for , these are the only modes through which signals can propagate without attenuation. Finally, we note that the modes correspond to perturbations that are spatially uniform along the cyclic diagonals of the covariance matrix. The fact that all signals with additional spatial structure attenuate for large depth suggests that deep critical convolutional networks behave quite similarly to fullyconnected networks, which also cannot propagate spatiallystructured signals.
2.1.6 Nonuniform kernels
The similarities between signal propagation in convolutional neural networks and fullyconnected networks in the limit of large depth are surprising. A consequence may be that the performance of very deep convolutional networks degrades as the signal is forced to propagate along modes with minimal spatial structure. Indeed, Fig. 3 shows that the generalization performance decreases with depth, and that for very large depth it barely surpasses the performance of a fullyconnected network.
If increased spatial uniformity is the problem, eqn. (2.12) holds the solution. In order for all modes to propagate without attenuation, it is necessary that for all . In fact, it is easy to show that the distribution of can be modified by allowing for spatial nonuniformity in the variance of the weights within the kernel. To this end, we introduce a nonnegative vector chosen such that , and initialize the weights of the network according to . Each choice of will induce a new dynamical equation analogous to eqn. (2.3) (see SM),
(2.13) 
where It follows directly from the previous analysis that the linearized dynamics of eqn. (2.13) will be identical to the dynamics of eqn (2.3), only now with . By the same argument presented in Section 2.1.3, the set of eigenvalues is now comprised of copies of the 1D Fourier transform of . As a result, it is possible to control the depth scales over which different modes of the signal can propagate through the network by changing the variance vector . We will return to this point in section 2.4.
2.2 Backpropagation of signal
We now turn our attention to the backpropgation of error signals through a convolutional network. Let denote the loss and the backpropagated signal at layer , channel and spatial location , i.e.,
(2.14) 
The recurrence relation is given by
As in (Schoenholz et al., 2017)
, we additionally make the assumption that the weights used during backpropagation are drawn independently from the weights used in forward propagation, in which case the random variables
are independent for each . The covariance matrices backpropagate according to,(2.15) 
We are primarily interested in the diagonal of , which measures the variance of backpropagated signals. We will also assume (see section 2.1.3) so that is wellapproximated by . In this case,
(2.16) 
where we used eqn. (2.7). Therefore we find that, , where is the total depth of the network. As in the fullyconnected case, is a necessary condition for gradient signals to neither explode nor vanish as they backpropagate through a convolutional network. However, as discussed in (Pennington et al., 2017, 2018), this is not always a sufficient condition for trainability. To further understand backward signal propagation, we need to push our analysis beyond mean field theory.
2.2.1 Beyond mean field theory
We have observed that the quantity is crucial for determining signal propagation in CNNs, both in the forward and backward directions. As discussed in (Poole et al., 2016), equals the the mean squared singular value of the Jacobian
of the layertolayer transition operator. Beyond just the second moment, higher moments and indeed the whole distribution of singular values of the entire endtoend Jacobian
are important for ensuring trainability of very deep fullyconnected networks (Pennington et al., 2017, 2018). Specifically, networks train well when their inputoutput Jacobians exhibit dynamical isometry, namely the property that the entire distribution of singular values is close to .In fact, we can adopt the entire analysis of (Pennington et al., 2017, 2018) into the convolutional setting with essentially no modification. The reason stems from the fact that, because convolution is a linear operator, it has a matrix representation, , which appears in the endtoend Jacobian in precisely the same manner as do the weight matrices in the fullyconnected case. In particular, , where
is the diagonal matrix whose diagonal elements contain the vectorized representation of derivatives of postactivation neurons in layer
. Roughly speaking, since this is the same expression as in (Pennington et al., 2017, 2018), the conclusions found in that work regarding dynamical isometry apply equally well in the convolutional setting.The analysis of Pennington et al. (2017, 2018) reveals that the singular values of depends crucially on the distribution of singular values of and . In particular, to achieve dynamical isometry, all of these matrices should be close to orthogonal. As in the fullyconnected case, the singular values of can be made arbitrarily close to by choosing a small value for and by using an activation function like that is smooth and linear near the origin. In the convolutional setting, the matrix representation of the convolution operator is a block matrix with circulant blocks. Note that in the large limit, and the relative size of the blocks vanishes. Therefore, if the weights are i.i.d. random variables, we can invoke universality results from random matrix theory to conclude its singular value distribution converges to the MarcenkoPastur distribution; see Fig. S4 in the SM. As such, we find that CNNs with i.i.d. weights cannot achieve dynamical isometry. We address this issue in the next section.
2.3 Orthogonal Initialization for CNNs
In (Pennington et al., 2017, 2018), it was observed that dynamical isometry can lead to dramatic improvements in training speed, and that achieving these favorable conditions requires orthogonal weight initializations. While the procedure to generate random orthogonal weight matrices in the fullyconnected setting is wellknown, it is less obvious how to do so in the convolutional setting, and at first sight it is not at all clear whether it is even possible. We resolve this question by invoking a result from the wavelet literature (Kautsky & Turcajová, 1994) and provide an explicit construction. We will focus on the twodimensional convolution here and begin with some notation.
Definition 2.1.
We say is an orthogonal kernel if for all , .
Definition 2.2.
Consider the block matrices and , with constituent blocks and . Define the blockwise convolution operator by,
(2.17) 
where the outofrange matrices are taken to be zero.
2.4 DeltaOrthogonal Initialization
In Section 2.1.5 it was observed that, in contrast to fullyconnected networks, CNNs have multiple depth scales controlling propagation of signals along different Fourier modes. Even at criticality, for generic varianceaveraging vectors , the majority of these depth scales are finite. However, there does exist one special averaging vector for which all of the depth scales are infinite: a onehot vector, i.e. . This kernel places all of its variance in the spatial center of the kernel and zero variance elsewhere. In this case, the eigenvalues are all equal to and all depth scales diverge, implying that signals can propagate arbitrarily far along all Fourier modes.
If we combine this special averaging vector with the orthogonal initialization of the previous section, we obtain a powerful new initialization scheme that we call DeltaOrthogonal Initialization. Matrices of this type can be generated from Algorithm 1 with
and padding with appropriate zeros or directly from Algorithm
2 in the SM.In the following sections, we demonstrate experimentally that extraordinarily deep convolutional networks can be trained with these initialization techniques.
2D orthogonal kernels for CNNs, available in TensorFlow via the
initializer.3 Experiments
To support the theoretical results built up in Section 2, we trained a large number of very deep CNNs on MNIST and CIFAR10 with as the activation function. We use the following vanilla CNN architecture. First we apply three
convolutions with strides 1, 2 and 2 in order to increase the channel size to
and reduce the spatial dimension to (or for CIFAR10), and then a block of convolutions with varying from to . Finally, an average pooling layer and a fullyconnected layer are applied. Here when and otherwise. To maximally support our theories, we applied no common techniques (including learning rate decay). Note that the early downsampling is necessary from a computational perspective, but it does diminish the maximum achievable performance; e.g. our best achieved test accuracy with downsampling was 82 on CIFAR10. We performed an additional experiment training a 50 layers network without downsampling. This resulted in a test accuracy of , which is comparable to the best performance on CIFAR10 using a architecture that we were able to find (, (Mishkin & Matas, 2015)).3.1 Trainability and Critical Initialization
The analysis in Section 2.1 gives a prediction for precisely which initialization hyperparameters a CNN will be trainable. In particular, we predict that the network ought to be trainable provided . To test this, we train a large number of convolutional neural networks on MNIST with depth varying between and and with weights initialized with . In Fig. 2 we plot – using a heatmap – the training accuracy obtained by these networks after different numbers of steps. Additionally we overlay the depth scale predicted by our theory, . We find strikingly good agreement between our theory of random networks and the results of our experiments.
3.2 Orthogonal Initialization and Ultradeep CNNs
We argued in Section 2.2.1 that the inputoutput Jacobian of CNNs with i.i.d. weights will become increasingly illconditioned as the number of layers grows. On the other hand, orthogonal weight initializations can achieve dynamical isometry and dramatically boost the training speed. To verify this, we train a 4,000layer CNN on MNIST using a criticallytuned Gaussian weight initialization and the orthogonal initialization scheme developed in Section 2.3. Fig. 5 shows that the network with Gaussian initialization learns slowly (test and training accuracy is below after
steps, about 60 epochs). In contrast, orthogonal initialization learns quickly with test accuracy above
after only 1 epoch, and achieves after steps or about 7 epochs.3.3 Multidimensional Signal Propagation
The analysis in Section 2.1.3 and Section 2.1.6 suggest that CNNs initialized with kernels with spatially uniform variance may suffer a degradation in generalization performance as the depth increases. Fig. 3 shows the learning curves of CNNs on CIFAR10 with depth varying from to . Although the orthogonal initialization enables even the deepest model to reach training accuracy, the test accuracy decays as the depth increases with the deepest mode generalizing only marginally better than a fullyconnected network.
To test whether this degradation in performance may be the result of attenuation of spatially nonuniform signals, we trained a variety of models on CIFAR10 whose kernels were initialized with spatially nonuniform variance. According to the analysis in Section 2.1.6, changing the shape of this nonuniformity controls the depth scales over which different Fourier components of the signal can propagate through the network. We examined five different nonuniform critical Gaussian initialization methods. The variance vectors were chosen in the following way: GS0 refers to the onehot delta initialization for which the eigenvalues
are all equal to 1. GS1, GS2 and GS3 are obtained by interpolating between GS0 and GS4, which is the uniform variance initialization.
Each variance vector has exactly singular values, plotted in Fig. 4(b) in descending order. Note that from GS0 to GS4, the singular values become more poorlyconditioned (the distribution becomes more concentrated around 0). Fig. 4(a) shows that the relative falloff of generalization performance with depth follows the same pattern: the more poorlyconditioned the singular values the worse the model generalizes. These observations suggest that salient information may be propagating along multiple Fourier modes.
3.4 Training 10,000layers: DeltaOrthogonal Initialization.
Our theory predicts that an ultradeep CNNs can train faster and perform better if critically initialized using DeltaOrthogonal kernels. To test this theory, we train CNNs of 1,250, 2,500, 5,000 and 10,000 layers on both MNIST and CIFAR10 (Fig. 1). All these networks learn surprisingly quickly and, remarkably, the learning time measured in number of training epochs is independent of depth. Furthermore, our experimental results match well with the predicted benefits of this initialization: test accuracy on MNIST for a 10,000layer network, and on CIFAR10. To isolate the benefits of the DeltaOrthogonal init, we also train a 2048layer CNN (Fig. 3) using the spatiallyuniform orthogonal initialization proposed in Section 2.3; the testing accuracy is about . Note that the test accuracy using (spatially uniform) Gaussian (nonorthogonal) initialization is already below when the depth is 259.
4 Discussion
In this work, we developed a theoretical framework based on mean field theory to study the propagation of signals in deep convolutional neural networks. By examining the necessary conditions for signals to flow both forward and backward through the network without attenuation, we derived an initialization scheme that facilitates training of vanilla CNNs of unprecedented depths. We presented an algorithm for the generation of random orthogonal convolutional kernels, an ingredient that is necessary to enable dynamical isometry, i.e. good conditioning of the network’s inputoutput Jacobian. In contrast to the fullyconnected case, signal propagation in CNNs is intrinsically multidimensional – we showed how to decompose those signals into independent Fourier modes and how to promote uniform signal propagation across them. By leveraging these various theoretical insights, we demonstrated empirically that it is possible to train vanilla CNNs with 10,000 layers or more.
Our results indicate that we have removed all the major fundamental obstacles to training arbitrarily deep vanilla convolutional networks. In doing so, we have layed the groundwork to begin addressing some outstanding questions in the deep learning community, such as whether depth alone can deliver enhanced generalization performance. Our initial results suggest that past a certain depth, on the order of tens or hundreds of layers, the test performance for vanilla convolutional architecture saturates. These observations suggest that architectural features such as residual connections and batch normalization are likely to play an important role in defining a good model class, rather than simply enabling efficient training.
Acknowledgements
We thank Xinyang Geng, Justin Gilmer, Alex Kurakin, Jaehoon Lee, Hoang Trieu Trinh, and Greg Yang for useful discussions and feedback.
References

Collobert et al. (2011)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa,
P.
Natural language processing (almost) from scratch.
Journal of Machine Learning Research
, 12(Aug):2493–2537, 2011.  Daniely et al. (2016) Daniely, A., Frostig, R., and Singer, Y. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 2253–2261. Curran Associates, Inc., 2016.
 Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and FeiFei, L. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 Hanin & Rolnick (2018) Hanin, B. and Rolnick, D. How to start training: The effect of initialization and architecture. arXiv preprint arXiv:1803.01719, 2018.
 Hayou et al. (2018) Hayou, S., Doucet, A., and Rousseau, J. On the selection of initialization and activation function for deep neural networks. arXiv preprint arXiv:1805.08266, 2018.

He et al. (2016a)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016a.  He et al. (2016b) He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Springer, 2016b.
 Hinton et al. (2012) Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456, 2015.
 Kalchbrenner et al. (2014) Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014.
 Karakida et al. (2018) Karakida, R., Akaho, S., and Amari, S.i. Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach. ArXiv eprints, June 2018.
 Kautsky & Turcajová (1994) Kautsky, J. and Turcajová, R. A matrix approach to discrete wavelets. In Chui, C. K., Montefusco, L., and Puccio, L. (eds.), Wavelets: Theory, Algorithms, and Applications, volume 5 of Wavelet Analysis and Its Applications, pp. 117 – 135. Academic Press, 1994.
 Kim (2014) Kim, Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Mishkin & Matas (2015) Mishkin, D. and Matas, J. All you need is a good init. CoRR, abs/1511.06422, 2015. URL http://arxiv.org/abs/1511.06422.
 Pennington et al. (2017) Pennington, J., Schoenholz, S., and Ganguli, S. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 4788–4798. Curran Associates, Inc., 2017.

Pennington et al. (2018)
Pennington, J., Schoenholz, S. S., and Ganguli, S.
The emergence of spectral universality in deep networks.
In
International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 911 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain
, pp. 1924–1932, 2018. URL http://proceedings.mlr.press/v84/pennington18a.html.  Poole et al. (2016) Poole, B., Lahiri, S., Raghu, M., SohlDickstein, J., and Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. NIPS, 2016.
 Saxe et al. (2013) Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
 Schoenholz et al. (2017) Schoenholz, S. S., Gilmer, J., Ganguli, S., and SohlDickstein, J. Deep Information Propagation. ICLR, 2017.
 Schoenholz et al. (2017) Schoenholz, S. S., Pennington, J., and SohlDickstein, J. A correspondence between random neural networks and statistical field theory. arXiv preprint arXiv:1710.06570, 2017.
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 01 2016.
 Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
 Yang & Schoenholz (2017) Yang, G. and Schoenholz, S. Mean field residual networks: On the edge of chaos. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2865–2873. Curran Associates, Inc., 2017.
 Yang & Schoenholz (2018) Yang, G. and Schoenholz, S. S. Deep mean field theory: Layerwise variance and width variation as methods to control gradient explosion, 2018. URL https://openreview.net/forum?id=rJGY8GbR.
Appendix A Discussion of Mean Field Theory
Consider an layer 1D^{5}^{5}5For notational simplicity, as in the main text, we again consider 1D convolutions, but the 2D case proceeds identically. periodic CNN with filter size , channel size , spatial size , perlayer weight tensors and biases . Let be the activation function and let denote the preactivation at layer , channel , and spatial location . Suppose the weights are drawn i.i.d. from the Gaussian and the biases are drawn i.i.d. from the Gaussian . The forwardpropagation dynamics can be described by the recurrence relation,
For , note that (a) are i.i.d. random variables and (b) for each , is a sum of i.i.d. random variables with mean zero. The central limit theorem implies that are i.i.d. Gaussian random variables. Let denote the covariance matrix, where
where the expectation is taken over all random variables in and before layer . Therefore, we have the following lemma.
Lemma A.1.
As , for each , is a mean zero Gaussian with covariance matrix satisfying the recurrence relation,
(S1) 
Proof.
Let and . Then,
(S2)  
(S3) 
where we used the fact that,
Note that can be computed once (or ) is given. We will proceed by induction. Let be fixed and assume are i.i.d. mean zero Gaussian with covariance . It is not difficult to see that are also i.i.d. mean zero Gaussian as . To compute the covariance, note that for any fixed pair , are i.i.d. random variables. Then,
(S4) 
Thus by eq. (2.5), eq. (S2) can be written as,
(S5) 
so that,
(S6) 
∎
The same proof yields the following corollary.
Corollary A.2.
Let be a sequence of nonnegative numbers with . Let be the crosscorrelation operator induced by , i.e.,
(S7) 
Suppose the weights are drawn i.i.d. from the Gaussian . Then the recurrence relation for the covariance matrix is given by,
(S8) 
a.1 Backpropagation
Let denote the loss associated to a CNN and denote a backprop signal given by,
The layertolayer recurrence relation is given by,
We need to make an assumption that the weights used during backpropagation are drawn independently from the weights used in forward propagation. This implies are independent for all and for ,
and
For large , the second parenthesized term can be approximated by if and by otherwise.
Appendix B The Jacobian of the map
Recall that is given by,
(S9) 
We are interested in the linearized dynamics of near the fixed point . Let denote the Jacobian of at . The main result of this section is that commutes with any diagonal convolution operator.
Theorem B.1.
Let be as above and be any diagonal matrix and be any symmetric matrix. Then,
(S10) 
Let be the canonical basis of the space of symmetric matrices, i.e. if or and 0 otherwise. We claim the following:
Lemma B.2.
The Jacobian has the following representation:

For the offdiagonal terms (i.e. ),
(S11) 
For the diagonal terms,
(S12) where is given by,
(S13)
Proof of Theorem b.1.
It is clear that
is an eigenspace of with eigenvalue . Here denotes the linear span of . For , define,
It is straightforward to verify that
is an eigenspace of with eigenvalue and the direct sum is the whole space of symmetric matrices. Note that acts on in a pointwise fashion and that maps onto itself (one can form an eigendecomposition of (and ) in using Fourier matrices; see below for details.) Thus commutes with in . It remains to verify that they also commute in .
,A key observation is that has a nice group structure,
which we can use it to form a new basis for ,
(S14) 
where is the diagonal matrix formed by the th row of the Fourier matrix, i.e. with . Since each is an eigenvector of the convolutional operator ( is diagonal),
where is the eigenvalue of . This finishes our proof. ∎
Proof of Lemma b.2.
We first consider perturbing the offdiagonal terms. Let be a small number and . Note that for ,
(S15) 
and