Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

06/14/2018 ∙ by Lechao Xiao, et al. ∙ 0

In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enable training at these depths, it has remained unclear whether such specialized architecture designs are truly necessary to train deep CNNs. In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by developing a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix. These conditions require that the convolution operator be an orthogonal transformation in the sense that it is norm-preserving. We present an algorithm for generating such random initial orthogonal convolution kernels and demonstrate empirically that they enable efficient training of extremely deep architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks (CNNs) have been crucial to the success of deep learning. Architectures based on CNNs have achieved unprecedented accuracy in domains ranging across computer vision 

(Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012)

, natural language processing 

(Collobert et al., 2011; Kalchbrenner et al., 2014; Kim, 2014), and recently even the board game Go (Silver et al., 2016, 2017).

Figure 1:

Extremely deep CNNs can be trained without the use of batch normalization or residual connections simply by using a Delta-Orthogonal initialization with critical weight and bias variance and appropriate (in this case,

) nonlinearity. Test (solid) and training (dashed) curves on MNIST (top) and CIFAR-10 (bottom) for depths , , , and .

The performance of deep convolutional networks has improved as these networks have been made ever deeper. For example, some of the best-performing models on ImageNet 

(Deng et al., 2009) have employed hundreds or even a thousand layers (He et al., 2016a, b). However, these extremely deep architectures have been trainable only in conjunction with techniques like residual connections (He et al., 2016a) and batch normalization (Ioffe & Szegedy, 2015). It is an open question whether these techniques qualitatively improve model performance or whether they are necessary crutches that solely make the networks easier to train. In this work, we study vanilla CNNs using a combination of theory and experiment to disentangle the notions of trainability and generalization performance. In doing so, we show that through a careful, theoretically-motivated initialization scheme, we can train vanilla CNNs with 10,000 layers using no architectural tricks.

Recent work has used mean field theory to build a theoretical understanding of neural networks with random parameters (Poole et al., 2016; Schoenholz et al., 2017; Yang & Schoenholz, 2017; Schoenholz et al., 2017; Karakida et al., 2018; Hayou et al., 2018; Hanin & Rolnick, 2018; Yang & Schoenholz, 2018)

. These studies revealed a maximum depth through which signals can propagate at initialization, and verified empirically that networks are trainable precisely when signals can travel all the way through them. In the fully-connected setting, the theory additionally predicts the existence of an order-to-chaos phase transition in the space of initialization hyperparameters. For networks initialized on the critical line separating these phases, signals can propagate indefinitely and arbitrarily deep networks can be trained. While mean field theory captures the “average” dynamics of random neural networks it does not quantify the scale of gradient fluctuations that are crucial to the stability of gradient descent. A related body of work 

(Saxe et al., 2013; Pennington et al., 2017, 2018)

has examined the input-output Jacobian and used random matrix theory to quantify the distribution of its singular values in terms of the activation function and the distribution from which the initial random weight matrices are drawn. These works concluded that networks can be trained most efficiently when the Jacobian is well-conditioned, a criterion that can be achieved with orthogonal, but not Gaussian, weight matrices. Together, these approaches have allowed researchers to efficiently train extremely deep network architectures, but so far they have been limited to neural networks composed of fully-connected layers.

In the present work, we continue this line of research and extend it to the convolutional setting. We show that a well-defined mean-field theory exists for convolutional networks in the limit that the number of channels is large, even when the size of the image is small. Moreover, convolutional networks have precisely the same order-to-chaos transition as fully-connected networks, with vanishing gradients in the ordered phase and exploding gradients in the chaotic phase. And just like fully-connected networks, very deep CNNs that are initialized on the critical line separating those two phases can be trained with relative ease.

Moving beyond mean field theory, we additionally show that the random matrix analysis of (Pennington et al., 2017, 2018) carries over to the convolutional setting. Furthermore, we identify an efficient construction from the wavelet literature that generates random orthogonal matrices with the block-circulant structure that corresponds to convolution operators. This construction facilitates random orthogonal initialization for convolulational layers and enables good conditioning of the end-to-end Jacobian matrices of arbitrarily deep networks. We show empirically that networks with this initialization can train significantly more quickly than standard convolutional networks.

Finally, we emphasize that although the order-to-chaos phase boundaries of fully-connected and convolutional networks look identical, the underlying mean-field theories are in fact quite different. In particular, a novel aspect of the convolutional theory is the existence of multiple depth scales that control signal propagation at different spatial frequencies. In the large depth limit, signals can only propagate along modes with minimal spatial structure; all other modes end up deteriorating, even at criticality. We hypothesize that this type of signal degradation is harmful for generalization, and we develop a modified initialization scheme that allows for balanced propagation of signals among all frequencies. In this scheme, which we call Delta-Orthogonal initialization, the orthogonal kernel is drawn from a spatially non-uniform distribution, and it allows us to train vanilla CNNs of 10,000 layers or more with no degradation in performance.

2 Theoretical results

In this section, we first derive a mean field theory for signal propagation in random convolutional neural networks. We will follow the general methodology established in Poole et al. (2016); Schoenholz et al. (2017); Yang & Schoenholz (2017). We will then arrive at a theory for the singular value distribution of the Jacobian following Pennington et al. (2017, 2018). Together, this will allow us to derive theoretically motivated initialization schemes for convolutional neural networks that we call orthogonal kernels and Delta-Orthogonal kernels111An example implementation of a deep network initialized critically using the Delta-Orthogonal kernel is provided at https://github.com/brain-research/mean-field-cnns.. Later we will demonstrate experimentally that these kernels outperform existing initialization schemes for very deep vanilla convolutional networks.

2.1 A mean field theory for CNNs

2.1.1 Recursion relation for covariance

Consider an -layer 1D222For notational simplicity, we consider one-dimensional convolutions, but the -dimensional case proceeds identically. CNN with periodic boundary conditions, filter width , number of channels , spatial size

, per-layer weight tensors

, and biases . Let be the activation function and let denote the pre-activation unit at layer , channel , and spatial location , where we define the set of spatial locations . The forward-propagation dynamics can be described by the recurrence relation,

(2.1)

where and . At initialization, we take the weights to be drawn i.i.d. from the Gaussian and the biases to be drawn i.i.d. from the Gaussian . Note that since we assume periodic boundary conditions. We wish to understand how signals propagate through these networks. As in previous work in this vein, we will take the large network limit, which in this context corresponds to the number of channels . This allows us to use powerful theoretical tools such as mean field theory and random matrix theory. Moreover, this approximation has been shown to give results that agree well with experiments on finite-size networks.

In the limit of a large number of channels, the central limit theorem implies that the pre-activation vectors

are i.i.d. Gaussian with mean zero and covariance matrix . Here, the expectation is taken over the weights and biases and it is independent of the channel index . In this limit, the covariance matrix takes the form (see Supplemental Materials (SM)),

(2.2)

and is independent of . A more compact representation of this equation can be given as,

(2.3)

where and denotes 2D circular cross-correlation, i.e. for any matrix , is defined as,

(2.4)

The function is related to the -map defined in Poole et al. (2016) (see also (Daniely et al., 2016)) and is given by,

(2.5)

All but the two dimensions and in eqn. (2.5) marginalize, so, as in (Poole et al., 2016), the -map can be computed by a two-dimensional integral. Unlike in (Poole et al., 2016), and do not correspond to different examples but rather to different spatial positions and eqn. (2.5) characterizes how signals from a single input propagate through convolutional networks in the mean-field approximation333The multi-input analysis proceeds in precisely the same manner as we present here, but comes with increased notational complexity and features no qualitatively different behavior, so we focus our presentation on the single-input case..

2.1.2 Dynamics of signal propagation

We now seek to study the dynamics induced by eqn. (2.3). Schematically, our approach will be to identify fixed points of eqn. (2.3) and then linearize the dynamics around these fixed points. These linearized dynamics will dictate the stability and rate of decay towards the fixed points, which determines the depth scales over which signals in the network can propagate.

Schoenholz et al. (2017) found that for many activation functions (e.g. ) and any choice of and , the -map has a fixed point (i.e. ) of the form,

(2.6)

where is the Kronecker-, is the fixed-point variance of a single input, and is the fixed-point correlation between two inputs. It follows from the form of eqn. (2.4) that is also a fixed point of the layer-to-layer covariance map in the convolutional case (eqn. (2.3)), i.e. .

To analyze the dynamics of the iteration map (2.3) near the fixed point , we define and expand eqn. (2.3) to lowest order in This expansion requires the Jacobian of the -map evaluated at the fixed point, the properties of which we analyze in the SM. In brief, perturbations in and

evolve independently and the Jacobian decomposes into a diagonal eigenspace

with eigenvalue

, and an off-diagonal eigenspace with eigenvalue . The eigenvalues are given by444By the symmetry of , these expectations are independent of spatial location and of the choice of and .,

(2.7)

and the eigenspaces have bases,

(2.8)

i.e. and . Note that and also were found in Schoenholz et al. (2017) to control signal propagation in the fully-connected case. The constant is given in Lemma B.2 of the SM but does not concern us here. This eigen-decomposition implies that the layer-wise deviations from the fixed point evolve under eqn. (2.3) as,

(2.9)

where and are decomposition of into the eigenspaces and .

Eqn. (2.9) defines the linear dynamics of random convolutional neural networks near their fixed points and is the basis for the in-depth analysis of the following subsections.

2.1.3 Multi-dimensional signal propagation

In the fully-connected setting, the dynamics of signal propagation near the fixed point are governed by scalar evolution equations. In contrast, the convolutional setting enjoys much richer dynamics, as eqn. (2.9) describes a multi-dimensional system that we now analyze.

It follows from eqns. (2.4) and (2.8) (see also the SM) that does not mix the diagonal and off-diagonal eigenspaces, i.e. and . To see this, note that for , the definition implies . This property ensures that can be expressed as a linear combination of matrices in , which means it also belongs to . The same argument applies to . As a result, these eigenspaces evolve entirely independently under the linearization of the covariance iteration map (2.3).

Let denote the depth over which transient effects persist and after which eqn. (2.9) accurately describes the linearized dynamics. Therefore, at depths larger than , we have

(2.10)

This matrix-valued equation is still somewhat complicated owing to the nested applications of . To further elucidate the dynamics, we can move to a Fourier basis, which diagonalizes the circular cross-correlation operator and decouples the modes of eqn. (2.10). In particular, let

denote the 2D discrete Fourier transform and

denote a Fourier mode of . Then eqn. (2.10) becomes a simple scalar equation,

(2.11)

with . Thus, the linearized dynamics of convolutional neural networks decouple into independently-evolving Fourier modes that evolve near the fixed point at frequency-dependent rates.

2.1.4 Fixed-point analysis

The stability of the fixed point is determined by whether nearby points move closer or farther from under the dynamics described by eqn. (2.9). Eqn. (2.11) shows that this condition depends on the whether the quantities and are less than or greater than one.

Since is a diagonal matrix, the eigenvalues have a specific structure. In particular, the set of eigenvalues is comprised of copies of the 1D discrete Fourier transform of the diagonal entries of . Furthermore, since the diagonal entries of are non-negative and sum to one, their Fourier coefficients have absolute value no larger than one and the zero-frequency coefficient is equal to one; see Figure 4 for the full distribution in the case of 2D convolutions. It follows that the fixed point will be stable if and only if and .

These stability conditions are precisely the ones found to govern fully-connected networks (Poole et al., 2016; Schoenholz et al., 2017). Moreover, the fixed point matrix is also the same as in the fully-connected case. Together, these observations imply that the entire fixed-point structure of the convolutional case is identical to that of the fully-connected case. In particular, based on the results of (Poole et al., 2016), we can immediately conclude that the hyperparameter plane is separated by the line into an ordered phase with in which all pixels approach the same value, and a chaotic phase with in which the pixels become decorrelated with one another; see the SM for a review of this phase diagram analysis.

Figure 2: Mean field theory predicts the maximum trainable depth for CNNs. For fixed bias variance , the heat map shows the training accuracy on MNIST obtained for a given depth network and weight variance , after (a) , (b) , (c) , and (d) training steps. Also plotted (white dashed line) is a multiple () of the characteristic depth scale governing convergence to the fixed point.

2.1.5 Depth scales of signal propagation

We now assume that the conditions for a stable fixed point are met, i.e. and , and we consider the rate at which the fixed point is approached. As in (Schoenholz et al., 2017), it is convenient to additionally assume so that the dynamics in the diagonal subspace can be neglected. In this case, eqn. (2.11) can be rewritten as

(2.12)

where are depth scales governing the convergence of the different modes. In particular, we expect signals corresponding to a specific Fourier mode to be able to travel a depth commensurate to through the network. Thus, unlike fully-connected networks which exhibit only a single depth scale, convolutional networks feature a hierarchy of depth scales.

Recalling that , it follows that , which is identical to the depth scale governing signal propagation through fully-connected networks. It follows from (Schoenholz et al., 2017) that when , diverges and thus convolutional networks can propagate signals arbitrarily far through the modes. Since for , these are the only modes through which signals can propagate without attenuation. Finally, we note that the modes correspond to perturbations that are spatially uniform along the cyclic diagonals of the covariance matrix. The fact that all signals with additional spatial structure attenuate for large depth suggests that deep critical convolutional networks behave quite similarly to fully-connected networks, which also cannot propagate spatially-structured signals.

Figure 3: Test (solid) and training (dashed) curves of CNNs with different depths initialized critically using orthogonal kernels on CIFAR-10. Training accuracy reaches for all these curves (except for 8192, which was stopped early) but generalization performance degrades with increasing depth, likely because of attenuation of spatially non-uniform modes. The Delta-Orthogonal initialization in Fig. 1 addresses this reduction in test performance with increasing depth.

2.1.6 Non-uniform kernels

Figure 4: Test performance, as a function of depth, is correlated with the singular value distribution (SVD) of the generalized averaging operator (see eqn. (2.13)). (a) Initialized critically, we examine the test accuracy of CNNs with different depths and with Gaussian initialization of different non-uniform variance vectors. We “deform” the variance vector from a delta function (red) to a uniformly distributed one (black). Starting from depth 35, we see the test accuracy curve also “deforms” from the red one to the black one. (b) The SVD of for the selected variance vectors. The -axis represents the index of a singular value, with a total of 64 singular values (each has 64 copies) for each variance vector. See Section 3.3 for details.

The similarities between signal propagation in convolutional neural networks and fully-connected networks in the limit of large depth are surprising. A consequence may be that the performance of very deep convolutional networks degrades as the signal is forced to propagate along modes with minimal spatial structure. Indeed, Fig. 3 shows that the generalization performance decreases with depth, and that for very large depth it barely surpasses the performance of a fully-connected network.

If increased spatial uniformity is the problem, eqn. (2.12) holds the solution. In order for all modes to propagate without attenuation, it is necessary that for all . In fact, it is easy to show that the distribution of can be modified by allowing for spatial non-uniformity in the variance of the weights within the kernel. To this end, we introduce a non-negative vector chosen such that , and initialize the weights of the network according to . Each choice of will induce a new dynamical equation analogous to eqn. (2.3) (see SM),

(2.13)

where It follows directly from the previous analysis that the linearized dynamics of eqn. (2.13) will be identical to the dynamics of eqn (2.3), only now with . By the same argument presented in Section 2.1.3, the set of eigenvalues is now comprised of copies of the 1D Fourier transform of . As a result, it is possible to control the depth scales over which different modes of the signal can propagate through the network by changing the variance vector . We will return to this point in section 2.4.

2.2 Back-propagation of signal

We now turn our attention to the back-propgation of error signals through a convolutional network. Let denote the loss and the back-propagated signal at layer , channel and spatial location , i.e.,

(2.14)

The recurrence relation is given by

As in (Schoenholz et al., 2017)

, we additionally make the assumption that the weights used during back-propagation are drawn independently from the weights used in forward propagation, in which case the random variables

are independent for each . The covariance matrices back-propagate according to,

(2.15)

We are primarily interested in the diagonal of , which measures the variance of back-propagated signals. We will also assume (see section 2.1.3) so that is well-approximated by . In this case,

(2.16)

where we used eqn. (2.7). Therefore we find that, , where is the total depth of the network. As in the fully-connected case, is a necessary condition for gradient signals to neither explode nor vanish as they back-propagate through a convolutional network. However, as discussed in (Pennington et al., 2017, 2018), this is not always a sufficient condition for trainability. To further understand backward signal propagation, we need to push our analysis beyond mean field theory.

2.2.1 Beyond mean field theory

We have observed that the quantity is crucial for determining signal propagation in CNNs, both in the forward and backward directions. As discussed in (Poole et al., 2016), equals the the mean squared singular value of the Jacobian

of the layer-to-layer transition operator. Beyond just the second moment, higher moments and indeed the whole distribution of singular values of the entire end-to-end Jacobian

are important for ensuring trainability of very deep fully-connected networks (Pennington et al., 2017, 2018). Specifically, networks train well when their input-output Jacobians exhibit dynamical isometry, namely the property that the entire distribution of singular values is close to .

In fact, we can adopt the entire analysis of (Pennington et al., 2017, 2018) into the convolutional setting with essentially no modification. The reason stems from the fact that, because convolution is a linear operator, it has a matrix representation, , which appears in the end-to-end Jacobian in precisely the same manner as do the weight matrices in the fully-connected case. In particular, , where

is the diagonal matrix whose diagonal elements contain the vectorized representation of derivatives of post-activation neurons in layer

. Roughly speaking, since this is the same expression as in (Pennington et al., 2017, 2018), the conclusions found in that work regarding dynamical isometry apply equally well in the convolutional setting.

The analysis of Pennington et al. (2017, 2018) reveals that the singular values of depends crucially on the distribution of singular values of and . In particular, to achieve dynamical isometry, all of these matrices should be close to orthogonal. As in the fully-connected case, the singular values of can be made arbitrarily close to by choosing a small value for and by using an activation function like that is smooth and linear near the origin. In the convolutional setting, the matrix representation of the convolution operator is a block matrix with circulant blocks. Note that in the large limit, and the relative size of the blocks vanishes. Therefore, if the weights are i.i.d. random variables, we can invoke universality results from random matrix theory to conclude its singular value distribution converges to the Marcenko-Pastur distribution; see Fig. S4 in the SM. As such, we find that CNNs with i.i.d. weights cannot achieve dynamical isometry. We address this issue in the next section.

2.3 Orthogonal Initialization for CNNs

In (Pennington et al., 2017, 2018), it was observed that dynamical isometry can lead to dramatic improvements in training speed, and that achieving these favorable conditions requires orthogonal weight initializations. While the procedure to generate random orthogonal weight matrices in the fully-connected setting is well-known, it is less obvious how to do so in the convolutional setting, and at first sight it is not at all clear whether it is even possible. We resolve this question by invoking a result from the wavelet literature (Kautsky & Turcajová, 1994) and provide an explicit construction. We will focus on the two-dimensional convolution here and begin with some notation.

Figure 5: Orthogonal initialization leads to faster training in CNNs. Training (solid lines) and test curves for a 4,000-layer CNN trained using orthogonal (red) and Gaussian (blue) initializations with identical weight variance.
Definition 2.1.

We say is an orthogonal kernel if for all , .

Definition 2.2.

Consider the block matrices and , with constituent blocks and . Define the block-wise convolution operator by,

(2.17)

where the out-of-range matrices are taken to be zero.

Algorithm 1 shows how to construct orthogonal kernels for 2D convolutions of size with . One can employ the same method to construct kernels of higher (or lower) dimensions. This new initialization method can dramatically boost the learning speed of deep CNNs; see Fig. 5 and Section 3.2.

2.4 Delta-Orthogonal Initialization

In Section 2.1.5 it was observed that, in contrast to fully-connected networks, CNNs have multiple depth scales controlling propagation of signals along different Fourier modes. Even at criticality, for generic variance-averaging vectors , the majority of these depth scales are finite. However, there does exist one special averaging vector for which all of the depth scales are infinite: a one-hot vector, i.e. . This kernel places all of its variance in the spatial center of the kernel and zero variance elsewhere. In this case, the eigenvalues are all equal to and all depth scales diverge, implying that signals can propagate arbitrarily far along all Fourier modes.

If we combine this special averaging vector with the orthogonal initialization of the previous section, we obtain a powerful new initialization scheme that we call Delta-Orthogonal Initialization. Matrices of this type can be generated from Algorithm 1 with

and padding with appropriate zeros or directly from Algorithm 

2 in the SM.

In the following sections, we demonstrate experimentally that extraordinarily deep convolutional networks can be trained with these initialization techniques.

  Input: kernel size, number of input channels, number of output channels. Return: a tensor . Step 1. Let be the tensor such that , where is the identity matrix. Step 2. Repeat the following times: Randomly generate two orthogonal projection matrices and of size and set (see eqn. (2.17))
Step 3. Randomly generate a matrix with orthonormal rows and for and , set . Return .
Algorithm 1

2D orthogonal kernels for CNNs, available in TensorFlow via the

initializer.

3 Experiments

To support the theoretical results built up in Section 2, we trained a large number of very deep CNNs on MNIST and CIFAR-10 with as the activation function. We use the following vanilla CNN architecture. First we apply three

convolutions with strides 1, 2 and 2 in order to increase the channel size to

and reduce the spatial dimension to (or for CIFAR-10), and then a block of convolutions with varying from to . Finally, an average pooling layer and a fully-connected layer are applied. Here when and otherwise. To maximally support our theories, we applied no common techniques (including learning rate decay). Note that the early downsampling is necessary from a computational perspective, but it does diminish the maximum achievable performance; e.g. our best achieved test accuracy with downsampling was 82 on CIFAR-10. We performed an additional experiment training a 50 layers network without downsampling. This resulted in a test accuracy of , which is comparable to the best performance on CIFAR-10 using a architecture that we were able to find (, (Mishkin & Matas, 2015)).

3.1 Trainability and Critical Initialization

The analysis in Section 2.1 gives a prediction for precisely which initialization hyperparameters a CNN will be trainable. In particular, we predict that the network ought to be trainable provided . To test this, we train a large number of convolutional neural networks on MNIST with depth varying between and and with weights initialized with . In Fig. 2 we plot – using a heatmap – the training accuracy obtained by these networks after different numbers of steps. Additionally we overlay the depth scale predicted by our theory, . We find strikingly good agreement between our theory of random networks and the results of our experiments.

3.2 Orthogonal Initialization and Ultra-deep CNNs

We argued in Section 2.2.1 that the input-output Jacobian of CNNs with i.i.d. weights will become increasingly ill-conditioned as the number of layers grows. On the other hand, orthogonal weight initializations can achieve dynamical isometry and dramatically boost the training speed. To verify this, we train a 4,000-layer CNN on MNIST using a critically-tuned Gaussian weight initialization and the orthogonal initialization scheme developed in Section  2.3. Fig. 5 shows that the network with Gaussian initialization learns slowly (test and training accuracy is below after

steps, about 60 epochs). In contrast, orthogonal initialization learns quickly with test accuracy above

after only 1 epoch, and achieves after steps or about 7 epochs.

3.3 Multi-dimensional Signal Propagation

The analysis in Section 2.1.3 and Section 2.1.6 suggest that CNNs initialized with kernels with spatially uniform variance may suffer a degradation in generalization performance as the depth increases. Fig. 3 shows the learning curves of CNNs on CIFAR-10 with depth varying from to . Although the orthogonal initialization enables even the deepest model to reach training accuracy, the test accuracy decays as the depth increases with the deepest mode generalizing only marginally better than a fully-connected network.

To test whether this degradation in performance may be the result of attenuation of spatially non-uniform signals, we trained a variety of models on CIFAR-10 whose kernels were initialized with spatially non-uniform variance. According to the analysis in Section 2.1.6, changing the shape of this non-uniformity controls the depth scales over which different Fourier components of the signal can propagate through the network. We examined five different non-uniform critical Gaussian initialization methods. The variance vectors were chosen in the following way: GS0 refers to the one-hot delta initialization for which the eigenvalues

are all equal to 1. GS1, GS2 and GS3 are obtained by interpolating between GS0 and GS4, which is the uniform variance initialization.

Each variance vector has exactly singular values, plotted in Fig. 4(b) in descending order. Note that from GS0 to GS4, the singular values become more poorly-conditioned (the distribution becomes more concentrated around 0). Fig. 4(a) shows that the relative fall-off of generalization performance with depth follows the same pattern: the more poorly-conditioned the singular values the worse the model generalizes. These observations suggest that salient information may be propagating along multiple Fourier modes.

3.4 Training 10,000-layers: Delta-Orthogonal Initialization.

Our theory predicts that an ultra-deep CNNs can train faster and perform better if critically initialized using Delta-Orthogonal kernels. To test this theory, we train CNNs of 1,250, 2,500, 5,000 and 10,000 layers on both MNIST and CIFAR-10 (Fig. 1). All these networks learn surprisingly quickly and, remarkably, the learning time measured in number of training epochs is independent of depth. Furthermore, our experimental results match well with the predicted benefits of this initialization: test accuracy on MNIST for a 10,000-layer network, and on CIFAR-10. To isolate the benefits of the Delta-Orthogonal init, we also train a 2048-layer CNN (Fig. 3) using the spatially-uniform orthogonal initialization proposed in Section 2.3; the testing accuracy is about . Note that the test accuracy using (spatially uniform) Gaussian (non-orthogonal) initialization is already below when the depth is 259.

4 Discussion

In this work, we developed a theoretical framework based on mean field theory to study the propagation of signals in deep convolutional neural networks. By examining the necessary conditions for signals to flow both forward and backward through the network without attenuation, we derived an initialization scheme that facilitates training of vanilla CNNs of unprecedented depths. We presented an algorithm for the generation of random orthogonal convolutional kernels, an ingredient that is necessary to enable dynamical isometry, i.e. good conditioning of the network’s input-output Jacobian. In contrast to the fully-connected case, signal propagation in CNNs is intrinsically multi-dimensional – we showed how to decompose those signals into independent Fourier modes and how to promote uniform signal propagation across them. By leveraging these various theoretical insights, we demonstrated empirically that it is possible to train vanilla CNNs with 10,000 layers or more.

Our results indicate that we have removed all the major fundamental obstacles to training arbitrarily deep vanilla convolutional networks. In doing so, we have layed the groundwork to begin addressing some outstanding questions in the deep learning community, such as whether depth alone can deliver enhanced generalization performance. Our initial results suggest that past a certain depth, on the order of tens or hundreds of layers, the test performance for vanilla convolutional architecture saturates. These observations suggest that architectural features such as residual connections and batch normalization are likely to play an important role in defining a good model class, rather than simply enabling efficient training.

Acknowledgements

We thank Xinyang Geng, Justin Gilmer, Alex Kurakin, Jaehoon Lee, Hoang Trieu Trinh, and Greg Yang for useful discussions and feedback.

References

Appendix A Discussion of Mean Field Theory

Consider an -layer 1D555For notational simplicity, as in the main text, we again consider 1D convolutions, but the 2D case proceeds identically. periodic CNN with filter size , channel size , spatial size , per-layer weight tensors and biases . Let be the activation function and let denote the pre-activation at layer , channel , and spatial location . Suppose the weights are drawn i.i.d. from the Gaussian and the biases are drawn i.i.d. from the Gaussian . The forward-propagation dynamics can be described by the recurrence relation,

For , note that (a) are i.i.d. random variables and (b) for each , is a sum of i.i.d. random variables with mean zero. The central limit theorem implies that are i.i.d. Gaussian random variables. Let denote the covariance matrix, where

where the expectation is taken over all random variables in and before layer . Therefore, we have the following lemma.

Lemma A.1.

As , for each , is a mean zero Gaussian with covariance matrix satisfying the recurrence relation,

(S1)
Proof.

Let and . Then,

(S2)
(S3)

where we used the fact that,

Note that can be computed once (or ) is given. We will proceed by induction. Let be fixed and assume are i.i.d. mean zero Gaussian with covariance . It is not difficult to see that are also i.i.d. mean zero Gaussian as . To compute the covariance, note that for any fixed pair , are i.i.d. random variables. Then,

(S4)

Thus by eq. (2.5), eq. (S2) can be written as,

(S5)

so that,

(S6)

The same proof yields the following corollary.

Corollary A.2.

Let be a sequence of non-negative numbers with . Let be the cross-correlation operator induced by , i.e.,

(S7)

Suppose the weights are drawn i.i.d. from the Gaussian . Then the recurrence relation for the covariance matrix is given by,

(S8)

a.1 Back-propagation

Let denote the loss associated to a CNN and denote a backprop signal given by,

The layer-to-layer recurrence relation is given by,

We need to make an assumption that the weights used during back-propagation are drawn independently from the weights used in forward propagation. This implies are independent for all and for ,

and

For large , the second parenthesized term can be approximated by if and by otherwise.

Appendix B The Jacobian of the -map

Recall that is given by,

(S9)

We are interested in the linearized dynamics of near the fixed point . Let denote the Jacobian of at . The main result of this section is that commutes with any diagonal convolution operator.

Theorem B.1.

Let be as above and be any diagonal matrix and be any symmetric matrix. Then,

(S10)

Let be the canonical basis of the space of symmetric matrices, i.e. if or and 0 otherwise. We claim the following:

Lemma B.2.

The Jacobian has the following representation:

  • For the off-diagonal terms (i.e. ),

    (S11)
  • For the diagonal terms,

    (S12)

    where is given by,

    (S13)

We first prove Theorem B.1 assuming Lemma B.2, and afterwards we prove the latter.

Proof of Theorem b.1.

It is clear that

is an eigenspace of with eigenvalue . Here denotes the linear span of . For , define,

It is straightforward to verify that

is an eigenspace of with eigenvalue and the direct sum is the whole space of symmetric matrices. Note that acts on in a pointwise fashion and that maps onto itself (one can form an eigen-decomposition of (and ) in using Fourier matrices; see below for details.) Thus commutes with in . It remains to verify that they also commute in .

,A key observation is that has a nice group structure,

which we can use it to form a new basis for ,

(S14)

where is the diagonal matrix formed by the -th row of the Fourier matrix, i.e. with . Since each is an eigen-vector of the convolutional operator ( is diagonal),

where is the eigenvalue of . This finishes our proof. ∎

Proof of Lemma b.2.

We first consider perturbing the off-diagonal terms. Let be a small number and . Note that for ,

(S15)

and