# Disentangling trainability and generalization in deep learning

A fundamental goal in deep learning is the characterization of trainability and generalization of neural networks as a function of their architecture and hyperparameters. In this paper, we discuss these challenging issues in the context of wide neural networks at large depths where we will see that the situation simplifies considerably. To do this, we leverage recent advances that have separately shown: (1) that in the wide network limit, random networks before training are Gaussian Processes governed by a kernel known as the Neural Network Gaussian Process (NNGP) kernel, (2) that at large depths the spectrum of the NNGP kernel simplifies considerably and becomes "weakly data-dependent" and (3) that gradient descent training of wide neural networks is described by a kernel called the Neural Tangent Kernel (NTK) that is related to the NNGP. Here we show that in the large depth limit the spectrum of the NTK simplifies in much the same way as that of the NNGP kernel. By analyzing this spectrum, we arrive at a precise characterization of trainability and a necessary condition for generalization across a range of architectures including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). In particular, we find that there are large regions of hyperparameter space where networks can only memorize the training set in the sense they reach perfect training accuracy but completely fail to generalize outside the training set, in contrast with several recent results. By comparing CNNs with- and without-global average pooling, we show that CNNs without average pooling have very nearly identical learning dynamics to FCNs while CNNs with pooling contain a correction that alters its generalization performance. We perform a thorough empirical investigation of these theoretical results and finding excellent agreement on real datasets.

• 12 publications
• 27 publications
• 25 publications
02/18/2019

### Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

A longstanding goal in deep learning research has been to precisely char...
02/07/2020

### Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks

A fundamental question in modern machine learning is how deep neural net...
02/13/2019

### Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

Several recent trends in machine learning theory and practice, from the ...
05/28/2018

### Understanding Generalization and Optimization Performance of Deep CNNs

This work aims to provide understandings on the remarkable success of de...
10/11/2018

### Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes

There is a previously identified equivalence between wide fully connecte...
05/21/2021

### Properties of the After Kernel

The Neural Tangent Kernel (NTK) is the wide-network limit of a kernel de...
04/26/2013

### An Algorithm for Training Polynomial Networks

We consider deep neural networks, in which the output of each node is a ...

## 1 Introduction

Machine learning models based on deep neural networks have attained state-of-the-art performance across a dizzying array of tasks including vision (Cubuk et al., 2019), speech recognition (Park et al., 2019), machine translation (Bahdanau et al., 2014), chemical property prediction (Gilmer et al., 2017), diagnosing medical conditions (Raghu et al., 2019), and playing games (Silver et al., 2018). Historically, the rampant success of deep learning models has lacked a sturdy theoretical foundation; architectures, hyperparameters, and learning algorithms are often selected by brute force search (Bergstra & Bengio, 2012)

and heuristics

(Glorot & Bengio, 2010). Recently, significant theoretical progress has been made on several fronts that have shown promise in making neural network design more systematic. In particular, in the infinite width (or channel) limit, the distribution of functions induced by neural networks with random weights and biases has been precisely characterized before, during, and after training.

The study of infinite networks dates back to seminal work by Neal (1994) who showed that the distribution of functions given by single hidden-layer networks with random weights and biases in the infinite-width limit are Gaussian Processes (GPs). Recently, there has been renewed interest in studying random, infinite, networks starting with concurrent work on “conjugate kernels” (Daniely et al., 2016; Daniely, 2017) and “mean-field theory” (Poole et al., 2016; Schoenholz et al., 2017). The former set of papers argued that the empirical covariance matrix of pre-activations became deterministic in the infinite-width limit and called this the conjugate kernel of the network while the latter papers studied the properties of these limiting kernels along with the kernel describing distribution of gradients. In particular, it was shown that the spectrum of the conjugate kernel of wide fully-connected networks approached a well-defined, data-independent, limit when the depth exceeds a certain scale, . Networks with

-nonlinearities (among other bounded activations) exhibit a phase transition between two limiting spectral distributions of the conjugate kernel as a function of their hyperparameters with

diverging at the transition. It was additionally hypothesized that networks were un-trainable when the conjugate kernel was sufficiently close to its limit.

Since then this analysis has been extended to include a wide range for architectures such as convolutions (Xiao et al., 2018), recurrent networks (Chen et al., 2018; Gilboa et al., 2019)

, networks with residual connections

(Yang & Schoenholz, 2017), networks with quantized activations (Blumenfeld et al., 2019), the spectrum of the fisher (Karakida et al., 2018)

, a range of activation functions

(Hayou et al., 2018)(Yang et al., 2019)

and weight-tied autoencoders

(Li & Nguyen, 2019). In each case, it was observed that the spectra of the kernels correlated strongly with whether or not the architectures were trainable. While these papers studied the properties of the conjugate kernels, especially the spectrum in the large-depth limit, a branch of concurrent work made a stronger statement: that many networks converge to Gaussian Processes as their width becomes large (Lee et al., 2018; Matthews et al., 2018; Novak et al., 2019b; Garriga-Alonso et al., 2018; Yang, 2019). In this case, the Conjugate Kernel was referred to as the Neural Network Gaussian Process (NNGP) kernel.

Together this work offered a significant advance to our understanding of wide neural networks; however, this theoretical progress was limited to networks at initialization or after Bayesian posterior estimation and provided no link to gradient descent. Moreover, there was some preliminary evidence that suggested the situation might be more nuanced than the qualitative link between the NNGP spectrum and trainability might suggest. For example,

Philipp & Carbonell (2018) observed that deep fully-connected -networks could be trained after the kernel reached its large-depth, data-independent, limit but that these networks did not generalize to unseen data.

In the last year, significant theoretical clarity has been reached regarding the relationship between the GP prior and the distribution following gradient descent. In particular, Jacot et al. (2018) along with followup work (Lee et al., 2019; Chizat et al., 2019) showed that the distribution of functions induced by gradient descent for infinite-width networks is a Gaussian Process with a particular compositional kernel known as the Neural Tangent Kernel (NTK). In addition to characterizing the distribution over functions following gradient descent in the wide network limit, the learning dynamics can be solved analytically throughout optimization.

In this paper, we leverage these developments and revisit the relationship between architecture, hyperparameters, trainability, and generalization in the large-depth limit for a variety of neural networks. In particular, we make the following contributions:

1. We compute the large-depth asymptotics of several quantities related to trainability, including the largest eigenvalue of the NTK,

, and the condition number , where is the smallest eigenvalue; see Table 1.

2. We introduce the residual predictor , namely the difference between the finite depth and infinite depth NTK predictions, which is related to the model’s ability to generalize: the network fails to generalize if is too small.

3. We show that the ordered and chaotic phases identified in Poole et al. (2016) lead to markedly different limiting spectra of the NTK. A corollary is that, as a function of depth, the optimal learning rates ought to decay exponentially in the chaotic phase, linearly on the order-to-chase trainsition line, and remain roughly a constant in the ordered phase.

4. We examine the differences in the above quantities for fully-connected networks (FCNs) and convolutional networks (CNNs) with and without pooling and precisely characterize the effect of pooling on the interplay between trainability, generalization, and depth.

5. We provide substantial experimental evidence supporting these claims, includes experiments that densely vary the hyperparameters of FCNs and CNNs with and without pooling.

Together these results provide a complete, analytically tractable, and dataset-independent theory for learning in very deep and wide networks. Finally, our results provides clarity regarding the observation that for linear networks the learning rate must be decreased linearly in the depth of the network Saxe et al. (2013). Here, we note that this is true only for networks that are initialized critically, i.e. on the order-to-chaos phase boundary.

## 2 Background

We summarize recent developments in the study of wide random networks. We will keep our discussion relatively informal; see (Lee et al., 2018; Matthews et al., 2018; Novak et al., 2019b) for a more rigorous version of these arguments. To simplify this discussion and as a warmup for the main text, we will consider the case of FCNs. Consider a fully-connected network of depth where each layer has a width and an activation function . In this work we will take however, most of the results will hold for a wide range of non-linearities though specifics - such as the phase diagram - can vary substantially. For simplicity, we will take the width of the hidden layers to infinity sequentially: . The network is parameterized by weights and biases that we take to be randomly initialized with along with hyperparameters, and that set the scale of the weights and biases. Letting the th pre-activation in the th layer due to an input be given by , the network is then described by the recursion,

 z(l+1)i(x)=σw√N(l)N(l)∑j=1W(l+1)ijϕ(z(l)j(x))+σbb(l+1)i0≤l≤L−1. (1)

Notice that as

, the sum ends up being over a large number of random variables and we can invoke the central limit theorem to conclude that the

are i.i.d. Gaussian with zero mean. Given a dataset of

points, the distribution over pre-activations can therefore be described completely by the covariance matrix between neurons in different inputs

Inspecting Equation 1, we see that can be computed in terms of as

 K(l+1)(x,x′)=σ2wE(z,z′)∼N(0,K(l)(x,x′))[ϕ(z)ϕ(z′)]+σ2b≡σ2wT(K(l)(x,x′))+σ2b. (2)

for , an appropriately defined operator from the space of positive semi-definite matrices to itself.

Equation 2 describes a dynamical system on positive semi-definite matrices . It was shown in Poole et al. (2016) that fixed points, , of these dynamics exist such that with independent of the inputs and . The values of and are determined by the hyperparameters, and . However  Equation 2 admits multiple fixed points (e.g. ) and the stability of these fixed points plays a significant role in determining the properties of the network. Generically, there are large regions of the plane in which the fixed-point structure is constant punctuated by curves, called phase transitions, where the structure changes.

The rate at which approaches or departs can be determined by expanding Equation 2 about its fixed point, to find111More precisely, one needs to consider the Jacobian of as an operator from positive semi-definite matrices to positive semi-definite matrices. We refer the readers to Section B of Xiao et al. (2018) for more details.

 δK(l+1)(x,x′)≈σ2w˙T(K∗(x,x′))δK(l)(x,x′) (3)

with . This expansion naturally exhibits exponential convergence to - or divergence from - the fixed-point as where . Since does not depend on or it follows that will take on a single value, , whenever . If then this fixed point is stable, but if then the fixed point is unstable and, as discussed above, the system will converge to a different fixed point. If then the hyperparameters lie at a phase transition and convergence is non-exponential. As was shown in Poole et al. (2016), there is always a fixed-point at whose stability is determined by . This defines the order-to-chaos transition. Note, that can be used to define a depth-scale, that describes the number of layers over which approaches

This provides a precise characterization of the NNGP kernel at large depths. As discussed above, recent work (Jacot et al., 2018; Lee et al., 2019; Chizat et al., 2019) has connected the prior described by the NNGP with the result of gradient descent training using a quantity called the NTK. To construct the NTK, suppose we enumerate all the parameters in the fully-connected network described above by . The finite width NTK is defined by where is the Jacobian evaluated at a point . The main result in Jacot et al. (2018) was to show that in the infinite-width limit, the NTK converges to a deterministic kernel and remains constant over the course of training. As such, at a time during gradient descent training with an MSE loss, the expected outputs of an infinitely wide network, , evolve as

 μt(Xtrain) =(Id−e−ηΘtrain, traint)Y% train (4) μt(Xtest) =Θtest, trainΘ−1train, train(Id−e−ηΘtrain, traint)Ytrain (5)

for train and test points respectively; see Section 2 in Lee et al. (2019). Here denotes the NTK between the test inputs and training inputs and is defined similarly. Since converges to , the gradient flow dynamics of real network also converge to the dynamics described by Equation 4 and Equation 5 (Jacot et al., 2018; Lee et al., 2019; Chizat et al., 2019; Yang, 2019; Arora et al., 2019; Huang & Yau, 2019). As the training time, tends to infinity we note that these equations reduce to and . Consequently we call the linear operator

 P(Θ)≡Θtest, trainΘ−1train, % train (6)

the “mean predictor” or “predictor” for short. In addition to showing that the NTK describes networks during gradient descent, Jacot et al. (2018) showed that the NTK could be computed in closed form in terms of , , and the NNGP as,

 Θ(l+1)(x,x′)=K(l+1)(x,x′)+σ2w˙T(K(l))(x,x′)Θ(l)(x,x′). (7)

where is the NTK for the pre-activations at layer-.

## 3 Metrics for Trainability and Generalization at Large Depth

We begin by discussing the interplay between the conditioning of and the trainability of wide networks. We can write Equation 4 in terms of the spectrum of letting as,

 ~μt(Xtrain)i=(Id−e−ηλit)~Ytrain,i (8)

where are the eigenvalues of and , are the mean prediction and the labels respectively written in the eigenbasis of . If we order the eigenvalues such that then it has been hypothesized222For finite width, the optimization problem is non-convex and there are not rigorous bounds on the maximum learning rate. in e.g. Lee et al. (2019) that the maximum feasible learning rate scales as as we verify empirically in section 4. Plugging this scaling for into Equation 8 we see that the smallest eigenvalue will converge exponentially at a rate given by the condition number. It follows that if the condition number of the NTK associated with a neural network diverges then it will become untrainable and so we use as a metric for trainability. We will see that at large depths, the spectrum of typically features a single large eigenvalue, , and then a gap that is large compared with the rest of the spectrum. We therefore will often refer to a typical eigenvalue in the bulk as and approximate the condition number as .

In the large-depth limit we will see that converges to independent of the data distribution. In this case will be a rank-1 constant matrix. As such, the mean prediction defined by Equation 5 completely fails to generalize since the prediction is independent of the test inputs. We define the finite depth correction to the infinite depth predictor333If diverges to infinity, we define . If is singular, we will add a diagonal regularizer into .,

 Δ(l)YTrain≡(P(Θ(l))−P(Θ∗))YTrain. (9)

By the triangle inequality, the generalization error is lower bounded by

 ∥P(Θ(l))Ytrain−Ytest∥2≥∥P(Θ∗)Ytrain−Ytest∥2−∥Δ(l)Ytrain% ∥2. (10)

is a constant independent of the test inputs and Equation 10 is large if is too small. Therefore, a necessary condition for the network to generalize is that there exists some such that

 ∥Δ(l)YTrain∥2≥ρ∥P(Θ∗)Ytrain−Ytest∥2. (11)

As such, we use as a metric for generalization in this paper.

Our goal is therefore to characterize the evolution of the two metrics and in . We follow the methodology outlined in Schoenholz et al. (2017); Xiao et al. (2018) to explore the spectrum of the NTK as a function of depth. We will use this to make precise predictions relating trainability and generalization to the hyperparameters . Our main results are summarized in Table 1 which describes the evolution of (the largest eigenvalue of ), (the remaining eigenvalues), , and in three different phases (ordered, chaotic, and the phase transition) and their dependence on , the size of the training set, the choices of architectures: FCN, CNN-F (convolution with flattening) and CNN-P (convolution with pooling), and size, , of the window in the pooling layer (which we always take to be the penultimate layer).

We give a brief derivation of these results in Section 4 followed by a more detailed discussion in the appendix. However, it is useful to first give a qualitative overview of the phenomenology. In the ordered phase, and . At large depths since it follows that and so the condition number diverges exponentially quickly. Thus, in the ordered phase we expect networks not to be trainable (or, specifically, the time they take to learn will grow exponentially in their depth). The predictor scales as which goes to zero at the same rate as the divergence of ; thus, in the ordered phase networks fail to train and generalize simultaneously. By contrast in the chaotic phase we see that there is no gap between and and networks become perfectly conditioned and are trainable everywhere. However, in this regime we see that the predictor scales as . Since in the chaotic phase and it follows that over a depth . Thus, in the chaotic phase, networks fail to generalize at a finite depth but remain trainable indefinitely. Finally, notice that introducing pooling modestly augments the depth over which networks can generalize in the chaotic phase but reduces the depth in the ordered phase. We will explore all of these predictions in detail in section 5.

## 4 Large-Depth Asymptotics of the NNGP and NTK

We now give a brief derivation of the results in table 1. To simplify the notation we will discuss fully-connected networks and then extend the results to CNNs with pooling (CNN-P) and without pooling (CNN-F). Details of these two cases can be found in the appendix. We will focus on the NTK here since Schoenholz et al. (2017); Xiao et al. (2018) contains a detailed description of the NNGP in this case. As in sec. 2, we will be concerned with the fixed points of as well as the linearization of  Equation 7 about its fixed point. Recall that the fixed point structure is invariant within a phase so it suffices to consider the ordered phase, the chaotic phase, and the critical line separately. In cases where a stable fixed point exists, we will describe how converges to the fixed point. We will see that in the chaotic phase and on the critical line, has no stable fixed point and in that case we will describe its divergence. As above, in each case the fixed points of have a simple structure with

. To simplify the forthcoming analysis, without a loss of generality, we assume the inputs are normalized to have variance

444It has been observed in previous works (Poole et al., 2016; Schoenholz et al., 2017) that the diagonals converge much faster than the off-diagonals for - or erf- networks.. As such, we can treat and , restricted on , as a point-wise functions,

 T(K)(x,x′)=Eϕ(u)ϕ(v),(u,v)T∼N(0,[q∗K(x,x′)K(x,x′)q∗]). (12)

Since the off-diagonal elements approach the same fixed point at the same rate, we use and to denote any off diagonal entry of and respectively. We will similarly use and to denote the limits, and . Using the above notation, Equation 7 and Equation 2 become

 q(l+1)ab=σ2wT(q(l)ab)+σ2b p(l+1)ab=q(l+1)ab+σ2w˙T(q(l)ab)p(l)ab (13) q(l+1)=q∗ p(l+1)=q∗+σ2w˙T(q∗)p(l), (14)

where and . In what follows, we split the discussion into three parts according to the values of recalling that in Poole et al. (2016); Schoenholz et al. (2017) it was shown that controls the fixed point structure.

### 4.1 The Chaotic Phase χ1=σ2ω˙T(q∗)>1:

The chaotic phase is so-named because so that similar inputs become more uncorrelated as they pass through the network. In this phase, the diagonal entries of grow exponentially and the off-diagonal entries converge to a fixed value. Indeed, Equation 14 implies,

 p(l+1)=q∗+χ1p(l)⟹p(l)=q∗χl+11−1χ1−1, (15)

which diverges exponentially. To find the limit of the off-diagonal terms, define which was shown to control convergence of the and is always less than 1  (Schoenholz et al., 2017; Xiao et al., 2018). Let in Equation 13, we find that

 p∗ab =q∗ab1−σ2ω˙T(q∗ab)=q∗ab1−χc∗<∞. (16)

The rate of convergence of is (see Section  A in the appendix). Since the diagonal terms diverge and the off-diagonal terms are finite it follows that in very deep networks in the chaotic phase, . Thus, in the chaotic phase, the spectrum of the NTK for very deep networks approaches the diverging constant multiplying the identity. From Equation 4 this implies that optimization in the chaotic phase should be easy since (provided numerical precision issues from the prefactor do not become problematic); see Figure 1 (a). However, computing the mean prediction on test points and noticing that we find (see Section B for the derivation),

 Δ(l)Ytrain=P(Θ(l))Ytrain≈(p(l))−1O(lχlc∗)Ytrain→0. (17)

It follows that in the chaotic phase the networks’ predictions on unseen data to converge to exponentially quickly in the depth. Since Equation 17 decays like , we expect the network fails to generalize after layers, where 555For simplicity, we ignore the polynomial correction in ..

In summary, for wide networks, in the chaotic phase as the depth increases optimization becomes increasingly easy but the generalization performance degrades and eventually the network fails completely away from the training set after layers. Therefore, in the chaotic phase, deep network memorizes the training data. We will confirm this prediction for both kernel prediction and neural network training in the experimental results; see Fig 3.

### 4.2 The Ordered Phase χ1=σ2ω˙T(q∗)<1:

The ordered phase is defined by the stable fixed point with ; in this case, disparate inputs will end up converging to the same output at the end of the network. In the ordered phase, Equation 14 implies that all the diagonal entries of converge to the same value,

 p(l)=q∗χl+11−1χ1−1l→∞−−−→p∗=q∗11−χ1<∞ (18)

However, as with the NNGP kernel, the off-diagonal terms of the NTK, , will also converge to the value on the diagonal, . It follows that the limiting kernels have the form and Thus, the limiting kernels are highly singular and feature only one non-zero eigenvalue. Since the limit is singular, we must linearize the dynamics about the fixed point to gain insight into the limiting behavior of the network. To compute the corrections we first define the deviation from the fixed point,

 ϵ(l)ab=q(l)ab−q∗ab δ(l)ab=p(l)ab−p∗ab (19) ϵ(l)=q(l)−q∗ δ(l)=p(l)−p∗ (20)

The diagonal correction can be obtained directly from Equation 18 and we find that and . To compute correction of the off-diagonals, we linearize the equation around the fixed point to find that asymptotically (see Section A),

 (21)

where with . While the NNGP and NTK feature the same exponential rate of convergence set by , we see that the off-diagonal terms of the NTK feature polynomial corrections.

We see that

features approximately two eigenspaces. The first eigenspace corresponds to the single non-zero eigenvalue at the fixed point and it is very close to the DC mode (i.e. all entries of the eigenvector are equal to 1) with eigenvalue

 λ(l)max≈(m−1)(p∗−δ(l)ab)+(p∗−δ(l))→mp∗=mq∗1−χ1 (22)

i.e. is the sum of one row, where is the size of the dataset. The second eigenspace comes from lifting the degenerate zero-modes when and it has dimension with eigenvalue which goes to zero exponentially over depth . The eigenvalues of have a similar distribution with and Thus the condition number, , of both and diverges exponentially as (see Figure 1 (b)) and respectively. As discussed above, there is a polynomial correction in the condition number of the NTK that slightly improves its conditioning.

Since is singular, we insert a diagonal regularization term into of the linear predictor Equation 6, where is a positive constant independent from and . Define the regularized mean and residual predictors to be

 Pσ(Θ) =Θtest, train(Θtrain, train+σId)−1 (23) Δ(l)σ =Pσ(Θ(l))−Pσ(Θ∗) (24)

We find ; see Section B for the derivation. In summary, in the ordered phase, (for simplicity, we ignore the polynomial correction) governs both trainability and generalizability of the predictor.

### 4.3 The Critical Line χ1=σ2ω˙T(q∗)=1

On the critical line both the diagonal and the off-diagonal terms of diverge linearly in the depth while converges to . From Equation 14 we see immediately that the diagonal terms are given by and . To compute the correction of the off-diagonals, we keep the definition of unchanged but define slightly differently to the above as to take into account the linear divergence at large depths. Taylor expanding to second order we find,

 ϵ(l)ab=−2χ21l+o(1l),δ(l)ab=−23lq∗+O(1) (25)

Thus for large , has the following form and . As in the ordered phase, for large it follows that essentially has two eigenspaces: one has dimension one and the other has dimension with

 λ(l)max=(m+2)q∗3l+mO(1),λ(l)bulk=23q∗l+O(1) (26)

and the condition number as ; see Figure 1 (c). Unlike the chaotic and ordered phases, converges with rate . The has and and the condition number diverges linearly with slope . A similar calculation gives on the critical line. In summary, converges to a finite number and the network ought to be trainable for arbitrary depth but the residual predictor decays polynomially, explaining why critically initialized networks with thousands of layers could still generalize (Xiao et al., 2018).

### 4.4 Remarks

We end this section with a couple remarks. (1) The above theory holds for CNNs; see Section D. In the large depth setting, the NTK of CNNs without pooling is essentially the same as the NTK of FCNs; see Figure 1. (2) In the ordered phase, adding a dropout layer could significantly improve the conditioning of the NTK. For example, adding dropout to the penultimate layer, the condition number will converge to a finite number rather than diverge exponentially; see (f) in Figure 1 and Equation 99 in the appendix.

## 5 Experiments

In this section, we provide empirical results to support the theoretical results in Section 4. Figure 1 is generated using synthetic data and all other plots are generated using CIFAR-10 with an MSE loss.

Evolution of (Figure 1). We randomly sample inputs with shapes for FCN and for CNN-F/CNN-P, where and . We compute the exact NTK with activation function Erf using the Neural Tangents library (Novak et al., 2019a). We see excellent agreement between the theoretical calculation of in Section 4 (summarized in Table 1) and the experimental results Figure 1.

Maximum Learning Rates (Figure 2). In practice, given a set of hyper-parameters of a network, knowing the range of feasible learning rates is extremely valuable. As discussed above, in the infinite width setting, Equation 4 implies the maximal convergent learning rate is given by . We argue that is a good prediction for the maximal convergent learning rate for wide network. To test this statement, we apply SGD to train a collection of fully-connected networks on CIFAR-10 using training samples with the following configurations: (1) width: 2048 (2) fixed, (3) depths: , (4) 10 different values of moving from the ordered phase (blue) to the chaotic phase (red) (5) 10 different learning rates , with . Overall, we see excellent agreement for depths less or equal to 20 and reasonable good agreement for depth 40. We point out that the degradation of the agreement for larger depth may due to the fact that the finite width NTK becomes more stochastic as the ratio between depth and width increases (Hanin & Nica, 2019). Note that Table 1 tells that, as depth increases, should decays exponentially and linearly in the chaotic and critical phases resp. and remain roughly a constant in the ordered phase.

Trainability vs Generalization (Figure 3 top). Our theoretical result suggests that in the deep chaotic regime ( is large) training becomes easier but the network can not generalize. On the other hand, the network can generalize but training becomes much more difficult as one moves towards the deep ordered region because blows up exponentially. To confirm this claim, we conduct an experiment using 16k training samples from CIFAR-10 with different configurations. We train each network using SGD with batch size and learning rate . Deep in the chaotic phase we see that all configurations reach perfect training accuracy but the network completely fails to generalize in the sense test accuracy approaches . However, in the ordered phase although the training accuracy degrades, generalization improves. The network eventually becomes untrainable after layers. In both phases we see that the depth scales, and respectively, perfectly capture the transition from generalizing to untrainable or overfitting.

CNN-P v.s. CNN-F: spatial correction (Figure 3 bottom). We compute the test accuracy using the analytic NTK predictor Equation 5, which corresponds to the test accuracy of ensemble of gradient descent trained neural networks taking the width to infinity. We choose training points, fix , and choose different configurations. We plot the test performance of CNN-P and CNN-F and the performance difference in Fig 3. Remarkably, the performance of both CNN-P and CNN-F are captured by in the ordered phase and by in the chaotic phase. We see that the test performance difference between CNN-P and CNN-F exhibits a region in the ordered phase (a blue strip) where CNN-F outperforms CNN-P by a large margin. This performance difference is due to the correction term as predicted by the -row of Table 1.

## 6 Further Related Work

There has been a significant recent literature studying the global convergence of neural networks in the over-parameterized regime. Under the same scaling limit (aka the kernel regime or linearized regime) used in this paper, parameters of the network do not move much from their initial values. The NTK essentailly remains constant and global convergence of deep networks are proved Jacot et al. (2018); Du et al. (2018b); Allen-Zhu et al. (2018); Du et al. (2018a); Zou et al. (2018). However, in another scaling limit, namely, the mean field limit global convergent results are much more difficult to obtain and are known for neural networks with one hidden layer Mei et al. (2018); Chizat & Bach (2018); Sirignano & Spiliopoulos (2018); Rotskoff & Vanden-Eijnden (2018). Therefore, understanding the training and generalization properties in this mean field limit remains a very challenging open question.

Two concurrent works  (Hayou et al., 2019; Jacot et al., 2019) also study the dynamics of for FCNs (and deconvolutions in Jacot et al. (2019)) as a function of depth and variances of the weights and biases. Hayou et al. (2019) investigates role of activation functions (smooth v.s. non-smooth) and skip-connection. Jacot et al. (2019) demonstrate that batch normalization helps removes the “ordered phase” (as in Yang et al. (2019)) and a layer-dependent learning rate allows every layer in a netwrok to contribute to learning.

We highlight some of the key differences in our paper: 1) we provides a non-asymptotic (and asymptotic) theory for the spectrum of the NTK in the large depth limit for both FCN and CNN; 2) we elucidate a quantitative relationship between trainability, generalization, hyperparameters, and architectural choices (e.g. pooling v.s. flattening) that are commonplace in the field. In doing this, we successfully disentangle generalization from trainability. 3) we provide large scale experiments verifying our theory.

## 7 Conclusion and Future work

In this work, we identify several quantities (, , , and

) related to the spectrum of the NTK that control trainability and generalization of deep networks. We offer a precise characterization of these quantities and provide substantial experimental evidence supporting theoretical results. In future work, we would like to extend our framework to other architectures, e.g., ResNet (with batch-norm), attention model. Understanding the implication of the sub-Fourier modes in the NTK to the test performance of CNN is also an important research direction. Finally, extending our results to shallower networks remains an important open question.

## Appendix A Signal propagation of NNGP and NTK

Recall that

 q(l+1)ab=σ2wT(q(l)ab)+σ2b p(l+1)ab=q(l+1)ab+σ2w˙T(q(l)ab)p(l)ab (27) q(l+1)=q∗ p(l+1)=q∗+σ2w˙T(q∗)p(l), (28)

### a.1 Correction of the off-diagonals in the chaotic/ordered phase

Applying Taylor’s expansion to the first equation of 27 gives

 q∗ab+ϵ(l+1)ab =σ2ωT(q∗ab+ϵ(l)ab)+σ2b (29) =σ2ωT(q∗ab)+σ2b+σ2ω˙T(q∗ab)ϵ(l)ab+O(ϵ(l)ab2) (30) =q∗ab+σ2ω˙T(q∗ab)ϵ(l)ab+O(ϵ(l)ab2) (31)

With , we have

 ϵ(l+1)ab ≈χc∗ϵ(l)ab (32)

Similarly, applying Taylor’s expansion to the second equation of 27 gives

 δ(l+1)ab≈(1+χc,2p∗ab)ϵ(l+1)ab+χc∗δ(l)ab (33)

where . This implies

 ϵ(l)ab ≈χlc∗ϵ(0)ab (34) δ(l)ab ≈χlc∗[δ(0)ab+l(1+χc,2χc∗p∗ab)ϵ(0)ab]. (35)

Note that contains a polynomial correction term and decays like .

The correction to the fixed points in the ordered phase could be obtained using the same calculation:

 ϵ(l)ab ≈χl1ϵ(0)ab (36) δ(l)ab ≈χl1[δ(0)ab+l(1+χ2χ1p∗)ϵ(0)ab]. (37)

### a.2 Correction of the off-diagonals on the critical line.

We have on the critical line. We need to expand the first equation of 27 to the second order

 ϵ(l+1)ab=ϵ(l)ab+12(χ2ϵ(l)ab)2+O((ϵ(l)ab)3) (38)

Here we assume has a continuous third derivative. The above equation implies

 ϵ(l)ab=−2χ21l+o(1l). (39)

Then

 δ(l+1)ab =q(l+1)ab−q∗+σ2ω˙T(q∗+ϵ(l)ab)p(l)ab−lq∗ (40) ≈ϵ