Machine learning models based on deep neural networks have attained state-of-the-art performance across a dizzying array of tasks including vision (Cubuk et al., 2019), speech recognition (Park et al., 2019), machine translation (Bahdanau et al., 2014), chemical property prediction (Gilmer et al., 2017), diagnosing medical conditions (Raghu et al., 2019), and playing games (Silver et al., 2018). Historically, the rampant success of deep learning models has lacked a sturdy theoretical foundation; architectures, hyperparameters, and learning algorithms are often selected by brute force search (Bergstra & Bengio, 2012)
and heuristics(Glorot & Bengio, 2010). Recently, significant theoretical progress has been made on several fronts that have shown promise in making neural network design more systematic. In particular, in the infinite width (or channel) limit, the distribution of functions induced by neural networks with random weights and biases has been precisely characterized before, during, and after training.
The study of infinite networks dates back to seminal work by Neal (1994) who showed that the distribution of functions given by single hidden-layer networks with random weights and biases in the infinite-width limit are Gaussian Processes (GPs). Recently, there has been renewed interest in studying random, infinite, networks starting with concurrent work on “conjugate kernels” (Daniely et al., 2016; Daniely, 2017) and “mean-field theory” (Poole et al., 2016; Schoenholz et al., 2017). The former set of papers argued that the empirical covariance matrix of pre-activations became deterministic in the infinite-width limit and called this the conjugate kernel of the network while the latter papers studied the properties of these limiting kernels along with the kernel describing distribution of gradients. In particular, it was shown that the spectrum of the conjugate kernel of wide fully-connected networks approached a well-defined, data-independent, limit when the depth exceeds a certain scale, . Networks with
-nonlinearities (among other bounded activations) exhibit a phase transition between two limiting spectral distributions of the conjugate kernel as a function of their hyperparameters withdiverging at the transition. It was additionally hypothesized that networks were un-trainable when the conjugate kernel was sufficiently close to its limit.
, networks with residual connections(Yang & Schoenholz, 2017), networks with quantized activations (Blumenfeld et al., 2019), the spectrum of the fisher (Karakida et al., 2018)
, a range of activation functions(Hayou et al., 2018)2019)
and weight-tied autoencoders(Li & Nguyen, 2019). In each case, it was observed that the spectra of the kernels correlated strongly with whether or not the architectures were trainable. While these papers studied the properties of the conjugate kernels, especially the spectrum in the large-depth limit, a branch of concurrent work made a stronger statement: that many networks converge to Gaussian Processes as their width becomes large (Lee et al., 2018; Matthews et al., 2018; Novak et al., 2019b; Garriga-Alonso et al., 2018; Yang, 2019). In this case, the Conjugate Kernel was referred to as the Neural Network Gaussian Process (NNGP) kernel.
Together this work offered a significant advance to our understanding of wide neural networks; however, this theoretical progress was limited to networks at initialization or after Bayesian posterior estimation and provided no link to gradient descent. Moreover, there was some preliminary evidence that suggested the situation might be more nuanced than the qualitative link between the NNGP spectrum and trainability might suggest. For example,Philipp & Carbonell (2018) observed that deep fully-connected -networks could be trained after the kernel reached its large-depth, data-independent, limit but that these networks did not generalize to unseen data.
In the last year, significant theoretical clarity has been reached regarding the relationship between the GP prior and the distribution following gradient descent. In particular, Jacot et al. (2018) along with followup work (Lee et al., 2019; Chizat et al., 2019) showed that the distribution of functions induced by gradient descent for infinite-width networks is a Gaussian Process with a particular compositional kernel known as the Neural Tangent Kernel (NTK). In addition to characterizing the distribution over functions following gradient descent in the wide network limit, the learning dynamics can be solved analytically throughout optimization.
In this paper, we leverage these developments and revisit the relationship between architecture, hyperparameters, trainability, and generalization in the large-depth limit for a variety of neural networks. In particular, we make the following contributions:
We introduce the residual predictor , namely the difference between the finite depth and infinite depth NTK predictions, which is related to the model’s ability to generalize: the network fails to generalize if is too small.
We show that the ordered and chaotic phases identified in Poole et al. (2016) lead to markedly different limiting spectra of the NTK. A corollary is that, as a function of depth, the optimal learning rates ought to decay exponentially in the chaotic phase, linearly on the order-to-chase trainsition line, and remain roughly a constant in the ordered phase.
We examine the differences in the above quantities for fully-connected networks (FCNs) and convolutional networks (CNNs) with and without pooling and precisely characterize the effect of pooling on the interplay between trainability, generalization, and depth.
We provide substantial experimental evidence supporting these claims, includes experiments that densely vary the hyperparameters of FCNs and CNNs with and without pooling.
Together these results provide a complete, analytically tractable, and dataset-independent theory for learning in very deep and wide networks. Finally, our results provides clarity regarding the observation that for linear networks the learning rate must be decreased linearly in the depth of the network Saxe et al. (2013). Here, we note that this is true only for networks that are initialized critically, i.e. on the order-to-chaos phase boundary.
We summarize recent developments in the study of wide random networks. We will keep our discussion relatively informal; see (Lee et al., 2018; Matthews et al., 2018; Novak et al., 2019b) for a more rigorous version of these arguments. To simplify this discussion and as a warmup for the main text, we will consider the case of FCNs. Consider a fully-connected network of depth where each layer has a width and an activation function . In this work we will take however, most of the results will hold for a wide range of non-linearities though specifics - such as the phase diagram - can vary substantially. For simplicity, we will take the width of the hidden layers to infinity sequentially: . The network is parameterized by weights and biases that we take to be randomly initialized with along with hyperparameters, and that set the scale of the weights and biases. Letting the th pre-activation in the th layer due to an input be given by , the network is then described by the recursion,
Notice that asare i.i.d. Gaussian with zero mean. Given a dataset of
points, the distribution over pre-activations can therefore be described completely by the covariance matrix between neurons in different inputsInspecting Equation 1, we see that can be computed in terms of as
for , an appropriately defined operator from the space of positive semi-definite matrices to itself.
Equation 2 describes a dynamical system on positive semi-definite matrices . It was shown in Poole et al. (2016) that fixed points, , of these dynamics exist such that with independent of the inputs and . The values of and are determined by the hyperparameters, and . However Equation 2 admits multiple fixed points (e.g. ) and the stability of these fixed points plays a significant role in determining the properties of the network. Generically, there are large regions of the plane in which the fixed-point structure is constant punctuated by curves, called phase transitions, where the structure changes.
The rate at which approaches or departs can be determined by expanding Equation 2 about its fixed point, to find111More precisely, one needs to consider the Jacobian of as an operator from positive semi-definite matrices to positive semi-definite matrices. We refer the readers to Section B of Xiao et al. (2018) for more details.
with . This expansion naturally exhibits exponential convergence to - or divergence from - the fixed-point as where . Since does not depend on or it follows that will take on a single value, , whenever . If then this fixed point is stable, but if then the fixed point is unstable and, as discussed above, the system will converge to a different fixed point. If then the hyperparameters lie at a phase transition and convergence is non-exponential. As was shown in Poole et al. (2016), there is always a fixed-point at whose stability is determined by . This defines the order-to-chaos transition. Note, that can be used to define a depth-scale, that describes the number of layers over which approaches
This provides a precise characterization of the NNGP kernel at large depths. As discussed above, recent work (Jacot et al., 2018; Lee et al., 2019; Chizat et al., 2019) has connected the prior described by the NNGP with the result of gradient descent training using a quantity called the NTK. To construct the NTK, suppose we enumerate all the parameters in the fully-connected network described above by . The finite width NTK is defined by where is the Jacobian evaluated at a point . The main result in Jacot et al. (2018) was to show that in the infinite-width limit, the NTK converges to a deterministic kernel and remains constant over the course of training. As such, at a time during gradient descent training with an MSE loss, the expected outputs of an infinitely wide network, , evolve as
for train and test points respectively; see Section 2 in Lee et al. (2019). Here denotes the NTK between the test inputs and training inputs and is defined similarly. Since converges to , the gradient flow dynamics of real network also converge to the dynamics described by Equation 4 and Equation 5 (Jacot et al., 2018; Lee et al., 2019; Chizat et al., 2019; Yang, 2019; Arora et al., 2019; Huang & Yau, 2019). As the training time, tends to infinity we note that these equations reduce to and . Consequently we call the linear operator
the “mean predictor” or “predictor” for short. In addition to showing that the NTK describes networks during gradient descent, Jacot et al. (2018) showed that the NTK could be computed in closed form in terms of , , and the NNGP as,
where is the NTK for the pre-activations at layer-.
3 Metrics for Trainability and Generalization at Large Depth
We begin by discussing the interplay between the conditioning of and the trainability of wide networks. We can write Equation 4 in terms of the spectrum of letting as,
where are the eigenvalues of and , are the mean prediction and the labels respectively written in the eigenbasis of . If we order the eigenvalues such that then it has been hypothesized222For finite width, the optimization problem is non-convex and there are not rigorous bounds on the maximum learning rate. in e.g. Lee et al. (2019) that the maximum feasible learning rate scales as as we verify empirically in section 4. Plugging this scaling for into Equation 8 we see that the smallest eigenvalue will converge exponentially at a rate given by the condition number. It follows that if the condition number of the NTK associated with a neural network diverges then it will become untrainable and so we use as a metric for trainability. We will see that at large depths, the spectrum of typically features a single large eigenvalue, , and then a gap that is large compared with the rest of the spectrum. We therefore will often refer to a typical eigenvalue in the bulk as and approximate the condition number as .
In the large-depth limit we will see that converges to independent of the data distribution. In this case will be a rank-1 constant matrix. As such, the mean prediction defined by Equation 5 completely fails to generalize since the prediction is independent of the test inputs. We define the finite depth correction to the infinite depth predictor333If diverges to infinity, we define . If is singular, we will add a diagonal regularizer into .,
By the triangle inequality, the generalization error is lower bounded by
is a constant independent of the test inputs and Equation 10 is large if is too small. Therefore, a necessary condition for the network to generalize is that there exists some such that
As such, we use as a metric for generalization in this paper.
Our goal is therefore to characterize the evolution of the two metrics and in . We follow the methodology outlined in Schoenholz et al. (2017); Xiao et al. (2018) to explore the spectrum of the NTK as a function of depth. We will use this to make precise predictions relating trainability and generalization to the hyperparameters . Our main results are summarized in Table 1 which describes the evolution of (the largest eigenvalue of ), (the remaining eigenvalues), , and in three different phases (ordered, chaotic, and the phase transition) and their dependence on , the size of the training set, the choices of architectures: FCN, CNN-F (convolution with flattening) and CNN-P (convolution with pooling), and size, , of the window in the pooling layer (which we always take to be the penultimate layer).
We give a brief derivation of these results in Section 4 followed by a more detailed discussion in the appendix. However, it is useful to first give a qualitative overview of the phenomenology. In the ordered phase, and . At large depths since it follows that and so the condition number diverges exponentially quickly. Thus, in the ordered phase we expect networks not to be trainable (or, specifically, the time they take to learn will grow exponentially in their depth). The predictor scales as which goes to zero at the same rate as the divergence of ; thus, in the ordered phase networks fail to train and generalize simultaneously. By contrast in the chaotic phase we see that there is no gap between and and networks become perfectly conditioned and are trainable everywhere. However, in this regime we see that the predictor scales as . Since in the chaotic phase and it follows that over a depth . Thus, in the chaotic phase, networks fail to generalize at a finite depth but remain trainable indefinitely. Finally, notice that introducing pooling modestly augments the depth over which networks can generalize in the chaotic phase but reduces the depth in the ordered phase. We will explore all of these predictions in detail in section 5.
4 Large-Depth Asymptotics of the NNGP and NTK
We now give a brief derivation of the results in table 1. To simplify the notation we will discuss fully-connected networks and then extend the results to CNNs with pooling (CNN-P) and without pooling (CNN-F). Details of these two cases can be found in the appendix. We will focus on the NTK here since Schoenholz et al. (2017); Xiao et al. (2018) contains a detailed description of the NNGP in this case. As in sec. 2, we will be concerned with the fixed points of as well as the linearization of Equation 7 about its fixed point. Recall that the fixed point structure is invariant within a phase so it suffices to consider the ordered phase, the chaotic phase, and the critical line separately. In cases where a stable fixed point exists, we will describe how converges to the fixed point. We will see that in the chaotic phase and on the critical line, has no stable fixed point and in that case we will describe its divergence. As above, in each case the fixed points of have a simple structure with
. To simplify the forthcoming analysis, without a loss of generality, we assume the inputs are normalized to have variance444It has been observed in previous works (Poole et al., 2016; Schoenholz et al., 2017) that the diagonals converge much faster than the off-diagonals for - or erf- networks.. As such, we can treat and , restricted on , as a point-wise functions,
Since the off-diagonal elements approach the same fixed point at the same rate, we use and to denote any off diagonal entry of and respectively. We will similarly use and to denote the limits, and . Using the above notation, Equation 7 and Equation 2 become
where and . In what follows, we split the discussion into three parts according to the values of recalling that in Poole et al. (2016); Schoenholz et al. (2017) it was shown that controls the fixed point structure.
4.1 The Chaotic Phase :
The chaotic phase is so-named because so that similar inputs become more uncorrelated as they pass through the network. In this phase, the diagonal entries of grow exponentially and the off-diagonal entries converge to a fixed value. Indeed, Equation 14 implies,
which diverges exponentially. To find the limit of the off-diagonal terms, define which was shown to control convergence of the and is always less than 1 (Schoenholz et al., 2017; Xiao et al., 2018). Let in Equation 13, we find that
The rate of convergence of is (see Section A in the appendix). Since the diagonal terms diverge and the off-diagonal terms are finite it follows that in very deep networks in the chaotic phase, . Thus, in the chaotic phase, the spectrum of the NTK for very deep networks approaches the diverging constant multiplying the identity. From Equation 4 this implies that optimization in the chaotic phase should be easy since (provided numerical precision issues from the prefactor do not become problematic); see Figure 1 (a). However, computing the mean prediction on test points and noticing that we find (see Section B for the derivation),
It follows that in the chaotic phase the networks’ predictions on unseen data to converge to exponentially quickly in the depth. Since Equation 17 decays like , we expect the network fails to generalize after layers, where 555For simplicity, we ignore the polynomial correction in ..
In summary, for wide networks, in the chaotic phase as the depth increases optimization becomes increasingly easy but the generalization performance degrades and eventually the network fails completely away from the training set after layers. Therefore, in the chaotic phase, deep network memorizes the training data. We will confirm this prediction for both kernel prediction and neural network training in the experimental results; see Fig 3.
4.2 The Ordered Phase :
The ordered phase is defined by the stable fixed point with ; in this case, disparate inputs will end up converging to the same output at the end of the network. In the ordered phase, Equation 14 implies that all the diagonal entries of converge to the same value,
However, as with the NNGP kernel, the off-diagonal terms of the NTK, , will also converge to the value on the diagonal, . It follows that the limiting kernels have the form and Thus, the limiting kernels are highly singular and feature only one non-zero eigenvalue. Since the limit is singular, we must linearize the dynamics about the fixed point to gain insight into the limiting behavior of the network. To compute the corrections we first define the deviation from the fixed point,
The diagonal correction can be obtained directly from Equation 18 and we find that and . To compute correction of the off-diagonals, we linearize the equation around the fixed point to find that asymptotically (see Section A),
where with . While the NNGP and NTK feature the same exponential rate of convergence set by , we see that the off-diagonal terms of the NTK feature polynomial corrections.
We see that
features approximately two eigenspaces. The first eigenspace corresponds to the single non-zero eigenvalue at the fixed point and it is very close to the DC mode (i.e. all entries of the eigenvector are equal to 1) with eigenvalue
i.e. is the sum of one row, where is the size of the dataset. The second eigenspace comes from lifting the degenerate zero-modes when and it has dimension with eigenvalue which goes to zero exponentially over depth . The eigenvalues of have a similar distribution with and Thus the condition number, , of both and diverges exponentially as (see Figure 1 (b)) and respectively. As discussed above, there is a polynomial correction in the condition number of the NTK that slightly improves its conditioning.
Since is singular, we insert a diagonal regularization term into of the linear predictor Equation 6, where is a positive constant independent from and . Define the regularized mean and residual predictors to be
We find ; see Section B for the derivation. In summary, in the ordered phase, (for simplicity, we ignore the polynomial correction) governs both trainability and generalizability of the predictor.
4.3 The Critical Line
On the critical line both the diagonal and the off-diagonal terms of diverge linearly in the depth while converges to . From Equation 14 we see immediately that the diagonal terms are given by and . To compute the correction of the off-diagonals, we keep the definition of unchanged but define slightly differently to the above as to take into account the linear divergence at large depths. Taylor expanding to second order we find,
Thus for large , has the following form and . As in the ordered phase, for large it follows that essentially has two eigenspaces: one has dimension one and the other has dimension with
and the condition number as ; see Figure 1 (c). Unlike the chaotic and ordered phases, converges with rate . The has and and the condition number diverges linearly with slope . A similar calculation gives on the critical line. In summary, converges to a finite number and the network ought to be trainable for arbitrary depth but the residual predictor decays polynomially, explaining why critically initialized networks with thousands of layers could still generalize (Xiao et al., 2018).
We end this section with a couple remarks. (1) The above theory holds for CNNs; see Section D. In the large depth setting, the NTK of CNNs without pooling is essentially the same as the NTK of FCNs; see Figure 1. (2) In the ordered phase, adding a dropout layer could significantly improve the conditioning of the NTK. For example, adding dropout to the penultimate layer, the condition number will converge to a finite number rather than diverge exponentially; see (f) in Figure 1 and Equation 99 in the appendix.
In this section, we provide empirical results to support the theoretical results in Section 4. Figure 1 is generated using synthetic data and all other plots are generated using CIFAR-10 with an MSE loss.
Evolution of (Figure 1). We randomly sample inputs with shapes for FCN and for CNN-F/CNN-P, where and . We compute the exact NTK with activation function Erf using the Neural Tangents library (Novak et al., 2019a). We see excellent agreement between the theoretical calculation of in Section 4 (summarized in Table 1) and the experimental results Figure 1.
Maximum Learning Rates (Figure 2). In practice, given a set of hyper-parameters of a network, knowing the range of feasible learning rates is extremely valuable. As discussed above, in the infinite width setting, Equation 4 implies the maximal convergent learning rate is given by . We argue that is a good prediction for the maximal convergent learning rate for wide network. To test this statement, we apply SGD to train a collection of fully-connected networks on CIFAR-10 using training samples with the following configurations: (1) width: 2048 (2) fixed, (3) depths: , (4) 10 different values of moving from the ordered phase (blue) to the chaotic phase (red) (5) 10 different learning rates , with . Overall, we see excellent agreement for depths less or equal to 20 and reasonable good agreement for depth 40. We point out that the degradation of the agreement for larger depth may due to the fact that the finite width NTK becomes more stochastic as the ratio between depth and width increases (Hanin & Nica, 2019). Note that Table 1 tells that, as depth increases, should decays exponentially and linearly in the chaotic and critical phases resp. and remain roughly a constant in the ordered phase.
Trainability vs Generalization (Figure 3 top). Our theoretical result suggests that in the deep chaotic regime ( is large) training becomes easier but the network can not generalize. On the other hand, the network can generalize but training becomes much more difficult as one moves towards the deep ordered region because blows up exponentially. To confirm this claim, we conduct an experiment using 16k training samples from CIFAR-10 with different configurations. We train each network using SGD with batch size and learning rate . Deep in the chaotic phase we see that all configurations reach perfect training accuracy but the network completely fails to generalize in the sense test accuracy approaches . However, in the ordered phase although the training accuracy degrades, generalization improves. The network eventually becomes untrainable after layers. In both phases we see that the depth scales, and respectively, perfectly capture the transition from generalizing to untrainable or overfitting.
CNN-P v.s. CNN-F: spatial correction (Figure 3 bottom). We compute the test accuracy using the analytic NTK predictor Equation 5, which corresponds to the test accuracy of ensemble of gradient descent trained neural networks taking the width to infinity. We choose training points, fix , and choose different configurations. We plot the test performance of CNN-P and CNN-F and the performance difference in Fig 3. Remarkably, the performance of both CNN-P and CNN-F are captured by in the ordered phase and by in the chaotic phase. We see that the test performance difference between CNN-P and CNN-F exhibits a region in the ordered phase (a blue strip) where CNN-F outperforms CNN-P by a large margin. This performance difference is due to the correction term as predicted by the -row of Table 1.
6 Further Related Work
There has been a significant recent literature studying the global convergence of neural networks in the over-parameterized regime. Under the same scaling limit (aka the kernel regime or linearized regime) used in this paper, parameters of the network do not move much from their initial values. The NTK essentailly remains constant and global convergence of deep networks are proved Jacot et al. (2018); Du et al. (2018b); Allen-Zhu et al. (2018); Du et al. (2018a); Zou et al. (2018). However, in another scaling limit, namely, the mean field limit global convergent results are much more difficult to obtain and are known for neural networks with one hidden layer Mei et al. (2018); Chizat & Bach (2018); Sirignano & Spiliopoulos (2018); Rotskoff & Vanden-Eijnden (2018). Therefore, understanding the training and generalization properties in this mean field limit remains a very challenging open question.
Two concurrent works (Hayou et al., 2019; Jacot et al., 2019) also study the dynamics of for FCNs (and deconvolutions in Jacot et al. (2019)) as a function of depth and variances of the weights and biases. Hayou et al. (2019) investigates role of activation functions (smooth v.s. non-smooth) and skip-connection. Jacot et al. (2019) demonstrate that batch normalization helps removes the “ordered phase” (as in Yang et al. (2019)) and a layer-dependent learning rate allows every layer in a netwrok to contribute to learning.
We highlight some of the key differences in our paper: 1) we provides a non-asymptotic (and asymptotic) theory for the spectrum of the NTK in the large depth limit for both FCN and CNN; 2) we elucidate a quantitative relationship between trainability, generalization, hyperparameters, and architectural choices (e.g. pooling v.s. flattening) that are commonplace in the field. In doing this, we successfully disentangle generalization from trainability. 3) we provide large scale experiments verifying our theory.
7 Conclusion and Future work
In this work, we identify several quantities (, , , and
) related to the spectrum of the NTK that control trainability and generalization of deep networks. We offer a precise characterization of these quantities and provide substantial experimental evidence supporting theoretical results. In future work, we would like to extend our framework to other architectures, e.g., ResNet (with batch-norm), attention model. Understanding the implication of the sub-Fourier modes in the NTK to the test performance of CNN is also an important research direction. Finally, extending our results to shallower networks remains an important open question.
- Allen-Zhu et al. (2018) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018.
- Arora et al. (2019) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Bergstra & Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
- Blumenfeld et al. (2019) Yaniv Blumenfeld, Dar Gilboa, and Daniel Soudry. A mean field theory of quantized deep networks: The quantization-depth trade-off. arXiv preprint arXiv:1906.00771, 2019.
Chen et al. (2018)
Minmin Chen, Jeffrey Pennington, and Samuel Schoenholz.
Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks.In International Conference on Machine Learning, 2018.
- Chizat & Bach (2018) Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems, pp. 3040–3050, 2018.
- Chizat et al. (2019) Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. 2019.
- Cubuk et al. (2019) Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation strategies from data. In
- Daniely (2017) Amit Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems 30. 2017.
- Daniely et al. (2016) Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, 2016.
- Du et al. (2018a) Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018a.
- Du et al. (2018b) Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks, 2018b.
- Garriga-Alonso et al. (2018) Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional networks as shallow gaussian processes, 2018.
- Gilboa et al. (2019) Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, and Jeffrey Pennington. Dynamical isometry and a mean field theory of lstms and grus. CoRR, abs/1901.08987, 2019. URL http://arxiv.org/abs/1901.08987.
- Gilmer et al. (2017) Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1263–1272. JMLR.org, 2017. URL http://dl.acm.org/citation.cfm?id=3305381.3305512.
Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
International Conference on Artificial Intelligence and Statistics, pp. 249–256, 2010.
- Hanin & Nica (2019) Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019.
- Hayou et al. (2018) Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. On the selection of initialization and activation function for deep neural networks. arXiv preprint arXiv:1805.08266, 2018.
- Hayou et al. (2019) Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. Mean-field behaviour of neural tangent kernel for deep neural networks, 2019.
- Huang & Yau (2019) Jiaoyang Huang and Horng-Tzer Yau. Dynamics of deep neural networks and neural tangent hierarchy. arXiv preprint arXiv:1909.08156, 2019.
- Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31. 2018.
- Jacot et al. (2019) Arthur Jacot, Franck Gabriel, and Clément Hongler. Freeze and chaos for dnns: an ntk view of batch normalization, checkerboard and boundary effects, 2019.
- Karakida et al. (2018) Ryo Karakida, Shotaro Akaho, and Shun-ichi Amari. Universal statistics of fisher information in deep neural networks: mean field approach. arXiv preprint arXiv:1806.01316, 2018.
- Lee et al. (2018) Jaehoon Lee, Yasaman Bahri, Roman Novak, Sam Schoenholz, Jeffrey Pennington, and Jascha Sohl-dickstein. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
- Lee et al. (2019) Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
Li & Nguyen (2019)
Ping Li and Phan-Minh Nguyen.
On random deep weight-tied autoencoders: Exact asymptotic analysis, phase transitions, and implications to training.In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HJx54i05tX.
- Matthews et al. (2018) Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 4 2018. URL https://openreview.net/forum?id=H1-nGgWC-.
- Mei et al. (2018) Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- Neal (1994) Radford M. Neal. Priors for infinite networks (tech. rep. no. crg-tr-94-1). University of Toronto, 1994.
- Novak et al. (2019a) Roman Novak, Lechao Xiao Jaehoon Lee, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python, 2019a. URL http://github.com/google/neural-tangents.
- Novak et al. (2019b) Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2019b.
- Park et al. (2019) Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
- Philipp & Carbonell (2018) George Philipp and Jaime G. Carbonell. The nonlinearity coefficient - predicting generalization in deep neural networks, 2018.
- Poole et al. (2016) Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems, pp. 3360–3368, 2016.
- Raghu et al. (2019) Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understanding transfer learning with applications to medical imaging. arXiv preprint arXiv:1902.07208, 2019.
- Rotskoff & Vanden-Eijnden (2018) Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915, 2018.
- Saxe et al. (2013) Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
- Schoenholz et al. (2017) Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. International Conference on Learning Representations, 2017.
Silver et al. (2018)
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew
Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore
Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis.
A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018. ISSN 0036-8075. doi: 10.1126/science.aar6404. URL https://science.sciencemag.org/content/362/6419/1140.
- Sirignano & Spiliopoulos (2018) Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks. arXiv preprint arXiv:1805.01053, 2018.
- Xiao et al. (2018) Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, 2018.
- Yang & Schoenholz (2017) Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in Neural Information Processing Systems. 2017.
- Yang (2019) Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
- Yang et al. (2019) Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S Schoenholz. A mean field theory of batch normalization. arXiv preprint arXiv:1902.08129, 2019.
- Zou et al. (2018) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.
Appendix A Signal propagation of NNGP and NTK
a.1 Correction of the off-diagonals in the chaotic/ordered phase
Applying Taylor’s expansion to the first equation of 27 gives
With , we have
Similarly, applying Taylor’s expansion to the second equation of 27 gives
where . This implies
Note that contains a polynomial correction term and decays like .
The correction to the fixed points in the ordered phase could be obtained using the same calculation:
a.2 Correction of the off-diagonals on the critical line.
We have on the critical line. We need to expand the first equation of 27 to the second order
Here we assume has a continuous third derivative. The above equation implies