Neural Tangent Kernel Eigenvalues Accurately Predict Generalization

10/08/2021
by   James B. Simon, et al.
berkeley college
0

Finding a quantitative theory of neural network generalization has long been a central goal of deep learning research. We extend recent results to demonstrate that, by examining the eigensystem of a neural network's "neural tangent kernel", one can predict its generalization performance when learning arbitrary functions. Our theory accurately predicts not only test mean-squared-error but all first- and second-order statistics of the network's learned function. Furthermore, using a measure quantifying the "learnability" of a given target function, we prove a new "no-free-lunch" theorem characterizing a fundamental tradeoff in the inductive bias of wide neural networks: improving a network's generalization for a given target function must worsen its generalization for orthogonal functions. We further demonstrate the utility of our theory by analytically predicting two surprising phenomena - worse-than-chance generalization on hard-to-learn functions and nonmonotonic error curves in the small data regime - which we subsequently observe in experiments. Though our theory is derived for infinite-width architectures, we find it agrees with networks as narrow as width 20, suggesting it is predictive of generalization in practical neural networks. Code replicating our results is available at https://github.com/james-simon/eigenlearning.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/17/2022

Fast Finite Width Neural Tangent Kernel

The Neural Tangent Kernel (NTK), defined as Θ_θ^f(x_1, x_2) = [∂ f(θ, x_...
01/12/2022

Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks

We study the dynamics of a neural network in function space when optimiz...
02/01/2022

Datamodels: Predicting Predictions from Training Data

We present a conceptual framework, datamodeling, for analyzing the behav...
10/23/2021

Learning curves for Gaussian process regression with power-law priors and targets

We study the power-law asymptotics of learning curves for Gaussian proce...
02/14/2020

Why Do Deep Residual Networks Generalize Better than Deep Feedforward Networks? – A Neural Tangent Kernel Perspective

Deep residual networks (ResNets) have demonstrated better generalization...

Code Repositories

eigenlearning

codebase for Neural Tangent Kernel Eigenvalues Accurately Predict Generalization


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have proven extraordinarily useful for a wide array of learning problems, but theoretical understanding of their high performance is notably lacking. One longstanding mystery is the fact that the functions learned by neural networks typically generalize quite well to new data. In light of neural networks’ extreme overparameterization and high expressivity (Zhang et al., 2017), conventional statistical intuition incorrectly predicts poor performance (Belkin et al., 2019; James et al., 2013), leading to a lack of useful generalization bounds and a need for new concepts to augment classical wisdom.

As is common when facing great mysteries, the first challenge lies in framing a precise question. To clarify our aims, we pose the following problem:

Can one efficiently predict, from first principles, how well a given network architecture will generalize when learning a given function, provided a given number of training examples?

Here we show that one can. To do so, we build on two recent developments in deep learning theory. The first is the theory of infinite-width networks, which has shown that, as hidden-layer widths tend to infinity, commonly-used neural networks take remarkably simple analytical forms (Daniely et al., 2016; Lee et al., 2018; Novak et al., 2019a; Arora et al., 2019). In particular, a wide network trained by gradient descent with mean-squared-error (MSE) loss is equivalent to a classical model called kernel regression, where the kernel is the network’s so-called “neural tangent kernel” (NTK) (Jacot et al., 2018; Lee et al., 2019). This line of work suggests that, by studying these simple infinite-width expressions, we might gain insight into the behavior of real, finite networks.

The second development takes this approach, studying the generalization performance of kernel regression and showing agreement with finite networks. In a pioneering study, Bordelon et al. (2020)

derived an approximation for the generalization MSE of kernel regression and showed that, using the network NTK as the kernel, their expressions accurately predict the MSE of neural networks learning arbitrary functions. Their results reveal a simple picture of neural network generalization: as samples are added to the training set, the network generalizes well on a larger and larger subspace of input functions. The natural basis for this subspace of learnable functions is the eigenbasis of the NTK, and its eigenfunctions are learned in descending order of their eigenvalues. We note a substantial body of related work studying the inductive bias of neural networks and kernel regression

(Valle-Perez et al., 2018; Yang and Salman, 2019; Bietti and Mairal, 2019; Spigler et al., 2020; Ortiz-Jiménez et al., 2021; Cohen et al., 2021; Canatar et al., 2021).

In this paper, we extend these results significantly. We begin in Section 2.2 by formulating a figure of merit we call the “learnability” of a target function, a measure which we prove obeys several desirable properties that MSE does not. In Section 2.4, we use learnability to prove a strong new “no-free-lunch” theorem describing a fundamental tradeoff in the inductive bias of a kernel towards all functions in an orthogonal basis. This shows that not only are higher NTK eigenmodes easier to learn, but these eigenmodes are in a zero-sum competition with one another for the ability to be learned at a given training set size. We further prove that, for any kernel or wide network, this tradeoff necessarily leaves some functions with worse-than-chance generalization.

In Sections 2.5 and 2.6, we derive expressions for not just expected test MSE111(a particular second-order statistic of the learned function) but for all first- and second-order statistics of the learned function. Our expression for learnability222(a first-order statistic of the learned function) is extremely simple, involving just one dataset-size-dependent parameter that acts as an eigenvalue threshold above which eigenmodes are well-learned. From the form of this expression, we conclude that for any supervised neural network learning problem, eigenmodes’ eigenvalues and learnabilities lie on the same universal curve. Furthermore, in Section 2.7 we expand MSE for small dataset size and make the counterintuitive prediction that, for many functions, MSE increases with training set size in this small-data regime.

In Section 3

, we experimentally verify all our conclusions on three input spaces for both exact NTK regression and trained finite networks. We find that our analytical expressions give excellent agreement with the true generalization behavior of real deep ReLU networks. We verify our no-free-lunch result with finite networks, and we experimentally observe our theoretical predictions of both worse-than-chance generalization and nonmonotonic MSE. Finally, we compare our theory to finite nets with widths as low as 20 and find that our theory holds with remarkable accuracy even at these narrow widths, suggesting that it is not merely applicable in the canonical NTK regime but in fact correctly predicts generalization performance in practical neural networks.

Figure 1: Spectral analysis of neural kernels allows accurate prediction of key measures of learning and generalization. All panels illustrate the generalization performance of a 4L ReLU net learning a function from random training points for (first and third rows) and (second and fourth rows). All functions of this form with integer are eigenmodes of the NTK on this domain. Our theoretical predictions agree closely with experiment. (A-F)

: Plots of the model’s learned interpolation of the data after training via full-batch gradient descent. As

increases, the model’s learned function approaches the true function. Our theory correctly predicts that the mode is learned faster with because it has a higher eigenvalue than the mode. (G, J): , the test MSE between and , as a function of . Here and in subsequent panels, dots indicate means, and error bars represent symmetric variation due to random initialization and dataset selection. Theoretical curves show excellent agreement, correctly predicting the faster decrease of the MSE of the more-learnable mode. Surprisingly, MSE can increase with (left of (J)). (H, K): , the Fourier coefficients of the spurious eigenmode in as a function of , plotted on a “symmetric log” scale. Because is the higher-eigenvalue mode, these coefficients are smaller than those for for any given . For both modes, these coefficients approach zero as is fully learned. Our theoretical bounds (shaded regions) show excellent agreement with experimental data. (I, L): “Learnability,” our measure of the alignment of and , as a function of . As increases, monotonically increases from its minimum of to its maximum of . Because is the higher-eigenvalue mode, its learnability is always greater, approaching at lower than the mode does.

2 Theory

2.1 A review of kernel regression

Consider the task of learning an -element function given a set of unique training points and their corresponding function values . To simplify our analysis, we will let the domain be discrete with size and assume the training points are uniformly sampled from . We later note how, taking , our results extend to settings where the data are sampled nonuniformly from a continuous domain.

We will use to denote the function learned by a neural network trained on this dataset. Remarkably, for an infinite-width neural network optimized via gradient descent to zero training MSE loss, this learned function is given by

(1)

where is the network’s “neural tangent kernel” (NTK) (Jacot et al., 2018; Lee et al., 2019), is the “kernel matrix” defined by , and

is a row vector with components

.333Naively, Equation 1 is only the expected learned function, and the true learned function will include a fluctuation term reflecting the random initialization. However, by storing a copy of the parameters at and redefining throughout optimization and at test time, this term becomes zero, and so we neglect it in our theory and use this trick in our experiments. We give a brief introduction to the NTK in Appendix C

. Due to its similarity to the normal equation of linear regression, Equation

1 is often called “kernel regression.”444

Interestingly, exact Bayesian inference for infinite-width neural networks yields predictions of the same form as Equation

1, with being the “neural network Gaussian process” (NNGP) kernel instead of the NTK (Lee et al., 2018). We will proceed treating as a network’s NTK, but our theory and exact results (including our “no-free-lunch” theorem) apply equally well to any incarnation of kernel regression.

Equation 1 holds exactly in the infinite-width limit of fully-connected networks (Lee et al., 2019), convolutional networks (Arora et al., 2019), transformers (Hron et al., 2020), and more (Yang, 2019). Moreover, several empirical studies have shown it to be a good approximation for networks of even modest width (Lee et al., 2019, 2020). Our approach will be to study the generalization behavior of Equation 1, conjecture that our results also apply to finite networks, and finally provide strong support for our conjecture with experiments.

Examining Equation 1, one finds that the indices of are predicted independently: the learned is equivalent to simply vectorizing the results of kernel regression on each index. For simplicity, we hereafter assume ; to apply our results to a multivariate target function, one just considers each index of the target function independently.

2.2 Figures of merit of

We will study three measures of the quality of the learned function . All three will be defined in terms of the inner product over : for two functions , their inner product is

The first measure of quality is mean-squared error (MSE). For a particular dataset , MSE is given by . Of more interest will be the expected MSE over all datasets of size , given by . We note that the inner product is taken over all , including , even though for for kernel regression.

In maximizing the similarity of to , we typically wish to minimize its similarity to all functions orthogonal to . The second measure examines the coefficient in of one such orthogonal function to . Letting be a function such that

= 0, we consider the mean and variance of the quantity

. We will derive accurate predictions for this metric of generalization.

Lastly, we introduce a figure of merit quantifying the alignment of and , which we call “learnability.” It is given by

(2)

where is the dataset-dependent learnability of the function (“-learnability”) and

is its expectation over random data (“learnability”). Though at first glance these two seem like odd figures of merit, we will soon show that they have many desirable properties when

is given by Equation 1: unlike MSE, both are bounded in , always change monotonically as new data points are added, are invariant to rescalings of , and obey a simple conservation law. Furthermore, expanding the inner product in the definition of and noting that , one can see that low MSE is impossible without high learnability. We will ultimately derive an accurate approximation for learnability that is substantially simpler than any known approximation for MSE.

2.3 The kernel eigensystem

By definition, any kernel function is symmetric and positive-semidefinite. This implies that we can find a set of orthonormal eigenfunctions and nonnegative eigenvalues that satisfy

(3)

For simplicity (and to ensure that is invertible), we will assume that is in fact positive definite and , an assumption that will hold in most cases of interest.555By Mercer’s Theorem, one can also find the eigensystem of any kernel on a continuous input space .666We do not use the Reproducing kernel Hilbert space (RKHS) formalism in this work, but we note that, by the Moore–Aronszajn theorem, the kernel defines a unique RKHS.

We will now translate Equation 1 to this eigenbasis. First we decompose and into weighted sums of the eigenfunctions as

(4)

where and are vectors of coefficients. Using this notation, MSE is and -learnability is .

Noting that , we can decompose the kernel matrix as , where is the “design matrix” and is a diagonal matrix of eigenvalues. The learned coefficients are then given by Stacking these coefficients into a matrix equation, we find that

(5)

where is an matrix, independent of , that fully describes the model’s learning behavior on a training set . We call this fundamental quantity the “learning transfer matrix.” If we can determine the statistical properties of this matrix, we will understand the learning behavior of our model.

We also define the mean learning transfer matrix . -learnability and learnability are then respectively given by and .

2.4 Exact results

The following lemma gives basic properties of the quantities defined above.

Lemma 1.

The following properties of , , , and hold:

  1. [(a)]

  2. , and .

  3. When , and .

  4. When , and .

  5. All eigenvalues of are in , all eigenvalues of are in , and so .

  6. Let be , where is a new data point. Then .

  7. For any , , hence .

  8. For any , , , hence .

Property (a) in Lemma 1 formalizes the relationship between the transfer matrix and learnability. Properties (b-e) together give an intuitive picture of the learning process: the learning transfer matrix monotonically interpolates between zero and as the training set grows — adding data never harms the learnability of any function. Properties (f-g) show that the kernel eigenmodes are in competition: increasing one eigenvalue improves the learnability of the corresponding eigenfunction, but decreases the learnabilities of all others. We prove Lemma 1 in Appendix D.

We now present our first major result.

Theorem 1 (“No-free-lunch” theorem for kernel regression).

For any complete basis of orthogonal functions ,

(6)

The proof, which hinges on the intermediate result that , is given in Appendix E. We note that this result is stronger than the ordinary “no-free-lunch” theorem for learning algorithms, which requires averaging over all target functions instead of merely an orthonormal basis (Wolpert, 1996). To understand the significance of this result, consider that one might naively hope to design a neural kernel that achieves generally high performance for all target functions . Theorem 1 states that this is impossible: averaged over a complete basis of functions, all kernels achieve the same learnability. This exact result implies that, because there exist no universally high-performing kernels, we must instead aim to choose a kernel whose high-eigenvalue modes align well with the function to be learned. To our knowledge, this is the first exact result quantifying such a tradeoff in kernel regression or deep learning. We note that Theorem 1 also applies to linear regression, a special case of kernel regression with .

Illustrating a consequence of this tradeoff, the following theorem states that, for any kernel, there will always be functions poorly-aligned with the kernel’s inductive bias on which the model generalizes as bad or worse than it would by simply predicting zero on all unseen test points.

Theorem 2 (Negative generalization).

There is always at least one eigenfunction for which and , where and are the learnability and mean MSE given by a naive, non-generalizing model with predictions given by

(7)

This theorem follows from Theorem 1. We give a full proof in Appendix E.

2.5 Deriving a closed-form expression for

We will now derive an approximation for

and the second moments of

, ultimately yielding simple yet accurate expressions for , , and . We sketch our method and state our results here and provide a full derivation in Appendix F.

We begin by noting that the expectation in is essentially an expectation over a combinatorially large set of possible design matrices , each of which has orthonormal columns. We replace this with an average over all matrices satisfying , which we write

(8)

We can then readily prove by symmetry that the off-diagonal elements of vanish, leaving the question of how each diagonal element depends on the set of eigenvalues. To probe this, we consider adding an -th “test eigenmode” to the problem, augmenting the principal quantities as

(9)

Using the Sherman-Morrison formula, we find that the new mode’s learnability is given by

(10)

where is a nonnegative dataset-dependent constant. For large and realistic eigenspectra, is distributed tightly around its mean, and so we can replace it with a constant, which we call . We emphasize that is the same for every eigenmode. Using Theorem 1 to obtain a constraint on , we reach our ultimate approximation that

(11)

From Lemma 1, we immediately obtain the following major result:

Lemma 2.

Under the above approximations, the learnability of a function is given by

(12)

with given by Equation 11.

A function is thus more learnable the more weight it places in high eigenvalue modes. We show in Section 3 that Equation 12 is in excellent agreement with experiments using both kernel regression and trained finite networks.

The learnability of each eigenmode depends critically on the value of . We now present several properties characterizing how depends on .

Lemma 3.

For satisfying the constraint in Equation 11, with ordered from greatest to least, the following properties hold:

  1. [(a)]

  2. when , and when .

  3. is strictly decreasing with .

  4. for all .

  5. for all .

Properties (a-b), paired with Equation 11, paint a surprisingly simple picture of the learning process: as the training set grows, gradually decreases, and eigenmodes are learned high-to-low as passes each eigenvalue. Properties (c-d) provide bounds on , which, in addition to then yielding bounds on learnability, can be used to furnish an initial guess when numerically solving for .

2.6 Second-order statistics of

We now have an expression for the mean of , but many quantities of interest, including MSE, depend on second-order fluctuations about that mean. In Appendix G, we show how, by taking a derivative with respect to , we can obtain expressions for the second-order statistics of and, as a result, for MSE. Our main second-order result is that

(13)

Using Equation 13, we can study the admixtures of particular spurious modes in the learned function . For example, if and , then we find that

(14)

We note that increases with , reinforcing our broad conclusion that the model is biased towards high-eigenvalue modes. Finally, by noting that , we can recover the expression of Bordelon et al. (2020) for MSE:

(15)

Our experiments corroborate existing evidence that this is an excellent approximation for MSE.

2.7 Nonmonotonic MSE curves

Expanding for small , we find that

(16)

The second equation implies that for all modes such that , which suggests that, for such low-eigenvalue modes, MSE increases as the training set size grows from zero. In practice, such low modes are commonplace; in fact, for a bounded kernel on a continuous input space (for which ), arbitrarily low modes are inevitable due to the constraint that is bounded777This constraint follows from the fact that .. We therefore expect that there quite often exist functions for which a given neural network yields an MSE nonmonotonic with . Figure A1 shows that experiments confirm this surprising first-principles prediction. We emphasize that this is a different, more generic phenomenon than that noted by Canatar et al. (2021), which required that was either noisy or placed weight in unlearnable zero-eigenvalue modes.

3 Experiments

Here we describe experiments confirming all our theoretical predictions for both finite networks and exact NTK regression. Unless otherwise stated, all experiments used a fully-connected (FC) four-hidden-layer (4L) ReLU architecture with width 500. Because this model is FC, it has a rotation-invariant NTK (Lee et al., 2019). These experiments used three distinct input spaces . For each, the eigenmodes of can be grouped into degenerate subsets indexed by , with higher corresponding to faster variation in space. In all cases, we find that as increases, eigenvalues decrease, in concordance with the widespread belief that neural nets have a “spectral bias” towards slowly-varying functions (Rahaman et al., 2019; Canatar et al., 2021; Cao et al., 2019). We now describe these three input spaces. For full experimental details, see Appendix H.

Discretized Unit Circle. The simplest input space we consider is the discretization of the unit circle into points, . The eigenfunctions on this domain are , , and , for .

Hypercube. The next input space we consider is the set of verticies of the -dimensional hypercube, , giving . The eigenfunctions on this domain are the subset-parity functions , where is a vector indicating the elements of to which the output is sensitive (Yang and Salman, 2019). Here we define .

Hypersphere. To illustrate that our results extend easily to continuous input spaces, we consider the -sphere . The eigenfunctions on this domain are the hyperspherical harmonics (see, e.g., Frye and Efthimiou (2012); Bordelon et al. (2020)) which group into degenerate sets indexed by . The corresponding eigenvalues decrease exponentially with , and so when summing over all eigenmodes to compute and , we simply truncate the sum at .

Figure I5 shows the 4L ReLU NTK eigenvalues on each domain. We now describe our experiments.

Figure 2: Theoretical predictions closely match the true learnabilities of arbitrary eigenfunctions on diverse input spaces. Each plot shows learnability (Equation 11) of several eigenfunctions as a function of training set size . Theoretical curves show excellent agreement with results from exact NTK regression (triangles) and finite nets trained via gradient descent (circles). Error bars reflect variation due to random choice of dataset and, for finite nets, random initialization. (A) Learnabilities of sinusoidal eigenfunctions on the unit circle discretized into points. Eigenfunctions with higher have lower eigenvalues and thus require more data to learn. At , the training set contains all input points and all functions are thus predicted perfectly. (B) Learnabilities of subset-parity functions on the vertices of the 8d hypercube. Eigenfunctions with higher again have lower eigenvalues and are learned later, with all functions predicted perfectly at . The dashed line indicates , the learnability of any function with respect to a random model; falls below this curve, showing that the eigenmode generalizes worse than chance. (C) Learnabilities of hyperspherical harmonics on the continuous 7-sphere . Eigenfunctions with higher again have lower eigenvalues and are learned later, but the continuous input space prevents learnability from exactly reaching .
Figure 3: Eigenmode learnability vs. eigenvalue takes a universal functional form. For any dataset size and input domain, eigenmode learnability closely follows a universal curve with one problem-dependent parameter . Theoretical curves (solid lines) have the same sigmoidal shape in every panel. True eigenmode learnabilities for for both exact NTK regression (triangles) and finite networks (circles) exhibit excellent agreement. Vertical dashed lines indicate for each learning problem. (A-C) Learnability vs. eigenvalue for eigenmodes of the unit circle, 8d hypercube, and 7-sphere with . The modes are cut off to the left of (C). (D-F) Learnability curves with . Eigenmodes lie higher on each curve than the corresponding points in (A-C), reflecting greater learnability due to the larger . (G) All points from (A-F), rescaled by their respective values of , lie on the same universal curve .
Figure 4: Eigenfunction learnability always sums to the size of the training set. Stacked bar charts show -learnability for a particular random for each of the 10 eigenfunctions over a simple 10-point domain. For all architectures, the total height of each bar is approximately . (A) As per Theorem 1, the summed -learnabilities for exact NTK regression (left bars) are all exactly , and those for trained finite nets (right bars) are remarkably close. Stacked bars show -learnability for the 10 eigenfunctions, all from the same training set of data points, stacked from top to bottom in descending order of eigenvalue. A different network architecture was used in each of the four pairs of columns. As per Lemma 1d, the height of each eigenmode contribution falls in . (B) Same as (A) with .

Warmup on the unit circle. As an example problem to illustrate our theory, we consider learning the and eigenmodes on the discretized unit circle with . We use trained networks to find , , and for both modes as varies and compare to our theoretical predictions. The results show that our theoretical expressions give excellent agreement with experiment on all counts (Figure 1).

Predicting learnability . We next use both finite nets and exact NTK regression to learn several eigenmodes on all three domains and compare true learnabilities with our theoretical predictions. Our theory again predicts true generalization behavior quite well (Figure 2). We further show that the mode on the hypercube has worse-than-chance generalization, an inevitable result of its low eigenvalue and Theorem 2. We note that finite networks and NTK regression give very similar results, supporting the validity of approximating networks with the NTK.

Universal form for learnability. We next fix and plot learnability vs. eigenvalue for eight modes over each domain. In each case we find , and by rescaling the eigenvalues (a symmetry which leaves Equation 1 invariant), we see that for any neural network learning problem with any training set size , eigenmode learnability always lies on one universal curve (Figure 3).

No free lunch for neural networks. We then experimentally confirm that our no-free-lunch result applies to finite networks as well as kernel regression (Figure 4). We use both models to learn a function on the discretized unit circle with and sum the resulting -learnabilities of each eigenmode. Unlike in our other experiments, we only sample once for each ; even without the benefit of averaging over datasets, we find that total -learnability is always conserved.

Nonmonotonic MSE curves. We next confirm our first-principles prediction of nonmonotonic MSE curves at small . We plot curves for four eigenmodes on each domain, three of which Equation 16 predicts will have positive . MSE is indeed increasing as predicted (Figure A1).

Agreement with narrow networks. Finally, we repeat the experiment of Figure 2B for varying network width. We find that our theory gives accurate predictions of learnability and MSE even for networks of width as small as (Figures B2 and B3). This surprising agreement suggests our theory will faithfully predict generalization in practical deep learning systems.

4 Conclusion

We have presented a first-principles theory of neural network generalization that efficiently and accurately predicts many measures of generalization performance. This theory offers new insight into neural networks’ inductive bias and provides a general framework for understanding their learning behavior, opening the door to the principled study of many other deep learning mysteries.

Acknowledgments

The authors would like to thank Zack Weinstein for useful discussions and Sajant Anand, Jesse Livezey, Roy Rinberg, Jascha Sohl-Dickstein, and Liu Ziyin for helpful comments on the manuscript. This research was supported in part by the U.S. Army Research Laboratory and the U.S. Army Research Office under contract W911NF-20-1-0151. JS gratefully acknowledges support from the National Science Foundation Graduate Fellow Research Program (NSF-GRFP) under grant DGE 1752814.

References

  • S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang (2019) On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems (NeurIPS), pp. 8139–8148. External Links: Link Cited by: §1, §2.1.
  • M. Belkin, D. Hsu, S. Ma, and S. Mandal (2019)

    Reconciling modern machine-learning practice and the classical bias–variance trade-off

    .
    Proceedings of the National Academy of Sciences 116 (32), pp. 15849–15854. Cited by: §1.
  • A. Bietti and J. Mairal (2019) On the inductive bias of neural tangent kernels. arXiv preprint arXiv:1905.12173. Cited by: §1.
  • B. Bordelon, A. Canatar, and C. Pehlevan (2020) Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pp. 1024–1034. Cited by: §G.3, §1, §2.6, §3.
  • A. Canatar, B. Bordelon, and C. Pehlevan (2021) Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications 12 (1), pp. 1–12. Cited by: §1, §2.7, §3.
  • Y. Cao, Z. Fang, Y. Wu, D. Zhou, and Q. Gu (2019) Towards understanding the spectral bias of deep learning. arXiv preprint arXiv:1912.01198. Cited by: §3.
  • O. Cohen, O. Malka, and Z. Ringel (2021) Learning curves for overparametrized deep neural networks: a field theory perspective. Physical Review Research 3 (2), pp. 023034. Cited by: §1.
  • A. Daniely, R. Frostig, and Y. Singer (2016) Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems (NeurIPS), D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2253–2261. External Links: Link Cited by: §1.
  • C. Frye and C. J. Efthimiou (2012) Spherical harmonics in p dimensions. arXiv preprint arXiv:1205.3548. Cited by: §3.
  • J. Hron, Y. Bahri, J. Sohl-Dickstein, and R. Novak (2020) Infinite attention: NNGP and NTK for deep attention networks. In International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, Vol. 119, pp. 4376–4386. External Links: Link Cited by: §2.1.
  • A. Jacot, C. Hongler, and F. Gabriel (2018) Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 8580–8589. External Links: Link Cited by: Appendix C, §1, §2.1.
  • G. James, D. Witten, T. Hastie, and R. Tibshirani (2013) An introduction to statistical learning. Vol. 112, Springer. Cited by: §1.
  • J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein (2018) Deep neural networks as gaussian processes. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1, footnote 4.
  • J. Lee, S. S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and J. Sohl-Dickstein (2020) Finite versus infinite neural networks: an empirical study. In Advances in Neural Information Processing Systems (NeurIPS), External Links: Link Cited by: §2.1.
  • J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington (2019) Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems (NeurIPS), pp. 8570–8581. External Links: Link Cited by: Appendix C, Appendix C, §1, §2.1, §2.1, §3.
  • A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari (2020) The large learning rate phase of deep learning: the catapult mechanism. CoRR abs/2003.02218. External Links: Link, 2003.02218 Cited by: Appendix H.
  • R. Novak, L. Xiao, Y. Bahri, J. Lee, G. Yang, J. Hron, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein (2019a) Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1.
  • R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz (2019b) Neural tangents: fast and easy infinite neural networks in python. CoRR abs/1912.02803. External Links: Link, 1912.02803 Cited by: Appendix H.
  • G. Ortiz-Jiménez, S. Moosavi-Dezfooli, and P. Frossard (2021) What can linearized neural networks actually say about generalization?. arXiv preprint arXiv:2106.06770. Cited by: §1.
  • N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville (2019) On the spectral bias of neural networks. In International Conference on Machine Learning, pp. 5301–5310. Cited by: Appendix H, §3.
  • J. Sohl-Dickstein, R. Novak, S. S. Schoenholz, and J. Lee (2020) On the infinite width limit of neural networks with a standard parameterization. arXiv preprint arXiv:2001.07301. Cited by: Appendix H.
  • S. Spigler, M. Geiger, and M. Wyart (2020) Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and Experiment 2020 (12), pp. 124001. Cited by: §1.
  • G. Valle-Perez, C. Q. Camargo, and A. A. Louis (2018) Deep learning generalizes because the parameter-function map is biased towards simple functions. arXiv preprint arXiv:1805.08522. Cited by: §1.
  • D. H. Wolpert (1996) The lack of a priori distinctions between learning algorithms. Neural computation 8 (7), pp. 1341–1390. Cited by: §2.4.
  • G. Yang and H. Salman (2019) A fine-grained spectral perspective on neural networks. arXiv preprint arXiv:1907.10599. Cited by: §1, §3.
  • G. Yang (2019) Tensor programs I: wide feedforward or recurrent neural networks of any architecture are gaussian processes. CoRR abs/1910.12478. External Links: Link, 1910.12478 Cited by: §2.1.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1.

Appendix A Nonmonotonic MSE curves at small training set size

Figure A1: Our theory correctly predicts that, for low-eigenvalue eigenfunctions, MSE counterintuively increases as points are added to a small training set. (A-C) Generalization MSE of exact NTK regression (triangles) and finite networks (circles) when learning four different eigenmodes on each of three different domains given training points. Theoretical curves closely match experimental data. Eigenmodes with higher have lower eigenvalues and thus higher mean MSEs, and for , MSE even increases as increases from zero. Dashed lines show as predicted by Equation 16.

Appendix B Experimental results for narrow networks

Figure B2: Theoretical learnability predictions remain accurate even for quite narrow networks. Plots show learnability vs. training set size for four eigenmodes of the 8d hypercube, learned with a 4L ReLU net with various widths. Except for changing width, these experiments are identical to those of Figure 2b. Theoretical predictions (solid curves) are the same in all plots. Dashed lines show learnability from a naive, nongeneralizing model; points below the line imply worse-than-chance generalization (see Theorem 2). (A) Infinite-width results using exact NTK regression. (B-F) Results for successively narrower finite networks. As width decreases, mean learnability increases slightly, and error grow. Despite this, mean learnabilities remain remarkably close to our theoretical predictions even at width 20.
Figure B3: Theoretical MSE predictions remain accurate even for quite narrow networks. Plots show MSE vs. training set size for four eigenmodes of the 8d hypercube, learned with a 4L ReLU net with various widths. Except for changing width and the fact that MSE is plotted instead of learnability, these experiments are identical to those of Figure 2b. Theoretical predictions (solid curves) are the same in all plots. Dashed lines show MSE from a naive, nongeneralizing model; points above the dashed lines imply worse-than-chance generalization (see Theorem [THEOREM]). (A) Infinite-width results using exact NTK regression. (B-F) Results for successively narrower finite networks. As width decreases, MSE tends to increase, but only slightly. Theoretical predictions remain remarkably accurate for most eigenmodes down to width 50, with qualitative agreement even at width 20.

Appendix C Review of the NTK

In the main text, we assume prior familiarity with the NTK, using Equation 1 as the starting point of our derivations. Here we provide a definition and very brief introduction to the NTK for unfamiliar readers. For derivations and full discussions, see Jacot et al. (2018) and Lee et al. (2019).

Consider a feedforward neural network representing a function , where is a parameter vector. Further consider one training example with target value and one test point and suppose we perform one step of gradient descent with a small learning rate with respect to the MSE loss . This gives the parameter update

(17)

We now wish to know how this parameter update changes . To do so, we linearize about , finding that

(18)

where we have defined . This quantity is the NTK. Remarkably, as network width888The “width” parameter varies by architecture; for example it is the minimal hidden layer width for fully connected networks and the minimal number of channels per hidden layer for a convolutional network. goes to infinity, the corrections become negligible, and is the same after any random initialization999(assuming the parameters are drawn from the same distribution) and at any time during training. This dramatically simplifies the analysis of network training, allowing one to prove that after infinite time training on MSE loss for an arbitrary dataset, the network’s learned function is given by Equation 1. See, for example, Equations 14-16 of Lee et al. (2019)101010We note that there exists a different infinite-width kernel, called the “NNGP kernel,” describing a network’s random initialization, and this reference uses for the NNGP kernel and for the NTK..

Appendix D Proof of Lemma 1

Property (a): , and .

Proof. Using the fact that , we see that , where is a one-hot -vector with the one at index . The second clause of the property follows by averaging.

Property (b): When , and .

Proof. When , has no columns, and thus . The other clauses follow from Property (a) and averaging.

Property (c): When , and .

Proof. When , is a full-rank matrix. Inspection of the formula for shows that . The other clauses follow from Property (a) and averaging.

Property (d): All eigenvalues of are in , all eigenvalues of are in , and so .

Proof. From the definition of , it is easy to see that . is thus idempotent, with all eigenvalues in . The fact that all eigenvalues of are in follows by averaging. The stated properties of follow from the fact that, for any compatible vector and matrix , it holds that is bounded by the maximum and minimum eigenavalues of .

Property (e): Let be , where is a new data point. Then .

Proof. To begin, we use the Moore-Penrose pseudoinverse, which we denote by , to cast into a more transparent form:

(19)

where we have suppressed the in . This follows from the property of pseudoinverses that for any matrix . We now augment our system with one extra data point, getting

(20)

where is an -element column vector orthonormal to the others of . We now convert the pseudoinverse into an inverse with a limit, getting

(21)

We now use the Sherman-Morrison matrix inversion formula to find that

(22)

Because both

and the inverted matrix in Equation

22 are positive semidefinite, we conclude that, for any -vector , it will hold that . The desired property follows.

Property (f): For any , , hence .

Proof. Differentiating with respect to a particular , we find that

(23)

where is the -th row of and . Specializing to the case , we note that because is positive semidefinite, and because is one of the positive semidefinite summands in . The desired property follows.

Property (g): For any , , , hence .

Proof. Differentiating as in the proof of Property (f) and using the fact that , we see that

(24)

which is manifestly nonpositive because . The desired property follows.

Appendix E Proofs of Theorems 1 and 2

Here we prove Theorem 1, our “no-free-lunch” result, and Theorem 2, which states the existence of eigenfunctions that generalize worse than chance. We begin with the former.

Proof of Theorem 1. First, we note that, for any orthogonal basis on ,

(25)

where is an orthogonal set of vectors spanning . This is equivalent to . This trace is given by

(26)

which proves the desired theorem. ∎

Proof of Theorem 2. First, we note that the naive, nongeneralizing model described in the theorem will always have a learnability of . As per Theorem 1, this is exactly the mean learnability of all eigenfunctions, so either all have precisely this learnability or else one has lower learnability. This proves the clause of the theorem specific to learnability.

To prove the clause specific to MSE, we now note that, as defined by Equation 1, kernel regression will always perfectly memorize the training data, so for both the naive model and kernel regression, MSE is given by

(27)

The first term is the same for any model, the second term is nonnegative but zero for the naive model, and the third term is equivalent to , which must be nonnegative for at least one eigenfunction. Kernel regression therefore gives worse MSE than the naive model unless, as for the naive model, for . ∎

Appendix F Approximating

Here we provide details of the derivations of our approximation for and of Lemma 3. Whenever possible, we leave it implicit that all expressions for in this appendix will be approximations and use the symbol for equality for simplicity.

We begin by taking the approximation of Equation 8 that

(28)

where the expectation is taken over all matrices satisfying with the uniform measure. It turns out that we can equivalently average over all all matrices with a mean-zero i.i.d. Gaussian measure over each element, without changing . To see this, note that this equation is symmetric under right-multiplication of by arbitrary invertible matrices :

(29)

Letting be a -dependent matrix which orthonormalizes its columns, it is straightforward to see that we can convert from an average over Gaussian to an average over only orthonormal , and vice-versa, as desired. Moving forward, we will exploit this equivalence and assume that each element of is sampled i.i.d. from . This assumption will largely remain in the background, but will be useful later.

We now evaluate , starting with the off-diagonal elements. We observe that

(30)

where is any orthogonal matrix. Defining as the matrix such that , noting that , and plugging in as in Equation 30, we find that

(31)

By choosing , we conclude that if .

To evaluate the diagonal elements of , we consider augmenting our eigensystem with a new “test eigenmode” with eigenvalue , as described in the text and formalized in Equation 9. To proceed, we make the core assumption that when 111111Empirically, we see that this approximation holds already for realistic values of and used in neural net experiments.. This assumption, combined with the symmetry whenever , implies that . Therefore, to evaluate , it suffices to evaluate as a function of .

We now manipulate the expression for to isolate . Using the Sherman-Morrison matrix inversion formula, we obtain