codebase for Neural Tangent Kernel Eigenvalues Accurately Predict Generalization
Finding a quantitative theory of neural network generalization has long been a central goal of deep learning research. We extend recent results to demonstrate that, by examining the eigensystem of a neural network's "neural tangent kernel", one can predict its generalization performance when learning arbitrary functions. Our theory accurately predicts not only test mean-squared-error but all first- and second-order statistics of the network's learned function. Furthermore, using a measure quantifying the "learnability" of a given target function, we prove a new "no-free-lunch" theorem characterizing a fundamental tradeoff in the inductive bias of wide neural networks: improving a network's generalization for a given target function must worsen its generalization for orthogonal functions. We further demonstrate the utility of our theory by analytically predicting two surprising phenomena - worse-than-chance generalization on hard-to-learn functions and nonmonotonic error curves in the small data regime - which we subsequently observe in experiments. Though our theory is derived for infinite-width architectures, we find it agrees with networks as narrow as width 20, suggesting it is predictive of generalization in practical neural networks. Code replicating our results is available at https://github.com/james-simon/eigenlearning.READ FULL TEXT VIEW PDF
codebase for Neural Tangent Kernel Eigenvalues Accurately Predict Generalization
Deep neural networks have proven extraordinarily useful for a wide array of learning problems, but theoretical understanding of their high performance is notably lacking. One longstanding mystery is the fact that the functions learned by neural networks typically generalize quite well to new data. In light of neural networks’ extreme overparameterization and high expressivity (Zhang et al., 2017), conventional statistical intuition incorrectly predicts poor performance (Belkin et al., 2019; James et al., 2013), leading to a lack of useful generalization bounds and a need for new concepts to augment classical wisdom.
As is common when facing great mysteries, the first challenge lies in framing a precise question. To clarify our aims, we pose the following problem:
Can one efficiently predict, from first principles, how well a given network architecture will generalize when learning a given function, provided a given number of training examples?
Here we show that one can. To do so, we build on two recent developments in deep learning theory. The first is the theory of infinite-width networks, which has shown that, as hidden-layer widths tend to infinity, commonly-used neural networks take remarkably simple analytical forms (Daniely et al., 2016; Lee et al., 2018; Novak et al., 2019a; Arora et al., 2019). In particular, a wide network trained by gradient descent with mean-squared-error (MSE) loss is equivalent to a classical model called kernel regression, where the kernel is the network’s so-called “neural tangent kernel” (NTK) (Jacot et al., 2018; Lee et al., 2019). This line of work suggests that, by studying these simple infinite-width expressions, we might gain insight into the behavior of real, finite networks.
The second development takes this approach, studying the generalization performance of kernel regression and showing agreement with finite networks. In a pioneering study, Bordelon et al. (2020)
derived an approximation for the generalization MSE of kernel regression and showed that, using the network NTK as the kernel, their expressions accurately predict the MSE of neural networks learning arbitrary functions. Their results reveal a simple picture of neural network generalization: as samples are added to the training set, the network generalizes well on a larger and larger subspace of input functions. The natural basis for this subspace of learnable functions is the eigenbasis of the NTK, and its eigenfunctions are learned in descending order of their eigenvalues. We note a substantial body of related work studying the inductive bias of neural networks and kernel regression(Valle-Perez et al., 2018; Yang and Salman, 2019; Bietti and Mairal, 2019; Spigler et al., 2020; Ortiz-Jiménez et al., 2021; Cohen et al., 2021; Canatar et al., 2021).
In this paper, we extend these results significantly. We begin in Section 2.2 by formulating a figure of merit we call the “learnability” of a target function, a measure which we prove obeys several desirable properties that MSE does not. In Section 2.4, we use learnability to prove a strong new “no-free-lunch” theorem describing a fundamental tradeoff in the inductive bias of a kernel towards all functions in an orthogonal basis. This shows that not only are higher NTK eigenmodes easier to learn, but these eigenmodes are in a zero-sum competition with one another for the ability to be learned at a given training set size. We further prove that, for any kernel or wide network, this tradeoff necessarily leaves some functions with worse-than-chance generalization.
In Sections 2.5 and 2.6, we derive expressions for not just expected test MSE111(a particular second-order statistic of the learned function) but for all first- and second-order statistics of the learned function. Our expression for learnability222(a first-order statistic of the learned function) is extremely simple, involving just one dataset-size-dependent parameter that acts as an eigenvalue threshold above which eigenmodes are well-learned. From the form of this expression, we conclude that for any supervised neural network learning problem, eigenmodes’ eigenvalues and learnabilities lie on the same universal curve. Furthermore, in Section 2.7 we expand MSE for small dataset size and make the counterintuitive prediction that, for many functions, MSE increases with training set size in this small-data regime.
In Section 3
, we experimentally verify all our conclusions on three input spaces for both exact NTK regression and trained finite networks. We find that our analytical expressions give excellent agreement with the true generalization behavior of real deep ReLU networks. We verify our no-free-lunch result with finite networks, and we experimentally observe our theoretical predictions of both worse-than-chance generalization and nonmonotonic MSE. Finally, we compare our theory to finite nets with widths as low as 20 and find that our theory holds with remarkable accuracy even at these narrow widths, suggesting that it is not merely applicable in the canonical NTK regime but in fact correctly predicts generalization performance in practical neural networks.
Consider the task of learning an -element function given a set of unique training points and their corresponding function values . To simplify our analysis, we will let the domain be discrete with size and assume the training points are uniformly sampled from . We later note how, taking , our results extend to settings where the data are sampled nonuniformly from a continuous domain.
We will use to denote the function learned by a neural network trained on this dataset. Remarkably, for an infinite-width neural network optimized via gradient descent to zero training MSE loss, this learned function is given by
is a row vector with components.333Naively, Equation 1 is only the expected learned function, and the true learned function will include a fluctuation term reflecting the random initialization. However, by storing a copy of the parameters at and redefining throughout optimization and at test time, this term becomes zero, and so we neglect it in our theory and use this trick in our experiments. We give a brief introduction to the NTK in Appendix C
. Due to its similarity to the normal equation of linear regression, Equation1 is often called “kernel regression.”444
Interestingly, exact Bayesian inference for infinite-width neural networks yields predictions of the same form as Equation1, with being the “neural network Gaussian process” (NNGP) kernel instead of the NTK (Lee et al., 2018). We will proceed treating as a network’s NTK, but our theory and exact results (including our “no-free-lunch” theorem) apply equally well to any incarnation of kernel regression.
Equation 1 holds exactly in the infinite-width limit of fully-connected networks (Lee et al., 2019), convolutional networks (Arora et al., 2019), transformers (Hron et al., 2020), and more (Yang, 2019). Moreover, several empirical studies have shown it to be a good approximation for networks of even modest width (Lee et al., 2019, 2020). Our approach will be to study the generalization behavior of Equation 1, conjecture that our results also apply to finite networks, and finally provide strong support for our conjecture with experiments.
Examining Equation 1, one finds that the indices of are predicted independently: the learned is equivalent to simply vectorizing the results of kernel regression on each index. For simplicity, we hereafter assume ; to apply our results to a multivariate target function, one just considers each index of the target function independently.
We will study three measures of the quality of the learned function . All three will be defined in terms of the inner product over : for two functions , their inner product is
The first measure of quality is mean-squared error (MSE). For a particular dataset , MSE is given by . Of more interest will be the expected MSE over all datasets of size , given by . We note that the inner product is taken over all , including , even though for for kernel regression.
In maximizing the similarity of to , we typically wish to minimize its similarity to all functions orthogonal to . The second measure examines the coefficient in of one such orthogonal function to . Letting be a function such that
= 0, we consider the mean and variance of the quantity. We will derive accurate predictions for this metric of generalization.
Lastly, we introduce a figure of merit quantifying the alignment of and , which we call “learnability.” It is given by
where is the dataset-dependent learnability of the function (“-learnability”) and
is its expectation over random data (“learnability”). Though at first glance these two seem like odd figures of merit, we will soon show that they have many desirable properties whenis given by Equation 1: unlike MSE, both are bounded in , always change monotonically as new data points are added, are invariant to rescalings of , and obey a simple conservation law. Furthermore, expanding the inner product in the definition of and noting that , one can see that low MSE is impossible without high learnability. We will ultimately derive an accurate approximation for learnability that is substantially simpler than any known approximation for MSE.
By definition, any kernel function is symmetric and positive-semidefinite. This implies that we can find a set of orthonormal eigenfunctions and nonnegative eigenvalues that satisfy
For simplicity (and to ensure that is invertible), we will assume that is in fact positive definite and , an assumption that will hold in most cases of interest.555By Mercer’s Theorem, one can also find the eigensystem of any kernel on a continuous input space .666We do not use the Reproducing kernel Hilbert space (RKHS) formalism in this work, but we note that, by the Moore–Aronszajn theorem, the kernel defines a unique RKHS.
We will now translate Equation 1 to this eigenbasis. First we decompose and into weighted sums of the eigenfunctions as
where and are vectors of coefficients. Using this notation, MSE is and -learnability is .
Noting that , we can decompose the kernel matrix as , where is the “design matrix” and is a diagonal matrix of eigenvalues. The learned coefficients are then given by Stacking these coefficients into a matrix equation, we find that
where is an matrix, independent of , that fully describes the model’s learning behavior on a training set . We call this fundamental quantity the “learning transfer matrix.” If we can determine the statistical properties of this matrix, we will understand the learning behavior of our model.
We also define the mean learning transfer matrix . -learnability and learnability are then respectively given by and .
The following lemma gives basic properties of the quantities defined above.
The following properties of , , , and hold:
, and .
When , and .
When , and .
All eigenvalues of are in , all eigenvalues of are in , and so .
Let be , where is a new data point. Then .
For any , , hence .
For any , , , hence .
Property (a) in Lemma 1 formalizes the relationship between the transfer matrix and learnability. Properties (b-e) together give an intuitive picture of the learning process: the learning transfer matrix monotonically interpolates between zero and as the training set grows — adding data never harms the learnability of any function. Properties (f-g) show that the kernel eigenmodes are in competition: increasing one eigenvalue improves the learnability of the corresponding eigenfunction, but decreases the learnabilities of all others. We prove Lemma 1 in Appendix D.
We now present our first major result.
For any complete basis of orthogonal functions ,
The proof, which hinges on the intermediate result that , is given in Appendix E. We note that this result is stronger than the ordinary “no-free-lunch” theorem for learning algorithms, which requires averaging over all target functions instead of merely an orthonormal basis (Wolpert, 1996). To understand the significance of this result, consider that one might naively hope to design a neural kernel that achieves generally high performance for all target functions . Theorem 1 states that this is impossible: averaged over a complete basis of functions, all kernels achieve the same learnability. This exact result implies that, because there exist no universally high-performing kernels, we must instead aim to choose a kernel whose high-eigenvalue modes align well with the function to be learned. To our knowledge, this is the first exact result quantifying such a tradeoff in kernel regression or deep learning. We note that Theorem 1 also applies to linear regression, a special case of kernel regression with .
Illustrating a consequence of this tradeoff, the following theorem states that, for any kernel, there will always be functions poorly-aligned with the kernel’s inductive bias on which the model generalizes as bad or worse than it would by simply predicting zero on all unseen test points.
There is always at least one eigenfunction for which and , where and are the learnability and mean MSE given by a naive, non-generalizing model with predictions given by
We will now derive an approximation for
and the second moments of, ultimately yielding simple yet accurate expressions for , , and . We sketch our method and state our results here and provide a full derivation in Appendix F.
We begin by noting that the expectation in is essentially an expectation over a combinatorially large set of possible design matrices , each of which has orthonormal columns. We replace this with an average over all matrices satisfying , which we write
We can then readily prove by symmetry that the off-diagonal elements of vanish, leaving the question of how each diagonal element depends on the set of eigenvalues. To probe this, we consider adding an -th “test eigenmode” to the problem, augmenting the principal quantities as
Using the Sherman-Morrison formula, we find that the new mode’s learnability is given by
where is a nonnegative dataset-dependent constant. For large and realistic eigenspectra, is distributed tightly around its mean, and so we can replace it with a constant, which we call . We emphasize that is the same for every eigenmode. Using Theorem 1 to obtain a constraint on , we reach our ultimate approximation that
From Lemma 1, we immediately obtain the following major result:
Under the above approximations, the learnability of a function is given by
with given by Equation 11.
A function is thus more learnable the more weight it places in high eigenvalue modes. We show in Section 3 that Equation 12 is in excellent agreement with experiments using both kernel regression and trained finite networks.
The learnability of each eigenmode depends critically on the value of . We now present several properties characterizing how depends on .
For satisfying the constraint in Equation 11, with ordered from greatest to least, the following properties hold:
when , and when .
is strictly decreasing with .
for all .
for all .
Properties (a-b), paired with Equation 11, paint a surprisingly simple picture of the learning process: as the training set grows, gradually decreases, and eigenmodes are learned high-to-low as passes each eigenvalue. Properties (c-d) provide bounds on , which, in addition to then yielding bounds on learnability, can be used to furnish an initial guess when numerically solving for .
We now have an expression for the mean of , but many quantities of interest, including MSE, depend on second-order fluctuations about that mean. In Appendix G, we show how, by taking a derivative with respect to , we can obtain expressions for the second-order statistics of and, as a result, for MSE. Our main second-order result is that
Using Equation 13, we can study the admixtures of particular spurious modes in the learned function . For example, if and , then we find that
We note that increases with , reinforcing our broad conclusion that the model is biased towards high-eigenvalue modes. Finally, by noting that , we can recover the expression of Bordelon et al. (2020) for MSE:
Our experiments corroborate existing evidence that this is an excellent approximation for MSE.
Expanding for small , we find that
The second equation implies that for all modes such that , which suggests that, for such low-eigenvalue modes, MSE increases as the training set size grows from zero. In practice, such low modes are commonplace; in fact, for a bounded kernel on a continuous input space (for which ), arbitrarily low modes are inevitable due to the constraint that is bounded777This constraint follows from the fact that .. We therefore expect that there quite often exist functions for which a given neural network yields an MSE nonmonotonic with . Figure A1 shows that experiments confirm this surprising first-principles prediction. We emphasize that this is a different, more generic phenomenon than that noted by Canatar et al. (2021), which required that was either noisy or placed weight in unlearnable zero-eigenvalue modes.
Here we describe experiments confirming all our theoretical predictions for both finite networks and exact NTK regression. Unless otherwise stated, all experiments used a fully-connected (FC) four-hidden-layer (4L) ReLU architecture with width 500. Because this model is FC, it has a rotation-invariant NTK (Lee et al., 2019). These experiments used three distinct input spaces . For each, the eigenmodes of can be grouped into degenerate subsets indexed by , with higher corresponding to faster variation in space. In all cases, we find that as increases, eigenvalues decrease, in concordance with the widespread belief that neural nets have a “spectral bias” towards slowly-varying functions (Rahaman et al., 2019; Canatar et al., 2021; Cao et al., 2019). We now describe these three input spaces. For full experimental details, see Appendix H.
Discretized Unit Circle. The simplest input space we consider is the discretization of the unit circle into points, . The eigenfunctions on this domain are , , and , for .
Hypercube. The next input space we consider is the set of verticies of the -dimensional hypercube, , giving . The eigenfunctions on this domain are the subset-parity functions , where is a vector indicating the elements of to which the output is sensitive (Yang and Salman, 2019). Here we define .
Hypersphere. To illustrate that our results extend easily to continuous input spaces, we consider the -sphere . The eigenfunctions on this domain are the hyperspherical harmonics (see, e.g., Frye and Efthimiou (2012); Bordelon et al. (2020)) which group into degenerate sets indexed by . The corresponding eigenvalues decrease exponentially with , and so when summing over all eigenmodes to compute and , we simply truncate the sum at .
Figure I5 shows the 4L ReLU NTK eigenvalues on each domain. We now describe our experiments.
Warmup on the unit circle. As an example problem to illustrate our theory, we consider learning the and eigenmodes on the discretized unit circle with . We use trained networks to find , , and for both modes as varies and compare to our theoretical predictions. The results show that our theoretical expressions give excellent agreement with experiment on all counts (Figure 1).
Predicting learnability . We next use both finite nets and exact NTK regression to learn several eigenmodes on all three domains and compare true learnabilities with our theoretical predictions. Our theory again predicts true generalization behavior quite well (Figure 2). We further show that the mode on the hypercube has worse-than-chance generalization, an inevitable result of its low eigenvalue and Theorem 2. We note that finite networks and NTK regression give very similar results, supporting the validity of approximating networks with the NTK.
Universal form for learnability. We next fix and plot learnability vs. eigenvalue for eight modes over each domain. In each case we find , and by rescaling the eigenvalues (a symmetry which leaves Equation 1 invariant), we see that for any neural network learning problem with any training set size , eigenmode learnability always lies on one universal curve (Figure 3).
No free lunch for neural networks. We then experimentally confirm that our no-free-lunch result applies to finite networks as well as kernel regression (Figure 4). We use both models to learn a function on the discretized unit circle with and sum the resulting -learnabilities of each eigenmode. Unlike in our other experiments, we only sample once for each ; even without the benefit of averaging over datasets, we find that total -learnability is always conserved.
Nonmonotonic MSE curves. We next confirm our first-principles prediction of nonmonotonic MSE curves at small . We plot curves for four eigenmodes on each domain, three of which Equation 16 predicts will have positive . MSE is indeed increasing as predicted (Figure A1).
Agreement with narrow networks. Finally, we repeat the experiment of Figure 2B for varying network width. We find that our theory gives accurate predictions of learnability and MSE even for networks of width as small as (Figures B2 and B3). This surprising agreement suggests our theory will faithfully predict generalization in practical deep learning systems.
We have presented a first-principles theory of neural network generalization that efficiently and accurately predicts many measures of generalization performance. This theory offers new insight into neural networks’ inductive bias and provides a general framework for understanding their learning behavior, opening the door to the principled study of many other deep learning mysteries.
The authors would like to thank Zack Weinstein for useful discussions and Sajant Anand, Jesse Livezey, Roy Rinberg, Jascha Sohl-Dickstein, and Liu Ziyin for helpful comments on the manuscript. This research was supported in part by the U.S. Army Research Laboratory and the U.S. Army Research Office under contract W911NF-20-1-0151. JS gratefully acknowledges support from the National Science Foundation Graduate Fellow Research Program (NSF-GRFP) under grant DGE 1752814.
Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences 116 (32), pp. 15849–15854. Cited by: §1.
In the main text, we assume prior familiarity with the NTK, using Equation 1 as the starting point of our derivations. Here we provide a definition and very brief introduction to the NTK for unfamiliar readers. For derivations and full discussions, see Jacot et al. (2018) and Lee et al. (2019).
Consider a feedforward neural network representing a function , where is a parameter vector. Further consider one training example with target value and one test point and suppose we perform one step of gradient descent with a small learning rate with respect to the MSE loss . This gives the parameter update
We now wish to know how this parameter update changes . To do so, we linearize about , finding that
where we have defined . This quantity is the NTK. Remarkably, as network width888The “width” parameter varies by architecture; for example it is the minimal hidden layer width for fully connected networks and the minimal number of channels per hidden layer for a convolutional network. goes to infinity, the corrections become negligible, and is the same after any random initialization999(assuming the parameters are drawn from the same distribution) and at any time during training. This dramatically simplifies the analysis of network training, allowing one to prove that after infinite time training on MSE loss for an arbitrary dataset, the network’s learned function is given by Equation 1. See, for example, Equations 14-16 of Lee et al. (2019)101010We note that there exists a different infinite-width kernel, called the “NNGP kernel,” describing a network’s random initialization, and this reference uses for the NNGP kernel and for the NTK..
Property (a): , and .
Proof. Using the fact that , we see that , where is a one-hot -vector with the one at index . The second clause of the property follows by averaging.
Property (b): When , and .
Proof. When , has no columns, and thus . The other clauses follow from Property (a) and averaging.
Property (c): When , and .
Proof. When , is a full-rank matrix. Inspection of the formula for shows that . The other clauses follow from Property (a) and averaging.
Property (d): All eigenvalues of are in , all eigenvalues of are in , and so .
Proof. From the definition of , it is easy to see that . is thus idempotent, with all eigenvalues in . The fact that all eigenvalues of are in follows by averaging. The stated properties of follow from the fact that, for any compatible vector and matrix , it holds that is bounded by the maximum and minimum eigenavalues of .
Property (e): Let be , where is a new data point. Then .
Proof. To begin, we use the Moore-Penrose pseudoinverse, which we denote by , to cast into a more transparent form:
where we have suppressed the in . This follows from the property of pseudoinverses that for any matrix . We now augment our system with one extra data point, getting
where is an -element column vector orthonormal to the others of . We now convert the pseudoinverse into an inverse with a limit, getting
We now use the Sherman-Morrison matrix inversion formula to find that
and the inverted matrix in Equation22 are positive semidefinite, we conclude that, for any -vector , it will hold that . The desired property follows.
Property (f): For any , , hence .
Proof. Differentiating with respect to a particular , we find that
where is the -th row of and . Specializing to the case , we note that because is positive semidefinite, and because is one of the positive semidefinite summands in . The desired property follows.
Property (g): For any , , , hence .
Proof. Differentiating as in the proof of Property (f) and using the fact that , we see that
which is manifestly nonpositive because . The desired property follows.
Proof of Theorem 1. First, we note that, for any orthogonal basis on ,
where is an orthogonal set of vectors spanning . This is equivalent to . This trace is given by
which proves the desired theorem. ∎
Proof of Theorem 2. First, we note that the naive, nongeneralizing model described in the theorem will always have a learnability of . As per Theorem 1, this is exactly the mean learnability of all eigenfunctions, so either all have precisely this learnability or else one has lower learnability. This proves the clause of the theorem specific to learnability.
To prove the clause specific to MSE, we now note that, as defined by Equation 1, kernel regression will always perfectly memorize the training data, so for both the naive model and kernel regression, MSE is given by
The first term is the same for any model, the second term is nonnegative but zero for the naive model, and the third term is equivalent to , which must be nonnegative for at least one eigenfunction. Kernel regression therefore gives worse MSE than the naive model unless, as for the naive model, for . ∎
Here we provide details of the derivations of our approximation for and of Lemma 3. Whenever possible, we leave it implicit that all expressions for in this appendix will be approximations and use the symbol for equality for simplicity.
We begin by taking the approximation of Equation 8 that
where the expectation is taken over all matrices satisfying with the uniform measure. It turns out that we can equivalently average over all all matrices with a mean-zero i.i.d. Gaussian measure over each element, without changing . To see this, note that this equation is symmetric under right-multiplication of by arbitrary invertible matrices :
Letting be a -dependent matrix which orthonormalizes its columns, it is straightforward to see that we can convert from an average over Gaussian to an average over only orthonormal , and vice-versa, as desired. Moving forward, we will exploit this equivalence and assume that each element of is sampled i.i.d. from . This assumption will largely remain in the background, but will be useful later.
We now evaluate , starting with the off-diagonal elements. We observe that
where is any orthogonal matrix. Defining as the matrix such that , noting that , and plugging in as in Equation 30, we find that
By choosing , we conclude that if .
To evaluate the diagonal elements of , we consider augmenting our eigensystem with a new “test eigenmode” with eigenvalue , as described in the text and formalized in Equation 9. To proceed, we make the core assumption that when 111111Empirically, we see that this approximation holds already for realistic values of and used in neural net experiments.. This assumption, combined with the symmetry whenever , implies that . Therefore, to evaluate , it suffices to evaluate as a function of .
We now manipulate the expression for to isolate . Using the Sherman-Morrison matrix inversion formula, we obtain