Deep learning has reached unparalleled performance in a variety of tasks and revolutionized many fields (LeCun et al., 2015), yet there are many important theoretical questions about deep learning that is unanswered. Perhaps the most important one is the question of generalization: how a deep network can predict accurately on data that it has not seen before. In this paper, we present a theory of generalization in kernel machines and wide neural networks. We ask how generalization performance depends on the number of training examples a network sees, a notion captured by learning curves. Such performance is expected to also depend on other factors such as the network’s architecture and the complexity of the learning task, yet identifying precisely how these factors impact generalization poses a theoretical challenge.
In order to address these questions, we work in a limit of neural networks where training dynamics simplifies. When the hidden layers of a neural network are taken to infinite width with a certain initialization scheme, recent influential work (Jacot et al., 2018; Arora et al., 2019; Lee et al., 2019)
showed that training a feedforward neural network with gradient descent to zero training loss is equivalent to kernel interpolation (or ridgeless kernel regression) with a kernel called the Neural Tangent Kernel (NTK)(Jacot et al., 2018). Exploiting this correspondence, we develop an approximate theory of generalization in kernel regression and use our findings to get insight into deep neural networks in the NTK limit. Our kernel regression theory is generally applicable to any kernel and contains kernel interpolation as a special limit (ridge parameter going to zero).
The goal of our theory is not to provide worst case bounds on generalization performance in the sense of statistical learning theory(Vapnik, 1999), but to provide analytical expressions that explain the average or a typical performance, in the spirit of statistical physics. The techniques we use are a continuous approximation to learning curves previously used in Gaussian processes (Sollich, 1999, 2002; Sollich and Halees, 2002) and the replica trick of statistical physics (Sherrington and Kirkpatrick, 1975).
Our contributions and results are summarized below:
Analysis of our theoretical expressions show that different spectral modes of a target function are learned with different rates. Modes corresponding to higher kernel eigenvalues are learned faster, in the sense that a marginal training data point causes a greater percent change in generalization error for high eigenvalue modes than for low eigenvalue modes.
When data is sampled from a uniform distribution on a hypersphere, dot product kernels, which include NTK, admit a degenerate Mercer decomposition in spherical harmonics,. In this case, our theory predicts that generalization error of lower frequency modes of the target function decrease more quickly than higher frequency modes as the dataset size grows. Different learning stages are visible in the sense below.
As the dimensions of data, , go to infinity, learning curves exhibit different learning stages. For a training set of size , modes with are perfectly learned, are being learned, and are not learned.
We verify the predictions of our theory for kernel regression and kernel interpolation with NTK, and wide and deep neural networks trained with gradient descent using numerical simulations. Our theory fits experiments remarkably well.
1.1 Related Work
Our main approximation technique comes from the literature on Gaussian processes, which is related to kernel regression is a certain limit. Total learning curves, but not their spectral decomposition as we do here, for Gaussian processes have been studied in a limited teacher-student setting where both student and teacher were described by the same Gaussian process and the same noise in (Opper and Vivarelli, 1999; Sollich, 1999). Here we allow any teacher distribution including the error for a single fixed teacher. Sollich also considered mismatched models where teacher and student kernels had different eigenspectra and different noise levels (Sollich, 2002). The total learning curve from this model is consistent with our results when the teacher noise sent to zero, but we also consider, provide and analyze generalization in spectral modes. We use an analogue of the “lower-continuous” approximation scheme introduced in (Sollich and Halees, 2002) which we also reproduce here through a replica trick and saddle-point approximation.
Generalization bounds for kernel ridge regression have been obtained in many contexts(Scholkopf and Smola, 2001; Cucker and Smale, 2002; Vapnik, 1999; Gyorfi et al., 2003), but the rates of convergence often crucially depend on the explicit ridge parameter and do not provide guarantees in the ridgeless case. Recently, interest in explaining the phenomenon of interpolation has led to the study of ridgeless regression (Belkin et al., 2018c, a, 2019; Liang and Rakhlin, 2018), where generalization bounds are kernel and data dependent. In this work, our aim is to capture the average case performance of kernel regression in a spectrum dependent model that remains valid in the ridgeless case.
In statistical physics domain, (Dietrich et al., 1999)
calculated learning curves for support vector machines, but not kernel regression, in the limit of number of training samples going to infinity for dot product kernels with binary inputs using a replica method. Our theory applies to general kernels and finite size datasets. In the infinite training set limit, they observed several learning stages where each spectral modes are learned with different rates. We observe similar phenomena in kernel regression as summarized above. In a similar spirit,(Cohen et al., 2019) calculates learning curves for infinite-width neural networks using a path integral formulation and a replica analysis but do not discuss the spectral dependence of the test risk as we do here.
In the infinite width limit we are considering, neural networks have many more parameters than training samples. Some works have explained generalization of overparameterized neural networks as a consequence of the training procedure since stochastic gradient descent exhibits an implicit bias towards choosing the simplest functions that interpolate the training data(Belkin et al., 2018b, c; Luo et al., 2019; Jacot et al., 2018). Empirical studies have shown that neural networks fit the low frequency components of the target before the high frequency components during training with gradient descent (Rahaman et al., 2018). In addition to training dynamics, recent works such as (Yang and Salman, 2019; Bietti and Mairal, 2019; Cao et al., 2019) have discussed how the spectrum of kernels impacts its smoothness and approximation properties. Here we explore similar ideas by explicitly calculating average case learning curves for kernel regression and studying its dependence on the kernel’s eigenspectrum.
2 Kernel Regression Learning Curves
We start with a general theory of kernel regression. Implications of our theory for dot product kernels including NTK, and its relation to trained neural networks are described in Section 3.
2.1 Notation and Problem Setup
We start by defining our notation and setting up our problem. Our initial goal is to derive a mathematical expression for generalization error in kernel regression, which we will analyze in the subsequent sections using techniques from the Gaussian process literature (Sollich, 1999, 2002; Sollich and Halees, 2002) and statistical physics (Sherrington and Kirkpatrick, 1975; Parisi, 1979).
The goal of kernel regression is to learn a function from a finite number of observations (Wahba, 1990). Let , where , be one of the training examples and let be a Reproducing Kernel Hilbert space (RKHS) with inner product . Kernel ridge regression is defined as:
limit is referred to as interpolating kernel regression, and, as we will discuss later, relevant to training wide neural networks. The unique minimum of the convex kernel regression loss is given by
where is the reproducing kernel for , is the kernel gram matrix , and . Lastly, is the vector of target values . For interpolating kernel regression, the solution is the same except that , meaning that training data is fit perfectly. The proof of this optimal solution is provided in the Supplementary Information section 1.
be the probability density function from which the input data are sampled. The generalization error is defined as the expected risk with expectation taken over new test points sampled from the same density. For a given dataset and target function , let represent the function learned with kernel regression. The generalization error for this dataset and target function is
To calculate the average case performance of kernel regression, we average this generalization error over the possible datasets and teacher functions
Our aim is to calculate for a general kernel, and distribution over teacher functions.
For our theory, we will find it convenient to work with the feature map defined by the Mercer decomposition. By Mercer’s theorem (Mercer, 1909; Rasmussen and Williams, 2005) , the kernel admits a representation in terms of its
where is the feature map we will work with. In our analysis, will be taken to be infinite, but for the development of the learning curves, we will first consider as a finite integer. The eigenfunctions and eigenvalues are defined with respect to the probability measure that generates the data
We will also find it convenient to work with a vector representation of the RKHS functions in the feature space. To do so, we note that the kernel eigenfunctions form a complete orthonormal basis for square integrable functions as , allowing the expansion of the target function and learned function in terms of features
Hence, -dimensional vectors and constitute a representation of and respectively in the feature space.
We can also obtain a feature space expression for the optimal kernel regression function (2). Let be feature matrix for the sample so that . With this representation, Kernel ridge regression (1) can be recast as the optimization problem , whose solution is
Another novelty of our theory is the decomposition of the test risk into its contributions from different eigenmodes. The feature space expression of the generalization error after averaging over the test distribution can be written as:
where we identify as the generalization error in mode .
Finally, with our notation set up, we can present our first result about generalization error.
For the that minimizes the training error, if the target functions are produced by a target function , we find that the generalization error is:
We leave the proof to Supplementary Information Section 2 but provide a few cursory observations of this result. First, note that all of the dependence on the teacher function comes in the matrix whereas all of the dependence on the empirical samples is in . In the rest of the paper, we will develop multiple theoretical methods to calculate the generalization error given by expression (11). Averaging over the target weights in the expression for is easily done for generic weight distributions. The case of a fixed target is included by choosing a delta-function distribution over .
We present two methods for computing the nontrivial average of the matrix over the training samples . First, we consider the effect of adding a single new sample to to derive a recurrence relation for
at different number of data points. This method generates a partial differential equation that must be solved to compute the generalization error. Second, we use a replica method and a saddle point approximation to calculate the generalization error by computing the average log-determinant of. These approaches give identical predictions for the learning curves of kernel machines.
For notational simplicity, in the rest of the paper, we will use to mean unless stated otherwise. In all cases, the quantity inside the brackets will depend either on the data distribution or the distribution of target weights, but not both.
2.2 Continuous Approximation to Learning Curves
First, we adapt a method following Sollich (Sollich, 1999, 2002; Sollich and Halees, 2002), to calculate the generalization error. We generalize the definition of by introducing an auxiliary parameter , and make explicit its dataset size, , dependence:
Note that the quantity we want to calculate is given by
By considering the effect of adding a single randomly sampled input , and treating as a continuous parameter, we can derive an approximate quasi-linear partial differential equation (PDE) for the average elements of as a function of the number of data points (see below for a derivation):
with the initial condition , which follows from when there is no data. Since is initialized as a diagonal matrix, the off-diagonal elements will not vary under the dynamics and will remain diagonal for all . We will use the solutions to this PDE and relation (15) to arrive at an approximate expression for the generalization error.
Derivation of the PDE approximation (16).
Let represent the new feature to be added to so that where is a random sample from the data distribution. Let denote the matrix averaged over it’s -sample design matrix . By the Woodbury matrix inversion formula
Performing the average of the last term on the right hand side is difficult so we resort to an approximation, where the numerator and denominator are averaged separately.
where we used the fact that .
Treating as a continuous variable and taking a continuum limit of the finite differences given above, we arrive at (16). ∎
Next, we present the solution to the PDE (16).
Let represent the diagonal elements of the average matrix . These matrix elements satisfy the implicit relationship
This implicit solution is obtained from the method of characteristics which we provide in Section 3.1 of the Supplementary Information. The proof relies on the observation that each of the matrix elements depend explicitly on and the trace . Once is solved for, the result above is immediate.
Under the PDE approximation (16), the average error associated with mode is
where is the solution to the implicit equation
and is defined as
The full proof of this proposition is provided in section 3.1 of the Supplementary Information. We show the steps required to compute theoretical learning curves numerically in Algorithm 1.
In these expressions, the target function sets the overall scale of . The spectrum of the kernel affects all modes in a nontrivial way. When we apply this theory to neural networks in Section (3), the information about the architecture of the network will be in the spectrum . The dependence on number of samples is also nontrivial, but we will consider various informative limits below. Next, we consider another theoretical method to obtain the same learning curves for kernel regression.
2.3 Replica Trick and Finite- Saddle-Point Approximation
The result of the continuous approximation can be obtained using another approximation method, which we outline here and detail in Supplementary Information Section 4. We perform the average of matrix over the training data, using the replica trick (Sherrington and Kirkpatrick, 1975; Parisi, 1979) from statistical physics and a finite size saddle-point approximation, which yields identical learning curves to Proposition 3.
The calculation proceeds by recasting the generalization error in terms of the log determinant of . First, we generalize the definition of once again
With this definition, the generalization error for a fixed teacher is given by
We identify the partition function and represent it with a Gaussian integral
As before, the difficulty in the calculation is averaging over the training samples: i.e. performing . To analytically perform this average, we resort to the replica method to compute the free energy
by computing integer moments of.
The calculation is detailed in Supplementary Information Section 4 and mostly follows the well-known steps using a replica symmetric ansatz. We approximate the integral that results from performing the data average of with a saddle-point approximation and arrive at the same result for learning curves as acquired from the continuous approximation, Proposition 3.
We note that to develop a theory for learning curves for finite sample sizes, we evaluated the saddle point at a finite . This is different than the usual application of the replica method where a thermodynamic limit () is taken (Mézard et al., 1987). In another work, we will provide a detailed account of the replica theory for this problem.
2.4 Spectral Dependency of Learning Curves
We can get insight about the behavior of learning curves by considering ratios between errors in different modes:
For small this ratio approaches . For large , , indicating that asymptotically (), the amount of error in mode grows with the ratio , showing that the asymptotic mode error is large if the teacher function places large amounts of power in modes that have small RKHS eigenvalues .
We can also examine how the RKHS spectrum affects the evolution of the error ratios with . Without loss of generality take and note that
Since and , we have . In this sense, the marginal training data point causes a greater percent change in generalization error for modes with larger RKHS eigenvalues.
3 Dot Product Kernels on and NTK
For the remainder of the paper, we specialize to the case where our inputs are drawn uniformly on , a -dimensional unit hyper-sphere. In addition, we will assume that the kernel is a dot product kernel, as is the case for NTK. In this setting, the kernel eigenfunctions are spherical harmonics (Bietti and Mairal, 2019; Efthimiou and Frye, 2014). This allows an expansion of the kernel in terms of Gegenbauer polynomials , which form a complete basis for the measure of inner products on the sphere (Dai and Xu, 2013). The Gegenbauer polynomials can be further expanded in terms of spherical harmonics (Dai and Xu, 2013) (see Supplementary Information Sections 5 and 6 for a review of these facts). We denote the dimension of the subspace spanned by -dimensional spherical harmonics of degree as . Using completeness of the Gegenbauer polynomials for inner products on the sphere, we expand any rotation invariant kernel as
This last expression is the Mercer decomposition of the kernel since the spherical harmonics are eigenfunctions of . In this form, it is clear that rotation invariance renders the eigenspectrum degenerate since each of the modes of frequency share the same eigenvalue .
3.1 Frequency Dependency of Learning Curves
In the special case of dot product kernels with monotonically decaying spectra, the result give in equation (27) indicates that the marginal training data point causes greater reduction in relative error for low frequency modes than for high frequency modes. This exercise illustrates that monotonic RKHS spectra represent an inductive bias that preferentially favors fitting lower frequencies as more data becomes available. More rapid decay in the spectrum yields a stronger bias to fit low frequencies first.
To make this intuition more precise, we now discuss an informative limit where the degeneracy factor approaches to . In the following, we replace eigenfunction index with index pair . Eigenvalues of the kernel scales with as (Smola et al., 2001) in the limit, as we verify numerically in Figure 1 for NTK. If we take for some integer degree , then exhibits three distinct learning stages. Leaving the details to Supplementary Information Section 9, we find that in this limit:
where . In other words, modes are perfectly learned, are being learned, and are not learned.
This simple calculation demonstrates that the lower modes are learned earlier with increasing sample complexity since the higher modes stays stationary until reaches to the degeneracy of that mode.
3.2 Neural Tangent Kernel and its Spectrum
The NTK is a rotation invariant kernel that describes how the predictions of infinitely wide neural networks evolve under gradient flow (Jacot et al., 2018). Let index all of the parameters of the neural network and let be the output of the network. Then the neural tangent kernel is defined as
Let be the current predictions of on the training data. If the parameters of the model are updated via gradient flow, , then the predictions on the training data evolve with the dynamics
When the width of the neural network is taken to infinity with proper initialization (where the weights at layer are sampled where is the number of hidden units in layer
) the NTK becomes independent of the particular realization of parameters and approaches to a deterministic function of the inputs and the nonlinear activation function(Jacot et al., 2018). Further, the kernel is approximately fixed throughout gradient descent (Jacot et al., 2018; Arora et al., 2019). If we assume that at , then the final learned function is
Note that this corresponds to ridgeless, interpolating regression where . We will use this correspondence and our kernel regression theory to explain neural network learning curves in the next section.
To generate theoretical learning curves, we need the eigenspectrum of the kernels involved. As a dot product kernel, since NTK lives on , it suffices to calculate the inner products over , which we evaluate numerically with Gauss-Gegenbauer quadrature (Supplementary Information Section 6). The NTK spectrum is shown as a function of input dimension (Figure 1), which verifies that for large dimensions, NTK spectrum stays constant as the dimension increases.
In this section, we test our theoretical results for kernel regression, kernel interpolation and wide networks.
4.1 Kernel Regression and Interpolation
We first test our theory in a kernel regression task. In this experiment, the teacher function is taken to be a linear combination of the kernel evaluated at randomly sampled points
are sampled randomly from a Bernoulli distribution onand are sampled uniformly from . The points are independent samples from and are different than the training set . The student function is learned with kernel regression by inverting the gram matrix defined on the training samples according to equation (2). With this choice of target function, exact computation of the mode wise errors
in terms of Gegenbauer polynomials is possible; the formula and its derivation are provided in section 10.2 of the Supplementary Information. We compare these experimental mode-errors to those predicted by our theory and find perfect agreement. For these experiments, both the teacher and student kernels are taken to be NTK of a 4 layer fully connected ReLU without bias. Figure2 shows the errors for each frequency as a function of sample size . In accordance with our theory, higher frequency modes require more data to fit.
The different panels of Figure 2 shows the effect of regularization and increased input dimension. In Figure 2(a), we show the mode errors for ridge regression with an explicit regularization penalty (, ) while Figure 2(b) shows the learning curves for ridgeless regression (, ). In both figures . The higher ridge parameter causes the learning curves to be learned more slowly. Figure 2(c) shows the effect of increasing the input dimension to . Consistent with our theory, higher input dimension causes the higher frequency modes to be learned at much larger .
). Errorbars show standard deviation across the 50 trials.
4.2 Learning Curves for Finite Width Neural Networks
Having established that our theory accurately predicts the generalization error of kernel regression with the NTK, we now compare the generalization error of finite width neural networks with the theoretical learning curves for NTK. For these experiments, we use the neural-tangents library (Novak et al., 2020) which supports training and inference for both finite and infinite width neural networks.
First, we use “pure mode” teacher functions, meaning the teacher is composed only of spherical harmonics of the same degree. For “pure mode” , the teacher is constructed with the following rule.
where again and are sampled randomly. Figure 6(b) shows the learning curve for a fully connected two layer ReLU network with width , input dimension and . As before, we see that the lower
pure modes require less data to be fit. The generalization error was estimated by taking a random test sample ofdata points. The average was taken over 25 trials and the standard deviations are shown with errorbars. The neural network was trained with stochastic gradient descent for iterations with a step size of 32 and minibatch size of 50. The weights were initialized with the default Gaussian NTK parameterization (Jacot et al., 2018). Experimental test errors for kernel regression with NTK on the same synthetic datasets are plotted as triangles. Our theory perfectly fits the experiments.
Results from a depth 4 simulation are provided in Figure 3(b). Each hidden layer had hidden units. Training was conducted with SGD with a learning rate of 2 for 10,000 iterations. We again see that the mode is only learned for . mode is not learned at all in this range. Our theory again perfectly fits the experiments.
Lastly we show that our theory also works for composite functions that contain many different degree spherical harmonics. In this setup, we randomly initialize a two layer teacher neural network and train a student neural network
where are the feedforward weights for the student and teacher respectively and are the student and teacher readout weights. Chosen in this way with ReLU activations, the teacher is composed of spherical harmonics of many different degrees (Section 11 in Supplementary Information). The total generalization error for this teacher student setup as well as the theoretical prediction of our theory is provided in Figure 6(f) for , . They agree excellently. Results from additional neural network simulations are provided in Section 11 of the Supplementary Information.
5 Discussion and Conclusion
In this paper, we presented an approximate theory of the average generalization performance for kernel regression. We studied our theory in the ridgeless limit to explain the behavior of trained neural networks in the infinite width limit (Jacot et al., 2018; Arora et al., 2019). Our theory demonstrates how the RKHS eigenspectrum of NTK encodes a preferential bias to learn high frequency modes only after the sample size is sufficiently large. Our theory fits kernel regression experiments remarkably well. We further experimentally verified that the theoretical learning curves obtained in the infinite width limit provide a good approximation of the learning curves for wide but finite NNs.
0 Supplementary Information (SI)
1 Background on Kernel Machines
1.1 Reproducing Kernel Hilbert Space
be a probability distribution over. Let be a Hilbert space with inner product . A kernel is said to be reproducing for if function evaluation at any is the equivalent to the Hilbert inner product with : is reproducing for if for all and all
1.2 Mercer’s Theorem
where the eigenvalue statement is
1.3 Representer Theorem
Let be a RKHS with inner product . Consider the regularized learning problem
where is an empirical cost defined on the discrete support of the dataset and . The optimal solution to the optimization problem above can always be written as
For any , function evaluation of at a test point is obtained by taking the Hilbert inner product of with the feature map for , : . Consider the orthogonal decomposition of into the subspace spanned by and its orthogonal complement. Thus we write
Evaluating of the function at a point in the training set we find
Therefore the empirical cost does not depend on . However, the Hilbert norm is minimized by choosing since
Since the optimal solution must have , the optimal function . Stated in terms of function evaluation,
for some coefficients . ∎
1.4 Solution to Least Squares
Specializing to the case of least squares regression, let
Using the representer theorem, we may reformulate the entire objective in terms of the coefficients
Optimizing this loss with respect to gives
Therefore the optimal function evaluated at a test point is
2 Derivation of the Generalization Error
Let the RKHS have an eigenvalues for . Define where are the eigenfunctions of the reproducing kernel for . Let the target function have the following expansion in terms of the kernel eigenfunctions . Define the design matrices and . Then the average generalization error for kernel regression is
Define the student’s eigenfunction expansion and decompose the risk in the basis of eigenfunctions.
Next, it suffices to calculate the weights learned through kernel regression. Define a matrix with elements . The training error for kernel regression is
The norm on is equivalent to the Hilbert norm on the target function. If then
since (Bietti and Mairal, 2019). The training error has a unique minimum
where the target function is produced deterministically according to .
Plugging in the that minimizes the training error into the formula for the generalization error, we find
gives the desired result. ∎
3 Solution of the PDE Using Method of Characteristics
Let . Here we derive the solution to the PDE in equation (14) of the main text by adapting the method used by (Sollich, 1999). Introduce a variable for the trace . Then,
The solution is a surface that passes through the line and satisfies the PDE above at all points. The solution to first order partial differential equations of the form is given by the method of characteristics (Arfken, 1985), which we describe below.
The tangent plane to the solution surface at a point is . Therefore a vector normal to the solution surface must satisfy
One such normal vector is .
The solution to the PDE satisfies
demonstrating that is tangent to the solution surface. This allows us to parameterize one dimensional curves along the solution in these tangent directions. Such curves are defined as characteristics. Introducing a parameter that varies along the one dimensional characteristic curves.
The first of these equations indicate that is constant along each characteristic curve. Integrating along the parameter, and where is the value of when and is the value of at . Without loss of generality, take so that . At , we have our initial condition
Since takes on the same value for each characteristic
which gives an implicit solution for . Now that we have solved for , we may write
For the kernel regression generalization, computing error requires the differentiation with respect to at
We need to calculate
Letting and solving for the derivative, we get