1 Introduction
Analysis of kernel methods have seen a resurgence after Jacot et al. (2018) showed an equivalence of wide neural networks, trained with gradient descent, with the socalled neural tangent kernel
(NTK). Contemporaneously, there has been growing interest in highdimensional asymptotic analyses of machine learning methods in a regime where the number of input samples
and number of input features grow proportionally as(1) 
for some and the data follow some random distribution. This regime often enables remarkably precise predictions on the behavior of complex algorithms (see, e.g. (Krzakala et al., 2012), and the references below).
In this work, we study kernel estimators of the form:
(2) 
where are weights learned from training samples , and is a kernel. We consider the training of such kernel models in an asymptotic random regime similar in form to several other highdimensional analyses:
Proportional, uniform large scale limit: Consider a sequence of problems indexed by the number of data samples satisfying the following assumptions:

[label=A0, itemsep=0pt, topsep=0pt]

(Uniform data) Training features are generated as where has i.i.d. entries with , , and for some . A test sample, , is generated similarly. Further, the covariance matrix is positive definite with , and
. 
(Proportional asymptotics) Number of samples and number of input features scale as (1).

(Kernel) The kernel function is of the form
(3) where is around , around .
Under these assumptions we show that:
Kernel regression offers no gain over linear models.
Our result does not disregard kernel methods (or neural networks) as a whole, but serves as a caution regarding the proportional uniform large scale limit model while examining the asymptotic properties of kernels. A result of this nature regarding the highdimensional degeneracy of two layer neural networks has been studied in Hu et al. (2020).
1.1 Summary of Contributions
To be precise, we show three surprising results concerning kernel regression in the proportional, uniform large scale limit:

[label=0., leftmargin=4mm, topsep=1mm]

First, we show kernel models only learn linear relations between the covariates and the response in this regime. Consequently, kernel models (including neural networks in the kernel regime) have no benefit over linear models in this regime.

Our second result considers the training dynamics of the kernel and linear models. We show that under gradient descent, in the high dimensional setting, dynamics of the kernel model and a scaled linear model are equivalent throughout training.

Finally, we consider the case where the true data is generated from a kernel model with some unknown parameters. In this case, the relation between and can be highly nonlinear. An example of such a model is that is generated from via a neural network with random, unknown parameters. In this case, we show that in the highdimensional limit, the linear networks provide the minimum generalization error. That is, again, nonlinear kernel methods provide no benefit and training a wide neural network would result in a linear model.
The main takeaway of this paper is that under certain data distribution assumptions that are widely used in theoretical papers, a large class of kernel methods, including fully connected neural networks (and residual architectures with fully connected blocks) in kernel regime, can only learn linear functions. Therefore, in order to theoretically understand the benefits that they provide over linear models, more complex data distributions should be considered. Informally, if covers this space in every direction (not necessarily isotropically), and the number of samples grows only linearly in the dimension of this space, many kernels can only see linear relationships between the covariates and the response. In other words, we argue that if we seek highdimensional models for analyzing performance of neural networks, other distributional assumptions will be needed.
The proofs of our results rely on a generalization of Theorem 2.1 and 2.2 of (El Karoui et al., 2010). This generalization might be of independent interest for other works.
1.2 Prior work:
Highdimensional analyses in the proportional asymptotics regime similar to assumptions 1 to 3
have been widelyused in statistical physics and random matrixbased analyses of inference algorithms
(Zdeborová and Krzakala, 2016). The highdimensional framework has yielded powerful results in a wide range of applications such as estimation error in linear inverse problems (Donoho et al., 2009; Bayati and Montanari, 2011; Krzakala et al., 2012; Rangan et al., 2019; Hastie et al., 2019), convolutional inverse problems (SahraeeArdakan et al., 2021), dynamics of deep linear networks (Saxe et al., 2013), matrix factorization (Kabashima et al., 2016), binary classification (Taheri et al., 2020; Kini and Thrampoulidis, 2020), inverse problems with deep priors (Gabrié et al., 2019; Pandit et al., 2019, 2020), generalization error in linear and generalized linear models (Gerace et al., 2020; Emami et al., 2020; Loureiro et al., 2021; Gerbelot et al., 2020), random features (D’Ascoli et al., 2020), and for choosing the optimal objective function for regression (Bean et al., 2013; Advani and Ganguli, 2016) to name a few. Our result that, under a similar set of assumptions, kernel regression degenerates to linear models is thus somewhat surprising.That being said, the result is not entirely new. Several authors have suggested that highdimensional data modeled with i.i.d. covariates are inadequate
(Goldt et al., 2020b; Mossel, 2016). The results in this paper can thus be seen as attempting to describe the limitations precisely.In this regard, the work is closest to (Hu et al., 2020). The work (Hu et al., 2020) proves that for a twolayer fullyconnected neural network, the training dynamics are equivalent to a linear model in inputs. They provide asymptotic rates for convergence in the early stages of training (). Our result, however, considers a much larger class of kernels and is not limited to the NTK. In addition, we consider the dynamics throughout the training including the limit.
The generalization of kernel ridgeless regression is also discussed in this setting in (Liang and Rakhlin, 2020). The connections to double descent with explicit regularization has been analyzed in (Liu et al., 2021). The authors in (Dobriban and Wager, 2018)
, characterize the limiting predictive risk for ridge regression and regularized discriminant analysis.
(Cui et al., 2021) provides the error rates for KRR in the noisy case, and the generalization error in learning with random features with kernel approximation has been discussed in (Liu et al., 2020b). A comparison between neural networks and kernel methods for Gaussian mixture classification is also is provided in (Refinetti et al., 2021).The kernel approximation of the overparameterized neural networks does not limit their performance in practical applications. In fact, these networks have surprisingly shown to generalize well (Neyshabur et al., 2017; Zhang et al., 2021; Belkin et al., 2018). Of course, in the nonasymptotic regime, these models also have very large capacity (Bartlett et al., 2017). While this high capacity leads to learning complex functions, it is not always the case for the trained networks, and large models might still advocate for learning simpler functions. Works such as (Kalimeris et al., 2019; Hu et al., 2020) show that this simplicity can come from the implicit regularization induced by the training algorithms such as gradient descent for earlytime dynamics. In this work, however, we show that in the high dimensional limit, this simplicity can be a result of the uniformity of input distribution over the space. In fact, we show that in this regime, kernel methods are no better than linear models.
2 Kernel Methods Learn Linear Models
In this section we show the first result of this paper: in the proportional, uniform highdimensional regime, fitting kernel models is equivalent to fitting a regularized least squares model with appropriate regularization parameters. A short review of reproducing kernel Hilbert spaces (RKHS) and kernel regression can be found in Appendix A.1.
Suppose we have data points , with , and an RKHS corresponding to the kernel .
Consider two models fitted to this data:

[label= 0.]

Kernel ridge regression model which solves
(4) where is the Hilbert norm of the function.

Linear model fitted by solving the regularized least squares problem:
(5)  
(6) 
The problem in (4) is an optimization over a function space. By parameterizing as we can find the optimal function by solving
(7) 
By the representer theorem (Schölkopf et al., 2001), the optimal function in (4) also has the form
(8) 
where solves
(9) 
where with is the kernel matrix and is its row.
To state the result we need to define the following constants related to the kernel and its associated function from assumption 3
(10a)  
(10b)  
(10c) 
where and are partial derivatives of in the second argument.
Our first result shows that with an appropriate choice of the two models and are in fact equivalent.
Theorem 2.1.
Proof.
See Appendix B. ∎
Remark 1.
Note that the result in Theorem 2.1
does not imply that the linear model and the kernel model are equal in probability for all the points in the domain of these functions in the proportional uniform regime, but rather over a random test point as given by assumption
1. However this suffices for understanding the generalization properties of these functions.Remark 2.
Since convergence in probability implies convergence in distribution, we also have that the generalization error of is the same as that of for any bounded continuous metric.
Remark 3.
Theorem 2.1 states a convergence in probability for a single test point. This holds for test samples so long as grows at most linearly in the number of training samples, i.e. and the outputs of kernel model and the linear model would be equal in probability over all these test samples.
3 Linear Dynamics of Kernel Models
Our next result shows that if a kernel ridge regression is solved using gradient descent, every intermediate estimator during training has an equivalent linear model.
Consider a kernel model that is parameterized as (where is the feature map) that is trained by regularized empirical risk minimization:
(12) 
The gradient descent iterates for this problem are,
(13) 
with . Here, is a matrix with as its th row and is the learning rate. Similarly, consider a scaled linear model learned by optimizing, via gradient descent, the regularized squared loss:
(14) 
The parameters are initialized zero in gradient descent. Observe that this optimization problem is equivalent to the one in (5) as we have only made a change of variables to and to , i.e. they learn the same model. The scalings are introduced to make the training dynamics of the linear model and the kernel model the same. Let the parameter of the kernel model after steps of gradient descent be and define . Similarly, let the parameters of the linear model after step of gradient descent be and define . Then we have the following result.
Theorem 3.1.
If and are given by equation (11), then for any step of gradient descent (initialized at zero) and any test sample drawn from the same distribution as the training data we have
(15) 
Proof.
The proof can be found in Appendix D. ∎
4 Optimality of Linear Models
Our last result shows that in the proportional uniform large scale limit, if the true model has a Gaussian process prior with a kernel that satisfies assumption 3, then linear models are in fact optimal, even though the true underlying relationship between the covariates and the responses could be highly nonlinear. See Appendix A.2 for a review of Gaussian process regression.
Assume that we are given training samples
(16) 
and the function is a zero mean Gaussian process with covariance kernel . An example occurs in the socalled studentteacher setup of (Gardner and Derrida, 1989; Aubin et al., 2019) where the unknown function is of the form
(17) 
and is a neural network with unknown parameters . If the network has infinitely wide hidden layers and the unknown parameters are generated with randomly with i.i.d. Gaussian coefficients with the appropriate scaling, the unknown function in (17) becomes asymptotically a Gaussian process Neal (2012); Lee et al. (2017); Matthews et al. (2018); Daniely et al. (2016).
Now assume that we are given a test sample from the same model and we are interested in estimating . It is well known (see Appendix A.2) that the Bayes optimal estimator with respect to squared error in this case is
(18) 
and its Bayes risk is
(19) 
Next consider a linear model fitted by solving the regularized least squares problem in (5). Define the square error risk of this model as
(20) 
where the expectation is with respect to the randomness in as well as the noise .
Theorem 4.1.
Proof.
It is important to contrast this result with (Goldt et al., 2020a) and (Aubin et al., 2019). The works (Aubin et al., 2019; Goldt et al., 2020a) consider exactly the case where the true function is of the form (17) where is a neural network with Gaussian i.i.d. parameters. However, in their analyses, the number of hidden units in both the true and trained network are fixed while the dimension of and number of samples grow with proportional scaling. With a fixed number of hidden units, the true function is not a Gaussian process, and the model class is not a simple kernel estimator – hence, our results do not apply. Interestingly, in this case, the results of (Aubin et al., 2019; Goldt et al., 2020a) show that nonlinear models can significantly outperform linear models. Hence, very wide neural networks can underperform networks with smaller numbers of hidden units. It is an open question as to which scaling of the number of hidden units, number of samples, and dimension yield degenerate results.
5 Sketch of Proofs
Here we provide the main ideas behind the proofs of our main theorems. The details of the proof of Theorem 2.1 can be found in Appendix B.1. Proof of Theorem 3.1 can be found in Appendix D.
5.1 Degeneracy of empirical kernel matrices
Our first result extends Theorems 2.1 and 2.2 of El Karoui et al. (2010) and may be of independent interest to the reader.
Proposition 5.1.
Proof.
See appendix B. ∎
El Karoui et al. (2010) present this result for kernels of the form or . Importantly, the NTK has a form that is neither or , but in fact of the form in equation (3), whereby Proposition 5.1 provides new insights into the behaviour of empirical kernel matrices of the NTK for a large class of architectures.
5.2 Equivalence of Kernel and Linear Models
Proposition 5.1 is the main tool we use to show that kernel methods and linear methods are equivalent in the proportional, uniform large scale limit.
The model learned by the kernel ridge regression in equation (4) can be written as
(23) 
Next, since the optimization in (5) is a quadratic problem it has a closed form solution.
Proposition 5.2.
The linear estimator in Model 2 has the following form
(24) 
Proof.
5.3 Equivalence Throughout Training
The proof of equivalence of the kernel model and scaled linear model after steps of gradient descent is very similar. The updates for parameters of the kernel model as well as the parameters of the scaled linear model have linear dynamics (in their respective parameters). By unrolling the gradient update through time, we can write the parameters after step as a summation over the past time steps. Using this, we can simplify the sums to write the output of the kernel model at time over a test sample as
(25) 
Similarly, for the linear model at step we get
(26) 
Here, we can use Proposition 5.1 again to show that all the terms in the linear model converge in probability to the corresponding term in the kernel model, thus proving that the two models are equal in probability for any test sample drawn from the same distribution as the training data over the course of gradient descent.
6 Numerical Experiments
6.1 Linearity of Kernel Models for NTK
We demonstrate via numerical simulations the predictions made by our results in Theorems 2.1, 3.1, 4.1
As shown in Lee et al. (2019) and Liu et al. (2020a), wide fully connected neural networks can be approximated by their first order Taylor expansion throughout the training
and this approximation becomes exact in the limit that all the hidden dimensions of the neural network go to infinity. Therefore, training a network by minimizing
(27) 
is equivalent (in the limit of wide network) to performing kernel ridge regression in an RKHS with feature map and neural tangent kernel as its kernel^{1}^{1}1See Appendix A.3 for a brief review of NTK.. Instead of removing the initial network, one can use a symmetric initialization scheme which makes the output of neural network zero at initialization without changing its NTK Chizat et al. (2018); Zhang et al. (2020); Hu et al. (2020).
A key property of the NTK of fully connected neural networks is that it satisfies assumption 3 since it has the form in equation (3). Hence, if the input data
satisfies the requirements of this theorem, in the proportional asymptotics regime the NTK should behave like a linear kernel. The first and second order derivatives of the kernel function can be obtained by backpropagation through the recursive equations in (
57) and (58).Figure 1 illustrates a setting where kernel models and neural networks in the kernel regime perform no better than appropriately trained linear models. This verifies the main result of this paper – Theorem 2.1.
We generate training data for as
(28) 
where and and
is a fullyconnected ReLU network with two hidden layers with 100 hidden units each.
We train 3 models:

[label=()]

A fully connected ReLU neural network with a single layer of
hidden units to fit this data using stochastic gradient descent (SGD) with momentum parameter
. The initial network is remove from the output as in (27).
We compare the test error for these models, measured as over test samples:
(29) 
We compare the test error for different number of training samples averaged over 3 runs.
We can see that the NTK model and the equivalent linear model almost match perfectly for all the values of test samples and the neural network model follows them very closely, matching them for small number of samples.
There are two main sources of mismatch between the neural network model and the NTK model: first the width of the network while large (20,000) it is still finite, and secondly the training of the neural network model is stopped after 150 epochs, i.e. the neural network trained differs from the optimal neural network. Finally, the oracle model’s performance is the noise floor.
6.2 Equivalence of Kernel and Linear Models Throughout Training
Next, we verify Theorem 3.1 by showing that the test error of the scaled linear model and neural network match for all the steps of gradient descent. The setting is the same as in Section 6.1. We generate data using a random neural network with two hidden layers of units each and train a neural network with a single hidden layer of 10,000 units as well as the scaled linear model using gradient descent. We plot the the error of each of the models over the test data throughout the training. We train each model for 100 epochs. Figure 2 shows that the two models have the same test error over the course of training.
6.3 Optimality of Linear Models
A polynomial kernels of degree has the following form
(30) 
where and is a constant that adjusts the influence of higher degree terms and lower degree terms. In this examples, we samples test and train samples from the following model
(31) 
where is a Gaussian process with covariance kernel being a polynomial kernel. We use for the polynomial kernel and set , . We generate samples and train the kernel model and the equivalent linear model and estimate the normalized mean squared error of the estimator by averaging the normalized error over test samples. We use as the regularization parameter which makes the kernel estimator Bayes optimal (with respect to squared error). The results are averaged over 5 runs.
The results are shown in Figure 3 where normalized errors (defined in equation (29)) are plotted against the number of training samples. The dashed line corresponds to optimal error curve obtained from Equation (18). The generalization errors for the linear model and the kernel model match which confirm Theorem 2.1 and as Theorem 4.1 proves both of the curves are very close to the optimal error curve. This figure verifies that the optimal estimator is indeed linear.
6.4 Counterexample: Beyond the Proportional Uniform Regime
Our results should not be misconstrued as ineffectiveness of kernel methods or neural networks. The equivalence of kernel models and linear models holds in the proportional uniform data regime. However kernel models and neural networks outperform linear models when we deviate from this regime, as demonstrated in Figure 4.
This observation is closer to realworld experiences of the machine learning community, which perhaps suggests that the assumptions A1A3 are unrealistic for understanding high dimensional phenomena relating large datasets and high dimensional models.
We consider Gaussian process regression as in Section 6.3, but the input variables are generated from a mixture of two zero mean Gaussians with lowrank covariances, which clearly violates assumption 1. The probability of each mixture component is set . We use and set rank of covariance of each component to . The covariance of each component is generated as
(32) 
Under this model, the resulting covariance matrix of the data would be
(33) 
which would have rank almost surely. In other words, the data only spans a subspace of dimension of the dimensional space.
Figure 4 shows that the kernel model which is the optimal estimator has a generalization error very close to the expected optimal error, whereas the linear model performs worse. The linear approximation of the true kernel matrix is inaccurate when we deviate from the proportional uniform data regime.
7 Conclusions
This paper, of course, does not contest the power of neural networks or kernel models relative to linear models. In a tremendous range of practical applications, nonlinear models outperform linear models. The results should interpreted as a limitations of Assumptions 13 as a model for highdimensional data. While this proportional highdimensional regime has been incredibly successful in explaining complex behavior of many other ML estimators, it provides degenerate results for kernel models and neural networks that operate in the kernel regime.
As mentioned above, the intuition is that when the data samples are generated as where has i.i.d. components and is positive definite, the data uniformly covers the space . When the number of samples only scales linearly with , it is impossible to learn models more complex than linear models.
This limitation suggests that more complex models for the generated data will be needed if the highdimensional asymptotics of kernel methods are to be understood.
References
 Advani and Ganguli (2016) Advani, M. and Ganguli, S. (2016). Statistical mechanics of optimal convex inference in high dimensions. Physical Review X, 6(3):031034.
 Alemohammad et al. (2020) Alemohammad, S., Wang, Z., Balestriero, R., and Baraniuk, R. (2020). The recurrent neural tangent kernel. arXiv preprint arXiv:2006.10246.
 Arora et al. (2019) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. (2019). On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8139–8148.
 Aubin et al. (2019) Aubin, B., Maillard, A., Barbier, J., Krzakala, F., Macris, N., and Zdeborová, L. (2019). The committee machine: Computational to statistical gaps in learning a twolayers neural network. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124023.
 Bartlett et al. (2017) Bartlett, P., Foster, D. J., and Telgarsky, M. (2017). Spectrallynormalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498.
 Bayati and Montanari (2011) Bayati, M. and Montanari, A. (2011). The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764–785.
 Bean et al. (2013) Bean, D., Bickel, P. J., El Karoui, N., and Yu, B. (2013). Optimal mestimation in highdimensional regression. Proceedings of the National Academy of Sciences, 110(36):14563–14568.

Belkin et al. (2018)
Belkin, M., Ma, S., and Mandal, S. (2018).
To understand deep learning we need to understand kernel learning.
In International Conference on Machine Learning, pages 541–549. PMLR.  Bietti and Mairal (2019) Bietti, A. and Mairal, J. (2019). On the inductive bias of neural tangent kernels. arXiv preprint arXiv:1905.12173.
 Chizat et al. (2018) Chizat, L., Oyallon, E., and Bach, F. (2018). On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956.
 Cui et al. (2021) Cui, H., Loureiro, B., Krzakala, F., and Zdeborová, L. (2021). Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. arXiv preprint arXiv:2105.15004.
 Daniely et al. (2016) Daniely, A., Frostig, R., and Singer, Y. (2016). Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pages 2253–2261.

D’Ascoli et al. (2020)
D’Ascoli, S., Refinetti, M., Biroli, G., and Krzakala, F. (2020).
Double trouble in double descent: Bias and variance(s) in the lazy regime.
In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2280–2290. PMLR.  Dobriban and Wager (2018) Dobriban, E. and Wager, S. (2018). Highdimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279.
 Donoho et al. (2009) Donoho, D. L., Maleki, A., and Montanari, A. (2009). Messagepassing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919.
 El Karoui et al. (2010) El Karoui, N. et al. (2010). The spectrum of kernel random matrices. Annals of statistics, 38(1):1–50.
 Emami et al. (2020) Emami, M., SahraeeArdakan, M., Pandit, P., Rangan, S., and Fletcher, A. (2020). Generalization error of generalized linear models in high dimensions. In International Conference on Machine Learning, pages 2892–2901. PMLR.
 Gabrié et al. (2019) Gabrié, M., Manoel, A., Luneau, C., Barbier, J., Macris, N., Krzakala, F., and Zdeborová, L. (2019). Entropy and mutual information in models of deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124014.
 Gardner and Derrida (1989) Gardner, E. and Derrida, B. (1989). Three unfinished works on the optimal storage capacity of networks. Journal of Physics A: Mathematical and General, 22(12):1983.
 Gerace et al. (2020) Gerace, F., Loureiro, B., Krzakala, F., Mézard, M., and Zdeborová, L. (2020). Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pages 3452–3462. PMLR.
 Gerbelot et al. (2020) Gerbelot, C., Abbara, A., and Krzakala, F. (2020). Asymptotic errors for teacherstudent convex generalized linear models (or: How to prove kabashima’s replica formula). arXiv preprint arXiv:2006.06581.
 Goldt et al. (2020a) Goldt, S., Advani, M. S., Saxe, A. M., Krzakala, F., and Zdeborová, L. (2020a). Dynamics of stochastic gradient descent for twolayer neural networks in the teacher–student setup. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124010.
 Goldt et al. (2020b) Goldt, S., Mézard, M., Krzakala, F., and Zdeborová, L. (2020b). Modeling the influence of data structure on learning in neural networks: The hidden manifold model. Physical Review X, 10(4):041044.
 Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2019). Surprises in highdimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560.
 Hu et al. (2020) Hu, W., Xiao, L., Adlam, B., and Pennington, J. (2020). The surprising simplicity of the earlytime learning dynamics of neural networks. arXiv preprint arXiv:2006.14599.
 Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580.
 Kabashima et al. (2016) Kabashima, Y., Krzakala, F., Mézard, M., Sakata, A., and Zdeborová, L. (2016). Phase transitions and sample complexity in bayesoptimal matrix factorization. IEEE Transactions on information theory, 62(7):4228–4265.
 Kalimeris et al. (2019) Kalimeris, D., Kaplun, G., Nakkiran, P., Edelman, B., Yang, T., Barak, B., and Zhang, H. (2019). Sgd on neural networks learns functions of increasing complexity. In Wallach, H., Larochelle, H., Beygelzimer, A., d’ AlchéBuc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
 Kini and Thrampoulidis (2020) Kini, G. R. and Thrampoulidis, C. (2020). Analytic study of double descent in binary classification: The impact of loss. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2527–2532. IEEE.
 Krzakala et al. (2012) Krzakala, F., Mézard, M., Sausset, F., Sun, Y., and Zdeborová, L. (2012). Probabilistic reconstruction in compressed sensing: algorithms, phase diagrams, and threshold achieving matrices. Journal of Statistical Mechanics: Theory and Experiment, 2012(08):P08009.
 Lee et al. (2017) Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and SohlDickstein, J. (2017). Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165.
 Lee et al. (2019) Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., SohlDickstein, J., and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pages 8570–8581.

Liang and Rakhlin (2020)
Liang, T. and Rakhlin, A. (2020).
Just interpolate: Kernel “ridgeless” regression can generalize.
The Annals of Statistics, 48(3):1329–1347.  Liu et al. (2020a) Liu, C., Zhu, L., and Belkin, M. (2020a). On the linearity of large nonlinear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33.
 Liu et al. (2020b) Liu, F., Huang, X., Chen, Y., and Suykens, J. A. (2020b). Random features for kernel approximation: A survey on algorithms, theory, and beyond. arXiv preprint arXiv:2004.11154.

Liu et al. (2021)
Liu, F., Liao, Z., and Suykens, J. (2021).
Kernel regression in high dimensions: Refined analysis beyond double
descent.
In
International Conference on Artificial Intelligence and Statistics
, pages 649–657. PMLR.  Loureiro et al. (2021) Loureiro, B., Sicuro, G., Gerbelot, C., Pacco, A., Krzakala, F., and Zdeborová, L. (2021). Learning gaussian mixtures with generalised linear models: Precise asymptotics in highdimensions. arXiv preprint arXiv:2106.03791.
 Mann and Wald (1943) Mann, H. B. and Wald, A. (1943). On stochastic limit and order relationships. The Annals of Mathematical Statistics, 14(3):217–226.
 Matthews et al. (2018) Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271.
 Mossel (2016) Mossel, E. (2016). Deep learning and hierarchal generative models. arXiv preprint arXiv:1612.09057.

Muthukumar et al. (2021)
Muthukumar, V., Narang, A., Subramanian, V., Belkin, M., Hsu, D., and Sahai, A.
(2021).
Classification vs regression in overparameterized regimes: Does the loss function matter?
Journal of Machine Learning Research, 22(222):1–69.  Neal (2012) Neal, R. M. (2012). Bayesian learning for neural networks, volume 118. Springer Science & Business Media.
 Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017). Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956.
 Pandit et al. (2019) Pandit, P., SahraeeArdakan, M., Rangan, S., Schniter, P., and Fletcher, A. K. (2019). Inference with deep generative priors in high dimensions. arXiv preprint arXiv:1911.03409.
 Pandit et al. (2020) Pandit, P., SahraeeArdakan, M., Rangan, S., Schniter, P., and Fletcher, A. K. (2020). Matrix inference and estimation in multilayer models. In NeurIPS.
 Rangan et al. (2019) Rangan, S., Schniter, P., and Fletcher, A. K. (2019). Vector approximate message passing. IEEE Transactions on Information Theory, 65(10):6664–6684.
 Refinetti et al. (2021) Refinetti, M., Goldt, S., Krzakala, F., and Zdeborová, L. (2021). Classifying highdimensional gaussian mixtures: Where kernel methods fail and neural networks succeed. arXiv preprint arXiv:2102.11742.
 SahraeeArdakan et al. (2021) SahraeeArdakan, M., Mai, T., Rao, A. B., Rossi, R. A., Rangan, S., and Fletcher, A. K. (2021). Asymptotics of ridge regression in convolutional models. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 9265–9275. PMLR.
 Saxe et al. (2013) Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.

Schölkopf et al. (2001)
Schölkopf, B., Herbrich, R., and Smola, A. J. (2001).
A generalized representer theorem.
In
International conference on computational learning theory
, pages 416–426. Springer.  Taheri et al. (2020) Taheri, H., Pedarsani, R., and Thrampoulidis, C. (2020). Sharp asymptotics and optimal performance for inference in binary models. In International Conference on Artificial Intelligence and Statistics, pages 3739–3749. PMLR.
 Yang (2019a) Yang, G. (2019a). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760.

Yang (2019b)
Yang, G. (2019b).
Wide feedforward or recurrent neural networks of any architecture are gaussian processes.
 Zdeborová and Krzakala (2016) Zdeborová, L. and Krzakala, F. (2016). Statistical physics of inference: Thresholds and algorithms. Advances in Physics, 65(5):453–552.
 Zhang et al. (2021) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115.
 Zhang et al. (2020) Zhang, Y., Xu, Z.Q. J., Luo, T., and Ma, Z. (2020). A type of generalization error induced by initialization in deep neural networks. In Mathematical and Scientific Machine Learning, pages 144–164. PMLR.
References
 Advani and Ganguli (2016) Advani, M. and Ganguli, S. (2016). Statistical mechanics of optimal convex inference in high dimensions. Physical Review X, 6(3):031034.
 Alemohammad et al. (2020) Alemohammad, S., Wang, Z., Balestriero, R., and Baraniuk, R. (2020). The recurrent neural tangent kernel. arXiv preprint arXiv:2006.10246.
 Arora et al. (2019) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. (2019). On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8139–8148.
 Aubin et al. (2019) Aubin, B., Maillard, A., Barbier, J., Krzakala, F., Macris, N., and Zdeborová, L. (2019). The committee machine: Computational to statistical gaps in learning a twolayers neural network. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124023.
 Bartlett et al. (2017) Bartlett, P., Foster, D. J., and Telgarsky, M. (2017). Spectrallynormalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498.
 Bayati and Montanari (2011) Bayati, M. and Montanari, A. (2011). The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764–785.
 Bean et al. (2013) Bean, D., Bickel, P. J., El Karoui, N., and Yu, B. (2013). Optimal mestimation in highdimensional regression. Proceedings of the National Academy of Sciences, 110(36):14563–14568.

Belkin et al. (2018)
Belkin, M., Ma, S., and Mandal, S. (2018).
To understand deep learning we need to understand kernel learning.
In International Conference on Machine Learning, pages 541–549. PMLR.  Bietti and Mairal (2019) Bietti, A. and Mairal, J. (2019). On the inductive bias of neural tangent kernels. arXiv preprint arXiv:1905.12173.
 Chizat et al. (2018) Chizat, L., Oyallon, E., and Bach, F. (2018). On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956.
 Cui et al. (2021) Cui, H., Loureiro, B., Krzakala, F., and Zdeborová, L. (2021). Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. arXiv preprint arXiv:2105.15004.
 Daniely et al. (2016) Daniely, A., Frostig, R., and Singer, Y. (2016). Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pages 2253–2261.

D’Ascoli et al. (2020)
D’Ascoli, S., Refinetti, M., Biroli, G., and Krzakala, F. (2020).
Double trouble in double descent: Bias and variance(s) in the lazy regime.
In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2280–2290. PMLR.  Dobriban and Wager (2018) Dobriban, E. and Wager, S. (2018). Highdimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279.
 Donoho et al. (2009) Donoho, D. L., Maleki, A., and Montanari, A. (2009). Messagepassing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919.
 El Karoui et al. (2010) El Karoui, N. et al. (2010). The spectrum of kernel random matrices. Annals of statistics, 38(1):1–50.
 Emami et al. (2020) Emami, M., SahraeeArdakan, M., Pandit, P., Rangan, S., and Fletcher, A. (2020). Generalization error of generalized linear models in high dimensions. In International Conference on Machine Learning, pages 2892–2901. PMLR.
 Gabrié et al. (2019) Gabrié, M., Manoel, A., Luneau, C., Barbier, J., Macris, N., Krzakala, F., and Zdeborová, L. (2019). Entropy and mutual information in models of deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124014.
 Gardner and Derrida (1989) Gardner, E. and Derrida, B. (1989). Three unfinished works on the optimal storage capacity of networks. Journal of Physics A: Mathematical and General, 22(12):1983.
 Gerace et al. (2020) Gerace, F., Loureiro, B., Krzakala, F., Mézard, M., and Zdeborová, L. (2020). Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pages 3452–3462. PMLR.
 Gerbelot et al. (2020) Gerbelot, C., Abbara, A., and Krzakala, F. (2020). Asymptotic errors for teacherstudent convex generalized linear models (or: How to prove kabashima’s replica formula). arXiv preprint arXiv:2006.06581.
 Goldt et al. (2020a) Goldt, S., Advani, M. S., Saxe, A. M., Krzakala, F., and Zdeborová, L. (2020a). Dynamics of stochastic gradient descent for twolayer neural networks in the teacher–student setup. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124010.
 Goldt et al. (2020b) Goldt, S., Mézard, M., Krzakala, F., and Zdeborová, L. (2020b). Modeling the influence of data structure on learning in neural networks: The hidden manifold model. Physical Review X, 10(4):041044.
 Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2019). Surprises in highdimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560.
 Hu et al. (2020) Hu, W., Xiao, L., Adlam, B., and Pennington, J. (2020). The surprising simplicity of the earlytime learning dynamics of neural networks. arXiv preprint arXiv:2006.14599.
 Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580.
 Kabashima et al. (2016) Kabashima, Y., Krzakala, F., Mézard, M., Sakata, A., and Zdeborová, L. (2016). Phase transitions and sample complexity in bayesoptimal matrix factorization. IEEE Transactions on information theory, 62(7):4228–4265.
 Kalimeris et al. (2019) Kalimeris, D., Kaplun, G., Nakkiran, P., Edelman, B., Yang, T., Barak, B., and Zhang, H. (2019). Sgd on neural networks learns functions of increasing complexity. In Wallach, H., Larochelle, H., Beygelzimer, A., d’ AlchéBuc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
 Kini and Thrampoulidis (2020) Kini, G. R. and Thrampoulidis, C. (2020). Analytic study of double descent in binary classification: The impact of loss. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2527–2532. IEEE.
 Krzakala et al. (2012) Krzakala, F., Mézard, M., Sausset, F., Sun, Y., and Zdeborová, L. (2012). Probabilistic reconstruction in compressed sensing: algorithms, phase diagrams, and threshold achieving matrices. Journal of Statistical Mechanics: Theory and Experiment, 2012(08):P08009.
 Lee et al. (2017) Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and SohlDickstein, J. (2017). Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165.
 Lee et al. (2019) Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., SohlDickstein, J., and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pages 8570–8581.

Liang and Rakhlin (2020)
Liang, T. and Rakhlin, A. (2020).
Just interpolate: Kernel “ridgeless” regression can generalize.
The Annals of Statistics, 48(3):1329–1347.  Liu et al. (2020a) Liu, C., Zhu, L., and Belkin, M. (2020a). On the linearity of large nonlinear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33.
 Liu et al. (2020b) Liu, F., Huang, X., Chen, Y., and Suykens, J. A. (2020b). Random features for kernel approximation: A survey on algorithms, theory, and beyond. arXiv preprint arXiv:2004.11154.

Liu et al. (2021)
Liu, F., Liao, Z., and Suykens, J. (2021).
Kernel regression in high dimensions: Refined analysis beyond double
descent.
In
International Conference on Artificial Intelligence and Statistics
, pages 649–657. PMLR.  Loureiro et al. (2021) Loureiro, B., Sicuro, G., Gerbelot, C., Pacco, A., Krzakala, F., and Zdeborová, L. (2021). Learning gaussian mixtures with generalised linear models: Precise asymptotics in highdimensions. arXiv preprint arXiv:2106.03791.
 Mann and Wald (1943) Mann, H. B. and Wald, A. (1943). On stochastic limit and order relationships. The Annals of Mathematical Statistics, 14(3):217–226.
 Matthews et al. (2018) Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271.
 Mossel (2016) Mossel, E. (2016). Deep learning and hierarchal generative models. arXiv preprint arXiv:1612.09057.

Muthukumar et al. (2021)
Muthukumar, V., Narang, A., Subramanian, V., Belkin, M., Hsu, D., and Sahai, A.
(2021).
Classification vs regression in overparameterized regimes: Does the loss function matter?
Journal of Machine Learning Research, 22(222):1–69.  Neal (2012) Neal, R. M. (2012). Bayesian learning for neural networks, volume 118. Springer Science & Business Media.
 Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017). Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956.
 Pandit et al. (2019) Pandit, P., SahraeeArdakan, M., Rangan, S., Schniter, P., and Fletcher, A. K. (2019). Inference with deep generative priors in high dimensions. arXiv preprint arXiv:1911.03409.
 Pandit et al. (2020) Pandit, P., SahraeeArdakan, M., Rangan, S., Schniter, P., and Fletcher, A. K. (2020). Matrix inference and estimation in multilayer models. In NeurIPS.
 Rangan et al. (2019) Rangan, S., Schniter, P., and Fletcher, A. K. (2019). Vector approximate message passing. IEEE Transactions on Information Theory, 65(10):6664–6684.
 Refinetti et al. (2021) Refinetti, M., Goldt, S., Krzakala, F., and Zdeborová, L. (2021). Classifying highdimensional gaussian mixtures: Where kernel methods fail and neural networks succeed. arXiv preprint arXiv:2102.11742.
 SahraeeArdakan et al. (2021) SahraeeArdakan, M., Mai, T., Rao, A. B., Rossi, R. A., Rangan, S., and Fletcher, A. K. (2021). Asymptotics of ridge regression in convolutional models. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 9265–9275. PMLR.
 Saxe et al. (2013) Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.

Schölkopf et al. (2001)
Schölkopf, B., Herbrich, R., and Smola, A. J. (2001).
A generalized representer theorem.
In
International conference on computational learning theory
, pages 416–426. Springer.  Taheri et al. (2020) Taheri, H., Pedarsani, R., and Thrampoulidis, C. (2020). Sharp asymptotics and optimal performance for inference in binary models. In International Conference on Artificial Intelligence and Statistics, pages 3739–3749. PMLR.
 Yang (2019a) Yang, G. (2019a). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760.

Yang (2019b)
Yang, G. (2019b).
Wide feedforward or recurrent neural networks of any architecture are gaussian processes.
 Zdeborová and Krzakala (2016) Zdeborová, L. and Krzakala, F. (2016). Statistical physics of inference: Thresholds and algorithms. Advances in Physics, 65(5):453–552.
 Zhang et al. (2021) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115.
 Zhang et al. (2020) Zhang, Y., Xu, Z.Q. J., Luo, T., and Ma, Z. (2020). A type of generalization error induced by initialization in deep neural networks. In Mathematical and Scientific Machine Learning, pages 144–164. PMLR.
Appendix
Appendix A Preliminaries
We present a short overview of reproducing kernel Hilbert spaces, Gaussian regression, and neural tangent kernels which are used throughout the paper in this appendix.
a.1 Kernel Regression
In kernel regression, the estimator is a function that belongs to a reproducing kernel Hilbert space (RKHS). A kernel that is an inner product in a possibly infinite dimensional space called the feature space, i.e. where is called the feature map. With this feature map, the functions in the RKHS are of the form which is a nonlinear function in but linear in the parameters . In this work, we consider kernels of the form in equation (3), which includes inner product kernels as well as shiftinvariant kernels. Many commonly used kernels such as RBF kernels, polynomial kernels, as well as the neural tangent kernel are of this form.
In kernel methods, the estimator is often learned via a regularized ERM
(34) 
where is a loss function and is the RKHS norm. By writing as a parametric function with parameters , this optimization over the function space can be written as an optimization over the parameter space as
(35)  
(36) 
Note that this optimization is often very highdimensional as the dimension of feature space could be very high or even infinite. By the representer theorem Schölkopf et al. (2001), the solution to the optimization problem in (34) has the form
(37) 
By the reproducing property of the kernel, it is easy to show that where and is the data kernel matrix . The optimization problem in (34) can then be written in terms of s as
(38) 
where is the i row of . Observe that this optimization problem only depends on the kernel evaluated over the data points, and hence the optimization problem in (34) can be solved without ever working in the feature space . If we let to represent the data matrix with as its th row, and the vector of observations, then for the special case of square loss the optimization problem in (38) has the closed form solution which corresponds to the estimator
(39) 
where . Throughout this paper, for two matrices of data points and we use the notation to represent the matrix with .
a.2 Gaussian Process Regression
A Gaussian process is a stochastic process in which for every fixed set of points
, the joint distribution of
has multivariate Gaussian distribution. As in multivariate Gaussian distribution, the distribution of a Gaussian process is completely determined by its first and second order statistics, known as the mean function and covariance kernel respectively. If we denote the mean function by
and the covariance kernel by , then for any finite set of points(40) 
where the vector of mean values and is the covariance matrix with . Next, assume that a priori we set the mean function to be zero everywhere. Then, the problem of Gaussian process regression can be stated as follows: we are given training samples
(41) 
where is a zero mean Gaussian process with covariance kernel . Given a test point , we are interested in the posterior distribution of given the training samples. Defining and as in previous section we have
(42) 
where is the kernel matrix evaluated at training points. Therefore, if we define we have where
(43)  
(44)  
(45) 
The minimum mean squared error (MMSE) estimator is the estimator that minimizes the square risk
(46) 
where is the class of all measureable functions of . For a given , we have where minimizes the posterior risk
(47) 
and the expectation is with respect to the randomness in as well as . The estimator that minimizes this risk is the mean of the posterior, i.e. in (43) is the Bayes optimal estimator with respect to mean squared error and its mean squared error is . Note that while this estimator is linear in the training outputs, it is nonlinear in the input data.
In this work, the problem of Gaussian process regression arises for systems that are in the Gaussian kernel regime. More specifically, assume that we have training and test data and
that are generated by a parametric model
where . Furthermore, assume that conditioned on and(48) 
which is dimensional vector of the function values on the training and test inputs is jointly Gaussian and zero mean. Also, for and , in the training and test inputs define the kernel function by
(49) 
Then the problem of estimating can be considered as a Gaussian regression problem. An important instance of this kernel model is when a wide neural network with parameters drawn from random Gaussian distributions and a linear last layer. In this case, one can show that conditioned on the input, all the preactivation signals in the neural network, i.e. all the signals right before going through the nonlinearities, as well as the gradients with respect to the parameters are Gaussian processes as discussed below.
a.3 Neural Tangent Kernel
Consider a neural network function defined recursively as
(50)  
(51)  
(52) 
where is a elementwise nonlinearity, , and is the collection of all weights and biases
which are all initialized with i.i.d. draws from the standard normal distribution. As noted in many works
Neal (2012); Lee et al. (2017); Matthews et al. (2018); Daniely et al. (2016), conditioned on the input signals, with a Lipschitz nonlinearity , the entries of the preactivations converge in distribution to an i.i.d. Gaussian processes in the limit of with covariance defined recursively as(53)  
(54) 
Therefore, if the true model is a random deep network plus noise, the optimal estimator would be as in (43) with the covariance in (54) used as the kernel.
The main result of Jacot et al. (2018) considers the problem of fitting a neural network to a training data using gradient descent. It is shown that in the limit of wide networks (i.e. for all ), training a neural network with gradient descent is equivalent to fitting a kernel regression with respect to a specific kernel called the neural tangent kernel (NTK).
When is a neural network with a scalar output, the neural tangent kernel (NTK) is defined as
(55) 
In the limit of wide fully connected neural networks, Jacot et al. (2018) show that this kernel converges in probability to a kernel that is fixed throughout the training
Similar to (54), neural tangent kernel can be evaluated via a set of recursive equations the details of which can be found in Jacot et al. (2018). Similar results for architectures other than fully connected networks have since been proven Arora et al. (2019); Yang (2019a, b); Alemohammad et al. (2020).
For a fully connected network with ReLU nonlinearities, the NTK has a closed recursive form given by Bietti and Mairal (2019). Let with and
(56) 
where is the ReLU function, , and all the parameters and , are initialized with i.i.d. entries drawn from . Then the corresponding NTK,
Comments
There are no comments yet.