Kernel Methods and Multi-layer Perceptrons Learn Linear Models in High Dimensions

01/20/2022
by   Mojtaba Sahraee-Ardakan, et al.
84

Empirical observation of high dimensional phenomena, such as the double descent behaviour, has attracted a lot of interest in understanding classical techniques such as kernel methods, and their implications to explain generalization properties of neural networks. Many recent works analyze such models in a certain high-dimensional regime where the covariates are independent and the number of samples and the number of covariates grow at a fixed ratio (i.e. proportional asymptotics). In this work we show that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime. More surprisingly, when the data is generated by a kernel model where the relationship between input and the response could be very nonlinear, we show that linear models are in fact optimal, i.e. linear models achieve the minimum risk among all models, linear or nonlinear. These results suggest that more complex models for the data other than independent features are needed for high-dimensional analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/08/2021

Asymptotics of Ridge Regression in Convolutional Models

Understanding generalization and estimation error of estimators for simp...
02/23/2021

Classifying high-dimensional Gaussian mixtures: Where kernel methods fail and neural networks succeed

A recent series of theoretical works showed that the dynamics of neural ...
10/10/2017

High-dimensional dynamics of generalization error in neural networks

We perform an average case analysis of the generalization dynamics of la...
05/16/2017

Learning how to explain neural networks: PatternNet and PatternAttribution

DeConvNet, Guided BackProp, LRP, were invented to better understand deep...
04/15/2022

Kernel similarity matching with Hebbian neural networks

Recent works have derived neural networks with online correlation-based ...
03/19/2019

Surprises in High-Dimensional Ridgeless Least Squares Interpolation

Interpolators -- estimators that achieve zero training error -- have att...
03/13/2016

On Learning High Dimensional Structured Single Index Models

Single Index Models (SIMs) are simple yet flexible semi-parametric model...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Analysis of kernel methods have seen a resurgence after Jacot et al. (2018) showed an equivalence of wide neural networks, trained with gradient descent, with the so-called neural tangent kernel

(NTK). Contemporaneously, there has been growing interest in high-dimensional asymptotic analyses of machine learning methods in a regime where the number of input samples

and number of input features grow proportionally as

(1)

for some and the data follow some random distribution. This regime often enables remarkably precise predictions on the behavior of complex algorithms (see, e.g.  (Krzakala et al., 2012), and the references below).

In this work, we study kernel estimators of the form:

(2)

where are weights learned from training samples , and is a kernel. We consider the training of such kernel models in an asymptotic random regime similar in form to several other high-dimensional analyses:

Proportional, uniform large scale limit: Consider a sequence of problems indexed by the number of data samples satisfying the following assumptions:

  1. [label=A0, itemsep=0pt, topsep=0pt]

  2. (Uniform data) Training features are generated as where has i.i.d. entries with , , and for some . A test sample, , is generated similarly. Further, the covariance matrix is positive definite with , and
    .

  3. (Proportional asymptotics) Number of samples and number of input features scale as (1).

  4. (Kernel) The kernel function is of the form

    (3)

    where is around , around .

Under these assumptions we show that:

Kernel regression offers no gain over linear models.

Our result does not disregard kernel methods (or neural networks) as a whole, but serves as a caution regarding the proportional uniform large scale limit model while examining the asymptotic properties of kernels. A result of this nature regarding the high-dimensional degeneracy of two layer neural networks has been studied in Hu et al. (2020).

1.1 Summary of Contributions

To be precise, we show three surprising results concerning kernel regression in the proportional, uniform large scale limit:

  1. [label=0., leftmargin=4mm, topsep=1mm]

  2. First, we show kernel models only learn linear relations between the covariates and the response in this regime. Consequently, kernel models (including neural networks in the kernel regime) have no benefit over linear models in this regime.

  3. Our second result considers the training dynamics of the kernel and linear models. We show that under gradient descent, in the high dimensional setting, dynamics of the kernel model and a scaled linear model are equivalent throughout training.

  4. Finally, we consider the case where the true data is generated from a kernel model with some unknown parameters. In this case, the relation between and can be highly nonlinear. An example of such a model is that is generated from via a neural network with random, unknown parameters. In this case, we show that in the high-dimensional limit, the linear networks provide the minimum generalization error. That is, again, nonlinear kernel methods provide no benefit and training a wide neural network would result in a linear model.

The main take-away of this paper is that under certain data distribution assumptions that are widely used in theoretical papers, a large class of kernel methods, including fully connected neural networks (and residual architectures with fully connected blocks) in kernel regime, can only learn linear functions. Therefore, in order to theoretically understand the benefits that they provide over linear models, more complex data distributions should be considered. Informally, if covers this space in every direction (not necessarily isotropically), and the number of samples grows only linearly in the dimension of this space, many kernels can only see linear relationships between the covariates and the response. In other words, we argue that if we seek high-dimensional models for analyzing performance of neural networks, other distributional assumptions will be needed.

The proofs of our results rely on a generalization of Theorem 2.1 and 2.2 of (El Karoui et al., 2010). This generalization might be of independent interest for other works.

1.2 Prior work:

High-dimensional analyses in the proportional asymptotics regime similar to assumptions  1 to 3

have been widely-used in statistical physics and random matrix-based analyses of inference algorithms 

(Zdeborová and Krzakala, 2016). The high-dimensional framework has yielded powerful results in a wide range of applications such as estimation error in linear inverse problems (Donoho et al., 2009; Bayati and Montanari, 2011; Krzakala et al., 2012; Rangan et al., 2019; Hastie et al., 2019), convolutional inverse problems (Sahraee-Ardakan et al., 2021), dynamics of deep linear networks (Saxe et al., 2013), matrix factorization (Kabashima et al., 2016), binary classification (Taheri et al., 2020; Kini and Thrampoulidis, 2020), inverse problems with deep priors (Gabrié et al., 2019; Pandit et al., 2019, 2020), generalization error in linear and generalized linear models (Gerace et al., 2020; Emami et al., 2020; Loureiro et al., 2021; Gerbelot et al., 2020), random features (D’Ascoli et al., 2020), and for choosing the optimal objective function for regression (Bean et al., 2013; Advani and Ganguli, 2016) to name a few. Our result that, under a similar set of assumptions, kernel regression degenerates to linear models is thus somewhat surprising.

That being said, the result is not entirely new. Several authors have suggested that high-dimensional data modeled with i.i.d. covariates are inadequate

(Goldt et al., 2020b; Mossel, 2016). The results in this paper can thus be seen as attempting to describe the limitations precisely.

In this regard, the work is closest to (Hu et al., 2020). The work (Hu et al., 2020) proves that for a two-layer fully-connected neural network, the training dynamics are equivalent to a linear model in inputs. They provide asymptotic rates for convergence in the early stages of training (). Our result, however, considers a much larger class of kernels and is not limited to the NTK. In addition, we consider the dynamics throughout the training including the limit.

The generalization of kernel ridgeless regression is also discussed in this setting in (Liang and Rakhlin, 2020). The connections to double descent with explicit regularization has been analyzed in (Liu et al., 2021). The authors in (Dobriban and Wager, 2018)

, characterize the limiting predictive risk for ridge regression and regularized discriminant analysis.

(Cui et al., 2021) provides the error rates for KRR in the noisy case, and the generalization error in learning with random features with kernel approximation has been discussed in (Liu et al., 2020b). A comparison between neural networks and kernel methods for Gaussian mixture classification is also is provided in (Refinetti et al., 2021).

The kernel approximation of the over-parameterized neural networks does not limit their performance in practical applications. In fact, these networks have surprisingly shown to generalize well (Neyshabur et al., 2017; Zhang et al., 2021; Belkin et al., 2018). Of course, in the non-asymptotic regime, these models also have very large capacity (Bartlett et al., 2017). While this high capacity leads to learning complex functions, it is not always the case for the trained networks, and large models might still advocate for learning simpler functions. Works such as (Kalimeris et al., 2019; Hu et al., 2020) show that this simplicity can come from the implicit regularization induced by the training algorithms such as gradient descent for early-time dynamics. In this work, however, we show that in the high dimensional limit, this simplicity can be a result of the uniformity of input distribution over the space. In fact, we show that in this regime, kernel methods are no better than linear models.

2 Kernel Methods Learn Linear Models

In this section we show the first result of this paper: in the proportional, uniform high-dimensional regime, fitting kernel models is equivalent to fitting a regularized least squares model with appropriate regularization parameters. A short review of reproducing kernel Hilbert spaces (RKHS) and kernel regression can be found in Appendix A.1.

Suppose we have data points , with , and an RKHS corresponding to the kernel .

Consider two models fitted to this data:

  1. [label= 0.]

  2. Kernel ridge regression model which solves

    (4)

    where is the Hilbert norm of the function.

  3. Linear model fitted by solving the -regularized least squares problem:

(5)
(6)

The problem in (4) is an optimization over a function space. By parameterizing as we can find the optimal function by solving

(7)

By the representer theorem (Schölkopf et al., 2001), the optimal function in (4) also has the form

(8)

where solves

(9)

where with is the kernel matrix and is its row.

To state the result we need to define the following constants related to the kernel and its associated function from assumption 3

(10a)
(10b)
(10c)

where and are partial derivatives of in the second argument.

Our first result shows that with an appropriate choice of the two models and are in fact equivalent.

Theorem 2.1.

Under Assumptions (1-3), if we use the same data to train and with

(11)

where the constants , and are defined in equations (10), then at a test sample, drawn from the same distribution as the training samples,

Proof.

See Appendix B. ∎

Remark 1.

Note that the result in Theorem 2.1

does not imply that the linear model and the kernel model are equal in probability for all the points in the domain of these functions in the proportional uniform regime, but rather over a random test point as given by assumption

1. However this suffices for understanding the generalization properties of these functions.

Remark 2.

Since convergence in probability implies convergence in distribution, we also have that the generalization error of is the same as that of for any bounded continuous metric.

Remark 3.

Theorem 2.1 states a convergence in probability for a single test point. This holds for test samples so long as grows at most linearly in the number of training samples, i.e. and the outputs of kernel model and the linear model would be equal in probability over all these test samples.

3 Linear Dynamics of Kernel Models

Our next result shows that if a kernel ridge regression is solved using gradient descent, every intermediate estimator during training has an equivalent linear model.

Consider a kernel model that is parameterized as (where is the feature map) that is trained by regularized empirical risk minimization:

(12)

The gradient descent iterates for this problem are,

(13)

with . Here, is a matrix with as its th row and is the learning rate. Similarly, consider a scaled linear model learned by optimizing, via gradient descent, the regularized squared loss:

(14)

The parameters are initialized zero in gradient descent. Observe that this optimization problem is equivalent to the one in (5) as we have only made a change of variables to and to , i.e. they learn the same model. The scalings are introduced to make the training dynamics of the linear model and the kernel model the same. Let the parameter of the kernel model after steps of gradient descent be and define . Similarly, let the parameters of the linear model after step of gradient descent be and define . Then we have the following result.

Theorem 3.1.

If and are given by equation (11), then for any step of gradient descent (initialized at zero) and any test sample drawn from the same distribution as the training data we have

(15)
Proof.

The proof can be found in Appendix D. ∎

Remark 4.

Theorem 3.1 provides an insight into the training dynamics of kernel models in the proportional uniform regime. This could potentially have implications regarding the Kernel-SVM solution in this regime, following the work of Muthukumar et al. (2021).

4 Optimality of Linear Models

Our last result shows that in the proportional uniform large scale limit, if the true model has a Gaussian process prior with a kernel that satisfies assumption 3, then linear models are in fact optimal, even though the true underlying relationship between the covariates and the responses could be highly nonlinear. See Appendix A.2 for a review of Gaussian process regression.

Assume that we are given training samples

(16)

and the function is a zero mean Gaussian process with covariance kernel . An example occurs in the so-called student-teacher set-up of (Gardner and Derrida, 1989; Aubin et al., 2019) where the unknown function is of the form

(17)

and is a neural network with unknown parameters . If the network has infinitely wide hidden layers and the unknown parameters are generated with randomly with i.i.d. Gaussian coefficients with the appropriate scaling, the unknown function in (17) becomes asymptotically a Gaussian process Neal (2012); Lee et al. (2017); Matthews et al. (2018); Daniely et al. (2016).

Now assume that we are given a test sample from the same model and we are interested in estimating . It is well known (see Appendix A.2) that the Bayes optimal estimator with respect to squared error in this case is

(18)

and its Bayes risk is

(19)

Next consider a linear model fitted by solving the regularized least squares problem in (5). Define the square error risk of this model as

(20)

where the expectation is with respect to the randomness in as well as the noise .

Theorem 4.1.

Under assumptions (1-3) and the Gaussian data model (16) if the linear model in equation (5) is trained with regularization parameters

(21)

where and are defined in Proposition 5.1, then achieves the Bayes optimal risk for any test sample drawn from the same distribution as training data,

(22)
Proof.

The result of Theorem 2.1 shows that with the choice of regularization parameters in (21), the linear model and the kernel model in (18) are equivalent

The result then immediately follows as the kernel model is Bayes optimal for squared error. ∎

It is important to contrast this result with (Goldt et al., 2020a) and (Aubin et al., 2019). The works (Aubin et al., 2019; Goldt et al., 2020a) consider exactly the case where the true function is of the form (17) where is a neural network with Gaussian i.i.d. parameters. However, in their analyses, the number of hidden units in both the true and trained network are fixed while the dimension of and number of samples grow with proportional scaling. With a fixed number of hidden units, the true function is not a Gaussian process, and the model class is not a simple kernel estimator – hence, our results do not apply. Interestingly, in this case, the results of (Aubin et al., 2019; Goldt et al., 2020a) show that nonlinear models can significantly out-perform linear models. Hence, very wide neural networks can underperform networks with smaller numbers of hidden units. It is an open question as to which scaling of the number of hidden units, number of samples, and dimension yield degenerate results.

5 Sketch of Proofs

Here we provide the main ideas behind the proofs of our main theorems. The details of the proof of Theorem 2.1 can be found in Appendix B.1. Proof of Theorem 3.1 can be found in Appendix D.

5.1 Degeneracy of empirical kernel matrices

Our first result extends Theorems 2.1 and 2.2 of El Karoui et al. (2010) and may be of independent interest to the reader.

Proposition 5.1.

If is a kernel matrix with entries (3) such that assumptions (1-3) hold, then

where where are defined in equation (10) and is the design matrix with samples as rows.

Proof.

See appendix B. ∎

El Karoui et al. (2010) present this result for kernels of the form or . Importantly, the NTK has a form that is neither or , but in fact of the form in equation (3), whereby Proposition 5.1 provides new insights into the behaviour of empirical kernel matrices of the NTK for a large class of architectures.

5.2 Equivalence of Kernel and Linear Models

Proposition 5.1 is the main tool we use to show that kernel methods and linear methods are equivalent in the proportional, uniform large scale limit.

The model learned by the kernel ridge regression in equation (4) can be written as

(23)

Next, since the optimization in (5) is a quadratic problem it has a closed form solution.

Proposition 5.2.

The linear estimator in Model 2 has the following form

(24)
Proof.

Solving the optimization problem in (5) we get

where the last equality follows from a special case of the Woodbury matrix identity (See Lemma C.1 in Appendix C). ∎

Next, we can use Proposition 5.1 to show that each of the terms in Equation (24) converge in probability to the corresponding term in Equation (23). This proves Theorem 2.1.

5.3 Equivalence Throughout Training

The proof of equivalence of the kernel model and scaled linear model after steps of gradient descent is very similar. The updates for parameters of the kernel model as well as the parameters of the scaled linear model have linear dynamics (in their respective parameters). By unrolling the gradient update through time, we can write the parameters after step as a summation over the past time steps. Using this, we can simplify the sums to write the output of the kernel model at time over a test sample as

(25)

Similarly, for the linear model at step we get

(26)

Here, we can use Proposition 5.1 again to show that all the terms in the linear model converge in probability to the corresponding term in the kernel model, thus proving that the two models are equal in probability for any test sample drawn from the same distribution as the training data over the course of gradient descent.

6 Numerical Experiments

6.1 Linearity of Kernel Models for NTK

We demonstrate via numerical simulations the predictions made by our results in Theorems 2.1, 3.1, 4.1

As shown in Lee et al. (2019) and Liu et al. (2020a), wide fully connected neural networks can be approximated by their first order Taylor expansion throughout the training

and this approximation becomes exact in the limit that all the hidden dimensions of the neural network go to infinity. Therefore, training a network by minimizing

(27)

is equivalent (in the limit of wide network) to performing kernel ridge regression in an RKHS with feature map and neural tangent kernel as its kernel111See Appendix A.3 for a brief review of NTK.. Instead of removing the initial network, one can use a symmetric initialization scheme which makes the output of neural network zero at initialization without changing its NTK Chizat et al. (2018); Zhang et al. (2020); Hu et al. (2020).

A key property of the NTK of fully connected neural networks is that it satisfies assumption 3 since it has the form in equation (3). Hence, if the input data

satisfies the requirements of this theorem, in the proportional asymptotics regime the NTK should behave like a linear kernel. The first and second order derivatives of the kernel function can be obtained by backpropagation through the recursive equations in (

57) and (58).

Figure 1: Comparison of test error for three different models: (i) a neural network with a single hidden layer, (ii) NTK of a two layer fully-connected network, and (iii) the linear equivalent model prescribed by Theorem 2.1. The errors of the kernel model and the equivalent linear model match perfectly and neural network follows them very closely. The oracle model is the true model and represents the noise floor. We use .

Figure 1 illustrates a setting where kernel models and neural networks in the kernel regime perform no better than appropriately trained linear models. This verifies the main result of this paper – Theorem 2.1.

We generate training data for as

(28)

where and and

is a fully-connected ReLU network with two hidden layers with 100 hidden units each.

We train 3 models:

  1. [label=()]

  2. A fully connected ReLU neural network with a single layer of

    hidden units to fit this data using stochastic gradient descent (SGD) with momentum parameter

    . The initial network is remove from the output as in (27).

  3. A kernel model as in equation (4) corresponding to the NTK of the model in (i) above. The kernel is evaluated using the recursive formulae given in (57) and (58).

  4. A linear model as in equation (5) trained using the regularization parameters prescribed by Theorem 2.1.

We compare the test error for these models, measured as over test samples:

(29)

We compare the test error for different number of training samples averaged over 3 runs.

We can see that the NTK model and the equivalent linear model almost match perfectly for all the values of test samples and the neural network model follows them very closely, matching them for small number of samples.

There are two main sources of mismatch between the neural network model and the NTK model: first the width of the network while large (20,000) it is still finite, and secondly the training of the neural network model is stopped after 150 epochs, i.e. the neural network trained differs from the optimal neural network. Finally, the oracle model’s performance is the noise floor.

6.2 Equivalence of Kernel and Linear Models Throughout Training

Next, we verify Theorem 3.1 by showing that the test error of the scaled linear model and neural network match for all the steps of gradient descent. The setting is the same as in Section 6.1. We generate data using a random neural network with two hidden layers of units each and train a neural network with a single hidden layer of 10,000 units as well as the scaled linear model using gradient descent. We plot the the error of each of the models over the test data throughout the training. We train each model for 100 epochs. Figure 2 shows that the two models have the same test error over the course of training.

Figure 2: Equivalence of test error of scaled linear model and the neural network vs. epochs of gradient descent.

6.3 Optimality of Linear Models

A polynomial kernels of degree has the following form

(30)

where and is a constant that adjusts the influence of higher degree terms and lower degree terms. In this examples, we samples test and train samples from the following model

(31)

where is a Gaussian process with covariance kernel being a polynomial kernel. We use for the polynomial kernel and set , . We generate samples and train the kernel model and the equivalent linear model and estimate the normalized mean squared error of the estimator by averaging the normalized error over test samples. We use as the regularization parameter which makes the kernel estimator Bayes optimal (with respect to squared error). The results are averaged over 5 runs.

Figure 3: Normalized errors vs. number of training samples for a kernel model and the equivalent linear model for a data generated from a Gaussian process. The curves for the kernel and linear fit match almost perfectly. The dashed line corresponds to the theoretical optimal error given in equation (18).

The results are shown in Figure 3 where normalized errors (defined in equation (29)) are plotted against the number of training samples. The dashed line corresponds to optimal error curve obtained from Equation (18). The generalization errors for the linear model and the kernel model match which confirm Theorem 2.1 and as Theorem 4.1 proves both of the curves are very close to the optimal error curve. This figure verifies that the optimal estimator is indeed linear.

6.4 Counterexample: Beyond the Proportional Uniform Regime

Figure 4: If the assumptions A1-A3 are not satisfied, the kernel model and linear model are not equivalent.

Our results should not be misconstrued as ineffectiveness of kernel methods or neural networks. The equivalence of kernel models and linear models holds in the proportional uniform data regime. However kernel models and neural networks outperform linear models when we deviate from this regime, as demonstrated in Figure 4.

This observation is closer to real-world experiences of the machine learning community, which perhaps suggests that the assumptions A1-A3 are unrealistic for understanding high dimensional phenomena relating large datasets and high dimensional models.

We consider Gaussian process regression as in Section 6.3, but the input variables are generated from a mixture of two zero mean Gaussians with low-rank covariances, which clearly violates assumption 1. The probability of each mixture component is set . We use and set rank of covariance of each component to . The covariance of each component is generated as

(32)

Under this model, the resulting covariance matrix of the data would be

(33)

which would have rank almost surely. In other words, the data only spans a subspace of dimension of the -dimensional space.

Figure 4 shows that the kernel model which is the optimal estimator has a generalization error very close to the expected optimal error, whereas the linear model performs worse. The linear approximation of the true kernel matrix is inaccurate when we deviate from the proportional uniform data regime.

7 Conclusions

This paper, of course, does not contest the power of neural networks or kernel models relative to linear models. In a tremendous range of practical applications, nonlinear models outperform linear models. The results should interpreted as a limitations of Assumptions 1-3 as a model for high-dimensional data. While this proportional high-dimensional regime has been incredibly successful in explaining complex behavior of many other ML estimators, it provides degenerate results for kernel models and neural networks that operate in the kernel regime.

As mentioned above, the intuition is that when the data samples are generated as where has i.i.d. components and is positive definite, the data uniformly covers the space . When the number of samples only scales linearly with , it is impossible to learn models more complex than linear models.

This limitation suggests that more complex models for the generated data will be needed if the high-dimensional asymptotics of kernel methods are to be understood.

References

References

Appendix

Appendix A Preliminaries

We present a short overview of reproducing kernel Hilbert spaces, Gaussian regression, and neural tangent kernels which are used throughout the paper in this appendix.

a.1 Kernel Regression

In kernel regression, the estimator is a function that belongs to a reproducing kernel Hilbert space (RKHS). A kernel that is an inner product in a possibly infinite dimensional space called the feature space, i.e.  where is called the feature map. With this feature map, the functions in the RKHS are of the form which is a nonlinear function in but linear in the parameters . In this work, we consider kernels of the form in equation (3), which includes inner product kernels as well as shift-invariant kernels. Many commonly used kernels such as RBF kernels, polynomial kernels, as well as the neural tangent kernel are of this form.

In kernel methods, the estimator is often learned via a regularized ERM

(34)

where is a loss function and is the RKHS norm. By writing as a parametric function with parameters , this optimization over the function space can be written as an optimization over the parameter space as

(35)
(36)

Note that this optimization is often very high-dimensional as the dimension of feature space could be very high or even infinite. By the representer theorem Schölkopf et al. (2001), the solution to the optimization problem in (34) has the form

(37)

By the reproducing property of the kernel, it is easy to show that where and is the data kernel matrix . The optimization problem in (34) can then be written in terms of s as

(38)

where is the i row of . Observe that this optimization problem only depends on the kernel evaluated over the data points, and hence the optimization problem in (34) can be solved without ever working in the feature space . If we let to represent the data matrix with as its th row, and the vector of observations, then for the special case of square loss the optimization problem in (38) has the closed form solution which corresponds to the estimator

(39)

where . Throughout this paper, for two matrices of data points and we use the notation to represent the matrix with .

a.2 Gaussian Process Regression

A Gaussian process is a stochastic process in which for every fixed set of points

, the joint distribution of

has multivariate Gaussian distribution. As in multivariate Gaussian distribution, the distribution of a Gaussian process is completely determined by its first and second order statistics, known as the mean function and covariance kernel respectively. If we denote the mean function by

and the covariance kernel by , then for any finite set of points

(40)

where the vector of mean values and is the covariance matrix with . Next, assume that a priori we set the mean function to be zero everywhere. Then, the problem of Gaussian process regression can be stated as follows: we are given training samples

(41)

where is a zero mean Gaussian process with covariance kernel . Given a test point , we are interested in the posterior distribution of given the training samples. Defining and as in previous section we have

(42)

where is the kernel matrix evaluated at training points. Therefore, if we define we have where

(43)
(44)
(45)

The minimum mean squared error (MMSE) estimator is the estimator that minimizes the square risk

(46)

where is the class of all measureable functions of . For a given , we have where minimizes the posterior risk

(47)

and the expectation is with respect to the randomness in as well as . The estimator that minimizes this risk is the mean of the posterior, i.e.  in (43) is the Bayes optimal estimator with respect to mean squared error and its mean squared error is . Note that while this estimator is linear in the training outputs, it is nonlinear in the input data.

In this work, the problem of Gaussian process regression arises for systems that are in the Gaussian kernel regime. More specifically, assume that we have training and test data and

that are generated by a parametric model

where . Furthermore, assume that conditioned on and

(48)

which is -dimensional vector of the function values on the training and test inputs is jointly Gaussian and zero mean. Also, for and , in the training and test inputs define the kernel function by

(49)

Then the problem of estimating can be considered as a Gaussian regression problem. An important instance of this kernel model is when a wide neural network with parameters drawn from random Gaussian distributions and a linear last layer. In this case, one can show that conditioned on the input, all the preactivation signals in the neural network, i.e. all the signals right before going through the nonlinearities, as well as the gradients with respect to the parameters are Gaussian processes as discussed below.

a.3 Neural Tangent Kernel

Consider a neural network function defined recursively as

(50)
(51)
(52)

where is a elementwise nonlinearity, , and is the collection of all weights and biases

which are all initialized with i.i.d. draws from the standard normal distribution. As noted in many works

Neal (2012); Lee et al. (2017); Matthews et al. (2018); Daniely et al. (2016), conditioned on the input signals, with a Lipschitz nonlinearity , the entries of the preactivations converge in distribution to an i.i.d. Gaussian processes in the limit of with covariance defined recursively as

(53)
(54)

Therefore, if the true model is a random deep network plus noise, the optimal estimator would be as in (43) with the covariance in (54) used as the kernel.

The main result of Jacot et al. (2018) considers the problem of fitting a neural network to a training data using gradient descent. It is shown that in the limit of wide networks (i.e.  for all ), training a neural network with gradient descent is equivalent to fitting a kernel regression with respect to a specific kernel called the neural tangent kernel (NTK).

When is a neural network with a scalar output, the neural tangent kernel (NTK) is defined as

(55)

In the limit of wide fully connected neural networks, Jacot et al. (2018) show that this kernel converges in probability to a kernel that is fixed throughout the training

Similar to (54), neural tangent kernel can be evaluated via a set of recursive equations the details of which can be found in Jacot et al. (2018). Similar results for architectures other than fully connected networks have since been proven Arora et al. (2019); Yang (2019a, b); Alemohammad et al. (2020).

For a fully connected network with ReLU nonlinearities, the NTK has a closed recursive form given by Bietti and Mairal (2019). Let with and

(56)

where is the ReLU function, , and all the parameters and , are initialized with i.i.d. entries drawn from . Then the corresponding NTK,