Kernel ridge regression (KRR) is part of kernel-based machine learning methods that deploy a set of nonlinear functions to describe the real output of interest[1, 2]. More precisely, the idea is to map the data into a high-dimensional space , a.k.a. feature space
, which can even be of infinite dimension resulting in a linear representation of the data with respect to the output. Then, a linear regression problem is solved inby controlling over-fitting with a regularization term. In fact, the most important advantage of kernel methods is the utilized kernel trick or kernel substitution 
, which allows to directly work with kernels and avoid explicit use of feature vectors in.
Due to its popularity, a rich body of research has been conducted to analyze the performance of KRR. In , a randomized version of KRR is studied with performance guarantees in terms of concentration bounds. The work in  analyzes the random features approximation in least squares kernel regression. More relevant results can be found in  where upper bounds of the prediction risk have been derived in terms of the empirical quadratic risk for general regression models. Similarly for KRR models, an upper and lower bound on the expected risk have been provided in  before being generalized to general regularization operators in 
. Therefore, most of the results related to the performance analysis of KRR and related regression techniques are in the form of upper or lower bounds of the prediction risk. In this work, we study the problem from an asymptotic analysis perspective. As we will demonstrate in the course of the paper, such an analysis brought about novel results that predict in an accurate fashion prediction risks metrics. Our focus is on a variation of KRR called centered kernel ridge regression (CKRR) that is built upon the same principles of KRR with the additional requirement to minimize thebias in the learning problem. This variation has been motivated by Cortes et al. in  and [9, 10] where the benefits of centering kernels have been highlighted. The obtained regression technique can be seen as KRR with centered kernels. Moreover, in the high dimensional setting with certain normalizations, we show that kernel matrices behave as a rank one matrix, thus centering allows to neutralize this non-informative component and highlight higher order components that retain useful information of the data.
To understand the behavior of CKRR, we conduct theoretical analysis in the large dimensional regime where both the data dimension and the training size tend to infinity with fixed ratio (). As far as inner-product kernels are concerned, with mild assumptions on the data statistics, we show using fundamental results from random matrix theory elaborated in  and 
that both the empirical and prediction risks approach a deterministic quantity that relates in closed form fashion these performance measures to the data statistics and dimensions. This important finding allows to see how the model performance behaves as a function of the problem’s parameters and as such tune the design parameters to minimize the prediction risk. Moreover, as an outcome of this result, we show that it is possible to jointly optimize the regularization parameter along with the kernel function so that to achieve the possible minimum prediction risk. In other words, the minimum prediction risk is always attainable for all kernels with a proper choice of the regualrization parameter. This implies that all kernels behave similarly to the linear kernel. We regard such a fact as a consequence of the curse of dimensionality phenomenon which causes the CKRR to be asymptotically equivalent tocentered linear ridge regression. As an additional contribution of the present work, we build a consistent estimator of the prediction risk based on the training samples, thereby paving the way towards optimal setting of the regularization parameter.
The rest of the paper is structured as follows. In section II, we give a brief background on kernel ridge regression and introduce its centered variation. In section III, we provide the main results of the paper related to the asymptotic analysis of CKRR as well as the construction of a consistent estimator of the prediction risk. Then, we provide some numerical examples in section IV. We finally make some concluding remarks in section V.
Notations: and stand for the expectation and the variance of a random variable while and respectively stand for the almost sure convergence and the convergence in probability. denotes the operator norm of a matrix and the norm for vectors, stands for the trace operator. The notation means that bounded such that . We say that is if the th derivative of exists and is continous.
Ii Background on kernel ridge regression
Let be a set of observations in , where denotes the input space and the output space. Our aim is to predict the output of new input points with a reasonable accuracy. Assume that the output is generated using a function , then the problem can be cast as a function approximation problem where the goal is to find an estimate of denoted by such that is close to the real output . In this context, the kernel learning problem is formulated as follows
where is a reproducing kernel Hilbert space (RKHS),
is a loss function andis a regularization parameter that permits to control overfitting. Denoting by a feature map that maps the data points to the feature space , then we define such that for all where is known as the positive definite kernel corresponding to the feature map . With these definitions, the representer theorem [13, 14] shows that the minimizer of the problem in (1) writes as . Thus, we can reformulate (1) as follows
When is the squared loss, the optimization problem in (2) can be reformulated as
where . This yields the following solution , where . Then, the output estimate of any data point is given by 
where is the information vector and with entries , . This is commonly known as the kernel trick which allows to highly simplify the problem which boils down to solving a -dimensional problem. Throughout this paper, we consider the following data model
where generates the actual output of the data and
are i.i.d. standard normal random variables withassumed to be known. We consider both the empirical (training) and the prediction (testing) risks respectively defined as 
where is the data input distribution, is taken independent of the training data and . The above two equations respectively measure the goodness of fit relative to the training data and to new unseen data all in terms of the mean squared error (MSE).
Ii-a Centered kernel ridge regression
The concept of centered kernels dates back to the work of Cortes  on learning kernels based on the notion of centered alignment. As we will show later, this notion of centering comes naturally to the picture when we account for the bias in the learning problem (also see the lecture notes by Jakkola ). More specifically, we modify the optimization problem in (2) to account for the bias as follows
where clearly we do not penalize the offset (or the bias) in the regularization term . With being the squared loss, we immediately get Substituting in (8), we solve the centered optimization problem given by
where is the centered kernel matrix as defined in [8, Lemma1] and is obtained using the Woodbury identity. With some basic manipulations, the centered kernel ridge regression estimate of the output of data point is given by
Therefore, the feature map corresponding to as well as the information vector can be respectively obtained as follows
where the normalization111This is equivalent to normalize all data points by . This type of normalization has also been conisdered in  following the heuristic of Jakkola.
following the heuristic of Jakkola.by in (12) is convenient in the large regime as we will show later (also see  for similar normalization in the analysis of LS-SVMs). In the following, we conduct a large dimensional analysis of the performance of CKRR with the aim to get useful insights on the design of CKRR. Particularly, we will focus on studying the empirical and the prediction risks of CKRR which we define as
The novelty of our analysis with respect to previous studies lies in that
It provides a mathematical connection between the performance and the problem’s dimensions and statistics resulting in a deeper understanding of centered kernel ridge regression in the large regime.
It brings insights on how to choose the kernel function and the regularization parameter in order to guarantee a good generalization performance for unknown data.
As far as the second point is considered, we show later that both the kernel function and the regularization parameter can be optimized jointly as a consequence of the mathematical result connecting the prediction risk with these design parameters. Our analysis does not assume a specific choice of the inner-product kernels, and is valid for the following popular ones.
Linear kernels: .
Polynomial kernels: .
Sigmoid kernels: .
Exponential kernels: .
Iii Main results
Iii-a Technical assumptions
In this section, we will present our theoretical results on the prediction risk of CKRR by first introducing the assumptions of data growth rate, kernel function and true function . Without loss of generality, we assume that the data samples are independent such that , with positive definite covariance matrix . Throughout the analysis, we consider the large dimensional regime in which both and grow simultaneously large with the following growth rate assumptions.
Assumption 1 (Growth rate).
As we assume the following
Data scaling: .
Covariance scaling: .
The above assumptions are standard to consider and allow to exploit the large heritage of random matrix theory. Moreover, allowing and to grow large at the same rate is of practical interest when dealing with modern large and numerous data. The assumption treating the covariance scaling is technically convenient since it allows to use important theoretical results on the behavior of large kernel matrices [11, 12]. Under Assumption 1, we have the following implications.
where due to the covariance scaling in Assumption 1. This means that in the limit when , the kernel matrix as defined earlier has all its entries converging to a deterministic limit. Applying a Taylor expansion on the entries of , and under some assumption on the kernel function , it has been shown in [11, Theorem 2.1] that
where the convergence is in operator norm and exhibits nice properties and can be expressed using standard random matrix models. The explicit expression of as well as its properties will be thoroughly investigated in Appendix A. We subsequently make additional assumptions to control the kernel function and the data generating function .
Assumption 2 (Kernel function).
As in [11, Theorem 2.1], we shall assume that is in a neighborhood of and in a neighborhood of 0. Moreover, we assume that for any independent observations and drawn from and ,
where is the third derivative of .
Assumption 3 (Data generating function).
We assume that is and polynomially bounded together with its derivatives.
We shall further assume that the moments of
and polynomially bounded together with its derivatives. We shall further assume that the moments ofand its gradient are finite. More explicitly we need to have:
As we will show later, the above assumptions are needed to guarantee a bounded asymptotic risk and to carry out the analysis. Under the setting of Assumptions 1, 2 and 3, we aim to study the performance of CKRR by asymptotically evaluating the performance metrics defined in (6). Inspired by the fundamental results from  and 
in the context of spectral clustering, then following the observations made in (13) and(14), it is always possible to linearize the kernel matrix around the matrix which avoids dealing with the original intractable expression of . Note that the first component of the approximation given by will be neutralized by the projection matrix in the context of CKRR, which means that the behavior of CKRR will be essentially governed by the higher order approximations of . Consequently, one can resort to those approximations to have an explicit expression of the asymptotic risk in the large regime. This expression would hopefully reveal the mathematical connection between the regression risk and the data’ statistics and dimensions as .
Iii-B Limiting risk
With the above assumptions at hand, we are now in a position to state the main results of the paper related to the derivation of the asymptotic risk of CKRR. Before doing so, we shall introduce some useful quantities.
Also, for all
at macroscopic distance from the eigenvaluesof , we define the Stieltjes transform of also known as the Stieltjes transform of the Marc̆enko-Pastur law as the unique solution to the following fixed-point equation 
where in (17) is bounded as provided that Assumption 1 is satisfied. For ease of notation, we shall use to denote for all appropriate . The first main result of the paper is summarized in the following theorem, the proof of which is postponed to the Appendix A.
Theorem 1 (Limiting risk).
Under Assumptions 1, 2 and 3 and by taking for kernel functions satisfying 222The case of is asymptotically equivalent to take the sample mean as an estimate of which is neither of practical nor theoretical interest. and at macroscopic distance from the eigenvalues of , both the empirical and the prediction risks converge in probability to a non trivial deterministic limits respectively given by
, both the empirical and the prediction risks converge in probability to a non trivial deterministic limits respectively given by
where the expressions of and are given in the top of the next page.
where can be explicitly derived as in 
From Theorem 1 it entails that the limiting prediction risk can be expressed using the limiting empirical risk in the following fashion.
Lemma 1 (A consistent estimator of the prediction risk).
Since the aim of any learning system is to design a model that achieves minimal prediction risk , the relation described in Lemma 1 by (23) has enormous advantages as it permits to estimate the prediction risk in terms of the empirical risk and hence optimize the prediction risk accordingly.
One important observation from the expression of the limiting prediction risk in (21) is that the information on the kernel (given by and ) as well as the information on are both encapsulated in with . This means that one should optimize to have minimal prediction risk and thus jointly choose the kernel and the regularization parameter . Moreover, it entails that the choice of the kernel (as long as ) is asymptotically irrelevant since a bad choice of the kernel can be compensated by a good choice of and vice-versa. This essentially implies that a linear kernel asymptotically achieves the same optimal performance as any other type of kernels 333This does not mean that all kernels will have the same performance for a given regularization parameter but means that they will achieve the same minimum prediction risk..
Iii-C A consistent estimator of the prediction risk
Although the estimator provided in Lemma 1 permits to estimate the prediction risk by virtue of the empirical risk, it presents the drawback of being sensitive to small values of . In the following theorem, we provide a consistent estimator of the prediction risk constructed from the training data and is less sensitive to small values of .
Theorem 2 (A consistent estimator of the prediction risk).
Theorem 2 provides a generic way to estimate the prediction risk from the pairs of training examples . This allows using the closed form expressions in (24) and (25)444The expression in (25) is useful because it does not involve any matrix inversion unlike the one in (24). with the same set of arguments in Remark 2 to jointly estimate the optimal kernel and the optimal regularization parameter .
Iii-D Parameters optimization
We briefly discuss how to jointly optimize the kernel function and the regularization parameter . As mentioned earlier, we exploit the special structure in the expression of the consistent estimate where both parameters (the kernel function and ) are summarized in . We focus on the case where due to the tractability of the expression of in (25). By simple calculations, we can show that is minimized when satisifies the equation
where , which admits the following closed-form solution
Then, look up such that . Finally, choose and such that . In the general case, it is difficult to get a closed from expression in terms of or , however it is possible to numerically optimize the expression of with respect to . This can be done using simple one dimensional optimization techniques implemented in most softwares.
Iv-a Synthetic data
To validate our theoretical findings, we consider both Gaussian and Bernoulli data. As shown in Figure 2
, both data distributions exhibit the same behavior for all the settings with different kernel functions. More importantly, eventhough the derived formulas heavily rely on the Gaussian assumption, in the case where the data is Bernoulli distributed, we have a good agreement with the theoretical limits. This can be understood as part of the universality property often encountered in many high dimensional settings. Therefore, we conjecture that the obtained results are valid for any data distribution following the modelwhere satisfies Assumption 1 and the entries of
are i.i.d. with zero mean, unit variance and have bounded moments555We couldn’t provide experiments for more data distributions due to space limitations.. For more clarity, we refer the reader to Figure 1 as a representative of Figure 2 when the data is Gaussian with and . As shown in Figure 1, the proposed consistent estimators are able to track the real behavior of the prediction risk for all types of kernels into consideration. It is worth mentioning however that the proposed estimator in Lemma 1 exhibits some instability for small values of due to the inversion of in (23). Therefore, it is advised to use the estimator given by Theorem 2. It is also clear from Figure 1 that all the considered kernels achieve the same minimum prediction risks but with different optimal regularizations . This is not the case for the empirical risk as shown in Figure 1 and (20) where the information on the kernel and the regularization parameter are decoupled. Hence, in contrast to the prediction risk, the regularization parameter and the kernel can not be jointly optimized to minimize the empirical risk.
Iv-B Real data
As a further experiment, we validate the accuracy of our result over a real data set. To this end, we use the real Communities and Crime Data Set for evaluation , which has 123 samples and 122 features. For the experiment in Figure 3, we divide the data set to have training samples () and the remaining for testing (). The risks in Figure 3 are obtained by averaging the prediction risk (computed using ) over random permutation of the data. Although the data set is far from being Gaussian, we notice that the proposed prediction risk estimators are able to track the real behavior of the prediction risk for all types of considered kernels. We can also validate the previous insight from Theorem 1 where all kernels almost achieve the same minimum prediction risk.
V Concluding Remarks
We conducted a large dimensional analysis of centered kernel ridge regression, which is a modified version of kernel ridge regression that accounts for the bias in the regression formulation. By allowing both the data dimension and the training size to grow infinitely large in a fixed proportion and by relying on fundamental tools from random matrix theory, we showed that both the empirical and the prediction risks converge in probability to a deterministic quantity that mathematically connects these performance metrics to the data dimensions and statistics. A fundamental insight taken from the analysis is that asymptotically the choice of the kernel is irrelevant to the learning problem which asserts that a large class of kernels will achieve the same best performance in terms of prediction risk as a linear kernel. Finally, based on the asymptotic analysis, we built a consistent estimator of the prediction risk making it possible to estimate the optimal regularization parameter that achieves the minimum prediction risk.
Proof of Theorem 1
Here, we provide the details of the derivation for the prediction risk. The analysis of the empirical risk follows in a very similar way and is thus omitted. Before delving into the proof of Theorem 1, we shall introduce some fundamental results on the asymptotic behavior of inner-product kernel matrices established by El-Karoui [11, Theorem 2.1].
Theorem (Asymptotic behavior of inner product kernel random matrices).
Under the assumptions of Theorem 2.1 , the kernel matrix can be approximated by in the sense that almost surely in operator norm, where
where . A similar result can be found in  where the accuracy of has been assessed as where denotes a matrix with spectral norm converging in probability to zero with a rate .
Note that using the Woodbury identity, it is easy to show the following useful relations
The above theorem has the following consequence
where (30) is obtained by a simple application of the Sherman-Morrison Lemma (inversion Lemma), along with the use of the resolvent identity , which holds for any square invertible matrices and . The proof of the above Theorem follows from the application of a Taylor expansion of the elements of at the vicinity of their mean. Applying the same approach for vector , we get
where has elements
with . Then, since is uniformly bounded in for all , we have for all ,
As shall be seen later, we need also to control . This is performed in the following technical Lemma.
Let be as in (32). Then,
Similarly, the following approximations hold true
To begin with, note that for a random matrix whose elements satisfies, for some , as , we have . We first start by deriving . We have
Using Assumption 2, we can prove that is bounded for all . and Hence, by computing the expectation over of the first term, we obtain
When , . Hence,
Using (Proof of Theorem 1), we thus obtain
It is easy to see that . On the other hand, one can show that we can replace in the second term by . This is because
Putting all the above results together, we obtain
Now using the approximation in (Theorem (Asymptotic behavior of inner product kernel random matrices).), we obtain