Risk Convergence of Centered Kernel Ridge Regression with Large Dimensional Data

04/19/2019 ∙ by Khalil Elkhalil, et al. ∙ 0

This paper carries out a large dimensional analysis of a variation of kernel ridge regression that we call centered kernel ridge regression (CKRR), also known in the literature as kernel ridge regression with offset. This modified technique is obtained by accounting for the bias in the regression problem resulting in the old kernel ridge regression but with centered kernels. The analysis is carried out under the assumption that the data is drawn from a Gaussian distribution and heavily relies on tools from random matrix theory (RMT). Under the regime in which the data dimension and the training size grow infinitely large with fixed ratio and under some mild assumptions controlling the data statistics, we show that both the empirical and the prediction risks converge to a deterministic quantities that describe in closed form fashion the performance of CKRR in terms of the data statistics and dimensions. Inspired by this theoretical result, we subsequently build a consistent estimator of the prediction risk based on the training data which allows to optimally tune the design parameters. A key insight of the proposed analysis is the fact that asymptotically a large class of kernels achieve the same minimum prediction risk. This insight is validated with both synthetic and real data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Kernel ridge regression (KRR) is part of kernel-based machine learning methods that deploy a set of nonlinear functions to describe the real output of interest

[1, 2]. More precisely, the idea is to map the data into a high-dimensional space , a.k.a. feature space

, which can even be of infinite dimension resulting in a linear representation of the data with respect to the output. Then, a linear regression problem is solved in

by controlling over-fitting with a regularization term. In fact, the most important advantage of kernel methods is the utilized kernel trick or kernel substitution [1]

, which allows to directly work with kernels and avoid explicit use of feature vectors in

.

Due to its popularity, a rich body of research has been conducted to analyze the performance of KRR. In [3], a randomized version of KRR is studied with performance guarantees in terms of concentration bounds. The work in [4] analyzes the random features approximation in least squares kernel regression. More relevant results can be found in [5] where upper bounds of the prediction risk have been derived in terms of the empirical quadratic risk for general regression models. Similarly for KRR models, an upper and lower bound on the expected risk have been provided in [6] before being generalized to general regularization operators in [7]

. Therefore, most of the results related to the performance analysis of KRR and related regression techniques are in the form of upper or lower bounds of the prediction risk. In this work, we study the problem from an asymptotic analysis perspective. As we will demonstrate in the course of the paper, such an analysis brought about novel results that predict in an accurate fashion prediction risks metrics. Our focus is on a variation of KRR called centered kernel ridge regression (CKRR) that is built upon the same principles of KRR with the additional requirement to minimize the

bias in the learning problem. This variation has been motivated by Cortes et al. in [8] and [9, 10] where the benefits of centering kernels have been highlighted. The obtained regression technique can be seen as KRR with centered kernels. Moreover, in the high dimensional setting with certain normalizations, we show that kernel matrices behave as a rank one matrix, thus centering allows to neutralize this non-informative component and highlight higher order components that retain useful information of the data.

To understand the behavior of CKRR, we conduct theoretical analysis in the large dimensional regime where both the data dimension and the training size tend to infinity with fixed ratio (). As far as inner-product kernels are concerned, with mild assumptions on the data statistics, we show using fundamental results from random matrix theory elaborated in [11] and [12]

that both the empirical and prediction risks approach a deterministic quantity that relates in closed form fashion these performance measures to the data statistics and dimensions. This important finding allows to see how the model performance behaves as a function of the problem’s parameters and as such tune the design parameters to minimize the prediction risk. Moreover, as an outcome of this result, we show that it is possible to jointly optimize the regularization parameter along with the kernel function so that to achieve the possible minimum prediction risk. In other words, the minimum prediction risk is always attainable for all kernels with a proper choice of the regualrization parameter. This implies that all kernels behave similarly to the linear kernel. We regard such a fact as a consequence of the curse of dimensionality phenomenon which causes the CKRR to be asymptotically equivalent to

centered linear ridge regression. As an additional contribution of the present work, we build a consistent estimator of the prediction risk based on the training samples, thereby paving the way towards optimal setting of the regularization parameter.

The rest of the paper is structured as follows. In section II, we give a brief background on kernel ridge regression and introduce its centered variation. In section III, we provide the main results of the paper related to the asymptotic analysis of CKRR as well as the construction of a consistent estimator of the prediction risk. Then, we provide some numerical examples in section IV. We finally make some concluding remarks in section V.

Notations: and stand for the expectation and the variance of a random variable while and respectively stand for the almost sure convergence and the convergence in probability. denotes the operator norm of a matrix and the norm for vectors, stands for the trace operator. The notation means that bounded such that . We say that is if the th derivative of exists and is continous.

Ii Background on kernel ridge regression

Let be a set of observations in , where denotes the input space and the output space. Our aim is to predict the output of new input points with a reasonable accuracy. Assume that the output is generated using a function , then the problem can be cast as a function approximation problem where the goal is to find an estimate of denoted by such that is close to the real output . In this context, the kernel learning problem is formulated as follows

(1)

where is a reproducing kernel Hilbert space (RKHS),

is a loss function and

is a regularization parameter that permits to control overfitting. Denoting by a feature map that maps the data points to the feature space , then we define such that for all where is known as the positive definite kernel corresponding to the feature map . With these definitions, the representer theorem [13, 14] shows that the minimizer of the problem in (1) writes as . Thus, we can reformulate (1) as follows

(2)

When is the squared loss, the optimization problem in (2) can be reformulated as

(3)

where . This yields the following solution , where . Then, the output estimate of any data point is given by [1]

(4)

where is the information vector and with entries , . This is commonly known as the kernel trick which allows to highly simplify the problem which boils down to solving a -dimensional problem. Throughout this paper, we consider the following data model

(5)

where generates the actual output of the data and

are i.i.d. standard normal random variables with

assumed to be known. We consider both the empirical (training) and the prediction (testing) risks respectively defined as [15]

(6)
(7)

where is the data input distribution, is taken independent of the training data and . The above two equations respectively measure the goodness of fit relative to the training data and to new unseen data all in terms of the mean squared error (MSE).

Ii-a Centered kernel ridge regression

The concept of centered kernels dates back to the work of Cortes [8] on learning kernels based on the notion of centered alignment. As we will show later, this notion of centering comes naturally to the picture when we account for the bias in the learning problem (also see the lecture notes by Jakkola [16]). More specifically, we modify the optimization problem in (2) to account for the bias as follows

(8)

where clearly we do not penalize the offset (or the bias) in the regularization term . With being the squared loss, we immediately get Substituting in (8), we solve the centered optimization problem given by

(9)

where is referred as a projection matrix or a centering matrix [8, 16]. Finally, we get

where is the centered kernel matrix as defined in [8, Lemma1] and is obtained using the Woodbury identity. With some basic manipulations, the centered kernel ridge regression estimate of the output of data point is given by

(10)

Therefore, the feature map corresponding to as well as the information vector can be respectively obtained as follows

(11)

Throughout this paper, we consider inner-product kernels [1, 11] defined as follows

(12)

and subsequently, where the normalization111This is equivalent to normalize all data points by . This type of normalization has also been conisdered in [17]

following the heuristic of Jakkola.

by in (12) is convenient in the large regime as we will show later (also see [18] for similar normalization in the analysis of LS-SVMs). In the following, we conduct a large dimensional analysis of the performance of CKRR with the aim to get useful insights on the design of CKRR. Particularly, we will focus on studying the empirical and the prediction risks of CKRR which we define as

The novelty of our analysis with respect to previous studies lies in that

  1. It provides a mathematical connection between the performance and the problem’s dimensions and statistics resulting in a deeper understanding of centered kernel ridge regression in the large regime.

  2. It brings insights on how to choose the kernel function and the regularization parameter in order to guarantee a good generalization performance for unknown data.

As far as the second point is considered, we show later that both the kernel function and the regularization parameter can be optimized jointly as a consequence of the mathematical result connecting the prediction risk with these design parameters. Our analysis does not assume a specific choice of the inner-product kernels, and is valid for the following popular ones.

  • Linear kernels: .

  • Polynomial kernels: .

  • Sigmoid kernels: .

  • Exponential kernels: .

Iii Main results

Iii-a Technical assumptions

In this section, we will present our theoretical results on the prediction risk of CKRR by first introducing the assumptions of data growth rate, kernel function and true function . Without loss of generality, we assume that the data samples are independent such that , with positive definite covariance matrix . Throughout the analysis, we consider the large dimensional regime in which both and grow simultaneously large with the following growth rate assumptions.

Assumption 1 (Growth rate).

As we assume the following

  • Data scaling: .

  • Covariance scaling: .

The above assumptions are standard to consider and allow to exploit the large heritage of random matrix theory. Moreover, allowing and to grow large at the same rate is of practical interest when dealing with modern large and numerous data. The assumption treating the covariance scaling is technically convenient since it allows to use important theoretical results on the behavior of large kernel matrices [11, 12]. Under Assumption 1, we have the following implications.

(13)
(14)

where due to the covariance scaling in Assumption 1. This means that in the limit when , the kernel matrix as defined earlier has all its entries converging to a deterministic limit. Applying a Taylor expansion on the entries of , and under some assumption on the kernel function , it has been shown in [11, Theorem 2.1] that

(15)

where the convergence is in operator norm and exhibits nice properties and can be expressed using standard random matrix models. The explicit expression of as well as its properties will be thoroughly investigated in Appendix A. We subsequently make additional assumptions to control the kernel function and the data generating function .

Assumption 2 (Kernel function).

As in [11, Theorem 2.1], we shall assume that is in a neighborhood of and in a neighborhood of 0. Moreover, we assume that for any independent observations and drawn from and ,

where is the third derivative of .

Assumption 3 (Data generating function).

We assume that is

and polynomially bounded together with its derivatives. We shall further assume that the moments of

and its gradient are finite. More explicitly we need to have:

and

(16)

As we will show later, the above assumptions are needed to guarantee a bounded asymptotic risk and to carry out the analysis. Under the setting of Assumptions 1, 2 and 3, we aim to study the performance of CKRR by asymptotically evaluating the performance metrics defined in (6). Inspired by the fundamental results from [11] and [12]

in the context of spectral clustering, then following the observations made in (

13) and(14), it is always possible to linearize the kernel matrix around the matrix which avoids dealing with the original intractable expression of . Note that the first component of the approximation given by will be neutralized by the projection matrix in the context of CKRR, which means that the behavior of CKRR will be essentially governed by the higher order approximations of . Consequently, one can resort to those approximations to have an explicit expression of the asymptotic risk in the large regime. This expression would hopefully reveal the mathematical connection between the regression risk and the data’ statistics and dimensions as .

Iii-B Limiting risk

With the above assumptions at hand, we are now in a position to state the main results of the paper related to the derivation of the asymptotic risk of CKRR. Before doing so, we shall introduce some useful quantities.

Also, for all

at macroscopic distance from the eigenvalues

of , we define the Stieltjes transform of also known as the Stieltjes transform of the Marc̆enko-Pastur law as the unique solution to the following fixed-point equation [19]

(17)

where in (17) is bounded as provided that Assumption 1 is satisfied. For ease of notation, we shall use to denote for all appropriate . The first main result of the paper is summarized in the following theorem, the proof of which is postponed to the Appendix A.

Theorem 1 (Limiting risk).

Under Assumptions 1, 2 and 3 and by taking for kernel functions satisfying 222The case of is asymptotically equivalent to take the sample mean as an estimate of which is neither of practical nor theoretical interest. and at macroscopic distance from the eigenvalues of

, both the empirical and the prediction risks converge in probability to a non trivial deterministic limits respectively given by

(18)
(19)

where the expressions of and are given in the top of the next page.

(20)
(21)

Note that in the case where , the limiting risks in (20) and (21) can be further simplified as

where can be explicitly derived as in [20]

Remark 1.

From Theorem 1 it entails that the limiting prediction risk can be expressed using the limiting empirical risk in the following fashion.

(22)
Lemma 1 (A consistent estimator of the prediction risk).

Inspired by the outcome of Theorem 1 summarized in Remark 1, we construct a consistent estimator of the prediction risk given by

(23)

in the sense that where can be consistently estimated as

Proof.

The proof is straightforward relying on the relation in (22) and the fact that

as shown in [12, Lemma 1]. ∎

Since the aim of any learning system is to design a model that achieves minimal prediction risk [15], the relation described in Lemma 1 by (23) has enormous advantages as it permits to estimate the prediction risk in terms of the empirical risk and hence optimize the prediction risk accordingly.

Remark 2.

One important observation from the expression of the limiting prediction risk in (21) is that the information on the kernel (given by and ) as well as the information on are both encapsulated in with . This means that one should optimize to have minimal prediction risk and thus jointly choose the kernel and the regularization parameter . Moreover, it entails that the choice of the kernel (as long as ) is asymptotically irrelevant since a bad choice of the kernel can be compensated by a good choice of and vice-versa. This essentially implies that a linear kernel asymptotically achieves the same optimal performance as any other type of kernels 333This does not mean that all kernels will have the same performance for a given regularization parameter but means that they will achieve the same minimum prediction risk..

Iii-C A consistent estimator of the prediction risk

Although the estimator provided in Lemma 1 permits to estimate the prediction risk by virtue of the empirical risk, it presents the drawback of being sensitive to small values of . In the following theorem, we provide a consistent estimator of the prediction risk constructed from the training data and is less sensitive to small values of .

Theorem 2 (A consistent estimator of the prediction risk).

Under Assumptions 1, 2 and 3 with and , we construct a consistent estimator of the prediction risk based on the training data such that

(24)

with is the resolvent matrix given by , with . Moreover, in the special case where , the estimator reduces to

(25)

Theorem 2 provides a generic way to estimate the prediction risk from the pairs of training examples . This allows using the closed form expressions in (24) and (25)444The expression in (25) is useful because it does not involve any matrix inversion unlike the one in (24). with the same set of arguments in Remark 2 to jointly estimate the optimal kernel and the optimal regularization parameter .

Iii-D Parameters optimization

We briefly discuss how to jointly optimize the kernel function and the regularization parameter . As mentioned earlier, we exploit the special structure in the expression of the consistent estimate where both parameters (the kernel function and ) are summarized in . We focus on the case where due to the tractability of the expression of in (25). By simple calculations, we can show that is minimized when satisifies the equation

where , which admits the following closed-form solution

(26)

Then, look up such that . Finally, choose and such that . In the general case, it is difficult to get a closed from expression in terms of or , however it is possible to numerically optimize the expression of with respect to . This can be done using simple one dimensional optimization techniques implemented in most softwares.

Iv Experiments

Iv-a Synthetic data

To validate our theoretical findings, we consider both Gaussian and Bernoulli data. As shown in Figure 2

, both data distributions exhibit the same behavior for all the settings with different kernel functions. More importantly, eventhough the derived formulas heavily rely on the Gaussian assumption, in the case where the data is Bernoulli distributed, we have a good agreement with the theoretical limits. This can be understood as part of the universality property often encountered in many high dimensional settings. Therefore, we conjecture that the obtained results are valid for any data distribution following the model

where satisfies Assumption 1 and the entries of

are i.i.d. with zero mean, unit variance and have bounded moments

555We couldn’t provide experiments for more data distributions due to space limitations.. For more clarity, we refer the reader to Figure 1 as a representative of Figure 2 when the data is Gaussian with and . As shown in Figure 1, the proposed consistent estimators are able to track the real behavior of the prediction risk for all types of kernels into consideration. It is worth mentioning however that the proposed estimator in Lemma 1 exhibits some instability for small values of due to the inversion of in (23). Therefore, it is advised to use the estimator given by Theorem 2. It is also clear from Figure 1 that all the considered kernels achieve the same minimum prediction risks but with different optimal regularizations . This is not the case for the empirical risk as shown in Figure 1 and (20) where the information on the kernel and the regularization parameter are decoupled. Hence, in contrast to the prediction risk, the regularization parameter and the kernel can not be jointly optimized to minimize the empirical risk.

Iv-B Real data

As a further experiment, we validate the accuracy of our result over a real data set. To this end, we use the real Communities and Crime Data Set for evaluation [21], which has 123 samples and 122 features. For the experiment in Figure 3, we divide the data set to have training samples () and the remaining for testing (). The risks in Figure 3 are obtained by averaging the prediction risk (computed using ) over random permutation of the data. Although the data set is far from being Gaussian, we notice that the proposed prediction risk estimators are able to track the real behavior of the prediction risk for all types of considered kernels. We can also validate the previous insight from Theorem 1 where all kernels almost achieve the same minimum prediction risk.

Fig. 1: CKRR risk with respect to the regularization parameter on Gaussian data (, training samples with predictors, for different types of kernels. The data generating function is taken to be .
Fig. 2: CKRR risk with respect to the regularization parameter on both Gaussian and Bernoulli data (i.e., and with respectively). The noise variance is taken to be and the data generating function is .
Fig. 3: CKRR risk with respect to the regularization parameter on the Communities and Crime Data Set where independent zero mean Gaussian noise samples with variance are added to the true response.

V Concluding Remarks

We conducted a large dimensional analysis of centered kernel ridge regression, which is a modified version of kernel ridge regression that accounts for the bias in the regression formulation. By allowing both the data dimension and the training size to grow infinitely large in a fixed proportion and by relying on fundamental tools from random matrix theory, we showed that both the empirical and the prediction risks converge in probability to a deterministic quantity that mathematically connects these performance metrics to the data dimensions and statistics. A fundamental insight taken from the analysis is that asymptotically the choice of the kernel is irrelevant to the learning problem which asserts that a large class of kernels will achieve the same best performance in terms of prediction risk as a linear kernel. Finally, based on the asymptotic analysis, we built a consistent estimator of the prediction risk making it possible to estimate the optimal regularization parameter that achieves the minimum prediction risk.

Appendix A

Proof of Theorem 1

Here, we provide the details of the derivation for the prediction risk. The analysis of the empirical risk follows in a very similar way and is thus omitted. Before delving into the proof of Theorem 1, we shall introduce some fundamental results on the asymptotic behavior of inner-product kernel matrices established by El-Karoui [11, Theorem 2.1].

Theorem (Asymptotic behavior of inner product kernel random matrices).

Under the assumptions of Theorem 2.1 [11], the kernel matrix can be approximated by in the sense that almost surely in operator norm, where

where . A similar result can be found in [11] where the accuracy of has been assessed as where denotes a matrix with spectral norm converging in probability to zero with a rate .

Define

(27)

Note that using the Woodbury identity, it is easy to show the following useful relations

(28)
(29)

The above theorem has the following consequence

(30)

where (30) is obtained by a simple application of the Sherman-Morrison Lemma (inversion Lemma), along with the use of the resolvent identity , which holds for any square invertible matrices and . The proof of the above Theorem follows from the application of a Taylor expansion of the elements of at the vicinity of their mean. Applying the same approach for vector , we get

where has elements

(31)

with . Then, since is uniformly bounded in for all , we have for all ,

(32)

As shall be seen later, we need also to control . This is performed in the following technical Lemma.

Lemma 2.

Let be as in (32). Then,

Similarly, the following approximations hold true

Proof.

To begin with, note that for a random matrix whose elements satisfies, for some , as , we have . We first start by deriving . We have

Using Assumption 2, we can prove that is bounded for all . and Hence, by computing the expectation over of the first term, we obtain

When , . Hence,

Using (Proof of Theorem 1), we thus obtain

It is easy to see that . On the other hand, one can show that we can replace in the second term by . This is because

Putting all the above results together, we obtain

Now using the approximation in (Theorem (Asymptotic behavior of inner product kernel random matrices).), we obtain

(33)
Theorem (Asymptotic behavior of and ).

As in [12, Lemma 1], let Assumption 1 holds, then as and all