1 Introduction
Kernelbased methods are now wellestablished tools for supervised learning, allowing to perform various tasks, such as regression or binary classification, with linear and nonlinear predictors
[19, 18]. A central issue common to all regularization frameworks is the choice of the regularization parameter: while most practitioners use crossvalidation procedures to select such a parameter, datadriven procedures not based on crossvalidation are rarely used. The choice of the kernel, a seemingly unrelated issue, is also important for good predictive performance: several techniques exists, either based on crossvalidation, Gaussian processes or multiple kernel learning [6, 17, 3].In this paper, we consider leastsquares regression and cast these two problems as the problem of selecting among several linear estimators, where the goal is to choose an estimator with a quadratic risk which is as small as possible. This problem includes for instance model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning (see Section 2).
The main contribution of the paper is to extend the notion of minimal penalty [4, 2] to all discrete classes of linear operators, and to use it for defining a fully datadriven selection algorithm satisfying a nonasymptotic oracle inequality. Our new theoretical results presented in Section 4 extend similar results which were limited to unregularized leastsquares regression (i.e., projection operators). Finally, in Section 5, we show that our algorithm improves the performances of classical selection procedures, such as GCV [7] and 10fold crossvalidation, for kernel ridge regression or multiple kernel learning, for moderate values of the sample size.
2 Linear estimators
In this section, we define the problem we aim to solve and give several examples of linear estimators.
2.1 Framework and notation
Let us assume that one observes
where
are i.i.d. centered random variables with
unknown, is an unknown measurable function and are deterministic design points. No assumption is made on the set . The goal is to reconstruct the signal , with some estimator , depending only on , and having a small quadratic risk , where , we denote by the norm of , defined as .In this paper, we focus on linear estimators that can be written as a linear function of , that is, , for some (deterministic) matrix
. Here and in the rest of the paper, vectors such as
or are assumed to be columnvectors. We present in Section 2.2 several important families of estimators of this form. The matrix may depend on (which are known and deterministic), but not on , and may be parameterized by certain quantities—usually regularization parameter or kernel combination weights.2.2 Examples of linear estimators
In this paper, our theoretical results apply to matrices which are symmetric positive semidefinite, such as the ones defined below.
Ordinary leastsquares regression / model selection. If we consider linear predictors from a design matrix , then with , which is a projection matrix (i.e., ); is often called a projection estimator. In the variable selection setting, one wants to select a subset , and matrices are parameterized by .
Kernel ridge regression / spline smoothing. We assume that a positive definite kernel is given, and we are looking for a function in the associated reproducing kernel Hilbert space (RKHS) , with norm . If denotes the kernel matrix, defined by , then the ridge regression estimator—a.k.a. spline smoothing estimator for spline kernels [22]—is obtained by minimizing with respect to [18]:
The unique solution is equal to , where . This leads to the smoothing matrix , parameterized by the regularization parameter .
Multiple kernel learning / Group Lasso / Lasso. We now assume that we have different kernels , feature spaces and feature maps , . The group Lasso [23] and multiple kernel learning [11, 3] frameworks consider the following objective function
Note that when is simply the th coordinate of , we get back the penalization by the norm and thus the regular Lasso [21].
Using , we obtain a variational formulation of the sum of norms . Thus, minimizing with respect to is equivalent to minimizing with respect to (see [3] for more details):
where is the identity matrix. Moreover, given , this leads to a smoothing matrix of the form
(1) 
parameterized by the regularization parameter and the kernel combinations in —note that it depends only on , which can be grouped in a single parameter set .
Thus, the Lasso/group lasso can be seen as particular (convex) way of optimizing over . In this paper, we propose a nonconvex alternative with better statistical properties (oracle inequality in Theorem 1). Note that in our setting, finding the solution of the problem is hard in general since the optimization is not convex. However, while the model selection problem is by nature combinatorial, our optimization problems for multiple kernels are all differentiable and are thus amenable to gradient descent procedures—which only find local optima.
Non symmetric linear estimators. Other linear estimators are commonly used, such as nearestneighbor regression or the NadarayaWatson estimator [10]; those however lead to non symmetric matrices , and are not entirely covered by our theoretical results.
3 Linear estimator selection
In this section, we first describe the statistical framework of linear estimator selection and introduce the notion of minimal penalty.
3.1 Unbiased risk estimation heuristics
Usually, several estimators of the form can be used. The problem that we consider in this paper is then to select one of them, that is, to choose a matrix . Let us assume that a family of matrices is given (examples are shown in Section 2.2), hence a family of estimators can be used, with . The goal is to choose from data some , so that the quadratic risk of is as small as possible.
The best choice would be the oracle:
which cannot be used since it depends on the unknown signal . Therefore, the goal is to define a datadriven satisfying an oracle inequality
(2) 
with large probability, where the leading constant
should be close to 1 (at least for large ) and the remainder term should be negligible compared to the risk of the oracle.Many classical selection methods are built upon the “unbiased risk estimation” heuristics: If
minimizes a criterion such thatthen satisfies an oracle inequality such as in Eq. (2) with large probability. For instance, crossvalidation [1, 20] and generalized crossvalidation (GCV) [7] are built upon this heuristics.
One way of implementing this heuristics is penalization, which consists in minimizing the sum of the empirical risk and a penalty term, i.e., using a criterion of the form:
The unbiased risk estimation heuristics, also called Mallows’ heuristics, then leads to the ideal (deterministic) penalty
When , we have:
(3)  
(4) 
where and , . Since is centered with covariance matrix , Eq. (3) and Eq. (4) imply that
(5) 
up to the term , which can be dropped off since it does not vary with .
Note that is called the effective dimensionality or degrees of freedom [24], so that the ideal penalty in Eq. (5) is proportional to the dimensionality associated to the estimator —for projection matrices, we get back the dimension of the subspace, which is classical in model selection.
The expression of the ideal penalty in Eq. (5) led to several selection procedures, in particular Mallows’ (called in the case of projection estimators) [14], where is replaced by some estimator . The estimator of usually used with is based upon the value of the empirical risk at some with large; it has the drawback of overestimating the risk, in a way which depends on and [8]. GCV, which implicitly estimates , has the drawback of overfitting if the family contains a matrix too close to [5]; GCV also overestimates the risk even more than for most (see (7.9) and Table 4 in [8]).
3.2 Minimal and optimal penalties
We deduce from Eq. (3) the biasvariance decomposition of the risk:
(6) 
and from Eq. (4) the expectation of the empirical risk:
(7) 
Note that the variance term in Eq. (6) is not proportional to the effective dimensionality but to . Although several papers argue these terms are of the same order (for instance, they are equal when is a projection matrix), this may not hold in general. If is symmetric with a spectrum , as in all the examples of Section 2.2, we only have
(8) 
In order to give a first intuitive interpretation of Eq. (6) and Eq. (7), let us consider the kernel ridge regression example and assume that the risk and the empirical risk behave as their expectations in Eq. (6) and Eq. (7); see also Fig. 1. Completely rigorous arguments based upon concentration inequalities are developed in the Appendix and summarized in Section 4, leading to the same conclusion as the present informal reasoning.
First, as proved in Appendix E, the bias is a decreasing function of the dimensionality , and the variance is an increasing function of , as well as . Therefore, Eq. (6) shows that the optimal realizes the best tradeoff between bias (which decreases with ) and variance (which increases with ), which is a classical fact in model selection.
Second, the expectation of the empirical risk in Eq. (7) can be decomposed into the bias and a negative variance term which is the opposite of
(9) 
As suggested by the notation , we will show it is a minimal penalty in the following sense. If
then, up to concentration inequalities that are detailed in Section 4.2, behaves like a minimizer of
Therefore, two main cases can be distinguished:

if , then decreases with so that is huge: overfits.

if , then increases with when is large enough, so that is much smaller than when .
As a conclusion, is the minimal amount of penalization needed so that a minimizer of a penalized criterion is not clearly overfitting.
Following an idea first proposed in [4] and further analyzed or used in several other papers such as [12, 2, 16], we now propose to use that is a minimal penalty for estimating and plug this estimator into Eq. (5). This leads to the algorithm described in Section 4.1.
Note that the minimal penalty given by Eq. (9) is new; it generalizes previous results [4, 2] where because all were assumed to be projection matrices, i.e., . Furthermore, our results generalize the slope heuristics (only valid for projection estimators [4, 2]) to general linear estimators for which .
4 Main results
In this section, we first describe our algorithm and then present our theoretical results.
4.1 Algorithm
The following algorithm first computes an estimator of of using the minimal penalty in Eq. (9), then considers the ideal penalty in Eq. (5) for selecting .

a finite set with for some , and matrices .

, compute .

Find such that .

Select .
In the steps 1 and 2 of the above algorithm, in practice, a grid in logscale is used, and our theoretical results from the next section suggest to use a stepsize of order . Note that it may not be possible in all cases to find a such that ; therefore, our condition in step 2, could be relaxed to finding a such that for all , and for all , , with , where is a small constant.
Alternatively, using the same grid in logscale, we can select with maximal jump between successive values of —note that our theoretical result then does not entirely hold, as we show the presence of a jump around , but do not show the absence of similar jumps elsewhere.
4.2 Oracle inequality
Theorem 1.
Let and be defined as in the algorithm of Section 4.1, with for some . Assume that , is symmetric with , that are i.i.d. Gaussian with variance , and that with
() 
Then, a numerical constant and an event of probability at least exist on which, for every ,
(10) 
Furthermore, if
() 
then, a constant depending only on exists such that for every , on the same event,
(11) 
Theorem 1 is proved in the Appendix. The proof mainly follows from the informal arguments developed in Section 3.2, completed with the following two concentration inequalities: If is a standard Gaussian random vector, and is a realvalued matrix, then for every ,
(12)  
(13) 
where is the operator norm of . A proof of Eq. (12) and (13) can be found in Appendix D.
4.3 Discussion of the assumptions of Theorem 1
Gaussian noise. When is subGaussian, Eq. (12) and Eq. (13) can be proved for at the price of additional technicalities, which implies that Theorem 1 is still valid.
Symmetry. The assumption that matrices must be symmetric can certainly be relaxed, since it is only used for deriving from Eq. (13) a concentration inequality for . Note that barely is an assumption since it means that actually shrinks .
Assumptions (). () holds if and the bias is smaller than for some , a quite classical assumption in the context of model selection. Besides, () is much less restrictive and can even be relaxed, see Appendix B.
Assumption (). The upper bound () on is certainly the strongest assumption of Theorem 1, but it is only needed for Eq. (11). According to Eq. (6), () holds with when is a projection matrix since . In the kernel ridge regression framework, (
) holds as soon as the eigenvalues of the kernel matrix
decrease like —see Appendix E. In general, () means that should not have a risk smaller than the parametric convergence rate associated with a model of dimension .When () does not hold, selecting among estimators whose risks are below the parametric rate is a rather difficult problem and it may not be possible to attain the risk of the oracle in general. Nevertheless, an oracle inequality can still be proved without (), at the price of enlarging slightly and adding a small fraction of in the righthand side of Eq. (11), see Appendix C. Enlarging is necessary in general: If for most , the minimal penalty is very close to , so that according to Eq. (10), overfitting is likely as soon as underestimates , even by a very small amount.
4.4 Main consequences of Theorem 1 and comparison with previous results
Consistent estimation of . The first part of Theorem 1 shows that is a consistent estimator of in a general framework and under mild assumptions. Compared to classical estimators of , such as the one usually used with Mallows’ , does not depend on the choice of some model assumed to have almost no bias, which can lead to overestimating by an unknown amount [8].
Oracle inequality. Our algorithm satisfies an oracle inequality with high probability, as shown by Eq. (11): The risk of the selected estimator is close to the risk of the oracle, up to a remainder term which is negligible when the dimensionality grows with faster than , a typical situation when the bias is never equal to zero, for instance in kernel ridge regression.
Several oracle inequalities have been proved in the statistical literature for Mallows’ with a consistent estimator of , for instance in [13]. Nevertheless, except for the model selection problem (see [4] and references therein), all previous results were asymptotic, meaning that is implicitly assumed to be larged compared to each parameter of the problem. This assumption can be problematic for several learning problems, for instance in multiple kernel learning when the number of kernels may grow with . On the contrary, Eq. (11) is nonasymptotic, meaning that it holds for every fixed as soon as the assumptions explicitly made in Theorem 1 are satisfied.
Comparison with other procedures. According to Theorem 1 and previous theoretical results [13, 5], , GCV, crossvalidation and our algorithm satisfy similar oracle inequalities in various frameworks. This should not lead to the conclusion that these procedures are completely equivalent. Indeed, secondorder terms can be large for a given , while they are hidden in asymptotic results and not tightly estimated by nonasymptotic results. As showed by the simulations in Section 5, our algorithm yields statistical performances as good as existing methods, and often quite better.
Furthermore, our algorithm never overfits too much because is by construction smaller than the effective dimensionality of at which the jump occurs. This is a quite interesting property compared for instance to GCV, which is likely to overfit if it is not corrected because GCV minimizes a criterion proportional to the empirical risk.
5 Simulations
Throughout this section, we consider exponential kernels on , , with the ’s sampled i.i.d. from a standard multivariate Gaussian. The functions are then selected randomly as , where both and are i.i.d. standard Gaussian (i.e., belongs to the RKHS).
Jump. In Figure 2 (left), we consider data , , and study the size of the jump in Figure 2 for kernel ridge regression. With half the optimal penalty (which is used in traditional variable selection for linear regression), we do not get any jump, while with the minimal penalty we always do. In Figure 2 (right), we plot the same curves for the multiple kernel learning problem with two kernels on two different 4dimensional variables, with similar results. In addition, we show two ways of optimizing over , by discrete optimization with different kernel matrices—a situation covered by Theorem 1—or with continuous optimization with respect to in Eq. (1), by gradient descent—a situation not covered by Theorem 1.
Comparison of estimator selection methods. In Figure 3, we plot model selection results for 20 replications of data (, ), comparing GCV [7], our minimal penalty algorithm, and crossvalidation methods. In the left part (single kernel), we compare to the oracle (which can be computed because we can enumerate ), and use for crossvalidation all possible values of . In the right part (multiple kernel), we compare to the performance of Mallows’ when is known (i.e., penalty in Eq. 5), and since we cannot enumerate all ’s, we use the solution obtained by MKL with CV [3]. We also compare to using our minimal penalty algorithm with the sum of kernels.
6 Conclusion
A new light on the slope heuristics. Theorem 1 generalizes some results first proved in [4] where all are assumed to be projection matrices, a framework where assumption () is automatically satisfied. To this extent, Birgé and Massart’s slope heuristics has been modified in a way that sheds a new light on the “magical” factor 2 between the minimal and the optimal penalty, as proved in [4, 2]. Indeed, Theorem 1 shows that for general linear estimators,
(14) 
which can take any value in in general; this ratio is only equal to 2 when , hence mostly when is a projection matrix.
Future directions.
In the case of projection estimators, the slope heuristics still holds when the design is random and data are heteroscedastic
[2]; we would like to know whether Eq. (14) is still valid for heteroscedastic data with general linear estimators. In addition, the good empirical performances of elbow heuristics based algorithms (i.e., based on the sharp variation of a certain quantity around good hyperparameter values) suggest that Theorem
1 can be generalized to many learning frameworks (and potentially to nonlinear estimators), probably with small modifications in the algorithm, but always relying on the concept of minimal penalty.Another interesting open problem would be to extend the results of Section 4, where is assumed, to continuous sets such as the ones appearing naturally in kernel ridge regression and multiple kernel learning. We conjecture that Theorem 1 is valid without modification for a “small” continuous , such as in kernel ridge regression where taking a grid of size in logscale is almost equivalent to taking . On the contrary, in applications such as the Lasso with variables, the natural set cannot be well covered by a grid of cardinality with small, and our minimal penalty algorithm and Theorem 1 certainly have to be modified.
Appendix
This appendix is mainly devoted to the proof of Theorem 1, which is splitted into two results. First, Proposition 1 shows that is a minimal penalty, so that defined in the Algorithm of Section 4.1 consistently estimates . Second, Proposition 2 shows that penalizing the empirical risk with and leads to an oracle inequality. Proving Theorem 1 is straightforward by combining Propositions 1 and 2.
In Section A, we introduce some notation and make some computations that will be used in the following. Proposition 1 is proved in Section B. Proposition 2 is proved in Section C. Concentration inequalities needed for proving Propositions 1 and 2 are stated and proved in Section D. Computations specific to the kernel ridge regression example are made in Section E.
Appendix A Notation and first computations
Recall that
where is deterministic, is centered with covariance matrix and is the identity matrix. For every , for some realvalued matrix , so that
(15)  
(16) 
where , and .
Note that , and are deterministic, and for all , all are random with zero mean. In particular, we deduce the following expressions of the risk and the empirical risk of :
(19)  
(20) 
Define
Since , we have
(21) 
In addition, if has a spectrum , then
so that
(22) 
Appendix B Minimal penalty
Define
(23) 
We will prove the following proposition in this section.
Proposition 1.
Let be defined by Eq. (23). Assume that , is symmetric with , that are i.i.d. Gaussian with zero mean and variance , and that
()  
() 
Then, a numerical constant exists such that for every , for every ,
(24)  
(25) 
hold with probability at least .
If , Proposition 1 with proves that with probability at least , defined in the Algorithm of Section 4.1 exists and
Remark 1.
Remark 2.
Remark 3.
Let us now prove Proposition 1.
Comments
There are no comments yet.