1 Introduction and Motivation
We study optimal recovery of the regression function in the framework of reproducing kernel Hilbert space (RKHS) learning. Here we are given random and noisy observations of the form
at i.i.d. data points , drawn according to some unknown distribution on some input space , taken as a standard Borel space. More precisely, we assume that the observed data
are sampled i.i.d. from an unknown probability measure
on , with , so that the distribution of may depend on , while satisfying . For simplicity, we take the output space as the set of real numbers, but this could be generalized to any separable Hilbert space, see [8].In our setting, an estimator for lies in an hypothesis space , which we choose to be a separable reproducing kernel Hilbert space (RKHS), having a measurable positive semidefinite kernel , satisfying .
More precisely, we confine ourselves to estimators arising from the fairly large class of spectral regularization methods, see e.e. [12], [1], [10], [5]. This class of methods contains the well known Tikhonov regularization, Landweber iteration or spectral cutoff.
We recall that while tuning the regularization parameter is essential for spectral regularization to work well, an a priori choice of the regularization parameter is in general not feasible in statistical problems since the choice necessarily depends on unknown structural properties (e.g. smoothness of the target function or behavior of the statistical dimension). This imposes the need for datadriven aposteriori choices of the regularization parameter, which hopefully are optimal in some well defined sense. An attractive approach is (some version of) the balancing principle going back to Lepskii’s seminal paper [15]
in the context of Gaussian white noise, having been elaborated by Lepskii himself in a series of papers and by other authors, see e.g.
[16], [17], [13], [2], [19] and references therein.Before we present our somewhat abstract approach, we shall motivate the general idea in a specific example. Denoting by
the kernel integral operator associated to and the sampling measure , we recall from [5] that the optimal regularization parameter (as well as the rate of convergence) is determined by the source condition assumption for some constants as well as by an assumed power decay of the effective dimension
with intrinsic dimensionality
and by the noise variance
. Error estimates are usually established by deriving a biasvariance decomposition, which looks in this special case as(1.1) 
holding with probability at least , for any , provided is big enough. Here, the function is the leading order of an upper bound for the approximation error and is the leading order of an upper bound for the sample error
. We combine all parameters in a vector
with and . The optimal regularization parameter is chosen by balancing the two leading error terms, more precisely by choosing as the unique solution of(1.2) 
leading to the resulting error estimate
with probability at least .
The associated sequence of estimated solutions
, depending on the regularization parameter
was called weak/ strong minimax optimal over the model family with rate of convergence
given by , pointwisely for any fixed .
However, if the parameter in the source condition or the intrinsic dimensionality are unknown,
an a priori choice of the theoretically best value as in (1.2) is impossible. Therefore, it is necessary to use
some a posteriori choice of , independent of the parameter . Our aim is to construct an estimator
, i.e. to find a sequence of regularization parameters ,
without knowledge of , but depending on the data , on and on the confidence level,
such that is (minimax) optimal adaptive in the sense of Definition 3.1.
Contribution: More generally, we derive adaptivity in the case where the approximation error is upper bounded by some increasing unknown function and where
is an upper bound for the sample error. Crucial for our approach is a twosided estimate of the effective dimension in terms of its empirical approximation. This in particular allows to control the spectral structure of the covariance operator through the given input data. In summary, our approach achieves:

A fully datadriven estimator for the whole class of spectral regularization algorithms, which does not use data splitting as e.g. Cross Validation.

Adaptation to unknown smoothness and unknown covariance structure.

One for all: Balancing in (which is easiest) automatically gives optimal balancing in the stronger  norm (an analogous result is open for other approaches to data dependent choices of the regularization parameter).
The paper is organized as follows: In Section 2 we provide a twosided estimate of the effective dimension by its empirical counterpart. The main results are presented in Section 3, followed by some specific examples in Section 4. A more detailed discussion is given in Section 5. The proofs are collected in the Appendix.
2 Empirical Effective Dimension
The main point of this subsection is a twosided estimate on the effective dimension by its empirical approximation which is crucial for our entire approach. We recall the definition of the effective dimension and introduce its empirical approximation, the empirical effective dimension: For we set
(2.1) 
where we introduce the shorthand notation and similarly . Here depends on the marginal (through ), but is considered as deterministic, while
is considered as a random variable.
Proposition 2.1.
For any , with probability at least
(2.2) 
for all and .
Corollary 2.2.
For any , with probability at least , one has
as well as
where . In particular, if , with probability at least one has
3 Balancing Principle
In this section, we present the main ideas related to the Balancing Principle and make the informal presentation from the Introduction more precise. Firstly a definition:
Definition 3.1.
Let be sets and let, for , be a class of data generating distributions on . For each let be an algorithm. If there is a sequence and a parameter choice (not depending on ) such that
(3.1) 
and
(3.2) 
where the infimum is taken over all estimators , then the sequence of estimators is called minimax optimal adaptive over and the model family , with respect to the family of rates , for the interpolation norm of parameter .
We remind the reader from [5] that upper estimates typically hold on a class and lower estimates hold on a possibly different class , the model class in the above definition being the intersection of both.
To find such an adaptive estimator, we apply a method which is known in the statistical literature as Balancing Principle. Throughout this section we need
Assumption 3.2.
Let be a class of models. We consider a discrete set of possible values for the regularization parameter
for some . Let and . We assume to have the following error decomposition uniformly over the grid :
(3.3) 
where
(3.4) 
with probability at least , for all data generating distributions from . The bounds and are given by
with and
where is increasing, satisfying and for some constants , . We further define .
We remark that it is actually sufficient to assume (3.3) for and . Interpolation via inequality implies validity of (3.3) for any .
Note that for any , the map as well as are strictly decreasing in . Also, if is sufficiently large and if is sufficiently small, .
We let
In this definition we have replaced , by and , thus including the remainder terms and into our definition of . It will emerge aposteriori, that the definition of is not affected, since the remainder terms are subleading. But a priori, this is not known. A correct proof of the crucial oracle inequality in Lemma 3.8 below is much easier with this definition of . It will then finally turn out that the remainder terms are really subleading.
The grid has to be designed such that the optimal value is contained in .
The best estimator for within belongs to the set
and is given by
(3.5) 
In particular, since we assume that and , there is some such that . Note also that the choice of the grid has to depend on .
Before we define the balancing principle estimate of , we give some intuition of its possible choice: For any , we have . Moreover, for any we have
Finally, since is decreasing, Assumption 3.2 gives for any two satisfying , with probability at least
(3.6) 
An essential step is to find an empirical approximation of the sample error. In view of Corollary 2.2 we define
with and the empirical effective dimension given in (2.1). Corollary 2.2 implies uniformly in
(3.7) 
with probability at least , provided
(3.8) 
Substituting (3.7) into the rhs of the estimate (3) motivates our definition of the balancing principle estimate of as follows:
Definition 3.3.
Given , and , we set
and define
(3.9) 
Notice that as well as depend on the confidence level .
For the analysis it will be important that the grid has a certain regularity. We summarize all requirements needed in
Assumption 3.4.
(on the grid)

Assume that and .

(Regularity of the grid) There is some such that the elements in the grid obey , .

Choose as the unique solution of . We require that is sufficiently large, such that (so that the maximum in the definition of can be dropped). We further assume that .
Note that as . Then, since as , we get that this satisfies . Furthermore, a short argument shows that the optimal value indeed satisfies , if is big enough. Since as , we get as Since by definition, it follows for big enough. From the definition of as a supremum, we actually have , for sufficiently large.
Under the regularity assumption, we find that
(3.10) 
Indeed, while the effective dimension is decreasing, the related function is nondecreasing. Hence we find that
and since
Therefore
One also easily verifies that
implying (3.10).
Remark 3.5.
The typical case for Assumption 3.4 to hold is given when the parameters follow a geometric progression, i.e., for some we let , and with . In this case we are able to upper bounding the total number of grid points in terms of . In fact, since , simple calculations lead to
Recall that the starting point is required to obey if is sufficiently large, implying . Finally, we obtain for sufficiently large
(3.11) 
with .
We shall need an additional assumption on the effective dimension:
Assumption 3.6.

For some and for any sufficiently small
for some .

For some and for any sufficiently small
for some .
Note that such an additional assumption restricts the class of admissible marginals and shrinks the class in Assumption 3.2 to a subclass . Such a lower and upper bound will hold in all examples which we encounter in Section 4.
We further remark that Assumption 3.6 ensures a precise asymptotic behavior for of the form
(3.12) 
for some , .
3.0.1 Main Results
The first result is of preparatory character.
Proposition 3.7.
We shall need
Lemma 3.8.
If Assumption 3.4 holds, then
(3.13) 
We immediately arrive at our first main result of this section:
Theorem 3.9.
In particular, choosing a geometric grid and assuming a lower and upper bound on the effective dimension, we obtain:
Corollary 3.10.
Note that as .
3.0.2 One for All: Balancing is sufficient !
This section is due to an idea suggested by P. Mathé (which itself was inspired by the work [3]) which we have worked out in detail. We define the balancing estimate according to Definition 3.3 by explicitely choosing (in contrast to Theorem 3.9, where we choose depending on the norm parameter ). Our main result states that balancing in the norm suffices to automatically give balancing in all other (stronger !) intermediate norms , for any .
Theorem 3.11.
In particular, choosing a geometric grid and assuming a lower and upper bound on the effective dimension, we obtain:
Corollary 3.12.
Note that as .
Remark 3.13.
Still, our choice for is only a theoretical value which remains unknown as it depends on the unknown marginal through the effective dimension
. Implementation requires a data driven choice. Heuristically, it seems resonable to proceed as follows. Let
and , (we are starting from the right and reverse the order). Define the stopping indexand let . Here, depends on the empirical effective dimension , see (2.1), which by Corollary 2.2 is close to the unknown effective dimension . Thus we think that the above choice of is reasonable for implementing the dependence of on the unknown marginal. A complete mathematical analysis is in development.
4 Specific Examples
We proceed by illustrating some specific examples of our method as described in the previous section. In view of our Theorem 3.11 and Corollary 3.12 it suffices to only consider balancing in . We always choose a geometric grid as in Remark 3.5, satisfying .
(1) The regular case
We consider the setting of [5]
, where the eigenvalues of
decay polynomially (with parameter ), the target function satisfies a Höldertype source conditionand the noise satisfies a BernsteinAssumption
(4.1) 
for any integer and for some and . We combine all structural parameters in a vector , with and . We are interested in adaptivity over .
It has been shown in [5], that the corresponding minimax optimal rate is given by
We shall now check validity of our Assumption 3.2. In the following, we assume that the data generating distribution belongs to the class , defined in [5]. Recall that we let be determined as the unique solution of . Then, we have uniformly for all data generating distributions from the class , with probability at least , for any ,
for sufficiently large, with
where does not depend on the parameters . Remember that the optimal choice for the regularization parameter is obtained by solving
and belongs to the interval . This can be seen by the following argument: If is sufficiently large
which is equivalent to . Since is strictly decreasing we conclude . Here we use the bound
Comments
There are no comments yet.