Adaptivity for Regularized Kernel Methods by Lepskii's Principle

04/15/2018
by   Nicole Mücke, et al.
Istituto Italiano di Tecnologia
0

We address the problem of adaptivity in the framework of reproducing kernel Hilbert space (RKHS) regression. More precisely, we analyze estimators arising from a linear regularization scheme g_. In practical applications, an important task is to choose the regularization parameter appropriately, i.e. based only on the given data and independently on unknown structural assumptions on the regression function. An attractive approach avoiding data-splitting is the Lepskii Principle (LP), also known as the Balancing Principle is this setting. We show that a modified parameter choice based on (LP) is minimax optimal adaptive, up to (n). A convenient result is the fact that balancing in L^2(ν)- norm, which is easiest, automatically gives optimal balancing in all stronger norms, interpolating between L^2(ν) and the RKHS. An analogous result is open for other classical approaches to data dependent choices of the regularization parameter, e.g. for Hold-Out.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/26/2019

Lepskii Principle in Supervised Learning

In the setting of supervised learning using reproducing kernel methods, ...
09/09/2020

Consistency and Regression with Laplacian regularization in Reproducing Kernel Hilbert Space

This note explains a way to look at reproducing kernel Hilbert space for...
03/27/2011

Fast Learning Rate of lp-MKL and its Minimax Optimality

In this paper, we give a new sharp generalization bound of lp-MKL which ...
05/17/2018

Minimax regularization

Classical approach to regularization is to design norms enhancing smooth...
05/14/2014

Learning rates for the risk of kernel based quantile regression estimators in additive models

Additive models play an important role in semiparametric statistics. Thi...
12/30/2019

A Parameter Choice Rule for Tikhonov Regularization Based on Predictive Risk

In this work, we propose a new criterion for choosing the regularization...
12/02/2019

Relating lp regularization and reweighted l1 regularization

We propose a general framework of iteratively reweighted l1 methods for ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

We study optimal recovery of the regression function in the framework of reproducing kernel Hilbert space (RKHS) learning. Here we are given random and noisy observations of the form

at i.i.d. data points , drawn according to some unknown distribution on some input space , taken as a standard Borel space. More precisely, we assume that the observed data

are sampled i.i.d. from an unknown probability measure

on , with  , so that the distribution of may depend on  , while satisfying  . For simplicity, we take the output space as the set of real numbers, but this could be generalized to any separable Hilbert space, see [8].

In our setting, an estimator for lies in an hypothesis space , which we choose to be a separable reproducing kernel Hilbert space (RKHS), having a measurable positive semi-definite kernel , satisfying .

More precisely, we confine ourselves to estimators arising from the fairly large class of spectral regularization methods, see e.e. [12], [1], [10], [5]. This class of methods contains the well known Tikhonov regularization, Landweber iteration or spectral cut-off.

We recall that while tuning the regularization parameter is essential for spectral regularization to work well, an a priori choice of the regularization parameter is in general not feasible in statistical problems since the choice necessarily depends on unknown structural properties (e.g. smoothness of the target function or behavior of the statistical dimension). This imposes the need for data-driven a-posteriori choices of the regularization parameter, which hopefully are optimal in some well defined sense. An attractive approach is (some version of) the balancing principle going back to Lepskii’s seminal paper [15]

in the context of Gaussian white noise, having been elaborated by Lepskii himself in a series of papers and by other authors, see e.g.

[16], [17], [13], [2], [19] and references therein.

Before we present our somewhat abstract approach, we shall motivate the general idea in a specific example. Denoting by

the kernel integral operator associated to and the sampling measure , we recall from [5] that the optimal regularization parameter (as well as the rate of convergence) is determined by the source condition assumption for some constants as well as by an assumed power decay of the effective dimension

with intrinsic dimensionality

and by the noise variance

. Error estimates are usually established by deriving a bias-variance decomposition, which looks in this special case as

(1.1)

holding with probability at least , for any , provided is big enough. Here, the function is the leading order of an upper bound for the approximation error and is the leading order of an upper bound for the sample error

. We combine all parameters in a vector

with and . The optimal regularization parameter is chosen by balancing the two leading error terms, more precisely by choosing as the unique solution of

(1.2)

leading to the resulting error estimate

with probability at least . The associated sequence of estimated solutions , depending on the regularization parameter was called weak/ strong minimax optimal over the model family with rate of convergence given by , pointwisely for any fixed .

However, if the parameter in the source condition or the intrinsic dimensionality are unknown, an a priori choice of the theoretically best value as in (1.2) is impossible. Therefore, it is necessary to use some a posteriori choice of , independent of the parameter . Our aim is to construct an estimator  , i.e. to find a sequence of regularization parameters , without knowledge of , but depending on the data , on and on the confidence level, such that is (minimax) optimal adaptive in the sense of Definition 3.1.

Contribution: More generally, we derive adaptivity in the case where the approximation error is upper bounded by some increasing unknown function and where

is an upper bound for the sample error. Crucial for our approach is a two-sided estimate of the effective dimension in terms of its empirical approximation. This in particular allows to control the spectral structure of the covariance operator through the given input data. In summary, our approach achieves:

  1. A fully data-driven estimator for the whole class of spectral regularization algorithms, which does not use data splitting as e.g. Cross Validation.

  2. Adaptation to unknown smoothness and unknown covariance structure.

  3. One for all: Balancing in (which is easiest) automatically gives optimal balancing in the stronger - norm (an analogous result is open for other approaches to data dependent choices of the regularization parameter).

The paper is organized as follows: In Section 2 we provide a two-sided estimate of the effective dimension by its empirical counterpart. The main results are presented in Section 3, followed by some specific examples in Section 4. A more detailed discussion is given in Section 5. The proofs are collected in the Appendix.

2 Empirical Effective Dimension

The main point of this subsection is a two-sided estimate on the effective dimension by its empirical approximation which is crucial for our entire approach. We recall the definition of the effective dimension and introduce its empirical approximation, the empirical effective dimension: For we set

(2.1)

where we introduce the shorthand notation and similarly  . Here depends on the marginal (through ), but is considered as deterministic, while

is considered as a random variable.


Proposition 2.1.

For any , with probability at least

(2.2)

for all and .

Corollary 2.2.

For any , with probability at least , one has

as well as

where  . In particular, if , with probability at least one has

3 Balancing Principle

In this section, we present the main ideas related to the Balancing Principle and make the informal presentation from the Introduction more precise. Firstly a definition:

Definition 3.1.

Let be sets and let, for , be a class of data generating distributions on . For each let be an algorithm. If there is a sequence and a parameter choice (not depending on ) such that

(3.1)

and

(3.2)

where the infimum is taken over all estimators , then the sequence of estimators is called minimax optimal adaptive over and the model family , with respect to the family of rates , for the interpolation norm of parameter .

We remind the reader from [5] that upper estimates typically hold on a class and lower estimates hold on a possibly different class , the model class in the above definition being the intersection of both.

To find such an adaptive estimator, we apply a method which is known in the statistical literature as Balancing Principle. Throughout this section we need

Assumption 3.2.

Let be a class of models. We consider a discrete set of possible values for the regularization parameter

for some . Let and . We assume to have the following error decomposition uniformly over the grid :

(3.3)

where

(3.4)

with probability at least , for all data generating distributions from . The bounds and are given by

with and

where is increasing, satisfying and for some constants , . We further define .

We remark that it is actually sufficient to assume (3.3) for and . Interpolation via inequality implies validity of (3.3) for any .

Note that for any , the map as well as are strictly decreasing in . Also, if is sufficiently large and if is sufficiently small, .

We let

In this definition we have replaced , by and , thus including the remainder terms and into our definition of . It will emerge a-posteriori, that the definition of is not affected, since the remainder terms are subleading. But a priori, this is not known. A correct proof of the crucial oracle inequality in Lemma 3.8 below is much easier with this definition of . It will then finally turn out that the remainder terms are really subleading.

The grid has to be designed such that the optimal value is contained in .

The best estimator for within belongs to the set

and is given by

(3.5)

In particular, since we assume that and , there is some such that . Note also that the choice of the grid has to depend on .

Before we define the balancing principle estimate of , we give some intuition of its possible choice: For any , we have . Moreover, for any we have

Finally, since is decreasing, Assumption 3.2 gives for any two satisfying , with probability at least

(3.6)

An essential step is to find an empirical approximation of the sample error. In view of Corollary 2.2 we define

with and the empirical effective dimension given in (2.1). Corollary 2.2 implies uniformly in

(3.7)

with probability at least , provided

(3.8)

Substituting (3.7) into the rhs of the estimate (3) motivates our definition of the balancing principle estimate of as follows:

Definition 3.3.

Given , and , we set

and define

(3.9)

Notice that as well as depend on the confidence level .

For the analysis it will be important that the grid has a certain regularity. We summarize all requirements needed in

Assumption 3.4.

(on the grid)

  1. Assume that and .

  2. (Regularity of the grid) There is some such that the elements in the grid obey , .

  3. Choose as the unique solution of . We require that is sufficiently large, such that (so that the maximum in the definition of can be dropped). We further assume that .

Note that as . Then, since as , we get that this satisfies . Furthermore, a short argument shows that the optimal value indeed satisfies , if is big enough. Since as , we get as Since by definition, it follows for big enough. From the definition of as a supremum, we actually have , for sufficiently large.

Under the regularity assumption, we find that

(3.10)

Indeed, while the effective dimension is decreasing, the related function is non-decreasing. Hence we find that

and since

Therefore

One also easily verifies that

implying (3.10).

Remark 3.5.

The typical case for Assumption 3.4 to hold is given when the parameters follow a geometric progression, i.e., for some we let , and with . In this case we are able to upper bounding the total number of grid points in terms of . In fact, since , simple calculations lead to

Recall that the starting point is required to obey if is sufficiently large, implying . Finally, we obtain for sufficiently large

(3.11)

with .

We shall need an additional assumption on the effective dimension:

Assumption 3.6.
  1. For some and for any sufficiently small

    for some .

  2. For some and for any sufficiently small

    for some .

Note that such an additional assumption restricts the class of admissible marginals and shrinks the class in Assumption 3.2 to a subclass . Such a lower and upper bound will hold in all examples which we encounter in Section 4.

We further remark that Assumption 3.6 ensures a precise asymptotic behavior for of the form

(3.12)

for some , .

3.0.1 Main Results

The first result is of preparatory character.

Proposition 3.7.

Let Assumption 3.2 be satisfied. Define as in (3.5). Assume . Then for any

uniformly over , with probability at least

We shall need

Lemma 3.8.

If Assumption 3.4 holds, then

(3.13)

We immediately arrive at our first main result of this section:

Theorem 3.9.

Let Assumption 3.2 be satisfied and suppose the grid obeys Assumption 3.4. Then for any

uniformly over , with probability at least

with

for some .

In particular, choosing a geometric grid and assuming a lower and upper bound on the effective dimension, we obtain:

Corollary 3.10.

Let Assumption 3.2, Assumption 3.4 and Assumption 3.6 be satisfied. Suppose the grid is given by a geometric sequence , with , and with . Then for any

uniformly over , with probability at least

with

for some and some , provided is sufficiently large.

Note that as .

3.0.2 One for All: -Balancing is sufficient !

This section is due to an idea suggested by P. Mathé (which itself was inspired by the work [3]) which we have worked out in detail. We define the balancing estimate according to Definition 3.3 by explicitely choosing (in contrast to Theorem 3.9, where we choose depending on the norm parameter ). Our main result states that balancing in the norm suffices to automatically give balancing in all other (stronger !) intermediate norms , for any .

Theorem 3.11.

Let Assumption 3.2 and Assumption 3.4 be satisfied and suppose the grid obeys Assumption 3.4. Then for any

uniformly over , with probability at least

with

for some .

In particular, choosing a geometric grid and assuming a lower and upper bound on the effective dimension, we obtain:

Corollary 3.12.

Let Assumption 3.2, Assumption 3.4 and Assumption 3.6 be satisfied. Suppose the grid is given by a geometric sequence , with , and with . Then, for sufficiently large and for any

uniformly over , with probability at least

with

for some and some .

Note that as .

Remark 3.13.

Still, our choice for is only a theoretical value which remains unknown as it depends on the unknown marginal through the effective dimension

. Implementation requires a data driven choice. Heuristically, it seems resonable to proceed as follows. Let

and , (we are starting from the right and reverse the order). Define the stopping index

and let . Here, depends on the empirical effective dimension , see (2.1), which by Corollary 2.2 is close to the unknown effective dimension . Thus we think that the above choice of is reasonable for implementing the dependence of on the unknown marginal. A complete mathematical analysis is in development.

4 Specific Examples

We proceed by illustrating some specific examples of our method as described in the previous section. In view of our Theorem 3.11 and Corollary 3.12 it suffices to only consider balancing in . We always choose a geometric grid as in Remark 3.5, satisfying .

(1) The regular case

We consider the setting of [5]

, where the eigenvalues of

decay polynomially (with parameter ), the target function satisfies a Hölder-type source condition

and the noise satisfies a Bernstein-Assumption

(4.1)

for any integer and for some and . We combine all structural parameters in a vector , with and . We are interested in adaptivity over .

It has been shown in [5], that the corresponding minimax optimal rate is given by

We shall now check validity of our Assumption 3.2. In the following, we assume that the data generating distribution belongs to the class , defined in [5]. Recall that we let be determined as the unique solution of . Then, we have uniformly for all data generating distributions from the class , with probability at least , for any ,

for sufficiently large, with

where does not depend on the parameters . Remember that the optimal choice for the regularization parameter is obtained by solving

and belongs to the interval . This can be seen by the following argument: If is sufficiently large

which is equivalent to . Since is strictly decreasing we conclude . Here we use the bound