Margin-adaptive model selection in statistical learning

04/18/2008 ∙ by Sylvain Arlot, et al. ∙ Cole Normale Suprieure berkeley college 0

A classical condition for fast learning rates is the margin condition, first introduced by Mammen and Tsybakov. We tackle in this paper the problem of adaptivity to this condition in the context of model selection, in a general learning framework. Actually, we consider a weaker version of this condition that allows one to take into account that learning within a small model can be much easier than within a large one. Requiring this "strong margin adaptivity" makes the model selection problem more challenging. We first prove, in a general framework, that some penalization procedures (including local Rademacher complexities) exhibit this adaptivity when the models are nested. Contrary to previous results, this holds with penalties that only depend on the data. Our second main result is that strong margin adaptivity is not always possible when the models are not nested: for every model selection procedure (even a randomized one), there is a problem for which it does not demonstrate strong margin adaptivity.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider in this paper the model selection problem in a general framework. Since our main motivation comes from the supervised binary classification setting, we focus on this framework in this introduction. Section 2 introduces the natural generalization to empirical (risk) minimization problems, which we consider in the remainder of the paper.

We observe independent realizations for

of a random variable with distribution

, where . The goal is to build a (data-dependent) predictor (i.e., a measurable function ) such that is as often as possible equal to , where is independent from the data. This is the prediction problem, in the setting of supervised binary classification. In other words, the goal is to find minimizing the prediction error , where is the 0-1 loss.

The minimizer of the prediction error, when it exists, is called the Bayes predictor. Define the regression function . Then, a classical argument shows that . However, is unknown, since it depends on the unknown distribution . Our goal is to build from the data some predictor minimizing the prediction error, or equivalently the excess loss .

A classical approach to the prediction problem is empirical risk minimization. Let be the empirical measure and be any set of predictors, which is called a model. The empirical risk minimizer over is then defined as

We expect that the risk of is close to that of

assuming that such a minimizer exists.

1.1 Margin condition

Depending on some properties of and the complexity of , the prediction error of is more or less distant from that of . For instance, when has a finite Vapnik-Chervonenkis dimension [27, 26] and , it has been proven (see for instance [19]) that

for some numerical constant . This is optimal without any assumption on

, in the minimax sense: no estimator can have a smaller prediction risk uniformly over all distributions

such that , up to the numerical factor [14].

However, there exist favorable situations where much smaller prediction errors (“fast rates”, up to instead of ) can be obtained. A sufficient condition, the so-called “margin condition”, has been introduced by Mammen and Tsybakov [21]. If, for some and ,

(1)

if the Bayes predictor belongs to , and if is a VC-class of dimension , then the prediction error of is smaller than in expectation, where and only depends on , and . Minimax lower bounds [23] and other upper bounds can be obtained under other complexity assumptions (for instance assumption (A2) of Tsybakov [24], involving bracketing entropy). In the extreme situation where , i.e., for some ,

(2)

then the same result holds with and . More precisely, as proved in [23]

Following the approach of Koltchinskii [16], we will consider the following generalization of the margin condition:

(3)

where is the set of predictors, and is a convex non-decreasing function on with . Indeed, the proofs of the above upper bounds on the prediction error of use only that (1) implies (3) with and , and that (2) implies (3) with . (See, for instance, Proposition 1 in [24].)

All these results show that the empirical risk minimizer is adaptive to the margin condition, since it leads to an optimal excess risk under various assumptions on the complexity of . However, obtaining such rates of estimation requires knowledge of some to which the Bayes predictor belongs, which is a strong assumption.

A less restrictive framework is the following. First, we do not assume that . Second, we do not assume that the margin condition (3) is satisfied for all , but only for , which can be seen as a “local” margin condition:

(4)

where is a convex non-decreasing function on with . The fact that can depend on allows situations where we are lucky to have a strong margin condition for some small models but the global margin condition is loose. As proven in Section 5.2 (Proposition 2), such situations certainly exist.

Note that when , (3) and (4

) can be traced back to mean-variance conditions on

which where used in several papers for deriving convergence rates of some minimum contrast estimators on some given model (see for instance [11] and references therein).

1.2 Adaptive model selection

Assume now that we are not given a single model but a whole family . By empirical risk minimization, we obtain a family of predictors, from which we would like to select some with a prediction error as small as possible. The aim of such a model selection procedure is to satisfy an oracle inequality of the form

(5)

where the leading constant should be close to one and the remainder term should be close to the value . Typically, one proves that (5

) holds either in expectation, or with high probability.

Assume for instance that for some and has a finite VC-dimension . In view of the aforementioned minimax lower bounds of [23], one cannot hope in general to prove an oracle inequality (5) with a remainder smaller than

where the term may only be necessary for some VC classes (see [23]).

Then, adaptive model selection occurs when satisfies an oracle inequality (5) with of the order of this minimax lower bound. More generally, let be some complexity measure of (for instance its VC-dimension, or the appearing in Tsybakov’s assumption [24]). Then, define as the minimax prediction error over the set of distributions such that and the local margin condition (4) is satisfied in with , where has a complexity at most . Massart and Nédélec [23] have proven tight upper and lower bounds on with several complexity measures; their results are stated with the margin condition (3), but they actually use its local version (4) only.

A margin adaptive model selection procedure should satisfy an oracle inequality of the form

(6)

without using the knowledge of and . We call this property “strong margin adaptivity”, to emphasize the fact that this is more challenging than adaptivity to a margin condition that holds uniformly over the models.

1.3 Penalization

We focus in particular in this article on penalization procedures, which are defined as follows. Let be a (data-dependent) function, and define

Since our goal is to minimize the prediction error of , the ideal penalty would be

(7)

but it is unknown because it depends on the distribution . A classical way of designing a penalty is to estimate , or at least a tight upper bound on it.

We consider in particular local complexity measures [20, 10, 8, 16], because they estimate tightly enough to achieve fast estimation rates when the margin condition holds true. See Section 3.2 for a detailed definition of these penalties.

1.4 Related results

There is a considerable literature on margin adaptivity, in the context of model selection as well as model aggregation. Most of the papers consider the uniform margin condition, that is when . Barron, Birgé and Massart [7] have proven oracle inequalities for deterministic penalties, under some mean-variance condition on close to (3) with . Following a similar approach, margin adaptive oracle inequalities (with more general ) have been proven with localized random penalties [20, 10, 8, 16], and in [25] with other penalties in a particular framework.

Adaptivity to the margin has also been considered with a regularized boosting method [12], the hold-out [13] and in a PAC-Bayes framework [5]. Aggregation methods have been studied in [24, 17]. Notice also that a completely different approach is possible: estimate first the regression function

(possibly through model selection), then use a plug-in classifier, and this works provided

is smooth enough [6].

It is quite unclear whether any of these results can be extended to strong margin adaptivity (actually, we will prove that this needs additional restrictions in general). To our knowledge, the only results allowing to depend on can be found in [16]. First, when the models are nested, a comparison method based on local Rademacher complexities attains strong margin adaptivity, assuming that (Theorem 7; and it is quite unclear whether this still holds without the latter assumption). Second, a penalization method based on local Rademacher complexities has the same property in the general case, but it uses the knowledge of (Theorems 6 and 11).

Our claim is that when does strongly depend on , it is crucial to take it into account to choose the best model in . And such situations occur, as proven by our Proposition 2 in Section 5.2. But assuming either or that is known is not realistic. Our goal is to investigate the kind of results which can be obtained with completely data-driven procedures, in particular when .

1.5 Our results

In this paper, we aim at understanding when strong margin adaptivity can be obtained for data-dependent model selection procedures. Notice that we do not restrict ourselves to the classification setting. We consider a much more general framework (as in [16] for instance), which is described in Section 2. We prove two kinds of results. First, when models are nested, we show that some penalization methods are strongly margin adaptive (Theorem 1). In particular, this result holds for the local Rademacher complexities (Corollary 1). Compared to previous results (in particular the ones of [16]), our main advance is that our penalties do not require the knowledge of , and we do not assume that the Bayes predictor belongs to any of the models.

Our second result probes the limits of strong margin adaptivity, without the nested assumption. A family of models exists such that, for every sample size and every (model) selection procedure , a distribution exists for which fails to be strongly margin adaptive with a positive probability (Theorem 2). Hence, the previous positive results (Theorem 1 and Corollary 1) cannot be extended outside of the nested case for a general distribution .

Where is the boundary between these two extremes? Obviously, the nested assumption is not necessary. For instance, when the global margin assumption is indeed tight ( for every ), margin adaptivity can be obtained in several ways, as mentioned in Section 1.4. We sketch in Section 5 some situations where strong margin adaptivity is possible. More precisely, we state a general oracle inequality (Theorem 3), valid for any family of models and any distribution . We then discuss assumptions under which its remainder term is small enough to imply strong margin adaptivity.

This paper is organized as follows. We describe the general setting in Section 2. We consider in Section 3 the nested case, in which strong margin adaptivity holds. Negative results (i.e., lower bounds on the prediction error of a general model selection procedure) are stated in Section 4. The line between these two situations is sketched in Section 5. We discuss our results in Section 6. All the proofs are given in Section 7.

2 The general empirical minimization framework

Although our main motivation comes from the classification problem, it turns out that all our results can be proven in the general setting of empirical minimization. As explained below, this setting includes binary classification with the 0-1 loss, bounded regression and several other frameworks. In the rest of the paper, we will use the following general notation, in order to emphasize the generality of our results.

We observe independent realizations of a random variable with distribution , and we are given a set of measurable functions . Our goal is to build some (data-dependent) such that its expectation is as small as possible. For the sake of simplicity, we assume that there is a minimizer of over .

This includes the prediction framework, in which , ,

where is any contrast function. Then, is equal to , where is the Bayes predictor. In the binary classification framework, and we can take the 0-1 contrast for instance. We then recover the setting described in Section 1. In the bounded regression framework, assuming that , we can take the least-squares contrast,

Many other contrast functions can be considered, provided that they take their values in . Notice the one-to-one correspondence between predictors and functions in the prediction framework.

The empirical minimizer over (called a model) can then be defined as

We expect that its expectation is close to that of , assuming that such a minimizer exists. In the prediction framework, defining , we have and .

We can now write the global margin condition as follows:

(8)

where is a convex non-decreasing function on with . Similarly, the local margin condition is

(9)

Notice that most of the upper and lower bounds on the risk under the margin condition given in the introduction stay valid in the general empirical minimization framework, at least when for some and (see for instance [23, 16]). Assume that is a VC-type class of dimension . If ,

for some numerical constant . If for some and ,

for some constant .

Given a collection of models, we are looking for a model selection procedure satisfying an oracle inequality of the form

(10)

with a leading constant close to 1 and a remainder term as small as possible. Similarly to (6), we define a strongly margin adaptive procedure as any such that (10) holds with some numerical constant , and of the order of the minimax risk .

Defining penalization methods as

(11)

for some data-dependent , the ideal penalty is .

3 Margin adaptive model selection for nested models

3.1 General result

Our first result is a sufficient condition for penalization procedures to attain strong margin adaptivity when the models are nested (Theorem 1). Since this condition is satisfied by local Rademacher complexities, this leads to a data-driven margin adaptive penalization procedure (Corollary 1).

Theorem 1.

Fix and such that the local margin conditions (9) hold. Let be a sequence of positive reals that is nondecreasing (with respect to the inclusion ordering on ). Assume that some constants and exist such that the following holds:

  • the models are nested.

  • lower bounds on the penalty: with probability at least , for every ,

    (12)
    (13)

Then, if is defined by (11), with probability at least , we have for every

(14)

where is the convex conjuguate of .

Theorem 1 is proved in Section 7.1.

Remark 1.
  1. If is of the right order, i.e., not much larger than , then Theorem 1 is a strong margin adaptivity result. Indeed, assuming that , the remainder term is not too large, since

    for some positive constant . Hence, choosing for instance, we can rewrite (14) as

    for some positive constants and . When is a general convex function, minimax estimation rates are no longer available, so that we do not know whether the remainder term in (14) is of the right order. Nevertheless, no better risk bound is known, even for a single model to which belongs.

  2. In the case that the are known, methods involving local Rademacher complexities and satisfy oracle inequalities similar to (14) (see Theorems 6 and 11 in [16]). On the contrary, the are not assumed to be known in Theorem 1, and conditions (12) and (13) are satisfied by completely data-dependent penalties, as shown in Section 3.2.
    Also, Theorem 7 of [16] shows that adaptivity is possible using a comparison method, provided that belongs to one of the models. However, it is not clear whether this comparison method achieves the optimal bias-variance trade-off in the general case, as in Theorem 1.

3.2 Local Rademacher complexities

Although Theorem 1 applies to any penalization procedure satisfying assumptions (12) and (13), we now focus on methods based on local Rademacher complexities. Let us define precisely these complexities. We mainly use the notation of [16]:

  • for every , the minimal set of w.r.t the distribution is

  • the diameter of the minimal set of :

  • the expected modulus of continuity of over :

We then define

where is a numerical constant (to be chosen later). The (ideal) local complexity is (roughly) the smallest positive fixed-point of . More precisely,

(15)

where is a numerical constant.

Two important points, which follow from Theorems 1 and 3 of Koltchinskii [16], are that:

  1. is large enough to satisfy assumption (12) with a probability at least for each model .

  2. there is a completely data-dependent such that

    This data-dependent is a resampling estimate of , called the “local Rademacher complexity”.

Before stating the main result of this section, let us recall the definition of , as in [16]. We need the following additional notation:

  • for every , the empirical minimal set of is

  • the empirical diameter of the empirical minimal set of :

  • the modulus of continuity of the Rademacher process over , where are i.i.d. Rademacher random variables (i.e., takes the values and with probability each):

Defining

(where are numerical constants, to be chosen later), the local Rademacher complexity is (roughly) the smallest positive fixed-point of . More precisely,

(16)

where is a numerical constant.

Corollary 1 (Strong margin adaptivity for local Rademacher complexities).

There exist numerical constants and such that the following holds. Let .  Assume that a numerical constant exists and an event of probability at least exists on which

(17)

where is defined by (15) (and depends on both and ). Assume moreover that the models are nested and

Then, an event of probability at least exists on which, for every ,

(18)

In particular, this holds when , provided that are larger than some constants depending only on .

Corollary 1 is proved in Section 7.1.

Remark 2.

One can always enlarge the constants and , making the leading constant of the oracle inequality (18) closer to one, at the price of enlarging (hence or ). We do not know whether it is possible to make the leading constant closer to one without changing the penalization procedure itself.

As we show in Section 5.2, there are distributions and collections of models such that this is a strong improvement over the “uniform margin” case, in terms of prediction error. It seems reasonable to expect that this happens in a significant number of practical situations.

In Section 5, we state a more general result (from which Theorem 1 is a corollary) which suggests why it is more difficult to prove Corollary 1 when really depends on . This general result is also useful to understand how the nestedness assumption might be relaxed in Theorem 1 and Corollary 1.

The reason why Corollary 1 implies strong margin adaptivity is that the local Rademacher complexities are not too large when the local margin condition is satisfied, together with a complexity assumption on . Indeed, there exists a distribution-dependent (defined as with replaced by for some numerical constants , related to and ) such that

(See Theorem 3 of [16].) This leads to several upper bounds on under the local margin condition (9), by combining Lemma 5 of [16] with the examples of its Section 2.5. For instance, in the binary classification case, when

is the class of 0-1 loss functions associated with a VC-class

of dimension , such that the margin condition (9) holds with , we have for every and ,

(19)

where and depend only on . (Similar upper bounds hold under several other complexity assumptions on the models , see [16].) In particular, when each model is a VC-class of dimension , , and , (18) implies that

with probability at least , for some numerical constants . Up to some factor, this is a strong margin adaptive model selection result, provided that is smaller than some power of . Notice that the factor is sometimes necessary (as shown by [23]), meaning that this upper bound is then optimal.

4 Lower bound for some non-nested models

In this section, we investigate the assumption in Theorem 1 that the models are nested. To this aim, let us consider the case where models are singletons . Then, any estimator is deterministic and equal to , so that model selection amounts to selecting among a family of functions. Theorem 2 below shows that no selection procedure can be strongly margin adaptive in general.

Theorem 2.

Let be the 0-1 loss and be the associated loss function class. If , two functions and absolute constants exist such that the following holds. For every integer and a selection procedure (that is, a function ), a distribution exists such that

(20)
(21)

Theorem 2 is proved in Section 7.2. A straightforward corollary of Theorem 2 is that in the classification setting with the 0-1 loss, strong margin adaptive model selection is not always possible when the models are not nested. Indeed, when for every , (20) shows that for any model selection procedure , some distribution exist such that results like Theorem 1 or Corollary 1 do not hold if for every .

Remark 3.
  1. Theorem 2 (and its corollary for model selection) also hold for randomized rules (where the value of is the probability assigned to the choice of ). Hence, aggregating models instead of selecting one does not modify the conclusion of Theorem 2.

  2. The most reasonable selection procedure among two functions and (or two models and ) clearly is empirical minimization. The proof of Theorem 2 yields explicitly some distribution , called , such that (20) and (21) hold for empirical minimization. Note that when models are singletons, most penalization procedures coincide with empirical minimization, for instance when is proportional to the local Rademacher complexity , or to the ideal penalty

    , its expectation or some quantile of

    .

  3. Theorem 2 focuses on margin adaptivity with , whereas the margin condition is also satisfied with other functions . This is both for simplicity reasons, and because this choice emphasizes that one could hope for learning rates of order if strong margin adaptivity were possible. The meaning of Theorem 2 is then mainly that one cannot guarantee to learn at a rate better than , whereas for some model, the excess loss and both are of order .

  4. The counterexample given in the proof of Theorem 2 is highly nonasymptotic, since the distribution strongly depends on . If and were fixed, it is well known that empirical minimization leads to asymptotic optimality, because is finite and fixed when grows. This illustrates a significant difference between the asymptotic and non-asymptotic frameworks.
    Another example of such a difference occurs when the number of candidate functions (or models) is infinite, or grows to infinity with the sample size, see (iv) in Proposition 2 in Section 5.2.

With Theorem 1, we have proven a strong margin adaptivity result for nested models, which holds true when the penalty is built upon local Rademacher complexities. Therefore, adaptive model selection is attainable for nested models, whatever the distribution of the data. On the other hand, Theorem 2 gives a simple example where no model selection procedure can satisfy an oracle inequality (10) with a leading constant smaller than .

Looking carefully at the selection problems considered in the proof of Theorem 2, it appears that the main reason why they are particularly tough is that we are quite “lucky” with one of the models: it has simultaneously a very small bias, a very small size and a large margin parameter, while other models with very similar appearance are much worse. When looking for more general strong margin adaptivity result, we then must keep in mind that this is a hopeless task in such situations.

Let us finally mention a related result in a close but slightly different framework. In the classification framework, under a global margin condition with with , Theorem 3 in [18] shows that for any , a family of classifiers exists for which, for any selection procedure , some distribution exists such that