We consider in this paper the model selection problem in a general framework. Since our main motivation comes from the supervised binary classification setting, we focus on this framework in this introduction. Section 2 introduces the natural generalization to empirical (risk) minimization problems, which we consider in the remainder of the paper.
We observe independent realizations for
of a random variable with distribution, where . The goal is to build a (data-dependent) predictor (i.e., a measurable function ) such that is as often as possible equal to , where is independent from the data. This is the prediction problem, in the setting of supervised binary classification. In other words, the goal is to find minimizing the prediction error , where is the 0-1 loss.
The minimizer of the prediction error, when it exists, is called the Bayes predictor. Define the regression function . Then, a classical argument shows that . However, is unknown, since it depends on the unknown distribution . Our goal is to build from the data some predictor minimizing the prediction error, or equivalently the excess loss .
A classical approach to the prediction problem is empirical risk minimization. Let be the empirical measure and be any set of predictors, which is called a model. The empirical risk minimizer over is then defined as
We expect that the risk of is close to that of
assuming that such a minimizer exists.
1.1 Margin condition
Depending on some properties of and the complexity of , the prediction error of is more or less distant from that of . For instance, when has a finite Vapnik-Chervonenkis dimension [27, 26] and , it has been proven (see for instance ) that
for some numerical constant . This is optimal without any assumption on
, in the minimax sense: no estimator can have a smaller prediction risk uniformly over all distributionssuch that , up to the numerical factor .
However, there exist favorable situations where much smaller prediction errors (“fast rates”, up to instead of ) can be obtained. A sufficient condition, the so-called “margin condition”, has been introduced by Mammen and Tsybakov . If, for some and ,
if the Bayes predictor belongs to , and if is a VC-class of dimension , then the prediction error of is smaller than in expectation, where and only depends on , and . Minimax lower bounds  and other upper bounds can be obtained under other complexity assumptions (for instance assumption (A2) of Tsybakov , involving bracketing entropy). In the extreme situation where , i.e., for some ,
then the same result holds with and . More precisely, as proved in 
Following the approach of Koltchinskii , we will consider the following generalization of the margin condition:
where is the set of predictors, and is a convex non-decreasing function on with . Indeed, the proofs of the above upper bounds on the prediction error of use only that (1) implies (3) with and , and that (2) implies (3) with . (See, for instance, Proposition 1 in .)
All these results show that the empirical risk minimizer is adaptive to the margin condition, since it leads to an optimal excess risk under various assumptions on the complexity of . However, obtaining such rates of estimation requires knowledge of some to which the Bayes predictor belongs, which is a strong assumption.
A less restrictive framework is the following. First, we do not assume that . Second, we do not assume that the margin condition (3) is satisfied for all , but only for , which can be seen as a “local” margin condition:
where is a convex non-decreasing function on with . The fact that can depend on allows situations where we are lucky to have a strong margin condition for some small models but the global margin condition is loose. As proven in Section 5.2 (Proposition 2), such situations certainly exist.
1.2 Adaptive model selection
Assume now that we are not given a single model but a whole family . By empirical risk minimization, we obtain a family of predictors, from which we would like to select some with a prediction error as small as possible. The aim of such a model selection procedure is to satisfy an oracle inequality of the form
where the leading constant should be close to one and the remainder term should be close to the value . Typically, one proves that (5
) holds either in expectation, or with high probability.
Assume for instance that for some and has a finite VC-dimension . In view of the aforementioned minimax lower bounds of , one cannot hope in general to prove an oracle inequality (5) with a remainder smaller than
where the term may only be necessary for some VC classes (see ).
Then, adaptive model selection occurs when satisfies an oracle inequality (5) with of the order of this minimax lower bound. More generally, let be some complexity measure of (for instance its VC-dimension, or the appearing in Tsybakov’s assumption ). Then, define as the minimax prediction error over the set of distributions such that and the local margin condition (4) is satisfied in with , where has a complexity at most . Massart and Nédélec  have proven tight upper and lower bounds on with several complexity measures; their results are stated with the margin condition (3), but they actually use its local version (4) only.
A margin adaptive model selection procedure should satisfy an oracle inequality of the form
without using the knowledge of and . We call this property “strong margin adaptivity”, to emphasize the fact that this is more challenging than adaptivity to a margin condition that holds uniformly over the models.
We focus in particular in this article on penalization procedures, which are defined as follows. Let be a (data-dependent) function, and define
Since our goal is to minimize the prediction error of , the ideal penalty would be
but it is unknown because it depends on the distribution . A classical way of designing a penalty is to estimate , or at least a tight upper bound on it.
1.4 Related results
There is a considerable literature on margin adaptivity, in the context of model selection as well as model aggregation. Most of the papers consider the uniform margin condition, that is when . Barron, Birgé and Massart  have proven oracle inequalities for deterministic penalties, under some mean-variance condition on close to (3) with . Following a similar approach, margin adaptive oracle inequalities (with more general ) have been proven with localized random penalties [20, 10, 8, 16], and in  with other penalties in a particular framework.
Adaptivity to the margin has also been considered with a regularized boosting method , the hold-out  and in a PAC-Bayes framework . Aggregation methods have been studied in [24, 17]. Notice also that a completely different approach is possible: estimate first the regression function
(possibly through model selection), then use a plug-in classifier, and this works providedis smooth enough .
It is quite unclear whether any of these results can be extended to strong margin adaptivity (actually, we will prove that this needs additional restrictions in general). To our knowledge, the only results allowing to depend on can be found in . First, when the models are nested, a comparison method based on local Rademacher complexities attains strong margin adaptivity, assuming that (Theorem 7; and it is quite unclear whether this still holds without the latter assumption). Second, a penalization method based on local Rademacher complexities has the same property in the general case, but it uses the knowledge of (Theorems 6 and 11).
Our claim is that when does strongly depend on , it is crucial to take it into account to choose the best model in . And such situations occur, as proven by our Proposition 2 in Section 5.2. But assuming either or that is known is not realistic. Our goal is to investigate the kind of results which can be obtained with completely data-driven procedures, in particular when .
1.5 Our results
In this paper, we aim at understanding when strong margin adaptivity can be obtained for data-dependent model selection procedures. Notice that we do not restrict ourselves to the classification setting. We consider a much more general framework (as in  for instance), which is described in Section 2. We prove two kinds of results. First, when models are nested, we show that some penalization methods are strongly margin adaptive (Theorem 1). In particular, this result holds for the local Rademacher complexities (Corollary 1). Compared to previous results (in particular the ones of ), our main advance is that our penalties do not require the knowledge of , and we do not assume that the Bayes predictor belongs to any of the models.
Our second result probes the limits of strong margin adaptivity, without the nested assumption. A family of models exists such that, for every sample size and every (model) selection procedure , a distribution exists for which fails to be strongly margin adaptive with a positive probability (Theorem 2). Hence, the previous positive results (Theorem 1 and Corollary 1) cannot be extended outside of the nested case for a general distribution .
Where is the boundary between these two extremes? Obviously, the nested assumption is not necessary. For instance, when the global margin assumption is indeed tight ( for every ), margin adaptivity can be obtained in several ways, as mentioned in Section 1.4. We sketch in Section 5 some situations where strong margin adaptivity is possible. More precisely, we state a general oracle inequality (Theorem 3), valid for any family of models and any distribution . We then discuss assumptions under which its remainder term is small enough to imply strong margin adaptivity.
This paper is organized as follows. We describe the general setting in Section 2. We consider in Section 3 the nested case, in which strong margin adaptivity holds. Negative results (i.e., lower bounds on the prediction error of a general model selection procedure) are stated in Section 4. The line between these two situations is sketched in Section 5. We discuss our results in Section 6. All the proofs are given in Section 7.
2 The general empirical minimization framework
Although our main motivation comes from the classification problem, it turns out that all our results can be proven in the general setting of empirical minimization. As explained below, this setting includes binary classification with the 0-1 loss, bounded regression and several other frameworks. In the rest of the paper, we will use the following general notation, in order to emphasize the generality of our results.
We observe independent realizations of a random variable with distribution , and we are given a set of measurable functions . Our goal is to build some (data-dependent) such that its expectation is as small as possible. For the sake of simplicity, we assume that there is a minimizer of over .
This includes the prediction framework, in which , ,
where is any contrast function. Then, is equal to , where is the Bayes predictor. In the binary classification framework, and we can take the 0-1 contrast for instance. We then recover the setting described in Section 1. In the bounded regression framework, assuming that , we can take the least-squares contrast,
Many other contrast functions can be considered, provided that they take their values in . Notice the one-to-one correspondence between predictors and functions in the prediction framework.
The empirical minimizer over (called a model) can then be defined as
We expect that its expectation is close to that of , assuming that such a minimizer exists. In the prediction framework, defining , we have and .
We can now write the global margin condition as follows:
where is a convex non-decreasing function on with . Similarly, the local margin condition is
Notice that most of the upper and lower bounds on the risk under the margin condition given in the introduction stay valid in the general empirical minimization framework, at least when for some and (see for instance [23, 16]). Assume that is a VC-type class of dimension . If ,
for some numerical constant . If for some and ,
for some constant .
Given a collection of models, we are looking for a model selection procedure satisfying an oracle inequality of the form
with a leading constant close to 1 and a remainder term as small as possible. Similarly to (6), we define a strongly margin adaptive procedure as any such that (10) holds with some numerical constant , and of the order of the minimax risk .
Defining penalization methods as
for some data-dependent , the ideal penalty is .
3 Margin adaptive model selection for nested models
3.1 General result
Our first result is a sufficient condition for penalization procedures to attain strong margin adaptivity when the models are nested (Theorem 1). Since this condition is satisfied by local Rademacher complexities, this leads to a data-driven margin adaptive penalization procedure (Corollary 1).
Fix and such that the local margin conditions (9) hold. Let be a sequence of positive reals that is nondecreasing (with respect to the inclusion ordering on ). Assume that some constants and exist such that the following holds:
the models are nested.
lower bounds on the penalty: with probability at least , for every ,
Then, if is defined by (11), with probability at least , we have for every
where is the convex conjuguate of .
If is of the right order, i.e., not much larger than , then Theorem 1 is a strong margin adaptivity result. Indeed, assuming that , the remainder term is not too large, since
for some positive constant . Hence, choosing for instance, we can rewrite (14) as
for some positive constants and . When is a general convex function, minimax estimation rates are no longer available, so that we do not know whether the remainder term in (14) is of the right order. Nevertheless, no better risk bound is known, even for a single model to which belongs.
In the case that the are known, methods involving local Rademacher complexities and satisfy oracle inequalities similar to (14) (see Theorems 6 and 11 in ). On the contrary, the are not assumed to be known in Theorem 1, and conditions (12) and (13) are satisfied by completely data-dependent penalties, as shown in Section 3.2.
Also, Theorem 7 of  shows that adaptivity is possible using a comparison method, provided that belongs to one of the models. However, it is not clear whether this comparison method achieves the optimal bias-variance trade-off in the general case, as in Theorem 1.
3.2 Local Rademacher complexities
Although Theorem 1 applies to any penalization procedure satisfying assumptions (12) and (13), we now focus on methods based on local Rademacher complexities. Let us define precisely these complexities. We mainly use the notation of :
for every , the minimal set of w.r.t the distribution is
the diameter of the minimal set of :
the expected modulus of continuity of over :
We then define
where is a numerical constant (to be chosen later). The (ideal) local complexity is (roughly) the smallest positive fixed-point of . More precisely,
where is a numerical constant.
Two important points, which follow from Theorems 1 and 3 of Koltchinskii , are that:
is large enough to satisfy assumption (12) with a probability at least for each model .
there is a completely data-dependent such that
This data-dependent is a resampling estimate of , called the “local Rademacher complexity”.
Before stating the main result of this section, let us recall the definition of , as in . We need the following additional notation:
for every , the empirical minimal set of is
the empirical diameter of the empirical minimal set of :
the modulus of continuity of the Rademacher process over , where are i.i.d. Rademacher random variables (i.e., takes the values and with probability each):
(where are numerical constants, to be chosen later), the local Rademacher complexity is (roughly) the smallest positive fixed-point of . More precisely,
where is a numerical constant.
Corollary 1 (Strong margin adaptivity for local Rademacher complexities).
There exist numerical constants and such that the following holds. Let . Assume that a numerical constant exists and an event of probability at least exists on which
where is defined by (15) (and depends on both and ). Assume moreover that the models are nested and
Then, an event of probability at least exists on which, for every ,
In particular, this holds when , provided that are larger than some constants depending only on .
One can always enlarge the constants and , making the leading constant of the oracle inequality (18) closer to one, at the price of enlarging (hence or ). We do not know whether it is possible to make the leading constant closer to one without changing the penalization procedure itself.
As we show in Section 5.2, there are distributions and collections of models such that this is a strong improvement over the “uniform margin” case, in terms of prediction error. It seems reasonable to expect that this happens in a significant number of practical situations.
In Section 5, we state a more general result (from which Theorem 1 is a corollary) which suggests why it is more difficult to prove Corollary 1 when really depends on . This general result is also useful to understand how the nestedness assumption might be relaxed in Theorem 1 and Corollary 1.
The reason why Corollary 1 implies strong margin adaptivity is that the local Rademacher complexities are not too large when the local margin condition is satisfied, together with a complexity assumption on . Indeed, there exists a distribution-dependent (defined as with replaced by for some numerical constants , related to and ) such that
(See Theorem 3 of .) This leads to several upper bounds on under the local margin condition (9), by combining Lemma 5 of  with the examples of its Section 2.5. For instance, in the binary classification case, when
is the class of 0-1 loss functions associated with a VC-classof dimension , such that the margin condition (9) holds with , we have for every and ,
where and depend only on . (Similar upper bounds hold under several other complexity assumptions on the models , see .) In particular, when each model is a VC-class of dimension , , and , (18) implies that
with probability at least , for some numerical constants . Up to some factor, this is a strong margin adaptive model selection result, provided that is smaller than some power of . Notice that the factor is sometimes necessary (as shown by ), meaning that this upper bound is then optimal.
4 Lower bound for some non-nested models
In this section, we investigate the assumption in Theorem 1 that the models are nested. To this aim, let us consider the case where models are singletons . Then, any estimator is deterministic and equal to , so that model selection amounts to selecting among a family of functions. Theorem 2 below shows that no selection procedure can be strongly margin adaptive in general.
Let be the 0-1 loss and be the associated loss function class. If , two functions and absolute constants exist such that the following holds. For every integer and a selection procedure (that is, a function ), a distribution exists such that
Theorem 2 is proved in Section 7.2. A straightforward corollary of Theorem 2 is that in the classification setting with the 0-1 loss, strong margin adaptive model selection is not always possible when the models are not nested. Indeed, when for every , (20) shows that for any model selection procedure , some distribution exist such that results like Theorem 1 or Corollary 1 do not hold if for every .
The most reasonable selection procedure among two functions and (or two models and ) clearly is empirical minimization. The proof of Theorem 2 yields explicitly some distribution , called , such that (20) and (21) hold for empirical minimization. Note that when models are singletons, most penalization procedures coincide with empirical minimization, for instance when is proportional to the local Rademacher complexity , or to the ideal penalty
, its expectation or some quantile of.
Theorem 2 focuses on margin adaptivity with , whereas the margin condition is also satisfied with other functions . This is both for simplicity reasons, and because this choice emphasizes that one could hope for learning rates of order if strong margin adaptivity were possible. The meaning of Theorem 2 is then mainly that one cannot guarantee to learn at a rate better than , whereas for some model, the excess loss and both are of order .
The counterexample given in the proof of Theorem 2 is highly nonasymptotic, since the distribution strongly depends on . If and were fixed, it is well known that empirical minimization leads to asymptotic optimality, because is finite and fixed when grows. This illustrates a significant difference between the asymptotic and non-asymptotic frameworks.
Another example of such a difference occurs when the number of candidate functions (or models) is infinite, or grows to infinity with the sample size, see (iv) in Proposition 2 in Section 5.2.
With Theorem 1, we have proven a strong margin adaptivity result for nested models, which holds true when the penalty is built upon local Rademacher complexities. Therefore, adaptive model selection is attainable for nested models, whatever the distribution of the data. On the other hand, Theorem 2 gives a simple example where no model selection procedure can satisfy an oracle inequality (10) with a leading constant smaller than .
Looking carefully at the selection problems considered in the proof of Theorem 2, it appears that the main reason why they are particularly tough is that we are quite “lucky” with one of the models: it has simultaneously a very small bias, a very small size and a large margin parameter, while other models with very similar appearance are much worse. When looking for more general strong margin adaptivity result, we then must keep in mind that this is a hopeless task in such situations.
Let us finally mention a related result in a close but slightly different framework. In the classification framework, under a global margin condition with with , Theorem 3 in  shows that for any , a family of classifiers exists for which, for any selection procedure , some distribution exists such that